A comprehensive, specification-driven benchmarking system for analyzing the computational complexity of kompot differential expression and differential abundance methods with robust replicate statistics.
This project benchmarks kompot's performance across multiple dimensions with 10 replicates per configuration for robust median estimates and uncertainty quantification:
- Cell count scaling (1k to 422k cells)
- Gene count scaling (50 to 33k genes for DE)
- Component count scaling (10 to 100 dimensions for DA)
- Landmark scaling (500 to 10k landmarks)
- Sample variance impact (with/without, memory vs disk storage)
- Batching strategies (batched vs no-batch)
- CPU parallelization (1 CPU vs 16 CPUs)
- GPU acceleration (CPU vs GPU backends)
- 10 replicates per unique parameter combination
- 2870 total configs (287 unique parameter sets × 10 replicates)
- Median lines with 25th-75th percentile error bands in plots
- Robust statistics resilient to outliers and system variability
Important: Runtime measurements capture kompot computation only, excluding data preprocessing:
INCLUDED in runtime:
- Gaussian Process model fitting
- Sample variance estimation (when enabled)
- Mahalanobis distance computation (for DE)
- Null gene computation (when specified)
EXCLUDED from runtime:
- Data loading and subsetting
- Diffusion map computation (pre-computed in
covid_preprocessed.h5ad) - Sample ID assignment
- Disk storage setup
For DA (Differential Abundance), the pre-computed diffusion map embeddings (DM_EigenVectors, 50 dimensions) are used directly, so runtime reflects only the GP computation on those embeddings.
ALL benchmark behavior is defined in benchmark_spec.yaml - zero hardcoding anywhere:
- Comparison groups: All parameter combinations with replicate counts
- Plot specifications: Which comparisons appear on which plots
- Resource estimation: Memory and time allocation rules for SLURM
- Partition selection: CPU (restart-new, canto) vs GPU (chorus) routing
- Constraints: Gene limits for SV, landmark bounds, batching rules
# Design Philosophy:
# - All comparisons defined once here
# - Config generation reads this file
# - Plotting code reads this file
# - SLURM scripts derive resources from this file
# - NO HARDCODING anywhere else- Reproducibility: Single file defines entire benchmark
- Maintainability: Change parameters in one place
- Transparency: All decisions explicit and documented
- Extensibility: Add new comparisons without touching code
# 1. Generate configs from spec (2870 total)
python3 scripts/generate_all_configs.py
# 2. Submit jobs to SLURM (automatic resource estimation)
python3 scripts/submit_all_jobs.py
# 3. Monitor progress (continuous updates)
python3 scripts/monitor_benchmark_runs.py
# 4. Generate plots with median + error bands
python3 scripts/plot_from_spec.py# Download COVID-19 PBMC dataset (Meyer & Nikolic, 2021)
# Place in: data/meyer_nikolic_covid_pbmc.cellxgene.20210813.h5ad
# Preprocess dataset
python3 scripts/00_preprocess_covid_data.py \
--input data/meyer_nikolic_covid_pbmc.cellxgene.20210813.h5ad \
--output data/covid_preprocessed.h5ad \
--n-pca-components 50 \
--n-dm-components 50Creates data/covid_preprocessed.h5ad with:
- 422,220 cells × 33,751 genes
- PCA embedding (50 components)
- Diffusion map eigenvectors (50 components) via Palantir - pre-computed once
- Required metadata:
COVID_status(Healthy/COVID-19),patient_id
Note: Diffusion maps are computed once during preprocessing. DA benchmarks measure only kompot's GP computation time on these pre-computed embeddings.
python3 scripts/generate_all_configs.pyWhat it does:
- Reads
benchmark_spec.yaml(single source of truth) - Generates 10 replicates for each unique parameter combination
- Validates constraints (landmarks < cells for DE, max_genes for SV)
- Assigns resource requirements based on spec rules
- Outputs
results/configs_generated.csv(2870 configurations)
Config structure:
- Each config has
replicate_id(1-10) - Config names include replicate suffix:
de_svdisk200g_c10000_g200_lm5000_r1 - SHA256 hash of parameters for result matching (config ID changes don't affect matching)
# Submit all jobs
python3 scripts/submit_all_jobs.py
# Resubmit only missing/failed jobs (uses hash matching)
python3 scripts/submit_all_jobs.py --only-missingWhat it does:
- Estimates memory/time from
benchmark_spec.yamlresource rules - Selects partition based on spec criteria:
- restart-new: ≤680GB RAM, CPU-only jobs
- canto: >680GB RAM, <1500GB RAM, CPU-only jobs (high-memory nodes)
- chorus: GPU-required jobs (auto-selected for use_gpu=true)
- Groups configs by resource requirements (minimizes job array count)
- Generates SLURM scripts from template
- Saves job IDs to
slurm/current_job_ids.txtfor tracking - Submits ~38 SLURM job arrays with automatic epilogue dependencies
Partition Selection Logic (from spec):
memory_thresholds:
- partition: restart-new
max_memory: 680
- partition: canto
max_memory: 1500
requires_gpu: false # High-memory CPU nodes
- partition: chorus
max_memory: 1500
requires_gpu: true # GPU nodesOutput:
Job IDs saved to slurm/current_job_ids.txt
Monitor with:
python3 scripts/monitor_benchmark_runs.py
Cancel these jobs:
python3 scripts/cancel_current_jobs.py
SLURM data capture:
Epilogue jobs will auto-populate SLURM data after completion
Manual fallback: python3 scripts/populate_slurm_data.py --all-missing
Problem: Config IDs change when benchmark_spec.yaml is modified, causing duplicate work.
Solution: SHA256 hash of benchmark parameters (excluding resource params like memory, time, partition).
Hash includes:
analysis_type,n_cells,n_genes,n_landmarksn_components,use_sample_variance,compute_mahalanobisstore_on_disk,batch_size,null_genesuse_gpu,replicate_id
Hash excludes:
- Resource parameters:
memory_gb,time_hours,partition,cpus - Metadata:
config_id,config_name,benchmark_type,plot_name
Benefits:
- Results survive spec modifications
--only-missingcorrectly identifies completed work- No duplicate runs when only resources change
Verification:
# Check hash matching
python3 scripts/submit_all_jobs.py --only-missing --dry-run
# Shows: "Indexed 1929 valid results by hash"The Problem: SLURM MaxRSS (peak memory) is only available AFTER job completion in SLURM's accounting database. Querying from within a running job returns empty data.
Our Solution: Automatic dependent epilogue jobs.
How It Works:
Main Benchmark Job Epilogue Job (dependent) Result File
─────────────────────────────────────────────────────────────────────────
┌──────────────────┐ ┌───────────────────┐ ┌───────────┐
│ 1. Run benchmark │ │ 3. Start after │ │ 5. MaxRSS │
│ 2. Save job IDs │ ──────> │ main completes │ ──────> │ saved │
│ to JSON │ finish │ 4. Query sacct │ append │ forever│
└──────────────────┘ │ for MaxRSS │ └───────────┘
└───────────────────┘
Job 40881253 Job 40881254
(--dependency=afterany:40881253)
Automatic Epilogue Jobs:
- Submitted automatically by
submit_all_jobs.pyfor each job array - Use SLURM dependency:
--dependency=afterany:MAIN_JOB_ID - Wait for main job to complete (success OR failure)
- Query SLURM accounting:
sacct -j JOBID --format=JobID,State,MaxRSS,Elapsed - Parse MaxRSS from
.batchsubjob (contains actual resource usage) - Append data to result JSON file permanently
Why This Is Critical:
- SLURM Data Retention: SLURM purges accounting data after 30-90 days
- Accurate Memory Metrics: MaxRSS captures ALL memory (C extensions, JAX, NumPy, etc.)
- Python tracemalloc: ~1.86 GB (Python allocations only)
- SLURM MaxRSS: ~3.62 GB (48% higher - true OS-level usage)
- Scientific Reproducibility: Result files preserve complete resource usage forever
- Publication Requirements: Accurate resource documentation for methods sections
Monitoring Epilogue Jobs:
# Check epilogue jobs (will show PENDING until main job finishes)
squeue -u $USER | grep slurm_epilogue
# After main job completes, epilogue should start and finish quickly (~10 seconds)
# Check epilogue logs
ls -lt slurm/logs/epilogue_*.out | head -10Manual SLURM Data Population (fallback if epilogue fails):
# Populate SLURM data for all results missing it
python3 scripts/populate_slurm_data.py --all-missing
# Populate for specific job
python3 scripts/populate_slurm_data.py --job-id 12345678
# Check how many results are missing SLURM data
python3 -c "
import json
from pathlib import Path
missing = 0
for f in Path('results').glob('*.json'):
data = json.load(open(f))
if data.get('success') and not data.get('slurm_maxrss_gb'):
missing += 1
print(f'Results missing SLURM data: {missing}')
"Troubleshooting:
If epilogue jobs fail:
- Check epilogue logs:
tail slurm/logs/epilogue_JOBID.out - Verify SLURM accounting is available:
sacct -j JOBID --format=JobID,State,MaxRSS - Manually populate:
python3 scripts/populate_slurm_data.py --job-id JOBID
If plots show missing memory data:
- Run:
python3 scripts/populate_slurm_data.py --all-missing - Check result files have
slurm_maxrss_gbfield - Verify SLURM jobs haven't been purged (check within 30-90 days)
# Continuous monitoring (updates every 30s)
python3 scripts/monitor_benchmark_runs.py
# One-time check
python3 scripts/monitor_benchmark_runs.py --once
# Show missing configs
python3 scripts/monitor_benchmark_runs.py --missing
# Show failures with error messages
python3 scripts/monitor_benchmark_runs.py --failures
# Generate detailed report
python3 scripts/monitor_benchmark_runs.py --reportMonitor output:
Overall Progress: 2400/2870 configs completed (83.6%)
Completed: 2400
Failed: 470
Missing: 0
By status:
Successful with SLURM data: 2400
Failed (OOM, cancelled, etc): 470
# Generate all plots with median + error bands
python3 scripts/plot_from_spec.py
# Generate specific plot
python3 scripts/plot_from_spec.py --plot de_n_cells_sweep
# Custom output directory
python3 scripts/plot_from_spec.py --output-dir my_plots/Plot features:
- Median lines (50th percentile) - robust central tendency
- Shaded bands (25th-75th percentile) - interquartile range
- Automatic filtering: Only uses successful runs with complete SLURM data
- Deduplication: Keeps one result per unique parameter combination
- High quality: 300 DPI PNG output
Generated plots:
de_n_cells_sweep_{runtime|memory}.pngde_n_genes_sweep_{runtime|memory}.pngde_n_landmarks_sweep_{runtime|memory}.pngde_n_components_sweep_{runtime|memory}.pngda_n_cells_sweep_{runtime|memory}.pngda_n_landmarks_sweep_{runtime|memory}.pngda_n_components_sweep_{runtime|memory}.png
# Safe cancellation using tracked job IDs
python3 scripts/cancel_current_jobs.py
# Cancel by working directory (all jobs in this project)
python3 scripts/cancel_current_jobs.py --by-workdir
# Dry run (see what would be canceled)
python3 scripts/cancel_current_jobs.py --dry-run# Check queue
squeue -u $USER
# View logs
tail -f slurm/logs/*.out
# Check job history
sacct -j JOB_ID --format=JobID,JobName,State,Elapsed,MaxRSS
# Check specific job array task
sacct -j JOB_ID_TASK_ID --format=JobID,State,MaxRSS,Elapsed2025_kompot_complexity/
├── benchmark_spec.yaml # SINGLE SOURCE OF TRUTH
├── README.md # This file
│
├── scripts/
│ ├── generate_all_configs.py # Config generator from spec
│ ├── submit_all_jobs.py # Job submitter with resource estimation
│ ├── run_final_benchmark.py # Benchmark runner (timed)
│ ├── plot_from_spec.py # Plotter (reads spec)
│ ├── monitor_benchmark_runs.py # Progress monitor
│ ├── populate_slurm_data.py # SLURM data collector (epilogue)
│ ├── cancel_current_jobs.py # Safe job cancellation
│ ├── config_hash.py # Hash-based result matching
│ └── add_hashes_to_results.py # Add hashes to existing results
│
├── templates/
│ └── benchmark_job.sh # SLURM job template
│
├── complexity_utils.py # Analysis utilities
│
├── results/
│ ├── configs_generated.csv # All 2870 configs with hashes
│ └── *.json # Benchmark results with SLURM data
│
├── slurm/
│ ├── current_job_ids.txt # Tracked job IDs
│ ├── jobs/ # Generated SLURM scripts
│ └── logs/ # Job logs (main + epilogue)
│
├── data/
│ └── covid_preprocessed.h5ad # Input dataset (with pre-computed DM)
│
└── complexity_analysis_plots/ # Generated plots
Differential Expression (DE):
sv_disk_200g: Sample variance, disk storage, 200 genes fixedsv_mem_200g: Sample variance, memory storage, 200 genes fixedsv_disk_sweep: Sample variance, disk, sweeping genes (max 1000)sv_mem_sweep: Sample variance, memory, sweeping genes (max 200)nosv_200g: No SV, 200 genes fixednosv_200g_gpu: No SV, 200 genes, GPU accelerationnosv_2000g_batched: No SV, 2000 genes, batchednosv_2000g_nobatch_1cpu: No SV, 2000 genes, no batch, 1 CPUnosv_2000g_nobatch_16cpu: No SV, 2000 genes, no batch, 16 CPUsnosv_2000g_gpu: No SV, 2000 genes, GPUnosv_allg_batched: No SV, all 33751 genes, batchednosv_allg_nobatch_1cpu: No SV, all genes, no batch, 1 CPUnosv_allg_nobatch_16cpu: No SV, all genes, no batch, 16 CPUsnosv_allg_gpu: No SV, all genes, GPU
Differential Abundance (DA):
sv: Sample variance (memory only)nosv: No sample variancenosv_nobatch_16cpu: No SV, no batch, 16 CPUsnosv_nobatch_gpu: No SV, no batch, GPUnosv_batched: No SV, batched
DE Plots:
de_n_cells_sweep: Cell count scaling (1k-422k cells)de_n_genes_sweep: Gene count scaling (50-33k genes)de_n_landmarks_sweep: Landmark scaling (500-10k landmarks)de_n_components_sweep: Component scaling (10-100 components)
DA Plots:
da_n_cells_sweep: Cell count scaling (1k-422k cells)da_n_landmarks_sweep: Landmark scaling (500-10k landmarks)da_n_components_sweep: Component scaling (10-100 components)
Batching Rules:
- All SV runs:
batch_size = 0(no batching supported) - Most No-SV runs:
batch_size = 0for direct comparison - Batched variants:
batch_size = 100or1000
Gene Count Limits:
- SV runs: Limited to ≤200 genes (disk) or ≤1000 genes (memory) due to computational cost
- No-SV runs: Can use all 33,751 genes
Null Genes:
- SV runs:
null_genes = 0(not needed with sample variance) - No-SV runs:
null_genes = 2000(for proper FDR calculation)
GPU Support:
- GPU configs:
use_gpu = true→ automatically routed to chorus partition - CPU configs:
use_gpu = falseor unspecified → restart-new or canto based on memory
All rules defined in benchmark_spec.yaml:
memory_gb:
sv_mem: # Sample variance with memory storage
base: 100
cell_factor: 0.00005
landmark_factor: 0.015
gene_factor: 0.5 # Significant gene scaling
nosv_nobatch: # No sample variance, no batching
base: 60
cell_factor: 0.001
landmark_factor: 0.05
gene_factor: 0.018
cpu_factor: 3.0 # Per extra CPU beyond 1time_hours:
sv:
base: 0.0
scaling_factor: 1.7365 # (n_cells * n_landmarks / 1e9)
nosv_1cpu:
base: 0.0
scaling_factor: 0.9233 # Faster without SV
nosv_16cpu:
base: 0.0
scaling_factor: 0.0578 # 16x speedup with parallelizationpartitions:
restart-new:
max_memory_gb: 680
max_time_hours: 168
canto:
max_memory_gb: 1500 # High-memory CPU nodes
max_time_hours: 168
chorus:
max_memory_gb: 1500 # GPU nodes
requires_gpu: trueSelection logic:
- If
use_gpu=true→ chorus (GPU partition) - Else if memory ≤ 680GB → restart-new (standard CPU)
- Else if memory > 680GB → canto (high-memory CPU)
# Find OOM failures
python3 scripts/monitor_benchmark_runs.py --failures | grep -i "memory\|OOM"
# Check memory estimates vs actual
# Compare config CSV memory_gb with result JSON slurm_maxrss_gb
# Adjust memory rules in benchmark_spec.yaml if needed
# Then regenerate and resubmit
python3 scripts/generate_all_configs.py
python3 scripts/submit_all_jobs.py --only-missingCommon OOM scenarios:
- GPU memory exhaustion: Reduce cell count or gene count
- CPU memory exhaustion: Jobs will automatically use canto partition if >680GB
Symptom: Config specifies cpus=1 but log shows CPUs allocated: 16
Diagnosis:
# Check SLURM log header
grep "CPUs allocated" slurm/logs/kompot_*.out
# Check actual job ID
cat results/RESULT.json | grep slurm_actual_job_id
# Find the log
ls slurm/logs/*ACTUAL_JOB_ID*.outSolution: Results from misconfigured jobs must be removed and rerun:
# Move bad results to trash
mkdir -p .trash/wrong_config_YYYYMMDD
mv results/BAD_RESULT.json .trash/wrong_config_YYYYMMDD/
# Resubmit with --only-missing (hash-based detection)
python3 scripts/submit_all_jobs.py --only-missing# Populate all missing SLURM data
python3 scripts/populate_slurm_data.py --all-missing
# If SLURM accounting has been purged (>90 days old):
# These jobs must be rerun - there's no way to recover the data- Check if jobs are running:
squeue -u $USER - Check logs for errors:
tail slurm/logs/*.out - Verify data file exists:
ls -lh data/covid_preprocessed.h5ad - Check for Python/JAX errors in logs:
grep -i error slurm/logs/*.out
- Edit
benchmark_spec.yaml:
comparison_groups:
de:
my_new_comparison:
analysis_type: de
use_sample_variance: false
n_genes: 1000
batch_size: 0
cpus: 1
null_genes: 2000
description: "My new comparison"- Add to relevant plots:
plots:
de_n_cells_sweep:
replicates: 10
comparisons:
- sv_disk_200g
- my_new_comparison # Add here- Regenerate and submit:
python3 scripts/generate_all_configs.py
python3 scripts/submit_all_jobs.pyEdit memory/time formulas in benchmark_spec.yaml:
resource_rules:
memory_gb:
my_comparison:
base: 50
cell_factor: 0.0001 # Adjust based on observed memory usage
landmark_factor: 0.01
gene_factor: 0.5Then regenerate configs and resubmit:
python3 scripts/generate_all_configs.py
python3 scripts/submit_all_jobs.py --only-missing # Hash matching preserves completed workEdit benchmark_spec.yaml:
plots:
de_n_cells_sweep:
replicates: 5 # Change from 10 to 5Then regenerate and submit:
python3 scripts/generate_all_configs.py
python3 scripts/submit_all_jobs.pyscripts/generate_all_configs.py- Generate configs from benchmark_spec.yamlscripts/submit_all_jobs.py- Submit jobs with automatic resource estimationscripts/monitor_benchmark_runs.py- Monitor progress with hash-based trackingscripts/plot_from_spec.py- Generate plots from benchmark_spec.yaml
scripts/populate_slurm_data.py- Query SLURM and save MaxRSS data- Automatic: Called by epilogue jobs after each main job completes
- Manual:
--all-missingto backfill missing data - Manual:
--job-id JOBIDto populate specific job
- Epilogue jobs: Auto-submitted by
submit_all_jobs.py- Format:
slurm_epilogue_JOBID - Dependency:
--dependency=afterany:MAIN_JOB_ID - Resources: 8GB RAM, 10 min, 1 CPU
- Logs:
slurm/logs/epilogue_JOBID.out
- Format:
scripts/config_hash.py- Hash computation utilitiesscripts/add_hashes_to_results.py- Add hashes to existing results- Hashes stored in:
configs_generated.csv(config_hash column) and result JSON files
scripts/cancel_current_jobs.py- Cancel current run safelyscripts/run_final_benchmark.py- Individual benchmark runnercomplexity_utils.py- Analysis utilities (plotting, data loading)
benchmark_spec.yaml- Single source of truth (comparisons, plots, resources)results/configs_generated.csv- All 2870 configs with hashesresults/*.json- Result files with permanent SLURM data and hashesslurm/current_job_ids.txt- Tracked job IDs for monitoringtemplates/benchmark_job.sh- SLURM job template
Each result JSON contains:
{
"config_id": 1430,
"config_name": "de_svmem200g_c422220_g200_lm10000_r1",
"config_hash": "5ded8cf42941",
"benchmark_type": "de_n_landmarks_sweep",
"analysis_type": "de",
"n_cells": 422220,
"n_genes": 200,
"n_landmarks": 10000,
"use_sample_variance": true,
"replicate_id": 1,
"runtime_seconds": 5804.4,
"success": true,
"slurm_job_id": "41314142",
"slurm_array_task_id": "1430",
"slurm_actual_job_id": "41219175",
"slurm_state": "COMPLETED",
"slurm_maxrss_gb": 422.5,
"slurm_elapsed": "01:36:44"
}