Kompot Complexity Benchmarking Suite

A comprehensive, specification-driven benchmarking system for analyzing the computational complexity of kompot differential expression and differential abundance methods with robust replicate statistics.

Overview

This project benchmarks kompot's performance across multiple dimensions with 10 replicates per configuration for robust median estimates and uncertainty quantification:

Cell count scaling (1k to 422k cells)
Gene count scaling (50 to 33k genes for DE)
Component count scaling (10 to 100 dimensions for DA)
Landmark scaling (500 to 10k landmarks)
Sample variance impact (with/without, memory vs disk storage)
Batching strategies (batched vs no-batch)
CPU parallelization (1 CPU vs 16 CPUs)
GPU acceleration (CPU vs GPU backends)

Replicate System

10 replicates per unique parameter combination
2870 total configs (287 unique parameter sets × 10 replicates)
Median lines with 25th-75th percentile error bands in plots
Robust statistics resilient to outliers and system variability

Runtime Measurement Scope

Important: Runtime measurements capture kompot computation only, excluding data preprocessing:

INCLUDED in runtime:

Gaussian Process model fitting
Sample variance estimation (when enabled)
Mahalanobis distance computation (for DE)
Null gene computation (when specified)

EXCLUDED from runtime:

Data loading and subsetting
Diffusion map computation (pre-computed in covid_preprocessed.h5ad)
Sample ID assignment
Disk storage setup

For DA (Differential Abundance), the pre-computed diffusion map embeddings (DM_EigenVectors, 50 dimensions) are used directly, so runtime reflects only the GP computation on those embeddings.

Central Design Principle: benchmark_spec.yaml as Single Source of Truth

ALL benchmark behavior is defined in benchmark_spec.yaml - zero hardcoding anywhere:

What the spec file controls:

Comparison groups: All parameter combinations with replicate counts
Plot specifications: Which comparisons appear on which plots
Resource estimation: Memory and time allocation rules for SLURM
Partition selection: CPU (restart-new, canto) vs GPU (chorus) routing
Constraints: Gene limits for SV, landmark bounds, batching rules

Design philosophy from the spec:

# Design Philosophy:
# - All comparisons defined once here
# - Config generation reads this file
# - Plotting code reads this file
# - SLURM scripts derive resources from this file
# - NO HARDCODING anywhere else

Why this matters:

Reproducibility: Single file defines entire benchmark
Maintainability: Change parameters in one place
Transparency: All decisions explicit and documented
Extensibility: Add new comparisons without touching code

Quick Start

# 1. Generate configs from spec (2870 total)
python3 scripts/generate_all_configs.py

# 2. Submit jobs to SLURM (automatic resource estimation)
python3 scripts/submit_all_jobs.py

# 3. Monitor progress (continuous updates)
python3 scripts/monitor_benchmark_runs.py

# 4. Generate plots with median + error bands
python3 scripts/plot_from_spec.py

Complete Workflow

Step 0: Data Preprocessing (One-time Setup)

# Download COVID-19 PBMC dataset (Meyer & Nikolic, 2021)
# Place in: data/meyer_nikolic_covid_pbmc.cellxgene.20210813.h5ad

# Preprocess dataset
python3 scripts/00_preprocess_covid_data.py \
    --input data/meyer_nikolic_covid_pbmc.cellxgene.20210813.h5ad \
    --output data/covid_preprocessed.h5ad \
    --n-pca-components 50 \
    --n-dm-components 50

Creates data/covid_preprocessed.h5ad with:

422,220 cells × 33,751 genes
PCA embedding (50 components)
Diffusion map eigenvectors (50 components) via Palantir - pre-computed once
Required metadata: COVID_status (Healthy/COVID-19), patient_id

Note: Diffusion maps are computed once during preprocessing. DA benchmarks measure only kompot's GP computation time on these pre-computed embeddings.

Step 1: Generate Benchmark Configurations

python3 scripts/generate_all_configs.py

What it does:

Reads benchmark_spec.yaml (single source of truth)
Generates 10 replicates for each unique parameter combination
Validates constraints (landmarks < cells for DE, max_genes for SV)
Assigns resource requirements based on spec rules
Outputs results/configs_generated.csv (2870 configurations)

Config structure:

Each config has replicate_id (1-10)
Config names include replicate suffix: de_svdisk200g_c10000_g200_lm5000_r1
SHA256 hash of parameters for result matching (config ID changes don't affect matching)

Step 2: Submit Jobs to SLURM

# Submit all jobs
python3 scripts/submit_all_jobs.py

# Resubmit only missing/failed jobs (uses hash matching)
python3 scripts/submit_all_jobs.py --only-missing

What it does:

Estimates memory/time from benchmark_spec.yaml resource rules
Selects partition based on spec criteria:
- restart-new: ≤680GB RAM, CPU-only jobs
- canto: >680GB RAM, <1500GB RAM, CPU-only jobs (high-memory nodes)
- chorus: GPU-required jobs (auto-selected for use_gpu=true)
Groups configs by resource requirements (minimizes job array count)
Generates SLURM scripts from template
Saves job IDs to slurm/current_job_ids.txt for tracking
Submits ~38 SLURM job arrays with automatic epilogue dependencies

Partition Selection Logic (from spec):

memory_thresholds:
  - partition: restart-new
    max_memory: 680
  - partition: canto
    max_memory: 1500
    requires_gpu: false  # High-memory CPU nodes
  - partition: chorus
    max_memory: 1500
    requires_gpu: true   # GPU nodes

Output:

Job IDs saved to slurm/current_job_ids.txt

Monitor with:
  python3 scripts/monitor_benchmark_runs.py

Cancel these jobs:
  python3 scripts/cancel_current_jobs.py

SLURM data capture:
  Epilogue jobs will auto-populate SLURM data after completion
  Manual fallback: python3 scripts/populate_slurm_data.py --all-missing

Hash-Based Result Detection

Problem: Config IDs change when benchmark_spec.yaml is modified, causing duplicate work.

Solution: SHA256 hash of benchmark parameters (excluding resource params like memory, time, partition).

Hash includes:

analysis_type, n_cells, n_genes, n_landmarks
n_components, use_sample_variance, compute_mahalanobis
store_on_disk, batch_size, null_genes
use_gpu, replicate_id

Hash excludes:

Resource parameters: memory_gb, time_hours, partition, cpus
Metadata: config_id, config_name, benchmark_type, plot_name

Benefits:

Results survive spec modifications
--only-missing correctly identifies completed work
No duplicate runs when only resources change

Verification:

# Check hash matching
python3 scripts/submit_all_jobs.py --only-missing --dry-run
# Shows: "Indexed 1929 valid results by hash"

SLURM Data Capture System

The Problem: SLURM MaxRSS (peak memory) is only available AFTER job completion in SLURM's accounting database. Querying from within a running job returns empty data.

Our Solution: Automatic dependent epilogue jobs.

How It Works:

Main Benchmark Job           Epilogue Job (dependent)        Result File
─────────────────────────────────────────────────────────────────────────
┌──────────────────┐         ┌───────────────────┐         ┌───────────┐
│ 1. Run benchmark │         │ 3. Start after    │         │ 5. MaxRSS │
│ 2. Save job IDs  │ ──────> │    main completes │ ──────> │    saved  │
│    to JSON       │ finish  │ 4. Query sacct    │ append  │    forever│
└──────────────────┘         │    for MaxRSS     │         └───────────┘
                             └───────────────────┘
      Job 40881253                Job 40881254
                               (--dependency=afterany:40881253)

Automatic Epilogue Jobs:

Submitted automatically by submit_all_jobs.py for each job array
Use SLURM dependency: --dependency=afterany:MAIN_JOB_ID
Wait for main job to complete (success OR failure)
Query SLURM accounting: sacct -j JOBID --format=JobID,State,MaxRSS,Elapsed
Parse MaxRSS from .batch subjob (contains actual resource usage)
Append data to result JSON file permanently

Why This Is Critical:

SLURM Data Retention: SLURM purges accounting data after 30-90 days
Accurate Memory Metrics: MaxRSS captures ALL memory (C extensions, JAX, NumPy, etc.)
- Python tracemalloc: ~1.86 GB (Python allocations only)
- SLURM MaxRSS: ~3.62 GB (48% higher - true OS-level usage)
Scientific Reproducibility: Result files preserve complete resource usage forever
Publication Requirements: Accurate resource documentation for methods sections

Monitoring Epilogue Jobs:

# Check epilogue jobs (will show PENDING until main job finishes)
squeue -u $USER | grep slurm_epilogue

# After main job completes, epilogue should start and finish quickly (~10 seconds)
# Check epilogue logs
ls -lt slurm/logs/epilogue_*.out | head -10

Manual SLURM Data Population (fallback if epilogue fails):

# Populate SLURM data for all results missing it
python3 scripts/populate_slurm_data.py --all-missing

# Populate for specific job
python3 scripts/populate_slurm_data.py --job-id 12345678

# Check how many results are missing SLURM data
python3 -c "
import json
from pathlib import Path
missing = 0
for f in Path('results').glob('*.json'):
    data = json.load(open(f))
    if data.get('success') and not data.get('slurm_maxrss_gb'):
        missing += 1
print(f'Results missing SLURM data: {missing}')
"

Troubleshooting:

If epilogue jobs fail:

Check epilogue logs: tail slurm/logs/epilogue_JOBID.out
Verify SLURM accounting is available: sacct -j JOBID --format=JobID,State,MaxRSS
Manually populate: python3 scripts/populate_slurm_data.py --job-id JOBID

If plots show missing memory data:

Run: python3 scripts/populate_slurm_data.py --all-missing
Check result files have slurm_maxrss_gb field
Verify SLURM jobs haven't been purged (check within 30-90 days)

Step 3: Monitor Progress

# Continuous monitoring (updates every 30s)
python3 scripts/monitor_benchmark_runs.py

# One-time check
python3 scripts/monitor_benchmark_runs.py --once

# Show missing configs
python3 scripts/monitor_benchmark_runs.py --missing

# Show failures with error messages
python3 scripts/monitor_benchmark_runs.py --failures

# Generate detailed report
python3 scripts/monitor_benchmark_runs.py --report

Monitor output:

Overall Progress: 2400/2870 configs completed (83.6%)
   Completed: 2400
   Failed: 470
   Missing: 0

By status:
   Successful with SLURM data: 2400
   Failed (OOM, cancelled, etc): 470

Step 4: Generate Plots

# Generate all plots with median + error bands
python3 scripts/plot_from_spec.py

# Generate specific plot
python3 scripts/plot_from_spec.py --plot de_n_cells_sweep

# Custom output directory
python3 scripts/plot_from_spec.py --output-dir my_plots/

Plot features:

Median lines (50th percentile) - robust central tendency
Shaded bands (25th-75th percentile) - interquartile range
Automatic filtering: Only uses successful runs with complete SLURM data
Deduplication: Keeps one result per unique parameter combination
High quality: 300 DPI PNG output

Generated plots:

de_n_cells_sweep_{runtime|memory}.png
de_n_genes_sweep_{runtime|memory}.png
de_n_landmarks_sweep_{runtime|memory}.png
de_n_components_sweep_{runtime|memory}.png
da_n_cells_sweep_{runtime|memory}.png
da_n_landmarks_sweep_{runtime|memory}.png
da_n_components_sweep_{runtime|memory}.png

Job Management

Cancel Current Jobs

# Safe cancellation using tracked job IDs
python3 scripts/cancel_current_jobs.py

# Cancel by working directory (all jobs in this project)
python3 scripts/cancel_current_jobs.py --by-workdir

# Dry run (see what would be canceled)
python3 scripts/cancel_current_jobs.py --dry-run

Manual Commands

# Check queue
squeue -u $USER

# View logs
tail -f slurm/logs/*.out

# Check job history
sacct -j JOB_ID --format=JobID,JobName,State,Elapsed,MaxRSS

# Check specific job array task
sacct -j JOB_ID_TASK_ID --format=JobID,State,MaxRSS,Elapsed

Directory Structure

2025_kompot_complexity/
├── benchmark_spec.yaml          # SINGLE SOURCE OF TRUTH
├── README.md                     # This file
│
├── scripts/
│   ├── generate_all_configs.py       # Config generator from spec
│   ├── submit_all_jobs.py            # Job submitter with resource estimation
│   ├── run_final_benchmark.py        # Benchmark runner (timed)
│   ├── plot_from_spec.py             # Plotter (reads spec)
│   ├── monitor_benchmark_runs.py     # Progress monitor
│   ├── populate_slurm_data.py        # SLURM data collector (epilogue)
│   ├── cancel_current_jobs.py        # Safe job cancellation
│   ├── config_hash.py                # Hash-based result matching
│   └── add_hashes_to_results.py      # Add hashes to existing results
│
├── templates/
│   └── benchmark_job.sh         # SLURM job template
│
├── complexity_utils.py          # Analysis utilities
│
├── results/
│   ├── configs_generated.csv    # All 2870 configs with hashes
│   └── *.json                   # Benchmark results with SLURM data
│
├── slurm/
│   ├── current_job_ids.txt      # Tracked job IDs
│   ├── jobs/                    # Generated SLURM scripts
│   └── logs/                    # Job logs (main + epilogue)
│
├── data/
│   └── covid_preprocessed.h5ad  # Input dataset (with pre-computed DM)
│
└── complexity_analysis_plots/   # Generated plots

Benchmark Configuration

Comparison Groups Defined in benchmark_spec.yaml

Differential Expression (DE):

sv_disk_200g: Sample variance, disk storage, 200 genes fixed
sv_mem_200g: Sample variance, memory storage, 200 genes fixed
sv_disk_sweep: Sample variance, disk, sweeping genes (max 1000)
sv_mem_sweep: Sample variance, memory, sweeping genes (max 200)
nosv_200g: No SV, 200 genes fixed
nosv_200g_gpu: No SV, 200 genes, GPU acceleration
nosv_2000g_batched: No SV, 2000 genes, batched
nosv_2000g_nobatch_1cpu: No SV, 2000 genes, no batch, 1 CPU
nosv_2000g_nobatch_16cpu: No SV, 2000 genes, no batch, 16 CPUs
nosv_2000g_gpu: No SV, 2000 genes, GPU
nosv_allg_batched: No SV, all 33751 genes, batched
nosv_allg_nobatch_1cpu: No SV, all genes, no batch, 1 CPU
nosv_allg_nobatch_16cpu: No SV, all genes, no batch, 16 CPUs
nosv_allg_gpu: No SV, all genes, GPU

Differential Abundance (DA):

sv: Sample variance (memory only)
nosv: No sample variance
nosv_nobatch_16cpu: No SV, no batch, 16 CPUs
nosv_nobatch_gpu: No SV, no batch, GPU
nosv_batched: No SV, batched

Plot Specifications from benchmark_spec.yaml

DE Plots:

de_n_cells_sweep: Cell count scaling (1k-422k cells)
de_n_genes_sweep: Gene count scaling (50-33k genes)
de_n_landmarks_sweep: Landmark scaling (500-10k landmarks)
de_n_components_sweep: Component scaling (10-100 components)

DA Plots:

da_n_cells_sweep: Cell count scaling (1k-422k cells)
da_n_landmarks_sweep: Landmark scaling (500-10k landmarks)
da_n_components_sweep: Component scaling (10-100 components)

Important Constraints (from spec)

Batching Rules:

All SV runs: batch_size = 0 (no batching supported)
Most No-SV runs: batch_size = 0 for direct comparison
Batched variants: batch_size = 100 or 1000

Gene Count Limits:

SV runs: Limited to ≤200 genes (disk) or ≤1000 genes (memory) due to computational cost
No-SV runs: Can use all 33,751 genes

Null Genes:

SV runs: null_genes = 0 (not needed with sample variance)
No-SV runs: null_genes = 2000 (for proper FDR calculation)

GPU Support:

GPU configs: use_gpu = true → automatically routed to chorus partition
CPU configs: use_gpu = false or unspecified → restart-new or canto based on memory

Resource Estimation and Partition Selection

All rules defined in benchmark_spec.yaml:

Memory Estimation

memory_gb:
  sv_mem:  # Sample variance with memory storage
    base: 100
    cell_factor: 0.00005
    landmark_factor: 0.015
    gene_factor: 0.5  # Significant gene scaling

  nosv_nobatch:  # No sample variance, no batching
    base: 60
    cell_factor: 0.001
    landmark_factor: 0.05
    gene_factor: 0.018
    cpu_factor: 3.0  # Per extra CPU beyond 1

Time Estimation

time_hours:
  sv:
    base: 0.0
    scaling_factor: 1.7365  # (n_cells * n_landmarks / 1e9)

  nosv_1cpu:
    base: 0.0
    scaling_factor: 0.9233  # Faster without SV

  nosv_16cpu:
    base: 0.0
    scaling_factor: 0.0578  # 16x speedup with parallelization

Partition Selection

partitions:
  restart-new:
    max_memory_gb: 680
    max_time_hours: 168

  canto:
    max_memory_gb: 1500  # High-memory CPU nodes
    max_time_hours: 168

  chorus:
    max_memory_gb: 1500  # GPU nodes
    requires_gpu: true

Selection logic:

If use_gpu=true → chorus (GPU partition)
Else if memory ≤ 680GB → restart-new (standard CPU)
Else if memory > 680GB → canto (high-memory CPU)

Troubleshooting

Jobs Failing Due to Memory

# Find OOM failures
python3 scripts/monitor_benchmark_runs.py --failures | grep -i "memory\|OOM"

# Check memory estimates vs actual
# Compare config CSV memory_gb with result JSON slurm_maxrss_gb

# Adjust memory rules in benchmark_spec.yaml if needed
# Then regenerate and resubmit
python3 scripts/generate_all_configs.py
python3 scripts/submit_all_jobs.py --only-missing

Common OOM scenarios:

GPU memory exhaustion: Reduce cell count or gene count
CPU memory exhaustion: Jobs will automatically use canto partition if >680GB

Jobs Running on Wrong Resources

Symptom: Config specifies cpus=1 but log shows CPUs allocated: 16

Diagnosis:

# Check SLURM log header
grep "CPUs allocated" slurm/logs/kompot_*.out

# Check actual job ID
cat results/RESULT.json | grep slurm_actual_job_id

# Find the log
ls slurm/logs/*ACTUAL_JOB_ID*.out

Solution: Results from misconfigured jobs must be removed and rerun:

# Move bad results to trash
mkdir -p .trash/wrong_config_YYYYMMDD
mv results/BAD_RESULT.json .trash/wrong_config_YYYYMMDD/

# Resubmit with --only-missing (hash-based detection)
python3 scripts/submit_all_jobs.py --only-missing

Missing SLURM Data

# Populate all missing SLURM data
python3 scripts/populate_slurm_data.py --all-missing

# If SLURM accounting has been purged (>90 days old):
# These jobs must be rerun - there's no way to recover the data

No Results Appearing

Check if jobs are running: squeue -u $USER
Check logs for errors: tail slurm/logs/*.out
Verify data file exists: ls -lh data/covid_preprocessed.h5ad
Check for Python/JAX errors in logs: grep -i error slurm/logs/*.out

Modifying the Benchmark

Add a New Comparison Group

Edit benchmark_spec.yaml:

comparison_groups:
  de:
    my_new_comparison:
      analysis_type: de
      use_sample_variance: false
      n_genes: 1000
      batch_size: 0
      cpus: 1
      null_genes: 2000
      description: "My new comparison"

Add to relevant plots:

plots:
  de_n_cells_sweep:
    replicates: 10
    comparisons:
      - sv_disk_200g
      - my_new_comparison  # Add here

Regenerate and submit:

python3 scripts/generate_all_configs.py
python3 scripts/submit_all_jobs.py

Change Resource Estimates

Edit memory/time formulas in benchmark_spec.yaml:

resource_rules:
  memory_gb:
    my_comparison:
      base: 50
      cell_factor: 0.0001  # Adjust based on observed memory usage
      landmark_factor: 0.01
      gene_factor: 0.5

Then regenerate configs and resubmit:

python3 scripts/generate_all_configs.py
python3 scripts/submit_all_jobs.py --only-missing  # Hash matching preserves completed work

Change Replicate Count

Edit benchmark_spec.yaml:

plots:
  de_n_cells_sweep:
    replicates: 5  # Change from 10 to 5

Then regenerate and submit:

python3 scripts/generate_all_configs.py
python3 scripts/submit_all_jobs.py

Key Scripts Reference

Core Workflow

scripts/generate_all_configs.py - Generate configs from benchmark_spec.yaml
scripts/submit_all_jobs.py - Submit jobs with automatic resource estimation
scripts/monitor_benchmark_runs.py - Monitor progress with hash-based tracking
scripts/plot_from_spec.py - Generate plots from benchmark_spec.yaml

SLURM Data Capture System

scripts/populate_slurm_data.py - Query SLURM and save MaxRSS data
- Automatic: Called by epilogue jobs after each main job completes
- Manual: --all-missing to backfill missing data
- Manual: --job-id JOBID to populate specific job
Epilogue jobs: Auto-submitted by submit_all_jobs.py
- Format: slurm_epilogue_JOBID
- Dependency: --dependency=afterany:MAIN_JOB_ID
- Resources: 8GB RAM, 10 min, 1 CPU
- Logs: slurm/logs/epilogue_JOBID.out

Hash-Based Result Matching

scripts/config_hash.py - Hash computation utilities
scripts/add_hashes_to_results.py - Add hashes to existing results
Hashes stored in: configs_generated.csv (config_hash column) and result JSON files

Utilities

scripts/cancel_current_jobs.py - Cancel current run safely
scripts/run_final_benchmark.py - Individual benchmark runner
complexity_utils.py - Analysis utilities (plotting, data loading)

Key Files

benchmark_spec.yaml - Single source of truth (comparisons, plots, resources)
results/configs_generated.csv - All 2870 configs with hashes
results/*.json - Result files with permanent SLURM data and hashes
slurm/current_job_ids.txt - Tracked job IDs for monitoring
templates/benchmark_job.sh - SLURM job template

Result File Format

Each result JSON contains:

{
  "config_id": 1430,
  "config_name": "de_svmem200g_c422220_g200_lm10000_r1",
  "config_hash": "5ded8cf42941",
  "benchmark_type": "de_n_landmarks_sweep",
  "analysis_type": "de",
  "n_cells": 422220,
  "n_genes": 200,
  "n_landmarks": 10000,
  "use_sample_variance": true,
  "replicate_id": 1,
  "runtime_seconds": 5804.4,
  "success": true,
  "slurm_job_id": "41314142",
  "slurm_array_task_id": "1430",
  "slurm_actual_job_id": "41219175",
  "slurm_state": "COMPLETED",
  "slurm_maxrss_gb": 422.5,
  "slurm_elapsed": "01:36:44"
}

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
notebooks		notebooks
scripts		scripts
slurm		slurm
templates		templates
.gitignore		.gitignore
QUICKSTART.md		QUICKSTART.md
README.md		README.md
benchmark_report.txt		benchmark_report.txt
benchmark_spec.yaml		benchmark_spec.yaml
complexity_utils.py		complexity_utils.py
debug_filtering.py		debug_filtering.py
run_comprehensive_analysis.sh		run_comprehensive_analysis.sh
submission_log_corrected.txt		submission_log_corrected.txt

settylab/kompot_scaling

Folders and files

Latest commit

History

Repository files navigation

Kompot Complexity Benchmarking Suite

Overview

Replicate System

Runtime Measurement Scope

Central Design Principle: benchmark_spec.yaml as Single Source of Truth

What the spec file controls:

Design philosophy from the spec:

Why this matters:

Quick Start

Complete Workflow

Step 0: Data Preprocessing (One-time Setup)

Step 1: Generate Benchmark Configurations

Step 2: Submit Jobs to SLURM

Hash-Based Result Detection

SLURM Data Capture System

Step 3: Monitor Progress

Step 4: Generate Plots

Job Management

Cancel Current Jobs

Manual Commands

Directory Structure

Benchmark Configuration

Comparison Groups Defined in benchmark_spec.yaml

Plot Specifications from benchmark_spec.yaml

Important Constraints (from spec)

Resource Estimation and Partition Selection

Memory Estimation

Time Estimation

Partition Selection

Troubleshooting

Jobs Failing Due to Memory

Jobs Running on Wrong Resources

Missing SLURM Data

No Results Appearing

Modifying the Benchmark

Add a New Comparison Group

Change Resource Estimates

Change Replicate Count

Key Scripts Reference

Core Workflow

SLURM Data Capture System

Hash-Based Result Matching

Utilities

Key Files

Result File Format

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages