Skip to content

settylab/kompot_scaling

Repository files navigation

Kompot Complexity Benchmarking Suite

A comprehensive, specification-driven benchmarking system for analyzing the computational complexity of kompot differential expression and differential abundance methods with robust replicate statistics.

Overview

This project benchmarks kompot's performance across multiple dimensions with 10 replicates per configuration for robust median estimates and uncertainty quantification:

  • Cell count scaling (1k to 422k cells)
  • Gene count scaling (50 to 33k genes for DE)
  • Component count scaling (10 to 100 dimensions for DA)
  • Landmark scaling (500 to 10k landmarks)
  • Sample variance impact (with/without, memory vs disk storage)
  • Batching strategies (batched vs no-batch)
  • CPU parallelization (1 CPU vs 16 CPUs)
  • GPU acceleration (CPU vs GPU backends)

Replicate System

  • 10 replicates per unique parameter combination
  • 2870 total configs (287 unique parameter sets × 10 replicates)
  • Median lines with 25th-75th percentile error bands in plots
  • Robust statistics resilient to outliers and system variability

Runtime Measurement Scope

Important: Runtime measurements capture kompot computation only, excluding data preprocessing:

INCLUDED in runtime:

  • Gaussian Process model fitting
  • Sample variance estimation (when enabled)
  • Mahalanobis distance computation (for DE)
  • Null gene computation (when specified)

EXCLUDED from runtime:

  • Data loading and subsetting
  • Diffusion map computation (pre-computed in covid_preprocessed.h5ad)
  • Sample ID assignment
  • Disk storage setup

For DA (Differential Abundance), the pre-computed diffusion map embeddings (DM_EigenVectors, 50 dimensions) are used directly, so runtime reflects only the GP computation on those embeddings.


Central Design Principle: benchmark_spec.yaml as Single Source of Truth

ALL benchmark behavior is defined in benchmark_spec.yaml - zero hardcoding anywhere:

What the spec file controls:

  • Comparison groups: All parameter combinations with replicate counts
  • Plot specifications: Which comparisons appear on which plots
  • Resource estimation: Memory and time allocation rules for SLURM
  • Partition selection: CPU (restart-new, canto) vs GPU (chorus) routing
  • Constraints: Gene limits for SV, landmark bounds, batching rules

Design philosophy from the spec:

# Design Philosophy:
# - All comparisons defined once here
# - Config generation reads this file
# - Plotting code reads this file
# - SLURM scripts derive resources from this file
# - NO HARDCODING anywhere else

Why this matters:

  1. Reproducibility: Single file defines entire benchmark
  2. Maintainability: Change parameters in one place
  3. Transparency: All decisions explicit and documented
  4. Extensibility: Add new comparisons without touching code

Quick Start

# 1. Generate configs from spec (2870 total)
python3 scripts/generate_all_configs.py

# 2. Submit jobs to SLURM (automatic resource estimation)
python3 scripts/submit_all_jobs.py

# 3. Monitor progress (continuous updates)
python3 scripts/monitor_benchmark_runs.py

# 4. Generate plots with median + error bands
python3 scripts/plot_from_spec.py

Complete Workflow

Step 0: Data Preprocessing (One-time Setup)

# Download COVID-19 PBMC dataset (Meyer & Nikolic, 2021)
# Place in: data/meyer_nikolic_covid_pbmc.cellxgene.20210813.h5ad

# Preprocess dataset
python3 scripts/00_preprocess_covid_data.py \
    --input data/meyer_nikolic_covid_pbmc.cellxgene.20210813.h5ad \
    --output data/covid_preprocessed.h5ad \
    --n-pca-components 50 \
    --n-dm-components 50

Creates data/covid_preprocessed.h5ad with:

  • 422,220 cells × 33,751 genes
  • PCA embedding (50 components)
  • Diffusion map eigenvectors (50 components) via Palantir - pre-computed once
  • Required metadata: COVID_status (Healthy/COVID-19), patient_id

Note: Diffusion maps are computed once during preprocessing. DA benchmarks measure only kompot's GP computation time on these pre-computed embeddings.

Step 1: Generate Benchmark Configurations

python3 scripts/generate_all_configs.py

What it does:

  • Reads benchmark_spec.yaml (single source of truth)
  • Generates 10 replicates for each unique parameter combination
  • Validates constraints (landmarks < cells for DE, max_genes for SV)
  • Assigns resource requirements based on spec rules
  • Outputs results/configs_generated.csv (2870 configurations)

Config structure:

  • Each config has replicate_id (1-10)
  • Config names include replicate suffix: de_svdisk200g_c10000_g200_lm5000_r1
  • SHA256 hash of parameters for result matching (config ID changes don't affect matching)

Step 2: Submit Jobs to SLURM

# Submit all jobs
python3 scripts/submit_all_jobs.py

# Resubmit only missing/failed jobs (uses hash matching)
python3 scripts/submit_all_jobs.py --only-missing

What it does:

  • Estimates memory/time from benchmark_spec.yaml resource rules
  • Selects partition based on spec criteria:
    • restart-new: ≤680GB RAM, CPU-only jobs
    • canto: >680GB RAM, <1500GB RAM, CPU-only jobs (high-memory nodes)
    • chorus: GPU-required jobs (auto-selected for use_gpu=true)
  • Groups configs by resource requirements (minimizes job array count)
  • Generates SLURM scripts from template
  • Saves job IDs to slurm/current_job_ids.txt for tracking
  • Submits ~38 SLURM job arrays with automatic epilogue dependencies

Partition Selection Logic (from spec):

memory_thresholds:
  - partition: restart-new
    max_memory: 680
  - partition: canto
    max_memory: 1500
    requires_gpu: false  # High-memory CPU nodes
  - partition: chorus
    max_memory: 1500
    requires_gpu: true   # GPU nodes

Output:

Job IDs saved to slurm/current_job_ids.txt

Monitor with:
  python3 scripts/monitor_benchmark_runs.py

Cancel these jobs:
  python3 scripts/cancel_current_jobs.py

SLURM data capture:
  Epilogue jobs will auto-populate SLURM data after completion
  Manual fallback: python3 scripts/populate_slurm_data.py --all-missing

Hash-Based Result Detection

Problem: Config IDs change when benchmark_spec.yaml is modified, causing duplicate work.

Solution: SHA256 hash of benchmark parameters (excluding resource params like memory, time, partition).

Hash includes:

  • analysis_type, n_cells, n_genes, n_landmarks
  • n_components, use_sample_variance, compute_mahalanobis
  • store_on_disk, batch_size, null_genes
  • use_gpu, replicate_id

Hash excludes:

  • Resource parameters: memory_gb, time_hours, partition, cpus
  • Metadata: config_id, config_name, benchmark_type, plot_name

Benefits:

  • Results survive spec modifications
  • --only-missing correctly identifies completed work
  • No duplicate runs when only resources change

Verification:

# Check hash matching
python3 scripts/submit_all_jobs.py --only-missing --dry-run
# Shows: "Indexed 1929 valid results by hash"

SLURM Data Capture System

The Problem: SLURM MaxRSS (peak memory) is only available AFTER job completion in SLURM's accounting database. Querying from within a running job returns empty data.

Our Solution: Automatic dependent epilogue jobs.

How It Works:

Main Benchmark Job           Epilogue Job (dependent)        Result File
─────────────────────────────────────────────────────────────────────────
┌──────────────────┐         ┌───────────────────┐         ┌───────────┐
│ 1. Run benchmark │         │ 3. Start after    │         │ 5. MaxRSS │
│ 2. Save job IDs  │ ──────> │    main completes │ ──────> │    saved  │
│    to JSON       │ finish  │ 4. Query sacct    │ append  │    forever│
└──────────────────┘         │    for MaxRSS     │         └───────────┘
                             └───────────────────┘
      Job 40881253                Job 40881254
                               (--dependency=afterany:40881253)

Automatic Epilogue Jobs:

  • Submitted automatically by submit_all_jobs.py for each job array
  • Use SLURM dependency: --dependency=afterany:MAIN_JOB_ID
  • Wait for main job to complete (success OR failure)
  • Query SLURM accounting: sacct -j JOBID --format=JobID,State,MaxRSS,Elapsed
  • Parse MaxRSS from .batch subjob (contains actual resource usage)
  • Append data to result JSON file permanently

Why This Is Critical:

  1. SLURM Data Retention: SLURM purges accounting data after 30-90 days
  2. Accurate Memory Metrics: MaxRSS captures ALL memory (C extensions, JAX, NumPy, etc.)
    • Python tracemalloc: ~1.86 GB (Python allocations only)
    • SLURM MaxRSS: ~3.62 GB (48% higher - true OS-level usage)
  3. Scientific Reproducibility: Result files preserve complete resource usage forever
  4. Publication Requirements: Accurate resource documentation for methods sections

Monitoring Epilogue Jobs:

# Check epilogue jobs (will show PENDING until main job finishes)
squeue -u $USER | grep slurm_epilogue

# After main job completes, epilogue should start and finish quickly (~10 seconds)
# Check epilogue logs
ls -lt slurm/logs/epilogue_*.out | head -10

Manual SLURM Data Population (fallback if epilogue fails):

# Populate SLURM data for all results missing it
python3 scripts/populate_slurm_data.py --all-missing

# Populate for specific job
python3 scripts/populate_slurm_data.py --job-id 12345678

# Check how many results are missing SLURM data
python3 -c "
import json
from pathlib import Path
missing = 0
for f in Path('results').glob('*.json'):
    data = json.load(open(f))
    if data.get('success') and not data.get('slurm_maxrss_gb'):
        missing += 1
print(f'Results missing SLURM data: {missing}')
"

Troubleshooting:

If epilogue jobs fail:

  1. Check epilogue logs: tail slurm/logs/epilogue_JOBID.out
  2. Verify SLURM accounting is available: sacct -j JOBID --format=JobID,State,MaxRSS
  3. Manually populate: python3 scripts/populate_slurm_data.py --job-id JOBID

If plots show missing memory data:

  1. Run: python3 scripts/populate_slurm_data.py --all-missing
  2. Check result files have slurm_maxrss_gb field
  3. Verify SLURM jobs haven't been purged (check within 30-90 days)

Step 3: Monitor Progress

# Continuous monitoring (updates every 30s)
python3 scripts/monitor_benchmark_runs.py

# One-time check
python3 scripts/monitor_benchmark_runs.py --once

# Show missing configs
python3 scripts/monitor_benchmark_runs.py --missing

# Show failures with error messages
python3 scripts/monitor_benchmark_runs.py --failures

# Generate detailed report
python3 scripts/monitor_benchmark_runs.py --report

Monitor output:

Overall Progress: 2400/2870 configs completed (83.6%)
   Completed: 2400
   Failed: 470
   Missing: 0

By status:
   Successful with SLURM data: 2400
   Failed (OOM, cancelled, etc): 470

Step 4: Generate Plots

# Generate all plots with median + error bands
python3 scripts/plot_from_spec.py

# Generate specific plot
python3 scripts/plot_from_spec.py --plot de_n_cells_sweep

# Custom output directory
python3 scripts/plot_from_spec.py --output-dir my_plots/

Plot features:

  • Median lines (50th percentile) - robust central tendency
  • Shaded bands (25th-75th percentile) - interquartile range
  • Automatic filtering: Only uses successful runs with complete SLURM data
  • Deduplication: Keeps one result per unique parameter combination
  • High quality: 300 DPI PNG output

Generated plots:

  • de_n_cells_sweep_{runtime|memory}.png
  • de_n_genes_sweep_{runtime|memory}.png
  • de_n_landmarks_sweep_{runtime|memory}.png
  • de_n_components_sweep_{runtime|memory}.png
  • da_n_cells_sweep_{runtime|memory}.png
  • da_n_landmarks_sweep_{runtime|memory}.png
  • da_n_components_sweep_{runtime|memory}.png

Job Management

Cancel Current Jobs

# Safe cancellation using tracked job IDs
python3 scripts/cancel_current_jobs.py

# Cancel by working directory (all jobs in this project)
python3 scripts/cancel_current_jobs.py --by-workdir

# Dry run (see what would be canceled)
python3 scripts/cancel_current_jobs.py --dry-run

Manual Commands

# Check queue
squeue -u $USER

# View logs
tail -f slurm/logs/*.out

# Check job history
sacct -j JOB_ID --format=JobID,JobName,State,Elapsed,MaxRSS

# Check specific job array task
sacct -j JOB_ID_TASK_ID --format=JobID,State,MaxRSS,Elapsed

Directory Structure

2025_kompot_complexity/
├── benchmark_spec.yaml          # SINGLE SOURCE OF TRUTH
├── README.md                     # This file
│
├── scripts/
│   ├── generate_all_configs.py       # Config generator from spec
│   ├── submit_all_jobs.py            # Job submitter with resource estimation
│   ├── run_final_benchmark.py        # Benchmark runner (timed)
│   ├── plot_from_spec.py             # Plotter (reads spec)
│   ├── monitor_benchmark_runs.py     # Progress monitor
│   ├── populate_slurm_data.py        # SLURM data collector (epilogue)
│   ├── cancel_current_jobs.py        # Safe job cancellation
│   ├── config_hash.py                # Hash-based result matching
│   └── add_hashes_to_results.py      # Add hashes to existing results
│
├── templates/
│   └── benchmark_job.sh         # SLURM job template
│
├── complexity_utils.py          # Analysis utilities
│
├── results/
│   ├── configs_generated.csv    # All 2870 configs with hashes
│   └── *.json                   # Benchmark results with SLURM data
│
├── slurm/
│   ├── current_job_ids.txt      # Tracked job IDs
│   ├── jobs/                    # Generated SLURM scripts
│   └── logs/                    # Job logs (main + epilogue)
│
├── data/
│   └── covid_preprocessed.h5ad  # Input dataset (with pre-computed DM)
│
└── complexity_analysis_plots/   # Generated plots

Benchmark Configuration

Comparison Groups Defined in benchmark_spec.yaml

Differential Expression (DE):

  • sv_disk_200g: Sample variance, disk storage, 200 genes fixed
  • sv_mem_200g: Sample variance, memory storage, 200 genes fixed
  • sv_disk_sweep: Sample variance, disk, sweeping genes (max 1000)
  • sv_mem_sweep: Sample variance, memory, sweeping genes (max 200)
  • nosv_200g: No SV, 200 genes fixed
  • nosv_200g_gpu: No SV, 200 genes, GPU acceleration
  • nosv_2000g_batched: No SV, 2000 genes, batched
  • nosv_2000g_nobatch_1cpu: No SV, 2000 genes, no batch, 1 CPU
  • nosv_2000g_nobatch_16cpu: No SV, 2000 genes, no batch, 16 CPUs
  • nosv_2000g_gpu: No SV, 2000 genes, GPU
  • nosv_allg_batched: No SV, all 33751 genes, batched
  • nosv_allg_nobatch_1cpu: No SV, all genes, no batch, 1 CPU
  • nosv_allg_nobatch_16cpu: No SV, all genes, no batch, 16 CPUs
  • nosv_allg_gpu: No SV, all genes, GPU

Differential Abundance (DA):

  • sv: Sample variance (memory only)
  • nosv: No sample variance
  • nosv_nobatch_16cpu: No SV, no batch, 16 CPUs
  • nosv_nobatch_gpu: No SV, no batch, GPU
  • nosv_batched: No SV, batched

Plot Specifications from benchmark_spec.yaml

DE Plots:

  • de_n_cells_sweep: Cell count scaling (1k-422k cells)
  • de_n_genes_sweep: Gene count scaling (50-33k genes)
  • de_n_landmarks_sweep: Landmark scaling (500-10k landmarks)
  • de_n_components_sweep: Component scaling (10-100 components)

DA Plots:

  • da_n_cells_sweep: Cell count scaling (1k-422k cells)
  • da_n_landmarks_sweep: Landmark scaling (500-10k landmarks)
  • da_n_components_sweep: Component scaling (10-100 components)

Important Constraints (from spec)

Batching Rules:

  • All SV runs: batch_size = 0 (no batching supported)
  • Most No-SV runs: batch_size = 0 for direct comparison
  • Batched variants: batch_size = 100 or 1000

Gene Count Limits:

  • SV runs: Limited to ≤200 genes (disk) or ≤1000 genes (memory) due to computational cost
  • No-SV runs: Can use all 33,751 genes

Null Genes:

  • SV runs: null_genes = 0 (not needed with sample variance)
  • No-SV runs: null_genes = 2000 (for proper FDR calculation)

GPU Support:

  • GPU configs: use_gpu = true → automatically routed to chorus partition
  • CPU configs: use_gpu = false or unspecified → restart-new or canto based on memory

Resource Estimation and Partition Selection

All rules defined in benchmark_spec.yaml:

Memory Estimation

memory_gb:
  sv_mem:  # Sample variance with memory storage
    base: 100
    cell_factor: 0.00005
    landmark_factor: 0.015
    gene_factor: 0.5  # Significant gene scaling

  nosv_nobatch:  # No sample variance, no batching
    base: 60
    cell_factor: 0.001
    landmark_factor: 0.05
    gene_factor: 0.018
    cpu_factor: 3.0  # Per extra CPU beyond 1

Time Estimation

time_hours:
  sv:
    base: 0.0
    scaling_factor: 1.7365  # (n_cells * n_landmarks / 1e9)

  nosv_1cpu:
    base: 0.0
    scaling_factor: 0.9233  # Faster without SV

  nosv_16cpu:
    base: 0.0
    scaling_factor: 0.0578  # 16x speedup with parallelization

Partition Selection

partitions:
  restart-new:
    max_memory_gb: 680
    max_time_hours: 168

  canto:
    max_memory_gb: 1500  # High-memory CPU nodes
    max_time_hours: 168

  chorus:
    max_memory_gb: 1500  # GPU nodes
    requires_gpu: true

Selection logic:

  1. If use_gpu=true → chorus (GPU partition)
  2. Else if memory ≤ 680GB → restart-new (standard CPU)
  3. Else if memory > 680GB → canto (high-memory CPU)

Troubleshooting

Jobs Failing Due to Memory

# Find OOM failures
python3 scripts/monitor_benchmark_runs.py --failures | grep -i "memory\|OOM"

# Check memory estimates vs actual
# Compare config CSV memory_gb with result JSON slurm_maxrss_gb

# Adjust memory rules in benchmark_spec.yaml if needed
# Then regenerate and resubmit
python3 scripts/generate_all_configs.py
python3 scripts/submit_all_jobs.py --only-missing

Common OOM scenarios:

  • GPU memory exhaustion: Reduce cell count or gene count
  • CPU memory exhaustion: Jobs will automatically use canto partition if >680GB

Jobs Running on Wrong Resources

Symptom: Config specifies cpus=1 but log shows CPUs allocated: 16

Diagnosis:

# Check SLURM log header
grep "CPUs allocated" slurm/logs/kompot_*.out

# Check actual job ID
cat results/RESULT.json | grep slurm_actual_job_id

# Find the log
ls slurm/logs/*ACTUAL_JOB_ID*.out

Solution: Results from misconfigured jobs must be removed and rerun:

# Move bad results to trash
mkdir -p .trash/wrong_config_YYYYMMDD
mv results/BAD_RESULT.json .trash/wrong_config_YYYYMMDD/

# Resubmit with --only-missing (hash-based detection)
python3 scripts/submit_all_jobs.py --only-missing

Missing SLURM Data

# Populate all missing SLURM data
python3 scripts/populate_slurm_data.py --all-missing

# If SLURM accounting has been purged (>90 days old):
# These jobs must be rerun - there's no way to recover the data

No Results Appearing

  1. Check if jobs are running: squeue -u $USER
  2. Check logs for errors: tail slurm/logs/*.out
  3. Verify data file exists: ls -lh data/covid_preprocessed.h5ad
  4. Check for Python/JAX errors in logs: grep -i error slurm/logs/*.out

Modifying the Benchmark

Add a New Comparison Group

  1. Edit benchmark_spec.yaml:
comparison_groups:
  de:
    my_new_comparison:
      analysis_type: de
      use_sample_variance: false
      n_genes: 1000
      batch_size: 0
      cpus: 1
      null_genes: 2000
      description: "My new comparison"
  1. Add to relevant plots:
plots:
  de_n_cells_sweep:
    replicates: 10
    comparisons:
      - sv_disk_200g
      - my_new_comparison  # Add here
  1. Regenerate and submit:
python3 scripts/generate_all_configs.py
python3 scripts/submit_all_jobs.py

Change Resource Estimates

Edit memory/time formulas in benchmark_spec.yaml:

resource_rules:
  memory_gb:
    my_comparison:
      base: 50
      cell_factor: 0.0001  # Adjust based on observed memory usage
      landmark_factor: 0.01
      gene_factor: 0.5

Then regenerate configs and resubmit:

python3 scripts/generate_all_configs.py
python3 scripts/submit_all_jobs.py --only-missing  # Hash matching preserves completed work

Change Replicate Count

Edit benchmark_spec.yaml:

plots:
  de_n_cells_sweep:
    replicates: 5  # Change from 10 to 5

Then regenerate and submit:

python3 scripts/generate_all_configs.py
python3 scripts/submit_all_jobs.py

Key Scripts Reference

Core Workflow

  • scripts/generate_all_configs.py - Generate configs from benchmark_spec.yaml
  • scripts/submit_all_jobs.py - Submit jobs with automatic resource estimation
  • scripts/monitor_benchmark_runs.py - Monitor progress with hash-based tracking
  • scripts/plot_from_spec.py - Generate plots from benchmark_spec.yaml

SLURM Data Capture System

  • scripts/populate_slurm_data.py - Query SLURM and save MaxRSS data
    • Automatic: Called by epilogue jobs after each main job completes
    • Manual: --all-missing to backfill missing data
    • Manual: --job-id JOBID to populate specific job
  • Epilogue jobs: Auto-submitted by submit_all_jobs.py
    • Format: slurm_epilogue_JOBID
    • Dependency: --dependency=afterany:MAIN_JOB_ID
    • Resources: 8GB RAM, 10 min, 1 CPU
    • Logs: slurm/logs/epilogue_JOBID.out

Hash-Based Result Matching

  • scripts/config_hash.py - Hash computation utilities
  • scripts/add_hashes_to_results.py - Add hashes to existing results
  • Hashes stored in: configs_generated.csv (config_hash column) and result JSON files

Utilities

  • scripts/cancel_current_jobs.py - Cancel current run safely
  • scripts/run_final_benchmark.py - Individual benchmark runner
  • complexity_utils.py - Analysis utilities (plotting, data loading)

Key Files

  • benchmark_spec.yaml - Single source of truth (comparisons, plots, resources)
  • results/configs_generated.csv - All 2870 configs with hashes
  • results/*.json - Result files with permanent SLURM data and hashes
  • slurm/current_job_ids.txt - Tracked job IDs for monitoring
  • templates/benchmark_job.sh - SLURM job template

Result File Format

Each result JSON contains:

{
  "config_id": 1430,
  "config_name": "de_svmem200g_c422220_g200_lm10000_r1",
  "config_hash": "5ded8cf42941",
  "benchmark_type": "de_n_landmarks_sweep",
  "analysis_type": "de",
  "n_cells": 422220,
  "n_genes": 200,
  "n_landmarks": 10000,
  "use_sample_variance": true,
  "replicate_id": 1,
  "runtime_seconds": 5804.4,
  "success": true,
  "slurm_job_id": "41314142",
  "slurm_array_task_id": "1430",
  "slurm_actual_job_id": "41219175",
  "slurm_state": "COMPLETED",
  "slurm_maxrss_gb": 422.5,
  "slurm_elapsed": "01:36:44"
}

About

Memory and runtime scaling of kompot.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published