batch-detective

Statistical diagnosis of batch effects in bulk RNA-seq count matrices.

batch-detective is a zero-configuration, read-only CLI tool that diagnoses technical batch effects in bulk RNA-seq count matrices. It accepts a raw count matrix and a sample metadata table, and produces a self-contained HTML report with a quantitative assessment of variance attributable to technical vs. biological covariates.

It does not modify your data. It only reads and reports.

What it does

batch-detective runs a seven-stage analysis pipeline:

Validates input files — detects featureCounts annotation columns, STAR/HTSeq summary rows, mismatched sample IDs, non-integer counts, and transposed matrices
Quality control — library size outliers, dominant gene flags, zero saturation
Normalizes — CPM + log1p, MAD-based variable gene selection
PCA — mean-centered, top variable genes
Collinearity detection — flags when batch and biology are confounded before you spend time on correction
PC–covariate associations — Kruskal-Wallis η² (categorical) and Spearman ρ (continuous), BH FDR-corrected
ICC — Intraclass Correlation Coefficient ICC(1,1) per covariate, with bootstrap 95% CIs

Key outputs in the HTML report:

🔴🟡🟢 Traffic light — one-line summary for quick interpretation
ICC table — headline batch effect magnitude per covariate
PC–covariate association heatmap — which principal components are driven by which variables
PCA scatter plots — colored by each covariate
Collinearity warnings — with specific guidance when batch and treatment are confounded
Top batch-associated genes — gene-level ICC for the highest-variance genes
Sample outlier report — Mahalanobis distance + IQR detection
Next-steps recommendations — DESeq2/edgeR/ComBat code snippets, context-aware

Installation

Base install (CLI only)

pip install batch-detective

With progress bars

pip install "batch-detective[progress]"

With local web UI

pip install "batch-detective[serve]"

With PDF report export

pip install "batch-detective[pdf]"

# Additional system libraries required:
# Linux:   sudo apt install libpango-1.0-0 libcairo2 libgdk-pixbuf2.0-0
# macOS:   brew install pango cairo gdk-pixbuf
# Windows: https://doc.courtbouillon.org/weasyprint/stable/first_steps.html

Full install (progress + stats extras)

pip install "batch-detective[full]"

Developer install

git clone https://github.com/onedimkurt/batch-detective
cd batch-detective
pip install -e ".[dev]"
pytest  # run test suite

Quick start

batch-detective run \
  --counts counts.csv \
  --metadata metadata.csv \
  --output-dir ./results/ \
  --technical-covariates batch \
  --biological-covariates treatment

# Open the report
open results/report.html        # macOS
xdg-open results/report.html    # Linux

That's it. The report is a single self-contained HTML file — no server needed.

With multiple covariates

batch-detective run \
  --counts counts.csv \
  --metadata metadata.csv \
  --output-dir ./results/ \
  --technical-covariates "batch,plate,operator" \
  --biological-covariates "treatment,sex,age"

Preview without running (dry run)

batch-detective run \
  --counts counts.csv \
  --metadata metadata.csv \
  --output-dir ./results/ \
  --dry-run

This prints detected covariates, sample counts, and resolved parameters — without running the analysis.

Check tool is working

batch-detective validate

Runs on built-in synthetic data and confirms the tool is functioning correctly.

Input format

Count matrix (`--counts`)

A gene × sample count matrix in CSV or TSV format.

Rows = genes (ENSEMBL IDs or gene symbols)
Columns = samples
Values = raw integer counts (not normalized)
First column = gene IDs (used as row index)
First row = sample IDs (used as column headers)

gene_id,Sample_A,Sample_B,Sample_C,Sample_D
ENSG00000000003,423,387,512,449
ENSG00000000005,0,1,0,2
ENSG00000000419,891,1023,876,934

Accepted formats: CSV, TSV (delimiter auto-detected)

Not accepted: Excel (.xlsx/.xls), pre-normalized matrices (use --normalized)

Automatic handling:

featureCounts annotation columns (Chr, Start, End, Strand, Length) are detected and stripped automatically
Transposed matrices (samples as rows) are detected and corrected automatically
STAR (N_*) and HTSeq (__*) summary rows trigger a fatal error with instructions to remove them

Metadata table (`--metadata`)

A sample × covariate table in CSV or TSV format.

Rows = samples (must match sample IDs in counts)
Columns = covariate names
First column = sample IDs (used as row index)

sample_id,batch,treatment,sex
Sample_A,B1,control,M
Sample_B,B1,treated,F
Sample_C,B2,control,F
Sample_D,B2,treated,M

Tips:

Sample IDs must match between files (order does not matter)
Missing values are allowed — affected samples are excluded from that covariate's analysis
Categorical covariates are detected automatically (< 10 unique values, or non-numeric)
Continuous covariates are tested with Spearman correlation instead of Kruskal-Wallis

CLI reference

`batch-detective run`

Usage: batch-detective run [OPTIONS]

  Run batch effect diagnosis.

Options:
  --counts PATH                  Count matrix CSV/TSV (required)
  --metadata PATH                Sample metadata CSV/TSV (required)
  --output-dir PATH              Output directory (required)

  --normalized                   Input is pre-normalized; skip CPM step
  --n-variable-genes INT         Top variable genes for PCA [default: 2000]
  --n-pcs INT                    Number of principal components [default: 10]
  --min-cpm FLOAT                Minimum CPM for gene filtering [default: 1.0]
  --min-samples-expressing INT   Min samples expressing a gene [default: auto]
  --outlier-pval FLOAT           Mahalanobis outlier p-value [default: 0.001]

  --technical-covariates TEXT    Comma-separated list of technical covariate names
                                 (e.g. "batch,plate,operator")
  --biological-covariates TEXT   Comma-separated list of biological covariate names
                                 (e.g. "treatment,sex,age")
  --primary-covariate TEXT       Single covariate to condition on
  --condition-on TEXT            Comma-separated covariates to regress out of PCs
                                 before association testing

  --config PATH                  YAML config file (see Config file section)
  --dry-run                      Validate inputs and preview, no analysis run
  --overwrite                    Overwrite existing output directory
  --force                        Proceed past non-fatal validation warnings
  --anonymize-samples            Replace sample IDs with Sample_001, etc.

  --pdf                          Export PDF report (requires [pdf] extra)
  --dpi INT                      Figure resolution [default: 150 HTML, 300 PDF]
  --export-figures               Export all figures as PNG + SVG files

  --log-file PATH                Write full log to file
  --verbose                      Enable debug-level logging
  --quiet                        Suppress all non-error output

  --help                         Show this message and exit.

`batch-detective serve`

batch-detective serve [--port INT]
# Default port: 8501
# Requires: pip install "batch-detective[serve]"

Launches a local Streamlit web UI for interactive analysis (upload files, explore results).

`batch-detective validate`

batch-detective validate

Runs a built-in self-test on synthetic data (80 samples, 4 batches, known ICC ≈ 0.45). Exits 0 on success.

Exit codes

Code	Meaning
`0`	Analysis complete — no significant batch effects detected
`1`	Analysis complete — significant batch effects detected (not an error)
`2`	Analysis complete with data quality warnings
`3`	Fatal error — validation failed or unrecoverable error

Important: Exit code 1 means the tool ran successfully and found batch effects. It is not an error. See Pipeline integration for correct usage.

Interpreting results

ICC (Intraclass Correlation Coefficient)

ICC quantifies what fraction of transcriptome-wide variance is explained by a grouping variable, computed using ICC(1,1) — a one-way random effects model across all selected variable genes.

ICC	Interpretation
< 0.10	Negligible
0.10 – 0.30	Mild
0.30 – 0.60	Moderate
> 0.60	Strong

(Thresholds after Koo & Li, 2016)

High ICC for a biological variable (treatment, disease status) means your experiment worked — it is expected and not a problem.

High ICC for a technical variable (batch, plate, operator, sequencing date) indicates a batch effect that may confound downstream analysis.

Traffic light

The traffic light summarizes the highest technical-covariate ICC and PC associations:

Signal	Meaning
🟢 Green	No significant batch effects detected
🟡 Yellow	Possible batch effect — review recommended
🔴 Red	Significant batch effect — correction recommended

The traffic light is only informative when --technical-covariates are specified. Without labeling, all covariates are reported but not classified.

PC–covariate association heatmap

Shows effect size (η² for categorical, |ρ| for continuous) between each metadata covariate and the top 5 PCs. Asterisks show FDR significance: * q < 0.05, ** q < 0.01, *** q < 0.001.

Collinearity warnings

When batch and treatment are strongly collinear (Cramér's V ≥ 0.70), the report warns that batch correction is not recommended — it would remove biological signal along with the batch effect.

What to do when batch effects are found

If batch and treatment are not collinear:

# DESeq2
dds <- DESeqDataSetFromMatrix(countData, colData, design = ~ batch + treatment)

# edgeR
design <- model.matrix(~ batch + treatment)

# Visualization only (do NOT use for DE):
library(sva)
adjusted <- ComBat(dat = log_counts, batch = batch_vector)

If batch and treatment are collinear: do not apply correction. Proceed with standard analysis and acknowledge the limitation.

Output files

File	Description
`report.html`	Self-contained HTML report (main output)
`qc_summary.csv`	Per-sample QC metrics (library size, outlier flags)
`icc_table.csv`	ICC results per covariate with bootstrap CIs
`association_table.csv`	PC–covariate association test results (p-values, effect sizes)
`run_manifest.json`	Full reproducibility record (parameters, MD5 checksums, versions)
`figures/`	PNG + SVG exports (if `--export-figures` used)
`report.pdf`	PDF version (if `--pdf` used)

Config file

All CLI parameters can be specified in a YAML config file:

# batch_detective_config.yaml
counts: counts.csv
metadata: metadata.csv
output_dir: ./results/

n_variable_genes: 2000
n_pcs: 10
min_cpm: 1.0
outlier_pval: 0.001

technical_covariates: [batch, plate, operator]
biological_covariates: [treatment, sex, age]

batch-detective run --config batch_detective_config.yaml

Paths in the config file are resolved relative to the config file's location. CLI flags override config file values. Config file values override defaults.

Pipeline integration

batch-detective is designed for use in automated pipelines. Exit code 1 signals detected batch effects — this is a normal result, not an error.

# Correct pipeline pattern:
batch-detective run \
  --counts counts.csv \
  --metadata metadata.csv \
  --output-dir ./results/ \
  --technical-covariates batch \
  --quiet

EXIT=$?

if [ $EXIT -eq 0 ]; then
    echo "No batch effects. Proceeding with standard DE."
elif [ $EXIT -eq 1 ]; then
    echo "Batch effects detected. Adding batch to DE model."
elif [ $EXIT -eq 3 ]; then
    echo "Validation failed. Check inputs." >&2
    exit 1
fi

Nextflow / Snakemake

// Nextflow example
process BATCH_DETECTIVE {
    input:
    path counts
    path metadata

    output:
    path "results/report.html"
    path "results/run_manifest.json"

    script:
    """
    batch-detective run \\
        --counts ${counts} \\
        --metadata ${metadata} \\
        --output-dir results/ \\
        --technical-covariates batch \\
        --biological-covariates treatment || true
    """
}

# Snakemake example
rule batch_detective:
    input:
        counts="counts.csv",
        metadata="metadata.csv"
    output:
        report="results/report.html",
        manifest="results/run_manifest.json"
    shell:
        """
        batch-detective run \
            --counts {input.counts} \
            --metadata {input.metadata} \
            --output-dir results/ \
            --technical-covariates batch \
            --biological-covariates treatment || true
        """

The || true prevents the pipeline from aborting on exit code 1 (batch detected). Check the manifest for the actual exit code.

HPC / cluster usage

batch-detective runs entirely on CPU. Memory usage is typically < 2 GB for datasets up to 20,000 genes × 500 samples.

# Batch submission (SLURM example)
#SBATCH --mem=4G
#SBATCH --cpus-per-task=1
#SBATCH --time=00:30:00

batch-detective run \
  --counts counts.csv \
  --metadata metadata.csv \
  --output-dir ./results/ \
  --log-file ./results/run.log

Web UI on HPC (SSH port forwarding):

# On HPC
batch-detective serve --port 8501

# On your local machine (in a separate terminal)
ssh -L 8501:localhost:8501 user@hpc.server.com

# Then open http://localhost:8501 in your local browser

Troubleshooting

`ERROR: Count matrix contains aligner summary rows`

Your count matrix contains STAR or HTSeq summary rows (N_unmapped, N_multimapping, __no_feature, etc.). Remove them before running:

import pandas as pd
df = pd.read_csv("counts.csv", index_col=0)
df = df[~df.index.str.startswith(("N_", "__"))]
df.to_csv("counts_clean.csv")

`ERROR: Count matrix contains non-integer values`

If your data is already normalized (RPKM, FPKM, TPM, log-transformed), use --normalized. If it contains floats due to rounding, use --force.

`ERROR: Output directory already contains files`

Use --overwrite to replace existing outputs:

batch-detective run ... --overwrite

weasyprint PDF dependencies missing

# Linux
sudo apt install libpango-1.0-0 libcairo2 libgdk-pixbuf2.0-0

# macOS
brew install pango cairo gdk-pixbuf

# Windows
# See: https://doc.courtbouillon.org/weasyprint/stable/first_steps.html

Port conflict with `serve`

batch-detective automatically finds the next available port if 8501 is in use. The port used is printed on startup.

Sample IDs do not match between files

Use --dry-run to inspect detected sample IDs before running the full analysis. Check for whitespace, case differences, or trailing characters in your CSV headers.

Large datasets (> 50,000 genes)

Gene filtering (CPM threshold) reduces the working set before PCA. For very large matrices, increase --min-cpm or decrease --n-variable-genes to reduce memory usage.

Known limitations

Repeated measures / longitudinal designs — Not supported. The tool detects likely subject-ID columns and warns, but statistical tests (Kruskal-Wallis, ICC) assume independent samples. Use DESeq2 LRT, lme4, or dream for repeated measures.
ICC underestimation under log1p(CPM) — log1p-CPM normalization is not variance-stabilized (cf. DESeq2 VST). High-expression genes may dominate variance. ICC values derived from log1p-CPM may underestimate true batch effect magnitude by approximately 20–35%.
Non-monotonic continuous relationships — Spearman correlation detects monotonic associations only. U-shaped or non-monotonic relationships between continuous covariates and PC scores are not detected.
Single-cell, spatial, and ATAC-seq — Not supported. The tool is designed for bulk RNA-seq count matrices.
Collinearity suppression — When batch and treatment are strongly collinear (Cramér's V ≥ 0.70), the tool reports ICC = 0 and suppresses PC association tests for the affected covariate. This is the correct behavior — associations cannot be interpreted when covariates are confounded.
Low statistical power at small n — With < 15 samples per group, the tool may miss moderate batch effects (ICC < 0.40). Power warnings are displayed in the report when this applies.
Fixed vs. random batch effects — ICC(1,1) treats batch labels as randomly sampled from a population. If your batches are fixed (i.e., you care specifically about these batches and would not replicate the study with different ones), ICC values may underestimate true effect magnitude.

Citation

If you use batch-detective in published research, please cite:

batch-detective v0.1.0 — Statistical diagnosis of batch effects in bulk RNA-seq count matrices.

And the underlying statistical methods:

ICC interpretation: Koo, T.K. & Li, M.Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2), 155–163.
FDR correction: Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1), 289–300.
PCA implementation: Pedregosa, F. et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
Outlier detection (LedoitWolf): Ledoit, O. & Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis, 88(2), 365–411.

Contributing

Pull requests are welcome. Please:

Run pytest before submitting
Run flake8 batch_detective/ to check style
Add tests for any new functionality
Update this README if CLI options change

License

MIT — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
batch_detective		batch_detective
docs		docs
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

batch-detective

Table of Contents

What it does

Installation

Base install (CLI only)

With progress bars

With local web UI

With PDF report export

Full install (progress + stats extras)

Developer install

Quick start

With multiple covariates

Preview without running (dry run)

Check tool is working

Input format

Count matrix (--counts)

Metadata table (--metadata)

CLI reference

batch-detective run

batch-detective serve

batch-detective validate

Exit codes

Interpreting results

ICC (Intraclass Correlation Coefficient)

Traffic light

PC–covariate association heatmap

Collinearity warnings

What to do when batch effects are found

Output files

Config file

Pipeline integration

Nextflow / Snakemake

HPC / cluster usage

Troubleshooting

ERROR: Count matrix contains aligner summary rows

ERROR: Count matrix contains non-integer values

ERROR: Output directory already contains files

weasyprint PDF dependencies missing

Port conflict with serve

Sample IDs do not match between files

Large datasets (> 50,000 genes)

Known limitations

Citation

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Count matrix (`--counts`)

Metadata table (`--metadata`)

`batch-detective run`

`batch-detective serve`

`batch-detective validate`

`ERROR: Count matrix contains aligner summary rows`

`ERROR: Count matrix contains non-integer values`

`ERROR: Output directory already contains files`

Port conflict with `serve`

Packages