Statistical diagnosis of batch effects in bulk RNA-seq count matrices.
batch-detective is a zero-configuration, read-only CLI tool that diagnoses technical batch effects in bulk RNA-seq count matrices. It accepts a raw count matrix and a sample metadata table, and produces a self-contained HTML report with a quantitative assessment of variance attributable to technical vs. biological covariates.
It does not modify your data. It only reads and reports.
- What it does
- Installation
- Quick start
- Input format
- CLI reference
- Interpreting results
- Output files
- Config file
- Pipeline integration
- HPC / cluster usage
- Troubleshooting
- Known limitations
- Citation
- License
batch-detective runs a seven-stage analysis pipeline:
- Validates input files — detects featureCounts annotation columns, STAR/HTSeq summary rows, mismatched sample IDs, non-integer counts, and transposed matrices
- Quality control — library size outliers, dominant gene flags, zero saturation
- Normalizes — CPM + log1p, MAD-based variable gene selection
- PCA — mean-centered, top variable genes
- Collinearity detection — flags when batch and biology are confounded before you spend time on correction
- PC–covariate associations — Kruskal-Wallis η² (categorical) and Spearman ρ (continuous), BH FDR-corrected
- ICC — Intraclass Correlation Coefficient ICC(1,1) per covariate, with bootstrap 95% CIs
Key outputs in the HTML report:
- 🔴🟡🟢 Traffic light — one-line summary for quick interpretation
- ICC table — headline batch effect magnitude per covariate
- PC–covariate association heatmap — which principal components are driven by which variables
- PCA scatter plots — colored by each covariate
- Collinearity warnings — with specific guidance when batch and treatment are confounded
- Top batch-associated genes — gene-level ICC for the highest-variance genes
- Sample outlier report — Mahalanobis distance + IQR detection
- Next-steps recommendations — DESeq2/edgeR/ComBat code snippets, context-aware
pip install batch-detectivepip install "batch-detective[progress]"pip install "batch-detective[serve]"pip install "batch-detective[pdf]"
# Additional system libraries required:
# Linux: sudo apt install libpango-1.0-0 libcairo2 libgdk-pixbuf2.0-0
# macOS: brew install pango cairo gdk-pixbuf
# Windows: https://doc.courtbouillon.org/weasyprint/stable/first_steps.htmlpip install "batch-detective[full]"git clone https://github.com/onedimkurt/batch-detective
cd batch-detective
pip install -e ".[dev]"
pytest # run test suitebatch-detective run \
--counts counts.csv \
--metadata metadata.csv \
--output-dir ./results/ \
--technical-covariates batch \
--biological-covariates treatment
# Open the report
open results/report.html # macOS
xdg-open results/report.html # LinuxThat's it. The report is a single self-contained HTML file — no server needed.
batch-detective run \
--counts counts.csv \
--metadata metadata.csv \
--output-dir ./results/ \
--technical-covariates "batch,plate,operator" \
--biological-covariates "treatment,sex,age"batch-detective run \
--counts counts.csv \
--metadata metadata.csv \
--output-dir ./results/ \
--dry-runThis prints detected covariates, sample counts, and resolved parameters — without running the analysis.
batch-detective validateRuns on built-in synthetic data and confirms the tool is functioning correctly.
A gene × sample count matrix in CSV or TSV format.
- Rows = genes (ENSEMBL IDs or gene symbols)
- Columns = samples
- Values = raw integer counts (not normalized)
- First column = gene IDs (used as row index)
- First row = sample IDs (used as column headers)
gene_id,Sample_A,Sample_B,Sample_C,Sample_D
ENSG00000000003,423,387,512,449
ENSG00000000005,0,1,0,2
ENSG00000000419,891,1023,876,934
Accepted formats: CSV, TSV (delimiter auto-detected)
Not accepted: Excel (.xlsx/.xls), pre-normalized matrices (use --normalized)
Automatic handling:
- featureCounts annotation columns (Chr, Start, End, Strand, Length) are detected and stripped automatically
- Transposed matrices (samples as rows) are detected and corrected automatically
- STAR (
N_*) and HTSeq (__*) summary rows trigger a fatal error with instructions to remove them
A sample × covariate table in CSV or TSV format.
- Rows = samples (must match sample IDs in counts)
- Columns = covariate names
- First column = sample IDs (used as row index)
sample_id,batch,treatment,sex
Sample_A,B1,control,M
Sample_B,B1,treated,F
Sample_C,B2,control,F
Sample_D,B2,treated,M
Tips:
- Sample IDs must match between files (order does not matter)
- Missing values are allowed — affected samples are excluded from that covariate's analysis
- Categorical covariates are detected automatically (< 10 unique values, or non-numeric)
- Continuous covariates are tested with Spearman correlation instead of Kruskal-Wallis
Usage: batch-detective run [OPTIONS]
Run batch effect diagnosis.
Options:
--counts PATH Count matrix CSV/TSV (required)
--metadata PATH Sample metadata CSV/TSV (required)
--output-dir PATH Output directory (required)
--normalized Input is pre-normalized; skip CPM step
--n-variable-genes INT Top variable genes for PCA [default: 2000]
--n-pcs INT Number of principal components [default: 10]
--min-cpm FLOAT Minimum CPM for gene filtering [default: 1.0]
--min-samples-expressing INT Min samples expressing a gene [default: auto]
--outlier-pval FLOAT Mahalanobis outlier p-value [default: 0.001]
--technical-covariates TEXT Comma-separated list of technical covariate names
(e.g. "batch,plate,operator")
--biological-covariates TEXT Comma-separated list of biological covariate names
(e.g. "treatment,sex,age")
--primary-covariate TEXT Single covariate to condition on
--condition-on TEXT Comma-separated covariates to regress out of PCs
before association testing
--config PATH YAML config file (see Config file section)
--dry-run Validate inputs and preview, no analysis run
--overwrite Overwrite existing output directory
--force Proceed past non-fatal validation warnings
--anonymize-samples Replace sample IDs with Sample_001, etc.
--pdf Export PDF report (requires [pdf] extra)
--dpi INT Figure resolution [default: 150 HTML, 300 PDF]
--export-figures Export all figures as PNG + SVG files
--log-file PATH Write full log to file
--verbose Enable debug-level logging
--quiet Suppress all non-error output
--help Show this message and exit.
batch-detective serve [--port INT]
# Default port: 8501
# Requires: pip install "batch-detective[serve]"Launches a local Streamlit web UI for interactive analysis (upload files, explore results).
batch-detective validateRuns a built-in self-test on synthetic data (80 samples, 4 batches, known ICC ≈ 0.45). Exits 0 on success.
| Code | Meaning |
|---|---|
0 |
Analysis complete — no significant batch effects detected |
1 |
Analysis complete — significant batch effects detected (not an error) |
2 |
Analysis complete with data quality warnings |
3 |
Fatal error — validation failed or unrecoverable error |
Important: Exit code
1means the tool ran successfully and found batch effects. It is not an error. See Pipeline integration for correct usage.
ICC quantifies what fraction of transcriptome-wide variance is explained by a grouping variable, computed using ICC(1,1) — a one-way random effects model across all selected variable genes.
| ICC | Interpretation |
|---|---|
| < 0.10 | Negligible |
| 0.10 – 0.30 | Mild |
| 0.30 – 0.60 | Moderate |
| > 0.60 | Strong |
(Thresholds after Koo & Li, 2016)
High ICC for a biological variable (treatment, disease status) means your experiment worked — it is expected and not a problem.
High ICC for a technical variable (batch, plate, operator, sequencing date) indicates a batch effect that may confound downstream analysis.
The traffic light summarizes the highest technical-covariate ICC and PC associations:
| Signal | Meaning |
|---|---|
| 🟢 Green | No significant batch effects detected |
| 🟡 Yellow | Possible batch effect — review recommended |
| 🔴 Red | Significant batch effect — correction recommended |
The traffic light is only informative when --technical-covariates are specified. Without labeling, all covariates are reported but not classified.
Shows effect size (η² for categorical, |ρ| for continuous) between each metadata covariate and the top 5 PCs. Asterisks show FDR significance: * q < 0.05, ** q < 0.01, *** q < 0.001.
When batch and treatment are strongly collinear (Cramér's V ≥ 0.70), the report warns that batch correction is not recommended — it would remove biological signal along with the batch effect.
If batch and treatment are not collinear:
# DESeq2
dds <- DESeqDataSetFromMatrix(countData, colData, design = ~ batch + treatment)
# edgeR
design <- model.matrix(~ batch + treatment)
# Visualization only (do NOT use for DE):
library(sva)
adjusted <- ComBat(dat = log_counts, batch = batch_vector)If batch and treatment are collinear: do not apply correction. Proceed with standard analysis and acknowledge the limitation.
| File | Description |
|---|---|
report.html |
Self-contained HTML report (main output) |
qc_summary.csv |
Per-sample QC metrics (library size, outlier flags) |
icc_table.csv |
ICC results per covariate with bootstrap CIs |
association_table.csv |
PC–covariate association test results (p-values, effect sizes) |
run_manifest.json |
Full reproducibility record (parameters, MD5 checksums, versions) |
figures/ |
PNG + SVG exports (if --export-figures used) |
report.pdf |
PDF version (if --pdf used) |
All CLI parameters can be specified in a YAML config file:
# batch_detective_config.yaml
counts: counts.csv
metadata: metadata.csv
output_dir: ./results/
n_variable_genes: 2000
n_pcs: 10
min_cpm: 1.0
outlier_pval: 0.001
technical_covariates: [batch, plate, operator]
biological_covariates: [treatment, sex, age]batch-detective run --config batch_detective_config.yamlPaths in the config file are resolved relative to the config file's location. CLI flags override config file values. Config file values override defaults.
batch-detective is designed for use in automated pipelines. Exit code 1 signals detected batch effects — this is a normal result, not an error.
# Correct pipeline pattern:
batch-detective run \
--counts counts.csv \
--metadata metadata.csv \
--output-dir ./results/ \
--technical-covariates batch \
--quiet
EXIT=$?
if [ $EXIT -eq 0 ]; then
echo "No batch effects. Proceeding with standard DE."
elif [ $EXIT -eq 1 ]; then
echo "Batch effects detected. Adding batch to DE model."
elif [ $EXIT -eq 3 ]; then
echo "Validation failed. Check inputs." >&2
exit 1
fi// Nextflow example
process BATCH_DETECTIVE {
input:
path counts
path metadata
output:
path "results/report.html"
path "results/run_manifest.json"
script:
"""
batch-detective run \\
--counts ${counts} \\
--metadata ${metadata} \\
--output-dir results/ \\
--technical-covariates batch \\
--biological-covariates treatment || true
"""
}# Snakemake example
rule batch_detective:
input:
counts="counts.csv",
metadata="metadata.csv"
output:
report="results/report.html",
manifest="results/run_manifest.json"
shell:
"""
batch-detective run \
--counts {input.counts} \
--metadata {input.metadata} \
--output-dir results/ \
--technical-covariates batch \
--biological-covariates treatment || true
"""The
|| trueprevents the pipeline from aborting on exit code 1 (batch detected). Check the manifest for the actual exit code.
batch-detective runs entirely on CPU. Memory usage is typically < 2 GB for datasets up to 20,000 genes × 500 samples.
# Batch submission (SLURM example)
#SBATCH --mem=4G
#SBATCH --cpus-per-task=1
#SBATCH --time=00:30:00
batch-detective run \
--counts counts.csv \
--metadata metadata.csv \
--output-dir ./results/ \
--log-file ./results/run.logWeb UI on HPC (SSH port forwarding):
# On HPC
batch-detective serve --port 8501
# On your local machine (in a separate terminal)
ssh -L 8501:localhost:8501 user@hpc.server.com
# Then open http://localhost:8501 in your local browserYour count matrix contains STAR or HTSeq summary rows (N_unmapped, N_multimapping, __no_feature, etc.). Remove them before running:
import pandas as pd
df = pd.read_csv("counts.csv", index_col=0)
df = df[~df.index.str.startswith(("N_", "__"))]
df.to_csv("counts_clean.csv")If your data is already normalized (RPKM, FPKM, TPM, log-transformed), use --normalized. If it contains floats due to rounding, use --force.
Use --overwrite to replace existing outputs:
batch-detective run ... --overwrite# Linux
sudo apt install libpango-1.0-0 libcairo2 libgdk-pixbuf2.0-0
# macOS
brew install pango cairo gdk-pixbuf
# Windows
# See: https://doc.courtbouillon.org/weasyprint/stable/first_steps.htmlbatch-detective automatically finds the next available port if 8501 is in use. The port used is printed on startup.
Use --dry-run to inspect detected sample IDs before running the full analysis. Check for whitespace, case differences, or trailing characters in your CSV headers.
Gene filtering (CPM threshold) reduces the working set before PCA. For very large matrices, increase --min-cpm or decrease --n-variable-genes to reduce memory usage.
-
Repeated measures / longitudinal designs — Not supported. The tool detects likely subject-ID columns and warns, but statistical tests (Kruskal-Wallis, ICC) assume independent samples. Use DESeq2 LRT, lme4, or dream for repeated measures.
-
ICC underestimation under log1p(CPM) — log1p-CPM normalization is not variance-stabilized (cf. DESeq2 VST). High-expression genes may dominate variance. ICC values derived from log1p-CPM may underestimate true batch effect magnitude by approximately 20–35%.
-
Non-monotonic continuous relationships — Spearman correlation detects monotonic associations only. U-shaped or non-monotonic relationships between continuous covariates and PC scores are not detected.
-
Single-cell, spatial, and ATAC-seq — Not supported. The tool is designed for bulk RNA-seq count matrices.
-
Collinearity suppression — When batch and treatment are strongly collinear (Cramér's V ≥ 0.70), the tool reports ICC = 0 and suppresses PC association tests for the affected covariate. This is the correct behavior — associations cannot be interpreted when covariates are confounded.
-
Low statistical power at small n — With < 15 samples per group, the tool may miss moderate batch effects (ICC < 0.40). Power warnings are displayed in the report when this applies.
-
Fixed vs. random batch effects — ICC(1,1) treats batch labels as randomly sampled from a population. If your batches are fixed (i.e., you care specifically about these batches and would not replicate the study with different ones), ICC values may underestimate true effect magnitude.
If you use batch-detective in published research, please cite:
batch-detective v0.1.0 — Statistical diagnosis of batch effects in bulk RNA-seq count matrices.
And the underlying statistical methods:
- ICC interpretation: Koo, T.K. & Li, M.Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2), 155–163.
- FDR correction: Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1), 289–300.
- PCA implementation: Pedregosa, F. et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
- Outlier detection (LedoitWolf): Ledoit, O. & Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis, 88(2), 365–411.
Pull requests are welcome. Please:
- Run
pytestbefore submitting - Run
flake8 batch_detective/to check style - Add tests for any new functionality
- Update this README if CLI options change
MIT — see LICENSE for details.