hotspot3 is a peak caller for chromatin accessibility data. It is tailored to work on datasets without a control experiment (ATAC-seq and DNase-seq) using adaptive estimation of the background (nonspecific cleavages) with a negative binomial distribution. hotspot3 accounts for variation in both total signal level and signal-to-background ratio along the genome.
The main algorithm steps are:
- Adaptive background modeling using a negative binomial distribution, fitted within locally uniform genomic segments.
- Bayesian segmentation (via BABACHI) to partition the genome into regions with a homogeneous background (i.e., similar signal-to-noise ratio, modeled using a common overdispersion parameter in the background model).
- Per-base statistical testing to assign p-values and estimate FDR for enrichment at each position.
- Signal smoothing using the Maximal Overlap Discrete Wavelet Transform (MODWT) to suppress local noise and normalize fine-scale variability (e.g., transcription factor footprints)
- Hotspot calling, which identifies contiguous regions of signal enrichment at a specified FDR threshold.
- Peak calling, which detects peaks from the smoothed signal and reports those that overlap with significant bases.
hotspot3 is optimized for scalability on large datasets with chromosome-level parallelism and optional reuse of intermediate results.
- Command line interface
- Installation
- Usage example
- Input parameters
- Output files
- Interpreting output
- Performance and resource requirements
- Authors
hotspot3- Call peaks from input data using MODWT-smoothed signal and local background estimation.hotspot3-pvals- (Experimental) Extract raw p-values for regions defined in a reference BED file. This is useful when comparing or aggregating signal across multiple samples at predefined loci (e.g., shared peak positions), enabling downstream consensus ("core") peak calling.
conda (recommended: mamba for faster solves)
python ≥ 3.8
mamba env create -n hotspot3 -f environment.yml
conda activate hotspot3
mamba env create -n hotspot3 -f environment.macos.yml
conda activate hotspot3
Once the environment is activated:
pip install hotspot3
Or to install in the developer mode:
git clone https://github.com/vierstralab/hotspot3.git
cd hotspot3
pip install -e .
You can check that the CLI is working with:
hotspot3 --help
hotspot3 AG10883 \
--chrom_sizes /net/seq/data2/projects/sabramov/SuperIndex/hotspot3/GRCh38_no_alts.nuclear.chrom_sizes \
--mappable_bases /net/seq/data2/projects/sabramov/SuperIndex/GRCh38_no_alts.K36.n150.center_sites_and_extended_blacklist.bed.gz \
--bam AG10883.filtered.cram \
--fdrs 0.01 0.05 0.1 \
--outdir ./AG10883 \
--save_density \
--cpus 6
This command uses a CRAM file to call peaks at 1%, 5%, and 10% FDR, and save cutcount density. It uses 6 threads for parallel processing.
hotspot3 AG10883 \
--chrom_sizes /net/seq/data2/projects/sabramov/SuperIndex/hotspot3/GRCh38_no_alts.nuclear.chrom_sizes \
--mappable_bases /net/seq/data2/projects/sabramov/SuperIndex/GRCh38_no_alts.K36.n150.center_sites_and_extended_blacklist.bed.gz \
--cutcounts ./AG10883/AG10883.cutcounts.bed.gz \
--signal_parquet ./AG10883/debug/AG10883.smoothed_signal.parquet \
--fdrs_parquet ./AG10883/debug/AG10883.fdrs.parquet \
--fdrs 0.001 \
--outdir ./AG10883 \
--cpus 6
This reuses previously computed intermediate files to quickly generate peaks at an additional FDR threshold (e.g., 0.001). Runtime is ~30 seconds.
-
sample_id- Unique identifier for the sample (used for naming outputs). -
--chrom_sizes CHROM_SIZES- Two-column TSV file with chromosome names and sizes (no header). Local path or URL. For GRCh38 you can use GRCh38_no_alts.nuclear.chrom_sizes.- Chromosome names must match those in the BAM/CRAM and reference FASTA.
- The order of chromosomes in this file determines the order in BED, BigWig and BigBed outputs.
- Use UCSC-style naming (e.g.,
chr1,chr2, ...,chrX,chrY) and exclude alternate contigs unless needed.
-
--bam BAM- Path to the input BAM or CRAM file. -
--fdrs FDRS [FDRS ...]- Space separated list of FDR thresholds to generate peak calls and hotspots at.
All higher FDR values will be NaN, so you won’t be able to call peaks at higher thresholds later
without re-running the FDR step (e.g., starting from raw p-values).
--cutcounts CUTCOUNTS- Tabix-indexed file with per-base cut counts. Skips extracting cut counts from BAM/CRAM.--signal_parquet SIGNAL_PARQUET- Path to pre-calculated partitioned parquet file(s) with per-bp smoothed signal. Skips MODWT signal smoothing--pvals_parquet PVALS_PARQUET- Path to pre-calculated partitioned parquet file(s) with per-bp p-values. Skips p-value calculation--fdrs_parquet FDRS_PARQUET- Path to pre-calculated partinioned parquet file(s) with per-bp FDRs. (Experimental) Can be used with multiple_samples_fdr.py to FDR correct across samples.
-
--cpus CPUS- Number of CPUs to use. A high thread count increases memory usage — no benefit from using more CPUs than the number of chromosomes. -
--reference REFERENCE- Path to reference FASTA (required for CRAMs missing sequence dictionary). -
--mappable_bases MAPPABLE_BASES- Three column (chrom, start, end) tabix-indexed BED file listing mappable positions. For GRCh38 you can use K36 mappable bases with ENCODE blacklist regions excluded. GRCh38_no_alts.K36.n150.center_sites_and_extended_blacklist.bed.gz -
--chromosomes CHROMOSOMES [CHROMOSOMES ...]- Restrict to specific chromosomes (for debugging). -
--save_density- Save normalized cut count density as output. -
--debug- Enable debug mode (extra logging). -
--outdir OUTDIR- Path to output directory -
--tempdir TEMPDIR- Path to temp directory. Defaults to the system temp location.
--window WINDOW- Smoothing window size for cut counts.--background_window BACKGROUND_WINDOW- Background window size--signal_quantile SIGNAL_QUANTILE- Fraction of genomic positions to model as a background (default: 0.995). Used to exclude extreme outliers (e.g., regions with unusually high coverage) when fitting the background model. This reduces computational burden by limiting the background estimation to the lower portion of the signal distribution (e.g., the lowest 99.5% of sites if--signal_quantile 0.995).
hotspot3 produces several output files, including signal tracks, peak calls, and internal model diagnostics. Most files are saved in the output directory, while large intermediate files are placed in the debug/ folder for optional inspection or reuse.
-
{sample_id}.cutcounts.bed.gz— Tabix-indexed file with per-base cut counts (~200MB) -
{sample_id}.total_cutcounts— Total number of cut counts observed in the sample (~10KB) -
{sample_id}.fit_stats.bed.gz— Tabix-indexed BED file with background model fit statistics for each segment (~20MB) -
{sample_id}.pvals.parquet— Per-base raw p-values (~1.5GB). Can be used later to aggregate p-values across datasets. -
{sample_id}.normalized_density.bw— Normalized raw signal (cuts per million), generated if--save_densityis specified (~200MB). Useful for visualization. -
{sample_id}.background.bw— Final estimated background level at each position, based on the local negative binomial model (corresponding to thep = 0.005cutoff). Useful for visualization and model inspection. -
{sample_id}.thresholds.bw— Per-base signal threshold used to select background bases during model fitting. Useful for debugging background model behavior. -
{sample_id}.segment_background.bw— Segment-level background estimate, defined as the signal threshold corresponding to thep = 0.005cutoff under the fitted negative binomial model. Useful for interpreting background segmentation.
-
{fdr}/{sample_id}.hotspots.fdr{fdr}.bed.gz— Tabix-indexed file with called hotspots -
{fdr}/{sample_id}.hotspots.fdr{fdr}.bb— Hotspots in BigBed (BED12) format -
{fdr}/{sample_id}.peaks.fdr{fdr}.bed.gz— Tabix-indexed file with called peaks -
{fdr}/{sample_id}.peaks.fdr{fdr}.bb— Peaks in BigBed (BED12) format
The following large intermediate files are always saved to the debug/ subdirectory of the output path. These can be reused for re-calling peaks at additional FDR thresholds or inspected to understand model behavior.
{sample_id}.smoothed_signal.parquet— Per-base MODWT-smoothed signal (~2GB){sample_id}.fdrs.parquet— Per-base FDR values (only for the highest FDR used in the run) (~600MB){sample_id}.fit_params.parquet— Per-base final model parameters (~2GB)
hotspot3 does not automatically delete the debug/ folder upon completion. You can safely remove it to free up disk space if it is no longer needed.
Once hotspot3 has finished running, the most effective way to understand and validate the results is by visualizing key tracks and inspecting background fits.
You can load the following files in IGV or the UCSC Genome Browser for interactive inspection:

-
{sample_id}.normalized_density.bw— (black track) BigWig file with normalized cut count density (in cuts per million). -
{sample_id}.background.bw— (orange track) BigWig file representing the estimated background level at each position (defined by raw p-value > 0.005).
Overlay with the normalized density to visualize how much observed signal exceeds the modeled background. -
{sample_id}.hotspots.fdr{fdr}.bb- called hotspots at each FDR threshold -
{sample_id}.peaks.fdr{fdr}.bb- (green track) Called peaks at each FDR threshold.
The file {sample_id}.fit_stats.bed.gz contains background model parameters for each genomic segment identified by hotspot3. This file helps diagnose modeling issues that may affect the peak calling.
Each row includes:
- Coordinates:
#chr,start,end - Negative binomial parameters:
r,p - Background and total cut counts: Total number of tags in the region and how many are used to produce the best background fit
- Segment SPOT score: Signal Proportion Of Tags. Proportion of tags that were not used to fit the background model (i.e., signal tags)
- Overall SPOT score: Weighted median of segment SPOT scores — a dataset-level quality metric
- Status flags:
refit_with_constraint,success_fit,max_bg_reached
- The segment was initially assigned too much signal (i.e., too high SPOT score).
- It was re-fit using a minimum background proportion constraint to prevent overcalling peaks.
- This commonly occurs in regions where the signal distribution deviates from the negative binomial assumption, such as telomeres or centromeres.
- This is not a failure, but a safeguard to ensure model robustness.
In these cases, hotspot3 applies a conservative re-fit to enforce a maximum signal-to-noise ratio — by default, no more than 5x the global weighted median.
Outlier segments are identified using a robust linear fit (RANSACRegressor) signal-to-noise ratio vs segment length, and then re-fit to avoid inflating background estimates.
- The model failed to fit a negative binomial distribution for this segment (even when using all the data).
- This happens in regions with very low signal.
- ❗Peaks will not be called in these segments.
If many segments are flagged this way, it may indicate insufficient coverage for reliable modeling — consider increasing sequencing depth or focusing on higher-quality regions.
- Even after assigning 99.5% of positions as background, the remaining 0.5% still contained more signal than allowed.
- This means the segment exceeded the maximum signal-to-noise ratio threshold and could not be re-fit under constraints.
This usually indicates:
⚠️ Mapping artifacts (e.g., multimappers, misalignments)⚠️ Collapsed repeats or copy number variants⚠️ Biological outliers (e.g., viral integrations)
Segments with this flag may still produce peaks, but the background is likely overestimated, which can reduce sensitivity.
This flag is rare, typically occurring in a small number of samples (e.g., ~100 out of 16,000 analyzed datasets).
Tip: If many segments are flagged with max_bg_reached = True, consider:
- inspecting affected regions visually,
- excluding problematic chromosomes or regions from analysis.
- increasing
--signal_quantile(e.g., from 0.995 to 0.998)
-
Typical runtime: ~1 hour for a 100 million read CRAM file.
-
Runtime is largely independent of dataset coverage, since most computations are per-base rather than per-read.
-
Recommended input coverage: At least 10 million mapped reads for a somewhat reliable background fit.
-
CPU usage: Parallelized by chromosome. Using more CPUs (e.g., 24 for all chromosomes) reduces wall-clock time but significantly increases memory usage.
-
Memory usage: ~80 GB RAM with 6 CPUs. Can exceed 150 GB when using 24 threads (e.g., one per chromosome).
-
Recommendation: Use a machine with 100–200 GB RAM for high-threaded runs (significantly improves time). Reduce thread count to lower memory use.
Developed by:
- Sergey Abramov
- Alexandr Boytsov
- Jeff Vierstra
Altius Institute for Biomedical Sciences