Snakemake pipeline for CUT&Tag (Cleavage Under Targets and Tagmentation) data analysis.
This pipeline processes CUT&Tag sequencing data through the following steps:
- Quality Control: FastQC analysis of raw and trimmed reads
- Read Trimming: Adapter and quality trimming with Trimmomatic/Cutadapt
- Alignment: Mapping to reference genome using Bowtie2
- Filtering: Quality filtering and duplicate removal
- Peak Calling: Peak identification using MACS2 and SEACR
- Annotation: Peak annotation with genomic features
- Visualization: BigWig generation and quality plots
- Quality Assessment: Fragment size analysis, TSS enrichment, correlation analysis
cut_tag/
├── workflow/
│ ├── Snakefile # Main workflow
│ ├── rules/ # Individual rule modules
│ │ ├── qc.smk
│ │ ├── trimming.smk
│ │ ├── alignment.smk
│ │ ├── filtering.smk
│ │ ├── peaks.smk
│ │ └── visualization.smk
│ ├── scripts/ # Custom scripts
│ └── envs/ # Conda environments
├── config/
│ ├── config.yaml # Pipeline configuration
│ └── samples.tsv # Sample information
├── data/ # Input FASTQ files
├── resources/ # Reference genome and annotations
├── results/ # Output files
└── README.md # This file
- Clone this repository:
git clone <repository-url>
cd cut_tag- Install Snakemake (if not already installed):
conda install -c conda-forge -c bioconda snakemakePlace your paired-end FASTQ files in the data/ directory. The pipeline supports flexible naming conventions:
Default pattern (with _001 suffix):
{sample_id}_R1_001.fastq.gz
{sample_id}_R2_001.fastq.gz
Alternative patterns: Modify fastq_suffix in config/config.yaml:
- For files without
_001: change to.fastq.gz - For other patterns: adjust accordingly
Edit config/samples.tsv with your sample information:
sample_id condition replicate control antibody
H3K4me3_rep1 H3K4me3 1 IgG_rep1 H3K4me3
H3K27ac_rep1 H3K27ac 1 IgG_rep1 H3K27ac
IgG_rep1 IgG 1 NA IgGEdit config/config.yaml to specify:
- Reference genome files
- Analysis parameters
- Quality thresholds
Download and prepare your reference genome files:
# Example for mouse mm10
mkdir -p resources/genome
cd resources/genome
# Download genome fasta
wget http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/mm10.fa.gz
gunzip mm10.fa.gz
mv mm10.fa genome.fa
# Download GTF annotation
wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.annotation.gtf.gz
gunzip gencode.vM25.annotation.gtf.gz
mv gencode.vM25.annotation.gtf genome.gtfRun a dry-run to check the workflow:
snakemake --dry-run --cores 1Execute the pipeline:
# Local execution
snakemake --cores 8 --use-conda
# With resource limits
snakemake --cores 8 --use-conda --resources mem_mb=32000
# On a cluster (example for SLURM)
snakemake --profile slurm --cores 100Reference Genome:
genome.fasta: Reference genome FASTA filegenome.gtf: Gene annotation GTF filegenome.effective_size: Effective genome size for MACS2
Quality Filtering:
filtering.min_quality: Minimum mapping quality (default: 30)filtering.max_fragment_size: Maximum fragment size for nucleosome-free reads (default: 120)
Peak Calling:
macs2.extra_params: Additional MACS2 parametersseacr.threshold: FDR threshold for SEACR peak calling
results/qc/multiqc_report.html: Comprehensive QC reportresults/peaks/macs2/{sample}_peaks.narrowPeak: MACS2 peaksresults/peaks/seacr/{sample}.stringent.bed: SEACR peaksresults/bigwig/{sample}.bw: Signal tracks for visualizationresults/peaks/consensus_peaks.bed: Consensus peaks across samples
results/
├── fastqc/ # FastQC reports
├── trimmed/ # Trimmed FASTQ files
├── aligned/ # BAM files
├── filtered/ # Quality-filtered BAMs
├── peaks/ # Peak calling results
│ ├── macs2/
│ ├── seacr/
│ └── annotated/
├── bigwig/ # BigWig signal tracks
└── qc/ # Quality control reports
The pipeline generates several QC metrics:
- Fragment Size Distribution: Nucleosome-free vs nucleosome-bound fragments
- TSS Enrichment: Signal enrichment around transcription start sites
- Library Complexity: Assessment of PCR duplication rates
- Peak Statistics: Number and distribution of identified peaks
- Sample Correlation: Correlation between replicates
To use only MACS2 or SEACR, modify the all rule in workflow/Snakefile.
- Add FASTQ files to
data/ - Update
config/samples.tsv - Re-run the pipeline (Snakemake will process only new samples)
For large datasets, use cluster execution:
# Create cluster profile
mkdir -p ~/.config/snakemake/slurm
# Add cluster configuration files
# Submit to cluster
snakemake --profile slurm --jobs 100- Memory errors: Increase
--resources mem_mb=parameter - Missing conda packages: Update environment YAML files
- Reference genome errors: Verify file paths in config.yaml
- Peak calling failures: Check input BAM file quality and parameters
- Check
results/logs/for detailed error messages - Use
snakemake --reasonto understand rule execution - Run with
--verbosefor detailed output
If you use this pipeline, please cite:
- Snakemake: Köster, Johannes and Rahmann, Sven. "Snakemake - A scalable bioinformatics workflow engine". Bioinformatics 2012.
- CUT&Tag Protocol: Kaya-Okur et al. "CUT&Tag for efficient epigenomic profiling of small samples and single cells". Nature Communications 2019.
This pipeline is released under the MIT License.
For questions and issues:
- Check the troubleshooting section above
- Review Snakemake documentation
- Open an issue in this repository