A minimal Snakemake pipeline for metagenomics shotgun sequencing analysis, implementing best practices for microbiome data processing.
This pipeline processes paired-end shotgun metagenomic sequencing data through the following steps:
Core Analysis:
- Quality Control (FastQC)
- Host Removal (KneadData)
- Taxonomic Profiling (MetaPhlAn4)
- Functional Profiling (HUMAnN3)
- Statistical Analysis (MaAsLin2)
- Quality Report (MultiQC)
Advanced Analysis (Optional): 7. Assembly (MEGAHIT/metaSPAdes) 8. Binning (MetaBAT2/CONCOCT/DAS Tool) 9. MAG Annotation (Prokka/DRAM/eggNOG-mapper) 10. Visualization (Krona/PCA/Diversity plots)
Core Tools:
- FastQC: Quality control assessment of raw sequencing data
- KneadData: Host contamination removal and quality filtering
- MetaPhlAn4: Species-level taxonomic profiling
- HUMAnN3: Functional profiling and pathway analysis
- MaAsLin2: Multivariable statistical analysis for microbiome data
- MultiQC: Aggregate reporting of QC metrics
Advanced Tools (Optional):
- MEGAHIT/metaSPAdes: De novo metagenome assembly
- MetaBAT2/CONCOCT: Genome binning for MAG recovery
- DAS Tool: Bin optimization and dereplication
- CheckM: MAG quality assessment
- Prokka: Rapid prokaryotic genome annotation
- DRAM: Comprehensive metabolic annotation
- eggNOG-mapper: Functional annotation using eggNOG database
- Krona: Interactive taxonomic visualization
- Custom scripts: PCA, diversity analysis, and visualization
# Install snakemake
conda install -c bioconda -c conda-forge snakemakeYou'll need to download and configure the following databases:
# Download human genome for KneadData
kneaddata_database --download human_genome bowtie2 /path/to/host/genome/# Download MetaPhlAn database
metaphlan --install --bowtie2db /path/to/metaphlan/db/# Download ChocoPhlAn database
humann_databases --download chocophlan full /path/to/chocophlan/
# Download UniRef90 database
humann_databases --download uniref uniref90_diamond /path/to/uniref90/Edit config/config.yaml to specify the correct paths to your reference databases:
databases:
host_genome: "/path/to/host/genome/index"
metaphlan_db: "/path/to/metaphlan/db"
chocophlan_db: "/path/to/chocophlan"
uniref_db: "/path/to/uniref90"Edit config/samples.csv with your sample information:
sample,sample_id,forward,reverse,condition
sample01,S001,sample01_R1.fastq.gz,sample01_R2.fastq.gz,treatment
sample02,S002,sample02_R1.fastq.gz,sample02_R2.fastq.gz,controlOrganize your raw data as follows:
shotgun_pipe/
├── data/
│ └── raw/
│ ├── sample01_R1.fastq.gz
│ ├── sample01_R2.fastq.gz
│ ├── sample02_R1.fastq.gz
│ └── sample02_R2.fastq.gz
# Dry run to check workflow
snakemake --dry-run --cores 1
# Run core pipeline (basic analysis)
snakemake --use-conda --cores 8
# Run advanced pipeline (includes assembly, binning, annotation)
snakemake all_advanced --use-conda --cores 8
# Run with specific number of jobs
snakemake --use-conda --jobs 4# Core analysis only
snakemake results/multiqc_report.html --use-conda --cores 4
# Taxonomic profiling only
snakemake results/metaphlan/merged_abundance_table.txt --use-conda --cores 4
# Assembly only
snakemake results/assembly/all_contigs_combined.fa --use-conda --cores 8
# Binning only (requires assembly)
snakemake results/binning/sample01/checkm/quality_summary.tsv --use-conda --cores 8
# Visualization only
snakemake results/visualization/metagenomics_dashboard.html --use-conda --cores 4For SLURM clusters:
snakemake --use-conda --cluster "sbatch -t 24:00:00 -c {threads} --mem={resources.mem_mb}" --jobs 10results/
├── fastqc/ # FastQC reports
├── kneaddata/ # Host-removed reads
├── metaphlan/ # Taxonomic profiles
│ └── merged_abundance_table.txt
├── humann/ # Functional profiles
│ ├── merged_genefamilies_relab.tsv
│ └── merged_pathabundance_relab.tsv
├── maaslin2/ # Statistical analysis
│ ├── taxonomy/
│ └── function/
├── assembly/ # Assembled contigs (optional)
│ ├── sample01/final.contigs.fa
│ └── all_contigs_combined.fa
├── binning/ # MAG binning results (optional)
│ ├── sample01/
│ │ ├── bins/ # MetaBAT2 bins
│ │ ├── concoct_bins/ # CONCOCT bins
│ │ ├── dastool_bins/ # Optimized bins
│ │ └── checkm/ # Quality assessment
├── annotation/ # MAG annotations (optional)
│ ├── sample01/
│ │ ├── prokka/ # Prokka annotations
│ │ ├── dram/ # DRAM annotations
│ │ └── eggnog/ # eggNOG annotations
├── visualization/ # Plots and visualizations
│ ├── pca_taxonomy.png
│ ├── taxonomy_barplot.png
│ ├── metagenomics_dashboard.html
│ └── *_krona.html
└── multiqc_report.html # Summary report
Core Analysis:
multiqc_report.html: Comprehensive QC summarymetaphlan/merged_abundance_table.txt: Species abundance tablehumann/merged_pathabundance_relab.tsv: Pathway abundance tablehumann/merged_genefamilies_relab.tsv: Gene family abundance tablemaaslin2/taxonomy/: Statistical results for taxonomic datamaaslin2/function/: Statistical results for functional datavisualization/pca_taxonomy.png: PCA plot of samplesvisualization/taxonomy_barplot.png: Taxonomic composition plot
Advanced Analysis:
assembly/all_contigs_combined.fa: Assembled contigs from all samplesbinning/{sample}/checkm/quality_summary.tsv: MAG quality metricsannotation/{sample}/dram/genome_summaries.tsv: Comprehensive MAG annotationsvisualization/metagenomics_dashboard.html: Interactive analysis dashboardvisualization/*_krona.html: Interactive taxonomic trees
Modify config/config.yaml to adjust tool-specific parameters:
kneaddata:
threads: 8
trimmomatic_options: "SLIDINGWINDOW:4:20 MINLEN:50"
bowtie2_options: "--very-sensitive --dovetail"
metaphlan:
threads: 4
add_viruses: true
unknown_estimation: false
humann:
threads: 8
search_mode: "uniref90"
bypass_translated_search: false
maaslin2:
min_abundance: 0.0
min_prevalence: 0.1
normalization: "TSS"
transform: "LOG"
analysis_method: "LM"Create additional rules in the workflow/rules/ directory and include them in the main Snakefile.
- Database paths: Ensure all database paths in
config.yamlare correct - Sample names: Check that sample names in
samples.csvmatch your fastq file names - Memory requirements: Increase memory allocation for resource-intensive steps
- Environment conflicts: Use
--use-condato ensure proper environment isolation
Core Analysis:
- RAM: 16-32 GB recommended
- Disk: ~50-100 GB for intermediate files
- CPU: 8-16 cores for optimal performance
Advanced Analysis (with assembly/binning):
- RAM: 64-128 GB recommended (especially for metaSPAdes)
- Disk: ~200-500 GB for intermediate files (depending on dataset size)
- CPU: 16-32 cores for optimal performance
If you use this pipeline, please cite the following tools:
- KneadData: https://github.com/biobakery/kneaddata
- MetaPhlAn4: Blanco-Míguez et al. eLife 2023
- HUMAnN3: Franzosa et al. Nature Methods 2018
- MaAsLin2: Mallick et al. Nature Methods 2021
- Snakemake: Köster & Rahmann, Bioinformatics 2012
For pipeline-specific issues, please check:
- Snakemake logs in
.snakemake/log/ - Tool-specific logs in respective output directories
- Environment setup with
conda listin each environment
This pipeline is provided under the MIT License.