Snakemake pipeline for merging, normalisation and integration of multisample/multimodal single cell datasets
- Modularised workflow can be modified and/or extended for different experiment designs
- Add as a submodule in a bioinformatics project GitHub repository
git submodule add https://github.com/redwanfarooq/single_cell_multi single_cell_multi
- Update submodule to the latest version
git submodule update --remote single_cell_multi
- Global environment
- Specific modules
- R >=v4.3
- docopt v0.7.1
- logger v0.3.0
- qs v0.26.3
- hdf5r v1.3.11
- tidyverse v2.0.0
- furrr v0.3.1
- MatrixExtra v0.1.15
- Seurat v5.1.0
- Signac v1.14.0
- harmony v1.2.1
- Bioconductor v3.18
- DropletUtils
- batchelor
- ensembldb
- AnnotationDbi
- EnsDb.Hsapiens.v86
- org.Hs.eg.db
- GenomicRanges
- GenomeInfoDb
- glmGamPoi
- MACS2 >=v2.2.9
- scvi-tools >=v1.3.2
- R >=v4.3
- Install software for global environment (requires Anaconda or Miniconda - see installation instructions)
- Download environment YAML
- Create new conda environment from YAML
conda env create -f snakemake.yaml - Install software for specific module(s)
- Manually install required software from source and check that executables are available in PATH (using
which) and/or - Create new conda environments with required software from YAML (as above - download environment YAMLs) and/or
- Check that required software is available to load as environment modules (using
module avail)
- Manually install required software from source and check that executables are available in PATH (using
- Set up pipeline configuration file config/config.yaml (see comments in file for detailed instructions)
- Set up profile configuration file profile/config.yaml (see comments in file for detailed instructions)
- Activate global environment
conda activate snakemake
- Execute run.py in root directory
Pipeline requires the following input files/folders:
REQUIRED:
- Post-QC multimodal single cell count matrices in 10x-formatted HDF5 file for each sample
- Cell barcode metadata table in delimited file format (e.g. TSV, CSV) for each sample
- 10x-formatted indexed fragments file for each sample with ATAC data (if applicable)
- Input metadata table in delimited file format (e.g. TSV, CSV) with the following required fields (with headers):
- sample_id: sample ID
- hdf5: path to 10x-formatted HDF5 file
- metadata: path to cell barcode metadata table
- fragments: path to 10x-formatted indexed fragments file (if applicable)
- summits: path to MACS2 peak summits BED file (if applicable)
Output directory will be created in specified locations with subfolders containing the output of each step specified in the module
- default
- Add entry to module rule specifications file config/modules.yaml with module name and list of rule names
- Add additional rule definition files in modules/rules folder (if needed)
- Rule definition file must also assign a list of pipeline target files generated by the rule to a variable with the same name as the rule
- Rule definition file must have the same file name as the rule with the file extension .smk
- Execute run.py in root directory with
--updateflag (needs to be repeated if there are any further changes to the module rule specification in config/modules.yaml)