Snakemake pipeline for preprocessing single cell sequencing data
- Modularised workflow can be modified and/or extended for different library preparation protocols
- Add as a submodule in a bioinformatics project GitHub repository
git submodule add https://github.com/redwanfarooq/preprocessing preprocessing
- Update submodule to the latest version
git submodule update --remote preprocessing
- Global environment
- Specific modules
- Install software for global environment (requires Anaconda or Miniconda - see installation instructions)
- Download environment YAML
- Create new conda environment from YAML
conda env create -f snakemake.yaml - Install software for specific module(s)
- Manually install required software from source and check that executables are available in PATH (using
which) and/or - Create new conda environments with required software from YAML (as above - download environment YAMLs) and/or
- Check that required software is available to load as environment modules (using
module avail)
- Manually install required software from source and check that executables are available in PATH (using
- Set up pipeline configuration file config/config.yaml (see comments in file for detailed instructions)
- Set up profile configuration file profile/config.yaml (see comments in file for detailed instructions)
- Activate global environment
conda activate snakemake
- Execute run.py in root directory
Pipeline requires the following input files/folders:
REQUIRED:
- Folder(s) containing input BCL files
- One folder for each Illumina sequencing run with standard directory structure (RunInfo.xml must be present at top level)
- Folders should ideally be named according to default convention for the system e.g. YYYYMMDD_InstrumentID_RunNumber_FlowCellID, but any folder naming convention ending in an underscore followed by a unique ID will suffice
or
Folder(s) containing input FASTQ files
- One folder for each Illumina sequencing run with subfolders containing FASTQ files from each library type
- Folders should ideally be named according to default convention for the system e.g. YYYYMMDD_InstrumentID_RunNumber_FlowCellID, but any folder naming ending in an underscore followed by a unique ID will suffice
- Subfolders should be named according to library type (must match exactly with lib_type field entry in runs summary table)
- FASTQ files should be named according to default convention e.g. SampleID_Sx_Lxxx_Rx_001.fastq.gz
- Runs summary table in delimited file format (e.g. TSV, CSV) with the following required fields (with headers):
- run: run folder name
- format: file type (options: BCL, FASTQ)
- lib_type: library type (options: GEX, ATAC, ADT, HTO, CRISPR, BCR, TCR)
- sample_id: sample ID
- sample_index: either index name or literal i7 index sequence - only required if format is BCL
- lane: either lane number(s) (separated with spaces if more than one lane used) or * (for all lanes) - only required if format is BCL
- sample_index2: literal i5 index sequence (if applicable) - only required if format is BCL, dual indexing used and sample_index is literal i7 index sequence
OPTIONAL:
- Index kit CSV files with the following required fields (with headers) - must be provided if any index names are used in sample_index field of runs summary table:
- index_name: index set name (e.g. SI-TT-A1); must be first field
- *: 1 or more fields specifying index sequences in the following order:
- Dual index kits: i7 sequence, i5 sequence (forward strand), i5 sequence (reverse complement)
- Single index kits: i7 sequence (additional fields if more than 1 sequence per index set)
REQUIRED:
- Reference files:
- STAR genome reference package
- Cell barcode whitelist
REQUIRED:
- Reference files:
- chromap genome reference and index
- Cell barcode whitelist
REQUIRED:
- Reference files:
- STAR genome reference package
- chromap genome reference and index
- Cell barcode whitelist (GEX)
- Cell barcode whitelist (ATAC)
REQUIRED:
- Reference files:
- STAR genome reference package
- Cell barcode whitelist (GEX)
- Antibody tag list in CSV format with the following required fields (without headers):
- Tag sequence (length 15nt) - must begin at first base in read 2 (if leading bases are present, FASTQ files must be trimmed e.g. TotalSeq-B and TotalSeq-C antibodies)
- Tag name
REQUIRED:
- Reference files:
- STAR genome reference package
- chromap genome reference and index
- Cell barcode whitelist (GEX)
- Cell barcode whitelist (ATAC)
- Antibody tag list in CSV format with the following required fields (without headers):
- Tag sequence (length 15nt) - must begin at first base in read 2 (if leading bases are present, FASTQ files must be trimmed e.g. TotalSeq-B and TotalSeq-C antibodies)
- Tag name
REQUIRED:
- Reference files:
- Cell Ranger genome reference package
- Feature reference in CSV format - see specifications (if using feature barcoding)
REQUIRED:
- Reference files:
- Cell Ranger genome reference package
REQUIRED:
- Reference files:
- Cell Ranger genome reference package
REQUIRED:
- Reference files:
- Cell Ranger VDJ reference package
REQUIRED:
- Reference files:
- Cell Ranger genome reference package
- Cell barcode whitelist (GEX)
- Antibody tag list in CSV format with the following required fields (without headers):
- Tag sequence (length 15nt) - must begin at first base in read 2 (if leading bases are present, FASTQ files must be trimmed e.g. TotalSeq-B and TotalSeq-C antibodies)
- Tag name
gex_fb_vdj_cellranger: 10X GEX +/- feature barcoding +/- immune profiling protocol (with optional sample hashing)
REQUIRED:
- Reference files:
- Cell Ranger genome reference package
- Cell Ranger VDJ reference package (if using immune profiling)
- Feature reference in CSV format - see specifications (if using feature barcoding)
- Sample hashing table in in delimited file format (e.g. TSV, CSV) with the following required fields (with headers) (if using sample hashing):
- sample_id: sample ID (must match sample_id field in runs summary table)
- hash_id: hashed sample ID (must be unique for each hashed sample)
- ocm_barcode_ids/hashtag_ids/cmo_ids: OCM barcode/hashtag/CMO IDs (if antibody hashtags are used, must match id field in feature reference CSV)
Output directory will be created in specified location with subfolders containing the output of each software tool specified in the module.
- gex
- atac
- gex_atac
- cite_seq
- tea_seq
- gex_fb_cellranger
- atac_cellranger
- gex_atac_cellranger
- vdj_cellranger
- tea_seq_cellranger
- gex_fb_vdj_cellranger
- Add entry to module rule specifications file config/modules.yaml with module name and list of rule names
- Add additional rule definition files in modules/rules folder (if needed)
- Rule definition file must also assign a list of pipeline target files generated by the rule to a variable with the same name as the rule
- Rule definition file must have the same file name as the rule with the file extension .smk
- Execute run.py in root directory with
--updateflag (needs to be repeated if there are any further changes to the module rule specification in config/modules.yaml)