Description

Snakemake pipeline for preprocessing single cell sequencing data

Modularised workflow can be modified and/or extended for different library preparation protocols
Add as a submodule in a bioinformatics project GitHub repository

git submodule add https://github.com/redwanfarooq/preprocessing preprocessing

Update submodule to the latest version

git submodule update --remote preprocessing

Required software

Global environment
Specific modules

Setup

Install software for global environment (requires Anaconda or Miniconda - see installation instructions)
- Download environment YAML
- Create new conda environment from YAML
```
conda env create -f snakemake.yaml
```
Install software for specific module(s)
- Manually install required software from source and check that executables are available in PATH (using which) and/or
- Create new conda environments with required software from YAML (as above - download environment YAMLs) and/or
- Check that required software is available to load as environment modules (using module avail)
Set up pipeline configuration file config/config.yaml (see comments in file for detailed instructions)
Set up profile configuration file profile/config.yaml (see comments in file for detailed instructions)

Run

Activate global environment

conda activate snakemake

Execute run.py in root directory

Input

Pipeline requires the following input files/folders:

General

REQUIRED:

Folder(s) containing input BCL files

One folder for each Illumina sequencing run with standard directory structure (RunInfo.xml must be present at top level)
Folders should ideally be named according to default convention for the system e.g. YYYYMMDD_InstrumentID_RunNumber_FlowCellID, but any folder naming convention ending in an underscore followed by a unique ID will suffice

or

Folder(s) containing input FASTQ files

One folder for each Illumina sequencing run with subfolders containing FASTQ files from each library type
Folders should ideally be named according to default convention for the system e.g. YYYYMMDD_InstrumentID_RunNumber_FlowCellID, but any folder naming ending in an underscore followed by a unique ID will suffice
Subfolders should be named according to library type (must match exactly with lib_type field entry in runs summary table)
FASTQ files should be named according to default convention e.g. SampleID_Sx_Lxxx_Rx_001.fastq.gz

Runs summary table in delimited file format (e.g. TSV, CSV) with the following required fields (with headers):

run: run folder name
format: file type (options: BCL, FASTQ)
lib_type: library type (options: GEX, ATAC, ADT, HTO, CRISPR, BCR, TCR)
sample_id: sample ID
sample_index: either index name or literal i7 index sequence - only required if format is BCL
lane: either lane number(s) (separated with spaces if more than one lane used) or * (for all lanes) - only required if format is BCL
sample_index2: literal i5 index sequence (if applicable) - only required if format is BCL, dual indexing used and sample_index is literal i7 index sequence

OPTIONAL:

Index kit CSV files with the following required fields (with headers) - must be provided if any index names are used in sample_index field of runs summary table:

index_name: index set name (e.g. SI-TT-A1); must be first field
*: 1 or more fields specifying index sequences in the following order:
- Dual index kits: i7 sequence, i5 sequence (forward strand), i5 sequence (reverse complement)
- Single index kits: i7 sequence (additional fields if more than 1 sequence per index set)

Module-specific

gex: GEX only protocol

REQUIRED:

Reference files:

STAR genome reference package
Cell barcode whitelist

atac: ATAC only protocol

REQUIRED:

Reference files:

chromap genome reference and index
Cell barcode whitelist

gex_atac: 10X multiome (GEX + ATAC) protocol

REQUIRED:

Reference files:

STAR genome reference package
chromap genome reference and index
Cell barcode whitelist (GEX)
Cell barcode whitelist (ATAC)

cite_seq: CITE-seq protocol

REQUIRED:

Reference files:

STAR genome reference package
Cell barcode whitelist (GEX)

Antibody tag list in CSV format with the following required fields (without headers):

Tag sequence (length 15nt) - must begin at first base in read 2 (if leading bases are present, FASTQ files must be trimmed e.g. TotalSeq-B and TotalSeq-C antibodies)
Tag name

tea_seq: TEA-seq/DOGMA-seq protocol

REQUIRED:

Reference files:

STAR genome reference package
chromap genome reference and index
Cell barcode whitelist (GEX)
Cell barcode whitelist (ATAC)

Antibody tag list in CSV format with the following required fields (without headers):

Tag sequence (length 15nt) - must begin at first base in read 2 (if leading bases are present, FASTQ files must be trimmed e.g. TotalSeq-B and TotalSeq-C antibodies)
Tag name

gex_fb_cellranger: 10X GEX +/- feature barcoding protocol

REQUIRED:

Reference files:

Cell Ranger genome reference package

Feature reference in CSV format - see specifications (if using feature barcoding)

atac_cellranger: 10X ATAC only protocol

REQUIRED:

Reference files:

Cell Ranger genome reference package

gex_atac_cellranger: 10X multiome (GEX + ATAC) protocol

REQUIRED:

Reference files:

Cell Ranger genome reference package

vdj_cellranger: 10X 5' immune profiling protocol

REQUIRED:

Reference files:

Cell Ranger VDJ reference package

tea_seq_cellranger: TEA-seq/DOGMA-seq protocol

REQUIRED:

Reference files:

Cell Ranger genome reference package
Cell barcode whitelist (GEX)

Antibody tag list in CSV format with the following required fields (without headers):

Tag sequence (length 15nt) - must begin at first base in read 2 (if leading bases are present, FASTQ files must be trimmed e.g. TotalSeq-B and TotalSeq-C antibodies)
Tag name

gex_fb_vdj_cellranger: 10X GEX +/- feature barcoding +/- immune profiling protocol (with optional sample hashing)

REQUIRED:

Reference files:

Cell Ranger genome reference package
Cell Ranger VDJ reference package (if using immune profiling)

Feature reference in CSV format - see specifications (if using feature barcoding)
Sample hashing table in in delimited file format (e.g. TSV, CSV) with the following required fields (with headers) (if using sample hashing):

sample_id: sample ID (must match sample_id field in runs summary table)
hash_id: hashed sample ID (must be unique for each hashed sample)
ocm_barcode_ids/hashtag_ids/cmo_ids: OCM barcode/hashtag/CMO IDs (if antibody hashtags are used, must match id field in feature reference CSV)

Output

Output directory will be created in specified location with subfolders containing the output of each software tool specified in the module.

Modules

Available modules

gex
atac
gex_atac
cite_seq
tea_seq
gex_fb_cellranger
atac_cellranger
gex_atac_cellranger
vdj_cellranger
tea_seq_cellranger
gex_fb_vdj_cellranger

Adding new module

Add entry to module rule specifications file config/modules.yaml with module name and list of rule names
Add additional rule definition files in modules/rules folder (if needed)

Rule definition file must also assign a list of pipeline target files generated by the rule to a variable with the same name as the rule
Rule definition file must have the same file name as the rule with the file extension .smk

Execute run.py in root directory with --update flag (needs to be repeated if there are any further changes to the module rule specification in config/modules.yaml)

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
config		config
profile		profile
resources		resources
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
VERSION		VERSION
jobscript.sh		jobscript.sh
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Description

Required software

Setup

Run

Input

General

Module-specific

gex: GEX only protocol

atac: ATAC only protocol

gex_atac: 10X multiome (GEX + ATAC) protocol

cite_seq: CITE-seq protocol

tea_seq: TEA-seq/DOGMA-seq protocol

gex_fb_cellranger: 10X GEX +/- feature barcoding protocol

atac_cellranger: 10X ATAC only protocol

gex_atac_cellranger: 10X multiome (GEX + ATAC) protocol

vdj_cellranger: 10X 5' immune profiling protocol

tea_seq_cellranger: TEA-seq/DOGMA-seq protocol

gex_fb_vdj_cellranger: 10X GEX +/- feature barcoding +/- immune profiling protocol (with optional sample hashing)

Output

Modules

Available modules

Adding new module

About

Uh oh!

Releases

Packages

Languages

License

redwanfarooq/preprocessing

Folders and files

Latest commit

History

Repository files navigation

Description

Required software

Setup

Run

Input

General

Module-specific

gex: GEX only protocol

atac: ATAC only protocol

gex_atac: 10X multiome (GEX + ATAC) protocol

cite_seq: CITE-seq protocol

tea_seq: TEA-seq/DOGMA-seq protocol

gex_fb_cellranger: 10X GEX +/- feature barcoding protocol

atac_cellranger: 10X ATAC only protocol

gex_atac_cellranger: 10X multiome (GEX + ATAC) protocol

vdj_cellranger: 10X 5' immune profiling protocol

tea_seq_cellranger: TEA-seq/DOGMA-seq protocol

gex_fb_vdj_cellranger: 10X GEX +/- feature barcoding +/- immune profiling protocol (with optional sample hashing)

Output

Modules

Available modules

Adding new module

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages