A Snakemake pipeline for extracting and mapping variant/barcode sequences from Nanopore reads.
The pipeline runs three steps in sequence:
- A1 — Convert raw input (
bam/fasta/fastq) to CSV - A2 — Extract
varandbcsequences using anchor sequences - A3 — Match variants to a reference using
massive-seq-finder
Final output: output_dir/var_bc_reads_named.csv
.
├── snakefile
├── run_snakemake.sh
├── requirements.txt
├── config/
│ ├── config.yaml # User configuration — edit this
│ └── sbatch.yaml # SLURM resource settings
└── module/
├── A1.transform_to_csv.py
├── A2.Split_Reads_Dask.py
└── A3.matching_variant.py
pip install -r requirements.txtmassive-seq-finder is installed directly from GitHub and is listed in requirements.txt.
Only config/config.yaml needs to be edited between runs.
Required:
| Key | Description |
|---|---|
samples |
Path to the input file |
input_type |
bam, fasta, or fastq |
anchor_seqs |
Four comma-separated anchor sequences |
oligo_length |
Expected variant sequence length |
bc_length |
Expected barcode length |
var_dict |
Path to reference variant table (must contain a seq column) |
output_dir |
Directory where outputs will be written |
Runtime (optional, with defaults):
| Key | Default | Description |
|---|---|---|
conda_env |
anaconda_env |
Conda environment to activate |
snakemake_jobs |
45 |
Max concurrent SLURM jobs |
n_workers |
32 |
Dask workers for A2 |
mem_per_worker |
9GB |
Memory per Dask worker |
Submit to SLURM cluster:
bash run_snakemake.shDry-run (no execution):
snakemake -n --configfile config/config.yamlGenerate DAG image:
snakemake --dag --configfile config/config.yaml | dot -Tpng > dag.pngAll outputs are written to output_dir/:
| File | Description |
|---|---|
raw_reads.csv |
Converted reads from A1 |
var_bc_reads.csv |
Extracted variant/barcode sequences from A2 |
var_bc_reads_named.csv |
Variant-matched and named output from A3 |
variant_matching.log |
Matching log from A3 |
- A3 uses the
MSFclass frommassive-seq-finderfor nearest-reference matching. - If the variant table contains a
names,name,ref_name, orvariant_namecolumn, it is automatically mapped to thevar_nameoutput column. - SLURM job logs are written to
logs/.