Skip to content

thewonlab/Longread_Sequencing_BCmapping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nanopore BCmapping

A Snakemake pipeline for extracting and mapping variant/barcode sequences from Nanopore reads.

Overview

The pipeline runs three steps in sequence:

  1. A1 — Convert raw input (bam / fasta / fastq) to CSV
  2. A2 — Extract var and bc sequences using anchor sequences
  3. A3 — Match variants to a reference using massive-seq-finder

Final output: output_dir/var_bc_reads_named.csv

Repository Structure

.
├── snakefile
├── run_snakemake.sh
├── requirements.txt
├── config/
│   ├── config.yaml       # User configuration — edit this
│   └── sbatch.yaml       # SLURM resource settings
└── module/
    ├── A1.transform_to_csv.py
    ├── A2.Split_Reads_Dask.py
    └── A3.matching_variant.py

Installation

pip install -r requirements.txt

massive-seq-finder is installed directly from GitHub and is listed in requirements.txt.

Configuration

Only config/config.yaml needs to be edited between runs.

Required:

Key Description
samples Path to the input file
input_type bam, fasta, or fastq
anchor_seqs Four comma-separated anchor sequences
oligo_length Expected variant sequence length
bc_length Expected barcode length
var_dict Path to reference variant table (must contain a seq column)
output_dir Directory where outputs will be written

Runtime (optional, with defaults):

Key Default Description
conda_env anaconda_env Conda environment to activate
snakemake_jobs 45 Max concurrent SLURM jobs
n_workers 32 Dask workers for A2
mem_per_worker 9GB Memory per Dask worker

Running

Submit to SLURM cluster:

bash run_snakemake.sh

Dry-run (no execution):

snakemake -n --configfile config/config.yaml

Generate DAG image:

snakemake --dag --configfile config/config.yaml | dot -Tpng > dag.png

Outputs

All outputs are written to output_dir/:

File Description
raw_reads.csv Converted reads from A1
var_bc_reads.csv Extracted variant/barcode sequences from A2
var_bc_reads_named.csv Variant-matched and named output from A3
variant_matching.log Matching log from A3

Notes

  • A3 uses the MSF class from massive-seq-finder for nearest-reference matching.
  • If the variant table contains a names, name, ref_name, or variant_name column, it is automatically mapped to the var_name output column.
  • SLURM job logs are written to logs/.

About

BC mapping Snakemake pipeline for long-read sequencing

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors