Pipelines for processing single-cell whole genome sequencing (CapWGS) and genome-transcriptome co-assay (CapGTA) data. Includes tools for alignment, cell detection, variant calling, and AnnData generation.
git clone https://github.com/SrivatsanLab/SPC_genome
cd SPC_genome
# Create the environment
micromamba create -n spc_genome -f environment.yml
micromamba activate spc_genomeComplete pipeline for single-cell whole genome sequencing data:
- BWA-MEM alignment
- Cell detection from barcode counts
- Single-cell BAM extraction
- GATK variant calling (HaplotypeCaller + GenotypeGVCFs)
- AnnData generation with variant calls
Output: data/{sample}/, data/{sample}/sc_outputs/, results/{sample}/variants.h5ad
QC-only pipeline for coverage benchmarking (no variant calling):
- BWA-MEM alignment
- Cell detection
- Single-cell BAM extraction
- BigWig generation
- Lorenz curve analysis for coverage uniformity
Output: data/{sample}/sc_outputs/, results/{sample}/ (QC metrics, Lorenz curves)
Pipeline for simultaneous genome and transcriptome profiling:
- STAR alignment (separates DNA/RNA by splice junctions)
- Dual-modality cell detection
- Single-cell DNA and RNA BAM extraction
- BCFtools variant calling on DNA
- RNA count matrix generation
- Dual AnnData output (variants + gene expression)
Output: data/{sample}/sc_outputs/, results/{sample}/variants.h5ad, results/{sample}/rna_counts.h5ad
./CapWGS_PP.sh \
-o sample_name \
-1 /path/to/read1.fastq.gz \
-2 /path/to/read2.fastq.gz \
-r 1000000000 \
-g /shared/biodata/reference/GATK/hg38/Required arguments:
-oSample name (createsdata/{sample}/andresults/{sample}/)-1Read 1 FASTQ file(s)-2Read 2 FASTQ file(s)-rTotal read count-gReference genome directory
Optional arguments:
-sScripts directory (default:.)-nNumber of chunks for parallelization (default: 500)-tTemporary directory (default:/hpc/temp/srivatsan_s/SPC_genome_preprocessing/{sample}/)-hShow help message
All pipelines support the same argument structure and can optionally use config.yaml for default values.
Create config.yaml to set defaults:
processing:
n_chunks: 500
tmp_dir: /hpc/temp/srivatsan_s/SPC_genome_preprocessing
reference:
genome_dir: /shared/biodata/reference/GATK/hg38/
data:
read1: /path/to/read1.fastq.gz
read2: /path/to/read2.fastq.gz
read_count: 1000000000
output:
sample_name: my_sampleCommand-line arguments override config values.
SPC_genome/
├── CapWGS_PP.sh # Main WGS pipeline
├── CapWGS_PP_QC_only.sh # QC-only pipeline
├── CapGTA_PP.sh # Genome-transcriptome pipeline
├── scripts/ # Pipeline scripts
│ ├── CapWGS/ # WGS + GATK scripts
│ ├── CapWGS_QC/ # Coverage QC scripts
│ ├── CapGTA/ # Genome-transcriptome scripts
│ ├── bulk/ # Bulk processing utilities
│ └── utils/ # Shared utilities
├── bin/ # Experiment metadata and one-off scripts
├── data/ # Alignments and single-cell outputs (gitignored)
│ └── {sample}/
│ ├── {sample}.bam # Bulk alignment
│ └── sc_outputs/ # Single-cell BAMs, VCFs, bigwigs
├── results/ # Final outputs (gitignored)
│ └── {sample}/
│ ├── variants.h5ad # Variant AnnData
│ ├── rna_counts.h5ad # RNA AnnData (CapGTA only)
│ └── *_qc_summary.csv # QC metrics
├── notebooks/ # Analysis notebooks
│ ├── K562_tree/ # K562 tree experiment analysis
│ ├── benchmarking/ # CapWGS benchmarking analysis
│ └── PolE_worm_pilot/ # C. elegans CapGTA analysis
└── environment.yml # Micromamba environment
- K562_tree/ - K562 lineage tree experiment
sc_analysis.ipynb- Single-cell variant analysisbulk_spectra_analysis.ipynb- Mutational spectra
- benchmarking/ - CapWGS validation
benchmarking.ipynb- Comparison with public scWGS datasets
- PolE_worm_pilot/ - C. elegans CapGTA analysis
PolE_worm_pilot_analysis.ipynb- Multi-modal analysis
Capsule-based Whole Genome Sequencing and lineage tracing - bioRxiv