Skip to content

ccmbioinfo/crg2-pacbio

Repository files navigation

Clinical research pipeline for exploring variants in PacBio HiFi whole genome (WGS) data (Hg38)

crg2-pacbio is a research pipeline aimed at discovering clinically relevant variants from PacBio HiFi whole genome sequence data in a family-based manner. crg2-pacbio uses Snakemake and Conda to manage jobs and software dependencies.

crg2-pacbio takes as input the following outputs from PacBio's WGS pipeline:

  • pbmm2 aligned BAM files for each sample
  • DeepVariant joint-genotyped small variant VCF
  • pbsv joint-genotyped structural variant VCF
  • TRGT VCFs genotyped against the adotto repeat catalog (see 'Repeat expansion outlier report' section for more information), one per family member. TRGT-denovo expects that these VCFs sit inside the same directory that the aligned BAMs are stored in. TRGT-denovo also expects that the TRGT spanning BAMs are in the same directory as aligned BAMs. If they are stored elsewhere, you can create softlinks to these files in the directory storing the aligned BAMs.

crg2-pacbio also has a module for identifying, filtering, and annotating methylation outliers, see 'Methylation outlier report' below.

Clone crg2-pacbio: git clone https://github.com/ccmbioinfo/crg2-pacbio/. Note that two files, crg2-pacbio/crg2-pacbio.sh and crg2-pacbio/config.yaml, assume you have cloned crg2-pacbio to your home directory. If this is NOT the case, please update the paths in these files.

How to run the pipeline

  1. Make a folder in a directory with sufficient space. Copy over the template files crg2-pacbio/samples.tsv, crg2-pacbio/units.tsv, crg2-pacbio/config.yaml, crg2-pacbio/crg2-pacbio.sh, crg2-pacbio/slurm_profile/slurm-config.yaml. Note that slurm-config.yaml is for submitting each rule as cluster jobs, so ignore this if not running on cluster.
$ mkdir NA12878
$ cp crg2-pacbio/samples.tsv crg2-pacbio/units.tsv crg2-pacbio/config.yaml crg2-pacbio/crg2-pacbio.$ $ sh crg2-pacbio/slurm_profile/slurm-config.yaml NA12878
$ cd NA12878
  1. Set up pipeline run:
  • reconfigure samples.tsv and units.tsv to reflect sample names and input files. Note that because several of the inputs are joint-genotyped, one row in the units.tsv file corresponds to one family, not one sample. units.tsv must be configured with the path to the joint-genotyped (or singleton, if no family members were sequenced) deepvariant small_variant_vcf, and the joint-genotyped (or singleton) pbsv pbsv_vcf. The samples.tsv file should contain one row per family member, with the sample column corresponding to the sample ID and the BAM column corresponding to the path to the BAM file for that sample. units.tsv example:
family	platform	small_variant_vcf	pbsv_vcf
FAM01	PACBIO	/path/to/FAM01.joint.GRCh38.small_variants.phased.vcf.gz   /path/to/FAM01.joint.GRCh38.structural_variants.phased.vcf.gz

samples.tsv example (note that the case_or_control field is ONLY required for the methylation outlier workflow, see below):

sample	BAM	case_or_control
01	/path/to/FAM01_01.GRCh38.aligned.haplotagged.bam  
02	/path/to/FAM01_02.GRCh38.aligned.haplotagged.bam 
  • Add paths to the HPO term file and pedigree file to config.yaml.
  • Do a dry run: add a -n flag to the Snakemake command in crg2-pacbio.sh. This will print out the rules that will be run, but not actually run them.
  • If the dry run is successful, run the pipeline: sbatch crg2-pacbio.sh.
  • If you just want to generate a single report, you can specify the report name in the Snakemake command in crg2-pacbio.sh. For example, to generate the repeat outlier report, add repeat_outliers/{family}.repeat.outliers.annotated.csv to the Snakemake command.

Example:

#!/bin/bash
#SBATCH --job-name=crg2-pacbio
#SBATCH --time=50:00:00
#SBATCH --ntasks-per-node=1
#SBATCH --mem=4G
#SBATCH --output=%x-%j.out

SF=~/crg2-pacbio/Snakefile
CP="/hpf/largeprojects/ccm_dccforge/dccdipg/Common/snakemake"
SLURM=~/crg2-pacbio/slurm_profile/
CONFIG="config.yaml"

source /hpf/largeprojects/ccm_dccforge/dccdipg/Common/anaconda3/etc/profile.d/conda.sh
conda activate snakemake

snakemake --use-conda -s ${SF} --cores 4 --conda-prefix ${CP} --configfile ${CONFIG} --profile ${SLURM} repeat_outliers/{family}.repeat.outliers.annotated.csv

Small variant annotation and report: small_variants/coding/{family}/{family}.wes.regular.{date}.csv

  • annotate variants with VEP and vcfanno
  • generate a gemini variant database
  • generate a rare (less than 1% maximum gnomAD population AF) variant report with medium to high impact variants. Report generation scripts are derived from cre, but copied to this repo for convenience. Report details are described here; additional columns included are allele frequencies and counts from the CoLoRSDB cohort.

Small variant panel report: small_variants/panel/{family}/{family}.wgs.regular.{date}.csv, small_variants/panel-flank/{family}/{family}.wgs.regular.{date}.csv

  • The annotation pipeline is the same as above, but only variants in a gene panel are considered. The panel report includes variants of any impact (i.e. it includes non-coding variants and intronic variants).
  • To generate a gene panel from an HPO term to gene text file exported from Phenotips ('Suggested genes'), add the HPO filepath to config["run"]["hpo"].
  • The first time you run the pipeline (NOTE- not required if you're running this on SickKids HPC4Health), you will also need to generate Ensembl and RefSeq gene files as well as an HGNC gene mapping file.
  • Download and unzip Ensembl gtf: wget -qO- https://ftp.ensembl.org/pub/release-112/gtf/homo_sapiens/Homo_sapiens.GRCh38.112.gtf.gz | gunzip -c > Homo_sapiens.GRCh38.112.gtf
  • Download and unzip RefSeq gff: wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz | gunzip -c > GRCh38_latest_genomic.gff
  • Download RefSeq chromosome mapping file: wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_assembly_report.txt
  • Run script to parse the above files: python scripts/clean_gtf.py --ensembl_gtf /path/to/Homo_sapiens.GRCh38.112.gtf --refseq_gff3 /path/to/GRCh38_latest_genomic.gff --refseq_assembly /path/to/GRCh38_latest_assembly_report.txt
  • Add the paths to the output files, Homo_sapiens.GRCh38.112.gtf_subset.csv and GRCh38_latest_genomic.gff_subset.csv, to the config["gene"]["ensembl"] and config["gene"]["refseq"] fields.
  • You will also need the HGNC alias file: download this from https://www.genenames.org/download/custom/ using the default fields. Add the path this file to config["gene"]["hgnc"].

Structural variant annotation and report: sv/{family}.pbsv.{date}.csv

Repeat expansion outlier report: repeat_outliers/{family}.repeat.outliers.annotated.csv

  • this module is modified from the find-outlier-expansions workflow developed by Egor Dolzhenko (PacBio), Adam English (Baylor College of Medicine), Tom Mokveld (PacBio), Giulia Del Gobbo (CHEO), and Madeline Couse (SickKids)
  • genotype repeats per-sample using TRGTv1.1.0 against 937,122 repeats originally released by the Genome in a Bottle tandem repeat benchmarking project
  • compute length of longest pure segment (LPS) in repeat alleles using TRGT-LPS v0.4.0 merge sample TRGT VCFs into multi-sample VCF
  • compute repeat allele length and LPS Z-scores relative to background distribution (Children's Mercy Hospital cohort, n=1436)
  • identify repeats with outlying size in family members
  • annnotate repeat outliers with custom Python script

De novo tandem repeat variant report: TRGT_denovo/{family}_{child}.TRGT.denovo.annotated.csv

  • identify de novo tandem repeats in probands using TRGT-denovo v0.2.0
  • note that the sample TRGT spanning BAM files must reside in the same directories as the BAMs specified per-sample in samples.tsv to successfully run the denovo tandem repeat report
  • filter annotate de novo repeats with custom Python script

Pathogenic repeat loci report: pathogenic_repeats/{family}.known.path.str.loci.csv

Methylation outlier report: methylation_outliers/{family}_{sample}.annotated.outliers.{date}.csv

The methylation outlier workflow uses pb-cpg-tools to extract CpG sites from the BAM files (coverage threshold of 5x) and Methbat to identify haplotype-specific methylation outliers in one sample against a background of samples. You must list both the case sample and the control samples in the samples.tsv file, specifying the 'case_or_control' field as 'case' for the case sample and 'control' for the control samples. There must only be one case sample, and ideally as many controls samples as possible. Example samples.tsv for methylation outlier analysis:

sample	BAM	case_or_control
01	/path/to/01.GRCh38.aligned.haplotagged.bam	case
02	/path/to/02.GRCh38.aligned.haplotagged.bam	control
03	/path/to/03.GRCh38.aligned.haplotagged.bam	control
04	/path/to/04.GRCh38.aligned.haplotagged.bam	control
05	/path/to/05.GRCh38.aligned.haplotagged.bam	control
(and so on)

In order to annotate methylation outlier against variants, you must first run the small variant, structural variant, and repeat outlier workflows. Then, you must add the paths to the output files to the variants_for_methbat.tsv file, and add the path to the variants_for_methbat.tsv file to the config["run"]["variants_for_methbat"] field in config.yaml. Note the following:

  • The small_variants field points to the annotated and decomposed VCF file generated by the small variant workflow. The VCF must be annotated with gnomAD allele frequencies.
  • The SVs field points to the structural variant report generated by the structural variant workflow.
  • The TR_outliers field points to the repeat outliers generated by the repeat outlier workflow.
  • The CNVs field points to the CNV report generated by TCAG (not crg2-pacbio).

Example variants_for_methbat.tsv:

variant_type	path
small_variants	small_variants/coding/{family}/{family}-ensemble-annotated-decomposed.vcf.gz
SVs	sv/{family}.pbsv.{date}.csv
TR_outliers	repeat_outliers/{family}.repeat.outliers.csv
CNVs	cnvs/{family}.cnvs.{date}.csv

The workflow is as follows:

  • extract CpG sites from BAMs using pb-cpg-tools
  • generate MethBat profiles for each sample using Methbat
  • build a background profile using the control samples using MethBat
  • identify methylation outliers using MethBat
  • annotate methylation outliers with custom Python script

To run the methylation outlier workflow, follow the instructions above for running the pipeline, but make sure your samples.tsv is configured for methylation outlier analysis, and use the methylation_outliers.sh job script instead of crg2-pacbio.sh.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •