Skip to content

talkowski-lab/lr-annotation

Repository files navigation

Long-Read Annotation

This repository serves as a home for all scripts, workflows and processes for annotating long-read callsets.

Cohort

  • HPRC.
    • 232 total samples.
    • Metadata File.
      • 234 total samples.
      • Additionally includes GRC38 and CHM13.
      • Fabio's table and the DeepVariant callset also include HG03492.
  • HGSVC.
    • 65 total samples.
    • Metadata File.
      • 67 total samples.
      • Renames NA21487 to GM21487.
      • Duplicates NA19129 (also includes GM19129) and NA20355 (also includes GM20355).
      • Additionally includes GM19320.
      • Misses NA24385 (HG002 in Terra).
  • HPRC & HGSVC Overlapping Samples: HG002, HG00733, HG02818, NA19036, NA19240.
  • All of Us Phase 1.
    • 1027 total samples.
    • Metadata file extracted from VCF - all samples are unrelated and of African ancestry..

References

  • coding_gtf: GENCODE v39 from the gnomAD workspace.
  • dbgap_vcf: Build 156 variants for GRCh38 from dbGaP.
  • dbgap_vcf_idx: Index for dbgap_vcf.
  • exons_bed: Loci for GRCh38 from the references listed in the SVAN repository.
  • fix_variant_collisions_java: Script for phasing from the DSP Long-Read SV team.
  • genetic_maps_tsv: Maps for GRCh38 from the GLIMPSE references.
  • gnomad_sv_vcf: gnomAD V4 structural variants from the gnomAD downloads.
  • gnomad_sv_vcf_idx: Index for gnomad_sv_vcf.
  • gnomad_tr_json: gnomAD V4 tandem repeats from the str-analysis references.
  • gnomad_vcfs: gnomAD V4 short variants from the gnomAD downloads.
  • gnomad_vcf_idxs: Indexes for gnomad_vcfs.
  • mei_catalog: Loci for GRCh38 is a combination of ALU and LINE loci from RepeatMasker and SVA loci from van Bree et al.
  • mei_fa: Sequences for GRCh38 from the SVAN references.
  • mei_fa_indices: BWA and Minimap indices for mei_fa.
  • noncoding_bed: Panel for GRCh38 from the GATK-SV references.
  • par_bed: Panel for GRCh38 from the GATK-SV references.
  • ploidy_bed_female: Panel for GRCh38 from the Kanpig references.
  • ploidy_bed_male: Panel for GRCh38 from the Kanpig references.
  • ref_dict: Sequence dictionary for ref_fa.
  • ref_fa: Sequences for GRCh38 from the PAV references.
  • ref_fai: Index for ref_fa.
  • ref_fa_bwa_indices: BWA indices for ref_fa.
  • ref_fa_indices: BWA and Minimap indices for ref_fa.
  • ref_vep_cache: Cache for v105 from the VEP archives, with the most up-to-date list of these found here.
  • repeat_masker_bed: Loci for GRCh38 from the GATK-SV references.
  • repeats_bed: Loci for GRCh38 from the SVAN references.
  • repeat_catalog_trgt: Catalog for GRCh38 from the TR Catalog references, as used in All of Us Phase.
  • seg_dup_bed: Loci for GRCh38 from the GATK-SV references.
  • simple_repeat_bed: Loci for GRCh38 from the GATK-SV references.
  • seqrepo_tar: Sequence repository v2024-12-20 for GRCh38 from the latest release of seqrepo.
  • vntr_bed: Loci for GRCh38 from the SVAN references.

Annotation Workflows

This workflow leverages AnnotateVcf from the GATK-SV pipeline in order to annotate internal allele frequencies based on sample sexes and ancestries. It runs on all variants in the input VCF, including SVs.

Inputs:

  • sample_pop_assignments: Two column file containing sample IDs in the first column and ancestry labels in the second column.
  • ped_file: Six column file containing the cohort pedigree, with specifications described in this article.
  • contigs.
  • par_bed.

TODO

TODO

TODO

TODO

This workflow first runs RepeatMasker on an input VCF. It then uses its output to run L1ME-AID and INTACT_MEI in order to annotate and filter MEI calls.

TODO

TODO

TODO

This workflow leverages PALMER in order to annotate MEI calls for a cohort in a given cohort VCF. It retains the genotypes present in the VCF, simply adding an INFO field ME_TYPE to insertions whose characteristics match those of the PALMER calls.

Inputs:

  • mei_types: MEI types to run on - must be a subset of [ALU, SVA, LINE or HERVK].
  • rm_fa: Output by RepeatMasker.
  • rm_out: Output by RepeatMasker.
  • contigs.
  • ref_fai.

TODO

TODO

This workflow leverages SVAN in order to annotate Mobile Element Insertions (MEIs), Mobile Element Deletions, Tandem Duplications, Dispersed Duplications and Nuclear Mitochondrial Segments (NUMT). It involves running Tandem Repeat Finder (TRF) on the inserted or deleted sequence for each SV in the input VCF.

Inputs:

  • contigs.
  • exons_bed.
  • mei_fasta.
  • ref_fa.
  • repeats_bed.
  • vntr_bed.

This workflow leverages SVAnnotate in order to annotate predicted functional effects for SVs. It conditionally only runs SV through this workflow, ignoring all SNVs and InDels.

Inputs:

  • coding_gtf.
  • contigs.
  • noncoding_bed.

TODO

This workflow leverages the Ensembl Variant Effect Predictor (VEP) in order to annotate predicted functional effects based on site-level information. It requires numerous references that provide context to these annotations, and uses Hail in order to run this annotation process in a more efficient and scalable manner.

Inputs:

  • ref_fa.
  • ref_vep_cache.

TODO

This workflow ingests two VCFs and finds matching variants across them in order to compare the AF & VEP annotations of these matched pairs. This serves as a degree of benchmarking, as it ensures that annotations applied to a larger cohort (e.g. gnomAD) are in line with those we annotate. It also enables the identification of variants that are outliers relative to exiting cohorts by pulling out those with a large amount of discordance in their annotation across the callsets.

The workflow undergoes multiple rounds of variant matching in order to determine matched pairs:

  1. Exact match across CHROM, POS, REF and ALT.
  2. Truvari match with overlap percentages of 90%, 70% and 50%.
  3. Matching based on bedtools closest, finetuned for SVs.

Inputs:

  • vcf_truth: VCF containing SNV & indels to evaluate against.
  • vcf_sv_truth: VCF containing SVs to evaluate against.
  • ref_fa.
  • ref_fai.

Annotation Utilities

TODO

TODO

TODO

TODO

TODO

TODO

TODO

TODO

TODO

TODO

TODO

TODO

TODO

TODO

TODO

TODO

TODO

TODO

Tools

TODO

TODO

TODO

TODO

TODO

TODO

TODO

This workflow leverages Minimap2 in order to align assemblies to a reference.

Inputs:

  • assembly_mat: Maternal assembly.
  • assembly_pat: Paternal assembly.
  • minimap_flags: Parameters to use when running Minimap2.
  • ref_fa.
  • ref_fai.

TODO

This workflow runs PALMER on a pair of aligned assembly haplotypes in order to generate MEI calls. It then convets the raw PALMER calls generated into a VCF, merges calls across the haplotypes to create a diploid VCF per haplotype and then finally integrates these into a final VCF containing multiple MEI types.

Inputs:

  • bam_pat: Aligned assembly for paternal haplotype.
  • bai_pat: Index for aligned assembly for paternal haplotype.
  • bam_mat: Aligned assembly for maternal haplotype.
  • bai_mat: Index for aligned assembly for maternal haplotype.
  • mei_types: Series of MEI modes to run PALMER in - a subset of ALU, SVA, LINE or HERVK.
  • truvari_collapse_params: Truvari parameters to use when merging across haplotypes.
  • override_palmer_calls_pat: Optional PALMER calls for paternal haplotype, causing the workflow to bypass its execution.
  • override_palmer_tsd_reads_pat: Optional PALMER TSD reads for paternal haplotype, causing the workflow to bypass its execution.
  • override_palmer_calls_mat: Optional PALMER calls for maternal haplotype, causing the workflow to bypass its execution.
  • override_palmer_tsd_reads_mat: Optional PALMER TSD reads for maternal haplotype, causing the workflow to bypass its execution.
  • contigs.
  • ref_fa.
  • ref_fai.

TODO

TODO

This workflow leverages RepeatMasker in order to annotate repeated regions in an input VCF.

TODO

TODO

TODO

This workflow leverages TRGT in order to genotype short-tandem repeats.

Inputs:

  • bam: Aligned reads.
  • bai: Index for aligned reads.
  • sex: Sex of sample (one of M or F).
  • ref_fa.
  • ref_fai.
  • repeat_catalog_trgt.

TODO

TODO

TODO

Output Schema

  • INFO/allele_length: Allele length - positive for insertions, negative for deletions and 0 for SNVs.
  • INFO/allele_type: Allele type, which is one of the below.
    • snv: Single nucleotide variant.
    • ins: Insertion.
    • del: Deletion.
    • trv: Tandem repeat.
    • dup: Tandem duplication.
    • dup_interspersed: Interspersed duplication.
    • complex_dup: Complex duplication.
    • inv_dup: Inverted duplication.
    • numt: Nuclear-mitochondrial segment.
    • {ME_TYPE}_ins: Mobile element insertion, where {ME_TYPE} is one of ALU, LINE or SVA.
    • {ME_TYPE}_del: Mobile element deletion, where {ME_TYPE} is one of ALU, LINE or SVA.
  • SOURCE: Source of call, which is one of the below.
    • DeepVariant: SNV or indel call made by the DeepVariant pipeline.
    • HPRC_SV_Integration: Structural variant call made by the HPRC SV Integration pipeline.
    • TRExplorer: Tandem repeat loci from the TRExplorer v1.0.1 catalog.
    • Vamos: Tandem repeat loci from the Vamos v2.1 catalog.
  • TR_ENVELOPED: Flag indicating a variant with allele_type != "trv" is completely enveloped by a variant with allele_type = "trv".
  • TRID: TR identifier for TR calls; source of enveloping variant with allele_type = "trv" for non-TR calls with the TR_ENVELOPED flag.
  • TR_PARSED: Flag indicating a variant with allele_type != "trv" is flagged as a tandem repeat by the AnnotateIndelTRs workflow.
  • ORIGIN: Origin of duplicated sequence for duplications and NUMTs.
  • SUB_FAMILY: Sub-family for MEI calls.
  • dbGaP_ID: Variant ID from dbGaP for matched variants.
  • REGION: Genomic region, which is one of SR (for simple repeats), SD (for segmental duplications), RM (for RepeatMasker annotated regions) or US (for unique sequences, or more simply, none of the previous regions).
  • Functional Annotations.
    • vep: Annotations from the Variant Effect Predictor (VEP).
    • PREDICTED_: Annotations from SVAnnotate, which are all prefixed by PREDICTED_.
  • gnomAD_V4 Benchmarking.
    • gnomAD_V4_match_type: Method for generating match, which is one of the below.
      • EXACT_MATCH: Exact match across CHROM, POS, REF and ALT.
      • TRUVARI_{X}: Truvari match requiring X% sequence similarity.
      • BEDTOOLS_CLOSEST: Bedtools closest match finetuned for SVs.
    • gnomAD_V4_match_ID: Variant ID of matched variant.
    • gnomAD_V4_match_source: Source of matched variant, which is one of the below.
      • SNV_indel: SNV & indel callset.
      • SV: SV callset.
  • Allele Frequencies.
    • AN: Count of alleles genotyped.
    • AC: Count of non-reference alleles.
    • AF: Proportion of alleles that are non-reference.
    • NCR: Proportion of alleles that don't have a genotype call.
    • AP_allele: Allele purity per-allele (multiallelic sites only).
    • MC_allele: Motif count per-allele (multiallelic sites only).
    • LPS_allele: Longest polymer sequence per-allele (multiallelic sites only).
  • VRS Annotations: As described in the VRS documentation
    • VRS_Allele_IDs.
    • VRS_Starts.
    • VRS_Ends.
    • VRS_States.
    • VRS_Lengths.
    • VRS_RepeatSubunitLengths.
  • Filters.
    • LARGE_SNV_INDEL: Variant with SOURCE = "DeepVariant" that has INFO/allele_length ≥ 50.
    • SMALL_SV: Variant with SOURCE = "HPRC_SV_Integration" that has INFO/allele_length < 50.

Code Conventions

WDL

  • Workflows should be structured in the following order, with each of the below separated by a blank line:
    1. Imports.
    2. Inputs.
    3. Definition of variables dynamically generated in the workflow itself.
    4. Calls to tasks.
    5. Outputs.
  • Tasks should be structured in the following order, with each of the below separated by a blank line:
    1. Inputs.
    2. Definition of variables dynamically generated in the task itself.
    3. Command.
    4. Outputs.
    5. Runtime settings - default parameters, followed by a select first with the runtime override, then the actual runtime block.
  • Inputs should be structured in the following order, with each of the below separated by a blank lines:
    1. Core input files that will be run through the workflow - e.g. VCFs being annotated, BAMs being analyzed etc (as well as their indexes if applicable). Also the contigs to be run on as well as the prefix.
    2. Parameters that govern how the file will be processed - e.g. prefixes, modes, input arguments to tools being called, PEDs, metadata files etc.
    3. Reference files - e.g. reference fasta, their indexes, catalogs used for annotations, etc.
    4. Runtime-related information that are not of type RuntimeAttr - e.g. docker paths, cores if applicable, sharding information if applicable.
    5. All RuntimeAttr? inputs - there should be one per task called, with its name reflective of the task's function.
  • Workflows should take in an input prefix that is passed to every task that creates output files, which should be used in conjunction with a descriptive suffix when creating outputs.
  • Workflow imports should not be renamed using the as operator.
  • Workflows should never contain any blank comments - e.g. #########################.
  • Workflows should never contain be any consecutive blank lines - i.e. they should have a maximum of one blank line at a time.
  • Inputs passed to a task should not have blank lines between inputs.
  • The order of inputs passed to a task should reflect their order in the inputs on the workflow level.
  • Inputs passed to a task should have a space on either side of the = character.
  • The inputs section of a task should not have blank lines between inputs.
  • Tasks should always have input fields docker and runtime_attr_override defined, though what is passed to each one of these when calling the task should be explicitly named - e.g. gatk_docker and runtime_attr_override_svannotate respectively.
  • Tasks should also have a prefix input defined, which is passed and set at the workflow level - the outputs from the task should simply use the prefix along with the file type.
  • Every command block within a task should begin with set -euo pipefail followed by a blank line.
  • The default disk_size for a task should be calculated dynamically based on the largest sized input file - or multiple if there are several large inputs, like multiple reference fastas or input catalogs. It should be defined in-line in the default runtime attributes section, unless it is a complicated function in which it can have a dedicated variable disk_size.
  • The default mem_gb, boot_disk_gb and compute_cores for a task should be explicitly defined rather than based on an input file - it should be set based on the intensity of compute needed by that task.
  • The default preemptible_tries for a task should always be 2.
  • The default max_retries for a task should always be 0.
  • The names of workflows and tasks should never include a _ character within them - rather, they should always be in Pascal case.
  • The names of inputs, variables and outputs should include a _ to separate words, and be entirely lowercase unless they refer to a noun that is capitalized (e.g. PALMER or L1MEAID) - i.e. they should always be in snake case.
  • There should never be any additional indentation in order to better align parts of the code to the length or horizontal/vertical spacing of other components in its section - indentation should only be applied at the start of a line.
  • All mentions of fasta should instead use fa - e.g. ref_fa instead of ref_fasta.
  • All mentions of fasta_index, fasta_fai or fa_fai should instead use fai - e.g. ref_fai instead of ref_fasta_index, ref_fasta_fai or ref_fa_fai.
  • All mentions of vcf_index or vcf_tbi should instead use vcf_idx.
  • All VCFs should have suffix _vcf, and be coupled with a VCF index file that has a suffix _vcf_idx.
  • Tasks that can be generalized and used across workflows should live in Helpers.wdl and be imported by consumer workflows, rather than explicitly defined in a standalone workflow itself.
  • Workflow file names must always match the workflow defined within them.
  • Annotation workflows should always output a TSV file rather than a VCF, unless its annotations are done for every single variant in the input VCF or if the underlying workflow is designed to annotate variants in a VCF.

Python

  • All code should be formatted in-line with black's formatting, which can be applied via black ..
  • All code should be compliant with flake8.

Codebase

  • Workflows in wdl/annotation/ should begin with Annotate.
  • Workflows directly run in the pipeline should be in one of wdl/annotation/, wdl/annotation_utils/ or wdl/tools/, and should have entries in dockstore.yml and README.md.

Workspace

  • All reference files - i.e. those not specific to an input callset - should be passed in via workspace data.
  • All dockers should be passed in via workspace data.

About

A pipeline for annotating long-read callsets.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors