This repository serves as a home for all scripts, workflows and processes for annotating long-read callsets.
- HPRC.
- 232 total samples.
- Metadata File.
- 234 total samples.
- Additionally includes GRC38 and CHM13.
- Fabio's table and the DeepVariant callset also include HG03492.
- HGSVC.
- 65 total samples.
- Metadata File.
- 67 total samples.
- Renames NA21487 to GM21487.
- Duplicates NA19129 (also includes GM19129) and NA20355 (also includes GM20355).
- Additionally includes GM19320.
- Misses NA24385 (HG002 in Terra).
- HPRC & HGSVC Overlapping Samples: HG002, HG00733, HG02818, NA19036, NA19240.
- All of Us Phase 1.
- 1027 total samples.
- Metadata file extracted from VCF - all samples are unrelated and of African ancestry..
coding_gtf: GENCODE v39 from the gnomAD workspace.dbgap_vcf: Build 156 variants for GRCh38 from dbGaP.dbgap_vcf_idx: Index fordbgap_vcf.exons_bed: Loci for GRCh38 from the references listed in the SVAN repository.fix_variant_collisions_java: Script for phasing from the DSP Long-Read SV team.genetic_maps_tsv: Maps for GRCh38 from the GLIMPSE references.gnomad_sv_vcf: gnomAD V4 structural variants from the gnomAD downloads.gnomad_sv_vcf_idx: Index forgnomad_sv_vcf.gnomad_tr_json: gnomAD V4 tandem repeats from the str-analysis references.gnomad_vcfs: gnomAD V4 short variants from the gnomAD downloads.gnomad_vcf_idxs: Indexes forgnomad_vcfs.mei_catalog: Loci for GRCh38 is a combination of ALU and LINE loci from RepeatMasker and SVA loci from van Bree et al.mei_fa: Sequences for GRCh38 from the SVAN references.mei_fa_indices: BWA and Minimap indices formei_fa.noncoding_bed: Panel for GRCh38 from the GATK-SV references.par_bed: Panel for GRCh38 from the GATK-SV references.ploidy_bed_female: Panel for GRCh38 from the Kanpig references.ploidy_bed_male: Panel for GRCh38 from the Kanpig references.ref_dict: Sequence dictionary forref_fa.ref_fa: Sequences for GRCh38 from the PAV references.ref_fai: Index forref_fa.ref_fa_bwa_indices: BWA indices forref_fa.ref_fa_indices: BWA and Minimap indices forref_fa.ref_vep_cache: Cache for v105 from the VEP archives, with the most up-to-date list of these found here.repeat_masker_bed: Loci for GRCh38 from the GATK-SV references.repeats_bed: Loci for GRCh38 from the SVAN references.repeat_catalog_trgt: Catalog for GRCh38 from the TR Catalog references, as used in All of Us Phase.seg_dup_bed: Loci for GRCh38 from the GATK-SV references.simple_repeat_bed: Loci for GRCh38 from the GATK-SV references.seqrepo_tar: Sequence repository v2024-12-20 for GRCh38 from the latest release of seqrepo.vntr_bed: Loci for GRCh38 from the SVAN references.
This workflow leverages AnnotateVcf from the GATK-SV pipeline in order to annotate internal allele frequencies based on sample sexes and ancestries. It runs on all variants in the input VCF, including SVs.
Inputs:
sample_pop_assignments: Two column file containing sample IDs in the first column and ancestry labels in the second column.ped_file: Six column file containing the cohort pedigree, with specifications described in this article.contigs.par_bed.
TODO
TODO
TODO
TODO
This workflow first runs RepeatMasker on an input VCF. It then uses its output to run L1ME-AID and INTACT_MEI in order to annotate and filter MEI calls.
TODO
TODO
TODO
This workflow leverages PALMER in order to annotate MEI calls for a cohort in a given cohort VCF. It retains the genotypes present in the VCF, simply adding an INFO field ME_TYPE to insertions whose characteristics match those of the PALMER calls.
Inputs:
mei_types: MEI types to run on - must be a subset of [ALU,SVA,LINEorHERVK].rm_fa: Output by RepeatMasker.rm_out: Output by RepeatMasker.contigs.ref_fai.
TODO
TODO
This workflow leverages SVAN in order to annotate Mobile Element Insertions (MEIs), Mobile Element Deletions, Tandem Duplications, Dispersed Duplications and Nuclear Mitochondrial Segments (NUMT). It involves running Tandem Repeat Finder (TRF) on the inserted or deleted sequence for each SV in the input VCF.
Inputs:
contigs.exons_bed.mei_fasta.ref_fa.repeats_bed.vntr_bed.
This workflow leverages SVAnnotate in order to annotate predicted functional effects for SVs. It conditionally only runs SV through this workflow, ignoring all SNVs and InDels.
Inputs:
coding_gtf.contigs.noncoding_bed.
TODO
This workflow leverages the Ensembl Variant Effect Predictor (VEP) in order to annotate predicted functional effects based on site-level information. It requires numerous references that provide context to these annotations, and uses Hail in order to run this annotation process in a more efficient and scalable manner.
Inputs:
ref_fa.ref_vep_cache.
TODO
This workflow ingests two VCFs and finds matching variants across them in order to compare the AF & VEP annotations of these matched pairs. This serves as a degree of benchmarking, as it ensures that annotations applied to a larger cohort (e.g. gnomAD) are in line with those we annotate. It also enables the identification of variants that are outliers relative to exiting cohorts by pulling out those with a large amount of discordance in their annotation across the callsets.
The workflow undergoes multiple rounds of variant matching in order to determine matched pairs:
- Exact match across CHROM, POS, REF and ALT.
- Truvari match with overlap percentages of 90%, 70% and 50%.
- Matching based on
bedtools closest, finetuned for SVs.
Inputs:
vcf_truth: VCF containing SNV & indels to evaluate against.vcf_sv_truth: VCF containing SVs to evaluate against.ref_fa.ref_fai.
TODO
TODO
TODO
TODO
TODO
TODO
TODO
TODO
TODO
TODO
TODO
TODO
TODO
TODO
TODO
TODO
TODO
TODO
TODO
TODO
TODO
TODO
TODO
TODO
TODO
This workflow leverages Minimap2 in order to align assemblies to a reference.
Inputs:
assembly_mat: Maternal assembly.assembly_pat: Paternal assembly.minimap_flags: Parameters to use when running Minimap2.ref_fa.ref_fai.
TODO
This workflow runs PALMER on a pair of aligned assembly haplotypes in order to generate MEI calls. It then convets the raw PALMER calls generated into a VCF, merges calls across the haplotypes to create a diploid VCF per haplotype and then finally integrates these into a final VCF containing multiple MEI types.
Inputs:
bam_pat: Aligned assembly for paternal haplotype.bai_pat: Index for aligned assembly for paternal haplotype.bam_mat: Aligned assembly for maternal haplotype.bai_mat: Index for aligned assembly for maternal haplotype.mei_types: Series of MEI modes to run PALMER in - a subset ofALU,SVA,LINEorHERVK.truvari_collapse_params: Truvari parameters to use when merging across haplotypes.override_palmer_calls_pat: Optional PALMER calls for paternal haplotype, causing the workflow to bypass its execution.override_palmer_tsd_reads_pat: Optional PALMER TSD reads for paternal haplotype, causing the workflow to bypass its execution.override_palmer_calls_mat: Optional PALMER calls for maternal haplotype, causing the workflow to bypass its execution.override_palmer_tsd_reads_mat: Optional PALMER TSD reads for maternal haplotype, causing the workflow to bypass its execution.contigs.ref_fa.ref_fai.
TODO
TODO
This workflow leverages RepeatMasker in order to annotate repeated regions in an input VCF.
TODO
TODO
TODO
This workflow leverages TRGT in order to genotype short-tandem repeats.
Inputs:
bam: Aligned reads.bai: Index for aligned reads.sex: Sex of sample (one ofMorF).ref_fa.ref_fai.repeat_catalog_trgt.
TODO
TODO
TODO
INFO/allele_length: Allele length - positive for insertions, negative for deletions and 0 for SNVs.INFO/allele_type: Allele type, which is one of the below.snv: Single nucleotide variant.ins: Insertion.del: Deletion.trv: Tandem repeat.dup: Tandem duplication.dup_interspersed: Interspersed duplication.complex_dup: Complex duplication.inv_dup: Inverted duplication.numt: Nuclear-mitochondrial segment.{ME_TYPE}_ins: Mobile element insertion, where{ME_TYPE}is one ofALU,LINEorSVA.{ME_TYPE}_del: Mobile element deletion, where{ME_TYPE}is one ofALU,LINEorSVA.
SOURCE: Source of call, which is one of the below.DeepVariant: SNV or indel call made by the DeepVariant pipeline.HPRC_SV_Integration: Structural variant call made by the HPRC SV Integration pipeline.TRExplorer: Tandem repeat loci from the TRExplorer v1.0.1 catalog.Vamos: Tandem repeat loci from the Vamos v2.1 catalog.
TR_ENVELOPED: Flag indicating a variant withallele_type != "trv"is completely enveloped by a variant withallele_type = "trv".TRID: TR identifier for TR calls; source of enveloping variant withallele_type = "trv"for non-TR calls with theTR_ENVELOPEDflag.TR_PARSED: Flag indicating a variant withallele_type != "trv"is flagged as a tandem repeat by theAnnotateIndelTRsworkflow.ORIGIN: Origin of duplicated sequence for duplications and NUMTs.SUB_FAMILY: Sub-family for MEI calls.dbGaP_ID: Variant ID from dbGaP for matched variants.REGION: Genomic region, which is one ofSR(for simple repeats),SD(for segmental duplications),RM(for RepeatMasker annotated regions) orUS(for unique sequences, or more simply, none of the previous regions).- Functional Annotations.
vep: Annotations from the Variant Effect Predictor (VEP).PREDICTED_: Annotations from SVAnnotate, which are all prefixed byPREDICTED_.
- gnomAD_V4 Benchmarking.
gnomAD_V4_match_type: Method for generating match, which is one of the below.EXACT_MATCH: Exact match across CHROM, POS, REF and ALT.TRUVARI_{X}: Truvari match requiring X% sequence similarity.BEDTOOLS_CLOSEST: Bedtools closest match finetuned for SVs.
gnomAD_V4_match_ID: Variant ID of matched variant.gnomAD_V4_match_source: Source of matched variant, which is one of the below.SNV_indel: SNV & indel callset.SV: SV callset.
- Allele Frequencies.
AN: Count of alleles genotyped.AC: Count of non-reference alleles.AF: Proportion of alleles that are non-reference.NCR: Proportion of alleles that don't have a genotype call.AP_allele: Allele purity per-allele (multiallelic sites only).MC_allele: Motif count per-allele (multiallelic sites only).LPS_allele: Longest polymer sequence per-allele (multiallelic sites only).
- VRS Annotations: As described in the VRS documentation
VRS_Allele_IDs.VRS_Starts.VRS_Ends.VRS_States.VRS_Lengths.VRS_RepeatSubunitLengths.
- Filters.
LARGE_SNV_INDEL: Variant withSOURCE = "DeepVariant"that hasINFO/allele_length ≥ 50.SMALL_SV: Variant withSOURCE = "HPRC_SV_Integration"that hasINFO/allele_length < 50.
- Workflows should be structured in the following order, with each of the below separated by a blank line:
- Imports.
- Inputs.
- Definition of variables dynamically generated in the workflow itself.
- Calls to tasks.
- Outputs.
- Tasks should be structured in the following order, with each of the below separated by a blank line:
- Inputs.
- Definition of variables dynamically generated in the task itself.
- Command.
- Outputs.
- Runtime settings - default parameters, followed by a select first with the runtime override, then the actual runtime block.
- Inputs should be structured in the following order, with each of the below separated by a blank lines:
- Core input files that will be run through the workflow - e.g. VCFs being annotated, BAMs being analyzed etc (as well as their indexes if applicable). Also the contigs to be run on as well as the prefix.
- Parameters that govern how the file will be processed - e.g. prefixes, modes, input arguments to tools being called, PEDs, metadata files etc.
- Reference files - e.g. reference fasta, their indexes, catalogs used for annotations, etc.
- Runtime-related information that are not of type RuntimeAttr - e.g. docker paths, cores if applicable, sharding information if applicable.
- All RuntimeAttr? inputs - there should be one per task called, with its name reflective of the task's function.
- Workflows should take in an input
prefixthat is passed to every task that creates output files, which should be used in conjunction with a descriptive suffix when creating outputs. - Workflow imports should not be renamed using the
asoperator. - Workflows should never contain any blank comments - e.g.
#########################. - Workflows should never contain be any consecutive blank lines - i.e. they should have a maximum of one blank line at a time.
- Inputs passed to a task should not have blank lines between inputs.
- The order of inputs passed to a task should reflect their order in the inputs on the workflow level.
- Inputs passed to a task should have a space on either side of the
=character. - The inputs section of a task should not have blank lines between inputs.
- Tasks should always have input fields
dockerandruntime_attr_overridedefined, though what is passed to each one of these when calling the task should be explicitly named - e.g.gatk_dockerandruntime_attr_override_svannotaterespectively. - Tasks should also have a prefix input defined, which is passed and set at the workflow level - the outputs from the task should simply use the prefix along with the file type.
- Every command block within a task should begin with
set -euo pipefailfollowed by a blank line. - The default
disk_sizefor a task should be calculated dynamically based on the largest sized input file - or multiple if there are several large inputs, like multiple reference fastas or input catalogs. It should be defined in-line in the default runtime attributes section, unless it is a complicated function in which it can have a dedicated variabledisk_size. - The default
mem_gb,boot_disk_gbandcompute_coresfor a task should be explicitly defined rather than based on an input file - it should be set based on the intensity of compute needed by that task. - The default
preemptible_triesfor a task should always be 2. - The default
max_retriesfor a task should always be 0. - The names of workflows and tasks should never include a
_character within them - rather, they should always be in Pascal case. - The names of inputs, variables and outputs should include a
_to separate words, and be entirely lowercase unless they refer to a noun that is capitalized (e.g. PALMER or L1MEAID) - i.e. they should always be in snake case. - There should never be any additional indentation in order to better align parts of the code to the length or horizontal/vertical spacing of other components in its section - indentation should only be applied at the start of a line.
- All mentions of
fastashould instead usefa- e.g.ref_fainstead ofref_fasta. - All mentions of
fasta_index,fasta_faiorfa_faishould instead usefai- e.g.ref_faiinstead ofref_fasta_index,ref_fasta_faiorref_fa_fai. - All mentions of
vcf_indexorvcf_tbishould instead usevcf_idx. - All VCFs should have suffix
_vcf, and be coupled with a VCF index file that has a suffix_vcf_idx. - Tasks that can be generalized and used across workflows should live in
Helpers.wdland be imported by consumer workflows, rather than explicitly defined in a standalone workflow itself. - Workflow file names must always match the workflow defined within them.
- Annotation workflows should always output a TSV file rather than a VCF, unless its annotations are done for every single variant in the input VCF or if the underlying workflow is designed to annotate variants in a VCF.
- All code should be formatted in-line with black's formatting, which can be applied via
black .. - All code should be compliant with
flake8.
- Workflows in
wdl/annotation/should begin with Annotate. - Workflows directly run in the pipeline should be in one of
wdl/annotation/,wdl/annotation_utils/orwdl/tools/, and should have entries indockstore.ymlandREADME.md.
- All reference files - i.e. those not specific to an input callset - should be passed in via workspace data.
- All dockers should be passed in via workspace data.