Skip to content

broadinstitute/str-analysis

Repository files navigation

This repo contains scripts and utilities for analyzing tandem repeats (TRs).

Installation

To install the latest version using pip, run:

python3 -m pip install --upgrade str-analysis@git+https://github.com/broadinstitute/str-analysis

or use the docker image (though it may not have the latest version of the code):

docker run -it weisburd/str-analysis:latest

Table of Contents


Variant Filtering & Detection

  • call_non_ref_motifs - Takes a BAM/CRAM file and, optionally, an ExpansionHunter variant catalog. For each locus, determines which STR motifs are supported by reads overlapping that locus before running ExpansionHunter on the detected motif(s). Useful for detecting non-reference pathogenic motifs (e.g., RFC1). (docs)
  • filter_vcf_to_tandem_repeats - Modern iteration of filter_vcf_to_STR_variants. Takes a VCF (single-sample or multi-sample) and filters it to insertions and deletions that represent tandem repeat expansions or contractions. Uses both brute-force k-mer search for perfect repeats and TandemRepeatFinder for imperfect repeats (VNTRs). Separates locus discovery from genotyping for more flexible downstream analysis.
  • filter_vcf_to_STR_variants - Original version of the above tool. Takes a single-sample VCF file and filters it to the INS/DEL variants that represent tandem repeat expansions or contractions by performing brute-force k-mer search on each variant's inserted or deleted bases. This tool was a core part of Weisburd, B., Tiao, G. & Rehm, H. L. Insights from a genome-wide truth set of tandem repeat variation. (2023)
  • parse_motif_composition - Simple algorithm that takes a table of motifs known to occur at a particular VNTR or STR locus with motif variability (such as CACNA1C or RFC1), and also takes a BAM/CRAM data file, and then computes the observed frequency of these motifs in the reads aligned to this locus.

Catalog Management & Annotation

  • merge_loci - Takes one or more STR catalogs and combines them into a single catalog while removing duplicates based on overlap and repeat motif.
  • annotate_and_filter_str_catalog - Takes an STR catalog and annotates loci based on their overlap with genes and known disease-associated STRs. Allows filtering by motif size, gene region, and various other criteria.
  • compute_catalog_stats - Takes an annotated catalog output by annotate_and_filter_str_catalog and computes various summary statistics.
  • add_offtarget_regions - Takes an ExpansionHunter variant catalog and adds a list of off-target regions to each locus definition by querying a database of off-target regions precomputed for each TR motif. This database was generated by using wgsim to simulate fully-repetitive reads for each motif, and then recording where these reads mapped on hg19 and hg38 after aligning them using bwa.
  • add_adjacent_loci_to_expansion_hunter_catalog - Takes an ExpansionHunter variant catalog and a BED file containing all simple repeats in the reference genome. Outputs a new catalog with updated LocusStructures and ReferenceRegions that include any adjacent repeats found near each locus.
  • split_adjacent_loci_in_expansion_hunter_catalog - Splits loci with adjacent repeats back into separate loci.
  • filter_out_loci_with_Ns_in_flanks - Removes loci from an ExpansionHunter catalog if their flanks contain enough Ns to trigger an ExpansionHunter error.
  • maximize_motif_purity - Adjusts motifs to maximize purity for each catalog interval.
  • trim_repeat_loci - Trims BED intervals so locus size is a multiple of motif size.

ExpansionHunter Results Processing

  • combine_str_json_to_tsv - Takes a set of ExpansionHunter JSON output files and combines them into a single TSV table.
  • combine_json_to_tsv - Takes a set of arbitrary JSON files that share the same schema and combines their top-level fields into a single TSV file.
  • postprocess_expansion_hunter_results - Cleans and post-processes ExpansionHunter output (filter hom-ref calls, convert formats).
  • filter_to_largest_expansions - Takes a combined table of TR genotypes (from combine_str_json_to_tsv) and filters it to the N most expanded genotypes per locus, based on either the long allele or the short allele (for recessive searches). Optionally filters by quality score, purity, and can discard non-polymorphic loci.
  • copy_EH_vcf_fields_to_json - Takes the ExpansionHunter output VCF and JSON file for a given sample and copies fields that are only present in the VCF to the JSON file.
  • check_trios_for_mendelian_violations - Takes a table of combined ExpansionHunter calls generated by combine_str_json_to_tsv as well as a FAM or PED file with parent/child relationships, and outputs a table of Mendelian violations in the callset.
  • compute_average_genotype_quality_per_sample - Computes average Q score per sample from a genotype table.
  • check_combined_results_tsv_for_pathogenic_repeats - Filters combined results for potential pathogenic expansions at known disease loci.
  • check_combined_results_tsv_for_genome_wide_repeats - Filters combined results for potential pathogenic expansions genome-wide.
  • run_reviewer - Takes ExpansionHunter output files for a single sample and runs REViewer on the subset of loci where the genotypes exceed locus-specific thresholds specified in the variant catalog.

TRGT and LPS Processing

  • combine_lps_to_allele_histograms - Takes one or more tables of LPS (likelihood per sample) scores and combines them into per-locus allele histograms. Useful for aggregating genotype data across large cohorts.

ExpansionHunterDenovo Post-Processing

  • annotate_EHdn_locus_outliers - Takes an ExpansionHunterDenovo outlier result table (locus outliers or case-control) as well as a BED file containing all simple repeats in the reference genome and, optionally, a gene models GTF file, a variant catalog of known disease-associated loci, and/or other BED files with genomic regions of interest. Outputs a new table where each EHdn outlier is annotated with multiple columns related to the provided reference data.
  • convert_annotated_EHdn_locus_outliers_to_expansion_hunter_catalog - Takes the output table from annotate_EHdn_locus_outliers and lets the user apply a range of filters before writing out the passing loci to an ExpansionHunter variant catalog.
  • convert_expansion_hunter_denovo_locus_tsv_to_bed - Converts ExpansionHunterDenovo locus TSV results to BED format.

Read Extraction

  • make_bamlet - Optimized version of ExpansionHunterDenovo's make-bamlet.py. For a given STR region, extracts all relevant reads from a BAM or CRAM file into a much smaller BAMlet which can be used as input to ExpansionHunter instead of the full BAM/CRAM but yield the same genotype. Reduces I/O operations for better performance.
  • make_minicram_for_expansion_hunter - Extracts minimal CRAM subset needed for ExpansionHunter genotyping.
  • print_reads - Extracts reads from CRAM/BAM files overlapping genomic intervals (lightweight alternative to GATK PrintReads).

Simulation

  • simulate_str_expansions - Uses wgsim to generate BAM files with simulated read data containing STR expansions at a given locus, with specified number of repeats, motif, zygosity, etc.

gnomAD STR Data Generation

  • generate_gnomad_json - Generates gnomAD-formatted tables and JSON files with readviz metadata. Used to combine the gnomAD STR calls into the files available for download on the gnomAD website.
  • generate_gnomad_v2_json - Generates gnomAD v2 format output (legacy version).

Catalog Format Conversion

  • convert_bed_to_expansion_hunter_catalog - Converts BED files to ExpansionHunter variant catalog JSON format.
  • convert_expansion_hunter_catalog_to_bed - Converts ExpansionHunter variant catalogs to BED format.
  • convert_dat_to_bed - Converts Tandem Repeat Finder (TRF) .dat output files to BED format.
  • convert_expansion_hunter_catalog_to_gangstr_spec - Converts ExpansionHunter catalogs to GangSTR spec format.
  • convert_gangstr_spec_to_expansion_hunter_catalog - Converts GangSTR spec to ExpansionHunter variant catalog.
  • convert_gangstr_vcf_to_expansion_hunter_json - Converts GangSTR VCF output to ExpansionHunter JSON format.
  • convert_expansion_hunter_catalog_to_hipstr_format - Converts ExpansionHunter catalogs to HipSTR format.
  • convert_hipstr_vcf_to_expansion_hunter_json - Converts HipSTR VCF output to ExpansionHunter JSON format.
  • convert_expansion_hunter_catalog_to_trgt_catalog - Converts ExpansionHunter catalogs to TRGT catalog format.
  • convert_trgt_catalog_to_expansion_hunter_catalog - Converts TRGT repeat catalog BED files to ExpansionHunter variant catalog JSON format.
  • convert_trgt_vcf_to_expansion_hunter_json - Converts TRGT VCF output to ExpansionHunter JSON format.
  • convert_expansion_hunter_catalog_to_longtr_format - Converts ExpansionHunter catalogs to LongTR format.
  • convert_expansion_hunter_catalog_to_vamos_catalog - Converts ExpansionHunter catalogs to VAMOS catalog format.
  • convert_straglr_bed_to_expansion_hunter_json - Converts Straglr BED output to ExpansionHunter JSON format.
  • convert_strling_calls_to_expansion_hunter_json - Converts STRling calls to ExpansionHunter JSON format.

About

Scripts and utilities for analyzing tandem repeats (TRs).

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5