-
Notifications
You must be signed in to change notification settings - Fork 7
Home
dbVar Structural Variant Clusters (SVCs) for Data Analysis and Variant Comparison: Computing and Annotating SVCs on a Reference Genome
NCBI Hackathon 01.04.2016 - 01.06.2016
Th NCBI dbVar Entrez database stores millions of "submitted structural variants" (ssv) by study. In the past users had to download all studies, which can be in the hundreds of files (depending on the organism), in order to compare which region or variant in dbVar is similar or different with their variant data. In addition there are different variant types, from different platforms, and from different samples, and the submitted variants can be reported as precise or fuzzy locations which can complicate the comparisons. The goal of this project is to create a set of Structural Variant Clusters (SVCs) on the GRCh38 assembly where the SVC is defined as: A single variant in a genomic region: the SVC is defined by the start and end of the single variant and SVC count = 1 Multiple variants in the genomic region with shared (overlapping) positions: The SVC boundaries are defined by the start and stop of the non-overlapping and overlapping regions and the SVC count = number of overlaps.
The benefits of having a defined set of SVC are:
- Having a defined genomic region with coordinates and range can improve data exchange, data mining, computation, and reporting
- Improve searching and matching of genomic coordinates across studies
- Improve aggregation of annotations such as disease and phenotype, frequency, and genomic features that co-locate with a SVC.
- Simplify display in sequence viewer as an aggregated histogram or density track from all studies. Currently dbVar displays each study as a track which can be slow to render and difficult to display on small screens.
Create a set of Structural Variant Clusters (SVCs) across the Reference Genome (RG), based on structural variants in the NCBI dbVar database on the GRCh38 assembly.
An SVC is:
- A single variant in a genomic region: the SVC is defined by the start and end of the single variant and SVC observed count = 1
- Multiple variants in the genomic region with shared (overlapping) positions: The SVC boundaries are defined by the start and stop of the non-overlapping and overlapping structural variant calls in dbVar, and the SVC observed count = number of overlaps. (Thus, SVCs do not overlap with each other.)
- No variants in a genomic region: SVC created to show non-variant regions in the genome, count = 0.
Generate a GVF file of SVCs as defined above based on GRCh38. Each region will have a unique ID (SVC1, SVC2, etc.). The SVC GVF file will be used as the basis for generating aggregated data, filtering, generating sequence viewer tracks, and for comparison with user data.
Generate a histogram track for sequence viewer to show the frequency of the regions across studies in genomic context. Annotation SVCs with Gene, colocated dbSNP RS and ClinVar RSV, and other colocated features.
- A tool for filtering SVC GVF by variant types, region size, region count, and other parameters.
- by Chromosome
- by Variant types
- other user defined splitting and filtering
- A tool for user to compare their data with SVC GVF and report matching regions of overlap.
Additional annotation sources:
Annotations form other sources may of course be used as long as the source contains the specific start and stop coordinates for the reference genome in use. Possible additional annotation sources to consider: colocated dbSNP RS, ClinVar RSV (Clinical Significance), and other colocated features.
Generate a histogram track for sequence viewer to show the frequency of the SVCs across studies in genomic context.
Reference: The dbVar home page is at: