RLE normalization on microsatellites / tandem repeat changes #592
Replies: 2 comments 2 replies
-
|
My opinion is that the current behavior is correct. Expanding unchanged reference alleles would lead to equivalent VRS IDs where they should not exist. A simpler example: With the current behavior, those two HGVS expressions would be assigned different VRS IDs. If roll left and expand is applied, they would have the same VRS ID. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @ehclark thanks for the reply. This is useful feedback and I think you may be correct, so thanks for the prompting to pump the brakes. In the specification: https://vrs.ga4gh.org/en/latest/conventions/normalization.html#allele-normalization In the del/ins cases though it falls past 2.a and continues to be VOCA-normalized and rolls left 8 bp down to 930081. It is a little confusing that for deletions, insertions, and substitutions, it is VOCA-normalized but for same-as-ref locations it does not get normalized and can refer to specific repeats within a longer repeating region. But @larrybabb (@ahwagner) said there may be some biological or clinical reason that this is desirable specifically in the same-as-ref case but not in the other variation cases. Do you have any thoughts on why that would be desirable? Or @ahwagner? Also in any case, could we add some additional documentation to here: https://vrs.ga4gh.org/en/latest/conventions/normalization.html#general-normalization-rules For non-same-as-ref Alleles, the user can either normalize or not (they should). e.g. these two pairs of variants refer to the same base pairs but the |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I was prompted by this issue: #582 to look into what it would take to implement some handling for simple microsatellite expressions in vrs-python as a workaround for
biocommons/hgvsnot supporting them yet. (issue: biocommons/hgvs#113). And since it was relevant to ReferenceLengthExpressions I wanted to look at it now since it's fresh in my mind and may be useful to do some part of it now.I am working off the vrs-python branch
issue-577-VCF-RLE(PR: #589).I came across some interesting behavior which maybe is correct and maybe needs fixing. I am not sure yet.
I started with this HGVS expression:
NC_000001.11:g.930090TTCCTCTCCTCCTGCCCCACC[2]This can be converted into SPDI or an HGVS del or ins by going to the reference sequence and figuring out how many copies are there right now, and then determining whether it's a del or ins by comparing to the count 2 in this variant expression. In this case it turns out the reference has 3 copies of
TTCCTCTCCTCCTGCCCCACC. So this is a deletion of one TTCCTCTCCTCCTGCCCCACC, a 21 base pair deletion.We can take 930090, add (
21*2) to get the start of the rightmost copy, and add (21*3-1) to get the end of the rightmost copy.Which leads to this HGVS del expression:
NC_000001.11:g.930132_930152delFrom this expression,
AlleleTranslatorreturns:{ "id": "ga4gh:VA.DqozDL1k644wgQiHgjVp8PODJkDFn2dS", "type": "Allele", "digest": "DqozDL1k644wgQiHgjVp8PODJkDFn2dS", "location": { "id": "ga4gh:SL.Z2axmOUY_oy7Sy26jzOduCprU3WIKzAf", "type": "SequenceLocation", "digest": "Z2axmOUY_oy7Sy26jzOduCprU3WIKzAf", "sequenceReference": { "type": "SequenceReference", "refgetAccession": "SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO" }, "start": 930081, "end": 930152 }, "state": { "type": "ReferenceLengthExpression", "length": 50, "sequence": "GCCCCACCTTCCTCTCCTCCTGCCCCACCTTCCTCTCCTCCTGCCCCACC", "repeatSubunitLength": 21 } }So vrs-python/bioutils determined that this
TTCCTCTCCTCCTGCCCCACCis actually part of a repeating region which can be rolled left to include the precedingGCCCCACC, meaning the repeating motif is actuallyGCCCCACCTTCCTCTCCTCCTAn expansion to 4 microsatellite (non-normalized) copies would then be an expansion to 92bp.
{ "id": "ga4gh:VA.yRLo23g3Le4t2P4R0rdtWOolw4Qo4yfs", "type": "Allele", "digest": "yRLo23g3Le4t2P4R0rdtWOolw4Qo4yfs", "location": { "id": "ga4gh:SL.Z2axmOUY_oy7Sy26jzOduCprU3WIKzAf", "type": "SequenceLocation", "digest": "Z2axmOUY_oy7Sy26jzOduCprU3WIKzAf", "sequenceReference": { "type": "SequenceReference", "refgetAccession": "SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO" }, "start": 930081, "end": 930152 }, "state": { "type": "ReferenceLengthExpression", "length": 92, "sequence": "GCCCCACCTTCCTCTCCTCCTGCCCCACCTTCCTCTCCTCCTGCCCCACCTTCCTCTCCTCCTGCCCCACCTTCCTCTCCTCCTGCCCCACC", "repeatSubunitLength": 21 } }But let's say the variant was same-as-reference / no-change. So 3 copies.
The current code in the branch (revision 26eeba3) is giving this result. It is not doing the rolling left down to 930081.
{ "id": "ga4gh:VA.ahoVwNQ9Ag-B9zLWWWgNLCDyPt_-gHr5", "type": "Allele", "digest": "ahoVwNQ9Ag-B9zLWWWgNLCDyPt_-gHr5", "location": { "id": "ga4gh:SL.m2kYCQSMG9-Wqo_kaUC_iZ-5lBoTSY6u", "type": "SequenceLocation", "digest": "m2kYCQSMG9-Wqo_kaUC_iZ-5lBoTSY6u", "sequenceReference": { "type": "SequenceReference", "refgetAccession": "SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO" }, "start": 930089, "end": 930152 }, "state": { "type": "ReferenceLengthExpression", "length": 63, "sequence": "TTCCTCTCCTCCTGCCCCACCTTCCTCTCCTCCTGCCCCACCTTCCTCTCCTCCTGCCCCACC", "repeatSubunitLength": 63 } }Beta Was this translation helpful? Give feedback.
All reactions