-
Notifications
You must be signed in to change notification settings - Fork 13
Description
This is referring to the step "Parse dbSNP dump" in the section "Rebuilding the database for SNP normalization" of this page.
I have wrote a port to Python for your code parsing the dnSNP database (JSON Format) and in the process I have found that the current logic of ParseJSONToFile.java seems to miss names ("hgvs" notations for a given rsid) which are instead reported in the web version of dbSNP.
This can be seen for instance in the results for "rs757229".
In the file rs757229.txt there's the entry in JSON format (the extension is .txt but it is a JSON file: GitHub won't allow uploading JSON file) as reported in the current version of dbSNP.
The script below uses jq to reproduce the extraction logic of ParseJSONToFile.java and the proposed modification (sorry, Java is really not my thing, but it should be easy port it).
#!/usr/bin/env bash
FILE="rs757229.txt"
if [[ ! -f "$FILE" ]]; then
echo "Plase place the filer rs757229.txt in this folder"
exit 1
fi
echo "RS757229"
echo "SETH Implementation"
ALLELE_ANNOTATIONS=$(jq -r ".primary_snapshot_data.allele_annotations | length" "$FILE")
for i in $(seq 0 $(("$ALLELE_ANNOTATIONS"-1))); do
ASSEMBLY_ANNOTATIONS=$(jq -r ".primary_snapshot_data.allele_annotations[$i].assembly_annotation | length" "$FILE")
for j in $(seq 0 $(("$ASSEMBLY_ANNOTATIONS"-1))); do
GENES=$(jq -r ".primary_snapshot_data.allele_annotations[$i].assembly_annotation[$j].genes | length" "$FILE")
for k in $(seq 0 $(("$GENES"-1))); do
RNAS=$(jq -r ".primary_snapshot_data.allele_annotations[$i].assembly_annotation[$j].genes[$k].rnas | length" "$FILE")
for l in $(seq 0 $(("$RNAS"-1))); do
REFSEQ=$(jq ".primary_snapshot_data.allele_annotations[$i].assembly_annotation[$j].genes[$k].rnas[$l].id" "$FILE")
HGVS=$(jq ".primary_snapshot_data.allele_annotations[$i].assembly_annotation[$j].genes[$k].rnas[$l].hgvs" "$FILE")
echo "$REFSEQ - $HGVS"
done
done
done
done
echo "Proposed changed (match data reported by browser version of dbsnp)"
PLACEMENT_WITH_ALLELES=$(jq -r ".primary_snapshot_data.placements_with_allele | length" "$FILE")
for i in $(seq 0 $(("$PLACEMENT_WITH_ALLELES"-1))); do
ALLELES=$(jq -r ".primary_snapshot_data.placements_with_allele[$i].alleles | length" "$FILE")
for j in $(seq 0 $(("$ALLELES"-1))); do
REFSEQ_HGVS=$(jq -r ".primary_snapshot_data.placements_with_allele[$i].alleles[$j].hgvs" "$FILE")
echo "$REFSEQ_HGVS"
done
doneIf you run it with the file I provided the output will look like this:
RS757229
SETH Implementation
"NM_001039847.3" - null
"NM_001039848.4" - null
"NM_001367832.1" - null
"NM_002085.5" - null
"NM_001039847.3" - null
"NM_001039848.4" - null
"NM_001367832.1" - null
"NM_002085.5" - null
"NM_001039847.3" - null
"NM_001039848.4" - null
"NM_001367832.1" - null
"NM_002085.5" - null
"NM_001039847.3" - null
"NM_001039848.4" - null
"NM_001367832.1" - null
"NM_002085.5" - null
Proposed changed (match data reported by browser version of dbsnp)
NC_000019.10:g.1102115=
NC_000019.10:g.1102115G>A
NC_000019.10:g.1102115G>C
NC_000019.10:g.1102115G>T
NC_000019.9:g.1102114=
NC_000019.9:g.1102114G>A
NC_000019.9:g.1102114G>C
NC_000019.9:g.1102114G>T
NG_050621.1:g.3190=
NG_050621.1:g.3190G>A
NG_050621.1:g.3190G>C
NG_050621.1:g.3190G>TAs you can see the current logic misses all "hgvs" names which are present in the web verision.
The only issue w/ the proposed solution is that you need to get rid of the "no change" entries like "NC_000019.10:g.1102115=".