Skip to content

Convert dbSNP into TSV-files: ParseJSONToFile.java misses HGVS names #45

@sg-wbi

Description

@sg-wbi

This is referring to the step "Parse dbSNP dump" in the section "Rebuilding the database for SNP normalization" of this page.

I have wrote a port to Python for your code parsing the dnSNP database (JSON Format) and in the process I have found that the current logic of ParseJSONToFile.java seems to miss names ("hgvs" notations for a given rsid) which are instead reported in the web version of dbSNP.

This can be seen for instance in the results for "rs757229".

In the file rs757229.txt there's the entry in JSON format (the extension is .txt but it is a JSON file: GitHub won't allow uploading JSON file) as reported in the current version of dbSNP.

The script below uses jq to reproduce the extraction logic of ParseJSONToFile.java and the proposed modification (sorry, Java is really not my thing, but it should be easy port it).

#!/usr/bin/env bash
FILE="rs757229.txt"

if [[ ! -f "$FILE" ]]; then
  echo "Plase place the filer rs757229.txt in this folder"
  exit 1
fi

echo "RS757229"

echo "SETH Implementation"
ALLELE_ANNOTATIONS=$(jq -r ".primary_snapshot_data.allele_annotations | length" "$FILE")
for i in $(seq 0 $(("$ALLELE_ANNOTATIONS"-1))); do
  ASSEMBLY_ANNOTATIONS=$(jq -r ".primary_snapshot_data.allele_annotations[$i].assembly_annotation | length"  "$FILE")
  for j in $(seq 0 $(("$ASSEMBLY_ANNOTATIONS"-1))); do
    GENES=$(jq -r ".primary_snapshot_data.allele_annotations[$i].assembly_annotation[$j].genes | length"  "$FILE")
    for k in $(seq 0 $(("$GENES"-1))); do
      RNAS=$(jq -r ".primary_snapshot_data.allele_annotations[$i].assembly_annotation[$j].genes[$k].rnas | length"  "$FILE")
      for l in $(seq 0 $(("$RNAS"-1))); do
        REFSEQ=$(jq ".primary_snapshot_data.allele_annotations[$i].assembly_annotation[$j].genes[$k].rnas[$l].id"  "$FILE")
        HGVS=$(jq ".primary_snapshot_data.allele_annotations[$i].assembly_annotation[$j].genes[$k].rnas[$l].hgvs"  "$FILE")
        echo "$REFSEQ - $HGVS"
      done
    done
  done
done

echo "Proposed changed (match data reported by browser version of dbsnp)"
PLACEMENT_WITH_ALLELES=$(jq -r ".primary_snapshot_data.placements_with_allele | length" "$FILE")
for i in $(seq 0 $(("$PLACEMENT_WITH_ALLELES"-1))); do
  ALLELES=$(jq -r ".primary_snapshot_data.placements_with_allele[$i].alleles | length" "$FILE")
  for j in $(seq 0 $(("$ALLELES"-1))); do
    REFSEQ_HGVS=$(jq -r ".primary_snapshot_data.placements_with_allele[$i].alleles[$j].hgvs" "$FILE")
    echo "$REFSEQ_HGVS"
  done
done

If you run it with the file I provided the output will look like this:

RS757229
SETH Implementation
"NM_001039847.3" - null
"NM_001039848.4" - null
"NM_001367832.1" - null
"NM_002085.5" - null
"NM_001039847.3" - null
"NM_001039848.4" - null
"NM_001367832.1" - null
"NM_002085.5" - null
"NM_001039847.3" - null
"NM_001039848.4" - null
"NM_001367832.1" - null
"NM_002085.5" - null
"NM_001039847.3" - null
"NM_001039848.4" - null
"NM_001367832.1" - null
"NM_002085.5" - null
Proposed changed (match data reported by browser version of dbsnp)
NC_000019.10:g.1102115=
NC_000019.10:g.1102115G>A
NC_000019.10:g.1102115G>C
NC_000019.10:g.1102115G>T
NC_000019.9:g.1102114=
NC_000019.9:g.1102114G>A
NC_000019.9:g.1102114G>C
NC_000019.9:g.1102114G>T
NG_050621.1:g.3190=
NG_050621.1:g.3190G>A
NG_050621.1:g.3190G>C
NG_050621.1:g.3190G>T

As you can see the current logic misses all "hgvs" names which are present in the web verision.

The only issue w/ the proposed solution is that you need to get rid of the "no change" entries like "NC_000019.10:g.1102115=".

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions