Convert dbSNP into TSV-files: ParseJSONToFile.java misses HGVS names

This is referring to the step "Parse dbSNP dump" in the section "Rebuilding the database for SNP normalization" of [this page](https://rockt.github.io/SETH/).

I have wrote a port to Python for your code parsing the dnSNP database (JSON Format) and in the process I have found that the current logic of `ParseJSONToFile.java` seems to miss names ("hgvs" notations for a given rsid) which are instead reported in the web version of dbSNP.

This can be seen for instance in the results for "rs757229".

In the file [rs757229.txt](https://github.com/rockt/SETH/files/11610838/rs757229.txt)  there's the entry in JSON format (the extension is `.txt` but it is a JSON file: GitHub won't allow uploading JSON file) as reported in the current version of dbSNP.

The script below uses [jq](https://jqlang.github.io/jq/) to reproduce the extraction logic of  `ParseJSONToFile.java` and the proposed modification (sorry, Java is really not my thing, but it should be easy port it).

```bash
#!/usr/bin/env bash
FILE="rs757229.txt"

if [[ ! -f "$FILE" ]]; then
  echo "Plase place the filer rs757229.txt in this folder"
  exit 1
fi

echo "RS757229"

echo "SETH Implementation"
ALLELE_ANNOTATIONS=$(jq -r ".primary_snapshot_data.allele_annotations | length" "$FILE")
for i in $(seq 0 $(("$ALLELE_ANNOTATIONS"-1))); do
  ASSEMBLY_ANNOTATIONS=$(jq -r ".primary_snapshot_data.allele_annotations[$i].assembly_annotation | length"  "$FILE")
  for j in $(seq 0 $(("$ASSEMBLY_ANNOTATIONS"-1))); do
    GENES=$(jq -r ".primary_snapshot_data.allele_annotations[$i].assembly_annotation[$j].genes | length"  "$FILE")
    for k in $(seq 0 $(("$GENES"-1))); do
      RNAS=$(jq -r ".primary_snapshot_data.allele_annotations[$i].assembly_annotation[$j].genes[$k].rnas | length"  "$FILE")
      for l in $(seq 0 $(("$RNAS"-1))); do
        REFSEQ=$(jq ".primary_snapshot_data.allele_annotations[$i].assembly_annotation[$j].genes[$k].rnas[$l].id"  "$FILE")
        HGVS=$(jq ".primary_snapshot_data.allele_annotations[$i].assembly_annotation[$j].genes[$k].rnas[$l].hgvs"  "$FILE")
        echo "$REFSEQ - $HGVS"
      done
    done
  done
done

echo "Proposed changed (match data reported by browser version of dbsnp)"
PLACEMENT_WITH_ALLELES=$(jq -r ".primary_snapshot_data.placements_with_allele | length" "$FILE")
for i in $(seq 0 $(("$PLACEMENT_WITH_ALLELES"-1))); do
  ALLELES=$(jq -r ".primary_snapshot_data.placements_with_allele[$i].alleles | length" "$FILE")
  for j in $(seq 0 $(("$ALLELES"-1))); do
    REFSEQ_HGVS=$(jq -r ".primary_snapshot_data.placements_with_allele[$i].alleles[$j].hgvs" "$FILE")
    echo "$REFSEQ_HGVS"
  done
done
```

If you run it with the file I provided the output will look like this:
```bash
RS757229
SETH Implementation
"NM_001039847.3" - null
"NM_001039848.4" - null
"NM_001367832.1" - null
"NM_002085.5" - null
"NM_001039847.3" - null
"NM_001039848.4" - null
"NM_001367832.1" - null
"NM_002085.5" - null
"NM_001039847.3" - null
"NM_001039848.4" - null
"NM_001367832.1" - null
"NM_002085.5" - null
"NM_001039847.3" - null
"NM_001039848.4" - null
"NM_001367832.1" - null
"NM_002085.5" - null
Proposed changed (match data reported by browser version of dbsnp)
NC_000019.10:g.1102115=
NC_000019.10:g.1102115G>A
NC_000019.10:g.1102115G>C
NC_000019.10:g.1102115G>T
NC_000019.9:g.1102114=
NC_000019.9:g.1102114G>A
NC_000019.9:g.1102114G>C
NC_000019.9:g.1102114G>T
NG_050621.1:g.3190=
NG_050621.1:g.3190G>A
NG_050621.1:g.3190G>C
NG_050621.1:g.3190G>T
```
As you can see the current logic misses all "hgvs" names which are present in the [web verision](https://www.ncbi.nlm.nih.gov/snp/?term=757229).

The only issue w/ the proposed solution is that you need to get rid of the "no change" entries like "NC_000019.10:g.1102115=".


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert dbSNP into TSV-files: ParseJSONToFile.java misses HGVS names #45

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Convert dbSNP into TSV-files: ParseJSONToFile.java misses HGVS names #45

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions