Entrez Gene is the NCBI database of gene-specific information. It provides "tracked, unique identifiers for genes" and reports "information associated with those identifiers for unrestricted public use [source]." We use Entrez Gene as the primary gene vocabulary for our drug repuposing research.
This repository creates user-friendly datasets from Entrez Gene. We currently focus on human genes only.
The python notebook process.ipynb executes the analysis. Files downloaded from external locations are stored in download. The following created datasets reside in data:
genes-human.tsv: human genes with a select set of fields storing additional attributessymbols-human.tsv: a table of GeneID, symbol, and symbol type (synonym or primary)symbols-human.json: a Symbol–GeneID mapping of primary symbols onlysynonyms-human.json: a Symbol–GeneIDs mapping for synonymssymbol-map.json: a Symbol–GeneID mapping with approved symbols and unambiguous synonymsxrefs-human.tsv: mappings to external resources