A pipeline that converts NCBI GeneRIF free-text annotations into a structured, confidence-weighted knowledge graph of gene–disease–chemical–pathway relationships.
GeneRIF (Gene Reference Into Function) is an NCBI database of ~1 million curator-submitted sentences, each linking a gene to a functional claim supported by one or more PubMed IDs. For example:
"HIF-1alpha induces VEGF expression and promotes angiogenesis in response to hypoxia." — Gene: HIF1A · PMID: 12345678
Each entry is short, focused, and tied to primary literature. Collectively they represent one of the densest structured summaries of gene function available without running full NLP on the primary literature.
This project converts those sentences into a directed, typed knowledge graph:
HIF1A ──PROMOTES──► VEGF (90 PMIDs, conf 0.85)
HIF1A ──INHIBITS──► tumor (9 PMIDs, conf 0.86)
VEGF ──PROMOTES──► angiogenesis (7 PMIDs, conf 0.82)
Four scripts run sequentially. All data lives in /data/; all outputs go to /results/.
01_entity_analysis.py — Entity type survey
Scans the full GeneRIF corpus to characterise what biomedical entity types appear and in what proportions. Uses:
- BC5CDR (scispaCy
en_ner_bc5cdr_md) for DISEASE and CHEMICAL recognition - Regex patterns for genes, miRNAs, lncRNAs, mutations, pathways, biological processes, tissues, and subcellular locations
Output: entity_type_summary.tsv, entity_examples.tsv
02_graph_extraction.py — Prototype triple extraction
Development scaffold. Runs steps 1–3 (NER → relation extraction → entity normalisation) on a sample of GeneRIF entries and writes example triples for inspection.
- Relation extraction: 15 edge types classified by verb-window pattern matching (PROMOTES, INHIBITS, REGULATES, PHOSPHORYLATES, BINDS, ASSOCIATED_WITH, LOCALIZES_TO, …)
- Entity normalisation: Entrez Gene IDs (NCBI gene_info), UMLS CUIs via MeSH linker (diseases and chemicals), standardised miRNA names
Output: graph_triples.tsv, graph_examples.tsv, normalization_examples.tsv
03_hif1a_graph.py — Full single-gene graph (HIF1A)
Production pipeline for all 3,567 GeneRIF entries annotated to HIF1A (Entrez Gene 3091). Adds two components not in the prototype:
Multi-evidence confidence scoring (score_v2) assigns each triple a score in [0, 1]:
confidence = 0.25 · pred_score
+ 0.20 · obj_quality
+ 0.10 · subj_quality
+ 0.45 · evidence_score
evidence_score = 0.40 · support_breadth (log-scaled PMID count, plateau at 20)
+ 0.30 · assertion_quality (direct > neutral > hedged > negated)
+ 0.20 · entity_link_quality (both norms known > one > none)
+ 0.10 · temporal_spread (years spanned across supporting PMIDs)
Pair-wise relation extraction — the subject of each triple is the closest valid preceding entity in the sentence, not always the source gene. A sentence like:
"HIF1A promotes VEGF, which drives angiogenesis"
yields two triples:
HIF1A ──PROMOTES──► VEGF
VEGF ──PROMOTES──► angiogenesis
rather than the single, incorrect HIF1A ──PROMOTES──► angiogenesis produced by a fixed-subject model.
Entity schema — nine typed node classes:
| Type | Source | Examples |
|---|---|---|
| Gene | regex + Entrez norm | HIF1A, VEGF, AKT, miR-210 |
| miRNA | regex | hsa-miR-210 |
| lncRNA | regex | H19, MALAT1 |
| Disease | BC5CDR + MeSH | breast cancer, hypoxia |
| Chemical | BC5CDR + MeSH | glucose, oxygen, doxorubicin |
| BiologicalProcess | regex | angiogenesis, glycolysis |
| Pathway | regex | PI3K/AKT, mTOR |
| Tissue | regex | liver, endothelium |
| SubcellularLocation | regex (noun forms only) | mitochondria, nucleus, lysosome |
Outputs: hif1a_triples.tsv, hif1a_graph.graphml, hif1a_graph.html (interactive), hif1a_graph_full.png, hif1a_graph_highconf.png
04_expand_graph.py — 1-hop neighbour expansion
Extends the HIF1A graph by loading GeneRIF entries for every Gene-type target of HIF1A that has a valid Entrez ID (e.g. VEGF, AKT, ROS). Triples from those entries are merged back with hop=2 and source_gene provenance attributes, revealing two-hop paths:
HIF1A ──PROMOTES──► VEGF ──PROMOTES──► angiogenesis
HIF1A ──PROMOTES──► VEGF ──ASSOCIATED_WITH──► colorectal cancer
HIF1A ──PROMOTES──► AKT ──PROMOTES──► proliferation
Outputs: expanded_triples.tsv, expanded_graph.graphml, expanded_graph.html
| Metric | Value | Note |
|---|---|---|
| Nodes | 2,286 | |
| Edges | 4,625 | unique (subj, edge, obj) triples |
| Graph type | directed MultiGraph | |
| Density | 0.00152 | sparse scale-free |
| Avg clustering coefficient | 0.1413 | ×108 vs fixed-subject model — pathway modules visible |
| Largest WCC | 2,215 / 2,286 (96.9%) | nearly fully connected |
| Largest SCC | 228 nodes (10%) | real regulatory cycles |
| Diameter | 8 | multi-hop paths exist |
| HIF1A betweenness | 0.098 | down from 0.947 in fixed-subject model |
| Bidirectional pairs | 86 | feedback loops |
Four biological communities emerge from Louvain partitioning:
| Community | Size | Key hubs | Biological identity |
|---|---|---|---|
| C1 | 736 (32%) | HIF1A | Direct HIF1A interactome |
| C2 | 405 (18%) | VEGF, glycolysis, metastasis | Cancer / angiogenesis effector arm |
| C3 | 185 (8%) | HIF-1, oxygen, BNIP3, GLUT1 | HIF-1 complex / oxygen-sensing layer |
| C4 | 142 (6%) | hypoxia, PHD2, proline, iron | PHD2 hydroxylation / degradation pathway |
Every triple carries a single confidence score (above). The next modelling tier — claim-level confidence — aggregates across all edges sharing the same (subject, direction, object) using a Noisy-OR formula:
claim_conf = 1 − ∏(1 − edge_conf_i) for all edges in the same directional group
Predicates are grouped by direction: UP (PROMOTES, ACTIVATES, INDUCES, …), DOWN (INHIBITS, SUPPRESSES, CLEAVES, …), NEUTRAL (REGULATES, BINDS, ASSOCIATED_WITH, …). Pairs where both UP and DOWN exist are flagged context_dependent = True — biologically meaningful (a gene can promote a disease while its inhibition treats it).
See results/design_notes.md for the full specification.
The following analysis was produced by querying the HIF1A graph directly. No external sources were consulted; all claims trace to specific PMIDs in the GeneRIF corpus.
Graph answer (5-step programme):
| Step | Actor → Target | PMIDs | Sentence excerpt |
|---|---|---|---|
| 1 | HIF1A → NDUFA4L2 | 1 | "HIF-1alpha regulates NDUFA4L2 … under hypoxic conditions through mitochondrial NDUFA4L2" — suppresses Complex I |
| 2 | HIF1/HIF2 → mitochondria (–) | 1 | "CPT1A is repressed by HIF1 and HIF2, reducing fatty acid transport into the mitochondria, forcing fatty acids to lipid droplets" |
| 3 | HIF1A → BNIP3 (+) | 10 | BNIP3-mediated mitophagy; selective removal of damaged mitochondria |
| 4 | ROS → PI3K/ERK/PKC (+) | 3–8 | Mitochondrial ROS feed forward to maintain HIF1A stabilisation |
| 5 | miR-210 → hypoxia (+) | 7 | miR-210 extends and amplifies the hypoxic programme |
What the graph added:
- Ranked competing claims by PMID support — high-confidence edges separate well-replicated biology from single-study observations
- Captured the contested claim: "mitochondria-derived ROS are NOT necessary for oxygen sensing in HeLa cells" (PMID 12237125) as a negated edge, flagging it as a debated assertion rather than consensus
- Provided two-hop paths that are invisible in a HIF1A-only star graph: NDUFA4L2 → non-small cell lung cancer, ROS → PI3K → AKT
Gaps (nodes absent from the graph):
- ISCU and COX10 — known HIF1A-regulated ETC assembly factors — have no GeneRIF entries pointing to mitochondria
- Causal ordering and mechanistic depth require reasoning beyond edge existence
The three nodes do not form a closed triangle. They connect through HIF1A / HIF-1 as the central relay, with four mechanistic lines of evidence:
Line 1 — Hypoxia suppresses mitochondria
hypoxia ──(via HIF1/HIF2)──► CPT1A repression ──► mitochondria (–)
hypoxia ──(via HIF1A)──────► NDUFA4L2 induction ──► ETC Complex I (–)
"CPT1A is repressed by HIF1 and HIF2, reducing fatty acid transport into the mitochondria" — PMID 29176561, conf 0.64
"Mitochondria and HIFs are intimately connected to regulate each other resulting in appropriate responses to hypoxia" — PMID 20158574, conf 0.64 (bidirectional)
Line 2 — Hypoxia redirects glucose away from mitochondria (Warburg effect)
hypoxia ──► HIF-1 ──► miR-23a~27a~24 cluster ──► glucose flux away from TCA cycle
"HIF-1alpha-induced miR-23a
27a24 cluster collectively regulates glucose metabolic flux" — PMID 30393198, conf 0.68 (glucose –REGULATES→ TCA)
Line 3 — Glucose availability modulates the hypoxic response (conditional switch)
Two opposing edges from the same experimental system (PMID 23538299):
glucose (present) ──PROMOTES──► HIF-1 ──PROMOTES──► BNIP3 ──► mitophagy
glucose (absent) ──INHIBITS──► HIF-1 ──► OXPHOS restored
"glucose availability significantly affects the hypoxia-induced HIF-1/BNIP3 response, and in particular glucose absence results in enhancing the oxidative phosphorylation rate" — PMID 23538299
A separate entry (PMID 18762723) corroborates the inhibitory arm:
"reduced availability of glucose under hypoxia downregulates HIF-1 in part through inhibition of HIF-1α mRNA translation"
The graph captures both edges with different assertion scores, correctly representing this as a glucose-concentration-dependent switch rather than a contradiction.
Line 4 — Contested: mitochondrial ROS as the oxygen sensor
mitochondria ──PROMOTES──► hypoxia [NEGATED, conf 0.52, PMID 12237125]
"mitochondria-derived ROS generated in response to hypoxia … are NOT necessary for oxygen sensing in HeLa cells"
A separate hedged edge (ZnO nanoparticle experiment, PMID 29622477) supports the opposite view: ETC Complex III → ROS → PHD inhibition → HIF-1 stabilisation. The graph surfaces this as a debated claim via contrasting assertion categories.
Summary diagram:
hypoxia
│
(via HIF1A/HIF-1/HIF2)
│
┌─────────┼──────────────────────────┐
▼ ▼ ▼
mitochondria glucose metabolism BNIP3
(–) CPT1A (–) TCA flux (mitophagy)
(–) ETC (+) GLUT1/LDHA │
▲ ▼
└─── glucose absent ──► OXPHOS mitochondria removed
glucose present ──► HIF-1 sustained ──► mitophagy
| File | Description |
|---|---|
/data/generifs_basic.gz |
NCBI GeneRIF — ~1 M entries, human + other species |
/data/Homo_sapiens.gene_info.gz |
NCBI gene_info — symbol and name for all human genes |
| Package | Purpose |
|---|---|
scispacy + en_ner_bc5cdr_md |
Biomedical NER (DISEASE, CHEMICAL) |
scispacy MeSH linker |
Entity normalisation to UMLS CUIs |
networkx |
Graph construction and topology analysis |
pyvis |
Interactive HTML graph visualisation |
matplotlib |
Static graph plots |
All scripts write to /results/ — the Code Ocean managed output folder published with the capsule. During active development, outputs can be copied to scratch/ for persistence across sessions (/results/ is cleared on each new run).
| File | Description |
|---|---|
results/hif1a_triples.tsv |
4,625 triples with confidence scores and example sentences |
results/hif1a_graph.graphml |
Full graph, importable into Gephi / Cytoscape |
results/hif1a_graph.html |
Interactive browser visualisation |
results/hif1a_graph_highconf.png |
Static plot, confidence ≥ 0.70 edges only |
results/HIF1A_topology_summary.md |
Full network topology report |
results/HIF1A_functional_summary.md |
Biological function summary derived from the graph |
results/design_notes.md |
Claim-level confidence model and expansion design |
- Claim-level confidence — Noisy-OR aggregation across converging edges:
claim_conf = 1 − ∏(1 − edge_conf_i), grouped by directional category (UP / DOWN / NEUTRAL). Specification inresults/design_notes.md. - Multi-gene expansion — run
04_expand_graph.pyfor all Gene-type HIF1A targets to reveal two-hop pathway structure at scale - Full corpus graph — scale
03_hif1a_graph.pyto all ~20,000 human genes in GeneRIF - Cross-gene normalisation — coverage-adjusted confidence to account for well-studied vs under-studied genes