Skip to content

aion-labs/GeneRIF-Network

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GeneRIF Knowledge Graph

A pipeline that converts NCBI GeneRIF free-text annotations into a structured, confidence-weighted knowledge graph of gene–disease–chemical–pathway relationships.


Background

GeneRIF (Gene Reference Into Function) is an NCBI database of ~1 million curator-submitted sentences, each linking a gene to a functional claim supported by one or more PubMed IDs. For example:

"HIF-1alpha induces VEGF expression and promotes angiogenesis in response to hypoxia." — Gene: HIF1A · PMID: 12345678

Each entry is short, focused, and tied to primary literature. Collectively they represent one of the densest structured summaries of gene function available without running full NLP on the primary literature.

This project converts those sentences into a directed, typed knowledge graph:

HIF1A  ──PROMOTES──►  VEGF           (90 PMIDs, conf 0.85)
HIF1A  ──INHIBITS──►  tumor          (9 PMIDs,  conf 0.86)
VEGF   ──PROMOTES──►  angiogenesis   (7 PMIDs,  conf 0.82)

Pipeline

Four scripts run sequentially. All data lives in /data/; all outputs go to /results/.

01_entity_analysis.py — Entity type survey

Scans the full GeneRIF corpus to characterise what biomedical entity types appear and in what proportions. Uses:

  • BC5CDR (scispaCy en_ner_bc5cdr_md) for DISEASE and CHEMICAL recognition
  • Regex patterns for genes, miRNAs, lncRNAs, mutations, pathways, biological processes, tissues, and subcellular locations

Output: entity_type_summary.tsv, entity_examples.tsv


02_graph_extraction.py — Prototype triple extraction

Development scaffold. Runs steps 1–3 (NER → relation extraction → entity normalisation) on a sample of GeneRIF entries and writes example triples for inspection.

  • Relation extraction: 15 edge types classified by verb-window pattern matching (PROMOTES, INHIBITS, REGULATES, PHOSPHORYLATES, BINDS, ASSOCIATED_WITH, LOCALIZES_TO, …)
  • Entity normalisation: Entrez Gene IDs (NCBI gene_info), UMLS CUIs via MeSH linker (diseases and chemicals), standardised miRNA names

Output: graph_triples.tsv, graph_examples.tsv, normalization_examples.tsv


03_hif1a_graph.py — Full single-gene graph (HIF1A)

Production pipeline for all 3,567 GeneRIF entries annotated to HIF1A (Entrez Gene 3091). Adds two components not in the prototype:

Multi-evidence confidence scoring (score_v2) assigns each triple a score in [0, 1]:

confidence = 0.25 · pred_score
           + 0.20 · obj_quality
           + 0.10 · subj_quality
           + 0.45 · evidence_score

evidence_score = 0.40 · support_breadth   (log-scaled PMID count, plateau at 20)
               + 0.30 · assertion_quality  (direct > neutral > hedged > negated)
               + 0.20 · entity_link_quality (both norms known > one > none)
               + 0.10 · temporal_spread    (years spanned across supporting PMIDs)

Pair-wise relation extraction — the subject of each triple is the closest valid preceding entity in the sentence, not always the source gene. A sentence like:

"HIF1A promotes VEGF, which drives angiogenesis"

yields two triples:

HIF1A ──PROMOTES──► VEGF
VEGF  ──PROMOTES──► angiogenesis

rather than the single, incorrect HIF1A ──PROMOTES──► angiogenesis produced by a fixed-subject model.

Entity schema — nine typed node classes:

Type Source Examples
Gene regex + Entrez norm HIF1A, VEGF, AKT, miR-210
miRNA regex hsa-miR-210
lncRNA regex H19, MALAT1
Disease BC5CDR + MeSH breast cancer, hypoxia
Chemical BC5CDR + MeSH glucose, oxygen, doxorubicin
BiologicalProcess regex angiogenesis, glycolysis
Pathway regex PI3K/AKT, mTOR
Tissue regex liver, endothelium
SubcellularLocation regex (noun forms only) mitochondria, nucleus, lysosome

Outputs: hif1a_triples.tsv, hif1a_graph.graphml, hif1a_graph.html (interactive), hif1a_graph_full.png, hif1a_graph_highconf.png


04_expand_graph.py — 1-hop neighbour expansion

Extends the HIF1A graph by loading GeneRIF entries for every Gene-type target of HIF1A that has a valid Entrez ID (e.g. VEGF, AKT, ROS). Triples from those entries are merged back with hop=2 and source_gene provenance attributes, revealing two-hop paths:

HIF1A ──PROMOTES──► VEGF ──PROMOTES──► angiogenesis
HIF1A ──PROMOTES──► VEGF ──ASSOCIATED_WITH──► colorectal cancer
HIF1A ──PROMOTES──► AKT  ──PROMOTES──► proliferation

Outputs: expanded_triples.tsv, expanded_graph.graphml, expanded_graph.html


Graph structure (HIF1A, v2)

Metric Value Note
Nodes 2,286
Edges 4,625 unique (subj, edge, obj) triples
Graph type directed MultiGraph
Density 0.00152 sparse scale-free
Avg clustering coefficient 0.1413 ×108 vs fixed-subject model — pathway modules visible
Largest WCC 2,215 / 2,286 (96.9%) nearly fully connected
Largest SCC 228 nodes (10%) real regulatory cycles
Diameter 8 multi-hop paths exist
HIF1A betweenness 0.098 down from 0.947 in fixed-subject model
Bidirectional pairs 86 feedback loops

Four biological communities emerge from Louvain partitioning:

Community Size Key hubs Biological identity
C1 736 (32%) HIF1A Direct HIF1A interactome
C2 405 (18%) VEGF, glycolysis, metastasis Cancer / angiogenesis effector arm
C3 185 (8%) HIF-1, oxygen, BNIP3, GLUT1 HIF-1 complex / oxygen-sensing layer
C4 142 (6%) hypoxia, PHD2, proline, iron PHD2 hydroxylation / degradation pathway

Confidence model

Every triple carries a single confidence score (above). The next modelling tier — claim-level confidence — aggregates across all edges sharing the same (subject, direction, object) using a Noisy-OR formula:

claim_conf = 1 − ∏(1 − edge_conf_i)   for all edges in the same directional group

Predicates are grouped by direction: UP (PROMOTES, ACTIVATES, INDUCES, …), DOWN (INHIBITS, SUPPRESSES, CLEAVES, …), NEUTRAL (REGULATES, BINDS, ASSOCIATED_WITH, …). Pairs where both UP and DOWN exist are flagged context_dependent = True — biologically meaningful (a gene can promote a disease while its inhibition treats it).

See results/design_notes.md for the full specification.


Use case — mitochondria, glucose, and hypoxia

The following analysis was produced by querying the HIF1A graph directly. No external sources were consulted; all claims trace to specific PMIDs in the GeneRIF corpus.

Question 1: What happens to mitochondria when HIF1 is stabilised?

Graph answer (5-step programme):

Step Actor → Target PMIDs Sentence excerpt
1 HIF1A → NDUFA4L2 1 "HIF-1alpha regulates NDUFA4L2 … under hypoxic conditions through mitochondrial NDUFA4L2" — suppresses Complex I
2 HIF1/HIF2 → mitochondria (–) 1 "CPT1A is repressed by HIF1 and HIF2, reducing fatty acid transport into the mitochondria, forcing fatty acids to lipid droplets"
3 HIF1A → BNIP3 (+) 10 BNIP3-mediated mitophagy; selective removal of damaged mitochondria
4 ROS → PI3K/ERK/PKC (+) 3–8 Mitochondrial ROS feed forward to maintain HIF1A stabilisation
5 miR-210 → hypoxia (+) 7 miR-210 extends and amplifies the hypoxic programme

What the graph added:

  • Ranked competing claims by PMID support — high-confidence edges separate well-replicated biology from single-study observations
  • Captured the contested claim: "mitochondria-derived ROS are NOT necessary for oxygen sensing in HeLa cells" (PMID 12237125) as a negated edge, flagging it as a debated assertion rather than consensus
  • Provided two-hop paths that are invisible in a HIF1A-only star graph: NDUFA4L2 → non-small cell lung cancer, ROS → PI3K → AKT

Gaps (nodes absent from the graph):

  • ISCU and COX10 — known HIF1A-regulated ETC assembly factors — have no GeneRIF entries pointing to mitochondria
  • Causal ordering and mechanistic depth require reasoning beyond edge existence

Question 2: What is the association between mitochondria, glucose, and hypoxia?

The three nodes do not form a closed triangle. They connect through HIF1A / HIF-1 as the central relay, with four mechanistic lines of evidence:

Line 1 — Hypoxia suppresses mitochondria

hypoxia ──(via HIF1/HIF2)──► CPT1A repression ──► mitochondria (–)
hypoxia ──(via HIF1A)──────► NDUFA4L2 induction ──► ETC Complex I (–)

"CPT1A is repressed by HIF1 and HIF2, reducing fatty acid transport into the mitochondria" — PMID 29176561, conf 0.64

"Mitochondria and HIFs are intimately connected to regulate each other resulting in appropriate responses to hypoxia" — PMID 20158574, conf 0.64 (bidirectional)

Line 2 — Hypoxia redirects glucose away from mitochondria (Warburg effect)

hypoxia ──► HIF-1 ──► miR-23a~27a~24 cluster ──► glucose flux away from TCA cycle

"HIF-1alpha-induced miR-23a27a24 cluster collectively regulates glucose metabolic flux" — PMID 30393198, conf 0.68 (glucose –REGULATES→ TCA)

Line 3 — Glucose availability modulates the hypoxic response (conditional switch)

Two opposing edges from the same experimental system (PMID 23538299):

glucose (present)  ──PROMOTES──► HIF-1  ──PROMOTES──► BNIP3  ──► mitophagy
glucose (absent)   ──INHIBITS──► HIF-1                       ──► OXPHOS restored

"glucose availability significantly affects the hypoxia-induced HIF-1/BNIP3 response, and in particular glucose absence results in enhancing the oxidative phosphorylation rate" — PMID 23538299

A separate entry (PMID 18762723) corroborates the inhibitory arm:

"reduced availability of glucose under hypoxia downregulates HIF-1 in part through inhibition of HIF-1α mRNA translation"

The graph captures both edges with different assertion scores, correctly representing this as a glucose-concentration-dependent switch rather than a contradiction.

Line 4 — Contested: mitochondrial ROS as the oxygen sensor

mitochondria ──PROMOTES──► hypoxia   [NEGATED, conf 0.52, PMID 12237125]

"mitochondria-derived ROS generated in response to hypoxia … are NOT necessary for oxygen sensing in HeLa cells"

A separate hedged edge (ZnO nanoparticle experiment, PMID 29622477) supports the opposite view: ETC Complex III → ROS → PHD inhibition → HIF-1 stabilisation. The graph surfaces this as a debated claim via contrasting assertion categories.

Summary diagram:

          hypoxia
             │
      (via HIF1A/HIF-1/HIF2)
             │
   ┌─────────┼──────────────────────────┐
   ▼         ▼                          ▼
mitochondria glucose metabolism     BNIP3
(–) CPT1A  (–) TCA flux          (mitophagy)
(–) ETC    (+) GLUT1/LDHA             │
   ▲                                   ▼
   └─── glucose absent ──► OXPHOS  mitochondria removed
        glucose present ──► HIF-1 sustained ──► mitophagy

Data sources

File Description
/data/generifs_basic.gz NCBI GeneRIF — ~1 M entries, human + other species
/data/Homo_sapiens.gene_info.gz NCBI gene_info — symbol and name for all human genes

Dependencies

Package Purpose
scispacy + en_ner_bc5cdr_md Biomedical NER (DISEASE, CHEMICAL)
scispacy MeSH linker Entity normalisation to UMLS CUIs
networkx Graph construction and topology analysis
pyvis Interactive HTML graph visualisation
matplotlib Static graph plots

Outputs

All scripts write to /results/ — the Code Ocean managed output folder published with the capsule. During active development, outputs can be copied to scratch/ for persistence across sessions (/results/ is cleared on each new run).

File Description
results/hif1a_triples.tsv 4,625 triples with confidence scores and example sentences
results/hif1a_graph.graphml Full graph, importable into Gephi / Cytoscape
results/hif1a_graph.html Interactive browser visualisation
results/hif1a_graph_highconf.png Static plot, confidence ≥ 0.70 edges only
results/HIF1A_topology_summary.md Full network topology report
results/HIF1A_functional_summary.md Biological function summary derived from the graph
results/design_notes.md Claim-level confidence model and expansion design

Roadmap

  • Claim-level confidence — Noisy-OR aggregation across converging edges: claim_conf = 1 − ∏(1 − edge_conf_i), grouped by directional category (UP / DOWN / NEUTRAL). Specification in results/design_notes.md.
  • Multi-gene expansion — run 04_expand_graph.py for all Gene-type HIF1A targets to reveal two-hop pathway structure at scale
  • Full corpus graph — scale 03_hif1a_graph.py to all ~20,000 human genes in GeneRIF
  • Cross-gene normalisation — coverage-adjusted confidence to account for well-studied vs under-studied genes

About

creating a graph for AI analysis from GeneRIF claims

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors