Skip to content

Normalize PubtatorNDD gene counts to filter research popularity bias #175

@berntpopp

Description

@berntpopp

Problem

The PubtatorNDD gene prioritization table ranks genes by raw publication co-occurrence count with NDD terms. This conflates true NDD relevance with research popularity bias — heavily-studied genes like TP53 (282K total pubs), APP (125K), MAPT (73K), and APOE (69K) appear in the top 10 despite having no specific NDD role.

Current top-10 false positive rate: ~40% (4 of 10 genes are noise).

Evidence

By querying PubTator for each gene's total publication count, we see clear separation:

Gene NDD Pubs Total Pubs NDD/Total Ratio True NDD?
GRIN2B 86 13,459 0.639% Yes
SCN1A 20 5,238 0.382% Yes
MECP2 16 10,677 0.150% Yes
TP53 8 282,103 0.003% No
APP 8 124,598 0.006% No
ALB 3 269,569 0.001% No

True NDD genes show 0.1–0.6% NDD/Total ratio; noise genes show 0.001–0.01% — two orders of magnitude difference.

Proposed Solution

Phase 1: Background Count Collection

  • For each gene in pubtator_human_gene_entity_view, query PubTator API for total publication count: GET https://www.ncbi.nlm.nih.gov/research/pubtator3-api/search/?text=@GENE_{SYMBOL}&page=1response.count
  • Cache in a new DB column or lookup table
  • Also store NDD corpus size (@DISEASE_neurodevelopmental search count)
  • Refresh monthly

Phase 2: Compute Normalized Scores

For each gene, compute:

  • Enrichment Ratio: observed / (NDD_corpus × bg_count / total_corpus) — simple ranking
  • NPMI (Normalized Pointwise Mutual Information): range [-1, 1], inherently normalizes for gene popularity
  • Fisher's exact test p-value + Benjamini-Hochberg FDR correction — statistical significance

Phase 3: Frontend Integration

  • Add columns to PubtatorNDDGenes table: Background Pubs, Enrichment Ratio (color-coded), FDR (significance stars)
  • Add "Top Genes by Enrichment Score" chart mode to PubtatorNDDStats
  • Consider a volcano plot: Enrichment Ratio (x) vs -log₁₀(FDR) (y)
  • Change default sort from publication_count to enrichment_ratio

Phase 4: Composite Score (Future)

  • Combine NPMI + Fisher + enrichment into a single "NDD Association Confidence" score (0–5 stars), following the DISEASES database approach (Jensen Lab)

Expected Impact

  • Before: Top-10 noise rate ~40%
  • After: Top-10 noise rate ~0% — all popularity-biased genes drop to negative NPMI scores

Key References

  • Stoeger et al. (2018). "Large-scale investigation of the reasons why potentially important genes are ignored." PLOS Biology
  • Pletscher-Frankild et al. (2015). "DISEASES: Text mining and data integration." Methods 74:83-89
  • Groth et al. (2020). "CoCoScore: Context-aware co-occurrence scoring." Bioinformatics 36(1):264-271

Design Document

See .planning/pubtator-gene-normalization-report.md for full analysis with formulas, worked examples, and implementation details.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions