-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Problem
The PubtatorNDD gene prioritization table ranks genes by raw publication co-occurrence count with NDD terms. This conflates true NDD relevance with research popularity bias — heavily-studied genes like TP53 (282K total pubs), APP (125K), MAPT (73K), and APOE (69K) appear in the top 10 despite having no specific NDD role.
Current top-10 false positive rate: ~40% (4 of 10 genes are noise).
Evidence
By querying PubTator for each gene's total publication count, we see clear separation:
| Gene | NDD Pubs | Total Pubs | NDD/Total Ratio | True NDD? |
|---|---|---|---|---|
| GRIN2B | 86 | 13,459 | 0.639% | Yes |
| SCN1A | 20 | 5,238 | 0.382% | Yes |
| MECP2 | 16 | 10,677 | 0.150% | Yes |
| TP53 | 8 | 282,103 | 0.003% | No |
| APP | 8 | 124,598 | 0.006% | No |
| ALB | 3 | 269,569 | 0.001% | No |
True NDD genes show 0.1–0.6% NDD/Total ratio; noise genes show 0.001–0.01% — two orders of magnitude difference.
Proposed Solution
Phase 1: Background Count Collection
- For each gene in
pubtator_human_gene_entity_view, query PubTator API for total publication count:GET https://www.ncbi.nlm.nih.gov/research/pubtator3-api/search/?text=@GENE_{SYMBOL}&page=1→response.count - Cache in a new DB column or lookup table
- Also store NDD corpus size (
@DISEASE_neurodevelopmentalsearch count) - Refresh monthly
Phase 2: Compute Normalized Scores
For each gene, compute:
- Enrichment Ratio:
observed / (NDD_corpus × bg_count / total_corpus)— simple ranking - NPMI (Normalized Pointwise Mutual Information): range [-1, 1], inherently normalizes for gene popularity
- Fisher's exact test p-value + Benjamini-Hochberg FDR correction — statistical significance
Phase 3: Frontend Integration
- Add columns to PubtatorNDDGenes table: Background Pubs, Enrichment Ratio (color-coded), FDR (significance stars)
- Add "Top Genes by Enrichment Score" chart mode to PubtatorNDDStats
- Consider a volcano plot: Enrichment Ratio (x) vs -log₁₀(FDR) (y)
- Change default sort from
publication_counttoenrichment_ratio
Phase 4: Composite Score (Future)
- Combine NPMI + Fisher + enrichment into a single "NDD Association Confidence" score (0–5 stars), following the DISEASES database approach (Jensen Lab)
Expected Impact
- Before: Top-10 noise rate ~40%
- After: Top-10 noise rate ~0% — all popularity-biased genes drop to negative NPMI scores
Key References
- Stoeger et al. (2018). "Large-scale investigation of the reasons why potentially important genes are ignored." PLOS Biology
- Pletscher-Frankild et al. (2015). "DISEASES: Text mining and data integration." Methods 74:83-89
- Groth et al. (2020). "CoCoScore: Context-aware co-occurrence scoring." Bioinformatics 36(1):264-271
Design Document
See .planning/pubtator-gene-normalization-report.md for full analysis with formulas, worked examples, and implementation details.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request