feat(rvdna): add health biomarker analysis engine with streaming simulation#199
Merged
feat(rvdna): add health biomarker analysis engine with streaming simulation#199
Conversation
…lation Implement ADR-014 Health Biomarker Analysis Architecture: - biomarker.rs: Composite risk scoring engine with 17-SNP weight matrix, gene-gene interaction modifiers (COMT×OPRM1, MTHFR compound, BRCA1×TP53), 64-dim HNSW-aligned profile vectors, clinical reference ranges for 12 biomarkers, and deterministic synthetic population generation - biomarker_stream.rs: Streaming biomarker simulator with generic RingBuffer, configurable noise/drift/anomaly injection, z-score anomaly detection, linear regression trend analysis, and exponential moving averages - 35 unit tests + 15 integration tests (168 total, 0 failures) - Criterion benchmark suite targeting ADR-014 performance budgets https://claude.ai/code/session_014FpaYVohmyLH5dcBZTgmSY
… halve ring buffer memory - Fix snp_idx silent fallback: unwrap_or(0) masked missing SNPs with incorrect index-0 lookups; now returns Option<usize> - RingBuffer: eliminate Option<T> wrapper, halving per-slot memory for f64 (8 bytes vs 16); use T::Default instead - window_mean_std: replace two-pass sum+variance with single-pass Welford's online algorithm (2x fewer cache misses) - compute_risk_scores: pre-compute category max scores via category_meta() to avoid re-scanning SNP_WEIGHTS per call; use &str keys in intermediate HashMap to reduce String allocations - HashMap capacity hints throughout (StreamProcessor, genotypes, biomarker_values, cat_scores) to eliminate rehashing - generate_synthetic_population: hoist APOE lookup out of inner loop, reserve biomarker_values capacity upfront - All 48 tests pass (33 unit + 15 integration), benchmark compiles https://claude.ai/code/session_014FpaYVohmyLH5dcBZTgmSY
…fehacks clinical data Evidence-based adjustments from geneticlifehacks.com research articles: - MTHFR C677T (rs1801133): het weight 0.30→0.35 to match documented 40% enzyme activity decrease - MTHFR A1298C (rs1801131): het 0.15→0.10, hom_alt 0.35→0.25 to match documented ~20% enzyme decrease - Homocysteine reference range: 4-12→5-15 μmol/L (clinical consensus), critical_high 50→30 (moderate hyperhomocysteinemia threshold) - Add MTHFR A1298C × COMT interaction (1.25x Neurological): A1298C homozygous + COMT slow = amplified depression risk - Add DRD2/ANKK1 × COMT interaction (1.2x Neurological): rs1800497 × Val158Met working memory interaction - Guard vector encoding with .take(4) so expanded interaction table (now 6 entries) doesn't overflow dims 56-59 Sources: - geneticlifehacks.com/mthfr/ (enzyme activity percentages) - geneticlifehacks.com/mthfr-c677t/ (MTHFR-COMT depression data) - geneticlifehacks.com/understanding-homocysteine-levels/ (ref ranges) - geneticlifehacks.com/dopamine-receptor-genes/ (DRD2×COMT interaction) All 48 tests pass (33 unit + 15 integration), benchmark compiles. https://claude.ai/code/session_014FpaYVohmyLH5dcBZTgmSY
Evidence-based refinements from peer-reviewed clinical research: - TP53 rs1042522 (Pro72Arg): hom_ref 0.10→0.00 — CC/Pro/Pro is not independently risk-associated; prior non-zero baseline was unjustified - BRCA2 rs11571833 (K3326X): het 0.25→0.20 — aligned with iCOGS meta-analysis OR 1.28 for breast cancer (Meeks et al., JNCI 2016, 76,637 cases / 83,796 controls) - NQO1 rs1800566 (Pro187Ser): het 0.20→0.15, hom_alt 0.45→0.30 — aligned with comprehensive meta-analysis OR 1.18 for TT vs CC (Lajin & Alachkar, Br J Cancer 2013, 92 studies, 21,178 cases); larger 2022 meta-analysis (43,736 cases) found no overall association Validated unchanged weights against SOTA evidence: - APOE rs429358: OR 3-4x het, 8-15x hom (Belloy JAMA Neurology 2023) - SLCO1B1 rs4363657: OR 4.5/allele, 16.9 hom (SEARCH/NEJM; CPIC 2022) - COMT×OPRM1 interaction: confirmed p=0.037 (orthopedic trauma study) All 48 tests pass (33 unit + 15 integration). https://claude.ai/code/session_014FpaYVohmyLH5dcBZTgmSY
…tion, and interaction tests - Add gene→biomarker correlations in synthetic population: APOE e4→lower HDL/higher triglycerides, MTHFR→lower B12, NQO1 null→higher CRP - Add CUSUM changepoint detection algorithm to StreamProcessor for detecting sustained biomarker shifts beyond simple anomaly detection - Add 4 new integration tests: MTHFR×COMT interaction, DRD2×COMT interaction, APOE→HDL population correlation, CUSUM changepoint detection - Remove unused variant_categories import - All 172 tests pass, all ADR-014 performance targets exceeded https://claude.ai/code/session_014FpaYVohmyLH5dcBZTgmSY
- Add Health Biomarker Engine section to rvDNA README with usage examples for composite risk scoring, streaming processing, and synthetic populations - Add biomarker.rs and biomarker_stream.rs to Modules table - Update test count from 102 to 172 (added biomarker tests) - Add biomarker benchmark results to Speed table - Add Welford, CUSUM, and PRS to Published Algorithms table - Update root README Genomics & Health capabilities (49 → 51 features) - Add health biomarker engine and streaming biomarkers to root feature table - Update rvDNA details section with risk scoring and streaming capabilities https://claude.ai/code/session_014FpaYVohmyLH5dcBZTgmSY
…eaming Structural improvements from deep code review: - Consolidate 5 parallel arrays (SNP_WEIGHTS, HOM_REF, HOM_ALT, HET, ALLELE_FREQS) into single SnpDef struct array — eliminates entire class of parallel-array misalignment bugs - Cache category_meta() with LazyLock — avoids per-call Vec allocation (critical in generate_synthetic_population hot path) - Hoist Normal::new out of inner loop in generate_readings — pre-compute distributions per biomarker instead of per-step*per-biomarker - Add clinically meaningful lower bounds: LDL normal_low 0→50 mg/dL (critical_low 25), Triglycerides normal_low 0→35 mg/dL (critical_low 20) - Optimize RingBuffer::clear from O(capacity) to O(1) — head/len reset is sufficient since push overwrites before read - Use NUM_SNPS const for vector encoding bounds instead of magic number 51 All 172 tests pass, zero clippy warnings for rvdna. https://claude.ai/code/session_014FpaYVohmyLH5dcBZTgmSY
…ence Add rs10455872 (OR 1.6-1.75/allele CHD) and rs3798220 (OR 1.49-1.54/allele) from 2024 LPA meta-analyses. Include Lp(a) biomarker reference (0-75 nmol/L) and gene-biomarker correlation in population model. Separate NUM_ONEHOT_SNPS (17) from NUM_SNPS (19) to preserve 64-dim vector layout with LPA encoded in summary dimension 63. https://claude.ai/code/session_014FpaYVohmyLH5dcBZTgmSY
Add PCSK9 R46L loss-of-function variant (NEJM 2006: OR 0.77 CHD, 0.40 MI) as a protective cardiovascular SNP with negative weights. Include PCSK9→LDL-C biomarker correlation (15-21% lower LDL in carriers). Refactor gene-biomarker correlations from match to additive if-chain so multiple gene effects can stack on the same biomarker (e.g., APOE raises LDL while PCSK9 R46L lowers it). Panel expanded to 20 SNPs. https://claude.ai/code/session_014FpaYVohmyLH5dcBZTgmSY
Update all references from 17 SNPs to 20 SNPs reflecting the addition of LPA rs10455872/rs3798220 and PCSK9 rs11591147. Document new gene-biomarker correlations (LPA→Lp(a), PCSK9→LDL) in synthetic population section. Update module table line counts. https://claude.ai/code/session_014FpaYVohmyLH5dcBZTgmSY
…nd benchmarks ADR-015: Pure-JS biomarker engine mirroring Rust biomarker.rs and biomarker_stream.rs exactly. Includes: - src/biomarker.js: 20-SNP composite risk scoring, 6 gene-gene interactions, 64-dim L2-normalized profile vectors, synthetic population generation with Mulberry32 PRNG - src/stream.js: RingBuffer, StreamProcessor with Welford online stats, CUSUM changepoint detection, z-score anomaly detection, linear regression trend analysis, batch reading generation - tests/test-biomarker.js: 35 tests + 5 benchmarks covering all classification levels, risk scoring, vector encoding, population generation, streaming, anomaly/trend detection - index.d.ts: Full TypeScript definitions for all biomarker APIs - package.json: Bump to v0.3.0, add biomarker keywords Benchmark results (Node.js): computeRiskScores: 7.33 us/op encodeProfileVector: 9.51 us/op RingBuffer push+iter: 3.32 us/op https://claude.ai/code/session_014FpaYVohmyLH5dcBZTgmSY
Optimizations (1.7-2x speedup across all hot paths): - biomarker.js: Replace O(n) findIndex with pre-built RSID_INDEX Map for O(1) SNP lookups; cache LPA SNP references to avoid repeated array iteration in vector encoding and population generation - stream.js: Add RingBuffer.pushPop() returning evicted value; replace O(n) windowMeanStd buffer scan with O(1) incremental windowed Welford algorithm in StreamProcessor Benchmark improvements (before → after): computeRiskScores: 7.33 → 3.70 us/op (1.98x) encodeProfileVector: 9.51 → 5.25 us/op (1.81x) StreamProcessor.processReading: 220 → 110 us/op (2.00x) generateSyntheticPopulation(100): 1090 → 595 us/op (1.83x) Real-data integration tests (25 new tests): - 4 realistic 23andMe fixture files (29 SNPs each) covering: high-risk cardio, low-risk baseline, multi-risk, PCSK9-protective - End-to-end pipeline: parse 23andMe → biomarker scoring → streaming - Clinical scenarios: APOE e4/e4, BRCA1 carrier, MTHFR compound het, COMT×OPRM1 pain, DRD2×COMT, PCSK9 protective - Cross-validation: 8 JS↔Rust parity assertions on tables, z-scores, classification, vector layout, risk thresholds - Population correlations: APOE→HDL, LPA→Lp(a), score distribution, clinical biomarker range validation (500 subjects) - Full pipeline benchmark: 220 us end-to-end https://claude.ai/code/session_014FpaYVohmyLH5dcBZTgmSY
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implement ADR-014 Health Biomarker Analysis Architecture:
gene-gene interaction modifiers (COMT×OPRM1, MTHFR compound, BRCA1×TP53),
64-dim HNSW-aligned profile vectors, clinical reference ranges for 12
biomarkers, and deterministic synthetic population generation
configurable noise/drift/anomaly injection, z-score anomaly detection,
linear regression trend analysis, and exponential moving averages
https://claude.ai/code/session_014FpaYVohmyLH5dcBZTgmSY