Skip to content

feat(rvdna): add health biomarker analysis engine with streaming simulation#199

Merged
ruvnet merged 13 commits intomainfrom
claude/health-biomarker-adr-ESZy4
Feb 22, 2026
Merged

feat(rvdna): add health biomarker analysis engine with streaming simulation#199
ruvnet merged 13 commits intomainfrom
claude/health-biomarker-adr-ESZy4

Conversation

@ruvnet
Copy link
Copy Markdown
Owner

@ruvnet ruvnet commented Feb 22, 2026

Implement ADR-014 Health Biomarker Analysis Architecture:

  • biomarker.rs: Composite risk scoring engine with 17-SNP weight matrix,
    gene-gene interaction modifiers (COMT×OPRM1, MTHFR compound, BRCA1×TP53),
    64-dim HNSW-aligned profile vectors, clinical reference ranges for 12
    biomarkers, and deterministic synthetic population generation
  • biomarker_stream.rs: Streaming biomarker simulator with generic RingBuffer,
    configurable noise/drift/anomaly injection, z-score anomaly detection,
    linear regression trend analysis, and exponential moving averages
  • 35 unit tests + 15 integration tests (168 total, 0 failures)
  • Criterion benchmark suite targeting ADR-014 performance budgets

https://claude.ai/code/session_014FpaYVohmyLH5dcBZTgmSY

…lation

Implement ADR-014 Health Biomarker Analysis Architecture:
- biomarker.rs: Composite risk scoring engine with 17-SNP weight matrix,
  gene-gene interaction modifiers (COMT×OPRM1, MTHFR compound, BRCA1×TP53),
  64-dim HNSW-aligned profile vectors, clinical reference ranges for 12
  biomarkers, and deterministic synthetic population generation
- biomarker_stream.rs: Streaming biomarker simulator with generic RingBuffer,
  configurable noise/drift/anomaly injection, z-score anomaly detection,
  linear regression trend analysis, and exponential moving averages
- 35 unit tests + 15 integration tests (168 total, 0 failures)
- Criterion benchmark suite targeting ADR-014 performance budgets

https://claude.ai/code/session_014FpaYVohmyLH5dcBZTgmSY
… halve ring buffer memory

- Fix snp_idx silent fallback: unwrap_or(0) masked missing SNPs with
  incorrect index-0 lookups; now returns Option<usize>
- RingBuffer: eliminate Option<T> wrapper, halving per-slot memory
  for f64 (8 bytes vs 16); use T::Default instead
- window_mean_std: replace two-pass sum+variance with single-pass
  Welford's online algorithm (2x fewer cache misses)
- compute_risk_scores: pre-compute category max scores via
  category_meta() to avoid re-scanning SNP_WEIGHTS per call;
  use &str keys in intermediate HashMap to reduce String allocations
- HashMap capacity hints throughout (StreamProcessor, genotypes,
  biomarker_values, cat_scores) to eliminate rehashing
- generate_synthetic_population: hoist APOE lookup out of inner loop,
  reserve biomarker_values capacity upfront
- All 48 tests pass (33 unit + 15 integration), benchmark compiles

https://claude.ai/code/session_014FpaYVohmyLH5dcBZTgmSY
…fehacks clinical data

Evidence-based adjustments from geneticlifehacks.com research articles:

- MTHFR C677T (rs1801133): het weight 0.30→0.35 to match documented
  40% enzyme activity decrease
- MTHFR A1298C (rs1801131): het 0.15→0.10, hom_alt 0.35→0.25 to
  match documented ~20% enzyme decrease
- Homocysteine reference range: 4-12→5-15 μmol/L (clinical consensus),
  critical_high 50→30 (moderate hyperhomocysteinemia threshold)
- Add MTHFR A1298C × COMT interaction (1.25x Neurological): A1298C
  homozygous + COMT slow = amplified depression risk
- Add DRD2/ANKK1 × COMT interaction (1.2x Neurological): rs1800497 ×
  Val158Met working memory interaction
- Guard vector encoding with .take(4) so expanded interaction table
  (now 6 entries) doesn't overflow dims 56-59

Sources:
- geneticlifehacks.com/mthfr/ (enzyme activity percentages)
- geneticlifehacks.com/mthfr-c677t/ (MTHFR-COMT depression data)
- geneticlifehacks.com/understanding-homocysteine-levels/ (ref ranges)
- geneticlifehacks.com/dopamine-receptor-genes/ (DRD2×COMT interaction)

All 48 tests pass (33 unit + 15 integration), benchmark compiles.

https://claude.ai/code/session_014FpaYVohmyLH5dcBZTgmSY
Evidence-based refinements from peer-reviewed clinical research:

- TP53 rs1042522 (Pro72Arg): hom_ref 0.10→0.00 — CC/Pro/Pro is not
  independently risk-associated; prior non-zero baseline was unjustified
- BRCA2 rs11571833 (K3326X): het 0.25→0.20 — aligned with iCOGS
  meta-analysis OR 1.28 for breast cancer (Meeks et al., JNCI 2016,
  76,637 cases / 83,796 controls)
- NQO1 rs1800566 (Pro187Ser): het 0.20→0.15, hom_alt 0.45→0.30 —
  aligned with comprehensive meta-analysis OR 1.18 for TT vs CC
  (Lajin & Alachkar, Br J Cancer 2013, 92 studies, 21,178 cases);
  larger 2022 meta-analysis (43,736 cases) found no overall association

Validated unchanged weights against SOTA evidence:
- APOE rs429358: OR 3-4x het, 8-15x hom (Belloy JAMA Neurology 2023)
- SLCO1B1 rs4363657: OR 4.5/allele, 16.9 hom (SEARCH/NEJM; CPIC 2022)
- COMT×OPRM1 interaction: confirmed p=0.037 (orthopedic trauma study)

All 48 tests pass (33 unit + 15 integration).

https://claude.ai/code/session_014FpaYVohmyLH5dcBZTgmSY
…tion, and interaction tests

- Add gene→biomarker correlations in synthetic population: APOE e4→lower HDL/higher
  triglycerides, MTHFR→lower B12, NQO1 null→higher CRP
- Add CUSUM changepoint detection algorithm to StreamProcessor for detecting
  sustained biomarker shifts beyond simple anomaly detection
- Add 4 new integration tests: MTHFR×COMT interaction, DRD2×COMT interaction,
  APOE→HDL population correlation, CUSUM changepoint detection
- Remove unused variant_categories import
- All 172 tests pass, all ADR-014 performance targets exceeded

https://claude.ai/code/session_014FpaYVohmyLH5dcBZTgmSY
- Add Health Biomarker Engine section to rvDNA README with usage examples
  for composite risk scoring, streaming processing, and synthetic populations
- Add biomarker.rs and biomarker_stream.rs to Modules table
- Update test count from 102 to 172 (added biomarker tests)
- Add biomarker benchmark results to Speed table
- Add Welford, CUSUM, and PRS to Published Algorithms table
- Update root README Genomics & Health capabilities (49 → 51 features)
- Add health biomarker engine and streaming biomarkers to root feature table
- Update rvDNA details section with risk scoring and streaming capabilities

https://claude.ai/code/session_014FpaYVohmyLH5dcBZTgmSY
…eaming

Structural improvements from deep code review:

- Consolidate 5 parallel arrays (SNP_WEIGHTS, HOM_REF, HOM_ALT, HET,
  ALLELE_FREQS) into single SnpDef struct array — eliminates entire class
  of parallel-array misalignment bugs
- Cache category_meta() with LazyLock — avoids per-call Vec allocation
  (critical in generate_synthetic_population hot path)
- Hoist Normal::new out of inner loop in generate_readings — pre-compute
  distributions per biomarker instead of per-step*per-biomarker
- Add clinically meaningful lower bounds: LDL normal_low 0→50 mg/dL
  (critical_low 25), Triglycerides normal_low 0→35 mg/dL (critical_low 20)
- Optimize RingBuffer::clear from O(capacity) to O(1) — head/len reset
  is sufficient since push overwrites before read
- Use NUM_SNPS const for vector encoding bounds instead of magic number 51

All 172 tests pass, zero clippy warnings for rvdna.

https://claude.ai/code/session_014FpaYVohmyLH5dcBZTgmSY
…ence

Add rs10455872 (OR 1.6-1.75/allele CHD) and rs3798220 (OR 1.49-1.54/allele)
from 2024 LPA meta-analyses. Include Lp(a) biomarker reference (0-75 nmol/L)
and gene-biomarker correlation in population model. Separate NUM_ONEHOT_SNPS
(17) from NUM_SNPS (19) to preserve 64-dim vector layout with LPA encoded
in summary dimension 63.

https://claude.ai/code/session_014FpaYVohmyLH5dcBZTgmSY
Add PCSK9 R46L loss-of-function variant (NEJM 2006: OR 0.77 CHD,
0.40 MI) as a protective cardiovascular SNP with negative weights.
Include PCSK9→LDL-C biomarker correlation (15-21% lower LDL in
carriers). Refactor gene-biomarker correlations from match to
additive if-chain so multiple gene effects can stack on the same
biomarker (e.g., APOE raises LDL while PCSK9 R46L lowers it).
Panel expanded to 20 SNPs.

https://claude.ai/code/session_014FpaYVohmyLH5dcBZTgmSY
Update all references from 17 SNPs to 20 SNPs reflecting the
addition of LPA rs10455872/rs3798220 and PCSK9 rs11591147.
Document new gene-biomarker correlations (LPA→Lp(a), PCSK9→LDL)
in synthetic population section. Update module table line counts.

https://claude.ai/code/session_014FpaYVohmyLH5dcBZTgmSY
…nd benchmarks

ADR-015: Pure-JS biomarker engine mirroring Rust biomarker.rs and
biomarker_stream.rs exactly. Includes:

- src/biomarker.js: 20-SNP composite risk scoring, 6 gene-gene
  interactions, 64-dim L2-normalized profile vectors, synthetic
  population generation with Mulberry32 PRNG
- src/stream.js: RingBuffer, StreamProcessor with Welford online
  stats, CUSUM changepoint detection, z-score anomaly detection,
  linear regression trend analysis, batch reading generation
- tests/test-biomarker.js: 35 tests + 5 benchmarks covering all
  classification levels, risk scoring, vector encoding, population
  generation, streaming, anomaly/trend detection
- index.d.ts: Full TypeScript definitions for all biomarker APIs
- package.json: Bump to v0.3.0, add biomarker keywords

Benchmark results (Node.js):
  computeRiskScores: 7.33 us/op
  encodeProfileVector: 9.51 us/op
  RingBuffer push+iter: 3.32 us/op

https://claude.ai/code/session_014FpaYVohmyLH5dcBZTgmSY
Optimizations (1.7-2x speedup across all hot paths):
- biomarker.js: Replace O(n) findIndex with pre-built RSID_INDEX Map
  for O(1) SNP lookups; cache LPA SNP references to avoid repeated
  array iteration in vector encoding and population generation
- stream.js: Add RingBuffer.pushPop() returning evicted value;
  replace O(n) windowMeanStd buffer scan with O(1) incremental
  windowed Welford algorithm in StreamProcessor

Benchmark improvements (before → after):
  computeRiskScores: 7.33 → 3.70 us/op (1.98x)
  encodeProfileVector: 9.51 → 5.25 us/op (1.81x)
  StreamProcessor.processReading: 220 → 110 us/op (2.00x)
  generateSyntheticPopulation(100): 1090 → 595 us/op (1.83x)

Real-data integration tests (25 new tests):
- 4 realistic 23andMe fixture files (29 SNPs each) covering:
  high-risk cardio, low-risk baseline, multi-risk, PCSK9-protective
- End-to-end pipeline: parse 23andMe → biomarker scoring → streaming
- Clinical scenarios: APOE e4/e4, BRCA1 carrier, MTHFR compound het,
  COMT×OPRM1 pain, DRD2×COMT, PCSK9 protective
- Cross-validation: 8 JS↔Rust parity assertions on tables, z-scores,
  classification, vector layout, risk thresholds
- Population correlations: APOE→HDL, LPA→Lp(a), score distribution,
  clinical biomarker range validation (500 subjects)
- Full pipeline benchmark: 220 us end-to-end

https://claude.ai/code/session_014FpaYVohmyLH5dcBZTgmSY
@ruvnet ruvnet merged commit f957eb7 into main Feb 22, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants