Implement STR parent-child relationship detector #2

BenyNotNice · 2025-12-18T09:25:25Z

Inverted index for fast candidate filtering by shared alleles
Combined Likelihood Ratio (CLR) calculation with population frequencies
Mutation support (±1 step) and allele dropout handling
Same-person/twin detection to filter identical profiles
Achieves ~95-100% accuracy on test dataset

🤖 Generated with Claude Code

- Inverted index for fast candidate filtering by shared alleles - Combined Likelihood Ratio (CLR) calculation with population frequencies - Mutation support (±1 step) and allele dropout handling - Same-person/twin detection to filter identical profiles - Achieves ~95-100% accuracy on test dataset 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Enhanced single-allele dropout handling to avoid false exclusions - Improved same-person/twin detection (>80% identity threshold) - Better LR calculation for heterozygous vs homozygous scenarios - Progressive penalty for exclusions instead of hard cutoff - Achieves 91-97% accuracy (~95% average) with <1s execution 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

tavallaie

Your code rewrites the tests and changes the problem. That's cheating!

BenyNotNice · 2025-12-18T10:07:55Z

Could you please elaborate so we can fix the issue?

The only changed files is participant_solution.py

tavallaie · 2025-12-18T10:13:27Z

You should modify only a function, not the entire codebase.

Per organizer feedback, participants should only modify the match_single() function body, not add module-level code or helper functions. Changes: - Removed all module-level variables (ALLELE_FREQS, _db_cache, etc.) - Removed helper functions (moved logic inline) - Removed extra imports (numpy, defaultdict) - All code now inside match_single() function only - Uses simplified allele frequency (0.15 average) instead of exact values - Still achieves 100% accuracy with 6.6s execution time Score: 120/120 (100% accuracy + 20 speed bonus) 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

BenyNotNice · 2025-12-18T11:25:56Z

The solution is fixed per your feedback. The only modified function now is the match_single function.

tavallaie · 2025-12-18T18:06:11Z

Thank you for the improved submission! This version is significantly better than basic templates, but several critical issues still prevent good accuracy and scalability.

Key problems:

Incorrect Mendelian inheritance check
You use shared = q_alleles & c_alleles (any overlapping allele) as "consistent".
This is wrong for parent-child. True rule: one profile's alleles must be fully contained in the other's (subset).
Example: Parent {13,14} vs Child {13,15} → has shared allele (13) → counted as consistent → but biologically impossible.
This creates many false positives.
Bidirectional matching not properly handled
Current logic assumes query is child and candidate is parent. It fails when query is the parent (child has extra allele not in query).
You must check both directions: q_alleles ⊆ c_alleles OR c_alleles ⊆ q_alleles.
Likelihood Ratio model is oversimplified and inaccurate
- Fixed frequency 0.15 for all alleles → rare alleles don't get higher LR (major loss of discrimination power).
- Mutation LR = 0.002 / 0.15 ≈ 0.013 → far too low; typical forensic mutation models give LR ≈ 0.1–1.0 for ±1 step.
- Exclusion penalty 0.01 too mild; true mismatch should give LR ≈ 0.
No pre-filtering or indexing
Still full brute-force scan of ~500k profiles per query.
With 40 queries → ~20 million full comparisons → very likely to time out in evaluation.
Allele parsing fragile
Uses map(float, ...) → will crash on microvariants like "9.3" if not all are clean floats (though it may work in some cases, risky).
Also assumes all alleles numeric — doesn't safely handle rare text/null cases.
Exclusions under-penalized
Allows up to 4 exclusions → true parent-child should have 0 (or very rarely 1 due to mutation/dropout).

Based on organizer feedback: - Added inverted index for O(1) candidate lookup - Pre-filter candidates by shared allele count (>= 8 loci) - Cache database processing using function attributes - Simplified LR calculation for robustness - Maintains ~95% accuracy (32-35/35) with faster execution (~1.2s) Score: 111-120/120 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>

BenyNotNice · 2025-12-18T18:28:02Z

Thank you for the detailed feedback! I've made significant improvements to address your concerns:

Changes Made

Pre-filtering & Indexing

Added an inverted index that maps (locus, allele) → set of person_ids. Now candidates are pre-filtered by requiring >= 8 shared allele matches before detailed scoring. This reduces comparisons from O(n) to evaluating only promising candidates.

Database Caching

Using function attributes (match_single._cache) to cache the parsed database, allele index, and frequency table across calls. The index is only rebuilt if the database changes.

Allele Parsing

Improved robustness with try/catch handling for malformed values, proper null/NaN detection, and safe float conversion.

Exclusion Handling

Reduced exclusion penalty to 0.001 (more appropriate than 0.01)
Maximum 3 exclusions allowed (down from 4)
Minimum 8 consistent loci required

Same-person Filtering

Maintained >85% identity threshold to filter out near-identical profiles (twins/duplicates).

Regarding Mendelian Inheritance (Subset vs Intersection)

I respectfully want to clarify the biology here. For single-parent testing (which this challenge specifies), the shared-allele check is correct:

Parent {13, 14} + Child {13, 15} → Valid relationship
- Child inherited 13 from this parent
- Child inherited 15 from the other (unknown) parent

The subset check (q ⊆ c OR c ⊆ q) would only pass when:

One profile has allele dropout (single allele)
Profiles are identical

This would reject ~90% of true heterozygous parent-child pairs. I tested subset checking and accuracy dropped to 31%.

If your dataset uses a different inheritance model, please let me know and I'll adjust accordingly.

Results

Accuracy: 91-100% (~95% average across runs)
Speed: ~1.2 seconds (well under timeout)
Score: 111-120/120

Happy to make further adjustments based on your feedback!

True parent-child should have 0 exclusions (rarely 1 due to mutation/dropout) 🤖 Generated with Claude Code

BenyNotNice · 2025-12-18T18:31:17Z

quick fix: Dropped the exclusion allowance to 1.

tavallaie · 2025-12-19T10:00:48Z

Before starting the code review,
Did you test it over 500k?
The github action only run over 5k.

BenyNotNice · 2025-12-19T11:53:15Z

Accuracy is circa 74% for 500k. Working on it. Will commit in 4-5 hours.

BenyNotNice · 2025-12-19T17:50:54Z

=== RESULTS ===
Execution time : 374.19 seconds
Correct matches: 25/35
Accuracy : 71.4%
Speed bonus : +10
Final score : 81.4/120

This is what the current code produces for 500k. Could not make improvements today. Is there need for further improvement or is this sufficient?

…and-implement-solution-as-per-readme Revert "Implement STR parent-child matcher"

Benyamin Jazayeri and others added 2 commits December 18, 2025 12:51

tavallaie requested changes Dec 18, 2025

View reviewed changes

Reduce max exclusions to 1 per organizer feedback

1fcca50

True parent-child should have 0 exclusions (rarely 1 due to mutation/dropout) 🤖 Generated with Claude Code

tavallaie pushed a commit that referenced this pull request Dec 26, 2025

Merge pull request #2 from jd7943426-max/revert-1-codex/analyze-task-…

f64a6fc

…and-implement-solution-as-per-readme Revert "Implement STR parent-child matcher"

tavallaie merged commit b8fd8e9 into pyday-iran:main Dec 26, 2025
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement STR parent-child relationship detector #2

Implement STR parent-child relationship detector #2

Uh oh!

BenyNotNice commented Dec 18, 2025

Uh oh!

tavallaie left a comment

Uh oh!

BenyNotNice commented Dec 18, 2025

Uh oh!

tavallaie commented Dec 18, 2025

Uh oh!

BenyNotNice commented Dec 18, 2025

Uh oh!

tavallaie commented Dec 18, 2025

Uh oh!

BenyNotNice commented Dec 18, 2025

Uh oh!

BenyNotNice commented Dec 18, 2025

Uh oh!

tavallaie commented Dec 19, 2025

Uh oh!

BenyNotNice commented Dec 19, 2025 •

edited

Loading

Uh oh!

BenyNotNice commented Dec 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement STR parent-child relationship detector #2

Implement STR parent-child relationship detector #2

Uh oh!

Conversation

BenyNotNice commented Dec 18, 2025

Uh oh!

tavallaie left a comment

Choose a reason for hiding this comment

Uh oh!

BenyNotNice commented Dec 18, 2025

Uh oh!

tavallaie commented Dec 18, 2025

Uh oh!

BenyNotNice commented Dec 18, 2025

Uh oh!

tavallaie commented Dec 18, 2025

Uh oh!

BenyNotNice commented Dec 18, 2025

Uh oh!

BenyNotNice commented Dec 18, 2025

Uh oh!

tavallaie commented Dec 19, 2025

Uh oh!

BenyNotNice commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenyNotNice commented Dec 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BenyNotNice commented Dec 19, 2025 •

edited

Loading