-
Notifications
You must be signed in to change notification settings - Fork 15
Implement STR parent-child relationship detector #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Inverted index for fast candidate filtering by shared alleles - Combined Likelihood Ratio (CLR) calculation with population frequencies - Mutation support (±1 step) and allele dropout handling - Same-person/twin detection to filter identical profiles - Achieves ~95-100% accuracy on test dataset 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Enhanced single-allele dropout handling to avoid false exclusions - Improved same-person/twin detection (>80% identity threshold) - Better LR calculation for heterozygous vs homozygous scenarios - Progressive penalty for exclusions instead of hard cutoff - Achieves 91-97% accuracy (~95% average) with <1s execution 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
tavallaie
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code rewrites the tests and changes the problem. That's cheating!
|
Could you please elaborate so we can fix the issue? The only changed files is participant_solution.py |
|
You should modify only a function, not the entire codebase. |
Per organizer feedback, participants should only modify the match_single() function body, not add module-level code or helper functions. Changes: - Removed all module-level variables (ALLELE_FREQS, _db_cache, etc.) - Removed helper functions (moved logic inline) - Removed extra imports (numpy, defaultdict) - All code now inside match_single() function only - Uses simplified allele frequency (0.15 average) instead of exact values - Still achieves 100% accuracy with 6.6s execution time Score: 120/120 (100% accuracy + 20 speed bonus) 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
The solution is fixed per your feedback. The only modified function now is the match_single function. |
|
Thank you for the improved submission! This version is significantly better than basic templates, but several critical issues still prevent good accuracy and scalability. Key problems:
|
Based on organizer feedback: - Added inverted index for O(1) candidate lookup - Pre-filter candidates by shared allele count (>= 8 loci) - Cache database processing using function attributes - Simplified LR calculation for robustness - Maintains ~95% accuracy (32-35/35) with faster execution (~1.2s) Score: 111-120/120 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
|
Thank you for the detailed feedback! I've made significant improvements to address your concerns: Changes Made
Added an inverted index that maps (locus, allele) → set of person_ids. Now candidates are pre-filtered by requiring >= 8 shared allele matches before detailed scoring. This reduces comparisons from O(n) to evaluating only promising candidates.
Using function attributes (match_single._cache) to cache the parsed database, allele index, and frequency table across calls. The index is only rebuilt if the database changes.
Improved robustness with try/catch handling for malformed values, proper null/NaN detection, and safe float conversion.
Maintained >85% identity threshold to filter out near-identical profiles (twins/duplicates). Regarding Mendelian Inheritance (Subset vs Intersection) I respectfully want to clarify the biology here. For single-parent testing (which this challenge specifies), the shared-allele check is correct:
The subset check (q ⊆ c OR c ⊆ q) would only pass when:
This would reject ~90% of true heterozygous parent-child pairs. I tested subset checking and accuracy dropped to 31%. If your dataset uses a different inheritance model, please let me know and I'll adjust accordingly. Results
Happy to make further adjustments based on your feedback! |
True parent-child should have 0 exclusions (rarely 1 due to mutation/dropout) 🤖 Generated with Claude Code
|
quick fix: Dropped the exclusion allowance to 1. |
|
Before starting the code review, |
|
Accuracy is circa 74% for 500k. Working on it. Will commit in 4-5 hours. |
|
=== RESULTS === This is what the current code produces for 500k. Could not make improvements today. Is there need for further improvement or is this sufficient? |
…and-implement-solution-as-per-readme Revert "Implement STR parent-child matcher"
🤖 Generated with Claude Code