-
Notifications
You must be signed in to change notification settings - Fork 15
Submit final solution for Code Challenge 2025 #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Submit final solution for Code Challenge 2025 #7
Conversation
bahman-farhadian
commented
Dec 19, 2025
- Implemented robust parent-child kinship matching using likelihood ratios
- Added allele frequencies for all 21 loci
- Per-locus classification: identical, partial, mutation, mismatch, missing
- CLR based on allele frequencies with penalties for mutations/mismatches
- Filter: require ≥3 partial matches to confirm parent-child relationship
- Prune unrelated candidates (>2 mismatches)
- Score boosted by number of partial matches
- Added detailed explanation in Hossein_Hamzehei_Bahman_Farhadian_Explanation.md
- Included test results and data batch
- Implemented robust parent-child kinship matching using likelihood ratios - Added allele frequencies for all 21 loci - Per-locus classification: identical, partial, mutation, mismatch, missing - CLR based on allele frequencies with penalties for mutations/mismatches - Filter: require ≥3 partial matches to confirm parent-child relationship - Prune unrelated candidates (>2 mismatches) - Score boosted by number of partial matches - Added detailed explanation in Hossein_Hamzehei_Bahman_Farhadian_Explanation.md - Included test results and data batch
- Implement caching and an inverted index to pre-filter candidates, reducing runtime from ~60s to ~5s. - Calculate allele frequencies dynamically from the database instead of using hardcoded values for improved robustness.
|
Hi Ali, Thanks for reviewing my PR! The CI result was: However, my local testing shows better results:
Is this variance due to random dataset generation, or could my approach be improved? If there's room for improvement, I'd appreciate any suggestions on how to make the function more robust and reliable with less variance. Also, would a CI re-run be possible? Thank you. |
|
Did you run it over 500k dataset, like the problem requiring? |
|
Yes, I ran it using the |
|
You made several unrelated changes that introduced errors: First, you should not have modified the function that calls the match function. |
|
Hi Ali, ✓ Removed ZIP files Thanks for the feedback! |
|
I also noticed my posterior probability calculation wasn't following the Bayesian formula with a 50% prior — I've fixed that now. I tested the function 100 times against the 50k dataset:
The average accuracy looks promising, but I'm still seeing quite a bit of variance across runs (min 77.1%, max 100%). I'm trying to make the matching behavior more consistent and reliable. I know you're busy, but if you have any suggestions or ideas on how to stabilize the performance and reduce the variance, I would really appreciate your guidance. Thanks a lot for your time and help! |
|
Just a quick update, Ali: I've updated the matching algorithm. I added a filter to reject self-matches and identical twins using a ratio-based approach: if a candidate has >90% identical loci AND fewer than 3 partial matches, it's rejected as likely same-person/twin rather than parent-child. The filter logic: if identical_ratio > 0.9 and partial_count < 3:
return None # Reject self-match/twinThis is based on the fact that true parent-child pairs have many partial matches (different strings but shared allele), while self-matches have almost all identical matches. |