Skip to content

Conversation

@bahman-farhadian
Copy link

  • Implemented robust parent-child kinship matching using likelihood ratios
  • Added allele frequencies for all 21 loci
  • Per-locus classification: identical, partial, mutation, mismatch, missing
  • CLR based on allele frequencies with penalties for mutations/mismatches
  • Filter: require ≥3 partial matches to confirm parent-child relationship
  • Prune unrelated candidates (>2 mismatches)
  • Score boosted by number of partial matches
  • Added detailed explanation in Hossein_Hamzehei_Bahman_Farhadian_Explanation.md
  • Included test results and data batch

Hossein Hamzehei added 2 commits December 19, 2025 03:32
- Implemented robust parent-child kinship matching using likelihood ratios
- Added allele frequencies for all 21 loci
- Per-locus classification: identical, partial, mutation, mismatch, missing
- CLR based on allele frequencies with penalties for mutations/mismatches
- Filter: require ≥3 partial matches to confirm parent-child relationship
- Prune unrelated candidates (>2 mismatches)
- Score boosted by number of partial matches
- Added detailed explanation in Hossein_Hamzehei_Bahman_Farhadian_Explanation.md
- Included test results and data batch
- Implement caching and an inverted index to pre-filter candidates, reducing runtime from ~60s to ~5s.
- Calculate allele frequencies dynamically from the database instead of using hardcoded values for improved robustness.
@bahman-farhadian
Copy link
Author

Hi Ali,

Thanks for reviewing my PR! The CI result was:

Correct matches: 27/35
Accuracy       : 77.1%
Execution time : 3.86 seconds
Final score    : 97.1/120

However, my local testing shows better results:

Metric Highest Average Lowest
Accuracy 100% (35/35) 88.5% (31/35) 71.4% (25/35)
Execution time 8.55 seconds ~9.22 seconds 10.77 seconds
Final score 120/120 108.5/120 91.4/120

Is this variance due to random dataset generation, or could my approach be improved? If there's room for improvement, I'd appreciate any suggestions on how to make the function more robust and reliable with less variance.

Also, would a CI re-run be possible?

Thank you.

@tavallaie
Copy link
Contributor

Did you run it over 500k dataset, like the problem requiring?
CI only runs over 5k

@bahman-farhadian
Copy link
Author

Yes, I ran it using the make all command, which generates a data directory containing the str_database.csv file. Each time, this file includes 500,000 rows. Did I make a mistake here during execution?

@tavallaie
Copy link
Contributor

You made several unrelated changes that introduced errors:

First, you should not have modified the function that calls the match function.
Second, according to the challenge requirements, you should not perform simple pairwise matching. Instead, you must build an index and perform matching using that index.
You should not have changed the README or uploaded any ZIP files. Your pull request should include only the updated match function.

@bahman-farhadian
Copy link
Author

Hi Ali,
Sorry about that. I thought including my AI interaction logs and a short report would be helpful for transparency, but I understand now that was out of scope.
I've cleaned up my commit and addressed all points:

✓ Removed ZIP files
✓ Only participant_solution.py is changed
✓ No modifications to find_matches()
✓ Uses index-based matching with allele_index for fast candidate lookup

Thanks for the feedback!

@bahman-farhadian
Copy link
Author

I also noticed my posterior probability calculation wasn't following the Bayesian formula with a 50% prior — I've fixed that now.

I tested the function 100 times against the 50k dataset:

Max Accuracy Average Accuracy Min Accuracy Variance
100.0% 89.5% 77.1% 23.93

The average accuracy looks promising, but I'm still seeing quite a bit of variance across runs (min 77.1%, max 100%). I'm trying to make the matching behavior more consistent and reliable.

I know you're busy, but if you have any suggestions or ideas on how to stabilize the performance and reduce the variance, I would really appreciate your guidance.

Thanks a lot for your time and help!

@bahman-farhadian
Copy link
Author

bahman-farhadian commented Dec 20, 2025

Just a quick update, Ali: I've updated the matching algorithm. I added a filter to reject self-matches and identical twins using a ratio-based approach: if a candidate has >90% identical loci AND fewer than 3 partial matches, it's rejected as likely same-person/twin rather than parent-child.

The filter logic:

if identical_ratio > 0.9 and partial_count < 3:
    return None  # Reject self-match/twin

This is based on the fact that true parent-child pairs have many partial matches (different strings but shared allele), while self-matches have almost all identical matches.

@tavallaie tavallaie merged commit 6521fb7 into pyday-iran:main Dec 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants