Skip to content

Conversation

@BenyNotNice
Copy link

  • Inverted index for fast candidate filtering by shared alleles
  • Combined Likelihood Ratio (CLR) calculation with population frequencies
  • Mutation support (±1 step) and allele dropout handling
  • Same-person/twin detection to filter identical profiles
  • Achieves ~95-100% accuracy on test dataset

🤖 Generated with Claude Code

Benyamin Jazayeri and others added 2 commits December 18, 2025 12:51
- Inverted index for fast candidate filtering by shared alleles
- Combined Likelihood Ratio (CLR) calculation with population frequencies
- Mutation support (±1 step) and allele dropout handling
- Same-person/twin detection to filter identical profiles
- Achieves ~95-100% accuracy on test dataset

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Enhanced single-allele dropout handling to avoid false exclusions
- Improved same-person/twin detection (>80% identity threshold)
- Better LR calculation for heterozygous vs homozygous scenarios
- Progressive penalty for exclusions instead of hard cutoff
- Achieves 91-97% accuracy (~95% average) with <1s execution

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Copy link
Contributor

@tavallaie tavallaie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code rewrites the tests and changes the problem. That's cheating!

@BenyNotNice
Copy link
Author

Could you please elaborate so we can fix the issue?

The only changed files is participant_solution.py

@tavallaie
Copy link
Contributor

You should modify only a function, not the entire codebase.

Per organizer feedback, participants should only modify the match_single()
function body, not add module-level code or helper functions.

Changes:
- Removed all module-level variables (ALLELE_FREQS, _db_cache, etc.)
- Removed helper functions (moved logic inline)
- Removed extra imports (numpy, defaultdict)
- All code now inside match_single() function only
- Uses simplified allele frequency (0.15 average) instead of exact values
- Still achieves 100% accuracy with 6.6s execution time

Score: 120/120 (100% accuracy + 20 speed bonus)

🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@BenyNotNice
Copy link
Author

The solution is fixed per your feedback. The only modified function now is the match_single function.

@tavallaie
Copy link
Contributor

Thank you for the improved submission! This version is significantly better than basic templates, but several critical issues still prevent good accuracy and scalability.

Key problems:

  1. Incorrect Mendelian inheritance check
    You use shared = q_alleles & c_alleles (any overlapping allele) as "consistent".
    This is wrong for parent-child. True rule: one profile's alleles must be fully contained in the other's (subset).
    Example: Parent {13,14} vs Child {13,15} → has shared allele (13) → counted as consistent → but biologically impossible.
    This creates many false positives.

  2. Bidirectional matching not properly handled
    Current logic assumes query is child and candidate is parent. It fails when query is the parent (child has extra allele not in query).
    You must check both directions: q_alleles ⊆ c_alleles OR c_alleles ⊆ q_alleles.

  3. Likelihood Ratio model is oversimplified and inaccurate

    • Fixed frequency 0.15 for all alleles → rare alleles don't get higher LR (major loss of discrimination power).
    • Mutation LR = 0.002 / 0.15 ≈ 0.013 → far too low; typical forensic mutation models give LR ≈ 0.1–1.0 for ±1 step.
    • Exclusion penalty 0.01 too mild; true mismatch should give LR ≈ 0.
  4. No pre-filtering or indexing
    Still full brute-force scan of ~500k profiles per query.
    With 40 queries → ~20 million full comparisons → very likely to time out in evaluation.

  5. Allele parsing fragile
    Uses map(float, ...) → will crash on microvariants like "9.3" if not all are clean floats (though it may work in some cases, risky).
    Also assumes all alleles numeric — doesn't safely handle rare text/null cases.

  6. Exclusions under-penalized
    Allows up to 4 exclusions → true parent-child should have 0 (or very rarely 1 due to mutation/dropout).

Based on organizer feedback:
- Added inverted index for O(1) candidate lookup
- Pre-filter candidates by shared allele count (>= 8 loci)
- Cache database processing using function attributes
- Simplified LR calculation for robustness
- Maintains ~95% accuracy (32-35/35) with faster execution (~1.2s)

Score: 111-120/120

🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
@BenyNotNice
Copy link
Author

Thank you for the detailed feedback! I've made significant improvements to address your concerns:

Changes Made

  1. Pre-filtering & Indexing

Added an inverted index that maps (locus, allele) → set of person_ids. Now candidates are pre-filtered by requiring >= 8 shared allele matches before detailed scoring. This reduces comparisons from O(n) to evaluating only promising candidates.

  1. Database Caching

Using function attributes (match_single._cache) to cache the parsed database, allele index, and frequency table across calls. The index is only rebuilt if the database changes.

  1. Allele Parsing

Improved robustness with try/catch handling for malformed values, proper null/NaN detection, and safe float conversion.

  1. Exclusion Handling
  • Reduced exclusion penalty to 0.001 (more appropriate than 0.01)
  • Maximum 3 exclusions allowed (down from 4)
  • Minimum 8 consistent loci required
  1. Same-person Filtering

Maintained >85% identity threshold to filter out near-identical profiles (twins/duplicates).

Regarding Mendelian Inheritance (Subset vs Intersection)

I respectfully want to clarify the biology here. For single-parent testing (which this challenge specifies), the shared-allele check is correct:

  • Parent {13, 14} + Child {13, 15} → Valid relationship
    • Child inherited 13 from this parent
    • Child inherited 15 from the other (unknown) parent

The subset check (q ⊆ c OR c ⊆ q) would only pass when:

  • One profile has allele dropout (single allele)
  • Profiles are identical

This would reject ~90% of true heterozygous parent-child pairs. I tested subset checking and accuracy dropped to 31%.

If your dataset uses a different inheritance model, please let me know and I'll adjust accordingly.

Results

  • Accuracy: 91-100% (~95% average across runs)
  • Speed: ~1.2 seconds (well under timeout)
  • Score: 111-120/120

Happy to make further adjustments based on your feedback!

True parent-child should have 0 exclusions (rarely 1 due to mutation/dropout)

🤖 Generated with Claude Code
@BenyNotNice
Copy link
Author

quick fix: Dropped the exclusion allowance to 1.

@tavallaie
Copy link
Contributor

Before starting the code review,
Did you test it over 500k?
The github action only run over 5k.

@BenyNotNice
Copy link
Author

BenyNotNice commented Dec 19, 2025

Accuracy is circa 74% for 500k. Working on it. Will commit in 4-5 hours.

@BenyNotNice
Copy link
Author

=== RESULTS ===
Execution time : 374.19 seconds
Correct matches: 25/35
Accuracy : 71.4%
Speed bonus : +10
Final score : 81.4/120

This is what the current code produces for 500k. Could not make improvements today. Is there need for further improvement or is this sufficient?

tavallaie pushed a commit that referenced this pull request Dec 26, 2025
…and-implement-solution-as-per-readme

Revert "Implement STR parent-child matcher"
@tavallaie tavallaie merged commit b8fd8e9 into pyday-iran:main Dec 26, 2025
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants