Context
While investigating #1011 (scoring fewer candidates), we analyzed the correlation between ES index scores and LogicV2 algo scores using ~20,800 scoring log entries from production (418 queries).
The per-query Spearman rank correlation between ES and LogicV2 has a median of just 0.42:
| Correlation band |
% of queries |
| Negative (< 0) — ES ranking is worse than random |
21.7% |
| Weak (0–0.3) |
13.0% |
| Moderate (0.3–0.6) |
37.0% |
| Strong (>= 0.6) |
28.3% |
Top-5 overlap between ES and algo rankings is only 35%. The worst observed inversion: best algo result (score 0.592) sitting at ES rank 153 out of 192 candidates.
Better alignment wouldn't replace LogicV2 scoring, but would mean the top ES candidates are more likely to include the actual best matches — reducing wasted scoring work and improving the effectiveness of early stopping heuristics (#1011).
Root cause: property weighting in entity_query()
The current entity_query() in yente/search/queries.py treats all non-name matchable properties equally — identifiers, dates, and countries all become unweighted term queries (boost 1.0). This diverges significantly from how LogicV2 values these signals:
| Signal |
LogicV2 weight |
Current ES boost |
| Specific identifiers (LEI, ISIN, BIC, IMO, etc.) |
0.95–0.98 |
1.0 (same as a name part) |
| Generic identifiers (tax/registration IDs) |
0.85 |
1.0 |
| Name match |
1.0 |
3.0 (full name), varies (parts) |
| Birth date mismatch |
-0.15 to -0.25 penalty |
no penalty, just no bonus |
| Country mismatch |
-0.20 penalty |
no penalty, just no bonus |
ES can't do negative scoring (penalties for mismatches), but it can weight positive signals to reflect their relative importance.
Proposed changes
1. Boost identifier matches heavily
Identifier matches are near-deterministic in LogicV2 (a matching LEI code almost certainly means same entity) but are currently just one more term query. In entity_query():
# Identifier types should get a high boost to reflect their
# near-deterministic nature in matching
IDENTIFIER_BOOSTS = {
registry.identifier: 8.0,
# Specific types could get even higher boosts if available
}
for prop, value in entity.itervalues():
if prop.type == registry.name or not prop.matchable:
continue
if prop.type == registry.address:
query = {"match": {prop.type.group: value}}
shoulds.append(query)
elif prop.type.group is not None:
boost = IDENTIFIER_BOOSTS.get(prop.type, 1.0)
shoulds.append(tq(prop.type.group, value, boost))
2. Boost date matches
Birth dates are highly discriminating. A matching date should contribute more than a single name part:
if prop.type == registry.date:
shoulds.append(tq(prop.type.group, value, boost=3.0))
3. Modest boost for country matches
Countries are informative but less discriminating than dates or identifiers:
if prop.type == registry.country:
shoulds.append(tq(prop.type.group, value, boost=1.5))
Limitations
These changes won't make ES perfectly match LogicV2 — there are structural differences that can't be bridged:
- ES uses additive scoring (
sum of all matching clauses), LogicV2 uses max of main features + qualifier adjustments
- ES has no negative signals (can't penalize country/date/gender mismatches, only reward matches)
- ES can't do cultural name analysis (symbol-aware pairing, family name weighting, etc.)
But aligning the relative weighting of ES should-clauses with LogicV2's priorities should meaningfully improve the correlation, meaning the candidates ES ranks highest are more likely to be the ones LogicV2 scores highest.
🤖 Generated with Claude Code
Context
While investigating #1011 (scoring fewer candidates), we analyzed the correlation between ES index scores and LogicV2 algo scores using ~20,800 scoring log entries from production (418 queries).
The per-query Spearman rank correlation between ES and LogicV2 has a median of just 0.42:
Top-5 overlap between ES and algo rankings is only 35%. The worst observed inversion: best algo result (score 0.592) sitting at ES rank 153 out of 192 candidates.
Better alignment wouldn't replace LogicV2 scoring, but would mean the top ES candidates are more likely to include the actual best matches — reducing wasted scoring work and improving the effectiveness of early stopping heuristics (#1011).
Root cause: property weighting in
entity_query()The current
entity_query()inyente/search/queries.pytreats all non-name matchable properties equally — identifiers, dates, and countries all become unweightedtermqueries (boost 1.0). This diverges significantly from how LogicV2 values these signals:ES can't do negative scoring (penalties for mismatches), but it can weight positive signals to reflect their relative importance.
Proposed changes
1. Boost identifier matches heavily
Identifier matches are near-deterministic in LogicV2 (a matching LEI code almost certainly means same entity) but are currently just one more term query. In
entity_query():2. Boost date matches
Birth dates are highly discriminating. A matching date should contribute more than a single name part:
3. Modest boost for country matches
Countries are informative but less discriminating than dates or identifiers:
Limitations
These changes won't make ES perfectly match LogicV2 — there are structural differences that can't be bridged:
sumof all matching clauses), LogicV2 usesmaxof main features + qualifier adjustmentsBut aligning the relative weighting of ES should-clauses with LogicV2's priorities should meaningfully improve the correlation, meaning the candidates ES ranks highest are more likely to be the ones LogicV2 scores highest.
🤖 Generated with Claude Code