Skip to content

Improve ES query weighting in entity_query to better align with LogicV2 scoring #1093

@pudo

Description

@pudo

Context

While investigating #1011 (scoring fewer candidates), we analyzed the correlation between ES index scores and LogicV2 algo scores using ~20,800 scoring log entries from production (418 queries).

The per-query Spearman rank correlation between ES and LogicV2 has a median of just 0.42:

Correlation band % of queries
Negative (< 0) — ES ranking is worse than random 21.7%
Weak (0–0.3) 13.0%
Moderate (0.3–0.6) 37.0%
Strong (>= 0.6) 28.3%

Top-5 overlap between ES and algo rankings is only 35%. The worst observed inversion: best algo result (score 0.592) sitting at ES rank 153 out of 192 candidates.

Better alignment wouldn't replace LogicV2 scoring, but would mean the top ES candidates are more likely to include the actual best matches — reducing wasted scoring work and improving the effectiveness of early stopping heuristics (#1011).

Root cause: property weighting in entity_query()

The current entity_query() in yente/search/queries.py treats all non-name matchable properties equally — identifiers, dates, and countries all become unweighted term queries (boost 1.0). This diverges significantly from how LogicV2 values these signals:

Signal LogicV2 weight Current ES boost
Specific identifiers (LEI, ISIN, BIC, IMO, etc.) 0.95–0.98 1.0 (same as a name part)
Generic identifiers (tax/registration IDs) 0.85 1.0
Name match 1.0 3.0 (full name), varies (parts)
Birth date mismatch -0.15 to -0.25 penalty no penalty, just no bonus
Country mismatch -0.20 penalty no penalty, just no bonus

ES can't do negative scoring (penalties for mismatches), but it can weight positive signals to reflect their relative importance.

Proposed changes

1. Boost identifier matches heavily

Identifier matches are near-deterministic in LogicV2 (a matching LEI code almost certainly means same entity) but are currently just one more term query. In entity_query():

# Identifier types should get a high boost to reflect their
# near-deterministic nature in matching
IDENTIFIER_BOOSTS = {
    registry.identifier: 8.0,
    # Specific types could get even higher boosts if available
}

for prop, value in entity.itervalues():
    if prop.type == registry.name or not prop.matchable:
        continue
    if prop.type == registry.address:
        query = {"match": {prop.type.group: value}}
        shoulds.append(query)
    elif prop.type.group is not None:
        boost = IDENTIFIER_BOOSTS.get(prop.type, 1.0)
        shoulds.append(tq(prop.type.group, value, boost))

2. Boost date matches

Birth dates are highly discriminating. A matching date should contribute more than a single name part:

if prop.type == registry.date:
    shoulds.append(tq(prop.type.group, value, boost=3.0))

3. Modest boost for country matches

Countries are informative but less discriminating than dates or identifiers:

if prop.type == registry.country:
    shoulds.append(tq(prop.type.group, value, boost=1.5))

Limitations

These changes won't make ES perfectly match LogicV2 — there are structural differences that can't be bridged:

  • ES uses additive scoring (sum of all matching clauses), LogicV2 uses max of main features + qualifier adjustments
  • ES has no negative signals (can't penalize country/date/gender mismatches, only reward matches)
  • ES can't do cultural name analysis (symbol-aware pairing, family name weighting, etc.)

But aligning the relative weighting of ES should-clauses with LogicV2's priorities should meaningfully improve the correlation, meaning the candidates ES ranks highest are more likely to be the ones LogicV2 scores highest.

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions