Improve ES query weighting in entity_query to better align with LogicV2 scoring

## Context

While investigating opensanctions/yente#1011 (scoring fewer candidates), we analyzed the correlation between ES index scores and LogicV2 algo scores using ~20,800 scoring log entries from production (418 queries).

The per-query Spearman rank correlation between ES and LogicV2 has a **median of just 0.42**:

| Correlation band | % of queries |
|---|---|
| Negative (< 0) — ES ranking is *worse than random* | 21.7% |
| Weak (0–0.3) | 13.0% |
| Moderate (0.3–0.6) | 37.0% |
| Strong (>= 0.6) | 28.3% |

Top-5 overlap between ES and algo rankings is only **35%**. The worst observed inversion: best algo result (score 0.592) sitting at ES rank 153 out of 192 candidates.

Better alignment wouldn't replace LogicV2 scoring, but would mean the top ES candidates are more likely to include the actual best matches — reducing wasted scoring work and improving the effectiveness of early stopping heuristics (#1011).

## Root cause: property weighting in `entity_query()`

The current `entity_query()` in `yente/search/queries.py` treats all non-name matchable properties equally — identifiers, dates, and countries all become unweighted `term` queries (boost 1.0). This diverges significantly from how LogicV2 values these signals:

| Signal | LogicV2 weight | Current ES boost |
|---|---|---|
| Specific identifiers (LEI, ISIN, BIC, IMO, etc.) | 0.95–0.98 | 1.0 (same as a name part) |
| Generic identifiers (tax/registration IDs) | 0.85 | 1.0 |
| Name match | 1.0 | 3.0 (full name), varies (parts) |
| Birth date mismatch | -0.15 to -0.25 penalty | no penalty, just no bonus |
| Country mismatch | -0.20 penalty | no penalty, just no bonus |

ES can't do negative scoring (penalties for mismatches), but it can weight positive signals to reflect their relative importance.

## Proposed changes

### 1. Boost identifier matches heavily

Identifier matches are near-deterministic in LogicV2 (a matching LEI code almost certainly means same entity) but are currently just one more term query. In `entity_query()`:

```python
# Identifier types should get a high boost to reflect their
# near-deterministic nature in matching
IDENTIFIER_BOOSTS = {
    registry.identifier: 8.0,
    # Specific types could get even higher boosts if available
}

for prop, value in entity.itervalues():
    if prop.type == registry.name or not prop.matchable:
        continue
    if prop.type == registry.address:
        query = {"match": {prop.type.group: value}}
        shoulds.append(query)
    elif prop.type.group is not None:
        boost = IDENTIFIER_BOOSTS.get(prop.type, 1.0)
        shoulds.append(tq(prop.type.group, value, boost))
```

### 2. Boost date matches

Birth dates are highly discriminating. A matching date should contribute more than a single name part:

```python
if prop.type == registry.date:
    shoulds.append(tq(prop.type.group, value, boost=3.0))
```

### 3. Modest boost for country matches

Countries are informative but less discriminating than dates or identifiers:

```python
if prop.type == registry.country:
    shoulds.append(tq(prop.type.group, value, boost=1.5))
```

## Limitations

These changes won't make ES perfectly match LogicV2 — there are structural differences that can't be bridged:
- ES uses additive scoring (`sum` of all matching clauses), LogicV2 uses `max` of main features + qualifier adjustments
- ES has no negative signals (can't penalize country/date/gender mismatches, only reward matches)
- ES can't do cultural name analysis (symbol-aware pairing, family name weighting, etc.)

But aligning the relative weighting of ES should-clauses with LogicV2's priorities should meaningfully improve the correlation, meaning the candidates ES ranks highest are more likely to be the ones LogicV2 scores highest.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve ES query weighting in entity_query to better align with LogicV2 scoring #1093

Context

Root cause: property weighting in `entity_query()`

Proposed changes

1. Boost identifier matches heavily

2. Boost date matches

3. Modest boost for country matches

Limitations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Correlation band	% of queries
Negative (< 0) — ES ranking is worse than random	21.7%
Weak (0–0.3)	13.0%
Moderate (0.3–0.6)	37.0%
Strong (>= 0.6)	28.3%

Signal	LogicV2 weight	Current ES boost
Specific identifiers (LEI, ISIN, BIC, IMO, etc.)	0.95–0.98	1.0 (same as a name part)
Generic identifiers (tax/registration IDs)	0.85	1.0
Name match	1.0	3.0 (full name), varies (parts)
Birth date mismatch	-0.15 to -0.25 penalty	no penalty, just no bonus
Country mismatch	-0.20 penalty	no penalty, just no bonus

Uh oh!

Improve ES query weighting in entity_query to better align with LogicV2 scoring #1093

Description

Context

Root cause: property weighting in entity_query()

Proposed changes

1. Boost identifier matches heavily

2. Boost date matches

3. Modest boost for country matches

Limitations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Root cause: property weighting in `entity_query()`