Problem(s) with the evaluation script

Hello,

The current evaluation script (`evaluate_predictions.py`) falls back to ANLS (which is character-level edit distance) for non-numeric answers:

```
def evaluate_single_answer(
    target: str,
    prediction: str,
    max_relative_change: float = 0.05
) -> float:
    """
    Evaluates a single target-prediction pair:
    - Numeric within tolerance or exact year match inside this helper.
    - Falls back to ANLS for text.
    """
    t = target.strip().strip('%').strip()
    p = prediction.strip().strip('%').strip()
    #print("Stripped", t, p)
    # Attempt numeric
    t_f = to_float(t)
    p_f = to_float(p)
    if t_f is not None and p_f is not None:
        if t_f == 0.0:
            return 1.0 if p_f == 0.0 else 0.0
        change = abs(p_f - t_f) / abs(t_f)
        return 1.0 if change <= max_relative_change else 0.0
    # Fallback text
    #print("P:", p, "T: ", t)
    return anls_score(prediction=p.lower(), gold_labels=[t.lower()], threshold=0.5)
```

Because ANLS is purely string-based, it awards high similarity to antonyms that share many characters. Example:

Ground-Truth (GT): Likely

Prediction: Unlikely

Score: 0.75

Printed via: `print(gt, pred, year_flags_per_row, score)` → `likely Unlikely ['NO'] 0.75`

I built a small visualizer to display the chart, model output, and the script’s score. In that UI, anything >50% was marked “Correct” (ignore that label, it’s crude), but the underlying 0.75 similarity is still misleading for evaluation.

**Screenshots**

1. Opposite-label case (“Likely” vs “Unlikely”):
<img width="1372" height="1294" alt="Image" src="https://github.com/user-attachments/assets/69a88a76-2195-4243-8a67-f73f561ed01e" />

2. Similar issue on another chart:
<img width="1354" height="570" alt="Image" src="https://github.com/user-attachments/assets/73a80948-6c12-4992-ad53-02835c853b02" />

**Why this is a problem**
- Character overlap ≠ semantic agreement. For categorical/ordinal labels (“Likely”, “Unlikely”, “Yes”, “No”), edit distance can invert reality.

- Inflated scores distort overall metrics and mask real model errors.

**Steps to Reproduce**
- Run `evaluate_single_answer`("Likely", "Unlikely").

- Observe score ~0.75 (depending on threshold).

- Repeat with other near-antonym pairs that share substrings (“increase” vs “decrease”, “present” vs “absent”, etc.).

**Expected Behavior**
- Semantically opposite categorical answers should score 0 (or near 0).

- Only genuinely equivalent/synonymous strings (case/whitespace variants, minor typos) should pass the threshold.

**Actual Behavior**
- ANLS returns high similarity for antonyms because it measures edit distance, not meaning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem(s) with the evaluation script #4

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Problem(s) with the evaluation script #4

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions