Skip to content

Problem(s) with the evaluation script #4

@utkuu-cerebras

Description

@utkuu-cerebras

Hello,

The current evaluation script (evaluate_predictions.py) falls back to ANLS (which is character-level edit distance) for non-numeric answers:

def evaluate_single_answer(
    target: str,
    prediction: str,
    max_relative_change: float = 0.05
) -> float:
    """
    Evaluates a single target-prediction pair:
    - Numeric within tolerance or exact year match inside this helper.
    - Falls back to ANLS for text.
    """
    t = target.strip().strip('%').strip()
    p = prediction.strip().strip('%').strip()
    #print("Stripped", t, p)
    # Attempt numeric
    t_f = to_float(t)
    p_f = to_float(p)
    if t_f is not None and p_f is not None:
        if t_f == 0.0:
            return 1.0 if p_f == 0.0 else 0.0
        change = abs(p_f - t_f) / abs(t_f)
        return 1.0 if change <= max_relative_change else 0.0
    # Fallback text
    #print("P:", p, "T: ", t)
    return anls_score(prediction=p.lower(), gold_labels=[t.lower()], threshold=0.5)

Because ANLS is purely string-based, it awards high similarity to antonyms that share many characters. Example:

Ground-Truth (GT): Likely

Prediction: Unlikely

Score: 0.75

Printed via: print(gt, pred, year_flags_per_row, score)likely Unlikely ['NO'] 0.75

I built a small visualizer to display the chart, model output, and the script’s score. In that UI, anything >50% was marked “Correct” (ignore that label, it’s crude), but the underlying 0.75 similarity is still misleading for evaluation.

Screenshots

  1. Opposite-label case (“Likely” vs “Unlikely”):
Image
  1. Similar issue on another chart:
Image

Why this is a problem

  • Character overlap ≠ semantic agreement. For categorical/ordinal labels (“Likely”, “Unlikely”, “Yes”, “No”), edit distance can invert reality.

  • Inflated scores distort overall metrics and mask real model errors.

Steps to Reproduce

  • Run evaluate_single_answer("Likely", "Unlikely").

  • Observe score ~0.75 (depending on threshold).

  • Repeat with other near-antonym pairs that share substrings (“increase” vs “decrease”, “present” vs “absent”, etc.).

Expected Behavior

  • Semantically opposite categorical answers should score 0 (or near 0).

  • Only genuinely equivalent/synonymous strings (case/whitespace variants, minor typos) should pass the threshold.

Actual Behavior

  • ANLS returns high similarity for antonyms because it measures edit distance, not meaning.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions