Hello,
The current evaluation script (evaluate_predictions.py) falls back to ANLS (which is character-level edit distance) for non-numeric answers:
def evaluate_single_answer(
target: str,
prediction: str,
max_relative_change: float = 0.05
) -> float:
"""
Evaluates a single target-prediction pair:
- Numeric within tolerance or exact year match inside this helper.
- Falls back to ANLS for text.
"""
t = target.strip().strip('%').strip()
p = prediction.strip().strip('%').strip()
#print("Stripped", t, p)
# Attempt numeric
t_f = to_float(t)
p_f = to_float(p)
if t_f is not None and p_f is not None:
if t_f == 0.0:
return 1.0 if p_f == 0.0 else 0.0
change = abs(p_f - t_f) / abs(t_f)
return 1.0 if change <= max_relative_change else 0.0
# Fallback text
#print("P:", p, "T: ", t)
return anls_score(prediction=p.lower(), gold_labels=[t.lower()], threshold=0.5)
Because ANLS is purely string-based, it awards high similarity to antonyms that share many characters. Example:
Ground-Truth (GT): Likely
Prediction: Unlikely
Score: 0.75
Printed via: print(gt, pred, year_flags_per_row, score) → likely Unlikely ['NO'] 0.75
I built a small visualizer to display the chart, model output, and the script’s score. In that UI, anything >50% was marked “Correct” (ignore that label, it’s crude), but the underlying 0.75 similarity is still misleading for evaluation.
Screenshots
- Opposite-label case (“Likely” vs “Unlikely”):
- Similar issue on another chart:
Why this is a problem
-
Character overlap ≠ semantic agreement. For categorical/ordinal labels (“Likely”, “Unlikely”, “Yes”, “No”), edit distance can invert reality.
-
Inflated scores distort overall metrics and mask real model errors.
Steps to Reproduce
-
Run evaluate_single_answer("Likely", "Unlikely").
-
Observe score ~0.75 (depending on threshold).
-
Repeat with other near-antonym pairs that share substrings (“increase” vs “decrease”, “present” vs “absent”, etc.).
Expected Behavior
-
Semantically opposite categorical answers should score 0 (or near 0).
-
Only genuinely equivalent/synonymous strings (case/whitespace variants, minor typos) should pass the threshold.
Actual Behavior
- ANLS returns high similarity for antonyms because it measures edit distance, not meaning.
Hello,
The current evaluation script (
evaluate_predictions.py) falls back to ANLS (which is character-level edit distance) for non-numeric answers:Because ANLS is purely string-based, it awards high similarity to antonyms that share many characters. Example:
Ground-Truth (GT): Likely
Prediction: Unlikely
Score: 0.75
Printed via:
print(gt, pred, year_flags_per_row, score)→likely Unlikely ['NO'] 0.75I built a small visualizer to display the chart, model output, and the script’s score. In that UI, anything >50% was marked “Correct” (ignore that label, it’s crude), but the underlying 0.75 similarity is still misleading for evaluation.
Screenshots
Why this is a problem
Character overlap ≠ semantic agreement. For categorical/ordinal labels (“Likely”, “Unlikely”, “Yes”, “No”), edit distance can invert reality.
Inflated scores distort overall metrics and mask real model errors.
Steps to Reproduce
Run
evaluate_single_answer("Likely", "Unlikely").Observe score ~0.75 (depending on threshold).
Repeat with other near-antonym pairs that share substrings (“increase” vs “decrease”, “present” vs “absent”, etc.).
Expected Behavior
Semantically opposite categorical answers should score 0 (or near 0).
Only genuinely equivalent/synonymous strings (case/whitespace variants, minor typos) should pass the threshold.
Actual Behavior