feat: add task-equivalence evaluation system for adapters #11

Sudhendra · 2026-01-31T05:38:48Z

Summary

Adds a comprehensive task-equivalence evaluation system that measures whether LLMs produce equivalent answers when given verbose vs compressed context. This is the correct metric for semantic compression (not perplexity/loss).

Changes

New Evaluation System (`src/evaluation/`)

Multi-model validation (Claude + GPT by default)
Task-equivalence scoring based on semantic similarity
Artifact stripping (<think>, </tool_call>, backtick wrappers)
Proper Qwen3 chat template handling (was broken before)
Configurable equivalence threshold (default 0.80)

V1 Adapter Results

Evaluated models/runs/mlx/2026-01-30_17-14-36/adapter (iter-500):

Metric	Value
Pass Rate	89% (at 0.80 threshold)
Avg Compression	62% (38% token savings)
Avg Equivalence	0.859
Failures	11/100

Key Findings

Gemini removed: Produces inconsistent scores vs Claude/GPT
0.80 threshold appropriate: 21/35 failures in previous runs were 0.75-0.80
Aggressive compression fails: <50% ratio has only 26.7% pass rate

Bug Fixes

Fixed chat template handling (was passing raw text instead of templated)
Added --mask-prompt to MLX eval for accurate perplexity

Documentation

Updated docs/TASKS.md with Phase 3-5 completion status
Added docs/plans/2026-01-31-v2-production-training.md for scaling to 10K+ pairs

Testing

pytest tests/test_evaluate_adapter.py -v  # All 3 tests pass

Usage

python scripts/evaluate_adapter.py \
  --adapter-path models/runs/mlx/latest/adapter \
  --data data/training/test.jsonl \
  --limit 100 \
  --models claude gpt \
  --equivalence-threshold 0.80 \
  --include-system-prompt

- Add src/evaluation/evaluate_adapter.py with multi-model validation - Fix chat template handling (was passing raw text instead of templated) - Add artifact stripping (<think>, </tool_call>, backticks) - Add --include-system-prompt flag for proper generation - Remove Gemini from default evaluators (inconsistent scoring) - Add --mask-prompt to MLX eval for accurate perplexity V1 adapter results (models/runs/mlx/2026-01-30_17-14-36/adapter): - 89% pass rate at 0.80 threshold (Claude + GPT) - 62% avg compression ratio (38% token savings) - 0.859 avg equivalence score Also adds v2 production training plan for scaling to 10K+ pairs.

Sudhendra merged commit 159b2f0 into main Jan 31, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add task-equivalence evaluation system for adapters #11

feat: add task-equivalence evaluation system for adapters #11

Uh oh!

Sudhendra commented Jan 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add task-equivalence evaluation system for adapters #11

feat: add task-equivalence evaluation system for adapters #11

Uh oh!

Conversation

Sudhendra commented Jan 31, 2026

Summary

Changes

New Evaluation System (src/evaluation/)

V1 Adapter Results

Key Findings

Bug Fixes

Documentation

Testing

Usage

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New Evaluation System (`src/evaluation/`)