feat: add task-equivalence evaluation system for adapters #11
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Adds a comprehensive task-equivalence evaluation system that measures whether LLMs produce equivalent answers when given verbose vs compressed context. This is the correct metric for semantic compression (not perplexity/loss).
Changes
New Evaluation System (
src/evaluation/)<think>,</tool_call>, backtick wrappers)V1 Adapter Results
Evaluated
models/runs/mlx/2026-01-30_17-14-36/adapter(iter-500):Key Findings
Bug Fixes
--mask-promptto MLX eval for accurate perplexityDocumentation
docs/TASKS.mdwith Phase 3-5 completion statusdocs/plans/2026-01-31-v2-production-training.mdfor scaling to 10K+ pairsTesting
pytest tests/test_evaluate_adapter.py -v # All 3 tests passUsage