Skip to content

Conversation

@Sudhendra
Copy link
Owner

Summary

Adds a comprehensive task-equivalence evaluation system that measures whether LLMs produce equivalent answers when given verbose vs compressed context. This is the correct metric for semantic compression (not perplexity/loss).

Changes

New Evaluation System (src/evaluation/)

  • Multi-model validation (Claude + GPT by default)
  • Task-equivalence scoring based on semantic similarity
  • Artifact stripping (<think>, </tool_call>, backtick wrappers)
  • Proper Qwen3 chat template handling (was broken before)
  • Configurable equivalence threshold (default 0.80)

V1 Adapter Results

Evaluated models/runs/mlx/2026-01-30_17-14-36/adapter (iter-500):

Metric Value
Pass Rate 89% (at 0.80 threshold)
Avg Compression 62% (38% token savings)
Avg Equivalence 0.859
Failures 11/100

Key Findings

  • Gemini removed: Produces inconsistent scores vs Claude/GPT
  • 0.80 threshold appropriate: 21/35 failures in previous runs were 0.75-0.80
  • Aggressive compression fails: <50% ratio has only 26.7% pass rate

Bug Fixes

  • Fixed chat template handling (was passing raw text instead of templated)
  • Added --mask-prompt to MLX eval for accurate perplexity

Documentation

  • Updated docs/TASKS.md with Phase 3-5 completion status
  • Added docs/plans/2026-01-31-v2-production-training.md for scaling to 10K+ pairs

Testing

pytest tests/test_evaluate_adapter.py -v  # All 3 tests pass

Usage

python scripts/evaluate_adapter.py \
  --adapter-path models/runs/mlx/latest/adapter \
  --data data/training/test.jsonl \
  --limit 100 \
  --models claude gpt \
  --equivalence-threshold 0.80 \
  --include-system-prompt

- Add src/evaluation/evaluate_adapter.py with multi-model validation
- Fix chat template handling (was passing raw text instead of templated)
- Add artifact stripping (<think>, </tool_call>, backticks)
- Add --include-system-prompt flag for proper generation
- Remove Gemini from default evaluators (inconsistent scoring)
- Add --mask-prompt to MLX eval for accurate perplexity

V1 adapter results (models/runs/mlx/2026-01-30_17-14-36/adapter):
- 89% pass rate at 0.80 threshold (Claude + GPT)
- 62% avg compression ratio (38% token savings)
- 0.859 avg equivalence score

Also adds v2 production training plan for scaling to 10K+ pairs.
@Sudhendra Sudhendra merged commit 159b2f0 into main Jan 31, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants