An RL training task where models learn to optimize language models via quantization.
python3 -m venv venv
source venv/bin/activate
pip install .Create .env with your API key:
echo "ANTHROPIC_API_KEY=sk-ant-api03-xxxxxxxxxxxxx" > .envQuick test (3 runs):
python main.pyFull evaluation (30 runs):
pytest(If you have issues with memory just run python main.py with the same parameters as in test_task.py)
Given a pre-trained language model (GPT-2), the agent must:
- Apply quantization to reduce model size by at least 50%
- Keep accuracy degradation under 7% (perplexity ratio ≤ 1.07x)
- Submit code that defines both
original_modelandquantized_model
The grader executes the submitted code and independently measures both models. No self-reported metrics - everything is verified.
Multiple approaches work: FP16 (.half()), INT8 (torch.quantization or bitsandbytes), custom quantization, etc.
Model quantization is a real production skill. This task requires understanding size/accuracy tradeoffs, not just calling an API. Expected pass rate: 10-40% on Claude Haiku.
Common failure modes:
- Not aggressive enough (e.g., forgot to quantize buffers, only got 48% reduction)
- Too aggressive (accuracy drops too much)
- In-place operations (forgot
deepcopy, both models become quantized) - Wrong method for hardware (bitsandbytes doesn't work on CPU)
Edit main.py to change:
NUM_RUNS: number of test runs (default: 3)MAX_STEPS: max agent steps per run (default: 25)VERBOSE: show detailed transcripts (default: True)CONCURRENT: run tests in parallel (default: False)
All task logic lives in task.py. The test suite (test_task.py) is read-only.
- Python 3.12+
- PyTorch with transformers
- ~500MB disk space for GPT-2 model
- API costs: ~$0.05-0.10 per run, ~$2-3 for full 30-run evaluation