Skip to content

kommaks/LLM-Inference-Optimization-through-Quantization

Repository files navigation

LLM Inference Optimization through Quantization

An RL training task where models learn to optimize language models via quantization.

Setup

python3 -m venv venv
source venv/bin/activate
pip install .

Create .env with your API key:

echo "ANTHROPIC_API_KEY=sk-ant-api03-xxxxxxxxxxxxx" > .env

Running

Quick test (3 runs):

python main.py

Full evaluation (30 runs):

pytest

(If you have issues with memory just run python main.py with the same parameters as in test_task.py)

The Task

Given a pre-trained language model (GPT-2), the agent must:

  1. Apply quantization to reduce model size by at least 50%
  2. Keep accuracy degradation under 7% (perplexity ratio ≤ 1.07x)
  3. Submit code that defines both original_model and quantized_model

The grader executes the submitted code and independently measures both models. No self-reported metrics - everything is verified.

Multiple approaches work: FP16 (.half()), INT8 (torch.quantization or bitsandbytes), custom quantization, etc.

Why This Task

Model quantization is a real production skill. This task requires understanding size/accuracy tradeoffs, not just calling an API. Expected pass rate: 10-40% on Claude Haiku.

Common failure modes:

  • Not aggressive enough (e.g., forgot to quantize buffers, only got 48% reduction)
  • Too aggressive (accuracy drops too much)
  • In-place operations (forgot deepcopy, both models become quantized)
  • Wrong method for hardware (bitsandbytes doesn't work on CPU)

Configuration

Edit main.py to change:

  • NUM_RUNS: number of test runs (default: 3)
  • MAX_STEPS: max agent steps per run (default: 25)
  • VERBOSE: show detailed transcripts (default: True)
  • CONCURRENT: run tests in parallel (default: False)

All task logic lives in task.py. The test suite (test_task.py) is read-only.

Requirements

  • Python 3.12+
  • PyTorch with transformers
  • ~500MB disk space for GPT-2 model
  • API costs: ~$0.05-0.10 per run, ~$2-3 for full 30-run evaluation

About

An RL training task where models learn to optimize language models via quantization.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages