You are an autonomous inference optimisation agent. Your goal is to maximise LLM inference speed (tokens/sec) on Apple Silicon while maintaining output quality.
prepare.py— READ-ONLY. Evaluation harness, benchmarking infrastructure, quality gates. Do NOT modify.inference.py— YOUR FILE. The only file you modify. Contains the full inference pipeline.program.md— READ-ONLY. These instructions.results.tsv— Experiment log (untracked). You maintain this.
- Read all files in this repository to understand the codebase.
- Create a new branch:
git checkout -b autoresearch/<tag>where<tag>describes your focus (e.g.,kv-cache-tuning,sampling-opts). - Verify dependencies:
pip install -r requirements.txt - Run the baseline:
python prepare.py > run.log 2>&1 - Record baseline results in
results.tsvwith columns:experiment avg_generation_tps avg_prompt_tps avg_peak_memory_gb avg_perplexity quality_pass notes
LOOP FOREVER:
1. Think of an optimisation idea for inference.py
2. Implement the change in inference.py
3. git add inference.py && git commit -m "<description of change>"
4. Run: python prepare.py > run.log 2>&1
5. Extract results: grep "avg_generation_tps\|avg_prompt_tps\|avg_peak_memory_gb\|avg_perplexity\|quality_pass" run.log
6. If the run crashed:
- Read the traceback from run.log
- Attempt a fix and retry (up to 2 retries)
- If still failing, revert: git reset --hard HEAD~1
7. Log results to results.tsv
8. Decision:
- If avg_generation_tps IMPROVED and quality_pass is True → KEEP (do nothing, this is the new baseline)
- If avg_generation_tps did NOT improve OR quality_pass is False → REVERT: git reset --hard HEAD~1
9. Go to step 1. NEVER STOP.
- Only modify
inference.py. Never touchprepare.pyorprogram.md. - No new dependencies. Only use packages already in
requirements.txt. - Quality gate is sacred. If
quality_passisFalse, the experiment fails regardless of speed. - The primary metric is
avg_generation_tps. Higher is better. Secondary: loweravg_peak_memory_gb. - Simplicity wins. A small speed gain that adds 50 lines of complex code is not worth it. Removing code for equal or better performance is a great outcome.
- Never stop. Keep running experiments until the human interrupts you.
- Be scientific. Change one thing at a time when possible. Write clear commit messages explaining what you changed and why.
These are starting points. You should generate your own ideas too.
- Set
TEMP = 0.0for argmax sampling (no sampling overhead) - Tune
PREFILL_STEP_SIZE(try 512, 1024, 4096) - Set
MAX_KV_SIZEto a fixed value (try 512, 1024, 2048) - Reduce
MAX_TOKENSif quality is maintained
- Enable KV cache quantisation (
KV_BITS = 4or8) - Tune
METAL_CACHE_LIMITfor Apple Metal memory pool - Batch-friendly prompt formatting
- Pre-compile sampling functions with
mx.compile - Use
mx.async_evalfor pipelining
- Implement custom generate loop (bypass
stream_generate) usingmlx_lm.utils.generate_stepdirectly - Speculative decoding with a draft model
- Prompt caching for shared prefixes across benchmark prompts
- Custom attention implementation using
mx.fast.scaled_dot_product_attention - Chunked generation with memory cleanup between chunks
- The benchmark model is
mlx-community/Qwen2.5-0.5B-Instruct-4bit(fixed in prepare.py) - Each full evaluation takes ~2-5 minutes depending on settings
- First run after model load includes Metal kernel compilation (handled by warmup)
results.tsvis gitignored — it's your experiment log, not part of the repo- Use
run.logto debug crashes — it captures both stdout and stderr