4,800 evaluations across 3 models, 12 tasks, and up to 3 runners per model.
| GPT-5-nano | GPT-5-mini | GPT-5.2 | |
|---|---|---|---|
| minRLM Accuracy | 53.7% | 72.7% | 78.2% |
| Vanilla Accuracy | 63.2% | 69.5% | 48.2% |
| Official RLM Accuracy | 43.3% | 69.7% | - |
| minRLM vs Vanilla | -9.5pp | +3.2pp | +30.0pp |
| Evaluations | 1,800 | 1,800 | 1,200 |
The RLM advantage grows with model capability: from -9.5pp on the small model, to +3.2pp on the mid-tier, to +30pp on the frontier model.
Model: gpt-5-mini | Evaluations: 1,800 | Tasks: 12 | Iterations: 50 per task per runner | Date: 2026-03-13
Three runners compared: minRLM (this implementation), Vanilla (direct LLM call), Official RLM (paper's reference implementation).
| minRLM | Vanilla LLM | Official RLM | |
|---|---|---|---|
| Accuracy | 72.7% | 69.5% | 69.7% |
| Avg Tokens | 8,151 | 20,967 | 29,327 |
| Avg Latency | 25.8s | 24.2s | 60.9s |
| Total Cost (600 evals) | $2.86 | $4.74 | $7.92 |
minRLM vs Vanilla: 2.6x fewer tokens, 1.7x cheaper, +3.2pp accuracy minRLM vs Official: 3.6x fewer tokens, 2.8x cheaper, +3.0pp accuracy
| Task | minRLM | Vanilla | Official | N |
|---|---|---|---|---|
| SNIAH | 94% | 100% | 76% | 50 |
| OOLONG | 92% | 78% | 80% | 50 |
| GDP Val | 86% | 54% | 50% | 50 |
| IFEval | 84% | 78% | 78% | 50 |
| MMLU-Pro | 82% | 90% | 86% | 50 |
| LiveCodeBench | 80% | 64% | 60% | 50 |
| AIME 2025 | 74% | 88% | 84% | 50 |
| GPQA Diamond | 70% | 66% | 74% | 50 |
| BrowseComp | 62% | 16% | 66% | 50 |
| RepoQA | 62% | 98% | 96% | 50 |
| LongBench V2 | 46% | 56% | 48% | 50 |
| CodeQA | 40% | 46% | 38% | 50 |
minRLM is the top scorer on 6 of 12 tasks (OOLONG, GDP Val, IFEval, LiveCodeBench, SNIAH vs official, BrowseComp vs vanilla). Vanilla fails on BrowseComp (16%) because the context exceeds the token limit.
Sorted by minRLM savings vs Official RLM.
| Task | minRLM | Vanilla | Official | vs Vanilla | vs Official |
|---|---|---|---|---|---|
| CodeQA | 9,724 | 95,332 | 78,232 | 9.8x | 8.0x |
| LongBench V2 | 10,767 | 87,813 | 83,807 | 8.2x | 7.8x |
| BrowseComp | 10,740 | 34,084 | 68,354 | 3.2x | 6.4x |
| SNIAH | 6,328 | 3,758 | 16,283 | - | 2.6x |
| OOLONG | 6,184 | 12,196 | 14,373 | 2.0x | 2.3x |
| RepoQA | 8,026 | 3,958 | 17,944 | - | 2.2x |
| GPQA Diamond | 6,679 | 2,140 | 14,272 | - | 2.1x |
| GDP Val | 12,007 | 4,236 | 20,458 | - | 1.7x |
| IFEval | 5,963 | 1,360 | 9,316 | - | 1.6x |
| AIME 2025 | 7,951 | 3,965 | 11,300 | - | 1.4x |
| MMLU-Pro | 6,341 | 885 | 8,461 | - | 1.3x |
| LiveCodeBench | 7,106 | 1,877 | 9,128 | - | 1.3x |
"-" = vanilla uses fewer tokens on that task. minRLM uses fewer tokens than Official RLM on every task (1.3x-8.0x).
50 evaluations per runner per task.
Aggregate totals: minRLM $2.86, Vanilla $4.74, Official $7.92 (600 evals each).
minRLM is cheaper than Official RLM on every task. minRLM is cheaper than Vanilla on tasks with large context (BrowseComp, CodeQA, LongBench V2, Oolong, RepoQA).
| Task | minRLM | Vanilla | Official | Faster than Official |
|---|---|---|---|---|
| OOLONG | 15.0s | 40.4s | 32.7s | 2.2x |
| SNIAH | 13.8s | 2.1s | 30.4s | 2.2x |
| IFEval | 18.0s | 19.8s | 33.0s | 1.8x |
| LiveCodeBench | 18.4s | 20.3s | 26.9s | 1.5x |
| MMLU-Pro | 18.5s | 11.0s | 26.3s | 1.4x |
| CodeQA | 20.8s | 19.8s | 84.4s | 4.1x |
| LongBench V2 | 21.9s | 23.8s | 96.4s | 4.4x |
| GPQA Diamond | 24.7s | 27.5s | 77.0s | 3.1x |
| RepoQA | 25.3s | 8.7s | 23.9s | 0.9x |
| BrowseComp | 27.8s | 7.0s | 123.6s | 4.4x |
| AIME 2025 | 39.0s | 54.9s | 74.3s | 1.9x |
| GDP Val | 66.3s | 55.6s | 102.1s | 1.5x |
minRLM is faster than Official RLM on 11 of 12 tasks.
| Task | minRLM Avg Iterations |
|---|---|
| OOLONG | 1.0 |
| CodeQA | 1.0 |
| LongBench V2 | 1.0 |
| SNIAH | 1.0 |
| GPQA Diamond | 1.0 |
| MMLU-Pro | 1.0 |
| IFEval | 1.0 |
| BrowseComp | 1.1 |
| RepoQA | 1.1 |
| AIME 2025 | 1.1 |
| LiveCodeBench | 1.1 |
| GDP Val | 1.2 |
Most tasks complete in a single iteration. GDP Val occasionally requires a second pass.
Model: gpt-5.2 | Evaluations: 1,200 | Tasks: 12 | Iterations: 50 per task per runner | Date: 2026-03-15
Two runners compared: minRLM and Vanilla. No official RLM runner for this model.
| minRLM | Vanilla LLM | |
|---|---|---|
| Accuracy | 78.2% | 48.2% |
| Avg Tokens | 8,096 | 14,196 |
| Avg Latency | 20.4s | 8.0s |
| Total Cost (600 evals) | $18.93 | $16.50 |
minRLM vs Vanilla: +30.0pp accuracy, 1.8x fewer tokens.
| Task | minRLM | Vanilla | N |
|---|---|---|---|
| SNIAH | 100% | 100% | 50 |
| AIME 2025 | 96% | 0% | 50 |
| OOLONG | 96% | 64% | 50 |
| MMLU-Pro | 92% | 42% | 50 |
| RepoQA | 84% | 98% | 50 |
| IFEval | 82% | 76% | 50 |
| GPQA Diamond | 76% | 46% | 50 |
| GDP Val | 74% | 50% | 50 |
| BrowseComp | 72% | 14% | 50 |
| LiveCodeBench | 66% | 42% | 50 |
| CodeQA | 56% | 20% | 50 |
| LongBench V2 | 44% | 26% | 50 |
minRLM wins 10 of 12 tasks. Vanilla wins only on RepoQA (full-context retrieval) and ties on SNIAH. AIME 2025 is the most dramatic flip: 96% vs 0%.
Model: gpt-5-nano | Evaluations: 1,800 | Tasks: 12 | Iterations: 50 per task per runner | Date: 2026-03-14
Three runners compared: minRLM, Vanilla, Official RLM.
| minRLM | Vanilla LLM | Official RLM | |
|---|---|---|---|
| Accuracy | 53.7% | 63.2% | 43.3% |
| Avg Tokens | 13,811 | 18,137 | 27,176 |
| Avg Latency | 14.3s | 23.5s | 81.2s |
| Total Cost (600 evals) | $0.74 | $1.16 | $2.68 |
On the smallest model, vanilla LLM outperforms both RLM implementations. minRLM still beats the official RLM by +10.4pp while costing 3.6x less.
| Task | minRLM | Vanilla | Official | N |
|---|---|---|---|---|
| SNIAH | 90% | 100% | 56% | 50 |
| GDP Val | 82% | 60% | 42% | 50 |
| MMLU-Pro | 80% | 92% | 70% | 50 |
| OOLONG | 76% | 70% | 34% | 50 |
| IFEval | 70% | 74% | 48% | 50 |
| AIME 2025 | 68% | 86% | 80% | 50 |
| GPQA Diamond | 56% | 68% | 58% | 50 |
| CodeQA | 38% | 28% | 36% | 50 |
| BrowseComp | 36% | 14% | 28% | 50 |
| LongBench V2 | 32% | 34% | 24% | 50 |
| RepoQA | 14% | 96% | 32% | 50 |
| LiveCodeBench | 2% | 36% | 12% | 50 |
minRLM wins on 5 of 12 tasks (SNIAH, GDP Val, OOLONG, CodeQA, BrowseComp). The small model struggles most with code generation (LiveCodeBench: 2%) and code retrieval (RepoQA: 14%).
| Task name | Dataset | Source |
|---|---|---|
official_browsecomp |
BrowseComp-Plus | Tevatron/browsecomp-plus |
official_sniah |
RULER NIAH | tonychenxyz/ruler-full |
official_oolong |
OOLONG | oolongbench/oolong-synth |
official_longbench_v2 |
LongBench-v2 | zai-org/LongBench-v2 |
official_codeqa |
LongBench-v2 (code subset) | zai-org/LongBench-v2 |
official_repoqa |
RepoQA | evalplus/repoqa_release |
official_gdpval |
GDP Val | openai/gdpval |
official_aime_2025 |
AIME 2025 | MathArena/aime_2025 |
official_gpqa_diamond |
GPQA Diamond | Idavidrein/gpqa (gated) |
official_mmlu_pro |
MMLU-Pro | TIGER-Lab/MMLU-Pro |
official_ifeval |
IFEval | google/IFEval |
official_livecodebench |
LiveCodeBench v6 | livecodebench/code_generation_lite |
GDP Val, AIME 2025, GPQA Diamond, MMLU-Pro, IFEval, and LiveCodeBench are auto-downloaded at runtime.
GPQA Diamond is a gated dataset - you must accept the license at huggingface.co/datasets/Idavidrein/gpqa and run huggingface-cli login first.
The remaining datasets must be pre-downloaded to evals/data/:
uv run --with datasets,huggingface_hub python -c "
from datasets import load_dataset
datasets = {
'oolong': ('oolongbench/oolong-synth', None),
'longbench_v2': ('zai-org/LongBench-v2', None),
'browsecomp_plus': ('Tevatron/browsecomp-plus', None),
'ruler_full_mirror': ('tonychenxyz/ruler-full', 'plain'),
}
for name, (repo, config) in datasets.items():
print(f'Downloading {repo}...')
ds = load_dataset(repo, config)
ds.save_to_disk(f'evals/data/{name}')
print(f' Saved to evals/data/{name}')
"export OPENAI_API_KEY="your-key"
# Just a task
uvx minrlm "What is the sum of the first 100 primes?"
# Task + file as context
uvx minrlm "How many ERROR lines in the last hour?" ./server.log
# Pipe context from stdin
cat huge_dataset.csv | uvx minrlm "Which product had the highest return rate?"
# Show generated code (-s) and token stats (-v)
uvx minrlm -sv "Return the sum of all primes up to 1,000,000."
# -> Sieve of Eratosthenes in 6,215 tokens, 1 iteration
# -> Answer: 37550402023
uvx minrlm -sv "Return all primes up to 1,000,000, reversed. Return a list of numbers."
# -> 999983, 999979, 999961, 999959, 999953, ...
# -> Tokens: 6,258 | Output: 616,964 chars (~154K tokens) | 25x savings# Install eval dependencies
git clone https://github.com/avilum/minrlm && cd minrlm
uv sync --extra eval
export OPENAI_API_KEY="your-key"
# Quick smoke test (3 tasks, 3 runs each)
uv run python eval/quickstart.py
# Single task
uv run python eval/run.py --model gpt-5-mini --tasks official_sniah --runs 10
# All tasks - single runner
uv run python eval/run.py \
--model gpt-5-mini \
--tasks all \
--runners minrlm-reasoning \
--runs 50 \
--parallel 5 \
--output-dir logs/my_eval
# Full multi-runner benchmark (reproduces the GPT-5-mini table above)
uv run python eval/run.py \
--tasks all \
--runners minrlm-reasoning,vanilla,official \
--runs 50 --parallel 12 --task-parallel 12 \
--output-dir logs/my_eval
# Cross-model (swap --model for gpt-5-nano or gpt-5.2)
uv run python eval/run.py \
--model gpt-5.2 \
--tasks all \
--runners minrlm-reasoning,vanilla \
--runs 50 --parallel 12 --task-parallel 12 \
--output-dir logs/my_eval_gpt52Raw data:
- GPT-5-mini:
BEST_EVALS/BEST_new-entropy-prompts-12-tasks-all-runners-gpt-5-mini-50-runs-BEST/eval_20260313_195547.json - GPT-5.2:
BEST_EVALS/BEST_new-entropy-prompts-12-tasks-all-runners-gpt-5.2-50-runs-after-opus-46/eval_20260315_184830.json - GPT-5-nano:
BEST_EVALS/BEST_new-entropy-prompts-12-tasks-all-runners-gpt-5-nano-50-runs-after-opus-46/eval_20260314_024652.json







