Name	Name	Last commit message	Last commit date
parent directory ..
data	data
README.md	README.md
__init__.py	__init__.py
metrics.py	metrics.py
plotting.py	plotting.py
quickstart.py	quickstart.py
run.py	run.py
runners.py	runners.py
runners_reasoning.py	runners_reasoning.py
tasks.py	tasks.py

Benchmark Results

4,800 evaluations across 3 models, 12 tasks, and up to 3 runners per model.

Model Scaling

	GPT-5-nano	GPT-5-mini	GPT-5.2
minRLM Accuracy	53.7%	72.7%	78.2%
Vanilla Accuracy	63.2%	69.5%	48.2%
Official RLM Accuracy	43.3%	69.7%	-
minRLM vs Vanilla	-9.5pp	+3.2pp	+30.0pp
Evaluations	1,800	1,800	1,200

The RLM advantage grows with model capability: from -9.5pp on the small model, to +3.2pp on the mid-tier, to +30pp on the frontier model.

GPT-5-mini (Primary Benchmark)

Model: gpt-5-mini | Evaluations: 1,800 | Tasks: 12 | Iterations: 50 per task per runner | Date: 2026-03-13

Three runners compared: minRLM (this implementation), Vanilla (direct LLM call), Official RLM (paper's reference implementation).

Summary

	minRLM	Vanilla LLM	Official RLM
Accuracy	72.7%	69.5%	69.7%
Avg Tokens	8,151	20,967	29,327
Avg Latency	25.8s	24.2s	60.9s
Total Cost (600 evals)	$2.86	$4.74	$7.92

minRLM vs Vanilla: 2.6x fewer tokens, 1.7x cheaper, +3.2pp accuracy minRLM vs Official: 3.6x fewer tokens, 2.8x cheaper, +3.0pp accuracy

Accuracy by Task

Task	minRLM	Vanilla	Official	N
SNIAH	94%	100%	76%	50
OOLONG	92%	78%	80%	50
GDP Val	86%	54%	50%	50
IFEval	84%	78%	78%	50
MMLU-Pro	82%	90%	86%	50
LiveCodeBench	80%	64%	60%	50
AIME 2025	74%	88%	84%	50
GPQA Diamond	70%	66%	74%	50
BrowseComp	62%	16%	66%	50
RepoQA	62%	98%	96%	50
LongBench V2	46%	56%	48%	50
CodeQA	40%	46%	38%	50

minRLM is the top scorer on 6 of 12 tasks (OOLONG, GDP Val, IFEval, LiveCodeBench, SNIAH vs official, BrowseComp vs vanilla). Vanilla fails on BrowseComp (16%) because the context exceeds the token limit.

Token Efficiency by Task

Sorted by minRLM savings vs Official RLM.

Task	minRLM	Vanilla	Official	vs Vanilla	vs Official
CodeQA	9,724	95,332	78,232	9.8x	8.0x
LongBench V2	10,767	87,813	83,807	8.2x	7.8x
BrowseComp	10,740	34,084	68,354	3.2x	6.4x
SNIAH	6,328	3,758	16,283	-	2.6x
OOLONG	6,184	12,196	14,373	2.0x	2.3x
RepoQA	8,026	3,958	17,944	-	2.2x
GPQA Diamond	6,679	2,140	14,272	-	2.1x
GDP Val	12,007	4,236	20,458	-	1.7x
IFEval	5,963	1,360	9,316	-	1.6x
AIME 2025	7,951	3,965	11,300	-	1.4x
MMLU-Pro	6,341	885	8,461	-	1.3x
LiveCodeBench	7,106	1,877	9,128	-	1.3x

"-" = vanilla uses fewer tokens on that task. minRLM uses fewer tokens than Official RLM on every task (1.3x-8.0x).

Cost by Task

50 evaluations per runner per task.

Aggregate totals: minRLM $2.86, Vanilla $4.74, Official $7.92 (600 evals each).

minRLM is cheaper than Official RLM on every task. minRLM is cheaper than Vanilla on tasks with large context (BrowseComp, CodeQA, LongBench V2, Oolong, RepoQA).

Latency by Task

Task	minRLM	Vanilla	Official	Faster than Official
OOLONG	15.0s	40.4s	32.7s	2.2x
SNIAH	13.8s	2.1s	30.4s	2.2x
IFEval	18.0s	19.8s	33.0s	1.8x
LiveCodeBench	18.4s	20.3s	26.9s	1.5x
MMLU-Pro	18.5s	11.0s	26.3s	1.4x
CodeQA	20.8s	19.8s	84.4s	4.1x
LongBench V2	21.9s	23.8s	96.4s	4.4x
GPQA Diamond	24.7s	27.5s	77.0s	3.1x
RepoQA	25.3s	8.7s	23.9s	0.9x
BrowseComp	27.8s	7.0s	123.6s	4.4x
AIME 2025	39.0s	54.9s	74.3s	1.9x
GDP Val	66.3s	55.6s	102.1s	1.5x

minRLM is faster than Official RLM on 11 of 12 tasks.

Iterations by Task

Task	minRLM Avg Iterations
OOLONG	1.0
CodeQA	1.0
LongBench V2	1.0
SNIAH	1.0
GPQA Diamond	1.0
MMLU-Pro	1.0
IFEval	1.0
BrowseComp	1.1
RepoQA	1.1
AIME 2025	1.1
LiveCodeBench	1.1
GDP Val	1.2

Most tasks complete in a single iteration. GDP Val occasionally requires a second pass.

GPT-5.2

Model: gpt-5.2 | Evaluations: 1,200 | Tasks: 12 | Iterations: 50 per task per runner | Date: 2026-03-15

Two runners compared: minRLM and Vanilla. No official RLM runner for this model.

Summary

	minRLM	Vanilla LLM
Accuracy	78.2%	48.2%
Avg Tokens	8,096	14,196
Avg Latency	20.4s	8.0s
Total Cost (600 evals)	$18.93	$16.50

minRLM vs Vanilla: +30.0pp accuracy, 1.8x fewer tokens.

Accuracy by Task

Task	minRLM	Vanilla	N
SNIAH	100%	100%	50
AIME 2025	96%	0%	50
OOLONG	96%	64%	50
MMLU-Pro	92%	42%	50
RepoQA	84%	98%	50
IFEval	82%	76%	50
GPQA Diamond	76%	46%	50
GDP Val	74%	50%	50
BrowseComp	72%	14%	50
LiveCodeBench	66%	42%	50
CodeQA	56%	20%	50
LongBench V2	44%	26%	50

minRLM wins 10 of 12 tasks. Vanilla wins only on RepoQA (full-context retrieval) and ties on SNIAH. AIME 2025 is the most dramatic flip: 96% vs 0%.

GPT-5-nano

Model: gpt-5-nano | Evaluations: 1,800 | Tasks: 12 | Iterations: 50 per task per runner | Date: 2026-03-14

Three runners compared: minRLM, Vanilla, Official RLM.

Summary

	minRLM	Vanilla LLM	Official RLM
Accuracy	53.7%	63.2%	43.3%
Avg Tokens	13,811	18,137	27,176
Avg Latency	14.3s	23.5s	81.2s
Total Cost (600 evals)	$0.74	$1.16	$2.68

On the smallest model, vanilla LLM outperforms both RLM implementations. minRLM still beats the official RLM by +10.4pp while costing 3.6x less.

Accuracy by Task

Task	minRLM	Vanilla	Official	N
SNIAH	90%	100%	56%	50
GDP Val	82%	60%	42%	50
MMLU-Pro	80%	92%	70%	50
OOLONG	76%	70%	34%	50
IFEval	70%	74%	48%	50
AIME 2025	68%	86%	80%	50
GPQA Diamond	56%	68%	58%	50
CodeQA	38%	28%	36%	50
BrowseComp	36%	14%	28%	50
LongBench V2	32%	34%	24%	50
RepoQA	14%	96%	32%	50
LiveCodeBench	2%	36%	12%	50

minRLM wins on 5 of 12 tasks (SNIAH, GDP Val, OOLONG, CodeQA, BrowseComp). The small model struggles most with code generation (LiveCodeBench: 2%) and code retrieval (RepoQA: 14%).

Datasets

Task name	Dataset	Source
`official_browsecomp`	BrowseComp-Plus	Tevatron/browsecomp-plus
`official_sniah`	RULER NIAH	tonychenxyz/ruler-full
`official_oolong`	OOLONG	oolongbench/oolong-synth
`official_longbench_v2`	LongBench-v2	zai-org/LongBench-v2
`official_codeqa`	LongBench-v2 (code subset)	zai-org/LongBench-v2
`official_repoqa`	RepoQA	evalplus/repoqa_release
`official_gdpval`	GDP Val	openai/gdpval
`official_aime_2025`	AIME 2025	MathArena/aime_2025
`official_gpqa_diamond`	GPQA Diamond	Idavidrein/gpqa (gated)
`official_mmlu_pro`	MMLU-Pro	TIGER-Lab/MMLU-Pro
`official_ifeval`	IFEval	google/IFEval
`official_livecodebench`	LiveCodeBench v6	livecodebench/code_generation_lite

Downloading datasets

GDP Val, AIME 2025, GPQA Diamond, MMLU-Pro, IFEval, and LiveCodeBench are auto-downloaded at runtime. GPQA Diamond is a gated dataset - you must accept the license at huggingface.co/datasets/Idavidrein/gpqa and run huggingface-cli login first.

The remaining datasets must be pre-downloaded to evals/data/:

uv run --with datasets,huggingface_hub python -c "
from datasets import load_dataset

datasets = {
    'oolong': ('oolongbench/oolong-synth', None),
    'longbench_v2': ('zai-org/LongBench-v2', None),
    'browsecomp_plus': ('Tevatron/browsecomp-plus', None),
    'ruler_full_mirror': ('tonychenxyz/ruler-full', 'plain'),
}

for name, (repo, config) in datasets.items():
    print(f'Downloading {repo}...')
    ds = load_dataset(repo, config)
    ds.save_to_disk(f'evals/data/{name}')
    print(f'  Saved to evals/data/{name}')
"

Try it (zero-install)

export OPENAI_API_KEY="your-key"

# Just a task
uvx minrlm "What is the sum of the first 100 primes?"

# Task + file as context
uvx minrlm "How many ERROR lines in the last hour?" ./server.log

# Pipe context from stdin
cat huge_dataset.csv | uvx minrlm "Which product had the highest return rate?"

# Show generated code (-s) and token stats (-v)
uvx minrlm -sv "Return the sum of all primes up to 1,000,000."
# -> Sieve of Eratosthenes in 6,215 tokens, 1 iteration
# -> Answer: 37550402023

uvx minrlm -sv "Return all primes up to 1,000,000, reversed. Return a list of numbers."
# -> 999983, 999979, 999961, 999959, 999953, ...
# -> Tokens: 6,258 | Output: 616,964 chars (~154K tokens) | 25x savings

Reproduction

# Install eval dependencies
git clone https://github.com/avilum/minrlm && cd minrlm
uv sync --extra eval

export OPENAI_API_KEY="your-key"

# Quick smoke test (3 tasks, 3 runs each)
uv run python eval/quickstart.py

# Single task
uv run python eval/run.py --model gpt-5-mini --tasks official_sniah --runs 10

# All tasks - single runner
uv run python eval/run.py \
    --model gpt-5-mini \
    --tasks all \
    --runners minrlm-reasoning \
    --runs 50 \
    --parallel 5 \
    --output-dir logs/my_eval

# Full multi-runner benchmark (reproduces the GPT-5-mini table above)
uv run python eval/run.py \
    --tasks all \
    --runners minrlm-reasoning,vanilla,official \
    --runs 50 --parallel 12 --task-parallel 12 \
    --output-dir logs/my_eval

# Cross-model (swap --model for gpt-5-nano or gpt-5.2)
uv run python eval/run.py \
    --model gpt-5.2 \
    --tasks all \
    --runners minrlm-reasoning,vanilla \
    --runs 50 --parallel 12 --task-parallel 12 \
    --output-dir logs/my_eval_gpt52

Raw data:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Benchmark Results

Model Scaling

GPT-5-mini (Primary Benchmark)

Summary

Accuracy by Task

Token Efficiency by Task

Cost by Task

Latency by Task

Iterations by Task

GPT-5.2

Summary

Accuracy by Task

GPT-5-nano

Summary

Accuracy by Task

Datasets

Downloading datasets

Try it (zero-install)

Reproduction

FilesExpand file tree

eval

Directory actions

More options

Directory actions

More options

Latest commit

History

eval

Folders and files

parent directory

README.md

Benchmark Results

Model Scaling

GPT-5-mini (Primary Benchmark)

Summary

Accuracy by Task

Token Efficiency by Task

Cost by Task

Latency by Task

Iterations by Task

GPT-5.2

Summary

Accuracy by Task

GPT-5-nano

Summary

Accuracy by Task

Datasets

Downloading datasets

Try it (zero-install)

Reproduction