adaptive-rag-rlm

A verifiers RLM environment for studying whether adaptive recursive search can outperform brittle manual RAG choreography on long synthetic corpora with overrides, contradictions, stale snapshots, and cross-document joins.

Bitter Lesson Motivation

This environment is motivated by a narrow, testable Bitter Lesson claim about retrieval-heavy agent systems.

When models are weak or context is scarce, builders often compensate with hand-authored retrieval policy: fixed chunk sizes, fixed top-k retrieval, fixed reranking stages, fixed summarization schedules, fixed context-compaction rules, and fixed read orders. These recipes can work surprisingly well on tidy benchmarks, but much of the apparent competence lives in the human-written choreography rather than in the learned policy.

The Bitter Lesson suggests that this kind of handcrafted inner loop is often temporary. Over time, systems that can spend computation on search, adaptation, and learning tend to outperform systems that hard-code human intuitions about the right intermediate procedure. In the retrieval setting, the implication is not that external memory, tools, or document access disappear. The implication is narrower: the durable part of the stack is the interface to documents, tools, and feedback; the fragile part is the human-authored policy that prespecifies exactly how retrieval should happen.

adaptive_rag_rlm turns that idea into a trainable benchmark. Instead of rewarding one blessed RAG recipe, it asks whether an RLM-style policy can learn when to:

scan broadly before reading deeply
branch into targeted sub-searches
reconcile stale snapshots with newer overrides
join structured and unstructured evidence
compress or rank candidate evidence adaptively
cite only the documents that truly justify the answer

The benchmark is deliberately designed so that common handcrafted shortcuts are unstable. Filenames are opaque. MANIFEST.json exposes only weak metadata. Distractors are paraphrased rather than trivially irrelevant. Many tasks require recency resolution, specificity resolution, or multi-hop joins across documents. A policy that relies on lexical proximity, filename authority, or one fixed retrieval recipe should degrade as corpus size, contradiction rate, and evidence-hop complexity increase.

This makes the benchmark useful both as an evaluation and as an RL training environment: it does not ask whether a model can retrieve from a corpus in the abstract, but whether it can learn a retrieval-and-reading strategy that is more robust than brittle manual choreography.

What This Environment Is Testing

This environment is not testing whether retrieval itself is unnecessary.

It is testing whether adaptive search policies can outperform manually choreographed retrieval policies under realistic failure modes.

You can think of the benchmark as contrasting two approaches:

Manual RAG choreography

fixed chunking
fixed top-k retrieval
fixed reranking and summarization passes
static context handoff rules
one retrieval recipe applied to every instance

Adaptive retrieval behavior

decide when to scan broadly vs. read deeply
decide when to spawn sub-searches
decide when to aggregate or compress candidate evidence
decide how much evidence is enough before answering
adapt the retrieval strategy to corpus scale, noise, and conflict structure

The environment is therefore best interpreted as a benchmark for the proposition that, in retrieval-heavy agent systems, learned or search-based policies can replace a meaningful fraction of today’s hand-written retrieval logic.

A strong result here would support a claim like:

As corpora become larger, noisier, more contradictory, and more update-heavy, fixed retrieval choreography degrades faster than adaptive recursive search policies.

A weak or negative result would also be informative. It would suggest that at least in this regime, manual retrieval structure is still doing irreplaceable work.

Why The Corpus Is Structured This Way

The corpus generator is intentionally hostile to brittle retrieval shortcuts.

It includes:

stale snapshots and explicit overrides
contradictory or near-duplicate documents
structured and unstructured files in the same corpus
multi-hop joins where the answer is not stated in one place
opaque filenames and low-information manifest metadata

These choices are meant to break the assumptions behind many standard RAG pipelines. A benchmark that leaks the answer through filenames, lexical nearest-neighbor matches, or overly clean corpora does not really test adaptive retrieval strategy. This one tries to.

Why Evidence Matters

The benchmark scores both the final answer and the cited evidence.

That matters for the Bitter Lesson framing because many brittle retrieval systems can produce plausible answers without demonstrating that they actually found the right support. By requiring exact answer matching together with evidence-set quality, adaptive_rag_rlm evaluates not just answer generation but the quality of the retrieval-and-reading policy that produced it.

This is especially important in environments with overrides, stale documents, and paraphrased distractors, where the difference between “found something plausible” and “found the correct supporting evidence” is the whole point.

Task Design

Each example is seeded and fully procedural. The generator creates mixed-file corpora with:

markdown notes and handoff docs
logs and changelog files
CSV and JSON tables
short config snippets
paraphrased distractors and stale snapshots

Task families include:

latest override selection by recency
specificity override resolution
multi-hop request-to-entity-to-route joins
structured plus unstructured joins for reviewer routing

The corpus layout is materialized lazily at rollout time. Filenames are opaque, and MANIFEST.json lists only path, kind, byte size, and content hash. Evidence IDs must be recovered from document contents rather than filenames or the manifest.

Output And Scoring

The model must end with a raw JSON object:

{"answer": "...", "evidence_ids": ["doc_..."]}

Scoring is strict:

malformed JSON gets output_json_valid = 0 and zero reward
answer_exact is binary exact match on the oracle answer string
evidence_f1 is computed from the predicted and oracle evidence ID sets
main reward is 0.7 * answer_exact + 0.3 * evidence_f1

The intended finalization path is the RLM answer channel: in the Python REPL, set answer["content"] to the raw JSON string and then set answer["ready"] = True.

If the model misses answer["ready"] but leaves behind either a valid raw JSON object or a plain-text answer["content"] = '...json...' assignment in the final root assistant text, the environment captures that as an assistant_fallback instead of dropping the sample on formatting alone.

Additional metrics include evidence precision and recall, exact evidence-set match, corpus size stats, rollout runtime, final-answer provenance, and built-in RLM monitor metrics such as sub-LLM batch count and mean batch size.

Installation

uv sync --group dev
uv pip install -e .

The package exposes load_environment().

Quickstart

Verify import and loading:

uv run python - <<'PY'
import verifiers as vf
env = vf.load_environment("adaptive_rag_rlm", difficulty="easy", num_train=2, num_eval=2)
print(type(env).__name__, len(env.get_eval_dataset()))
PY

Run a local easy smoke eval:

prime eval run adaptive_rag_rlm \
  -b https://api.pinference.ai/api/v1 \
  -k PRIME_API_KEY \
  -m openai/gpt-4.1-mini \
  -n 3 -r 1 \
  -a '{"difficulty":"easy","max_turns":12,"include_env_tips":true}' \
  --debug

Run a medium eval that is more likely to exercise llm_batch:

prime eval run adaptive_rag_rlm \
  -b https://api.pinference.ai/api/v1 \
  -k PRIME_API_KEY \
  -m openai/gpt-4.1-mini \
  -n 8 -r 1 -c 4 \
  -a '{"difficulty":"medium","include_env_tips":true}' \
  --debug

Local Setup Without `tunnel:write`

If your Prime token does not have tunnel:write, you can still run adaptive_rag_rlm locally by supplying your own public interception URL. This is a developer-only fallback when Prime Tunnel auth is unavailable. It is not the primary hosted runtime path.

Use the public base origin only. adaptive_rag_rlm appends the rollout path automatically.

Option 1: `ngrok`

ngrok http 8765

Then run:

prime eval run adaptive_rag_rlm \
  -b https://api.pinference.ai/api/v1 \
  -k PRIME_API_KEY \
  -m openai/gpt-4.1-mini \
  -n 3 -r 1 \
  -a '{"difficulty":"easy","max_turns":12,"include_env_tips":true,"interception_url":"https://<your-ngrok-url>","interception_port":8765}' \
  --debug

Option 2: `cloudflared`

cloudflared tunnel --url http://127.0.0.1:8765

Then run:

prime eval run adaptive_rag_rlm \
  -b https://api.pinference.ai/api/v1 \
  -k PRIME_API_KEY \
  -m openai/gpt-4.1-mini \
  -n 3 -r 1 \
  -a '{"difficulty":"easy","max_turns":12,"include_env_tips":true,"interception_url":"https://<your-cloudflare-url>","interception_port":8765}' \
  --debug

Notes:

use the public origin only, not a rollout-specific path
keep interception_port aligned with the local tunnel target
run one tunnel-backed eval at a time on a given local port
in external-tunnel mode, adaptive_rag_rlm caps max_sub_llm_parallelism to 2
for medium and hard local sweeps, prefer prime eval run -c 4
hosted Hub evals should use a PRIME_API_KEY with tunnel:write

Helper Scripts

Inspect one generated corpus and write its oracle alongside the files:

uv run python scripts/inspect_corpus.py --difficulty easy --seed 0 --example-index 0 --output-dir ./outputs/example0

Run the naive fixed-chunk lexical baseline on one example:

uv run python scripts/fixed_chunk_baseline.py --difficulty easy --seed 0 --example-index 0

Run the integration checker from source:

uv run python scripts/check_prime_env_integration.py --env-dir . --load-name adaptive_rag_rlm

Environment Arguments

Arg	Type	Default	Description
`difficulty`	`easy \| medium \| hard`	`"medium"`	Chooses the default corpus scale.
`include_env_tips`	`bool`	`false`	Adds a short strategy hint about adaptive scanning and evidence tracking.
`prompt_in_context_file`	`bool`	`false`	Writes the semantic query to `TASK_QUERY.md` and keeps the prompt short.
`max_turns`	`int`	`20`	Root-model turn budget for the RLM loop.
`sub_model`	`str \| None`	`None`	Optional model override for `llm_batch` subcalls.
`max_sub_llm_parallelism`	`int`	`5`	Concurrency cap for `llm_batch`.
`repl_language`	`python \| bash`	`"python"`	Root REPL language. Python is the default validated path.
`seed`	`int`	`0`	Global seed for deterministic corpus generation.
`num_docs`	`int \| None`	preset	Override for corpus document count.
`approx_context_size`	`int \| None`	preset	Override for approximate corpus byte size.
`num_entities`	`int \| None`	preset	Override for the number of synthetic entities in the world model.
`num_overrides`	`int \| None`	preset	Override for stale or conflicting updates.
`hop_count`	`int \| None`	preset	Override for the intended reasoning depth.
`distractor_similarity`	`float \| None`	preset	Controls how lexically similar distractors are to the target path.
`task_family`	`latest_handoff_code \| specificity_fingerprint \| package_route \| reviewer_handle \| None`	`None`	Optional fixed task family for targeted debugging or slice-specific validation.
`num_train`	`int \| None`	preset	Training split size.
`num_eval`	`int \| None`	preset	Evaluation split size.
`interception_url`	`str \| None`	`None`	Optional public base URL for local sub-LLM interception when Prime Tunnel is unavailable.
`interception_port`	`int \| None`	`None`	Local interception server port used with `interception_url`. Defaults to `8765` when `interception_url` is set.
`root_prompt_verbosity`	`light \| medium \| heavy`	`"heavy"`	RLM scaffold verbosity for the root model.
`sub_prompt_verbosity`	`light \| medium \| heavy`	`"medium"`	Prompt verbosity for sub-LLM calls.

Preset scales:

easy: 8 docs, about 7 KB, 3 entities, 1 override
medium: 40 docs, about 80 KB, 10 entities, 8 overrides
hard: 120 docs, about 250 KB, 24 entities, 20 overrides

Validation

Run the local validation gates from the repo root:

uv run --with pytest python -m pytest tests/test_adaptive_rag_rlm.py
uv build --out-dir dist
uv run --with twine python -m twine check dist/*
uv run python scripts/check_prime_env_integration.py --env-dir . --load-name adaptive_rag_rlm

For local external-tunnel medium validation:

prime eval run adaptive_rag_rlm \
  -b https://api.pinference.ai/api/v1 \
  -k PRIME_API_KEY \
  -m openai/gpt-4.1-mini \
  -n 32 -r 1 -c 4 \
  -a '{"difficulty":"medium","include_env_tips":true,"interception_url":"https://<your-ngrok-url>","interception_port":8765}' \
  --debug

Prime Hub Notes

Push a private Prime Hub build after local checks pass:

prime env push --path . --name adaptive_rag_rlm --visibility PRIVATE

Run a hosted smoke eval after publishing:

prime eval run casella/adaptive_rag_rlm \
  -b https://api.pinference.ai/api/v1 \
  -k PRIME_API_KEY \
  -m openai/gpt-4.1-mini \
  -n 3 -r 1 \
  -a '{"difficulty":"easy","max_turns":12,"include_env_tips":true}' \
  --hosted --allow-sandbox-access --follow --debug

Hosted RLMEnv execution requires a PRIME_API_KEY with tunnel:write, because sub-LLM interception is routed through Prime Tunnel unless interception_url is provided. The external interceptor flow in this README is intended for local development on your own machine.

Notes

The environment is fully procedural and deterministic by seed.
The published benchmark contract assumes Python REPL; bash remains experimental until explicitly validated in release checks.
RLMEnv is imported from verifiers.envs.experimental.rlm_env, which is the current installed location in verifiers 0.1.11.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
adaptive_rag_rlm		adaptive_rag_rlm
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

adaptive-rag-rlm

Bitter Lesson Motivation

What This Environment Is Testing

Why The Corpus Is Structured This Way

Why Evidence Matters

Task Design

Output And Scoring

Installation

Quickstart

Local Setup Without `tunnel:write`

Option 1: `ngrok`

Option 2: `cloudflared`

Helper Scripts

Environment Arguments

Validation

Prime Hub Notes

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

adaptive-rag-rlm

Bitter Lesson Motivation

What This Environment Is Testing

Why The Corpus Is Structured This Way

Why Evidence Matters

Task Design

Output And Scoring

Installation

Quickstart

Local Setup Without tunnel:write

Option 1: ngrok

Option 2: cloudflared

Helper Scripts

Environment Arguments

Validation

Prime Hub Notes

Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Local Setup Without `tunnel:write`

Option 1: `ngrok`

Option 2: `cloudflared`

Packages