A verifiers RLM environment for studying whether adaptive recursive search can outperform brittle manual RAG choreography on long synthetic corpora with overrides, contradictions, stale snapshots, and cross-document joins.
This environment is motivated by a narrow, testable Bitter Lesson claim about retrieval-heavy agent systems.
When models are weak or context is scarce, builders often compensate with hand-authored retrieval policy: fixed chunk sizes, fixed top-k retrieval, fixed reranking stages, fixed summarization schedules, fixed context-compaction rules, and fixed read orders. These recipes can work surprisingly well on tidy benchmarks, but much of the apparent competence lives in the human-written choreography rather than in the learned policy.
The Bitter Lesson suggests that this kind of handcrafted inner loop is often temporary. Over time, systems that can spend computation on search, adaptation, and learning tend to outperform systems that hard-code human intuitions about the right intermediate procedure. In the retrieval setting, the implication is not that external memory, tools, or document access disappear. The implication is narrower: the durable part of the stack is the interface to documents, tools, and feedback; the fragile part is the human-authored policy that prespecifies exactly how retrieval should happen.
adaptive_rag_rlm turns that idea into a trainable benchmark. Instead of rewarding one blessed RAG recipe, it asks whether an RLM-style policy can learn when to:
- scan broadly before reading deeply
- branch into targeted sub-searches
- reconcile stale snapshots with newer overrides
- join structured and unstructured evidence
- compress or rank candidate evidence adaptively
- cite only the documents that truly justify the answer
The benchmark is deliberately designed so that common handcrafted shortcuts are unstable. Filenames are opaque. MANIFEST.json exposes only weak metadata. Distractors are paraphrased rather than trivially irrelevant. Many tasks require recency resolution, specificity resolution, or multi-hop joins across documents. A policy that relies on lexical proximity, filename authority, or one fixed retrieval recipe should degrade as corpus size, contradiction rate, and evidence-hop complexity increase.
This makes the benchmark useful both as an evaluation and as an RL training environment: it does not ask whether a model can retrieve from a corpus in the abstract, but whether it can learn a retrieval-and-reading strategy that is more robust than brittle manual choreography.
This environment is not testing whether retrieval itself is unnecessary.
It is testing whether adaptive search policies can outperform manually choreographed retrieval policies under realistic failure modes.
You can think of the benchmark as contrasting two approaches:
Manual RAG choreography
- fixed chunking
- fixed top-k retrieval
- fixed reranking and summarization passes
- static context handoff rules
- one retrieval recipe applied to every instance
Adaptive retrieval behavior
- decide when to scan broadly vs. read deeply
- decide when to spawn sub-searches
- decide when to aggregate or compress candidate evidence
- decide how much evidence is enough before answering
- adapt the retrieval strategy to corpus scale, noise, and conflict structure
The environment is therefore best interpreted as a benchmark for the proposition that, in retrieval-heavy agent systems, learned or search-based policies can replace a meaningful fraction of today’s hand-written retrieval logic.
A strong result here would support a claim like:
As corpora become larger, noisier, more contradictory, and more update-heavy, fixed retrieval choreography degrades faster than adaptive recursive search policies.
A weak or negative result would also be informative. It would suggest that at least in this regime, manual retrieval structure is still doing irreplaceable work.
The corpus generator is intentionally hostile to brittle retrieval shortcuts.
It includes:
- stale snapshots and explicit overrides
- contradictory or near-duplicate documents
- structured and unstructured files in the same corpus
- multi-hop joins where the answer is not stated in one place
- opaque filenames and low-information manifest metadata
These choices are meant to break the assumptions behind many standard RAG pipelines. A benchmark that leaks the answer through filenames, lexical nearest-neighbor matches, or overly clean corpora does not really test adaptive retrieval strategy. This one tries to.
The benchmark scores both the final answer and the cited evidence.
That matters for the Bitter Lesson framing because many brittle retrieval systems can produce plausible answers without demonstrating that they actually found the right support. By requiring exact answer matching together with evidence-set quality, adaptive_rag_rlm evaluates not just answer generation but the quality of the retrieval-and-reading policy that produced it.
This is especially important in environments with overrides, stale documents, and paraphrased distractors, where the difference between “found something plausible” and “found the correct supporting evidence” is the whole point.
Each example is seeded and fully procedural. The generator creates mixed-file corpora with:
- markdown notes and handoff docs
- logs and changelog files
- CSV and JSON tables
- short config snippets
- paraphrased distractors and stale snapshots
Task families include:
- latest override selection by recency
- specificity override resolution
- multi-hop request-to-entity-to-route joins
- structured plus unstructured joins for reviewer routing
The corpus layout is materialized lazily at rollout time. Filenames are opaque, and MANIFEST.json lists only path, kind, byte size, and content hash. Evidence IDs must be recovered from document contents rather than filenames or the manifest.
The model must end with a raw JSON object:
{"answer": "...", "evidence_ids": ["doc_..."]}Scoring is strict:
- malformed JSON gets
output_json_valid = 0and zero reward answer_exactis binary exact match on the oracle answer stringevidence_f1is computed from the predicted and oracle evidence ID sets- main reward is
0.7 * answer_exact + 0.3 * evidence_f1
The intended finalization path is the RLM answer channel: in the Python REPL, set answer["content"] to the raw JSON string and then set answer["ready"] = True.
If the model misses answer["ready"] but leaves behind either a valid raw JSON object or a plain-text answer["content"] = '...json...' assignment in the final root assistant text, the environment captures that as an assistant_fallback instead of dropping the sample on formatting alone.
Additional metrics include evidence precision and recall, exact evidence-set match, corpus size stats, rollout runtime, final-answer provenance, and built-in RLM monitor metrics such as sub-LLM batch count and mean batch size.
uv sync --group dev
uv pip install -e .The package exposes load_environment().
Verify import and loading:
uv run python - <<'PY'
import verifiers as vf
env = vf.load_environment("adaptive_rag_rlm", difficulty="easy", num_train=2, num_eval=2)
print(type(env).__name__, len(env.get_eval_dataset()))
PYRun a local easy smoke eval:
prime eval run adaptive_rag_rlm \
-b https://api.pinference.ai/api/v1 \
-k PRIME_API_KEY \
-m openai/gpt-4.1-mini \
-n 3 -r 1 \
-a '{"difficulty":"easy","max_turns":12,"include_env_tips":true}' \
--debugRun a medium eval that is more likely to exercise llm_batch:
prime eval run adaptive_rag_rlm \
-b https://api.pinference.ai/api/v1 \
-k PRIME_API_KEY \
-m openai/gpt-4.1-mini \
-n 8 -r 1 -c 4 \
-a '{"difficulty":"medium","include_env_tips":true}' \
--debugIf your Prime token does not have tunnel:write, you can still run adaptive_rag_rlm locally by supplying your own public interception URL. This is a developer-only fallback when Prime Tunnel auth is unavailable. It is not the primary hosted runtime path.
Use the public base origin only. adaptive_rag_rlm appends the rollout path automatically.
ngrok http 8765Then run:
prime eval run adaptive_rag_rlm \
-b https://api.pinference.ai/api/v1 \
-k PRIME_API_KEY \
-m openai/gpt-4.1-mini \
-n 3 -r 1 \
-a '{"difficulty":"easy","max_turns":12,"include_env_tips":true,"interception_url":"https://<your-ngrok-url>","interception_port":8765}' \
--debugcloudflared tunnel --url http://127.0.0.1:8765Then run:
prime eval run adaptive_rag_rlm \
-b https://api.pinference.ai/api/v1 \
-k PRIME_API_KEY \
-m openai/gpt-4.1-mini \
-n 3 -r 1 \
-a '{"difficulty":"easy","max_turns":12,"include_env_tips":true,"interception_url":"https://<your-cloudflare-url>","interception_port":8765}' \
--debugNotes:
- use the public origin only, not a rollout-specific path
- keep
interception_portaligned with the local tunnel target - run one tunnel-backed eval at a time on a given local port
- in external-tunnel mode,
adaptive_rag_rlmcapsmax_sub_llm_parallelismto2 - for medium and hard local sweeps, prefer
prime eval run -c 4 - hosted Hub evals should use a
PRIME_API_KEYwithtunnel:write
Inspect one generated corpus and write its oracle alongside the files:
uv run python scripts/inspect_corpus.py --difficulty easy --seed 0 --example-index 0 --output-dir ./outputs/example0Run the naive fixed-chunk lexical baseline on one example:
uv run python scripts/fixed_chunk_baseline.py --difficulty easy --seed 0 --example-index 0Run the integration checker from source:
uv run python scripts/check_prime_env_integration.py --env-dir . --load-name adaptive_rag_rlm| Arg | Type | Default | Description |
|---|---|---|---|
difficulty |
easy | medium | hard |
"medium" |
Chooses the default corpus scale. |
include_env_tips |
bool |
false |
Adds a short strategy hint about adaptive scanning and evidence tracking. |
prompt_in_context_file |
bool |
false |
Writes the semantic query to TASK_QUERY.md and keeps the prompt short. |
max_turns |
int |
20 |
Root-model turn budget for the RLM loop. |
sub_model |
str | None |
None |
Optional model override for llm_batch subcalls. |
max_sub_llm_parallelism |
int |
5 |
Concurrency cap for llm_batch. |
repl_language |
python | bash |
"python" |
Root REPL language. Python is the default validated path. |
seed |
int |
0 |
Global seed for deterministic corpus generation. |
num_docs |
int | None |
preset | Override for corpus document count. |
approx_context_size |
int | None |
preset | Override for approximate corpus byte size. |
num_entities |
int | None |
preset | Override for the number of synthetic entities in the world model. |
num_overrides |
int | None |
preset | Override for stale or conflicting updates. |
hop_count |
int | None |
preset | Override for the intended reasoning depth. |
distractor_similarity |
float | None |
preset | Controls how lexically similar distractors are to the target path. |
task_family |
latest_handoff_code | specificity_fingerprint | package_route | reviewer_handle | None |
None |
Optional fixed task family for targeted debugging or slice-specific validation. |
num_train |
int | None |
preset | Training split size. |
num_eval |
int | None |
preset | Evaluation split size. |
interception_url |
str | None |
None |
Optional public base URL for local sub-LLM interception when Prime Tunnel is unavailable. |
interception_port |
int | None |
None |
Local interception server port used with interception_url. Defaults to 8765 when interception_url is set. |
root_prompt_verbosity |
light | medium | heavy |
"heavy" |
RLM scaffold verbosity for the root model. |
sub_prompt_verbosity |
light | medium | heavy |
"medium" |
Prompt verbosity for sub-LLM calls. |
Preset scales:
easy: 8 docs, about 7 KB, 3 entities, 1 overridemedium: 40 docs, about 80 KB, 10 entities, 8 overrideshard: 120 docs, about 250 KB, 24 entities, 20 overrides
Run the local validation gates from the repo root:
uv run --with pytest python -m pytest tests/test_adaptive_rag_rlm.py
uv build --out-dir dist
uv run --with twine python -m twine check dist/*
uv run python scripts/check_prime_env_integration.py --env-dir . --load-name adaptive_rag_rlmFor local external-tunnel medium validation:
prime eval run adaptive_rag_rlm \
-b https://api.pinference.ai/api/v1 \
-k PRIME_API_KEY \
-m openai/gpt-4.1-mini \
-n 32 -r 1 -c 4 \
-a '{"difficulty":"medium","include_env_tips":true,"interception_url":"https://<your-ngrok-url>","interception_port":8765}' \
--debugPush a private Prime Hub build after local checks pass:
prime env push --path . --name adaptive_rag_rlm --visibility PRIVATERun a hosted smoke eval after publishing:
prime eval run casella/adaptive_rag_rlm \
-b https://api.pinference.ai/api/v1 \
-k PRIME_API_KEY \
-m openai/gpt-4.1-mini \
-n 3 -r 1 \
-a '{"difficulty":"easy","max_turns":12,"include_env_tips":true}' \
--hosted --allow-sandbox-access --follow --debugHosted RLMEnv execution requires a PRIME_API_KEY with tunnel:write, because sub-LLM interception is routed through Prime Tunnel unless interception_url is provided. The external interceptor flow in this README is intended for local development on your own machine.
- The environment is fully procedural and deterministic by seed.
- The published benchmark contract assumes Python REPL;
bashremains experimental until explicitly validated in release checks. RLMEnvis imported fromverifiers.envs.experimental.rlm_env, which is the current installed location inverifiers 0.1.11.