Skip to content

scasella/adaptive_rag_rlm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

adaptive-rag-rlm

A verifiers RLM environment for studying whether adaptive recursive search can outperform brittle manual RAG choreography on long synthetic corpora with overrides, contradictions, stale snapshots, and cross-document joins.

Bitter Lesson Motivation

This environment is motivated by a narrow, testable Bitter Lesson claim about retrieval-heavy agent systems.

When models are weak or context is scarce, builders often compensate with hand-authored retrieval policy: fixed chunk sizes, fixed top-k retrieval, fixed reranking stages, fixed summarization schedules, fixed context-compaction rules, and fixed read orders. These recipes can work surprisingly well on tidy benchmarks, but much of the apparent competence lives in the human-written choreography rather than in the learned policy.

The Bitter Lesson suggests that this kind of handcrafted inner loop is often temporary. Over time, systems that can spend computation on search, adaptation, and learning tend to outperform systems that hard-code human intuitions about the right intermediate procedure. In the retrieval setting, the implication is not that external memory, tools, or document access disappear. The implication is narrower: the durable part of the stack is the interface to documents, tools, and feedback; the fragile part is the human-authored policy that prespecifies exactly how retrieval should happen.

adaptive_rag_rlm turns that idea into a trainable benchmark. Instead of rewarding one blessed RAG recipe, it asks whether an RLM-style policy can learn when to:

  • scan broadly before reading deeply
  • branch into targeted sub-searches
  • reconcile stale snapshots with newer overrides
  • join structured and unstructured evidence
  • compress or rank candidate evidence adaptively
  • cite only the documents that truly justify the answer

The benchmark is deliberately designed so that common handcrafted shortcuts are unstable. Filenames are opaque. MANIFEST.json exposes only weak metadata. Distractors are paraphrased rather than trivially irrelevant. Many tasks require recency resolution, specificity resolution, or multi-hop joins across documents. A policy that relies on lexical proximity, filename authority, or one fixed retrieval recipe should degrade as corpus size, contradiction rate, and evidence-hop complexity increase.

This makes the benchmark useful both as an evaluation and as an RL training environment: it does not ask whether a model can retrieve from a corpus in the abstract, but whether it can learn a retrieval-and-reading strategy that is more robust than brittle manual choreography.

What This Environment Is Testing

This environment is not testing whether retrieval itself is unnecessary.

It is testing whether adaptive search policies can outperform manually choreographed retrieval policies under realistic failure modes.

You can think of the benchmark as contrasting two approaches:

Manual RAG choreography

  • fixed chunking
  • fixed top-k retrieval
  • fixed reranking and summarization passes
  • static context handoff rules
  • one retrieval recipe applied to every instance

Adaptive retrieval behavior

  • decide when to scan broadly vs. read deeply
  • decide when to spawn sub-searches
  • decide when to aggregate or compress candidate evidence
  • decide how much evidence is enough before answering
  • adapt the retrieval strategy to corpus scale, noise, and conflict structure

The environment is therefore best interpreted as a benchmark for the proposition that, in retrieval-heavy agent systems, learned or search-based policies can replace a meaningful fraction of today’s hand-written retrieval logic.

A strong result here would support a claim like:

As corpora become larger, noisier, more contradictory, and more update-heavy, fixed retrieval choreography degrades faster than adaptive recursive search policies.

A weak or negative result would also be informative. It would suggest that at least in this regime, manual retrieval structure is still doing irreplaceable work.

Why The Corpus Is Structured This Way

The corpus generator is intentionally hostile to brittle retrieval shortcuts.

It includes:

  • stale snapshots and explicit overrides
  • contradictory or near-duplicate documents
  • structured and unstructured files in the same corpus
  • multi-hop joins where the answer is not stated in one place
  • opaque filenames and low-information manifest metadata

These choices are meant to break the assumptions behind many standard RAG pipelines. A benchmark that leaks the answer through filenames, lexical nearest-neighbor matches, or overly clean corpora does not really test adaptive retrieval strategy. This one tries to.

Why Evidence Matters

The benchmark scores both the final answer and the cited evidence.

That matters for the Bitter Lesson framing because many brittle retrieval systems can produce plausible answers without demonstrating that they actually found the right support. By requiring exact answer matching together with evidence-set quality, adaptive_rag_rlm evaluates not just answer generation but the quality of the retrieval-and-reading policy that produced it.

This is especially important in environments with overrides, stale documents, and paraphrased distractors, where the difference between “found something plausible” and “found the correct supporting evidence” is the whole point.

Task Design

Each example is seeded and fully procedural. The generator creates mixed-file corpora with:

  • markdown notes and handoff docs
  • logs and changelog files
  • CSV and JSON tables
  • short config snippets
  • paraphrased distractors and stale snapshots

Task families include:

  • latest override selection by recency
  • specificity override resolution
  • multi-hop request-to-entity-to-route joins
  • structured plus unstructured joins for reviewer routing

The corpus layout is materialized lazily at rollout time. Filenames are opaque, and MANIFEST.json lists only path, kind, byte size, and content hash. Evidence IDs must be recovered from document contents rather than filenames or the manifest.

Output And Scoring

The model must end with a raw JSON object:

{"answer": "...", "evidence_ids": ["doc_..."]}

Scoring is strict:

  • malformed JSON gets output_json_valid = 0 and zero reward
  • answer_exact is binary exact match on the oracle answer string
  • evidence_f1 is computed from the predicted and oracle evidence ID sets
  • main reward is 0.7 * answer_exact + 0.3 * evidence_f1

The intended finalization path is the RLM answer channel: in the Python REPL, set answer["content"] to the raw JSON string and then set answer["ready"] = True.

If the model misses answer["ready"] but leaves behind either a valid raw JSON object or a plain-text answer["content"] = '...json...' assignment in the final root assistant text, the environment captures that as an assistant_fallback instead of dropping the sample on formatting alone.

Additional metrics include evidence precision and recall, exact evidence-set match, corpus size stats, rollout runtime, final-answer provenance, and built-in RLM monitor metrics such as sub-LLM batch count and mean batch size.

Installation

uv sync --group dev
uv pip install -e .

The package exposes load_environment().

Quickstart

Verify import and loading:

uv run python - <<'PY'
import verifiers as vf
env = vf.load_environment("adaptive_rag_rlm", difficulty="easy", num_train=2, num_eval=2)
print(type(env).__name__, len(env.get_eval_dataset()))
PY

Run a local easy smoke eval:

prime eval run adaptive_rag_rlm \
  -b https://api.pinference.ai/api/v1 \
  -k PRIME_API_KEY \
  -m openai/gpt-4.1-mini \
  -n 3 -r 1 \
  -a '{"difficulty":"easy","max_turns":12,"include_env_tips":true}' \
  --debug

Run a medium eval that is more likely to exercise llm_batch:

prime eval run adaptive_rag_rlm \
  -b https://api.pinference.ai/api/v1 \
  -k PRIME_API_KEY \
  -m openai/gpt-4.1-mini \
  -n 8 -r 1 -c 4 \
  -a '{"difficulty":"medium","include_env_tips":true}' \
  --debug

Local Setup Without tunnel:write

If your Prime token does not have tunnel:write, you can still run adaptive_rag_rlm locally by supplying your own public interception URL. This is a developer-only fallback when Prime Tunnel auth is unavailable. It is not the primary hosted runtime path.

Use the public base origin only. adaptive_rag_rlm appends the rollout path automatically.

Option 1: ngrok

ngrok http 8765

Then run:

prime eval run adaptive_rag_rlm \
  -b https://api.pinference.ai/api/v1 \
  -k PRIME_API_KEY \
  -m openai/gpt-4.1-mini \
  -n 3 -r 1 \
  -a '{"difficulty":"easy","max_turns":12,"include_env_tips":true,"interception_url":"https://<your-ngrok-url>","interception_port":8765}' \
  --debug

Option 2: cloudflared

cloudflared tunnel --url http://127.0.0.1:8765

Then run:

prime eval run adaptive_rag_rlm \
  -b https://api.pinference.ai/api/v1 \
  -k PRIME_API_KEY \
  -m openai/gpt-4.1-mini \
  -n 3 -r 1 \
  -a '{"difficulty":"easy","max_turns":12,"include_env_tips":true,"interception_url":"https://<your-cloudflare-url>","interception_port":8765}' \
  --debug

Notes:

  • use the public origin only, not a rollout-specific path
  • keep interception_port aligned with the local tunnel target
  • run one tunnel-backed eval at a time on a given local port
  • in external-tunnel mode, adaptive_rag_rlm caps max_sub_llm_parallelism to 2
  • for medium and hard local sweeps, prefer prime eval run -c 4
  • hosted Hub evals should use a PRIME_API_KEY with tunnel:write

Helper Scripts

Inspect one generated corpus and write its oracle alongside the files:

uv run python scripts/inspect_corpus.py --difficulty easy --seed 0 --example-index 0 --output-dir ./outputs/example0

Run the naive fixed-chunk lexical baseline on one example:

uv run python scripts/fixed_chunk_baseline.py --difficulty easy --seed 0 --example-index 0

Run the integration checker from source:

uv run python scripts/check_prime_env_integration.py --env-dir . --load-name adaptive_rag_rlm

Environment Arguments

Arg Type Default Description
difficulty easy | medium | hard "medium" Chooses the default corpus scale.
include_env_tips bool false Adds a short strategy hint about adaptive scanning and evidence tracking.
prompt_in_context_file bool false Writes the semantic query to TASK_QUERY.md and keeps the prompt short.
max_turns int 20 Root-model turn budget for the RLM loop.
sub_model str | None None Optional model override for llm_batch subcalls.
max_sub_llm_parallelism int 5 Concurrency cap for llm_batch.
repl_language python | bash "python" Root REPL language. Python is the default validated path.
seed int 0 Global seed for deterministic corpus generation.
num_docs int | None preset Override for corpus document count.
approx_context_size int | None preset Override for approximate corpus byte size.
num_entities int | None preset Override for the number of synthetic entities in the world model.
num_overrides int | None preset Override for stale or conflicting updates.
hop_count int | None preset Override for the intended reasoning depth.
distractor_similarity float | None preset Controls how lexically similar distractors are to the target path.
task_family latest_handoff_code | specificity_fingerprint | package_route | reviewer_handle | None None Optional fixed task family for targeted debugging or slice-specific validation.
num_train int | None preset Training split size.
num_eval int | None preset Evaluation split size.
interception_url str | None None Optional public base URL for local sub-LLM interception when Prime Tunnel is unavailable.
interception_port int | None None Local interception server port used with interception_url. Defaults to 8765 when interception_url is set.
root_prompt_verbosity light | medium | heavy "heavy" RLM scaffold verbosity for the root model.
sub_prompt_verbosity light | medium | heavy "medium" Prompt verbosity for sub-LLM calls.

Preset scales:

  • easy: 8 docs, about 7 KB, 3 entities, 1 override
  • medium: 40 docs, about 80 KB, 10 entities, 8 overrides
  • hard: 120 docs, about 250 KB, 24 entities, 20 overrides

Validation

Run the local validation gates from the repo root:

uv run --with pytest python -m pytest tests/test_adaptive_rag_rlm.py
uv build --out-dir dist
uv run --with twine python -m twine check dist/*
uv run python scripts/check_prime_env_integration.py --env-dir . --load-name adaptive_rag_rlm

For local external-tunnel medium validation:

prime eval run adaptive_rag_rlm \
  -b https://api.pinference.ai/api/v1 \
  -k PRIME_API_KEY \
  -m openai/gpt-4.1-mini \
  -n 32 -r 1 -c 4 \
  -a '{"difficulty":"medium","include_env_tips":true,"interception_url":"https://<your-ngrok-url>","interception_port":8765}' \
  --debug

Prime Hub Notes

Push a private Prime Hub build after local checks pass:

prime env push --path . --name adaptive_rag_rlm --visibility PRIVATE

Run a hosted smoke eval after publishing:

prime eval run casella/adaptive_rag_rlm \
  -b https://api.pinference.ai/api/v1 \
  -k PRIME_API_KEY \
  -m openai/gpt-4.1-mini \
  -n 3 -r 1 \
  -a '{"difficulty":"easy","max_turns":12,"include_env_tips":true}' \
  --hosted --allow-sandbox-access --follow --debug

Hosted RLMEnv execution requires a PRIME_API_KEY with tunnel:write, because sub-LLM interception is routed through Prime Tunnel unless interception_url is provided. The external interceptor flow in this README is intended for local development on your own machine.

Notes

  • The environment is fully procedural and deterministic by seed.
  • The published benchmark contract assumes Python REPL; bash remains experimental until explicitly validated in release checks.
  • RLMEnv is imported from verifiers.envs.experimental.rlm_env, which is the current installed location in verifiers 0.1.11.

About

A verifiers RLM environment for testing whether adaptive recursive search outperforms brittle manual RAG choreography on long synthetic corpora.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages