A verifiers environment for training models to come up with a new idea, back it up with retrieved evidence, and end with an experiment that could prove it wrong.
Most public RL environments train models to be correct. They do not put much pressure on models to be interesting in a disciplined way: to search over a design space instead of jumping to the nearest familiar answer, to ground claims in evidence that was actually gathered, and to propose the next experiment that could prove the idea wrong.
I built hypothesis-forge because I wanted an environment that rewards novelty with accountability.
A concrete sequence of behaviors:
- Search a constrained design problem broadly enough to find multiple plausible directions.
- Know what is already known instead of paraphrasing the nearest baseline.
- Use tools to inspect evidence instead of bluffing.
- Commit to a mechanism that deserves follow-up.
- Propose an experiment that distinguishes the idea from what came before.
This is not a creativity benchmark for nice-sounding prose. The target is exploratory, evidence-grounded, falsifiable ideation.
Pure open-web plus judge-model setups are slow, expensive, noisy, and easy to game under RL. Pure symbolic novelty puzzles are clean and cheap but too toy-like to say much about real research behavior.
hypothesis-forge sits in the middle: synthetic enough to run cheaply and deterministically, structured enough for honest reward functions, open-ended enough that the model still has to search, and tool-based so evidence gathering is part of the task.
Each episode is a research brief from one of four domains:
- materials discovery
- biosensor design
- surrogate model discovery
- LLM memory systems
The model sees a research objective, a shortlist of candidate components, and tool access to evidence cards and known baselines. It returns one JSON object with a title, selected_components, cited_cards, hypothesis, and experiment.
Five tools:
list_brief()list_components()search_evidence(query)read_card(card_id)list_known_ideas()
Several choices make the reward signal more honest:
- Evidence-card IDs and baseline IDs are opaque per task.
- Grounding only credits evidence the model actually surfaced through tool use.
- Realistic-mode grounding cannot max out from search-only snippets; at least one opened card is required.
distinguishes_fromonly scores when the referenced baseline was actually revealed.- Novelty is constrained by feasibility and baseline-relative improvement, so obviously weak ideas do not win for being merely different.
- Grounding uses citation precision, not citation stuffing.
- The default benchmark is a blended hard public dev set: 20 hard synthetic tasks and 20 realistic surrogate tasks.
| Reward | Weight | Rationale |
|---|---|---|
feasibility_reward |
0.25 | Clever nonsense should not win |
grounding_reward |
0.22 | Ideas need evidence that was actually inspected |
experiment_reward |
0.20 | Ideas need a discriminating next step |
novelty_reward |
0.15 | Reward movement beyond baseline imitation |
explanation_reward |
0.13 | Mechanism clarity should matter, but not dominate |
format_reward |
0.03 | Structure enforcement |
group_diversity_bonus |
0.02 | Encourage varied proposals only after a quality floor |
{
"title": "Adaptive residual surrogate",
"selected_components": ["C1", "C4", "C7"],
"cited_cards": ["EV-7K2Q9M", "EV-R4D8TX", "EV-N6P3LC"],
"hypothesis": "Combining the physics residual head with the uncertainty calibrator and graph prior encoder should improve calibration while preserving fast inference.",
"experiment": {
"manipulation": "Remove the uncertainty calibrator while keeping the rest fixed.",
"readout": "Expected calibration error and OOD performance on shifted inputs.",
"prediction": "The full stack should maintain better uncertainty calibration and robustness than the ablated variant.",
"distinguishes_from": "KI-3M7QPD"
}
}# From source
uv sync --group dev
# After package release
uv pip install hypothesis-forgeThe package exposes load_environment().
The public Prime Hub environment lives at casella/hypothesis-forge.
# Inspect the published environment
prime env info casella/hypothesis-forge
# Install it from Prime Hub
prime env install casella/hypothesis-forge
# Run an official Prime eval with Prime Inference
prime eval run casella/hypothesis-forge --model openai/gpt-4.1-mini
# Run the realistic surrogate-mode track through Prime
prime eval run casella/hypothesis-forge \
--model openai/gpt-4.1-mini \
--env-args '{"task_mode":"surrogate_realistic","domain_filter":"surrogate model discovery"}'Official Prime evals use Prime Inference by default, so the Hub-hosted path does not need the OpenRouter base URL or API-key flags.
Official Prime Evals on casella/hypothesis-forge@0.2.1 using Prime Inference, the default blended_hard benchmark, max_turns=25, 40 examples, and 4 rollouts per example. These runs were executed on March 7, 2026.
| Model | Avg Reward | Format | Feasibility | Novelty | Grounding | Experiment | Explanation | Eval |
|---|---|---|---|---|---|---|---|---|
openai/gpt-4.1-mini |
0.401 |
1.000 |
0.360 |
0.213 |
0.460 |
0.427 |
0.477 |
du1xh23l7j55ockk1xd6hxca |
openai/gpt-5-mini |
0.466 |
1.000 |
0.345 |
0.203 |
0.476 |
0.801 |
0.416 |
ocm2b5q4kmkiz4o6gc945rgm |
openai/gpt-5 |
0.502 |
1.000 |
0.414 |
0.238 |
0.493 |
0.801 |
0.482 |
jg4v0ocd7q4zpf0anq3zeikd |
# Default local eval
uv run vf-eval hypothesis-forge
# OpenRouter
export OPENROUTER_API_KEY=...
uv run vf-eval hypothesis-forge -b https://openrouter.ai/api/v1 -k OPENROUTER_API_KEY --debug
# Bundled realistic surrogate-model pack
uv run vf-eval hypothesis-forge -a '{"task_mode":"surrogate_realistic","domain_filter":"surrogate model discovery"}'
# Explicit default blended-hard public dev set
uv run vf-eval hypothesis-forge -n 40 -r 4 -a '{"task_mode":"blended_hard"}'
# Private holdout pack outside the public repo
uv run vf-eval hypothesis-forge -a '{"task_pack_path":"./holdout/private_surrogate_holdout.json"}'
# Calibration ladder helper
python scripts/calibrate_ladder.py --dry-run| Arg | Type | Default | Description |
|---|---|---|---|
num_train |
int | 200 |
Generated train examples |
num_eval |
int | 40 |
Generated eval examples |
seed |
int | 0 |
Global seed |
max_turns |
int | 25 |
Max tool-use turns |
max_components |
int | 4 |
Max selected components |
domain_filter |
str | None |
None |
Restrict tasks to one domain, such as "surrogate model discovery" |
task_mode |
str |
"blended_hard" |
Task source mode: blended hard default, "synthetic", or bundled "surrogate_realistic" |
task_pack_path |
str | None |
None |
Optional JSON or JSONL task pack path for private dev or holdout packs |
system_prompt |
str | built-in | Optional prompt override |
- Only surrogate model discovery has a realistic public dev pack; the other domains are still synthetic.
- The scorer is symbolic and still not calibrated as sharply as I want. Official Prime evals now separate models, but the ladder is still more compressed than the target.
- There is no cross-episode memory, curriculum, or bundled private holdout yet.
The next step is recalibration: make the public benchmark harder, expand the realistic surrogate pack, add a private holdout through task_pack_path, and keep tuning until official Prime evals show a wider spread across the model ladder.