hypothesis-forge

A verifiers environment for training models to come up with a new idea, back it up with retrieved evidence, and end with an experiment that could prove it wrong.

Why

Most public RL environments train models to be correct. They do not put much pressure on models to be interesting in a disciplined way: to search over a design space instead of jumping to the nearest familiar answer, to ground claims in evidence that was actually gathered, and to propose the next experiment that could prove the idea wrong.

I built hypothesis-forge because I wanted an environment that rewards novelty with accountability.

What it trains

A concrete sequence of behaviors:

Search a constrained design problem broadly enough to find multiple plausible directions.
Know what is already known instead of paraphrasing the nearest baseline.
Use tools to inspect evidence instead of bluffing.
Commit to a mechanism that deserves follow-up.
Propose an experiment that distinguishes the idea from what came before.

This is not a creativity benchmark for nice-sounding prose. The target is exploratory, evidence-grounded, falsifiable ideation.

Why hybrid synthetic

Pure open-web plus judge-model setups are slow, expensive, noisy, and easy to game under RL. Pure symbolic novelty puzzles are clean and cheap but too toy-like to say much about real research behavior.

hypothesis-forge sits in the middle: synthetic enough to run cheaply and deterministically, structured enough for honest reward functions, open-ended enough that the model still has to search, and tool-based so evidence gathering is part of the task.

How an episode works

Each episode is a research brief from one of four domains:

materials discovery
biosensor design
surrogate model discovery
LLM memory systems

The model sees a research objective, a shortlist of candidate components, and tool access to evidence cards and known baselines. It returns one JSON object with a title, selected_components, cited_cards, hypothesis, and experiment.

Five tools:

list_brief()
list_components()
search_evidence(query)
read_card(card_id)
list_known_ideas()

Hardening

Several choices make the reward signal more honest:

Evidence-card IDs and baseline IDs are opaque per task.
Grounding only credits evidence the model actually surfaced through tool use.
Realistic-mode grounding cannot max out from search-only snippets; at least one opened card is required.
distinguishes_from only scores when the referenced baseline was actually revealed.
Novelty is constrained by feasibility and baseline-relative improvement, so obviously weak ideas do not win for being merely different.
Grounding uses citation precision, not citation stuffing.
The default benchmark is a blended hard public dev set: 20 hard synthetic tasks and 20 realistic surrogate tasks.

Reward design

Reward	Weight	Rationale
`feasibility_reward`	0.25	Clever nonsense should not win
`grounding_reward`	0.22	Ideas need evidence that was actually inspected
`experiment_reward`	0.20	Ideas need a discriminating next step
`novelty_reward`	0.15	Reward movement beyond baseline imitation
`explanation_reward`	0.13	Mechanism clarity should matter, but not dominate
`format_reward`	0.03	Structure enforcement
`group_diversity_bonus`	0.02	Encourage varied proposals only after a quality floor

Output format

{
  "title": "Adaptive residual surrogate",
  "selected_components": ["C1", "C4", "C7"],
  "cited_cards": ["EV-7K2Q9M", "EV-R4D8TX", "EV-N6P3LC"],
  "hypothesis": "Combining the physics residual head with the uncertainty calibrator and graph prior encoder should improve calibration while preserving fast inference.",
  "experiment": {
    "manipulation": "Remove the uncertainty calibrator while keeping the rest fixed.",
    "readout": "Expected calibration error and OOD performance on shifted inputs.",
    "prediction": "The full stack should maintain better uncertainty calibration and robustness than the ablated variant.",
    "distinguishes_from": "KI-3M7QPD"
  }
}

Installation

# From source
uv sync --group dev

# After package release
uv pip install hypothesis-forge

The package exposes load_environment().

Prime Hub

The public Prime Hub environment lives at casella/hypothesis-forge.

# Inspect the published environment
prime env info casella/hypothesis-forge

# Install it from Prime Hub
prime env install casella/hypothesis-forge

# Run an official Prime eval with Prime Inference
prime eval run casella/hypothesis-forge --model openai/gpt-4.1-mini

# Run the realistic surrogate-mode track through Prime
prime eval run casella/hypothesis-forge \
  --model openai/gpt-4.1-mini \
  --env-args '{"task_mode":"surrogate_realistic","domain_filter":"surrogate model discovery"}'

Official Prime evals use Prime Inference by default, so the Hub-hosted path does not need the OpenRouter base URL or API-key flags.

Official Prime Eval Snapshot

Official Prime Evals on casella/hypothesis-forge@0.2.1 using Prime Inference, the default blended_hard benchmark, max_turns=25, 40 examples, and 4 rollouts per example. These runs were executed on March 7, 2026.

Model	Avg Reward	Format	Feasibility	Novelty	Grounding	Experiment	Explanation	Eval
`openai/gpt-4.1-mini`	`0.401`	`1.000`	`0.360`	`0.213`	`0.460`	`0.427`	`0.477`	`du1xh23l7j55ockk1xd6hxca`
`openai/gpt-5-mini`	`0.466`	`1.000`	`0.345`	`0.203`	`0.476`	`0.801`	`0.416`	`ocm2b5q4kmkiz4o6gc945rgm`
`openai/gpt-5`	`0.502`	`1.000`	`0.414`	`0.238`	`0.493`	`0.801`	`0.482`	`jg4v0ocd7q4zpf0anq3zeikd`

Quickstart

# Default local eval
uv run vf-eval hypothesis-forge

# OpenRouter
export OPENROUTER_API_KEY=...
uv run vf-eval hypothesis-forge -b https://openrouter.ai/api/v1 -k OPENROUTER_API_KEY --debug

# Bundled realistic surrogate-model pack
uv run vf-eval hypothesis-forge -a '{"task_mode":"surrogate_realistic","domain_filter":"surrogate model discovery"}'

# Explicit default blended-hard public dev set
uv run vf-eval hypothesis-forge -n 40 -r 4 -a '{"task_mode":"blended_hard"}'

# Private holdout pack outside the public repo
uv run vf-eval hypothesis-forge -a '{"task_pack_path":"./holdout/private_surrogate_holdout.json"}'

# Calibration ladder helper
python scripts/calibrate_ladder.py --dry-run

Environment arguments

Arg	Type	Default	Description
`num_train`	int	`200`	Generated train examples
`num_eval`	int	`40`	Generated eval examples
`seed`	int	`0`	Global seed
`max_turns`	int	`25`	Max tool-use turns
`max_components`	int	`4`	Max selected components
`domain_filter`	`str \| None`	`None`	Restrict tasks to one domain, such as `"surrogate model discovery"`
`task_mode`	`str`	`"blended_hard"`	Task source mode: blended hard default, `"synthetic"`, or bundled `"surrogate_realistic"`
`task_pack_path`	`str \| None`	`None`	Optional JSON or JSONL task pack path for private dev or holdout packs
`system_prompt`	str	built-in	Optional prompt override

Limitations

Only surrogate model discovery has a realistic public dev pack; the other domains are still synthetic.
The scorer is symbolic and still not calibrated as sharply as I want. Official Prime evals now separate models, but the ladder is still more compressed than the target.
There is no cross-episode memory, curriculum, or bundled private holdout yet.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
hypothesis_forge.py		hypothesis_forge.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hypothesis-forge

Why

What it trains

Why hybrid synthetic

How an episode works

Hardening

Reward design

Output format

Installation

Prime Hub

Official Prime Eval Snapshot

Quickstart

Environment arguments

Limitations

Next

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

hypothesis-forge

Why

What it trains

Why hybrid synthetic

How an episode works

Hardening

Reward design

Output format

Installation

Prime Hub

Official Prime Eval Snapshot

Quickstart

Environment arguments

Limitations

Next

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages