Skip to content

scasella/hypothesis_forge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hypothesis-forge

A verifiers environment for training models to come up with a new idea, back it up with retrieved evidence, and end with an experiment that could prove it wrong.

Why

Most public RL environments train models to be correct. They do not put much pressure on models to be interesting in a disciplined way: to search over a design space instead of jumping to the nearest familiar answer, to ground claims in evidence that was actually gathered, and to propose the next experiment that could prove the idea wrong.

I built hypothesis-forge because I wanted an environment that rewards novelty with accountability.

What it trains

A concrete sequence of behaviors:

  1. Search a constrained design problem broadly enough to find multiple plausible directions.
  2. Know what is already known instead of paraphrasing the nearest baseline.
  3. Use tools to inspect evidence instead of bluffing.
  4. Commit to a mechanism that deserves follow-up.
  5. Propose an experiment that distinguishes the idea from what came before.

This is not a creativity benchmark for nice-sounding prose. The target is exploratory, evidence-grounded, falsifiable ideation.

Why hybrid synthetic

Pure open-web plus judge-model setups are slow, expensive, noisy, and easy to game under RL. Pure symbolic novelty puzzles are clean and cheap but too toy-like to say much about real research behavior.

hypothesis-forge sits in the middle: synthetic enough to run cheaply and deterministically, structured enough for honest reward functions, open-ended enough that the model still has to search, and tool-based so evidence gathering is part of the task.

How an episode works

Each episode is a research brief from one of four domains:

  • materials discovery
  • biosensor design
  • surrogate model discovery
  • LLM memory systems

The model sees a research objective, a shortlist of candidate components, and tool access to evidence cards and known baselines. It returns one JSON object with a title, selected_components, cited_cards, hypothesis, and experiment.

Five tools:

  • list_brief()
  • list_components()
  • search_evidence(query)
  • read_card(card_id)
  • list_known_ideas()

Hardening

Several choices make the reward signal more honest:

  • Evidence-card IDs and baseline IDs are opaque per task.
  • Grounding only credits evidence the model actually surfaced through tool use.
  • Realistic-mode grounding cannot max out from search-only snippets; at least one opened card is required.
  • distinguishes_from only scores when the referenced baseline was actually revealed.
  • Novelty is constrained by feasibility and baseline-relative improvement, so obviously weak ideas do not win for being merely different.
  • Grounding uses citation precision, not citation stuffing.
  • The default benchmark is a blended hard public dev set: 20 hard synthetic tasks and 20 realistic surrogate tasks.

Reward design

Reward Weight Rationale
feasibility_reward 0.25 Clever nonsense should not win
grounding_reward 0.22 Ideas need evidence that was actually inspected
experiment_reward 0.20 Ideas need a discriminating next step
novelty_reward 0.15 Reward movement beyond baseline imitation
explanation_reward 0.13 Mechanism clarity should matter, but not dominate
format_reward 0.03 Structure enforcement
group_diversity_bonus 0.02 Encourage varied proposals only after a quality floor

Output format

{
  "title": "Adaptive residual surrogate",
  "selected_components": ["C1", "C4", "C7"],
  "cited_cards": ["EV-7K2Q9M", "EV-R4D8TX", "EV-N6P3LC"],
  "hypothesis": "Combining the physics residual head with the uncertainty calibrator and graph prior encoder should improve calibration while preserving fast inference.",
  "experiment": {
    "manipulation": "Remove the uncertainty calibrator while keeping the rest fixed.",
    "readout": "Expected calibration error and OOD performance on shifted inputs.",
    "prediction": "The full stack should maintain better uncertainty calibration and robustness than the ablated variant.",
    "distinguishes_from": "KI-3M7QPD"
  }
}

Installation

# From source
uv sync --group dev

# After package release
uv pip install hypothesis-forge

The package exposes load_environment().

Prime Hub

The public Prime Hub environment lives at casella/hypothesis-forge.

# Inspect the published environment
prime env info casella/hypothesis-forge

# Install it from Prime Hub
prime env install casella/hypothesis-forge

# Run an official Prime eval with Prime Inference
prime eval run casella/hypothesis-forge --model openai/gpt-4.1-mini

# Run the realistic surrogate-mode track through Prime
prime eval run casella/hypothesis-forge \
  --model openai/gpt-4.1-mini \
  --env-args '{"task_mode":"surrogate_realistic","domain_filter":"surrogate model discovery"}'

Official Prime evals use Prime Inference by default, so the Hub-hosted path does not need the OpenRouter base URL or API-key flags.

Official Prime Eval Snapshot

Official Prime Evals on casella/hypothesis-forge@0.2.1 using Prime Inference, the default blended_hard benchmark, max_turns=25, 40 examples, and 4 rollouts per example. These runs were executed on March 7, 2026.

Model Avg Reward Format Feasibility Novelty Grounding Experiment Explanation Eval
openai/gpt-4.1-mini 0.401 1.000 0.360 0.213 0.460 0.427 0.477 du1xh23l7j55ockk1xd6hxca
openai/gpt-5-mini 0.466 1.000 0.345 0.203 0.476 0.801 0.416 ocm2b5q4kmkiz4o6gc945rgm
openai/gpt-5 0.502 1.000 0.414 0.238 0.493 0.801 0.482 jg4v0ocd7q4zpf0anq3zeikd

Quickstart

# Default local eval
uv run vf-eval hypothesis-forge

# OpenRouter
export OPENROUTER_API_KEY=...
uv run vf-eval hypothesis-forge -b https://openrouter.ai/api/v1 -k OPENROUTER_API_KEY --debug

# Bundled realistic surrogate-model pack
uv run vf-eval hypothesis-forge -a '{"task_mode":"surrogate_realistic","domain_filter":"surrogate model discovery"}'

# Explicit default blended-hard public dev set
uv run vf-eval hypothesis-forge -n 40 -r 4 -a '{"task_mode":"blended_hard"}'

# Private holdout pack outside the public repo
uv run vf-eval hypothesis-forge -a '{"task_pack_path":"./holdout/private_surrogate_holdout.json"}'

# Calibration ladder helper
python scripts/calibrate_ladder.py --dry-run

Environment arguments

Arg Type Default Description
num_train int 200 Generated train examples
num_eval int 40 Generated eval examples
seed int 0 Global seed
max_turns int 25 Max tool-use turns
max_components int 4 Max selected components
domain_filter str | None None Restrict tasks to one domain, such as "surrogate model discovery"
task_mode str "blended_hard" Task source mode: blended hard default, "synthetic", or bundled "surrogate_realistic"
task_pack_path str | None None Optional JSON or JSONL task pack path for private dev or holdout packs
system_prompt str built-in Optional prompt override

Limitations

  • Only surrogate model discovery has a realistic public dev pack; the other domains are still synthetic.
  • The scorer is symbolic and still not calibrated as sharply as I want. Official Prime evals now separate models, but the ladder is still more compressed than the target.
  • There is no cross-episode memory, curriculum, or bundled private holdout yet.

Next

The next step is recalibration: make the public benchmark harder, expand the realistic surrogate pack, add a private holdout through task_pack_path, and keep tuning until official Prime evals show a wider spread across the model ladder.

About

A verifiers RL environment that trains models to propose novel, evidence-grounded, falsifiable hypotheses. Rewards novelty with accountability.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages