ResearchClawBench

Evaluating AI Agents for Automated Research from Re-Discovery to New-Discovery

Quick Start | How It Works | Domains | Leaderboard | Add Your Agent

ResearchClawBench is a benchmark that measures whether AI coding agents can independently conduct scientific research — from reading raw data to producing publication-quality reports — and then rigorously evaluates the results against real human-authored papers.

Unlike benchmarks that test coding ability or factual recall, ResearchClawBench asks: given the same data and tools a human researcher had, can an AI agent arrive at the same (or better) scientific conclusions?

✨ Highlights

🔄 Two-Stage Pipeline _{Autonomous research + rigorous peer-review-style evaluation}	🧪 40 Real-Science Tasks _{10 disciplines, complete datasets from published papers}	👁️ Expert-Annotated Data _{Tasks, checklists & datasets curated by domain experts}	🤖 Multi-Agent Support _{Claude Code, Codex CLI, OpenClaw, Nanobot & custom agents}
🚀 Re-Discovery to New-Discovery _{50 = match the paper, 70+ = surpass it}	📋 Fine-Grained Checklist _{Per-item keywords, weights & reasoning}	📡 Live Streaming UI _{Watch agents code, plot & write in real-time}	🍃 Lightweight Dependencies _{Pure Flask + vanilla JS, no heavy frameworks}

📢 News

2026-03-20 🐈 Added Nanobot as a new agent — ultra-lightweight OpenClaw alternative with reliable multi-step tool execution. Agent config moved to agents.json for easy customization.
2026-03-19 🚀 Initial release with Claude Code, Codex CLI, and OpenClaw support. 40 tasks across 10 scientific domains.

🎬 Demo

demo.mp4

💡 Why ResearchClawBench?

Most AI benchmarks evaluate what models know. We evaluate what agents can do.

Real science, not toy problems. 40 tasks sourced from published papers across 10 disciplines, each with complete experimental datasets.
Two-stage pipeline. Autonomous research first, rigorous evaluation second — just like peer review.
Fine-grained, multimodal scoring. A weighted checklist with text and image criteria, judged by an LLM acting as a strict peer reviewer.
Agent-agnostic. Ships with first-class support for Claude Code, Codex CLI, and OpenClaw. Bring your own agent in one line.
From Re-Discovery to New-Discovery. Scoring above 50 means matching the original paper; above 70 means surpassing it. The frontier is wide open.

🏗️ Data Construction

Every task in ResearchClawBench is built through a rigorous, expert-driven pipeline to ensure scientific validity and reproducibility:

flowchart TD
    A["📄 High-Quality Paper Collection\n(Target Paper)"] --> B["🧑‍🔬 Human Expert Extraction\n(Core Task Instructions)"]
    B --> C["📋 Evaluation Checklist\n(Criteria + Keywords + Weights)"]
    B --> D["📂 Data & Related Work Collection\n(Datasets + Reference Papers)"]
    C --> E["✅ Human Reproduction & Validation\n(Verify checklist is reproducible)"]
    D --> E

    style A fill:#e0f2fe,stroke:#0284c7,stroke-width:2px
    style B fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
    style C fill:#fce7f3,stroke:#ec4899,stroke-width:2px
    style D fill:#f0fdf4,stroke:#22c55e,stroke-width:2px
    style E fill:#f5f3ff,stroke:#8b5cf6,stroke-width:2px

High-Quality Paper Collection — Domain experts select recent, high-impact publications with clear methodology and reproducible results across 10 scientific disciplines.
Expert Task Extraction — Human experts read each paper and distill the core research task into structured instructions, identifying the key scientific question, input data, and expected outputs.
Checklist Design — Experts create a fine-grained evaluation checklist with weighted criteria (text and image items), each with specific technical keywords that a judge must verify.
Data & Related Work Collection — The original datasets used in the paper are gathered, along with relevant reference materials, to form a self-contained research workspace.
Human Reproduction & Validation — Human researchers independently reproduce the paper's results using only the provided data and instructions, verifying that every checklist item is achievable. This ensures the benchmark is fair and the checklist is grounded in reality.

⚙️ How It Works

ResearchClawBench operates in two distinct stages:

flowchart LR
    subgraph Stage1["Stage 1 &mdash; Auto Research"]
        A["Raw Data\n+ Instructions"] --> B["AI Agent\n(autonomous)"]
        B --> C["Code\n+ Figures\n+ Report"]
    end

    subgraph Stage2["Stage 2 &mdash; Evaluation"]
        C --> D["LLM Judge"]
        E["Target Paper\n+ Checklist"] --> D
        D --> F["Per-Item Scores\n+ Reasoning"]
    end

    style Stage1 fill:#f0f4ff,stroke:#3b82f6,stroke-width:2px
    style Stage2 fill:#fff7ed,stroke:#f59e0b,stroke-width:2px

Stage 1: Autonomous Research

Auto Research view — file explorer, live code output, and real-time agent conversation

The AI agent receives a workspace containing raw datasets, reference materials, and task instructions. It must independently:

Explore the data and understand the research question
Write code to analyze, model, and visualize the data
Produce a research report (report/report.md) with figures, methodology, results, and discussion

No hand-holding. No chain-of-thought hints. The agent works in its own sandboxed workspace with full tool access — just like a real researcher.

Stage 2: Reference-Based Evaluation

Evaluation view — target paper (left), AI report (center), scored checklist (right)

Once the agent finishes, its report is evaluated against the original published paper using a fine-grained checklist. The judge receives the task instructions, the AI report, and the checklist criteria — then scores each item using a dual-mode rubric:

flowchart TD
    subgraph Inputs
        I["INSTRUCTIONS.md\n(task background)"]
        R["Agent Report\n(text + figures)"]
        CL["Checklist\n(from target paper)"]
    end

    I & R & CL --> J["Multimodal LLM Judge"]

    J --> DET{"Determine\nEvaluation Mode"}

    DET -->|"Quantitative\nresults"| OBJ["Mode A: Objective\n(Metric Optimization)"]
    DET -->|"Qualitative\nreasoning"| SUB["Mode B: Subjective\n(Mechanism Analysis)"]

    OBJ --> SO["Score by metric\naccuracy vs paper"]
    SUB --> SS["Score by evidence\nstrength vs paper"]

    SO & SS --> T["Per-Item Scores\n+ Reasoning\n→ Weighted Total"]

    style Inputs fill:#f0f4ff,stroke:#3b82f6,stroke-width:2px
    style J fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
    style OBJ fill:#dbeafe,stroke:#3b82f6,stroke-width:2px
    style SUB fill:#fce7f3,stroke:#ec4899,stroke-width:2px
    style T fill:#f0fdf4,stroke:#22c55e,stroke-width:2px

Each checklist item includes:

Specific criteria extracted from the paper's key contributions
Technical keywords the judge must verify (e.g., "ROC-AUC improvement", "Monte Carlo integration")
Weight reflecting the item's importance
Type — text for methodology/findings, image for figure comparison (multimodal vision)

The judge automatically determines which evaluation mode applies to each item, then scores it with the corresponding rubric (see below).

Mode A: Objective Evaluation (Metric Optimization)

For checklist items involving specific numerical results, metrics, or quantitative outcomes:

Score	Meaning
0	Criterion completely absent
1–10	Mentioned but no quantitative results provided
11–20	Results given but methodology has fundamental errors
21–30	Significant methodological flaws; metrics deviate severely
31–40	Methodology mostly correct but metrics notably worse than the paper
41–50	Metrics roughly comparable to the paper
51–60	Metrics slightly better than the paper
61–70	Metrics clearly better than the paper
71–80	Methodology and metrics both substantially improved
81–90	Metrics dramatically surpass the paper
91–100	Breakthrough results far exceeding the paper

Mode B: Subjective Evaluation (Mechanism Analysis)

For checklist items involving theoretical explanations, mechanistic insights, or interpretive analysis:

Score	Meaning
0	Criterion completely absent
1–10	Mentioned only with vague, generic statements
11–20	Some description but no substantive analysis
21–30	Analysis attempted but evidence insufficient or logic has gaps
31–40	Correct direction but lacks depth; key arguments missing
41–50	Analysis depth and rigor comparable to the paper
51–60	More supporting evidence provided than the paper
61–70	More complete logical chain and more rigorous argumentation
71–80	Significantly deeper analysis with novel insights
81–90	Analysis depth far exceeds the paper
91–100	Original contributions with breakthrough insights

Strict by design. The judge is highly skeptical of AI-generated content — plausible-sounding claims must be backed by concrete evidence. Longer reports do not score higher. Substance over style.

🔬 10 Scientific Domains

Each domain contains 4 carefully curated tasks with complete experimental data from real published research:

Domain	Example Topics	Data Types
Astronomy	Black hole superradiance, Bayesian stellar inference	`.dat`, `.csv`
Chemistry	GNN molecular prediction, protein-ligand docking	`.pdb`, `.sdf`, `.csv`
Earth	Glacier mass balance, climate datasets	`.csv`, multi-region series
Energy	Battery degradation, renewable energy modeling	`.xlsx`, time series
Information	NLP benchmarks, deep learning analysis	`.pdf`, `.tex`, `.ipynb`
Life	Nanopore sequencing, genomic analysis	`.csv`, `.xlsx`
Material	Materials property prediction, pretrained models	`.pt`, `.csv`
Math	Multi-agent pathfinding, optimization	`.json`, `.npy`, grid maps
Neuroscience	Neural decoding, brain signal processing	`.csv`, `.h5`, `.yaml`
Physics	Quantum geometry, superfluid stiffness	`.h5`, `.json`, `.csv`

40 tasks total — each a self-contained research challenge selected from high-quality human-authored publications, spanning the full spectrum from data analysis to novel scientific insight.

🚀 Quick Start

1. Install

git clone https://github.com/InternScience/ResearchClawBench.git
cd ResearchClawBench
pip install -r evaluation/requirements.txt

2. Configure

Create evaluation/.env with your scoring model credentials:

OPENAI_API_KEY=sk-xxx
OPENAI_BASE_URL=https://api.openai.com/v1
SCORER_MODEL=gpt-5.1

3. Launch

python -m evaluation

Open http://localhost:5000 — browse tasks, pick an agent, hit Start Run, and watch the research happen live.

4. Score

After a run completes, switch to the Evaluation tab and click Score. The multimodal LLM judge evaluates each checklist item and returns per-item scores with reasoning.

🤖 Supported Agents

ResearchClawBench ships with built-in support for four frontier coding agents:

Agent	Command	Notes
Claude Code	`claude -p ...`	Anthropic, stream-JSON output
Codex CLI	`codex exec --full-auto ...`	OpenAI, full-auto mode
OpenClaw	`openclaw agent ...`	Self-hosted gateway, 3600s timeout
Nanobot	`nanobot agent -m ...`	Ultra-lightweight, reliable tool execution

🔧 Add Your Own Agent

Agent configuration is stored in evaluation/agents.json. To add a new agent, simply append an entry:

{
  "my_agent": {
    "label": "My Agent",
    "icon": "M",
    "logo": "/static/logos/my_agent.svg",
    "cmd": "my-agent run -m <PROMPT> -w <WORKSPACE>"
  }
}

Placeholder	Replaced With	Notes
`<PROMPT>`	Prompt content (via file path or `$(cat ...)`)	Required. For `-p` style flags, replaced with file path; otherwise replaced with `"$(cat 'path')"` to pass content
`<WORKSPACE>`	Absolute path to the workspace directory	Optional. Only replaced if present in cmd

The prompt injected into <PROMPT> is auto-generated from evaluation/instructions_tmpl.py, which combines a unified agent persona (autonomous execution guidelines, workspace sandbox rules) with task-specific instructions (description, data files, deliverables). All agents receive the exact same prompt — no code changes required, just edit the JSON file and restart the server.

🏆 Leaderboard

The built-in dashboard aggregates the best score per (task, agent) pair and displays:

Frontier chart — best score per task across all agents
Leaderboard table — clickable cells linking to individual runs
Per-task breakdown — view any agent's report, code, and score reasoning

The frontier represents the state of the art — every point above 50 is uncharted territory where AI surpasses human researchers on that specific task.

📁 Project Structure

ResearchClawBench/
├── evaluation/                 # Core evaluation framework
│   ├── server.py               # Flask API + SSE streaming
│   ├── run_task.py             # Workspace setup + agent subprocess
│   ├── score.py                # Multimodal LLM scoring engine
│   ├── config.py               # Paths, constants, loads agents.json
│   ├── agents.json             # Agent presets (add your own here)
│   ├── instructions_tmpl.py    # Unified prompt template for all agents
│   ├── utils.py                # File tree, path safety, discovery
│   ├── static/app.js           # Single-file frontend
│   └── templates/index.html    # Entry point
├── tasks/                      # 40 research tasks
│   ├── Astronomy_000/
│   │   ├── task_info.json      # Task description + data manifest
│   │   ├── data/               # Raw experimental datasets
│   │   ├── related_work/       # Reference papers
│   │   └── target_study/       # Paper + checklist + images
│   ├── Chemistry_000/
│   └── ...                     # 10 domains x 4 tasks
└── workspaces/                 # Generated at runtime (gitignored)

🤝 Contributing

We welcome contributions in several forms:

New tasks — Add research challenges in existing or new domains
New agents — Add presets for emerging coding agents
Bug reports — Open an issue

📧 Email: xu_wanghan@sjtu.edu.cn

📜 Citation

If you would like to cite our work, please use the following BibTeX.

@article{xu2025probing,
  title={Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows},
  author={Xu, Wanghan and Zhou, Yuhao and Zhou, Yifan and Cao, Qinglong and Li, Shuo and Bu, Jia and Liu, Bo and Chen, Yixin and He, Xuming and Zhao, Xiangyu and others},
  journal={arXiv preprint arXiv:2512.16969},
  year={2025}
}

⭐ Star History

🔝Back to top

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
assets		assets
evaluation		evaluation
tasks		tasks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ResearchClawBench

✨ Highlights

📢 News

🎬 Demo

💡 Why ResearchClawBench?

🏗️ Data Construction

⚙️ How It Works

Stage 1: Autonomous Research

Stage 2: Reference-Based Evaluation

Mode A: Objective Evaluation (Metric Optimization)

Mode B: Subjective Evaluation (Mechanism Analysis)

🔬 10 Scientific Domains

🚀 Quick Start

1. Install

2. Configure

3. Launch

4. Score

🤖 Supported Agents

🔧 Add Your Own Agent

🏆 Leaderboard

📁 Project Structure

🤝 Contributing

📜 Citation

⭐ Star History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ResearchClawBench

✨ Highlights

📢 News

🎬 Demo

💡 Why ResearchClawBench?

🏗️ Data Construction

⚙️ How It Works

Stage 1: Autonomous Research

Stage 2: Reference-Based Evaluation

Mode A: Objective Evaluation (Metric Optimization)

Mode B: Subjective Evaluation (Mechanism Analysis)

🔬 10 Scientific Domains

🚀 Quick Start

1. Install

2. Configure

3. Launch

4. Score

🤖 Supported Agents

🔧 Add Your Own Agent

🏆 Leaderboard

📁 Project Structure

🤝 Contributing

📜 Citation

⭐ Star History

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages