Skip to content

InternScience/ResearchClawBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ResearchClawBench

Official Site  GitHub  License: MIT Python 3.10+ Domains Tasks GitHub

Evaluating AI Agents for Automated Research from Re-Discovery to New-Discovery

Quick Start | How It Works | Domains | Leaderboard | Add Your Agent

SGI Overview


ResearchClawBench is a benchmark that measures whether AI coding agents can independently conduct scientific research — from reading raw data to producing publication-quality reports — and then rigorously evaluates the results against real human-authored papers.

Unlike benchmarks that test coding ability or factual recall, ResearchClawBench asks: given the same data and tools a human researcher had, can an AI agent arrive at the same (or better) scientific conclusions?

✨ Highlights

🔄
Two-Stage Pipeline
Autonomous research + rigorous peer-review-style evaluation
🧪
40 Real-Science Tasks
10 disciplines, complete datasets from published papers
👁️
Expert-Annotated Data
Tasks, checklists & datasets curated by domain experts
🤖
Multi-Agent Support
Claude Code, Codex CLI, OpenClaw, Nanobot & custom agents
🚀
Re-Discovery to New-Discovery
50 = match the paper, 70+ = surpass it
📋
Fine-Grained Checklist
Per-item keywords, weights & reasoning
📡
Live Streaming UI
Watch agents code, plot & write in real-time
🍃
Lightweight Dependencies
Pure Flask + vanilla JS, no heavy frameworks

📢 News

  • 2026-03-20 🐈 Added Nanobot as a new agent — ultra-lightweight OpenClaw alternative with reliable multi-step tool execution. Agent config moved to agents.json for easy customization.
  • 2026-03-19 🚀 Initial release with Claude Code, Codex CLI, and OpenClaw support. 40 tasks across 10 scientific domains.

🎬 Demo

demo.mp4

💡 Why ResearchClawBench?

Most AI benchmarks evaluate what models know. We evaluate what agents can do.

  • Real science, not toy problems. 40 tasks sourced from published papers across 10 disciplines, each with complete experimental datasets.
  • Two-stage pipeline. Autonomous research first, rigorous evaluation second — just like peer review.
  • Fine-grained, multimodal scoring. A weighted checklist with text and image criteria, judged by an LLM acting as a strict peer reviewer.
  • Agent-agnostic. Ships with first-class support for Claude Code, Codex CLI, and OpenClaw. Bring your own agent in one line.
  • From Re-Discovery to New-Discovery. Scoring above 50 means matching the original paper; above 70 means surpassing it. The frontier is wide open.

🏗️ Data Construction

Every task in ResearchClawBench is built through a rigorous, expert-driven pipeline to ensure scientific validity and reproducibility:

flowchart TD
    A["📄 High-Quality Paper Collection\n(Target Paper)"] --> B["🧑‍🔬 Human Expert Extraction\n(Core Task Instructions)"]
    B --> C["📋 Evaluation Checklist\n(Criteria + Keywords + Weights)"]
    B --> D["📂 Data & Related Work Collection\n(Datasets + Reference Papers)"]
    C --> E["✅ Human Reproduction & Validation\n(Verify checklist is reproducible)"]
    D --> E

    style A fill:#e0f2fe,stroke:#0284c7,stroke-width:2px
    style B fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
    style C fill:#fce7f3,stroke:#ec4899,stroke-width:2px
    style D fill:#f0fdf4,stroke:#22c55e,stroke-width:2px
    style E fill:#f5f3ff,stroke:#8b5cf6,stroke-width:2px
Loading
  1. High-Quality Paper Collection — Domain experts select recent, high-impact publications with clear methodology and reproducible results across 10 scientific disciplines.

  2. Expert Task Extraction — Human experts read each paper and distill the core research task into structured instructions, identifying the key scientific question, input data, and expected outputs.

  3. Checklist Design — Experts create a fine-grained evaluation checklist with weighted criteria (text and image items), each with specific technical keywords that a judge must verify.

  4. Data & Related Work Collection — The original datasets used in the paper are gathered, along with relevant reference materials, to form a self-contained research workspace.

  5. Human Reproduction & Validation — Human researchers independently reproduce the paper's results using only the provided data and instructions, verifying that every checklist item is achievable. This ensures the benchmark is fair and the checklist is grounded in reality.


⚙️ How It Works

ResearchClawBench operates in two distinct stages:

flowchart LR
    subgraph Stage1["Stage 1 — Auto Research"]
        A["Raw Data\n+ Instructions"] --> B["AI Agent\n(autonomous)"]
        B --> C["Code\n+ Figures\n+ Report"]
    end

    subgraph Stage2["Stage 2 — Evaluation"]
        C --> D["LLM Judge"]
        E["Target Paper\n+ Checklist"] --> D
        D --> F["Per-Item Scores\n+ Reasoning"]
    end

    style Stage1 fill:#f0f4ff,stroke:#3b82f6,stroke-width:2px
    style Stage2 fill:#fff7ed,stroke:#f59e0b,stroke-width:2px
Loading

Stage 1: Autonomous Research

Auto Research view — file explorer, live code output, and real-time agent conversation

The AI agent receives a workspace containing raw datasets, reference materials, and task instructions. It must independently:

  1. Explore the data and understand the research question
  2. Write code to analyze, model, and visualize the data
  3. Produce a research report (report/report.md) with figures, methodology, results, and discussion

No hand-holding. No chain-of-thought hints. The agent works in its own sandboxed workspace with full tool access — just like a real researcher.

Stage 2: Reference-Based Evaluation

Evaluation view — target paper (left), AI report (center), scored checklist (right)

Once the agent finishes, its report is evaluated against the original published paper using a fine-grained checklist. The judge receives the task instructions, the AI report, and the checklist criteria — then scores each item using a dual-mode rubric:

flowchart TD
    subgraph Inputs
        I["INSTRUCTIONS.md\n(task background)"]
        R["Agent Report\n(text + figures)"]
        CL["Checklist\n(from target paper)"]
    end

    I & R & CL --> J["Multimodal LLM Judge"]

    J --> DET{"Determine\nEvaluation Mode"}

    DET -->|"Quantitative\nresults"| OBJ["Mode A: Objective\n(Metric Optimization)"]
    DET -->|"Qualitative\nreasoning"| SUB["Mode B: Subjective\n(Mechanism Analysis)"]

    OBJ --> SO["Score by metric\naccuracy vs paper"]
    SUB --> SS["Score by evidence\nstrength vs paper"]

    SO & SS --> T["Per-Item Scores\n+ Reasoning\n→ Weighted Total"]

    style Inputs fill:#f0f4ff,stroke:#3b82f6,stroke-width:2px
    style J fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
    style OBJ fill:#dbeafe,stroke:#3b82f6,stroke-width:2px
    style SUB fill:#fce7f3,stroke:#ec4899,stroke-width:2px
    style T fill:#f0fdf4,stroke:#22c55e,stroke-width:2px
Loading

Each checklist item includes:

  • Specific criteria extracted from the paper's key contributions
  • Technical keywords the judge must verify (e.g., "ROC-AUC improvement", "Monte Carlo integration")
  • Weight reflecting the item's importance
  • Typetext for methodology/findings, image for figure comparison (multimodal vision)

The judge automatically determines which evaluation mode applies to each item, then scores it with the corresponding rubric (see below).

Mode A: Objective Evaluation (Metric Optimization)

For checklist items involving specific numerical results, metrics, or quantitative outcomes:

Score Meaning
0 Criterion completely absent
1–10 Mentioned but no quantitative results provided
11–20 Results given but methodology has fundamental errors
21–30 Significant methodological flaws; metrics deviate severely
31–40 Methodology mostly correct but metrics notably worse than the paper
41–50 Metrics roughly comparable to the paper
51–60 Metrics slightly better than the paper
61–70 Metrics clearly better than the paper
71–80 Methodology and metrics both substantially improved
81–90 Metrics dramatically surpass the paper
91–100 Breakthrough results far exceeding the paper

Mode B: Subjective Evaluation (Mechanism Analysis)

For checklist items involving theoretical explanations, mechanistic insights, or interpretive analysis:

Score Meaning
0 Criterion completely absent
1–10 Mentioned only with vague, generic statements
11–20 Some description but no substantive analysis
21–30 Analysis attempted but evidence insufficient or logic has gaps
31–40 Correct direction but lacks depth; key arguments missing
41–50 Analysis depth and rigor comparable to the paper
51–60 More supporting evidence provided than the paper
61–70 More complete logical chain and more rigorous argumentation
71–80 Significantly deeper analysis with novel insights
81–90 Analysis depth far exceeds the paper
91–100 Original contributions with breakthrough insights

Strict by design. The judge is highly skeptical of AI-generated content — plausible-sounding claims must be backed by concrete evidence. Longer reports do not score higher. Substance over style.


🔬 10 Scientific Domains

Each domain contains 4 carefully curated tasks with complete experimental data from real published research:

Domain Example Topics Data Types
Astronomy Black hole superradiance, Bayesian stellar inference .dat, .csv
Chemistry GNN molecular prediction, protein-ligand docking .pdb, .sdf, .csv
Earth Glacier mass balance, climate datasets .csv, multi-region series
Energy Battery degradation, renewable energy modeling .xlsx, time series
Information NLP benchmarks, deep learning analysis .pdf, .tex, .ipynb
Life Nanopore sequencing, genomic analysis .csv, .xlsx
Material Materials property prediction, pretrained models .pt, .csv
Math Multi-agent pathfinding, optimization .json, .npy, grid maps
Neuroscience Neural decoding, brain signal processing .csv, .h5, .yaml
Physics Quantum geometry, superfluid stiffness .h5, .json, .csv

40 tasks total — each a self-contained research challenge selected from high-quality human-authored publications, spanning the full spectrum from data analysis to novel scientific insight.


🚀 Quick Start

1. Install

git clone https://github.com/InternScience/ResearchClawBench.git
cd ResearchClawBench
pip install -r evaluation/requirements.txt

2. Configure

Create evaluation/.env with your scoring model credentials:

OPENAI_API_KEY=sk-xxx
OPENAI_BASE_URL=https://api.openai.com/v1
SCORER_MODEL=gpt-5.1

3. Launch

python -m evaluation

Open http://localhost:5000 — browse tasks, pick an agent, hit Start Run, and watch the research happen live.

4. Score

After a run completes, switch to the Evaluation tab and click Score. The multimodal LLM judge evaluates each checklist item and returns per-item scores with reasoning.


🤖 Supported Agents

ResearchClawBench ships with built-in support for four frontier coding agents:

Agent Command Notes
Claude Code claude -p ... Anthropic, stream-JSON output
Codex CLI codex exec --full-auto ... OpenAI, full-auto mode
OpenClaw openclaw agent ... Self-hosted gateway, 3600s timeout
Nanobot nanobot agent -m ... Ultra-lightweight, reliable tool execution

🔧 Add Your Own Agent

Agent configuration is stored in evaluation/agents.json. To add a new agent, simply append an entry:

{
  "my_agent": {
    "label": "My Agent",
    "icon": "M",
    "logo": "/static/logos/my_agent.svg",
    "cmd": "my-agent run -m <PROMPT> -w <WORKSPACE>"
  }
}
Placeholder Replaced With Notes
<PROMPT> Prompt content (via file path or $(cat ...)) Required. For -p style flags, replaced with file path; otherwise replaced with "$(cat 'path')" to pass content
<WORKSPACE> Absolute path to the workspace directory Optional. Only replaced if present in cmd

The prompt injected into <PROMPT> is auto-generated from evaluation/instructions_tmpl.py, which combines a unified agent persona (autonomous execution guidelines, workspace sandbox rules) with task-specific instructions (description, data files, deliverables). All agents receive the exact same prompt — no code changes required, just edit the JSON file and restart the server.


🏆 Leaderboard

The built-in dashboard aggregates the best score per (task, agent) pair and displays:

  • Frontier chart — best score per task across all agents
  • Leaderboard table — clickable cells linking to individual runs
  • Per-task breakdown — view any agent's report, code, and score reasoning

The frontier represents the state of the art — every point above 50 is uncharted territory where AI surpasses human researchers on that specific task.


📁 Project Structure

ResearchClawBench/
├── evaluation/                 # Core evaluation framework
│   ├── server.py               # Flask API + SSE streaming
│   ├── run_task.py             # Workspace setup + agent subprocess
│   ├── score.py                # Multimodal LLM scoring engine
│   ├── config.py               # Paths, constants, loads agents.json
│   ├── agents.json             # Agent presets (add your own here)
│   ├── instructions_tmpl.py    # Unified prompt template for all agents
│   ├── utils.py                # File tree, path safety, discovery
│   ├── static/app.js           # Single-file frontend
│   └── templates/index.html    # Entry point
├── tasks/                      # 40 research tasks
│   ├── Astronomy_000/
│   │   ├── task_info.json      # Task description + data manifest
│   │   ├── data/               # Raw experimental datasets
│   │   ├── related_work/       # Reference papers
│   │   └── target_study/       # Paper + checklist + images
│   ├── Chemistry_000/
│   └── ...                     # 10 domains x 4 tasks
└── workspaces/                 # Generated at runtime (gitignored)

🤝 Contributing

We welcome contributions in several forms:

  • New tasks — Add research challenges in existing or new domains
  • New agents — Add presets for emerging coding agents
  • Bug reports — Open an issue

📧 Email: xu_wanghan@sjtu.edu.cn


📜 Citation

If you would like to cite our work, please use the following BibTeX.

@article{xu2025probing,
  title={Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows},
  author={Xu, Wanghan and Zhou, Yuhao and Zhou, Yifan and Cao, Qinglong and Li, Shuo and Bu, Jia and Liu, Bo and Chen, Yixin and He, Xuming and Zhao, Xiangyu and others},
  journal={arXiv preprint arXiv:2512.16969},
  year={2025}
}

⭐ Star History

Star History Chart

🔝Back to top