🌐🔍DeepScholar-Bench: Build and Benchmark Generative Research Synthesis

🏆 Live Leaderboard | 🤖 DeepScholar Live Preview

📊 Dataset | 📄 Paper | 🎮Discord

DeepScholar-Bench provides a live benchmark dataset and holistic evaluation of generative research synthesis, an emerging capability among AI systems designed for DeepResearch. We also developed DeepScholar-base, a strong open-source reference pipeline/

This repository provides:

Dataset Scripts - which allow you to collect new datasets from recent, high-quality Arxiv papers using our automated data-collection pipeline. You can set your own configurations (e.g., choice of valid date ranges and valid Arxiv domains) to customize your dataset
An Evaluation Suite - for measuring performance of long-form research synthesis answers. Our evaluation framework supports a holistic set of metrics, which demonstrate high agreement with human annotations. Our eval suite is built using the LOTUS framework for LLM-based data processing, which provides a library for LLM-based evaluations and can be used directly to instantiate your custom LLM-judges.
DeepScholar-base - our open-source reference pipeline for generative research synthesis. It is built on top of the LOTUS framework, which introduces and serves semantic operators for LLM-powered data processing. LOTUS' semantic operators provide a rich-set of primitives, providing a superset of RAG that goes beyond search() and LM() calls. On DeepScholar-bench, our reference pipeline achieves competitive performance with OpenAI's DeepResearch, while running 2x faster.

If you run into any problems with the code in this repo, leaderboard, or dataset, please feel free to raise an issue and we will address it promptly. If you would like to add your AI system to the DeepScholar-bench leaderboard, please fill out this form.

🚀 Quick Start

To get started, make sure you are using Python 3.10, simply clone the repository and install dependencies as follows:

# Clone the repository
git clone git@github.com:guestrin-lab/deepscholar-bench.git
cd deepscholar-bench

# Install dependencies
conda create -n dsbench python=3.10 -y
conda activate dsbench
pip install -r requirements.txt

📊 Benchmark Usage

You can start scraping your own datasets and running our holistic, automated evaluation suite using the commands below. For more details and a full introduction, please continue to our Dataset Scripts Description and/or our Evaluation library Description.

1. Scraping Data

# Collect recent AI papers since May 1, 2025
python -m data_pipeline.main \
    --categories cs.AI \
    --start-date 2025-05-01

2. Evaluate Research Generation Systems

# Evaluate the system answers generated by deepscholar_base_gpt_4.1 using gpt-4o as a judge model to assess organization, nugget coverage, reference coverage, and citation precision metrics
python -m eval.main \
    --modes deepscholar_base \
    --evals organization nugget_coverage reference_coverage cite_p \
    --input_folder tests/baselines_results/deepscholar_base_gpt_4.1 \
    --output_folder results \
    --dataset_path dataset/related_works_combined.csv \
    --model_name gpt-4o

📚 DeepScholar-Base

DeepScholar-Base is our reference pipeline research synthesis pipeline that generates comprehensive literature reviews from a research query. It serves as a strong, open-source baseline and is built on LOTUS for efficient LLM-based data processing. For detailed documentation see the DeepScholar Base README.

from deepscholar_base import deepscholar_base
from deepscholar_base.configs import Configs
from lotus.models import LM
from datetime import datetime
import asyncio

configs = Configs(lm=LM(model="gpt-4o", temperature=1.0, max_tokens=10000))

async def main():
    final_report, docs_df, stats = await deepscholar_base(
        topic="What are the latest developments in retrieval-augmented generation?",
        end_date=datetime(2025, 1, 1),  # Only papers before this date
        configs=configs,
    )
    print(final_report)

asyncio.run(main())

🤝 Contributing

We welcome contributions to DeepScholarBench! Please feel free to submit a PR for code contributions. If you would like to add your AI system to the DeepScholar-bench leaderboard, please fill out this this form.

Citation

If you use DeepScholar-Bench in an academic work, we would greatly appreciate it if you can cite this work as follows:

@article{patel2025deepscholarbench,
      title={DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis}, 
      author={Liana Patel and Negar Arabzadeh and Harshit Gupta and Ankita Sundar and Ion Stoica and Matei Zaharia and Carlos Guestrin},
      year={2025},
      url={https://arxiv.org/abs/2508.20033}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
assets		assets
data_pipeline		data_pipeline
dataset		dataset
deepscholar_base		deepscholar_base
eval		eval
examples		examples
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
mypy.ini		mypy.ini
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌐🔍DeepScholar-Bench: Build and Benchmark Generative Research Synthesis

🚀 Quick Start

📊 Benchmark Usage

1. Scraping Data

2. Evaluate Research Generation Systems

📚 DeepScholar-Base

🤝 Contributing

Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

guestrin-lab/deepscholar

Folders and files

Latest commit

History

Repository files navigation

🌐🔍DeepScholar-Bench: Build and Benchmark Generative Research Synthesis

🚀 Quick Start

📊 Benchmark Usage

1. Scraping Data

2. Evaluate Research Generation Systems

📚 DeepScholar-Base

🤝 Contributing

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages