bibextract

A Python package (with Rust backend) for extracting survey content and bibliography from arXiv papers.

There are a lot of ArXiv MCP tools already. This is another.

What it does differently is that it extracts content directly from the LaTeX source of the paper, rather than parsing the PDF.

It also focuses entirely on survey/background/related work sections. Right now this tool will ignore all the other sections.

Once it extracts the content, it also extracts looks at the BBL file and tries to reconstruct the .bibtex file and normalise the entries. Not all BBL files work (see the tests/fixtures for examples). Once it has a title/author/year, it will try to look up the arXiv ID or DOI of the paper, and use that in the bibtex entry instead of the raw entry from the BBL file.

This citation normalisation means that you can pass multiple papers to it and it will extract the related work content and bibliography from all of them, merging them into a single output, with limited overlap.

The goal of this tool is to make it easy to get LLM agents to read/cite/write background sections of papers. In a loop, an agent could read a paper, extract the related work section, and then use all the ArXiv IDs in that section to extract the related work sections of those papers, and so on. This way, you can build a large corpus of related work content without having to manually search for papers.

Some future todos

improve test coverage
add more .bbl files to tests
improve the MCP docs for the tool

Installation

Installing via Smithery

To install bibextract for Claude Desktop automatically via Smithery:

npx -y @smithery/cli install @gautierdag/bibextract --client claude

fastMCP server implementation

uv run bibextract_mcp.py

fastMCP from URL

# obviously check the file before running it, don't trust random scripts from the internet
uv run --python 3.12 https://raw.githubusercontent.com/gautierdag/bibextract/refs/heads/main/bibextract_mcp.py

From PyPI

uv add bibextract

From Source

Install Rust (if not already installed):

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source ~/.cargo/env

Install maturin:
```
pip install maturin
```

Clone and build:

git clone https://github.com/gautier/bibextract.git
cd bibextract
maturin develop

Usage

Python API

import bibextract

# Process one or more arXiv papers
result = bibextract.extract_survey(['2104.08653', '1912.02292'])

# Access the extracted content
survey_text = result['survey_text']  # Raw LaTeX with sections
bibtex = result['bibtex']           # BibTeX bibliography

# Save to files
with open('survey.tex', 'w') as f:
    f.write(survey_text)

with open('bibliography.bib', 'w') as f:
    f.write(bibtex)

Command Line (original Rust binary)

# Build the CLI tool
cargo build --release

# Process papers
./target/release/bibextract --paper-ids 2104.08653 1912.02292 --output survey.tex

Development

Running Tests

cargo test
pytest tests

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github/workflows		.github/workflows
bibextract		bibextract
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
bibextract_mcp.py		bibextract_mcp.py
pyproject.toml		pyproject.toml
smithery.yaml		smithery.yaml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bibextract

Some future todos

Installation

Installing via Smithery

fastMCP server implementation

fastMCP from URL

From PyPI

From Source

Usage

Python API

Command Line (original Rust binary)

Development

Running Tests

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

bibextract

Some future todos

Installation

Installing via Smithery

fastMCP server implementation

fastMCP from URL

From PyPI

From Source

Usage

Python API

Command Line (original Rust binary)

Development

Running Tests

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages