Free pipeline: arxiv OAI-PMH harvest + PDF download + pymupdf4llm convert + push to HuggingFace Hub. Fresh post-2024 corpora in common-pile schema.
-
Updated
Apr 28, 2026 - Python
Free pipeline: arxiv OAI-PMH harvest + PDF download + pymupdf4llm convert + push to HuggingFace Hub. Fresh post-2024 corpora in common-pile schema.
CLI for downloading Project Gutenberg e-texts - orderly, efficient, and vaguely Teutonic
Turn a folder of PDFs into an analysis-ready text corpus. Wizard + skills + scripts for Claude Code and Codex.
Add a description, image, and links to the corpus-building topic page so that developers can more easily learn about it.
To associate your repository with the corpus-building topic, visit your repo's landing page and select "manage topics."