Skip to content

scdenney/corpus-building

Repository files navigation

Corpus Building

A wizard, six Claude Code skills, and the scripts and templates that connect them — for turning a folder of documents (PDFs, DOCX, HTML, ePub, XML, plain text) into an analysis-ready text corpus.

License: CC BY 4.0   Live wizard   Companion to supervision site

Live wizard: https://scdenney.github.io/corpus-building/


What this is

A text corpus is a structured collection of documents — usually one row per article or page, with metadata — ready to load into analysis software like Orange, R, or Python. Getting there from a folder of source files (PDFs most often, but also Word documents, HTML, ePub, XML, plain text) involves real decisions: which extraction or OCR approach to use, what metadata to track, how to structure the output. The repo gives you a wizard that picks sensible defaults, a set of Claude Code skills that codify the choices, and the scripts and templates that make the pipeline reproducible.

Built primarily for students and staff at Leiden University, but works for anyone with documents and a Claude or OpenAI account. The repo stops at the analysis-ready corpus — what happens after (topic modeling, NER, classification, embeddings) is a separate future module.

The pipeline at a glance

flowchart LR
  A[Folder of source files] --> B[inventory_builder.py]
  B --> C[manifest.csv]
  C --> D{Wizard picks<br/>a path}
  D -->|API| E[Cloud API<br/>Claude / GPT / Gemini]
  D -->|HPC| F[ALICE<br/>vLLM + SLURM]
  D -->|Local GPU| G[HF Transformers<br/>8B / 13B / 32B]
  D -->|Non-PDF formats| K[source-file-extraction<br/>DOCX / HTML / ePub / XML]
  E --> H[results_raw.json<br/>per document]
  F --> H
  G --> H
  K --> H
  H --> I[corpus_assembler.py]
  I --> J[corpus.csv + corpus.json<br/>analysis-ready]
Loading

Quick start

See what it produces. Start with a student-scale example (~75 documents):

Build your own. Take the wizard. Seven questions in, you have a starter kit: which skills to read, which templates to copy, which quality checks to run, and a one-line terminal command that launches Claude Code or Codex already primed with your specifics.

Use the skills in Claude Code directly. Each skill in skills/ has YAML frontmatter with trigger phrases; Claude Code auto-detects them. Install project-level (cp -r skills/corpus-from-pdfs /your/project/.claude/skills/) or user-level (cp -r skills/* ~/.claude/skills/). Call one explicitly with /corpus-from-pdfs I have 75 Korean newspaper editorials....


What's inside

Skills (skills/) — the decisions: corpus-from-pdfs · source-file-extraction · corpus-metadata-design · api-ocr-runner · hf-transformers-ocr · alice-vllm-deploy

Scripts (scripts/) — the mechanics (each supports --help): inventory_builder.py · cost_estimator.py · vllm_health_check.sh · alice_deploy.sh · corpus_assembler.py

Templates (templates/) — fill-in starting points: run_ocr.slurm.template · manifest.csv.example · prompts.py.template

Scenarios (examples/) — narrated walkthroughs that the wizard's cold-entry links to.

Embed snippets (embed/) — self-contained HTML + CSS blocks for linking to the wizard from another site (mini-wizard form + faux-terminal clickable card). See embed/README.md.


New to Claude Code or Codex?

The starter kit will tell you which skills to read and which commands to run, but the surrounding workflow — structuring the project, writing a CLAUDE.md or AGENTS.md, managing what the agent knows and remembers — is its own skill. Rather than duplicate that material here, these are the best sources to start with.

Official docs

Practitioner voices

Codex-specific practitioner writing is thin in early 2026 — most named voices focus on Claude Code. Both tools work for this repo's skills and commands.


Context

This repo is the computational-methods deep dive that pairs with the corpus-building primer on Thesis & Research Supervision. Students who need a conceptual introduction start there; students whose projects require LLM-based OCR, programmatic pipelines, HPC deployment, or non-PDF source handling continue here.

The repo is deliberately standalone so it can evolve independently — skills, templates, and the wizard mature on their own cadence; the supervision site links in rather than duplicating content.


License

This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). You're free to share and adapt for any purpose, including commercial, as long as you give appropriate credit. See LICENSE for the full notice.

Use, fork, and remix encouraged — especially for teaching.


Developed by Dr. Steven Denney at Leiden University, Faculty of Humanities.

About

Turn a folder of PDFs into an analysis-ready text corpus. Wizard + skills + scripts for Claude Code and Codex.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors