AI-first, modular ingestion lab for turning scanned books, PDFs, and images into faithful semantic HTML and other structured artifacts with full traceability.
- Produce the trustworthy semantic HTML baseline for Onward so an external website project can publish it.
- Mature the reusable structural-website runtime and graduate it into
doc-web, which Dossier will consume through a stable, versioned boundary. - Continue upstream ingestion R&D for future books and documents until those capabilities are stable enough to move into
doc-weband then into Dossier.
Codex-forge intentionally stops at semantic HTML and structural-website R&D. doc-web owns the reusable runtime boundary. Presentation-layer website generation remains outside this repo's scope.
- Runbook & Operations Guide: START HERE for running the pipeline, resuming runs, and troubleshooting.
- Agent Guide: Guidelines for AI agents and developers contributing to the codebase.
- Benchmarks: Systematic model evaluation using
promptfoo.
The pipeline follows a 5-stage model:
- Intake → IR (generic): PDF/images → structured elements (Unstructured library provides rich IR with text, types, coordinates, tables)
- Verify IR (generic): QA checks on completeness, page coverage, element quality
- Portionize (domain-specific): Identify logical portions (CYOA sections, genealogy chapters, textbook problems) and reference IR elements
- Augment (domain-specific): Enrich portions with domain data (choices/combat for CYOA, relationships for genealogy)
- Export (format-specific): Output to target format (FF Engine JSON, HTML, Markdown) using IR + augmentations
Steps 1-2 are universal across all document types. Steps 3-4 vary by domain (gamebooks vs genealogies vs textbooks). Step 5 is tied to output requirements.
Reusability goal: Keep upstream intake/OCR modules as generic as possible. Prefer pushing booktype-specific heuristics/normalization downstream into booktype-aware modules.
driver.py: Main orchestration script.modules/: Pipeline stages (extract,transform,adapter, etc.).configs/recipes/: YAML files defining pipeline stages (logic).configs/presets/: YAML files defining model/cost settings (parameters).output/runs/: All pipeline artifacts and logs.output/run_*.jsonl: Shared run registries for manifest, health, and AI review assessments.docs/: Documentation and story tracking.benchmarks/: Model evaluation workspace.
- Canonical (GPT-5.1 OCR): Runs on any architecture (x86_64 or ARM64).
pip install --no-cache-dir -r requirements.txt
Set the following environment variables:
OPENAI_API_KEY: For GPT-4/5 models.GEMINI_API_KEY: For Google Gemini models.ANTHROPIC_API_KEY: For Claude models (benchmarking/judging).
If using the deprecated legacy pipeline with local OCR:
- ARM64 (Apple Silicon): Recommended for
hi_resstrategy usingjax-metal. - x86_64 (Rosetta): Use for
ocr_onlycompatibility. - See
docs/legacy/environment_setup.md(if created) or check git history for detailed setup of legacy environments.
python -m unittest discover -s tests -p "driver_*test.py"View pipeline progress and artifacts visually:
python -m http.server 8000
# Open http://localhost:8000/docs/pipeline-visibility.html