codex-forge

AI-first, modular ingestion lab for turning scanned books, PDFs, and images into faithful semantic HTML and other structured artifacts with full traceability.

Current Role

Produce the trustworthy semantic HTML baseline for Onward so an external website project can publish it.
Mature the reusable structural-website runtime and graduate it into doc-web, which Dossier will consume through a stable, versioned boundary.
Continue upstream ingestion R&D for future books and documents until those capabilities are stable enough to move into doc-web and then into Dossier.

Codex-forge intentionally stops at semantic HTML and structural-website R&D. doc-web owns the reusable runtime boundary. Presentation-layer website generation remains outside this repo's scope.

📚 Documentation

Runbook & Operations Guide: START HERE for running the pipeline, resuming runs, and troubleshooting.
Agent Guide: Guidelines for AI agents and developers contributing to the codebase.
Benchmarks: Systematic model evaluation using promptfoo.

Pipeline Architecture

The pipeline follows a 5-stage model:

Intake → IR (generic): PDF/images → structured elements (Unstructured library provides rich IR with text, types, coordinates, tables)
Verify IR (generic): QA checks on completeness, page coverage, element quality
Portionize (domain-specific): Identify logical portions (CYOA sections, genealogy chapters, textbook problems) and reference IR elements
Augment (domain-specific): Enrich portions with domain data (choices/combat for CYOA, relationships for genealogy)
Export (format-specific): Output to target format (FF Engine JSON, HTML, Markdown) using IR + augmentations

Steps 1-2 are universal across all document types. Steps 3-4 vary by domain (gamebooks vs genealogies vs textbooks). Step 5 is tied to output requirements.

Reusability goal: Keep upstream intake/OCR modules as generic as possible. Prefer pushing booktype-specific heuristics/normalization downstream into booktype-aware modules.

Repository Layout

driver.py: Main orchestration script.
modules/: Pipeline stages (extract, transform, adapter, etc.).
configs/recipes/: YAML files defining pipeline stages (logic).
configs/presets/: YAML files defining model/cost settings (parameters).
output/runs/: All pipeline artifacts and logs.
output/run_*.jsonl: Shared run registries for manifest, health, and AI review assessments.
docs/: Documentation and story tracking.
benchmarks/: Model evaluation workspace.

Setup & Dependencies

Python Environment

Canonical (GPT-5.1 OCR): Runs on any architecture (x86_64 or ARM64).
```
pip install --no-cache-dir -r requirements.txt
```

API Keys

Set the following environment variables:

OPENAI_API_KEY: For GPT-4/5 models.
GEMINI_API_KEY: For Google Gemini models.
ANTHROPIC_API_KEY: For Claude models (benchmarking/judging).

Legacy Environment (Unstructured/EasyOCR)

If using the deprecated legacy pipeline with local OCR:

ARM64 (Apple Silicon): Recommended for hi_res strategy using jax-metal.
x86_64 (Rosetta): Use for ocr_only compatibility.
See docs/legacy/environment_setup.md (if created) or check git history for detailed setup of legacy environments.

Development

Unit Tests

python -m unittest discover -s tests -p "driver_*test.py"

Dashboard

View pipeline progress and artifacts visually:

python -m http.server 8000
# Open http://localhost:8000/docs/pipeline-visibility.html

Name		Name	Last commit message	Last commit date
Latest commit History 187 Commits
.agents/skills		.agents/skills
.claude		.claude
.cursor		.cursor
.gemini-clipboard		.gemini-clipboard
.gemini		.gemini
ai-work/issues		ai-work/issues
benchmarks		benchmarks
configs		configs
constraints		constraints
docs		docs
modules		modules
prompts		prompts
scripts		scripts
testdata		testdata
tests		tests
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Makefile		Makefile
README.md		README.md
conftest.py		conftest.py
driver.py		driver.py
example-prompt.md		example-prompt.md
ff-canonical		ff-canonical
pipeline_state.example.json		pipeline_state.example.json
requirements-legacy-easyocr.txt		requirements-legacy-easyocr.txt
requirements.txt		requirements.txt
schemas.py		schemas.py
settings.example.yaml		settings.example.yaml
settings.fast-intake.yaml		settings.fast-intake.yaml
settings.smoke.yaml		settings.smoke.yaml
skills		skills
snapshot.md		snapshot.md
testfile		testfile
todo.md		todo.md
validate_artifact.py		validate_artifact.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

codex-forge

Current Role

📚 Documentation

Pipeline Architecture

Repository Layout

Setup & Dependencies

Python Environment

API Keys

Legacy Environment (Unstructured/EasyOCR)

Development

Unit Tests

Dashboard

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

codex-forge

Current Role

📚 Documentation

Pipeline Architecture

Repository Layout

Setup & Dependencies

Python Environment

API Keys

Legacy Environment (Unstructured/EasyOCR)

Development

Unit Tests

Dashboard

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages