Skip to content

hadro/directory-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

directory-pipeline

Give it a URL or IIIF manifest from the Library of Congress, the Internet Archive, NYPL Digital Collections, or any public IIIF endpoint — it returns a structured CSV of every entry in that digitized historical directory. It extracts things like name, address, city, state, and category.

Every row in the output CSV carries a canvas_fragment column: a IIIF URI that links directly back to the source scan. With the precision upgrade (--surya-ocr --align-ocr), the fragment includes a #xywh= bounding box pinpointing the exact line on the page. The auto-generated data explorer and Clover IIIF viewer use IIIF Content State to turn that coordinate into a live deep link — click any row and the viewer opens at exactly that entry, highlighted in the original document.

No manual transcription or ground truth required to get started. No custom code per collection needed.


Quick start

The first three commands:

  • pull down the images
  • select a few representative pages
  • generate OCR and entity-recognition prompts for data extraction
# One-time calibration for a new collection type:
python main.py https://archive.org/details/ldpd_11290437_000/ --download
python main.py https://archive.org/details/ldpd_11290437_000/ --select-pages
python main.py https://archive.org/details/ldpd_11290437_000/ --generate-prompts

# Automated run — produces a structured entries CSV:
python main.py https://archive.org/details/ldpd_11290437_000/ --to-csv

For any additional volume in the same series, reuse an earlier generated prompt — no re-calibration needed:

python main.py https://archive.org/details/ldpd_11290437_001/ --to-csv \
  --ner-prompt output/ldpd_11290437_000/ner_prompt.md

--ner-prompt points to the prompt generated for the first volume. If you forget it, extract_entries.py will warn you and suggest nearby candidates automatically.

Requires GEMINI_API_KEY. See Installation.

These steps are safe to re-run, and will detect existing output unless told to redo the work explicitly (via a --force flag).


How it works

The core automated path:

--download → --gemini-ocr → --extract-entries

Expanded by --to-csv.

Two interactive calibration steps run once per collection type/similar volumes:

Step What it does Output
--select-pages Browser UI — pick 4–10 representative pages selection.txt
--generate-prompts Gemini analyzes sample pages and writes tailored prompts ocr_prompt.md, ner_prompt.md

Calibrate once, run many. --select-pages and --generate-prompts prompt the model with the vocabulary of a specific document: field names, abbreviations, column structure, city/state heading conventions. Run them once for a new series. Generated prompts are saved to output/{slug}/. For additional volumes in the same series, pass --ner-prompt output/{first-slug}/ner_prompt.md to reuse it — no re-calibration needed. If you forget, extract_entries.py warns you and lists any nearby candidate prompts it finds.

What each automated step produces:

Step Output
--download JPEG images + manifest.json (IIIF canvas URIs for linking)
--gemini-ocr One .txt file per page
--extract-entries entries_{model}.csv + per-page *_{model}_entries.json sidecars

The output CSV includes a canvas_fragment column: a IIIF URI pointing back to the exact canvas (and, with the precision upgrade, the exact line) for every row. This is the foundation for the full source-linking chain:

CSV row  →  explorer thumbnail  →  Clover viewer  →  highlighted entry in original scan

With --surya-ocr --align-ocr, the fragment gains a #xywh= bounding box. The explorer and the deployed Clover viewer encode that coordinate as a IIIF Content State parameter, so the "View in source document" link opens the viewer scrolled and zoomed directly to that entry — no manual page-hunting required.


Going further

These stages extend the core CSV output but are not required.

Precision upgrade — adds spatial bounding boxes to canvas_fragment:

python main.py URL --surya-ocr --align-ocr        # adds #xywh= coordinates to every row
python main.py URL --review-alignment              # interactive correction of unmatched lines

Geocoding and mapping — resolves addresses to lat/lon, builds an interactive map:

python main.py URL --geocode --map

IIIF annotation export — W3C/IIIF Annotation Pages for all entries:

python pipeline/iiif/export_entry_boxes.py output/{slug}/{item_id}/
python pipeline/iiif/export_annotations.py output/{slug}/{item_id}/

GitHub Pages viewer — Clover IIIF viewer with annotations and data explorer:

./scripts/make-git-repo.sh output/{slug}/{item_id}/ ~/github/my-repo https://username.github.io/my-repo

Copies manifest.json, annotation files, and explorer.html to the destination folder, and generates a Clover-based index.html viewer with full IIIF Content State support (deep-linking directly to highlighted entries). Requires clover.umd.patched.js in scripts/ (included).

--full-run is the maximal shorthand: --download --surya-ocr --gemini-ocr --align-ocr --review-alignment --extract-entries --geocode --map, with --batch-size and --workers defaulted to 8.


Screenshots

Page selection (--select-pages)

Page selector browser UI showing the Sample and Scope tabs

Two-tab browser UI. The Sample tab picks 4–10 representative pages for prompt calibration. The Scope tab (all pages selected by default) lets you deselect frontmatter, ads, and almanac sections so they're skipped entirely during OCR and extraction.


Field-value explorer (--to-csv --explore)

Interactive field-value explorer with density chart, bar charts, and results table

Auto-generated self-contained HTML explorer. Categorical bar charts show distribution; filtering is done via sidebar facet checkboxes and per-column filter inputs. Clicking a row opens a detail panel with all fields, a IIIF thumbnail of the source page, and a "View in source document" link — a IIIF Content State deep link that opens the Clover viewer scrolled directly to that entry in the original scan.


Alignment visualization (--align-ocr --visualize)

NW alignment output drawn on a source page scan — green bounding boxes on matched lines

Needleman-Wunsch alignment result drawn on the source image. Green boxes are word-confidence matches between Surya OCR and Gemini text. Unmatched Gemini lines (no bounding box found) are listed in the margin in red.


Interactive alignment review (--review-alignment)

Flask alignment review UI with page sidebar, canvas, and proposed match panel

Flask-based review UI for fixing pages where automatic alignment left unmatched entries. Draw bounding boxes on the canvas, re-run Surya on the crop, then accept proposed Surya → Gemini pairs. Accepted matches are written back to the aligned JSON with "confidence": "manual".


Geocoded map (--geocode --map)

Leaflet map with clustered markers, category sidebar, and a popup with IIIF page thumbnail

Self-contained Leaflet HTML map. Markers are clustered and color-coded by establishment category. The sidebar has live search, state filter, and category checkboxes. Popups include a IIIF page thumbnail fetched directly from the source institution's image server.


All pipeline stages

Stages always run in the fixed order below, regardless of flag order on the command line. All stages are optional — run only what you need.

Core stages (URL → CSV):

Stage Script Output
--download pipeline/download_images.py output/{slug}/
--select-pages pipeline/select_pages.py selection.txt, included_pages.txt (interactive, once per volume)
--generate-prompts pipeline/generate_prompt.py ocr_prompt.md, ner_prompt.md (once per collection type)
--gemini-ocr pipeline/run_gemini_ocr.py *_{model}.txt
--extract-entries pipeline/extract_entries.py entries_{model}.csv, *_{model}_entries.json

Precision upgrade (adds #xywh= bounding boxes to canvas_fragment):

Stage Script Output
--surya-ocr pipeline/run_surya_ocr.py *_surya.json, *_surya.txt
--align-ocr pipeline/align_ocr.py *_{model}_aligned.json
--review-alignment pipeline/review_alignment.py updated *_{model}_aligned.json (interactive)

Extensions:

Stage Script Output
--nypl-csv sources/nypl_collection_csv.py output/{slug}/{slug}.csv
--loc-csv sources/loc_collection_csv.py output/{slug}/{slug}.csv
--ia-csv sources/ia_collection_csv.py output/{slug}/{slug}.csv
--iiif-csv sources/iiif_manifest_csv.py output/{slug}/{slug}.csv — any public IIIF v2/v3 manifest or collection
--chandra-ocr pipeline/run_chandra_ocr.py *_chandra-ocr-2.txt — local 5B model, no API key
--compare-ocr analysis/compare_ocr.py *_comparison.html, ocr_comparison_stats.csv
--detect-spreads pipeline/detect_spreads.py spreads_report.csv
--split-spreads pipeline/split_spreads.py *_left.jpg, *_right.jpg
--surya-detect pipeline/surya_detect.py columns_report.csv
--detect-columns pipeline/detect_columns.py columns_report.csv (legacy)
--tesseract old/run_ocr.py *_tesseract.hocr, *_tesseract.txt (legacy)
--visualize analysis/visualize_alignment.py *_{model}_viz.jpg
--explore pipeline/explore_entries.py entries_{model}_explorer.html
--geocode pipeline/geo/geocode_entries.py entries_{model}_geocoded.csv
--map pipeline/geo/map_entries.py entries_{model}.html
(standalone) pipeline/iiif/export_annotations.py *_{model}_annotations.json, *_{model}_entry_annotations.json
(standalone) pipeline/iiif/export_entry_boxes.py *_{model}_box_annotations.json
(standalone) pipeline/iiif/build_ranges.py ranges_{model}.json (directory collections)
(standalone) scripts/make-git-repo.sh GitHub Pages deployable folder

See docs/pipeline-stages.md for detailed documentation on each stage.


Directory layout

directory-pipeline/
├── main.py                           # Pipeline orchestrator
│
├── pipeline/                         # Active pipeline stage scripts
│   ├── download_images.py            # Download images from IIIF manifests
│   ├── detect_spreads.py             # Spread detection
│   ├── split_spreads.py              # Spread splitting
│   ├── select_pages.py               # Interactive browser UI for picking sample pages
│   ├── generate_prompt.py            # Gemini-generated volume-specific OCR + NER prompts
│   ├── surya_detect.py               # Surya neural column detection (preferred)
│   ├── detect_columns.py             # Pixel-projection column detection (legacy)
│   ├── run_surya_ocr.py              # Surya OCR — line-level bboxes (preferred)
│   ├── run_gemini_ocr.py             # Gemini OCR
│   ├── run_chandra_ocr.py            # Chandra OCR — local 5B model, no API key
│   ├── align_ocr.py                  # NW alignment (Surya preferred, Tesseract fallback)
│   ├── review_alignment.py           # Interactive alignment review UI (Flask)
│   ├── extract_entries.py            # Structured entry extraction (NER)
│   ├── explore_entries.py            # Self-contained HTML data explorer
│   ├── geo/
│   │   ├── geocode_entries.py        # Entry geocoding
│   │   └── map_entries.py            # Interactive map generation (IIIF popup thumbnails + Content State links)
│   └── iiif/
│       ├── export_annotations.py     # IIIF Annotation Pages export (W3C Web Annotation)
│       ├── export_entry_boxes.py     # IIIF colored entry bounding boxes (standalone)
│       └── build_ranges.py           # IIIF table of contents from geocoded entries (standalone)
│
├── sources/                          # Collection metadata exporters
│   ├── nypl_collection_csv.py        # NYPL Digital Collections
│   ├── loc_collection_csv.py         # Library of Congress
│   ├── ia_collection_csv.py          # Internet Archive
│   └── iiif_manifest_csv.py          # Any public IIIF v2/v3 manifest or collection
│
├── analysis/                         # Dev tools (not in main pipeline)
│   ├── compare_ocr.py                # Side-by-side OCR model comparison
│   ├── visualize_alignment.py        # Draw alignment boxes on images → *_viz.jpg
│   ├── compare_extraction.py         # Compare entry extraction across models
│   └── visualize_entries.py          # Draw entry bounding boxes on images
│
├── old/                              # Legacy and superseded scripts
│   └── run_ocr.py                    # Tesseract OCR — word-level hOCR (use --surya-ocr instead)
│
├── utils/                            # Shared utilities
│   └── iiif_utils.py                 # IIIF v2/v3 manifest parsing
│
├── prompts/                          # Gemini system prompts
│   ├── ocr_prompt.md                 # Generic OCR transcription prompt (global fallback)
│   └── ner_prompt.md                 # Generic NER extraction prompt (global fallback)
│
├── docs/                             # Reference documentation
│   ├── pipeline-stages.md            # Detailed per-stage documentation
│   ├── usage-examples.md             # Full usage examples by source and stage
│   ├── key-design-decisions.md       # Technical architecture notes
│   └── prior-work.md                 # Annotated citations of related work
│
├── scripts/                          # Deployment utilities
│   └── make-git-repo.sh              # Assemble pipeline output into a GitHub Pages folder
│
├── pyproject.toml                    # Python project config and dependencies
└── output/
    └── {slug}/                       # e.g. the_negro_motorist_green_book_1947_4bea2040/
        ├── {slug}.csv                                # collection metadata CSV (from --*-csv stages)
        ├── selection.txt                             # sample page filenames (from --select-pages)
        ├── included_pages.txt                        # scope filter — pages to include in OCR/extraction
        ├── ocr_prompt.md                             # volume-specific OCR prompt (from --generate-prompts)
        ├── ner_prompt.md                             # volume-specific NER prompt (from --generate-prompts)
        └── {item_id}/                # NYPL UUID or LoC/IA identifier
            ├── manifest.json
            ├── select_pages.html                     # page-selector UI (from --select-pages)
            ├── 0001_{image_id}.jpg
            ├── 0001_{image_id}_left.jpg              # if spread-split
            ├── 0001_{image_id}_right.jpg             # if spread-split
            ├── 0001_{image_id}_split.json            # split coordinate sidecar
            ├── 0001_{image_id}_surya.json            # Surya line bboxes + text
            ├── 0001_{image_id}_surya.txt             # Surya plain text
            ├── 0001_{image_id}_chandra-ocr-2.txt     # Chandra OCR plain text
            ├── 0001_{image_id}_tesseract.hocr        # (legacy Tesseract output)
            ├── 0001_{image_id}_tesseract.txt
            ├── 0001_{image_id}_{model}.txt           # Gemini plain text
            ├── 0001_{image_id}_{model}_aligned.json            # NW alignment output
            ├── 0001_{image_id}_{model}_viz.jpg               # alignment visualization
            ├── 0001_{image_id}_{model}_comparison.html        # OCR model comparison
            ├── 0001_{image_id}_{model}_entries.json          # per-page entries
            ├── 0001_{image_id}_{model}_annotations.json      # IIIF line-level annotation page
            ├── 0001_{image_id}_{model}_entry_annotations.json # IIIF entry-level annotation page
            ├── 0001_{image_id}_{model}_box_annotations.json  # IIIF colored entry bounding boxes
            ├── spreads_report.csv
            ├── columns_report.csv
            ├── ocr_comparison_stats.csv                      # summary stats from --compare-ocr
            ├── entries_{model}.csv                   # aggregate entries for collection
            ├── entries_{model}_geocoded.csv          # entries with lat/lon
            ├── entries_{model}_explorer.html         # interactive data explorer
            ├── entries_{model}.html                  # interactive Leaflet map
            └── geocache.json                         # geocoding cache

For NYPL collections, {slug} is derived as {title_words}_{uuid8}, e.g. the_negro_motorist_green_book_1940_feb978b0. For LoC items it is derived from the item title and numeric ID. For IA items it is derived from the item title and IA identifier. For arbitrary IIIF manifests, it is derived from the manifest label and a segment of the URL. Pass --slug to override.


Installation

Requires Python 3.11+ and uv. Tesseract is only needed for the legacy --tesseract stage.

# Optional: install Tesseract for legacy OCR support
brew install tesseract          # macOS
apt install tesseract-ocr       # Debian/Ubuntu

# Install core dependencies (Gemini OCR, entry extraction, geocoding)
uv sync

# Add Surya OCR support (requires GPU or Apple Silicon for reasonable speed)
uv sync --extra gpu

# Add geocoding + map generation
uv sync --extra geo

# Everything
uv sync --all-extras

Dependencies are declared in pyproject.toml and locked in uv.lock. To run any pipeline command without activating the virtual environment: uv run python main.py ...

Set environment variables (or copy .env.template to .env):

export GEMINI_API_KEY=your_key_here
export NYPL_API_TOKEN=your_token_here      # from https://api.repo.nypl.org/sign_up
                                            # (not needed for LoC, IA, or generic IIIF)
export GOOGLE_MAPS_API_KEY=your_key_here   # optional; enables address-level geocoding

Estimated costs

Two cost categories: API charges (variable; applies on any platform) and platform costs (compute infrastructure).

Gemini API

--gemini-ocr and --extract-entries both call the Gemini API. Pricing as of early 2026 (verify current rates at ai.google.dev/pricing):

Stage Model (default) Input Output
--gemini-ocr gemini-2.0-flash $0.10 / 1M tokens $0.40 / 1M tokens
--extract-entries gemini-3.1-flash-lite-preview $0.25 / 1M tokens $1.50 / 1M tokens
fallback (dense pages) gemini-2.5-flash $0.30 / 1M tokens $2.50 / 1M tokens

A Green Book page generates roughly 2,000 input tokens and 1,000 output tokens for OCR (gemini-2.0-flash, ~$0.0006/page), and another ~10,000 input / 2,000 output tokens for NER entry extraction (gemini-3.1-flash-lite-preview, ~$0.0055/page) — about $0.006 per page combined. Dense pages that exceed the output token limit automatically retry with gemini-2.5-flash, but this affects fewer than 5% of pages in practice.

--generate-prompts (generate_prompt.py) makes 2 Gemini calls (one for the OCR prompt, one for the NER prompt) with 4–8 sample images each. This is a one-time per-volume cost of roughly $0.01–$0.05 total, negligible compared to the full run. gemini-3-flash-preview is used by default for prompt generation because it produces higher-quality meta-prompts; you can override with --prompt-model.

Rough collection estimates:

Collection Pages OCR (gemini-2.0-flash) NER (gemini-3.1-flash-lite-preview) Total Prompt generation
One Green Book volume ~100 pages ~$0.06 ~$0.55 ~$0.61 total ~$0.02 (one-time)
Full Green Books corpus (14 volumes) ~1,400 pages ~$0.84 ~$7.70 ~$8.54 total ~$0.02 (one-time per volume)
Large city directory (500+ pages) 500 pages ~$0.30 ~$2.75 ~$3.05 total ~$0.02 (one-time)

Free tier: The Gemini API free tier (no billing required) covers both models at no charge, subject to rate limits of 15 requests/minute and ~1,500 requests/day for gemini-2.0-flash. A single 100-page volume (~200 API calls total) fits comfortably within a single day's free quota, though the 15 RPM cap means the API stages take ~15–20 minutes rather than a few minutes. For the full multi-volume corpus you will either need billing enabled or spread the run across several days.

Chandra OCR (--chandra-ocr) uses a local 5B model — no API key or API cost. Requires GPU; see platform costs below.

Google Maps Geocoding (optional)

The --geocode stage uses Nominatim (free, city-level accuracy) by default. Setting GOOGLE_MAPS_API_KEY enables address-level geocoding at roughly $0.005/request. Google Maps includes a $200/month free credit, which covers ~40,000 geocoding requests — more than the entire Green Books corpus.

Platform costs

The stages that use significant compute are Surya OCR (--surya-ocr, --surya-detect, --review-alignment) and Chandra OCR (--chandra-ocr). Gemini API stages are network-bound and run equally fast everywhere.

Platform Cost Surya OCR (200 pages) Notes
Mac (M-series, 16 GB+) $0 (electricity) ~5–8 min (MPS, --batch-size 4) Good for development and single-volume runs
Mac (8 GB) $0 ~10–15 min (MPS, --batch-size 1–2) Works; reduce batch size if OOM errors occur
Google Colab (free T4) $0 ~2–3 min (CUDA, --batch-size 8) Sessions expire; T4 not always available at peak times; --review-alignment requires a tunnel (e.g. ngrok)
Google Colab Pro ~$10/month (also pay as you go option) ~1–2 min (T4/L4, --batch-size 8) Reliable GPU access, longer sessions
Google Colab Pro+ ~$50/month <1 min (A100, --batch-size 16) Background execution; best for large multi-volume runs

The pipeline is designed so that the compute-heavy steps — Surya OCR and the interactive alignment review — can be run on a GPU machine while everything else (downloading, Gemini OCR, entry extraction, geocoding, map generation) runs fine on a laptop. Gemini API calls are network-bound and complete in seconds regardless of the machine; Surya is a neural vision model that is 5–20× faster on a GPU than on an Apple Silicon Mac and significantly slower or impractical on CPU-only hardware.

A ready-to-run Colab notebook covering the Surya OCR, alignment, and review steps is in colab/ocr-align-review.ipynb. Open it directly in Colab, mount your Drive, and follow the cells in order.

Chandra evaluation (analysis/chandra_eval.py) runs Qwen3-VL 7B and requires ~9 GB VRAM with --quantize. On a Colab T4 this is roughly 50 seconds per image; on an M-series Mac with MPS it is ~25 minutes per image. Chandra is practical only on Colab or a machine with a CUDA GPU.


Usage

# Minimal: download → OCR → CSV
python main.py collections.txt --to-csv

# Full pipeline: also includes Surya alignment, geocoding, and map
python main.py collections.txt --full-run

# Dry run — show commands without running anything
python main.py URL --to-csv --dry-run

See docs/usage-examples.md for full usage examples by source (LoC, IA, NYPL, IIIF manifest) and stage.


Key design decisions

  • Gemini for accuracy, Surya for coordinates. Gemini transcribes historical print far more accurately than conventional OCR; Surya provides line-level bounding boxes that anchor coordinates.
  • Anchored Needleman-Wunsch alignment. City/state headings that appear verbatim in both sources are committed as fixed anchors before the NW pass, preventing misalignment drift on long pages.
  • Schema-agnostic NER. extract_entries.py hard-codes nothing. The NER prompt defines all field names; CSV columns are inferred dynamically — no code changes for a new collection type.
  • IIIF-native output. Every aligned line and entry carries a canvas_fragment (#xywh=) URI in natural image pixel coordinates, directly consumable by IIIF viewers and annotation tools.
  • Any IIIF source. --iiif-csv and --download accept any public IIIF Presentation v2 or v3 manifest URL, not just NYPL/LoC/IA. IIIF Collection manifests are enumerated automatically, writing one CSV row per child manifest.

See docs/key-design-decisions.md for full technical notes.


Prior work and inspirations

  • Greif et al. (2025) — foundational benchmark showing multimodal LLMs beat Tesseract + Transkribus on historical city directories (0.84% CER with Gemini 2.0 Flash). Directly motivates the two-stage OCR + NER architecture. (arXiv:2504.00414)
  • Bell et al. — directoreadr (2020) — closest prior work; end-to-end pipeline for Polk city directories using classical CV + Tesseract. Documents the brittle year-specific heuristics this pipeline replaces. (PLOS ONE)
  • Fleischhacker et al. (2025) — layout detection as preprocessing improves OCR accuracy by 15+ pp on multi-column historical docs. Motivates column reading-order correction in align_ocr.py. (Int. J. Digital Libraries)
  • Cook et al. (2020) — canonical prior Green Books digitization (entirely manual; OCR rejected). Source of the six-category establishment taxonomy used here. (NBER WP 26819)
  • Smith & Cordell (2018) — practitioner research agenda naming layout analysis as the top barrier to historical OCR and validating NW-style sequence alignment for ground truth creation. (NEH report)
  • Carlson et al. — EffOCR (2023) — OCR benchmarks on historical newspapers (Tesseract ~10.6% CER, fine-tuned TrOCR 1.3%). Establishes the noisy-input baseline for the alignment stage. (arXiv:2304.02737)
  • Wolf et al. (2020) — machine-readable NYC directory entries 1850–1890 from NYPL digitizations. Direct precedent for applying this pipeline to city directories. (NYU Faculty Digital Archive)

See docs/prior-work.md for full annotated citations.

About

A pipeline for turning digital collections into structured data -- an LLM assisted, IIIF-native tool to jump into working with sources like digitized print directories.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors