document-intelligence-pipeline

Images → JSONL → Structured Evidence

⚠️ Work In Progress (Active Development)

What This Repository Is

document-intelligence-pipeline is a schema-first, cost-aware slide extraction backbone.

It converts slide decks (PNG/JPG images) into:

Structured JSONL records
Cropped visual assets
Optional OCR evidence
Schema-enforced extraction outputs
Query-ready Parquet tables

This repository contains the generic document intelligence backbone only.

It is intentionally domain-agnostic.

What This Repository Is NOT

This repository does NOT include:

Industry-specific taxonomies
Proprietary signal scoring logic
Domain intelligence aggregation engines
Strategic interpretation layers
Commercial analytics modules

Those components belong in a separate domain intelligence layer.

This repository provides the structured foundation on which such layers can be built.

Architectural Philosophy

This pipeline enforces a strict separation:

Backbone (public) → Structured Evidence → Intelligence Layer (private/domain-specific)

The backbone performs:

Extraction
Structuring
Validation
Persistence

It does NOT perform:

Interpretation
Strategic scoring
Industry reasoning

Notebook Overview

The system is notebook-first and organised into clear stages.

✅ Notebook 01 — Ingest & Index (Complete)

Purpose:

Discover slide images
Extract slide IDs
Capture basic image metadata
Persist minimal slides.jsonl

Establishes the persistence boundary.

Output: slides.jsonl

✅ Notebook 02 — Routing & Asset Detection (Complete)

Purpose:

Add coarse layout tags
Compute visual routing metadata
Detect and crop visual regions (charts, UI, QR, etc.)
Persist cropped assets
Write slides_routed.jsonl

Routing is lightweight and operational only.
It does not interpret slide meaning.

Output:

slides_routed.jsonl ,
assets/

🟡 Notebook 03 — Signal Gating + Schema-First Extraction (In Progress)

Purpose:

Optional OCR on cropped assets (pytesseract)
Deterministic signal gating (open-source only)
Budget-aware LLM escalation
Schema-enforced structured extraction
Validation
Persist slides_extracted.jsonl

Signal gating happens here — not in Notebook 02.

Only slides meeting explicit criteria escalate to paid LLM extraction.

Output:

slides_extracted.jsonl,
tables/*.parquet

Where Domain-Specific Intelligence Fits

This repository deliberately separates:

Backbone (public)

Slide ingest
Routing
Cropping
OCR evidence
Structured extraction
JSONL persistence

Intelligence Layer (domain-specific)

Industry taxonomies
Signal lexicons
Embedding similarity models
Opportunity scoring
Cross-slide aggregation
Strategic synthesis

The intelligence layer consumes: slides_extracted.jsonl

It does not modify the backbone.

How to Add Your Own Domain Intelligence

This pipeline is designed to allow domain-specific logic to plug in before expensive LLM calls.

You have two integration points:

1️⃣ Add Deterministic Signal Gating (Recommended)

Inside Notebook 03, before LLM extraction:

You can add:

Keyword lexicons
Regex-based detection
spaCy matchers
Sentence-transformer similarity
Local embedding scoring
Any open-source model

This produces:

domain_signal_score
domain_signal_flags
llm_escalation = true | false

Only slides with llm_escalation = true are sent to the paid LLM.

This keeps costs controlled and precision high.

2️⃣ Add a Separate Intelligence Notebook (Private Layer)

After slides_extracted.jsonl is written:

Create a new notebook outside this repo (or ignored via .gitignore) that:

Loads structured JSON
Applies your domain taxonomy
Aggregates across slides
Scores opportunity density
Produces reports

This keeps proprietary intelligence separate from extraction infrastructure.

Why This Design Matters

Most document pipelines:

Send everything to an LLM
Mix extraction with interpretation
Burn API budget quickly
Lack auditability

This pipeline enforces:

No guessing
Schema-first outputs
Deterministic behaviour
Budget-aware escalation
JSONL as source-of-truth
Clean public/private separation

Repository Structure

document-intelligence-pipeline/
├── notebooks/
│   ├── Notebook01_Ingest.ipynb
│   ├── Notebook02_Routing.ipynb
│   └── Notebook03_Extraction.ipynb
├── src/
├── data/
├── config.yaml
├── config.local.yaml (ignored)
├── requirements.txt
└── README.md

Installation

Python 3.11 recommended.

pip install -r requirements.txt

If using OCR:

Install Tesseract engine on your OS.

Usage

Place slide images in: data/Images_ToRead/
Configure: config.yaml
Run notebooks in order:
- Notebook 01
- Notebook 02
- Notebook 03
Inspect:
- slides_extracted.jsonl
- tables/

Project Status

Completed:

Ingest pipeline
Routing & asset detection
Cropped asset persistence
JSONL persistence boundary
Config-driven behaviour

In Progress:

Signal gating framework
Cost instrumentation
Escalation controls
Validation hardening

Planned:

Embedding-based triage
Evaluation harness
CLI runner
Schema versioning

Living Documentation & Roadmap

This repository is under active development.

As additional notebooks and architectural layers are introduced, this README will be updated to reflect:

New pipeline stages
Expanded schema definitions
Cost-control refinements
Signal gating improvements
Validation enhancements
CLI or production hardening

The goal is for this README to remain the authoritative public reference for the backbone layer.

Future notebooks (e.g., advanced validation, cost instrumentation, embedding-based triage, CLI runners) will be documented here as they become stable.

Versioning Approach

Structural changes will be reflected.
Major architectural changes will be noted in commit history.
The separation between backbone and domain-specific intelligence will remain enforced.

Important

The backbone will continue to evolve.

The intelligence layer remains domain-specific and is intentionally not part of this repository.

Users are encouraged to fork and extend the pipeline for their own domain-specific intelligence layers.

Design Principles

Deterministic execution
No inference during extraction
Null when uncertain
Structured outputs enforced via schema
Config-driven runtime behaviour
Budget-protected LLM usage
Strict separation of extraction and intelligence

Intended Audience

Engineers/Analysts/Scientists building document intelligence systems
Teams requiring structured slide extraction
Researchers analysing presentation decks
Domain experts adding industry-specific intelligence
Startups building vertical AI layers

License

AGPL-3.0 license

Maintainer

Malixor Zero

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

document-intelligence-pipeline

Images → JSONL → Structured Evidence

What This Repository Is

What This Repository Is NOT

Architectural Philosophy

Notebook Overview

✅ Notebook 01 — Ingest & Index (Complete)

✅ Notebook 02 — Routing & Asset Detection (Complete)

🟡 Notebook 03 — Signal Gating + Schema-First Extraction (In Progress)

Where Domain-Specific Intelligence Fits

Backbone (public)

Intelligence Layer (domain-specific)

How to Add Your Own Domain Intelligence

1️⃣ Add Deterministic Signal Gating (Recommended)

2️⃣ Add a Separate Intelligence Notebook (Private Layer)

Why This Design Matters

Repository Structure

Installation

Usage

Project Status

Living Documentation & Roadmap

Versioning Approach

Important

Design Principles

Intended Audience

License

Maintainer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
notebooks		notebooks
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

document-intelligence-pipeline

Images → JSONL → Structured Evidence

What This Repository Is

What This Repository Is NOT

Architectural Philosophy

Notebook Overview

✅ Notebook 01 — Ingest & Index (Complete)

✅ Notebook 02 — Routing & Asset Detection (Complete)

🟡 Notebook 03 — Signal Gating + Schema-First Extraction (In Progress)

Where Domain-Specific Intelligence Fits

Backbone (public)

Intelligence Layer (domain-specific)

How to Add Your Own Domain Intelligence

1️⃣ Add Deterministic Signal Gating (Recommended)

2️⃣ Add a Separate Intelligence Notebook (Private Layer)

Why This Design Matters

Repository Structure

Installation

Usage

Project Status

Living Documentation & Roadmap

Versioning Approach

Important

Design Principles

Intended Audience

License

Maintainer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages