document-intelligence-pipeline is a schema-first, cost-aware slide extraction backbone.
It converts slide decks (PNG/JPG images) into:
- Structured JSONL records
- Cropped visual assets
- Optional OCR evidence
- Schema-enforced extraction outputs
- Query-ready Parquet tables
This repository contains the generic document intelligence backbone only.
It is intentionally domain-agnostic.
This repository does NOT include:
- Industry-specific taxonomies
- Proprietary signal scoring logic
- Domain intelligence aggregation engines
- Strategic interpretation layers
- Commercial analytics modules
Those components belong in a separate domain intelligence layer.
This repository provides the structured foundation on which such layers can be built.
This pipeline enforces a strict separation:
Backbone (public) → Structured Evidence → Intelligence Layer (private/domain-specific)
The backbone performs:
- Extraction
- Structuring
- Validation
- Persistence
It does NOT perform:
- Interpretation
- Strategic scoring
- Industry reasoning
The system is notebook-first and organised into clear stages.
Purpose:
- Discover slide images
- Extract slide IDs
- Capture basic image metadata
- Persist minimal
slides.jsonl
Establishes the persistence boundary.
Output: slides.jsonl
Purpose:
- Add coarse layout tags
- Compute visual routing metadata
- Detect and crop visual regions (charts, UI, QR, etc.)
- Persist cropped assets
- Write
slides_routed.jsonl
Routing is lightweight and operational only.
It does not interpret slide meaning.
Output:
- slides_routed.jsonl ,
- assets/
Purpose:
- Optional OCR on cropped assets (pytesseract)
- Deterministic signal gating (open-source only)
- Budget-aware LLM escalation
- Schema-enforced structured extraction
- Validation
- Persist
slides_extracted.jsonl
Signal gating happens here — not in Notebook 02.
Only slides meeting explicit criteria escalate to paid LLM extraction.
Output:
- slides_extracted.jsonl,
- tables/*.parquet
This repository deliberately separates:
- Slide ingest
- Routing
- Cropping
- OCR evidence
- Structured extraction
- JSONL persistence
- Industry taxonomies
- Signal lexicons
- Embedding similarity models
- Opportunity scoring
- Cross-slide aggregation
- Strategic synthesis
The intelligence layer consumes: slides_extracted.jsonl
It does not modify the backbone.
This pipeline is designed to allow domain-specific logic to plug in before expensive LLM calls.
You have two integration points:
Inside Notebook 03, before LLM extraction:
You can add:
- Keyword lexicons
- Regex-based detection
- spaCy matchers
- Sentence-transformer similarity
- Local embedding scoring
- Any open-source model
This produces:
- domain_signal_score
- domain_signal_flags
- llm_escalation = true | false
Only slides with llm_escalation = true are sent to the paid LLM.
This keeps costs controlled and precision high.
After slides_extracted.jsonl is written:
Create a new notebook outside this repo (or ignored via .gitignore) that:
- Loads structured JSON
- Applies your domain taxonomy
- Aggregates across slides
- Scores opportunity density
- Produces reports
This keeps proprietary intelligence separate from extraction infrastructure.
Most document pipelines:
- Send everything to an LLM
- Mix extraction with interpretation
- Burn API budget quickly
- Lack auditability
This pipeline enforces:
- No guessing
- Schema-first outputs
- Deterministic behaviour
- Budget-aware escalation
- JSONL as source-of-truth
- Clean public/private separation
document-intelligence-pipeline/
├── notebooks/
│ ├── Notebook01_Ingest.ipynb
│ ├── Notebook02_Routing.ipynb
│ └── Notebook03_Extraction.ipynb
├── src/
├── data/
├── config.yaml
├── config.local.yaml (ignored)
├── requirements.txt
└── README.md
Python 3.11 recommended.
pip install -r requirements.txt
If using OCR:
Install Tesseract engine on your OS.
- Place slide images in: data/Images_ToRead/
- Configure: config.yaml
- Run notebooks in order:
- Notebook 01
- Notebook 02
- Notebook 03
- Inspect:
- slides_extracted.jsonl
- tables/
Completed:
- Ingest pipeline
- Routing & asset detection
- Cropped asset persistence
- JSONL persistence boundary
- Config-driven behaviour
In Progress:
- Signal gating framework
- Cost instrumentation
- Escalation controls
- Validation hardening
Planned:
- Embedding-based triage
- Evaluation harness
- CLI runner
- Schema versioning
This repository is under active development.
As additional notebooks and architectural layers are introduced, this README will be updated to reflect:
- New pipeline stages
- Expanded schema definitions
- Cost-control refinements
- Signal gating improvements
- Validation enhancements
- CLI or production hardening
The goal is for this README to remain the authoritative public reference for the backbone layer.
Future notebooks (e.g., advanced validation, cost instrumentation, embedding-based triage, CLI runners) will be documented here as they become stable.
- Structural changes will be reflected.
- Major architectural changes will be noted in commit history.
- The separation between backbone and domain-specific intelligence will remain enforced.
The backbone will continue to evolve.
The intelligence layer remains domain-specific and is intentionally not part of this repository.
Users are encouraged to fork and extend the pipeline for their own domain-specific intelligence layers.
- Deterministic execution
- No inference during extraction
- Null when uncertain
- Structured outputs enforced via schema
- Config-driven runtime behaviour
- Budget-protected LLM usage
- Strict separation of extraction and intelligence
- Engineers/Analysts/Scientists building document intelligence systems
- Teams requiring structured slide extraction
- Researchers analysing presentation decks
- Domain experts adding industry-specific intelligence
- Startups building vertical AI layers
AGPL-3.0 license
Malixor Zero