Skip to content

moizzah/document-intelligence-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

document-intelligence-pipeline

Images → JSONL → Structured Evidence

⚠️ Work In Progress (Active Development)


What This Repository Is

document-intelligence-pipeline is a schema-first, cost-aware slide extraction backbone.

It converts slide decks (PNG/JPG images) into:

  • Structured JSONL records
  • Cropped visual assets
  • Optional OCR evidence
  • Schema-enforced extraction outputs
  • Query-ready Parquet tables

This repository contains the generic document intelligence backbone only.

It is intentionally domain-agnostic.


What This Repository Is NOT

This repository does NOT include:

  • Industry-specific taxonomies
  • Proprietary signal scoring logic
  • Domain intelligence aggregation engines
  • Strategic interpretation layers
  • Commercial analytics modules

Those components belong in a separate domain intelligence layer.

This repository provides the structured foundation on which such layers can be built.


Architectural Philosophy

This pipeline enforces a strict separation:

Backbone (public) → Structured Evidence → Intelligence Layer (private/domain-specific)

The backbone performs:

  • Extraction
  • Structuring
  • Validation
  • Persistence

It does NOT perform:

  • Interpretation
  • Strategic scoring
  • Industry reasoning

Notebook Overview

The system is notebook-first and organised into clear stages.


✅ Notebook 01 — Ingest & Index (Complete)

Purpose:

  • Discover slide images
  • Extract slide IDs
  • Capture basic image metadata
  • Persist minimal slides.jsonl

Establishes the persistence boundary.

Output: slides.jsonl


✅ Notebook 02 — Routing & Asset Detection (Complete)

Purpose:

  • Add coarse layout tags
  • Compute visual routing metadata
  • Detect and crop visual regions (charts, UI, QR, etc.)
  • Persist cropped assets
  • Write slides_routed.jsonl

Routing is lightweight and operational only.
It does not interpret slide meaning.

Output:

  • slides_routed.jsonl ,
  • assets/

🟡 Notebook 03 — Signal Gating + Schema-First Extraction (In Progress)

Purpose:

  • Optional OCR on cropped assets (pytesseract)
  • Deterministic signal gating (open-source only)
  • Budget-aware LLM escalation
  • Schema-enforced structured extraction
  • Validation
  • Persist slides_extracted.jsonl

Signal gating happens here — not in Notebook 02.

Only slides meeting explicit criteria escalate to paid LLM extraction.

Output:

  • slides_extracted.jsonl,
  • tables/*.parquet

Where Domain-Specific Intelligence Fits

This repository deliberately separates:

Backbone (public)

  • Slide ingest
  • Routing
  • Cropping
  • OCR evidence
  • Structured extraction
  • JSONL persistence

Intelligence Layer (domain-specific)

  • Industry taxonomies
  • Signal lexicons
  • Embedding similarity models
  • Opportunity scoring
  • Cross-slide aggregation
  • Strategic synthesis

The intelligence layer consumes: slides_extracted.jsonl

It does not modify the backbone.


How to Add Your Own Domain Intelligence

This pipeline is designed to allow domain-specific logic to plug in before expensive LLM calls.

You have two integration points:


1️⃣ Add Deterministic Signal Gating (Recommended)

Inside Notebook 03, before LLM extraction:

You can add:

  • Keyword lexicons
  • Regex-based detection
  • spaCy matchers
  • Sentence-transformer similarity
  • Local embedding scoring
  • Any open-source model

This produces:

  • domain_signal_score
  • domain_signal_flags
  • llm_escalation = true | false

Only slides with llm_escalation = true are sent to the paid LLM.

This keeps costs controlled and precision high.


2️⃣ Add a Separate Intelligence Notebook (Private Layer)

After slides_extracted.jsonl is written:

Create a new notebook outside this repo (or ignored via .gitignore) that:

  • Loads structured JSON
  • Applies your domain taxonomy
  • Aggregates across slides
  • Scores opportunity density
  • Produces reports

This keeps proprietary intelligence separate from extraction infrastructure.


Why This Design Matters

Most document pipelines:

  • Send everything to an LLM
  • Mix extraction with interpretation
  • Burn API budget quickly
  • Lack auditability

This pipeline enforces:

  • No guessing
  • Schema-first outputs
  • Deterministic behaviour
  • Budget-aware escalation
  • JSONL as source-of-truth
  • Clean public/private separation

Repository Structure

document-intelligence-pipeline/
├── notebooks/
│   ├── Notebook01_Ingest.ipynb
│   ├── Notebook02_Routing.ipynb
│   └── Notebook03_Extraction.ipynb
├── src/
├── data/
├── config.yaml
├── config.local.yaml (ignored)
├── requirements.txt
└── README.md

Installation

Python 3.11 recommended.

pip install -r requirements.txt

If using OCR:

Install Tesseract engine on your OS.


Usage

  1. Place slide images in: data/Images_ToRead/
  2. Configure: config.yaml
  3. Run notebooks in order:
    • Notebook 01
    • Notebook 02
    • Notebook 03
  4. Inspect:
    • slides_extracted.jsonl
    • tables/

Project Status

Completed:

  • Ingest pipeline
  • Routing & asset detection
  • Cropped asset persistence
  • JSONL persistence boundary
  • Config-driven behaviour

In Progress:

  • Signal gating framework
  • Cost instrumentation
  • Escalation controls
  • Validation hardening

Planned:

  • Embedding-based triage
  • Evaluation harness
  • CLI runner
  • Schema versioning

Living Documentation & Roadmap

This repository is under active development.

As additional notebooks and architectural layers are introduced, this README will be updated to reflect:

  • New pipeline stages
  • Expanded schema definitions
  • Cost-control refinements
  • Signal gating improvements
  • Validation enhancements
  • CLI or production hardening

The goal is for this README to remain the authoritative public reference for the backbone layer.

Future notebooks (e.g., advanced validation, cost instrumentation, embedding-based triage, CLI runners) will be documented here as they become stable.


Versioning Approach

  • Structural changes will be reflected.
  • Major architectural changes will be noted in commit history.
  • The separation between backbone and domain-specific intelligence will remain enforced.

Important

The backbone will continue to evolve.

The intelligence layer remains domain-specific and is intentionally not part of this repository.

Users are encouraged to fork and extend the pipeline for their own domain-specific intelligence layers.


Design Principles

  • Deterministic execution
  • No inference during extraction
  • Null when uncertain
  • Structured outputs enforced via schema
  • Config-driven runtime behaviour
  • Budget-protected LLM usage
  • Strict separation of extraction and intelligence

Intended Audience

  • Engineers/Analysts/Scientists building document intelligence systems
  • Teams requiring structured slide extraction
  • Researchers analysing presentation decks
  • Domain experts adding industry-specific intelligence
  • Startups building vertical AI layers

License

AGPL-3.0 license


Maintainer

Malixor Zero

About

A multimodal pipeline for extracting, structuring, and analysing content from hybrid documents such as slide decks and long-form reports. Combines text extraction, image-based parsing, and LLM-driven understanding to support offline review, summarisation, and analysis.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors