Historical Document Intelligence

AI-powered document processing system that extracts text from scanned documents and analyzes content using OCR and Large Language Models.

Overview

A full-stack application for researchers who need to extract structured information from scanned documents.

Core Capabilities:

Intelligent OCR - Multi-column layout support, automatic image preprocessing for low-quality scans
AI Entity Extraction - People, locations, dates, organizations (customizable)
Document Summarization - LLM-generated 2-3 sentence summaries
Image Quality Assessment - Research-based quality scoring with automatic preprocessing
Full Traceability - Before/after comparisons (images + text), OCR confidence scores, every processing step auditable
Processing History - All results persisted and searchable

Architecture

┌─────────────┐     ┌─────────────┐     ┌──────────────────────────────────┐
│   Frontend  │────▶│   Backend   │────▶│        Python AI Pipeline        │
│  React/MUI  │     │ Node/Express│     │           (LangGraph)            │
└─────────────┘     └─────────────┘     └──────────────────────────────────┘
                           │                           │
                    ┌──────┴──────┐            ┌───────┼───────┐
                    │             │            │       │       │
                    ▼             ▼            ▼       ▼       ▼
               ┌────────┐   ┌────────┐   ┌────────┐ ┌────────┐ ┌────────┐
               │DynamoDB│   │   S3   │   │   S3   │ │Textract│ │ OpenAI │
               └────────┘   └────────┘   └────────┘ └────────┘ └────────┘
                               ↑             ↑
                               └─────────────┘
                               (shared storage)

Processing Flow

User uploads scanned document via frontend
Backend generates pre-signed S3 URL and stores metadata in DynamoDB
AI pipeline performs:
- Image quality assessment (blur, contrast, noise, brightness)
- Conditional preprocessing if quality is poor (deskew, denoise, binarization)
- OCR with Textract (LAYOUT enabled for multi-column support)
- Text cleaning to remove OCR artifacts
- Entity extraction and summarization via LLM
Results and processed images persisted to S3/DynamoDB
Frontend displays before/after comparison and extracted insights

Design Decisions

Node.js + Python hybrid - AI uses Python (LangChain, OpenCV have no mature Node.js alternatives). Separate services for resource isolation: OpenCV image processing is memory-intensive, if it crashes it won't take down the main API.

LangGraph over n8n - LangGraph provides built-in state management and visibility into each node's state, making it easy to trace data flow through the pipeline. n8n might be able to do this, but would require significant research time and may have unknown limitations.

Textract over Tesseract - System is deployed on AWS, Textract integrates seamlessly (reads directly from S3). Mature enterprise service, and cost-effective (1000 pages/month free).

Tech Stack

Layer	Technology
Frontend	React 18, TypeScript, Vite, Material UI v7
Backend	Node.js, Express, ES Modules
Database	AWS DynamoDB
Storage	AWS S3 (Pre-signed URLs)
OCR	AWS Textract (analyze_document + LAYOUT)
LLM	OpenAI GPT-4o-mini (Structured Output)
AI Pipeline	Python, LangGraph, Pydantic
Image Processing	OpenCV, unpaper

Project Structure

historical-doc-intelligence/
├── frontend/           # React + TypeScript + MUI v7
├── backend/            # Node.js + Express + DynamoDB
├── python-services/    # LangGraph AI Pipeline
└── context/            # Design documentation

See individual README files in each directory for details.

Deployment (AWS)

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│     S3      │────▶│ ECS on EC2  │────▶│   Lambda    │
│  (Static)   │     │  (Backend)  │     │(AI Pipeline)│
└─────────────┘     └─────────────┘     └─────────────┘
   Frontend           Node.js            Python

Component	Service	Notes
Frontend	S3	Static website hosting
Backend	ECS on EC2	Containerized Node.js, Elastic IP
AI Pipeline	Lambda	Container image (OpenCV/unpaper)
Storage	S3	Document images, processed results
Database	DynamoDB	Processing history

See Deployment Guide for details.

Quick Start

Prerequisites

Node.js 18+
Python 3.12+
AWS Account (S3, DynamoDB, Textract)
OpenAI API Key

Installation

# Clone
git clone https://github.com/JOJOMRJ/historical-doc-intelligence.git
cd historical-doc-intelligence

# Frontend
cd frontend && npm install

# Backend
cd ../backend && npm install

# Python (with virtual environment)
cd ../python-services
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Environment Variables

backend/.env

PORT=3000
S3_BUCKET=your-bucket
S3_REGION=ap-southeast-2
AWS_ACCESS_KEY_ID=xxx
AWS_SECRET_ACCESS_KEY=xxx
DYNAMODB_TABLE=historical-documents

python-services/.env

OCR_S3_BUCKET=your-bucket
OCR_AWS_REGION=ap-southeast-2
OCR_AWS_ACCESS_KEY_ID=xxx
OCR_AWS_SECRET_ACCESS_KEY=xxx
OPENAI_API_KEY=sk-xxx

Running

# Terminal 1: Backend
cd backend && npm run dev

# Terminal 2: Frontend
cd frontend && npm run dev

Open http://localhost:5173

Documentation

License

MIT

Author

Jojo - GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github/workflows		.github/workflows
backend		backend
frontend		frontend
python-services		python-services
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Historical Document Intelligence

Overview

Architecture

Processing Flow

Design Decisions

Tech Stack

Project Structure

Deployment (AWS)

Quick Start

Prerequisites

Installation

Environment Variables

Running

Documentation

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Historical Document Intelligence

Overview

Architecture

Processing Flow

Design Decisions

Tech Stack

Project Structure

Deployment (AWS)

Quick Start

Prerequisites

Installation

Environment Variables

Running

Documentation

License

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages