AI-powered document processing system that extracts text from scanned documents and analyzes content using OCR and Large Language Models.
A full-stack application for researchers who need to extract structured information from scanned documents.
Core Capabilities:
- Intelligent OCR - Multi-column layout support, automatic image preprocessing for low-quality scans
- AI Entity Extraction - People, locations, dates, organizations (customizable)
- Document Summarization - LLM-generated 2-3 sentence summaries
- Image Quality Assessment - Research-based quality scoring with automatic preprocessing
- Full Traceability - Before/after comparisons (images + text), OCR confidence scores, every processing step auditable
- Processing History - All results persisted and searchable
┌─────────────┐ ┌─────────────┐ ┌──────────────────────────────────┐
│ Frontend │────▶│ Backend │────▶│ Python AI Pipeline │
│ React/MUI │ │ Node/Express│ │ (LangGraph) │
└─────────────┘ └─────────────┘ └──────────────────────────────────┘
│ │
┌──────┴──────┐ ┌───────┼───────┐
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│DynamoDB│ │ S3 │ │ S3 │ │Textract│ │ OpenAI │
└────────┘ └────────┘ └────────┘ └────────┘ └────────┘
↑ ↑
└─────────────┘
(shared storage)
- User uploads scanned document via frontend
- Backend generates pre-signed S3 URL and stores metadata in DynamoDB
- AI pipeline performs:
- Image quality assessment (blur, contrast, noise, brightness)
- Conditional preprocessing if quality is poor (deskew, denoise, binarization)
- OCR with Textract (LAYOUT enabled for multi-column support)
- Text cleaning to remove OCR artifacts
- Entity extraction and summarization via LLM
- Results and processed images persisted to S3/DynamoDB
- Frontend displays before/after comparison and extracted insights
Node.js + Python hybrid - AI uses Python (LangChain, OpenCV have no mature Node.js alternatives). Separate services for resource isolation: OpenCV image processing is memory-intensive, if it crashes it won't take down the main API.
LangGraph over n8n - LangGraph provides built-in state management and visibility into each node's state, making it easy to trace data flow through the pipeline. n8n might be able to do this, but would require significant research time and may have unknown limitations.
Textract over Tesseract - System is deployed on AWS, Textract integrates seamlessly (reads directly from S3). Mature enterprise service, and cost-effective (1000 pages/month free).
| Layer | Technology |
|---|---|
| Frontend | React 18, TypeScript, Vite, Material UI v7 |
| Backend | Node.js, Express, ES Modules |
| Database | AWS DynamoDB |
| Storage | AWS S3 (Pre-signed URLs) |
| OCR | AWS Textract (analyze_document + LAYOUT) |
| LLM | OpenAI GPT-4o-mini (Structured Output) |
| AI Pipeline | Python, LangGraph, Pydantic |
| Image Processing | OpenCV, unpaper |
historical-doc-intelligence/
├── frontend/ # React + TypeScript + MUI v7
├── backend/ # Node.js + Express + DynamoDB
├── python-services/ # LangGraph AI Pipeline
└── context/ # Design documentation
See individual README files in each directory for details.
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ S3 │────▶│ ECS on EC2 │────▶│ Lambda │
│ (Static) │ │ (Backend) │ │(AI Pipeline)│
└─────────────┘ └─────────────┘ └─────────────┘
Frontend Node.js Python
| Component | Service | Notes |
|---|---|---|
| Frontend | S3 | Static website hosting |
| Backend | ECS on EC2 | Containerized Node.js, Elastic IP |
| AI Pipeline | Lambda | Container image (OpenCV/unpaper) |
| Storage | S3 | Document images, processed results |
| Database | DynamoDB | Processing history |
See Deployment Guide for details.
- Node.js 18+
- Python 3.12+
- AWS Account (S3, DynamoDB, Textract)
- OpenAI API Key
# Clone
git clone https://github.com/JOJOMRJ/historical-doc-intelligence.git
cd historical-doc-intelligence
# Frontend
cd frontend && npm install
# Backend
cd ../backend && npm install
# Python (with virtual environment)
cd ../python-services
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtbackend/.env
PORT=3000
S3_BUCKET=your-bucket
S3_REGION=ap-southeast-2
AWS_ACCESS_KEY_ID=xxx
AWS_SECRET_ACCESS_KEY=xxx
DYNAMODB_TABLE=historical-documentspython-services/.env
OCR_S3_BUCKET=your-bucket
OCR_AWS_REGION=ap-southeast-2
OCR_AWS_ACCESS_KEY_ID=xxx
OCR_AWS_SECRET_ACCESS_KEY=xxx
OPENAI_API_KEY=sk-xxx# Terminal 1: Backend
cd backend && npm run dev
# Terminal 2: Frontend
cd frontend && npm run devMIT
Jojo - GitHub