Skip to content

JOJOMRJ/historical-doc-intelligence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Historical Document Intelligence

AI-powered document processing system that extracts text from scanned documents and analyzes content using OCR and Large Language Models.


Overview

A full-stack application for researchers who need to extract structured information from scanned documents.

Core Capabilities:

  • Intelligent OCR - Multi-column layout support, automatic image preprocessing for low-quality scans
  • AI Entity Extraction - People, locations, dates, organizations (customizable)
  • Document Summarization - LLM-generated 2-3 sentence summaries
  • Image Quality Assessment - Research-based quality scoring with automatic preprocessing
  • Full Traceability - Before/after comparisons (images + text), OCR confidence scores, every processing step auditable
  • Processing History - All results persisted and searchable

Architecture

┌─────────────┐     ┌─────────────┐     ┌──────────────────────────────────┐
│   Frontend  │────▶│   Backend   │────▶│        Python AI Pipeline        │
│  React/MUI  │     │ Node/Express│     │           (LangGraph)            │
└─────────────┘     └─────────────┘     └──────────────────────────────────┘
                           │                           │
                    ┌──────┴──────┐            ┌───────┼───────┐
                    │             │            │       │       │
                    ▼             ▼            ▼       ▼       ▼
               ┌────────┐   ┌────────┐   ┌────────┐ ┌────────┐ ┌────────┐
               │DynamoDB│   │   S3   │   │   S3   │ │Textract│ │ OpenAI │
               └────────┘   └────────┘   └────────┘ └────────┘ └────────┘
                               ↑             ↑
                               └─────────────┘
                               (shared storage)

Processing Flow

  1. User uploads scanned document via frontend
  2. Backend generates pre-signed S3 URL and stores metadata in DynamoDB
  3. AI pipeline performs:
    • Image quality assessment (blur, contrast, noise, brightness)
    • Conditional preprocessing if quality is poor (deskew, denoise, binarization)
    • OCR with Textract (LAYOUT enabled for multi-column support)
    • Text cleaning to remove OCR artifacts
    • Entity extraction and summarization via LLM
  4. Results and processed images persisted to S3/DynamoDB
  5. Frontend displays before/after comparison and extracted insights

Design Decisions

Node.js + Python hybrid - AI uses Python (LangChain, OpenCV have no mature Node.js alternatives). Separate services for resource isolation: OpenCV image processing is memory-intensive, if it crashes it won't take down the main API.

LangGraph over n8n - LangGraph provides built-in state management and visibility into each node's state, making it easy to trace data flow through the pipeline. n8n might be able to do this, but would require significant research time and may have unknown limitations.

Textract over Tesseract - System is deployed on AWS, Textract integrates seamlessly (reads directly from S3). Mature enterprise service, and cost-effective (1000 pages/month free).


Tech Stack

Layer Technology
Frontend React 18, TypeScript, Vite, Material UI v7
Backend Node.js, Express, ES Modules
Database AWS DynamoDB
Storage AWS S3 (Pre-signed URLs)
OCR AWS Textract (analyze_document + LAYOUT)
LLM OpenAI GPT-4o-mini (Structured Output)
AI Pipeline Python, LangGraph, Pydantic
Image Processing OpenCV, unpaper

Project Structure

historical-doc-intelligence/
├── frontend/           # React + TypeScript + MUI v7
├── backend/            # Node.js + Express + DynamoDB
├── python-services/    # LangGraph AI Pipeline
└── context/            # Design documentation

See individual README files in each directory for details.


Deployment (AWS)

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│     S3      │────▶│ ECS on EC2  │────▶│   Lambda    │
│  (Static)   │     │  (Backend)  │     │(AI Pipeline)│
└─────────────┘     └─────────────┘     └─────────────┘
   Frontend           Node.js            Python
Component Service Notes
Frontend S3 Static website hosting
Backend ECS on EC2 Containerized Node.js, Elastic IP
AI Pipeline Lambda Container image (OpenCV/unpaper)
Storage S3 Document images, processed results
Database DynamoDB Processing history

See Deployment Guide for details.


Quick Start

Prerequisites

  • Node.js 18+
  • Python 3.12+
  • AWS Account (S3, DynamoDB, Textract)
  • OpenAI API Key

Installation

# Clone
git clone https://github.com/JOJOMRJ/historical-doc-intelligence.git
cd historical-doc-intelligence

# Frontend
cd frontend && npm install

# Backend
cd ../backend && npm install

# Python (with virtual environment)
cd ../python-services
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Environment Variables

backend/.env

PORT=3000
S3_BUCKET=your-bucket
S3_REGION=ap-southeast-2
AWS_ACCESS_KEY_ID=xxx
AWS_SECRET_ACCESS_KEY=xxx
DYNAMODB_TABLE=historical-documents

python-services/.env

OCR_S3_BUCKET=your-bucket
OCR_AWS_REGION=ap-southeast-2
OCR_AWS_ACCESS_KEY_ID=xxx
OCR_AWS_SECRET_ACCESS_KEY=xxx
OPENAI_API_KEY=sk-xxx

Running

# Terminal 1: Backend
cd backend && npm run dev

# Terminal 2: Frontend
cd frontend && npm run dev

Open http://localhost:5173


Documentation


License

MIT


Author

Jojo - GitHub

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors