Skip to content

Intelligent document processing. Extract structured data like JSON, Markdown and HTML from documents using AI.

License

Notifications You must be signed in to change notification settings

docuglean-ai/docuglean-ocr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Intelligent document processing using State of the Art AI models.

If you find Docuglean helpful, please ⭐ this repository to show your support!

What is Docuglean?

Docuglean is a unified SDK for intelligent document processing using State of the Art AI models. Docuglean provides multilingual and multimodal capabilities with plug-and-play APIs for document OCR, structured data extraction, annotation, classification, summarization, and translation. It also comes with inbuilt tools and supports different types of documents out of the box.

Features

  • 🚀 Easy to Use: Simple, intuitive API with detailed documentation. Just pass in a file and get markdown in response.
  • 🔍 OCR Capabilities: Extract text from images and scanned documents
  • 📊 Structured Data Extraction: Use Zod/Pydantic schemas for type-safe structured data extraction
  • 📄 Multimodal Support: Process PDFs and images with ease
  • 🤖 Multiple AI Providers: Support for OpenAI, Mistral, and Google Gemini, with more coming soon
  • 🔒 Type Safety: Full TypeScript support with comprehensive types
  • summarize: Get structured TLDRs of long documents
  • local OCR (PDF): Parse PDFs locally without calling external APIs.

Available SDKs

📦 Node.js/TypeScript SDK

Package: docuglean-ocr

npm install docuglean-ocr

Repository: node-ocr/

Quick Start:

OCR Function - Pure OCR Processing Extracts text from documents and images, returning content and metadata like bounding boxes (provider-dependent).

import { ocr, extract } from 'docuglean-ocr';

// Extract raw text from documents (supports URLs and local files)
const ocrResult = await ocr({
  filePath: 'https://arxiv.org/pdf/2302.12854',
  provider: 'openai',
  model: 'gpt-4o-mini',
  apiKey: 'your-api-key'
});

Extract Function - Structured Data Extraction Extracts structured data from documents using custom schemas. Also handles summarization via custom prompts and a compact schema.

import { z } from 'zod';

// Define schema for structured extraction
const ReceiptSchema = z.object({
  date: z.string(),
  total: z.number(),
  items: z.array(z.object({
    name: z.string(),
    price: z.number()
  }))
});

// Extract structured data from documents
const extractResult = await extract({
  filePath: './receipt.pdf',
  provider: 'mistral',
  model: 'mistral-small-latest',
  apiKey: 'your-api-key',
  responseFormat: ReceiptSchema,
  prompt: 'Extract receipt details including date, total, and items'
});
// Summarization via extract
const SummarySchema = z.object({
  title: z.string().optional(),
  summary: z.string().min(50),
  keyPoints: z.array(z.string()).min(3).max(7),
});
const summary = await extract({
  filePath: './long-report.pdf',
  provider: 'openai',
  apiKey: 'your-api-key',
  responseFormat: SummarySchema,
  prompt: 'Provide a concise 3-sentence summary of this document and 3–7 key points.'
});
console.log('Summary:', summary.summary);

Note: you can also use extract with a targeted "search" prompt (e.g., "Find all occurrences of X and return matching passages") to perform semantic search within a document.

🐍 Python SDK

Package: docuglean-ocr

pip install docuglean-ocr

Repository: python-ocr/

Quick Start:

OCR Function - Pure OCR Processing Extracts text from documents and images, returning content and metadata like bounding boxes (provider-dependent).

from docuglean import ocr, extract

# Extract raw text from documents (supports URLs and local files)
ocr_result = await ocr(
    file_path="./test/data/testocr.png",
    provider="gemini",
    model="gemini-2.5-flash",
    api_key="your-api-key"
)

Extract Function - Structured Data Extraction Extracts structured data from documents using custom schemas. Requires a response format schema and returns parsed data.

from pydantic import BaseModel
from typing import List

# Define schema for structured extraction
class Item(BaseModel):
    name: str
    price: float

class Receipt(BaseModel):
    date: str
    total: float
    items: List[Item]

# Extract structured data from documents
extract_result = await extract(
    file_path="./receipt.pdf",
    provider="mistral",
    model="mistral-small-latest",
    api_key="your-api-key",
    response_format=Receipt,
    prompt="Extract receipt details including date, total, and items"
)

Coming Soon

  • 🏷️ classify(): Document type classifier (receipt, ID, invoice, etc.)
  • 🤖 More Models. More Providers: Integration with Meta's Llama, Together AI, OpenRouter and lots more.
  • 🌍 Multilingual: Support for multiple languages
  • 🎯 Smart Classification: Automatic document type detection

Provider Options

Currently supported providers and models:

  • OpenAI: gpt-4o-mini, gpt-4o, gpt-4-turbo, gpt-3.5-turbo, o1-mini, o1-preview
  • Mistral: mistral-ocr-latest, mistral-small-latest, ministral-8b-latest
  • Google Gemini: gemini-2.5-flash, gemini-2.5-pro, gemini-1.5-flash, gemini-1.5-pro
  • Hugging Face: Qwen/Qwen2.5-VL-3B-Instruct and other vision-language models (Python only)

Development

Node.js SDK

cd node-ocr
npm install
npm run build
npm test

Python SDK

cd python-ocr
uv sync
uv run pytest

Contributing

We welcome contributions! Please see our Contributing Guide for details.

License

Apache 2.0 - see the LICENSE file for details.

Stay Up to Date

⭐ Star this repo to get notified about new releases and updates!

About

Intelligent document processing. Extract structured data like JSON, Markdown and HTML from documents using AI.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published