A Python-based proof-of-concept system that extracts structured data from car insurance policy PDFs using OCR and AI (OpenAI GPT), outputting JSON conforming to a defined schema.
- PDF Text Extraction: Direct text extraction with OCR fallback for scanned documents
- AI-Powered Extraction: Uses OpenAI GPT models to intelligently parse and extract structured data
- Schema Validation: Validates extracted data against a comprehensive JSON schema
- REST API: FastAPI-based API for easy integration and testing
The system consists of four main components:
- PDF Processing Layer: Extracts text from PDFs using
pdfplumberand falls back to OCR (pytesseract) for scanned documents - AI Extraction Layer: Uses OpenAI GPT models to parse extracted text and structure it according to the schema
- Schema Validation: Validates extracted JSON against Pydantic models
- API Interface: FastAPI REST API for file upload and extraction
- Python 3.9 or higher
- Tesseract OCR (for OCR functionality)
- macOS:
brew install tesseractandbrew install tesseract-lang(for Chinese support) - Ubuntu/Debian:
sudo apt-get install tesseract-ocr tesseract-ocr-chi-sim(for Chinese support) - Windows: Download from GitHub and install Chinese language data
- macOS:
- OpenAI API key
- Clone or navigate to the project directory:
cd project-3- Create a virtual environment (recommended):
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Set up environment variables:
Create a
.envfile in the project root with your OpenAI API key:
OPENAI_API_KEY=your_api_key_here
Start the FastAPI server:
python -m src.api.mainOr using uvicorn directly:
uvicorn src.api.main:app --reload --host 0.0.0.0 --port 8000The API will be available at http://localhost:8000
GET /healthReturns:
{
"status": "healthy",
"service": "policy-extraction"
}POST /extract
Content-Type: multipart/form-dataUpload a PDF file:
curl -X POST "http://localhost:8000/extract" -H "accept: application/json" -H "Content-Type: multipart/form-data" -F "file=@policy-docs/test12.pdf" -o out_test12.jsonResponse format:
{
"success": true,
"data": {
"policyholder": { ... },
"vehicle": { ... },
"coverage": { ... },
"premiumAndDiscounts": { ... },
"insurerAndPolicyDetails": { ... },
"additionalEndorsements": { ... }
},
"validation": {
"is_valid": true,
"errors": [],
"missing_fields": []
}
}import requests
url = "http://localhost:8000/extract"
with open("policy.pdf", "rb") as f:
files = {"file": f}
response = requests.post(url, files=files)
data = response.json()
print(data)from src.extractor.pdf_processor import extract_text_from_pdf
from src.extractor.ai_extractor import extract_policy_data
from src.extractor.schema_validator import validate_and_format
# Extract text from PDF
pdf_text = extract_text_from_pdf("policy.pdf")
# Extract structured data using AI
extracted_data = extract_policy_data(pdf_text)
# Validate the data
validated_data, validation_result = validate_and_format(extracted_data)
if validation_result.is_valid:
print("Extraction successful!")
print(validated_data)
else:
print("Validation errors:", validation_result.errors)project-3/
├── policy-docs/ # Sample PDFs (existing)
├── schema.pdf # Schema definition (existing)
├── validator.json # JSON schema for validation
├── .env # Environment variables (create this)
├── src/
│ ├── __init__.py
│ ├── extractor/
│ │ ├── __init__.py
│ │ ├── pdf_processor.py # PDF text extraction & OCR
│ │ ├── ai_extractor.py # OpenAI-based extraction (key-value pairs)
│ │ └── schema_validator.py # JSON schema validation
│ ├── models/
│ │ ├── __init__.py
│ │ └── policy_schema.py # Pydantic models (for reference)
│ └── api/
│ ├── __init__.py
│ └── main.py # FastAPI application
├── tests/
│ └── test_extractor.py
├── requirements.txt
├── README.md
└── .gitignore
The extracted data conforms to the JSON schema defined in validator.json. This schema file can be updated directly if the upper layer APIs require changes to the JSON structure.
The schema includes the following main sections:
- policyholder: Name, address, occupation, named drivers
- vehicle: Registration mark, make/model, year, VIN, engine details, seating capacity, body type, estimated value
- coverage: Type of cover, liability limits, excess/deductibles, limitations on use, authorized drivers
- premiumAndDiscounts: Premium amount, total payable, no-claim discount, levies
- insurerAndPolicyDetails: Insurer name, policy number, period of insurance, date of issue
- additionalEndorsements: Endorsements/clauses, hire purchase/mortgagee
See schema.pdf for the detailed field descriptions and validator.json for the JSON schema definition used for validation.
Run the test suite:
python -m pytest tests/Or run individual tests:
python tests/test_extractor.py- OCR Dependency: OCR functionality requires Tesseract to be installed on the system
- API Costs: Uses OpenAI API which incurs costs per request
- PDF Quality: Extraction accuracy depends on PDF quality and format
- Token Limits: Very long PDFs may be truncated to fit within API token limits
- Redacted Documents: Documents with redacted/blacked-out fields will show "REDACTED" for those fields
- Multilingual Support: Supports English and Chinese (Simplified) documents. For Chinese documents, ensure Tesseract Chinese language data is installed
The API returns appropriate HTTP status codes:
200: Successful extraction400: Invalid file type or request422: Unable to extract text from PDF500: Server error (AI extraction failure, validation error, etc.)
Validation errors and missing fields are included in the response for debugging.
- Batch processing multiple PDFs
- Web UI for file upload
- Caching extracted results
- Integration with quote generation APIs
- Support for additional document formats
- Multi-language support
This is a proof-of-concept project.