A complete web application for extracting structured information from document images using OCR and verifying user-entered data against extracted values.
- Frontend: HTML, CSS, JavaScript (vanilla)
- Backend: Spring Boot (Java) REST API
- OCR Engine: Python + Microsoft TrOCR (PyTorch)
- Communication: Spring Boot executes Python scripts locally via ProcessBuilder
OCR_PROJECT/
├── backend/ # Spring Boot application
│ ├── src/main/java/com/ocr/
│ │ ├── OcrApplication.java # Main Spring Boot app
│ │ ├── config/
│ │ │ └── PythonConfig.java # Python execution config
│ │ ├── controller/
│ │ │ └── OcrController.java # REST endpoints
│ │ ├── service/
│ │ │ ├── OcrService.java # OCR extraction service
│ │ │ └── VerificationService.java # Data verification service
│ │ └── dto/ # Data Transfer Objects
│ ├── src/main/resources/
│ │ └── application.properties # Application configuration
│ └── pom.xml # Maven dependencies
│
├── python-ocr/ # Python OCR module
│ ├── ocr_processor.py # TrOCR model wrapper
│ ├── field_extractor.py # Regex-based field extraction
│ └── requirements.txt # Python dependencies
│
└── frontend/ # Frontend web application
├── index.html # Main HTML page
├── style.css # Styling
└── app.js # JavaScript logic
-
Java Development Kit (JDK) 17 or higher
- Download from: https://adoptium.net/
- Verify:
java -version
-
Maven 3.6+
- Download from: https://maven.apache.org/download.cgi
- Verify:
mvn -version
-
Python 3.8+
- Download from: https://www.python.org/downloads/
- Verify:
python --versionorpython3 --version
-
pip (Python package manager)
- Usually comes with Python
- Verify:
pip --version
- RAM: Minimum 4GB (8GB+ recommended for TrOCR model)
- Storage: ~2GB free space for Python dependencies and models
- OS: Windows, Linux, or macOS
-
Navigate to the
python-ocrdirectory:cd python-ocr -
Create a virtual environment (recommended):
python -m venv venv
-
Activate virtual environment:
- Windows:
venv\Scripts\activate
- Linux/macOS:
source venv/bin/activate
- Windows:
-
Install Python dependencies:
pip install -r requirements.txt
Note: This will download PyTorch and TrOCR model (~1.5GB). The first run will also download the TrOCR model weights.
-
Test Python OCR script:
python ocr_processor.py <path_to_test_image>
-
Navigate to the
backenddirectory:cd backend -
Edit
src/main/resources/application.properties:- Update
python.executableif your Python command ispython3instead ofpython - Verify
python.ocr.scriptpath is correct relative to project root
- Update
-
Build the Spring Boot application:
mvn clean install
This will download all Maven dependencies and compile the project.
The frontend files are static HTML/CSS/JS files. No build process required.
Important: Update the API URL in frontend/app.js if your backend runs on a different port:
const API_BASE_URL = 'http://localhost:8080/api';-
Navigate to
backenddirectory:cd backend -
Run Spring Boot application:
mvn spring-boot:run
Or run the JAR file:
java -jar target/ocr-application-1.0.0.jar
-
Verify backend is running:
- Open browser: http://localhost:8080
- You should see a Spring Boot error page (expected, as there's no root endpoint)
- Check logs for: "Started OcrApplication"
You can serve the frontend in several ways:
Option A: Using Python HTTP Server (simplest)
cd frontend
python -m http.server 8000Option B: Using Node.js http-server
npx http-server frontend -p 8000Option C: Using any web server
- Copy
frontend/contents to your web server directory - Ensure CORS is enabled (backend already allows all origins)
- Open browser: http://localhost:8000
- Upload a document image (JPG, PNG, or PDF)
- Click "Extract Information"
- Review and edit extracted fields
- Click "Verify Information" to see comparison results
Extract text and structured fields from uploaded image/PDF.
Request:
- Method:
POST - Content-Type:
multipart/form-data - Body:
file(image or PDF file)
Response:
{
"success": true,
"message": "OCR extraction successful",
"rawText": "Full extracted text...",
"extractedFields": {
"name": "John Doe",
"dob": "01/15/1990",
"id_number": "AB123456789",
"address": "123 Main Street, New York, NY 10001"
}
}Verify user-entered form data against OCR extracted data.
Request:
- Method:
POST - Content-Type:
multipart/form-data - Body:
file: Original uploaded image fileformData: JSON string containing form data map
Response:
{
"success": true,
"message": "Verification completed",
"fieldResults": {
"name": {
"match": true,
"confidence": 0.95,
"extractedValue": "John Doe",
"userValue": "John Doe"
},
"dob": {
"match": false,
"confidence": 0.72,
"extractedValue": "01/15/1990",
"userValue": "01/16/1990"
}
},
"overallConfidence": 0.835
}# Python executable path
python.executable=python
# Path to OCR script (relative to project root)
python.ocr.script=python-ocr/ocr_processor.py
# Timeout for Python script execution (milliseconds)
python.ocr.timeout=30000
# File upload size limit
spring.servlet.multipart.max-file-size=10MB// Backend API URL
const API_BASE_URL = 'http://localhost:8080/api';Error: Python OCR script failed with exit code: 1
Solution:
- Verify Python script path in
application.properties - Ensure script is executable:
chmod +x python-ocr/ocr_processor.py(Linux/macOS) - Test script manually:
python python-ocr/ocr_processor.py <image_path>
Error: Error loading TrOCR model
Solution:
- Ensure internet connection for first-time model download
- Model will be cached in
~/.cache/huggingface/after first download - Check available disk space (model is ~500MB)
Error: Port 8080 is already in use
Solution:
- Change port in
application.properties:server.port=8081 - Update frontend
API_BASE_URLaccordingly - Or stop the process using port 8080
Error: Access to fetch at 'http://localhost:8080/api/extract' from origin 'http://localhost:8000' has been blocked by CORS policy
Solution:
- Backend already allows all origins via
@CrossOrigin(origins = "*") - If issue persists, verify frontend is accessing correct backend URL
Error: Maximum upload size exceeded
Solution:
- Increase limit in
application.properties:spring.servlet.multipart.max-file-size=20MB spring.servlet.multipart.max-request-size=20MB
The field_extractor.py uses regex patterns to extract:
- Name: Full names (2+ words, capitalized)
- DOB: Dates in MM/DD/YYYY, DD/MM/YYYY, or YYYY-MM-DD format
- ID Number: Alphanumeric IDs (6+ characters)
- Address: Street addresses with city/state/zip
You can customize patterns in python-ocr/field_extractor.py for your specific document types.
- First OCR run: Slow (~10-30 seconds) due to model loading
- Subsequent runs: Faster (~2-5 seconds) as model stays in memory
- GPU acceleration: Automatically used if CUDA is available
- Memory usage: ~2-3GB RAM for TrOCR model
- Update
field_extractor.pywith new regex patterns - Update frontend form (
index.html) with new input fields - Update DTOs if needed
Edit VerificationService.java:
boolean match = confidence >= 0.85; // Change threshold hereThis project uses open-source libraries:
- Spring Boot (Apache License 2.0)
- TrOCR (MIT License)
- PyTorch (BSD-style License)
For issues or questions:
- Check logs in Spring Boot console
- Verify Python script runs independently
- Test API endpoints with Postman/curl
- Check browser console for frontend errors
- Add PDF text extraction support
- Implement batch processing
- Add database storage for extracted data
- Enhance field extraction with ML models
- Add user authentication