Created by vibe coding with Claude Code. Excuse the code quality.
A robust CLI tool to extract blood test information from lab report PDFs and output the results as CSV files. Supports multiple lab formats including Quest Diagnostics, LabCorp, and Cleveland HeartLab.
pdf_to_csv.py- Original rule-based PDF extractor for Quest/LabCorp formats using pattern matchingunified_ai_extractor.py- AI-powered extractor using Claude or OpenAI for any PDF formatwellavy_ai_extractor.py- Wellavy-specific extractor with intelligent database marker mapping
api.py- FastAPI service providing HTTP endpoints for PDF extraction/api/v1/ai-extract- Basic AI extraction/api/v1/ai-extract-mapped- AI extraction with Wellavy database marker mapping/convert- Legacy rule-based extraction
test_api.py- Tests for the API endpointstest_ai_api.py- Tests specifically for AI extraction endpoints
run_unified_extractor.sh- Shell script to run the unified AI extractorinstall.sh- Installation script for dependencies
- Multi-format Support: Handles Quest Diagnostics (Analyte/Value), LabCorp, Function Health Dashboard, and Vibrant America formats
- Intelligent Format Detection: Automatically detects and uses the appropriate extraction method
- Optional Format Override: Force specific lab format (
--format=quest,--format=labcorp, or--format=function_health) when needed - Reference Range Extraction: Optional extraction of reference ranges with
--include-rangesflag - Comprehensive Extraction: Extracts 80+ markers including CBC, CMP, hormones, lipids, and fatty acids
- Single CSV Output: All markers in one file, preserving the original PDF order
- Flexible Output Formats: Standard CSV or enhanced CSV with reference ranges
- Date Auto-detection: Automatically extracts test dates from PDF content
- Value Validation: Validates extracted values against realistic medical ranges
- API Support: FastAPI web service for programmatic access
- Fallback PDF Reading: Uses PyPDF2 with pdfplumber fallback for robust text extraction
- Configurable: JSON-based configuration for markers and extraction settings
- Install Python dependencies:
pip3 install -r requirements.txt- Make the script executable:
chmod +x pdf_to_csv.pyOr use the install script:
./install.shpython3 pdf_to_csv.py your_lab_report.pdfThis will create a single CSV file: your_lab_report.csv containing all extracted markers in the order they appear in the PDF.
python3 pdf_to_csv.py your_lab_report.pdf --output results.csvThis creates: results.csv with all extracted markers.
python3 pdf_to_csv.py your_lab_report.pdf --verbosepython3 pdf_to_csv.py your_lab_report.pdf --include-ranges
# or
python3 pdf_to_csv.py your_lab_report.pdf -rpython3 pdf_to_csv.py your_lab_report.pdf --format quest
python3 pdf_to_csv.py your_lab_report.pdf --format labcorp
python3 pdf_to_csv.py your_lab_report.pdf --format function_health--output, -o: Specify the output CSV file path--verbose, -v: Enable verbose output to see extracted data--include-ranges, -r: Include reference ranges (MinRange, MaxRange) in output--format: Force specific lab format (quest,labcorp, orfunction_health) - auto-detects if not specified--config-dir: Specify custom configuration directory (default: config)--help: Show help message
- Format: Analyte/Value structured tables
- Detection: Looks for "Analyte" and "Value" headers
- Examples: Wild Health reports, most Quest lab reports
- Extraction: 80-90 markers typically extracted
- Format: LabCorp structured results + Cleveland fatty acid profiles
- Detection: Identifies both LabCorp codes (01, 02, 03, 04) and Cleveland HeartLab sections
- Examples: Combination reports with comprehensive panels + specialized fatty acid analysis
- Extraction: 80+ markers from both lab formats
- Format: Specialized fatty acid and cardiovascular markers
- Detection: Cleveland HeartLab headers with fatty acid data
- Examples: Cardiometabolic reports
- Extraction: Omega-3, Omega-6, EPA, DHA, ratios
- Format: Function Health's comprehensive biomarker dashboard export
- Detection: Looks for "In Range", "Out of Range", "Biomarkers" patterns and characteristic layout
- Examples: Function Health dashboard PDF exports
- Extraction: 80-110+ markers including autoimmunity, biological age, hormones, vitamins, minerals
- Note: PDFs may have spaced text extraction issues with PyPDF2; the tool automatically uses pdfplumber for better results
The tool extracts a comprehensive set of blood markers across multiple categories:
WBC, RBC, Hemoglobin, Hematocrit, MCV, MCH, MCHC, RDW, Platelets, MPV, Neutrophils, Lymphocytes, Monocytes, Eosinophils, Basophils (both % and absolute counts)
Glucose, BUN, Creatinine, eGFR, Sodium, Potassium, Chloride, CO2, Calcium, Total Protein, Albumin, Globulin, Bilirubin, Alkaline Phosphatase, AST, ALT
Total Cholesterol, HDL, LDL, Triglycerides, Non-HDL Cholesterol, LDL Particle Number, LDL Small, HDL Large, Apolipoprotein A1, Apolipoprotein B, Lipoprotein(a)
Testosterone (Total & Free), DHEA Sulfate, Sex Hormone Binding Globulin, Cortisol, Estradiol, Progesterone, TSH, Free T4, Free T3
Omega-3 Total, Omega-6 Total, EPA, DHA, DPA, Arachidonic Acid, Linoleic Acid, Omega-6/Omega-3 Ratio, Arachidonic Acid/EPA Ratio, hs-CRP, LP-PLA2
Vitamin D, Vitamin B12, Vitamin B6, Folate (Serum & RBC), Hemoglobin A1c, Insulin, Ferritin, Iron, TIBC, Homocysteine, Uric Acid, TMAO, Coenzyme Q10
The CSV file contains all extracted markers in the order they appear in the PDF:
Marker Name,2024-04-04
GLUCOSE,83
UREA NITROGEN (BUN),14
CREATININE,0.98
EGFR,100
WHITE BLOOD CELL COUNT,6.4
HEMOGLOBIN,16.6
LDL-CHOLESTEROL,80
HDL CHOLESTEROL,72
TESTOSTERONE, TOTAL, MS,589
VITAMIN D, 25-OH, TOTAL,42.6When using the --include-ranges flag, the CSV includes reference range columns:
Marker,MinRange,MaxRange,2025-06-10
LDL-P,,1000,1258
LDL-C (NIH Calc),0,99,113
HDL-C,39,,69
Triglycerides,0,149,40
Glucose,70,99,101
BUN,6,24,15
Creatinine,0.57,1.00,0.67
WBC,3.4,10.8,6.2
Hemoglobin,11.1,15.9,14.4
TSH,0.450,4.500,1.240To install as a system-wide command:
pip3 install -e .Then you can use it as:
pdf-to-csv your_lab_report.pdfThe tool includes a FastAPI web service with two extraction methods:
- Pattern-based extraction (public, no auth required)
- AI-powered extraction (secured with API key)
# Start locally
python api.py
# Or deploy on Railway (automatic from GitHub)POST /convert - Extract using regex patterns
Query Parameters:
include_ranges(boolean, optional): Include reference ranges in output (default: false)format(string, optional): Force specific lab format (quest,labcorp, orfunction_health) - auto-detects if not specified
Example:
curl -X POST "http://localhost:8000/convert?include_ranges=true" \
-F "file=@your_lab_report.pdf" \
-o results.csvPOST /api/v1/ai-extract - Extract using Claude AI
Authentication Required: Include API key in header
X-API-Key: your-api-key-here
Query Parameters:
include_ranges(boolean, optional): Include reference ranges in output (default: false)
Response Format:
{
"success": true,
"test_date": "2024-01-15",
"marker_count": 85,
"results": [
{
"marker": "Glucose",
"value": "95",
"min_range": "70", // Only if include_ranges=true
"max_range": "100" // Only if include_ranges=true
}
]
}Example:
# Basic extraction
curl -X POST "https://bloodpdftocsv.amandeep.app/api/v1/ai-extract" \
-H "X-API-Key: your-api-key" \
-F "file=@blood_test.pdf"
# With reference ranges
curl -X POST "https://bloodpdftocsv.amandeep.app/api/v1/ai-extract?include_ranges=true" \
-H "X-API-Key: your-api-key" \
-F "file=@blood_test.pdf"POST /api/v1/ai-extract-mapped - Extract and map to database markers
This endpoint is designed specifically for Wellavy integration, intelligently mapping extracted markers to a provided database schema.
Authentication Required: Include API key in header
X-API-Key: your-api-key-here
Request:
file: PDF file to extract (multipart/form-data)database_markers: JSON array of database markers (in form data)
Database Markers Format:
[
{"id": "be9a1341-7ce3-4e18-b3d8-4147d5bb6366", "name": "Glucose"},
{"id": "b562e4ad-2f5d-4da6-8eb7-4b7ece904d69", "name": "Cholesterol, Total"},
{"id": "340c24f2-6e6b-4ab8-9a3b-719d4b557d88", "name": "WBC"}
]Response Format:
{
"success": true,
"test_date": "07/22/2025",
"lab_name": "Wild Health",
"marker_count": 83,
"mapping_stats": {
"total_extracted": 83,
"successfully_mapped": 75,
"unmapped": 8,
"mapping_rate": 0.90
},
"results": [
{
"original_marker": "Cholesterol Total",
"value": "155",
"unit": "mg/dL",
"min_range": "100",
"max_range": "199",
"mapped_marker_name": "Cholesterol, Total",
"mapped_marker_id": "b562e4ad-2f5d-4da6-8eb7-4b7ece904d69",
"confidence": 0.95
}
]
}Mapping Features:
- Intelligent marker name matching (handles variations like "Total Cholesterol" → "Cholesterol, Total")
- Confidence scores for each mapping
- Preserves original marker names for audit trail
- Maps common abbreviations (WBC, RBC, CRP, etc.)
- Handles lab-specific naming conventions
Example:
# Extract with marker mapping
curl -X POST "https://extract.wellavy.co/api/v1/ai-extract-mapped?include_ranges=true" \
-H "X-API-Key: your-api-key" \
-F "file=@blood_test.pdf" \
-F "database_markers=$(cat markers.json)"- Fork/Clone this repository
- Connect to Railway:
- Create new project on Railway
- Connect your GitHub repository
- Set Environment Variables:
API_SECRET_KEY=your-secure-random-key ANTHROPIC_API_KEY=sk-ant-your-claude-key - Deploy: Railway will auto-deploy on push
The AI extraction endpoint requires authentication to:
- Protect against unauthorized usage
- Control Claude API costs
- Track usage per client
Generate a secure API key:
import secrets
print(secrets.token_urlsafe(32))import requests
class BloodTestClient:
def __init__(self, api_url, api_key=None):
self.api_url = api_url
self.api_key = api_key
def extract_with_ai(self, pdf_path, include_ranges=False):
"""Use AI extraction (requires API key)"""
url = f"{self.api_url}/api/v1/ai-extract"
headers = {"X-API-Key": self.api_key}
params = {"include_ranges": include_ranges}
with open(pdf_path, 'rb') as f:
files = {'file': f}
response = requests.post(url, headers=headers, files=files, params=params)
return response.json()
def extract_with_patterns(self, pdf_path, include_ranges=False):
"""Use pattern extraction (no auth required)"""
url = f"{self.api_url}/convert"
params = {"include_ranges": include_ranges}
with open(pdf_path, 'rb') as f:
files = {'file': f}
response = requests.post(url, files=files, params=params)
return response.text # CSV content
# Usage
client = BloodTestClient(
api_url="https://bloodpdftocsv.amandeep.app",
api_key="your-api-key"
)
# AI extraction
results = client.extract_with_ai("blood_test.pdf", include_ranges=True)
print(f"Found {results['marker_count']} markers")For comprehensive API documentation including rate limiting, error handling, and more examples, see API_DOCUMENTATION.md.
The project includes an AI-powered extractor that uses Claude or GPT-4o to extract blood test results directly from PDFs without relying on text extraction patterns.
- Direct PDF Processing: Sends PDF as base64 to AI models for native OCR/document processing
- Dual AI Support: Choose between Claude (Anthropic) or GPT-4o (OpenAI)
- Structured Output: Returns JSON with standardized marker names and reference ranges
- Flexible Output: Export as CSV or JSON format
- Create a
.env.localfile (see.env.local.example):
ANTHROPIC_API_KEY=your-claude-api-key
OPENAI_API_KEY=your-openai-api-key- Install additional dependencies:
pip install anthropic openai python-dotenvBasic usage with Claude (default):
python unified_ai_extractor.py your_lab_report.pdfUse GPT-4o instead:
python unified_ai_extractor.py your_lab_report.pdf --service gpt4oInclude reference ranges:
python unified_ai_extractor.py your_lab_report.pdf --include-rangesOutput as JSON:
python unified_ai_extractor.py your_lab_report.pdf --json --output results.jsonUsing the shell script:
./run_unified_extractor.sh your_lab_report.pdf -s gpt4o -r--service/-s: Choose AI service (claude,openai, orgpt4o)--output/-o: Specify output file path--include-ranges/-r: Include reference ranges in output--json: Output as JSON instead of CSV
JSON output structure:
{
"results": [
{
"marker": "Glucose",
"value": "95",
"min_range": "70",
"max_range": "100"
},
{
"marker": "Cholesterol",
"value": "180",
"min_range": null,
"max_range": "200"
}
],
"test_date": "2024-01-15"
}- Claude:
claude-sonnet-4-20250514(Anthropic's latest Sonnet model) - GPT-4o:
gpt-4o(OpenAI's multimodal model with vision capabilities)
The AI extractor is particularly useful for:
- PDFs with complex layouts that pattern-based extraction struggles with
- Scanned documents or images where OCR is needed
- Reports from labs not yet supported by pattern-based extractors
- Quick extraction without needing to configure patterns
Extracted 85 markers including comprehensive CBC, CMP, hormones, advanced lipids, and fatty acid profiles:
Marker Name,2025-04-04
WHITE BLOOD CELL COUNT,6.4
HEMOGLOBIN,16.6
GLUCOSE,83
TESTOSTERONE, TOTAL, MS,589
LDL PARTICLE NUMBER,1605
OMEGA-3 TOTAL,7.6
EPA,1.6
DHA,4.0
VITAMIN D, 25-OH, TOTAL,42.6
HEMOGLOBIN A1c,5.3Extracted 82 markers from both lab formats in a single comprehensive report:
Marker Name,2024-10-30
WBC,4.3
Hemoglobin,15.6
Glucose,82
Testosterone,789
Arachidonic Acid/EPA Ratio,11.4
Omega3 Total,6.2
EPA,1.0
DHA,3.5
TMAO (Trimethylamine N-oxide),3.3
Vitamin D, 25-Hydroxy,50.4The tool uses JSON configuration files in the config/ directory:
Defines marker patterns, validation ranges, and categories. Organized by:
default_markers: Core lab markers by category (CBC, CMP, hormones, etc.)other_markers: Additional specialized markers
Contains extraction settings:
date_patterns: Regex patterns for date extractionvalue_patterns: Patterns for marker-value extractionexclusion_lists: Keywords to filter out noiseextraction_settings: Thresholds and parameters
- No data extracted: PDF format not recognized. Use
--verboseto see extracted text - Low marker count: May indicate wrong format detection. Check if PDF has Analyte/Value structure
- Wrong date: Date extraction failed, current date used. Edit CSV header manually if needed
- Missing specific markers: Pattern matching may need adjustment in
config/markers.json
- Quest reports: Should detect Analyte/Value structure and extract 80+ markers
- LabCorp combination: Needs both LabCorp codes and Cleveland sections for full extraction
- Fragmented PDFs: Tool will attempt fragmented extraction method
- Use
--verboseflag to see extraction method and marker counts - Check that pdfplumber is installed for problematic PDFs:
pip install pdfplumber - Examine extracted text patterns if markers are missing
- Python 3.7+
- PyPDF2
- pdfplumber (optional, for fallback PDF reading)
- click
- python-dateutil
The tool uses a modular object-oriented design:
- ConfigLoader: Manages JSON configuration files
- ValueValidator: Validates extracted values against medical ranges
- PatternMatcher: Compiles and manages regex patterns
- TextProcessor: Cleans and processes PDF text
- DateExtractor: Extracts dates from text
- LabReportExtractor: Main extraction logic for different formats
- BloodTestExtractor: Orchestrates the entire extraction process