This system extracts structured payment data from contract documents and transforms it into different output formats. It consists of two main phases: extraction and transformation.
The system processes document chunks stored in data/headers/ and:
- Extracts payment schedule data using GPT-4o-mini
- Transforms the data into nested structure formats
- Install dependencies:
pip install -r requirements.txt- Set your OpenAI API key as an environment variable:
export OPENAI_API_KEY="your-api-key-here"Alternatively, create a .env file in the project root:
OPENAI_API_KEY=your-api-key-here
Advantages:
- Faster (7 API calls)
- More cost-effective
- Proven for known document formats
Limitations:
- Would miss new service categories not in hardcoded patterns
- Would miss new plan types not explicitly programmed
- Requires code changes for new document types
Usage:
python extract_payment_schedules.pyPerformance: ~72 seconds (7 API calls)
Advantages:
- Automatically detects new service categories and plan types
- More flexible for varying document formats
- No code changes needed for new document types
- More precise terminology detection
Trade-offs:
- More API calls (18 vs 7)
- Slightly higher cost and time
Usage:
python extract_payment_schedules_v2.pyPerformance: ~81 seconds (18 API calls)
- Test the setup (recommended):
python test_extraction.py- Choose your extraction method:
# Option A: Original (fast, hardcoded patterns)
python extract_payment_schedules.py
# Option B: LLM-based (flexible, dynamic detection)
python extract_payment_schedules_v2.pydata/output/payment_schedules_v2.json- Original extraction (flat structure)data/output/payment_schedules_v2_llm.json- LLM-based extraction (flat structure)
The system extracts the following payment fields:
- Lesser of Logic Language included (Y/N)
- Lesser of Rate
- Reimb Methodology
- Reimb Methodology short
- Flat Fee
- Provider Type, IP/OP, Service Type, Plan Type (hierarchy fields)
Transform the extracted flat data into nested hierarchical structures that match the target format.
Advantages:
- Fast and deterministic (~0.047 seconds)
- No external API dependencies
- Consistent results
- Cost-effective
Usage:
# Transform original extraction
python transform_payment_data.py
# Or modify the script to transform LLM-based extraction:
# Change input file from payment_schedules_v2.json to payment_schedules_v2_llm.jsonPerformance: ~47 milliseconds (1,700x faster than extraction!)
data/output/payment_schedules_nested.json- Nested structure format
# 1. Extract payment data (~72 seconds, 7 API calls)
python extract_payment_schedules.py
# 2. Transform to nested structure (~0.047 seconds)
python transform_payment_data.py# 1. Extract payment data (~81 seconds, 18 API calls)
python extract_payment_schedules_v2.py
# 2. Transform to nested structure (~0.047 seconds)
# Note: Modify transform_payment_data.py input file to payment_schedules_v2_llm.json
python transform_payment_data.pyContract Documents (markdown)
↓ (extract_payment_schedules.py OR extract_payment_schedules_v2.py)
payment_schedules_v2.json OR payment_schedules_v2_llm.json (flat structure)
↓ (transform_payment_data.py)
payment_schedules_nested.json (nested structure)
Flat Structure (payment_schedules_v2.json):
{
"hierarchy": {
"line_of_business": "MEDICARE",
"provider_type": "Professional Services",
"ip_op": "OP",
"service_type": "Therapy Services",
"plan_type": "MA PLAN"
},
"payment_fields": [...]
}Nested Structure (payment_schedules_nested.json):
{
"sections": [{
"section_name": "Main Contract Details",
"subsections": [{
"section_name": "Medicare",
"subsections": [{
"section_name": "Professional Services",
"subsections": [{
"section_name": "OP",
"subsections": [{
"section_name": "Therapy Services",
"subsections": [{
"section_name": "MA PLAN",
"extracted_fields": [...]
}]
}]
}]
}]
}]
}]
}Defines the data types and structure:
ExtractedData: Base fields for each data pointExtractedField: Extends ExtractedData with history and subfieldsFinalOutputIntermediateRepresentation: Top-level structure
Shows the desired nested output format with sample data.
/
├── data/
│ ├── headers/ # Input: Pre-chunked documents
│ ├── templates/
│ │ └── foir.yml # Schema definition
│ ├── target_output_nested.json # Example output format
│ └── output/ # Generated results
│ ├── payment_schedules_v2.json # Original extraction
│ ├── payment_schedules_v2_llm.json # LLM-based extraction
│ └── payment_schedules_nested.json # Nested (transformed)
├── extract_payment_schedules.py # Original extraction (hardcoded)
├── extract_payment_schedules_v2.py # LLM-based extraction (dynamic)
├── transform_payment_data.py # Programmatic transformation
├── test_extraction.py # Test setup
├── requirements.txt
└── README.md
| Method | Time | API Calls | Flexibility | Cost |
|---|---|---|---|---|
| Original Extraction | ~72 seconds | 7 | ❌ Hardcoded | $ |
| LLM-Based Extraction | ~81 seconds | 18 | ✅ Dynamic | $$ |
| Transformation | ~0.047 seconds | 0 | ✅ Fast | Free |
| Feature | Original | LLM-Based v2 |
|---|---|---|
| Service Detection | 5 hardcoded patterns | 7+ dynamically detected |
| New Categories | ❌ Would miss | ✅ Auto-detects |
| New Plan Types | ❌ Would miss | ✅ Auto-detects |
| API Calls | 7 | 18 |
| Schedules Created | 7 | 9 |
| Code Changes for New Docs | Required | Not needed |
- Document Processing: Reads pre-chunked contract documents
- Pattern Matching: Uses hardcoded regex patterns for service/plan detection
- LLM Analysis: Uses GPT-4o-mini only for payment field extraction
- Hierarchical Organization: Organizes by predefined categories
- Document Processing: Reads pre-chunked contract documents
- LLM Detection: Uses GPT-4o-mini for line of business, service categories, and plan types
- LLM Analysis: Uses GPT-4o-mini for payment field extraction
- Dynamic Organization: Organizes by LLM-detected categories
- Structure Mapping: Maps flat hierarchy to nested sections/subsections
- Field Transformation: Converts
payment_fieldstoextracted_fieldsformat - Data Preservation: Maintains all field values, citations, and rationales
Update the extraction fields in either extraction script
Edit hierarchy mapping in transform_payment_data.py
Modify the target schema in data/templates/foir.yml
- API Key Error: Make sure your OpenAI API key is set correctly
- Rate Limiting: Reduce API call frequency if needed
- Memory Issues: Process fewer chunks at a time for large documents
- Slow Performance: Use original extraction for speed, LLM-based for flexibility
- Missing Categories: Use LLM-based extraction for new document types
- Use original extraction for known document formats (faster, cheaper)
- Use LLM-based extraction for new/varying document types (more flexible)
- Always use programmatic transformation for speed (1,700x faster than LLM)
- Test extraction first with
test_extraction.pybefore full runs - Monitor API costs during extraction phase
- Document formats are stable and known
- Service categories don't change frequently
- Speed and cost are priorities
- You have a consistent document template
- Processing new document types
- Service categories may vary
- Plan types are not standardized
- You need maximum flexibility and accuracy