Automated document processing pipeline using Azure Document Intelligence for OCR, Azure Functions for orchestration, and Power Automate for workflow integration. Transforms manual data entry processes into automated workflows.
Many organizations still rely on manual data entry for processing forms, invoices, and handwritten documents. This pipeline automates the extraction of structured data from uploaded documents and routes it to downstream systems.
Before: 15 minutes per form, manual data entry, error-prone
After: Under 2 minutes per form, automated extraction, validation checks
┌─────────────────────────────────────────────────────────────────────────────┐
│ DOCUMENT INGESTION │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ SharePoint │ │ Email │ │ Manual │ │
│ │ Library │ │ Inbox │ │ Upload │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └──────────────────┴──────────────────┘ │
│ │ │
└─────────────────────────────┼───────────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ AZURE BLOB STORAGE │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Container: documents-incoming │ │
│ │ │ │
│ │ /2024/01/15/form-001.pdf │ │
│ │ /2024/01/15/form-002.jpg │ │
│ │ /2024/01/16/form-003.pdf │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ │ Blob Trigger │
│ ▼ │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ AZURE FUNCTIONS │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ ProcessDocument (Blob Trigger) │ │
│ │ │ │
│ │ 1. Validate file type (PDF, JPG, PNG, TIFF) │ │
│ │ 2. Check file size limits │ │
│ │ 3. Call Document Intelligence API │ │
│ │ 4. Parse extracted fields │ │
│ │ 5. Validate required fields present │ │
│ │ 6. Write results to output container │ │
│ │ 7. Send notification (success/failure) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Azure Document Intelligence │ │
│ │ │ │
│ │ - Prebuilt models (invoice, receipt, ID, business card) │ │
│ │ - Custom models (trained on org-specific forms) │ │
│ │ - Handwriting recognition │ │
│ │ - Table extraction │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
└─────────────────────────────┼───────────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ OUTPUT LAYER │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Blob │ │ CosmosDB │ │ Power │ │
│ │ Storage │ │ (results) │ │ Automate │ │
│ │ (JSON out) │ │ │ │ (workflow) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ │ │ ▼ │
│ │ │ ┌─────────────┐ │
│ │ │ │ Downstream │ │
│ │ │ │ Systems │ │
│ │ │ │ (SharePoint,│ │
│ │ │ │ Salesforce,│ │
│ │ │ │ Email) │ │
│ │ │ └─────────────┘ │
│ │ │ │
└─────────┴──────────────────┴───────────────────────────────────────────────┘
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Upload │──▶│ Validate │──▶│ OCR │──▶│ Parse │──▶│ Output │
│ │ │ │ │ │ │ │ │ │
└──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Invalid │ │ API │ │ Missing │ │ Success │
│ File │ │ Error │ │ Fields │ │ or │
│ Type │ │ │ │ │ │ Failure │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
│ │ │ │
└──────────────┴──────────────┴──────────────┘
│
▼
┌─────────────┐
│ Error │
│ Queue │
│ (retry/ │
│ review) │
└─────────────┘
Document Ingestion
- Multiple input sources: SharePoint, email attachments, direct upload
- Supported formats: PDF, JPEG, PNG, TIFF, BMP
- Automatic file validation and size checks
OCR and Extraction
- Azure Document Intelligence for high-accuracy OCR
- Prebuilt models for common document types
- Custom model support for organization-specific forms
- Handwriting recognition
- Table and key-value pair extraction
Processing Logic
- Field validation (required fields, format checks)
- Confidence score thresholds
- Error handling and retry logic
- Dead-letter queue for manual review
Output and Integration
- Structured JSON output
- Power Automate triggers for downstream workflows
- Direct integration with SharePoint, Salesforce, databases
| Component | Technology |
|---|---|
| Orchestration | Azure Functions (Node.js/TypeScript) |
| OCR Engine | Azure Document Intelligence |
| Storage | Azure Blob Storage |
| Results Store | Azure CosmosDB (optional) |
| Workflow | Power Automate |
| Monitoring | Application Insights |
| Infrastructure | Bicep / ARM Templates |
azure-document-pipeline/
├── functions/
│ ├── src/
│ │ ├── ProcessDocument/
│ │ │ ├── index.ts # Blob trigger function
│ │ │ └── function.json
│ │ ├── RetryFailedDocuments/
│ │ │ ├── index.ts # Timer trigger for retries
│ │ │ └── function.json
│ │ └── lib/
│ │ ├── documentIntelligence.ts
│ │ ├── validation.ts
│ │ ├── parser.ts
│ │ └── notifications.ts
│ ├── host.json
│ ├── local.settings.json
│ └── package.json
├── infrastructure/
│ ├── main.bicep
│ ├── parameters.dev.json
│ └── parameters.prod.json
├── power-automate/
│ └── document-processed-flow.zip # Exported flow
├── .github/
│ └── workflows/
│ └── deploy.yml
├── docs/
│ ├── custom-model-training.md
│ └── troubleshooting.md
└── README.md
Environment Variables
DOCUMENT_INTELLIGENCE_ENDPOINT=https://<resource>.cognitiveservices.azure.com/
DOCUMENT_INTELLIGENCE_KEY=<key>
STORAGE_CONNECTION_STRING=<connection-string>
COSMOS_CONNECTION_STRING=<connection-string> # Optional
NOTIFICATION_WEBHOOK_URL=<power-automate-http-trigger>
CONFIDENCE_THRESHOLD=0.8
Prebuilt Models
| Model | Use Case |
|---|---|
| prebuilt-invoice | Vendor invoices |
| prebuilt-receipt | Expense receipts |
| prebuilt-idDocument | ID cards, driver's licenses |
| prebuilt-businessCard | Contact cards |
| prebuilt-document | General documents |
Custom Models For organization-specific forms, train a custom model:
- Collect 5-10 sample documents
- Label fields in Document Intelligence Studio
- Train the model
- Deploy and reference by model ID
See docs/custom-model-training.md for detailed instructions.
Documents that fail processing are moved to an error queue for review:
| Error Type | Handling |
|---|---|
| Invalid file type | Reject, notify uploader |
| File too large | Reject, notify uploader |
| OCR confidence low | Queue for manual review |
| Missing required fields | Queue for manual review |
| API failure | Retry 3x with exponential backoff |
| Unknown error | Dead-letter queue, alert ops team |
Application Insights tracks:
- Documents processed per day
- Average processing time
- Success/failure rates
- Confidence score distribution
- API latency to Document Intelligence
Alert Rules
- Failure rate > 10% in 1 hour
- Processing time > 30 seconds average
- Dead-letter queue depth > 50
Production deployment processing 200+ documents monthly:
| Metric | Before | After |
|---|---|---|
| Time per document | 15 min | < 2 min |
| Error rate | ~5% (typos) | < 1% |
| Staff hours/month | 50+ hours | 5 hours (review only) |
Prerequisites
- Node.js 18+
- Azure Functions Core Tools v4
- Azure Storage Emulator or Azurite
- Document Intelligence resource (no local emulator available)
Setup
cd functions
npm install
cp local.settings.json.example local.settings.json
# Edit local.settings.json with your keys
# Start the function
func startTesting
Upload a file to the documents-incoming container in your storage account (or emulator). The function will trigger and process it.
Automated via GitHub Actions:
- Build and test Functions
- Deploy infrastructure (Bicep)
- Deploy Functions to Azure
- Update Power Automate connection (manual step)
- Azure React Dashboard - Could display processing stats
- GitHub Actions for Azure - CI/CD workflows