Skip to content

st0rm-bless3d/azure-document-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

azure-document-pipeline

Automated document processing pipeline using Azure Document Intelligence for OCR, Azure Functions for orchestration, and Power Automate for workflow integration. Transforms manual data entry processes into automated workflows.

Overview

Many organizations still rely on manual data entry for processing forms, invoices, and handwritten documents. This pipeline automates the extraction of structured data from uploaded documents and routes it to downstream systems.

Before: 15 minutes per form, manual data entry, error-prone
After: Under 2 minutes per form, automated extraction, validation checks

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                           DOCUMENT INGESTION                                │
│                                                                             │
│   ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                    │
│   │  SharePoint │    │    Email    │    │   Manual    │                    │
│   │   Library   │    │   Inbox     │    │   Upload    │                    │
│   └──────┬──────┘    └──────┬──────┘    └──────┬──────┘                    │
│          │                  │                  │                            │
│          └──────────────────┴──────────────────┘                            │
│                             │                                               │
└─────────────────────────────┼───────────────────────────────────────────────┘
                              ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                          AZURE BLOB STORAGE                                 │
│                                                                             │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │  Container: documents-incoming                                      │   │
│   │                                                                     │   │
│   │  /2024/01/15/form-001.pdf                                           │   │
│   │  /2024/01/15/form-002.jpg                                           │   │
│   │  /2024/01/16/form-003.pdf                                           │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                             │                                               │
│                             │ Blob Trigger                                  │
│                             ▼                                               │
└─────────────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         AZURE FUNCTIONS                                     │
│                                                                             │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │  ProcessDocument (Blob Trigger)                                     │   │
│   │                                                                     │   │
│   │  1. Validate file type (PDF, JPG, PNG, TIFF)                        │   │
│   │  2. Check file size limits                                          │   │
│   │  3. Call Document Intelligence API                                  │   │
│   │  4. Parse extracted fields                                          │   │
│   │  5. Validate required fields present                                │   │
│   │  6. Write results to output container                               │   │
│   │  7. Send notification (success/failure)                             │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                             │                                               │
│                             ▼                                               │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │  Azure Document Intelligence                                        │   │
│   │                                                                     │   │
│   │  - Prebuilt models (invoice, receipt, ID, business card)            │   │
│   │  - Custom models (trained on org-specific forms)                    │   │
│   │  - Handwriting recognition                                          │   │
│   │  - Table extraction                                                 │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                             │                                               │
└─────────────────────────────┼───────────────────────────────────────────────┘
                              ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                          OUTPUT LAYER                                       │
│                                                                             │
│   ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                    │
│   │    Blob     │    │   CosmosDB  │    │    Power    │                    │
│   │   Storage   │    │  (results)  │    │   Automate  │                    │
│   │  (JSON out) │    │             │    │  (workflow) │                    │
│   └─────────────┘    └─────────────┘    └─────────────┘                    │
│         │                  │                  │                            │
│         │                  │                  ▼                            │
│         │                  │         ┌─────────────┐                       │
│         │                  │         │ Downstream  │                       │
│         │                  │         │  Systems    │                       │
│         │                  │         │ (SharePoint,│                       │
│         │                  │         │  Salesforce,│                       │
│         │                  │         │  Email)     │                       │
│         │                  │         └─────────────┘                       │
│         │                  │                                               │
└─────────┴──────────────────┴───────────────────────────────────────────────┘

Processing Flow

┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐
│  Upload  │──▶│ Validate │──▶│   OCR    │──▶│  Parse   │──▶│  Output  │
│          │   │          │   │          │   │          │   │          │
└──────────┘   └──────────┘   └──────────┘   └──────────┘   └──────────┘
                    │              │              │              │
                    ▼              ▼              ▼              ▼
               ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐
               │ Invalid │   │  API    │   │ Missing │   │ Success │
               │  File   │   │  Error  │   │ Fields  │   │   or    │
               │  Type   │   │         │   │         │   │ Failure │
               └─────────┘   └─────────┘   └─────────┘   └─────────┘
                    │              │              │              │
                    └──────────────┴──────────────┴──────────────┘
                                        │
                                        ▼
                                 ┌─────────────┐
                                 │   Error     │
                                 │   Queue     │
                                 │  (retry/    │
                                 │   review)   │
                                 └─────────────┘

Features

Document Ingestion

  • Multiple input sources: SharePoint, email attachments, direct upload
  • Supported formats: PDF, JPEG, PNG, TIFF, BMP
  • Automatic file validation and size checks

OCR and Extraction

  • Azure Document Intelligence for high-accuracy OCR
  • Prebuilt models for common document types
  • Custom model support for organization-specific forms
  • Handwriting recognition
  • Table and key-value pair extraction

Processing Logic

  • Field validation (required fields, format checks)
  • Confidence score thresholds
  • Error handling and retry logic
  • Dead-letter queue for manual review

Output and Integration

  • Structured JSON output
  • Power Automate triggers for downstream workflows
  • Direct integration with SharePoint, Salesforce, databases

Tech Stack

Component Technology
Orchestration Azure Functions (Node.js/TypeScript)
OCR Engine Azure Document Intelligence
Storage Azure Blob Storage
Results Store Azure CosmosDB (optional)
Workflow Power Automate
Monitoring Application Insights
Infrastructure Bicep / ARM Templates

Project Structure

azure-document-pipeline/
├── functions/
│   ├── src/
│   │   ├── ProcessDocument/
│   │   │   ├── index.ts          # Blob trigger function
│   │   │   └── function.json
│   │   ├── RetryFailedDocuments/
│   │   │   ├── index.ts          # Timer trigger for retries
│   │   │   └── function.json
│   │   └── lib/
│   │       ├── documentIntelligence.ts
│   │       ├── validation.ts
│   │       ├── parser.ts
│   │       └── notifications.ts
│   ├── host.json
│   ├── local.settings.json
│   └── package.json
├── infrastructure/
│   ├── main.bicep
│   ├── parameters.dev.json
│   └── parameters.prod.json
├── power-automate/
│   └── document-processed-flow.zip   # Exported flow
├── .github/
│   └── workflows/
│       └── deploy.yml
├── docs/
│   ├── custom-model-training.md
│   └── troubleshooting.md
└── README.md

Configuration

Environment Variables

DOCUMENT_INTELLIGENCE_ENDPOINT=https://<resource>.cognitiveservices.azure.com/
DOCUMENT_INTELLIGENCE_KEY=<key>
STORAGE_CONNECTION_STRING=<connection-string>
COSMOS_CONNECTION_STRING=<connection-string>  # Optional
NOTIFICATION_WEBHOOK_URL=<power-automate-http-trigger>
CONFIDENCE_THRESHOLD=0.8

Document Intelligence Models

Prebuilt Models

Model Use Case
prebuilt-invoice Vendor invoices
prebuilt-receipt Expense receipts
prebuilt-idDocument ID cards, driver's licenses
prebuilt-businessCard Contact cards
prebuilt-document General documents

Custom Models For organization-specific forms, train a custom model:

  1. Collect 5-10 sample documents
  2. Label fields in Document Intelligence Studio
  3. Train the model
  4. Deploy and reference by model ID

See docs/custom-model-training.md for detailed instructions.

Error Handling

Documents that fail processing are moved to an error queue for review:

Error Type Handling
Invalid file type Reject, notify uploader
File too large Reject, notify uploader
OCR confidence low Queue for manual review
Missing required fields Queue for manual review
API failure Retry 3x with exponential backoff
Unknown error Dead-letter queue, alert ops team

Monitoring

Application Insights tracks:

  • Documents processed per day
  • Average processing time
  • Success/failure rates
  • Confidence score distribution
  • API latency to Document Intelligence

Alert Rules

  • Failure rate > 10% in 1 hour
  • Processing time > 30 seconds average
  • Dead-letter queue depth > 50

Results

Production deployment processing 200+ documents monthly:

Metric Before After
Time per document 15 min < 2 min
Error rate ~5% (typos) < 1%
Staff hours/month 50+ hours 5 hours (review only)

Local Development

Prerequisites

  • Node.js 18+
  • Azure Functions Core Tools v4
  • Azure Storage Emulator or Azurite
  • Document Intelligence resource (no local emulator available)

Setup

cd functions
npm install
cp local.settings.json.example local.settings.json
# Edit local.settings.json with your keys

# Start the function
func start

Testing Upload a file to the documents-incoming container in your storage account (or emulator). The function will trigger and process it.

Deployment

Automated via GitHub Actions:

  1. Build and test Functions
  2. Deploy infrastructure (Bicep)
  3. Deploy Functions to Azure
  4. Update Power Automate connection (manual step)

Related

About

Automated document processing pipeline using Azure Document Intelligence for OCR, Azure Functions for orchestration, and Power Automate for workflow integration. Transforms manual data entry processes into automated workflows.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors