A comprehensive, production-ready document processing and analytics platform built on Microsoft Azure. This solution demonstrates advanced microservices architecture, AI/ML capabilities with custom model fine-tuning, intelligent document routing, real-time analytics, database migration tools, and enterprise integration capabilities.
Production Ready: Fully implemented with Azure integration, comprehensive error handling, SQL persistence, real-time monitoring, Docker containerization, intelligent routing system, and production-grade resilience patterns.
- Key Features
- Architecture
- Microservices
- Technology Stack
- Getting Started
- Documentation
- Project Structure
- Contributing
- License
- Azure OpenAI Integration: GPT-4/GPT-4o for document understanding and analysis
- Azure Form Recognizer: Advanced OCR, layout analysis, and custom model training
- Custom ML Models: Support for Hugging Face models (BERT, BART, DistilBERT, RoBERTa)
- Intelligent Document Routing: Automatic selection of optimal processing mode based on complexity analysis
- Azure OpenAI Fine-Tuning: Industry-specific model customization workflows
- Automated Training Pipelines: End-to-end workflow orchestration with quality assessment
- Real-Time Monitoring: WebSocket-based training dashboards with live metrics
- Model Evaluation: Comprehensive accuracy, precision, recall, and F1-score tracking
- LLMOps Automation: Enhanced model lifecycle management with automation metrics
- Invoice Processing Chain: Automated upload → extract → validate → classify → store workflow
- Document Analysis Chain: Intelligent document summarization and entity extraction
- Fine-Tuning Workflow Chain: Automated data preparation, training, and evaluation
- Multi-Agent Workflow: Coordinated processing with specialized agents
- RAG-Based Q&A: Retrieval-augmented generation for document queries
- Conversation History: Multi-turn dialogue management
- Document Context: Semantic search with Azure Cognitive Search
- Streaming Responses: Real-time answer generation
- 14 Independent Microservices: Fully containerized with Docker
- Event Sourcing: Complete audit trail with domain events
- CQRS Pattern: Separate read/write models for optimal performance
- Service Mesh: API Gateway for centralized routing and authentication
- Circuit Breaker Pattern: Prevents cascading failures across services
- Retry Logic: Exponential backoff for transient failure handling
- Connection Pooling: Optimized database connections with configurable limits
- HTTP/2 Support: Enhanced connection efficiency
- Rate Limiting: Token bucket algorithm for API quota management
- Health Checks: Kubernetes-ready liveness and readiness probes
- Real-Time Dashboards: WebSocket-based performance monitoring
- Prometheus Integration: Time-series metrics collection
- Centralized Logging: Azure Monitor and Application Insights
- Performance Tracking: Request latency, throughput, error rates
- Azure Key Vault: Centralized secret management with DefaultAzureCredential
- JWT Authentication: Secure API access with role-based authorization (RBAC)
- Encryption: TLS 1.3 for data in transit, AES-256 for data at rest
- Data Lineage: Complete audit trail and relationship mapping
- Event Sourcing: Immutable event log for compliance
- AI-Native Tools: Expose platform capabilities to Claude, ChatGPT, and other AI agents
- MCP 0.9.1 Protocol: Standard compliance with REST endpoints
- Tool Library: Invoice extraction, validation, classification, automation metrics
- Resource Access: Documents, analytics, automation scores, fine-tuning jobs
- Teradata Migration: Schema and data migration with teradatasql driver
- Netezza Migration: Data warehouse migration support
- Oracle Migration: Database and stored procedure conversion with cx_Oracle
- Schema Converter: Automated DDL translation across database platforms
- OneLake: Unified data lake with Delta Lake support
- Fabric Data Warehouse: Serverless SQL pools with auto-scaling
- Real-Time Intelligence: KQL queries and stream processing
- Data Warehouse: Advanced analytics capabilities
- Outlook Connector: Email processing automation
- Teams Bot: Document collaboration and chat
- SharePoint Sync: Document library integration
- OneDrive: Personal file management
┌─────────────────────────────────────────────────────────────────────┐
│ DOCUMENT INTELLIGENCE PLATFORM │
│ (Production-Ready) │
├─────────────────────────────────────────────────────────────────────┤
│ Client Layer │
│ ├── Web Dashboard (FastAPI + Jinja2) │
│ ├── API Service (REST) │
│ └── WebSocket Connections (Real-time updates) │
├─────────────────────────────────────────────────────────────────────┤
│ Gateway & Load Balancing │
│ ├── Nginx (Rate Limiting, SSL/TLS Termination) │
│ └── API Gateway (Authentication, Routing, Rate Limiting) │
├─────────────────────────────────────────────────────────────────────┤
│ Microservices Layer (14 Services) │
│ ├── Document Ingestion (8000) - Upload, validation │
│ ├── AI Processing (8001) - AI/ML, fine-tuning, LangChain │
│ ├── Analytics (8002) - Metrics, automation scoring │
│ ├── API Gateway (8003) - Auth, routing, validation │
│ ├── AI Chat (8004) - RAG-based Q&A │
│ ├── Batch Processor (8005) - Bulk operations, ETL │
│ ├── Data Quality (8006) - Validation, quality scoring │
│ ├── Data Catalog (8007) - Metadata, lineage tracking │
│ ├── Performance Dashboard (8008) - Real-time monitoring │
│ ├── Migration Service - Database migration tools │
│ ├── MCP Server (8012) - Model Context Protocol │
│ ├── LLM Optimization - Prompt engineering │
│ ├── M365 Integration - Microsoft 365 connectors │
│ └── Experimentation - A/B testing framework │
├─────────────────────────────────────────────────────────────────────┤
│ AI/ML Layer │
│ ├── Azure OpenAI (GPT-4o, GPT-4o-mini, Embeddings) │
│ ├── Azure Form Recognizer (Layout, Custom Models) │
│ ├── Hugging Face Models (BERT, BART, DistilBERT, RoBERTa) │
│ └── Azure Cognitive Search (Vector, Semantic, Hybrid) │
├─────────────────────────────────────────────────────────────────────┤
│ Shared Infrastructure │
│ ├── Intelligent Router - Complexity-based routing │
│ ├── HTTP Client Pool - Connection pooling with HTTP/2 │
│ ├── Circuit Breaker - Fault tolerance │
│ ├── Retry Logic - Exponential backoff │
│ ├── Rate Limiter - Token bucket algorithm │
│ ├── Health Monitor - Kubernetes probes │
│ └── Caching Layer - Redis for hot data │
├─────────────────────────────────────────────────────────────────────┤
│ Event & Messaging │
│ ├── Azure Event Hubs (Event streaming) │
│ ├── Azure Service Bus (Message queuing) │
│ ├── Event Sourcing (Domain events, audit trail) │
│ └── Redis (Caching, Pub/Sub) │
├─────────────────────────────────────────────────────────────────────┤
│ Data Layer │
│ ├── Azure SQL Database (Primary storage, connection pooling) │
│ ├── Azure Blob Storage (Document storage, tiered) │
│ ├── Azure Data Lake Gen2 (Analytics, raw/curated zones) │
│ └── Microsoft Fabric (OneLake, Data Warehouse) │
├─────────────────────────────────────────────────────────────────────┤
│ Monitoring & Security │
│ ├── Azure Key Vault (Secret management) │
│ ├── Azure Monitor (Application Insights, Log Analytics) │
│ ├── Prometheus (Metrics collection) │
│ └── RBAC & Azure AD (Authentication, authorization) │
└─────────────────────────────────────────────────────────────────────┘
- Microservices: 14 independent, scalable services with clear boundaries
- Event-Driven: Asynchronous communication via Azure Event Hubs and Service Bus
- Event Sourcing: Complete audit trail with immutable domain events
- CQRS: Separate read/write models for optimal performance
- API Gateway: Centralized authentication, rate limiting, and routing
- Circuit Breaker: Fault tolerance with automatic failure detection
- Connection Pooling: Optimized database connections with configurable limits
- Caching Strategy: Redis for hot data, reducing database load
- Health Checks: Liveness and readiness probes for Kubernetes
- Retry Logic: Exponential backoff for transient failures
- Rate Limiting: Token bucket algorithm for API quota management
Port: 8000 | Tech: FastAPI, Azure Blob Storage
- Document upload and validation (PDF, DOCX, images)
- Metadata extraction and storage
- Azure Blob Storage integration (Hot/Cool/Archive tiers)
- Event publishing for processing pipeline
Port: 8001 | Tech: Azure OpenAI, Form Recognizer, LangChain
- Azure OpenAI integration (GPT-4o, embeddings, fine-tuning)
- Azure Form Recognizer (layout analysis, custom models)
- LangChain orchestration (invoice, document analysis workflows)
- Fine-tuning workflow orchestration
- ML model inference (BERT, BART, DistilBERT)
- Enhanced LLMOps with automation tracking
- Intelligent document routing integration
Port: 8002 | Tech: Pandas, Azure Synapse, Power BI
- Real-time metrics aggregation
- Automation scoring system
- Power BI integration
- Custom report generation
- Statistical analysis and trend detection
Port: 8004 | Tech: RAG, Vector Search, GPT-4o
- Conversational AI with RAG architecture
- Document Q&A with context retrieval
- Conversation history management
- Semantic search integration
- Streaming response generation
Port: 8003 | Tech: FastAPI, JWT, Rate Limiting
- JWT-based authentication
- Role-based access control (RBAC)
- Request validation with Pydantic
- Service routing and load balancing
- Rate limiting (configurable per client)
- Circuit breaker and retry status endpoints
Port: 8005 | Tech: Celery, Azure Data Factory
- Bulk document processing
- ETL pipeline orchestration
- Scheduled job execution
- Data transformation workflows
- Error handling and retry logic
Port: 8006 | Tech: Great Expectations, Pandas
- Data validation rules engine
- Quality scoring and metrics
- Anomaly detection
- Data profiling and statistics
- Automated quality reports
Port: 8007 | Tech: Apache Atlas concepts, NetworkX
- Metadata management and discovery
- Data lineage tracking (upstream/downstream)
- Relationship mapping and visualization
- Business glossary
- Data governance compliance
Port: 8008 | Tech: WebSocket, FastAPI, Prometheus
- Real-time metrics via WebSocket
- System health monitoring
- Prometheus metrics integration
- Alert management
- Performance percentiles (P50, P95, P99)
Tech: OpenAI, Prompt Engineering
- Advanced prompt templates
- Chain-of-thought reasoning
- Few-shot learning
- Prompt validation and security
- Reusable template library
Port: 8012 | Tech: FastAPI, MCP Protocol 0.9.1
- Model Context Protocol implementation
- AI-native tools for external agents
- Invoice extraction, validation, classification
- Automation metrics exposure
- Resource access (documents, analytics, jobs)
Tech: Microsoft Graph API
- Outlook email processing
- Teams bot integration
- SharePoint document sync
- OneDrive file management
- M365 Copilot extensions
Tech: Statsmodels, SciPy
- A/B testing framework
- Statistical significance testing
- Bayesian analysis
- Experiment tracking
- Traffic splitting
Tech: pyodbc, teradatasql, cx_Oracle
- Teradata to Azure SQL migration
- Netezza data warehouse migration
- Oracle database migration
- Schema converter (automated DDL translation)
- Data validation and integrity checks
| Category | Technologies |
|---|---|
| Language | Python 3.11+ |
| Web Framework | FastAPI, Uvicorn |
| Data Validation | Pydantic, Pydantic-Settings |
| Async Runtime | asyncio, httpx, aiofiles |
| Containerization | Docker, Docker Compose |
| IaC | Azure Bicep |
| CI/CD | GitHub Actions |
| Service | Purpose |
|---|---|
| Azure OpenAI | GPT-4o, GPT-4o-mini, embeddings, fine-tuning |
| Azure Form Recognizer | OCR, layout analysis, custom models |
| Azure SQL Database | Primary data storage with connection pooling |
| Azure Blob Storage | Document storage (Hot/Cool/Archive) |
| Azure Data Lake Gen2 | Analytics data storage |
| Azure Key Vault | Secret management with DefaultAzureCredential |
| Azure Event Hubs | Event streaming |
| Azure Service Bus | Message queuing |
| Azure Cognitive Search | Vector search, semantic search, hybrid retrieval |
| Azure Monitor | Application Insights, Log Analytics |
| Microsoft Fabric | OneLake, Data Warehouse, Real-time Intelligence |
| Technology | Use Case |
|---|---|
| Pandas | Data manipulation and analysis |
| NumPy | Numerical computing |
| Polars | High-performance DataFrames |
| PyArrow | Columnar data processing |
| NetworkX | Graph analysis for lineage tracking |
| SciPy | Statistical analysis |
| Library | Purpose |
|---|---|
| Transformers | Hugging Face model integration |
| Torch | Deep learning |
| TensorFlow | ML model training |
| Scikit-learn | Classical machine learning |
| LangChain | LLM orchestration and chains |
| LangChain-OpenAI | Azure OpenAI integration |
| TikToken | Token counting |
| Driver | Database |
|---|---|
| pyodbc | Azure SQL, SQL Server |
| teradatasql | Teradata migration |
| cx_Oracle | Oracle migration |
| psycopg2-binary | PostgreSQL |
| PyMySQL | MySQL |
| Technology | Purpose |
|---|---|
| Redis | Caching, Pub/Sub |
| Prometheus | Metrics collection |
| Nginx | Load balancing, SSL termination |
| Docker | Containerization |
| Docker Compose | Local orchestration |
| Tool | Purpose |
|---|---|
| pytest | Unit and integration testing |
| pytest-asyncio | Async test support |
| pytest-cov | Code coverage |
| httpx | HTTP client testing |
- Azure Subscription with contributor access
- Python 3.11+ installed
- Docker Desktop (latest version)
- Azure CLI installed
- Git for version control
- Minimum 4GB RAM for local development
git clone https://github.com/saidulIslam1602/Document-Intelligence-Platform.git
cd Document-Intelligence-Platform# Copy environment template
cp env.example .env
# Edit with your Azure credentials
nano .envRequired Environment Variables:
# Azure Services
AZURE_STORAGE_ACCOUNT_NAME=your_storage_account
AZURE_STORAGE_ACCOUNT_KEY=your_storage_key
AZURE_SQL_SERVER=your_sql_server.database.windows.net
AZURE_SQL_DATABASE=your_database
AZURE_SQL_USERNAME=your_username
AZURE_SQL_PASSWORD=your_password
# Azure OpenAI
OPENAI_API_KEY=your_openai_key
OPENAI_ENDPOINT=https://your-openai.openai.azure.com/
OPENAI_DEPLOYMENT=gpt-4o
# Azure Form Recognizer
FORM_RECOGNIZER_ENDPOINT=https://your-formrecognizer.cognitiveservices.azure.com/
FORM_RECOGNIZER_KEY=your_form_recognizer_key
# Security
JWT_SECRET_KEY=your-secret-key-here
KEY_VAULT_URL=https://your-keyvault.vault.azure.net/
# Redis
REDIS_HOST=localhost
REDIS_PORT=6379
# Optional Configuration
ENVIRONMENT=development
LOG_LEVEL=INFO# Install Python dependencies
pip install -r requirements.txt
# For testing only
pip install -r requirements-test.txt# Build and start all services
docker-compose up -d
# View logs
docker-compose logs -f
# Check service health
docker-compose ps- Web Dashboard: http://localhost:5000
- API Documentation: http://localhost:8003/docs
- Performance Dashboard: http://localhost:8008
- Prometheus Metrics: http://localhost:9090
- MCP Server: http://localhost:8012
# Login to Azure
az login
# Set subscription
az account set --subscription "your-subscription-id"
# Deploy using Bicep
az deployment group create \
--resource-group document-intelligence-rg \
--template-file infrastructure/main.bicep \
--parameters environment=production
# Or use deployment script
chmod +x scripts/deploy.sh
./scripts/deploy.sh# Set Key Vault secrets
az keyvault secret set \
--vault-name your-keyvault \
--name "openai-api-key" \
--value "your-openai-key"# Execute SQL schema migrations
python -m alembic upgrade head# Run all unit tests (skip integration tests)
pytest tests/ -m "not integration" -v
# Run with coverage
pytest tests/ --cov=src --cov-report=html
# Run specific test file
pytest tests/test_unit.py -v
# Quick validation
python tests/quick_test.pyComprehensive documentation is available in the docs/ folder:
- Quick Start - Get started in 5 minutes
- Quick Start: Intelligent Routing - Intelligent routing setup
- Integration Guide - Complete integration documentation
- Intelligent Routing Guide - Routing system documentation
- Azure Deployment Guide - Detailed Azure deployment
- Implementation Summary - Complete implementation details
- Enhancements README - v2.0 feature overview
- Validation Checklist - Pre-deployment validation
- Retry Logic Usage - Retry pattern implementation
- Circuit Breaker Usage - Circuit breaker pattern
- CI/CD Fix Documentation - CI/CD pipeline configuration
- Comprehensive Analysis - Platform analysis
- Implementation Progress - Progress tracking
- Project Completion Report - Achievement summary
Document-Intelligence-Platform/
├── docs/ # Comprehensive documentation (15+ guides)
├── infrastructure/ # Azure Bicep IaC templates
├── nginx/ # Load balancer configuration
├── monitoring/ # Prometheus configuration
├── scripts/ # Deployment & utility scripts
├── src/
│ ├── microservices/ # 14 microservices
│ │ ├── document-ingestion/ # Upload & validation
│ │ ├── ai-processing/ # AI/ML, fine-tuning, LangChain
│ │ ├── analytics/ # Metrics, automation scoring
│ │ ├── ai-chat/ # RAG-based chat
│ │ ├── api-gateway/ # Auth, routing, rate limiting
│ │ ├── batch-processor/ # Bulk operations
│ │ ├── data-quality/ # Validation, quality scoring
│ │ ├── data-catalog/ # Metadata, lineage
│ │ ├── performance-dashboard/ # Real-time monitoring
│ │ ├── mcp-server/ # Model Context Protocol
│ │ ├── llm-optimization/ # Prompt engineering
│ │ ├── m365-integration/ # Microsoft 365 connectors
│ │ ├── experimentation/ # A/B testing
│ │ └── data-pipeline/ # Stream processing
│ ├── services/ # Enterprise services
│ │ ├── migration-service/ # Database migration tools
│ │ ├── fabric-integration/ # Microsoft Fabric
│ │ └── demo-service/ # PoC framework
│ ├── shared/ # Shared libraries
│ │ ├── auth/ # Authentication
│ │ ├── cache/ # Redis caching
│ │ ├── config/ # Configuration management
│ │ ├── events/ # Event sourcing
│ │ ├── health/ # Health check system
│ │ ├── http/ # HTTP client pool
│ │ ├── monitoring/ # Performance monitoring
│ │ ├── rate_limiting/ # Rate limiter
│ │ ├── resilience/ # Circuit breaker, retry
│ │ ├── routing/ # Intelligent router
│ │ ├── services/ # Common services
│ │ ├── storage/ # SQL, Blob, Data Lake
│ │ └── utils/ # Utilities
│ └── web/ # Web applications
├── tests/ # Test suite
│ ├── test_unit.py # Unit tests
│ ├── test_integration.py # Integration tests
│ ├── quick_test.py # Quick validation
│ └── demo_script.py # Demo script
├── docker-compose.yml # Local development
├── requirements.txt # Python dependencies
├── requirements-test.txt # Test dependencies
├── setup.cfg # Pytest configuration
└── README.md # This file
We welcome contributions to improve the Document Intelligence Platform!
- Fork the Repository
git clone https://github.com/saidulIslam1602/Document-Intelligence-Platform.git
cd Document-Intelligence-Platform- Create a Feature Branch
git checkout -b feature/your-feature-name- Make Your Changes
- Follow Python PEP 8 style guidelines
- Add comprehensive docstrings
- Include type hints
- Write unit tests for new features
- Update documentation
- Run Tests
pytest tests/ -m "not integration" -v- Commit Your Changes
git add .
git commit -m "Add feature: description"- Push to Your Fork
git push origin feature/your-feature-name- Open a Pull Request
- Provide clear description
- Reference related issues
- Ensure CI/CD pipeline passes
- Code Style: Follow PEP 8, use type hints
- Documentation: Update README and inline docs
- Testing: Maintain test coverage
- Commits: Use descriptive commit messages
- Security: Never commit secrets or credentials
This project is licensed under the MIT License - see the LICENSE file for details.
- ✅ Commercial use
- ✅ Modification
- ✅ Distribution
- ✅ Private use
- ❌ Liability
- ❌ Warranty
- Complete Azure integration with Key Vault
- Comprehensive error handling and logging
- SQL-based persistence for all workflows
- Docker containerization for all 14 microservices
- Production-grade resilience patterns
- True microservices with independent deployability
- Event-driven design with Event Hubs and Service Bus
- Intelligent document routing with complexity analysis
- Connection pooling and HTTP/2 support
- Circuit breaker and retry mechanisms
- Health checks for Kubernetes
- Rate limiting with token bucket algorithm
- Custom Azure OpenAI fine-tuning workflows
- RAG-based intelligent chat
- LangChain orchestration for complex workflows
- Multi-model support (GPT-4o, BERT, BART, DistilBERT)
- Real-time training monitoring via WebSockets
- Automated model evaluation and deployment
- Database migration tools (Teradata, Netezza, Oracle)
- Microsoft Fabric integration (OneLake, Data Warehouse)
- Data lineage tracking and cataloging
- M365 integration (Teams, Outlook, SharePoint)
- A/B testing framework
- PoC generation tools
- Invoice processing automation
- Contract analysis and risk assessment
- Regulatory compliance checking
- Fraud detection in financial documents
- Medical records extraction
- Clinical notes structuring
- Insurance claims processing
- HIPAA-compliant document handling
- Contract review automation
- Due diligence document analysis
- Legal research and case law search
- E-discovery at scale
- Quality report analysis
- Maintenance log insights
- Supply chain documentation
- Compliance tracking
- Product catalog extraction
- Customer feedback analysis
- Supplier invoice automation
- Inventory documentation
- 14 microservices implementation
- Azure OpenAI fine-tuning integration
- RAG-based chat functionality
- LangChain orchestration
- Intelligent document routing
- Database migration tools
- Microsoft Fabric integration
- MCP Server implementation
- Resilience patterns (circuit breaker, retry, rate limiting)
- Health check system
- Real-time performance monitoring
- Azure Key Vault integration
- Docker containerization
- CI/CD pipeline
- Comprehensive documentation
- Azure Container Apps deployment
- Power BI embedded dashboards
- Advanced security features
- Multi-language support
- Kubernetes orchestration
- GraphQL API
- Advanced AI agents
- Edge computing support
- Documentation: Browse the docs/ folder
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- GitHub: @saidulIslam1602
- LinkedIn: Md Saidul Islam
If you find this project helpful:
- Star the repository
- Fork for your own projects
- Share with your network
- Report issues or bugs
- Suggest new features
Built with ❤️ for the Azure and AI/ML community
Last Updated: December 2025