A production-ready FastAPI microservice for AI-powered file processing with comprehensive enterprise features. Deployed and tested with 91% coverage across 439 tests with dual AI model support (OpenAI + Local Models).
┌─────────────────────────────────────────────────────────────┐
│ AI File Utilities v1.0.0 - DEPLOYED │
├─────────────────────────────────────────────────────────────┤
│ FastAPI + JWT Auth Port: 8000 │
│ ├── General Files (/api/v1/general) │
│ ├── Image Processing (/api/v1/image) │
│ ├── Audio/STT (/api/v1/audio) │
│ ├── Video Analysis (/api/v1/video) │
│ └── PDF Operations (/api/v1/pdf) │
├─────────────────────────────────────────────────────────────┤
│ Cloud Integration Stack │
│ ├── Redis Cloud (Caching + Locks) │
│ ├── Supabase Storage (Files + Database) │
│ ├── OpenAI GPT-4 + Whisper OR Local Models (CLIP) │
│ └── Kafka Events (Docker Local) │
├─────────────────────────────────────────────────────────────┤
│ Monitoring & Observability │
│ ├── Prometheus Metrics Port: 9090 │
│ ├── Grafana Dashboard Port: 3000 │
│ ├── Structured JSON Logging │
│ └── Health Checks + Service Status │
└─────────────────────────────────────────────────────────────┘
| Service Category | Endpoints | Key Features | |-----------------|-----------|--------|--------------| | Authentication | 2 endpoints | JWT tokens, secure auth | | General Files | 5 endpoints | File_ID architecture, upload, metadata extraction, text extraction, download, delete | | Image Processing | 4 endpoints | Convert, classify, defect detection, thumbnails | | Audio Processing | 4 endpoints | Transcription with timestamps, language detection, summarization, conversion | | Video Processing | 3 endpoints | Frame extraction, object detection, summarization | | PDF Operations | 2 endpoints | PDF splitting, cached text retrieval |
- File ID Architecture: Secure file_id-based processing for all operations
- Multi-format Support: Images, Videos, Audio, PDFs, Documents
- Dual AI Models: OpenAI GPT-4/Whisper OR Local Models (CLIP for CV, Whisper for STT)
- Format Conversion: Image/audio conversion, PDF operations
- Metadata Extraction: EXIF, PDF properties, video specs
- OCR + Text Extraction: Tesseract integration
Complete Model Mapping Table - OpenAI vs Local Models
| Service | API Endpoint | OpenAI Model (USE_OPENAI=true) | Local Model (USE_OPENAI=false) | Local Model Recommendations | Purpose |
|---|---|---|---|---|---|
| Audio/STT | POST /api/v1/audio/transcribe |
whisper-1 |
openai/whisper-tiny |
Better: whisper-base (290MB) or whisper-small (966MB) for higher accuracy. Best: whisper-medium (3GB) for production quality. Enable GPU with device="cuda" |
Audio transcription with timestamps |
| Audio/STT | POST /api/v1/audio/detect-language |
whisper-1 |
openai/whisper-tiny |
Better: whisper-base for more reliable language detection. Use longer audio samples (>10s) for better accuracy |
Audio language detection |
| Audio/STT | POST /api/v1/audio/summarize |
whisper-1 + gpt-3.5-turbo |
openai/whisper-tiny + t5-small |
STT: Use whisper-base for cleaner transcripts. Summary: Upgrade to t5-base (850MB) or flan-t5-base for better summarization quality |
Transcription + summarization |
| Audio/STT | POST /api/v1/audio/convert |
No AI (FFmpeg) | No AI (FFmpeg) | No AI - Pure FFmpeg processing | Audio format conversion |
| Image/CV | POST /api/v1/image/classify |
gpt-4o |
openai/clip-vit-base-patch32 |
Better: clip-vit-base-patch16 (350MB) for higher resolution analysis. Best: clip-vit-large-patch14 (1.7GB) for production. Use image preprocessing (resize, normalize) for stability |
Image classification & description |
| Image/CV | POST /api/v1/image/detect-defects |
gpt-4o |
openai/clip-vit-base-patch32 |
Better: clip-vit-base-patch16 + custom defect categories. Advanced: Fine-tune CLIP on defect datasets or use specialized models like microsoft/DiT-base for industrial QA |
Defect detection & quality analysis |
| Image/CV | POST /api/v1/image/convert |
No AI (PIL) | No AI (PIL) | No AI - Pure PIL processing | Image format conversion |
| Image/CV | POST /api/v1/image/thumbnail |
No AI (PIL) | No AI (PIL) | No AI - Pure PIL processing | Thumbnail generation |
| Video/CV | POST /api/v1/video/extract-frames |
No AI (OpenCV) | No AI (OpenCV) | No AI - Pure OpenCV processing | Frame extraction |
| Video/CV | POST /api/v1/video/detect-objects |
gpt-4o |
openai/clip-vit-base-patch32 |
Better: Use clip-vit-base-patch16 + frame sampling strategy (every 5th frame). Advanced: YOLOv8 or DETR models for real object detection vs semantic analysis |
Object detection in video frames |
| Video/CV | POST /api/v1/video/summarize |
gpt-4o |
openai/clip-vit-base-patch32 + t5-small |
CV: Use clip-vit-base-patch16 for better frame analysis. Text Gen: Upgrade to t5-base or bart-large-cnn (1.6GB) for video summarization. Implement keyframe extraction |
Video content summarization |
POST /api/v1/pdf/split |
No AI (PyPDF2) | No AI (PyPDF2) | No AI - Pure PyPDF2 processing | PDF page splitting | |
GET /api/v1/pdf/text/{cache_id} |
No AI (Tesseract OCR) | No AI (Tesseract OCR) | Better OCR: Use PaddleOCR or EasyOCR for higher accuracy. Pre-process images (deskew, denoise) before OCR |
Cached OCR text retrieval | |
| General | POST /api/v1/general/extract-text |
No AI (Tesseract OCR) | No AI (Tesseract OCR) | Better OCR: Upgrade to PaddleOCR (multi-language) or TrOCR transformer model for document text extraction |
Document text extraction |
OpenAI Models:
whisper-1: OpenAI's production Whisper model for audio transcriptiongpt-3.5-turbo: Text summarization (200 tokens max)gpt-4o: Vision model for image/video analysis (200-300 tokens max)
Local Models:
openai/whisper-tiny: 39M parameter Whisper model (fast initialization, 244MB download)- Upgrade Path:
whisper-base(290MB) →whisper-small(966MB) →whisper-medium(3GB) - Best For: Development/testing, low-resource environments
- Upgrade Path:
t5-small: 60M parameter T5 model for text summarization (242MB download)- Upgrade Path:
t5-base(850MB) →flan-t5-base(990MB) →bart-large-cnn(1.6GB) - Best For: Quick summarization, limited GPU memory
- Upgrade Path:
openai/clip-vit-base-patch32: 151MB CLIP model for computer vision (optimized for speed)- Upgrade Path:
clip-vit-base-patch16(350MB) →clip-vit-large-patch14(1.7GB) - Best For: Fast image classification, real-time processing
- Upgrade Path:
Performance Optimization:
- GPU Acceleration: All models support CUDA with
torch.float16optimization - Model Quantization: Use 8-bit or 4-bit quantization for memory reduction
- Batch Processing: Process multiple files simultaneously for better throughput
- Model Caching: Models stay loaded in memory after first use
Quality Improvements:
- Whisper Audio: Use 16kHz sampling rate, reduce background noise, longer audio clips (>10s)
- CLIP Vision: Resize images to optimal dimensions (224x224 or 336x336), proper normalization
- T5 Text: Provide context, use appropriate prompt engineering, limit output length
Production Recommendations:
- Memory Requirements:
- Tiny models: 2-4GB RAM
- Base models: 6-8GB RAM
- Large models: 12-16GB RAM + GPU
- Storage: Pre-download models to avoid cold starts
- Monitoring: Track model inference times and memory usage
- Environment Variable:
USE_OPENAI=true|falsecontrols model selection - Automatic Fallback: No fallback - pure OpenAI or pure local model operation
- Performance: Local models optimized with
torch.float16for CUDA acceleration - Initialization: Local models downloaded and cached on first use from Hugging Face
- JWT Authentication: Secure token-based auth with expiration
- Authorization Middleware: Protected endpoints with user context
- Input Validation: Comprehensive request/response validation
- Secure Headers: CORS, security headers implemented
- File ID-Based Operations: All processing endpoints use secure
file_idparameters instead of file uploads - Supabase-Only Storage: 100% cloud-native storage with NO local fallback - all files stored in Supabase buckets
- Decoupled Storage: Files uploaded once via
/upload, then processed by reference - General API Refactored: Extract-text endpoint now uses
file_idinstead of file upload - Metadata Bug Fixed: Async metadata service now properly awaits coroutines (was causing 500 errors)
- Enhanced Security: No direct file exposure in processing endpoints
- Improved Performance: Cached file access and reduced data transfer
- Async Processing: Background processing with event-driven architecture
- Consistent Architecture: All APIs (general, image, audio, video, PDF) use file_id pattern
- 91% Test Coverage (439 tests passing across all APIs - June 2025)
- Comprehensive Module Coverage:
- CV Service: 91% coverage (387 statements, 34 missing) - Full OpenAI/Local support
- STT Service: 86% coverage (366 statements, 51 missing) - Full OpenAI/Local support
- Audio API: 95% coverage (188 statements, 10 missing) - Dual model endpoints
- Image API: 94% coverage (197 statements, 11 missing) - Complete file_id architecture
- Video API: 96% coverage (141 statements, 6 missing) - Frame extraction + AI analysis
- PDF API: 100% coverage (81 statements, 0 missing) - Split + cached text
- General API: 78% coverage (185 statements, 41 missing) - File_id refactoring
- Auth API: 100% coverage (35 statements, 0 missing) - JWT token management
- Storage Services: 93-100% coverage - Redis (100%), Supabase (100%), Local Storage (93%)
- Prometheus Metrics: Request counters, duration histograms
- Structured Logging: JSON logs with correlation IDs
- Health Endpoints:
/healthwith service status - Performance Tracking: Response time monitoring
- Dual Model Testing: All CV, STT, and Audio services tested with both OpenAI and local models
- Redis Cloud: Distributed caching and session management
- Supabase Storage: Scalable file storage with CDN (100% cloud-native, no local fallback)
- Kafka Events: Async event processing for workflows
- Docker Orchestration: Multi-service deployment ready
# 1. Clone repository
git clone <repository-url>
cd ai-file-utilities
# 2. Configure environment
Copy-Item .env.example .env
# Edit .env with your API keys (see Configuration section)
# 3. Deploy full stack
docker-compose up -d
# 4. Verify deployment
docker-compose ps
# Expected: All services "Up (healthy)"
# 5. Test API
Invoke-RestMethod "http://localhost:8000/health"Services deployed:
- API: http://localhost:8000 (Interactive docs:
/docs) - Grafana: http://localhost:3000 (admin/admin)
- Prometheus: http://localhost:9090
- Redis: localhost:6379
- Kafka: localhost:9092
# Create and activate virtual environment
python -m venv venv
.\venv\Scripts\Activate.ps1
# Install dependencies
pip install -r requirements.txt
# Set environment variables
$env:OPENAI_API_KEY = "your_openai_api_key_here"
$env:SUPABASE_URL = "your_supabase_url"
$env:SUPABASE_KEY = "your_supabase_anon_key"
# Install Tesseract OCR (Windows)
# Download from: https://github.com/UB-Mannheim/tesseract/wiki
# Add to PATH or set TESSERACT_CMD environment variable
# Run the service
uvicorn ai_file_utilities.main:app --reload --host 0.0.0.0 --port 8000| Variable | Description | Default | Required |
|---|---|---|---|
OPENAI_API_KEY |
OpenAI API key for AI features | - | Yes |
SUPABASE_URL |
Supabase project URL | - | No |
SUPABASE_KEY |
Supabase anon key | - | No |
JWT_SECRET_KEY |
Secret for JWT token signing | auto-generated | No |
REDIS_URL |
Redis connection URL | redis://localhost:6379 |
No |
KAFKA_BOOTSTRAP_SERVERS |
Kafka brokers | localhost:9092 |
No |
KAFKA_ENABLED |
Enable Kafka events | true |
No |
UPLOAD_DIR |
Local upload directory | ./uploads |
No |
TESSERACT_CMD |
Tesseract executable path | tesseract |
No |
# Core AI Services
OPENAI_API_KEY=sk-your-openai-api-key-here
# JWT Authentication
JWT_SECRET=your-secure-jwt-secret-minimum-64-characters
JWT_ALGORITHM=HS256
JWT_EXPIRATION_HOURS=24
# Supabase Cloud Storage
SUPABASE_URL=https://your-project-id.supabase.co
SUPABASE_ANON_KEY=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
SUPABASE_SERVICE_KEY=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
SUPABASE_BUCKET=ai-file-utilities# 1. Go to https://supabase.com and sign up
# 2. Create new project (takes 2-3 minutes)
# 3. Navigate to Project Settings → API
# 4. Copy Project URL and API keys-- In Supabase Dashboard → Storage, create bucket:
-- Name: ai-file-utilities
-- Public: false (private)
-- File size limit: 100MB-- Go to Storage → Policies and add these:
-- Allow uploads for authenticated users
CREATE POLICY "Allow upload for authenticated users"
ON storage.objects FOR INSERT
WITH CHECK (bucket_id = 'ai-file-utilities');
-- Allow downloads for authenticated users
CREATE POLICY "Allow download for authenticated users"
ON storage.objects FOR SELECT
USING (bucket_id = 'ai-file-utilities');
-- Allow deletions for authenticated users
CREATE POLICY "Allow delete for authenticated users"
ON storage.objects FOR DELETE
USING (bucket_id = 'ai-file-utilities');# After configuring .env, test with health check:
Invoke-RestMethod "http://localhost:8000/health" | ConvertTo-Json
# Should show: "supabase": "configured"# For production, use Redis Cloud:
REDIS_ENABLED=true
REDIS_URL=redis://default:password@host:port
REDIS_DEFAULT_TTL=3600# For Docker (local development):
KAFKA_ENABLED=true
KAFKA_BOOTSTRAP_SERVERS=kafka:9092
# For production Kafka cluster:
# KAFKA_BOOTSTRAP_SERVERS=your-kafka-cluster:9092# Windows: Download Tesseract OCR
# https://github.com/UB-Mannheim/tesseract/wiki
# Add to PATH or set:
$env:TESSERACT_CMD = "C:\Program Files\Tesseract-OCR\tesseract.exe"# For production deployment:
LOG_LEVEL=INFO
DEBUG=false
MAX_WORKERS=4
REQUEST_TIMEOUT=300
PROCESSING_TIMEOUT=600# For development:
LOG_LEVEL=DEBUG
DEBUG=true
RELOAD=true
TEST_MODE=false# Set environment variables for OpenAI testing
$env:USE_OPENAI="true"
$env:OPENAI_API_KEY="test-key"
# Run all tests with coverage
python -m pytest tests/ --cov=ai_file_utilities --cov-report=html --cov-report=xml --cov-report=term-missing
# View HTML coverage report
start htmlcov/index.html# First install local model dependencies (if not already installed)
pip install torch torchvision transformers pillow
# Set environment variables for local model testing
$env:USE_OPENAI="false"
# Run tests (now with local model dependencies available)
python -m pytest tests/ --cov=ai_file_utilities --cov-report=html --cov-report=xml --cov-report=term-missing
# Note: Local models will be downloaded automatically on first use
# Models downloaded: whisper-tiny (~244MB), CLIP (~151MB), T5-small (~242MB)# Fast test run with OpenAI mocking
$env:USE_OPENAI="true"; $env:OPENAI_API_KEY="test-key"; python -m pytest tests/ -v
# Fast test run with local models
$env:USE_OPENAI="false"; python -m pytest tests/ -v# Enable OpenAI models for testing
$env:USE_OPENAI="true"
$env:OPENAI_API_KEY="test-key"
# Run specific test modules
python -m pytest tests/test_cv_service_comprehensive_fixed.py -v
python -m pytest tests/test_stt_service_fixed.py -v
python -m pytest tests/test_audio_endpoints_updated.py -v# First install local model dependencies
pip install torch torchvision transformers pillow
# Enable local models for testing
$env:USE_OPENAI="false"
# Run dual-model test suites
python -m pytest tests/test_cv_service_comprehensive_fixed.py -v
python -m pytest tests/test_stt_service_fixed.py -v
python -m pytest tests/test_audio_endpoints_updated.py -v
# Note: First run will download models (~640MB total)- API Endpoint Tests: Complete coverage of all REST endpoints with dual model support
- CV Service Tests: Both OpenAI GPT-4o and local CLIP model testing (65 tests)
- STT Service Tests: OpenAI Whisper and local Whisper model testing (55 tests)
- Audio Processing Tests: Dual model transcription, language detection, summarization (30 tests)
- Service Layer Tests: Storage, OCR, metadata services with comprehensive mocking
- Integration Tests: End-to-end workflows with environment variable switching
- File_ID Architecture Tests: All processing endpoints use secure file_id parameters
- Total Tests: 439 tests across 19 comprehensive test modules
- Test Status: All tests passing (0 failures)
- Coverage: 91% overall code coverage (2412 statements, 219 missing)
- Dual Model Test Support: Both OpenAI and local model paths fully tested
- Test Files:
test_cv_service_comprehensive_fixed.py- CV service with OpenAI + local models (65 tests)test_stt_service_fixed.py- STT service with OpenAI Whisper + local models (55 tests)test_audio_endpoints_updated.py- Audio API with dual model support (30 tests)test_image_fileid.py- Image API endpoints with file_id architecture (20 tests)test_video_fileid.py- Video API endpoints (28 tests)test_*_comprehensive.py- Complete service coverage (241 additional tests)
- CV Service: 91% coverage (387 statements, 34 missing) - Full OpenAI/Local support
- STT Service: 86% coverage (366 statements, 51 missing) - Full OpenAI/Local support
- Audio API: 95% coverage (188 statements, 10 missing) - Dual model endpoints
- Image API: 94% coverage (197 statements, 11 missing) - Complete file_id architecture
- Video API: 96% coverage (141 statements, 6 missing) - Frame extraction + AI analysis
- PDF API: 100% coverage (81 statements, 0 missing) - Split + cached text
- General API: 78% coverage (185 statements, 41 missing) - File_id refactoring
- Auth API: 100% coverage (35 statements, 0 missing) - JWT token management
- Storage Services: 93-100% coverage - Redis (100%), Supabase (100%), Local Storage (93%)
- Utility Services: 89-100% coverage - Metadata (100%), OCR (100%), Kafka (95%)
- General API: 78% coverage (file upload, metadata, text extraction)
- Image API: 94% coverage (classification, defect detection, conversion, thumbnails)
- Audio API: 95% coverage (transcription, language detection, conversion)
- Video API: 96% coverage (frame extraction, object detection, summarization)
- PDF API: 100% coverage (text extraction, metadata, cached retrieval)
- CV Service: 91% coverage (dual OpenAI/local model support)
- STT Service: 86% coverage (dual OpenAI Whisper/local model support)
- Storage Service: 93% coverage (Supabase integration)
- Redis Service: 100% coverage (caching and distributed locks)
- Kafka Events: 95% coverage (event publishing)
The test suite supports both OpenAI and local models:
OpenAI Model Tests (Default - set USE_OPENAI=true):
- Computer Vision: Uses GPT-4 Vision for image classification and defect detection
- Speech-to-Text: Uses OpenAI Whisper for audio transcription
- Audio Processing: Uses OpenAI models for language detection and summarization
Local Model Tests (set USE_OPENAI=false):
- Computer Vision: Uses CLIP and T5 models for local processing
- Speech-to-Text: Uses local Whisper models and ASR libraries
- Audio Processing: Uses local NLP models for analysis
# Generate JWT token
$response = Invoke-RestMethod -Uri "http://localhost:8000/auth/generate-token" -Method POST -ContentType "application/json" -Body '{"user_id": "test-user"}'
$token = $response.access_token
# Use token in requests
$headers = @{ "Authorization" = "Bearer $token" }The General API has been updated to use a file_id-based architecture for improved security and performance:
# Step 1: Upload file (ONLY upload endpoint accepts files)
$uploadResult = Invoke-RestMethod -Uri "http://localhost:8000/api/v1/general/upload" -Method POST -Form @{ file = Get-Item "document.txt" } -Headers $headers
$fileId = $uploadResult.file_id
# Step 2: Extract text using file_id (NO file upload required)
$extractResult = Invoke-RestMethod -Uri "http://localhost:8000/api/v1/general/extract-text" -Method POST -Form @{ file_id = $fileId } -Headers $headers
# Step 3: Get metadata using file_id
$metadata = Invoke-RestMethod -Uri "http://localhost:8000/api/v1/general/metadata/$fileId" -Method GET -Headers $headers
# Step 4: Download file using file_id
Invoke-WebRequest -Uri "http://localhost:8000/api/v1/general/download/$fileId" -Method GET -Headers $headers -OutFile "downloaded_file.txt"
# Step 5: Delete file using file_id
$deleteResult = Invoke-RestMethod -Uri "http://localhost:8000/api/v1/general/$fileId" -Method DELETE -Headers $headersGeneral API:
- Upload: Only
/uploadendpoint accepts file uploads - Extract Text: Now uses
file_idparameter instead of file upload - Metadata: Uses
file_idin URL path - Download: Uses
file_idin URL path - Delete: Uses
file_idin URL path - Security: No file exposure in processing endpoints
- Performance: Cached file access and reduced data transfer
# Upload and process image
$form = @{
file = Get-Item "image.jpg"
target_format = "png"
width = 800
height = 600
}
$result = Invoke-RestMethod -Uri "http://localhost:8000/api/v1/image/convert" -Method POST -Form $form -Headers $headersComplete PowerShell examples are available in the examples/ directory with full dual model support:
# Run with OpenAI models (default)
$env:USE_OPENAI="true"; $env:OPENAI_API_KEY="your-key-here"
.\examples\image_working_test.ps1
# Switch to local models (CLIP for CV)
$env:USE_OPENAI="false"
.\examples\image_working_test.ps1
# Compare OpenAI vs Local model results
$env:USE_OPENAI="true"; .\examples\image_working_test.ps1
$env:USE_OPENAI="false"; .\examples\image_working_test.ps1# Audio transcription with OpenAI Whisper
$env:USE_OPENAI="true"; $env:OPENAI_API_KEY="your-key-here"
.\examples\audio_working_test.ps1
# Switch to local STT models (whisper-tiny)
$env:USE_OPENAI="false"
.\examples\audio_working_test.ps1
# Test language detection with both models
$env:USE_OPENAI="true"; .\examples\audio_working_test.ps1 # Uses OpenAI whisper-1
$env:USE_OPENAI="false"; .\examples\audio_working_test.ps1 # Uses local whisper-tiny# Video analysis with OpenAI GPT-4o
$env:USE_OPENAI="true"; $env:OPENAI_API_KEY="your-key-here"
.\examples\video_working_test.ps1
# Test with local models (CLIP + T5)
$env:USE_OPENAI="false"
.\examples\video_working_test.ps1
# Compare AI model outputs
$env:USE_OPENAI="true"; .\examples\video_working_test.ps1 # GPT-4o analysis
$env:USE_OPENAI="false"; .\examples\video_working_test.ps1 # Local model analysis# Run comprehensive tests across all services with OpenAI
$env:USE_OPENAI="true"; $env:OPENAI_API_KEY="your-key-here"
.\examples\complete_test_suite.ps1
# Test with local models
$env:USE_OPENAI="false"
.\examples\complete_test_suite.ps1
# Test API endpoints with curl (model-agnostic)
.\examples\curl_test_suite.ps1Switch between OpenAI and local models at runtime:
# Set environment variables for OpenAI models
$env:USE_OPENAI="true"
$env:OPENAI_API_KEY="your-openai-api-key"
# Verify configuration
Write-Output "Model Mode: $env:USE_OPENAI"
Write-Output "API Key Set: $($env:OPENAI_API_KEY -ne $null)"
# Run example with OpenAI
.\examples\audio_working_test.ps1# Set environment variables for local models
$env:USE_OPENAI="false"
# Note: Local models are automatically downloaded and cached on first use
# Verify configuration
Write-Output "Model Mode: $env:USE_OPENAI"
Write-Output "Local models will be used (Whisper-tiny, CLIP, T5-small)"
# Run example with local models (no API key required)
.\examples\audio_working_test.ps1# Test with OpenAI, then switch to local models
$env:USE_OPENAI="true"; $env:OPENAI_API_KEY="test-key"; .\examples\complete_test_suite.ps1
$env:USE_OPENAI="false"; .\examples\complete_test_suite.ps1
# Run specific service tests with model switching
$env:USE_OPENAI="true"; python -m pytest tests/test_cv_service_comprehensive_fixed.py -v
$env:USE_OPENAI="false"; python -m pytest tests/test_cv_service_comprehensive_fixed.py -vgeneral_api_demo.ps1- Complete General API file_id workflow (Upload → Extract-Text → Metadata → Download → Delete)audio_working_test.ps1- Audio processing with dual model support (OpenAI Whisper + Local whisper-tiny)video_working_test.ps1- Video processing with dual model support (OpenAI GPT-4o + Local CLIP/T5)image_working_test.ps1- Image processing with dual model support (OpenAI GPT-4o + Local CLIP)complete_test_suite.ps1- Full API test suite with model switching (All endpoints + dual models)pdf_test.ps1- PDF processing (split, text extraction, cached retrieval)curl_test_suite.ps1- cURL command examples for all endpointsimage_test.ps1- Legacy image processing test (kept for compatibility)video_test.ps1- Legacy video processing test (kept for compatibility)
examples/
├── PowerShell Scripts (All with Dual Model Support)
│ ├── audio_working_test.ps1 # STT: OpenAI Whisper vs Local whisper-tiny
│ ├── video_working_test.ps1 # CV: OpenAI GPT-4o vs Local CLIP+T5
│ ├── image_working_test.ps1 # CV: OpenAI GPT-4o vs Local CLIP
│ ├── complete_test_suite.ps1 # Full test suite with model switching
│ ├── general_api_demo.ps1 # Complete file_id workflow demo
│ ├── pdf_test.ps1 # PDF processing examples
│ └── curl_test_suite.ps1 # cURL command examples
├── Test Media Files (Real files for testing)
│ ├── audio.mp3 # Real audio file (41KB) - English speech
│ ├── video.mp4 # Real video file (1.7MB) - Sample video
│ ├── test_image.png # Real image file (1KB) - Test image
│ └── test.pdf # Real PDF file (218KB) - Multi-page document
└── Generated Test Results
├── test_debug.png # Debug output images from processing
├── test_final.png # Final processed images
└── test_fresh.png # Fresh test outputs
All example scripts support environment variable configuration for immediate model switching:
# Set environment for OpenAI models and run examples
$env:USE_OPENAI="true"; $env:OPENAI_API_KEY="your-key"; .\examples\audio_working_test.ps1
$env:USE_OPENAI="true"; $env:OPENAI_API_KEY="your-key"; .\examples\image_working_test.ps1
$env:USE_OPENAI="true"; $env:OPENAI_API_KEY="your-key"; .\examples\video_working_test.ps1
# Run comprehensive test suite with OpenAI
$env:USE_OPENAI="true"; $env:OPENAI_API_KEY="your-key"; .\examples\complete_test_suite.ps1# First install local model dependencies (one-time setup)
pip install torch torchvision transformers pillow
# Set environment for local models and run examples (no API key required)
$env:USE_OPENAI="false"; .\examples\audio_working_test.ps1 # Local whisper-tiny + T5-small
$env:USE_OPENAI="false"; .\examples\image_working_test.ps1 # Local CLIP model
$env:USE_OPENAI="false"; .\examples\video_working_test.ps1 # Local CLIP + T5-small
# Run comprehensive test suite with local models
$env:USE_OPENAI="false"; .\examples\complete_test_suite.ps1
# Note: Models are downloaded automatically on first use (~640MB total)# Compare outputs between OpenAI and local models
Write-Output "=== Testing with OpenAI Models ==="
$env:USE_OPENAI="true"; $env:OPENAI_API_KEY="your-key"; .\examples\audio_working_test.ps1
Write-Output "=== Testing with Local Models ==="
$env:USE_OPENAI="false"; .\examples\audio_working_test.ps1
# Results will show model differences in transcription quality, speed, and output format├── general_api_demo.ps1 # Complete General API demo (file_id architecture)
├── audio_working_test.ps1 # Audio processing with OpenAI/Local models (141 lines)
├── video_working_test.ps1 # Video processing with OpenAI/Local models (164 lines)
├── image_working_test.ps1 # Image processing with OpenAI/Local models (172 lines)
├── pdf_test.ps1 # PDF processing test
├── curl_test_suite.ps1 # cURL commands reference
├── complete_test_suite.ps1 # Comprehensive test suite with model switching
├── image_test.ps1 # Legacy image processing test
├── video_test.ps1 # Legacy video processing test
├── audio.mp3 # Real audio file for testing (41KB)
├── video.mp4 # Real video file for testing (1.7MB)
├── test_debug.png # Test image file (335 bytes)
├── test_final.png # Test image file (335 bytes)
├── test_fresh.png # Test image file (485 bytes)
├── test_image.png # Real image file for testing (1KB)
└── test.pdf # Real PDF file for testing (218KB)
### **General API Demo Features**
**File:** `examples/general_api_demo.ps1`
**Complete File_ID Architecture Test:**
1. **Upload:** `POST /api/v1/general/upload` - Accepts file uploads → Returns `file_id`
2. **Extract-Text:** `POST /api/v1/general/extract-text` - Uses `file_id` (NOT file upload)
3. **Metadata:** `GET /api/v1/general/metadata/{file_id}` - File metadata extraction
4. **Download:** `GET /api/v1/general/download/{file_id}` - File download
5. **Delete:** `DELETE /api/v1/general/{file_id}` - File cleanup
**Sample Output:**
=== General API Complete Test - File ID Architecture === Upload successful! File ID: df90a98c-8f82-4503-bc4c-009aa0d94984 Text extraction successful! Processing Time: 2.149s Metadata retrieval successful! Keys: file_size, content_type, filename, page_count, is_encrypted, pdf_metadata Download successful! Downloaded: 218,741 bytes File deleted successfully!
## **EXAMPLE WORKFLOWS** (Working & Tested)
### **Pre-requisites for Examples**
```powershell
# 1. Ensure Docker stack is running
docker-compose ps # All services should be "Up (healthy)"
# 2. Generate authentication token
$token = (Invoke-RestMethod -Uri "http://localhost:8000/auth/generate-token" -Method POST -ContentType "application/json" -Body '{"user_id": "test-user"}').access_token
Write-Host "Token: $token"
# Navigate to examples directory
cd examples
# Run complete General API file_id architecture demo
powershell.exe -ExecutionPolicy Bypass -File general_api_demo.ps1
# Navigate to examples directory
cd examples
# Run audio transcription and conversion workflow
powershell.exe -ExecutionPolicy Bypass -File audio_test.ps1
# Test video frame extraction and analysis
powershell.exe -ExecutionPolicy Bypass -File video_test.ps1
# Test image processing pipeline
powershell.exe -ExecutionPolicy Bypass -File image_test.ps1
# Test PDF processing capabilities
powershell.exe -ExecutionPolicy Bypass -File pdf_test.ps1
### **OpenAI/Local Model Switching Examples**
The following examples demonstrate automatic model switching between OpenAI and Local models:
#### **Audio Processing with Model Switching**
```powershell
# Navigate to examples directory
cd examples
# Test with LOCAL models (default)
powershell.exe -ExecutionPolicy Bypass -File audio_working_test.ps1
# Test with OPENAI models (requires OPENAI_API_KEY)
powershell.exe -ExecutionPolicy Bypass -File audio_working_test.ps1 -UseOpenAI
# Test with LOCAL models (ResNet, EfficientNet)
powershell.exe -ExecutionPolicy Bypass -File image_working_test.ps1
# Test with OPENAI models (GPT-4 Vision)
powershell.exe -ExecutionPolicy Bypass -File image_working_test.ps1 -UseOpenAI
# Test with LOCAL models (CLIP, OpenCV)
powershell.exe -ExecutionPolicy Bypass -File video_working_test.ps1
# Test with OPENAI models (GPT-4 Vision for frames)
powershell.exe -ExecutionPolicy Bypass -File video_working_test.ps1 -UseOpenAI
# Option 1: Environment variable (affects all services)
$env:USE_OPENAI = "true"
$env:OPENAI_API_KEY = "your-api-key"
# Option 2: Parameter-based switching (per test)
powershell.exe -ExecutionPolicy Bypass -File audio_working_test.ps1 -UseOpenAI
powershell.exe -ExecutionPolicy Bypass -File image_working_test.ps1 -UseOpenAI
powershell.exe -ExecutionPolicy Bypass -File video_working_test.ps1 -UseOpenAI
# Option 3: Docker environment switching
# Update docker-compose.yml:
# - USE_OPENAI=true
# - OPENAI_API_KEY=${OPENAI_API_KEY}
docker-compose down && docker-compose up -d| Method | Endpoint | Description | OpenAI Model | Local Model |
|---|---|---|---|---|
POST |
/api/v1/auth/generate-token |
Generate JWT token | None | None |
GET |
/api/v1/auth/verify |
Verify JWT token | None | None |
Features:
- JWT token generation with configurable expiration
- Token verification and user context extraction
- Secure authentication middleware
| Method | Endpoint | Description | OpenAI Model | Local Model |
|---|---|---|---|---|
POST |
/api/v1/general/upload |
Upload files to Supabase storage | None | None |
GET |
/api/v1/general/download/{file_id} |
Download files by file_id | None | None |
GET |
/api/v1/general/metadata/{file_id} |
Extract file metadata | None | None |
POST |
/api/v1/general/extract-text |
OCR text extraction (file_id) | None | None |
DELETE |
/api/v1/general/{file_id} |
Delete files from storage | None | None |
Architecture:
- File-ID Based: All processing uses secure
file_idparameters - Supabase-Only Storage: 100% cloud-native, no local fallback
- OCR Integration: Tesseract OCR for text extraction
- Metadata Extraction: EXIF, PDF properties, document analysis
| Method | Endpoint | Description | OpenAI Model | Local Model |
|---|---|---|---|---|
POST |
/api/v1/audio/transcribe |
Audio transcription with timestamps | whisper-1 |
openai/whisper-tiny |
POST |
/api/v1/audio/detect-language |
Audio language detection | whisper-1 |
openai/whisper-tiny |
POST |
/api/v1/audio/summarize |
Transcription + AI summarization | whisper-1 + gpt-3.5-turbo |
whisper-tiny + t5-small |
POST |
/api/v1/audio/convert |
Audio format conversion | None (FFmpeg) | None (FFmpeg) |
Features:
- Dual Model Support: OpenAI Whisper-1 OR local Whisper-tiny
- Language Detection: Automatic language identification
- VTT Timestamps: WebVTT format with precise timing
- AI Summarization: GPT-3.5-turbo OR T5-small for summaries
- Format Conversion: MP3, WAV, M4A, FLAC support
| Method | Endpoint | Description | OpenAI Model | Local Model |
|---|---|---|---|---|
POST |
/api/v1/image/convert |
Image format conversion | None (PIL) | None (PIL) |
POST |
/api/v1/image/classify |
AI image classification | gpt-4o |
openai/clip-vit-base-patch32 |
POST |
/api/v1/image/detect-defects |
AI defect detection | gpt-4o |
openai/clip-vit-base-patch32 |
POST |
/api/v1/image/thumbnail |
Thumbnail generation | None (PIL) | None (PIL) |
AI Features:
- Classification: Detailed image description and categorization
- Defect Detection: Quality control and damage assessment
- Dual Models: GPT-4o vision OR CLIP for computer vision tasks
- Format Support: JPEG, PNG, GIF, BMP, TIFF, WebP
| Method | Endpoint | Description | OpenAI Model | Local Model |
|---|---|---|---|---|
POST |
/api/v1/video/extract-frames |
Extract frames from video | None (OpenCV) | None (OpenCV) |
POST |
/api/v1/video/detect-objects |
AI object detection in video | gpt-4o |
openai/clip-vit-base-patch32 |
POST |
/api/v1/video/summarize |
AI video content summarization | gpt-4o |
clip-vit-base-patch32 + t5-small |
AI Features:
- Object Detection: Identify and describe objects in video frames
- Content Summarization: AI-powered video content analysis
- Frame Analysis: Processes first 3-5 frames for optimization
- Dual Models: GPT-4o vision OR CLIP + T5 for video analysis
| Method | Endpoint | Description | OpenAI Model | Local Model |
|---|---|---|---|---|
POST |
/api/v1/pdf/split |
Split PDF by page ranges | None (PyPDF2) | None (PyPDF2) |
GET |
/api/v1/pdf/text/{cache_id} |
Retrieve cached OCR text | None (Tesseract) | None (Tesseract) |
Features:
- Page Range Splitting: Split PDFs by specified page ranges (e.g., "1-3,5-7")
- Cached Text Retrieval: Redis-cached OCR text with cache_id lookup
- PDF Processing: PyPDF2 for manipulation, Tesseract for OCR
- Multiple Cache Patterns: Supports various cache key formats for compatibility
USE_OPENAI=true # Use OpenAI models (whisper-1, gpt-4o, gpt-3.5-turbo)
USE_OPENAI=false # Use local models (whisper-tiny, clip-vit-base-patch32, t5-small)- OpenAI: Cloud-based, consistent performance, API costs
- Local: Self-hosted, CUDA acceleration, one-time download costs
- Optimization: Local models use
torch.float16for CUDA acceleration - Initialization: Local models downloaded from Hugging Face on first use
tests/
├── conftest.py # Test configuration and fixtures
├── test_api_endpoints.py # API endpoint integration tests
├── test_auth_comprehensive.py # Authentication system tests
├── test_audio_endpoints_updated.py # Audio API endpoint tests
├── test_audio_conversion_comprehensive.py # Audio conversion service tests
├── test_cv_service_comprehensive_fixed.py # Computer Vision service tests
├── test_general_updated.py # General API tests
├── test_image_fileid.py # Image API file_id tests
├── test_video_fileid.py # Video API file_id tests
├── test_pdf_fileid.py # PDF API file_id tests
├── test_storage_comprehensive.py # Storage service tests
├── test_supabase_storage_comprehensive.py # Supabase storage tests
├── test_stt_service_fixed.py # Speech-to-Text service tests
├── test_ocr_service_comprehensive.py # OCR service tests
├── test_redis_service_comprehensive.py # Redis caching tests
├── test_kafka_events_comprehensive.py # Kafka event handling tests
├── test_metadata_service_comprehensive.py # Metadata extraction tests
├── test_middleware_comprehensive.py # Middleware tests
└── __init__.py # Package initialization
Note: Tests require Docker environment with all dependencies installed:
# Run tests in Docker container
docker-compose up -d
docker-compose exec ai-file-utilities python -m pytest tests/ -v