A powerful multimodal AI system that analyzes video content by extracting and combining information from multiple sources: audio transcription, visual object detection, and text recognition. Built with FastAPI for high-performance video processing.
- 🎙️ Audio Transcription: Automatic speech-to-text using OpenAI's Whisper model
- 👁️ Object Detection: Real-time object recognition using YOLOv8
- 📝 Text Extraction: OCR capabilities with Tesseract for detecting text in video frames
- 🤖 AI Summarization: Intelligent context fusion using local LLM (with fallback to rule-based summarization)
- ⚡ Fast API: RESTful API built with FastAPI for easy integration
- 🔄 Automatic Cleanup: Temporary files are automatically managed and cleaned up
Video Upload → Frame Extraction → Parallel Processing
├─ Audio → Whisper → Transcription
├─ Visual → YOLO → Object Detection
└─ Text → Tesseract → OCR
↓
LLM Summarizer
↓
Unified Summary
- Python: 3.9 or higher
- Operating System: Windows, macOS, or Linux
- RAM: Minimum 8GB (16GB recommended for larger models)
- Storage: At least 5GB free space for model weights
Important
Tesseract OCR must be installed separately on your system:
- Windows: Download from GitHub Releases
- macOS:
brew install tesseract - Linux:
sudo apt-get install tesseract-ocr
If Tesseract is not in your system PATH, update line 19 in visual_analyser.py:
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'UV is a fast Python package installer and resolver.
-
Install UV (if not already installed):
# Windows (PowerShell) powershell -c "irm https://astral.sh/uv/install.ps1 | iex" # macOS/Linux curl -LsSf https://astral.sh/uv/install.sh | sh
-
Clone or navigate to the project directory:
cd path/to/Contex_AI -
Install dependencies:
uv sync
# Create a virtual environment
python -m venv venv
# Activate the virtual environment
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txtThe system works out-of-the-box with a rule-based fallback summarizer. For enhanced AI summarization:
- Download a GGUF model file (e.g., from Hugging Face)
- Update
video_processor.pyline 14:self.llm_summarizer = LLMSummarizer(model_path="path/to/your/model.gguf")
On first run, the following models will be automatically downloaded:
- Whisper (base model): ~140MB
- YOLOv8n: ~6MB
# Using UV
uv run python main.py
# Using standard Python
python main.pyThe server will start at http://localhost:8000
Once the server is running, visit:
- Interactive API docs: http://localhost:8000/docs
- Alternative docs: http://localhost:8000/redoc
curl -X POST "http://localhost:8000/analyze_video/" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@your_video.mp4"import requests
url = "http://localhost:8000/analyze_video/"
files = {"file": open("your_video.mp4", "rb")}
response = requests.post(url, files=files)
result = response.json()
print("Summary:", result["summary"])
print("Transcription:", result["transcription"])
print("Detected Objects:", result["detected_objects"])const formData = new FormData();
formData.append('file', videoFile);
fetch('http://localhost:8000/analyze_video/', {
method: 'POST',
body: formData
})
.then(response => response.json())
.then(data => {
console.log('Summary:', data.summary);
console.log('Transcription:', data.transcription);
console.log('Detected Objects:', data.detected_objects);
});{
"summary": "Audio Content: This is a tutorial about... | Detected Objects: person, laptop, book | Text in Video: Tutorial Title",
"transcription": "Full audio transcription text here...",
"detected_objects": {
"frame_1": {
"objects": [
{"object": "person", "confidence": 0.95},
{"object": "laptop", "confidence": 0.87}
],
"text": "Detected text from frame"
},
"frame_2": {
"objects": [...],
"text": "..."
}
}
}Contex_AI/
├── main.py # FastAPI application entry point
├── video_processor.py # Main video processing orchestrator
├── audio_analyser.py # Whisper-based audio transcription
├── visual_analyser.py # YOLO object detection + Tesseract OCR
├── llm_summarizer.py # LLM-based context summarization
├── requirements.txt # Python dependencies (legacy)
├── pyproject.toml # Modern Python project configuration
├── uv.lock # Locked dependency versions
└── README.md # This file
Issue: ModuleNotFoundError: No module named 'whisper'
- Solution: Install openai-whisper:
uv add openai-whisperorpip install openai-whisper
Issue: pytesseract.pytesseract.TesseractNotFoundError
- Solution: Install Tesseract OCR (see Requirements section) and configure the path in
visual_analyser.py
Issue: CUDA/GPU errors with Whisper
- Solution: The code uses
fp16=Falsefor CPU compatibility. For GPU acceleration, ensure you have CUDA installed and setfp16=True
Issue: Out of memory errors
- Solution:
- Use smaller Whisper model: Change
whisper.load_model("base")to"tiny"inaudio_analyser.py - Reduce frame extraction interval in
video_processor.pyline 53
- Use smaller Whisper model: Change
Issue: Slow processing
- Solution:
- Use GPU acceleration if available
- Increase
frame_intervalparameter (default: 5 seconds) - Use smaller AI models
# Install dev dependencies
uv sync --all-extras
# Run tests (when available)
pytest# Format code with Black
black .
# Lint with Ruff
ruff check .- fastapi (>=0.104.0): Web framework for building APIs
- uvicorn (>=0.24.0): ASGI server
- openai-whisper (>=20231117): Audio transcription
- ultralytics (>=8.0.0): YOLOv8 object detection
- pytesseract (>=0.3.10): OCR text extraction
- llama-cpp-python (>=0.2.0): Local LLM inference
- moviepy (>=1.0.3): Video processing
- opencv-python (>=4.8.0): Computer vision operations
- Pillow (>=10.0.0): Image processing
- python-multipart (>=0.0.6): File upload handling
Contributions are welcome! Please feel free to submit issues or pull requests.
MIT License - feel free to use this project for personal or commercial purposes.
- OpenAI Whisper: State-of-the-art speech recognition
- Ultralytics YOLOv8: Real-time object detection
- Tesseract OCR: Open-source text recognition
- FastAPI: Modern, fast web framework
Built with ❤️ for multimodal AI analysis