Skip to content

This project is a Video Context Extraction API that analyzes video files to extract multimodal information including: Audio transcription using OpenAI's Whisper model Visual object detection using YOLOv8 Text extraction from video frames using Tesseract OCR Context summarization using a local LLM (llama-cpp-python).

Notifications You must be signed in to change notification settings

ramblinghermit0403/Context_AI

Repository files navigation

Context AI - Video Context Extraction API

A powerful multimodal AI system that analyzes video content by extracting and combining information from multiple sources: audio transcription, visual object detection, and text recognition. Built with FastAPI for high-performance video processing.

🎯 Features

  • 🎙️ Audio Transcription: Automatic speech-to-text using OpenAI's Whisper model
  • 👁️ Object Detection: Real-time object recognition using YOLOv8
  • 📝 Text Extraction: OCR capabilities with Tesseract for detecting text in video frames
  • 🤖 AI Summarization: Intelligent context fusion using local LLM (with fallback to rule-based summarization)
  • ⚡ Fast API: RESTful API built with FastAPI for easy integration
  • 🔄 Automatic Cleanup: Temporary files are automatically managed and cleaned up

🏗️ Architecture

Video Upload → Frame Extraction → Parallel Processing
                                   ├─ Audio → Whisper → Transcription
                                   ├─ Visual → YOLO → Object Detection
                                   └─ Text → Tesseract → OCR
                                          ↓
                                   LLM Summarizer
                                          ↓
                                   Unified Summary

📋 Requirements

System Requirements

  • Python: 3.9 or higher
  • Operating System: Windows, macOS, or Linux
  • RAM: Minimum 8GB (16GB recommended for larger models)
  • Storage: At least 5GB free space for model weights

External Dependencies

Important

Tesseract OCR must be installed separately on your system:

  • Windows: Download from GitHub Releases
  • macOS: brew install tesseract
  • Linux: sudo apt-get install tesseract-ocr

If Tesseract is not in your system PATH, update line 19 in visual_analyser.py:

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

🚀 Installation

Using UV (Recommended)

UV is a fast Python package installer and resolver.

  1. Install UV (if not already installed):

    # Windows (PowerShell)
    powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
    
    # macOS/Linux
    curl -LsSf https://astral.sh/uv/install.sh | sh
  2. Clone or navigate to the project directory:

    cd path/to/Contex_AI
  3. Install dependencies:

    uv sync

Using pip (Alternative)

# Create a virtual environment
python -m venv venv

# Activate the virtual environment
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

⚙️ Configuration

Optional: LLM Model Setup

The system works out-of-the-box with a rule-based fallback summarizer. For enhanced AI summarization:

  1. Download a GGUF model file (e.g., from Hugging Face)
  2. Update video_processor.py line 14:
    self.llm_summarizer = LLMSummarizer(model_path="path/to/your/model.gguf")

Model Downloads

On first run, the following models will be automatically downloaded:

  • Whisper (base model): ~140MB
  • YOLOv8n: ~6MB

🎬 Usage

Starting the API Server

# Using UV
uv run python main.py

# Using standard Python
python main.py

The server will start at http://localhost:8000

API Documentation

Once the server is running, visit:

Making API Requests

Using cURL

curl -X POST "http://localhost:8000/analyze_video/" \
  -H "accept: application/json" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@your_video.mp4"

Using Python

import requests

url = "http://localhost:8000/analyze_video/"
files = {"file": open("your_video.mp4", "rb")}

response = requests.post(url, files=files)
result = response.json()

print("Summary:", result["summary"])
print("Transcription:", result["transcription"])
print("Detected Objects:", result["detected_objects"])

Using JavaScript/Fetch

const formData = new FormData();
formData.append('file', videoFile);

fetch('http://localhost:8000/analyze_video/', {
  method: 'POST',
  body: formData
})
  .then(response => response.json())
  .then(data => {
    console.log('Summary:', data.summary);
    console.log('Transcription:', data.transcription);
    console.log('Detected Objects:', data.detected_objects);
  });

Response Format

{
  "summary": "Audio Content: This is a tutorial about... | Detected Objects: person, laptop, book | Text in Video: Tutorial Title",
  "transcription": "Full audio transcription text here...",
  "detected_objects": {
    "frame_1": {
      "objects": [
        {"object": "person", "confidence": 0.95},
        {"object": "laptop", "confidence": 0.87}
      ],
      "text": "Detected text from frame"
    },
    "frame_2": {
      "objects": [...],
      "text": "..."
    }
  }
}

📦 Project Structure

Contex_AI/
├── main.py                 # FastAPI application entry point
├── video_processor.py      # Main video processing orchestrator
├── audio_analyser.py       # Whisper-based audio transcription
├── visual_analyser.py      # YOLO object detection + Tesseract OCR
├── llm_summarizer.py       # LLM-based context summarization
├── requirements.txt        # Python dependencies (legacy)
├── pyproject.toml          # Modern Python project configuration
├── uv.lock                 # Locked dependency versions
└── README.md               # This file

🔧 Troubleshooting

Common Issues

Issue: ModuleNotFoundError: No module named 'whisper'

  • Solution: Install openai-whisper: uv add openai-whisper or pip install openai-whisper

Issue: pytesseract.pytesseract.TesseractNotFoundError

  • Solution: Install Tesseract OCR (see Requirements section) and configure the path in visual_analyser.py

Issue: CUDA/GPU errors with Whisper

  • Solution: The code uses fp16=False for CPU compatibility. For GPU acceleration, ensure you have CUDA installed and set fp16=True

Issue: Out of memory errors

  • Solution:
    • Use smaller Whisper model: Change whisper.load_model("base") to "tiny" in audio_analyser.py
    • Reduce frame extraction interval in video_processor.py line 53

Issue: Slow processing

  • Solution:
    • Use GPU acceleration if available
    • Increase frame_interval parameter (default: 5 seconds)
    • Use smaller AI models

🛠️ Development

Running Tests

# Install dev dependencies
uv sync --all-extras

# Run tests (when available)
pytest

Code Formatting

# Format code with Black
black .

# Lint with Ruff
ruff check .

📝 Dependencies

Core Dependencies

  • fastapi (>=0.104.0): Web framework for building APIs
  • uvicorn (>=0.24.0): ASGI server
  • openai-whisper (>=20231117): Audio transcription
  • ultralytics (>=8.0.0): YOLOv8 object detection
  • pytesseract (>=0.3.10): OCR text extraction
  • llama-cpp-python (>=0.2.0): Local LLM inference
  • moviepy (>=1.0.3): Video processing
  • opencv-python (>=4.8.0): Computer vision operations
  • Pillow (>=10.0.0): Image processing
  • python-multipart (>=0.0.6): File upload handling

🤝 Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

📄 License

MIT License - feel free to use this project for personal or commercial purposes.

🙏 Acknowledgments

  • OpenAI Whisper: State-of-the-art speech recognition
  • Ultralytics YOLOv8: Real-time object detection
  • Tesseract OCR: Open-source text recognition
  • FastAPI: Modern, fast web framework

Built with ❤️ for multimodal AI analysis

About

This project is a Video Context Extraction API that analyzes video files to extract multimodal information including: Audio transcription using OpenAI's Whisper model Visual object detection using YOLOv8 Text extraction from video frames using Tesseract OCR Context summarization using a local LLM (llama-cpp-python).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages