Viva Stage AI - Backend

AI-powered video processing service that transforms YouTube videos into engaging short-form content

Overview

Viva Stage AI is a FastAPI-based backend service that automatically extracts highlight moments from long-form YouTube videos and converts them into vertical (9:16) short-format videos optimized for TikTok, Instagram Reels, and YouTube Shorts.

The service uses AI to analyze video transcripts, identifies the most engaging segments, applies intelligent face-centered cropping, and generates professional-looking shorts with optional captions—all through a single API call.

Key Features

AI-Powered Highlight Extraction - Uses LLM analysis to identify engaging moments from video transcripts
Intelligent Face Detection & Tracking - Tracks speakers across frames and centers the crop on the active face
Smart 9:16 Cropping - Face-centered cropping with automatic fallback to letterboxing for complex scenes
Automatic Caption Generation - Word-level timed captions positioned for mobile viewing
LLM Provider Flexibility - Switch between local (Ollama) and API-based (OpenAI) LLMs with zero code changes
YouTube Integration - Direct video download and processing from YouTube URLs

Tech Stack

Web Framework

FastAPI - Modern async Python web framework
Uvicorn - ASGI server for production deployment
Pydantic - Data validation and settings management

AI & Machine Learning

LLMs - OpenAI's GPT models + Ollama models
Groq - Whisper Large V3 for audio transcription

Video & Audio Processing

OpenCV - Computer vision and video manipulation
face_recognition - Face detection and tracking
moviepy - Video editing and clip extraction
pytubefix - YouTube video downloading
FFmpeg - Fast Audio/video processing

Security & HTTP

Supabase - JWT authentication
python-jose - JWT token handling
httpx - Async HTTP client

Architecture

The project follows a clean layered architecture:

API Layer (Controllers)
        ↓
Service Layer (ReelService)
        ↓
Engine Layer (VideoEngine, AudioEngine, LLMEngine, CaptionEngine)
        ↓
Provider Layer (LLMProvider abstraction)

Key Design Patterns:

Service-Oriented Architecture - Clear separation between orchestration and business logic
Provider Pattern - Abstract LLM provider interface with factory instantiation
Dependency Injection - FastAPI's DI system for clean component composition

Quick Start

Prerequisites

Python 3.11 or higher
Ollama for local LLM inference

Installation

# 1. Clone the repository
git clone <repository-url>
cd backend

# 2. Install Python dependencies
pip install -r requirements.txt

# 3. Configure environment variables
cp .env.example .env
# Edit .env with your API keys (see Configuration section)

# 4. For local shorts usage
# Install Ollama from https://ollama.com
ollama pull llama3.1:8b

Running the Server

uvicorn main:app --reload

The API will be available at http://localhost:8000

Configuration

Required environment variables:

# API Keys
GROQ_API_KEY=your_groq_api_key          # Required for transcription
OPENAI_API_KEY=your_openai_key          # Required if LLM_PROVIDER=openai

# shorts Provider Configuration
LLM_PROVIDER=local                      # Options: "local" or "openai"
LOCAL_LLM_URL=http://localhost:11434    # Ollama endpoint
LOCAL_LLM_MODEL=qwen3:8b                # Model name for Ollama
LOCAL_LLM_TIMEOUT=300                   # Request timeout in seconds

# Authentication
SUPABASE_URL=your_supabase_url
SUPABASE_SECRET_KEY=your_secret_key
SUPABASE_JWKS_URL=your_jwks_endpoint
SUPABASE_AUDIENCE=authenticated

API Endpoints

`POST /reels/extract`

Processes a YouTube video and generates short-form vertical videos.

Request Body:

{
  "youtube_url": "https://www.youtube.com/watch?v=VIDEO_ID",
  "captions": true,
  "language": "en"
}

Note: The number of reels generated is automatically determined based on video duration:

0-15 minutes: Up to 4 reels
15-30 minutes: Up to 6 reels
30-60 minutes: Up to 8 reels
60+ minutes: Up to 10 reels

Response:

{
  "status": "success",
  "processing_id": "20240115_143022",
  "message": "Reels generated successfully"
}

Output: Generated videos are saved in output/{processing_id}/shorts/

How It Works

The processing pipeline consists of 8 automated steps:

Setup - Creates timestamped output directories for organized file management
Download - Downloads video and audio from YouTube, merges streams using FFmpeg
Transcribe - Processes audio in chunks using Groq's Whisper API for word-level timestamps
Analyze - LLM analyzes transcript to identify highlight moments based on engagement
Cut - Extracts video clips using the identified timestamps
Process - Applies face detection and intelligent 9:16 aspect ratio cropping
Caption - Adds dynamic word-level captions to videos (if requested)
Cleanup - Removes temporary files and returns processing results

Project Highlights

LLM Provider Abstraction Layer

The standout architectural feature is a flexible LLM integration system that allows seamless switching between providers:

Abstract base class defines the contract for all providers
Factory pattern handles provider instantiation based on configuration
Provider-specific prompts optimized for each LLM (ChatML for local, OpenAI format for API)
Structured output support ensures type-safe responses across all providers

Why this matters: Develop and test with free local models (Ollama), then deploy with production APIs (OpenAI) using a single environment variable change. This demonstrates architectural foresight and cost-conscious engineering.

Face Tracking Algorithm

Beyond simple cropping, the face detection system implements sophisticated tracking:

Encodes and compares faces across frames to identify unique speakers
Builds continuous segments where the same person is centered
Applies temporal smoothing to prevent jarring transitions
Gracefully handles edge cases (no faces, multiple faces) with letterbox fallback

This creates a professional viewing experience similar to manually edited content.

Clean Architecture

The codebase demonstrates production-ready software design:

Separation of concerns - Controllers handle HTTP, services orchestrate, engines execute
Dependency injection - Clean component composition without tight coupling
Error handling - Custom exception hierarchy with informative error messages
Security - JWT authentication, input validation, CORS configuration

Output Structure

Processed videos are organized in timestamped directories:

output/{processing_id}/
├── audio/           # Extracted and merged audio files
├── video/           # Downloaded and processed videos
├── transcription/   # JSON transcripts with word-level timing
├── llm/             # LLM formatted transcript
└── shorts/          # Final 9:16 vertical videos (DELIVERABLES)

Development

# Install dependencies
pip install -r requirements.txt

# Run with auto-reload
uvicorn main:app --reload

License

Version 1.0 — 2025

1. Purpose This project is made public solely for the purpose of technical evaluation (e.g., job interviews, portfolio showcase). No permission is granted for any other use.

2. No Permission to Use, Copy, or Modify Except for viewing the source code directly on GitHub, you are NOT allowed to:

use this code, in whole or in part, for any purpose
copy it
modify it
distribute it
reproduce it
build upon it
incorporate it into your own software
use it in any commercial or non-commercial context
use it for machine learning training, dataset creation, or code generation

All of these actions are strictly prohibited.

3. No Redistribution

You may not redistribute the code in any form, including forks, clones, mirrors, or archives. Forking is disabled at the repository level, but any attempt to bypass this restriction is strictly prohibited.

4. No Warranty

This project is provided “as is”, without any warranties of any kind.

5. Automatic License Termination

If you violate any of the above terms:

all permissions immediately terminate
you must destroy all copies of the code
you may be liable for damages under applicable law

6. Ownership

All intellectual property rights remain with Mesaros David, the sole creator of VivaStage.

Built with FastAPI, OpenCV, and AI | Showcasing modern Python backend development practices

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
app		app
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Viva Stage AI - Backend

Overview

Key Features

Tech Stack

Web Framework

AI & Machine Learning

Video & Audio Processing

Security & HTTP

Architecture

Quick Start

Prerequisites

Installation

Running the Server

Configuration

API Endpoints

`POST /reels/extract`

How It Works

Project Highlights

LLM Provider Abstraction Layer

Face Tracking Algorithm

Clean Architecture

Output Structure

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Viva Stage AI - Backend

Overview

Key Features

Tech Stack

Web Framework

AI & Machine Learning

Video & Audio Processing

Security & HTTP

Architecture

Quick Start

Prerequisites

Installation

Running the Server

Configuration

API Endpoints

POST /reels/extract

How It Works

Project Highlights

LLM Provider Abstraction Layer

Face Tracking Algorithm

Clean Architecture

Output Structure

Development

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /reels/extract`

Packages