AI-powered video editing that transforms raw footage into engaging short-form content.
Auto Edit uses speech transcription, audio analysis, and LLM-based narrative intelligence to automatically identify the most compelling moments in your videos and assemble them into polished, story-driven clips.
Upload Video(s)
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TRANSCRIPTION & ANALYSIS β
β β’ Whisper extracts word-level transcripts β
β β’ Librosa computes energy, pitch, and pause metrics β
β β’ Filler words detected and flagged β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 1: NARRATIVE ASSEMBLY (Quality-First) β
β β’ Sentences scored for engagement β
β β’ LLM selects best clips for story arc β
β β’ Hook optimization for first 3 seconds β
β β’ Cross-video mixing for multi-clip projects β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 2: LENGTH OPTIMIZATION β
β β’ Trim to target duration (~45s) β
β β’ Remove low-engagement content β
β β’ Preserve "sacred elements" (hook, core message) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
FFmpeg assembles final video with frame-accurate cuts
The system doesn't just cut for lengthβit understands story structure. Using Claude or GPT-4, it identifies:
- Hooks that grab attention in the first 3 seconds
- Core value statements that deliver the main message
- Proof/demo moments that back up claims
- Natural transitions that maintain flow
A key architectural decision: build the best possible story first, then optimize for length. This ensures the narrative never suffers from premature length constraints.
Every word is enriched with:
- Energy (RMS loudness) β identifies emphasis
- Pitch variance β detects questions, excitement
- Pause duration β natural cut points
- Filler detection β "um", "uh", "like", "you know"
By aggregating words into sentences before LLM processing, the system achieves 75-80% token cost reduction while maintaining coherent narrative selection.
Upload multiple clips and Auto Edit will intelligently weave them into a single, cohesive narrativeβmixing freely across sources while maintaining logical flow.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FRONTEND β
β Next.js + TypeScript + Tailwind CSS β
β Real-time progress tracking β’ Video preview β’ Caption editing β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β REST API
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BACKEND β
β FastAPI + Python 3.13 + Async Background Tasks β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β SERVICES β
β ββ transcription.py Whisper (faster-whisper, GPU-accel) β
β ββ word_metrics.py Librosa audio analysis β
β ββ clip_analyzer.py Sentence grouping & scoring β
β ββ narrative_assembler.py LLM story selection (Claude/GPT-4) β
β ββ length_optimizer.py Duration trimming β
β ββ assembly.py FFmpeg video stitching β
β ββ captions.py SRT/VTT generation β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Layer | Technology |
|---|---|
| Frontend | Next.js, TypeScript, Tailwind CSS |
| Backend | FastAPI, Python 3.13, Uvicorn |
| Transcription | Whisper (faster-whisper with GPU acceleration) |
| Audio Analysis | Librosa, NumPy, SciPy |
| LLM | Claude (Anthropic) or GPT-4 (OpenAI) |
| Video Processing | FFmpeg (H.264 encoding, concatenation) |
| ML/NLP | PyTorch, HuggingFace Transformers |
Auto-detects and uses the best available hardware:
- Apple Silicon (M1/M2/M3) via VideoToolbox
- NVIDIA CUDA for GPU-accelerated transcription
- Intel Quick Sync for video encoding
- CPU fallback with int8 quantization
This project is guided by Architecture Decision Records (ADRs) documenting key technical choices:
| ADR | Decision |
|---|---|
| 0002 | Human-equivalent editing approach with on-demand metrics |
| 0003 | Three-pass editing system (evolved into two-stage) |
| 0004 | Two-stage quality-first pipeline β separate "what makes a good story" from "what fits in time" |
| 0005 | Sentence-level LLM processing for 75% token savings |
| 0006 | Phased audio-first approach (visual analysis planned) |
.
βββ api/ # Python FastAPI backend
β βββ services/ # Core processing services
β β βββ transcription.py # Whisper integration
β β βββ word_metrics.py # Audio feature extraction
β β βββ clip_analyzer.py # Sentence segmentation
β β βββ narrative_assembler.py # LLM narrative selection
β β βββ length_optimizer.py # Duration optimization
β β βββ assembly.py # FFmpeg video assembly
β βββ routers/ # API endpoints
β βββ config.py # Environment configuration
βββ web/ # Next.js frontend
β βββ app/ # App router pages
β βββ components/ # React components
βββ docs/adr/ # Architecture Decision Records
βββ packages/ # Shared TypeScript packages
β βββ shared/ # Types and utilities
β βββ ui/ # Shared UI components
βββ worker/ # Background job worker (planned)
βββ infra/ # Infrastructure as Code (planned)
- Python 3.11+
- Node.js 18+
- pnpm 8+
- FFmpeg installed
# Clone and install dependencies
git clone https://github.com/your-username/auto-edit.git
cd auto-edit
pnpm install
# Configure API key
cp api/.env.example api/.env.local
# Edit api/.env.local and add: ANTHROPIC_API_KEY=sk-ant-...
# Start the API server
pnpm dev:api
# In another terminal, start the frontend
pnpm dev- API: http://localhost:8000/docs (Swagger UI)
- Frontend: http://localhost:3000
Key environment variables in api/.env.local:
# Required
ANTHROPIC_API_KEY=sk-ant-... # Get from console.anthropic.com
# Transcription
WHISPER_MODEL=base # tiny | base | small | medium | large-v3
# LLM Provider
LLM_PROVIDER=anthropic # anthropic | openai
LLM_MODEL=claude-sonnet-4-5 # Model for narrative selection
# Content Filtering
SILENCE_THRESHOLD=0.1 # Silence detection (0.0-1.0)
EXTREME_FILLER_THRESHOLD=0.5 # Filler ratio thresholdSee api/ENV_CONFIG.md for full documentation.
| Endpoint | Method | Description |
|---|---|---|
/upload |
POST | Upload video file(s) |
/videos/{id}/status |
GET | Check processing status |
/videos/{id}/result |
GET | Get final video + metadata |
/projects |
POST | Create multi-video project |
/projects/{id}/status |
GET | Project processing status |
/projects/{id}/result |
GET | Get assembled video |
Full API documentation available at /docs when the server is running.
# API development (with auto-reload)
pnpm dev:api
# Frontend development
pnpm dev
# Run API tests
cd api && make test
# Type checking
pnpm typecheckMIT
- faster-whisper for efficient transcription
- Anthropic Claude for narrative intelligence
- FFmpeg for video processing