Skip to content

SuperMonstor/auto-edit

Repository files navigation

Auto Edit

AI-powered video editing that transforms raw footage into engaging short-form content.

Auto Edit uses speech transcription, audio analysis, and LLM-based narrative intelligence to automatically identify the most compelling moments in your videos and assemble them into polished, story-driven clips.


How It Works

Upload Video(s)
      ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  TRANSCRIPTION & ANALYSIS                               β”‚
β”‚  β€’ Whisper extracts word-level transcripts              β”‚
β”‚  β€’ Librosa computes energy, pitch, and pause metrics    β”‚
β”‚  β€’ Filler words detected and flagged                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STAGE 1: NARRATIVE ASSEMBLY (Quality-First)            β”‚
β”‚  β€’ Sentences scored for engagement                      β”‚
β”‚  β€’ LLM selects best clips for story arc                 β”‚
β”‚  β€’ Hook optimization for first 3 seconds               β”‚
β”‚  β€’ Cross-video mixing for multi-clip projects          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STAGE 2: LENGTH OPTIMIZATION                           β”‚
β”‚  β€’ Trim to target duration (~45s)                       β”‚
β”‚  β€’ Remove low-engagement content                        β”‚
β”‚  β€’ Preserve "sacred elements" (hook, core message)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      ↓
FFmpeg assembles final video with frame-accurate cuts

Key Features

Intelligent Narrative Selection

The system doesn't just cut for lengthβ€”it understands story structure. Using Claude or GPT-4, it identifies:

  • Hooks that grab attention in the first 3 seconds
  • Core value statements that deliver the main message
  • Proof/demo moments that back up claims
  • Natural transitions that maintain flow

Two-Stage Quality-First Pipeline

A key architectural decision: build the best possible story first, then optimize for length. This ensures the narrative never suffers from premature length constraints.

Word-Level Audio Metrics

Every word is enriched with:

  • Energy (RMS loudness) β€” identifies emphasis
  • Pitch variance β€” detects questions, excitement
  • Pause duration β€” natural cut points
  • Filler detection β€” "um", "uh", "like", "you know"

Sentence-Level LLM Processing

By aggregating words into sentences before LLM processing, the system achieves 75-80% token cost reduction while maintaining coherent narrative selection.

Multi-Video Projects

Upload multiple clips and Auto Edit will intelligently weave them into a single, cohesive narrativeβ€”mixing freely across sources while maintaining logical flow.


Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         FRONTEND                                 β”‚
β”‚  Next.js + TypeScript + Tailwind CSS                            β”‚
β”‚  Real-time progress tracking β€’ Video preview β€’ Caption editing  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              ↓ REST API
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                          BACKEND                                 β”‚
β”‚  FastAPI + Python 3.13 + Async Background Tasks                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  SERVICES                                                        β”‚
β”‚  β”œβ”€ transcription.py     Whisper (faster-whisper, GPU-accel)    β”‚
β”‚  β”œβ”€ word_metrics.py      Librosa audio analysis                 β”‚
β”‚  β”œβ”€ clip_analyzer.py     Sentence grouping & scoring            β”‚
β”‚  β”œβ”€ narrative_assembler.py  LLM story selection (Claude/GPT-4)  β”‚
β”‚  β”œβ”€ length_optimizer.py  Duration trimming                      β”‚
β”‚  β”œβ”€ assembly.py          FFmpeg video stitching                 β”‚
β”‚  └─ captions.py          SRT/VTT generation                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Tech Stack

Layer Technology
Frontend Next.js, TypeScript, Tailwind CSS
Backend FastAPI, Python 3.13, Uvicorn
Transcription Whisper (faster-whisper with GPU acceleration)
Audio Analysis Librosa, NumPy, SciPy
LLM Claude (Anthropic) or GPT-4 (OpenAI)
Video Processing FFmpeg (H.264 encoding, concatenation)
ML/NLP PyTorch, HuggingFace Transformers

Hardware Acceleration

Auto-detects and uses the best available hardware:

  • Apple Silicon (M1/M2/M3) via VideoToolbox
  • NVIDIA CUDA for GPU-accelerated transcription
  • Intel Quick Sync for video encoding
  • CPU fallback with int8 quantization

Engineering Decisions

This project is guided by Architecture Decision Records (ADRs) documenting key technical choices:

ADR Decision
0002 Human-equivalent editing approach with on-demand metrics
0003 Three-pass editing system (evolved into two-stage)
0004 Two-stage quality-first pipeline β€” separate "what makes a good story" from "what fits in time"
0005 Sentence-level LLM processing for 75% token savings
0006 Phased audio-first approach (visual analysis planned)

Project Structure

.
β”œβ”€β”€ api/                    # Python FastAPI backend
β”‚   β”œβ”€β”€ services/           # Core processing services
β”‚   β”‚   β”œβ”€β”€ transcription.py       # Whisper integration
β”‚   β”‚   β”œβ”€β”€ word_metrics.py        # Audio feature extraction
β”‚   β”‚   β”œβ”€β”€ clip_analyzer.py       # Sentence segmentation
β”‚   β”‚   β”œβ”€β”€ narrative_assembler.py # LLM narrative selection
β”‚   β”‚   β”œβ”€β”€ length_optimizer.py    # Duration optimization
β”‚   β”‚   └── assembly.py            # FFmpeg video assembly
β”‚   β”œβ”€β”€ routers/            # API endpoints
β”‚   └── config.py           # Environment configuration
β”œβ”€β”€ web/                    # Next.js frontend
β”‚   β”œβ”€β”€ app/                # App router pages
β”‚   └── components/         # React components
β”œβ”€β”€ docs/adr/               # Architecture Decision Records
β”œβ”€β”€ packages/               # Shared TypeScript packages
β”‚   β”œβ”€β”€ shared/             # Types and utilities
β”‚   └── ui/                 # Shared UI components
β”œβ”€β”€ worker/                 # Background job worker (planned)
└── infra/                  # Infrastructure as Code (planned)

Quick Start

Prerequisites

  • Python 3.11+
  • Node.js 18+
  • pnpm 8+
  • FFmpeg installed

Setup

# Clone and install dependencies
git clone https://github.com/your-username/auto-edit.git
cd auto-edit
pnpm install

# Configure API key
cp api/.env.example api/.env.local
# Edit api/.env.local and add: ANTHROPIC_API_KEY=sk-ant-...

# Start the API server
pnpm dev:api

# In another terminal, start the frontend
pnpm dev

Verify Installation


Configuration

Key environment variables in api/.env.local:

# Required
ANTHROPIC_API_KEY=sk-ant-...          # Get from console.anthropic.com

# Transcription
WHISPER_MODEL=base                     # tiny | base | small | medium | large-v3

# LLM Provider
LLM_PROVIDER=anthropic                 # anthropic | openai
LLM_MODEL=claude-sonnet-4-5            # Model for narrative selection

# Content Filtering
SILENCE_THRESHOLD=0.1                  # Silence detection (0.0-1.0)
EXTREME_FILLER_THRESHOLD=0.5           # Filler ratio threshold

See api/ENV_CONFIG.md for full documentation.


API Endpoints

Endpoint Method Description
/upload POST Upload video file(s)
/videos/{id}/status GET Check processing status
/videos/{id}/result GET Get final video + metadata
/projects POST Create multi-video project
/projects/{id}/status GET Project processing status
/projects/{id}/result GET Get assembled video

Full API documentation available at /docs when the server is running.


Development

# API development (with auto-reload)
pnpm dev:api

# Frontend development
pnpm dev

# Run API tests
cd api && make test

# Type checking
pnpm typecheck

License

MIT


Acknowledgments

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors