Local AI-powered transcription for YouTube videos and local media files. Runs entirely on your hardware with GPU acceleration.
v1.1.0 NEW: Speaker Identification — Create a voice profile and extract only YOUR voice from videos!
- 🚀 GPU Accelerated - CUDA support via faster-whisper (CTranslate2)
- 🎯 High Accuracy - Whisper large-v3 model with word-level timestamps
- 📺 YouTube Support - Videos, playlists, and channels
- 📁 Local Files - MP4, MKV, MOV, MP3, WAV, and many more formats
- 🎤 Speaker Identification - Extract only YOUR voice from multi-speaker videos (NEW in v1.1.0)
- 🗣️ Speaker Diarization - Identify who said what (optional)
- 📊 Multiple Formats - SRT, VTT, TXT, JSON output
- 🔄 Batch Processing - Process folders, playlists, or URL lists
- 🌐 Web Interface - Drag-and-drop uploads with real-time progress
- 💾 Resume Support - Skip already-processed files
- 🏠 Fully Local - No cloud APIs, everything on your machine
Speaker Identification lets you:
- Create a "voice profile" from a sample of you speaking
- Automatically find and extract YOUR segments from any video
- Perfect for pulling your voice out of family videos, meetings, podcasts
# One-time: Create your voice profile
python voice_profile.py create kyle ~/Videos/me_talking.mp4
# Extract your voice from any video
python extract_speaker.py kyle family_reunion.mp4
# Output: Just your text, SRT subtitles, or audio-only filegit clone https://github.com/kylefoxaustin/media-transcriber.git
cd media-transcriber
docker compose up -d
# Open http://localhost:8000git clone https://github.com/kylefoxaustin/media-transcriber.git
cd media-transcriber
bash setup.sh
source venv/bin/activate
# Transcribe
python transcribe.py https://www.youtube.com/watch?v=VIDEO_ID
python transcribe.py /path/to/video.mp4You need a sample where only you are speaking (30-60 seconds is ideal):
# From a video file
python voice_profile.py create kyle ~/Videos/me_presenting.mp4
# From a specific time range (you at 0:30-1:30)
python voice_profile.py create kyle video.mp4 --start 30 --end 90
# From an audio file
python voice_profile.py create kyle my_voice_memo.m4a# Extract your segments (outputs text, SRT, audio, JSON)
python extract_speaker.py kyle family_video.mp4
# Just get the text
python extract_speaker.py kyle video.mp4 --output-text
# Just get subtitles of your parts
python extract_speaker.py kyle video.mp4 --output-srt
# Just get audio of you speaking
python extract_speaker.py kyle video.mp4 --output-audio
# Lower threshold = more matches (may include false positives)
python extract_speaker.py kyle video.mp4 --threshold 0.5# List all profiles
python voice_profile.py list
# Test if a file matches your voice
python voice_profile.py test kyle unknown_video.mp4
# Delete a profile
python voice_profile.py delete kyle
# View profile details
python voice_profile.py info kyle- Voice Embedding: Uses SpeechBrain's ECAPA-TDNN model to create a 192-dimensional "voiceprint" from your sample
- Transcription: Whisper transcribes the target video into segments
- Matching: Each segment's audio is compared to your voiceprint using cosine similarity
- Extraction: Segments above the threshold (default 0.6) are extracted
- Good samples: Clear audio, just your voice, minimal background noise
- 30-60 seconds: Enough to capture your voice characteristics
- Multiple samples: You can add more samples to improve accuracy:
python voice_profile.py create kyle sample1.wav sample2.wav sample3.wav
- Threshold tuning: Start at 0.6, lower to 0.5 if missing segments, raise to 0.7 if getting false positives
# YouTube video
python transcribe.py https://www.youtube.com/watch?v=VIDEO_ID
# Local video file
python transcribe.py /path/to/video.mp4
# Local audio file
python transcribe.py /path/to/podcast.mp3
# Multiple files
python transcribe.py video1.mp4 video2.mov audio.mp3
# From a text file (URLs or paths, one per line)
python transcribe.py --file inputs.txt
# YouTube playlist
python transcribe.py --playlist "https://www.youtube.com/playlist?list=PLxxxxx"
# Folder of media files
python transcribe.py --folder /path/to/videos/
# Folder recursive (include subfolders)
python transcribe.py --folder /path/to/videos/ --recursive| Model | VRAM | Speed | Quality |
|---|---|---|---|
| tiny | ~1GB | ⚡⚡⚡⚡⚡ | ★★☆☆☆ |
| base | ~1GB | ⚡⚡⚡⚡ | ★★★☆☆ |
| small | ~2GB | ⚡⚡⚡ | ★★★☆☆ |
| medium | ~5GB | ⚡⚡ | ★★★★☆ |
| large-v3 | ~10GB | ⚡ | ★★★★★ |
# Use a smaller/faster model
python transcribe.py --model medium video.mp4
# Use CPU (slower but no GPU needed)
python transcribe.py --device cpu --model small video.mp4docker compose up -d
# Open http://localhost:8000- YouTube Tab: Paste URLs, watch real-time progress
- Upload Tab: Drag-and-drop local files
- Download results in SRT, VTT, TXT, JSON formats
./manage.sh start # Start container
./manage.sh stop # Stop container
./manage.sh logs # View live logs
./manage.sh gpu # Verify GPU accessoutput/
├── transcripts/
│ └── VIDEO_ID/
│ ├── metadata.json
│ ├── Video_Title.srt
│ ├── Video_Title.vtt
│ ├── Video_Title.txt
│ └── Video_Title.json
│
└── speaker_extract/ # Speaker ID outputs
├── video_kyle.txt # Just your text
├── video_kyle.srt # Your subtitles
├── video_kyle.wav # Audio of just you
└── video_kyle.json # Full data with confidence scores
Video: MP4, MKV, AVI, MOV, WMV, FLV, WebM, M4V, MPEG, MPG, 3GP
Audio: MP3, WAV, M4A, AAC, OGG, FLAC, WMA, Opus
git clone https://github.com/kylefoxaustin/media-transcriber.git
cd media-transcriber
bash setup.sh
source venv/bin/activate# System dependencies
sudo apt update
sudo apt install ffmpeg python3-pip python3-venv
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install packages
pip install -r requirements.txtIf you get Unable to load libcudnn_ops.so:
sudo apt install libcudnn9-cuda-12
sudo ldconfig- Python 3.10+
- NVIDIA GPU with CUDA support (or CPU fallback)
- ~10GB disk space for Whisper large-v3 model
- ~1GB for speaker recognition model
- ffmpeg
- Docker (optional, for web UI)
RTX 5090 + large-v3:
- Transcription: ~15x realtime (1 hour video ≈ 4 minutes)
- Speaker ID: ~30 segments/minute
RTX 3080 + large-v3:
- Transcription: ~5x realtime (1 hour video ≈ 12 minutes)
python transcribe.py --model medium video.mp4pip install speechbrain torchaudio- Use a cleaner voice sample (just you, no background noise)
- Try a longer sample (60+ seconds)
- Lower the threshold:
--threshold 0.5
- Threshold might be too high: try
--threshold 0.4 - Voice sample quality might be poor
- Background noise in target video
Kyle Fox - GitHub
MIT License - Use freely for personal and commercial projects.
- faster-whisper
- SpeechBrain - Speaker verification
- yt-dlp
- FastAPI
- OpenAI Whisper
- v1.1.0 - Speaker identification: voice profiles and speaker extraction
- v1.0.0 - Initial release: YouTube + local file transcription