# Run setup script
./setup.sh
# Or manually:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtEdit .env file:
# Required
YOUTUBE_API_KEY=your_key_here # Already set
YOUTUBE_CHANNEL_URL=your_channel_url # Add your channel
HF_TOKEN=your_huggingface_token # Required for speaker diarization
# Optional
OPENAI_API_KEY=your_openai_key # Only if using Whisper API (not needed for local)Important: Get a Hugging Face token:
- Go to https://huggingface.co/settings/tokens
- Create a new token
- Accept the user agreement at https://huggingface.co/pyannote/speaker-diarization
This analyzes your voice from the reference video so the system can identify you vs guests:
source venv/bin/activate
python create_voice_profile.pyThis will:
- Download the reference video (https://www.youtube.com/watch?v=HN_hOuyXUkc)
- Extract your voice from 0:29 onward
- Create a voice embedding/profile
- Save it to
voice_profiles/jon_radoff_voice_profile.pkl
Before processing all videos, test with one:
python test_single_video.py https://www.youtube.com/watch?v=VIDEO_IDCheck the output in output/VIDEO_ID.txt
python transcribe_channel.pyThis will:
- Fetch all videos from your YouTube channel
- For each video:
- Download the video
- Extract audio
- Transcribe with Whisper (state-of-the-art accuracy)
- Perform speaker diarization (detect when different people speak)
- Identify which speaker is you vs guests
- Try to extract guest names from video titles/descriptions
- Generate formatted transcript with timestamps and speaker labels
- Save transcripts to
output/directory
Transcripts are saved as output/VIDEO_ID.txt:
Video: My Amazing Interview with Jane Doe
URL: https://www.youtube.com/watch?v=abc123
Published: 2024-01-15
Transcribed: transcriptor
---
[0:00:00] Jon Radoff: Welcome to the show everyone.
[0:00:15] Jon Radoff: Today we have a very special guest.
[0:00:29] Jane Doe: Thanks for having me!
[0:00:35] Jon Radoff: Let's dive right in...
The system uses multiple state-of-the-art models:
-
Whisper (OpenAI): Speech-to-text transcription
- Using
large-v3model for maximum accuracy - Generates word-level timestamps
- Using
-
pyannote.audio: Speaker diarization
- Detects when different speakers talk
- Creates speaker segments with timestamps
-
SpeechBrain ECAPA-TDNN: Speaker verification
- Generates voice embeddings
- Compares speakers against your voice profile
- Identifies "Jon Radoff" vs "Guest" speakers
-
Guest Name Detection:
- Analyzes video titles and descriptions
- Looks for patterns like "with [Name]", "[Name] on...", etc.
- Automatically labels guests when detected
First run will download several models (~5GB total):
- Whisper large-v3 (~3GB)
- pyannote diarization (~500MB)
- SpeechBrain speaker recognition (~500MB)
- Get token at https://huggingface.co/settings/tokens
- Accept agreement at https://huggingface.co/pyannote/speaker-diarization
- Add to
.envfile
- Run
python create_voice_profile.pyfirst
- Reduce Whisper model size in
transcribe_channel.py:- Change
load_model("large-v3")toload_model("base")orload_model("medium") - Accuracy will be slightly lower but memory usage much less
- Change
- Ensure reference video has clear audio of your voice
- Try adjusting
REFERENCE_START_TIMEin.envto a section with clearer speech - Re-run
create_voice_profile.py
- Whisper large-v3 is very accurate but slow
- For faster processing, use a smaller model (medium, small, or base)
- Consider using a GPU if available (automatic with PyTorch + CUDA)
transcriptor/
├── .env # Configuration
├── requirements.txt # Python dependencies
├── setup.sh # Setup script
├── create_voice_profile.py # Step 1: Create voice profile
├── youtube_fetcher.py # Fetch videos from channel
├── transcribe_channel.py # Main transcription script
├── test_single_video.py # Test single video
├── downloads/ # Downloaded videos (can delete to save space)
├── output/ # Transcription output files
└── voice_profiles/ # Your voice profile
└── jon_radoff_voice_profile.pkl
- Save Space: Delete files in
downloads/after transcription to save disk space - Batch Processing: The script processes videos sequentially. For faster processing with multiple GPUs, you could modify it to process in parallel
- Re-transcribe: Delete the
.txtfile inoutput/to re-transcribe a specific video - Custom Voice Segments: Edit
.envto use different reference video or timestamp for voice profiling