A Python package for creating intelligent closed captions with face detection and speaker recognition.
- Audio Transcription: Powered by OpenAI Whisper for high-quality speech-to-text
- Speaker Diarization: Identifies different speakers in audio
- Face Recognition: Links speakers to known faces for character identification
- Multiple Output Formats: Supports SRT, VTT, and SAMI caption formats
- Voice Activity Detection: Intelligently detects speech vs non-speech segments
- GPU Acceleration: Automatic CUDA support when available
pip install captionalchemyIf you have a GPU and want to use hardware acceleration:
pip install captionalchemy[cuda]- Python 3.10+
- FFmpeg (for video/audio processing)
- CUDA-capable GPU (optional, for acceleration but is highly recommended for the diarization)
- Whisper.cpp capable (optional on MacOS)
If using Whisper.cpp on MacOS, follow installation instructions [here] Clone the whisper repo into your working directory.
-
Set up environment variables (create
.envfile):HF_AUTH_TOKEN=your_huggingface_token_here -
Prepare known faces (optional, for speaker identification): Create
known_faces.json:[ { "name": "Speaker Name", "image_path": "path/to/speaker/photo.jpg" } ] -
Generate captions:
captionalchemy video.mp4 -f srt -o my_captionsor in a python script
from dotenv import load_dotenv
from captionalchemy import caption
load_dotenv()
caption.run_pipeline(
video_url_or_path="path/to/your/video.mp4", # this can be a video URL or local file
character_identification=False, # True by default
known_faces_json="path/to/known_faces.json",
embed_faces_json="path/to/embed_faces.json", # name of the output file
caption_output_path="my_captions/output", # will write output to output.srt (or .vtt/.smi)
caption_format="srt"
)# Generate SRT captions from video file
captionalchemy video.mp4
# Generate VTT captions from YouTube URL
captionalchemy "https://youtube.com/watch?v=VIDEO_ID" -f vtt -o output
# Disable face recognition
captionalchemy video.mp4 --no-face-idcaptionalchemy VIDEO [OPTIONS]
Arguments:
VIDEO Video file path or URL
Options:
-f, --format Caption format: srt, vtt, smi (default: srt)
-o, --output Output file base name (default: output_captions)
--no-face-id Disable face recognition
--known-faces-json Path to known faces JSON (default: example/known_faces.json)
--embed-faces-json Path to face embeddings JSON (default: example/embed_faces.json)
-v, --verbose Enable debug logging
- Face Embedding: Pre-processes known faces into embeddings
- Audio Extraction: Extracts audio from video files
- Voice Activity Detection: Identifies speech segments
- Speaker Diarization: Separates different speakers
- Transcription: Converts speech to text using Whisper
- Face Recognition: Matches speakers to known faces (if enabled)
- Caption Generation: Creates timestamped captions with speaker names
Create a known_faces.json file with speaker information:
[
{
"name": "John Doe",
"image_path": "photos/john_doe.jpg"
},
{
"name": "Jane Smith",
"image_path": "photos/jane_smith.png"
}
]HF_AUTH_TOKEN: Hugging Face token for accessing pyannote models
1
00:00:03,254 --> 00:00:06,890
John Doe: Welcome to our presentation on quantum computing.
2
00:00:07,120 --> 00:00:10,456
Jane Smith: Thanks John. Let's start with the basics.
WEBVTT
00:03.254 --> 00:06.890
John Doe: Welcome to our presentation on quantum computing.
00:07.120 --> 00:10.456
Jane Smith: Thanks John. Let's start with the basics.
# Install in development mode
pip install -e ".[dev]"pytest# Linting
flake8
# Code formatting
black src/ tests/See requirements.txt for the complete list of dependencies. Key packages include:
openai-whisper: Speech transcriptionpyannote.audio: Speaker diarizationopencv-python: Computer visioninsightface: Face recognitiontorch: Deep learning framework
MIT License - see LICENSE file for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Run the test suite
- Submit a pull request
- CUDA out of memory: Use CPU-only mode or reduce batch sizes
- Missing models: Ensure whisper.cpp models are downloaded
- Face recognition errors: Verify image paths in known_faces.json
- Audio extraction fails: Check that FFmpeg is installed
- Check the logs with
-vflag for detailed error information - Ensure all dependencies are properly installed
- Verify video file format compatibility