CaptionAlchemy

A Python package for creating intelligent closed captions with face detection and speaker recognition.

Features

Audio Transcription: Powered by OpenAI Whisper for high-quality speech-to-text
Speaker Diarization: Identifies different speakers in audio
Face Recognition: Links speakers to known faces for character identification
Multiple Output Formats: Supports SRT, VTT, and SAMI caption formats
Voice Activity Detection: Intelligently detects speech vs non-speech segments
GPU Acceleration: Automatic CUDA support when available

Installation

pip install captionalchemy

If you have a GPU and want to use hardware acceleration:

pip install captionalchemy[cuda]

Prerequisites

Python 3.10+
FFmpeg (for video/audio processing)
CUDA-capable GPU (optional, for acceleration but is highly recommended for the diarization)
Whisper.cpp capable (optional on MacOS)

If using Whisper.cpp on MacOS, follow installation instructions [here] Clone the whisper repo into your working directory.

Quick Start

Set up environment variables (create .env file):
```
HF_AUTH_TOKEN=your_huggingface_token_here
```

Prepare known faces (optional, for speaker identification): Create known_faces.json:

[
  {
    "name": "Speaker Name",
    "image_path": "path/to/speaker/photo.jpg"
  }
]

Generate captions:

captionalchemy video.mp4 -f srt -o my_captions

or in a python script

from dotenv import load_dotenv
from captionalchemy import caption

load_dotenv()

caption.run_pipeline(
    video_url_or_path="path/to/your/video.mp4",         # this can be a video URL or local file
    character_identification=False,                      # True by default
    known_faces_json="path/to/known_faces.json",
    embed_faces_json="path/to/embed_faces.json",        # name of the output file
    caption_output_path="my_captions/output",           # will write output to output.srt (or .vtt/.smi)
    caption_format="srt"
)

Usage

Basic Usage

# Generate SRT captions from video file
captionalchemy video.mp4

# Generate VTT captions from YouTube URL
captionalchemy "https://youtube.com/watch?v=VIDEO_ID" -f vtt -o output

# Disable face recognition
captionalchemy video.mp4 --no-face-id

Command Line Options

captionalchemy VIDEO [OPTIONS]

Arguments:
  VIDEO                Video file path or URL

Options:
  -f, --format         Caption format: srt, vtt, smi (default: srt)
  -o, --output         Output file base name (default: output_captions)
  --no-face-id         Disable face recognition
  --known-faces-json   Path to known faces JSON (default: example/known_faces.json)
  --embed-faces-json   Path to face embeddings JSON (default: example/embed_faces.json)
  -v, --verbose        Enable debug logging

How It Works

Face Embedding: Pre-processes known faces into embeddings
Audio Extraction: Extracts audio from video files
Voice Activity Detection: Identifies speech segments
Speaker Diarization: Separates different speakers
Transcription: Converts speech to text using Whisper
Face Recognition: Matches speakers to known faces (if enabled)
Caption Generation: Creates timestamped captions with speaker names

Configuration

Known Faces Setup

Create a known_faces.json file with speaker information:

[
  {
    "name": "John Doe",
    "image_path": "photos/john_doe.jpg"
  },
  {
    "name": "Jane Smith",
    "image_path": "photos/jane_smith.png"
  }
]

Environment Variables

HF_AUTH_TOKEN: Hugging Face token for accessing pyannote models

Output Examples

SRT Format

1
00:00:03,254 --> 00:00:06,890
John Doe: Welcome to our presentation on quantum computing.

2
00:00:07,120 --> 00:00:10,456
Jane Smith: Thanks John. Let's start with the basics.

VTT Format

WEBVTT

00:03.254 --> 00:06.890
John Doe: Welcome to our presentation on quantum computing.

00:07.120 --> 00:10.456
Jane Smith: Thanks John. Let's start with the basics.

Development and Contributing

Setup Development Environment

# Install in development mode
pip install -e ".[dev]"

Running Tests

pytest

Code Quality

# Linting
flake8

# Code formatting
black src/ tests/

Requirements

See requirements.txt for the complete list of dependencies. Key packages include:

openai-whisper: Speech transcription
pyannote.audio: Speaker diarization
opencv-python: Computer vision
insightface: Face recognition
torch: Deep learning framework

License

MIT License - see LICENSE file for details.

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests for new functionality
Run the test suite
Submit a pull request

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
example		example
src/captionalchemy		src/captionalchemy
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CaptionAlchemy

Features

Installation

Prerequisites

Quick Start

Usage

Basic Usage

Command Line Options

How It Works

Configuration

Known Faces Setup

Environment Variables

Output Examples

SRT Format

VTT Format

Development and Contributing

Setup Development Environment

Running Tests

Code Quality

Requirements

License

Contributing

Troubleshooting

Common Issues

Getting Help

About

Uh oh!

Releases

Packages

Languages

License

benbatman/captionalchemy

Folders and files

Latest commit

History

Repository files navigation

CaptionAlchemy

Features

Installation

Prerequisites

Quick Start

Usage

Basic Usage

Command Line Options

How It Works

Configuration

Known Faces Setup

Environment Variables

Output Examples

SRT Format

VTT Format

Development and Contributing

Setup Development Environment

Running Tests

Code Quality

Requirements

License

Contributing

Troubleshooting

Common Issues

Getting Help

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages