Skip to content

timcrob/episcan

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Episcan

Match TV episodes using audio transcription and episode subtitles (default) or descriptions.

Features

  • Audio Transcription: Uses OpenAI Whisper to transcribe video audio (loaded on-demand)
  • Time-Synchronized Comparison: Compares matching time segments for fair subtitle-to-subtitle comparison
  • Smart Subtitle Track Selection: Finds best matching video subtitle track using similarity analysis
  • Comprehensive Subtitle Caching: OS-appropriate cache storage with automatic management
  • Enhanced Subtitle Provider Support:
    • Local subtitle files (--subtitles-dir)
    • Embedded video subtitles (--try-subtitles)
    • Subliminal library (default) - Multiple providers with authentication support
    • External ID matching (TMDB/TVDB/IMDB) for better provider accuracy
  • Robust Subtitle Retry System:
    • Configurable retry attempts with exponential backoff (default: 5 retries)
    • Smart failure handling (exit/prompt/continue after retries exhausted)
  • Intelligent Transcription Defaults:
    • Subtitle comparison: 3 minutes starting at 1 minute (skips intros)
    • Description comparison: Full episode transcription
  • Universal Subtitle Support: Handles SRT, WebVTT, ASS/SSA, MicroDVD, MPL2, TMP, and JSON formats via pysubs2
  • Advanced File Management: Smart conflict resolution for renaming with file preservation
  • Optimal Episode Matching: Uses SBERT embeddings and Hungarian algorithm for unique assignments
  • GPU Acceleration: CUDA support for both Whisper and sentence transformers
  • Memory Efficient: Conditional model loading saves resources when not needed
  • Progress Tracking: ETA calculations and detailed processing feedback
  • Multiple APIs: TMDB (preferred) and TVDB support with external ID enrichment

Quick Start

# Default: Use subliminal for subtitle downloads with 5 retry attempts
uv run python main.py /path/to/videos

# Use local subtitle files
uv run python main.py /path/to/videos --subtitles-dir /path/to/subtitles

# Compare against episode descriptions instead
uv run python main.py /path/to/videos --use-descriptions

# Custom retry behavior: 3 retries, then continue anyway
uv run python main.py /path/to/videos --subtitle-retries 3 --on-subtitle-failure continue

# Clear cache and disable caching for fresh downloads
uv run python main.py /path/to/videos --clear-cache --no-cache

Usage Examples

Basic Usage

# Default behavior: subtitle comparison using subliminal
episcan /path/to/videos

# Use local subtitle files for comparison
episcan /path/to/videos --subtitles-dir /path/to/subtitles

# Compare against episode descriptions (full transcription)
episcan /path/to/videos --use-descriptions

# Force full episode transcription even for subtitle comparison
episcan /path/to/videos --max-duration 0

Advanced Options

# Custom models, auto-rename, verbose output
episcan /path/to/videos \
    --whisper-model medium \
    --sbert-model sentence-transformers/all-MiniLM-L6-v2 \
    --rename auto \
    --verbose

# Try embedded subtitles first, fallback to Whisper
episcan /path/to/videos --try-subtitles

# Cache management for re-processing shows
episcan /path/to/videos --clear-cache  # Clear existing cache
episcan /path/to/videos --no-cache     # Disable caching entirely

# Adjust transcription timing
episcan /path/to/videos --max-duration 300 --start-offset 30

Environment Variables

export TMDB_API_KEY="your_tmdb_key"             # Preferred
export TVDB_API_KEY="your_tvdb_key"             # Fallback

# Optional: Subtitle provider authentication (improves success rates)
export ADDIC7ED_USERNAME="your_username"        # Addic7ed account
export ADDIC7ED_PASSWORD="your_password"
export OPENSUBTITLES_USERNAME="your_username"   # OpenSubtitles account  
export OPENSUBTITLES_PASSWORD="your_password"

How It Works

Subtitle Comparison (Default)

  1. Check Cache: Looks for previously downloaded subtitles in OS-appropriate cache directory
  2. Download Subtitles: Uses subliminal with enhanced provider configurations and external ID matching
  3. Enhanced Matching: Utilizes TMDB/TVDB/IMDB IDs for better provider accuracy
  4. Retry System: Automatically retries failed downloads with exponential backoff (2s→4s→8s→16s→32s)
  5. Cache Storage: Saves downloaded subtitles with metadata for future use
  6. Time-Synchronized Extraction: For partial transcription, extracts matching time segments from episode subtitles
  7. Smart Track Selection: When using --try-subtitles, finds video subtitle track with best similarity to episode content
  8. Fair Comparison: Compares equivalent content (3min transcript vs 3min subtitle segment)
  9. Optimal Assignment: Uses Hungarian algorithm to ensure unique episode matches

Description Comparison

  1. Full Transcription: Transcribes entire episodes for comprehensive comparison
  2. Metadata Matching: Compares transcripts against episode descriptions from TMDB/TVDB
  3. Similarity Scoring: Uses cosine similarity with sentence transformers

Subtitle Sources

Priority Order:

  1. Local Files (--subtitles-dir) - Custom subtitle directory
  2. Subliminal (default) - Enhanced provider support with authentication:
    • Multiple provider strategies with intelligent fallback
    • External ID matching (TMDB/TVDB/IMDB) for better accuracy
    • Optional authentication for Addic7ed and OpenSubtitles (via environment variables)
    • Configurable retry system with exponential backoff (default: 5 attempts)
    • OS-appropriate cache storage (macOS: ~/Library/Caches/episcan/)
    • Cache hits eliminate re-downloads for repeated processing

Why Subliminal?

  • Enhanced Provider Support: Multiple subtitle providers with authentication for higher success rates
  • External ID Matching: Uses TMDB/TVDB/IMDB IDs for more accurate content matching
  • Robust Retry System: Automatic retries with exponential backoff for temporary failures
  • Smart Caching: Prevents re-downloads with persistent storage and metadata
  • Format Support: Handles various subtitle formats automatically
  • Respectful: Built-in rate limiting and intelligent provider rotation
  • Zero Configuration: Works without API keys, but supports authentication for better results

Command Line Options

positional arguments:
  video_dir                Directory containing video files (default: current directory)

options:
  --tvdb-api-key           TVDB API key (or use TVDB_API_KEY environment variable)
  --tmdb-api-key           TMDB API key (or use TMDB_API_KEY environment variable)
  --subtitles-dir          Directory containing subtitle files for episodes
  --use-descriptions       Use episode descriptions instead of subtitles (default: use subtitles)
  --force-tvdb             Force TVDB even if TMDB key available
  --whisper-model          Whisper model: tiny, base, small, medium, large (default: base)
  --sbert-model            Sentence transformer model (default: all-mpnet-base-v2)
  --max-duration           Transcription duration in seconds (default: 180 for subtitles, 0=full)
  --start-offset           Skip intro seconds (default: 60)
  --rename                 File renaming: none, prompt, auto (default: none)
  --try-subtitles          Try embedded subtitles first, fallback to Whisper
  --subtitle-retries       Number of retry attempts for missing subtitles (default: 5)
  --on-subtitle-failure    Action when subtitles still missing after retries: exit, prompt, continue (default: exit)
  --clear-cache            Clear subtitle cache before processing
  --no-cache               Disable subtitle caching (always download fresh)
  --verbose                Detailed processing information

Supported Subtitle Formats

Thanks to pysubs2 integration, episcan supports all major subtitle formats:

  • SubRip (.srt)
  • WebVTT (.vtt)
  • Advanced SubStation Alpha (.ass, .ssa)
  • MicroDVD (.sub)
  • MPL2 (.mpl)
  • TMP (.tmp)
  • JSON subtitles

Example Output

Using TMDB API
Found series: Breaking Bad (TMDB ID: 1396, Year: 2008)
  External IDs - IMDB: tt0903747, TVDB: 81189
Loading sentence transformer model (sentence-transformers/all-mpnet-base-v2) on cuda...
Found 8 video files
Detected: Breaking Bad Season 1
Fetching episode subtitles using subliminal...
  Attempting to get subtitles for 8 episodes (checking cache first)...
✓ All episodes have subtitles
Processing 8 video files...
  1/8: episode1.mkv ✓ (12.3s)
  2/8: episode2.mkv ✓ (11.8s)
  ...

Calculating optimal matches...

=== FINAL MATCHES ===
✓ episode1.mkv -> S01E01 - Pilot
  Similarity: 0.847

→ episode5.mkv -> S01E05 - Gray Matter
  Similarity: 0.723

=== FILE RENAMING ===
Planned renames:
  episode1.mkv → Breaking Bad - S01E01 - Pilot.mkv
  episode5.mkv → Breaking Bad - S01E05 - Gray Matter.mkv

Renamed 8/8 files successfully

Performance Tips

  • Caching Benefits: Second runs on same show are dramatically faster with subtitle cache hits
  • GPU Acceleration: Use CUDA-compatible hardware for 5-10x speed improvement
  • Memory Optimization: Use --try-subtitles to avoid Whisper loading when embedded subtitles are available
  • Cache Management:
    • Cache persists between runs for faster re-processing
    • Use --clear-cache when switching show versions or subtitle preferences
    • Cache stored in OS-appropriate locations (Linux: ~/.cache/episcan/, Windows: %LOCALAPPDATA%\episcan\)
  • Model Selection:
    • whisper-model base: Good balance of speed/accuracy (loaded on-demand)
    • sbert-model all-MiniLM-L6-v2: Faster but less accurate than default
  • Transcription Optimization:
    • Default 3-minute excerpts work well for most shows with time-synchronized comparison
    • Use --max-duration 0 for very short episodes or poor matches
    • Adjust --start-offset for shows with long intros
  • Subtitle Extraction: --try-subtitles can be much faster than transcription for videos with embedded subs

Troubleshooting

Common Issues

Missing subtitles despite retries:

# Set provider credentials for better access
export ADDIC7ED_USERNAME="your_user"
export ADDIC7ED_PASSWORD="your_pass"

# Increase retry attempts
episcan /path/to/videos --subtitle-retries 10

# Continue anyway with partial coverage
episcan /path/to/videos --on-subtitle-failure continue

# Use description comparison as fallback
episcan /path/to/videos --use-descriptions

File renaming conflicts:

# Conflicts are automatically resolved with UUID preservation
# Files are never deleted - conflicting files get UNMATCHED_ prefix
# Example: "Show - S01E01.mkv" becomes "UNMATCHED_Show - S01E01_a1b2c3d4.mkv"

Poor matching accuracy:

# Try full episode transcription
episcan /path/to/videos --max-duration 0

# Use higher quality models
episcan /path/to/videos --whisper-model medium --sbert-model sentence-transformers/all-mpnet-base-v2

Subtitle provider rate limiting:

# Automatic exponential backoff handles most rate limiting
# For persistent issues, try local subtitle files
episcan /path/to/videos --subtitles-dir /path/to/subtitles

Dependencies

  • subliminal>=2.1.0 - Multi-provider subtitle downloads with dynamic discovery
  • pysubs2>=1.6.0 - Universal subtitle parsing
  • platformdirs>=3.0.0 - OS-appropriate cache directories
  • sentence-transformers - Text similarity embeddings
  • openai-whisper - Audio transcription
  • tmdbsimple - TMDB API client
  • tvdb-v4-official - TVDB API client
  • scipy - Hungarian algorithm for optimal matching
  • torch - GPU acceleration support

License

MIT License - see LICENSE file for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%