Skip to content

Fix YouTube transcript download from datacenter IP (bot detection) #87

@AndreRobitaille

Description

@AndreRobitaille

Problem

yt-dlp works perfectly for downloading YouTube auto-generated captions from residential IPs, but gets blocked from the Hetzner VPS (datacenter IP) with:

ERROR: [youtube] S5TgyP91F5Q: Sign in to confirm you're not a bot.
Use --cookies-from-browser or --cookies for the authentication.

This means the automated DiscoverTranscriptsJobDownloadTranscriptJob pipeline works for discovery (listing channel videos via --flat-playlist) but fails on the actual caption download step.

Current workaround

Download SRT files locally (where yt-dlp works fine), SCP to server, import via bin/rails transcripts:import[/path].

Solution Options

Option 1: YouTube Data API for captions (recommended to investigate first)

Use the official YouTube Data API v3 instead of yt-dlp for caption download. Requires a Google API key (free tier: 10,000 units/day, more than enough).

Pros:

  • No bot detection — it's an official API with a key
  • More reliable long-term than scraping
  • Could also replace yt-dlp --flat-playlist for video discovery

Cons:

  • The Captions API historically doesn't serve auto-generated captions for third-party channels. This may have changed — needs investigation.
  • Requires a Google Cloud project + API key in credentials

Investigation steps:

  1. Create a Google Cloud project, enable YouTube Data API v3
  2. Test GET /youtube/v3/captions?videoId=S8rW22zizHc to list caption tracks
  3. Check if auto-generated English captions appear in the list
  4. If yes, test GET /youtube/v3/captions/{captionId} to download
  5. If auto-captions aren't available via the API, move to option 2 or 3

Implementation: Replace Open3.capture3("yt-dlp", ...) in DownloadTranscriptJob with HTTP calls to the YouTube API. Store the API key in Rails credentials.

Option 2: GitHub Actions as a proxy

Run yt-dlp from a GitHub Actions workflow instead of the VPS. GitHub's IPs rotate frequently and are less aggressively flagged.

Flow:

  1. Scheduled GitHub Action runs nightly
  2. Action calls yt-dlp to download captions for recent videos
  3. Action uploads SRT files as artifacts or pushes to a known location
  4. Production app fetches and imports them

Pros:

  • yt-dlp already works, just from a different IP
  • GitHub Actions IPs are generally less blocked than datacenter IPs
  • Free for public repos

Cons:

  • Adds infrastructure complexity (Action workflow, artifact transfer)
  • GitHub IPs could get blocked too eventually
  • Introduces a dependency on GitHub Actions availability

Option 3: Cookie-based authentication

Export YouTube cookies from a logged-in browser session, store on the server, pass to yt-dlp via --cookies.

Flow:

  1. Export cookies from browser: yt-dlp --cookies-from-browser chrome --cookies cookies.txt
  2. SCP cookies.txt to production server
  3. Update DownloadTranscriptJob to pass --cookies /path/to/cookies.txt

Pros:

  • Simple, uses existing yt-dlp infrastructure
  • Works immediately

Cons:

  • Cookies expire (usually 6-12 months for Google)
  • Requires periodic manual refresh
  • Storing Google cookies on a server has security implications
  • YouTube may flag the account if cookies are used from a datacenter IP

Option 4: Residential proxy

Route yt-dlp through a residential proxy service.

Pros: Most reliable — appears as a normal residential user
Cons: Monthly cost ($5-20+), adds a dependency, may violate YouTube ToS

Option 5: yt-dlp PO token / visitor data workaround

yt-dlp has experimental support for --extractor-args "youtube:player_client=web;po_token=..." to bypass bot detection. This involves generating a Proof of Origin token.

Pros: Free, uses existing tooling
Cons: Fragile — YouTube changes these mechanisms frequently. The po_token generation process is complex and poorly documented.

Recommendation

Investigate Option 1 (YouTube Data API) first — if auto-captions are available via the official API, it's the cleanest long-term solution. If not, Option 2 (GitHub Actions) is the most practical fallback. Options 3-5 are fragile workarounds.

Context

  • Feature PR: Add YouTube transcript ingestion for same-day council meeting summaries #86
  • Design spec: docs/superpowers/specs/2026-04-09-youtube-transcript-ingestion-design.md
  • The transcript pipeline itself works perfectly — this issue is specifically about the download step from datacenter IPs
  • Only affects DownloadTranscriptJob; DiscoverTranscriptsJob (video listing) uses --flat-playlist which still works from the VPS

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions