You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
yt-dlp works perfectly for downloading YouTube auto-generated captions from residential IPs, but gets blocked from the Hetzner VPS (datacenter IP) with:
ERROR: [youtube] S5TgyP91F5Q: Sign in to confirm you're not a bot.
Use --cookies-from-browser or --cookies for the authentication.
This means the automated DiscoverTranscriptsJob → DownloadTranscriptJob pipeline works for discovery (listing channel videos via --flat-playlist) but fails on the actual caption download step.
Current workaround
Download SRT files locally (where yt-dlp works fine), SCP to server, import via bin/rails transcripts:import[/path].
Solution Options
Option 1: YouTube Data API for captions (recommended to investigate first)
Use the official YouTube Data API v3 instead of yt-dlp for caption download. Requires a Google API key (free tier: 10,000 units/day, more than enough).
Pros:
No bot detection — it's an official API with a key
More reliable long-term than scraping
Could also replace yt-dlp --flat-playlist for video discovery
Cons:
The Captions API historically doesn't serve auto-generated captions for third-party channels. This may have changed — needs investigation.
Requires a Google Cloud project + API key in credentials
Investigation steps:
Create a Google Cloud project, enable YouTube Data API v3
Test GET /youtube/v3/captions?videoId=S8rW22zizHc to list caption tracks
Check if auto-generated English captions appear in the list
If yes, test GET /youtube/v3/captions/{captionId} to download
If auto-captions aren't available via the API, move to option 2 or 3
Implementation: Replace Open3.capture3("yt-dlp", ...) in DownloadTranscriptJob with HTTP calls to the YouTube API. Store the API key in Rails credentials.
Option 2: GitHub Actions as a proxy
Run yt-dlp from a GitHub Actions workflow instead of the VPS. GitHub's IPs rotate frequently and are less aggressively flagged.
Flow:
Scheduled GitHub Action runs nightly
Action calls yt-dlp to download captions for recent videos
Action uploads SRT files as artifacts or pushes to a known location
Production app fetches and imports them
Pros:
yt-dlp already works, just from a different IP
GitHub Actions IPs are generally less blocked than datacenter IPs
Introduces a dependency on GitHub Actions availability
Option 3: Cookie-based authentication
Export YouTube cookies from a logged-in browser session, store on the server, pass to yt-dlp via --cookies.
Flow:
Export cookies from browser: yt-dlp --cookies-from-browser chrome --cookies cookies.txt
SCP cookies.txt to production server
Update DownloadTranscriptJob to pass --cookies /path/to/cookies.txt
Pros:
Simple, uses existing yt-dlp infrastructure
Works immediately
Cons:
Cookies expire (usually 6-12 months for Google)
Requires periodic manual refresh
Storing Google cookies on a server has security implications
YouTube may flag the account if cookies are used from a datacenter IP
Option 4: Residential proxy
Route yt-dlp through a residential proxy service.
Pros: Most reliable — appears as a normal residential user Cons: Monthly cost ($5-20+), adds a dependency, may violate YouTube ToS
Option 5: yt-dlp PO token / visitor data workaround
yt-dlp has experimental support for --extractor-args "youtube:player_client=web;po_token=..." to bypass bot detection. This involves generating a Proof of Origin token.
Pros: Free, uses existing tooling Cons: Fragile — YouTube changes these mechanisms frequently. The po_token generation process is complex and poorly documented.
Recommendation
Investigate Option 1 (YouTube Data API) first — if auto-captions are available via the official API, it's the cleanest long-term solution. If not, Option 2 (GitHub Actions) is the most practical fallback. Options 3-5 are fragile workarounds.
Problem
yt-dlpworks perfectly for downloading YouTube auto-generated captions from residential IPs, but gets blocked from the Hetzner VPS (datacenter IP) with:This means the automated
DiscoverTranscriptsJob→DownloadTranscriptJobpipeline works for discovery (listing channel videos via--flat-playlist) but fails on the actual caption download step.Current workaround
Download SRT files locally (where
yt-dlpworks fine), SCP to server, import viabin/rails transcripts:import[/path].Solution Options
Option 1: YouTube Data API for captions (recommended to investigate first)
Use the official YouTube Data API v3 instead of
yt-dlpfor caption download. Requires a Google API key (free tier: 10,000 units/day, more than enough).Pros:
yt-dlp --flat-playlistfor video discoveryCons:
Investigation steps:
GET /youtube/v3/captions?videoId=S8rW22zizHcto list caption tracksGET /youtube/v3/captions/{captionId}to downloadImplementation: Replace
Open3.capture3("yt-dlp", ...)inDownloadTranscriptJobwith HTTP calls to the YouTube API. Store the API key in Rails credentials.Option 2: GitHub Actions as a proxy
Run
yt-dlpfrom a GitHub Actions workflow instead of the VPS. GitHub's IPs rotate frequently and are less aggressively flagged.Flow:
yt-dlpto download captions for recent videosPros:
yt-dlpalready works, just from a different IPCons:
Option 3: Cookie-based authentication
Export YouTube cookies from a logged-in browser session, store on the server, pass to
yt-dlpvia--cookies.Flow:
yt-dlp --cookies-from-browser chrome --cookies cookies.txtcookies.txtto production serverDownloadTranscriptJobto pass--cookies /path/to/cookies.txtPros:
yt-dlpinfrastructureCons:
Option 4: Residential proxy
Route
yt-dlpthrough a residential proxy service.Pros: Most reliable — appears as a normal residential user
Cons: Monthly cost ($5-20+), adds a dependency, may violate YouTube ToS
Option 5: yt-dlp PO token / visitor data workaround
yt-dlphas experimental support for--extractor-args "youtube:player_client=web;po_token=..."to bypass bot detection. This involves generating a Proof of Origin token.Pros: Free, uses existing tooling
Cons: Fragile — YouTube changes these mechanisms frequently. The
po_tokengeneration process is complex and poorly documented.Recommendation
Investigate Option 1 (YouTube Data API) first — if auto-captions are available via the official API, it's the cleanest long-term solution. If not, Option 2 (GitHub Actions) is the most practical fallback. Options 3-5 are fragile workarounds.
Context
docs/superpowers/specs/2026-04-09-youtube-transcript-ingestion-design.mdDownloadTranscriptJob;DiscoverTranscriptsJob(video listing) uses--flat-playlistwhich still works from the VPS