Fix YouTube transcript download from datacenter IP (bot detection)

## Problem

`yt-dlp` works perfectly for downloading YouTube auto-generated captions from residential IPs, but gets blocked from the Hetzner VPS (datacenter IP) with:

```
ERROR: [youtube] S5TgyP91F5Q: Sign in to confirm you're not a bot.
Use --cookies-from-browser or --cookies for the authentication.
```

This means the automated `DiscoverTranscriptsJob` → `DownloadTranscriptJob` pipeline works for discovery (listing channel videos via `--flat-playlist`) but fails on the actual caption download step.

### Current workaround

Download SRT files locally (where `yt-dlp` works fine), SCP to server, import via `bin/rails transcripts:import[/path]`.

## Solution Options

### Option 1: YouTube Data API for captions (recommended to investigate first)

Use the official YouTube Data API v3 instead of `yt-dlp` for caption download. Requires a Google API key (free tier: 10,000 units/day, more than enough).

**Pros:**
- No bot detection — it's an official API with a key
- More reliable long-term than scraping
- Could also replace `yt-dlp --flat-playlist` for video discovery

**Cons:**
- The Captions API historically doesn't serve auto-generated captions for third-party channels. This may have changed — needs investigation.
- Requires a Google Cloud project + API key in credentials

**Investigation steps:**
1. Create a Google Cloud project, enable YouTube Data API v3
2. Test `GET /youtube/v3/captions?videoId=S8rW22zizHc` to list caption tracks
3. Check if auto-generated English captions appear in the list
4. If yes, test `GET /youtube/v3/captions/{captionId}` to download
5. If auto-captions aren't available via the API, move to option 2 or 3

**Implementation:** Replace `Open3.capture3("yt-dlp", ...)` in `DownloadTranscriptJob` with HTTP calls to the YouTube API. Store the API key in Rails credentials.

### Option 2: GitHub Actions as a proxy

Run `yt-dlp` from a GitHub Actions workflow instead of the VPS. GitHub's IPs rotate frequently and are less aggressively flagged.

**Flow:**
1. Scheduled GitHub Action runs nightly
2. Action calls `yt-dlp` to download captions for recent videos
3. Action uploads SRT files as artifacts or pushes to a known location
4. Production app fetches and imports them

**Pros:**
- `yt-dlp` already works, just from a different IP
- GitHub Actions IPs are generally less blocked than datacenter IPs
- Free for public repos

**Cons:**
- Adds infrastructure complexity (Action workflow, artifact transfer)
- GitHub IPs could get blocked too eventually
- Introduces a dependency on GitHub Actions availability

### Option 3: Cookie-based authentication

Export YouTube cookies from a logged-in browser session, store on the server, pass to `yt-dlp` via `--cookies`.

**Flow:**
1. Export cookies from browser: `yt-dlp --cookies-from-browser chrome --cookies cookies.txt`
2. SCP `cookies.txt` to production server
3. Update `DownloadTranscriptJob` to pass `--cookies /path/to/cookies.txt`

**Pros:**
- Simple, uses existing `yt-dlp` infrastructure
- Works immediately

**Cons:**
- Cookies expire (usually 6-12 months for Google)
- Requires periodic manual refresh
- Storing Google cookies on a server has security implications
- YouTube may flag the account if cookies are used from a datacenter IP

### Option 4: Residential proxy

Route `yt-dlp` through a residential proxy service.

**Pros:** Most reliable — appears as a normal residential user
**Cons:** Monthly cost ($5-20+), adds a dependency, may violate YouTube ToS

### Option 5: yt-dlp PO token / visitor data workaround

`yt-dlp` has experimental support for `--extractor-args "youtube:player_client=web;po_token=..."` to bypass bot detection. This involves generating a Proof of Origin token.

**Pros:** Free, uses existing tooling
**Cons:** Fragile — YouTube changes these mechanisms frequently. The `po_token` generation process is complex and poorly documented.

## Recommendation

Investigate **Option 1** (YouTube Data API) first — if auto-captions are available via the official API, it's the cleanest long-term solution. If not, **Option 2** (GitHub Actions) is the most practical fallback. Options 3-5 are fragile workarounds.

## Context

- Feature PR: #86
- Design spec: `docs/superpowers/specs/2026-04-09-youtube-transcript-ingestion-design.md`
- The transcript pipeline itself works perfectly — this issue is specifically about the download step from datacenter IPs
- Only affects `DownloadTranscriptJob`; `DiscoverTranscriptsJob` (video listing) uses `--flat-playlist` which still works from the VPS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix YouTube transcript download from datacenter IP (bot detection) #87

Problem

Current workaround

Solution Options

Option 1: YouTube Data API for captions (recommended to investigate first)

Option 2: GitHub Actions as a proxy

Option 3: Cookie-based authentication

Option 4: Residential proxy

Option 5: yt-dlp PO token / visitor data workaround

Recommendation

Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Fix YouTube transcript download from datacenter IP (bot detection) #87

Description

Problem

Current workaround

Solution Options

Option 1: YouTube Data API for captions (recommended to investigate first)

Option 2: GitHub Actions as a proxy

Option 3: Cookie-based authentication

Option 4: Residential proxy

Option 5: yt-dlp PO token / visitor data workaround

Recommendation

Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions