Skip to content

Add YouTube transcript ingestion for same-day council meeting summaries#86

Merged
AndreRobitaille merged 12 commits intomasterfrom
feature/youtube-transcript-ingestion
Apr 9, 2026
Merged

Add YouTube transcript ingestion for same-day council meeting summaries#86
AndreRobitaille merged 12 commits intomasterfrom
feature/youtube-transcript-ingestion

Conversation

@AndreRobitaille
Copy link
Copy Markdown
Owner

Summary

  • Ingests YouTube auto-generated captions from council meeting recordings via yt-dlp
  • Produces same-day preliminary summaries when official minutes aren't available yet
  • Enriches minutes-based summaries with transcript context when minutes arrive later
  • Shows a visible banner on meeting pages when summary is transcript-sourced
  • Transcript document in Documents section links to "Watch Recording" on YouTube

Design

  • Spec: docs/superpowers/specs/2026-04-09-youtube-transcript-ingestion-design.md
  • Plan: docs/superpowers/plans/2026-04-09-youtube-transcript-ingestion.md

Changes

  • New jobs: Scrapers::DiscoverTranscriptsJob (finds YouTube videos for recent council meetings), Documents::DownloadTranscriptJob (fetches auto-captions, creates MeetingDocument)
  • Modified: SummarizeMeetingJob (transcript priority tier, supplementary context, source_type tracking), DiscoverMeetingsJob (triggers transcript discovery), MeetingsController (finds transcript summaries), meeting show view (transcript banner + document display)
  • Infrastructure: yt-dlp added to Dockerfile, Brakeman ignore for false positive on safe Open3.capture3 usage
  • Model: Meeting#document_status gains :transcript tier (minutes > packet > transcript > agenda)

How it works

DiscoverMeetingsJob (11pm daily)
  → DiscoverTranscriptsJob (checks YouTube for council meetings in last 48h)
    → DownloadTranscriptJob (fetches SRT captions, stores as MeetingDocument)
      → SummarizeMeetingJob (produces preliminary "transcript_recap" summary)

When minutes arrive weeks later:
  → SummarizeMeetingJob replaces transcript_recap with minutes_recap
  → Transcript text used as supplementary context (15K chars)
  → Banner automatically removed

Test plan

  • bin/rails test — 437 tests, 0 failures
  • bin/rubocop — 0 offenses
  • bin/ci — all checks pass
  • Manually test with a real YouTube video ID to verify yt-dlp caption download works in production
  • Verify transcript banner renders correctly on meeting show page
  • Verify banner disappears when minutes-based summary exists

🤖 Generated with Claude Code

AndreRobitaille and others added 12 commits April 9, 2026 09:01
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t council meetings

Queries for council/work session meetings within 48 hours, fetches the Two
Rivers WI YouTube channel stream list via yt-dlp, matches video titles to
meeting dates, and enqueues Documents::DownloadTranscriptJob for each match.
Also adds a stub DownloadTranscriptJob placeholder for Task 4.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements yt-dlp-based caption download, SRT-to-plaintext parsing, MeetingDocument creation, and conditional SummarizeMeetingJob enqueue. Includes full Minitest coverage (5 tests: happy path, idempotency, summarization gate, and yt-dlp failure handling).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Document priority is now minutes > transcript > packet. When minutes
exist, transcript text is appended as supplementary context. When only
a transcript exists, it becomes the primary input with summary_type
"transcript_recap". The source_type field in generation_data tracks
which input combination was used.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Shows a cool-toned informational banner when the summary is based on
the video recording instead of official minutes. Automatically removed
when minutes arrive and the summary is regenerated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… section

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Validates video_url matches YouTube URL pattern before passing to
yt-dlp. Brakeman false positive ignored since Open3.capture3 with
array arguments doesn't use a shell.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add upper time bound to DiscoverTranscriptsJob candidate query
  (exclude future meetings)
- Add test for URL validation rejection in DownloadTranscriptJob
- Assert minutes_with_transcript source_type in combined minutes+transcript test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@AndreRobitaille AndreRobitaille merged commit a52e174 into master Apr 9, 2026
2 of 3 checks passed
@AndreRobitaille AndreRobitaille deleted the feature/youtube-transcript-ingestion branch April 9, 2026 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant