Extracts structured lot data from Brazilian cattle auction YouTube videos.
- Download — fetches the video with
yt-dlpand extracts a 16kHz mono audio track - Transcribe — transcribes audio in PT-BR using MLX Whisper (local, Metal), whisper.cpp (local, Metal), or Groq API (cloud)
- Screenshots — extracts one frame every 30 seconds with
ffmpeg, with a live progress bar - OCR — reads text visible on screen using RapidOCR (ONNX-based, fast, no native deps), with a live progress bar
- Aggregate — merges transcript segments and OCR results into 10-minute windows
- Extract lots — sends each window to an LLM with a structured PT-BR prompt to pull out lot data (number, sex, category, count, breed, price, sold status)
- Extract metadata — scans the first windows to extract auction-level info: date, city, auctioneer, farm, auction type
- Output — saves
lots_<video_id>.json,metadata_<video_id>.json, and prints a summary table
Each stage is checkpointed. Interrupted runs resume automatically from where they left off.
- Python 3.11+
uvfor dependency managementffmpeginstalled on the systemdenoruntime (used by yt-dlp for YouTube download)
# macOS
brew install ffmpeg deno
# Windows
winget install Gyan.FFmpeg DenoLand.Denogit clone <repo>
cd cattle-auction
uv sync --no-install-project # base deps
uv sync --extra local --no-install-project # + mlx-whisper (Apple Silicon only)Fill in your API keys in .env:
# .env
OPENAI_API_KEY=sk-...
GROQ_API_KEY=gsk-...
OPENROUTER_API_KEY=sk-or-... # only if using --provider openrouterKeys are loaded automatically from .env on every run. Only set the keys you need. The default run uses OpenRouter for extraction and Groq for transcription, so OPENROUTER_API_KEY and GROQ_API_KEY are required for the default path. Use OPENAI_API_KEY only when running with --provider openai.
uv run python main.py <youtube_url> [OPTIONS]| Backend | Flag | Speed | Cost |
|---|---|---|---|
| MLX Whisper | --transcriber mlx |
Fast (Metal, Apple Silicon) | Free |
| whisper.cpp | --transcriber cpp |
Fast (Metal, Apple Silicon) | Free |
| Groq API | --transcriber groq |
~228× realtime | ~$0.20 / 5h video |
whisper.cpp setup (if using --transcriber cpp):
brew install whisper-cpp
whisper-cpp-download-ggml-model medium| Flag | Default | Description |
|---|---|---|
--transcriber |
groq |
Transcription backend: mlx, cpp, or groq |
--whisper-model |
medium |
Model size for mlx/cpp: tiny / base / small / medium / large-v3 |
--cpp-model |
auto | Path to ggml model file (whisper.cpp only) |
--provider |
openrouter |
LLM provider: openrouter or openai |
--screenshot-interval |
30 |
Seconds between captured frames |
--output-dir |
output |
Base directory for all generated files |
--no-resume |
off | Ignore cached stages and rerun everything |
--metadata / --no-metadata |
on | Display auction metadata (date, city, auctioneer, farm, type) |
--summary / --no-summary |
on | Display summary statistics (totals, averages, counts by category) |
--table / --no-table |
on | Display full table of all lots with detailed information |
# Default: OpenRouter Gemini 2.5 Flash-Lite extraction + Groq transcription
uv run python main.py "https://www.youtube.com/watch?v=..."
# Local MLX transcription (Apple Silicon only, requires --extra local)
uv run python main.py "https://www.youtube.com/watch?v=..." --transcriber mlx
# OpenRouter extraction (default provider)
uv run python main.py "https://www.youtube.com/watch?v=..." --provider openrouter
# OpenAI extraction alternative
uv run python main.py "https://www.youtube.com/watch?v=..." --provider openai
# Show only metadata and summary (no table)
uv run python main.py "https://www.youtube.com/watch?v=..." --no-table
# Show only the table (no metadata or summary)
uv run python main.py "https://www.youtube.com/watch?v=..." --no-metadata --no-summary
# Force full rerun (ignore all cached stages)
uv run python main.py "https://www.youtube.com/watch?v=..." --no-resumeAll files are written to output/<video_id>/:
| File | Contents |
|---|---|
video_<video_id>.mp4 |
Downloaded video |
audio_<video_id>.wav |
16kHz mono audio for Whisper |
transcript_<video_id>.json |
Timestamped transcript segments |
screenshots_<video_id>/ |
JPEG frames at every N seconds |
screenshots_<video_id>.json |
Index of frames with timestamps |
ocr_results_<video_id>.json |
Screen text per timestamp |
lots_<video_id>.json |
Extracted lots (array) |
metadata_<video_id>.json |
Auction metadata (date, city, auctioneer, farm, type) |
result_<video_id>.json |
Final result with metadata and lots |
{
"lot_number": 12,
"sex": "macho",
"category": "garrote",
"num_animals": 30,
"age_months": 18,
"breed": "Nelore",
"unit_price": 3200.00,
"total_price": null,
"sold": true,
"timestamp_start": "01:24:35",
"notes": null
}| Stage | mlx / cpp | groq |
|---|---|---|
| Download | ~5–15 min (network) | same |
| Transcribe | ~20–40 min (medium) | ~1–2 min ($0.20) |
| Screenshots (30s interval) | ~2–3 min | same |
| OCR (~600 frames) | ~5–10 min | same |
| LLM extraction (~30 windows) | ~3–8 min | same |
| Provider | Flag | Default model | Auth |
|---|---|---|---|
| OpenRouter | --provider openrouter |
google/gemini-2.5-flash-lite-preview-09-2025 |
OPENROUTER_API_KEY |
| OpenAI | --provider openai |
gpt-4.1-mini |
OPENAI_API_KEY |
The shipped model catalog is intentionally narrow and benchmark-driven:
| Provider | Model | Cost/video | Speed | Coverage | Accuracy (MAPE) |
|---|---|---|---|---|---|
| openrouter (default) | google/gemini-2.5-flash-lite-preview-09-2025 |
~$0.05 | 13-24s | 92-100% | 1.9-2.0% |
| openai (alt) | gpt-4.1-mini |
~$0.13 | 31s | 100% | 0.1% |
Use bench/ for the current benchmark harness. benchmark.py is a single-video comparison script that now targets the same two shipping models.
The project includes a comprehensive unit test suite with 132 tests covering:
- Model validation —
LotandAuctionResultdata validation, Brazilian number format coercion, price mis-parsing guards, required field checks - LLM response parsing — JSON extraction with extra-text tolerance, lot merging, sold field detection
- Data aggregation — Window overlap logic, transcript + OCR merging, broadcast clock filtering, empty window placeholders
- Summary statistics — Animal counts by category and sex, average prices by category, sold/unsold tracking
Run tests:
uv run pytest tests/ -v # Run all tests with verbose output
uv run pytest tests/test_lot_model.py -v # Run model tests onlyAll tests are pure unit tests with no external dependencies (no API calls, file I/O, or fixtures).
This project is licensed under the MIT License — see the LICENSE file for details.