Cattle Auction Extractor

Extracts structured lot data from Brazilian cattle auction YouTube videos.

How it works

Download — fetches the video with yt-dlp and extracts a 16kHz mono audio track
Transcribe — transcribes audio in PT-BR using MLX Whisper (local, Metal), whisper.cpp (local, Metal), or Groq API (cloud)
Screenshots — extracts one frame every 30 seconds with ffmpeg, with a live progress bar
OCR — reads text visible on screen using RapidOCR (ONNX-based, fast, no native deps), with a live progress bar
Aggregate — merges transcript segments and OCR results into 10-minute windows
Extract lots — sends each window to an LLM with a structured PT-BR prompt to pull out lot data (number, sex, category, count, breed, price, sold status)
Extract metadata — scans the first windows to extract auction-level info: date, city, auctioneer, farm, auction type
Output — saves lots_<video_id>.json, metadata_<video_id>.json, and prints a summary table

Each stage is checkpointed. Interrupted runs resume automatically from where they left off.

Requirements

Python 3.11+
uv for dependency management
ffmpeg installed on the system
deno runtime (used by yt-dlp for YouTube download)

# macOS
brew install ffmpeg deno

# Windows
winget install Gyan.FFmpeg DenoLand.Deno

Setup

git clone <repo>
cd cattle-auction
uv sync --no-install-project                        # base deps
uv sync --extra local --no-install-project          # + mlx-whisper (Apple Silicon only)

Fill in your API keys in .env:

# .env
OPENAI_API_KEY=sk-...
GROQ_API_KEY=gsk-...
OPENROUTER_API_KEY=sk-or-...   # only if using --provider openrouter

Keys are loaded automatically from .env on every run. Only set the keys you need. The default run uses OpenRouter for extraction and Groq for transcription, so OPENROUTER_API_KEY and GROQ_API_KEY are required for the default path. Use OPENAI_API_KEY only when running with --provider openai.

Usage

uv run python main.py <youtube_url> [OPTIONS]

Transcription backends

Backend	Flag	Speed	Cost
MLX Whisper	`--transcriber mlx`	Fast (Metal, Apple Silicon)	Free
whisper.cpp	`--transcriber cpp`	Fast (Metal, Apple Silicon)	Free
Groq API	`--transcriber groq`	~228× realtime	~$0.20 / 5h video

whisper.cpp setup (if using --transcriber cpp):

brew install whisper-cpp
whisper-cpp-download-ggml-model medium

Options

Flag	Default	Description
`--transcriber`	`groq`	Transcription backend: `mlx`, `cpp`, or `groq`
`--whisper-model`	`medium`	Model size for mlx/cpp: `tiny` / `base` / `small` / `medium` / `large-v3`
`--cpp-model`	auto	Path to ggml model file (whisper.cpp only)
`--provider`	`openrouter`	LLM provider: `openrouter` or `openai`
`--screenshot-interval`	`30`	Seconds between captured frames
`--output-dir`	`output`	Base directory for all generated files
`--no-resume`	off	Ignore cached stages and rerun everything
`--metadata / --no-metadata`	on	Display auction metadata (date, city, auctioneer, farm, type)
`--summary / --no-summary`	on	Display summary statistics (totals, averages, counts by category)
`--table / --no-table`	on	Display full table of all lots with detailed information

Examples

# Default: OpenRouter Gemini 2.5 Flash-Lite extraction + Groq transcription
uv run python main.py "https://www.youtube.com/watch?v=..."

# Local MLX transcription (Apple Silicon only, requires --extra local)
uv run python main.py "https://www.youtube.com/watch?v=..." --transcriber mlx

# OpenRouter extraction (default provider)
uv run python main.py "https://www.youtube.com/watch?v=..." --provider openrouter

# OpenAI extraction alternative
uv run python main.py "https://www.youtube.com/watch?v=..." --provider openai

# Show only metadata and summary (no table)
uv run python main.py "https://www.youtube.com/watch?v=..." --no-table

# Show only the table (no metadata or summary)
uv run python main.py "https://www.youtube.com/watch?v=..." --no-metadata --no-summary

# Force full rerun (ignore all cached stages)
uv run python main.py "https://www.youtube.com/watch?v=..." --no-resume

Output

All files are written to output/<video_id>/:

File	Contents
`video_<video_id>.mp4`	Downloaded video
`audio_<video_id>.wav`	16kHz mono audio for Whisper
`transcript_<video_id>.json`	Timestamped transcript segments
`screenshots_<video_id>/`	JPEG frames at every N seconds
`screenshots_<video_id>.json`	Index of frames with timestamps
`ocr_results_<video_id>.json`	Screen text per timestamp
`lots_<video_id>.json`	Extracted lots (array)
`metadata_<video_id>.json`	Auction metadata (date, city, auctioneer, farm, type)
`result_<video_id>.json`	Final result with metadata and lots

Lot schema

{
  "lot_number": 12,
  "sex": "macho",
  "category": "garrote",
  "num_animals": 30,
  "age_months": 18,
  "breed": "Nelore",
  "unit_price": 3200.00,
  "total_price": null,
  "sold": true,
  "timestamp_start": "01:24:35",
  "notes": null
}

Estimated run times (5-hour video, Apple Silicon M2)

Stage	mlx / cpp	groq
Download	~5–15 min (network)	same
Transcribe	~20–40 min (medium)	~1–2 min ($0.20)
Screenshots (30s interval)	~2–3 min	same
OCR (~600 frames)	~5–10 min	same
LLM extraction (~30 windows)	~3–8 min	same

LLM providers

Provider	Flag	Default model	Auth
OpenRouter	`--provider openrouter`	`google/gemini-2.5-flash-lite-preview-09-2025`	`OPENROUTER_API_KEY`
OpenAI	`--provider openai`	`gpt-4.1-mini`	`OPENAI_API_KEY`

Model benchmark

The shipped model catalog is intentionally narrow and benchmark-driven:

Provider	Model	Cost/video	Speed	Coverage	Accuracy (MAPE)
openrouter (default)	`google/gemini-2.5-flash-lite-preview-09-2025`	~$0.05	13-24s	92-100%	1.9-2.0%
openai (alt)	`gpt-4.1-mini`	~$0.13	31s	100%	0.1%

Use bench/ for the current benchmark harness. benchmark.py is a single-video comparison script that now targets the same two shipping models.

Testing

The project includes a comprehensive unit test suite with 132 tests covering:

Model validation — Lot and AuctionResult data validation, Brazilian number format coercion, price mis-parsing guards, required field checks
LLM response parsing — JSON extraction with extra-text tolerance, lot merging, sold field detection
Data aggregation — Window overlap logic, transcript + OCR merging, broadcast clock filtering, empty window placeholders
Summary statistics — Animal counts by category and sex, average prices by category, sold/unsold tracking

Run tests:

uv run pytest tests/ -v              # Run all tests with verbose output
uv run pytest tests/test_lot_model.py -v  # Run model tests only

All tests are pure unit tests with no external dependencies (no API calls, file I/O, or fixtures).

License

This project is licensed under the MIT License — see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.claude		.claude
bench		bench
bench_results		bench_results
models		models
pipeline		pipeline
prompts		prompts
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
benchmark.py		benchmark.py
main.py		main.py
pyproject.toml		pyproject.toml
release.py		release.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cattle Auction Extractor

How it works

Requirements

Setup

Usage

Transcription backends

Options

Examples

Output

Lot schema

Estimated run times (5-hour video, Apple Silicon M2)

LLM providers

Model benchmark

Testing

License

About

Uh oh!

Releases 12

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cattle Auction Extractor

How it works

Requirements

Setup

Usage

Transcription backends

Options

Examples

Output

Lot schema

Estimated run times (5-hour video, Apple Silicon M2)

LLM providers

Model benchmark

Testing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages