Skip to content

TeodoroRodrigo/cattle-auction

Repository files navigation

Cattle Auction Extractor

Extracts structured lot data from Brazilian cattle auction YouTube videos.

How it works

  1. Download — fetches the video with yt-dlp and extracts a 16kHz mono audio track
  2. Transcribe — transcribes audio in PT-BR using MLX Whisper (local, Metal), whisper.cpp (local, Metal), or Groq API (cloud)
  3. Screenshots — extracts one frame every 30 seconds with ffmpeg, with a live progress bar
  4. OCR — reads text visible on screen using RapidOCR (ONNX-based, fast, no native deps), with a live progress bar
  5. Aggregate — merges transcript segments and OCR results into 10-minute windows
  6. Extract lots — sends each window to an LLM with a structured PT-BR prompt to pull out lot data (number, sex, category, count, breed, price, sold status)
  7. Extract metadata — scans the first windows to extract auction-level info: date, city, auctioneer, farm, auction type
  8. Output — saves lots_<video_id>.json, metadata_<video_id>.json, and prints a summary table

Each stage is checkpointed. Interrupted runs resume automatically from where they left off.

Requirements

  • Python 3.11+
  • uv for dependency management
  • ffmpeg installed on the system
  • deno runtime (used by yt-dlp for YouTube download)
# macOS
brew install ffmpeg deno

# Windows
winget install Gyan.FFmpeg DenoLand.Deno

Setup

git clone <repo>
cd cattle-auction
uv sync --no-install-project                        # base deps
uv sync --extra local --no-install-project          # + mlx-whisper (Apple Silicon only)

Fill in your API keys in .env:

# .env
OPENAI_API_KEY=sk-...
GROQ_API_KEY=gsk-...
OPENROUTER_API_KEY=sk-or-...   # only if using --provider openrouter

Keys are loaded automatically from .env on every run. Only set the keys you need. The default run uses OpenRouter for extraction and Groq for transcription, so OPENROUTER_API_KEY and GROQ_API_KEY are required for the default path. Use OPENAI_API_KEY only when running with --provider openai.

Usage

uv run python main.py <youtube_url> [OPTIONS]

Transcription backends

Backend Flag Speed Cost
MLX Whisper --transcriber mlx Fast (Metal, Apple Silicon) Free
whisper.cpp --transcriber cpp Fast (Metal, Apple Silicon) Free
Groq API --transcriber groq ~228× realtime ~$0.20 / 5h video

whisper.cpp setup (if using --transcriber cpp):

brew install whisper-cpp
whisper-cpp-download-ggml-model medium

Options

Flag Default Description
--transcriber groq Transcription backend: mlx, cpp, or groq
--whisper-model medium Model size for mlx/cpp: tiny / base / small / medium / large-v3
--cpp-model auto Path to ggml model file (whisper.cpp only)
--provider openrouter LLM provider: openrouter or openai
--screenshot-interval 30 Seconds between captured frames
--output-dir output Base directory for all generated files
--no-resume off Ignore cached stages and rerun everything
--metadata / --no-metadata on Display auction metadata (date, city, auctioneer, farm, type)
--summary / --no-summary on Display summary statistics (totals, averages, counts by category)
--table / --no-table on Display full table of all lots with detailed information

Examples

# Default: OpenRouter Gemini 2.5 Flash-Lite extraction + Groq transcription
uv run python main.py "https://www.youtube.com/watch?v=..."

# Local MLX transcription (Apple Silicon only, requires --extra local)
uv run python main.py "https://www.youtube.com/watch?v=..." --transcriber mlx

# OpenRouter extraction (default provider)
uv run python main.py "https://www.youtube.com/watch?v=..." --provider openrouter

# OpenAI extraction alternative
uv run python main.py "https://www.youtube.com/watch?v=..." --provider openai

# Show only metadata and summary (no table)
uv run python main.py "https://www.youtube.com/watch?v=..." --no-table

# Show only the table (no metadata or summary)
uv run python main.py "https://www.youtube.com/watch?v=..." --no-metadata --no-summary

# Force full rerun (ignore all cached stages)
uv run python main.py "https://www.youtube.com/watch?v=..." --no-resume

Output

All files are written to output/<video_id>/:

File Contents
video_<video_id>.mp4 Downloaded video
audio_<video_id>.wav 16kHz mono audio for Whisper
transcript_<video_id>.json Timestamped transcript segments
screenshots_<video_id>/ JPEG frames at every N seconds
screenshots_<video_id>.json Index of frames with timestamps
ocr_results_<video_id>.json Screen text per timestamp
lots_<video_id>.json Extracted lots (array)
metadata_<video_id>.json Auction metadata (date, city, auctioneer, farm, type)
result_<video_id>.json Final result with metadata and lots

Lot schema

{
  "lot_number": 12,
  "sex": "macho",
  "category": "garrote",
  "num_animals": 30,
  "age_months": 18,
  "breed": "Nelore",
  "unit_price": 3200.00,
  "total_price": null,
  "sold": true,
  "timestamp_start": "01:24:35",
  "notes": null
}

Estimated run times (5-hour video, Apple Silicon M2)

Stage mlx / cpp groq
Download ~5–15 min (network) same
Transcribe ~20–40 min (medium) ~1–2 min ($0.20)
Screenshots (30s interval) ~2–3 min same
OCR (~600 frames) ~5–10 min same
LLM extraction (~30 windows) ~3–8 min same

LLM providers

Provider Flag Default model Auth
OpenRouter --provider openrouter google/gemini-2.5-flash-lite-preview-09-2025 OPENROUTER_API_KEY
OpenAI --provider openai gpt-4.1-mini OPENAI_API_KEY

Model benchmark

The shipped model catalog is intentionally narrow and benchmark-driven:

Provider Model Cost/video Speed Coverage Accuracy (MAPE)
openrouter (default) google/gemini-2.5-flash-lite-preview-09-2025 ~$0.05 13-24s 92-100% 1.9-2.0%
openai (alt) gpt-4.1-mini ~$0.13 31s 100% 0.1%

Use bench/ for the current benchmark harness. benchmark.py is a single-video comparison script that now targets the same two shipping models.

Testing

The project includes a comprehensive unit test suite with 132 tests covering:

  • Model validationLot and AuctionResult data validation, Brazilian number format coercion, price mis-parsing guards, required field checks
  • LLM response parsing — JSON extraction with extra-text tolerance, lot merging, sold field detection
  • Data aggregation — Window overlap logic, transcript + OCR merging, broadcast clock filtering, empty window placeholders
  • Summary statistics — Animal counts by category and sex, average prices by category, sold/unsold tracking

Run tests:

uv run pytest tests/ -v              # Run all tests with verbose output
uv run pytest tests/test_lot_model.py -v  # Run model tests only

All tests are pure unit tests with no external dependencies (no API calls, file I/O, or fixtures).

License

This project is licensed under the MIT License — see the LICENSE file for details.

About

Python CLI that downloads Brazilian cattle auction YouTube videos, transcribes Portuguese audio locally (MLX/whisper.cpp) or via Groq, performs OCR on screenshots, and uses LLMs to extract structured lot data (number, sex, category, count, breed, price). Outputs JSON with checkpoint support for resumable runs.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages