Local video/audio transcription on Apple Silicon using MLX Whisper.
No API keys. No cloud. No cost. Runs entirely on your Mac.
Supports YouTube, Bilibili (Bη«), Xiaohongshu (ε°ηΊ’δΉ¦), Douyin (ζι³), podcasts, and local files.
# Install dependencies
brew install yt-dlp ffmpeg
python3 -m venv ~/.openclaw/venvs/whisper
~/.openclaw/venvs/whisper/bin/pip install mlx-whisper
# Transcribe
bash scripts/transcribe.sh "https://www.youtube.com/watch?v=..."Output: /tmp/whisper_output.txt (text) + /tmp/whisper_output.json (with timestamps)
- π Apple Silicon optimized β MLX framework, fast inference on M1/M2/M3/M4
- π Multi-language β auto-detects language, strong Chinese/English/Japanese support
- πΊ Multi-platform β YouTube, Bilibili, Xiaohongshu, Douyin, and 1000+ sites
- π Local files β MP4, MP3, WAV, M4A, etc.
- β±οΈ Timestamps β JSON output includes per-segment timing
- π€ OpenClaw ready β drop into
skills/and let your AI agent transcribe & summarize
On Mac mini M4 (16GB):
| Video Length | Time (medium model) |
|---|---|
| 5 min | ~30-40s |
| 10 min | ~60-90s |
| 30 min | ~3-4 min |
| 60 min | ~6-8 min |
See SKILL.md for full docs, model options, and integration guide.
MIT