A Flask-based web application for transcribing audio and video files using the Mistral AI Voxtral-Mini-3B model. Features a modern web interface with real-time progress updates, REST API, and WebSocket support, while leveraging local AI processing for privacy-focused, cost-free transcription.
- 🌐 Modern Web Interface - User-friendly drag-and-drop file upload
- ⚡ Real-Time Progress - Live updates via WebSocket during transcription
- 📊 Progress Tracking - Visual progress bar with chunk-by-chunk updates
- 📋 Easy Export - Copy to clipboard or download as text file
- 🎨 Responsive Design - Works on desktop and mobile browsers
- 🎵 Audio & Video Support - WAV, MP3, FLAC, M4A, MP4, AVI, MOV
- 🎬 FFmpeg Pre-Conversion - Converts complex formats to clean PCM WAV for reliable processing
- 🌍 30+ Languages - Multilingual support (English, French, Spanish, and more)
- 🗣️ Auto Language Detection - Automatically detects language per chunk using SpeechBrain
- 🔄 Smart Chunking - 90-second chunks optimized for multi-language content
- 🎚️ Audio Normalization - Automatic volume normalization for better recognition
- 🎯 High Accuracy - Powered by Mistral AI Voxtral-Mini-3B model
- 🎙️ Distant Speaker Enhancement - Optional FFmpeg filter chain for audio recorded 5+ meters from microphone
- 🚀 Device Auto-Detection - MPS (Apple Silicon), CUDA (NVIDIA GPU), or CPU
- 🔒 Privacy-Focused - All processing on local hardware, no cloud uploads
- 💰 No API Costs - Completely free to use
- ⚙️ Efficient Memory - Smart chunking prevents memory overflow
- 📊 Memory Monitoring - Real-time RAM usage warnings to prevent system slowdown
- 🔄 Auto-Updates - Automatic update notifications from GitHub releases
flowchart TD
A[Web Browser<br/>HTML/CSS/JS + Socket.IO] -->|HTTP/WebSocket| B[Flask Application<br/>app.py]
B --> C[REST API + WebSocket<br/>+ Model Selector]
C --> P[Audio Preprocessing<br/>FFmpeg Conversion]
P -->|Optional| E1[Distant Speaker<br/>Enhancement]
E1 --> D
E1 --> E
E1 --> F
P --> D[MLX Engine<br/>Mac M1+]
P --> E[Voxtral Engine<br/>CUDA/MPS/CPU]
P --> F[GGUF Backend<br/>Future]
D --> G[MLX Models<br/>3-25 GB<br/>Apple Silicon]
E --> H[PyTorch Models<br/>9-97 GB<br/>Any Platform]
F --> I[GGUF Models<br/>Future<br/>CPU/GPU]
| Model | Size | Platform | RAM Required | Load Time | Best For |
|---|---|---|---|---|---|
| MLX Mini 3B (4-bit) | 3.2 GB | Mac M1/M2/M3/M4 | 4-5 GB | < 1 min | ⭐ Recommended for Mac |
| MLX Mini 3B (8-bit) | 5.3 GB | Mac M1/M2/M3/M4 | 6-7 GB | < 2 min | Better quality on Mac |
| MLX Small 24B (8-bit) | 25 GB | Mac (64GB+ RAM) | 50-60 GB | 5-10 min | Best quality on Mac |
| Voxtral Mini 3B (Full) | 9.4 GB | Any (CUDA/MPS/CPU) | 20-30 GB | 10-30 min | Cross-platform |
| Voxtral Mini 3B (4-bit) | 9.4 GB | NVIDIA GPU | 5-8 GB | 2-5 min | NVIDIA GPU users |
| Voxtral Small 24B (Full) | 97 GB | GPU (55GB+ VRAM) | 55+ GB | N/A | Enterprise GPUs |
| Voxtral Small 24B (4-bit) | 97 GB | NVIDIA (16GB+) | 16-20 GB | 5-10 min | High-end NVIDIA |
Apple Silicon Mac (M1/M2/M3/M4):
- 8-16 GB RAM → MLX Mini 3B (4-bit) ⭐
- 16-32 GB RAM → MLX Mini 3B (8-bit)
- 64+ GB RAM → MLX Small 24B (8-bit)
NVIDIA GPU:
- 8 GB VRAM → Voxtral Mini 3B (4-bit)
- 16+ GB VRAM → Voxtral Small 24B (4-bit)
CPU Only (Windows/Linux):
- Voxtral Mini 3B (Full) - slow but works
- Python 3.11 or later
- Operating System: macOS, Linux, or Windows
- Disk Space: 3-25 GB depending on model (see table above)
- RAM: 4-60 GB depending on model (see table above)
- Internet: Required for initial model download only
- FFmpeg - For audio/video pre-conversion (ensures reliable processing)
- macOS:
brew install ffmpeg - Windows: Download from ffmpeg.org or
choco install ffmpeg - Linux:
sudo apt install ffmpeg
- macOS:
| Platform | First Time | After Setup |
|---|---|---|
| macOS | Double-click Start Voxtral Web - Mac.command |
Same - just double-click |
| Windows | Double-click Start Voxtral Web - Windows.bat |
Same - just double-click |
Both launchers will prompt you to run setup if needed on first use.
macOS/Linux:
cd transcribe-voxtral-main/VoxtralApp
./setup.sh # First time only
./start_web.sh # Start the appWindows:
cd transcribe-voxtral-main\VoxtralApp
python -m venv voxtral_env && voxtral_env\Scripts\activate.bat && pip install -r requirements.txt
start_web.bat- Open
http://localhost:8000in your browser - Drag and drop an audio/video file
- Select the language
- Optional: Enable "Distant Speaker Enhancement" if speakers are far from the mic
- Click "Start Transcription"
- Copy or download your transcript when done
transcribe-voxtral-main/VoxtralApp/
├── app.py # Flask web application
├── transcription_engine.py # Core transcription logic
├── transcribe_voxtral.py # CLI script for batch processing
├── requirements.txt # Production dependencies
├── requirements-dev.txt # Development dependencies
├── static/
│ ├── css/ # Stylesheets
│ ├── js/ # JavaScript frontend
│ └── assets/ # Images and icons
├── templates/
│ └── index.html # Main web interface
├── tests/ # Comprehensive test suite
├── docs/
│ ├── API_DOCUMENTATION.md # API reference
│ └── USER_GUIDE.md # User manual
├── uploads/ # Temporary uploads (auto-cleanup)
├── transcriptions_voxtral_final/ # Saved transcripts
└── voxtral_env/ # Python virtual environment
Perfect for interactive use with real-time feedback:
cd transcribe-voxtral-main/VoxtralApp
./start_web.sh # or start_web.bat on WindowsAccess at http://localhost:8000
Features:
- Drag-and-drop file upload
- Live progress updates
- Visual feedback
- Copy/download transcripts
- Language selection
For automating multiple files or integration with scripts:
cd transcribe-voxtral-main/VoxtralApp
source voxtral_env/bin/activate
python transcribe_voxtral.pyFeatures:
- Batch processing of all audio files in a directory
- Headless operation
- Scriptable and automatable
- Lower memory overhead
The Voxtral model supports 30+ languages:
| Code | Language | Code | Language | Code | Language |
|---|---|---|---|---|---|
en |
English | fr |
French | es |
Spanish |
de |
German | it |
Italian | pt |
Portuguese |
nl |
Dutch | pl |
Polish | ru |
Russian |
zh |
Chinese | ja |
Japanese | ko |
Korean |
ar |
Arabic | hi |
Hindi | tr |
Turkish |
sv |
Swedish | da |
Danish | no |
Norwegian |
fi |
Finnish | cs |
Czech | sk |
Slovak |
uk |
Ukrainian | ro |
Romanian | el |
Greek |
he |
Hebrew | id |
Indonesian | vi |
Vietnamese |
th |
Thai | ms |
Malay | ca |
Catalan |
The Flask application provides a REST API and WebSocket interface for programmatic access.
POST /api/upload- Upload audio/video filePOST /api/transcribe- Start transcription jobGET /api/status/<job_id>- Get job statusGET /api/transcript/<job_id>- Retrieve transcriptGET /api/transcript/<job_id>/download- Download as fileGET /api/languages- Get supported languagesGET /api/device-info- Get device informationGET /api/system/memory- Get current memory usage and statusGET /api/version- Get current application versionGET /api/updates/check- Check for available updates from GitHub
transcription_progress- Real-time progress updatestranscription_complete- Completion notificationtranscription_error- Error notificationsmemory_warning- Real-time memory usage warnings (80%+ RAM)
For complete API reference, see VoxtralApp/docs/API_DOCUMENTATION.md
Edit app.py to configure:
MAX_FILE_SIZE = 500 * 1024 * 1024 # 500MB (line 26)
UPLOAD_FOLDER = BASE_DIR / "uploads" # Upload directory (line 23)
OUTPUT_FOLDER = BASE_DIR / "transcriptions_voxtral_final" # Output (line 24)Edit transcription_engine.py to configure:
chunk_duration_s: int = 2 * 60 # Chunk size in seconds (line 150)
sample_rate: int = 16000 # Audio sample rate (line 151)Edit transcribe_voxtral.py to configure batch processing:
INPUT_DIRECTORY = "." # Where to find audio files
OUTPUT_SUBFOLDER_NAME = "transcriptions_voxtral_final" # Output folderThe application automatically detects and uses the best available hardware:
| Device | Speed | Example (10 min audio) |
|---|---|---|
| Apple M1/M2/M3 (MPS) | ~1-2x realtime | 5-10 min processing |
| NVIDIA GPU (CUDA) | ~1-3x realtime | 3-10 min processing |
| CPU (Fallback) | ~0.1-0.5x realtime | 20-100 min processing |
Note: Actual speed varies based on audio quality, language, and specific hardware
MPS (Apple Silicon)
- M1, M2, M3, M4 chips
- Uses
bfloat16precision - Automatic cache clearing
- Fastest on Apple devices
CUDA (NVIDIA GPUs)
- Requires CUDA-compatible GPU
- Uses
bfloat16precision - Requires CUDA toolkit
CPU (Universal)
- Works on all systems
- Uses
float32precision - Slower but reliable
The application includes a comprehensive test suite with pytest.
cd transcribe-voxtral-main/VoxtralApp
# Activate test environment
source test_venv/bin/activate
# Run all tests (excluding model/GPU tests)
export TESTING=1
pytest tests/ -v -m "not requires_model and not requires_gpu and not slow"
# Run specific test categories
pytest tests/test_api.py -v # API tests
pytest tests/test_integration.py -v # Integration tests
pytest tests/ -v -m unit # Unit tests onlyunit- Unit tests for individual componentsapi- API endpoint testsintegration- Integration testsslow- Long-running testsrequires_model- Tests needing the ML model (skipped in CI)requires_gpu- Tests requiring GPU (skipped in CI)cross_platform- Platform compatibility tests
For more details, see VoxtralApp/tests/README.md
Model: Mistral AI Voxtral-Mini-3B-2507
- Type: Conditional Generation (Audio-to-Text)
- Size: ~20GB download
- Architecture: Transformer-based encoder-decoder
- Sample Rate: 16kHz (automatically resampled)
- License: HuggingFace Model Page
The model is downloaded automatically on first use:
- Download Size: ~20GB
- Download Time: 10-60 minutes (depends on internet speed)
- Cache Location:
~/.cache/huggingface/hub/ - Redownload: Not needed - model is cached locally
- Model loads from cache in 10-30 seconds
- No internet connection required
- Processing starts immediately
Check Python version:
python --version # Should be 3.11+Reinstall dependencies:
pip install -r requirements.txt --force-reinstallSolutions:
- Verify server is running (check terminal output)
- Try
http://127.0.0.1:8000instead - Check if port 8000 is in use
- Check firewall settings
Solutions:
- Ensure 20GB+ free disk space
- Check internet connection
- Check firewall/proxy settings
- Downloads resume automatically - try again
Solutions:
- Close other applications
- Reduce chunk size in
transcription_engine.py - Process shorter files
- Restart your computer
Solutions:
- Verify correct language selected
- Use high-quality audio (minimal background noise)
- Ensure adequate audio volume
- Try with clear speech examples first
Solutions:
- Install FFmpeg (see requirements)
- Install moviepy:
pip install moviepy - Convert video manually using FFmpeg
- Try different video format
For detailed troubleshooting, see VoxtralApp/docs/USER_GUIDE.md
Format code:
cd transcribe-voxtral-main/VoxtralApp
source test_venv/bin/activate
# Auto-format with black
black app.py transcription_engine.py transcribe_voxtral.py tests/*.py
# Sort imports
isort app.py transcription_engine.py transcribe_voxtral.py tests/*.py --skip test_venvLint code:
flake8 . --config=.flake8- Run tests before committing
- Follow code style (black, isort)
- Add tests for new features
- Update documentation
- User Guide - Complete user manual
- API Documentation - API reference
- Test Documentation - Testing guide
- torch - Deep learning framework
- torchaudio - Audio processing for PyTorch
- transformers - HuggingFace model interface
- librosa - Audio processing and normalization
- soundfile - Audio file I/O
- speechbrain - Automatic language detection
- Flask - Web framework
- Flask-SocketIO - Real-time WebSocket support
- mistral-common - Mistral AI utilities
- psutil - System memory monitoring
- pytest - Testing framework
- black - Code formatting
- isort - Import sorting
- flake8 - Linting
See requirements.txt and requirements-dev.txt for complete list.
- ✅ 100% Local Processing - No cloud uploads
- ✅ No Data Collection - No analytics or tracking
- ✅ Open Source - Fully auditable code
- ✅ No Account Required - Use immediately
- ✅ Automatic Cleanup - Temporary files deleted after processing
- Test First - Start with a short audio clip to verify setup
- Correct Language - Always select the spoken language
- Quality Audio - Use clear recordings for best results
- Adequate Storage - Ensure 20GB+ free for model + files
- Monitor Progress - Watch first transcription to verify quality
- Save Transcripts - Download/copy before closing browser
The application includes intelligent memory monitoring to prevent system slowdown:
- Normal (< 80% RAM) - No warnings, optimal performance
- Warning (80-90% RAM) - Yellow banner appears, transcription continues
- Critical (> 90% RAM) - Red banner appears, consider stopping transcription
The transcription engine automatically adjusts based on available memory:
- < 2GB available - Uses 60-second chunks (reduced memory footprint)
- 2-4GB available - Uses 90-second chunks (balanced performance)
- > 4GB available - Uses 120-second chunks (optimal performance)
- Real-time Monitoring - Memory status checked every 15 seconds
- WebSocket Alerts - Instant warnings when RAM usage exceeds thresholds
- Automatic Cleanup - Garbage collection after each audio chunk
- Device Cache Clearing - MPS/CUDA caches cleared between chunks
- Visual Banners - Clear on-screen warnings with RAM percentage
- Close Other Applications - Free up RAM before transcribing
- Monitor Banners - Watch for memory warnings during processing
- Restart if Needed - Stop transcription if critical warning appears
- Chunk Size - System automatically adjusts based on available RAM
For recordings where speakers are far from the microphone (5+ meters), enable the Distant Speaker Enhancement checkbox before transcribing.
- Conference room recordings with distant speakers
- Lectures captured from the back of a room
- Interviews with inconsistent microphone distances
- Any recording where speech sounds quiet or muddy
The enhancement applies a sophisticated FFmpeg audio filter chain:
| Filter | Purpose |
|---|---|
| High-pass (80Hz) | Removes low-frequency rumble (air conditioning, traffic) |
| Low-pass (8kHz) | Removes high-frequency hiss |
| Compand | Dynamic compression to bring up quiet speech |
| EQ Boost | Boosts voice frequencies (300Hz, 1kHz, 2.5kHz, 3.5kHz) |
| Loudnorm (-14 LUFS) | Normalizes loudness to broadcast standard |
For complete technical details, see Audio Enhancement Documentation.
The application uses semantic versioning (vMAJOR.MINOR.PATCH) and automatically checks for updates.
Check your version at startup or via the API:
curl http://localhost:8000/api/version- On Startup - Checks GitHub for new releases when app starts
- Update Banner - Green notification appears when new version available
- Release Information - Click "View Release" to see changelog and download
curl http://localhost:8000/api/updates/checkWhen a new version is available:
- View Release Notes - Click the banner link to see what's new
- Download Update - Download from GitHub releases page
- Stop Application - Close the web interface (server auto-shuts down)
- Replace Files - Extract new version over existing installation
- Restart - Launch application normally
The current version is stored in:
VoxtralApp/VERSION
All releases are published at: https://github.com/debrockb/transcribe-voxtral/releases
For educational and research purposes. Check Mistral AI's license terms for the Voxtral model at HuggingFace Model Page
For issues or questions:
- User Guide - See USER_GUIDE.md for detailed help
- API Docs - See API_DOCUMENTATION.md for technical details
- Test Docs - See tests/README.md for testing help
- Troubleshooting - Check error messages and logs in terminal
Powered by Mistral AI Voxtral and MLX.
Thank you for using Voxtral Transcription Application! 🎙️