Voxtral Transcription Application

A Flask-based web application for transcribing audio and video files using the Mistral AI Voxtral-Mini-3B model. Features a modern web interface with real-time progress updates, REST API, and WebSocket support, while leveraging local AI processing for privacy-focused, cost-free transcription.

Features

Web Application

🌐 Modern Web Interface - User-friendly drag-and-drop file upload
⚡ Real-Time Progress - Live updates via WebSocket during transcription
📊 Progress Tracking - Visual progress bar with chunk-by-chunk updates
📋 Easy Export - Copy to clipboard or download as text file
🎨 Responsive Design - Works on desktop and mobile browsers

Transcription

🎵 Audio & Video Support - WAV, MP3, FLAC, M4A, MP4, AVI, MOV
🎬 FFmpeg Pre-Conversion - Converts complex formats to clean PCM WAV for reliable processing
🌍 30+ Languages - Multilingual support (English, French, Spanish, and more)
🗣️ Auto Language Detection - Automatically detects language per chunk using SpeechBrain
🔄 Smart Chunking - 90-second chunks optimized for multi-language content
🎚️ Audio Normalization - Automatic volume normalization for better recognition
🎯 High Accuracy - Powered by Mistral AI Voxtral-Mini-3B model
🎙️ Distant Speaker Enhancement - Optional FFmpeg filter chain for audio recorded 5+ meters from microphone

Performance & Privacy

🚀 Device Auto-Detection - MPS (Apple Silicon), CUDA (NVIDIA GPU), or CPU
🔒 Privacy-Focused - All processing on local hardware, no cloud uploads
💰 No API Costs - Completely free to use
⚙️ Efficient Memory - Smart chunking prevents memory overflow
📊 Memory Monitoring - Real-time RAM usage warnings to prevent system slowdown
🔄 Auto-Updates - Automatic update notifications from GitHub releases

Architecture

flowchart TD
    A[Web Browser<br/>HTML/CSS/JS + Socket.IO] -->|HTTP/WebSocket| B[Flask Application<br/>app.py]
    B --> C[REST API + WebSocket<br/>+ Model Selector]
    C --> P[Audio Preprocessing<br/>FFmpeg Conversion]
    P -->|Optional| E1[Distant Speaker<br/>Enhancement]
    E1 --> D
    E1 --> E
    E1 --> F
    P --> D[MLX Engine<br/>Mac M1+]
    P --> E[Voxtral Engine<br/>CUDA/MPS/CPU]
    P --> F[GGUF Backend<br/>Future]
    D --> G[MLX Models<br/>3-25 GB<br/>Apple Silicon]
    E --> H[PyTorch Models<br/>9-97 GB<br/>Any Platform]
    F --> I[GGUF Models<br/>Future<br/>CPU/GPU]

Available Models

Model	Size	Platform	RAM Required	Load Time	Best For
MLX Mini 3B (4-bit)	3.2 GB	Mac M1/M2/M3/M4	4-5 GB	< 1 min	⭐ Recommended for Mac
MLX Mini 3B (8-bit)	5.3 GB	Mac M1/M2/M3/M4	6-7 GB	< 2 min	Better quality on Mac
MLX Small 24B (8-bit)	25 GB	Mac (64GB+ RAM)	50-60 GB	5-10 min	Best quality on Mac
Voxtral Mini 3B (Full)	9.4 GB	Any (CUDA/MPS/CPU)	20-30 GB	10-30 min	Cross-platform
Voxtral Mini 3B (4-bit)	9.4 GB	NVIDIA GPU	5-8 GB	2-5 min	NVIDIA GPU users
Voxtral Small 24B (Full)	97 GB	GPU (55GB+ VRAM)	55+ GB	N/A	Enterprise GPUs
Voxtral Small 24B (4-bit)	97 GB	NVIDIA (16GB+)	16-20 GB	5-10 min	High-end NVIDIA

Model Selection Guide

Apple Silicon Mac (M1/M2/M3/M4):

8-16 GB RAM → MLX Mini 3B (4-bit) ⭐
16-32 GB RAM → MLX Mini 3B (8-bit)
64+ GB RAM → MLX Small 24B (8-bit)

NVIDIA GPU:

8 GB VRAM → Voxtral Mini 3B (4-bit)
16+ GB VRAM → Voxtral Small 24B (4-bit)

CPU Only (Windows/Linux):

Voxtral Mini 3B (Full) - slow but works

Requirements

Python 3.11 or later
Operating System: macOS, Linux, or Windows
Disk Space: 3-25 GB depending on model (see table above)
RAM: 4-60 GB depending on model (see table above)
Internet: Required for initial model download only

Required

FFmpeg - For audio/video pre-conversion (ensures reliable processing)
- macOS: brew install ffmpeg
- Windows: Download from ffmpeg.org or choco install ffmpeg
- Linux: sudo apt install ffmpeg

Quick Start

Easiest (Double-Click)

Platform	First Time	After Setup
macOS	Double-click `Start Voxtral Web - Mac.command`	Same - just double-click
Windows	Double-click `Start Voxtral Web - Windows.bat`	Same - just double-click

Both launchers will prompt you to run setup if needed on first use.

Terminal (Alternative)

macOS/Linux:

cd transcribe-voxtral-main/VoxtralApp
./setup.sh           # First time only
./start_web.sh       # Start the app

Windows:

cd transcribe-voxtral-main\VoxtralApp
python -m venv voxtral_env && voxtral_env\Scripts\activate.bat && pip install -r requirements.txt
start_web.bat

Use the App

Open http://localhost:8000 in your browser
Drag and drop an audio/video file
Select the language
Optional: Enable "Distant Speaker Enhancement" if speakers are far from the mic
Click "Start Transcription"
Copy or download your transcript when done

Project Structure

transcribe-voxtral-main/VoxtralApp/
├── app.py                      # Flask web application
├── transcription_engine.py     # Core transcription logic
├── transcribe_voxtral.py       # CLI script for batch processing
├── requirements.txt            # Production dependencies
├── requirements-dev.txt        # Development dependencies
├── static/
│   ├── css/                   # Stylesheets
│   ├── js/                    # JavaScript frontend
│   └── assets/                # Images and icons
├── templates/
│   └── index.html             # Main web interface
├── tests/                      # Comprehensive test suite
├── docs/
│   ├── API_DOCUMENTATION.md   # API reference
│   └── USER_GUIDE.md          # User manual
├── uploads/                    # Temporary uploads (auto-cleanup)
├── transcriptions_voxtral_final/  # Saved transcripts
└── voxtral_env/               # Python virtual environment

Usage Modes

Web Interface (Recommended)

Perfect for interactive use with real-time feedback:

cd transcribe-voxtral-main/VoxtralApp
./start_web.sh  # or start_web.bat on Windows

Access at http://localhost:8000

Features:

Drag-and-drop file upload
Live progress updates
Visual feedback
Copy/download transcripts
Language selection

Command Line (Batch Processing)

For automating multiple files or integration with scripts:

cd transcribe-voxtral-main/VoxtralApp
source voxtral_env/bin/activate
python transcribe_voxtral.py

Features:

Batch processing of all audio files in a directory
Headless operation
Scriptable and automatable
Lower memory overhead

Supported Languages

The Voxtral model supports 30+ languages:

Code	Language	Code	Language	Code	Language
`en`	English	`fr`	French	`es`	Spanish
`de`	German	`it`	Italian	`pt`	Portuguese
`nl`	Dutch	`pl`	Polish	`ru`	Russian
`zh`	Chinese	`ja`	Japanese	`ko`	Korean
`ar`	Arabic	`hi`	Hindi	`tr`	Turkish
`sv`	Swedish	`da`	Danish	`no`	Norwegian
`fi`	Finnish	`cs`	Czech	`sk`	Slovak
`uk`	Ukrainian	`ro`	Romanian	`el`	Greek
`he`	Hebrew	`id`	Indonesian	`vi`	Vietnamese
`th`	Thai	`ms`	Malay	`ca`	Catalan

API Documentation

The Flask application provides a REST API and WebSocket interface for programmatic access.

REST Endpoints

POST /api/upload - Upload audio/video file
POST /api/transcribe - Start transcription job
GET /api/status/<job_id> - Get job status
GET /api/transcript/<job_id> - Retrieve transcript
GET /api/transcript/<job_id>/download - Download as file
GET /api/languages - Get supported languages
GET /api/device-info - Get device information
GET /api/system/memory - Get current memory usage and status
GET /api/version - Get current application version
GET /api/updates/check - Check for available updates from GitHub

WebSocket Events

transcription_progress - Real-time progress updates
transcription_complete - Completion notification
transcription_error - Error notifications
memory_warning - Real-time memory usage warnings (80%+ RAM)

For complete API reference, see VoxtralApp/docs/API_DOCUMENTATION.md

Configuration

Web Application Settings

Edit app.py to configure:

MAX_FILE_SIZE = 500 * 1024 * 1024  # 500MB (line 26)
UPLOAD_FOLDER = BASE_DIR / "uploads"  # Upload directory (line 23)
OUTPUT_FOLDER = BASE_DIR / "transcriptions_voxtral_final"  # Output (line 24)

Transcription Engine Settings

Edit transcription_engine.py to configure:

chunk_duration_s: int = 2 * 60  # Chunk size in seconds (line 150)
sample_rate: int = 16000  # Audio sample rate (line 151)

CLI Script Settings

Edit transcribe_voxtral.py to configure batch processing:

INPUT_DIRECTORY = "."  # Where to find audio files
OUTPUT_SUBFOLDER_NAME = "transcriptions_voxtral_final"  # Output folder

Device Detection & Performance

The application automatically detects and uses the best available hardware:

Processing Speed (Approximate)

Device	Speed	Example (10 min audio)
Apple M1/M2/M3 (MPS)	~1-2x realtime	5-10 min processing
NVIDIA GPU (CUDA)	~1-3x realtime	3-10 min processing
CPU (Fallback)	~0.1-0.5x realtime	20-100 min processing

Note: Actual speed varies based on audio quality, language, and specific hardware

Device Details

MPS (Apple Silicon)

M1, M2, M3, M4 chips
Uses bfloat16 precision
Automatic cache clearing
Fastest on Apple devices

CUDA (NVIDIA GPUs)

Requires CUDA-compatible GPU
Uses bfloat16 precision
Requires CUDA toolkit

CPU (Universal)

Works on all systems
Uses float32 precision
Slower but reliable

Testing

The application includes a comprehensive test suite with pytest.

Run Tests

cd transcribe-voxtral-main/VoxtralApp

# Activate test environment
source test_venv/bin/activate

# Run all tests (excluding model/GPU tests)
export TESTING=1
pytest tests/ -v -m "not requires_model and not requires_gpu and not slow"

# Run specific test categories
pytest tests/test_api.py -v          # API tests
pytest tests/test_integration.py -v  # Integration tests
pytest tests/ -v -m unit            # Unit tests only

Test Categories

unit - Unit tests for individual components
api - API endpoint tests
integration - Integration tests
slow - Long-running tests
requires_model - Tests needing the ML model (skipped in CI)
requires_gpu - Tests requiring GPU (skipped in CI)
cross_platform - Platform compatibility tests

For more details, see VoxtralApp/tests/README.md

Model Information

Model: Mistral AI Voxtral-Mini-3B-2507

Type: Conditional Generation (Audio-to-Text)
Size: ~20GB download
Architecture: Transformer-based encoder-decoder
Sample Rate: 16kHz (automatically resampled)
License: HuggingFace Model Page

First Run

The model is downloaded automatically on first use:

Download Size: ~20GB
Download Time: 10-60 minutes (depends on internet speed)
Cache Location: ~/.cache/huggingface/hub/
Redownload: Not needed - model is cached locally

Subsequent Runs

Model loads from cache in 10-30 seconds
No internet connection required
Processing starts immediately

Troubleshooting

Application Won't Start

Check Python version:

python --version  # Should be 3.11+

Reinstall dependencies:

pip install -r requirements.txt --force-reinstall

Can't Access Web Interface

Solutions:

Verify server is running (check terminal output)
Try http://127.0.0.1:8000 instead
Check if port 8000 is in use
Check firewall settings

Model Download Fails

Solutions:

Ensure 20GB+ free disk space
Check internet connection
Check firewall/proxy settings
Downloads resume automatically - try again

Out of Memory Errors

Solutions:

Close other applications
Reduce chunk size in transcription_engine.py
Process shorter files
Restart your computer

Poor Transcription Quality

Solutions:

Verify correct language selected
Use high-quality audio (minimal background noise)
Ensure adequate audio volume
Try with clear speech examples first

Video Conversion Fails

Solutions:

Install FFmpeg (see requirements)
Install moviepy: pip install moviepy
Convert video manually using FFmpeg
Try different video format

For detailed troubleshooting, see VoxtralApp/docs/USER_GUIDE.md

Development

Code Quality

Format code:

cd transcribe-voxtral-main/VoxtralApp
source test_venv/bin/activate

# Auto-format with black
black app.py transcription_engine.py transcribe_voxtral.py tests/*.py

# Sort imports
isort app.py transcription_engine.py transcribe_voxtral.py tests/*.py --skip test_venv

Lint code:

flake8 . --config=.flake8

Contributing

Run tests before committing
Follow code style (black, isort)
Add tests for new features
Update documentation

Documentation

User Guide - Complete user manual
API Documentation - API reference
Test Documentation - Testing guide

Dependencies

Core

torch - Deep learning framework
torchaudio - Audio processing for PyTorch
transformers - HuggingFace model interface
librosa - Audio processing and normalization
soundfile - Audio file I/O
speechbrain - Automatic language detection
Flask - Web framework
Flask-SocketIO - Real-time WebSocket support
mistral-common - Mistral AI utilities
psutil - System memory monitoring

Development

pytest - Testing framework
black - Code formatting
isort - Import sorting
flake8 - Linting

See requirements.txt and requirements-dev.txt for complete list.

Privacy & Security

✅ 100% Local Processing - No cloud uploads
✅ No Data Collection - No analytics or tracking
✅ Open Source - Fully auditable code
✅ No Account Required - Use immediately
✅ Automatic Cleanup - Temporary files deleted after processing

Best Practices

Test First - Start with a short audio clip to verify setup
Correct Language - Always select the spoken language
Quality Audio - Use clear recordings for best results
Adequate Storage - Ensure 20GB+ free for model + files
Monitor Progress - Watch first transcription to verify quality
Save Transcripts - Download/copy before closing browser

Memory Monitoring & System Health

The application includes intelligent memory monitoring to prevent system slowdown:

Memory Warning Levels

Normal (< 80% RAM) - No warnings, optimal performance
Warning (80-90% RAM) - Yellow banner appears, transcription continues
Critical (> 90% RAM) - Red banner appears, consider stopping transcription

Automatic Optimizations

The transcription engine automatically adjusts based on available memory:

< 2GB available - Uses 60-second chunks (reduced memory footprint)
2-4GB available - Uses 90-second chunks (balanced performance)
> 4GB available - Uses 120-second chunks (optimal performance)

Memory Management Features

Real-time Monitoring - Memory status checked every 15 seconds
WebSocket Alerts - Instant warnings when RAM usage exceeds thresholds
Automatic Cleanup - Garbage collection after each audio chunk
Device Cache Clearing - MPS/CUDA caches cleared between chunks
Visual Banners - Clear on-screen warnings with RAM percentage

Best Practices for Large Files

Close Other Applications - Free up RAM before transcribing
Monitor Banners - Watch for memory warnings during processing
Restart if Needed - Stop transcription if critical warning appears
Chunk Size - System automatically adjusts based on available RAM

Distant Speaker Enhancement

For recordings where speakers are far from the microphone (5+ meters), enable the Distant Speaker Enhancement checkbox before transcribing.

When to Use

Conference room recordings with distant speakers
Lectures captured from the back of a room
Interviews with inconsistent microphone distances
Any recording where speech sounds quiet or muddy

How It Works

The enhancement applies a sophisticated FFmpeg audio filter chain:

Filter	Purpose
High-pass (80Hz)	Removes low-frequency rumble (air conditioning, traffic)
Low-pass (8kHz)	Removes high-frequency hiss
Compand	Dynamic compression to bring up quiet speech
EQ Boost	Boosts voice frequencies (300Hz, 1kHz, 2.5kHz, 3.5kHz)
Loudnorm (-14 LUFS)	Normalizes loudness to broadcast standard

For complete technical details, see Audio Enhancement Documentation.

Versioning & Updates

The application uses semantic versioning (vMAJOR.MINOR.PATCH) and automatically checks for updates.

Current Version

Check your version at startup or via the API:

curl http://localhost:8000/api/version

Automatic Update Checks

On Startup - Checks GitHub for new releases when app starts
Update Banner - Green notification appears when new version available
Release Information - Click "View Release" to see changelog and download

Manual Update Check

curl http://localhost:8000/api/updates/check

Updating the Application

When a new version is available:

View Release Notes - Click the banner link to see what's new
Download Update - Download from GitHub releases page
Stop Application - Close the web interface (server auto-shuts down)
Replace Files - Extract new version over existing installation
Restart - Launch application normally

Version File

The current version is stored in:

VoxtralApp/VERSION

GitHub Releases

All releases are published at: https://github.com/debrockb/transcribe-voxtral/releases

License

For educational and research purposes. Check Mistral AI's license terms for the Voxtral model at HuggingFace Model Page

Support

For issues or questions:

User Guide - See USER_GUIDE.md for detailed help
API Docs - See API_DOCUMENTATION.md for technical details
Test Docs - See tests/README.md for testing help
Troubleshooting - Check error messages and logs in terminal

Acknowledgments

Powered by Mistral AI Voxtral and MLX.

Thank you for using Voxtral Transcription Application! 🎙️

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
.github/workflows		.github/workflows
VoxtralApp		VoxtralApp
.gitignore		.gitignore
README.md		README.md
Start Voxtral Web - Mac.command		Start Voxtral Web - Mac.command
Start Voxtral Web - Windows.bat		Start Voxtral Web - Windows.bat

Folders and files

Latest commit

History

Repository files navigation

Voxtral Transcription Application

Features

Web Application

Transcription

Performance & Privacy

Architecture

Available Models

Model Selection Guide

Requirements

Required

Quick Start

Easiest (Double-Click)

Terminal (Alternative)

Use the App

Project Structure

Usage Modes

Web Interface (Recommended)

Command Line (Batch Processing)

Supported Languages

API Documentation

REST Endpoints

WebSocket Events

Configuration

Web Application Settings

Transcription Engine Settings

CLI Script Settings

Device Detection & Performance

Processing Speed (Approximate)

Device Details

Testing

Run Tests

Test Categories

Model Information

First Run

Subsequent Runs

Troubleshooting

Application Won't Start

Can't Access Web Interface

Model Download Fails

Out of Memory Errors

Poor Transcription Quality

Video Conversion Fails

Development

Code Quality

Contributing

Documentation

Dependencies

Core

Development

Privacy & Security

Best Practices

Memory Monitoring & System Health

Memory Warning Levels

Automatic Optimizations

Memory Management Features

Best Practices for Large Files

Distant Speaker Enhancement

When to Use

How It Works

Versioning & Updates

Current Version

Automatic Update Checks

Manual Update Check

Updating the Application

Version File

GitHub Releases

License

Support

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 15

Packages