A Streamlit web application for transcribing audio files to text using a local Whisper model. This app works completely offline without requiring external servers.
- Upload one or multiple audio files for transcription
- Support for various audio formats (MP3, WAV, M4A, OGG, FLAC, AAC)
- Speaker diarization to identify different speakers in the audio
- Language selection (English or Spanish)
- Multiple model size options to balance speed and accuracy
- Download transcription results as text files
- Completely local processing - no data sent to external servers
- Detailed metadata for each transcription (processing time, detected language)
- Installation
- Usage
- Project Structure
- Advanced Usage
- Speaker Diarization
- Model Information
- Requirements
- Performance Considerations
- Contributing
- License
- Clone this repository or download the files:
git clone https://github.com/yourusername/Transcribe_UI.git
cd Transcribe_UI- Create a virtual environment (recommended):
# Using venv
python -m venv venv
# On Windows
venv\Scripts\activate
# On macOS/Linux
source venv/bin/activate- Install the required dependencies:
pip install -r requirements.txt- Install FFmpeg (required for audio processing):
- Windows: Download from FFmpeg website and add to PATH
- Linux:
sudo apt install ffmpeg - macOS:
brew install ffmpeg
- Run the Streamlit app:
streamlit run app.py-
The app will open in your web browser at
http://localhost:8501 -
Select model settings in the sidebar:
- Model size (tiny, base, small, medium, large-v2)
- Language (English or Spanish)
- Compute device (CPU or CUDA for GPU acceleration)
-
Upload one or more audio files using the file uploader
-
Click "Transcribe" for each file you want to process
-
View and download the transcription results
Transcribe_UI/
├── app.py # Main application file
├── requirements.txt # Python dependencies
├── README.md # Project documentation
├── utils/ # Utility functions
│ ├── __init__.py # Package initialization
│ ├── audio.py # Audio processing utilities
│ ├── model.py # Model loading and transcription
│ └── ui.py # UI components and helpers
├── docs/ # Documentation assets
│ └── app_screenshot.png # Application screenshot
└── tests/ # Unit tests
└── test_utils.py # Tests for utility functions
For batch processing of multiple files, simply upload all files at once and click the transcribe button for each file. Results can be downloaded individually.
While the UI currently supports English and Spanish, the underlying Whisper model supports many more languages. Advanced users can modify the language selection options in the code.
The transcription output can be easily integrated with other NLP tools for further processing:
- Text summarization
- Sentiment analysis
- Translation
- Named entity recognition
The app includes speaker diarization capabilities to identify different speakers in your audio files.
- Audio Feature Extraction: The system extracts MFCC (Mel-frequency cepstral coefficients) features from the audio.
- Speaker Clustering: Using spectral clustering algorithms, the system identifies distinct speakers based on audio characteristics.
- Segment Labeling: Each segment of the transcription is labeled with a speaker identifier (e.g., SPEAKER_1, SPEAKER_2).
- FFmpeg: Required for audio processing. If FFmpeg is not installed, the app will attempt to use PyDub as a fallback.
- SoundFile: Used for reliable audio loading.
- scikit-learn: Used for the clustering algorithms.
- Number of Speakers: You can specify the expected number of speakers or let the system estimate it automatically.
- Minimum Segment Duration: Controls the minimum length of a speaker segment (default: 1.0 seconds).
- Speaker diarization works best with clear audio and distinct speakers
- Background noise can affect the accuracy of speaker identification
- Very short speaker turns may not be accurately identified
- The system assigns generic labels (SPEAKER_1, SPEAKER_2) rather than identifying actual individuals
This app uses the faster-whisper implementation of OpenAI's Whisper model, which offers:
- Improved performance over the original Whisper implementation
- Reduced memory usage
- Support for CPU and GPU acceleration
- Multiple model sizes to balance accuracy and speed
Model size comparison:
| Model Size | Parameters | Relative Speed | Memory Usage | Accuracy |
|---|---|---|---|---|
| tiny | 39M | Very Fast | Low | Basic |
| base | 74M | Fast | Low | Good |
| small | 244M | Medium | Medium | Better |
| medium | 769M | Slow | High | Great |
| large-v2 | 1550M | Very Slow | Very High | Best |
- Python 3.7+
- FFmpeg (for audio processing and diarization)
- Streamlit
- faster-whisper
- NumPy
- PyDub
- SoundFile (for audio processing)
- scikit-learn (for speaker diarization)
- librosa (for audio feature extraction)
See requirements.txt for specific version requirements.
- Larger models provide better accuracy but require more memory and processing power
- GPU acceleration (CUDA) significantly improves performance but requires a compatible NVIDIA GPU
- First-time model usage will download the model files (one-time process)
- Processing long audio files (>10 minutes) may take significant time with larger models
- Consider using the 'tiny' or 'base' model for quick testing before using larger models
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.