Audio Transcription App

A Streamlit web application for transcribing audio files to text using a local Whisper model. This app works completely offline without requiring external servers.

Features

Upload one or multiple audio files for transcription
Support for various audio formats (MP3, WAV, M4A, OGG, FLAC, AAC)
Speaker diarization to identify different speakers in the audio
Language selection (English or Spanish)
Multiple model size options to balance speed and accuracy
Download transcription results as text files
Completely local processing - no data sent to external servers
Detailed metadata for each transcription (processing time, detected language)

Installation

Clone this repository or download the files:

git clone https://github.com/yourusername/Transcribe_UI.git
cd Transcribe_UI

Create a virtual environment (recommended):

# Using venv
python -m venv venv
# On Windows
venv\Scripts\activate
# On macOS/Linux
source venv/bin/activate

Install the required dependencies:

pip install -r requirements.txt

Install FFmpeg (required for audio processing):
- Windows: Download from FFmpeg website and add to PATH
- Linux: sudo apt install ffmpeg
- macOS: brew install ffmpeg

Usage

Run the Streamlit app:

streamlit run app.py

The app will open in your web browser at http://localhost:8501
Select model settings in the sidebar:
- Model size (tiny, base, small, medium, large-v2)
- Language (English or Spanish)
- Compute device (CPU or CUDA for GPU acceleration)
Upload one or more audio files using the file uploader
Click "Transcribe" for each file you want to process
View and download the transcription results

Project Structure

Transcribe_UI/
├── app.py                  # Main application file
├── requirements.txt        # Python dependencies
├── README.md               # Project documentation
├── utils/                  # Utility functions
│   ├── __init__.py         # Package initialization
│   ├── audio.py            # Audio processing utilities
│   ├── model.py            # Model loading and transcription
│   └── ui.py               # UI components and helpers
├── docs/                   # Documentation assets
│   └── app_screenshot.png  # Application screenshot
└── tests/                  # Unit tests
    └── test_utils.py       # Tests for utility functions

Advanced Usage

Batch Processing

For batch processing of multiple files, simply upload all files at once and click the transcribe button for each file. Results can be downloaded individually.

Custom Language Support

While the UI currently supports English and Spanish, the underlying Whisper model supports many more languages. Advanced users can modify the language selection options in the code.

Integration with Other Tools

The transcription output can be easily integrated with other NLP tools for further processing:

Text summarization
Sentiment analysis
Translation
Named entity recognition

Speaker Diarization

The app includes speaker diarization capabilities to identify different speakers in your audio files.

How Speaker Diarization Works

Audio Feature Extraction: The system extracts MFCC (Mel-frequency cepstral coefficients) features from the audio.
Speaker Clustering: Using spectral clustering algorithms, the system identifies distinct speakers based on audio characteristics.
Segment Labeling: Each segment of the transcription is labeled with a speaker identifier (e.g., SPEAKER_1, SPEAKER_2).

Requirements for Speaker Diarization

FFmpeg: Required for audio processing. If FFmpeg is not installed, the app will attempt to use PyDub as a fallback.
SoundFile: Used for reliable audio loading.
scikit-learn: Used for the clustering algorithms.

Diarization Settings

Number of Speakers: You can specify the expected number of speakers or let the system estimate it automatically.
Minimum Segment Duration: Controls the minimum length of a speaker segment (default: 1.0 seconds).

Limitations

Speaker diarization works best with clear audio and distinct speakers
Background noise can affect the accuracy of speaker identification
Very short speaker turns may not be accurately identified
The system assigns generic labels (SPEAKER_1, SPEAKER_2) rather than identifying actual individuals

Model Information

This app uses the faster-whisper implementation of OpenAI's Whisper model, which offers:

Improved performance over the original Whisper implementation
Reduced memory usage
Support for CPU and GPU acceleration
Multiple model sizes to balance accuracy and speed

Model size comparison:

Model Size	Parameters	Relative Speed	Memory Usage	Accuracy
tiny	39M	Very Fast	Low	Basic
base	74M	Fast	Low	Good
small	244M	Medium	Medium	Better
medium	769M	Slow	High	Great
large-v2	1550M	Very Slow	Very High	Best

Requirements

Python 3.7+
FFmpeg (for audio processing and diarization)
Streamlit
faster-whisper
NumPy
PyDub
SoundFile (for audio processing)
scikit-learn (for speaker diarization)
librosa (for audio feature extraction)

See requirements.txt for specific version requirements.

Performance Considerations

Larger models provide better accuracy but require more memory and processing power
GPU acceleration (CUDA) significantly improves performance but requires a compatible NVIDIA GPU
First-time model usage will download the model files (one-time process)
Processing long audio files (>10 minutes) may take significant time with larger models
Consider using the 'tiny' or 'base' model for quick testing before using larger models

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audio Transcription App

Features

Table of Contents

Installation

Usage

Project Structure

Advanced Usage

Batch Processing

Custom Language Support

Integration with Other Tools

Speaker Diarization

How Speaker Diarization Works

Requirements for Speaker Diarization

Diarization Settings

Limitations

Model Information

Requirements

Performance Considerations

Contributing

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
utils		utils
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Leo-GG/Transcribe_UI

Folders and files

Latest commit

History

Repository files navigation

Audio Transcription App

Features

Table of Contents

Installation

Usage

Project Structure

Advanced Usage

Batch Processing

Custom Language Support

Integration with Other Tools

Speaker Diarization

How Speaker Diarization Works

Requirements for Speaker Diarization

Diarization Settings

Limitations

Model Information

Requirements

Performance Considerations

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages