A web application for splitting bilingual sermons into two clean language-only audio tracks.
Interpret allows users to upload MP3 files containing bilingual audio (e.g., sermons with interpretation) and automatically separates them into two clean audio tracks - one for each language. It uses AI-powered speaker diarization via pyannote.audio to identify and separate speakers.
- Upload: User drops MP3 file, browser converts to base64
- Process: Base64 sent directly to Modal GPU endpoint
- Diarize: pyannote.audio identifies 2 speakers via diarization
- Separate: Audio segments grouped by speaker (longer speaker = Track 1)
- Return: Two base64-encoded MP3s returned to browser
- Download: Browser decodes and offers file downloads
The core separation happens in run-service/modal_app.py:
-
Audio Preprocessing
- Load MP3 with torchaudio
- Convert stereo to mono (average channels)
- Resample to 16kHz (required by pyannote)
- Peak normalize to [-1, 1] range
-
Speaker Diarization (pyannote.audio)
- Neural network identifies "who spoke when"
- Forces exactly 2 speaker clusters (
num_speakers=2) - Outputs timestamped segments:
[(0.5s, 3.2s, SPEAKER_00), (3.2s, 8.1s, SPEAKER_01), ...] - Uses FP16 mixed precision and batch size 64 for GPU optimization
-
Speaker-to-Track Assignment
- Calculate total speaking duration per speaker
- Speaker with more total time becomes Track 1
- Assumes both languages have roughly equal content
-
Track Building
- Iterate through diarization segments chronologically
- Slice audio array for each segment:
audio[start:end] - Concatenate all segments per speaker into continuous tracks
-
MP3 Export
- Convert float32 samples to int16
- Export via pydub at 128kbps
Important: The pipeline separates by voice identity, not by language detection. It assumes the two speakers are speaking different languages (e.g., original speaker + interpreter).
- Frontend: Next.js 16 with React 19, Tailwind CSS v4
- GPU Processing: Modal serverless GPU (L4) with pyannote.audio speaker diarization
- Communication: Direct client-to-Modal API
- Node.js 18+
- npm or yarn
- Modal account (for GPU processing)
- HuggingFace account with access to pyannote models
-
Install frontend dependencies:
npm install
-
Configure environment:
cp .env.example .env.local
Fill in your Modal endpoint URL after deployment.
-
Deploy Modal service:
cd run-service modal secret create huggingface HUGGING_FACE_TOKEN=hf_your_token modal deploy modal_app.pyCopy the web endpoint URL to your
.env.local. -
Setup YouTube cookies (Required for YouTube downloads):
YouTube now requires authentication to prevent bot detection. Export your browser cookies:
Option A: Using Browser Extension (Recommended)
- Install "Get cookies.txt LOCALLY" extension (Chrome / Firefox)
- Go to youtube.com and sign in
- Click the extension icon → Export cookies
- Save the content
Option B: Using yt-dlp
yt-dlp --cookies-from-browser chrome --cookies cookies.txt https://youtube.com cat cookies.txt # Copy the contentCreate Modal secret:
modal secret create youtube-cookies YOUTUBE_COOKIES="$(cat cookies.txt)"Or manually:
modal secret create youtube-cookies # When prompted, paste: YOUTUBE_COOKIES=<paste cookie content>Redeploy after adding the secret:
modal deploy modal_app.py
-
Run development server:
npm run dev
-
Open browser: Visit http://localhost:3000
interpret/
├── app/ # Next.js app directory
│ ├── page.tsx # Main page with upload/download logic
│ ├── layout.tsx # Root layout
│ └── globals.css # Global styles (Tailwind v4)
├── components/ # React components
│ └── ui/
│ ├── file-upload.tsx # Drag-and-drop upload (react-dropzone)
│ ├── input.tsx # Input component
│ └── simple-growth-tree.tsx # Animated tree visualization
├── lib/ # Utility functions
│ ├── types.ts # TypeScript interfaces
│ └── utils.ts # General utilities (cn helper)
├── run-service/ # Modal GPU service
│ ├── modal_app.py # AudioSeparator class with pyannote pipeline
│ └── requirements.txt # Python dependencies
└── .env.local # Local environment variables
- Next.js 16 - React framework with App Router
- React 19 - UI library
- Tailwind CSS v4 - Utility-first CSS
- Framer Motion - Animation library
- React Dropzone - File upload handling
- TypeScript - Type safety
- Modal - Serverless GPU platform
- Python 3.10 - Programming language
- pyannote.audio 3.1 - Speaker diarization
- PyTorch + CUDA - GPU acceleration
- torchaudio - Audio loading/preprocessing
- pydub - MP3 export
NEXT_PUBLIC_MODAL_ENDPOINT=https://your-modal-endpoint.modal.run
HUGGING_FACE_TOKEN=hf_your_token # For Modal secret