Interpret - Bilingual Audio Separator

A web application for splitting bilingual sermons into two clean language-only audio tracks.

Overview

Interpret allows users to upload MP3 files containing bilingual audio (e.g., sermons with interpretation) and automatically separates them into two clean audio tracks - one for each language. It uses AI-powered speaker diarization via pyannote.audio to identify and separate speakers.

How It Works

High-Level Flow

Upload: User drops MP3 file, browser converts to base64
Process: Base64 sent directly to Modal GPU endpoint
Diarize: pyannote.audio identifies 2 speakers via diarization
Separate: Audio segments grouped by speaker (longer speaker = Track 1)
Return: Two base64-encoded MP3s returned to browser
Download: Browser decodes and offers file downloads

Audio Processing Pipeline (Modal GPU)

The core separation happens in run-service/modal_app.py:

Audio Preprocessing
- Load MP3 with torchaudio
- Convert stereo to mono (average channels)
- Resample to 16kHz (required by pyannote)
- Peak normalize to [-1, 1] range
Speaker Diarization (pyannote.audio)
- Neural network identifies "who spoke when"
- Forces exactly 2 speaker clusters (num_speakers=2)
- Outputs timestamped segments: [(0.5s, 3.2s, SPEAKER_00), (3.2s, 8.1s, SPEAKER_01), ...]
- Uses FP16 mixed precision and batch size 64 for GPU optimization
Speaker-to-Track Assignment
- Calculate total speaking duration per speaker
- Speaker with more total time becomes Track 1
- Assumes both languages have roughly equal content
Track Building
- Iterate through diarization segments chronologically
- Slice audio array for each segment: audio[start:end]
- Concatenate all segments per speaker into continuous tracks
MP3 Export
- Convert float32 samples to int16
- Export via pydub at 128kbps

Important: The pipeline separates by voice identity, not by language detection. It assumes the two speakers are speaking different languages (e.g., original speaker + interpreter).

Architecture

Frontend: Next.js 16 with React 19, Tailwind CSS v4
GPU Processing: Modal serverless GPU (L4) with pyannote.audio speaker diarization
Communication: Direct client-to-Modal API

Getting Started

Prerequisites

Node.js 18+
npm or yarn
Modal account (for GPU processing)
HuggingFace account with access to pyannote models

Local Development

Install frontend dependencies:
```
npm install
```
Configure environment:
```
cp .env.example .env.local
```
Fill in your Modal endpoint URL after deployment.

Deploy Modal service:

cd run-service
modal secret create huggingface HUGGING_FACE_TOKEN=hf_your_token
modal deploy modal_app.py

Copy the web endpoint URL to your .env.local.

Setup YouTube cookies (Required for YouTube downloads):

YouTube now requires authentication to prevent bot detection. Export your browser cookies:

Option A: Using Browser Extension (Recommended)
- Install "Get cookies.txt LOCALLY" extension (Chrome / Firefox)
- Go to youtube.com and sign in
- Click the extension icon → Export cookies
- Save the content
Option B: Using yt-dlp
```
yt-dlp --cookies-from-browser chrome --cookies cookies.txt https://youtube.com
cat cookies.txt  # Copy the content
```
Create Modal secret:
```
modal secret create youtube-cookies YOUTUBE_COOKIES="$(cat cookies.txt)"
```
Or manually:
```
modal secret create youtube-cookies
# When prompted, paste: YOUTUBE_COOKIES=<paste cookie content>
```
Redeploy after adding the secret:
```
modal deploy modal_app.py
```
Run development server:
```
npm run dev
```
Open browser: Visit http://localhost:3000

Project Structure

interpret/
├── app/                          # Next.js app directory
│   ├── page.tsx                  # Main page with upload/download logic
│   ├── layout.tsx                # Root layout
│   └── globals.css               # Global styles (Tailwind v4)
├── components/                   # React components
│   └── ui/
│       ├── file-upload.tsx       # Drag-and-drop upload (react-dropzone)
│       ├── input.tsx             # Input component
│       └── simple-growth-tree.tsx # Animated tree visualization
├── lib/                          # Utility functions
│   ├── types.ts                  # TypeScript interfaces
│   └── utils.ts                  # General utilities (cn helper)
├── run-service/                  # Modal GPU service
│   ├── modal_app.py              # AudioSeparator class with pyannote pipeline
│   └── requirements.txt          # Python dependencies
└── .env.local                    # Local environment variables

Technology Stack

Frontend

Next.js 16 - React framework with App Router
React 19 - UI library
Tailwind CSS v4 - Utility-first CSS
Framer Motion - Animation library
React Dropzone - File upload handling
TypeScript - Type safety

GPU Processing Service

Modal - Serverless GPU platform
Python 3.10 - Programming language
pyannote.audio 3.1 - Speaker diarization
PyTorch + CUDA - GPU acceleration
torchaudio - Audio loading/preprocessing
pydub - MP3 export

Environment Variables

NEXT_PUBLIC_MODAL_ENDPOINT=https://your-modal-endpoint.modal.run
HUGGING_FACE_TOKEN=hf_your_token  # For Modal secret

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.claude		.claude
app		app
components/ui		components/ui
lib		lib
public		public
run-service		run-service
.env.example		.env.example
CLAUDE.md		CLAUDE.md
README.md		README.md
components.json		components.json
eslint.config.mjs		eslint.config.mjs
next.config.ts		next.config.ts
package-lock.json		package-lock.json
package.json		package.json
postcss.config.mjs		postcss.config.mjs
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Interpret - Bilingual Audio Separator

Overview

How It Works

High-Level Flow

Audio Processing Pipeline (Modal GPU)

Architecture

Getting Started

Prerequisites

Local Development

Project Structure

Technology Stack

Frontend

GPU Processing Service

Environment Variables

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

balebbae/Interpret

Folders and files

Latest commit

History

Repository files navigation

Interpret - Bilingual Audio Separator

Overview

How It Works

High-Level Flow

Audio Processing Pipeline (Modal GPU)

Architecture

Getting Started

Prerequisites

Local Development

Project Structure

Technology Stack

Frontend

GPU Processing Service

Environment Variables

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages