A web application for creating large-scale Egyptian Arabic ASR datasets from YouTube videos.
- Channel Monitoring: Automatically fetch all videos from specified YouTube channels.
- Audio Extraction: Download high-quality audio using
pytubefix. - AI Transcription: High-accuracy Egyptian Arabic transcription using Google Gemini 2.0 Flash.
- VAD Filtering: Intelligent Voice Activity Detection to remove non-speech chunks.
- Transcript Alignment: Fetch existing Arabic captions (manual or auto-generated) as a fallback.
- Intelligent Splitting: Silence-based audio splitting using
pyduband VAD. - Slack Notifications: Real-time alerts for system status, errors, and daily summaries.
- Dataset Export: Create HuggingFace-compatible datasets (Arrow/Parquet).
- Modern UI: Dark-themed dashboard with real-time progress tracking.
- Backend: FastAPI, SQLAlchemy (MySQL), Celery, Redis.
- Frontend: React, TypeScript, Vite, Vanilla CSS.
- Processing: Pydub, Librosa, Pytubefix, Google Gemini API.
- Docker and Docker Compose
- HuggingFace Token (optional, for Hub export)
- Google Gemini API Key (for transcription)
- Slack Webhook URL (optional, for notifications)
- Clone the repository.
- Create a
.envfile from.env.example:cp .env.example .env
- Update
.envwith your credentials:GEMINI_API_KEY: Your Google Gemini API key.SLACK_WEBHOOK_URL: Your Slack webhook URL.DATABASE_URL: Your MySQL connection string (e.g.,mysql+pymysql://user:password@localhost/dbname).
- Start the application:
docker-compose up --build
- Access the UI at
http://localhost:5173. - Upload a
.txtfile with channel URLs in the "Channels" tab. - Monitor progress in the "Videos" tab.
- Browse and preview chunks in the "Chunks" tab.
- Use the "Transcribe" feature to generate accurate text for your chunks.
- Export your dataset in the "Export" tab.
If you prefer to run the application without Docker, follow these steps:
- Python 3.10+: Install from python.org.
- Node.js 20+: Install from nodejs.org.
- MySQL Server: Recommended for production/large datasets.
- Redis Server: Required for Celery.
- Mac:
brew install redis - Linux:
sudo apt install redis-server - Windows: Use Redis on WSL.
- Mac:
- Navigate to the backend directory:
cd backend - Setup environment and install dependencies using uv:
uv venv source .venv/bin/activate # On Windows: .venv\Scripts\activate uv pip install -r requirements.txt
- Start Redis (in a separate terminal):
redis-server
- Start the Celery worker (in a separate terminal, ensure REDIS_URL is set):
export REDIS_URL="redis://localhost:6379/0" # Make sure your .env is loaded or variables are set uv run celery -A app.celery_app worker --loglevel=info
- Start the FastAPI server:
uv run uvicorn app.main:app --reload
- Navigate to the frontend directory:
cd frontend - Install dependencies:
npm install
- Start the development server:
npm run dev
backend/: FastAPI application, Celery tasks, and Gemini integration.frontend/: React application.data/: Local storage for audio and transcripts (ignored by git).logs/: Error logs (ignored by git).
To use MySQL, ensure DATABASE_URL in .env is set to a MySQL connection string (e.g., mysql+pymysql://...). The application will verify the connection on startup.