Skip to content

havido/SilentVoice

Repository files navigation

SilenceVoice

A visual speech recognition (VSR) tool that reads your lips in real-time and types whatever you silently mouth.

SilenceVoice leverages state-of-the-art AI to translate visual lip movements into text, making communication accessible for mute individuals or for silent dictation in quiet environments.

🚀 Features

  • Real-Time Lip Reading: Translates silent speech instantly using advanced computer vision.
  • AI Correction: Uses Google Gemma-3-27B to refine raw phonetic detections into natural, grammatically correct sentences.
  • Text-to-Speech: Built-in functionality to speak the corrected text aloud.
  • Privacy-First: The core VSR model runs locally on your machine.
  • Modern UI: A clean, accessible interface built with Next.js and Tailwind CSS.

🛠️ Technology Stack

Frontend

Backend

  • Framework: FastAPI (Python)
  • Server: Uvicorn

AI & Machine Learning

  • Visual Speech Recognition (VSR):
    • Based on Auto-AVSR architecture.
    • Trained on the LRS3 (Lip Reading Sentences 3) dataset.
    • Libraries: torch, torchvision, torchaudio, mediapipe (for face tracking), opencv-python.
  • Large Language Model (LLM):

📦 Installation

Prerequisites

  • Python 3.11+
  • Node.js 18+
  • Git

1. Clone the Repository

git clone https://github.com/your-username/silencevoice.git
cd silencevoice

2. Setup Backend & Models

Run the setup script to download the required VSR models (approx. 1GB):

./setup.sh

Install Python dependencies:

pip install -r requirements.txt
pip install -r backend_requirements.txt

Start the Backend Server:

python backend/main.py

The server will start at http://0.0.0.0:8000

3. Setup Frontend

Open a new terminal window and navigate to the frontend directory:

cd frontend
npm install

Start the Frontend Application:

npm run dev

The application will be available at http://localhost:3000

🎮 Usage

  1. Open your browser to http://localhost:3000.
  2. Allow camera access when prompted.
  3. Click "Start Recognition" to begin the session.
  4. Speak silently (mouth words without sound) into the camera.
  5. The raw detection will be processed by the VSR model, then refined by Gemma-3.
  6. The final text will appear on screen.
  7. Toggle "Text-To-Speech On" to have the system read the text aloud automatically.

📄 License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors