A Python app that listens to your voice, transcribes it, analyzes an image, and returns a caption or answer using Google's Gemma 3n multimodal model.
- Speech Recognition: Captures and transcribes spoken input using your microphone.
- Image Analysis: Loads and processes an image for context.
- Multimodal Reasoning: Uses the Gemma 3n E4B model to generate answers or captions based on both text and image input.
- Python 3.8+
- transformers >= 4.53.1
- SpeechRecognition >= 3.14.3
- pillow >= 11.3.0
Install dependencies:
pip install -r requirements.txt- Place an image file named
sample.jpgin the project directory (or modify the code to use your own image path). - Run the app:
python main.py
- Speak into your microphone when prompted. The app will transcribe your speech, analyze the image, and generate a response using Gemma 3n.
🎙️ Speak now...
📝 Transcribed: What is the dog doing on the beach?
🤖 Response: The dog is sitting on the beach, possibly enjoying the view or resting.
This project uses the google/gemma-3n-E4B model, a state-of-the-art, open, multimodal model from Google DeepMind. Gemma 3n supports text, image, and audio input, and is optimized for efficient execution on a wide range of devices. Learn more in the official documentation.
- The app requires a working microphone and an image file named
sample.jpgin the project directory. - For best results, use clear speech and high-quality images.
- The first run may take time as the model downloads weights from Hugging Face.