Skip to content

mynamevansh/Gemini-API

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🤖 AI Voice Assistant with Push-to-Talk

� Advanced Voice Conversation System

A high-performance, real-time voice AI assistant featuring push-to-talk functionality, optimized response times, and intelligent conversation flow. Built with Google Gemini AI for natural, human-like interactions.

🌟 Key Features

🎤 Push-to-Talk System

  • Hold & Release: Hold button/spacebar to record, release to send automatically
  • Smart Auto-Response: AI responds immediately when you stop talking or release button
  • Interrupt Capability: Press button during AI speech to stop and start new conversation
  • Voice Activity Detection: Automatic silence detection for seamless conversations

Performance Optimized

  • 3x Faster Response: Reduced response time from 10-15s to 3-6s
  • Smart Timeout Handling: 8-second API timeout with graceful fallback
  • Optimized Audio Processing: Enhanced voice detection and processing
  • Minimal Latency: Real-time WebSocket communication

🎨 Intelligent UI/UX

  • Smart Animations: Wave visualizer only animates when AI is speaking
  • Clean Interface: Minimal, distraction-free design
  • Visual Feedback: Clear status indicators for each interaction state
  • Responsive Design: Works perfectly on desktop and mobile

🚀 Quick Start

1. Clone & Setup

git clone https://github.com/mynamevansh/Gemini-API.git
cd Gemini-API
npm install

2. Configure API Key

Create a .env file with your Gemini API key:

GEMINI_API_KEY=your_gemini_api_key_here

3. Start the Server

npm start

4. Open & Use

  • Navigate to: http://localhost:3000
  • Allow microphone permission
  • Hold "Hold to Talk" button → Speak → Release → AI responds automatically!

� How to Use (Push-to-Talk)

🖱️ Mouse/Touch

  1. Press and Hold the "Hold to Talk" button
  2. Speak your message clearly
  3. Release the button when finished
  4. AI responds automatically - no extra button presses needed!

⌨️ Keyboard (Hands-free)

  1. Press and Hold Spacebar
  2. Speak your message
  3. Release Spacebar
  4. AI responds immediately

🛑 During AI Response

  • Press button once to stop AI and start new conversation
  • ESC key to stop AI response

💬 Conversation Examples

Perfect for natural conversations:

  • Questions: "What's the weather like today?"
  • Explanations: "How does quantum computing work?"
  • Creative Tasks: "Write me a short poem about coffee"
  • Problem Solving: "Help me plan a weekend trip"
  • Casual Chat: "Tell me something interesting"
  • Technical Help: "Explain JavaScript promises"

🔧 Technical Architecture

Frontend Stack

  • HTML5 + CSS3: Modern responsive interface
  • JavaScript ES6+: Advanced audio processing and WebSocket handling
  • Web Speech API: Real-time speech recognition
  • WebRTC: High-quality audio capture and playback

Backend Stack

  • Node.js + Express: High-performance server
  • WebSocket (ws): Real-time bidirectional communication
  • Google Gemini AI: Advanced language model integration
  • Environment Variables: Secure API key management

Performance Features

  • Optimized API Calls: Reduced token usage for faster responses
  • Smart Timeout Management: 8s timeout with automatic retry
  • Efficient Audio Processing: Optimized voice activity detection
  • Memory Management: Clean state management and garbage collection

📊 Status Indicators & Visual Feedback

State Visual Description
🟢 Ready Static waves, button ready Ready for input
🔵 Listening Soft green glow, "Recording..." Voice recording active
🟡 Processing No animation, "Processing..." AI thinking
🟠 AI Speaking Animated orange waves AI responding
🔴 Error Error message display Connection/API issues

⚙️ Configuration & Requirements

Prerequisites

  • Node.js (v14 or higher)
  • Modern Browser with Web Speech API support
  • Microphone access and permissions
  • Google Gemini API Key (Get one here)

Environment Setup

# Install dependencies
npm install

# Create .env file
echo "GEMINI_API_KEY=your_api_key_here" > .env

Supported Browsers

Chrome (Recommended) - Full feature support
Edge - Full feature support
Firefox - Good support
Safari - Basic support

🐛 Troubleshooting Guide

🎤 Microphone Issues

Problem: "Microphone access denied"
Solutions:
• Click 🔒 in address bar → Allow microphone
• Check Windows Privacy Settings → Microphone access
• Ensure no other apps are using microphone
• Try refreshing the page (Ctrl+F5)

🔗 Connection Problems

Problem: "Connection error" or timeouts
Solutions:
• Verify server is running: npm start
• Check API key in .env file
• Test internet connection
• Try: taskkill /F /IM node.exe && npm start

⚡ Performance Issues

Problem: Slow responses or no responses
Solutions:
• Check console for errors (F12)
• Verify API key has sufficient quota
• Try "Test AI" button first
• Close other browser tabs to free resources

🔊 Audio Problems

Problem: No AI voice output
Solutions:
• Check system volume and browser audio
• Verify speakers/headphones are working
• Try different browser or incognito mode
• Check browser's autoplay policy settings

🎨 Customization Options

Voice & Audio Settings

// In script.js - Customize speech synthesis
currentUtterance.rate = 1.2;     // Speech speed (0.5-2.0)
currentUtterance.pitch = 1;      // Voice pitch (0.5-2.0)
currentUtterance.volume = 1;     // Volume (0.0-1.0)

Voice Detection Sensitivity

// In script.js - Adjust detection thresholds
const SILENCE_THRESHOLD = 1200;  // Silence detection (ms)
const VOLUME_THRESHOLD = 0.003;  // Voice sensitivity
const MIN_SPEECH_DURATION = 200; // Minimum speech length (ms)

UI Theme Customization

/* In style.css - Change color scheme */
--primary-gradient: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
--success-gradient: linear-gradient(135deg, #28a745, #20c997);
--speaking-gradient: linear-gradient(135deg, #ff6b6b, #feca57);

AI Response Customization

// In server/index.js - Modify AI behavior
generationConfig: {
  temperature: 0.3,      // Creativity (0.0-1.0)
  maxOutputTokens: 500,  // Response length
  candidateCount: 1      // Number of responses
}

🚀 Advanced Features & Tips

🔥 Power User Features

  • Multiple Conversations: Each button press starts fresh context
  • Interrupt & Resume: Stop AI anytime to ask follow-up questions
  • Smart Timeout Handling: Automatic retry on network issues
  • Voice Activity Detection: Hands-free operation with spacebar
  • Mobile Optimized: Touch-friendly interface for smartphones

� Performance Tips

  • Use Chrome for best performance and compatibility
  • Close unused tabs to free up system resources
  • Stable internet ensures faster AI response times
  • Clear browser cache if experiencing issues
  • Update browser for latest Web Speech API features

🔒 Privacy & Security

  • Local Processing: Voice processing happens in your browser
  • Secure Connection: HTTPS/WSS encryption for data transmission
  • No Storage: Conversations are not stored or logged
  • API Key Security: Keep your Gemini API key private

🌟 What Makes This Special?

Instant Interaction: No complex button sequences - just hold, talk, and release
🚀 Lightning Fast: Optimized for 3-6 second total response times
🎯 Zero Friction: Push-to-talk eliminates UI complexity
🔄 Smart Flow: AI automatically responds when you finish speaking
🎨 Polished UX: Clean animations that enhance rather than distract
🛡️ Bulletproof: Comprehensive error handling and recovery systems


📝 Recent Updates (v2.0)

  • 🎤 Push-to-Talk System: Complete interaction model overhaul
  • 3x Performance Boost: Dramatically improved response times
  • 🎨 Smart Animations: Context-aware visual feedback
  • 🔧 Enhanced Error Handling: Robust timeout and retry mechanisms
  • 📱 Mobile Optimization: Touch-friendly interface improvements
  • 🧹 Clean Console Output: Minimal, developer-friendly logging

Enjoy seamless voice conversations with AI! 🎉


Built with ❤️ using Node.js, Google Gemini AI, and modern web technologies

About

This project demonstrates how to integrate Google’s Gemini API with a modern web application. It showcases real-time interaction with a powerful large-language-model (LLM) to perform tasks such as text generation, summarization, and conversational AI.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors