Skip to content

samarthvmurthy/friday-voice-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸŽ™οΈ Friday β€” Voice-Controlled Browser Agent

Say "hey Friday" and watch AI browse the web for you.

Friday is a hands-free, wake-word-activated voice assistant that controls a real browser using Google Gemini's multimodal vision and planning capabilities. Just speak a task β€” Friday navigates, clicks, fills forms, and reports back when done.

Built for the Google Gemini Hackathon.


✨ Features

  • πŸŽ™οΈ Wake-word activation β€” always listening for "hey Friday", completely hands-free
  • 🧠 Gemini Vision + Planning β€” Gemini sees the live browser screen and plans multi-step tasks
  • πŸ–₯️ Floating HUD β€” persistent GUI showing live agent status and action log
  • πŸ”Š Voice feedback β€” Friday speaks its status back to you in real time
  • ⏹️ Cancellable tasks β€” stop mid-execution with a single button press
  • πŸ”‡ Mute toggle β€” silence voice output without interrupting the agent
  • ⌨️ Browser shortcuts β€” back, refresh, new tab, home β€” all from the GUI
  • πŸ” Loop detection β€” automatically recovers from stuck states

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Main Thread (GUI)                  β”‚
β”‚                   Tkinter HUD                        β”‚
β”‚         status Β· log Β· mic Β· stop Β· mute             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚ callbacks + threading.Event
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚             Background Thread (Agent Loop)           β”‚
β”‚                                                      β”‚
β”‚  VoiceListener ──► wake word ──► record command      β”‚
β”‚                                       β”‚              β”‚
β”‚                              asyncio event loop      β”‚
β”‚                                       β”‚              β”‚
β”‚                              browser-use Agent       β”‚
β”‚                                       β”‚              β”‚
β”‚                            Google Gemini API         β”‚
β”‚                         (Vision + Planning)          β”‚
β”‚                                       β”‚              β”‚
β”‚                             Playwright Browser       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Two threads run in parallel:

  • Main thread β€” Tkinter GUI, never blocks
  • Agent thread β€” owns its own asyncio event loop, handles voice I/O and all browser automation

Tasks are wrapped as asyncio.Task objects so they can be cancelled cleanly at any point via loop.call_soon_threadsafe(task.cancel).


πŸ› οΈ Tech Stack

Google Technologies

Technology Role
Gemini API (gemini-2.5-flash) Core LLM β€” reasoning, planning, and decision-making
Gemini Vision Reads and understands the live browser screen
Gemini Planning Decomposes complex tasks into multi-step execution plans
ChatGoogle (browser_use.llm.google) Native Google Gemini API integration used in agent.py
Google AI Studio API key management and usage monitoring

Supporting Stack

Technology Role
browser-use Browser agent framework
Playwright Browser engine (via browser-use)
Vosk Offline wake-word detection + speech recognition
SpeechRecognition Microphone input
pyttsx3 Text-to-speech (Friday's voice)
Tkinter Floating HUD / GUI
pyautogui Browser keyboard shortcut passthrough
python-dotenv API key management
Python 3.11+ Language

πŸš€ Getting Started

Prerequisites

  • Python 3.11 or higher
  • A Gemini API key from Google AI Studio
  • A working microphone

Installation

# 1. Clone the repo
git clone https://github.com/yourusername/friday.git
cd friday

# 2. Create and activate a virtual environment
python -m venv venv
source venv/bin/activate        # macOS/Linux
venv\Scripts\activate           # Windows

# 3. Install dependencies
pip install -r requirements.txt

# 4. Install Playwright browsers
playwright install chromium

# 5. Set up your API key
cp .env.example .env
# Edit .env and add your GEMINI_API_KEY

Environment Variables

Create a .env file in the project root:

GEMINI_API_KEY=your_gemini_api_key_here

Get your key from Google AI Studio β€” it's free.

Run

python main.py

Friday will calibrate your microphone on startup, then say "Friday is ready."


🎀 Usage

Voice Control

  1. Say "hey Friday" to activate
  2. Wait for the "Listening" confirmation
  3. Speak your task clearly:
    • "Search for flights from New York to London next month"
    • "Go to YouTube and find a tutorial on Python asyncio"
    • "Open Gmail and summarize my latest unread emails"
  4. Friday will confirm your command and start executing
  5. Say or press Stop to cancel at any time

GUI Controls

Button Action
πŸŽ™οΈ Mic Manually trigger listening (no wake word needed)
⏹️ Stop Cancel the current task immediately
πŸ”‡ Mute Toggle voice feedback on/off
← Back Browser back (only when no task running)
↻ Refresh Browser refresh (only when no task running)
+ New Tab Open a new browser tab
🏠 Home Navigate to browser home
βœ• Quit Exit Friday

πŸ“ Project Structure

friday/
β”œβ”€β”€ main.py          # Entry point β€” orchestrates all components
β”œβ”€β”€ agent.py         # Google Gemini LLM setup via ChatGoogle (gemini-2.5-flash)
β”œβ”€β”€ browser.py       # Browser profile configuration for browser-use
β”œβ”€β”€ voice.py         # Wake-word detection, recording, TTS
β”œβ”€β”€ gui.py           # Tkinter floating HUD
β”œβ”€β”€ .env             # Your API keys (never commit this)
β”œβ”€β”€ .env.example     # Template for environment variables
└── requirements.txt # Python dependencies

βš™οΈ Configuration

Key parameters in main.py you can tune:

agent = Agent(
    use_vision=True,          # Gemini reads the screen visually
    enable_planning=True,     # Multi-step task decomposition
    max_failures=5,           # Retries before giving up
    loop_detection_enabled=True,  # Detects and escapes stuck loops
)

πŸ§ͺ Reproducible Testing Instructions

No microphone? No problem. Friday has a manual trigger button in the GUI so you can test every feature without ever saying a word.

Setup (5 minutes)

# 1. Clone and enter the repo
git clone https://github.com/yourusername/friday.git
cd friday

# 2. Create a virtual environment
python -m venv venv
source venv/bin/activate        # macOS/Linux
venv\Scripts\activate           # Windows

# 3. Install all dependencies
pip install -r requirements.txt

# 4. Install the browser engine
playwright install chromium

# 5. Add your Gemini API key
echo "GEMINI_API_KEY=your_key_here" > .env

Get a free Gemini API key at aistudio.google.com β€” takes 30 seconds.


Run Friday

python main.py

Wait for the floating HUD to appear and Friday to say "Friday is ready."


How to give commands (two ways)

Option A β€” Voice (recommended) Say "hey Friday" clearly, wait for "Listening", then speak your task.

Option B β€” Manual button (no microphone needed) Click the πŸŽ™οΈ mic button in the HUD, then speak your task. Identical behavior, no wake word required.


Suggested Test Commands

Try these in order β€” they go from simple to complex and cover the full range of Friday's capabilities:

1. Sanity check

"Go to google.com"

The browser should open and navigate to Google. Friday confirms when done.

2. Google Search + Gemini vision

"Search Google for the latest Gemini AI news and tell me the top result"

Friday searches, reads the results page visually using Gemini, and speaks the top headline back to you.

3. Google News summarization

"Open Google News and summarize the top 3 stories right now"

Friday navigates to news.google.com, reads the live headlines using Gemini vision, and gives you a spoken summary.

4. YouTube navigation

"Go to YouTube and find the most viewed video about Google Gemini"

Multi-step task β€” Friday searches, reads view counts visually, and opens the top result.

5. The fun one β€” Rickroll

"Hey Friday, rickroll me"

Friday interprets intent (not just literal words), navigates to YouTube, finds Rick Astley's Never Gonna Give You Up, and plays it. Tests Gemini's reasoning ability.

6. Multi-step Google task

"Open Google Maps and find the highest rated coffee shop near me"

Tests location-aware browsing, reading ratings and reviews, and multi-step decision-making β€” all in one command.


Test Task Cancellation

  1. Give Friday a long task: "Search Google for every country in the world and list them all"
  2. While it's running, press ⏹️ Stop in the HUD
  3. Friday should say "Stopped." and return to idle within 2 seconds β€” no crash, no hang

What to look for

Signal What it means
HUD status changes in real time Agent loop and GUI are communicating correctly
Browser navigates without manual input browser-use + Playwright working
Friday speaks results back Gemini vision successfully read the page
Stop button cancels cleanly Async task cancellation working
Log panel fills with action steps Gemini planning is decomposing tasks correctly

πŸ”§ Troubleshooting

Friday doesn't hear the wake word

  • Run with a quiet background β€” Vosk calibrates to ambient noise on startup
  • Speak clearly and at a normal pace
  • Check your microphone is set as the default input device

Browser doesn't open

  • Make sure you ran playwright install chromium
  • Check your browser.py profile configuration

Gemini API errors

  • Verify your GEMINI_API_KEY in .env is valid and has quota
  • Check Google AI Studio for usage limits

Task gets stuck

  • Press the Stop button in the GUI β€” this cleanly cancels the async task
  • loop_detection_enabled=True will also auto-recover in most cases

☁️ Google API Integration β€” Proof of Use

Friday's entire AI brain is powered by a direct call to Google's Gemini API via the official ChatGoogle client from browser-use:

# agent.py
from browser_use.llm.google.chat import ChatGoogle


def create_llm(api_key: str) -> ChatGoogle:
    """
    Creates a Google Gemini LLM instance via the official Google API.
    Model: gemini-2.5-flash β€” Google's latest multimodal model.
    API key sourced from GEMINI_API_KEY environment variable.
    Used by browser-use Agent for vision, planning, and task execution.
    """
    return ChatGoogle(
        model="gemini-2.5-flash",
        api_key=api_key,
        temperature=0.1,
    )

Every voice command Friday receives triggers a live call to gemini-2.5-flash β€” Google's latest model β€” for vision-based screen reading, multi-step task planning, and action execution. No other LLM provider is used anywhere in the codebase.

API usage is tracked and visible in Google AI Studio β€” 500+ requests, 5M+ input tokens, 100% success rate logged over 28 days of development.

πŸ”— View agent.py on GitHub


🀝 Contributing

Pull requests are welcome! For major changes, please open an issue first.

  1. Fork the repo
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ“„ License

MIT License β€” see LICENSE for details.


πŸ™ Acknowledgements

  • browser-use β€” the incredible browser agent framework that makes this possible
  • Google Gemini β€” for vision and planning capabilities
  • Vosk β€” for fast, offline speech recognition

Built with ❀️ for the Google Gemini Hackathon

About

Friday is a hands-free AI assistant that listens to your voice and autonomously controls your web browser to complete tasks. Say "Hey Friday" or press the mic button, give a command in plain English, and Friday handles the rest - navigating, clicking, and interacting with websites on your behalf.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages