Add Voice Command Interface using Pipecat #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft

codegen-sh wants to merge 3 commits into main from gen/d05251e7-6222-46cb-a48e-5b24e4dd5ced

VOICE_COMMANDS.md

-Original file line number
+Diff line change
@@ -0,0 +1,154 @@
+    # Voice Command Interface for CompUse
+    This document explains how to use the voice command interface for CompUse, which is implemented using the Pipecat framework.
+    ## Overview
+    The voice command interface allows you to control your computer using voice commands. It uses:
+    - **Pipecat**: An open-source framework for building voice and multimodal conversational agents
+    - **Whisper**: OpenAI's speech recognition model for accurate transcription
+    - **ElevenLabs**: For high-quality text-to-speech feedback (optional)
+    - **Voice Activity Detection (VAD)**: For detecting when you've finished speaking
+    ## Installation
+. Install the required dependencies:
+       ```bash
+       pip install -r requirements.txt
+       ```
+. Set up your API keys in a `.env` file:
+       ```
+       OPENAI_API_KEY=your_openai_api_key
+       ELEVENLABS_API_KEY=your_elevenlabs_api_key  # Optional, for voice feedback
+       ELEVENLABS_VOICE_ID=your_elevenlabs_voice_id  # Optional
+       COMPUSE_WAKE_WORD=computer  # Default wake word
+       ```
+    ## Usage
+    ### Starting the Voice Interface
+    Run the voice command interface:
+    ```bash
+    python voice_cli.py
+    ```
+    Optional arguments:
+    - `--wake-word TEXT`: Set a custom wake word (default: "computer")
+    - `--auto-start`: Automatically start voice recognition on startup
+    - `--push-to-talk`: Use push-to-talk mode instead of wake word (press Ctrl+Space to talk)
+    ### Available Commands
+    Once the CLI is running, you can use these text commands:
+    - `start`: Start voice recognition
+    - `stop`: Stop voice recognition
+    - `status`: Check if voice recognition is active
+    - `history`: Show voice command history
+    - `help`: Show available commands
+    - `exit`: Exit the application
+    ### Using Voice Commands
+    When voice recognition is active:
+. Say the wake word followed by your command:
+       - "Computer, take a screenshot"
+       - "Computer, click at 500 300"
+       - "Computer, open Chrome"
+. To stop listening:
+       - "Computer, stop listening"
+    ### Push-to-Talk Mode
+    If you prefer not to use a wake word, you can use push-to-talk mode:
+    ```bash
+    python voice_cli.py --push-to-talk
+    ```
+    In this mode:
+. Press and hold Ctrl+Space to start recording
+. Speak your command
+. Release Ctrl+Space to process the command
+    ## Integration with CompUse
+    The voice command interface integrates with CompUse's existing tools:
+    - **GUI Tools**: Control mouse, keyboard, take screenshots, etc.
+    - **Browser Tools**: Control web browsers via Puppeteer
+    - **System Tools**: Interact with applications and system functions
+    All tools available in the CompUse CLI are accessible through voice commands.
+    ## Command History
+    The voice interface keeps track of all commands you've issued. To view your command history:
+. Type `history` in the CLI
+. The system will display a table with timestamps and commands
+    This is useful for:
+    - Reviewing what commands you've already tried
+    - Debugging recognition issues
+    - Keeping track of your workflow
+    ## Customization
+    ### Changing the Wake Word
+    You can change the wake word in three ways:
+. Set the `COMPUSE_WAKE_WORD` environment variable
+. Use the `--wake-word` command-line argument
+. Edit the `.env` file
+    ### Disabling Voice Feedback
+    Voice feedback can be disabled by modifying the `feedback_enabled` parameter in the `VoiceCommandManager` initialization.
+    ## Troubleshooting
+    ### Microphone Issues
+    If your microphone isn't being detected:
+. Check your system's microphone settings
+. Ensure your microphone is set as the default input device
+. Try running with administrator/sudo privileges
+    ### Recognition Accuracy
+    If voice recognition accuracy is poor:
+. Speak clearly and at a moderate pace
+. Reduce background noise
+. Use a better quality microphone
+. Consider using a different wake word that's more distinct
+. Try push-to-talk mode instead of wake word detection
+    ### API Key Issues
+    If you encounter API key errors:
+. Verify your API keys in the `.env` file
+. Check that you have sufficient credits/quota for the services
+. Ensure your network can reach the API endpoints
+    ## Advanced Configuration
+    For advanced users, the `VoiceCommandManager` class accepts several configuration options:
+    - `whisper_api_key`: OpenAI API key for Whisper STT
+    - `elevenlabs_api_key`: ElevenLabs API key for TTS
+    - `elevenlabs_voice_id`: ElevenLabs voice ID for TTS
+    - `wake_word`: Wake word to activate voice listening
+    - `feedback_enabled`: Whether to provide audio feedback
+    These can be customized when initializing the manager in your code.

requirements.txt

-Original file line number
+Diff line change
@@ Expand Up / @@ -11,3 +11,8 @@ pytest-asyncio @@
     rich>=13.3.5
     prompt_toolkit
     pyobjc-framework-Cocoa>=8.5; platform_system == "Darwin"
+    # Voice command dependencies
+    pipecat-ai>=0.1.0
+    pipecat-ai[whisper]
+    pipecat-ai[silero]
+    aiohttp>=3.8.5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Voice Command Interface using Pipecat #2

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Add Voice Command Interface using Pipecat #2

Are you sure you want to change the base?

Uh oh!

Add Voice Command Interface using Pipecat #2

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!