Automate desktop interactions using AI vision & language capabilities.
LLM Desktop Automation is a Python framework that empowers you to control your desktop UI by natural language prompts. Harness the power of Gemini 2.5 Vision's API to visually understand your desktop, process commands, and perform intelligent mouse clicks using AI reasoning.
- Vision-powered recognition of desktop icons and windows
- Smart mouse movement & click via
pyautogui - Natural language instructions powered by Gemini 2.5 LLM
- Built-in safety prompts before high-impact actions
- Pythonic and readable code, ready to extend
- User types a command:
Open Recycle Bin - Screenshot of desktop is taken
- Screenshot is sent to Gemini 2.5 Vision for detection of clickable elements
- Structured screen analysis (elements + coordinates) is returned
- LLM matches user instruction to visual data, selects the target
- Python script uses
pyautoguito move mouse and click the target - (Optional) User is prompted to confirm risky actions
# Clone the repository
git clone https://github.com/dalijon-byte/LLM-ComputerUse.git
cd LLM-ComputerUse.git
# Install Python dependencies (Python 3.8+ recommended)
pip install -r requirements.txt- Set up Gemini API key:
- Sign up for Gemini API and get your key: Get API Key
- Create a
.envfile in your project root:GEMINI_API_KEY=your_gemini_2.5_vision_api_key_here
- Adjust permissions: This script requires permission to capture your screen and control your mouse.
python desktop_automation.pyWhen prompted, type commands like:
Click the Chrome icon
Open Notepad
Open the Recycle Bin
import os, pyautogui
from PIL import ImageGrab
import google.generativeai as genai
# ... Set up Gemini as in documentation
screenshot = ImageGrab.grab()
# Send screenshot and prompt to Gemini
# Parse result, get coordinates
pyautogui.moveTo(x, y)
pyautogui.click()This project now supports two different approaches to desktop automation:
- Direct Coordinate-Based Automation (original approach)
- Template Matching-Based Automation (new approach)
The template-based approach provides greater reliability across different screen resolutions and window positions:
- Screen Analysis: Capture the screen and send to Gemini 2.5 Vision
- Element Extraction: Gemini identifies UI elements and returns their bounding boxes
- Template Creation: Small images of each UI element are cropped and saved
- Template Matching: When actions are needed, PyAutoGUI looks for these templates on screen
- Action Execution: Once found, the system can click, type, drag, etc.
The system now supports these advanced interactions:
click(start_box='[x1, y1, x2, y2]')- Single left clickleft_double(start_box='[x1, y1, x2, y2]')- Double left clickright_single(start_box='[x1, y1, x2, y2]')- Single right clickdrag(start_box='[x1, y1, x2, y2]', end_box='[x3, y3, x4, y4]')- Drag and drophotkey(key='ctrl+c')- Press keyboard shortcutstype(content='Hello world\n')- Type text (use '\n' for Enter)scroll(start_box='[x1, y1, x2, y2]', direction='down')- Scroll in specified directionwait()- Pause for 5 secondsfinished()- Mark task as completecall_user()- Request human assistance
# Template-based automation
python template_automation.py
# When prompted:
# "What would you like me to do?"
# Try these commands:
"Open Google Chrome"
"Create a new text document"
"Move the calculator to the right side of the screen"Warning: Running this code gives the AI limited control of your mouse and keyboard. Only use in safe, controlled environments. Carefully review all actions before confirming.
LLM-ComputerUse/
├── README.md
├── desktop_automation.py (original approach)
├── template_automation.py (new template-based approach)
├── requirements.txt
├── templates/ (directory for extracted templates)
└── utils/
├── __init__.py
├── screen_capture.py
├── element_extraction.py
└── action_execution.py
python-dotenvgoogle-generativeaipyautoguipillow(PIL)
You can add:
- Voice command input via
SpeechRecognition - Better vision models with OpenCV, YOLO, or LLaVA
- Self-verification for critical clicks
- Errors about permissions? See your OS's privacy/accessibility settings for screen and input control
- Mouse not clicking where expected? Check your display scaling and resolution settings
- Gemini errors? Ensure API key is correct and you have quota
Here are some additional enhancements you could consider:
- Template Database: Store templates with metadata for reuse across sessions
- Visual Feedback: Show bounding boxes on detected elements for user verification
- Action Sequences: Record and replay multiple actions as macros
- Error Recovery: Implement retry mechanisms when template matching fails
- Context Awareness: Maintain a model of desktop state between actions
- Voice Integration: Add voice control capabilities for hands-free operation
- Application-Specific Templates: Pre-train on common applications like Office, browsers
-
Confidence Parameter: PyAutoGUI's
locateCenterOnScreenfunction has aconfidenceparameter (requires OpenCV). Start with 0.8 and adjust based on reliability. -
Template Size: Smaller templates may be less distinctive but find more matches. Larger templates are more specific but might fail with small UI changes.
-
Error Handling: Template matching can fail for many reasons - implement good error handling and fallback mechanisms.
-
Security Considerations: Continue to prioritize safety checks, especially with broader action capabilities.
Is this production ready?
No – it is a research/prototype tool. Use in controlled environments only.
Can I use other LLMs?
Yes, via API adjustments, but Gemini 2.5 Vision recommended for best multi-modal performance.
Can it close popups or interact with notifications?
If they appear in the screenshot, and are visually distinct, yes.
This project is MIT Licensed. Copyright © 2025 Dalibor JONIC, MSc
MIT License
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Powered by Gemini 2.5 Vision · Built for research and innovation.
