LLM Desktop Automation with Gemini 2.5 Vision

Automate desktop interactions using AI vision & language capabilities.

Overview

LLM Desktop Automation is a Python framework that empowers you to control your desktop UI by natural language prompts. Harness the power of Gemini 2.5 Vision's API to visually understand your desktop, process commands, and perform intelligent mouse clicks using AI reasoning.

Features

Vision-powered recognition of desktop icons and windows
Smart mouse movement & click via pyautogui
Natural language instructions powered by Gemini 2.5 LLM
Built-in safety prompts before high-impact actions
Pythonic and readable code, ready to extend

How It Works

User types a command: Open Recycle Bin
Screenshot of desktop is taken
Screenshot is sent to Gemini 2.5 Vision for detection of clickable elements
Structured screen analysis (elements + coordinates) is returned
LLM matches user instruction to visual data, selects the target
Python script uses pyautogui to move mouse and click the target
(Optional) User is prompted to confirm risky actions

Installation

# Clone the repository
git clone https://github.com/dalijon-byte/LLM-ComputerUse.git
cd LLM-ComputerUse.git

# Install Python dependencies (Python 3.8+ recommended)
pip install -r requirements.txt

Configuration

Set up Gemini API key:
- Sign up for Gemini API and get your key: Get API Key
- Create a .env file in your project root:
```
GEMINI_API_KEY=your_gemini_2.5_vision_api_key_here
```
Adjust permissions: This script requires permission to capture your screen and control your mouse.

Usage Example

python desktop_automation.py

When prompted, type commands like: Click the Chrome icon Open Notepad Open the Recycle Bin

Minimal Example

import os, pyautogui
from PIL import ImageGrab
import google.generativeai as genai
# ... Set up Gemini as in documentation

screenshot = ImageGrab.grab()
# Send screenshot and prompt to Gemini
# Parse result, get coordinates
pyautogui.moveTo(x, y)
pyautogui.click()

Advanced Template-Based Workflow

This project now supports two different approaches to desktop automation:

Direct Coordinate-Based Automation (original approach)
Template Matching-Based Automation (new approach)

Template Matching Workflow

The template-based approach provides greater reliability across different screen resolutions and window positions:

Screen Analysis: Capture the screen and send to Gemini 2.5 Vision
Element Extraction: Gemini identifies UI elements and returns their bounding boxes
Template Creation: Small images of each UI element are cropped and saved
Template Matching: When actions are needed, PyAutoGUI looks for these templates on screen
Action Execution: Once found, the system can click, type, drag, etc.

Advanced Features

The system now supports these advanced interactions:

click(start_box='[x1, y1, x2, y2]') - Single left click
left_double(start_box='[x1, y1, x2, y2]') - Double left click
right_single(start_box='[x1, y1, x2, y2]') - Single right click
drag(start_box='[x1, y1, x2, y2]', end_box='[x3, y3, x4, y4]') - Drag and drop
hotkey(key='ctrl+c') - Press keyboard shortcuts
type(content='Hello world\n') - Type text (use '\n' for Enter)
scroll(start_box='[x1, y1, x2, y2]', direction='down') - Scroll in specified direction
wait() - Pause for 5 seconds
finished() - Mark task as complete
call_user() - Request human assistance

Usage Example

# Template-based automation
python template_automation.py

# When prompted:
# "What would you like me to do?"

# Try these commands:
"Open Google Chrome"
"Create a new text document"
"Move the calculator to the right side of the screen"

Security Warning

Warning: Running this code gives the AI limited control of your mouse and keyboard. Only use in safe, controlled environments. Carefully review all actions before confirming.

Project Structure

LLM-ComputerUse/
├── README.md
├── desktop_automation.py (original approach)
├── template_automation.py (new template-based approach)
├── requirements.txt
├── templates/ (directory for extracted templates)
└── utils/
    ├── __init__.py
    ├── screen_capture.py
    ├── element_extraction.py
    └── action_execution.py

Dependencies

python-dotenv
google-generativeai
pyautogui
pillow (PIL)

Extending

You can add:

Voice command input via SpeechRecognition
Better vision models with OpenCV, YOLO, or LLaVA
Self-verification for critical clicks

Troubleshooting

Errors about permissions? See your OS's privacy/accessibility settings for screen and input control
Mouse not clicking where expected? Check your display scaling and resolution settings
Gemini errors? Ensure API key is correct and you have quota

Potential Future Enhancements

Here are some additional enhancements you could consider:

Template Database: Store templates with metadata for reuse across sessions
Visual Feedback: Show bounding boxes on detected elements for user verification
Action Sequences: Record and replay multiple actions as macros
Error Recovery: Implement retry mechanisms when template matching fails
Context Awareness: Maintain a model of desktop state between actions
Voice Integration: Add voice control capabilities for hands-free operation
Application-Specific Templates: Pre-train on common applications like Office, browsers

Implementation Tips

Confidence Parameter: PyAutoGUI's locateCenterOnScreen function has a confidence parameter (requires OpenCV). Start with 0.8 and adjust based on reliability.
Template Size: Smaller templates may be less distinctive but find more matches. Larger templates are more specific but might fail with small UI changes.
Error Handling: Template matching can fail for many reasons - implement good error handling and fallback mechanisms.
Security Considerations: Continue to prioritize safety checks, especially with broader action capabilities.

FAQ

Is this production ready?

No – it is a research/prototype tool. Use in controlled environments only.

Can I use other LLMs?

Yes, via API adjustments, but Gemini 2.5 Vision recommended for best multi-modal performance.

Can it close popups or interact with notifications?

If they appear in the screenshot, and are visually distinct, yes.

License

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Powered by Gemini 2.5 Vision · Built for research and innovation.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
utils		utils
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
LLM-PoweredDestopApp666.html		LLM-PoweredDestopApp666.html
LLM-PoweredDestopApp666.png		LLM-PoweredDestopApp666.png
README.html		README.html
README.md		README.md
desktop_automation.py		desktop_automation.py
requirements.txt		requirements.txt
template_automation.py		template_automation.py
vision_automation.py		vision_automation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Desktop Automation with Gemini 2.5 Vision

Overview

Features

How It Works

Installation

Configuration

Usage Example

Minimal Example

Advanced Template-Based Workflow

Template Matching Workflow

Advanced Features

Usage Example

Security Warning

Project Structure

Dependencies

Extending

Troubleshooting

Potential Future Enhancements

Implementation Tips

FAQ

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

dalijon-byte/LLM-ComputerUse

Folders and files

Latest commit

History

Repository files navigation

LLM Desktop Automation with Gemini 2.5 Vision

Overview

Features

How It Works

Installation

Configuration

Usage Example

Minimal Example

Advanced Template-Based Workflow

Template Matching Workflow

Advanced Features

Usage Example

Security Warning

Project Structure

Dependencies

Extending

Troubleshooting

Potential Future Enhancements

Implementation Tips

FAQ

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages