A cross-platform PC automation agent that uses screenshots and LLM to control your computer via natural language instructions.
- Cross-platform: Works on macOS and Linux (X11 and Wayland)
- Multi-backend LLM support: OpenAI-compatible API (Ollama, vLLM, etc.), Codex CLI, Claude CLI
- Visual understanding: Takes screenshots and sends to vision-capable LLMs
- Platform-aware control:
- macOS:
cliclick+osascript - Linux X11:
xdotool+wmctrl - Linux Wayland:
ydotool+grim
- macOS:
- Multi-command execution: 1-5 commands per LLM call for efficiency
- Action history: JSON-structured history provides context across iterations
- Notes storage: Agent can store and retrieve information (URLs, keys, etc.) across iterations
- Interactive model selection: Prompts for model if not specified
- Safety validation: Commands are validated before execution
# Required
brew install cliclick# Input tool
sudo apt install xdotool
# Screenshot tool (one of)
sudo apt install scrot # Recommended
sudo apt install gnome-screenshot
# Optional
sudo apt install wmctrl # Window management
sudo apt install xclip # Clipboard# Input tool (requires ydotoold service)
sudo apt install ydotool
sudo systemctl enable --now ydotool
# Screenshot tool
sudo apt install gnome-screenshot # Recommended for Wayland
sudo apt install grim # Alternative
# Optional
sudo apt install wl-clipboard # Clipboard (wl-copy/wl-paste)The default backend uses any OpenAI-compatible API server (Ollama, vLLM, etc.)
# Example: Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull a vision model
ollama pull llava
# or
ollama pull moondream
ollama pull minicpm-v# For codex backend
npm install -g @openai/codex
# For claude backend
npm install -g @anthropic-ai/claude-codegit clone git@github.com:gsportelli/agentkvm.git
cd agentkvm
chmod +x agent.py
# Check dependencies
./agent.py --check-deps# Using OpenAI-compatible API (will prompt for model selection)
./agent.py "Open Firefox and search for weather"
# Specify model directly
./agent.py --model llava "Open browser and go to google.com"
# Connect to remote API server
./agent.py --host 192.168.1.100 --port 11434 --model llava "Open terminal"
# Using Claude CLI
./agent.py -b claude "Open Safari and search for weather"
# Using Codex
./agent.py -b codex "Click the Settings icon"
# With verbose output
./agent.py -v --model llava "Open file manager"
# Reset history and start fresh
./agent.py -r --model moondream "Open browser"
# Check dependencies only
./agent.py --check-deps| Option | Description |
|---|---|
-b, --backend |
LLM backend: openai, codex, claude (default: openai) |
--host |
OpenAI API host (default: localhost, or OPENAI_API_HOST env) |
--port |
OpenAI API port (default: 11434, or OPENAI_API_PORT env) |
--model |
Model to use (will prompt if not specified) |
-m, --max-iter |
Maximum iterations (default: 50) |
-r, --reset |
Reset action history before starting |
-v, --verbose |
Verbose output |
-w, --workdir |
Working directory for output files (default: current directory) |
--check-deps |
Check dependencies and exit |
-h, --help |
Show help |
| Variable | Description |
|---|---|
OPENAI_API_HOST |
Default API host |
OPENAI_API_PORT |
Default API port |
- Detect Platform: Identifies macOS or Linux (X11/Wayland), checks for required tools
- Select Model: If no model specified, shows available vision models to choose from
- Screenshot: Captures screen using platform-appropriate tool
- Analyze: Sends screenshot + prompt + history + notes to LLM
- Parse: Extracts observation, reasoning, notes, and command(s) from response
- Store Notes: Saves any notes (key-value pairs) for future iterations
- Validate: Checks all commands for safety before execution
- Execute: Runs command sequence with delays (stops on first failure)
- Record: Saves action to history for context in next iteration
- Loop: Repeats until goal achieved or max iterations
The agent can execute 1-5 commands per LLM call to reduce API calls:
###CMD
ydotool mousemove -a 500 300
ydotool click 0xC0
ydotool type "hello world"
ydotool key enter
The agent can store information it discovers during execution for later use. This is useful for multi-step tasks where the agent needs to remember URLs, keys, credentials, or other data:
###NOTE
api_key: sk-abc123xyz
dashboard_url: https://example.com/dashboard
username: admin@example.com
Notes persist across iterations and are automatically included in the context shown to the LLM. The agent can reference stored notes in later iterations to use the information it gathered earlier.
cliclick m:x,y # Move mouse
cliclick c:x,y # Click at position
cliclick dc:x,y # Double-click
cliclick rc:x,y # Right-click
cliclick t:"text" # Type text
cliclick kp:enter # Press key
cliclick kp:cmd-l # Key comboxdotool mousemove x y # Move mouse
xdotool mousemove x y click 1 # Move and click
xdotool click 1 # Left click (1=left, 3=right)
xdotool type "text here" # Type text
xdotool key Return # Press key
xdotool key ctrl+l # Key combo
xdotool key super # Super keyydotool mousemove -a x y # Move mouse (absolute)
ydotool click 0xC0 # Left click
ydotool click 0xC1 # Right click
ydotool type "text here" # Type text
ydotool key enter # Press key
ydotool key ctrl+l # Key combo
ydotool key super # Super key- Uses
ydotoolinstead ofxdotool(works natively on Wayland) - Uses
grimfor screenshots - Requires
ydotooldservice running:sudo systemctl enable --now ydotool
If --model is not specified, the agent will:
- Connect to the API server
- List available models
- Prompt you to select one
Available models:
1. llava:latest
2. moondream:latest
3. minicpm-v:latest
Select model (1-3):
Commands are validated before execution:
Allowed commands:
- macOS:
cliclick,osascript - Linux (X11):
xdotool,wmctrl - Linux (Wayland):
ydotool,wmctrl
Blocked patterns:
- Shell metacharacters:
; && || >> \$()` - Dangerous commands:
rm,sudo,curl,wget,kill
| File | Description |
|---|---|
agent.py |
Main agent script (Python) |
Output files (written to working directory, configurable with --workdir):
| File | Description |
|---|---|
action_history.json |
Structured action history (JSON) |
action_history.txt |
Human-readable action history |
currentscreen.png |
Latest screenshot |
currentscreen.md |
Latest LLM response |
logs/ |
Execution logs |
past_screens/ |
Archived screenshots and responses |
# For Ollama, make sure it's running
ollama serve
# Check if it's accessible
curl http://localhost:11434/api/tags# Check if ydotoold is running
systemctl status ydotool
# Enable and start service
sudo systemctl enable --now ydotool
# You may need to add user to input group
sudo usermod -aG input $USER
# Then logout and login# Install gnome-screenshot for Wayland screenshots (recommended)
sudo apt install gnome-screenshotSystem Preferences > Security & Privacy > Privacy > Accessibility Add Terminal (or your terminal app) to the list.
MIT