Skip to content

tak-uukti/linux-computer-use

Repository files navigation

linux-computer-use

linux-computer-use

Linux/X11 computer-use tools for AI agents — Pi, Claude Code, OpenCode, and any MCP-aware client. AT-SPI + xdotool, ~1k LOC.

license platform node python release

A Linux port of @injaneity/pi-computer-use. One bridge, three frontends:

The macOS original uses Apple's Accessibility API + AppleScript + ScreenCaptureKit (~6,800 lines of Swift + TS). This port replaces the entire native layer with AT-SPI 2 + xdotool + scrot, ships a single ~470-line Python bridge, and trims the tool surface from 15 → 8 to keep prompts cheap.

upstream macOS this port
Total LOC ~6,866 ~1,200 (-83%)
Tools registered ~15 8
Native helper 2,065 lines Swift 471 lines Python
Runtime deps Swift toolchain, codesign python3-gi, xdotool, wmctrl, scrot
Frontends macOS only Pi · Claude Code · OpenCode · any MCP client

System dependencies (all installs)

# Debian/Ubuntu
sudo apt-get install -y python3 python3-gi gir1.2-atspi-2.0 xdotool wmctrl scrot

# Enable AT-SPI on the desktop session (GNOME)
gsettings set org.gnome.desktop.interface toolkit-accessibility true

X11 only — Wayland sessions cannot capture other-app windows or synthesize input via xdotool. Run a GNOME-on-Xorg, KDE-on-X11, or XFCE session.

Install

Option 1 — Pi (mariozechner/pi-coding-agent)

pi install git:github.com/tak-uukti/linux-computer-use@v0.2.0

The postinstall script writes a small bash wrapper to ~/.pi/agent/helpers/linux-computer-use/bridge that execs python3 bridge/bridge.py. No build step, no codesign, no native compile.

In a Pi session, call screenshot first — it picks the focused window, returns AT-SPI refs (@e1, @e2, …) plus a PNG, then you can click({ref:"@e3"}), set_text({ref:"@e2", text:"…"}), etc.

Option 2 — Claude Code (MCP)

Installable as an MCP server straight from GitHub via uvx (no clone, no manual venv):

claude mcp add linux-computer-use -- uvx --from git+https://github.com/tak-uukti/linux-computer-use linux-computer-use-mcp

Or, equivalently, drop this into your Claude Code MCP config file (~/.claude.json under mcpServers):

{
  "mcpServers": {
    "linux-computer-use": {
      "command": "uvx",
      "args": [
        "--from",
        "git+https://github.com/tak-uukti/linux-computer-use",
        "linux-computer-use-mcp"
      ]
    }
  }
}

Restart Claude Code; the 8 tools (list_windows, screenshot, click, type_text, set_text, keypress, scroll, computer_actions) appear under the linux-computer-use namespace.

Option 3 — OpenCode (MCP)

Add to ~/.config/opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "mcp": {
    "linux-computer-use": {
      "type": "local",
      "command": [
        "uvx",
        "--from",
        "git+https://github.com/tak-uukti/linux-computer-use",
        "linux-computer-use-mcp"
      ],
      "enabled": true
    }
  }
}

Restart OpenCode and the tools become available to the agent.

Tools

8 total. Schemas are deliberately terse — see extensions/computer-use.ts (Pi) or mcp_server/server.py (MCP).

name purpose
list_windows enumerate visible X11 windows; returns @wN, title, pid, geometry, focus state
screenshot focus a window, capture PNG, walk AT-SPI tree → @eN targets with role / name / bounds / capabilities
click click @eN, @wN, or x,y; supports button and clickCount
type_text xdotool-type literal text at the cursor
set_text replace value of an @eN text/entry via AT-SPI EditableText (falls back to focus + Ctrl+A + type)
keypress press keys/chords — ["Return"], ["Ctrl","A"], ["ctrl+l","Return"], etc.
scroll scroll at ref/coords by pixel delta
computer_actions batch up to 20 actions in a single call

Architecture

┌──────────────────────────────────────────────┐
│  Pi          Claude Code        OpenCode     │
└──────┬─────────────┬──────────────────┬──────┘
       │             │                  │
       │ extension   │ MCP stdio        │ MCP stdio
       ▼             ▼                  ▼
┌──────────────┐  ┌────────────────────────────┐
│ extensions/  │  │ mcp_server/server.py       │
│ computer-    │  │ FastMCP wrapper (8 tools)  │
│ use.ts       │  └─────────────┬──────────────┘
└──────┬───────┘                │
       │                        │
       ▼                        ▼
┌──────────────────────────────────────────────┐
│ bridge/bridge.py  newline-JSON over stdio    │
│ AT-SPI walk · xdotool · wmctrl · scrot       │
└──────────────────────────────────────────────┘

The AT-SPI walker is depth-capped (12) and element-capped (200) to keep prompts lean. Element bounds use SCREEN coords with a fallback to WINDOW coords + window offset (necessary for GTK4 / Xwayland which report SCREEN as 0,0).

Verified end-to-end

These captures are from the bridge running against a Xvfb :99 + openbox session, driving real Linux apps. Screenshots taken via scrot after the bridge issued the actions.

gnome-calculator — keypress flow

keypress: 7, +, 8, Return → display shows 15. 26 AT-SPI elements detected, every push button reports canPress: true and accurate bounds.

gnome-calculator — AT-SPI @eN ref clicks

computer_actions: [click @e3, click @e7] (which the bridge resolves to push buttons "4" and "5") → display shows 45.

gedit — full type_text round-trip

type_text: "Hello sir, … Linux X11 + AT-SPI + xdotool working end-to-end." → 169 characters typed. 190 AT-SPI elements found in gedit's window.

gedit — clear and retype

keypress ctrl+akeypress Deletetype_text "Taksheel". Status bar reads Ln 1, Col 9.

App compatibility matrix

App screenshot AT-SPI refs input
gnome-calculator ✅ 26 elements, full action metadata
gedit ✅ 190 elements
GTK / Qt apps with AT-SPI
Google Chrome / Chromium ⚠️ AT-SPI tree empty unless launched with --force-renderer-accessibility ✅ (coords / keypress)
Firefox ✅ on a real session (gates on gsettings toolkit-accessibility)
Electron apps ⚠️ same as Chrome — needs --force-renderer-accessibility
LibreOffice (real Xorg session) ✅ via SAL_USE_COMMON_ONE_ACCESSIBILITY=1
Xvfb / nested X partial (some apps misbehave under Xvfb without a real session bus)

Limitations

  • X11 only. Wayland sessions cannot capture other-app windows or synthesize input via xdotool.
  • Apps must export AT-SPI for @eN refs to populate. Most GTK / Qt apps do; Electron / Chromium need --force-renderer-accessibility.
  • Mouse cursor physically moves — no stealth pointer on X11.
  • Dropped vs upstream: move_mouse, drag, wait, double_click, arrange_window, navigate_browser, list_apps. Use keypress, type_text, and computer_actions to compose what you need.

Development

git clone https://github.com/tak-uukti/linux-computer-use
cd linux-computer-use

# Pi side (TypeScript)
npm install
npm run typecheck

# Bridge sanity
python3 -c "import ast; ast.parse(open('bridge/bridge.py').read())"
echo '{"id":"1","cmd":"list_windows"}' | python3 bridge/bridge.py

# MCP side
python3 -m venv .venv && .venv/bin/pip install -e .
.venv/bin/linux-computer-use-mcp   # speaks MCP over stdio

The Pi extension API surface is stubbed locally in src/types.ts so typecheck runs without @mariozechner/pi-coding-agent installed.

Layout

.
├── assets/                          logo + screenshots
├── bridge/
│   ├── bridge.py                    471-line Python helper (AT-SPI + xdotool + scrot)
│   └── requirements.txt
├── extensions/
│   └── computer-use.ts              Pi tool registration + JSON schemas
├── mcp_server/
│   ├── __init__.py
│   └── server.py                    FastMCP wrapper around the bridge (8 tools)
├── scripts/
│   └── setup-helper.mjs             Pi postinstall — writes ~/.pi/.../bridge wrapper
├── skills/computer-use/SKILL.md     pi skill — Quick Start + Pitfalls
├── src/
│   ├── bridge.ts                    Pi-side subprocess manager + JSON-line protocol
│   └── types.ts                     local stubs for the pi-coding-agent extension API
├── package.json                     npm metadata (Pi extension)
├── pyproject.toml                   MCP server packaging (uvx-installable)
├── tsconfig.json
├── CHANGELOG.md
├── LICENSE
└── README.md

Credits

License

MIT © 2026 Tak1tak · built by Tak1tak

About

Linux/X11 computer-use tools for AI agents — Pi, Claude Code, OpenCode, any MCP-aware client. AT-SPI + xdotool, ~1k LOC. Drop-in replacement for the macOS pi-computer-use.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors