An Osaurus plugin for efficient macOS automation via accessibility APIs. Features decoupled actions/observations, element-based interactions, and smart filtering for minimal context usage.
Accessibility permissions are required. Grant permission in:
- System Preferences > Security & Privacy > Privacy > Accessibility
Add the application using this plugin (e.g., Osaurus, or your terminal if running from CLI).
This plugin separates actions (click, type, press key) from observations (get UI elements), allowing agents to:
- Observe the UI once to understand the layout
- Execute multiple actions without re-observing
- Observe again only when needed (after navigation, dialogs, etc.)
This dramatically reduces context usage compared to returning the full UI tree after every action.
Opens or activates an application by name, bundle ID, or path.
{ "identifier": "Safari" }Returns: { "pid": 1234, "bundleId": "com.apple.Safari", "name": "Safari" }
Clicks an element by its ID (from get_ui_elements). Uses AXPress action when available, falls back to coordinate click.
{ "id": 5 }Focuses an element by its ID. Useful for text fields before typing.
{ "id": 3 }Clicks at raw screen coordinates. Use click_element instead when possible.
{ "x": 100, "y": 200, "button": "left", "doubleClick": false }Types text into the currently focused element.
{ "text": "Hello, world!" }Presses a keyboard key with optional modifiers.
{ "key": "return", "modifiers": ["command"] }Scrolls in the specified direction.
{ "direction": "down", "amount": 5 }Traverses the accessibility tree and returns interactive UI elements with assigned IDs.
{
"pid": 1234,
"maxElements": 100,
"maxDepth": 15,
"interactiveOnly": true
}Returns compact element array:
{
"pid": 1234,
"app": "Safari",
"elementCount": 25,
"elements": [
{
"id": 1,
"role": "button",
"label": "Back",
"x": 50,
"y": 100,
"w": 30,
"h": 30,
"actions": ["press"]
},
{
"id": 2,
"role": "textfield",
"label": "Address",
"x": 100,
"y": 100,
"w": 400,
"h": 30,
"actions": ["focus"]
}
]
}Returns information about the currently active window.
{}Returns: { "pid": 1234, "app": "Safari", "title": "Apple", "x": 0, "y": 25, "w": 1440, "h": 875 }
Lists all connected displays with their positions and dimensions.
{}Returns:
{
"displays": [
{
"index": 0,
"displayId": 1,
"x": 0,
"y": 0,
"width": 2560,
"height": 1440,
"isMain": true
},
{
"index": 1,
"displayId": 2,
"x": 2560,
"y": 0,
"width": 1920,
"height": 1080,
"isMain": false
}
]
}Captures a screenshot with multi-monitor support. Returns images in MCP ImageContent format for vision model support.
Defaults: format=jpeg, quality=0.7, scale=0.5 (suitable for most use cases)
{ "displayIndex": 0 } // Capture main display
{ "displayIndex": 1 } // Capture second display
{ "allDisplays": true } // Capture all displays as one image
{ "pid": 1234 } // Capture specific window (works on any display)
{ "savePath": "/tmp/screen.jpg" } // Save to file instead of base64 (avoids token limits)
{ "scale": 1.0, "format": "png" } // Full resolution PNG (larger output)Returns MCP CallToolResult format with content array (enables vision models to "see" the image):
{
"content": [
{
"type": "image",
"data": "<base64-encoded-image>",
"mimeType": "image/jpeg"
}
]
}Save to file - Use savePath to save the screenshot to disk instead of returning base64. This completely avoids token limit issues:
{ "savePath": "/tmp/screenshot.jpg" }Returns: { "width": 1440, "height": 900, "path": "/tmp/screenshot.jpg" }
Clicks an element and returns the updated UI state.
{ "id": 5, "maxElements": 100, "interactiveOnly": true }Types text and returns the updated UI state.
{ "text": "Hello", "pid": 1234 }Presses a key and returns the updated UI state.
{ "key": "return", "pid": 1234 }1. open_application({ "identifier": "Notes" })
→ { "pid": 1234, "name": "Notes" }
2. get_ui_elements({ "pid": 1234 })
→ Returns 30 elements with IDs 1-30
3. click_element({ "id": 5 }) // Click "New Note" button
→ { "success": true }
4. type_text({ "text": "My note content" })
→ { "success": true }
5. press_key({ "key": "s", "modifiers": ["command"] }) // Save
→ { "success": true }
Token usage: ~3K vs ~150K with the old approach (returning full tree after every action).
Elements are identified by numeric IDs assigned during get_ui_elements. The plugin:
- Tries AXPress action first - Works regardless of mouse position, immune to user interference
- Falls back to coordinate click - Re-queries element position before clicking to minimize stale data
This makes interactions more reliable than raw coordinate clicks.
- Native macOS apps (Finder, Mail, Notes, System Settings) - Full AX action support
- Browser chrome (tabs, bookmarks, toolbar) - Good AX support
- Well-built Electron apps - Varies by implementation
- Web content inside browsers - Use Playwright for better reliability
- Canvas-based apps (Figma, games) - Coordinate clicks only, no element tree
- Poorly accessible apps - Falls back to coordinate-based interaction
-
Build:
swift build -c release cp .build/release/libosaurus-macos-use.dylib ./libosaurus-macos-use.dylib
-
Install locally:
osaurus tools install .
codesign --force --options runtime --timestamp \
--sign "Developer ID Application: Your Name (TEAMID)" \
.build/release/libosaurus-macos-use.dylibosaurus tools package osaurus.macos-use 0.2.0MIT