osaurus-macos-use

An Osaurus plugin for efficient macOS automation via accessibility APIs. Features decoupled actions/observations, element-based interactions, and smart filtering for minimal context usage.

Prerequisites

Accessibility permissions are required. Grant permission in:

System Preferences > Security & Privacy > Privacy > Accessibility

Add the application using this plugin (e.g., Osaurus, or your terminal if running from CLI).

Architecture

This plugin separates actions (click, type, press key) from observations (get UI elements), allowing agents to:

Observe the UI once to understand the layout
Execute multiple actions without re-observing
Observe again only when needed (after navigation, dialogs, etc.)

This dramatically reduces context usage compared to returning the full UI tree after every action.

Tools

Core Action Tools (Lean responses ~100 tokens)

`open_application`

Opens or activates an application by name, bundle ID, or path.

{ "identifier": "Safari" }

Returns: { "pid": 1234, "bundleId": "com.apple.Safari", "name": "Safari" }

`click_element`

Clicks an element by its ID (from get_ui_elements). Uses AXPress action when available, falls back to coordinate click.

{ "id": 5 }

`focus_element`

Focuses an element by its ID. Useful for text fields before typing.

{ "id": 3 }

`click`

Clicks at raw screen coordinates. Use click_element instead when possible.

{ "x": 100, "y": 200, "button": "left", "doubleClick": false }

`type_text`

Types text into the currently focused element.

{ "text": "Hello, world!" }

`press_key`

Presses a keyboard key with optional modifiers.

{ "key": "return", "modifiers": ["command"] }

`scroll`

Scrolls in the specified direction.

{ "direction": "down", "amount": 5 }

Observation Tools (On-demand ~2-5K tokens)

`get_ui_elements`

Traverses the accessibility tree and returns interactive UI elements with assigned IDs.

{
  "pid": 1234,
  "maxElements": 100,
  "maxDepth": 15,
  "interactiveOnly": true
}

Returns compact element array:

{
  "pid": 1234,
  "app": "Safari",
  "elementCount": 25,
  "elements": [
    {
      "id": 1,
      "role": "button",
      "label": "Back",
      "x": 50,
      "y": 100,
      "w": 30,
      "h": 30,
      "actions": ["press"]
    },
    {
      "id": 2,
      "role": "textfield",
      "label": "Address",
      "x": 100,
      "y": 100,
      "w": 400,
      "h": 30,
      "actions": ["focus"]
    }
  ]
}

`get_active_window`

Returns information about the currently active window.

{}

Returns: { "pid": 1234, "app": "Safari", "title": "Apple", "x": 0, "y": 25, "w": 1440, "h": 875 }

`list_displays`

Lists all connected displays with their positions and dimensions.

{}

Returns:

{
  "displays": [
    {
      "index": 0,
      "displayId": 1,
      "x": 0,
      "y": 0,
      "width": 2560,
      "height": 1440,
      "isMain": true
    },
    {
      "index": 1,
      "displayId": 2,
      "x": 2560,
      "y": 0,
      "width": 1920,
      "height": 1080,
      "isMain": false
    }
  ]
}

`take_screenshot`

Captures a screenshot with multi-monitor support. Returns images in MCP ImageContent format for vision model support.

Defaults: format=jpeg, quality=0.7, scale=0.5 (suitable for most use cases)

{ "displayIndex": 0 }           // Capture main display
{ "displayIndex": 1 }           // Capture second display
{ "allDisplays": true }         // Capture all displays as one image
{ "pid": 1234 }                 // Capture specific window (works on any display)
{ "savePath": "/tmp/screen.jpg" }  // Save to file instead of base64 (avoids token limits)
{ "scale": 1.0, "format": "png" }  // Full resolution PNG (larger output)

Returns MCP CallToolResult format with content array (enables vision models to "see" the image):

{
  "content": [
    {
      "type": "image",
      "data": "<base64-encoded-image>",
      "mimeType": "image/jpeg"
    }
  ]
}

Save to file - Use savePath to save the screenshot to disk instead of returning base64. This completely avoids token limit issues:

{ "savePath": "/tmp/screenshot.jpg" }

Returns: { "width": 1440, "height": 900, "path": "/tmp/screenshot.jpg" }

Convenience Tools (Action + Observation combined)

`click_element_and_observe`

Clicks an element and returns the updated UI state.

{ "id": 5, "maxElements": 100, "interactiveOnly": true }

`type_and_observe`

Types text and returns the updated UI state.

{ "text": "Hello", "pid": 1234 }

`press_key_and_observe`

Presses a key and returns the updated UI state.

{ "key": "return", "pid": 1234 }

Typical Workflow

1. open_application({ "identifier": "Notes" })
   → { "pid": 1234, "name": "Notes" }

2. get_ui_elements({ "pid": 1234 })
   → Returns 30 elements with IDs 1-30

3. click_element({ "id": 5 })      // Click "New Note" button
   → { "success": true }

4. type_text({ "text": "My note content" })
   → { "success": true }

5. press_key({ "key": "s", "modifiers": ["command"] })  // Save
   → { "success": true }

Token usage: ~3K vs ~150K with the old approach (returning full tree after every action).

Element-Based Interactions

Elements are identified by numeric IDs assigned during get_ui_elements. The plugin:

Tries AXPress action first - Works regardless of mouse position, immune to user interference
Falls back to coordinate click - Re-queries element position before clicking to minimize stale data

This makes interactions more reliable than raw coordinate clicks.

Best Use Cases

Native macOS apps (Finder, Mail, Notes, System Settings) - Full AX action support
Browser chrome (tabs, bookmarks, toolbar) - Good AX support
Well-built Electron apps - Varies by implementation

Limitations

Web content inside browsers - Use Playwright for better reliability
Canvas-based apps (Figma, games) - Coordinate clicks only, no element tree
Poorly accessible apps - Falls back to coordinate-based interaction

Development

Build:

swift build -c release
cp .build/release/libosaurus-macos-use.dylib ./libosaurus-macos-use.dylib

Install locally:
```
osaurus tools install .
```

Publishing

Code Signing (Required for Distribution)

codesign --force --options runtime --timestamp \
  --sign "Developer ID Application: Your Name (TEAMID)" \
  .build/release/libosaurus-macos-use.dylib

Package and Distribute

osaurus tools package osaurus.macos-use 0.2.0

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
Sources/osaurus_macos_use		Sources/osaurus_macos_use
release		release
.gitignore		.gitignore
Package.swift		Package.swift
README.md		README.md

dinoki-ai/osaurus-macos-use

Folders and files

Latest commit

History

Repository files navigation

osaurus-macos-use

Prerequisites

Architecture

Tools

Core Action Tools (Lean responses ~100 tokens)

open_application

click_element

focus_element

click

type_text

press_key

scroll

Observation Tools (On-demand ~2-5K tokens)

get_ui_elements

get_active_window

list_displays

take_screenshot

Convenience Tools (Action + Observation combined)

click_element_and_observe

type_and_observe

press_key_and_observe

Typical Workflow

Element-Based Interactions

Best Use Cases

Limitations

Development

Publishing

Code Signing (Required for Distribution)

Package and Distribute

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Contributors 2