Harmony macOS VM Orchestrator

This repository provides a complete toolkit to drive a macOS virtual machine from Telegram using natural language. The planner runs on gpt-oss-20b (Harmony) exposed by LM Studio, with fast vision handled by qwen3-vl-4b and automatic OCR fallback. The agent focuses on DOM-first web automation, but falls back to HTTP scraping or full-screen vision when necessary.

Project layout

src/
├── agent/
│   ├── config.py            # load .env and YAML configuration
│   ├── harmony_client.py    # Harmony request helper + tool schema
│   ├── orchestrator.py      # Telegram loop, tool execution, logging
│   ├── tools_dom.py         # Chrome/Selenium helpers with CSS path
│   ├── tools_http.py        # requests + BeautifulSoup scraping helpers
│   ├── tools_telegram.py    # Telegram polling and messaging
│   ├── tools_ui.py          # Humanised mouse/keyboard/screenshot helpers
│   └── tools_vlm.py         # Vision model client + Tesseract fallback
└── web/
    └── web_agent.py         # Standalone DOM-first Selenium agent

configs/
├── .env.example             # Copy to .env and customise credentials
└── config.yaml              # Default runtime parameters

scripts/
├── run_orchestrator.sh      # Launch the Telegram orchestrator
└── run_web_agent.sh          # Launch the DOM-only agent

tests/                       # pytest coverage for the protocol and tools

Quick installation

Install system requirements (macOS host):
```
brew install tesseract
```
Create and activate a Python 3.10+ virtual environment.
Install Python dependencies:
```
pip install -r requirements.txt
```
Copy the environment template and edit it:
```
cp configs/.env.example .env
```
Fill in at minimum TELEGRAM_BOT_TOKEN, TELEGRAM_CHAT_ID, and confirm the LM Studio endpoints.

macOS permissions

The orchestrator controls the GUI directly. Grant the host application running the Python process the following permissions in System Settings → Privacy & Security:

Accessibility (keyboard and mouse automation)
Screen Recording (for screenshots sent as proof)

Restart the terminal after toggling the permissions.

Environment variables

All configuration values can be provided through .env or the environment. The most relevant keys are:

Variable	Description
`LM_STUDIO_API_BASE_TEXT`	HTTP base URL for the text model (`/v1` path is required).
`LM_STUDIO_API_BASE_VLM`	HTTP base URL for the vision model.
`MODEL_TEXT` / `MODEL_VLM`	Model identifiers in LM Studio.
`TELEGRAM_BOT_TOKEN` / `TELEGRAM_CHAT_ID`	Telegram bot credentials.
`AGENT_LOG_DIR` / `AGENT_CAPTURE_DIR`	Override log and screenshot directories.
`ALLOW_AWAIT_USER_REPLY`	Set to `1` to allow `/await_user_reply` style questions.
`AGENT_HISTORY_LIMIT`	Number of conversational messages kept (default: 20).

The YAML file configs/config.yaml exposes sensible defaults for timeouts, Selenium options, and poll intervals.

Running the orchestrator

Start LM Studio with both the text and vision models served over the OpenAI-compatible API. Then launch the Telegram-driven orchestrator:

./scripts/run_orchestrator.sh

The script loads .env, initialises all tools, and starts polling Telegram. Every natural-language message in the configured chat is forwarded to the Harmony planner. Tool invocations are logged to logs/agent.jsonl alongside OCR/VLM evidence and screenshot paths saved in captures/.

Telegram commands

/shot – force an immediate screenshot and send it back.
/end – stop the orchestrator gracefully (also closes Chrome).

The agent never asks clarifying questions unless you explicitly enable ALLOW_AWAIT_USER_REPLY=1.

Tool orchestration policy

DOM-first: try to derive a precise CSS selector and use the Selenium DOM tools.
HTTP fallback: if DOM selectors are unreliable, fetch and parse HTML through simple_scrape.
Vision/OCR: capture the screen and ask the vision model (vision_describe), falling back to Tesseract when needed.
Proof before finish: every /finish call must attach at least one screenshot; the orchestrator enforces this and relays the proof to Telegram.

Each tool invocation is appended to logs/agent.jsonl using an append-only JSONL structure:

{"ts": "2024-04-12T10:45:21Z", "tool": "dom_navigate", "result": {"ok": true, "url": "https://example.com"}}

Standalone DOM agent

The DOM-only agent in src/web/web_agent.py keeps the previous single-process workflow (no Telegram). Run it with:

./scripts/run_web_agent.sh --start-url https://example.com --objective "Cherche la météo de Paris"

It sends structured page descriptions to the model and expects strict JSON replies (navigate, click, input, scroll, finish). Use this mode when debugging selectors locally without the full Telegram orchestration stack.

Testing

Run the unit tests with pytest:

pytest

The suite covers Harmony payload formatting, HTTP scraping helpers, and UI stubs in test mode.

Troubleshooting

No screenshots are sent: ensure Screen Recording permission is granted and that the captures/ directory is writable.
proof_required when finishing: call take_screenshot (or /shot) before the planner issues finish.
Selenium cannot launch Chrome: install a recent Chrome build and make sure webdriver-manager can download the corresponding driver.
Vision call fails: the orchestrator automatically falls back to OCR, but check that Tesseract is installed via Homebrew.
Telegram messages ignored: double-check the chat ID matches the destination chat (private, group, or channel).

Licensing

The project is released under the MIT licence. See LICENSE for the full text.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
configs		configs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
2508.10925v1.pdf		2508.10925v1.pdf
LICENSE		LICENSE
README.md		README.md
agent_harmony.py		agent_harmony.py
agent_vm_control.py		agent_vm_control.py
agent_vm_control_human.py		agent_vm_control_human.py
agent_vm_orchestrator.py		agent_vm_orchestrator.py
agent_vm_orchestrator_nl.py		agent_vm_orchestrator_nl.py
backend.py		backend.py
browser_server.py		browser_server.py
docker_tool.py		docker_tool.py
format.md		format.md
page_contents.py		page_contents.py
python.md		python.md
requirements.txt		requirements.txt
simple_browser_tool.py		simple_browser_tool.py
web_agent.py		web_agent.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Harmony macOS VM Orchestrator

Project layout

Quick installation

macOS permissions

Environment variables

Running the orchestrator

Telegram commands

Tool orchestration policy

Standalone DOM agent

Testing

Troubleshooting

Licensing

About

Uh oh!

Releases

Packages

Languages

License

eauchs/gpt-oss-web-agent

Folders and files

Latest commit

History

Repository files navigation

Harmony macOS VM Orchestrator

Project layout

Quick installation

macOS permissions

Environment variables

Running the orchestrator

Telegram commands

Tool orchestration policy

Standalone DOM agent

Testing

Troubleshooting

Licensing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages