This repository provides a complete toolkit to drive a macOS virtual machine from Telegram using natural language. The planner runs on gpt-oss-20b (Harmony) exposed by LM Studio, with fast vision handled by qwen3-vl-4b and automatic OCR fallback. The agent focuses on DOM-first web automation, but falls back to HTTP scraping or full-screen vision when necessary.
src/
├── agent/
│ ├── config.py # load .env and YAML configuration
│ ├── harmony_client.py # Harmony request helper + tool schema
│ ├── orchestrator.py # Telegram loop, tool execution, logging
│ ├── tools_dom.py # Chrome/Selenium helpers with CSS path
│ ├── tools_http.py # requests + BeautifulSoup scraping helpers
│ ├── tools_telegram.py # Telegram polling and messaging
│ ├── tools_ui.py # Humanised mouse/keyboard/screenshot helpers
│ └── tools_vlm.py # Vision model client + Tesseract fallback
└── web/
└── web_agent.py # Standalone DOM-first Selenium agent
configs/
├── .env.example # Copy to .env and customise credentials
└── config.yaml # Default runtime parameters
scripts/
├── run_orchestrator.sh # Launch the Telegram orchestrator
└── run_web_agent.sh # Launch the DOM-only agent
tests/ # pytest coverage for the protocol and tools
- Install system requirements (macOS host):
brew install tesseract
- Create and activate a Python 3.10+ virtual environment.
- Install Python dependencies:
pip install -r requirements.txt
- Copy the environment template and edit it:
Fill in at minimum
cp configs/.env.example .env
TELEGRAM_BOT_TOKEN,TELEGRAM_CHAT_ID, and confirm the LM Studio endpoints.
The orchestrator controls the GUI directly. Grant the host application running the Python process the following permissions in System Settings → Privacy & Security:
- Accessibility (keyboard and mouse automation)
- Screen Recording (for screenshots sent as proof)
Restart the terminal after toggling the permissions.
All configuration values can be provided through .env or the environment. The most relevant keys are:
| Variable | Description |
|---|---|
LM_STUDIO_API_BASE_TEXT |
HTTP base URL for the text model (/v1 path is required). |
LM_STUDIO_API_BASE_VLM |
HTTP base URL for the vision model. |
MODEL_TEXT / MODEL_VLM |
Model identifiers in LM Studio. |
TELEGRAM_BOT_TOKEN / TELEGRAM_CHAT_ID |
Telegram bot credentials. |
AGENT_LOG_DIR / AGENT_CAPTURE_DIR |
Override log and screenshot directories. |
ALLOW_AWAIT_USER_REPLY |
Set to 1 to allow /await_user_reply style questions. |
AGENT_HISTORY_LIMIT |
Number of conversational messages kept (default: 20). |
The YAML file configs/config.yaml exposes sensible defaults for timeouts, Selenium options, and poll intervals.
Start LM Studio with both the text and vision models served over the OpenAI-compatible API. Then launch the Telegram-driven orchestrator:
./scripts/run_orchestrator.shThe script loads .env, initialises all tools, and starts polling Telegram. Every natural-language message in the configured chat is forwarded to the Harmony planner. Tool invocations are logged to logs/agent.jsonl alongside OCR/VLM evidence and screenshot paths saved in captures/.
/shot– force an immediate screenshot and send it back./end– stop the orchestrator gracefully (also closes Chrome).
The agent never asks clarifying questions unless you explicitly enable ALLOW_AWAIT_USER_REPLY=1.
- DOM-first: try to derive a precise CSS selector and use the Selenium DOM tools.
- HTTP fallback: if DOM selectors are unreliable, fetch and parse HTML through
simple_scrape. - Vision/OCR: capture the screen and ask the vision model (
vision_describe), falling back to Tesseract when needed. - Proof before finish: every
/finishcall must attach at least one screenshot; the orchestrator enforces this and relays the proof to Telegram.
Each tool invocation is appended to logs/agent.jsonl using an append-only JSONL structure:
{"ts": "2024-04-12T10:45:21Z", "tool": "dom_navigate", "result": {"ok": true, "url": "https://example.com"}}The DOM-only agent in src/web/web_agent.py keeps the previous single-process workflow (no Telegram). Run it with:
./scripts/run_web_agent.sh --start-url https://example.com --objective "Cherche la météo de Paris"It sends structured page descriptions to the model and expects strict JSON replies (navigate, click, input, scroll, finish). Use this mode when debugging selectors locally without the full Telegram orchestration stack.
Run the unit tests with pytest:
pytestThe suite covers Harmony payload formatting, HTTP scraping helpers, and UI stubs in test mode.
- No screenshots are sent: ensure Screen Recording permission is granted and that the
captures/directory is writable. proof_requiredwhen finishing: calltake_screenshot(or/shot) before the planner issuesfinish.- Selenium cannot launch Chrome: install a recent Chrome build and make sure
webdriver-managercan download the corresponding driver. - Vision call fails: the orchestrator automatically falls back to OCR, but check that Tesseract is installed via Homebrew.
- Telegram messages ignored: double-check the chat ID matches the destination chat (private, group, or channel).
The project is released under the MIT licence. See LICENSE for the full text.