Skip to content

Autonomous web automation agent powered by a GPT-OSS 20b. It analyzes the web page structure and executes actions (click, input, navigation) via an OpenAI-compatible API to accomplish complex goals.

License

Notifications You must be signed in to change notification settings

eauchs/gpt-oss-web-agent

Repository files navigation

Harmony macOS VM Orchestrator

This repository provides a complete toolkit to drive a macOS virtual machine from Telegram using natural language. The planner runs on gpt-oss-20b (Harmony) exposed by LM Studio, with fast vision handled by qwen3-vl-4b and automatic OCR fallback. The agent focuses on DOM-first web automation, but falls back to HTTP scraping or full-screen vision when necessary.

Project layout

src/
├── agent/
│   ├── config.py            # load .env and YAML configuration
│   ├── harmony_client.py    # Harmony request helper + tool schema
│   ├── orchestrator.py      # Telegram loop, tool execution, logging
│   ├── tools_dom.py         # Chrome/Selenium helpers with CSS path
│   ├── tools_http.py        # requests + BeautifulSoup scraping helpers
│   ├── tools_telegram.py    # Telegram polling and messaging
│   ├── tools_ui.py          # Humanised mouse/keyboard/screenshot helpers
│   └── tools_vlm.py         # Vision model client + Tesseract fallback
└── web/
    └── web_agent.py         # Standalone DOM-first Selenium agent

configs/
├── .env.example             # Copy to .env and customise credentials
└── config.yaml              # Default runtime parameters

scripts/
├── run_orchestrator.sh      # Launch the Telegram orchestrator
└── run_web_agent.sh          # Launch the DOM-only agent

tests/                       # pytest coverage for the protocol and tools

Quick installation

  1. Install system requirements (macOS host):
    brew install tesseract
  2. Create and activate a Python 3.10+ virtual environment.
  3. Install Python dependencies:
    pip install -r requirements.txt
  4. Copy the environment template and edit it:
    cp configs/.env.example .env
    Fill in at minimum TELEGRAM_BOT_TOKEN, TELEGRAM_CHAT_ID, and confirm the LM Studio endpoints.

macOS permissions

The orchestrator controls the GUI directly. Grant the host application running the Python process the following permissions in System Settings → Privacy & Security:

  • Accessibility (keyboard and mouse automation)
  • Screen Recording (for screenshots sent as proof)

Restart the terminal after toggling the permissions.

Environment variables

All configuration values can be provided through .env or the environment. The most relevant keys are:

Variable Description
LM_STUDIO_API_BASE_TEXT HTTP base URL for the text model (/v1 path is required).
LM_STUDIO_API_BASE_VLM HTTP base URL for the vision model.
MODEL_TEXT / MODEL_VLM Model identifiers in LM Studio.
TELEGRAM_BOT_TOKEN / TELEGRAM_CHAT_ID Telegram bot credentials.
AGENT_LOG_DIR / AGENT_CAPTURE_DIR Override log and screenshot directories.
ALLOW_AWAIT_USER_REPLY Set to 1 to allow /await_user_reply style questions.
AGENT_HISTORY_LIMIT Number of conversational messages kept (default: 20).

The YAML file configs/config.yaml exposes sensible defaults for timeouts, Selenium options, and poll intervals.

Running the orchestrator

Start LM Studio with both the text and vision models served over the OpenAI-compatible API. Then launch the Telegram-driven orchestrator:

./scripts/run_orchestrator.sh

The script loads .env, initialises all tools, and starts polling Telegram. Every natural-language message in the configured chat is forwarded to the Harmony planner. Tool invocations are logged to logs/agent.jsonl alongside OCR/VLM evidence and screenshot paths saved in captures/.

Telegram commands

  • /shot – force an immediate screenshot and send it back.
  • /end – stop the orchestrator gracefully (also closes Chrome).

The agent never asks clarifying questions unless you explicitly enable ALLOW_AWAIT_USER_REPLY=1.

Tool orchestration policy

  1. DOM-first: try to derive a precise CSS selector and use the Selenium DOM tools.
  2. HTTP fallback: if DOM selectors are unreliable, fetch and parse HTML through simple_scrape.
  3. Vision/OCR: capture the screen and ask the vision model (vision_describe), falling back to Tesseract when needed.
  4. Proof before finish: every /finish call must attach at least one screenshot; the orchestrator enforces this and relays the proof to Telegram.

Each tool invocation is appended to logs/agent.jsonl using an append-only JSONL structure:

{"ts": "2024-04-12T10:45:21Z", "tool": "dom_navigate", "result": {"ok": true, "url": "https://example.com"}}

Standalone DOM agent

The DOM-only agent in src/web/web_agent.py keeps the previous single-process workflow (no Telegram). Run it with:

./scripts/run_web_agent.sh --start-url https://example.com --objective "Cherche la météo de Paris"

It sends structured page descriptions to the model and expects strict JSON replies (navigate, click, input, scroll, finish). Use this mode when debugging selectors locally without the full Telegram orchestration stack.

Testing

Run the unit tests with pytest:

pytest

The suite covers Harmony payload formatting, HTTP scraping helpers, and UI stubs in test mode.

Troubleshooting

  • No screenshots are sent: ensure Screen Recording permission is granted and that the captures/ directory is writable.
  • proof_required when finishing: call take_screenshot (or /shot) before the planner issues finish.
  • Selenium cannot launch Chrome: install a recent Chrome build and make sure webdriver-manager can download the corresponding driver.
  • Vision call fails: the orchestrator automatically falls back to OCR, but check that Tesseract is installed via Homebrew.
  • Telegram messages ignored: double-check the chat ID matches the destination chat (private, group, or channel).

Licensing

The project is released under the MIT licence. See LICENSE for the full text.

About

Autonomous web automation agent powered by a GPT-OSS 20b. It analyzes the web page structure and executes actions (click, input, navigation) via an OpenAI-compatible API to accomplish complex goals.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published