Daily browser-agent benchmark for logical puzzles.
- A benchmark engine with pluggable model, puzzle, and sandbox adapters
- Local SQLite persistence for puzzle instances, runs, steps, artifacts, and daily leaderboard rows
- A FastAPI web app that shows the current leaderboard and run detail pages
- A local development path with fixture-based puzzle adapters and a deterministic mock model
- A production-oriented
BrowserbaseSandboxProviderscaffold for remote browser sessions
python3 -m venv .venv
. .venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --reloadHealth check:
curl -s http://127.0.0.1:8000/healthSeed fixture/demo data:
python3 -m app.cli seed-demoThen open http://127.0.0.1:8000.
Run the live NYT Wordle benchmark in a real browser:
python3 -m app.cli run-live-wordleRun Wordle with a real OpenAI model decision loop:
export OPENAI_API_KEY=...
export GAME_BENCH_HEADLESS=0
python3 -m app.cli run-live-wordle-openaiRun Wordle with a separate fully agentic browser-use loop where the model chooses low-level UI actions:
export OPENAI_API_KEY=...
export GAME_BENCH_AGENTIC_HEADLESS=0
export GAME_BENCH_AGENTIC_KEEP_OPEN_SECONDS=20
python3 -m app.cli run-live-wordle-openai-agenticRun the same agentic browser-use path with Claude computer use:
export ANTHROPIC_API_KEY=...
export GAME_BENCH_AGENTIC_HEADLESS=0
export GAME_BENCH_AGENTIC_KEEP_OPEN_SECONDS=20
python3 -m app.cli run-live-wordle-claude-agenticOr use the generic provider-based entrypoint:
python3 -m app.cli run-live-wordle-agentic anthropic
python3 -m app.cli run-live-wordle-agentic openaiTrigger the agentic run through the internal admin endpoint:
export GAME_BENCH_ADMIN_TOKEN=...
curl -X POST http://127.0.0.1:8000/internal/runs/wordle-agentic \
-H "Authorization: Bearer $GAME_BENCH_ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"target_date":"2026-03-29","provider":"anthropic","model_id":"claude-sonnet-4-20250514"}'python3 -m app.cli seed-demo
python3 -m app.cli run-daily-benchmark
python3 -m app.cli run-live-wordle
python3 -m app.cli run-live-wordle-openai
python3 -m app.cli run-live-wordle-openai-agentic
python3 -m app.cli run-live-wordle-claude-agentic
python3 -m app.cli run-live-wordle-agentic anthropic
python3 -m app.cli fetch-daily-puzzles
python3 -m app.cli recompute-leaderboardGAME_BENCH_DB_PATH: SQLite database path. Defaults todata/game_bench.dbGAME_BENCH_DATA_ROOT: Root for persistent app data. Defaults todataGAME_BENCH_ARTIFACTS_ROOT: Artifact and replay directory. Defaults to<data-root>/artifactsGAME_BENCH_TIMEZONE: Canonical benchmark timezone. Defaults toUTCGAME_BENCH_ADMIN_TOKEN: Bearer token for internal run-trigger endpointsGAME_BENCH_HEADLESS: Set to0to show the Playwright browser window. Defaults to1GAME_BENCH_KEEP_OPEN_SECONDS: How long a headed browser stays open after the run ends. Defaults to10when headed,0when headlessGAME_BENCH_BROWSER_PROVIDER: Browser backend for live runs. Usebrowserbasein production,localin development. Defaults tolocalGAME_BENCH_AGENTIC_HEADLESS: Set to0to show the browser window for the separate computer-use agent. Defaults to0GAME_BENCH_AGENTIC_KEEP_OPEN_SECONDS: How long the separate agentic browser stays open after the run ends. Defaults to15when headedGAME_BENCH_AGENTIC_MAX_TURNS: Max computer-use turns before the separate agentic run aborts. Defaults to30OPENAI_COMPUTER_MODEL: OpenAI model for the separate agentic browser-use path. Defaults togpt-5.4ANTHROPIC_API_KEY: Anthropic API key for Claude computer useANTHROPIC_COMPUTER_MODEL: Anthropic model for the separate agentic browser-use path. Defaults toclaude-sonnet-4-20250514BROWSERBASE_API_KEY: Optional Browserbase API keyBROWSERBASE_PROJECT_ID: Optional Browserbase project idBROWSERBASE_REGION: Browserbase region for remote sessions. Defaults tous-west-2
This repo includes render.yaml for a single persistent-disk web service plus a daily cron trigger.
Production defaults:
- FastAPI app runs as a single Render web service
- SQLite and artifacts live under
/var/data - Browser automation uses Browserbase via
GAME_BENCH_BROWSER_PROVIDER=browserbase - The cron job calls the protected
/internal/runs/wordle-agenticendpoint
Before enabling the cron service, set:
GAME_BENCH_BASE_URLGAME_BENCH_ADMIN_TOKENOPENAI_API_KEYBROWSERBASE_API_KEYBROWSERBASE_PROJECT_ID