game-bench

Daily browser-agent benchmark for logical puzzles.

What is included

A benchmark engine with pluggable model, puzzle, and sandbox adapters
Local SQLite persistence for puzzle instances, runs, steps, artifacts, and daily leaderboard rows
A FastAPI web app that shows the current leaderboard and run detail pages
A local development path with fixture-based puzzle adapters and a deterministic mock model
A production-oriented BrowserbaseSandboxProvider scaffold for remote browser sessions

Quick start

python3 -m venv .venv
. .venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --reload

Health check:

curl -s http://127.0.0.1:8000/health

Seed fixture/demo data:

python3 -m app.cli seed-demo

Then open http://127.0.0.1:8000.

Run the live NYT Wordle benchmark in a real browser:

python3 -m app.cli run-live-wordle

Run Wordle with a real OpenAI model decision loop:

export OPENAI_API_KEY=...
export GAME_BENCH_HEADLESS=0
python3 -m app.cli run-live-wordle-openai

Run Wordle with a separate fully agentic browser-use loop where the model chooses low-level UI actions:

export OPENAI_API_KEY=...
export GAME_BENCH_AGENTIC_HEADLESS=0
export GAME_BENCH_AGENTIC_KEEP_OPEN_SECONDS=20
python3 -m app.cli run-live-wordle-openai-agentic

Run the same agentic browser-use path with Claude computer use:

export ANTHROPIC_API_KEY=...
export GAME_BENCH_AGENTIC_HEADLESS=0
export GAME_BENCH_AGENTIC_KEEP_OPEN_SECONDS=20
python3 -m app.cli run-live-wordle-claude-agentic

Or use the generic provider-based entrypoint:

python3 -m app.cli run-live-wordle-agentic anthropic
python3 -m app.cli run-live-wordle-agentic openai

Trigger the agentic run through the internal admin endpoint:

export GAME_BENCH_ADMIN_TOKEN=...
curl -X POST http://127.0.0.1:8000/internal/runs/wordle-agentic \
  -H "Authorization: Bearer $GAME_BENCH_ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"target_date":"2026-03-29","provider":"anthropic","model_id":"claude-sonnet-4-20250514"}'

Commands

python3 -m app.cli seed-demo
python3 -m app.cli run-daily-benchmark
python3 -m app.cli run-live-wordle
python3 -m app.cli run-live-wordle-openai
python3 -m app.cli run-live-wordle-openai-agentic
python3 -m app.cli run-live-wordle-claude-agentic
python3 -m app.cli run-live-wordle-agentic anthropic
python3 -m app.cli fetch-daily-puzzles
python3 -m app.cli recompute-leaderboard

Environment

GAME_BENCH_DB_PATH: SQLite database path. Defaults to data/game_bench.db
GAME_BENCH_DATA_ROOT: Root for persistent app data. Defaults to data
GAME_BENCH_ARTIFACTS_ROOT: Artifact and replay directory. Defaults to <data-root>/artifacts
GAME_BENCH_TIMEZONE: Canonical benchmark timezone. Defaults to UTC
GAME_BENCH_ADMIN_TOKEN: Bearer token for internal run-trigger endpoints
GAME_BENCH_HEADLESS: Set to 0 to show the Playwright browser window. Defaults to 1
GAME_BENCH_KEEP_OPEN_SECONDS: How long a headed browser stays open after the run ends. Defaults to 10 when headed, 0 when headless
GAME_BENCH_BROWSER_PROVIDER: Browser backend for live runs. Use browserbase in production, local in development. Defaults to local
GAME_BENCH_AGENTIC_HEADLESS: Set to 0 to show the browser window for the separate computer-use agent. Defaults to 0
GAME_BENCH_AGENTIC_KEEP_OPEN_SECONDS: How long the separate agentic browser stays open after the run ends. Defaults to 15 when headed
GAME_BENCH_AGENTIC_MAX_TURNS: Max computer-use turns before the separate agentic run aborts. Defaults to 30
OPENAI_COMPUTER_MODEL: OpenAI model for the separate agentic browser-use path. Defaults to gpt-5.4
ANTHROPIC_API_KEY: Anthropic API key for Claude computer use
ANTHROPIC_COMPUTER_MODEL: Anthropic model for the separate agentic browser-use path. Defaults to claude-sonnet-4-20250514
BROWSERBASE_API_KEY: Optional Browserbase API key
BROWSERBASE_PROJECT_ID: Optional Browserbase project id
BROWSERBASE_REGION: Browserbase region for remote sessions. Defaults to us-west-2

Render deployment

This repo includes render.yaml for a single persistent-disk web service plus a daily cron trigger.

Production defaults:

FastAPI app runs as a single Render web service
SQLite and artifacts live under /var/data
Browser automation uses Browserbase via GAME_BENCH_BROWSER_PROVIDER=browserbase
The cron job calls the protected /internal/runs/wordle-agentic endpoint

Before enabling the cron service, set:

GAME_BENCH_BASE_URL
GAME_BENCH_ADMIN_TOKEN
OPENAI_API_KEY
BROWSERBASE_API_KEY
BROWSERBASE_PROJECT_ID

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
app		app
output/playwright		output/playwright
static		static
templates		templates
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
render.yaml		render.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

game-bench

What is included

Quick start

Commands

Environment

Render deployment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

game-bench

What is included

Quick start

Commands

Environment

Render deployment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages