diff --git a/.claude/skills/benchflow/SKILL.md b/.claude/skills/benchflow/SKILL.md index 1f21b5fd..c740791b 100644 --- a/.claude/skills/benchflow/SKILL.md +++ b/.claude/skills/benchflow/SKILL.md @@ -23,7 +23,7 @@ Arguments passed: `$ARGUMENTS` ### No args or `status` — show current state -1. Check if benchflow is installed: `pip show benchflow` +1. Check if benchflow is installed: `uv tool list | grep benchflow` 2. Check if `.env` exists with API keys 3. Check available agents: `benchflow agents` 4. Show recent job results if any exist in `jobs/` @@ -199,7 +199,7 @@ asyncio.run(main()) ## Setup ```bash -pip install benchflow # or: pip install -e . (from source) +uv tool install benchflow # or: uv tool install -e . (from source) source .env # ANTHROPIC_API_KEY, DAYTONA_API_KEY ``` diff --git a/.claude/skills/benchflow/tasks/benchflow-knowledge/environment/benchflow/SKILL.md b/.claude/skills/benchflow/tasks/benchflow-knowledge/environment/benchflow/SKILL.md index 7d3817bf..57af0d81 100644 --- a/.claude/skills/benchflow/tasks/benchflow-knowledge/environment/benchflow/SKILL.md +++ b/.claude/skills/benchflow/tasks/benchflow-knowledge/environment/benchflow/SKILL.md @@ -23,7 +23,7 @@ Arguments passed: `$ARGUMENTS` ### No args or `status` — show current state -1. Check if benchflow is installed: `pip show benchflow` +1. Check if benchflow is installed: `uv tool list | grep benchflow` 2. Check if `.env` exists with API keys 3. Check available agents: `benchflow agents` 4. Show recent job results if any exist in `jobs/` @@ -182,7 +182,7 @@ asyncio.run(main()) ## Setup ```bash -pip install benchflow # or: pip install -e . (from source) +uv tool install benchflow # or: uv tool install -e . (from source) source .env # ANTHROPIC_API_KEY, DAYTONA_API_KEY ``` diff --git a/.claude/skills/benchflow/tasks/create-simple-task/environment/benchflow/SKILL.md b/.claude/skills/benchflow/tasks/create-simple-task/environment/benchflow/SKILL.md index 7d3817bf..57af0d81 100644 --- a/.claude/skills/benchflow/tasks/create-simple-task/environment/benchflow/SKILL.md +++ b/.claude/skills/benchflow/tasks/create-simple-task/environment/benchflow/SKILL.md @@ -23,7 +23,7 @@ Arguments passed: `$ARGUMENTS` ### No args or `status` — show current state -1. Check if benchflow is installed: `pip show benchflow` +1. Check if benchflow is installed: `uv tool list | grep benchflow` 2. Check if `.env` exists with API keys 3. Check available agents: `benchflow agents` 4. Show recent job results if any exist in `jobs/` @@ -182,7 +182,7 @@ asyncio.run(main()) ## Setup ```bash -pip install benchflow # or: pip install -e . (from source) +uv tool install benchflow # or: uv tool install -e . (from source) source .env # ANTHROPIC_API_KEY, DAYTONA_API_KEY ``` diff --git a/README.md b/README.md index 4c38823d..e56d4553 100644 --- a/README.md +++ b/README.md @@ -21,10 +21,10 @@ BenchFlow runs AI agents against benchmark tasks in sandboxed environments. It s ## Install ```bash -pip install benchflow==0.3.0a3 +uv tool install benchflow ``` -Requires Python 3.12+. For cloud sandboxes, set `DAYTONA_API_KEY`. +Requires Python 3.12+ and [uv](https://docs.astral.sh/uv/). For cloud sandboxes, set `DAYTONA_API_KEY`. ## Quick Start diff --git a/docs/api-reference.md b/docs/api-reference.md index 52484595..786f0c87 100644 --- a/docs/api-reference.md +++ b/docs/api-reference.md @@ -5,7 +5,7 @@ The Trial/Scene API is the primary way to run agent benchmarks programmatically. ## Install ```bash -pip install benchflow==0.3.0a3 +uv tool install benchflow ``` ## Quick Start diff --git a/docs/quickstart.md b/docs/quickstart.md index e087ea74..34fc2dc1 100644 --- a/docs/quickstart.md +++ b/docs/quickstart.md @@ -4,14 +4,14 @@ Get a benchmark result in under 5 minutes. ## Prerequisites -- Python 3.12+ +- Python 3.12+ and [uv](https://docs.astral.sh/uv/) - A Daytona API key (`DAYTONA_API_KEY`) for cloud sandboxes - An agent API key (e.g. `GEMINI_API_KEY` for Gemini) ## Install ```bash -pip install benchflow==0.3.0a3 +uv tool install benchflow ``` ## Run your first evaluation diff --git a/docs/skill-eval-guide.md b/docs/skill-eval-guide.md index f0a04c68..1268222e 100644 --- a/docs/skill-eval-guide.md +++ b/docs/skill-eval-guide.md @@ -5,7 +5,7 @@ Test whether your agent skill actually helps agents perform better. ## Install ```bash -pip install benchflow==0.3.0a3 +uv tool install benchflow ``` ## Overview @@ -382,7 +382,7 @@ BenchFlow generates everything ephemeral — only results persist. **CI integration:** ```bash # In your skill's CI pipeline -pip install benchflow==0.3.0a3 +uv tool install benchflow bench skills eval . -a claude-agent-acp --no-baseline # Exit code 1 if any case scores < 0.5 ```