Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .claude/skills/benchflow/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Arguments passed: `$ARGUMENTS`

### No args or `status` — show current state

1. Check if benchflow is installed: `pip show benchflow`
1. Check if benchflow is installed: `uv tool list | grep benchflow`
2. Check if `.env` exists with API keys
3. Check available agents: `benchflow agents`
4. Show recent job results if any exist in `jobs/`
Expand Down Expand Up @@ -199,7 +199,7 @@ asyncio.run(main())
## Setup

```bash
pip install benchflow # or: pip install -e . (from source)
uv tool install benchflow # or: uv tool install -e . (from source)
source .env # ANTHROPIC_API_KEY, DAYTONA_API_KEY
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Arguments passed: `$ARGUMENTS`

### No args or `status` — show current state

1. Check if benchflow is installed: `pip show benchflow`
1. Check if benchflow is installed: `uv tool list | grep benchflow`
2. Check if `.env` exists with API keys
3. Check available agents: `benchflow agents`
4. Show recent job results if any exist in `jobs/`
Expand Down Expand Up @@ -182,7 +182,7 @@ asyncio.run(main())
## Setup

```bash
pip install benchflow # or: pip install -e . (from source)
uv tool install benchflow # or: uv tool install -e . (from source)
source .env # ANTHROPIC_API_KEY, DAYTONA_API_KEY
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Arguments passed: `$ARGUMENTS`

### No args or `status` — show current state

1. Check if benchflow is installed: `pip show benchflow`
1. Check if benchflow is installed: `uv tool list | grep benchflow`
2. Check if `.env` exists with API keys
3. Check available agents: `benchflow agents`
4. Show recent job results if any exist in `jobs/`
Expand Down Expand Up @@ -182,7 +182,7 @@ asyncio.run(main())
## Setup

```bash
pip install benchflow # or: pip install -e . (from source)
uv tool install benchflow # or: uv tool install -e . (from source)
source .env # ANTHROPIC_API_KEY, DAYTONA_API_KEY
```

Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,10 @@ BenchFlow runs AI agents against benchmark tasks in sandboxed environments. It s
## Install

```bash
pip install benchflow==0.3.0a3
uv tool install benchflow
```

Requires Python 3.12+. For cloud sandboxes, set `DAYTONA_API_KEY`.
Requires Python 3.12+ and [uv](https://docs.astral.sh/uv/). For cloud sandboxes, set `DAYTONA_API_KEY`.

## Quick Start

Expand Down
2 changes: 1 addition & 1 deletion docs/api-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ The Trial/Scene API is the primary way to run agent benchmarks programmatically.
## Install

```bash
pip install benchflow==0.3.0a3
uv tool install benchflow
```

## Quick Start
Expand Down
4 changes: 2 additions & 2 deletions docs/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,14 @@ Get a benchmark result in under 5 minutes.

## Prerequisites

- Python 3.12+
- Python 3.12+ and [uv](https://docs.astral.sh/uv/)
- A Daytona API key (`DAYTONA_API_KEY`) for cloud sandboxes
- An agent API key (e.g. `GEMINI_API_KEY` for Gemini)

## Install

```bash
pip install benchflow==0.3.0a3
uv tool install benchflow
```

## Run your first evaluation
Expand Down
4 changes: 2 additions & 2 deletions docs/skill-eval-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Test whether your agent skill actually helps agents perform better.
## Install

```bash
pip install benchflow==0.3.0a3
uv tool install benchflow
```

## Overview
Expand Down Expand Up @@ -382,7 +382,7 @@ BenchFlow generates everything ephemeral — only results persist.
**CI integration:**
```bash
# In your skill's CI pipeline
pip install benchflow==0.3.0a3
uv tool install benchflow
bench skills eval . -a claude-agent-acp --no-baseline
# Exit code 1 if any case scores < 0.5
```
Expand Down