Skip to content

cline/cline-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

cline-bench early access

Real-world coding benchmarks derived from actual Cline user sessions. Tasks are challenging, verified, and represent genuine engineering problems solved in production.

Prerequisites

  • Python 3.13
  • uv (Python package manager)
  • Docker (for local testing)
  • Daytona API key (for cloud execution)
  • LLM Provider API key (Anthropic, OpenRouter, OpenAI, or OpenAI-compatible)

Directory Structure

cline-bench/
├── README.md           # This file
└── tasks/              # Benchmark tasks
    ├── 01k6kr5hbv8za80v8vnze3at8h-every-plugin-api-migration/
    ├── 01k6n26zm27ffa7qqbcx0prrnw-police-sync-segfault/
    ├── 01k6rkmyfgbwpvf7h81gh4pdgd-intercept-axios-error-handling/
    ├── 01k6zz0nyj31znwsevx4sn6zb2-telegram-plugin-refactor/
    ├── 01k7a12sd1nk15j08e6x0x7v9e-discord-trivia-approval-keyerror/
    ├── 01k7x8zyeg4nzx6ehdb0fg5gfx-terraform-azurerm-deployment-stacks/
    └── etc.

Each task contains:

  • instruction.md - Task description (agent input)
  • task.toml - Harbor configuration (timeouts, resources)
  • environment/Dockerfile - Container with broken initial state
  • solution/solve.sh - Oracle's ground truth solution
  • tests/ - Pytest verification suite

Installation

# 1. Create Python 3.13 virtual environment
uv venv --python 3.13
source .venv/bin/activate

# 2. Install Harbor
uv tool install harbor

# 3. Verify installation
python --version  # Should show 3.13.x
which harbor      # Should show harbor binary location

Running Benchmarks

Environment Variables

Cline CLI supports multiple LLM providers via environment variables:

# Required: Cloud execution
export DAYTONA_API_KEY=dtn_your-key-here

# Required: API key for your provider
export API_KEY=your-api-key-here

# Optional: For openai provider only (custom OpenAI-compatible endpoints)
export BASE_URL=https://your-endpoint.com/v1

Model Name Format: provider:model-id

The provider is specified as a prefix in the model name using : separator.

Examples:

# Option 1: Anthropic Direct
export API_KEY=sk-ant-xxx
harbor run ... -m anthropic:claude-sonnet-4-20250514

# Option 2: OpenRouter
export API_KEY=sk-or-v1-xxx
harbor run ... -m openrouter:anthropic/claude-sonnet-4-5:1m

# Option 3: OpenAI Native
export API_KEY=sk-proj-xxx
harbor run ... -m openai-native:gpt-4o

# Option 4: OpenAI (custom OpenAI-compatible endpoints: local LLMs, Ollama, vLLM)
export API_KEY=your-key
export BASE_URL=http://localhost:8000/v1
harbor run ... -m openai:your-model-id

Local Execution (Docker)

# Activate venv
source .venv/bin/activate

# Run single task with Oracle (ground truth)
harbor run -p tasks/01k7a12sd1nk15j08e6x0x7v9e-discord-trivia-approval-keyerror \
           -a oracle \
           --env docker

# Run with Cline CLI using Anthropic direct
export API_KEY=sk-ant-your-key
harbor run -p tasks/01k7a12sd1nk15j08e6x0x7v9e-discord-trivia-approval-keyerror \
           -a cline-cli \
           -m anthropic:claude-sonnet-4-20250514 \
           --env docker

# Run with Cline CLI using OpenRouter
export API_KEY=sk-or-v1-your-key
harbor run -p tasks/01k7a12sd1nk15j08e6x0x7v9e-discord-trivia-approval-keyerror \
           -a cline-cli \
           -m openrouter:anthropic/claude-sonnet-4-5:1m \
           --env docker

# Run all tasks
export API_KEY=sk-ant-your-key
harbor run -p tasks \
           -a cline-cli \
           -m anthropic:claude-sonnet-4-5:1m \
           --env docker

Cloud Execution (Daytona)

# Activate venv
source .venv/bin/activate

# Set environment variables
export DAYTONA_API_KEY=dtn_your-key
export API_KEY=sk-ant-your-key

# Run single task with Anthropic direct
harbor run -p tasks/01k7a12sd1nk15j08e6x0x7v9e-discord-trivia-approval-keyerror \
           -a cline-cli \
           -m anthropic:claude-sonnet-4-20250514 \
           --env daytona \
           --force-build

# Run single task with OpenRouter
export API_KEY=sk-or-v1-your-key
harbor run -p tasks/01k7a12sd1nk15j08e6x0x7v9e-discord-trivia-approval-keyerror \
           -a cline-cli \
           -m openrouter:anthropic/claude-sonnet-4-5:1m \
           --env daytona \
           --force-build

# Run all tasks on Daytona (batch execution)
export API_KEY=sk-ant-your-key
harbor run -p tasks \
           -a cline-cli \
           -m anthropic:claude-sonnet-4-5:1m \
           --env daytona \
           -n 200 \
           -k 1 \
           --max-retries 2 \
           --force-build

Note: Daytona has known issues with concurrent sandbox creation. If you encounter "Session is closed" errors during batch runs, run tasks sequentially or reduce concurrency.

Additional Provider Examples

Using OpenAI-compatible endpoints (local LLMs, Ollama, vLLM):

source .venv/bin/activate
export DAYTONA_API_KEY=dtn_your-key
export API_KEY=your-key
export BASE_URL=http://localhost:8000/v1

harbor run -p tasks/01k7a12sd1nk15j08e6x0x7v9e-discord-trivia-approval-keyerror \
           -a cline-cli \
           -m openai:your-model-name \
           --env daytona \
           --force-build

Comparing multiple models:

# Test Sonnet 4.5 via Anthropic direct
export API_KEY=sk-ant-your-key
harbor run -p tasks -a cline-cli -m anthropic:claude-sonnet-4-5:1m --env daytona -n 200 -k 1 --max-retries 2 --force-build

# Test Sonnet 4 via OpenRouter
export API_KEY=sk-or-v1-your-key
harbor run -p tasks -a cline-cli -m openrouter:anthropic/claude-sonnet-4-20250514 --env daytona -n 200 -k 1 --max-retries 2 --force-build

# Test GPT-4 via OpenRouter
export API_KEY=sk-or-v1-your-key
harbor run -p tasks -a cline-cli -m openrouter:openai/gpt-4 --env daytona -n 200 -k 1 --max-retries 2 --force-build

Interpreting Results

Results are written to jobs/ directory:

jobs/
└── 2025-11-11__15-30-00/           # Job timestamp
    ├── config.json                 # Job configuration
    ├── result.json                 # Aggregate results
    └── <task-id>__<hash>/          # Trial directory
        ├── config.json             # Trial configuration
        ├── result.json             # Trial result (reward, timing, costs)
        ├── agent/                  # Agent execution logs
        │   ├── cline.txt           # Full conversation log
        │   ├── setup/              # Installation logs
        │   └── command-*/          # Shell command logs
        └── verifier/               # Test results
            ├── reward.txt          # 1 (pass) or 0 (fail)
            ├── test-stdout.txt     # Pytest output
            └── test-stderr.txt     # Pytest warnings

Quick result check

# Find latest job
LATEST=$(ls -td jobs/2025-*/ | head -1)

# Check reward
cat ${LATEST}/*/verifier/reward.txt

# View test results
cat ${LATEST}/*/verifier/test-stdout.txt | grep -E "PASSED|FAILED"

# View agent conversation
tail -50 ${LATEST}/*/agent/cline.txt

Scoring

Binary pass/fail:

  • reward = 1.0 - All tests pass
  • reward = 0.0 - Any test fails

This mirrors real-world standards: code either works or it doesn't.

Resource Limits

Daytona Tier 3:

  • Max memory per sandbox: 8GB
  • All tasks validated to work within 8GB
  • Complex tasks (Qt WASM, Android) take 20-30 minutes

Common Issues

Venv not activated

Symptom: harbor: command not found or wrong harbor version

Fix:

source .venv/bin/activate
which harbor  # Verify: should show .venv/bin/harbor

Wrong Python version

Symptom: Installation errors or compatibility issues

Fix:

rm -rf .venv
uv venv --python 3.13
source .venv/bin/activate
uv tool install harbor

Architecture

cline-bench uses Harbor, the official successor to Terminal-Bench:

  • Agent-agnostic - Pre-integrated support for 10+ coding agents
  • Cloud-native - Daytona, Modal, E2B integration
  • Flexible - Docker compose for local, cloud for scale
  • Battle-tested - Production-ready framework from Laude Institute

Harbor repository: https://github.com/laude-institute/harbor

Citation

If you use cline-bench in your research:

@misc{cline-bench2025,
  title={cline-bench: Real-World Coding Benchmarks from Production Agent Sessions},
  author={[Your name/org]},
  year={2025},
  howpublished={\url{https://github.com/[your-repo]}}
}

Support

For issues:

Understanding Results

Harbor writes all execution data to the jobs/ directory:

jobs/
└── 2025-11-11__10-47-23/           # Job timestamp
    ├── config.json                 # Job configuration
    ├── result.json                 # Aggregate results
    └── 01k7a12s...disco__fhSEuhr/  # Trial directory (task-id + hash)
        ├── config.json             # Trial config
        ├── result.json             # Trial result (reward, timing, costs)
        ├── agent/
        │   ├── cline.txt           # Full Cline conversation log
        │   ├── install.sh          # Generated installation script
        │   ├── setup/              # Agent installation phase
        │   │   ├── stdout.txt      # Installation output
        │   │   ├── stderr.txt      # Installation errors
        │   │   └── return-code.txt # Exit code (0 = success)
        │   ├── command-0/          # First command (config setup)
        │   │   ├── command.txt     # Command that ran
        │   │   └── return-code.txt
        │   └── command-1/          # Second command (cline execution)
        │       ├── command.txt
        │       ├── stdout.txt      # Same as cline.txt (tee'd)
        │       └── return-code.txt
        └── verifier/               # Test execution
            ├── reward.txt          # 1 (pass) or 0 (fail)
            ├── test-stdout.txt     # pytest output
            └── test-stderr.txt     # pytest warnings

Quick Commands

# Find latest job
LATEST=$(ls -td jobs/2025-*/ | head -1)
echo "Latest: $LATEST"

# Check reward (1 = pass, 0 = fail)
cat ${LATEST}/*/verifier/reward.txt

# View test results
cat ${LATEST}/*/verifier/test-stdout.txt | tail -20

# Count passed/failed tests
grep -c "PASSED" ${LATEST}/*/verifier/test-stdout.txt
grep -c "FAILED" ${LATEST}/*/verifier/test-stdout.txt

# View Cline's last actions
tail -50 ${LATEST}/*/agent/cline.txt

# Check if agent installed successfully
cat ${LATEST}/*/agent/setup/return-code.txt  # Should be 0

Example: Successful Run

$ cat jobs/2025-11-11__10-47-23/01k7a12s...disco__fhSEuhr/verifier/reward.txt
1

$ cat jobs/2025-11-11__10-47-23/01k7a12s...disco__fhSEuhr/verifier/test-stdout.txt | tail -3
PASSED test_approval_session_fix.py::test_approval_session_creation
PASSED test_approval_session_fix.py::test_trivia_approval_function
========================= 2 passed, 1 warning in 5.70s =========================

Running Locally with Docker

Testing locally without running on Daytona cloud:

# Activate venv
source .venv/bin/activate

# Run Oracle (ground truth) to validate task
harbor run -p tasks/01k7a12sd1nk15j08e6x0x7v9e-discord-trivia-approval-keyerror \
           -a oracle \
           --env docker

# Run Cline CLI locally
export API_KEY=sk-ant-your-key
harbor run -p tasks/01k7a12sd1nk15j08e6x0x7v9e-discord-trivia-approval-keyerror \
           -a cline-cli \
           -m anthropic:claude-sonnet-4-5:1m \
           --env docker

# Run all tasks locally (sequential)
export API_KEY=sk-ant-your-key
harbor run -p tasks -a cline-cli -m anthropic:claude-sonnet-4-5:1m --env docker

Local vs Cloud:

  • Docker (local): Slower, uses your machine resources, free
  • Daytona (cloud): Faster, parallel execution, costs money

Recommendation: Test single tasks with Docker first, then scale to Daytona for batch evaluation.

Hill-climbing on cline-bench & Modifying Cline's Agentic Loop

To modify Cline's behavior (system prompt, tools, API shapes like Responses API), you need to:

  1. Create a PR to cline/cline
  2. Set environment variable when running cline-cli in harbor with: export CUSTOM_CLINE_BUILD=

Recommendation: This process is more involved for early access. Wait for the GA release where this will be streamlined.

About

Real-world coding benchmarks derived from actual Cline user sessions. Tasks are challenging, verified, and represent genuine engineering problems solved in production.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages