-
Notifications
You must be signed in to change notification settings - Fork 9
Update adapter tutorial with new registry instructions #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
crystalxyz
wants to merge
2
commits into
harbor-framework:main
Choose a base branch
from
crystalxyz:update-registry-format
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -23,16 +23,20 @@ See [this section](#translating-terminal-bench-adapters-to-harbor) to learn abou | |
|
|
||
| ```bash | ||
| # List available datasets | ||
| harbor datasets list | ||
| harbor dataset list | ||
|
|
||
| # Start the interactive wizard to create a new adapter | ||
| harbor adapters init | ||
| harbor adapter init | ||
|
|
||
| # Initialize with specific arguments (skipping some prompts) | ||
| harbor adapters init my-adapter --name "My Benchmark" | ||
| harbor adapter init my-adapter --name "My Benchmark" | ||
| ``` | ||
|
|
||
| Use the above commands to view our supported datasets and start creating your new ones. The `harbor adapters init` command will create starter code and template files. | ||
| Use the above commands to view our supported datasets and start creating your new ones. The `harbor adapter init` command will create starter code and template files. | ||
|
|
||
| <Callout title="CLI command changes"> | ||
| Harbor CLI commands have moved from plural to singular form (e.g., `harbor datasets` → `harbor dataset`, `harbor adapters` → `harbor adapter`, `harbor jobs` → `harbor job`, `harbor trials` → `harbor trial`). The old plural forms are still supported for backwards compatibility. | ||
| </Callout> | ||
|
|
||
| For more details about what adapters are and how we ensure equivalance between the original benchmark and its harbor adapter, please continue reading. | ||
|
|
||
|
|
@@ -49,7 +53,7 @@ Here's a quick look at the typical steps: | |
| 5. **[Run Parity Experiments](#5-run-parity-experiments):** Run parity experiments to verify your adapter's performance against the original benchmark baseline results. | ||
| 6. **[Record Parity Results](#6-record-parity-results):** Formally document the performance comparison in `parity_experiment.json`. | ||
| 7. **[Upload Parity Results](#7-upload-parity-results):** Upload parity and oracle results to the HuggingFace dataset repository. | ||
| 8. **[Register the Dataset](#8-register-the-dataset):** Add your new tasks to the official dataset repository and registry, then verify with `--registry-path`. | ||
| 8. **[Register the Dataset](#8-register-the-dataset):** Prepare your dataset for the [Harbor Registry](https://registry.harborframework.com) using `harbor init` and `dataset.toml`, then coordinate with the Harbor team to publish. | ||
| 9. **[Document and Submit](#9-document-and-submit):** Document your adapter's usage, parity results, and comprehensive adaptation details in a `README.md`, then submit your work through a pull request. | ||
|
|
||
| We'll break down each step in detail below. Let's get started! | ||
|
|
@@ -193,44 +197,42 @@ There are several ways to run Harbor harness on your adapter: | |
| **Option 1: Using individual trials (for testing single tasks)** | ||
| ```bash | ||
| # Run oracle agent on a single task | ||
| uv run harbor trials start -p datasets/<your-adapter-name>/<task-id> | ||
| harbor trial start -p datasets/<your-adapter-name>/<task-id> | ||
|
|
||
| # Run with specific agent and model | ||
| uv run harbor trials start -p datasets/<your-adapter-name>/<task-id> -a <agent-name> -m <model-name> | ||
| harbor trial start -p datasets/<your-adapter-name>/<task-id> -a <agent-name> -m <model-name> | ||
| ``` | ||
|
|
||
| **Option 2: Using jobs with local dataset path** | ||
| ```bash | ||
| # Run on entire local dataset | ||
| uv run harbor jobs start -p datasets/<your-adapter-name> -a <agent-name> -m <model-name> | ||
| harbor job start -p datasets/<your-adapter-name> -a <agent-name> -m <model-name> | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| ``` | ||
|
|
||
| **Option 3: Using jobs with configuration file**. Refer to [harbor/examples/configs](https://github.com/laude-institute/harbor/tree/main/examples/configs) for configuration examples. It's highly recommended to write a reference config file for your adapter to ensure reproducibility. | ||
| ```bash | ||
| # Create a job config YAML (see harbor/examples/configs/ for examples) | ||
| uv run harbor jobs start -c adapters/<your-adapter-name>/<config>.yaml -a <agent-name> -m <model-name> | ||
| harbor job start -c adapters/<your-adapter-name>/<config>.yaml -a <agent-name> -m <model-name> | ||
| ``` | ||
|
|
||
| **Option 4: Using local registry after your [dataset registry PR](#8-register-the-dataset) gets merged**. This step is required to check the correctness of (1) your registered dataset and (2) your updated `registry.json` in the Harbor repository. If this run successfully passes all oracle tests, then after your adapter PR gets merged, people can directly use `-d` without `--registry-path` to run evaluation (the same way as Option 5). | ||
| **Option 4: Using registry dataset (after [publishing](#8-publish-to-the-registry))**. Registry testing is only available after the dataset has been published, which ensures the correct data structure. | ||
| ```bash | ||
| # Run from registry | ||
| uv run harbor jobs start -d <your-adapter-name> --registry-path registry.json -a <agent-name> -m "<model-name>" | ||
| ``` | ||
| # Single task | ||
| harbor run -t terminal-bench/adaptive-rejection-sampler -a <agent-name> -m <model-name> | ||
|
|
||
| **Option 5: Using registry dataset (after registration and all PRs merged)** | ||
| ```bash | ||
| # Run from registry | ||
| uv run harbor jobs start -d <your-adapter-name> -a <agent-name> -m "<model-name>" | ||
| # Entire dataset | ||
| harbor run -d terminal-bench/terminal-bench-2 -a <agent-name> -m <model-name> | ||
| ``` | ||
|
|
||
| You should include instructions for running in multiple ways in the `README.md` for your adapter, following the [Harbor adapter README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md). **Note that the order of these options is organized differently in the final adapter README**. This is because from the user's perspective, Option 5 is the primary way to run the adapter without needing to prepare task directories; the adapter code and other running methods are mainly used for development and reproduction. | ||
| You should include instructions for running in multiple ways in the `README.md` for your adapter, following the [Harbor adapter README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md). **Note that the order of these options is organized differently in the final adapter README**. This is because from the user's perspective, Option 4 (registry) is the primary way to run the adapter without needing to prepare task directories; the adapter code and other running methods are mainly used for development and reproduction. | ||
|
|
||
| #### 3.1 Verify Oracle Solutions Pass 100% | ||
|
|
||
| Before proceeding further, you must ensure that all oracle solutions pass with a 100% reward. Run the oracle agent on your entire dataset: | ||
|
|
||
| ```bash | ||
| uv run harbor jobs start -p datasets/<your-adapter-name> | ||
| harbor job start -p datasets/<your-adapter-name> | ||
| ``` | ||
|
|
||
| Once you've verified that all oracle solutions pass, you can create a Work-In-Progress (WIP) pull request to the Harbor repository: | ||
|
|
@@ -286,7 +288,7 @@ This approach has two important implications: | |
| uv run run_adapter.py --output-dir /path/to/output | ||
| ``` | ||
|
|
||
| 2. **Registry Version Naming:** When uploading the dataset to the registry, use the version name `"parity"` instead of `"1.0"` or `"2.0"` to avoid confusion. This allows users to run parity reproduction using `-d <adapter_name>@parity` while keeping the full dataset available separately. | ||
| 2. **Registry Version Naming:** When publishing the dataset to the [Harbor Registry](https://registry.harborframework.com), use the version name `"parity"` instead of `"1.0"` or `"2.0"` in your `dataset.toml` to avoid confusion. This allows users to run parity reproduction using `-d <adapter_name>@parity` while keeping the full dataset available separately. | ||
|
|
||
|
|
||
| </Callout> | ||
|
|
@@ -379,6 +381,7 @@ adapters/ | |
|
|
||
| ### 8. Register the Dataset | ||
|
|
||
| #### 8.1 Generate dataset | ||
| Once your adapter correctly generates tasks and you verify the parity experiments, you should add them to the official [Harbor datasets repository](https://github.com/laude-institute/harbor-datasets). | ||
|
|
||
| - **Fork and clone the dataset repository:** | ||
|
|
@@ -392,56 +395,54 @@ Once your adapter correctly generates tasks and you verify the parity experiment | |
| # Specify custom path to the harbor-datasets repo | ||
| uv run run_adapter.py --output-dir /path/to/harbor-datasets/datasets/<your-adapter-name> | ||
| ``` | ||
| - Generate `dataset.toml`: | ||
| ```bash | ||
| # Initialize the dataset (creates dataset.toml, auto-detects tasks in the directory) | ||
| cd harbor-datasets/datasets/<your-adapter-name> | ||
| harbor init | ||
| # Select "dataset" when prompted | ||
| ``` | ||
| - Edit the generated `dataset.toml` to fill in the required metadata. Your dataset description should include: | ||
| - **Parity experiment results:** A summary of your parity findings (see [Step 6](#6-record-parity-results)) | ||
| - **Adapter author credits:** Names and contact information for the adapter contributors | ||
| - **Any other acknowledgment:** i.e. funding support | ||
| - **Pull Request:** Create a pull request to the `harbor-datasets` repository. It's recommended to link the original benchmark's GitHub repository in your PR. Request @Slimshilin for review, and he will merge it so you can try `--registry-path` for the `harbor` harness. You may always submit another PR to update the dataset registry. | ||
|
|
||
| **Version naming:** Use `"1.0"` by default. If the original benchmark has named versions (e.g., "verified", "lite"), follow their naming. If you ran parity experiments on a subset (see [Section 4](#4-discuss-parity-plans-and-implement-agents)), use `"parity"` for the parity subset to allow users to run `-d <adapter_name>@parity` for parity reproduction. | ||
|
|
||
| Then you should navigate to the `harbor` repository (not the dataset repo!) and add a new entry to the `registry.json` file in the root. **Note:** Harbor's registry format uses task-level entries with Git URLs. | ||
| #### 8.2 Test Locally | ||
| Before submitting for publishing, verify your dataset works correctly using the `-p` path parameter: | ||
|
|
||
| **Version naming:** Use `"version": "1.0"` by default. If there's a version naming of the original benchmark (e.g., "verified", "lite"), then please follow their naming. If you ran parity experiments on a subset (see [Section 4](#4-discuss-parity-plans-and-implement-agents)), use `"version": "parity"` for the parity subset to allow users to run `-d <adapter_name>@parity` for parity reproduction. | ||
|
|
||
| For example: | ||
| ```json | ||
| [ | ||
| // existing entries... | ||
| { | ||
| "name": "<your-adapter-name>", | ||
| "version": "1.0", // or "parity" if this is a parity subset; if there are named splits from the original benchmark, use them accordingly | ||
| "description": "A brief description of the adapter. Original benchmark: [URL]. More details at [adapter README URL].", | ||
| "tasks": [ | ||
| { | ||
| "name": "<task-id-1>", | ||
| "git_url": "https://github.com/laude-institute/harbor-datasets.git", | ||
| "git_commit_id": "<commit-hash>", | ||
| "path": "datasets/<your-adapter-name>/<task-id-1>" | ||
| }, | ||
| { | ||
| "name": "<task-id-2>", | ||
| "git_url": "https://github.com/laude-institute/harbor-datasets.git", | ||
| "git_commit_id": "<commit-hash>", | ||
| "path": "datasets/<your-adapter-name>/<task-id-2>" | ||
| } | ||
| // ... more tasks | ||
| ] | ||
| } | ||
| ] | ||
| ```bash | ||
| # Run oracle agent on your local dataset | ||
| harbor job start -p /path/to/your/dataset | ||
| ``` | ||
|
|
||
| For initial development, you can use `"git_commit_id": "head"` to reference the latest commit, but for production you should pin to a specific commit hash for reproducibility. | ||
| <Callout title="Registry testing is only available post-publish"> | ||
| You cannot test against the registry (using `-d`) until the dataset has been published. This ensures the published data structure is correct. Use `-p` (local path) for all pre-publish testing. | ||
| </Callout> | ||
|
|
||
| #### 8.1 Verify Registry Configuration | ||
| #### 8.3 Submit for Publishing | ||
| Include your tasks directory and `dataset.toml` in your adapter PR. | ||
|
|
||
| **Important:** After your dataset registry PR is merged, you must verify that your registered dataset and `registry.json` are correctly configured. Run the following command to test oracle solutions using the registry: | ||
| Once your adapter PR gets approved, the Harbor team will review and publish the dataset to the registry: | ||
|
|
||
| ```bash | ||
| # Run from registry | ||
| uv run harbor jobs start -d <your-adapter-name> --registry-path registry.json | ||
| # (Harbor team) Authenticate with the registry via GitHub | ||
| harbor auth login | ||
|
|
||
| # (Harbor team) Publish the dataset (supports optional tags and concurrency settings) | ||
| harbor publish | ||
| ``` | ||
|
|
||
| Once the oracle tests pass successfully, **take a screenshot of your terminal** showing both: | ||
| - The command you ran | ||
| - The successful oracle test logs/results | ||
| #### 8.4 Verify Post-Publish | ||
|
|
||
| Once the dataset is published to the registry, verify that it loads and runs correctly: | ||
|
|
||
| **Paste this screenshot in your adapter PR** to demonstrate that the registry configuration is correct and that the dataset can be successfully loaded from the registry. | ||
| ```bash | ||
| # Run oracle agent from the registry | ||
| harbor run -d <your-adapter-name> | ||
| ``` | ||
|
|
||
| ### 9. Document and Submit | ||
|
|
||
|
|
@@ -531,7 +532,7 @@ The following table summarizes the main differences between Terminal-Bench and H | |
| | **Docker Compose** | `docker-compose.yaml` in task root | Not typically used per-task | | ||
| | **Default Output Directory** | `tasks/<adapter-name>` | `datasets/<adapter-name>` | | ||
| | **Registry Format** | Dataset-level with `dataset_path` | Task-level with `git_url` and `path` per task | | ||
| | **CLI Commands** | `tb run --dataset` / `tb run --dataset-path` | `harbor jobs start -d` / `harbor trials start -p` | | ||
| | **CLI Commands** | `tb run --dataset` / `tb run --dataset-path` | `harbor run -d` / `harbor run -t` / `harbor job start -p` | | ||
| | **Metrics** | Resolved rate (binary pass/fail per task) | Rewards that support multiple metrics and float-type values | | ||
|
|
||
| **IMPORTANT:** If the Terminal-Bench adapter used a tweaked metric (e.g., threshold-based scoring, ignoring certain metrics), then you'll need to re-implement the adapter for Harbor to support the **original** metrics used by the benchmark, as Harbor now supports multiple metrics as rewards. | ||
|
|
@@ -665,6 +666,8 @@ fi | |
|
|
||
| #### Step 5: Update Registry Format | ||
|
|
||
| Terminal-Bench uses a `registry.json` file with dataset-level entries. Harbor uses the [Harbor Registry](https://registry.harborframework.com) with `dataset.toml` configuration and the `harbor publish` workflow. | ||
|
|
||
| **Terminal-Bench registry.json:** | ||
| ```json | ||
| { | ||
|
|
@@ -677,24 +680,19 @@ fi | |
| } | ||
| ``` | ||
|
|
||
| **Harbor registry.json:** | ||
| ```json | ||
| { | ||
| "name": "my-adapter", | ||
| "version": "1.0", | ||
| "description": "...", | ||
| "tasks": [ | ||
| { | ||
| "name": "task-1", | ||
| "git_url": "https://github.com/laude-institute/harbor-datasets.git", | ||
| "git_commit_id": "abc123", | ||
| "path": "datasets/my-adapter/task-1" | ||
| } | ||
| // ... one entry per task | ||
| ] | ||
| } | ||
| **Harbor registry (dataset.toml + publish):** | ||
| ```bash | ||
| # Initialize dataset configuration (auto-detects tasks) | ||
| harbor init # select "dataset" | ||
|
|
||
| # Edit dataset.toml with descriptions, authors, credits | ||
| # Then submit to Harbor team for publishing | ||
| harbor auth login | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably best not to include these steps bc they may get confused and publish without us? |
||
| harbor publish | ||
| ``` | ||
|
|
||
| See [Step 8: Register the Dataset](#8-register-the-dataset) for the full publishing workflow. | ||
|
|
||
| ### Getting Help | ||
|
|
||
| If you have any questions about translating your Terminal-Bench adapter to Harbor, please ask in the `#adapters-spam` channel in our [Discord](https://discord.com/invite/6xWPKhGDbA) or reach out to [Lin Shi](mailto:ls2282@cornell.edu). | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need this callout bc we made all the commands backwards compatible in case ppl have muscle memory or stale docs