feat: replace pipeline with agent-build trial tooling#54
feat: replace pipeline with agent-build trial tooling#54EdwardIrby wants to merge 7 commits intomainfrom
Conversation
Full replacement of src/ and bin/ with trial runner, schemas, and CLI utilities ported from agent-build. Strips BP-specific types (no RISK_TAG, SelectionBid, TOOL_STATUS). Inlines simplified TrajectoryStepSchema with 3 step types. BREAKING: New CLI entry point, new exports structure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
PR 54 code review - see inline findings below |
These skills taught the old pipeline's concepts (Docker evals, calibration flow, headless browser adapters) which no longer exist after the trial-runner replacement. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PR Review: feat/trial-runner-replacementThis is a large-scale replacement PR — nearly the entire codebase is removed and replaced with a minimal new structure. Here is my analysis based on the PR description, commit messages, and changed file list. Summary of Changes
The PR strips BP-specific types, removes the headless adapter layer, the pipeline commands, and the full harness — replacing everything with a new CLI entry point ( Issues Found1. Test coverage gap — new code has zero testsThe only new source files are Additionally, the test plan claims 2. Stub implementations shipped as preserved contractThe PR summary states: But also claims "Downstream contract preserved." A stub is not a preserved contract — it is a placeholder. Any consumers relying on
3. AGENTS.md skills table is now stale
Both of those skill directories are deleted in this PR. The replacements ( 4. Two unchecked test plan itemsThe PR test plan explicitly leaves these unchecked:
These are the two most critical integration checks. The PR should not be merged until both are verified and checked, or the intent of leaving them unchecked should be explained. 5.
|
Gemini Code Assist MCP server config from the old pipeline. No longer relevant after the trial-runner replacement. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Old integration test wrapper for the previous pipeline's integration_tests/ directory which no longer exists. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Delete docker-compose.test.yml, Dockerfile.test, and the test-integration CI job — all referenced integration_tests/ which no longer exists. CI now runs a single test job with check + test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
API keys were for the deleted Docker integration tests. The trial runner delegates credential management to adapters. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PR Review: feat/trial-runner-replacementI read all new source files directly from the merged state. Here is a fresh review based on the actual code. Bug:
|
PR Review: feat: replace pipeline with agent-build trial toolingThis is a significant rewrite — ~24K lines deleted and ~2.7K added. The direction (simplifying away BP-specific types, removing headless adapter complexity) is reasonable, but there are several issues that should be addressed before merging. Critical Issues1. No tests for new source code
The testing rules require happy path, edge cases, and error paths. The CLI routing and utility functions need at minimum:
2. Stub implementations shipped as preserved contractThe PR description says: A stub is not a preserved contract. Any downstream consumer calling 3. Two unchecked test plan items are the most critical verificationsThe PR test plan leaves unchecked:
These aren't optional — they're the smoke tests for the two primary promises of the PR. Please check these before merging. Moderate Issues4. AGENTS.md skills table is stale
Both skill directories are deleted in this PR. The replacements ( 5. Package.json export contract needs explicit verification
6. "51 tests, 0 failures" claim needs clarificationAll What's Good
Recommended Before Merge
|
AGENTS.md rewritten to reflect new trial runner architecture: updated structure diagram, skills table, capabilities, commands. Removed stale Docker test references and old skill names. Added src/tests/cli.spec.ts with coverage for: - CLI routing (trials/compare/calibrate/unknown/no-command) - parseCli meta flags (--help, -h, --schema input) - Input validation (valid JSON, invalid JSON, missing input) - Export contract verification (runTrial, schemas) All 65 tests pass. Types clean. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Code ReviewReviewed against AGENTS.md conventions. CI passes (test + CodeQL green). Here are my findings: Bugs / Correctness
The PR description states "Downstream contract preserved" but both subcommands call
Two test plan items are unchecked These should be verified or explicitly deferred with a follow-up issue linked. Convention Violations (AGENTS.md)
Per the Bun APIs rule: prefer
The convention is arrow functions: Documentation
The The new flat structure is: The This should be updated before merge. Test CoverageThe 723-line
Scale / ScopeThis is a near-total rewrite (2,704 additions / 23,801 deletions across 114 files). CI passes, which is reassuring, but the removed Docker integration tests ( Summary
The core implementation quality is good. The blocking items before merge are: updating |
|
Follow-up review after reading actual branch code. Correcting the earlier review and adding new findings. |
|
CORRECTION from earlier review: Tests DO exist The earlier review claimed no test files were added and questioned the 51-test claim. After reading the actual branch, this is wrong. The PR adds 7 new source files, not 2:
The 51-test count is accurate. The cli.ts import of ./trial.ts is valid. The earlier review was working from the files-changed list in the PR metadata, which only showed 2 new files — likely a rendering issue with the large diff. |
|
Code issue 1: debug parameter is dead code TrialInputSchema and TrialConfig both accept a debug field. It is forwarded from the CLI through to runTrial config. But inside runTrial, the destructuring omits debug entirely and it is never read. Either add debug logging (e.g. log adapter stdout, raw timing, per-trial inputs) or remove it from the schema, type, and CLI input schema. As-is it is dead code that adds surface area without any behavior. |
|
Code issue 2: resolvePath uses string concatenation instead of node:path Current implementation in trial.utils.ts: AGENTS.md rule: Keep node:path for join/resolve/dirname. Manual concatenation can produce double slashes and does not normalize .. segments. The fix is a one-liner: resolve() from node:path already handles absolute vs relative paths correctly, making the startsWith check unnecessary. |
|
Code issue 3: Promise.race timeout leaks the adapter Promise In trial.ts inside processPrompt, the timeout pattern is: Two issues:
At minimum, store the timer handle and clear it in the adapter path. For full correctness, document that adapters cannot be cancelled mid-flight and that running k=N concurrently may have up to N zombie adapter processes after a timeout. |
|
Code issue 4 and 5: AGENTS.md is stale; compare/calibrate behavior clarification Issue 4 (confirmed from earlier review): AGENTS.md still documents agent-eval-harness and headless-adapters skills in the Skills table, both of which are deleted in this PR. The three new skills (trial-runner, trial-adapters, compare-trials) need to be added to the table before merge. Issue 5 (correction of earlier review): The earlier review said compare and calibrate silently get no output. The actual behavior in cli.ts is that both subcommands call process.exit(1) with an explicit error message: 'compare command not yet implemented'. This is explicit failure, not silent failure. The concern about shipping unimplemented subcommands still stands, but the behavior is correct for a stub. |
|
What is well done After reading the actual code:
Summary of blockers before merge
|
PR Review: feat/trial-runner-replacement (post-fixup pass)The fixup commit ( What the fixup resolved
Remaining issues1.
|
| Area | Status |
|---|---|
| AGENTS.md docs | ✅ Updated |
| CLI tests | ✅ Added |
| Core trial runner logic | ✅ Solid |
| CI (unit tests, CodeQL) | ✅ Green |
debug dead code |
|
resolvePath path handling |
path.resolve() |
| Stub exit code framing | |
appendFile exception |
ℹ️ Add inline comment |
The two blocking items before merge are debug dead code and resolvePath path handling. The rest are minor polish.
Summary
Downstream contract preserved
import type { Grader } from '@plaited/agent-eval-harness/schemas'✓bunx @plaited/agent-eval-harness trials ...✓bunx @plaited/agent-eval-harness compare(stub)bunx @plaited/agent-eval-harness calibrate(stub)Test plan
bun --bun tsc --noEmitpassesbun test src/passes (51 tests, 0 failures)🤖 Generated with Claude Code