feat: replace pipeline with agent-build trial tooling by EdwardIrby · Pull Request #54 · plaited/agent-eval-harness

EdwardIrby · 2026-03-11T04:23:03Z

Summary

Full replacement of existing src/ and bin/ with trial runner tooling ported from agent-build
Strips all BP-specific types (RISK_TAG, SelectionBid, TOOL_STATUS, AGENT_EVENTS)
Inlines simplified TrajectoryStepSchema with 3 core step types (thought, message, tool_call)
New unified CLI entry point with trials/compare/calibrate subcommands
Skills copied to .agents/skills/

Downstream contract preserved

import type { Grader } from '@plaited/agent-eval-harness/schemas' ✓
bunx @plaited/agent-eval-harness trials ... ✓
bunx @plaited/agent-eval-harness compare (stub)
bunx @plaited/agent-eval-harness calibrate (stub)

Test plan

bun --bun tsc --noEmit passes
bun test src/ passes (51 tests, 0 failures)
Verify CLI entry point routes correctly
Verify downstream consumers can import schemas

🤖 Generated with Claude Code

Full replacement of src/ and bin/ with trial runner, schemas, and CLI utilities ported from agent-build. Strips BP-specific types (no RISK_TAG, SelectionBid, TOOL_STATUS). Inlines simplified TrajectoryStepSchema with 3 step types. BREAKING: New CLI entry point, new exports structure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude · 2026-03-11T04:28:31Z

PR 54 code review - see inline findings below

These skills taught the old pipeline's concepts (Docker evals, calibration flow, headless browser adapters) which no longer exist after the trial-runner replacement. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude · 2026-03-11T04:57:43Z

PR Review: feat/trial-runner-replacement

This is a large-scale replacement PR — nearly the entire codebase is removed and replaced with a minimal new structure. Here is my analysis based on the PR description, commit messages, and changed file list.

Summary of Changes

Category	Count
Files deleted	~95
Files added	8
Net lines removed	~15,000+

The PR strips BP-specific types, removes the headless adapter layer, the pipeline commands, and the full harness — replacing everything with a new CLI entry point (src/cli.ts, src/cli.utils.ts) and three new skills (compare-trials, trial-adapters, trial-runner).

Issues Found

1. Test coverage gap — new code has zero tests

The only new source files are src/cli.ts (22 lines) and src/cli.utils.ts (99 lines). No test files are added for either. Per the testing rules in AGENTS.md, tests are required — especially for CLI utilities where happy path, edge cases, and error paths all need coverage.

Additionally, the test plan claims bun test src/ passes (51 tests, 0 failures), but the PR deletes every *.spec.ts file that previously lived in src/. If all 15+ test files are removed, bun test src/ should find ~0 tests, not 51. This discrepancy needs clarification — are those 51 tests from an unreferenced location, or was the test plan run against main rather than the PR branch?

2. Stub implementations shipped as preserved contract

The PR summary states:

bunx @plaited/agent-eval-harness compare  (stub)
bunx @plaited/agent-eval-harness calibrate  (stub)

But also claims "Downstream contract preserved." A stub is not a preserved contract — it is a placeholder. Any consumers relying on compare or calibrate will silently get no output. These should either:

Be fully implemented before merging, or
Be explicitly removed from the CLI and docs until they are ready

3. AGENTS.md skills table is now stale

AGENTS.md currently documents these skills:

| agent-eval-harness | capture, trials, summarize, calibrate, validate-refs, balance, schemas, run, extract, grade, format, compare |
| headless-adapters  | headless |

Both of those skill directories are deleted in this PR. The replacements (compare-trials, trial-adapters, trial-runner) are added, but AGENTS.md is not updated to reflect the new skill names and command surface. This will cause confusion for any agent using the skill table as a reference.

4. Two unchecked test plan items

The PR test plan explicitly leaves these unchecked:

[ ] Verify CLI entry point routes correctly
[ ] Verify downstream consumers can import schemas

These are the two most critical integration checks. The PR should not be merged until both are verified and checked, or the intent of leaving them unchecked should be explained.

5. `package.json` exports — verify schema export contract

The PR modifies package.json with 7 additions and 16 deletions. The claim import type { Grader } from '@plaited/agent-eval-harness/schemas' is preserved, but given the extensive restructuring, this should be explicitly verified with a consumer test or export audit. The old src/schemas/schemas.ts (1073 lines) is deleted — confirm the replacement exposes all public types that downstream packages depend on.

What Looks Good

Conventional commit messages with co-author attribution
The simplification rationale is clear — removing BP-specific types reduces coupling
bun --bun tsc --noEmit passing suggests the TypeScript structure is sound
New skills (trial-runner, trial-adapters, compare-trials) appear focused and purposeful

Recommended Before Merge

Add src/cli.spec.ts and src/cli.utils.spec.ts with basic tests
Clarify or fix the "51 tests" claim in the test plan
Update AGENTS.md skills table to reflect new skill names
Either implement compare/calibrate or remove them from the CLI and docs
Explicitly verify the @plaited/agent-eval-harness/schemas export
Check the two unchecked test plan items

Gemini Code Assist MCP server config from the old pipeline. No longer relevant after the trial-runner replacement. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Old integration test wrapper for the previous pipeline's integration_tests/ directory which no longer exists. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Delete docker-compose.test.yml, Dockerfile.test, and the test-integration CI job — all referenced integration_tests/ which no longer exists. CI now runs a single test job with check + test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

API keys were for the deleted Docker integration tests. The trial runner delegates credential management to adapters. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude · 2026-03-11T05:14:22Z

PR Review: feat/trial-runner-replacement

I read all new source files directly from the merged state. Here is a fresh review based on the actual code.

Bug: `Dockerfile.test` CMD references a deleted script

Dockerfile.test:42

CMD ["bash", "scripts/bun-test-wrapper.sh"]

This PR deletes scripts/bun-test-wrapper.sh (commit 854390c), but Dockerfile.test still points to it. The test-integration CI job runs docker compose -f docker-compose.test.yml run --rm test, which builds this image and executes the missing CMD. The container will fail with bash: scripts/bun-test-wrapper.sh: No such file or directory.

Additionally, src/integration_tests/claude.spec.ts and src/integration_tests/gemini.spec.ts are both deleted, so there are no integration tests left to run anyway. The fix is either:

Update Dockerfile.test/docker-compose.test.yml CMD to run bun test src/ (or remove the integration CI job entirely), or
Keep a note that Docker integration tests are a follow-up

AGENTS.md is completely stale

The file was not updated in this PR. It describes directories and skills that no longer exist:

Stale structure section:

src/
├── harness/        # Core capture engine
├── headless/       # Headless adapter implementation
├── pipeline/       # Unix-style pipeline commands
└── schemas/        # Zod schemas + types

All four of these directories were deleted. The actual structure is now:

src/
├── cli.ts          # CLI entry point
├── cli.utils.ts    # Shared CLI parsing utilities
├── trial.ts        # Library API + CLI handler
├── trial.schemas.ts
├── trial.utils.ts
├── trial.constants.ts
└── tests/trial.spec.ts

Stale skills table:

| agent-eval-harness | capture, trials, ... |
| headless-adapters  | headless |

Both skill directories were removed. The replacements (trial-runner, trial-adapters, compare-trials) are not in the table.

Stale capabilities section still lists "Headless adapter: Schema-driven JSON wrapper for any CLI agent" and MCP auto-discovery — neither exists in the new codebase.

`node:fs/promises` import (`trial.utils.ts:11`)

import { appendFile, mkdir } from 'node:fs/promises'

The Bun API rules say to prefer Bun over Node.js for file system operations. mkdir is fine (no Bun equivalent with { recursive: true } async support), but appendFile is used in writeOutput where Bun.write wouldn't preserve existing content. This is a minor rule compliance issue — appendFile is the right tool here, so the rule should be noted as an exception or the rule updated to acknowledge this case.

What looks good

Real test file with solid coverage — src/tests/trial.spec.ts covers pass@k math, k=1/k=3 trials, concurrency, grading, timeouts, adapter errors, JSONL output, append mode, workspace isolation, worker pool behavior, and polyglot loading. The "51 tests, 0 failures" claim in the test plan is credible.
Code quality — Arrow functions, explicit .ts extensions, Bun APIs used correctly (Bun.file, Bun.write, Bun.spawn, Bun.stdin), object params for functions with >2 args, type over interface.
TSDoc — All public APIs have @public/@internal annotations with @remarks.
Schema-first — Zod schemas are the source of truth, types derived via z.infer<>.
resolvePath is straightforward — process.cwd() is appropriate for runtime file resolution (not module resolution), consistent with Bun API guidance.
Stub commands exit with code 1 — This is the right behavior. Silent no-ops would be worse; a clear error is honest.

Required before merge

Fix Dockerfile.test:42 — Update the CMD to not reference the deleted wrapper script, or remove/update the Docker integration CI job
Update AGENTS.md — Update overview, structure diagram, skills table, and capabilities to reflect the new trial runner architecture

Optional / follow-up

resolvePath uses ${process.cwd()}/${path} — path.join(process.cwd(), path) from node:path is more robust against edge cases (trailing slashes, double slashes). Minor.
The createWorkspaceDir test cleanup (Bun.$\rm -rf ${base}`.quiet()) runs inline rather than in afterEach`. Low risk since base dir is unique per test, but consistent cleanup patterns make test failures easier to debug.

claude · 2026-03-11T05:14:28Z

PR Review: feat: replace pipeline with agent-build trial tooling

This is a significant rewrite — ~24K lines deleted and ~2.7K added. The direction (simplifying away BP-specific types, removing headless adapter complexity) is reasonable, but there are several issues that should be addressed before merging.

Critical Issues

1. No tests for new source code

src/cli.ts (22 lines) and src/cli.utils.ts (99 lines) are the only new source files, and neither has a corresponding test file. Per AGENTS.md:

Run: `bun test` before commit

The testing rules require happy path, edge cases, and error paths. The CLI routing and utility functions need at minimum:

Command dispatch test (does trials route correctly?)
Unknown subcommand handling
Argument parsing edge cases in cli.utils.ts

2. Stub implementations shipped as preserved contract

The PR description says:

## Downstream contract preserved
- bunx @plaited/agent-eval-harness compare  (stub)
- bunx @plaited/agent-eval-harness calibrate  (stub)

A stub is not a preserved contract. Any downstream consumer calling compare or calibrate will silently get no meaningful output. These should either be fully implemented or removed from the CLI surface and the README/docs until ready. Shipping stubs that look like real commands is worse than removing them.

3. Two unchecked test plan items are the most critical verifications

The PR test plan leaves unchecked:

[ ] Verify CLI entry point routes correctly
[ ] Verify downstream consumers can import schemas

These aren't optional — they're the smoke tests for the two primary promises of the PR. Please check these before merging.

Moderate Issues

4. AGENTS.md skills table is stale

AGENTS.md still documents:

| agent-eval-harness | capture, trials, summarize, calibrate, validate-refs, balance, schemas, run, extract, grade, format, compare |
| headless-adapters  | headless |

Both skill directories are deleted in this PR. The replacements (compare-trials, trial-adapters, trial-runner) are added but AGENTS.md is not updated. This creates a broken reference for any agent using the skills table.

5. Package.json export contract needs explicit verification

src/schemas/schemas.ts (1073 lines) is deleted. The PR claims import type { Grader } from '@plaited/agent-eval-harness/schemas' still works, but this needs to be explicitly verified. The package.json changed 7 additions / 16 deletions — if the exports field changed, downstream packages may break silently (TypeScript resolves successfully but the wrong types are exported).

6. "51 tests, 0 failures" claim needs clarification

All *.spec.ts files in src/ are deleted. If bun test src/ finds 0 spec files, it should report 0 tests — not 51. Were the 51 tests run against main before the deletion? Or is there a test directory outside src/? This discrepancy should be explained.

What's Good

Commit messages follow conventional commits with clear context
bun --bun tsc --noEmit passing confirms structural TypeScript soundness
CI test check passed (SUCCESS)
New skills (trial-runner, trial-adapters, compare-trials) look well-focused
Removing BP-specific types (RISK_TAG, SelectionBid, TOOL_STATUS, AGENT_EVENTS) reduces unnecessary coupling

Recommended Before Merge

Add src/cli.spec.ts and src/cli.utils.spec.ts with basic routing and utility tests
Implement or remove compare and calibrate commands — don't ship stubs as contract
Update AGENTS.md skills table to reflect the three new skills
Verify and check the two unchecked test plan items
Confirm @plaited/agent-eval-harness/schemas export still resolves correctly
Clarify the 51-test count

AGENTS.md rewritten to reflect new trial runner architecture: updated structure diagram, skills table, capabilities, commands. Removed stale Docker test references and old skill names. Added src/tests/cli.spec.ts with coverage for: - CLI routing (trials/compare/calibrate/unknown/no-command) - parseCli meta flags (--help, -h, --schema input) - Input validation (valid JSON, invalid JSON, missing input) - Export contract verification (runTrial, schemas) All 65 tests pass. Types clean. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude · 2026-03-11T05:18:11Z

Code Review

Reviewed against AGENTS.md conventions. CI passes (test + CodeQL green). Here are my findings:

Bugs / Correctness

compare and calibrate stubs break the contract claim

The PR description states "Downstream contract preserved" but both subcommands call console.error() and process.exit(1). Any consumer that runs these commands will get a non-zero exit code — that is a broken contract, not a preserved one. The PR body acknowledges they are stubs, but the framing is misleading. Suggest either:

Removing them from the bin entry and description until implemented, or
Explicitly noting they are not yet available rather than "preserved"

Two test plan items are unchecked

[ ] Verify CLI entry point routes correctly
[ ] Verify downstream consumers can import schemas

These should be verified or explicitly deferred with a follow-up issue linked.

Convention Violations (AGENTS.md)

node:fs/promises instead of Bun APIs (trial.utils.ts)

Per the Bun APIs rule: prefer Bun.write over fs write APIs. The file imports appendFile and mkdir from node:fs/promises. For mkdir, Bun.$\mkdir -p ${dir}`is the idiomatic replacement. ForappendFile`, Bun doesn't have a direct append API, so this may be acceptable — but worth a comment explaining why the Node fallback is used here.

export function declarations (if any in trial.ts / trial.utils.ts)

The convention is arrow functions: const fn = () => over function fn(). I couldn't diff the full file in this review environment, but worth verifying no named function exports snuck in.

Documentation

AGENTS.md is definitively stale

The Structure section still documents the old layout:

src/
├── harness/
├── headless/
├── pipeline/
└── schemas/

The new flat structure is:

src/
├── cli.ts
├── cli.utils.ts
├── trial.ts
├── trial.schemas.ts
├── trial.utils.ts
└── trial.constants.ts

The Skills table lists agent-eval-harness and headless-adapters (both deleted). The new skills (trial-runner, trial-adapters, compare-trials) are not mentioned. The Commands table (capture, trials, summarize, calibrate, validate-refs, balance, schemas, run, extract, grade, format, compare) reflects the old CLI surface — the new surface is just trials (with compare/calibrate as non-functional stubs).

This should be updated before merge.

Test Coverage

The 723-line src/tests/trial.spec.ts is comprehensive and covers the happy paths, error paths, concurrency, k-shot math, and CLI contract well. One gap:

cli.ts routing is not directly tested. The dispatch logic (trials → trialCli, compare → stub, calibrate → stub) is covered indirectly at best. Since the two unchecked test plan items include "Verify CLI entry point routes correctly", a minimal routing test would close this gap.

Scale / Scope

This is a near-total rewrite (2,704 additions / 23,801 deletions across 114 files). CI passes, which is reassuring, but the removed Docker integration tests (Dockerfile.test, docker-compose.test.yml) eliminated coverage of real API interactions (Claude, Gemini). If this is intentional for the scope reduction, it should be documented as a known trade-off.

Summary

Area	Status
CI (unit tests, CodeQL)	✅ Green
Core trial runner logic	✅ Looks solid
AGENTS.md docs	❌ Stale — must update
`compare`/`calibrate` contract	⚠️ Misleading — clarify or remove
Bun API conventions	⚠️ `node:fs/promises` used for append/mkdir
CLI routing test	⚠️ Unchecked test plan item
Schema contract verification	⚠️ Unchecked test plan item

The core implementation quality is good. The blocking items before merge are: updating AGENTS.md and clarifying the stub subcommand status.

claude · 2026-03-11T05:18:14Z

Follow-up review after reading actual branch code. Correcting the earlier review and adding new findings.

claude · 2026-03-11T05:18:28Z

CORRECTION from earlier review: Tests DO exist

The earlier review claimed no test files were added and questioned the 51-test claim. After reading the actual branch, this is wrong. The PR adds 7 new source files, not 2:

src/cli.ts (22 lines)
src/cli.utils.ts (99 lines)
src/trial.constants.ts (11 lines)
src/trial.schemas.ts (288 lines)
src/trial.ts (319 lines)
src/trial.utils.ts (444 lines)
src/tests/trial.spec.ts (723 lines, 51 tests)

The 51-test count is accurate. The cli.ts import of ./trial.ts is valid. The earlier review was working from the files-changed list in the PR metadata, which only showed 2 new files — likely a rendering issue with the large diff.

claude · 2026-03-11T05:18:36Z

Code issue 1: debug parameter is dead code

TrialInputSchema and TrialConfig both accept a debug field. It is forwarded from the CLI through to runTrial config. But inside runTrial, the destructuring omits debug entirely and it is never read.

Either add debug logging (e.g. log adapter stdout, raw timing, per-trial inputs) or remove it from the schema, type, and CLI input schema. As-is it is dead code that adds surface area without any behavior.

claude · 2026-03-11T05:18:42Z

Code issue 2: resolvePath uses string concatenation instead of node:path

Current implementation in trial.utils.ts:

export const resolvePath = (path: string): string => {
  if (path.startsWith('/')) return path
  return process.cwd() + '/' + path
}

AGENTS.md rule: Keep node:path for join/resolve/dirname. Manual concatenation can produce double slashes and does not normalize .. segments. The fix is a one-liner:

import { resolve } from 'node:path'
export const resolvePath = (path: string): string => resolve(path)

resolve() from node:path already handles absolute vs relative paths correctly, making the startsWith check unnecessary.

claude · 2026-03-11T05:18:51Z

Code issue 3: Promise.race timeout leaks the adapter Promise

In trial.ts inside processPrompt, the timeout pattern is:

const adapterResult = await Promise.race([
  adapter({ prompt: promptCase.input, cwd: promptCwd }),
  new Promise((_, reject) =>
    setTimeout(() => reject(new Error('Trial timed out')), effectiveTimeout)
  ),
])

Two issues:

When the adapter wins, the setTimeout is never cleared (minor resource leak per trial).
When the timeout fires, the adapter Promise continues running in the background with no way to cancel it. Depending on what the adapter does, this can consume API quota, write to cwd, or hold open subprocess handles for the remainder of the adapter's own execution time.

At minimum, store the timer handle and clear it in the adapter path. For full correctness, document that adapters cannot be cancelled mid-flight and that running k=N concurrently may have up to N zombie adapter processes after a timeout.

claude · 2026-03-11T05:19:00Z

Code issue 4 and 5: AGENTS.md is stale; compare/calibrate behavior clarification

Issue 4 (confirmed from earlier review): AGENTS.md still documents agent-eval-harness and headless-adapters skills in the Skills table, both of which are deleted in this PR. The three new skills (trial-runner, trial-adapters, compare-trials) need to be added to the table before merge.

Issue 5 (correction of earlier review): The earlier review said compare and calibrate silently get no output. The actual behavior in cli.ts is that both subcommands call process.exit(1) with an explicit error message: 'compare command not yet implemented'. This is explicit failure, not silent failure. The concern about shipping unimplemented subcommands still stands, but the behavior is correct for a stub.

claude · 2026-03-11T05:19:20Z

What is well done

After reading the actual code:

trial.spec.ts (723 lines, 51 tests) is thorough. It covers runTrial at k=1 and k=3, pass@k and pass^k math, grading with always-pass and flaky graders, concurrency behavior, CLI contract flags (--help, --schema input, --schema output), loadPolyglot for both TS modules and executables, and every utility function. Good coverage.
trial.schemas.ts is clean schema-first design with Zod 4, z.infer<> types, and PascalCase naming throughout. No any types.
trial.utils.ts correctly uses Bun.file/Bun.write/Bun.stdin where Bun APIs exist. The node:fs/promises usage for appendFile and mkdir is acceptable since Bun has no native equivalents for those operations.
createWriteMutex: chain.then(fn, fn) is intentional and correct. It ensures the next write always runs regardless of whether the previous write succeeded or failed, which is the right mutex semantic for ordered JSONL output.
loadPolyglot cleanly handles both TS/JS modules (via dynamic import + named export check) and executable subprocesses (via stdin/stdout JSON protocol with schema validation). The error messages are clear.

Summary of blockers before merge

Remove or implement debug (dead code in schema, type, and CLI)
Fix resolvePath to use node:path resolve()
Clear setTimeout handle when adapter wins the race (or document the limitation)
Update AGENTS.md skills table
Decide on compare/calibrate: implement or remove from CLI and docs

claude · 2026-03-11T05:21:50Z

PR Review: feat/trial-runner-replacement (post-fixup pass)

The fixup commit (16e594e) addressed the main blockers from earlier reviews — AGENTS.md is updated and src/tests/cli.spec.ts now exists with 65 tests passing. This pass reviews what remains.

What the fixup resolved

AGENTS.md — Structure diagram, skills table, and capabilities now reflect the trial runner architecture. ✅
CLI tests — src/tests/cli.spec.ts covers routing, parseCli meta-flags, input validation, and export contract. ✅
Test count discrepancy — 65 tests (51 in trial.spec.ts + 14 in cli.spec.ts) is now credible. ✅

Remaining issues

1. `debug` field is dead code

TrialInputSchema and TrialConfig both accept a debug field. It is forwarded from the CLI through to runTrial config but never read inside runTrial — the destructuring omits it entirely.

Either implement debug logging (e.g. log raw adapter stdout, per-trial timing) or remove the field from the schema, type, and CLI input. Dead fields on a public schema are a maintenance burden: consumers may rely on them expecting behavior that never arrives.

2. `resolvePath` — string concatenation instead of `node:path`

The Bun API rules in AGENTS.md say: Keep node:path for join/resolve/dirname. The current implementation does manual concatenation:

export const resolvePath = (path: string): string => {
  if (path.startsWith('/')) return path
  return process.cwd() + '/' + path
}

This does not normalize .. segments and can produce double slashes. The idiomatic replacement is one line:

import { resolve } from 'node:path'
export const resolvePath = (path: string): string => resolve(path)

resolve() already handles absolute vs relative correctly, making the startsWith('/') guard unnecessary.

3. `compare` and `calibrate` stubs — framing vs behavior

The stubs exit with code 1, which is honest. The framing concern from earlier reviews still holds: the PR summary says "Downstream contract preserved" and lists these commands with (stub) notation, but a non-zero exit is a broken contract for any downstream script that calls them.

Suggest one of:

Reword the PR summary to say "not yet available" rather than "preserved"
Or exit with code 0 and print a clear "not yet implemented" message to avoid breaking pipelines that check exit codes

4. `node:fs/promises` — acceptable but undocumented exception

appendFile is imported from node:fs/promises. The Bun APIs rule flags this for review. As noted in earlier reviews, Bun has no async append equivalent, so this is the right choice — but a brief inline comment like // Bun has no async append equivalent would make the exception self-documenting and prevent a future reviewer from flagging it again.

Overall assessment

Area	Status
AGENTS.md docs	✅ Updated
CLI tests	✅ Added
Core trial runner logic	✅ Solid
CI (unit tests, CodeQL)	✅ Green
`debug` dead code	⚠️ Remove or implement
`resolvePath` path handling	⚠️ Use `path.resolve()`
Stub exit code framing	⚠️ Clarify in PR description
`appendFile` exception	ℹ️ Add inline comment

The two blocking items before merge are debug dead code and resolvePath path handling. The rest are minor polish.

EdwardIrby requested a review from alisonailea as a code owner March 11, 2026 04:23

EdwardIrby and others added 4 commits March 10, 2026 22:10

chore: remove stale .gemini config

5d45878

Gemini Code Assist MCP server config from the old pipeline. No longer relevant after the trial-runner replacement. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: remove stale bun-test-wrapper script

854390c

Old integration test wrapper for the previous pipeline's integration_tests/ directory which no longer exists. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: remove stale .env.example

e4a1524

API keys were for the deleted Docker integration tests. The trial runner delegates credential management to adapters. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Conversation

EdwardIrby commented Mar 11, 2026

Summary

Downstream contract preserved

Test plan

Uh oh!

claude bot commented Mar 11, 2026

Uh oh!

claude bot commented Mar 11, 2026

PR Review: feat/trial-runner-replacement

Summary of Changes

Issues Found

1. Test coverage gap — new code has zero tests

2. Stub implementations shipped as preserved contract

3. AGENTS.md skills table is now stale

4. Two unchecked test plan items

5. package.json exports — verify schema export contract

What Looks Good

Recommended Before Merge

Uh oh!

claude bot commented Mar 11, 2026

PR Review: feat/trial-runner-replacement

Bug: Dockerfile.test CMD references a deleted script

AGENTS.md is completely stale

node:fs/promises import (trial.utils.ts:11)

What looks good

Required before merge

Optional / follow-up

Uh oh!

claude bot commented Mar 11, 2026

PR Review: feat: replace pipeline with agent-build trial tooling

Critical Issues

1. No tests for new source code

2. Stub implementations shipped as preserved contract

3. Two unchecked test plan items are the most critical verifications

Moderate Issues

4. AGENTS.md skills table is stale

5. Package.json export contract needs explicit verification

6. "51 tests, 0 failures" claim needs clarification

What's Good

Recommended Before Merge

Uh oh!

claude bot commented Mar 11, 2026

Code Review

Bugs / Correctness

Convention Violations (AGENTS.md)

Documentation

Test Coverage

Scale / Scope

Summary

Uh oh!

claude bot commented Mar 11, 2026

Uh oh!

claude bot commented Mar 11, 2026

Uh oh!

claude bot commented Mar 11, 2026

Uh oh!

claude bot commented Mar 11, 2026

Uh oh!

claude bot commented Mar 11, 2026

Uh oh!

claude bot commented Mar 11, 2026

Uh oh!

claude bot commented Mar 11, 2026

Uh oh!

claude bot commented Mar 11, 2026

PR Review: feat/trial-runner-replacement (post-fixup pass)

What the fixup resolved

Remaining issues

1. debug field is dead code

2. resolvePath — string concatenation instead of node:path

3. compare and calibrate stubs — framing vs behavior

4. node:fs/promises — acceptable but undocumented exception

Overall assessment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

5. `package.json` exports — verify schema export contract

Bug: `Dockerfile.test` CMD references a deleted script

`node:fs/promises` import (`trial.utils.ts:11`)

1. `debug` field is dead code

2. `resolvePath` — string concatenation instead of `node:path`

3. `compare` and `calibrate` stubs — framing vs behavior

4. `node:fs/promises` — acceptable but undocumented exception