Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,11 @@ directory, or a suite directory:
bun run eval:spec -- update-add-single-nongoal-preserve-system-overview --deterministic-only skills/spec-creator/evals/runs/<run>/candidate.md
```

Update-mode validation contracts can use `skip_base_check_ids` when a packet
explicitly asks to preserve legacy text that would fail a generic create-mode
style rule. Keep these skips narrow and pair them with required/forbidden
patterns for the actual requested edit.

Current comparison variants:

- `current` — working tree prompt stack
Expand Down Expand Up @@ -149,6 +154,30 @@ skill package names. Current values are `foundation-doc` and `service-spec`.
Optional Braintrust environment variables are `BRAINTRUST_EXPERIMENT` for
manual curated runs and `BRAINTRUST_ORG` for org selection.

Braintrust can also be inspected from the terminal without opening the UI:

```bash
bun run braintrust:list -- --limit 5
bun run braintrust:latest -- --capability foundation-doc
bun run braintrust:latest -- --capability service-spec
bun run braintrust:show -- foundation-doc.smoke.fast.model.20260423-1015.0a10e79
```

These commands use Braintrust's API and BTQL directly, summarize experiment
rows, and print combined status counts, LLM status counts, deterministic
failures, open issues, timing, and per-eval row status. They require
`BRAINTRUST_API_KEY` and use `BRAINTRUST_PROJECT` when set.

Braintrust also provides an optional beta `bt` CLI for listing experiments,
running BTQL, and syncing experiment data locally:

```bash
curl -fsSL https://bt.dev/cli/install.sh | bash
bt experiments list --project lightfast-skills --env-file .env --json --no-input
bt sql "SELECT id, input, scores FROM experiment('<experiment-id>') LIMIT 20" --env-file .env --json --no-input
bt sync pull experiment:<experiment-name> --project lightfast-skills --env-file .env
```

Eval manifests also carry lightweight taxonomy metadata
(`scenario_type`, `input_shape`, `ambiguity_level`, `domain_profile`,
`primary_risks`) so benchmark runs can be grouped by failure mode. Shared
Expand Down
5 changes: 4 additions & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,10 @@
"eval:spec": "bun run with-env -- bun ./scripts/run-baml-eval.ts spec-creator",
"eval:spec:smoke": "bun run eval:spec -- --smoke",
"eval:check": "bun ./scripts/check-eval-fixtures.ts foundation-creator spec-creator",
"eval:typecheck": "tsc --noEmit --allowImportingTsExtensions --moduleResolution bundler --module esnext --target esnext --skipLibCheck --types node scripts/check-eval-fixtures.ts scripts/run-baml-eval.ts scripts/evals/*.ts scripts/evals/validators/*.ts",
"eval:typecheck": "tsc --noEmit --allowImportingTsExtensions --moduleResolution bundler --module esnext --target esnext --skipLibCheck --types node scripts/check-eval-fixtures.ts scripts/run-baml-eval.ts scripts/braintrust-evals.ts scripts/evals/*.ts scripts/evals/validators/*.ts",
"braintrust:list": "bun run with-env -- bun ./scripts/braintrust-evals.ts list",
"braintrust:latest": "bun run with-env -- bun ./scripts/braintrust-evals.ts latest",
"braintrust:show": "bun run with-env -- bun ./scripts/braintrust-evals.ts show",
"ci:check": "bun run eval:check && bun run baml:generate:foundation && bun run baml:generate:spec && bun run eval:typecheck"
},
"dependencies": {
Expand Down
Loading
Loading