Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
5bbcf8e
add foundation/spec compiler eval scaffolding
jeevanpillay Apr 20, 2026
0686878
Migrate evals to AI Gateway and Bun
jeevanpillay Apr 20, 2026
b19bf60
Tighten foundation creator prompts and eval coverage
jeevanpillay Apr 20, 2026
2679aea
Add deterministic checks and benchmarks to eval runner
jeevanpillay Apr 20, 2026
caea972
Tighten evals and expand foundation coverage
jeevanpillay Apr 20, 2026
65d821f
Add foundation update-mode eval coverage
jeevanpillay Apr 20, 2026
a5e56f3
Add replacement-heavy foundation update eval
jeevanpillay Apr 21, 2026
c771030
Add baseline comparison eval harness
jeevanpillay Apr 21, 2026
e4028a7
Tune foundation and spec eval profiles
jeevanpillay Apr 21, 2026
5f53fbc
Add spec ambiguity and cross-domain eval packets
jeevanpillay Apr 21, 2026
f91d310
Harden spec creator update evals
jeevanpillay Apr 23, 2026
6cbdaa4
Refactor BAML eval runner modules
jeevanpillay Apr 23, 2026
5f0ad1a
Add Braintrust-ready eval reporting
jeevanpillay Apr 23, 2026
f23f9ee
Tune spec eval domain modeling prompts
jeevanpillay Apr 23, 2026
af53340
Persist Braintrust suite metadata
jeevanpillay Apr 23, 2026
3e7e6dd
Relax foundation update validator phrasing
jeevanpillay Apr 23, 2026
433f625
Tighten foundation strategic bet language
jeevanpillay Apr 23, 2026
29e7d78
Add eval CI smoke workflow
jeevanpillay Apr 23, 2026
32eaed2
Tighten spec rendering prompt boundaries
jeevanpillay Apr 23, 2026
2acdb85
Upgrade CI to Node 24 LTS
jeevanpillay Apr 23, 2026
d073591
Stabilize foundation update eval prompt
jeevanpillay Apr 23, 2026
978c152
Stabilize foundation update smoke eval
jeevanpillay Apr 23, 2026
98c574d
Harden foundation open-question prompts
jeevanpillay Apr 23, 2026
aa8ff92
Preserve spec lineage brief stage
jeevanpillay Apr 23, 2026
68ef8ee
Preserve foundation update thesis bullets
jeevanpillay Apr 23, 2026
134e23e
Add static eval contract checks
jeevanpillay Apr 23, 2026
765e6fc
Align foundation public-source bet checks
jeevanpillay Apr 23, 2026
83d291c
Broaden foundation strategic bet leads
jeevanpillay Apr 23, 2026
60e8561
Align eval docs with validator taxonomy
jeevanpillay Apr 23, 2026
8db1504
Require explicit spec open questions
jeevanpillay Apr 23, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 70 additions & 0 deletions .github/workflows/evals.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
name: Evals

on:
pull_request:
push:
branches:
- main
workflow_dispatch:

concurrency:
group: evals-${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true

jobs:
checks:
name: Typecheck and smoke evals
runs-on: ubuntu-latest
timeout-minutes: 30
permissions:
contents: read
env:
AI_GATEWAY_API_KEY: ${{ secrets.AI_GATEWAY_API_KEY }}
BRAINTRUST_API_KEY: ${{ secrets.BRAINTRUST_API_KEY }}
BRAINTRUST_PROJECT: lightfast-skills
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Set up Node
uses: actions/setup-node@v4
with:
node-version-file: .node-version

- name: Set up Bun
uses: oven-sh/setup-bun@v2
with:
bun-version: 1.3.9

- name: Install dependencies
run: bun install --frozen-lockfile

- name: Static eval checks
run: bun run ci:check

- name: Run live smoke evals
if: env.AI_GATEWAY_API_KEY != ''
run: |
reporter="local"
if [ -n "${BRAINTRUST_API_KEY:-}" ]; then
reporter="local,braintrust"
fi

bun run eval:foundation:smoke -- --reporter "$reporter"
bun run eval:spec:smoke -- --reporter "$reporter"

- name: Skip live smoke evals
if: env.AI_GATEWAY_API_KEY == ''
run: echo "AI_GATEWAY_API_KEY is not configured; skipping model-backed smoke evals."

- name: Upload eval artifacts
if: always() && env.AI_GATEWAY_API_KEY != ''
uses: actions/upload-artifact@v4
with:
name: eval-runs
path: skills/*/evals/runs/**
if-no-files-found: ignore
retention-days: 7
22 changes: 17 additions & 5 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,18 @@
node_modules/

# Generated BAML clients
skills/**/baml_client/
skills/**/baml_client_dist/

# Local eval outputs
skills/**/evals/runs/

# Local environment files
.env
.env.local
.env.*.local

# Local OS and tooling noise
.DS_Store
__pycache__/
*.py[cod]
.venv/
.idea/
.vscode/
*.log
.tmp-baml-client-tsconfig.json
1 change: 1 addition & 0 deletions .node-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
24.15.0
155 changes: 155 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,18 +6,173 @@ Agent skills published by Lightfast. Compatible with [Claude Code](https://docs.

| Skill | Purpose |
|---|---|
| [`foundation-creator`](skills/foundation-creator/) | Draft a top-level foundation document for a product or company primitive: thesis, mission, boundaries, actor model, surfaces, strategic bets, and open questions. |
| [`spec-creator`](skills/spec-creator/) | Write and update a top-level `SPEC.md` service specification following a strict template and language guide. |

## Install

Each skill is a subdirectory under `skills/`. To install one into a project:

```bash
npx skills add lightfastai/skills --skill foundation-creator
npx skills add lightfastai/skills --skill spec-creator
```

Or copy the directory directly into `.claude/skills/` in your project.

## Local evals

This repo now includes BAML-backed fixture evals for `foundation-creator` and
`spec-creator`.

```bash
bun install
bun run ci:check
bun run eval:check
bun run eval:typecheck
bun run eval:foundation -- create-foundation-from-vercel-source-packet
bun run eval:foundation -- create-foundation-from-lightfast-founder-notes
bun run eval:foundation -- update-lightfast-foundation-boundary-surface-question
bun run eval:foundation -- update-lightfast-foundation-tighten-overreach
bun run eval:spec -- create-from-vercel-mcp-source-packet
bun run eval:foundation:smoke
bun run eval:spec:smoke
bun run eval:spec -- --all
bun run with-env -- bun ./scripts/run-baml-eval.ts foundation-creator create-foundation-from-cloudflare-source-packet --eval-profile gate --trials 3
bun run with-env -- bun ./scripts/run-baml-eval.ts foundation-creator update-lightfast-foundation-tighten-overreach --eval-profile fast --compare previous,profile:no-skill
bun run with-env -- bun ./scripts/run-baml-eval.ts foundation-creator create-foundation-from-lightfast-founder-notes --eval-profile cross
```

Each run writes packet, brief, candidate document, and evaluation report
artifacts under `skills/<skill>/evals/runs/`.

`bun run eval:check` is the cheap deterministic CI guard. It validates eval
manifests, fixture paths, validation regexes, smoke membership, and BAML runner
function wiring without calling any model.

Current `foundation-creator` corpus includes:

- `create-foundation-from-vercel-source-packet`
- `create-foundation-from-cloudflare-source-packet`
- `create-foundation-from-lightfast-founder-notes`
- `create-foundation-from-harbor-care-source-packet`
- `update-lightfast-foundation-boundary-surface-question`
- `update-lightfast-foundation-tighten-overreach`

The runner now also writes:

- `deterministic_checks.json` — reference-driven checks derived from the skill's
`template.md` and `language.md`
- `timing.json` — per-stage local timing
- `summary.json` — per-trial LLM status + combined status
- `benchmark.json` — aggregated status counts and timing summaries across all
trials

When `--compare` is used, the run directory also includes:

- `comparison.json` — head-to-head summary across variants, all judged by the
current skill's evaluator
- `variants/<label>/...` — per-variant packet/brief/candidate/report artifacts
and `benchmark.json`

When `--all` is used, the runner executes every eval in the selected skill
manifest and writes a suite directory under `skills/<skill>/evals/runs/` with:

- `suite.json` — aggregate status summary for every eval in the manifest
- `<eval-name>/...` — the normal per-eval artifacts for each manifest entry

Suite mode exits nonzero if any eval has a non-`Pass` combined status, making it
suitable for CI gates.

When `--smoke` is used, the runner executes only manifest entries marked with
`"smoke": true`. The package scripts `eval:foundation:smoke` and
`eval:spec:smoke` are the intended lightweight CI commands.

When `--deterministic-only <path>` is used, the runner validates an existing
`candidate.md` artifact against deterministic reference checks without calling
the candidate model or LLM judge. The path can point to a `candidate.md`, a run
directory, or a suite directory:

```bash
bun run eval:spec -- update-add-single-nongoal-preserve-system-overview --deterministic-only skills/spec-creator/evals/runs/<run>/candidate.md
```

Current comparison variants:

- `current` — working tree prompt stack
- `previous` — `HEAD~1` snapshot of the skill
- `profile:no-skill` — intentionally under-scaffolded baseline profile for
measuring how much the foundation-specific prompt constraints matter

Current eval profiles:

- `fast` — candidate and judge both run on `openai/gpt-5.4-mini`
- `gate` — candidate runs on `openai/gpt-5.4-mini`, judge runs on `openai/gpt-5.4`
- `prod` — candidate uses the skill's default authoring model from `baml_src/clients.baml`, judge runs on `openai/gpt-5.4`
- `cross` — candidate runs on `openai/gpt-5.4-mini`, judge runs on `anthropic/claude-opus-4-7`

The default authoring client in each skill's `baml_src/clients.baml` is
`openai/gpt-5.4` for higher-quality foundation/spec generation. Eval profiles
override that default so the tuning loop can stay on cheaper candidate models.

`fast` is the default when `--eval-profile` is omitted.
Model profiles are applied as overlay fixtures, so prompt comparisons against
`previous` or `profile:no-skill` stay on the same candidate/judge model split.
The `cross` profile requires Anthropic model access through Vercel AI Gateway.

Local JSON artifacts remain the source of truth. Optional Braintrust export can
be enabled with:

```bash
bun run eval:spec -- create-from-vercel-mcp-source-packet --reporter local,braintrust
```

Braintrust export requires `BRAINTRUST_API_KEY`. The default project is
`lightfast-skills`, which can be overridden with `BRAINTRUST_PROJECT`.

Experiment names are generated as:

```text
<capability-id>.<suite-mode>.<profile>.<run-kind>.<yyyymmdd-HHMM>.<git-sha>
```

Examples:

```text
foundation-doc.smoke.fast.model.20260423-0423.6cbdaa4
service-spec.smoke.fast.deterministic.20260423-0422.6cbdaa4
service-spec.compare.gate.model.20260423-0530.6cbdaa4
```

Use stable `capability_id` values in manifests instead of relying on mutable
skill package names. Current values are `foundation-doc` and `service-spec`.
Optional Braintrust environment variables are `BRAINTRUST_EXPERIMENT` for
manual curated runs and `BRAINTRUST_ORG` for org selection.

Eval manifests also carry lightweight taxonomy metadata
(`scenario_type`, `input_shape`, `ambiguity_level`, `domain_profile`,
`primary_risks`) so benchmark runs can be grouped by failure mode. Shared
taxonomy guidance lives in [`evals/TAXONOMY.md`](evals/TAXONOMY.md).

When `--trials N` is used, the run directory contains `trial-1/`, `trial-2/`,
... plus a top-level `benchmark.json`.

`bun run eval:*` loads `.env` automatically through `dotenv-cli`, so
`AI_GATEWAY_API_KEY` can live in the repo-local `.env` without manual
`source` steps.

CI runs `bun run ci:check` on every pull request and push to `main`. Live smoke
evals run only when `AI_GATEWAY_API_KEY` is configured in GitHub Actions
secrets. Braintrust export remains optional: if `BRAINTRUST_API_KEY` is present,
CI uses `local,braintrust`; otherwise local JSON artifacts remain the source of
truth.

For other local commands that should inherit `.env`, use:

```bash
bun run with-env -- bun ./scripts/run-baml-eval.ts foundation-creator create-foundation-from-vercel-source-packet
```

## License

MIT
Loading
Loading