garrytan · garrytan · Apr 18, 2026 · Apr 18, 2026 · Apr 18, 2026 · Apr 18, 2026
diff --git a/.github/PULL_REQUEST_TEMPLATE/tier5-queries.md b/.github/PULL_REQUEST_TEMPLATE/tier5-queries.md
@@ -0,0 +1,39 @@
+<!--
+Tier 5.5 Externally-Authored Query Submission template
+See eval/CONTRIBUTING.md for the full workflow.
+-->
+
+## Summary
+
+Submitting **N** Tier 5.5 queries for BrainBench.
+
+- Author handle: `@your-handle`
+- File location: `eval/external-authors/your-handle/queries.json`
+- Queries authored fresh (not copy-pasted from a model output)
+- Slugs verified against `eval/data/world-v1/` (via `bun run eval:world:view`)
+
+## Checklist
+
+- [ ] `bun run eval:query:validate eval/external-authors/your-handle/queries.json` passes
+- [ ] At least 20 queries
+- [ ] Each query has either `gold.relevant` (with real slugs) or `gold.expected_abstention: true`
+- [ ] Temporal queries have `as_of_date` set (`corpus-end` | `per-source` | ISO-8601)
+- [ ] Phrasing is varied (not all the same template)
+- [ ] `author` field matches my handle
+
+## Phrasing variety (optional self-audit)
+
+Tick the styles represented in your batch:
+
+- [ ] Full sentence questions
+- [ ] Fragment-style ("crypto founder Goldman Sachs background")
+- [ ] Comparison ("X vs Y")
+- [ ] Follow-up ("And who else...")
+- [ ] Imperative ("Pull up Alice Davis")
+- [ ] Trait-based ("the demanding engineering leader")
+- [ ] Abstention bait (answer is "not in corpus")
+
+## Notes to reviewer
+
+Anything worth flagging — ambiguous cases, corpus gaps you found, specific
+phrasings you were uncertain about.
diff --git a/.github/workflows/eval-tests.yml b/.github/workflows/eval-tests.yml
@@ -0,0 +1,40 @@
+name: Eval tests
+
+on:
+  push:
+    branches: [master]
+    paths:
+      - 'eval/**'
+      - 'src/core/link-extraction.ts'
+      - 'src/core/search/**'
+  pull_request:
+    branches: [master]
+    paths:
+      - 'eval/**'
+      - 'src/core/link-extraction.ts'
+      - 'src/core/search/**'
+
+permissions:
+  contents: read
+
+jobs:
+  eval-tests:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5  # v4
+      - uses: oven-sh/setup-bun@0c5077e51419868618aeaa5fe8019c62421857d6  # v2
+        with:
+          bun-version: latest
+      - run: bun install
+
+      # Validate the built-in Tier 5 + 5.5 query set.
+      - name: Validate built-in queries
+        run: bun run eval:query:validate
+
+      # Pure-function unit tests — zero API calls, fast.
+      - name: Run eval unit tests
+        run: bun run test:eval
+
+      # Smoke-test the world.html renderer against the committed corpus.
+      - name: Render world.html
+        run: bun run eval:world:render
diff --git a/.gitignore b/.gitignore
@@ -12,3 +12,4 @@ supabase/.temp/
 .claude/skills/
 .idea
 eval/reports/
+eval/data/world-v1/world.html
diff --git a/docs/benchmarks/2026-04-19-brainbench-multi-adapter.md b/docs/benchmarks/2026-04-19-brainbench-multi-adapter.md
@@ -0,0 +1,99 @@
+# BrainBench — multi-adapter side-by-side (2026-04-19)
+
+**Branch:** `garrytan/gbrain-evals`
+**Commit:** `b81373d`
+**Engine:** PGLite (in-memory)
+**Corpus:** `eval/data/world-v1/` (240 rich-prose fictional pages, committed)
+**Runner:** `bun run eval:run` (N=5, page-order shuffled per run, seeded LCG)
+**Wall time:** ~11.5 min
+
+## Headline
+
+| Adapter          | Runs | Queries | P@5          | R@5          | Correct in top-5 (run 1) |
+|------------------|------|---------|--------------|--------------|--------------------------|
+| **gbrain-after** | 5    | 145     | **49.1%** ±0 | **97.9%** ±0 | **248 / 261**            |
+| hybrid-nograph   | 5    | 145     | 17.8%        | 65.1%        | 129 / 261                |
+| ripgrep-bm25     | 5    | 145     | 17.1%        | 62.4%        | 124 / 261                |
+| vector-only      | 5    | 145     | 10.8%        | 40.7%        | 78 / 261                 |
+
+Stddev = 0 across all adapters this run — every adapter is deterministic over
+page ordering. That's the correct signal for the shipped code (non-zero would
+surface an order-dependent tie-break bug).
+
+### Deltas vs gbrain-after
+
+- hybrid-nograph: P@5 **−31.4 pts**, R@5 **−32.9 pts**, correct-in-top-5 **−119**
+- ripgrep-bm25:   P@5 **−32.0 pts**, R@5 **−35.5 pts**, correct-in-top-5 **−124**
+- vector-only:    P@5 **−38.4 pts**, R@5 **−57.2 pts**, correct-in-top-5 **−170**
+
+### Per-adapter wall time (5 runs)
+
+| Adapter        | Time    | Per run | Notes                                    |
+|----------------|---------|---------|------------------------------------------|
+| gbrain-after   | 7.4s    | ~1.5s   | PGLite + extract (graph) + grep fallback |
+| hybrid-nograph | 555.1s  | ~111s   | Re-embeds 240 pages every run            |
+| ripgrep-bm25   | 0.1s    | ~20ms   | Pure in-memory term matching             |
+| vector-only    | 131.8s  | ~26s    | Embeds once, cosine per query            |
+
+## What this confirms
+
+The graph layer is doing the work.
+
+`hybrid-nograph` is gbrain's own hybrid retrieval stack with the graph disabled —
+same embedder, same chunking, same RRF, same codebase. It lands at 17.8% P@5,
+barely a point above classic BM25. Add typed-edge traversal back in and P@5
+jumps to 49.1%. That's **+31.4 points from the graph alone**, holding everything
+else constant.
+
+Vector-only is the worst on these relational queries. Cosine similarity over
+bio prose doesn't know that "Carol Wilson" appearing in a paragraph about
+Anchor means she's employed there — it ranks by semantic neighborhood, which
+puts other engineering people at other startups ahead of actual coworkers.
+40.7% R@5 is the floor.
+
+## Reproducibility
+
+```sh
+# From a clean checkout at commit b81373d
+export OPENAI_API_KEY=sk-proj-...   # embedding-based adapters need this
+bun install
+bun run eval:run
+```
+
+Deterministic adapters (`gbrain-after`, `ripgrep-bm25`, `vector-only`) match
+this scorecard byte-for-byte. `hybrid-nograph` matches within tolerance bands
+(N=5 smooths embedding nondeterminism).
+
+For faster iteration: `BRAINBENCH_N=1 bun run eval:run:dev` (one run per adapter,
+~2 min total).
+
+## Methodology
+
+- **Corpus:** 240 Opus-generated fictional biographical pages — 80 people, 80
+  companies, 50 meetings, 30 concepts. Committed at
+  `eval/data/world-v1/`, zero private data, no regen needed.
+- **Gold:** 145 relational queries derived from each page's `_facts` metadata
+  — "Who attended X?", "Who works at X?", "Who invested in X?", "Who advises X?"
+  No `_facts` ever cross the adapter boundary; adapters see raw prose only
+  (enforced structurally in `Adapter.init`).
+- **Metrics:** mean P@5 and R@5. Top-5 is what agents actually read in ranked
+  results.
+- **N=5 runs per adapter**, page ingestion order shuffled with a per-run seed
+  (`shuffleSeeded`, LCG). Stddev surfaces order-dependent bugs. Zero stddev on
+  deterministic adapters is the expected-correct signal.
+- **Temporal queries** (none in this 145-query set) require explicit
+  `as_of_date`, validated at query-authoring time.
+
+## Notes
+
+- This is a reproduction of the multi-adapter scorecard shipped with the
+  eval harness at `b81373d`. Numbers match the README table exactly for
+  `gbrain-after`, `ripgrep-bm25`, `vector-only` (deterministic) and are within
+  tolerance for `hybrid-nograph` (embedder nondeterminism).
+- `bun run eval:run` exits with code 99 at the very end despite printing the
+  full scorecard cleanly. Tracked separately; the metrics above are all from
+  the completed run.
+- For the BEFORE/AFTER PR #188 evaluation (graph layer vs no graph layer on
+  an earlier commit), see `2026-04-18-brainbench-v1.md`. This file is the
+  neutrality scorecard — gbrain compared to external baselines anyone could
+  reimplement.
diff --git a/docs/benchmarks/2026-04-19-brainbench-v0_11-vs-v0_12.md b/docs/benchmarks/2026-04-19-brainbench-v0_11-vs-v0_12.md
@@ -0,0 +1,133 @@
+# BrainBench — gbrain v0.11.1 vs v0.12.1 (2026-04-19)
+
+Historical regression comparison. Same harness, same corpus, same 145 queries
+— only the gbrain `src/` tree varies. Answers the question "did the v0.12 work
+make retrieval better, or are the external-adapter numbers the whole story?"
+
+**Short answer:** v0.12.1 moves gbrain-after from **P@5 22.1% → 49.1%** and
+**R@5 54.6% → 97.9%** on identical inputs. The v0.12 extract upgrades alone
+explain most of the multi-adapter gap.
+
+## Setup
+
+| Slot            | SHA       | Dated        | Version label |
+|-----------------|-----------|--------------|----------------|
+| BEFORE          | `d861336` | 2026-04-18   | v0.11.1 (Minions + canonical migration) |
+| AFTER (HEAD)    | `b81373d` | 2026-04-19   | v0.12.1 base + eval harness Phase 3    |
+
+Method:
+1. `git worktree add ../gbrain-eval-v0.11 d861336` — old `src/` tree in isolation
+2. Copy current `eval/` (harness, corpus, queries) into the worktree so both
+   runs score the identical benchmark
+3. Patch the worktree's `gbrain-after` adapter to call `getLinks`/`getBacklinks`
+   (v0.11 graph API) with the same linkType filter + direction semantics as
+   `traversePaths` (v0.12). Same ranking logic, different underlying primitives.
+4. Run `bun eval/runner/multi-adapter.ts --adapter=gbrain-after` at N=5 on both.
+
+The external baselines (`ripgrep-bm25`, `vector-only`) share no code with
+gbrain's `src/`, so their numbers are invariant across the two SHAs. Included
+below for context only.
+
+## Headline
+
+| Adapter (config)        | BEFORE v0.11.1 | AFTER v0.12.1 | Δ             |
+|-------------------------|----------------|----------------|---------------|
+| **gbrain-after — P@5**  | 22.1%          | **49.1%**      | **+27.0 pts** |
+| **gbrain-after — R@5**  | 54.6%          | **97.9%**      | **+43.3 pts** |
+| Correct in top-5 (run 1)| 99 / 261       | **248 / 261**  | **+149**      |
+| hybrid-nograph — P@5    | 17.8%          | 17.8%          | —             |
+| hybrid-nograph — R@5    | 65.1%          | 65.1%          | —             |
+
+Stddev = 0 on both versions — both adapter codepaths are deterministic over
+ingestion order. The entire movement is on `gbrain-after`; `hybrid-nograph`
+holds flat because v0.12 didn't change `hybridSearch`, chunking, or embedding.
+
+## Where the gain came from
+
+`runExtract` is the hinge. Same 240 raw pages in, very different graph out:
+
+| What got extracted       | v0.11.1     | v0.12.1     | Δ          |
+|--------------------------|-------------|-------------|------------|
+| Pages with extractable links | 124 / 240  | 240 / 240   | +116 pages |
+| Typed links created      | 136         | 499         | **×3.7**   |
+| Timeline entries created | 27          | 2,208       | **×82**    |
+
+Three shipped fixes account for the jump (all on master between the two SHAs):
+1. **`inferLinkType` rewrite** (PR #188 five-part patch) — `invested_in`,
+   `works_at`, `founded`, `advises` regexes extended to the narrative verbs
+   Opus-generated prose actually uses ("led the Series A", "early investor",
+   "the founder", "joined as partner"). Context window 80 → 240 chars.
+2. **Auto-link on `put_page`** (v0.12.0) — typed edges get extracted on every
+   write instead of only when the user runs `extract` manually.
+3. **Timeline extraction in `extract --source db`** (v0.12.0) — walks the
+   whole brain, pulls dated lines into structured entries. v0.11 only did
+   this on filesystem sync, so DB-only ingestion paths (like this benchmark)
+   saw almost no timeline data.
+
+## What this means for the multi-adapter scorecard
+
+The April-18 multi-adapter scorecard shows gbrain-after beating
+hybrid-nograph by 31 points P@5. This comparison explains the shape of that
+gap: on v0.11.1 the same architecture only beats hybrid-nograph by **4.3
+points P@5** (22.1% vs 17.8%). The 27-point extra lift came from v0.12's
+extract quality, not the graph layer being present vs absent.
+
+That's a useful refinement of the "the graph layer does the work" claim from
+the April-18 benchmark. Sharper version:
+
+> **The typed-edge graph + high-quality extraction together do the work.**
+> Either piece alone only moves the needle a few points. Both pieces in
+> combination account for the +31 P@5 gap.
+
+## Reproducibility
+
+```sh
+# From providence/, current HEAD (b81373d)
+bun run eval:run --adapter=gbrain-after
+#   gbrain-after N=5: P@5 49.1% ±0, R@5 97.9% ±0
+
+# Historical side
+git worktree add -f ../gbrain-eval-v0.11 d861336
+cp -r eval ../gbrain-eval-v0.11/
+ln -s $PWD/node_modules ../gbrain-eval-v0.11/node_modules
+# Patch: gbrain-after uses getLinks/getBacklinks instead of traversePaths
+#        (v0.11 doesn't have traversePaths). Same direction + linkType filter
+#        semantics, different primitive. See the perl one-liner in the
+#        session commit message for the exact diff.
+cd ../gbrain-eval-v0.11
+bun eval/runner/multi-adapter.ts --adapter=gbrain-after
+#   gbrain-after N=5: P@5 22.1% ±0, R@5 54.6% ±0
+```
+
+## Methodology notes
+
+- The v0.11 shim swaps `traversePaths(seed, {depth:1, direction, linkType})`
+  for `getLinks(seed)` / `getBacklinks(seed)` filtered in-memory by
+  `link_type`. At depth=1 this is semantically identical; it would diverge if
+  the query asked for depth>=2 (none here do). So the reported delta is
+  attributable to gbrain's extraction + storage, not to differences in how
+  the adapter interprets the graph at query time.
+- External baselines would be identical on both SHAs by construction.
+  Re-running them adds no signal. If we later add a baseline that shares
+  gbrain code (a hybrid variant, say), we'd need to re-run it on both sides.
+- `pages with extractable links` is the count `extract --source db` logs
+  after walking the brain. On v0.11 the filtering was narrower, so only
+  124/240 pages contributed any typed edge. On v0.12 every page contributes
+  at least one.
+- The exit-99 fix on the multi-adapter runner (teardown of PGLite engines)
+  was applied to both sides before running, so neither run spuriously
+  returns a failing status to CI.
+
+## What this does not test
+
+- **Other retrieval-adjacent v0.12 work** (sync quality, publish, lint,
+  integrations). BrainBench is scoped to retrieval. Per-feature tests still
+  live in `test/`.
+- **Real prose beyond the 240-page fictional corpus.** The extract regex
+  wins on Opus-generated biographical prose. Real brain pages have their
+  own vocabulary quirks — future work is corpus diversity (tracked in
+  `eval/README.md` three-contributor-paths section).
+- **Wall-clock or token cost.** v0.12's extract is slightly slower
+  (auto-link on every put_page, 2,208 timeline entries vs 27), but we
+  haven't benchmarked the difference. If that ever matters for autopilot,
+  it needs a separate pass.