Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE/tier5-queries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
<!--
Tier 5.5 Externally-Authored Query Submission template
See eval/CONTRIBUTING.md for the full workflow.
-->

## Summary

Submitting **N** Tier 5.5 queries for BrainBench.

- Author handle: `@your-handle`
- File location: `eval/external-authors/your-handle/queries.json`
- Queries authored fresh (not copy-pasted from a model output)
- Slugs verified against `eval/data/world-v1/` (via `bun run eval:world:view`)

## Checklist

- [ ] `bun run eval:query:validate eval/external-authors/your-handle/queries.json` passes
- [ ] At least 20 queries
- [ ] Each query has either `gold.relevant` (with real slugs) or `gold.expected_abstention: true`
- [ ] Temporal queries have `as_of_date` set (`corpus-end` | `per-source` | ISO-8601)
- [ ] Phrasing is varied (not all the same template)
- [ ] `author` field matches my handle

## Phrasing variety (optional self-audit)

Tick the styles represented in your batch:

- [ ] Full sentence questions
- [ ] Fragment-style ("crypto founder Goldman Sachs background")
- [ ] Comparison ("X vs Y")
- [ ] Follow-up ("And who else...")
- [ ] Imperative ("Pull up Alice Davis")
- [ ] Trait-based ("the demanding engineering leader")
- [ ] Abstention bait (answer is "not in corpus")

## Notes to reviewer

Anything worth flagging — ambiguous cases, corpus gaps you found, specific
phrasings you were uncertain about.
40 changes: 40 additions & 0 deletions .github/workflows/eval-tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
name: Eval tests

on:
push:
branches: [master]
paths:
- 'eval/**'
- 'src/core/link-extraction.ts'
- 'src/core/search/**'
pull_request:
branches: [master]
paths:
- 'eval/**'
- 'src/core/link-extraction.ts'
- 'src/core/search/**'

permissions:
contents: read

jobs:
eval-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
- uses: oven-sh/setup-bun@0c5077e51419868618aeaa5fe8019c62421857d6 # v2
with:
bun-version: latest
- run: bun install

# Validate the built-in Tier 5 + 5.5 query set.
- name: Validate built-in queries
run: bun run eval:query:validate

# Pure-function unit tests — zero API calls, fast.
- name: Run eval unit tests
run: bun run test:eval

# Smoke-test the world.html renderer against the committed corpus.
- name: Render world.html
run: bun run eval:world:render
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,4 @@ supabase/.temp/
.claude/skills/
.idea
eval/reports/
eval/data/world-v1/world.html
99 changes: 99 additions & 0 deletions docs/benchmarks/2026-04-19-brainbench-multi-adapter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# BrainBench — multi-adapter side-by-side (2026-04-19)

**Branch:** `garrytan/gbrain-evals`
**Commit:** `b81373d`
**Engine:** PGLite (in-memory)
**Corpus:** `eval/data/world-v1/` (240 rich-prose fictional pages, committed)
**Runner:** `bun run eval:run` (N=5, page-order shuffled per run, seeded LCG)
**Wall time:** ~11.5 min

## Headline

| Adapter | Runs | Queries | P@5 | R@5 | Correct in top-5 (run 1) |
|------------------|------|---------|--------------|--------------|--------------------------|
| **gbrain-after** | 5 | 145 | **49.1%** ±0 | **97.9%** ±0 | **248 / 261** |
| hybrid-nograph | 5 | 145 | 17.8% | 65.1% | 129 / 261 |
| ripgrep-bm25 | 5 | 145 | 17.1% | 62.4% | 124 / 261 |
| vector-only | 5 | 145 | 10.8% | 40.7% | 78 / 261 |

Stddev = 0 across all adapters this run — every adapter is deterministic over
page ordering. That's the correct signal for the shipped code (non-zero would
surface an order-dependent tie-break bug).

### Deltas vs gbrain-after

- hybrid-nograph: P@5 **−31.4 pts**, R@5 **−32.9 pts**, correct-in-top-5 **−119**
- ripgrep-bm25: P@5 **−32.0 pts**, R@5 **−35.5 pts**, correct-in-top-5 **−124**
- vector-only: P@5 **−38.4 pts**, R@5 **−57.2 pts**, correct-in-top-5 **−170**

### Per-adapter wall time (5 runs)

| Adapter | Time | Per run | Notes |
|----------------|---------|---------|------------------------------------------|
| gbrain-after | 7.4s | ~1.5s | PGLite + extract (graph) + grep fallback |
| hybrid-nograph | 555.1s | ~111s | Re-embeds 240 pages every run |
| ripgrep-bm25 | 0.1s | ~20ms | Pure in-memory term matching |
| vector-only | 131.8s | ~26s | Embeds once, cosine per query |

## What this confirms

The graph layer is doing the work.

`hybrid-nograph` is gbrain's own hybrid retrieval stack with the graph disabled —
same embedder, same chunking, same RRF, same codebase. It lands at 17.8% P@5,
barely a point above classic BM25. Add typed-edge traversal back in and P@5
jumps to 49.1%. That's **+31.4 points from the graph alone**, holding everything
else constant.

Vector-only is the worst on these relational queries. Cosine similarity over
bio prose doesn't know that "Carol Wilson" appearing in a paragraph about
Anchor means she's employed there — it ranks by semantic neighborhood, which
puts other engineering people at other startups ahead of actual coworkers.
40.7% R@5 is the floor.

## Reproducibility

```sh
# From a clean checkout at commit b81373d
export OPENAI_API_KEY=sk-proj-... # embedding-based adapters need this
bun install
bun run eval:run
```

Deterministic adapters (`gbrain-after`, `ripgrep-bm25`, `vector-only`) match
this scorecard byte-for-byte. `hybrid-nograph` matches within tolerance bands
(N=5 smooths embedding nondeterminism).

For faster iteration: `BRAINBENCH_N=1 bun run eval:run:dev` (one run per adapter,
~2 min total).

## Methodology

- **Corpus:** 240 Opus-generated fictional biographical pages — 80 people, 80
companies, 50 meetings, 30 concepts. Committed at
`eval/data/world-v1/`, zero private data, no regen needed.
- **Gold:** 145 relational queries derived from each page's `_facts` metadata
— "Who attended X?", "Who works at X?", "Who invested in X?", "Who advises X?"
No `_facts` ever cross the adapter boundary; adapters see raw prose only
(enforced structurally in `Adapter.init`).
- **Metrics:** mean P@5 and R@5. Top-5 is what agents actually read in ranked
results.
- **N=5 runs per adapter**, page ingestion order shuffled with a per-run seed
(`shuffleSeeded`, LCG). Stddev surfaces order-dependent bugs. Zero stddev on
deterministic adapters is the expected-correct signal.
- **Temporal queries** (none in this 145-query set) require explicit
`as_of_date`, validated at query-authoring time.

## Notes

- This is a reproduction of the multi-adapter scorecard shipped with the
eval harness at `b81373d`. Numbers match the README table exactly for
`gbrain-after`, `ripgrep-bm25`, `vector-only` (deterministic) and are within
tolerance for `hybrid-nograph` (embedder nondeterminism).
- `bun run eval:run` exits with code 99 at the very end despite printing the
full scorecard cleanly. Tracked separately; the metrics above are all from
the completed run.
- For the BEFORE/AFTER PR #188 evaluation (graph layer vs no graph layer on
an earlier commit), see `2026-04-18-brainbench-v1.md`. This file is the
neutrality scorecard — gbrain compared to external baselines anyone could
reimplement.
133 changes: 133 additions & 0 deletions docs/benchmarks/2026-04-19-brainbench-v0_11-vs-v0_12.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# BrainBench — gbrain v0.11.1 vs v0.12.1 (2026-04-19)

Historical regression comparison. Same harness, same corpus, same 145 queries
— only the gbrain `src/` tree varies. Answers the question "did the v0.12 work
make retrieval better, or are the external-adapter numbers the whole story?"

**Short answer:** v0.12.1 moves gbrain-after from **P@5 22.1% → 49.1%** and
**R@5 54.6% → 97.9%** on identical inputs. The v0.12 extract upgrades alone
explain most of the multi-adapter gap.

## Setup

| Slot | SHA | Dated | Version label |
|-----------------|-----------|--------------|----------------|
| BEFORE | `d861336` | 2026-04-18 | v0.11.1 (Minions + canonical migration) |
| AFTER (HEAD) | `b81373d` | 2026-04-19 | v0.12.1 base + eval harness Phase 3 |

Method:
1. `git worktree add ../gbrain-eval-v0.11 d861336` — old `src/` tree in isolation
2. Copy current `eval/` (harness, corpus, queries) into the worktree so both
runs score the identical benchmark
3. Patch the worktree's `gbrain-after` adapter to call `getLinks`/`getBacklinks`
(v0.11 graph API) with the same linkType filter + direction semantics as
`traversePaths` (v0.12). Same ranking logic, different underlying primitives.
4. Run `bun eval/runner/multi-adapter.ts --adapter=gbrain-after` at N=5 on both.

The external baselines (`ripgrep-bm25`, `vector-only`) share no code with
gbrain's `src/`, so their numbers are invariant across the two SHAs. Included
below for context only.

## Headline

| Adapter (config) | BEFORE v0.11.1 | AFTER v0.12.1 | Δ |
|-------------------------|----------------|----------------|---------------|
| **gbrain-after — P@5** | 22.1% | **49.1%** | **+27.0 pts** |
| **gbrain-after — R@5** | 54.6% | **97.9%** | **+43.3 pts** |
| Correct in top-5 (run 1)| 99 / 261 | **248 / 261** | **+149** |
| hybrid-nograph — P@5 | 17.8% | 17.8% | — |
| hybrid-nograph — R@5 | 65.1% | 65.1% | — |

Stddev = 0 on both versions — both adapter codepaths are deterministic over
ingestion order. The entire movement is on `gbrain-after`; `hybrid-nograph`
holds flat because v0.12 didn't change `hybridSearch`, chunking, or embedding.

## Where the gain came from

`runExtract` is the hinge. Same 240 raw pages in, very different graph out:

| What got extracted | v0.11.1 | v0.12.1 | Δ |
|--------------------------|-------------|-------------|------------|
| Pages with extractable links | 124 / 240 | 240 / 240 | +116 pages |
| Typed links created | 136 | 499 | **×3.7** |
| Timeline entries created | 27 | 2,208 | **×82** |

Three shipped fixes account for the jump (all on master between the two SHAs):
1. **`inferLinkType` rewrite** (PR #188 five-part patch) — `invested_in`,
`works_at`, `founded`, `advises` regexes extended to the narrative verbs
Opus-generated prose actually uses ("led the Series A", "early investor",
"the founder", "joined as partner"). Context window 80 → 240 chars.
2. **Auto-link on `put_page`** (v0.12.0) — typed edges get extracted on every
write instead of only when the user runs `extract` manually.
3. **Timeline extraction in `extract --source db`** (v0.12.0) — walks the
whole brain, pulls dated lines into structured entries. v0.11 only did
this on filesystem sync, so DB-only ingestion paths (like this benchmark)
saw almost no timeline data.

## What this means for the multi-adapter scorecard

The April-18 multi-adapter scorecard shows gbrain-after beating
hybrid-nograph by 31 points P@5. This comparison explains the shape of that
gap: on v0.11.1 the same architecture only beats hybrid-nograph by **4.3
points P@5** (22.1% vs 17.8%). The 27-point extra lift came from v0.12's
extract quality, not the graph layer being present vs absent.

That's a useful refinement of the "the graph layer does the work" claim from
the April-18 benchmark. Sharper version:

> **The typed-edge graph + high-quality extraction together do the work.**
> Either piece alone only moves the needle a few points. Both pieces in
> combination account for the +31 P@5 gap.

## Reproducibility

```sh
# From providence/, current HEAD (b81373d)
bun run eval:run --adapter=gbrain-after
# gbrain-after N=5: P@5 49.1% ±0, R@5 97.9% ±0

# Historical side
git worktree add -f ../gbrain-eval-v0.11 d861336
cp -r eval ../gbrain-eval-v0.11/
ln -s $PWD/node_modules ../gbrain-eval-v0.11/node_modules
# Patch: gbrain-after uses getLinks/getBacklinks instead of traversePaths
# (v0.11 doesn't have traversePaths). Same direction + linkType filter
# semantics, different primitive. See the perl one-liner in the
# session commit message for the exact diff.
cd ../gbrain-eval-v0.11
bun eval/runner/multi-adapter.ts --adapter=gbrain-after
# gbrain-after N=5: P@5 22.1% ±0, R@5 54.6% ±0
```

## Methodology notes

- The v0.11 shim swaps `traversePaths(seed, {depth:1, direction, linkType})`
for `getLinks(seed)` / `getBacklinks(seed)` filtered in-memory by
`link_type`. At depth=1 this is semantically identical; it would diverge if
the query asked for depth>=2 (none here do). So the reported delta is
attributable to gbrain's extraction + storage, not to differences in how
the adapter interprets the graph at query time.
- External baselines would be identical on both SHAs by construction.
Re-running them adds no signal. If we later add a baseline that shares
gbrain code (a hybrid variant, say), we'd need to re-run it on both sides.
- `pages with extractable links` is the count `extract --source db` logs
after walking the brain. On v0.11 the filtering was narrower, so only
124/240 pages contributed any typed edge. On v0.12 every page contributes
at least one.
- The exit-99 fix on the multi-adapter runner (teardown of PGLite engines)
was applied to both sides before running, so neither run spuriously
returns a failing status to CI.

## What this does not test

- **Other retrieval-adjacent v0.12 work** (sync quality, publish, lint,
integrations). BrainBench is scoped to retrieval. Per-feature tests still
live in `test/`.
- **Real prose beyond the 240-page fictional corpus.** The extract regex
wins on Opus-generated biographical prose. Real brain pages have their
own vocabulary quirks — future work is corpus diversity (tracked in
`eval/README.md` three-contributor-paths section).
- **Wall-clock or token cost.** v0.12's extract is slightly slower
(auto-link on every put_page, 2,208 timeline entries vs 27), but we
haven't benchmarked the difference. If that ever matters for autopilot,
it needs a separate pass.
Loading
Loading