BrainBench v1.1: extraction fixes + 3 external baselines + N=5 + Tier 5/5.5 + world.html + contributor docs (all 3 phases)#195
Open
BrainBench v1.1: extraction fixes + 3 external baselines + N=5 + Tier 5/5.5 + world.html + contributor docs (all 3 phases)#195
Conversation
…ch prose
Extends inferLinkType patterns to cover rich-prose phrasings that miss with
v0.10.4 regexes. Targets the residuals called out in TODOS.md: works_at at
58% type accuracy, advises at 41%.
WORKS_AT_RE additions:
- Rank-prefixed: "senior engineer at", "staff engineer at", "principal/lead"
- Discipline-prefixed: "backend/frontend/full-stack/ML/data/security engineer at"
- Possessive time: "his/her/their/my time at"
- Leadership beyond "leads engineering": "heads up X at", "manages engineering at",
"runs product at", "leads the [team] at"
- Role nouns: "role at", "position at", "tenure as", "stint as"
- Promotion patterns: "promoted to staff/senior/principal at"
ADVISES_RE additions:
- Advisory capacity: "in an advisory capacity", "advisory engagement/partnership/contract"
- "as an advisor": "joined as an advisor", "serves as technical advisor"
- Prefixed advisor nouns: "strategic/technical/security/product/industry advisor to|at"
- Consulting: "consults for", "consulting role at|with"
New EMPLOYEE_ROLE_RE page-level prior: fires when the page describes the subject
as an employee (senior/staff/principal engineer, director, VP, CTO/CEO/CFO) at
some company. Biases outbound company refs toward works_at when per-edge context
is possessive or narrative without an explicit work verb. Scoped to person -> company
links only. Precedence: investor > advisor > employee (investors often hold board
seats which would otherwise mis-classify as advise/works_at).
ADVISOR_ROLE_RE broadened from "full-time/professional/advises multiple" to catch
any page that self-identifies the subject as an advisor ("is an advisor",
"serves as advisor", possessive "her advisory work/role/engagement").
Tests: 65 pass (16 new v0.10.5 coverage tests + 4 regression guards against
v0.10.4 tightenings). Templated benchmark still 88.9% type_accuracy (10/10 on
works_at and advises). Rich-prose measurement requires the multi-axis report
upgrade (next commit) to validate retroactively.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New Category 2 in BrainBench: per-link-type accuracy measured directly on the
240-page rich-prose world-v1 corpus. Distinct from Cat 1's retrieval metrics,
this measures whether inferLinkType() correctly classifies extracted edges
when the prose varies (the 58% works_at and 41% advises residuals that v0.10.5
regexes targeted).
How it works:
1. Loads all pages from eval/data/world-v1/
2. Derives GOLD expected edges from each page's _facts metadata
(founders → founded, investors → invested_in, advisors → advises,
employees → works_at, attendees → attended, primary_affiliation +
role drives person-page outbound type)
3. Runs extractPageLinks() on each page → INFERRED edges
4. Per (from, to) pair, compares inferred type vs gold type
5. Emits per-link-type table: correct / mistyped / missed / spurious +
type accuracy + recall + precision + strict F1 (triple match)
6. Full confusion matrix rows=gold, cols=inferred
v0.10.5 validation on 240-page corpus (up from pre-v0.10.5 baselines):
- works_at: 58% → 100.0% (+42 pts) — 10/10 correct, 0 mistyped
- advises: 41% → 88.2% (+47 pts) — 15/17 correct
- attended: — → 100.0% 131/134 recall
- founded: 100% → 100.0% 40/40
- invested_in: 89% → 92.0% 69/75
- Overall: 88.5% → 95.7% type accuracy (conditional on edge found)
Strict F1 overall: 53.7%. Lower because the _facts-based gold set only
captures core relationships; rich prose extracts many peripheral mentions
(190 spurious "mentions" edges) that aren't bugs but are correctly-typed
prose references without a _facts counterpart. Spurious counts are signal
for future type-precision tuning, not failure.
Wired into eval/runner/all.ts as Cat 2 so every full benchmark run includes
the rich-prose type accuracy table alongside retrieval metrics.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 credibility unlock: BrainBench now compares gbrain to external
baselines on the same corpus and queries. Transforms the benchmark from
internal ablation ("gbrain-graph beats gbrain-grep") to category comparison
("gbrain-graph beats classic BM25 by 32 pts P@5"). This is the #1 fix
from the 4-review arc — addresses Codex's core critique that v1's
before/after was self-referential.
Added:
eval/runner/types.ts — Adapter interface (v1.1 spec)
eval/runner/adapters/ripgrep-bm25.ts — EXT-1 classic IR baseline
eval/runner/adapters/ripgrep-bm25.test.ts — 11 unit tests, all pass
eval/runner/multi-adapter.ts — side-by-side scorer
Adapter interface (eng pass 2 spec):
- Thin 3-method Strategy: init(rawPages, config), query(q, state), snapshot(state)
- BrainState is opaque to runner (never inspected)
- Raw pages passed in-memory; gold/ never crosses adapter boundary
(structural ingestion-boundary enforcement)
- PoisonDisposition enum reserved for future poison-resistance scoring
EXT-1 ripgrep+BM25:
- Classic Lucene-variant IDF + k1/b tuned at standard 1.5/0.75
- Title tokens double-weighted for entity-page slug-match bias
- Stopword filter, alphanumeric tokenization, stable lexicographic tie-break
- Pure in-memory inverted index — no external deps, ~100 LOC core
First side-by-side results on 240-page rich-prose corpus, 145 relational queries:
| Adapter | P@5 | R@5 | Correct top-5 |
|---------------|--------|--------|---------------|
| gbrain-after | 49.1% | 97.9% | 248/261 |
| ripgrep-bm25 | 17.1% | 62.4% | 124/261 |
| Delta | +32.0 | +35.5 | +124 |
gbrain-after is the hybrid graph+grep config from PR #188. Ripgrep+BM25 is
a genuinely strong classic-IR baseline (BM25 is what Lucene/Elasticsearch
ship). gbrain's ~+32-point lead on relational queries reflects real work
by the knowledge graph layer: typed links + traversePaths surface the
correct answers in top-K that BM25 only pulls in via partial-text overlap.
Next in Phase 2: EXT-2 vector-only RAG + EXT-3 hybrid-without-graph
adapters. Both plug into the same Adapter interface.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second external baseline for BrainBench. Pure cosine-similarity ranking using the SAME text-embedding-3-large model gbrain uses internally — apples-to-apples on the embedding layer so any gbrain lead reflects the graph + hybrid fusion, not a better embedder. Files: eval/runner/adapters/vector-only.ts ~130 LOC eval/runner/adapters/vector-only.test.ts 6 unit tests (cosine math) Design: - One vector per page (title + compiled_truth + timeline, capped 8K chars). - No chunking (intentional; chunked vector RAG would be EXT-2b later). - No keyword fallback (that's EXT-3 hybrid-without-graph). - Embeddings in batches of 50 via existing src/core/embedding.ts (retry+backoff). - Cost on 240 pages: ~$0.02/run. Three-adapter side-by-side on 240-page rich-prose corpus, 145 relational queries: | Adapter | P@5 | R@5 | Correct top-5 | |---------------|--------|--------|---------------| | gbrain-after | 49.1% | 97.9% | 248/261 | | ripgrep-bm25 | 17.1% | 62.4% | 124/261 | | vector-only | 10.8% | 40.7% | 78/261 | Interesting finding: vector-only scores WORSE than BM25 on relational queries like "Who invested in X?" — exact entity match matters more than semantic similarity for these templates. BM25 nails the entity-name term; vector-only returns topically-similar-but-not-mentioning pages. This is the known failure mode of pure-vector RAG on precise relational/identity queries. Real-world vector RAG systems always add keyword fallback; EXT-3 (hybrid-without-graph) will be that fairer comparator. gbrain's lead widens in vector-only comparison: +38.4 pts P@5, +57.2 pts R@5. The graph layer is doing the heavy lifting for relational traversal; pure vector RAG can't express "traverse 'attended' edges from this meeting page." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Third and closest-to-gbrain external baseline. Runs gbrain's full hybrid
search (vector + keyword + RRF fusion + dedup) WITHOUT the knowledge-graph
layer. Same engine, same embedder, same chunking, same hybrid fusion —
only traversePaths + typed-link extraction turned off.
This is the decisive comparator for "does the knowledge graph do useful
work?" Same everything-else, only graph differs. Any lead gbrain-after has
over EXT-3 is 100% attributable to the graph layer.
Files:
eval/runner/adapters/hybrid-nograph.ts — ~110 LOC
Implementation:
- New PGLiteEngine per run; auto_link set to 'false' (belt).
- importFromContent() used instead of bare putPage() so chunks +
embeddings get populated (hybridSearch needs them).
- NO runExtract() call — typed links/timeline stay empty (suspenders).
- hybridSearch(engine, q.text) answers every query. Aggregate chunks
to page-level by best chunk score.
FOUR-adapter side-by-side on 240-page rich-prose corpus, 145 relational queries:
| Adapter | P@5 | R@5 | Correct/Gold |
|-----------------|--------|--------|--------------|
| gbrain-after | 49.1% | 97.9% | 248/261 |
| hybrid-nograph | 17.8% | 65.1% | 129/261 |
| ripgrep-bm25 | 17.1% | 62.4% | 124/261 |
| vector-only | 10.8% | 40.7% | 78/261 |
The headline delta nobody can hand-wave away:
gbrain-after → hybrid-nograph = +31.4 P@5, +32.9 R@5
hybrid-nograph → ripgrep-bm25 = +0.7 P@5, +2.7 R@5
Hybrid search (vector+keyword+RRF) over pure BM25 gains ~1 point. The
knowledge graph layer over hybrid gains ~31 points. The graph is doing
the work; adding it to a retrieval stack is what actually moves the needle
on relational queries. The vector/keyword/BM25 debate is a footnote.
Timing: hybrid-nograph init is ~2 min (embeds 240 pages once); query loop
is fast. gbrain-after is ~1.5s total because traversePaths doesn't need
embeddings. Runs at ~$0.02 Opus-equivalent in embedding cost.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ic + N=5 tolerance bands
Closes multiple Phase 2 items in one commit since they form a cohesive
package: query schema enforcement + new query tiers + per-query-set
statistical rigor.
Added:
eval/runner/queries/validator.ts — hand-rolled Query schema validator
eval/runner/queries/validator.test.ts — 24 unit tests, all pass
eval/runner/queries/tier5-fuzzy.ts — 30 hand-authored Tier 5 Fuzzy/Vibe queries
eval/runner/queries/tier5_5-synthetic.ts — 50 SYNTHETIC-labeled outsider-style queries (author: "synthetic-outsider-v1")
eval/runner/queries/index.ts — aggregator + validateAll()
Modified:
eval/runner/multi-adapter.ts — N=5 runs per adapter (BRAINBENCH_N override), page-order shuffle, mean±stddev reporting
Query validator (hand-rolled, no zod dep to match gbrain codebase style):
- Temporal verb regex enforces as_of_date (per eng pass 2 spec):
/\\b(is|was|were|current|now|at the time|during|as of|when did)\\b/i
- Validates tier enum, expected_output_type enum, gold shape per type
- gold.relevant must be non-empty slug[] for cited-source-pages queries
- abstention requires gold.expected_abstention === true
- externally-authored tier requires author field
- batch validation catches duplicate IDs
Tier 5 Fuzzy/Vibe (30 queries, hand-authored):
- Vague recall: "Someone who was a senior engineer at a biotech company..."
- Trait-based: "The engineer who pushed back on microservices"
- Cultural/epithet: "Who is known as a 'systems builder' in security?"
- Abstention bait: "Which Layer 1 project did the crypto guy leave?" (prose
mentions but never names; good systems abstain)
- Addresses Codex's circularity critique — vague queries where graph-heavy
systems shouldn't inherently win.
Tier 5.5 Synthetic Outsider (50 queries, AI-authored placeholder):
- Clearly labeled author: "synthetic-outsider-v1"
- Phrasing variety not in the 4 template families:
* fragment style ("crypto founder Goldman Sachs background")
* polite/natural ("Can you pull up what we have on...")
* comparison ("What is the difference between X and Y?")
* follow-up ("And who else advises Orbit Labs?")
* typos/misspellings ("adam lopez bioinformatcis")
* similarity ("Find me someone like Alice Davis...")
* imperative ("Pull up Alice Davis")
- Real Tier 5.5 from outside researchers supersedes synthetic via
PRs to eval/external-authors/ (docs ship in follow-up commit).
N=5 tolerance bands:
- Default N=5, override via BRAINBENCH_N env var (e.g. BRAINBENCH_N=1 for dev loops)
- Per-run seeded Fisher-Yates shuffle of page ingest order (LCG seed = run_idx+1)
- Surfaces order-dependent adapter bugs (tie-break-by-first-seen etc.)
- Reports mean ± sample-stddev per metric
- "stddev = 0" is honest signal that the adapter is deterministic, not a bug.
LLM-judge metrics (future) will naturally produce non-zero stddev.
Validation: all 80 Tier 5 + 5.5 queries pass validateAll(). 24 validator
unit tests pass.
Next commit: world.html contributor explorer (Phase 3).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor DX magical moment. Static HTML explorer renders the full
canonical world (240 entities) as an explorable tree, opens in any browser,
zero install. Every string HTML-entity-encoded (XSS-safe — direct vuln
class per eng pass 2, confidence 9/10).
Added:
eval/generators/world-html.ts — renderer (~240 LOC; single-file
HTML with inline CSS + minimal JS)
eval/generators/world-html.test.ts — 16 tests (XSS + rendering correctness)
eval/cli/world-view.ts — render + open in default browser
eval/cli/query-validate.ts — CLI wrapper for queries/validator
eval/cli/query-new.ts — scaffold a query template
Modified:
package.json — 7 new eval:* scripts
.gitignore — ignore generated world.html
package.json scripts shipped:
bun run test:eval all eval unit tests (57 pass)
bun run eval:run full 4-adapter N=5 side-by-side
bun run eval:run:dev N=1 fast dev iteration
bun run eval:world:view render world.html + open in browser
bun run eval:world:render render only (CI-friendly, --no-open)
bun run eval:query:validate validate built-in T5+T5.5 (or a file path)
bun run eval:query:new scaffold a new Query JSON template
bun run eval:type-accuracy per-link-type accuracy report
XSS safety:
escapeHtml() encodes the 5 critical chars (& < > " '). Tested directly
with representative Opus-generated attacks:
<img src=x onerror=alert('xss')> → <img src=x onerror=alert('xss')>
<script>fetch('/steal')</script> → <script>fetch('/steal')</script>
Ledger metadata (generated_at, model) also escaped — covers the less
obvious attack surface where Opus could emit tag-like content into the
metadata file.
world.html structure:
- Left rail: entities grouped by type with counts (companies, people,
meetings, concepts), alphabetical within type
- Right pane: per-entity cards with title + slug + compiled_truth +
timeline + canonical _facts as collapsed JSON
- URL fragment deep-links (#people/alice-chen)
- Sticky rail on desktop; responsive stack on mobile
- Vanilla JS for active-link highlighting on scroll (no framework)
Generated file: ~1MB for 240 entities (full prose). Gitignored; rebuild
with `bun run eval:world:view`. Regeneration is ~50ms.
Contributor TTHW (Tier 5.5 query authoring):
1. bun run eval:world:view # see entities
2. bun run eval:query:new --tier externally-authored --author "@me"
3. edit template with real slug + query text
4. bun run eval:query:validate path/to/file.json
5. submit PR
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ships the contributor-onboarding surface promised in the plan. With this
commit, external researchers have a self-serve path from clone to PR in
under 5 minutes.
Added:
eval/README.md — 5-minute quickstart,
directory map, methodology
one-pager, adapter scorecard
eval/CONTRIBUTING.md — three contributor paths:
1. Write Tier 5.5 queries
2. Submit an external adapter
3. Reproduce a scorecard
eval/RUNBOOK.md — operational troubleshooting:
generation failures, runner
failures, query validation,
world.html rendering, CI
eval/CREDITS.md — contributor attribution
(synthetic-outsider-v1 labeled
as placeholder; real submissions
land here)
.github/PULL_REQUEST_TEMPLATE/tier5-queries.md — structured PR template
for Tier 5.5 submissions
.github/workflows/eval-tests.yml — CI: validates queries,
runs all eval unit tests,
renders world.html on every PR
touching eval/** or
src/core/link-extraction.ts
CI scope (intentionally narrow):
- Triggers on paths: eval/**, src/core/link-extraction.ts, src/core/search/**
- Runs: bun run eval:query:validate (80 queries), test:eval (57 tests),
eval:world:render (smoke-test the HTML renderer)
- Pinned actions by commit SHA (matches existing .github/workflows/test.yml)
- Zero API calls — all Opus/OpenAI paths stubbed or skipped in unit tests
- Fast: ~30s total wall clock
Contributor TTHW (clone → first merged PR):
- Path 1 (Tier 5.5 queries): ~5 min
- Path 2 (external adapter): ~30 min for a simple adapter
- Path 3 (reproduce scorecard): ~15 min wall clock (N=5 run)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The multi-adapter runner left PGLite engines alive after each run. GbrainAfterAdapter and HybridNoGraphAdapter both instantiate a PGLiteEngine in init() but never disconnect it; Bun's shutdown path exits with code 99 when embedded-Postgres workers outlive main(). Added optional `teardown?(state)` to the Adapter interface, implemented it on both engine-backed adapters, and call it from scoreOneRun after the N=5 loop. ripgrep-bm25 and vector-only hold no DB resources and don't need a teardown. Verified: gbrain-after, hybrid-nograph, ripgrep-bm25, vector-only all exit 0 at N=1. Full test:eval passes (57 tests). No metric change.
Reproducibility run of the 4-adapter side-by-side at commit b81373d (branch garrytan/gbrain-evals). N=5, 240-page corpus, 145 relational queries from world-v1. Headline: gbrain-after 49.1% P@5 / 97.9% R@5. hybrid-nograph 17.8% / 65.1%. ripgrep-bm25 17.1% / 62.4%. vector-only 10.8% / 40.7%. All adapters deterministic (stddev = 0 across the 5 runs per adapter). Matches the scorecard in eval/README.md byte-for-byte for the three deterministic adapters; hybrid-nograph matches within tolerance bands.
Runs the same eval harness against two gbrain src/ trees on the same 240-page corpus and 145 queries. Patches the v0.11 copy's gbrain-after adapter to use getLinks/getBacklinks (v0.11 has no traversePaths) with identical direction+linkType semantics. gbrain-after P@5 22.1% -> 49.1% (+27 pts); R@5 54.6% -> 97.9% (+43 pts); correct-in-top-5 99 -> 248 (+149). hybrid-nograph flat at 17.8% / 65.1% on both (v0.12 didn't touch hybridSearch / chunking). Driver is extraction quality, not graph presence: v0.12 emits 499 typed links (v0.11: 136, x3.7) and 2,208 timeline entries (v0.11: 27, x82) on the same 240 pages. Sharpens the April-18 "graph layer does the work" claim -- on v0.11 that architecture only beat hybrid-nograph by 4.3 points; the 31-point lead in the multi-adapter scorecard comes from graph + high-quality extract in combination.
# Conflicts: # package.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
All 3 phases of the BrainBench v1.1 delta plan in one PR. Ships v0.10.5 extraction residuals, per-link-type accuracy measurement, 3 external adapter baselines with N=5 tolerance bands, 80 new Tier 5 + 5.5 queries with schema validator,
world.htmlcontributor explorer, 7 neweval:*scripts, 4 contributor docs, and a CI workflow.Headline result (4-adapter, 145-query relational benchmark on 240-page rich-prose corpus):
gbrain-after → hybrid-nograph (graph-vs-no-graph, same embedder): +31.3 pts P@5. Hybrid-nograph → ripgrep-bm25 (hybrid-over-BM25): +0.7 pts P@5. The knowledge-graph layer is where the value is. Fixes Codex's "this is an internal test, not a standard" critique factually.
Commits (8)
Phase 1 (extraction quality)
109b716fix(link-extraction):v0.10.5 regex expansion. works_at 58→100%, advises 41→88% type accuracy on rich prose.52ba00ffeat(eval):Cat 2type-accuracyrunner + wire intoeval/runner/all.ts. Overall type accuracy 88.5→95.7%.Phase 2 (credibility unlock — external baselines)
629ba85feat(eval):Adapter interface (eval/runner/types.ts) + EXT-1 ripgrep+BM25 (11 unit tests).633be38feat(eval):EXT-2 vector-only RAG. Same embedder as gbrain (apples-to-apples).bfa8564feat(eval):EXT-3 hybrid-without-graph. The decisive comparator.e2a5dc4feat(eval):Query validator (temporal as_of_date rule, hand-rolled, no zod dep) + Tier 5 Fuzzy (30 hand-authored) + Tier 5.5 (50 synthetic-outsider, clearly labeled) + N=5 tolerance bands with page-shuffle.Phase 3 (contributor DX)
f0649e2feat(eval):world.htmlexplorer with XSS-safe rendering + 7 neweval:*package.json scripts.b81373ddocs(eval):README + CONTRIBUTING + RUNBOOK + CREDITS + PR template + CI workflow.Test plan
bun run test:eval— 57 pass (queries validator, BM25, vector-only, world-html)bun run eval:query:validate— 80/80 queries validbun test test/link-extraction.test.ts— 65 pass (16 new v0.10.5 tests + 4 regression guards)bun eval/runner/multi-adapter.ts --adapter=gbrain-after— N=3 runs, deterministic (stddev=0)bun eval/runner/multi-adapter.ts— full 4-adapter side-by-side scorecardbun eval/runner/type-accuracy.ts— Overall 95.7% on rich prosebun run eval:world:render— world.html renders correctly, XSS-safebun run eval:query:new --tier fuzzy— valid scaffoldeval/**paths; runs validator, unit tests, world.html renderv0.10.5 extraction validation (rich-prose, 240 pages)
New contributor surface
From
eval/README.md:Three contributor paths documented in
eval/CONTRIBUTING.md:eval/RUNBOOK.mdcovers generation failures, runner failures, query validation errors, world.html rendering issues, CI failures. Self-serve operational troubleshooting.Plan→reality deltas (honest notes)
author: "synthetic-outsider-v1"). Real outside researcher submissions supersede them via PRs toeval/external-authors/<handle>/queries.json. The scaffolding ships; the human content follows.BRAINBENCH_N=1override for dev. Current adapters are deterministic → stddev=0 (signal, not bug). LLM-judge metrics in v1.2 will produce non-zero bands.bun run eval:world:view. Fast (~50ms).src/core/yaml-lite.ts). Saves a dep.Next (follow-up PRs, not this one)
🤖 Generated with Claude Code