Skip to content

BrainBench v1.1: extraction fixes + 3 external baselines + N=5 + Tier 5/5.5 + world.html + contributor docs (all 3 phases)#195

Open
garrytan wants to merge 12 commits intomasterfrom
garrytan/gbrain-evals
Open

BrainBench v1.1: extraction fixes + 3 external baselines + N=5 + Tier 5/5.5 + world.html + contributor docs (all 3 phases)#195
garrytan wants to merge 12 commits intomasterfrom
garrytan/gbrain-evals

Conversation

@garrytan
Copy link
Copy Markdown
Owner

@garrytan garrytan commented Apr 18, 2026

Summary

All 3 phases of the BrainBench v1.1 delta plan in one PR. Ships v0.10.5 extraction residuals, per-link-type accuracy measurement, 3 external adapter baselines with N=5 tolerance bands, 80 new Tier 5 + 5.5 queries with schema validator, world.html contributor explorer, 7 new eval:* scripts, 4 contributor docs, and a CI workflow.

Headline result (4-adapter, 145-query relational benchmark on 240-page rich-prose corpus):

Adapter P@5 R@5 Correct top-5
gbrain-after 49.1% 97.9% 248/261
hybrid-nograph 17.8% 65.1% 129/261
ripgrep-bm25 17.1% 62.4% 124/261
vector-only 10.8% 40.7% 78/261

gbrain-after → hybrid-nograph (graph-vs-no-graph, same embedder): +31.3 pts P@5. Hybrid-nograph → ripgrep-bm25 (hybrid-over-BM25): +0.7 pts P@5. The knowledge-graph layer is where the value is. Fixes Codex's "this is an internal test, not a standard" critique factually.

Commits (8)

Phase 1 (extraction quality)

  1. 109b716 fix(link-extraction): v0.10.5 regex expansion. works_at 58→100%, advises 41→88% type accuracy on rich prose.
  2. 52ba00f feat(eval): Cat 2 type-accuracy runner + wire into eval/runner/all.ts. Overall type accuracy 88.5→95.7%.

Phase 2 (credibility unlock — external baselines)

  1. 629ba85 feat(eval): Adapter interface (eval/runner/types.ts) + EXT-1 ripgrep+BM25 (11 unit tests).
  2. 633be38 feat(eval): EXT-2 vector-only RAG. Same embedder as gbrain (apples-to-apples).
  3. bfa8564 feat(eval): EXT-3 hybrid-without-graph. The decisive comparator.
  4. e2a5dc4 feat(eval): Query validator (temporal as_of_date rule, hand-rolled, no zod dep) + Tier 5 Fuzzy (30 hand-authored) + Tier 5.5 (50 synthetic-outsider, clearly labeled) + N=5 tolerance bands with page-shuffle.

Phase 3 (contributor DX)

  1. f0649e2 feat(eval): world.html explorer with XSS-safe rendering + 7 new eval:* package.json scripts.
  2. b81373d docs(eval): README + CONTRIBUTING + RUNBOOK + CREDITS + PR template + CI workflow.

Test plan

  • bun run test:eval57 pass (queries validator, BM25, vector-only, world-html)
  • bun run eval:query:validate80/80 queries valid
  • bun test test/link-extraction.test.ts65 pass (16 new v0.10.5 tests + 4 regression guards)
  • bun eval/runner/multi-adapter.ts --adapter=gbrain-after — N=3 runs, deterministic (stddev=0)
  • bun eval/runner/multi-adapter.ts — full 4-adapter side-by-side scorecard
  • bun eval/runner/type-accuracy.ts — Overall 95.7% on rich prose
  • bun run eval:world:render — world.html renders correctly, XSS-safe
  • bun run eval:query:new --tier fuzzy — valid scaffold
  • CI workflow: triggers on eval/** paths; runs validator, unit tests, world.html render

v0.10.5 extraction validation (rich-prose, 240 pages)

Link type Before After Target
works_at 58% 100.0% >85% ✓ +42 pts
advises 41% 88.2% >85% ✓ +47 pts
attended 100.0% 131/134
founded 100% 100.0% 40/40
invested_in ~89% 92.0% 69/75
Overall 88.5% 95.7% ✓ +7.2 pts

New contributor surface

From eval/README.md:

# 5-minute quickstart
bun run eval:run              # full 4-adapter N=5 benchmark (~15 min)
bun run eval:run:dev          # N=1 fast iteration
bun run eval:type-accuracy    # per-link-type report
bun run eval:world:view       # render + open world.html (magical moment)

# Query authoring (Tier 5.5)
bun run eval:query:new --tier externally-authored --author "@me"
bun run eval:query:validate path/to/queries.json

Three contributor paths documented in eval/CONTRIBUTING.md:

  1. Write Tier 5.5 queries (~5 min clone to PR)
  2. Submit an external adapter (~30 min for simple adapter)
  3. Reproduce a scorecard (~15 min wall clock)

eval/RUNBOOK.md covers generation failures, runner failures, query validation errors, world.html rendering issues, CI failures. Self-serve operational troubleshooting.

Plan→reality deltas (honest notes)

  • Tier 5.5 queries are AI-authored placeholders (author: "synthetic-outsider-v1"). Real outside researcher submissions supersede them via PRs to eval/external-authors/<handle>/queries.json. The scaffolding ships; the human content follows.
  • N=5 tolerance bands default with BRAINBENCH_N=1 override for dev. Current adapters are deterministic → stddev=0 (signal, not bug). LLM-judge metrics in v1.2 will produce non-zero bands.
  • world.html is gitignored (~1MB per 240 entities). Regenerate with bun run eval:world:view. Fast (~50ms).
  • Hand-rolled validator, not zod. Matches existing gbrain style (see src/core/yaml-lite.ts). Saves a dep.

Next (follow-up PRs, not this one)

  • LLM-judge-based metrics (personalization scoring, citation accuracy)
  • Real external researcher submissions for Tier 5.5
  • Competitor adapters (mem0, supermemory, Letta, Cognee) — framework ready
  • Scale up to 2K corpus (v1.1.x)

🤖 Generated with Claude Code

garrytan and others added 8 commits April 18, 2026 23:21
…ch prose

Extends inferLinkType patterns to cover rich-prose phrasings that miss with
v0.10.4 regexes. Targets the residuals called out in TODOS.md: works_at at
58% type accuracy, advises at 41%.

WORKS_AT_RE additions:
- Rank-prefixed: "senior engineer at", "staff engineer at", "principal/lead"
- Discipline-prefixed: "backend/frontend/full-stack/ML/data/security engineer at"
- Possessive time: "his/her/their/my time at"
- Leadership beyond "leads engineering": "heads up X at", "manages engineering at",
  "runs product at", "leads the [team] at"
- Role nouns: "role at", "position at", "tenure as", "stint as"
- Promotion patterns: "promoted to staff/senior/principal at"

ADVISES_RE additions:
- Advisory capacity: "in an advisory capacity", "advisory engagement/partnership/contract"
- "as an advisor": "joined as an advisor", "serves as technical advisor"
- Prefixed advisor nouns: "strategic/technical/security/product/industry advisor to|at"
- Consulting: "consults for", "consulting role at|with"

New EMPLOYEE_ROLE_RE page-level prior: fires when the page describes the subject
as an employee (senior/staff/principal engineer, director, VP, CTO/CEO/CFO) at
some company. Biases outbound company refs toward works_at when per-edge context
is possessive or narrative without an explicit work verb. Scoped to person -> company
links only. Precedence: investor > advisor > employee (investors often hold board
seats which would otherwise mis-classify as advise/works_at).

ADVISOR_ROLE_RE broadened from "full-time/professional/advises multiple" to catch
any page that self-identifies the subject as an advisor ("is an advisor",
"serves as advisor", possessive "her advisory work/role/engagement").

Tests: 65 pass (16 new v0.10.5 coverage tests + 4 regression guards against
v0.10.4 tightenings). Templated benchmark still 88.9% type_accuracy (10/10 on
works_at and advises). Rich-prose measurement requires the multi-axis report
upgrade (next commit) to validate retroactively.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New Category 2 in BrainBench: per-link-type accuracy measured directly on the
240-page rich-prose world-v1 corpus. Distinct from Cat 1's retrieval metrics,
this measures whether inferLinkType() correctly classifies extracted edges
when the prose varies (the 58% works_at and 41% advises residuals that v0.10.5
regexes targeted).

How it works:
  1. Loads all pages from eval/data/world-v1/
  2. Derives GOLD expected edges from each page's _facts metadata
     (founders → founded, investors → invested_in, advisors → advises,
      employees → works_at, attendees → attended, primary_affiliation +
      role drives person-page outbound type)
  3. Runs extractPageLinks() on each page → INFERRED edges
  4. Per (from, to) pair, compares inferred type vs gold type
  5. Emits per-link-type table: correct / mistyped / missed / spurious +
     type accuracy + recall + precision + strict F1 (triple match)
  6. Full confusion matrix rows=gold, cols=inferred

v0.10.5 validation on 240-page corpus (up from pre-v0.10.5 baselines):
  - works_at:    58%  → 100.0%   (+42 pts) — 10/10 correct, 0 mistyped
  - advises:     41%  → 88.2%    (+47 pts) — 15/17 correct
  - attended:    —    → 100.0%   131/134 recall
  - founded:    100%  → 100.0%   40/40
  - invested_in: 89%  → 92.0%    69/75
  - Overall:    88.5% → 95.7%    type accuracy (conditional on edge found)

Strict F1 overall: 53.7%. Lower because the _facts-based gold set only
captures core relationships; rich prose extracts many peripheral mentions
(190 spurious "mentions" edges) that aren't bugs but are correctly-typed
prose references without a _facts counterpart. Spurious counts are signal
for future type-precision tuning, not failure.

Wired into eval/runner/all.ts as Cat 2 so every full benchmark run includes
the rich-prose type accuracy table alongside retrieval metrics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 credibility unlock: BrainBench now compares gbrain to external
baselines on the same corpus and queries. Transforms the benchmark from
internal ablation ("gbrain-graph beats gbrain-grep") to category comparison
("gbrain-graph beats classic BM25 by 32 pts P@5"). This is the #1 fix
from the 4-review arc — addresses Codex's core critique that v1's
before/after was self-referential.

Added:
  eval/runner/types.ts                      — Adapter interface (v1.1 spec)
  eval/runner/adapters/ripgrep-bm25.ts      — EXT-1 classic IR baseline
  eval/runner/adapters/ripgrep-bm25.test.ts — 11 unit tests, all pass
  eval/runner/multi-adapter.ts              — side-by-side scorer

Adapter interface (eng pass 2 spec):
  - Thin 3-method Strategy: init(rawPages, config), query(q, state), snapshot(state)
  - BrainState is opaque to runner (never inspected)
  - Raw pages passed in-memory; gold/ never crosses adapter boundary
    (structural ingestion-boundary enforcement)
  - PoisonDisposition enum reserved for future poison-resistance scoring

EXT-1 ripgrep+BM25:
  - Classic Lucene-variant IDF + k1/b tuned at standard 1.5/0.75
  - Title tokens double-weighted for entity-page slug-match bias
  - Stopword filter, alphanumeric tokenization, stable lexicographic tie-break
  - Pure in-memory inverted index — no external deps, ~100 LOC core

First side-by-side results on 240-page rich-prose corpus, 145 relational queries:

| Adapter       | P@5    | R@5    | Correct top-5 |
|---------------|--------|--------|---------------|
| gbrain-after  | 49.1%  | 97.9%  | 248/261       |
| ripgrep-bm25  | 17.1%  | 62.4%  | 124/261       |
| Delta         | +32.0  | +35.5  | +124          |

gbrain-after is the hybrid graph+grep config from PR #188. Ripgrep+BM25 is
a genuinely strong classic-IR baseline (BM25 is what Lucene/Elasticsearch
ship). gbrain's ~+32-point lead on relational queries reflects real work
by the knowledge graph layer: typed links + traversePaths surface the
correct answers in top-K that BM25 only pulls in via partial-text overlap.

Next in Phase 2: EXT-2 vector-only RAG + EXT-3 hybrid-without-graph
adapters. Both plug into the same Adapter interface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second external baseline for BrainBench. Pure cosine-similarity ranking
using the SAME text-embedding-3-large model gbrain uses internally —
apples-to-apples on the embedding layer so any gbrain lead reflects the
graph + hybrid fusion, not a better embedder.

Files:
  eval/runner/adapters/vector-only.ts      ~130 LOC
  eval/runner/adapters/vector-only.test.ts 6 unit tests (cosine math)

Design:
  - One vector per page (title + compiled_truth + timeline, capped 8K chars).
  - No chunking (intentional; chunked vector RAG would be EXT-2b later).
  - No keyword fallback (that's EXT-3 hybrid-without-graph).
  - Embeddings in batches of 50 via existing src/core/embedding.ts (retry+backoff).
  - Cost on 240 pages: ~$0.02/run.

Three-adapter side-by-side on 240-page rich-prose corpus, 145 relational queries:

| Adapter       | P@5    | R@5    | Correct top-5 |
|---------------|--------|--------|---------------|
| gbrain-after  | 49.1%  | 97.9%  | 248/261       |
| ripgrep-bm25  | 17.1%  | 62.4%  | 124/261       |
| vector-only   | 10.8%  | 40.7%  |  78/261       |

Interesting finding: vector-only scores WORSE than BM25 on relational queries
like "Who invested in X?" — exact entity match matters more than semantic
similarity for these templates. BM25 nails the entity-name term; vector-only
returns topically-similar-but-not-mentioning pages. This is the known failure
mode of pure-vector RAG on precise relational/identity queries. Real-world
vector RAG systems always add keyword fallback; EXT-3 (hybrid-without-graph)
will be that fairer comparator.

gbrain's lead widens in vector-only comparison: +38.4 pts P@5, +57.2 pts R@5.
The graph layer is doing the heavy lifting for relational traversal; pure
vector RAG can't express "traverse 'attended' edges from this meeting page."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Third and closest-to-gbrain external baseline. Runs gbrain's full hybrid
search (vector + keyword + RRF fusion + dedup) WITHOUT the knowledge-graph
layer. Same engine, same embedder, same chunking, same hybrid fusion —
only traversePaths + typed-link extraction turned off.

This is the decisive comparator for "does the knowledge graph do useful
work?" Same everything-else, only graph differs. Any lead gbrain-after has
over EXT-3 is 100% attributable to the graph layer.

Files:
  eval/runner/adapters/hybrid-nograph.ts   — ~110 LOC

Implementation:
  - New PGLiteEngine per run; auto_link set to 'false' (belt).
  - importFromContent() used instead of bare putPage() so chunks +
    embeddings get populated (hybridSearch needs them).
  - NO runExtract() call — typed links/timeline stay empty (suspenders).
  - hybridSearch(engine, q.text) answers every query. Aggregate chunks
    to page-level by best chunk score.

FOUR-adapter side-by-side on 240-page rich-prose corpus, 145 relational queries:

| Adapter         | P@5    | R@5    | Correct/Gold |
|-----------------|--------|--------|--------------|
| gbrain-after    | 49.1%  | 97.9%  | 248/261      |
| hybrid-nograph  | 17.8%  | 65.1%  | 129/261      |
| ripgrep-bm25    | 17.1%  | 62.4%  | 124/261      |
| vector-only     | 10.8%  | 40.7%  |  78/261      |

The headline delta nobody can hand-wave away:
  gbrain-after → hybrid-nograph  = +31.4 P@5, +32.9 R@5
  hybrid-nograph → ripgrep-bm25  = +0.7 P@5,  +2.7 R@5

Hybrid search (vector+keyword+RRF) over pure BM25 gains ~1 point. The
knowledge graph layer over hybrid gains ~31 points. The graph is doing
the work; adding it to a retrieval stack is what actually moves the needle
on relational queries. The vector/keyword/BM25 debate is a footnote.

Timing: hybrid-nograph init is ~2 min (embeds 240 pages once); query loop
is fast. gbrain-after is ~1.5s total because traversePaths doesn't need
embeddings. Runs at ~$0.02 Opus-equivalent in embedding cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ic + N=5 tolerance bands

Closes multiple Phase 2 items in one commit since they form a cohesive
package: query schema enforcement + new query tiers + per-query-set
statistical rigor.

Added:
  eval/runner/queries/validator.ts               — hand-rolled Query schema validator
  eval/runner/queries/validator.test.ts          — 24 unit tests, all pass
  eval/runner/queries/tier5-fuzzy.ts             — 30 hand-authored Tier 5 Fuzzy/Vibe queries
  eval/runner/queries/tier5_5-synthetic.ts       — 50 SYNTHETIC-labeled outsider-style queries (author: "synthetic-outsider-v1")
  eval/runner/queries/index.ts                   — aggregator + validateAll()

Modified:
  eval/runner/multi-adapter.ts                   — N=5 runs per adapter (BRAINBENCH_N override), page-order shuffle, mean±stddev reporting

Query validator (hand-rolled, no zod dep to match gbrain codebase style):
  - Temporal verb regex enforces as_of_date (per eng pass 2 spec):
    /\\b(is|was|were|current|now|at the time|during|as of|when did)\\b/i
  - Validates tier enum, expected_output_type enum, gold shape per type
  - gold.relevant must be non-empty slug[] for cited-source-pages queries
  - abstention requires gold.expected_abstention === true
  - externally-authored tier requires author field
  - batch validation catches duplicate IDs

Tier 5 Fuzzy/Vibe (30 queries, hand-authored):
  - Vague recall: "Someone who was a senior engineer at a biotech company..."
  - Trait-based: "The engineer who pushed back on microservices"
  - Cultural/epithet: "Who is known as a 'systems builder' in security?"
  - Abstention bait: "Which Layer 1 project did the crypto guy leave?" (prose
    mentions but never names; good systems abstain)
  - Addresses Codex's circularity critique — vague queries where graph-heavy
    systems shouldn't inherently win.

Tier 5.5 Synthetic Outsider (50 queries, AI-authored placeholder):
  - Clearly labeled author: "synthetic-outsider-v1"
  - Phrasing variety not in the 4 template families:
    * fragment style ("crypto founder Goldman Sachs background")
    * polite/natural ("Can you pull up what we have on...")
    * comparison ("What is the difference between X and Y?")
    * follow-up ("And who else advises Orbit Labs?")
    * typos/misspellings ("adam lopez bioinformatcis")
    * similarity ("Find me someone like Alice Davis...")
    * imperative ("Pull up Alice Davis")
  - Real Tier 5.5 from outside researchers supersedes synthetic via
    PRs to eval/external-authors/ (docs ship in follow-up commit).

N=5 tolerance bands:
  - Default N=5, override via BRAINBENCH_N env var (e.g. BRAINBENCH_N=1 for dev loops)
  - Per-run seeded Fisher-Yates shuffle of page ingest order (LCG seed = run_idx+1)
  - Surfaces order-dependent adapter bugs (tie-break-by-first-seen etc.)
  - Reports mean ± sample-stddev per metric
  - "stddev = 0" is honest signal that the adapter is deterministic, not a bug.
    LLM-judge metrics (future) will naturally produce non-zero stddev.

Validation: all 80 Tier 5 + 5.5 queries pass validateAll(). 24 validator
unit tests pass.

Next commit: world.html contributor explorer (Phase 3).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor DX magical moment. Static HTML explorer renders the full
canonical world (240 entities) as an explorable tree, opens in any browser,
zero install. Every string HTML-entity-encoded (XSS-safe — direct vuln
class per eng pass 2, confidence 9/10).

Added:
  eval/generators/world-html.ts         — renderer (~240 LOC; single-file
                                          HTML with inline CSS + minimal JS)
  eval/generators/world-html.test.ts    — 16 tests (XSS + rendering correctness)
  eval/cli/world-view.ts                — render + open in default browser
  eval/cli/query-validate.ts            — CLI wrapper for queries/validator
  eval/cli/query-new.ts                 — scaffold a query template

Modified:
  package.json                          — 7 new eval:* scripts
  .gitignore                            — ignore generated world.html

package.json scripts shipped:
  bun run test:eval                 all eval unit tests (57 pass)
  bun run eval:run                  full 4-adapter N=5 side-by-side
  bun run eval:run:dev              N=1 fast dev iteration
  bun run eval:world:view           render world.html + open in browser
  bun run eval:world:render         render only (CI-friendly, --no-open)
  bun run eval:query:validate       validate built-in T5+T5.5 (or a file path)
  bun run eval:query:new            scaffold a new Query JSON template
  bun run eval:type-accuracy        per-link-type accuracy report

XSS safety:
  escapeHtml() encodes the 5 critical chars (& < > " '). Tested directly
  with representative Opus-generated attacks:
    <img src=x onerror=alert('xss')>  → &lt;img src=x onerror=alert(&#39;xss&#39;)&gt;
    <script>fetch('/steal')</script>  → &lt;script&gt;fetch(&#39;/steal&#39;)&lt;/script&gt;
  Ledger metadata (generated_at, model) also escaped — covers the less
  obvious attack surface where Opus could emit tag-like content into the
  metadata file.

world.html structure:
  - Left rail: entities grouped by type with counts (companies, people,
    meetings, concepts), alphabetical within type
  - Right pane: per-entity cards with title + slug + compiled_truth +
    timeline + canonical _facts as collapsed JSON
  - URL fragment deep-links (#people/alice-chen)
  - Sticky rail on desktop; responsive stack on mobile
  - Vanilla JS for active-link highlighting on scroll (no framework)

Generated file: ~1MB for 240 entities (full prose). Gitignored; rebuild
with `bun run eval:world:view`. Regeneration is ~50ms.

Contributor TTHW (Tier 5.5 query authoring):
  1. bun run eval:world:view                         # see entities
  2. bun run eval:query:new --tier externally-authored --author "@me"
  3. edit template with real slug + query text
  4. bun run eval:query:validate path/to/file.json
  5. submit PR

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ships the contributor-onboarding surface promised in the plan. With this
commit, external researchers have a self-serve path from clone to PR in
under 5 minutes.

Added:
  eval/README.md                                — 5-minute quickstart,
                                                  directory map, methodology
                                                  one-pager, adapter scorecard
  eval/CONTRIBUTING.md                          — three contributor paths:
                                                    1. Write Tier 5.5 queries
                                                    2. Submit an external adapter
                                                    3. Reproduce a scorecard
  eval/RUNBOOK.md                               — operational troubleshooting:
                                                  generation failures, runner
                                                  failures, query validation,
                                                  world.html rendering, CI
  eval/CREDITS.md                               — contributor attribution
                                                  (synthetic-outsider-v1 labeled
                                                  as placeholder; real submissions
                                                  land here)
  .github/PULL_REQUEST_TEMPLATE/tier5-queries.md — structured PR template
                                                  for Tier 5.5 submissions
  .github/workflows/eval-tests.yml              — CI: validates queries,
                                                  runs all eval unit tests,
                                                  renders world.html on every PR
                                                  touching eval/** or
                                                  src/core/link-extraction.ts

CI scope (intentionally narrow):
  - Triggers on paths: eval/**, src/core/link-extraction.ts, src/core/search/**
  - Runs: bun run eval:query:validate (80 queries), test:eval (57 tests),
          eval:world:render (smoke-test the HTML renderer)
  - Pinned actions by commit SHA (matches existing .github/workflows/test.yml)
  - Zero API calls — all Opus/OpenAI paths stubbed or skipped in unit tests
  - Fast: ~30s total wall clock

Contributor TTHW (clone → first merged PR):
  - Path 1 (Tier 5.5 queries): ~5 min
  - Path 2 (external adapter): ~30 min for a simple adapter
  - Path 3 (reproduce scorecard): ~15 min wall clock (N=5 run)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan changed the title BrainBench v1.1: v0.10.5 extraction + Phase 2 external baselines (gbrain beats all 3 by 32 pts) BrainBench v1.1: extraction fixes + 3 external baselines + N=5 + Tier 5/5.5 + world.html + contributor docs (all 3 phases) Apr 18, 2026
The multi-adapter runner left PGLite engines alive after each run.
GbrainAfterAdapter and HybridNoGraphAdapter both instantiate a
PGLiteEngine in init() but never disconnect it; Bun's shutdown path
exits with code 99 when embedded-Postgres workers outlive main().

Added optional `teardown?(state)` to the Adapter interface, implemented
it on both engine-backed adapters, and call it from scoreOneRun after
the N=5 loop. ripgrep-bm25 and vector-only hold no DB resources and
don't need a teardown.

Verified: gbrain-after, hybrid-nograph, ripgrep-bm25, vector-only all
exit 0 at N=1. Full test:eval passes (57 tests). No metric change.
Reproducibility run of the 4-adapter side-by-side at commit b81373d
(branch garrytan/gbrain-evals). N=5, 240-page corpus, 145 relational
queries from world-v1.

Headline: gbrain-after 49.1% P@5 / 97.9% R@5. hybrid-nograph 17.8% /
65.1%. ripgrep-bm25 17.1% / 62.4%. vector-only 10.8% / 40.7%. All
adapters deterministic (stddev = 0 across the 5 runs per adapter).

Matches the scorecard in eval/README.md byte-for-byte for the three
deterministic adapters; hybrid-nograph matches within tolerance bands.
Runs the same eval harness against two gbrain src/ trees on the same
240-page corpus and 145 queries. Patches the v0.11 copy's gbrain-after
adapter to use getLinks/getBacklinks (v0.11 has no traversePaths)
with identical direction+linkType semantics.

gbrain-after P@5 22.1% -> 49.1% (+27 pts); R@5 54.6% -> 97.9% (+43
pts); correct-in-top-5 99 -> 248 (+149). hybrid-nograph flat at 17.8%
/ 65.1% on both (v0.12 didn't touch hybridSearch / chunking).

Driver is extraction quality, not graph presence: v0.12 emits 499
typed links (v0.11: 136, x3.7) and 2,208 timeline entries (v0.11: 27,
x82) on the same 240 pages. Sharpens the April-18 "graph layer does
the work" claim -- on v0.11 that architecture only beat hybrid-nograph
by 4.3 points; the 31-point lead in the multi-adapter scorecard comes
from graph + high-quality extract in combination.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant