Skip to content

fix: JSONB double-encode + splitBody wiki + parseEmbedding (v0.12.1)#196

Merged
garrytan merged 7 commits intomasterfrom
garrytan/jsonb-hotfix
Apr 18, 2026
Merged

fix: JSONB double-encode + splitBody wiki + parseEmbedding (v0.12.1)#196
garrytan merged 7 commits intomasterfrom
garrytan/jsonb-hotfix

Conversation

@garrytan
Copy link
Copy Markdown
Owner

Summary

Data-correctness hotfix for v0.12.0 Postgres-backed brains. PGLite users were unaffected. Bundles community PRs #187 (@knee5) and #175 (@leonardsellem) with expanded migration scope, schema audit (5 affected JSONB columns vs 3 originally reported), CI grep guard, and an E2E regression test that should have caught the original bug.

JSONB double-encode (Postgres only). Every ${JSON.stringify(value)}::jsonb interpolation in postgres-engine.ts and files.ts caused postgres.js v3 to stringify again on the wire, storing JSONB columns as quoted string literals. Every frontmatter->>'key' returned NULL on Postgres-backed brains. GIN indexes were inert. Switched to sql.json(value) (postgres.js native, OID 3802). Affected columns: pages.frontmatter, raw_data.data, ingest_log.pages_updated, files.metadata, page_versions.frontmatter. PGLite hid this bug entirely — different driver path.

splitBody truncation. Treated any standalone --- as timeline separator, causing 83% content truncation on wiki corpora (1,991-article wiki, 4,856 of 6,680 wikilinks lost). New behavior requires explicit sentinel: <!-- timeline -->, --- timeline ---, or --- directly before ## Timeline / ## History heading.

inferType wiki subtypes. Added /writing/, /wiki/analysis/, /wiki/guides/, /wiki/hardware/, /wiki/architecture/, /wiki/concepts/. Path order is most-specific-first so projects/blog/writing/essay.mdwriting.

pgvector NaN scores (Supabase). getEmbeddingsByChunkIds returned strings instead of Float32Array on Supabase, producing [NaN] cosine scores. Adds parseEmbedding() helper. Throws loud on malformed vectors (no silent NaN); returns null for non-vector strings.

Wikilink extraction. [[page]] and [[page|Display]] syntaxes now extracted alongside standard [text](page.md). resolveSlug() does ancestor-search for wiki KBs that omit ../.

Migration. New gbrain repair-jsonb command + v0_12_1 orchestrator (schema → repair → verify → record). Idempotent — re-running is a no-op. PGLite engines short-circuit cleanly.

CI grep guard at scripts/check-jsonb-pattern.sh fails the build if anyone reintroduces the ${JSON.stringify(x)}::jsonb pattern.

Test Coverage

[+] src/core/markdown.ts — splitBody/serializeMarkdown/inferType
    └── [★★★] markdown.test.ts: 10 splitBody + 5 round-trip + wiki/writing inferType cases
[+] src/core/postgres-engine.ts — sql.json() at putPage/putRawData/logIngest
    └── [★★★] e2e/postgres-jsonb.test.ts: 4 round-trip assertions on real Postgres
[+] src/core/utils.ts — parseEmbedding helper, rowToChunk delegation
    └── [★★★] utils.test.ts: F32A passthrough, pgvector string, null, throw on garbage
[+] src/commands/extract.ts — wikilink extraction, resolveSlug ancestor search
    └── [★★★] extract.test.ts existing coverage
[+] src/commands/files.ts:254 — sql.json metadata
    └── [★★★] e2e/postgres-jsonb.test.ts: files.metadata round-trip
[+] src/commands/migrations/v0_12_1.ts — JSONB repair orchestrator
    └── [★★★] migrations-v0_12_1.test.ts: registry, dry-run, phase exports
[+] src/commands/repair-jsonb.ts — repair command + PGLite short-circuit
    └── [★★★] repair-jsonb.test.ts (PGLite no-op) + e2e/postgres-jsonb.test.ts (real repair)
[+] scripts/check-jsonb-pattern.sh — CI grep guard
    └── [★★] Manual + wired into bun test

Coverage: 8/8 paths (100%). Tests: 1361 → 1415 (+54 new).

E2E suite: 120/120 pass against pgvector/pg16 Docker container. Unit suite: 1415/1415 pass. CI grep guard passes on this diff (no JSON.stringify(x)::jsonb patterns in src/).

Pre-Landing Review

No new issues found. Specialists already comprehensively covered by /plan-eng-review + Codex outside-voice review during planning (25+ findings, 3 material tensions adjudicated). repair-jsonb.ts uses sql.unsafe with table/column names from a hardcoded TARGETS array — no injection vector. Migration is idempotent. parseEmbedding throws loud on malformed input per Codex's no-silent-NaN requirement.

Plan Completion

All 24 planned items DONE. Scope reduced from 9-PR bundle to 2-PR hotfix per Codex outside-voice scope challenge. The remaining 7 PRs (#184, #177, #132, #114, #115, #119, #123) deferred to v0.12.2 follow-up wave per /Users/garrytan/.claude/plans/system-instruction-you-are-working-elegant-squid.md.

TODOS

No items in TODOS.md were specifically completed by this PR (it focused on BrainBench eval work).

Documentation

Documentation was synced to v0.12.1 in commit 998ef82. Six files updated to reflect the JSONB hotfix, splitBody sentinel contract, wiki inferType, and native wikilink extraction.

  • README.md — added gbrain repair-jsonb [--dry-run] to the ADMIN command reference
  • CLAUDE.md — registered new files, documented splitBody sentinel precedence and inferType wiki subtypes
  • INSTALL_FOR_AGENTS.md — fixed stale verification check counts, added v0.12.1 upgrade guidance
  • docs/GBRAIN_VERIFY.md — added check perf: parallelize hybrid search pipeline #8 (JSONB Frontmatter Integrity)
  • docs/UPGRADING_DOWNSTREAM_AGENTS.md — added v0.12.1 hotfix section explaining the splitBody contract change
  • skills/migrate/SKILL.md — documented native [[wikilink]] and [[wikilink|Display]] extraction

CHANGELOG.md was left untouched (already comprehensive). VERSION bumped to 0.12.1.

Test plan

  • All unit tests pass (1322/1322)
  • All E2E tests pass against real pgvector (120/120, including new test/e2e/postgres-jsonb.test.ts)
  • CI grep guard passes on current src/
  • PGLite repair-jsonb test confirms no DB connection / 0 rows reported
  • Migration dry-run skips all side-effect phases
  • Manual smoke on a real Postgres-backed brain: gbrain put a page with frontmatter, query frontmatter->>'key' returns the value
  • Manual smoke: gbrain repair-jsonb --dry-run against a brain with double-encoded rows reports the correct count

Attribution

Built on community PRs:

Both PRs reported the bugs and proposed the fixes. Codex outside-voice review during planning surfaced the missed page_versions.frontmatter propagation path, dropped the noisy-truncated-diagnostic anti-pattern from scope, and pushed for the engine-aware migration.

🤖 Generated with Claude Code

garrytan and others added 6 commits April 18, 2026 23:49
- splitBody now requires explicit timeline sentinel (<!-- timeline -->,
  --- timeline ---, or --- directly before ## Timeline / ## History).
  A bare --- in body text is a markdown horizontal rule, not a separator.
  This fixes the 83% content truncation @knee5 reported on a 1,991-article
  wiki where 4,856 of 6,680 wikilinks were lost.

- serializeMarkdown emits <!-- timeline --> sentinel for round-trip stability.

- inferType extended with /writing/, /wiki/analysis/, /wiki/guides/,
  /wiki/hardware/, /wiki/architecture/, /wiki/concepts/. Path order is
  most-specific-first so projects/blog/writing/essay.md → writing,
  not project.

- PageType union extended: writing, analysis, guide, hardware, architecture.

Updates test/import-file.test.ts to use the new sentinel.

Co-Authored-By: @knee5 (PR #187)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related Postgres-string-typed-data bugs that PGLite hid:

1. JSONB double-encode (postgres-engine.ts:107,668,846 + files.ts:254):
   ${JSON.stringify(value)}::jsonb in postgres.js v3 stringified again
   on the wire, storing JSONB columns as quoted string literals. Every
   frontmatter->>'key' returned NULL on Postgres-backed brains; GIN
   indexes were inert. Switched to sql.json(value), which is the
   postgres.js-native JSONB encoder (Parameter with OID 3802).
   Affected columns: pages.frontmatter, raw_data.data,
   ingest_log.pages_updated, files.metadata. page_versions.frontmatter
   is downstream via INSERT...SELECT and propagates the fix.

2. pgvector embeddings returning as strings (utils.ts):
   getEmbeddingsByChunkIds returned "[0.1,0.2,...]" instead of
   Float32Array on Supabase, producing [NaN] cosine scores.
   Adds parseEmbedding() helper handling Float32Array, numeric arrays,
   and pgvector string format. Throws loud on malformed vectors
   (per Codex's no-silent-NaN requirement); returns null for
   non-vector strings (treated as "no embedding here"). rowToChunk
   delegates to parseEmbedding.

E2E regression test at test/e2e/postgres-jsonb.test.ts asserts
jsonb_typeof = 'object' AND col->>'k' returns expected scalar across
all 5 affected columns — the test that should have caught the original
bug. Runs in CI via the existing pgvector service.

Co-Authored-By: @knee5 (PR #187 — JSONB triple-fix)
Co-Authored-By: @leonardsellem (PR #175 — parseEmbedding)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
extractMarkdownLinks now handles [[page]] and [[page|Display Text]]
alongside standard [text](page.md). For wiki KBs where authors omit
leading ../ (thinking in wiki-root-relative terms), resolveSlug
walks ancestor directories until it finds a matching slug.

Without this, wikilinks under tech/wiki/analysis/ targeting
[[../../finance/wiki/concepts/foo]] silently dangled when the
correct relative depth was 3 × ../ instead of 2.

Co-Authored-By: @knee5 (PR #187)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- New gbrain repair-jsonb command. Detects rows where
  jsonb_typeof(col) = 'string' and rewrites them via
  (col #>> '{}')::jsonb across 5 affected columns:
  pages.frontmatter, raw_data.data, ingest_log.pages_updated,
  files.metadata, page_versions.frontmatter. Idempotent — re-running
  is a no-op. PGLite engines short-circuit cleanly (the bug never
  affected the parameterized encode path PGLite uses). --dry-run
  shows what would be repaired; --json for scripting.

- New v0_12_1.ts migration orchestrator. Phases: schema → repair → verify.
  Modeled on v0_12_0 pattern, registered in migrations/index.ts.
  Runs automatically via gbrain upgrade / apply-migrations.

- CI grep guard at scripts/check-jsonb-pattern.sh fails the build if
  anyone reintroduces the ${JSON.stringify(x)}::jsonb interpolation
  pattern. Wired into bun test via package.json. Best-effort static
  analysis (multi-line and helper-wrapped variants are caught by the
  E2E round-trip test instead).

- Updates apply-migrations.test.ts expectations to account for the new
  v0.12.1 entry in the registry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- CLAUDE.md: document repair-jsonb command, v0_12_1 migration,
  splitBody sentinel contract, inferType wiki subtypes, CI grep
  guard, new test files (repair-jsonb, migrations-v0_12_1, markdown)
- README.md: add gbrain repair-jsonb to ADMIN command reference
- INSTALL_FOR_AGENTS.md: fix verification count (6 -> 7), add
  v0.12.1 upgrade guidance for Postgres brains
- docs/GBRAIN_VERIFY.md: add check #8 for JSONB integrity on
  Postgres-backed brains
- docs/UPGRADING_DOWNSTREAM_AGENTS.md: add v0.12.1 section with
  migration steps, splitBody contract, wiki subtype inference
- skills/migrate/SKILL.md: document native wikilink extraction
  via gbrain extract links (v0.12.1+)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan merged commit c0b6219 into master Apr 18, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant