fix: parse pgvector embeddings before cosine rescoring by leonardsellem · Pull Request #175 · garrytan/gbrain

leonardsellem · 2026-04-17T09:42:38Z

Summary

Fix a Postgres/Supabase search bug where gbrain query could render [NaN]
scores after cosine rescoring.

The root cause was that getEmbeddingsByChunkIds() could return pgvector values
as strings (for example "[0.1, 0.2, ...]") instead of Float32Arrays. Those
string values were then passed into cosine similarity math, which produced
NaN scores.

This change adds a shared embedding parser and uses it in the Postgres path
before cosine rescoring.

Changes

add parseEmbedding() in src/core/utils.ts
use parseEmbedding() in rowToChunk()
use parseEmbedding() in PostgresEngine.getEmbeddingsByChunkIds()
add regression tests for pgvector-string parsing in test/utils.test.ts

Why this is needed

Before this fix, hybrid search itself was returning valid RRF scores, but the
Postgres/Supabase cosine rescoring step converted them into NaN because the
embedding type coming back from the DB was not normalized.

After this fix, query results render normal numeric scores again and semantic
ranking works as expected on the Postgres engine.

Test Plan

Run the targeted unit tests:

bun test test/utils.test.ts test/search.test.ts

Manual verification against a real Supabase-backed brain:

gbrain query "what happened in ayor pmo meetings"

Before this fix, the query output showed [NaN] scores.
After this fix, the query output shows normal numeric scores like [0.8315].

- change health check url to a more stable endpoint (`users/by/username/X`) - update auth label for clarity

@knee5

Two related Postgres-string-typed-data bugs that PGLite hid: 1. JSONB double-encode (postgres-engine.ts:107,668,846 + files.ts:254): ${JSON.stringify(value)}::jsonb in postgres.js v3 stringified again on the wire, storing JSONB columns as quoted string literals. Every frontmatter->>'key' returned NULL on Postgres-backed brains; GIN indexes were inert. Switched to sql.json(value), which is the postgres.js-native JSONB encoder (Parameter with OID 3802). Affected columns: pages.frontmatter, raw_data.data, ingest_log.pages_updated, files.metadata. page_versions.frontmatter is downstream via INSERT...SELECT and propagates the fix. 2. pgvector embeddings returning as strings (utils.ts): getEmbeddingsByChunkIds returned "[0.1,0.2,...]" instead of Float32Array on Supabase, producing [NaN] cosine scores. Adds parseEmbedding() helper handling Float32Array, numeric arrays, and pgvector string format. Throws loud on malformed vectors (per Codex's no-silent-NaN requirement); returns null for non-vector strings (treated as "no embedding here"). rowToChunk delegates to parseEmbedding. E2E regression test at test/e2e/postgres-jsonb.test.ts asserts jsonb_typeof = 'object' AND col->>'k' returns expected scalar across all 5 affected columns — the test that should have caught the original bug. Runs in CI via the existing pgvector service. Co-Authored-By: @knee5 (PR #187 — JSONB triple-fix) Co-Authored-By: @leonardsellem (PR #175 — parseEmbedding) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

garrytan · 2026-04-18T15:57:17Z

Hey @leonardsellem — thank you. The Supabase NaN-score bug was a sharp catch. PR #196 (v0.12.1 hotfix) bundles your fix with the JSONB double-encode work since they're the same Postgres-string-typed-data class. Your parseEmbedding() helper is the canonical implementation; rowToChunk delegates to it everywhere now.

One small adjustment per Codex outside-voice review during planning: parseEmbedding now throws loud on malformed vectors (e.g. '[abc, def]') instead of silently returning null, to prevent a future silent-NaN class. Non-vector strings still return null (treated as 'no embedding here'). Your tests pass unchanged.

Re-implemented rather than cherry-picked because the file overlapped heavily with #187's rowToChunk update — single coherent commit was cleaner. Co-authorship preserved.

Mind if I close this PR in favor of #196? Thanks for the fix.

leonardsellem · 2026-04-18T21:17:40Z

Hi @garrytan, no worries, of course you can close this PR and merge #196 instead, as long as it makes gbrain even better! Thank you for bringing such an upgrade in the way I declutter my organic brain!

@knee5

…196) * fix: splitBody and inferType for wiki-style markdown content - splitBody now requires explicit timeline sentinel (, --- timeline ---, or --- directly before ## Timeline / ## History). A bare --- in body text is a markdown horizontal rule, not a separator. This fixes the 83% content truncation @knee5 reported on a 1,991-article wiki where 4,856 of 6,680 wikilinks were lost. - serializeMarkdown emits  sentinel for round-trip stability. - inferType extended with /writing/, /wiki/analysis/, /wiki/guides/, /wiki/hardware/, /wiki/architecture/, /wiki/concepts/. Path order is most-specific-first so projects/blog/writing/essay.md → writing, not project. - PageType union extended: writing, analysis, guide, hardware, architecture. Updates test/import-file.test.ts to use the new sentinel. Co-Authored-By: @knee5 (PR #187) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: JSONB double-encode bug on Postgres + parseEmbedding NaN scores Two related Postgres-string-typed-data bugs that PGLite hid: 1. JSONB double-encode (postgres-engine.ts:107,668,846 + files.ts:254): ${JSON.stringify(value)}::jsonb in postgres.js v3 stringified again on the wire, storing JSONB columns as quoted string literals. Every frontmatter->>'key' returned NULL on Postgres-backed brains; GIN indexes were inert. Switched to sql.json(value), which is the postgres.js-native JSONB encoder (Parameter with OID 3802). Affected columns: pages.frontmatter, raw_data.data, ingest_log.pages_updated, files.metadata. page_versions.frontmatter is downstream via INSERT...SELECT and propagates the fix. 2. pgvector embeddings returning as strings (utils.ts): getEmbeddingsByChunkIds returned "[0.1,0.2,...]" instead of Float32Array on Supabase, producing [NaN] cosine scores. Adds parseEmbedding() helper handling Float32Array, numeric arrays, and pgvector string format. Throws loud on malformed vectors (per Codex's no-silent-NaN requirement); returns null for non-vector strings (treated as "no embedding here"). rowToChunk delegates to parseEmbedding. E2E regression test at test/e2e/postgres-jsonb.test.ts asserts jsonb_typeof = 'object' AND col->>'k' returns expected scalar across all 5 affected columns — the test that should have caught the original bug. Runs in CI via the existing pgvector service. Co-Authored-By: @knee5 (PR #187 — JSONB triple-fix) Co-Authored-By: @leonardsellem (PR #175 — parseEmbedding) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: extract wikilink syntax with ancestor-search slug resolution extractMarkdownLinks now handles [[page]] and [[page|Display Text]] alongside standard [text](page.md). For wiki KBs where authors omit leading ../ (thinking in wiki-root-relative terms), resolveSlug walks ancestor directories until it finds a matching slug. Without this, wikilinks under tech/wiki/analysis/ targeting [[../../finance/wiki/concepts/foo]] silently dangled when the correct relative depth was 3 × ../ instead of 2. Co-Authored-By: @knee5 (PR #187) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: gbrain repair-jsonb + v0.12.1 migration + CI grep guard - New gbrain repair-jsonb command. Detects rows where jsonb_typeof(col) = 'string' and rewrites them via (col #>> '{}')::jsonb across 5 affected columns: pages.frontmatter, raw_data.data, ingest_log.pages_updated, files.metadata, page_versions.frontmatter. Idempotent — re-running is a no-op. PGLite engines short-circuit cleanly (the bug never affected the parameterized encode path PGLite uses). --dry-run shows what would be repaired; --json for scripting. - New v0_12_1.ts migration orchestrator. Phases: schema → repair → verify. Modeled on v0_12_0 pattern, registered in migrations/index.ts. Runs automatically via gbrain upgrade / apply-migrations. - CI grep guard at scripts/check-jsonb-pattern.sh fails the build if anyone reintroduces the ${JSON.stringify(x)}::jsonb interpolation pattern. Wired into bun test via package.json. Best-effort static analysis (multi-line and helper-wrapped variants are caught by the E2E round-trip test instead). - Updates apply-migrations.test.ts expectations to account for the new v0.12.1 entry in the registry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.12.1) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: update project documentation for v0.12.1 - CLAUDE.md: document repair-jsonb command, v0_12_1 migration, splitBody sentinel contract, inferType wiki subtypes, CI grep guard, new test files (repair-jsonb, migrations-v0_12_1, markdown) - README.md: add gbrain repair-jsonb to ADMIN command reference - INSTALL_FOR_AGENTS.md: fix verification count (6 -> 7), add v0.12.1 upgrade guidance for Postgres brains - docs/GBRAIN_VERIFY.md: add check #8 for JSONB integrity on Postgres-backed brains - docs/UPGRADING_DOWNSTREAM_AGENTS.md: add v0.12.1 section with migration steps, splitBody contract, wiki subtype inference - skills/migrate/SKILL.md: document native wikilink extraction via gbrain extract links (v0.12.1+) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Klaw added 2 commits April 17, 2026 09:07

fix: parse pgvector embeddings before cosine rescoring

5bf95a1

📝 docs(recipes): update x-to-brain recipe health check url

f2f4a50

- change health check url to a more stable endpoint (`users/by/username/X`) - update auth label for clarity

garrytan mentioned this pull request Apr 18, 2026

fix: JSONB double-encode + splitBody wiki + parseEmbedding (v0.12.1) #196

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: parse pgvector embeddings before cosine rescoring#175

fix: parse pgvector embeddings before cosine rescoring#175
leonardsellem wants to merge 2 commits intogarrytan:masterfrom
leonardsellem:fix/parse-pgvector-embeddings

leonardsellem commented Apr 17, 2026

Uh oh!

garrytan commented Apr 18, 2026

Uh oh!

leonardsellem commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

leonardsellem commented Apr 17, 2026

Summary

Changes

Why this is needed

Test Plan

Uh oh!

garrytan commented Apr 18, 2026

Uh oh!

leonardsellem commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants