Skip to content

fix(migrate): parse PGLite string embeddings before re-serializing for Postgres#199

Open
ShadowRaptor wants to merge 1 commit intogarrytan:masterfrom
ShadowRaptor:fix/migrate-vector-serialization
Open

fix(migrate): parse PGLite string embeddings before re-serializing for Postgres#199
ShadowRaptor wants to merge 1 commit intogarrytan:masterfrom
ShadowRaptor:fix/migrate-vector-serialization

Conversation

@ShadowRaptor
Copy link
Copy Markdown

Summary

gbrain migrate --to <postgres-url> (e.g. PGLite → Supabase) fails on the very first page with:

invalid input syntax for type vector: "[[,-,0,.,0,1,2,5,4,2,7,2,5,,,-,..."

The corrupted vector value is the embedding being iterated character-by-character with commas.

Root cause

pglite-engine.ts:getChunksWithEmbeddings returns the embedding column as a JSON-stringified array (e.g. "[0.1,0.2,...]"), not a Float32Array. PGLite's pgvector returns vector columns this way.

The migrate then passes those chunks to postgres-engine.ts:upsertChunks which serializes embeddings:

const embeddingStr = chunk.embedding
  ? '[' + Array.from(chunk.embedding).join(',') + ']'
  : null;

When chunk.embedding is a string, Array.from(string) iterates characters (["[", "0", ".", "1", ",", ...]). Joining with , produces the malformed "[[,0,.,1,,,0,.,2,..." and pgvector rejects it.

The sister method pglite-engine.ts:getEmbeddingsByChunkIds already does the right thing — parses the string back to Float32Array. getChunksWithEmbeddings was missing the same defensive parse.

Fix (defense in depth)

  1. pglite-engine.ts getChunksWithEmbeddings — parse string embeddings to Float32Array at source, mirroring the pattern in getEmbeddingsByChunkIds.
  2. postgres-engine.ts upsertChunks — also accept already-stringified pgvector-format embeddings (pass-through), in case any other call site hands us a string.

13 lines added, 2 lines changed.

Reproduction

Any PGLite source brain with chunks.embedding populated:

gbrain init                          # creates PGLite brain
# add some content with embeddings
gbrain migrate --to <postgres-url>   # fails on first page

Verification

After patch: migrate completes successfully, embeddings transfer intact.

Verified on a real 153-page PGLite → Supabase migration (472 chunks, 1536-dim text-embedding-3-large vectors). Pre-patch failed immediately; post-patch completed in ~3 minutes with all embeddings intact.

Notes

  • No new tests added — happy to add them if you want a specific test format. The fix follows the same defensive pattern already in getEmbeddingsByChunkIds, so it's symmetry rather than a new contract.
  • Found while migrating a 0.10.2 brain. Verified bug still present on master (81b3f7a).
  • Separate (lower-priority) finding from the same investigation: gbrain put returns the misleading error Page not found: <slug> when the slug contains uppercase letters. I'll file that as a separate issue.

🤖 Generated with Claude Code

…r Postgres

When `gbrain migrate --to <postgres-target>` reads chunks via
`pglite-engine.getChunksWithEmbeddings`, PGLite returns the `embedding`
column as a JSON-stringified array (e.g. `"[0.1,0.2,...]"`), not a
`Float32Array`. The migrate then passes those chunks to
`postgres-engine.upsertChunks` which calls
`Array.from(chunk.embedding).join(',')` — but `Array.from(string)`
iterates the string CHARACTER-BY-CHARACTER, producing
`"[,0,.,1,,,0,.,2,...]"` and a pgvector parse error:

  invalid input syntax for type vector: "[[,-,0,.,0,1,2,5,4,2,7,2,5,,,..."

`pglite-engine.getEmbeddingsByChunkIds` already does the right thing
(parses the string back to `Float32Array`), but `getChunksWithEmbeddings`
was missing the same defensive conversion.

Fix in two places (defense in depth):

1. `pglite-engine.ts getChunksWithEmbeddings` — parse string embeddings
   to `Float32Array` at the source, mirroring the pattern in
   `getEmbeddingsByChunkIds`.
2. `postgres-engine.ts upsertChunks` — also accept already-stringified
   pgvector-format embeddings (pass-through), in case any other call
   site hands us a string.

Reproduction (any PGLite source brain that has `chunks.embedding` set):
  gbrain init                          # creates PGLite brain
  echo "..." > brain/test.md
  gbrain sync                          # populates chunks + embeddings
  gbrain migrate --to <postgres-url>   # fails

After patch: migrate completes successfully, embeddings transfer intact.

Verified locally on a 153-page PGLite → Supabase migration (472 chunks,
1536-dim text-embedding-3-large vectors).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant