fix: markdown parsing bugs affecting wiki-style content#187
fix: markdown parsing bugs affecting wiki-style content#187knee5 wants to merge 7 commits intogarrytan:masterfrom
Conversation
- splitBody now requires explicit timeline sentinel (<!-- timeline -->, --- timeline ---, or --- directly before ## Timeline / ## History). A bare --- in body text is a markdown horizontal rule, not a separator. This fixes the 83% content truncation @knee5 reported on a 1,991-article wiki where 4,856 of 6,680 wikilinks were lost. - serializeMarkdown emits <!-- timeline --> sentinel for round-trip stability. - inferType extended with /writing/, /wiki/analysis/, /wiki/guides/, /wiki/hardware/, /wiki/architecture/, /wiki/concepts/. Path order is most-specific-first so projects/blog/writing/essay.md → writing, not project. - PageType union extended: writing, analysis, guide, hardware, architecture. Updates test/import-file.test.ts to use the new sentinel. Co-Authored-By: @knee5 (PR #187) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related Postgres-string-typed-data bugs that PGLite hid:
1. JSONB double-encode (postgres-engine.ts:107,668,846 + files.ts:254):
${JSON.stringify(value)}::jsonb in postgres.js v3 stringified again
on the wire, storing JSONB columns as quoted string literals. Every
frontmatter->>'key' returned NULL on Postgres-backed brains; GIN
indexes were inert. Switched to sql.json(value), which is the
postgres.js-native JSONB encoder (Parameter with OID 3802).
Affected columns: pages.frontmatter, raw_data.data,
ingest_log.pages_updated, files.metadata. page_versions.frontmatter
is downstream via INSERT...SELECT and propagates the fix.
2. pgvector embeddings returning as strings (utils.ts):
getEmbeddingsByChunkIds returned "[0.1,0.2,...]" instead of
Float32Array on Supabase, producing [NaN] cosine scores.
Adds parseEmbedding() helper handling Float32Array, numeric arrays,
and pgvector string format. Throws loud on malformed vectors
(per Codex's no-silent-NaN requirement); returns null for
non-vector strings (treated as "no embedding here"). rowToChunk
delegates to parseEmbedding.
E2E regression test at test/e2e/postgres-jsonb.test.ts asserts
jsonb_typeof = 'object' AND col->>'k' returns expected scalar across
all 5 affected columns — the test that should have caught the original
bug. Runs in CI via the existing pgvector service.
Co-Authored-By: @knee5 (PR #187 — JSONB triple-fix)
Co-Authored-By: @leonardsellem (PR #175 — parseEmbedding)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
extractMarkdownLinks now handles [[page]] and [[page|Display Text]] alongside standard [text](page.md). For wiki KBs where authors omit leading ../ (thinking in wiki-root-relative terms), resolveSlug walks ancestor directories until it finds a matching slug. Without this, wikilinks under tech/wiki/analysis/ targeting [[../../finance/wiki/concepts/foo]] silently dangled when the correct relative depth was 3 × ../ instead of 2. Co-Authored-By: @knee5 (PR #187) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Hey @knee5 — thank you for finding and fixing this. The 1,991-article wiki repro made the bug case-shut. PR #196 (v0.12.1 hotfix) just merged into the queue and re-implements all three of your fixes (splitBody sentinel, inferType wiki subtypes, JSONB triple-fix) with expanded scope:
Re-implemented rather than cherry-picked because PR #187 went CONFLICTING after Minions v7 / knowledge graph landed. Your test cases are ported verbatim. Co-authorship preserved in the commit trailers. Mind if I close this PR in favor of #196? Want to make sure you're OK with the merger before I do — your name's on every relevant commit and the CHANGELOG. Thanks again. |
Plain `---` in article body was treated as compiled_truth/timeline separator. Wikis using `---` as horizontal rules between sections experienced severe content truncation — a 23,887-byte article could store as 593 bytes, and 4,856 of 6,680 wikilinks were lost from the DB (73%) across a 1,991-article knowledge base. Now splits only on explicit `<!-- timeline -->` sentinel, `--- timeline ---`, or `---` when immediately followed by a `## Timeline` or `## History` heading. serializeMarkdown updated to emit `<!-- timeline -->` for round-trip stability. Tests added for horizontal rules, sentinel splits, and heading-gated splits.
Only `/wiki/concepts/` was mapped; articles under `/wiki/analysis/`, `/wiki/guides/`, `/wiki/hardware/`, and `/wiki/architecture/` all silently defaulted to `type='concept'`, producing incorrect metadata and breaking any type-filtered queries. Adds explicit path-segment mappings for the four missing subtypes. `concept` remains the default fallback.
…gify()::jsonb The Postgres engine was passing `JSON.stringify(x)::jsonb` to postgres.js. Because postgres.js v3 sends that as a plain string, the DB stores a JSONB value that is itself a JSON string literal — not an object. Consequently `frontmatter->>'key'` returns NULL in SQL and GIN indexes are ineffective. Replace all three call sites (putPage, putRawData, logIngest) with `this.sql.json(x)`, which is postgres.js v3's native JSONB serialization and causes the driver to send the value with the correct wire type. Also fix rowToChunk in utils.ts to handle embeddings returned as JSON strings (a related symptom of the same driver/cast mismatch). PGLite engine is unaffected — it uses `$n::jsonb` with JSON.stringify, which is correct for that driver.
The extract command's regex only matched standard markdown links `[text](path.md)`, missing the `[[path|display]]` wikilinks used by Obsidian-style knowledge bases. A 2,000-article vault with thousands of wikilinks extracted 0 links because of this. Now handles both syntaxes: - Standard markdown: `[text](relative/path.md)` - Wikilinks: `[[path/to/page]]` and `[[path/to/page|Display Text]]` Skips external URLs in both cases. Normalizes wikilink targets to include .md suffix when missing. Note: target-slug resolution for wikilinks still needs refinement — relative paths like `[[concepts/foo]]` don't map cleanly to DB slugs like `tech/wiki/concepts/foo` without context. Tracked for follow-up. Tests added for wikilink patterns, display text handling, external URL filtering.
…slugs Wikilinks in wiki-style KBs use various formats that the previous extractor failed to resolve, dropping ~30% of valid links: - Relative bare name: [[foo]] in tech/wiki/concepts/ → tech/wiki/concepts/foo - Cross-type shorthand: [[analysis/foo]] in tech/wiki/guides/ → tech/wiki/analysis/foo (authors omit leading ../ thinking in wiki-root-relative terms) - Cross-domain under-specified: [[../../finance/wiki/...]] from depth-3 dirs resolves one level short because authors write 2× ../ when 3× is needed to reach KB root — ancestor search corrects this - Fully-qualified: [[tech/wiki/concepts/foo]] — now handled by root fallback - Section anchors: [[page#section]] — now stripped; bare [[#anchor]] skipped Adds resolveSlug(fileDir, relTarget, allSlugs) that first tries the standard path.join resolution, then progressively strips leading path components from fileDir (ancestor search) until a matching slug is found. Returns null for genuinely dangling targets (no matching page exists anywhere in the KB). Also strips section anchors (#heading) from wikilink paths in extractMarkdownLinks — they're intra-page refs and were causing lookup misses. Analysis on the user's 2,074-page KB: - Previously resolved: 6,760 raw / 5,039 unique deduped disk links - After fix: 8,594 raw / 6,641 unique deduped disk links (+32% unique) - Remaining 1,241 raw links are genuinely dangling (no matching page) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… pages Surfaces pages with zero inbound wikilinks. Essential for content enrichment cycles in KBs with 1000+ pages. By default filters out auto-generated pages, raw sources, and pseudo-pages where no inbound links is expected; --include-pseudo to disable. Supports text (grouped by domain), --json, --count outputs. Also exposed as find_orphans MCP operation. Tests cover basic detection, filtering, all output modes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
25e1c56 to
f50954f
Compare
…n canonical extractor extractEntityRefs now recognizes both syntaxes equally: [Name](people/slug) -- upstream original [[people/slug|Name]] -- Obsidian wikilink (new) Extends DIR_PATTERN to include domain-organized wiki slugs used by Karpathy-style knowledge bases: - entities (legacy prefix some brains keep during migration) - projects (gbrain canonical, was missing from regex) - tech, finance, personal, openclaw (domain-organized wiki roots) Before this change, a 2,100-page brain with wikilinks throughout extracted zero auto-links on put_page because the regex only matched markdown-style [name](path). After: 1,377 new typed edges on a single extract --source db pass over the same corpus. Matches the behavior of the extract.ts filesystem walker (which already handled wikilinks as of the wiki-markdown-compat fix wave), so the db and fs sources now produce the same link graph from the same content. Both patterns share the DIR_PATTERN constant so adding a new entity dir only requires updating one string. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Hey @garrytan — thank you, better outcome than I hoped for. The expanded JSONB audit ( Yes, please close #187 in favor of #196. Two scope items from the original PR aren't in v0.12.0 and I'd like to offer as focused follow-up PRs if you want them:
Either/both sound useful, or should I drop them? Happy to open a clean PR against v0.12.0 master if you want to evaluate. Thanks again for the thorough treatment — I'll upgrade and run |
…196) * fix: splitBody and inferType for wiki-style markdown content - splitBody now requires explicit timeline sentinel (<!-- timeline -->, --- timeline ---, or --- directly before ## Timeline / ## History). A bare --- in body text is a markdown horizontal rule, not a separator. This fixes the 83% content truncation @knee5 reported on a 1,991-article wiki where 4,856 of 6,680 wikilinks were lost. - serializeMarkdown emits <!-- timeline --> sentinel for round-trip stability. - inferType extended with /writing/, /wiki/analysis/, /wiki/guides/, /wiki/hardware/, /wiki/architecture/, /wiki/concepts/. Path order is most-specific-first so projects/blog/writing/essay.md → writing, not project. - PageType union extended: writing, analysis, guide, hardware, architecture. Updates test/import-file.test.ts to use the new sentinel. Co-Authored-By: @knee5 (PR #187) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: JSONB double-encode bug on Postgres + parseEmbedding NaN scores Two related Postgres-string-typed-data bugs that PGLite hid: 1. JSONB double-encode (postgres-engine.ts:107,668,846 + files.ts:254): ${JSON.stringify(value)}::jsonb in postgres.js v3 stringified again on the wire, storing JSONB columns as quoted string literals. Every frontmatter->>'key' returned NULL on Postgres-backed brains; GIN indexes were inert. Switched to sql.json(value), which is the postgres.js-native JSONB encoder (Parameter with OID 3802). Affected columns: pages.frontmatter, raw_data.data, ingest_log.pages_updated, files.metadata. page_versions.frontmatter is downstream via INSERT...SELECT and propagates the fix. 2. pgvector embeddings returning as strings (utils.ts): getEmbeddingsByChunkIds returned "[0.1,0.2,...]" instead of Float32Array on Supabase, producing [NaN] cosine scores. Adds parseEmbedding() helper handling Float32Array, numeric arrays, and pgvector string format. Throws loud on malformed vectors (per Codex's no-silent-NaN requirement); returns null for non-vector strings (treated as "no embedding here"). rowToChunk delegates to parseEmbedding. E2E regression test at test/e2e/postgres-jsonb.test.ts asserts jsonb_typeof = 'object' AND col->>'k' returns expected scalar across all 5 affected columns — the test that should have caught the original bug. Runs in CI via the existing pgvector service. Co-Authored-By: @knee5 (PR #187 — JSONB triple-fix) Co-Authored-By: @leonardsellem (PR #175 — parseEmbedding) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: extract wikilink syntax with ancestor-search slug resolution extractMarkdownLinks now handles [[page]] and [[page|Display Text]] alongside standard [text](page.md). For wiki KBs where authors omit leading ../ (thinking in wiki-root-relative terms), resolveSlug walks ancestor directories until it finds a matching slug. Without this, wikilinks under tech/wiki/analysis/ targeting [[../../finance/wiki/concepts/foo]] silently dangled when the correct relative depth was 3 × ../ instead of 2. Co-Authored-By: @knee5 (PR #187) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: gbrain repair-jsonb + v0.12.1 migration + CI grep guard - New gbrain repair-jsonb command. Detects rows where jsonb_typeof(col) = 'string' and rewrites them via (col #>> '{}')::jsonb across 5 affected columns: pages.frontmatter, raw_data.data, ingest_log.pages_updated, files.metadata, page_versions.frontmatter. Idempotent — re-running is a no-op. PGLite engines short-circuit cleanly (the bug never affected the parameterized encode path PGLite uses). --dry-run shows what would be repaired; --json for scripting. - New v0_12_1.ts migration orchestrator. Phases: schema → repair → verify. Modeled on v0_12_0 pattern, registered in migrations/index.ts. Runs automatically via gbrain upgrade / apply-migrations. - CI grep guard at scripts/check-jsonb-pattern.sh fails the build if anyone reintroduces the ${JSON.stringify(x)}::jsonb interpolation pattern. Wired into bun test via package.json. Best-effort static analysis (multi-line and helper-wrapped variants are caught by the E2E round-trip test instead). - Updates apply-migrations.test.ts expectations to account for the new v0.12.1 entry in the registry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.12.1) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: update project documentation for v0.12.1 - CLAUDE.md: document repair-jsonb command, v0_12_1 migration, splitBody sentinel contract, inferType wiki subtypes, CI grep guard, new test files (repair-jsonb, migrations-v0_12_1, markdown) - README.md: add gbrain repair-jsonb to ADMIN command reference - INSTALL_FOR_AGENTS.md: fix verification count (6 -> 7), add v0.12.1 upgrade guidance for Postgres brains - docs/GBRAIN_VERIFY.md: add check #8 for JSONB integrity on Postgres-backed brains - docs/UPGRADING_DOWNSTREAM_AGENTS.md: add v0.12.1 section with migration steps, splitBody contract, wiki subtype inference - skills/migrate/SKILL.md: document native wikilink extraction via gbrain extract links (v0.12.1+) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
---in article bodies was treated as a compiled_truth/timeline separator, causing 83% content truncation across a 1,991-article knowledge base (e.g., a 23,887-byte article stored as 593 bytes; 4,856 of 6,680 wikilinks lost from the DB)./wiki/analysis/,/wiki/guides/,/wiki/hardware/,/wiki/architecture/all silently defaulted totype='concept', breaking type-filtered queries.JSON.stringify(x)::jsonbcauses postgres.js v3 to store a JSONB string literal instead of an object, makingfrontmatter->>'key'return NULL and GIN indexes ineffective.Why this matters
Any wiki/notebook with markdown horizontal rules or non-default subdirectory types hits these. Found while migrating a 1,991-article knowledge base where 83% of articles were truncated in the DB.
Changes
splitBody(): No longer treats plain
---as a sentinel. Recognized split points are now:<!-- timeline -->(preferred, unambiguous),--- timeline ---(decorated separator), or---only when the next non-empty line is a## Timelineor## Historyheading (backward-compat fallback).serializeMarkdownupdated to emit<!-- timeline -->for round-trip stability.inferType(): Added path-segment mappings for
/wiki/analysis/,/wiki/guides/(and/wiki/guide/),/wiki/hardware/,/wiki/architecture/, and/wiki/concepts/(and/wiki/concept/).conceptremains the default fallback.postgres-engine JSONB: Replaced all three
JSON.stringify(x)::jsonbcall sites (putPage,putRawData,logIngest) withthis.sql.json(x), which is postgres.js v3's native JSONB serialization. Also fixedrowToChunkinutils.tsto handle embeddings returned as JSON strings (a symptom of the same mismatch). PGLite engine is unaffected.Testing
854 tests pass. New tests added for: horizontal rules in body,
<!-- timeline -->sentinel, heading-gated splits, wiki subtype inference.Backwards compat
Existing PGLite behavior unchanged. Postgres engine writes now produce valid JSONB (queries that used
frontmatter->>'key'which returned NULL will start returning values after next write — this is a correctness fix, but downstream code that relied on the NULL behavior should be checked).