feat(link-extraction): Obsidian vault support — configurable entity_dirs + wikilinks by gopalpatel · Pull Request #194 · garrytan/gbrain

gopalpatel · 2026-04-18T13:43:55Z

Summary

Adds Obsidian-vault compatibility to the auto-link extractor via two coordinated changes:

Configurable entity_dirs so vaults with non-default directory structures (Johnny Decimal 01-notes/, 03-subjects/, etc.) get typed-link extraction without forking.
Explicit-path wikilink support: [[dir/slug]] and [[dir/slug|alias]] are now recognized alongside the existing [Name](dir/slug) markdown syntax.

Existing behavior is preserved for everyone not touching the config — the default dir list is unchanged, and extractEntityRefs / extractPageLinks have optional (additive) parameters.

Motivation

Discovered while migrating a 2,052-page Obsidian vault to gbrain v0.12.0. The auto-link pass extracted 0 links because (a) the vault uses 01-notes/ and 03-subjects/ instead of people/ and companies/, and (b) the content uses [[wikilinks]] rather than [Name](path) markdown links. Both gaps meant the knowledge graph stayed empty.

What changed

`src/core/link-extraction.ts`

DEFAULT_ENTITY_DIRS exported as Object.freezed readonly list (current canonical dirs).
New internal helpers: escapeRegexChars, buildEntityRefRegex, buildBareSlugRegex, buildWikilinkRegex (defense-in-depth regex construction from any dir list).
New getEntityDirs(engine) config reader:
- Reads entity_dirs (comma-separated string) + entity_dirs_mode ("union" default / "replace" opt-out).
- Union by default — custom dirs ADD to defaults so hybrid vaults work without surprise. Replace mode is a deliberate opt-out.
- Validates each entry against /^[a-z0-9][a-z0-9-]*$/. Invalid entries trigger console.warn + fall back to defaults (fail-safe, but with a visible signal so users can diagnose typos rather than silently get zero links).
extractEntityRefs(content, dirs?) and extractPageLinks(content, fm, type, dirs?) accept an optional dirs parameter. Omitting it reuses the module-level default regex (zero perf cost for the common path).
Explicit-path wikilink extraction: [[dir/slug]] and [[dir/slug|Display Alias]] are recognized. The dir prefix must be in the configured dir list — this keeps the extractor consistent with its markdown-ref sibling and avoids accidental matches on unrelated [[...]] text. Alias display text is captured but currently unused.
Out of scope (by design): bare [[name]] wikilinks without a dir prefix. Resolving those requires a page-identity lookup against the engine, which would break the pure-function contract the file documents at the top. Bare-slug resolution is a natural follow-up PR.

Call sites wired up

src/core/operations.ts — runAutoLink (put_page post-hook) loads dirs via getEntityDirs(engine) once inside the transaction.
src/commands/extract.ts — batch DB extractor wires the same call.

Docs

README.md — new "Configuring entity directories" + "Wikilink scope" subsections.
CHANGELOG.md — new [Unreleased] entry covering the three user-visible additions.

Test plan

bun test test/link-extraction.test.ts — 74 pass / 0 fail (up from 48 baseline). Adds 26 new tests covering DEFAULT_ENTITY_DIRS, buildEntityRefRegex, extractEntityRefs / extractPageLinks with custom dirs (union + replace modes), getEntityDirs parsing / validation / warning behavior, wikilink extraction, wikilink alias handling, wikilink code-block skipping, and wikilink dir filtering.
bun test — 1326 pass / 141 skip / 0 fail. Zero regressions.

Implementation notes

All 8 commits are atomic, TDD-ordered (failing test first, implementation, green). Clean history for reviewing piece-by-piece.
The dir lists flow through both the markdown-ref regex AND the bare-slug regex, so bare-text references like See 01-notes/alice for context. also pick up custom dirs automatically.
Union-by-default was chosen over replace-by-default specifically because real vaults migrate incrementally — hybrid structures (some legacy people/, some new 01-notes/) are normal and shouldn't require a full mental flip to support.

Out of scope / potential follow-ups

Bare [[name]] wikilink resolution via page lookup. Natural next PR.
Alias-aware link types (use the |Display Alias as a hint for relationship inference).
gbrain doctor warning surface for invalid entity_dirs config values.

Extracts the hardcoded entity directory list (people, companies, meetings, concepts, deal, civic, project, source, media, yc) into an exported frozen readonly array. This is the first step toward configurable entity dirs; subsequent commits build the regex dynamically from this list and add a config reader so users can extend or replace the defaults (e.g. Johnny Decimal filesystems).

Replaces the hardcoded ENTITY_REF_RE literal with a builder that composes the alternation from a dir list. The default regex is now built once at module load from DEFAULT_ENTITY_DIRS, preserving the fast path for the common case. Adds escapeRegexChars as defense-in-depth — getEntityDirs already validates dir names against /^[a-z0-9][a-z0-9-]*$/, so no metachars should ever reach the regex builder, but future callers who reach for buildEntityRefRegex directly still get safe output. Both helpers are internal (not exported) to keep the public surface small.

extractEntityRefs(content, dirs?) now takes an optional readonly dir list. When omitted, the module-level regex built from DEFAULT_ENTITY_DIRS is reused (fast path, zero compile cost). When provided, a scoped regex is compiled from the custom list — only those dirs match. Callers who want custom dirs IN ADDITION to defaults must pass the union themselves; the upcoming getEntityDirs helper does exactly that. Unlocks non-default filesystem layouts (e.g. Johnny Decimal 01-notes/).

extractPageLinks(content, frontmatter, pageType, dirs?) now threads the optional dir list through both the markdown-ref extractor and the bare-slug regex. Both paths use the same dir list so behavior stays consistent. When dirs is omitted, module-level defaults are reused (no per-call regex compile). When provided, scoped regexes are built from the custom list. Adds buildBareSlugRegex as the bare-slug counterpart to buildEntityRefRegex — both internal, both built from the same escaped alternation.

…lace modes Reads the effective entity-dir list from engine config: - entity_dirs: comma-separated custom dir names (optional, defaults empty) - entity_dirs_mode: 'union' (default) or 'replace' Union mode ADDS custom dirs to DEFAULT_ENTITY_DIRS (defaults first, custom appended, deduped). Replace mode uses ONLY the custom list. Empty replace falls back to defaults to prevent accidentally disabling extraction. Each custom entry is validated against /^[a-z0-9][a-z0-9-]*$/. On any invalid entry, the function logs a warning and returns defaults — a fail-safe that prevents malformed config from silently breaking the graph layer. Validation runs BEFORE mode resolution so bad input is caught once, regardless of mode.

extractEntityRefs now also picks up explicit-path wikilinks: [[people/alice]] -> { name: 'alice', slug: 'people/alice' } [[people/alice|Alice Chen]] -> same slug, alias consumed but not captured Scope intentionally limited to explicit dir-prefixed wikilinks. Bare [[alice]] form is OUT OF SCOPE — resolving it requires engine page lookup (walk the slug table, disambiguate aliases), which breaks the pure-function contract of extractEntityRefs. Documented in code and README. The wikilink regex honors the same configured dir list as the markdown extractor, so a custom dir (e.g. 01-notes) matches in both [Name](…) and [[…]] forms. Alias segment is length-bounded (100 chars) to cap worst-case regex cost. Slug segment is bounded — no ReDoS surface. Wikilinks inside fenced or inline code blocks are excluded via the existing stripCodeBlocks pass. Full test suite: 1326 pass / 141 skip / 0 fail.

Both production callsites of extractPageLinks now resolve the entity-dir list from config before extracting candidates: - src/core/operations.ts runAutoLink (put_page post-hook) — one read per put_page. Runs inside the auto-link branch (after the remote/ disabled guards) so disabled callers skip the config read too. - src/commands/extract.ts extractLinksFromDB (batch backfill) — one read per run, outside the page loop. Config doesn't change mid-run. timeline extraction (parseTimelineEntries) has no dir dependency, so no changes there. Full test suite: 1326 pass / 141 skip / 0 fail.

README gains a 'Configuring entity directories' subsection under Knowledge Graph explaining union vs replace modes, validation rules, and the fail-safe fallback on invalid input. Plus a 'Wikilink scope' subsection documenting the explicit-path-only design decision — bare [[name]] wikilinks are out of scope because resolving them requires engine-side slug lookup, which would break the pure-function contract of the extractor. CHANGELOG gets an Unreleased section covering the configurable dirs and wikilink additions, including the new exported API surface (DEFAULT_ENTITY_DIRS, getEntityDirs, optional dirs param on extractEntityRefs and extractPageLinks).

gopalpatel added 8 commits April 18, 2026 14:32

jamebobob mentioned this pull request Apr 18, 2026

extractAndEnrich writes pages from untrusted text without a gate #160

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(link-extraction): Obsidian vault support — configurable entity_dirs + wikilinks#194

feat(link-extraction): Obsidian vault support — configurable entity_dirs + wikilinks#194
gopalpatel wants to merge 8 commits intogarrytan:masterfrom
gopalpatel:feat/configurable-entity-dirs

gopalpatel commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gopalpatel commented Apr 18, 2026

Summary

Motivation

What changed

src/core/link-extraction.ts

Call sites wired up

Docs

Test plan

Implementation notes

Out of scope / potential follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`src/core/link-extraction.ts`