Skip to content

feat(link-extraction): Obsidian vault support — configurable entity_dirs + wikilinks#194

Open
gopalpatel wants to merge 8 commits intogarrytan:masterfrom
gopalpatel:feat/configurable-entity-dirs
Open

feat(link-extraction): Obsidian vault support — configurable entity_dirs + wikilinks#194
gopalpatel wants to merge 8 commits intogarrytan:masterfrom
gopalpatel:feat/configurable-entity-dirs

Conversation

@gopalpatel
Copy link
Copy Markdown

Summary

Adds Obsidian-vault compatibility to the auto-link extractor via two coordinated changes:

  1. Configurable entity_dirs so vaults with non-default directory structures (Johnny Decimal 01-notes/, 03-subjects/, etc.) get typed-link extraction without forking.
  2. Explicit-path wikilink support: [[dir/slug]] and [[dir/slug|alias]] are now recognized alongside the existing [Name](dir/slug) markdown syntax.

Existing behavior is preserved for everyone not touching the config — the default dir list is unchanged, and extractEntityRefs / extractPageLinks have optional (additive) parameters.

Motivation

Discovered while migrating a 2,052-page Obsidian vault to gbrain v0.12.0. The auto-link pass extracted 0 links because (a) the vault uses 01-notes/ and 03-subjects/ instead of people/ and companies/, and (b) the content uses [[wikilinks]] rather than [Name](path) markdown links. Both gaps meant the knowledge graph stayed empty.

What changed

src/core/link-extraction.ts

  • DEFAULT_ENTITY_DIRS exported as Object.freezed readonly list (current canonical dirs).
  • New internal helpers: escapeRegexChars, buildEntityRefRegex, buildBareSlugRegex, buildWikilinkRegex (defense-in-depth regex construction from any dir list).
  • New getEntityDirs(engine) config reader:
    • Reads entity_dirs (comma-separated string) + entity_dirs_mode ("union" default / "replace" opt-out).
    • Union by default — custom dirs ADD to defaults so hybrid vaults work without surprise. Replace mode is a deliberate opt-out.
    • Validates each entry against /^[a-z0-9][a-z0-9-]*$/. Invalid entries trigger console.warn + fall back to defaults (fail-safe, but with a visible signal so users can diagnose typos rather than silently get zero links).
  • extractEntityRefs(content, dirs?) and extractPageLinks(content, fm, type, dirs?) accept an optional dirs parameter. Omitting it reuses the module-level default regex (zero perf cost for the common path).
  • Explicit-path wikilink extraction: [[dir/slug]] and [[dir/slug|Display Alias]] are recognized. The dir prefix must be in the configured dir list — this keeps the extractor consistent with its markdown-ref sibling and avoids accidental matches on unrelated [[...]] text. Alias display text is captured but currently unused.
  • Out of scope (by design): bare [[name]] wikilinks without a dir prefix. Resolving those requires a page-identity lookup against the engine, which would break the pure-function contract the file documents at the top. Bare-slug resolution is a natural follow-up PR.

Call sites wired up

  • src/core/operations.tsrunAutoLink (put_page post-hook) loads dirs via getEntityDirs(engine) once inside the transaction.
  • src/commands/extract.ts — batch DB extractor wires the same call.

Docs

  • README.md — new "Configuring entity directories" + "Wikilink scope" subsections.
  • CHANGELOG.md — new [Unreleased] entry covering the three user-visible additions.

Test plan

  • bun test test/link-extraction.test.ts — 74 pass / 0 fail (up from 48 baseline). Adds 26 new tests covering DEFAULT_ENTITY_DIRS, buildEntityRefRegex, extractEntityRefs / extractPageLinks with custom dirs (union + replace modes), getEntityDirs parsing / validation / warning behavior, wikilink extraction, wikilink alias handling, wikilink code-block skipping, and wikilink dir filtering.
  • bun test — 1326 pass / 141 skip / 0 fail. Zero regressions.

Implementation notes

  • All 8 commits are atomic, TDD-ordered (failing test first, implementation, green). Clean history for reviewing piece-by-piece.
  • The dir lists flow through both the markdown-ref regex AND the bare-slug regex, so bare-text references like See 01-notes/alice for context. also pick up custom dirs automatically.
  • Union-by-default was chosen over replace-by-default specifically because real vaults migrate incrementally — hybrid structures (some legacy people/, some new 01-notes/) are normal and shouldn't require a full mental flip to support.

Out of scope / potential follow-ups

  • Bare [[name]] wikilink resolution via page lookup. Natural next PR.
  • Alias-aware link types (use the |Display Alias as a hint for relationship inference).
  • gbrain doctor warning surface for invalid entity_dirs config values.

Extracts the hardcoded entity directory list (people, companies, meetings,
concepts, deal, civic, project, source, media, yc) into an exported frozen
readonly array. This is the first step toward configurable entity dirs;
subsequent commits build the regex dynamically from this list and add a
config reader so users can extend or replace the defaults (e.g. Johnny
Decimal filesystems).
Replaces the hardcoded ENTITY_REF_RE literal with a builder that composes
the alternation from a dir list. The default regex is now built once at
module load from DEFAULT_ENTITY_DIRS, preserving the fast path for the
common case. Adds escapeRegexChars as defense-in-depth — getEntityDirs
already validates dir names against /^[a-z0-9][a-z0-9-]*$/, so no
metachars should ever reach the regex builder, but future callers who
reach for buildEntityRefRegex directly still get safe output. Both
helpers are internal (not exported) to keep the public surface small.
extractEntityRefs(content, dirs?) now takes an optional readonly dir
list. When omitted, the module-level regex built from DEFAULT_ENTITY_DIRS
is reused (fast path, zero compile cost). When provided, a scoped regex
is compiled from the custom list — only those dirs match. Callers who
want custom dirs IN ADDITION to defaults must pass the union themselves;
the upcoming getEntityDirs helper does exactly that.

Unlocks non-default filesystem layouts (e.g. Johnny Decimal 01-notes/).
extractPageLinks(content, frontmatter, pageType, dirs?) now threads the
optional dir list through both the markdown-ref extractor and the
bare-slug regex. Both paths use the same dir list so behavior stays
consistent. When dirs is omitted, module-level defaults are reused
(no per-call regex compile). When provided, scoped regexes are built
from the custom list.

Adds buildBareSlugRegex as the bare-slug counterpart to
buildEntityRefRegex — both internal, both built from the same escaped
alternation.
…lace modes

Reads the effective entity-dir list from engine config:

  - entity_dirs: comma-separated custom dir names (optional, defaults empty)
  - entity_dirs_mode: 'union' (default) or 'replace'

Union mode ADDS custom dirs to DEFAULT_ENTITY_DIRS (defaults first, custom
appended, deduped). Replace mode uses ONLY the custom list. Empty replace
falls back to defaults to prevent accidentally disabling extraction.

Each custom entry is validated against /^[a-z0-9][a-z0-9-]*$/. On any
invalid entry, the function logs a warning and returns defaults — a
fail-safe that prevents malformed config from silently breaking the
graph layer. Validation runs BEFORE mode resolution so bad input is
caught once, regardless of mode.
extractEntityRefs now also picks up explicit-path wikilinks:

  [[people/alice]]              -> { name: 'alice', slug: 'people/alice' }
  [[people/alice|Alice Chen]]   -> same slug, alias consumed but not captured

Scope intentionally limited to explicit dir-prefixed wikilinks. Bare
[[alice]] form is OUT OF SCOPE — resolving it requires engine page
lookup (walk the slug table, disambiguate aliases), which breaks the
pure-function contract of extractEntityRefs. Documented in code and
README.

The wikilink regex honors the same configured dir list as the markdown
extractor, so a custom dir (e.g. 01-notes) matches in both [Name](…)
and [[…]] forms. Alias segment is length-bounded (100 chars) to cap
worst-case regex cost. Slug segment is bounded — no ReDoS surface.
Wikilinks inside fenced or inline code blocks are excluded via the
existing stripCodeBlocks pass.

Full test suite: 1326 pass / 141 skip / 0 fail.
Both production callsites of extractPageLinks now resolve the entity-dir
list from config before extracting candidates:

  - src/core/operations.ts runAutoLink (put_page post-hook) — one read
    per put_page. Runs inside the auto-link branch (after the remote/
    disabled guards) so disabled callers skip the config read too.
  - src/commands/extract.ts extractLinksFromDB (batch backfill) — one
    read per run, outside the page loop. Config doesn't change mid-run.

timeline extraction (parseTimelineEntries) has no dir dependency, so
no changes there.

Full test suite: 1326 pass / 141 skip / 0 fail.
README gains a 'Configuring entity directories' subsection under Knowledge
Graph explaining union vs replace modes, validation rules, and the
fail-safe fallback on invalid input. Plus a 'Wikilink scope' subsection
documenting the explicit-path-only design decision — bare [[name]]
wikilinks are out of scope because resolving them requires engine-side
slug lookup, which would break the pure-function contract of the
extractor.

CHANGELOG gets an Unreleased section covering the configurable dirs and
wikilink additions, including the new exported API surface
(DEFAULT_ENTITY_DIRS, getEntityDirs, optional dirs param on
extractEntityRefs and extractPageLinks).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant