feat(link-extraction): Obsidian vault support — configurable entity_dirs + wikilinks#194
Open
gopalpatel wants to merge 8 commits intogarrytan:masterfrom
Open
feat(link-extraction): Obsidian vault support — configurable entity_dirs + wikilinks#194gopalpatel wants to merge 8 commits intogarrytan:masterfrom
gopalpatel wants to merge 8 commits intogarrytan:masterfrom
Conversation
Extracts the hardcoded entity directory list (people, companies, meetings, concepts, deal, civic, project, source, media, yc) into an exported frozen readonly array. This is the first step toward configurable entity dirs; subsequent commits build the regex dynamically from this list and add a config reader so users can extend or replace the defaults (e.g. Johnny Decimal filesystems).
Replaces the hardcoded ENTITY_REF_RE literal with a builder that composes the alternation from a dir list. The default regex is now built once at module load from DEFAULT_ENTITY_DIRS, preserving the fast path for the common case. Adds escapeRegexChars as defense-in-depth — getEntityDirs already validates dir names against /^[a-z0-9][a-z0-9-]*$/, so no metachars should ever reach the regex builder, but future callers who reach for buildEntityRefRegex directly still get safe output. Both helpers are internal (not exported) to keep the public surface small.
extractEntityRefs(content, dirs?) now takes an optional readonly dir list. When omitted, the module-level regex built from DEFAULT_ENTITY_DIRS is reused (fast path, zero compile cost). When provided, a scoped regex is compiled from the custom list — only those dirs match. Callers who want custom dirs IN ADDITION to defaults must pass the union themselves; the upcoming getEntityDirs helper does exactly that. Unlocks non-default filesystem layouts (e.g. Johnny Decimal 01-notes/).
extractPageLinks(content, frontmatter, pageType, dirs?) now threads the optional dir list through both the markdown-ref extractor and the bare-slug regex. Both paths use the same dir list so behavior stays consistent. When dirs is omitted, module-level defaults are reused (no per-call regex compile). When provided, scoped regexes are built from the custom list. Adds buildBareSlugRegex as the bare-slug counterpart to buildEntityRefRegex — both internal, both built from the same escaped alternation.
…lace modes Reads the effective entity-dir list from engine config: - entity_dirs: comma-separated custom dir names (optional, defaults empty) - entity_dirs_mode: 'union' (default) or 'replace' Union mode ADDS custom dirs to DEFAULT_ENTITY_DIRS (defaults first, custom appended, deduped). Replace mode uses ONLY the custom list. Empty replace falls back to defaults to prevent accidentally disabling extraction. Each custom entry is validated against /^[a-z0-9][a-z0-9-]*$/. On any invalid entry, the function logs a warning and returns defaults — a fail-safe that prevents malformed config from silently breaking the graph layer. Validation runs BEFORE mode resolution so bad input is caught once, regardless of mode.
extractEntityRefs now also picks up explicit-path wikilinks:
[[people/alice]] -> { name: 'alice', slug: 'people/alice' }
[[people/alice|Alice Chen]] -> same slug, alias consumed but not captured
Scope intentionally limited to explicit dir-prefixed wikilinks. Bare
[[alice]] form is OUT OF SCOPE — resolving it requires engine page
lookup (walk the slug table, disambiguate aliases), which breaks the
pure-function contract of extractEntityRefs. Documented in code and
README.
The wikilink regex honors the same configured dir list as the markdown
extractor, so a custom dir (e.g. 01-notes) matches in both [Name](…)
and [[…]] forms. Alias segment is length-bounded (100 chars) to cap
worst-case regex cost. Slug segment is bounded — no ReDoS surface.
Wikilinks inside fenced or inline code blocks are excluded via the
existing stripCodeBlocks pass.
Full test suite: 1326 pass / 141 skip / 0 fail.
Both production callsites of extractPageLinks now resolve the entity-dir
list from config before extracting candidates:
- src/core/operations.ts runAutoLink (put_page post-hook) — one read
per put_page. Runs inside the auto-link branch (after the remote/
disabled guards) so disabled callers skip the config read too.
- src/commands/extract.ts extractLinksFromDB (batch backfill) — one
read per run, outside the page loop. Config doesn't change mid-run.
timeline extraction (parseTimelineEntries) has no dir dependency, so
no changes there.
Full test suite: 1326 pass / 141 skip / 0 fail.
README gains a 'Configuring entity directories' subsection under Knowledge Graph explaining union vs replace modes, validation rules, and the fail-safe fallback on invalid input. Plus a 'Wikilink scope' subsection documenting the explicit-path-only design decision — bare [[name]] wikilinks are out of scope because resolving them requires engine-side slug lookup, which would break the pure-function contract of the extractor. CHANGELOG gets an Unreleased section covering the configurable dirs and wikilink additions, including the new exported API surface (DEFAULT_ENTITY_DIRS, getEntityDirs, optional dirs param on extractEntityRefs and extractPageLinks).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds Obsidian-vault compatibility to the auto-link extractor via two coordinated changes:
entity_dirsso vaults with non-default directory structures (Johnny Decimal01-notes/,03-subjects/, etc.) get typed-link extraction without forking.[[dir/slug]]and[[dir/slug|alias]]are now recognized alongside the existing[Name](dir/slug)markdown syntax.Existing behavior is preserved for everyone not touching the config — the default dir list is unchanged, and
extractEntityRefs/extractPageLinkshave optional (additive) parameters.Motivation
Discovered while migrating a 2,052-page Obsidian vault to gbrain v0.12.0. The auto-link pass extracted 0 links because (a) the vault uses
01-notes/and03-subjects/instead ofpeople/andcompanies/, and (b) the content uses[[wikilinks]]rather than[Name](path)markdown links. Both gaps meant the knowledge graph stayed empty.What changed
src/core/link-extraction.tsDEFAULT_ENTITY_DIRSexported asObject.freezed readonly list (current canonical dirs).escapeRegexChars,buildEntityRefRegex,buildBareSlugRegex,buildWikilinkRegex(defense-in-depth regex construction from any dir list).getEntityDirs(engine)config reader:entity_dirs(comma-separated string) +entity_dirs_mode("union"default /"replace"opt-out)./^[a-z0-9][a-z0-9-]*$/. Invalid entries triggerconsole.warn+ fall back to defaults (fail-safe, but with a visible signal so users can diagnose typos rather than silently get zero links).extractEntityRefs(content, dirs?)andextractPageLinks(content, fm, type, dirs?)accept an optionaldirsparameter. Omitting it reuses the module-level default regex (zero perf cost for the common path).[[dir/slug]]and[[dir/slug|Display Alias]]are recognized. The dir prefix must be in the configured dir list — this keeps the extractor consistent with its markdown-ref sibling and avoids accidental matches on unrelated[[...]]text. Alias display text is captured but currently unused.[[name]]wikilinks without a dir prefix. Resolving those requires a page-identity lookup against the engine, which would break the pure-function contract the file documents at the top. Bare-slug resolution is a natural follow-up PR.Call sites wired up
src/core/operations.ts—runAutoLink(put_page post-hook) loads dirs viagetEntityDirs(engine)once inside the transaction.src/commands/extract.ts— batch DB extractor wires the same call.Docs
README.md— new "Configuring entity directories" + "Wikilink scope" subsections.CHANGELOG.md— new[Unreleased]entry covering the three user-visible additions.Test plan
bun test test/link-extraction.test.ts— 74 pass / 0 fail (up from 48 baseline). Adds 26 new tests covering DEFAULT_ENTITY_DIRS,buildEntityRefRegex,extractEntityRefs/extractPageLinkswith custom dirs (union + replace modes),getEntityDirsparsing / validation / warning behavior, wikilink extraction, wikilink alias handling, wikilink code-block skipping, and wikilink dir filtering.bun test— 1326 pass / 141 skip / 0 fail. Zero regressions.Implementation notes
See 01-notes/alice for context.also pick up custom dirs automatically.people/, some new01-notes/) are normal and shouldn't require a full mental flip to support.Out of scope / potential follow-ups
[[name]]wikilink resolution via page lookup. Natural next PR.|Display Aliasas a hint for relationship inference).gbrain doctorwarning surface for invalidentity_dirsconfig values.