Markdown & TOML Parsing Support by josephismikhail · Pull Request #147 · ix-infrastructure/Ix

josephismikhail · 2026-04-05T22:55:06Z

Summary

Adds TOML and Markdown file parsing to the ingestion pipeline.

.toml files are parsed into the graph as config_entry entities — table headers ([package], [profile.release]) and key=value pairs — with CONTAINS relationships linking them.

.md/.markdown files are parsed into the graph as section entities derived from headings (ATX #, setext underline, and HTML <h1>–<h6>), with hierarchical CONTAINS relationships, YAML frontmatter extraction, and fenced code block boundaries respected. Also fixes ix contains disambiguation when the same name appears across multiple files.

Closes #145

Type

Changes

TOML

feat(core-ingestion): Add TOML parser — extracts [table] headers and key = value pairs as config_entry entities with CONTAINS relationships, mirroring the YAML parser pattern
fix(toml): Emit intermediate nodes for dotted table headers — [profile.release] now materialises a profile node in the graph, not just release; a per-file seen-set deduplicates shared prefixes
fix: Add TOML to the isGrammarSupported early-return guard — .toml files were being silently dropped before reaching parseFile
feat(cli): Add --path filter to ix contains — allows disambiguating by file path when multiple entities share the same name across workspaces or files

Markdown

feat(core-ingestion): Add Markdown parser — extracts headings as section entities with hierarchical CONTAINS relationships; supports ATX headings, setext headings, and HTML headings (<h1>–<h6>); parses YAML frontmatter as a body chunk; respects fenced code block boundaries so heading-like lines inside code aren't parsed as headings
fix(ingest): Add .md and .markdown to SUPPORTED_EXTENSIONS
fix(core-ingestion): Store section chunks with kind 'section'; bump extractor to 1.21
fix(cli): Wire up ix config set workspace so it actually takes effect; scope CLI search to active workspace
fix(cli): Resolve absolute paths, short IDs, and section kind display
fix(md-parse): Hierarchical section scoping and root-file disambiguation
fix(md-parse): Clean VitePress heading syntax from entity names (strip {#anchor}, badges, <sup>, escaped angle brackets)
fix(md-parse): Setext heading detection, fence delimiter matching, frontmatter body chunk

Validation

TOML

Tested against ripgrep (real-world Rust workspace with multiple Cargo.toml files)
ix text --language toml returns results from Cargo.toml
name, version, edition resolve as config_entry with language: toml
[package], [dependencies] resolve as table entities
ix contains package returns correct child keys (name, version, edition, authors, etc.)
[profile.release] → profile intermediate node present; opt-level resolves as child
ix contains package --path crates/regex resolves without ambiguity prompt
263-line test suite in queries.toml.test.ts covers the parser

Markdown

Checklist

Tests pass
Smoke tests pass
No raw errors introduced
CLI output follows Ix format

Parses TOML files line-by-line, extracting [table] headers and key=value pairs as config_entry entities with CONTAINS relationships. Mirrors the YAML parser pattern — no external dependencies needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

TOML has a custom parser (parseTomlFile) like YAML/JSON/SQL/Dockerfile, but was missing from the isGrammarSupported check. This caused all .toml files to be silently dropped before reaching parseFile. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

[profile.release] now materialises a `profile` node in the graph, not just `release`. A per-file seen-set deduplicates shared prefixes so [profile.release] + [profile.dev] produce exactly one `profile` entity. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Allows disambiguating by file path when multiple entities share the same name across workspaces or files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Parses ATX headings as heading entities with hierarchical CONTAINS relationships, YAML frontmatter as a frontmatter entity/chunk, and section chunks spanning each heading's content. Skips headings inside fenced code blocks. Falls back to file_body for files with no headings. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

File discovery in ingest.ts maintains a local extension set that mirrors core-ingestion's EXT_MAP but was missing .md and .markdown. This caused markdown files to be silently excluded before reaching the parser, so ix map never produced heading, frontmatter, or file nodes for .md files — even though parseMarkdownFile was fully implemented. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Two bugs from the markdown-parsing test run: 1. LANGUAGE_QUERIES was missing a Markdown entry, causing a TS2741 compile error that prevented core-ingestion from building. Added [SupportedLanguages.Markdown]: '' (markdown uses its own hand-written parser, not tree-sitter queries). 2. ix config set workspace <name> wrote a top-level workspace: key to config.yaml but resolveWorkspaceRoot never read it — all commands continued routing to the workspace with default: true. Added workspace?: string to IxConfig and a lookup step in resolveWorkspaceRoot that checks cfg.workspace by name before falling back to the default workspace. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…xtractor to 1.21 Section chunks were stored as kind 'chunk', making ix search --kind section return no results. Giving them a first-class 'section' kind (consistent with heading/frontmatter) makes them discoverable via --kind filtering. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- ix contains: absolute paths now match against graph's relative URIs by checking targetLower.endsWith(uri); URI-length tiebreaker picks the most specific match when multiple quality-0 candidates exist - ix contains/explain: 8–31 char hex inputs (short IDs from CLI output) now attempt resolvePrefix before falling back to symbol resolution - ix explain: context section always shows Kind so section (and all other) entity types are visible in the detail view Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Section chunks now span to the next heading at the same or shallower level, so parent sections include their full nested subtree content - Bare filename tie-breaking now prefers shorter URIs (root-level files) over deeply-nested ones, fixing `ix contains README.md` resolving to a fixture file instead of the root README - Add test asserting parent section lineEnd covers nested child sections Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replaces single-line HTML strip with a full pipeline that handles anchor IDs ({#...}), backtick-wrapped component names, backslash-escaped angle brackets, stability markers (\*\*), inline HTML badges, and double-space normalization. Adds 6 regression tests for vuejs/docs cases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… body chunk - Bug 1: emit file_body when no section headings exist even if a frontmatter chunk is already present; lineStart/startByte now point past the frontmatter - Bug 2: detect setext-style headings (=== and --- underlines) inside the heading loop with a one-line lookahead; respects inFence guard and integrates with existing headingStack nesting - Bug 3: replace boolean inFence with fenceState {char, len} so a backtick fence can only be closed by backticks of >= the opening length, and vice versa for tildes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

core-ingestion/src/index.ts

Restructured keyPattern to handle quoted and bare keys as separate alternatives, eliminating the space/\s* overlap that caused polynomial backtracking (CodeQL js/polynomial-redos). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

/\s*\{#[^}]+\}\s*$/ had three overlapping quantifiers on uncontrolled input causing O(n²) backtracking. Replaced with lastIndexOf/indexOf string methods (CodeQL js/polynomial-redos). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…parser - Loop <[^>]+> replacement until stable to handle split tags like <scr<x>ipt> (CodeQL js/incomplete-multi-character-sanitization) - Replace headingPattern (.+?)(?:\s+#+)?\s*$ with (.*\S) to eliminate three overlapping quantifier pairs on spaces - Replace htmlHeadingPattern (.*?)\s*$ with greedy .* anchored by closing tag, eliminating (.*?)/\s*$ overlap (CodeQL js/polynomial-redos) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

core-ingestion/src/index.ts

- headingPattern: replace \s+(.*\S) with [ \t]+(\S[^\r\n]*) so content must start with non-whitespace, removing the \s+/.* overlap that caused O(n²) backtracking on space-only lines - rawName: replace /\s+#+$/ regex with a string walk, removing +#+$ backtracking on strings with # runs followed by no end-of-string match (CodeQL js/polynomial-redos) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jsmikhai and others added 14 commits April 2, 2026 17:05

feat(cli): add --path filter to ix contains command

6d340eb

Allows disambiguating by file path when multiple entities share the same name across workspaces or files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Support HTML headings in markdown docs

a7c351d

Scope CLI search to active workspace

7c8bf74

github-advanced-security bot found potential problems Apr 5, 2026

View reviewed changes

core-ingestion/src/index.ts Fixed Show fixed Hide fixed

core-ingestion/src/index.ts Fixed Show fixed Hide fixed

core-ingestion/src/index.ts Fixed Show fixed Hide fixed

core-ingestion/src/index.ts Fixed Show fixed Hide fixed

jsmikhai and others added 3 commits April 5, 2026 16:20

github-advanced-security bot found potential problems Apr 5, 2026

View reviewed changes

core-ingestion/src/index.ts Fixed Show fixed Hide fixed

riley0227 approved these changes Apr 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Markdown & TOML Parsing Support#147

Markdown & TOML Parsing Support#147
josephismikhail wants to merge 18 commits intomainfrom
feat/markdown-toml-parsing-support

josephismikhail commented Apr 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

josephismikhail commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type

Changes

Validation

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

josephismikhail commented Apr 5, 2026 •

edited

Loading