Skip to content

Markdown & TOML Parsing Support#147

Open
josephismikhail wants to merge 18 commits intomainfrom
feat/markdown-toml-parsing-support
Open

Markdown & TOML Parsing Support#147
josephismikhail wants to merge 18 commits intomainfrom
feat/markdown-toml-parsing-support

Conversation

@josephismikhail
Copy link
Copy Markdown
Contributor

@josephismikhail josephismikhail commented Apr 5, 2026

Summary

Adds TOML and Markdown file parsing to the ingestion pipeline.

.toml files are parsed into the graph as config_entry entities — table headers ([package], [profile.release]) and key=value pairs — with CONTAINS relationships linking them.

.md/.markdown files are parsed into the graph as section entities derived from headings (ATX #, setext underline, and HTML <h1><h6>), with hierarchical CONTAINS relationships, YAML frontmatter extraction, and fenced code block boundaries respected. Also fixes ix contains disambiguation when the same name appears across multiple files.

Closes #145

Type

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Test
  • CI

Changes

TOML

  • feat(core-ingestion): Add TOML parser — extracts [table] headers and key = value pairs as config_entry entities with CONTAINS relationships, mirroring the YAML parser pattern
  • fix(toml): Emit intermediate nodes for dotted table headers — [profile.release] now materialises a profile node in the graph, not just release; a per-file seen-set deduplicates shared prefixes
  • fix: Add TOML to the isGrammarSupported early-return guard — .toml files were being silently dropped before reaching parseFile
  • feat(cli): Add --path filter to ix contains — allows disambiguating by file path when multiple entities share the same name across workspaces or files

Markdown

  • feat(core-ingestion): Add Markdown parser — extracts headings as section entities with hierarchical CONTAINS relationships; supports ATX headings, setext headings, and HTML headings (<h1><h6>); parses YAML frontmatter as a body chunk; respects fenced code block boundaries so heading-like lines inside code aren't parsed as headings
  • fix(ingest): Add .md and .markdown to SUPPORTED_EXTENSIONS
  • fix(core-ingestion): Store section chunks with kind 'section'; bump extractor to 1.21
  • fix(cli): Wire up ix config set workspace so it actually takes effect; scope CLI search to active workspace
  • fix(cli): Resolve absolute paths, short IDs, and section kind display
  • fix(md-parse): Hierarchical section scoping and root-file disambiguation
  • fix(md-parse): Clean VitePress heading syntax from entity names (strip {#anchor}, badges, <sup>, escaped angle brackets)
  • fix(md-parse): Setext heading detection, fence delimiter matching, frontmatter body chunk

Validation

TOML

  • Tested against ripgrep (real-world Rust workspace with multiple Cargo.toml files)
  • ix text --language toml returns results from Cargo.toml
  • name, version, edition resolve as config_entry with language: toml
  • [package], [dependencies] resolve as table entities
  • ix contains package returns correct child keys (name, version, edition, authors, etc.)
  • [profile.release]profile intermediate node present; opt-level resolves as child
  • ix contains package --path crates/regex resolves without ambiguity prompt
  • 263-line test suite in queries.toml.test.ts covers the parser

Markdown

  • Tested on 3 repos
  • No regressions on existing parsers (ix map on a known TS/JS repo, verify counts unchanged)
  • Unit + smoke tests pass (npm test in ix-cli)
  • Heading hierarchy chains correctly: file → h1 → h2 → h3
  • Both ATX and setext heading syntax ingested
  • ix contains returns children for a known markdown file and heading
  • Frontmatter entities present where expected
  • Section chunks created alongside heading entities
  • ix text returns results with language: markdown
  • Both .md and .markdown extensions ingested

Checklist

  • Tests pass
  • Smoke tests pass
  • No raw errors introduced
  • CLI output follows Ix format

jsmikhai and others added 14 commits April 2, 2026 17:05
Parses TOML files line-by-line, extracting [table] headers and key=value
pairs as config_entry entities with CONTAINS relationships. Mirrors the
YAML parser pattern — no external dependencies needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TOML has a custom parser (parseTomlFile) like YAML/JSON/SQL/Dockerfile,
but was missing from the isGrammarSupported check. This caused all .toml
files to be silently dropped before reaching parseFile.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
[profile.release] now materialises a `profile` node in the graph, not
just `release`. A per-file seen-set deduplicates shared prefixes so
[profile.release] + [profile.dev] produce exactly one `profile` entity.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Allows disambiguating by file path when multiple entities share the
same name across workspaces or files.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Parses ATX headings as heading entities with hierarchical CONTAINS
relationships, YAML frontmatter as a frontmatter entity/chunk, and
section chunks spanning each heading's content. Skips headings inside
fenced code blocks. Falls back to file_body for files with no headings.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
File discovery in ingest.ts maintains a local extension set that mirrors
core-ingestion's EXT_MAP but was missing .md and .markdown. This caused
markdown files to be silently excluded before reaching the parser, so
ix map never produced heading, frontmatter, or file nodes for .md files
— even though parseMarkdownFile was fully implemented.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs from the markdown-parsing test run:

1. LANGUAGE_QUERIES was missing a Markdown entry, causing a TS2741
   compile error that prevented core-ingestion from building.
   Added [SupportedLanguages.Markdown]: '' (markdown uses its own
   hand-written parser, not tree-sitter queries).

2. ix config set workspace <name> wrote a top-level workspace: key to
   config.yaml but resolveWorkspaceRoot never read it — all commands
   continued routing to the workspace with default: true.
   Added workspace?: string to IxConfig and a lookup step in
   resolveWorkspaceRoot that checks cfg.workspace by name before
   falling back to the default workspace.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…xtractor to 1.21

Section chunks were stored as kind 'chunk', making ix search --kind section
return no results. Giving them a first-class 'section' kind (consistent with
heading/frontmatter) makes them discoverable via --kind filtering.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- ix contains: absolute paths now match against graph's relative URIs by
  checking targetLower.endsWith(uri); URI-length tiebreaker picks the
  most specific match when multiple quality-0 candidates exist
- ix contains/explain: 8–31 char hex inputs (short IDs from CLI output)
  now attempt resolvePrefix before falling back to symbol resolution
- ix explain: context section always shows Kind so section (and all
  other) entity types are visible in the detail view

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Section chunks now span to the next heading at the same or shallower
  level, so parent sections include their full nested subtree content
- Bare filename tie-breaking now prefers shorter URIs (root-level files)
  over deeply-nested ones, fixing `ix contains README.md` resolving to
  a fixture file instead of the root README
- Add test asserting parent section lineEnd covers nested child sections

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces single-line HTML strip with a full pipeline that handles
anchor IDs ({#...}), backtick-wrapped component names, backslash-escaped
angle brackets, stability markers (\*\*), inline HTML badges, and
double-space normalization. Adds 6 regression tests for vuejs/docs cases.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… body chunk

- Bug 1: emit file_body when no section headings exist even if a frontmatter
  chunk is already present; lineStart/startByte now point past the frontmatter
- Bug 2: detect setext-style headings (=== and --- underlines) inside the
  heading loop with a one-line lookahead; respects inFence guard and integrates
  with existing headingStack nesting
- Bug 3: replace boolean inFence with fenceState {char, len} so a backtick
  fence can only be closed by backticks of >= the opening length, and vice
  versa for tildes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
jsmikhai and others added 3 commits April 5, 2026 16:20
Restructured keyPattern to handle quoted and bare keys as separate
alternatives, eliminating the space/\s* overlap that caused polynomial
backtracking (CodeQL js/polynomial-redos).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
/\s*\{#[^}]+\}\s*$/ had three overlapping quantifiers on uncontrolled
input causing O(n²) backtracking. Replaced with lastIndexOf/indexOf
string methods (CodeQL js/polynomial-redos).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…parser

- Loop <[^>]+> replacement until stable to handle split tags like
  <scr<x>ipt> (CodeQL js/incomplete-multi-character-sanitization)
- Replace headingPattern (.+?)(?:\s+#+)?\s*$ with (.*\S) to eliminate
  three overlapping quantifier pairs on spaces
- Replace htmlHeadingPattern (.*?)\s*$ with greedy .*  anchored by
  closing tag, eliminating (.*?)/\s*$ overlap
  (CodeQL js/polynomial-redos)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- headingPattern: replace \s+(.*\S) with [ \t]+(\S[^\r\n]*) so content
  must start with non-whitespace, removing the \s+/.* overlap that
  caused O(n²) backtracking on space-only lines
- rawName: replace /\s+#+$/ regex with a string walk, removing +#+$
  backtracking on strings with # runs followed by no end-of-string match
  (CodeQL js/polynomial-redos)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Markdown parser support (.md, .markdown)

4 participants