feat: symbol extraction and search enrichment (Phase 2) by jamesrisberg · Pull Request #484 · tobi/qmd

jamesrisberg · 2026-03-29T16:45:33Z

Summary

Builds on the AST-aware chunking infrastructure from #449 to extract symbol metadata (functions, classes, interfaces, types, enums) during indexing and surface it through search results.

When --chunk-strategy auto is enabled, qmd now:

Extracts symbols from code files during the chunking pass (single tree-sitter traversal — no extra cost)
Enriches embeddings with symbol names before vectorization, improving semantic retrieval for code queries
Returns symbols in results across CLI, JSON output, MCP, and REST endpoints

What gets extracted

Language	Symbols
TypeScript/JavaScript	functions, classes, methods, interfaces, type aliases, enums
Python	functions, classes, methods
Go	functions, methods, types
Rust	functions, structs, traits, impls, enums

Example output

[function] authenticate(user: User, token: string): Promise<boolean>
[class] AuthService
[interface] AuthConfig

Changes

src/ast.ts — extractAllSymbols() and parseCodeFile() for single-pass breakpoint + symbol extraction; per-language tree-sitter queries with name, kind, signature, line number, and byte offset. WASM parser/tree cleanup via try/finally to prevent leaks. JS-uses-TS-grammar decision documented.
src/store.ts — Symbol-to-chunk mapping by byte range; symbols column in content_vectors; symbols read from DB at query time (no re-parsing); sequential enrichment to avoid unbounded WASM memory pressure; Store interface and proxy updated for symbols parameter; migration only catches "duplicate column" errors.
src/cli/qmd.ts — Symbols displayed in default + JSON output; grammar availability shown in status.
src/llm.ts — formatDocForEmbedding accepts optional symbol names to prepend.
src/mcp/server.ts — symbols field in MCP search results and REST /query endpoint.
src/index.ts — Exports SymbolInfo type; corrected JSDoc default ("regex", not "auto").
test/ast.test.ts — 50 unit tests covering all 6 languages (TS, JS, Python, Go, Rust) + unicode identifiers + type/enum assertions.
test/ast-chunking.test.ts — 10 integration tests: merge breakpoints, AST vs regex split comparison, markdown regression, strategy bypass, overlapping chunk symbols, embedding enrichment.
test-ast-chunking.mjs — Deleted. Standalone 1,112-line test script replaced by proper vitest suites above.

Design decisions

Single-pass parsing — breakpoints and symbols extracted in one tree-sitter traversal, cached per file
DB-first symbol lookup — enrichWithSymbols reads stored symbols from content_vectors at query time instead of re-parsing every search result; falls back to on-demand parsing only for files not yet re-embedded
Sequential enrichment — avoids Promise.all on N concurrent WASM parses; DB path is sync anyway
Opt-in — Default remains "regex". No behavior change for existing users
Graceful degradation — Missing grammars or parse failures silently fall back to regex with no symbols
WASM safety — try/finally ensures parser and tree are always freed, even on exception

Test plan

50 unit tests: all 6 languages, unicode, type/enum, error cases, range filtering
10 integration tests: chunking comparison, markdown regression, overlapping symbols, embedding format
Zero new TypeScript type errors
Zero new test regressions (6 pre-existing upstream handelize failures unchanged)
Rebased cleanly onto current main (24 commits ahead of feat: AST-aware chunking for code files via tree-sitter #449 merge point)

Migration

After upgrading, re-embed to generate symbol metadata:

qmd embed -f --chunk-strategy auto

Existing indexes continue to work — symbols are additive only.

🤖 Generated with Claude Code

Builds on AST-aware chunking (tobi#449) to extract symbol metadata (functions, classes, interfaces, types, enums) during indexing and surface them through search results. Key changes: - Single-pass symbol extraction via tree-sitter during embedding - Symbols stored in content_vectors and read at query time (no re-parse) - Sequential enrichment to avoid unbounded WASM memory pressure - Symbol-enriched embeddings for improved semantic retrieval - CLI, JSON, MCP, and REST endpoints all return symbols - Store interface/proxy updated for symbols parameter - WASM parser cleanup via try/finally (no leak on exception) - DB migration only swallows "duplicate column" errors - Standalone test script replaced with proper vitest suite Tests: 60 passing (50 unit + 10 integration) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jamesrisberg closed this Mar 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: symbol extraction and search enrichment (Phase 2)#484

feat: symbol extraction and search enrichment (Phase 2)#484
jamesrisberg wants to merge 1 commit intotobi:mainfrom
jamesrisberg:feat/phase2-symbol-extraction

jamesrisberg commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jamesrisberg commented Mar 29, 2026

Summary

What gets extracted

Example output

Changes

Design decisions

Test plan

Migration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant