Skip to content

feat: symbol extraction and search enrichment (Phase 2)#484

Closed
jamesrisberg wants to merge 1 commit intotobi:mainfrom
jamesrisberg:feat/phase2-symbol-extraction
Closed

feat: symbol extraction and search enrichment (Phase 2)#484
jamesrisberg wants to merge 1 commit intotobi:mainfrom
jamesrisberg:feat/phase2-symbol-extraction

Conversation

@jamesrisberg
Copy link
Copy Markdown
Contributor

Summary

Builds on the AST-aware chunking infrastructure from #449 to extract symbol metadata (functions, classes, interfaces, types, enums) during indexing and surface it through search results.

When --chunk-strategy auto is enabled, qmd now:

  • Extracts symbols from code files during the chunking pass (single tree-sitter traversal — no extra cost)
  • Enriches embeddings with symbol names before vectorization, improving semantic retrieval for code queries
  • Returns symbols in results across CLI, JSON output, MCP, and REST endpoints

What gets extracted

Language Symbols
TypeScript/JavaScript functions, classes, methods, interfaces, type aliases, enums
Python functions, classes, methods
Go functions, methods, types
Rust functions, structs, traits, impls, enums

Example output

[function] authenticate(user: User, token: string): Promise<boolean>
[class] AuthService
[interface] AuthConfig

Changes

  • src/ast.tsextractAllSymbols() and parseCodeFile() for single-pass breakpoint + symbol extraction; per-language tree-sitter queries with name, kind, signature, line number, and byte offset. WASM parser/tree cleanup via try/finally to prevent leaks. JS-uses-TS-grammar decision documented.
  • src/store.ts — Symbol-to-chunk mapping by byte range; symbols column in content_vectors; symbols read from DB at query time (no re-parsing); sequential enrichment to avoid unbounded WASM memory pressure; Store interface and proxy updated for symbols parameter; migration only catches "duplicate column" errors.
  • src/cli/qmd.ts — Symbols displayed in default + JSON output; grammar availability shown in status.
  • src/llm.tsformatDocForEmbedding accepts optional symbol names to prepend.
  • src/mcp/server.tssymbols field in MCP search results and REST /query endpoint.
  • src/index.ts — Exports SymbolInfo type; corrected JSDoc default ("regex", not "auto").
  • test/ast.test.ts — 50 unit tests covering all 6 languages (TS, JS, Python, Go, Rust) + unicode identifiers + type/enum assertions.
  • test/ast-chunking.test.ts — 10 integration tests: merge breakpoints, AST vs regex split comparison, markdown regression, strategy bypass, overlapping chunk symbols, embedding enrichment.
  • test-ast-chunking.mjs — Deleted. Standalone 1,112-line test script replaced by proper vitest suites above.

Design decisions

  • Single-pass parsing — breakpoints and symbols extracted in one tree-sitter traversal, cached per file
  • DB-first symbol lookupenrichWithSymbols reads stored symbols from content_vectors at query time instead of re-parsing every search result; falls back to on-demand parsing only for files not yet re-embedded
  • Sequential enrichment — avoids Promise.all on N concurrent WASM parses; DB path is sync anyway
  • Opt-in — Default remains "regex". No behavior change for existing users
  • Graceful degradation — Missing grammars or parse failures silently fall back to regex with no symbols
  • WASM safety — try/finally ensures parser and tree are always freed, even on exception

Test plan

  • 50 unit tests: all 6 languages, unicode, type/enum, error cases, range filtering
  • 10 integration tests: chunking comparison, markdown regression, overlapping symbols, embedding format
  • Zero new TypeScript type errors
  • Zero new test regressions (6 pre-existing upstream handelize failures unchanged)
  • Rebased cleanly onto current main (24 commits ahead of feat: AST-aware chunking for code files via tree-sitter #449 merge point)

Migration

After upgrading, re-embed to generate symbol metadata:

qmd embed -f --chunk-strategy auto

Existing indexes continue to work — symbols are additive only.

🤖 Generated with Claude Code

Builds on AST-aware chunking (tobi#449) to extract symbol metadata
(functions, classes, interfaces, types, enums) during indexing
and surface them through search results.

Key changes:
- Single-pass symbol extraction via tree-sitter during embedding
- Symbols stored in content_vectors and read at query time (no re-parse)
- Sequential enrichment to avoid unbounded WASM memory pressure
- Symbol-enriched embeddings for improved semantic retrieval
- CLI, JSON, MCP, and REST endpoints all return symbols
- Store interface/proxy updated for symbols parameter
- WASM parser cleanup via try/finally (no leak on exception)
- DB migration only swallows "duplicate column" errors
- Standalone test script replaced with proper vitest suite

Tests: 60 passing (50 unit + 10 integration)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant