Skip to content

fix: handle hyphenated and underscore terms in lex and vec/hyde queries#418

Open
fxstein wants to merge 1 commit intotobi:mainfrom
fxstein:fix/hyphenated-query-handling
Open

fix: handle hyphenated and underscore terms in lex and vec/hyde queries#418
fxstein wants to merge 1 commit intotobi:mainfrom
fxstein:fix/hyphenated-query-handling

Conversation

@fxstein
Copy link
Copy Markdown

@fxstein fxstein commented Mar 16, 2026

Problem

Hyphens and underscores in compound words and identifiers break both lex and vec/hyde search:

Lex — hyphens (#417): Hyphenated identifiers like DEC-0054, RFC-0011, CVE-2024-1234 are unsearchable:

  • Bare DEC-0054 → parsed as negation (DEC minus 0054) → 0 results
  • Quoted "DEC-0054"sanitizeFTS5Term() strips hyphen → dec0054 → doesn't match FTS5 unicode61 tokens (dec, 0054) → 0 results

Lex — underscores (#305): Snake_case identifiers like apply_secrets, __init__, my_variable are unsearchable:

  • sanitizeFTS5Term() strips underscores → applysecrets → doesn't match FTS5 unicode61 tokens (apply, secrets) → 0 results

Vec/hyde (#414): validateSemanticQuery() uses /-\w/ which rejects compound words like multi-agent, role-based, chain-of-thought as negation syntax.

Root Cause

FTS5's unicode61 tokenizer splits on hyphens and underscores at index time (DEC-0054dec + 0054, apply_secretsapply + secrets). But sanitizeFTS5Term() strips these characters entirely, concatenating the parts into a single token (dec0054, applysecrets) that can never match the index.

Fix

Two changes in store.ts:

  1. sanitizeFTS5Term() — preserve hyphens and underscores:
- return term.replace(/[^\p{L}\p{N}']/gu, '').toLowerCase();
+ return term.replace(/[^\p{L}\p{N}'_-]/gu, '').toLowerCase();

FTS5 applies the same tokenizer to query strings as to indexed content. Preserving the separator lets FTS5 split the query symmetrically — producing precise adjacency/phrase matches. This is simpler and more accurate than splitting at the JS level (which would produce AND terms matching anywhere in the document, not just adjacent occurrences).

  1. validateSemanticQuery() — only match hyphens preceded by whitespace or string start (actual negation), not internal hyphens in compound words:
- if (/-\w/.test(query) || /-"/.test(query)) {
+ if (/(?:^|\s)-[\w"]/.test(query)) {

No changes to buildFTS5Query() — FTS5 handles the splitting correctly when the separator characters are preserved.

Testing

Verified against real FTS5 porter unicode61 (in-memory + 7,665 document production index):

Query Before After
lex "DEC-0054" 0 results ✅ 93% top hit
lex DEC-0054 0 results ✅ 93% top hit
lex apply_secrets 0 results ✅ match (phrase, adjacent only)
lex __init__ 0 results ✅ match
lex my-app_v2 0 results ✅ match (mixed separators)
vec "multi-agent orchestration" ❌ Negation error ✅ 88% top hit
lex spawn -orchestrator ✅ Works ✅ Still works (no regression)

Precision verified: "apply_secrets" produces a phrase match (1 hit for adjacent apply + secrets), not an AND match (which would also hit documents containing both words non-adjacently). FTS5's symmetric tokenization gives us adjacency for free.

Relationship to #404

Complements #404 which also addresses #305 (underscores) via a similar sanitizeFTS5Term change. This PR extends the fix to hyphens and adds the validateSemanticQuery fix for vec/hyde false positives, which #404 does not cover.

Fixes #305, fixes #414, fixes #417
Related: #404

Environment

  • QMD: v2.0.1
  • Platform: macOS (Apple Silicon)
  • Node: v24.2.0

Preserve hyphens and underscores in sanitizeFTS5Term so FTS5's unicode61
tokenizer can split them symmetrically at query time, producing precise
phrase matches. Also fix validateSemanticQuery false positive that rejected
hyphenated terms like DEC-0054 as negation syntax in vec/hyde queries.

Complements tobi#404 (underscore-only fix) by also covering hyphens.
Refs: tobi#305, tobi#417
@fxstein fxstein force-pushed the fix/hyphenated-query-handling branch from 889b0e3 to b5f4286 Compare March 18, 2026 12:20
@fxstein fxstein changed the title fix: handle hyphenated terms in lex and vec/hyde queries fix: handle hyphenated and underscore terms in lex and vec/hyde queries Mar 18, 2026
zeattacker pushed a commit to zeattacker/qmd that referenced this pull request Mar 26, 2026
Merges dev-upstream-fixes (cherry-picked PRs tobi#462, tobi#463, tobi#455, tobi#418,
tobi#456, tobi#442, tobi#453) into dev. Resolved mcp/server.ts bind conflict —
keep 0.0.0.0 for Docker container accessibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant