fix: handle hyphenated and underscore terms in lex and vec/hyde queries#418
Open
fix: handle hyphenated and underscore terms in lex and vec/hyde queries#418
Conversation
cea88d0 to
4feb9b1
Compare
4feb9b1 to
889b0e3
Compare
Preserve hyphens and underscores in sanitizeFTS5Term so FTS5's unicode61 tokenizer can split them symmetrically at query time, producing precise phrase matches. Also fix validateSemanticQuery false positive that rejected hyphenated terms like DEC-0054 as negation syntax in vec/hyde queries. Complements tobi#404 (underscore-only fix) by also covering hyphens. Refs: tobi#305, tobi#417
889b0e3 to
b5f4286
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Hyphens and underscores in compound words and identifiers break both lex and vec/hyde search:
Lex — hyphens (#417): Hyphenated identifiers like
DEC-0054,RFC-0011,CVE-2024-1234are unsearchable:DEC-0054→ parsed as negation (DECminus0054) → 0 results"DEC-0054"→sanitizeFTS5Term()strips hyphen →dec0054→ doesn't match FTS5unicode61tokens (dec,0054) → 0 resultsLex — underscores (#305): Snake_case identifiers like
apply_secrets,__init__,my_variableare unsearchable:sanitizeFTS5Term()strips underscores →applysecrets→ doesn't match FTS5unicode61tokens (apply,secrets) → 0 resultsVec/hyde (#414):
validateSemanticQuery()uses/-\w/which rejects compound words likemulti-agent,role-based,chain-of-thoughtas negation syntax.Root Cause
FTS5's
unicode61tokenizer splits on hyphens and underscores at index time (DEC-0054→dec+0054,apply_secrets→apply+secrets). ButsanitizeFTS5Term()strips these characters entirely, concatenating the parts into a single token (dec0054,applysecrets) that can never match the index.Fix
Two changes in
store.ts:sanitizeFTS5Term()— preserve hyphens and underscores:FTS5 applies the same tokenizer to query strings as to indexed content. Preserving the separator lets FTS5 split the query symmetrically — producing precise adjacency/phrase matches. This is simpler and more accurate than splitting at the JS level (which would produce AND terms matching anywhere in the document, not just adjacent occurrences).
validateSemanticQuery()— only match hyphens preceded by whitespace or string start (actual negation), not internal hyphens in compound words:No changes to
buildFTS5Query()— FTS5 handles the splitting correctly when the separator characters are preserved.Testing
Verified against real FTS5
porter unicode61(in-memory + 7,665 document production index):lex "DEC-0054"lex DEC-0054lex apply_secretslex __init__lex my-app_v2vec "multi-agent orchestration"lex spawn -orchestratorPrecision verified:
"apply_secrets"produces a phrase match (1 hit for adjacentapply+secrets), not an AND match (which would also hit documents containing both words non-adjacently). FTS5's symmetric tokenization gives us adjacency for free.Relationship to #404
Complements #404 which also addresses #305 (underscores) via a similar
sanitizeFTS5Termchange. This PR extends the fix to hyphens and adds thevalidateSemanticQueryfix for vec/hyde false positives, which #404 does not cover.Fixes #305, fixes #414, fixes #417
Related: #404
Environment