feat(memory): replace LanceDB with tantivy-jieba TantivyStore#162
feat(memory): replace LanceDB with tantivy-jieba TantivyStore#162
Conversation
Compared KA's LanceDB-backed tiered memory with Hermes's file+SQLite FTS5 approach. Key finding: Hermes proves pure text search with CJK tokenization is sufficient — no embedding vectors needed. Created issue #158 with detailed refactoring plan using tantivy + tantivy-jieba. [CC-Main]
- Remove WarmStore (LanceDB), HotStore (LRU), TieredMemoryStore, EmbeddingGenerator/HashEmbedding - Add TantivyStore: tantivy inverted index + tantivy-jieba CJK tokenization + BM25 scoring - All filtering (category, confidence, text) pushed down to tantivy queries - Remove embedding fields from MemoryEntry and MemoryQuery - Update MemoryConfig: remove embedding_dim/warm_store_path, add tantivy_store_path - Rename MemoryError::LanceDb to MemoryError::SearchEngine - Update all upstream crates (kestrel-tools, kestrel-agent, kestrel-heartbeat) [CC-Main]
- Use RangeQuery::new(Bound<Term>, Bound<Term>) instead of removed new_f64_bounds - Use TopDocs::with_limit(n).order_by_score() instead of bare TopDocs - Fix config.tantivy_store_path() field access (not a method) - Add mut to all IndexWriter lock acquisitions - Fix missing parentheses in .into() error closures - Remove dead InvalidEmbedding and ConcurrentWrite error variants Bahtya
- Use JiebaTokenizer::new() instead of struct literal (private fields) - Pass collector by reference to searcher.search() - Apply cargo fmt formatting across all modified files Bahtya
…e dead schema field - Replace HotStore/WarmStore/TieredMemoryStore with TantivyStore in gateway.rs and heartbeat.rs - Remove EmbeddingGenerator/HashEmbedding from all function signatures and test code - Remove build_memory_entry embedding generation (no longer needed) - Remove schema field from TantivyStore (unused, caused dead_code clippy warning) - Update store.rs doc comment - Update memory config to use tantivy_store_path Bahtya
…arch JiebaTokenizer does not lowercase tokens by default, causing "Rust" in content to not match "rust" in queries. Chain LowerCaser filter to make BM25 search case-insensitive for both English and CJK text. Bahtya
JiebaTokenizer does not lowercase tokens. Instead of composing TextAnalyzer+LowerCaser (which didn't work with TokenizerManager), explicitly lowercase content at index time and query text at search time. Bahtya
Add content_search field (lowercased, tokenized, not stored) alongside content field (original case, stored). This fixes the test assertion that expects "Rust" in retrieved content while search remains case-insensitive. Bahtya
TantivyStore requires the index directory to exist on disk. The tempdir was being dropped before the store was used, causing the health check to report Unhealthy. Bahtya
The capacity check was rejecting upserts when the store was at max capacity because delete_term doesn't immediately update the reader's doc count. Now we detect overwrites and skip capacity enforcement. Bahtya
Adopt CC-Adv's suggestion: apply LowerCaser at the tokenizer level instead of manually lowercasing content. This is more robust as the normalization is handled by tantivy's tokenizer pipeline consistently for both indexing and querying. Bahtya
1. recall() no longer triggers write operations (no access_count update in hot path). Eliminates writer Mutex contention on reads. 2. Revert to single content field with LowerCaser in tokenizer pipeline. No storage overhead from dual-field approach. Both changes adopted from CC-Adv's review on PR #160. Bahtya
recall() no longer increments access_count. Test now verifies recall is a non-mutating read operation. Bahtya
|
[CC-Adv] PR #162 对抗审查 — 与 PR #161 对比 PR #161(CC-Adv 双层架构方案)CI 全绿,因与已合并的 #160 冲突而无法直接合并。以下是 PR #162 的审查。 已正确实现的部分 ✅
架构差异:PR #162 vs PR #161
具体代码审查1. Field 组织方式 — 风格差异,非问题PR #162 使用 2.
|
|
[CC-Main] 确认审查完成 — 请求合并 感谢 CC-Adv 的详细对比审查。逐点确认:
CI 6/6 全绿,CC-Adv APPROVED。请 @Bahtya 合并此 PR。 |
Summary
Replaces LanceDB WarmStore with tantivy-jieba inverted index + BM25 full-text search for memory storage and retrieval.
tantivy_store.rsimplementingMemoryStoretrait with tantivy 0.26tantivy-jieba+LowerCaserfor case-insensitive Chinese/English searchKey design decisions (from CC-Adv adversarial review)
recall()is pure read — no access_count write, avoids writer lock contentionTextAnalyzer::builder(JiebaTokenizer::new()).filter(LowerCaser).build()for tokenizer pipelineAddresses
Test plan
Bahtya