feat: Implement semantic search with fastembed-rs and Qdrant #9

cyrusagent · 2025-10-19T17:06:58Z

Summary

Implements semantic search capabilities for the Logseq knowledge base using local embeddings (fastembed-rs) and vector storage (Qdrant). This follows the simplified DDD architecture established in the project.

Key Features

Local Embedding Generation: Uses fastembed-rs with all-MiniLM-L6-v2 model (384 dimensions, ~90MB)
Vector Storage: Qdrant client integration for efficient similarity search
Smart Text Processing: Cleans Logseq syntax, adds hierarchical context, intelligent chunking
Batch Processing: Efficient embedding generation and storage (configurable batch sizes)
Flexible Search: Supports both traditional and semantic search modes
Page Filtering: Scope searches to specific pages

Domain Layer Extensions

Value Objects

ChunkId: Unique identifier for text chunks
EmbeddingVector: 384-dimensional vectors with cosine similarity computation
SimilarityScore: Normalized similarity scores (0.0-1.0)
EmbeddingModel: Supported embedding models enum

Entities

TextChunk: Preprocessed text with full metadata (block/page context, hierarchy, embeddings)

Infrastructure Layer

FastEmbed Service

Local embedding generation using fastembed 5.2
all-MiniLM-L6-v2 model (384 dimensions)
Batch processing for efficiency
Async/await throughout

Qdrant Vector Store

Collection management with cosine distance
Metadata storage (page/block IDs, hierarchy, content)
CRUD operations with filtering support
Statistics and monitoring

Text Preprocessor

Logseq syntax cleaning (TODO/DONE markers, page references, tags)
Context addition (page title + hierarchy path)
Smart chunking with word overlap (default: 150 words, 50 overlap)

Application Layer

Embedding Service

Orchestrates preprocessing → embedding → storage pipeline
Page-level and bulk operations
Statistics tracking (blocks processed, chunks created/stored, errors)
Delete operations (by page or block)

Search Integration

Extended SearchPagesAndBlocks with semantic search
New SearchType enum (Traditional, Semantic)
Async execute() method
Combined semantic + traditional results

Testing

169 total tests (162 passed, 7 ignored)
7 comprehensive semantic search integration tests
All existing tests updated to async/await
Tests marked #[ignore] require running Qdrant instance

Run with Qdrant: cargo test -- --ignored

Dependencies Added

fastembed = "5.2": Local embedding generation
qdrant-client = "1.11": Vector database client
regex = "1.10": Text preprocessing
chrono = "0.4": Timestamp handling

Configuration

Default EmbeddingServiceConfig:

Model: all-MiniLM-L6-v2 (384 dimensions)
Qdrant URL: http://localhost:6334
Collection: logseq_blocks
Max words per chunk: 150 (~512 tokens with margin)
Overlap words: 50
Batch size: 32

Setup Requirements

Requires Qdrant running locally (Docker recommended):

docker run -p 6334:6334 qdrant/qdrant

Files Changed

14 files modified/created
2,210 insertions, 49 deletions
4 new infrastructure modules
1 new application service
1 comprehensive test suite

Breaking Changes

SearchPagesAndBlocks::execute() is now async (requires .await)

Test Plan

🤖 Generated with Claude Code

Adds comprehensive semantic search capabilities to the Logseq knowledge base application using local embeddings (fastembed-rs) and vector storage (Qdrant). This implementation follows the simplified DDD architecture established in the project. ## Domain Layer Extensions - **Value Objects**: - `ChunkId`: Unique identifier for text chunks (format: block-id-chunk-index) - `EmbeddingVector`: 384-dimensional vector with cosine similarity computation - `SimilarityScore`: Normalized similarity score (0.0-1.0) - `EmbeddingModel`: Enum for supported models (currently all-MiniLM-L6-v2) - **Entities**: - `TextChunk`: Preprocessed text with metadata (block/page context, hierarchy path, embeddings) ## Infrastructure Layer - **FastEmbed Service** (`fastembed_service.rs`): - Local embedding generation using fastembed 5.2 - all-MiniLM-L6-v2 model (384 dimensions, ~90MB) - Batch processing for efficiency - Async/await for non-blocking operations - **Qdrant Vector Store** (`qdrant_store.rs`): - Qdrant client integration (requires Docker instance) - Collection management with cosine distance metric - Metadata storage (page/block IDs, hierarchy, content) - Point-based CRUD operations - Filter support for page-scoped searches - **Text Preprocessor** (`text_preprocessor.rs`): - Logseq syntax cleaning (removes TODO/DONE markers) - Page reference conversion ([[page]] → page) - Tag normalization (#tag → tag) - Context addition (page title + hierarchy path) - Smart chunking with word overlap (configurable) ## Application Layer - **Embedding Service** (`embedding_service.rs`): - Orchestrates preprocessing → embedding → storage pipeline - Configurable batch processing (default: 32 chunks) - Page-level and bulk operations - Statistics tracking (blocks processed, chunks created/stored, errors) - Delete operations (by page or block) - Vector store statistics - **Search Integration** (`search.rs`): - Extended `SearchPagesAndBlocks` with semantic search support - New `SearchType` enum (Traditional, Semantic) - Async `execute()` method (breaking change) - Combined semantic + traditional search results - Maintains existing filtering (pages, result types) ## Testing - Updated all integration tests to async/await - Added comprehensive semantic search integration tests: - Semantic similarity validation - Page filtering - Chunking for long content - Hierarchical context preservation - Semantic vs traditional search comparison - Embedding statistics and collection management - Delete operations ## Dependencies - `fastembed = "5.2"`: Local embedding generation - `qdrant-client = "1.11"`: Vector database client - `regex = "1.10"`: Text preprocessing - `chrono = "0.4"`: Timestamp handling ## Configuration Default configuration (EmbeddingServiceConfig): - Model: all-MiniLM-L6-v2 (384 dimensions) - Qdrant URL: http://localhost:6334 - Collection: logseq_blocks - Max words per chunk: 150 (~512 tokens with margin) - Overlap words: 50 - Batch size: 32 ## Testing All tests pass (169 total: 162 passed, 7 ignored): - 7 semantic search tests require running Qdrant instance (ignored by default) - Use `cargo test -- --ignored` to run with Qdrant 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

weswalla merged commit d019744 into main Oct 20, 2025
1 check passed

weswalla deleted the cyrus/per-6-implement-semantic-search-with-fastembed-rs-qdrant-using branch October 20, 2025 04:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement semantic search with fastembed-rs and Qdrant #9

feat: Implement semantic search with fastembed-rs and Qdrant #9

Uh oh!

cyrusagent bot commented Oct 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: Implement semantic search with fastembed-rs and Qdrant #9

feat: Implement semantic search with fastembed-rs and Qdrant #9

Uh oh!

Conversation

cyrusagent bot commented Oct 19, 2025

Summary

Key Features

Domain Layer Extensions

Value Objects

Entities

Infrastructure Layer

FastEmbed Service

Qdrant Vector Store

Text Preprocessor

Application Layer

Embedding Service

Search Integration

Testing

Dependencies Added

Configuration

Setup Requirements

Files Changed

Breaking Changes

Test Plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant