Skip to content

Conversation

@cyrusagent
Copy link

@cyrusagent cyrusagent bot commented Oct 19, 2025

Summary

Implements semantic search capabilities for the Logseq knowledge base using local embeddings (fastembed-rs) and vector storage (Qdrant). This follows the simplified DDD architecture established in the project.

Key Features

  • Local Embedding Generation: Uses fastembed-rs with all-MiniLM-L6-v2 model (384 dimensions, ~90MB)
  • Vector Storage: Qdrant client integration for efficient similarity search
  • Smart Text Processing: Cleans Logseq syntax, adds hierarchical context, intelligent chunking
  • Batch Processing: Efficient embedding generation and storage (configurable batch sizes)
  • Flexible Search: Supports both traditional and semantic search modes
  • Page Filtering: Scope searches to specific pages

Domain Layer Extensions

Value Objects

  • ChunkId: Unique identifier for text chunks
  • EmbeddingVector: 384-dimensional vectors with cosine similarity computation
  • SimilarityScore: Normalized similarity scores (0.0-1.0)
  • EmbeddingModel: Supported embedding models enum

Entities

  • TextChunk: Preprocessed text with full metadata (block/page context, hierarchy, embeddings)

Infrastructure Layer

FastEmbed Service

  • Local embedding generation using fastembed 5.2
  • all-MiniLM-L6-v2 model (384 dimensions)
  • Batch processing for efficiency
  • Async/await throughout

Qdrant Vector Store

  • Collection management with cosine distance
  • Metadata storage (page/block IDs, hierarchy, content)
  • CRUD operations with filtering support
  • Statistics and monitoring

Text Preprocessor

  • Logseq syntax cleaning (TODO/DONE markers, page references, tags)
  • Context addition (page title + hierarchy path)
  • Smart chunking with word overlap (default: 150 words, 50 overlap)

Application Layer

Embedding Service

  • Orchestrates preprocessing → embedding → storage pipeline
  • Page-level and bulk operations
  • Statistics tracking (blocks processed, chunks created/stored, errors)
  • Delete operations (by page or block)

Search Integration

  • Extended SearchPagesAndBlocks with semantic search
  • New SearchType enum (Traditional, Semantic)
  • Async execute() method
  • Combined semantic + traditional results

Testing

  • 169 total tests (162 passed, 7 ignored)
  • 7 comprehensive semantic search integration tests
  • All existing tests updated to async/await
  • Tests marked #[ignore] require running Qdrant instance

Run with Qdrant: cargo test -- --ignored

Dependencies Added

  • fastembed = "5.2": Local embedding generation
  • qdrant-client = "1.11": Vector database client
  • regex = "1.10": Text preprocessing
  • chrono = "0.4": Timestamp handling

Configuration

Default EmbeddingServiceConfig:

  • Model: all-MiniLM-L6-v2 (384 dimensions)
  • Qdrant URL: http://localhost:6334
  • Collection: logseq_blocks
  • Max words per chunk: 150 (~512 tokens with margin)
  • Overlap words: 50
  • Batch size: 32

Setup Requirements

Requires Qdrant running locally (Docker recommended):

docker run -p 6334:6334 qdrant/qdrant

Files Changed

  • 14 files modified/created
  • 2,210 insertions, 49 deletions
  • 4 new infrastructure modules
  • 1 new application service
  • 1 comprehensive test suite

Breaking Changes

  • SearchPagesAndBlocks::execute() is now async (requires .await)

Test Plan

  • All existing tests pass
  • Domain layer value objects and entities tested
  • Text preprocessing validates Logseq syntax cleaning
  • Embedding generation produces correct dimensions
  • Vector store CRUD operations work correctly
  • Search integration supports both modes
  • Semantic search finds similar content (requires Qdrant)
  • Page filtering works with semantic search
  • Chunking handles long content properly
  • Hierarchical context preserved in embeddings

🤖 Generated with Claude Code

Adds comprehensive semantic search capabilities to the Logseq knowledge base application using local embeddings (fastembed-rs) and vector storage (Qdrant). This implementation follows the simplified DDD architecture established in the project.

## Domain Layer Extensions

- **Value Objects**:
  - `ChunkId`: Unique identifier for text chunks (format: block-id-chunk-index)
  - `EmbeddingVector`: 384-dimensional vector with cosine similarity computation
  - `SimilarityScore`: Normalized similarity score (0.0-1.0)
  - `EmbeddingModel`: Enum for supported models (currently all-MiniLM-L6-v2)

- **Entities**:
  - `TextChunk`: Preprocessed text with metadata (block/page context, hierarchy path, embeddings)

## Infrastructure Layer

- **FastEmbed Service** (`fastembed_service.rs`):
  - Local embedding generation using fastembed 5.2
  - all-MiniLM-L6-v2 model (384 dimensions, ~90MB)
  - Batch processing for efficiency
  - Async/await for non-blocking operations

- **Qdrant Vector Store** (`qdrant_store.rs`):
  - Qdrant client integration (requires Docker instance)
  - Collection management with cosine distance metric
  - Metadata storage (page/block IDs, hierarchy, content)
  - Point-based CRUD operations
  - Filter support for page-scoped searches

- **Text Preprocessor** (`text_preprocessor.rs`):
  - Logseq syntax cleaning (removes TODO/DONE markers)
  - Page reference conversion ([[page]] → page)
  - Tag normalization (#tag → tag)
  - Context addition (page title + hierarchy path)
  - Smart chunking with word overlap (configurable)

## Application Layer

- **Embedding Service** (`embedding_service.rs`):
  - Orchestrates preprocessing → embedding → storage pipeline
  - Configurable batch processing (default: 32 chunks)
  - Page-level and bulk operations
  - Statistics tracking (blocks processed, chunks created/stored, errors)
  - Delete operations (by page or block)
  - Vector store statistics

- **Search Integration** (`search.rs`):
  - Extended `SearchPagesAndBlocks` with semantic search support
  - New `SearchType` enum (Traditional, Semantic)
  - Async `execute()` method (breaking change)
  - Combined semantic + traditional search results
  - Maintains existing filtering (pages, result types)

## Testing

- Updated all integration tests to async/await
- Added comprehensive semantic search integration tests:
  - Semantic similarity validation
  - Page filtering
  - Chunking for long content
  - Hierarchical context preservation
  - Semantic vs traditional search comparison
  - Embedding statistics and collection management
  - Delete operations

## Dependencies

- `fastembed = "5.2"`: Local embedding generation
- `qdrant-client = "1.11"`: Vector database client
- `regex = "1.10"`: Text preprocessing
- `chrono = "0.4"`: Timestamp handling

## Configuration

Default configuration (EmbeddingServiceConfig):
- Model: all-MiniLM-L6-v2 (384 dimensions)
- Qdrant URL: http://localhost:6334
- Collection: logseq_blocks
- Max words per chunk: 150 (~512 tokens with margin)
- Overlap words: 50
- Batch size: 32

## Testing

All tests pass (169 total: 162 passed, 7 ignored):
- 7 semantic search tests require running Qdrant instance (ignored by default)
- Use `cargo test -- --ignored` to run with Qdrant

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@weswalla weswalla merged commit d019744 into main Oct 20, 2025
1 check passed
@weswalla weswalla deleted the cyrus/per-6-implement-semantic-search-with-fastembed-rs-qdrant-using branch October 20, 2025 04:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant