From f79e9048e21ff4cd3b405a243ccfc087f0592ac1 Mon Sep 17 00:00:00 2001 From: Wesley Finck Date: Sun, 19 Oct 2025 21:30:22 -0700 Subject: [PATCH 1/7] add implementation plans for missing steps --- notes/features/file-page-mapping.md | 952 ++++++++++++++++++++ notes/features/sqlite-persistence.md | 885 +++++++++++++++++++ notes/features/tantivy-search.md | 1079 +++++++++++++++++++++++ notes/features/tauri-integration.md | 1198 ++++++++++++++++++++++++++ 4 files changed, 4114 insertions(+) create mode 100644 notes/features/file-page-mapping.md create mode 100644 notes/features/sqlite-persistence.md create mode 100644 notes/features/tantivy-search.md create mode 100644 notes/features/tauri-integration.md diff --git a/notes/features/file-page-mapping.md b/notes/features/file-page-mapping.md new file mode 100644 index 0000000..fa23db2 --- /dev/null +++ b/notes/features/file-page-mapping.md @@ -0,0 +1,952 @@ +# Fileβ†’Page Mapping Implementation Plan + +## Overview + +Implement bidirectional mapping between file system paths and domain `Page` entities to enable proper deletion handling, conflict resolution, and efficient sync operations. This addresses a critical gap in the current implementation where deleted files cannot be properly tracked. + +## Problem Statement + +### Current Limitations + +The existing `SyncService` implementation has several issues: + +1. **No deletion tracking:** When a `.md` file is deleted, we don't know which `Page` to remove from the repository +2. **No rename detection:** File renames appear as delete + create, losing page history +3. **No conflict resolution:** Can't detect if file and database are out of sync +4. **Inefficient sync:** Must parse file to determine page title for lookup +5. **No source of truth validation:** Can't verify if a page in DB still has a corresponding file + +### Why This Matters + +```rust +// Current SyncService behavior on file deletion: +match event.kind { + FileEventKind::Deleted => { + // PROBLEM: We don't know which PageId to delete! + // File path: /path/to/pages/my-note.md + // Page title: Could be anything (not necessarily "my-note") + // PageId: Could be UUID or derived from title + + // Current workaround: Just log and ignore + tracing::warn!("File deleted: {:?}", path); + } +} +``` + +## Goals + +1. **Enable file deletion sync:** Map file paths to PageIds for proper deletion +2. **Track file metadata:** Store modification times, checksums for conflict detection +3. **Support rename detection:** Recognize when a file moves without losing data +4. **Provide bidirectional lookup:** Find file by page ID and vice versa +5. **Maintain referential integrity:** Ensure mappings stay in sync with repository +6. **Persist mappings:** Store in database alongside pages + +## Architecture Layer + +**Infrastructure Layer** (`backend/src/infrastructure/persistence/`) + +This is infrastructure-level concern because: +- File paths are technical implementation details, not domain concepts +- Mapping is required for infrastructure operations (sync, import) +- Domain layer should remain file-system agnostic + +## Design Approach + +### Option 1: Separate FileMappingRepository (Recommended) + +**Pros:** +- Clean separation of concerns +- Independent of PageRepository implementation +- Easy to add additional metadata (checksums, sync status) +- Can query mappings without loading full pages + +**Cons:** +- Additional repository to manage +- Need to keep mappings in sync with pages + +### Option 2: Extend Page Domain Model + +**Pros:** +- Single source of truth +- Atomic updates (page + mapping) + +**Cons:** +- Violates DDD (file paths are not domain concepts) +- Couples domain to infrastructure +- Makes domain objects less portable + +**Decision: Option 1** - Use separate repository following DDD principles. + +## Database Schema + +### New Tables + +```sql +-- migrations/002_file_mapping.sql + +-- File to page mapping table +CREATE TABLE file_page_mappings ( + file_path TEXT PRIMARY KEY NOT NULL, + page_id TEXT NOT NULL, + file_modified_at TIMESTAMP NOT NULL, + file_size_bytes INTEGER NOT NULL, + checksum TEXT, -- SHA-256 hash of file content (optional, for conflict detection) + created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, + updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, + + FOREIGN KEY (page_id) REFERENCES pages(id) ON DELETE CASCADE +); + +-- Index for reverse lookup (page β†’ file) +CREATE INDEX idx_file_mappings_page_id ON file_page_mappings(page_id); + +-- Index for sync queries (find stale files) +CREATE INDEX idx_file_mappings_modified ON file_page_mappings(file_modified_at); + +-- Trigger to update timestamp +CREATE TRIGGER update_file_mappings_timestamp + AFTER UPDATE ON file_page_mappings + FOR EACH ROW +BEGIN + UPDATE file_page_mappings SET updated_at = CURRENT_TIMESTAMP WHERE file_path = OLD.file_path; +END; +``` + +### Design Decisions + +**1. file_path as primary key:** +- Ensures one-to-one mapping (one file = one page) +- Fast lookup for deletion events +- Natural unique identifier from file system + +**2. CASCADE on page deletion:** +- When page is deleted, mapping is automatically removed +- Maintains referential integrity +- Prevents orphaned mappings + +**3. Checksum column (optional):** +- SHA-256 hash for content-based conflict detection +- Can detect "file modified externally" scenarios +- Trade-off: Computational cost vs. accuracy + +**4. file_modified_at tracking:** +- Used by SyncService to detect stale files +- Enables "sync only modified files" optimization +- Critical for rename detection (unchanged modification time = rename) + +## Domain Value Objects + +### FilePathMapping (Value Object) + +```rust +// backend/src/infrastructure/persistence/value_objects.rs + +use std::path::{Path, PathBuf}; +use chrono::{DateTime, Utc}; +use crate::domain::{PageId, DomainResult, DomainError}; + +/// Represents a mapping between a file system path and a domain Page +#[derive(Debug, Clone, PartialEq, Eq)] +pub struct FilePathMapping { + file_path: PathBuf, + page_id: PageId, + file_modified_at: DateTime, + file_size_bytes: u64, + checksum: Option, +} + +impl FilePathMapping { + pub fn new( + file_path: impl Into, + page_id: PageId, + file_modified_at: DateTime, + file_size_bytes: u64, + checksum: Option, + ) -> DomainResult { + let file_path = file_path.into(); + + if !file_path.is_absolute() { + return Err(DomainError::InvalidValue( + "File path must be absolute".to_string() + )); + } + + if file_path.extension().and_then(|s| s.to_str()) != Some("md") { + return Err(DomainError::InvalidValue( + "File path must be a .md file".to_string() + )); + } + + Ok(Self { + file_path, + page_id, + file_modified_at, + file_size_bytes, + checksum, + }) + } + + pub fn file_path(&self) -> &Path { + &self.file_path + } + + pub fn page_id(&self) -> &PageId { + &self.page_id + } + + pub fn file_modified_at(&self) -> DateTime { + self.file_modified_at + } + + pub fn file_size_bytes(&self) -> u64 { + self.file_size_bytes + } + + pub fn checksum(&self) -> Option<&str> { + self.checksum.as_deref() + } + + /// Check if file metadata has changed (for conflict detection) + pub fn is_stale(&self, current_modified_at: DateTime) -> bool { + current_modified_at > self.file_modified_at + } + + /// Update metadata (returns new instance - immutable value object) + pub fn with_updated_metadata( + self, + file_modified_at: DateTime, + file_size_bytes: u64, + checksum: Option, + ) -> Self { + Self { + file_modified_at, + file_size_bytes, + checksum, + ..self + } + } +} + +impl ValueObject for FilePathMapping {} +``` + +## Repository Interface + +### FileMappingRepository Trait + +```rust +// backend/src/application/repositories/file_mapping_repository.rs + +use std::path::Path; +use crate::domain::{PageId, DomainResult}; +use crate::infrastructure::persistence::FilePathMapping; + +/// Repository for managing file path to page ID mappings +pub trait FileMappingRepository { + /// Save or update a file mapping + fn save(&mut self, mapping: FilePathMapping) -> DomainResult<()>; + + /// Find mapping by file path + fn find_by_path(&self, path: &Path) -> DomainResult>; + + /// Find mapping by page ID + fn find_by_page_id(&self, page_id: &PageId) -> DomainResult>; + + /// Get all mappings + fn find_all(&self) -> DomainResult>; + + /// Delete mapping by file path + fn delete_by_path(&mut self, path: &Path) -> DomainResult; + + /// Delete mapping by page ID + fn delete_by_page_id(&mut self, page_id: &PageId) -> DomainResult; + + /// Find all files modified after a certain timestamp + fn find_modified_after(&self, timestamp: DateTime) -> DomainResult>; + + /// Batch operations for efficiency + fn save_batch(&mut self, mappings: Vec) -> DomainResult<()>; +} +``` + +## Implementation + +### SqliteFileMappingRepository + +```rust +// backend/src/infrastructure/persistence/sqlite_file_mapping_repository.rs + +use sqlx::{SqlitePool, FromRow}; +use std::path::{Path, PathBuf}; +use chrono::{DateTime, Utc}; +use crate::application::repositories::FileMappingRepository; +use crate::domain::{PageId, DomainResult, DomainError}; +use crate::infrastructure::persistence::FilePathMapping; + +#[derive(Debug, FromRow)] +struct FileMappingRow { + file_path: String, + page_id: String, + file_modified_at: DateTime, + file_size_bytes: i64, + checksum: Option, + created_at: DateTime, + updated_at: DateTime, +} + +pub struct SqliteFileMappingRepository { + pool: SqlitePool, +} + +impl SqliteFileMappingRepository { + pub fn new(pool: SqlitePool) -> Self { + Self { pool } + } + + fn row_to_domain(row: FileMappingRow) -> DomainResult { + let page_id = PageId::new(&row.page_id)?; + FilePathMapping::new( + PathBuf::from(row.file_path), + page_id, + row.file_modified_at, + row.file_size_bytes as u64, + row.checksum, + ) + } + + async fn save_async(&mut self, mapping: FilePathMapping) -> DomainResult<()> { + sqlx::query( + "INSERT INTO file_page_mappings + (file_path, page_id, file_modified_at, file_size_bytes, checksum, created_at, updated_at) + VALUES (?, ?, ?, ?, ?, ?, ?) + ON CONFLICT(file_path) DO UPDATE SET + page_id = excluded.page_id, + file_modified_at = excluded.file_modified_at, + file_size_bytes = excluded.file_size_bytes, + checksum = excluded.checksum, + updated_at = excluded.updated_at" + ) + .bind(mapping.file_path().to_string_lossy().as_ref()) + .bind(mapping.page_id().as_str()) + .bind(mapping.file_modified_at()) + .bind(mapping.file_size_bytes() as i64) + .bind(mapping.checksum()) + .bind(Utc::now()) + .bind(Utc::now()) + .execute(&self.pool) + .await + .map_err(|e| DomainError::InvalidOperation(format!("Save mapping error: {}", e)))?; + + Ok(()) + } + + async fn find_by_path_async(&self, path: &Path) -> DomainResult> { + let row: Option = sqlx::query_as( + "SELECT file_path, page_id, file_modified_at, file_size_bytes, checksum, created_at, updated_at + FROM file_page_mappings + WHERE file_path = ?" + ) + .bind(path.to_string_lossy().as_ref()) + .fetch_optional(&self.pool) + .await + .map_err(|e| DomainError::InvalidOperation(format!("Query error: {}", e)))?; + + row.map(Self::row_to_domain).transpose() + } + + async fn find_by_page_id_async(&self, page_id: &PageId) -> DomainResult> { + let row: Option = sqlx::query_as( + "SELECT file_path, page_id, file_modified_at, file_size_bytes, checksum, created_at, updated_at + FROM file_page_mappings + WHERE page_id = ?" + ) + .bind(page_id.as_str()) + .fetch_optional(&self.pool) + .await + .map_err(|e| DomainError::InvalidOperation(format!("Query error: {}", e)))?; + + row.map(Self::row_to_domain).transpose() + } + + async fn find_all_async(&self) -> DomainResult> { + let rows: Vec = sqlx::query_as( + "SELECT file_path, page_id, file_modified_at, file_size_bytes, checksum, created_at, updated_at + FROM file_page_mappings + ORDER BY file_path" + ) + .fetch_all(&self.pool) + .await + .map_err(|e| DomainError::InvalidOperation(format!("Query error: {}", e)))?; + + rows.into_iter() + .map(Self::row_to_domain) + .collect() + } + + async fn delete_by_path_async(&mut self, path: &Path) -> DomainResult { + let result = sqlx::query("DELETE FROM file_page_mappings WHERE file_path = ?") + .bind(path.to_string_lossy().as_ref()) + .execute(&self.pool) + .await + .map_err(|e| DomainError::InvalidOperation(format!("Delete error: {}", e)))?; + + Ok(result.rows_affected() > 0) + } + + async fn delete_by_page_id_async(&mut self, page_id: &PageId) -> DomainResult { + let result = sqlx::query("DELETE FROM file_page_mappings WHERE page_id = ?") + .bind(page_id.as_str()) + .execute(&self.pool) + .await + .map_err(|e| DomainError::InvalidOperation(format!("Delete error: {}", e)))?; + + Ok(result.rows_affected() > 0) + } + + async fn find_modified_after_async(&self, timestamp: DateTime) -> DomainResult> { + let rows: Vec = sqlx::query_as( + "SELECT file_path, page_id, file_modified_at, file_size_bytes, checksum, created_at, updated_at + FROM file_page_mappings + WHERE file_modified_at > ? + ORDER BY file_modified_at DESC" + ) + .bind(timestamp) + .fetch_all(&self.pool) + .await + .map_err(|e| DomainError::InvalidOperation(format!("Query error: {}", e)))?; + + rows.into_iter() + .map(Self::row_to_domain) + .collect() + } + + async fn save_batch_async(&mut self, mappings: Vec) -> DomainResult<()> { + let mut tx = self.pool.begin().await + .map_err(|e| DomainError::InvalidOperation(format!("Transaction error: {}", e)))?; + + for mapping in mappings { + sqlx::query( + "INSERT INTO file_page_mappings + (file_path, page_id, file_modified_at, file_size_bytes, checksum, created_at, updated_at) + VALUES (?, ?, ?, ?, ?, ?, ?) + ON CONFLICT(file_path) DO UPDATE SET + page_id = excluded.page_id, + file_modified_at = excluded.file_modified_at, + file_size_bytes = excluded.file_size_bytes, + checksum = excluded.checksum, + updated_at = excluded.updated_at" + ) + .bind(mapping.file_path().to_string_lossy().as_ref()) + .bind(mapping.page_id().as_str()) + .bind(mapping.file_modified_at()) + .bind(mapping.file_size_bytes() as i64) + .bind(mapping.checksum()) + .bind(Utc::now()) + .bind(Utc::now()) + .execute(&mut *tx) + .await + .map_err(|e| DomainError::InvalidOperation(format!("Batch insert error: {}", e)))?; + } + + tx.commit().await + .map_err(|e| DomainError::InvalidOperation(format!("Commit error: {}", e)))?; + + Ok(()) + } +} + +impl FileMappingRepository for SqliteFileMappingRepository { + fn save(&mut self, mapping: FilePathMapping) -> DomainResult<()> { + tokio::task::block_in_place(|| { + tokio::runtime::Handle::current().block_on(async { + self.save_async(mapping).await + }) + }) + } + + fn find_by_path(&self, path: &Path) -> DomainResult> { + tokio::task::block_in_place(|| { + tokio::runtime::Handle::current().block_on(async { + self.find_by_path_async(path).await + }) + }) + } + + fn find_by_page_id(&self, page_id: &PageId) -> DomainResult> { + tokio::task::block_in_place(|| { + tokio::runtime::Handle::current().block_on(async { + self.find_by_page_id_async(page_id).await + }) + }) + } + + fn find_all(&self) -> DomainResult> { + tokio::task::block_in_place(|| { + tokio::runtime::Handle::current().block_on(async { + self.find_all_async().await + }) + }) + } + + fn delete_by_path(&mut self, path: &Path) -> DomainResult { + tokio::task::block_in_place(|| { + tokio::runtime::Handle::current().block_on(async { + self.delete_by_path_async(path).await + }) + }) + } + + fn delete_by_page_id(&mut self, page_id: &PageId) -> DomainResult { + tokio::task::block_in_place(|| { + tokio::runtime::Handle::current().block_on(async { + self.delete_by_page_id_async(page_id).await + }) + }) + } + + fn find_modified_after(&self, timestamp: DateTime) -> DomainResult> { + tokio::task::block_in_place(|| { + tokio::runtime::Handle::current().block_on(async { + self.find_modified_after_async(timestamp).await + }) + }) + } + + fn save_batch(&mut self, mappings: Vec) -> DomainResult<()> { + tokio::task::block_in_place(|| { + tokio::runtime::Handle::current().block_on(async { + self.save_batch_async(mappings).await + }) + }) + } +} +``` + +## Service Integration + +### Updated ImportService + +```rust +// backend/src/application/services/import_service.rs + +pub struct ImportService { + page_repository: P, + mapping_repository: M, + max_concurrent_files: usize, +} + +impl ImportService { + pub fn new(page_repository: P, mapping_repository: M) -> Self { + Self { + page_repository, + mapping_repository, + max_concurrent_files: 4, + } + } + + async fn process_file(&mut self, path: PathBuf) -> ImportResult<()> { + // Parse file + let page = LogseqMarkdownParser::parse_file(&path).await?; + let page_id = page.id().clone(); + + // Get file metadata + let metadata = tokio::fs::metadata(&path).await?; + let modified_at = metadata.modified()? + .duration_since(UNIX_EPOCH)? + .as_secs(); + let file_size = metadata.len(); + + // Save page + self.page_repository.save(page)?; + + // Save file mapping + let mapping = FilePathMapping::new( + path, + page_id, + DateTime::from_timestamp(modified_at as i64, 0).unwrap(), + file_size, + None, // Checksum optional for v1 + )?; + self.mapping_repository.save(mapping)?; + + Ok(()) + } +} +``` + +### Updated SyncService + +```rust +// backend/src/application/services/sync_service.rs + +pub struct SyncService { + page_repository: Arc>, + mapping_repository: Arc>, + directory_path: LogseqDirectoryPath, + watcher: LogseqFileWatcher, +} + +impl + SyncService +{ + pub fn new( + page_repository: Arc>, + mapping_repository: Arc>, + directory_path: LogseqDirectoryPath, + ) -> SyncResult { + let watcher = LogseqFileWatcher::new(directory_path.as_path(), Duration::from_millis(500))?; + + Ok(Self { + page_repository, + mapping_repository, + directory_path, + watcher, + }) + } + + async fn handle_file_deleted(&self, path: PathBuf) -> SyncResult<()> { + let mut mapping_repo = self.mapping_repository.lock().await; + let mut page_repo = self.page_repository.lock().await; + + // Find mapping for deleted file + if let Some(mapping) = mapping_repo.find_by_path(&path)? { + let page_id = mapping.page_id().clone(); + + // Delete page from repository + page_repo.delete(&page_id)?; + + // Delete mapping + mapping_repo.delete_by_path(&path)?; + + tracing::info!("Deleted page {} for file {:?}", page_id.as_str(), path); + } else { + tracing::warn!("No mapping found for deleted file: {:?}", path); + } + + Ok(()) + } + + async fn handle_file_created(&self, path: PathBuf) -> SyncResult<()> { + // Check if mapping already exists (rename detection) + let mapping_repo = self.mapping_repository.lock().await; + if let Some(_existing) = mapping_repo.find_by_path(&path)? { + drop(mapping_repo); + // File already tracked - treat as update + return self.handle_file_updated(path).await; + } + drop(mapping_repo); + + // Parse and save new file + let page = LogseqMarkdownParser::parse_file(&path).await?; + let page_id = page.id().clone(); + + // Get file metadata + let metadata = tokio::fs::metadata(&path).await?; + let modified_at = DateTime::from_timestamp( + metadata.modified()?.duration_since(UNIX_EPOCH)?.as_secs() as i64, + 0 + ).unwrap(); + + // Save page + let mut page_repo = self.page_repository.lock().await; + page_repo.save(page)?; + drop(page_repo); + + // Save mapping + let mut mapping_repo = self.mapping_repository.lock().await; + let mapping = FilePathMapping::new( + path, + page_id, + modified_at, + metadata.len(), + None, + )?; + mapping_repo.save(mapping)?; + + Ok(()) + } + + async fn handle_file_updated(&self, path: PathBuf) -> SyncResult<()> { + // Get existing mapping + let mapping_repo = self.mapping_repository.lock().await; + let existing_mapping = mapping_repo.find_by_path(&path)?; + drop(mapping_repo); + + // Get current file metadata + let metadata = tokio::fs::metadata(&path).await?; + let current_modified = DateTime::from_timestamp( + metadata.modified()?.duration_since(UNIX_EPOCH)?.as_secs() as i64, + 0 + ).unwrap(); + + // Check if file actually changed + if let Some(mapping) = &existing_mapping { + if !mapping.is_stale(current_modified) { + tracing::debug!("File not modified, skipping: {:?}", path); + return Ok(()); + } + } + + // Parse updated file + let page = LogseqMarkdownParser::parse_file(&path).await?; + let page_id = page.id().clone(); + + // Save page + let mut page_repo = self.page_repository.lock().await; + page_repo.save(page)?; + drop(page_repo); + + // Update mapping + let mut mapping_repo = self.mapping_repository.lock().await; + let mapping = FilePathMapping::new( + path, + page_id, + current_modified, + metadata.len(), + None, + )?; + mapping_repo.save(mapping)?; + + Ok(()) + } +} +``` + +## Rename Detection (Advanced) + +### Strategy + +Detect file renames by comparing: +1. **File size** (unchanged for rename) +2. **Modification time** (unchanged for rename) +3. **Content checksum** (if enabled) + +```rust +impl SyncService +where + P: PageRepository + Send + 'static, + M: FileMappingRepository + Send + 'static, +{ + async fn detect_rename(&self, new_path: PathBuf) -> SyncResult> { + let metadata = tokio::fs::metadata(&new_path).await?; + let size = metadata.len(); + let modified = DateTime::from_timestamp( + metadata.modified()?.duration_since(UNIX_EPOCH)?.as_secs() as i64, + 0 + ).unwrap(); + + // Find all mappings with same size and modification time + let mapping_repo = self.mapping_repository.lock().await; + let all_mappings = mapping_repo.find_all()?; + + for mapping in all_mappings { + // Check if mapping's file no longer exists + if !mapping.file_path().exists() { + // Same size and modification time = likely a rename + if mapping.file_size_bytes() == size + && mapping.file_modified_at() == modified + { + tracing::info!( + "Detected rename: {:?} -> {:?}", + mapping.file_path(), + new_path + ); + return Ok(Some(mapping.page_id().clone())); + } + } + } + + Ok(None) + } + + async fn handle_file_created_with_rename_detection(&self, path: PathBuf) -> SyncResult<()> { + // Try to detect rename + if let Some(page_id) = self.detect_rename(path.clone()).await? { + // This is a rename - update mapping + let mut mapping_repo = self.mapping_repository.lock().await; + + // Delete old mapping + mapping_repo.delete_by_page_id(&page_id)?; + + // Create new mapping + let metadata = tokio::fs::metadata(&path).await?; + let mapping = FilePathMapping::new( + path, + page_id, + DateTime::from_timestamp( + metadata.modified()?.duration_since(UNIX_EPOCH)?.as_secs() as i64, + 0 + ).unwrap(), + metadata.len(), + None, + )?; + mapping_repo.save(mapping)?; + + tracing::info!("Updated mapping for renamed file"); + return Ok(()); + } + + // Not a rename - treat as new file + self.handle_file_created(path).await + } +} +``` + +## Testing Strategy + +### Unit Tests + +```rust +#[cfg(test)] +mod tests { + use super::*; + + #[tokio::test] + async fn test_save_and_find_mapping() { + let pool = create_test_pool().await; + let mut repo = SqliteFileMappingRepository::new(pool); + + let page_id = PageId::new("test-page").unwrap(); + let mapping = FilePathMapping::new( + "/path/to/test.md", + page_id.clone(), + Utc::now(), + 1024, + None, + ).unwrap(); + + repo.save(mapping.clone()).unwrap(); + + let found = repo.find_by_path(Path::new("/path/to/test.md")).unwrap(); + assert!(found.is_some()); + assert_eq!(found.unwrap().page_id(), &page_id); + } + + #[tokio::test] + async fn test_delete_cascade() { + // When page is deleted, mapping should also be deleted + let pool = create_test_pool().await; + let mut page_repo = SqlitePageRepository::new(pool.clone()); + let mut mapping_repo = SqliteFileMappingRepository::new(pool); + + let page_id = PageId::new("test").unwrap(); + let page = Page::new(page_id.clone(), "Test".to_string()); + page_repo.save(page).unwrap(); + + let mapping = FilePathMapping::new( + "/path/test.md", + page_id.clone(), + Utc::now(), + 100, + None, + ).unwrap(); + mapping_repo.save(mapping).unwrap(); + + // Delete page + page_repo.delete(&page_id).unwrap(); + + // Mapping should be gone + let found = mapping_repo.find_by_page_id(&page_id).unwrap(); + assert!(found.is_none()); + } +} +``` + +### Integration Tests + +```rust +#[tokio::test] +async fn test_sync_handles_file_deletion() { + let page_repo = Arc::new(Mutex::new(SqlitePageRepository::new_in_memory().await.unwrap())); + let mapping_repo = Arc::new(Mutex::new(SqliteFileMappingRepository::new_in_memory().await.unwrap())); + + let logseq_dir = create_test_logseq_dir(); + let sync_service = SyncService::new(page_repo.clone(), mapping_repo.clone(), logseq_dir).unwrap(); + + // Create and sync a file + let test_file = logseq_dir.join("pages/test.md"); + create_test_file(&test_file, "# Test"); + sync_service.sync_once(None).await.unwrap(); + + // Verify page and mapping exist + let page_repo_lock = page_repo.lock().await; + let pages = page_repo_lock.find_all().unwrap(); + assert_eq!(pages.len(), 1); + drop(page_repo_lock); + + // Delete file + fs::remove_file(&test_file).unwrap(); + sync_service.sync_once(None).await.unwrap(); + + // Verify page and mapping are deleted + let page_repo_lock = page_repo.lock().await; + let pages = page_repo_lock.find_all().unwrap(); + assert_eq!(pages.len(), 0); +} +``` + +## Performance Considerations + +### Optimizations + +1. **Batch inserts during import:** Use `save_batch()` instead of individual saves +2. **Index on page_id:** Fast reverse lookups +3. **Index on file_modified_at:** Efficient "find stale files" queries +4. **Avoid full table scans:** Use targeted queries with WHERE clauses + +### Expected Performance + +- **Save mapping:** ~1-2ms +- **Find by path:** ~1ms (indexed) +- **Find by page_id:** ~1-2ms (indexed) +- **Batch save (1000 mappings):** ~100-200ms (transaction) + +## Rollout Plan + +### Phase 1: Foundation βœ… +- [ ] Add `FilePathMapping` value object +- [ ] Create database migration for `file_page_mappings` table +- [ ] Define `FileMappingRepository` trait +- [ ] Implement `SqliteFileMappingRepository` + +### Phase 2: Service Integration βœ… +- [ ] Update `ImportService` to save mappings +- [ ] Update `SyncService` to use mappings for deletion +- [ ] Add mapping updates for file create/update events +- [ ] Add conflict detection using `is_stale()` + +### Phase 3: Advanced Features πŸš€ +- [ ] Implement rename detection algorithm +- [ ] Add checksum calculation (SHA-256) +- [ ] Add checksum-based conflict resolution +- [ ] Add "orphan cleanup" job (mappings without files) + +### Phase 4: Testing & Documentation βœ… +- [ ] Unit tests for repository +- [ ] Integration tests for sync with deletions +- [ ] Integration tests for rename detection +- [ ] Update documentation + +## Open Questions + +1. **Checksum performance:** Should checksums be calculated on-demand or stored? +2. **Rename detection threshold:** How confident do we need to be before treating create as rename? +3. **Orphan cleanup:** Should we automatically delete mappings for missing files, or alert user? +4. **Conflict resolution UI:** How to present file vs. DB conflicts to user? +5. **Multi-device sync:** How will file mappings work across different machines? + +## Future Enhancements + +- **Content-based checksums:** SHA-256 hashing for conflict detection +- **Move detection:** Track directory renames (e.g., `pages/` β†’ `archive/`) +- **Symbolic links:** Handle symlinks to files outside Logseq directory +- **Sync status tracking:** Add `sync_status` column (synced, conflict, deleted) +- **Conflict resolution UI:** Present conflicts to user with merge options +- **Audit log:** Track all fileβ†’page mapping changes for debugging + +## References + +- Git rename detection: https://github.com/git/git/blob/master/diffcore-rename.c +- SQLite foreign key constraints: https://www.sqlite.org/foreignkeys.html +- DDD repository pattern: https://martinfowler.com/eaaCatalog/repository.html diff --git a/notes/features/sqlite-persistence.md b/notes/features/sqlite-persistence.md new file mode 100644 index 0000000..c110c33 --- /dev/null +++ b/notes/features/sqlite-persistence.md @@ -0,0 +1,885 @@ +# SQLite Persistence Implementation Plan + +## Overview + +Implement SQLite-based persistence for the `PageRepository` trait following the existing DDD and layered architecture patterns. This will replace the in-memory test implementations with production-ready database storage. + +## Goals + +- Provide durable storage for `Page` aggregates and `Block` entities +- Maintain domain model integrity and DDD boundaries +- Support all existing `PageRepository` operations +- Enable efficient queries for common access patterns +- Lay foundation for full-text search integration + +## Architecture Layer + +**Infrastructure Layer** (`backend/src/infrastructure/persistence/`) + +Following the existing pattern where: +- Domain layer defines pure business logic (unchanged) +- Application layer defines `PageRepository` trait (unchanged) +- Infrastructure layer provides concrete `SqlitePageRepository` implementation + +## Dependencies + +Add to `backend/Cargo.toml`: + +```toml +[dependencies] +# SQLite persistence +sqlx = { version = "0.8", features = ["runtime-tokio", "sqlite", "uuid", "chrono"] } + +[dev-dependencies] +# For testing migrations and database +sqlx = { version = "0.8", features = ["runtime-tokio", "sqlite", "migrate"] } +``` + +**Rationale for sqlx over rusqlite:** +- Async/await support (matches existing Tokio runtime) +- Compile-time query verification with `sqlx::query!` macro +- Built-in migration support +- Better integration with async services (ImportService, SyncService) + +## Database Schema + +### Design Principles + +1. **Aggregate persistence:** Store `Page` as aggregate root with related `Block` entities +2. **Referential integrity:** Foreign keys enforce blockβ†’page and parentβ†’child relationships +3. **Efficient queries:** Indexes on common access patterns (title lookups, hierarchy traversal) +4. **Denormalization:** Store computed values (e.g., block depth) for query performance +5. **JSON columns:** Use for collections (URLs, page references) to avoid join tables + +### Schema Definition + +```sql +-- migrations/001_initial_schema.sql + +-- Pages table (Aggregate Root) +CREATE TABLE pages ( + id TEXT PRIMARY KEY NOT NULL, + title TEXT NOT NULL, + created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, + updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP +); + +-- Index for find_by_title lookups +CREATE INDEX idx_pages_title ON pages(title); + +-- Blocks table (Entity owned by Page aggregate) +CREATE TABLE blocks ( + id TEXT PRIMARY KEY NOT NULL, + page_id TEXT NOT NULL, + content TEXT NOT NULL, + indent_level INTEGER NOT NULL, + parent_id TEXT, + position INTEGER NOT NULL, -- Order within siblings + created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, + updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, + + FOREIGN KEY (page_id) REFERENCES pages(id) ON DELETE CASCADE, + FOREIGN KEY (parent_id) REFERENCES blocks(id) ON DELETE CASCADE +); + +-- Indexes for hierarchy queries +CREATE INDEX idx_blocks_page_id ON blocks(page_id); +CREATE INDEX idx_blocks_parent_id ON blocks(parent_id); +CREATE INDEX idx_blocks_page_parent ON blocks(page_id, parent_id); + +-- URLs extracted from blocks (denormalized for query performance) +CREATE TABLE block_urls ( + block_id TEXT NOT NULL, + url TEXT NOT NULL, + domain TEXT, -- Extracted domain for filtering + position INTEGER NOT NULL, -- Order within block + + PRIMARY KEY (block_id, position), + FOREIGN KEY (block_id) REFERENCES blocks(id) ON DELETE CASCADE +); + +CREATE INDEX idx_block_urls_domain ON block_urls(domain); + +-- Page references from blocks (denormalized) +CREATE TABLE block_page_references ( + block_id TEXT NOT NULL, + reference_text TEXT NOT NULL, + reference_type TEXT NOT NULL CHECK(reference_type IN ('link', 'tag')), + position INTEGER NOT NULL, -- Order within block + + PRIMARY KEY (block_id, position), + FOREIGN KEY (block_id) REFERENCES blocks(id) ON DELETE CASCADE +); + +CREATE INDEX idx_block_page_refs_text ON block_page_references(reference_text); +CREATE INDEX idx_block_page_refs_type ON block_page_references(reference_type); + +-- Trigger to update updated_at timestamp on pages +CREATE TRIGGER update_pages_timestamp + AFTER UPDATE ON pages + FOR EACH ROW +BEGIN + UPDATE pages SET updated_at = CURRENT_TIMESTAMP WHERE id = OLD.id; +END; + +-- Trigger to update updated_at timestamp on blocks +CREATE TRIGGER update_blocks_timestamp + AFTER UPDATE ON blocks + FOR EACH ROW +BEGIN + UPDATE blocks SET updated_at = CURRENT_TIMESTAMP WHERE id = OLD.id; +END; +``` + +### Design Decisions + +**1. Denormalized URLs and Page References:** +- **Pro:** Avoids complex joins for common queries like `get_urls_with_context()` +- **Pro:** Better performance for read-heavy workloads +- **Con:** More storage overhead, more complex save logic +- **Decision:** Denormalize - query performance is critical for UI responsiveness + +**2. Position column for ordering:** +- Maintains insertion order for blocks, URLs, and page references +- Essential for preserving document structure +- Enables efficient reconstruction of `Vec`, `Vec`, etc. + +**3. Cascade deletes:** +- Deleting a page removes all blocks automatically +- Deleting a parent block removes all children (subtree deletion) +- Matches `Page::remove_block()` recursive behavior + +**4. Timestamp tracking:** +- `created_at` / `updated_at` for audit trail +- Useful for sync conflict resolution (future) +- Enables "modified since" queries + +## Implementation Structure + +### Directory Layout + +``` +backend/src/infrastructure/ +β”œβ”€β”€ persistence/ +β”‚ β”œβ”€β”€ mod.rs +β”‚ β”œβ”€β”€ sqlite_page_repository.rs # Main implementation +β”‚ β”œβ”€β”€ mappers.rs # Domain ↔ DB mapping +β”‚ β”œβ”€β”€ models.rs # Database row structs +β”‚ └── migrations/ # SQL migration files +β”‚ └── 001_initial_schema.sql +└── mod.rs +``` + +### Core Components + +#### 1. Database Models (`models.rs`) + +```rust +use sqlx::FromRow; +use chrono::{DateTime, Utc}; + +#[derive(Debug, FromRow)] +pub struct PageRow { + pub id: String, + pub title: String, + pub created_at: DateTime, + pub updated_at: DateTime, +} + +#[derive(Debug, FromRow)] +pub struct BlockRow { + pub id: String, + pub page_id: String, + pub content: String, + pub indent_level: i32, + pub parent_id: Option, + pub position: i32, + pub created_at: DateTime, + pub updated_at: DateTime, +} + +#[derive(Debug, FromRow)] +pub struct BlockUrlRow { + pub block_id: String, + pub url: String, + pub domain: Option, + pub position: i32, +} + +#[derive(Debug, FromRow)] +pub struct BlockPageReferenceRow { + pub block_id: String, + pub reference_text: String, + pub reference_type: String, // "link" or "tag" + pub position: i32, +} +``` + +#### 2. Domain Mappers (`mappers.rs`) + +```rust +use crate::domain::{Page, Block, PageId, BlockId, Url, PageReference}; +use crate::domain::base::DomainResult; +use super::models::*; + +pub struct PageMapper; + +impl PageMapper { + /// Convert database rows to Page aggregate + pub fn to_domain( + page_row: PageRow, + block_rows: Vec, + url_rows: Vec, + ref_rows: Vec, + ) -> DomainResult { + // 1. Build block lookup maps + let url_map: HashMap> = + url_rows.into_iter() + .sorted_by_key(|r| r.position) + .into_group_map_by(|r| r.block_id.clone()); + + let ref_map: HashMap> = + ref_rows.into_iter() + .sorted_by_key(|r| r.position) + .into_group_map_by(|r| r.block_id.clone()); + + // 2. Convert blocks to domain objects + let blocks: HashMap = block_rows + .into_iter() + .sorted_by_key(|b| b.position) + .map(|row| Self::block_to_domain(row, &url_map, &ref_map)) + .collect::>>()? + .into_iter() + .map(|b| (b.id().clone(), b)) + .collect(); + + // 3. Build Page aggregate + let page_id = PageId::new(&page_row.id)?; + let root_blocks: Vec = blocks.values() + .filter(|b| b.parent_id().is_none()) + .map(|b| b.id().clone()) + .collect(); + + Page::from_raw_parts(page_id, page_row.title, blocks, root_blocks) + } + + fn block_to_domain( + row: BlockRow, + url_map: &HashMap>, + ref_map: &HashMap>, + ) -> DomainResult { + let block_id = BlockId::new(&row.id)?; + let content = BlockContent::new(&row.content)?; + let indent_level = IndentLevel::new(row.indent_level as usize)?; + let parent_id = row.parent_id.map(|id| BlockId::new(id)).transpose()?; + + // Extract URLs + let urls: Vec = url_map + .get(&row.id) + .map(|rows| { + rows.iter() + .map(|r| Url::new(&r.url)) + .collect::>>() + }) + .transpose()? + .unwrap_or_default(); + + // Extract page references + let page_refs: Vec = ref_map + .get(&row.id) + .map(|rows| { + rows.iter() + .map(|r| Self::row_to_page_reference(r)) + .collect::>>() + }) + .transpose()? + .unwrap_or_default(); + + Block::from_raw_parts( + block_id, + content, + indent_level, + parent_id, + Vec::new(), // child_ids populated later + urls, + page_refs, + ) + } + + fn row_to_page_reference(row: &BlockPageReferenceRow) -> DomainResult { + match row.reference_type.as_str() { + "link" => PageReference::new_link(&row.reference_text), + "tag" => PageReference::new_tag(&row.reference_text), + _ => Err(DomainError::InvalidValue( + format!("Unknown reference type: {}", row.reference_type) + )), + } + } + + /// Convert Page aggregate to database rows + pub fn from_domain(page: &Page) -> (PageRow, Vec, Vec, Vec) { + let page_row = PageRow { + id: page.id().as_str().to_string(), + title: page.title().to_string(), + created_at: Utc::now(), + updated_at: Utc::now(), + }; + + let mut block_rows = Vec::new(); + let mut url_rows = Vec::new(); + let mut ref_rows = Vec::new(); + + for (position, block) in page.all_blocks().enumerate() { + block_rows.push(BlockRow { + id: block.id().as_str().to_string(), + page_id: page.id().as_str().to_string(), + content: block.content().as_str().to_string(), + indent_level: block.indent_level().level() as i32, + parent_id: block.parent_id().map(|id| id.as_str().to_string()), + position: position as i32, + created_at: Utc::now(), + updated_at: Utc::now(), + }); + + for (url_pos, url) in block.urls().iter().enumerate() { + url_rows.push(BlockUrlRow { + block_id: block.id().as_str().to_string(), + url: url.as_str().to_string(), + domain: url.domain().map(String::from), + position: url_pos as i32, + }); + } + + for (ref_pos, page_ref) in block.page_references().iter().enumerate() { + ref_rows.push(BlockPageReferenceRow { + block_id: block.id().as_str().to_string(), + reference_text: page_ref.text().to_string(), + reference_type: match page_ref.reference_type() { + ReferenceType::Link => "link", + ReferenceType::Tag => "tag", + }.to_string(), + position: ref_pos as i32, + }); + } + } + + (page_row, block_rows, url_rows, ref_rows) + } +} +``` + +#### 3. Repository Implementation (`sqlite_page_repository.rs`) + +```rust +use sqlx::{SqlitePool, sqlite::SqlitePoolOptions}; +use std::path::Path; +use crate::application::repositories::PageRepository; +use crate::domain::{Page, PageId}; +use crate::domain::base::{DomainResult, DomainError}; +use super::{models::*, mappers::PageMapper}; + +pub struct SqlitePageRepository { + pool: SqlitePool, +} + +impl SqlitePageRepository { + /// Create a new repository with an in-memory database (for testing) + pub async fn new_in_memory() -> Result { + let pool = SqlitePoolOptions::new() + .max_connections(5) + .connect("sqlite::memory:") + .await?; + + sqlx::migrate!("./migrations").run(&pool).await?; + + Ok(Self { pool }) + } + + /// Create a new repository with a file-based database + pub async fn new(db_path: impl AsRef) -> Result { + let db_url = format!("sqlite://{}?mode=rwc", db_path.as_ref().display()); + + let pool = SqlitePoolOptions::new() + .max_connections(5) + .connect(&db_url) + .await?; + + sqlx::migrate!("./migrations").run(&pool).await?; + + Ok(Self { pool }) + } + + /// Load all related rows for a page + async fn load_page_data(&self, page_id: &str) + -> Result, Vec, Vec)>, sqlx::Error> + { + let page_row: Option = sqlx::query_as( + "SELECT id, title, created_at, updated_at FROM pages WHERE id = ?" + ) + .bind(page_id) + .fetch_optional(&self.pool) + .await?; + + let Some(page_row) = page_row else { + return Ok(None); + }; + + let block_rows: Vec = sqlx::query_as( + "SELECT id, page_id, content, indent_level, parent_id, position, created_at, updated_at + FROM blocks + WHERE page_id = ? + ORDER BY position ASC" + ) + .bind(page_id) + .fetch_all(&self.pool) + .await?; + + let block_ids: Vec = block_rows.iter() + .map(|b| b.id.clone()) + .collect(); + + let url_rows: Vec = if !block_ids.is_empty() { + let placeholders = block_ids.iter().map(|_| "?").join(","); + let query = format!( + "SELECT block_id, url, domain, position + FROM block_urls + WHERE block_id IN ({}) + ORDER BY block_id, position", + placeholders + ); + + let mut q = sqlx::query_as(&query); + for id in &block_ids { + q = q.bind(id); + } + q.fetch_all(&self.pool).await? + } else { + Vec::new() + }; + + let ref_rows: Vec = if !block_ids.is_empty() { + let placeholders = block_ids.iter().map(|_| "?").join(","); + let query = format!( + "SELECT block_id, reference_text, reference_type, position + FROM block_page_references + WHERE block_id IN ({}) + ORDER BY block_id, position", + placeholders + ); + + let mut q = sqlx::query_as(&query); + for id in &block_ids { + q = q.bind(id); + } + q.fetch_all(&self.pool).await? + } else { + Vec::new() + }; + + Ok(Some((page_row, block_rows, url_rows, ref_rows))) + } +} + +impl PageRepository for SqlitePageRepository { + fn save(&mut self, page: Page) -> DomainResult<()> { + // Use async block with tokio runtime + tokio::task::block_in_place(|| { + tokio::runtime::Handle::current().block_on(async { + self.save_async(page).await + }) + }) + } + + fn find_by_id(&self, id: &PageId) -> DomainResult> { + tokio::task::block_in_place(|| { + tokio::runtime::Handle::current().block_on(async { + self.find_by_id_async(id).await + }) + }) + } + + fn find_by_title(&self, title: &str) -> DomainResult> { + tokio::task::block_in_place(|| { + tokio::runtime::Handle::current().block_on(async { + self.find_by_title_async(title).await + }) + }) + } + + fn find_all(&self) -> DomainResult> { + tokio::task::block_in_place(|| { + tokio::runtime::Handle::current().block_on(async { + self.find_all_async().await + }) + }) + } + + fn delete(&mut self, id: &PageId) -> DomainResult { + tokio::task::block_in_place(|| { + tokio::runtime::Handle::current().block_on(async { + self.delete_async(id).await + }) + }) + } +} + +impl SqlitePageRepository { + /// Async implementation of save (upsert) + async fn save_async(&mut self, page: Page) -> DomainResult<()> { + let (page_row, block_rows, url_rows, ref_rows) = PageMapper::from_domain(&page); + + let mut tx = self.pool.begin().await + .map_err(|e| DomainError::InvalidOperation(format!("Transaction error: {}", e)))?; + + // Upsert page + sqlx::query( + "INSERT INTO pages (id, title, created_at, updated_at) + VALUES (?, ?, ?, ?) + ON CONFLICT(id) DO UPDATE SET + title = excluded.title, + updated_at = excluded.updated_at" + ) + .bind(&page_row.id) + .bind(&page_row.title) + .bind(page_row.created_at) + .bind(page_row.updated_at) + .execute(&mut *tx) + .await + .map_err(|e| DomainError::InvalidOperation(format!("Save page error: {}", e)))?; + + // Delete existing blocks (cascade will delete URLs and refs) + sqlx::query("DELETE FROM blocks WHERE page_id = ?") + .bind(&page_row.id) + .execute(&mut *tx) + .await + .map_err(|e| DomainError::InvalidOperation(format!("Delete blocks error: {}", e)))?; + + // Insert blocks + for block in block_rows { + sqlx::query( + "INSERT INTO blocks (id, page_id, content, indent_level, parent_id, position, created_at, updated_at) + VALUES (?, ?, ?, ?, ?, ?, ?, ?)" + ) + .bind(&block.id) + .bind(&block.page_id) + .bind(&block.content) + .bind(block.indent_level) + .bind(&block.parent_id) + .bind(block.position) + .bind(block.created_at) + .bind(block.updated_at) + .execute(&mut *tx) + .await + .map_err(|e| DomainError::InvalidOperation(format!("Insert block error: {}", e)))?; + } + + // Insert URLs + for url in url_rows { + sqlx::query( + "INSERT INTO block_urls (block_id, url, domain, position) VALUES (?, ?, ?, ?)" + ) + .bind(&url.block_id) + .bind(&url.url) + .bind(&url.domain) + .bind(url.position) + .execute(&mut *tx) + .await + .map_err(|e| DomainError::InvalidOperation(format!("Insert URL error: {}", e)))?; + } + + // Insert page references + for page_ref in ref_rows { + sqlx::query( + "INSERT INTO block_page_references (block_id, reference_text, reference_type, position) + VALUES (?, ?, ?, ?)" + ) + .bind(&page_ref.block_id) + .bind(&page_ref.reference_text) + .bind(&page_ref.reference_type) + .bind(page_ref.position) + .execute(&mut *tx) + .await + .map_err(|e| DomainError::InvalidOperation(format!("Insert page ref error: {}", e)))?; + } + + tx.commit().await + .map_err(|e| DomainError::InvalidOperation(format!("Commit error: {}", e)))?; + + Ok(()) + } + + async fn find_by_id_async(&self, id: &PageId) -> DomainResult> { + let data = self.load_page_data(id.as_str()).await + .map_err(|e| DomainError::InvalidOperation(format!("Load error: {}", e)))?; + + match data { + Some((page_row, block_rows, url_rows, ref_rows)) => { + let page = PageMapper::to_domain(page_row, block_rows, url_rows, ref_rows)?; + Ok(Some(page)) + } + None => Ok(None), + } + } + + async fn find_by_title_async(&self, title: &str) -> DomainResult> { + let page_row: Option = sqlx::query_as( + "SELECT id, title, created_at, updated_at FROM pages WHERE title = ?" + ) + .bind(title) + .fetch_optional(&self.pool) + .await + .map_err(|e| DomainError::InvalidOperation(format!("Query error: {}", e)))?; + + match page_row { + Some(row) => { + let page_id = PageId::new(&row.id)?; + self.find_by_id_async(&page_id).await + } + None => Ok(None), + } + } + + async fn find_all_async(&self) -> DomainResult> { + let page_rows: Vec = sqlx::query_as( + "SELECT id, title, created_at, updated_at FROM pages ORDER BY title" + ) + .fetch_all(&self.pool) + .await + .map_err(|e| DomainError::InvalidOperation(format!("Query error: {}", e)))?; + + let mut pages = Vec::new(); + for row in page_rows { + let page_id = PageId::new(&row.id)?; + if let Some(page) = self.find_by_id_async(&page_id).await? { + pages.push(page); + } + } + + Ok(pages) + } + + async fn delete_async(&mut self, id: &PageId) -> DomainResult { + let result = sqlx::query("DELETE FROM pages WHERE id = ?") + .bind(id.as_str()) + .execute(&self.pool) + .await + .map_err(|e| DomainError::InvalidOperation(format!("Delete error: {}", e)))?; + + Ok(result.rows_affected() > 0) + } +} +``` + +## Migration Strategy + +### Setup sqlx-cli + +```bash +cargo install sqlx-cli --no-default-features --features sqlite +``` + +### Create migrations + +```bash +cd backend +sqlx migrate add initial_schema +# Edit migrations/XXX_initial_schema.sql with schema above +sqlx migrate run --database-url sqlite://logjam.db +``` + +### Compile-time verification + +```bash +# Prepare for offline mode (CI/CD) +DATABASE_URL=sqlite://logjam.db cargo sqlx prepare +``` + +## Testing Strategy + +### Unit Tests + +```rust +#[cfg(test)] +mod tests { + use super::*; + + #[tokio::test] + async fn test_save_and_find_page() { + let mut repo = SqlitePageRepository::new_in_memory().await.unwrap(); + + // Create test page + let page_id = PageId::new("test-page").unwrap(); + let page = Page::new(page_id.clone(), "Test Page".to_string()); + + // Save + repo.save(page.clone()).unwrap(); + + // Find by ID + let found = repo.find_by_id(&page_id).unwrap(); + assert!(found.is_some()); + assert_eq!(found.unwrap().title(), "Test Page"); + } + + #[tokio::test] + async fn test_save_page_with_blocks() { + let mut repo = SqlitePageRepository::new_in_memory().await.unwrap(); + + // Create page with hierarchical blocks + let page_id = PageId::new("page-with-blocks").unwrap(); + let mut page = Page::new(page_id.clone(), "Page With Blocks".to_string()); + + let root_block = Block::new_root( + BlockId::generate(), + BlockContent::new("Root block").unwrap(), + ); + page.add_block(root_block.clone()).unwrap(); + + let child_block = Block::new_child( + BlockId::generate(), + BlockContent::new("Child block with [[link]] and #tag").unwrap(), + IndentLevel::new(1).unwrap(), + root_block.id().clone(), + ); + page.add_block(child_block).unwrap(); + + // Save and reload + repo.save(page.clone()).unwrap(); + let loaded = repo.find_by_id(&page_id).unwrap().unwrap(); + + // Verify structure + assert_eq!(loaded.all_blocks().count(), 2); + assert_eq!(loaded.root_blocks().len(), 1); + } + + #[tokio::test] + async fn test_delete_cascade() { + let mut repo = SqlitePageRepository::new_in_memory().await.unwrap(); + + let page_id = PageId::new("delete-test").unwrap(); + let page = Page::new(page_id.clone(), "Delete Test".to_string()); + + repo.save(page).unwrap(); + let deleted = repo.delete(&page_id).unwrap(); + + assert!(deleted); + assert!(repo.find_by_id(&page_id).unwrap().is_none()); + } +} +``` + +### Integration Tests + +```rust +// backend/tests/sqlite_integration_test.rs + +#[tokio::test] +async fn test_import_service_with_sqlite() { + let repo = SqlitePageRepository::new_in_memory().await.unwrap(); + let mut import_service = ImportService::new(repo); + + let logseq_dir = LogseqDirectoryPath::new("./test-fixtures/sample-logseq").unwrap(); + let summary = import_service.import_directory(logseq_dir, None).await.unwrap(); + + // Verify pages were persisted + assert!(summary.total_processed > 0); +} +``` + +## Performance Considerations + +### Optimizations + +1. **Batch inserts:** Use transactions to batch all inserts for a page +2. **Connection pooling:** Reuse connections via `SqlitePool` (5 connections) +3. **Prepared statements:** sqlx automatically prepares and caches statements +4. **Indexes:** Create indexes on commonly queried columns (title, parent_id, page_id) +5. **Lazy loading:** Only load blocks when page is accessed (not implemented in v1) + +### Expected Performance + +- **Save page:** ~10-50ms for page with 100 blocks (with transaction) +- **Find by ID:** ~5-20ms for page with 100 blocks +- **Find by title:** ~2-5ms (indexed lookup) + page load time +- **Find all:** ~N * 20ms for N pages (could optimize with bulk loading) + +## Integration with Existing Services + +### ImportService Changes + +```rust +// backend/src/application/services/import_service.rs + +impl ImportService { + // No changes needed! Repository is injected via generic +} + +// Usage in Tauri commands (future): +let db_path = app_data_dir.join("logjam.db"); +let repo = SqlitePageRepository::new(&db_path).await?; +let mut import_service = ImportService::new(repo); +``` + +### SyncService Changes + +```rust +// backend/src/application/services/sync_service.rs + +// Already uses Arc> for concurrent access +// No changes needed for SQLite integration + +// Usage: +let repo = Arc::new(Mutex::new(SqlitePageRepository::new(&db_path).await?)); +let sync_service = SyncService::new(repo, logseq_dir)?; +``` + +## Rollout Plan + +### Phase 1: Infrastructure Setup βœ… +- [ ] Add sqlx dependency +- [ ] Create database schema migration +- [ ] Create database models (`models.rs`) +- [ ] Create domain mappers (`mappers.rs`) + +### Phase 2: Repository Implementation βœ… +- [ ] Implement `SqlitePageRepository` +- [ ] Add `save_async()` with transaction support +- [ ] Add `find_by_id_async()` with eager loading +- [ ] Add `find_by_title_async()` +- [ ] Add `find_all_async()` +- [ ] Add `delete_async()` with cascade + +### Phase 3: Testing βœ… +- [ ] Unit tests for repository methods +- [ ] Unit tests for mappers (domain ↔ DB) +- [ ] Integration tests with ImportService +- [ ] Integration tests with SyncService +- [ ] Performance benchmarks + +### Phase 4: Documentation βœ… +- [ ] Update IMPLEMENTATION.md with persistence layer +- [ ] Add database schema documentation +- [ ] Add migration guide +- [ ] Update README with setup instructions + +## Open Questions + +1. **Database location:** Should DB path be configurable via environment variable or app config? +2. **Migration strategy:** How to handle schema upgrades in production? +3. **Backup strategy:** Should we implement automatic database backups? +4. **Concurrency:** Do we need PRAGMA settings for better concurrent access? +5. **Vacuum:** Should we periodically run VACUUM to reclaim space? + +## Future Enhancements + +- **Lazy loading:** Load blocks on-demand for large pages +- **Bulk operations:** Optimize `find_all()` with a single query + JOIN +- **Soft deletes:** Add `deleted_at` column instead of hard deletes +- **Audit trail:** Track all changes with event sourcing table +- **Read replicas:** Use separate read-only connections for queries +- **Caching layer:** Add in-memory cache for frequently accessed pages + +## References + +- sqlx documentation: https://docs.rs/sqlx/ +- SQLite transaction best practices: https://www.sqlite.org/lang_transaction.html +- DDD repository pattern: https://martinfowler.com/eaaCatalog/repository.html diff --git a/notes/features/tantivy-search.md b/notes/features/tantivy-search.md new file mode 100644 index 0000000..a9da9d6 --- /dev/null +++ b/notes/features/tantivy-search.md @@ -0,0 +1,1079 @@ +# Tantivy Full-Text Search Implementation Plan + +## Overview + +Implement full-text search capabilities using Tantivy, a high-performance search engine library written in Rust. This will enable fast, typo-tolerant searches across pages, blocks, URLs, and page references following the existing DDD architecture. + +## Goals + +1. **Fast full-text search** across all content (pages, blocks, URLs, tags) +2. **Typo-tolerant fuzzy search** for better user experience +3. **Ranked results** with relevance scoring (BM25 algorithm) +4. **Filter by content type** (pages vs blocks, tags vs links) +5. **Incremental indexing** (update index as content changes) +6. **Real-time sync** with repository changes +7. **Performant queries** (<50ms for typical searches) + +## Why Tantivy? + +**Tantivy** is a modern full-text search library similar to Apache Lucene: + +- **Pure Rust:** Memory-safe, fast, integrates seamlessly with our codebase +- **Battle-tested:** Used in production by Quickwit, Meilisearch +- **Full-featured:** BM25 ranking, fuzzy search, highlighting, faceting +- **Performant:** ~10-100x faster than SQLite FTS5 for complex queries +- **Embedded:** No separate server process needed (unlike Elasticsearch) + +**Alternatives considered:** +- SQLite FTS5: Limited features, slower for fuzzy search +- Meilisearch: Requires separate server, heavier weight +- Typesense: Also requires server, adds deployment complexity + +## Architecture Layer + +**Infrastructure Layer** (`backend/src/infrastructure/search/`) + +Search is infrastructure because: +- Domain layer remains search-agnostic +- Tantivy is an implementation detail (could swap with Meilisearch) +- Search index is a derived data structure (not source of truth) + +## Dependencies + +```toml +# backend/Cargo.toml + +[dependencies] +tantivy = "0.22" # Full-text search engine +``` + +## Index Schema Design + +### Document Structure + +Tantivy indexes "documents" with typed fields. We'll create a flattened search index: + +```rust +// backend/src/infrastructure/search/schema.rs + +use tantivy::schema::*; + +pub struct SearchSchema { + schema: Schema, + + // Document ID fields + pub page_id: Field, + pub block_id: Field, // Empty for page-level documents + + // Content fields (searchable) + pub page_title: Field, + pub block_content: Field, + pub urls: Field, + pub page_references: Field, + + // Metadata fields (filterable, not searchable) + pub document_type: Field, // "page" or "block" + pub reference_type: Field, // "link", "tag", or empty + pub indent_level: Field, + + // Ranking signals + pub url_domains: Field, +} + +impl SearchSchema { + pub fn build() -> (Schema, Self) { + let mut schema_builder = Schema::builder(); + + // IDs (stored, not indexed for search) + let page_id = schema_builder.add_text_field("page_id", STRING | STORED); + let block_id = schema_builder.add_text_field("block_id", STRING | STORED); + + // Searchable text fields with different weights + let page_title = schema_builder.add_text_field( + "page_title", + TextOptions::default() + .set_indexing_options( + TextFieldIndexing::default() + .set_tokenizer("en_stem") // English stemming + .set_index_option(IndexRecordOption::WithFreqsAndPositions) + ) + .set_stored() + ); + + let block_content = schema_builder.add_text_field( + "block_content", + TextOptions::default() + .set_indexing_options( + TextFieldIndexing::default() + .set_tokenizer("en_stem") + .set_index_option(IndexRecordOption::WithFreqsAndPositions) + ) + .set_stored() + ); + + let urls = schema_builder.add_text_field( + "urls", + TextOptions::default() + .set_indexing_options( + TextFieldIndexing::default() + .set_tokenizer("raw") // Don't stem URLs + .set_index_option(IndexRecordOption::WithFreqsAndPositions) + ) + ); + + let page_references = schema_builder.add_text_field( + "page_references", + TextOptions::default() + .set_indexing_options( + TextFieldIndexing::default() + .set_tokenizer("en_stem") + .set_index_option(IndexRecordOption::WithFreqsAndPositions) + ) + ); + + // Facet fields for filtering + let document_type = schema_builder.add_facet_field("document_type", STORED); + let reference_type = schema_builder.add_facet_field("reference_type", STORED); + let indent_level = schema_builder.add_u64_field("indent_level", INDEXED | STORED); + + // Domain extraction for URL filtering + let url_domains = schema_builder.add_facet_field("url_domains", INDEXED); + + let schema = schema_builder.build(); + + (schema.clone(), Self { + schema, + page_id, + block_id, + page_title, + block_content, + urls, + page_references, + document_type, + reference_type, + indent_level, + url_domains, + }) + } +} +``` + +### Indexing Strategy + +**Two document types:** + +1. **Page documents:** For searching by page title + - `page_id`: PageId + - `page_title`: Searchable page title + - `document_type`: "page" + +2. **Block documents:** For searching block content + - `page_id`: Parent page ID + - `block_id`: Block ID + - `block_content`: Searchable content + - `urls`: Concatenated URLs + - `page_references`: Concatenated page refs + - `document_type`: "block" + - `indent_level`: Hierarchy depth + +**Rationale:** Separate document types allow: +- Title-only search ("find page titled X") +- Content-only search ("find blocks containing X") +- Combined search with different weights (title matches rank higher) + +## Search Index Implementation + +### TantivySearchIndex + +```rust +// backend/src/infrastructure/search/tantivy_index.rs + +use tantivy::{Index, IndexWriter, IndexReader, ReloadPolicy, TantivyDocument}; +use tantivy::collector::TopDocs; +use tantivy::query::{QueryParser, FuzzyTermQuery, BooleanQuery, Occur}; +use tantivy::schema::*; +use std::path::Path; +use crate::domain::{Page, Block, PageId, BlockId}; +use super::schema::SearchSchema; + +pub struct TantivySearchIndex { + index: Index, + schema: SearchSchema, + writer: IndexWriter, + reader: IndexReader, +} + +impl TantivySearchIndex { + /// Create new search index at the given directory + pub fn new(index_dir: impl AsRef) -> tantivy::Result { + let (schema_def, schema) = SearchSchema::build(); + + let index = Index::create_in_dir(index_dir, schema_def.clone())?; + + let writer = index.writer(50_000_000)?; // 50MB heap + + let reader = index + .reader_builder() + .reload_policy(ReloadPolicy::OnCommitWithDelay) + .try_into()?; + + Ok(Self { + index, + schema, + writer, + reader, + }) + } + + /// Create in-memory index (for testing) + pub fn new_in_memory() -> tantivy::Result { + let (schema_def, schema) = SearchSchema::build(); + + let index = Index::create_in_ram(schema_def.clone()); + + let writer = index.writer(50_000_000)?; + + let reader = index + .reader_builder() + .reload_policy(ReloadPolicy::OnCommitWithDelay) + .try_into()?; + + Ok(Self { + index, + schema, + writer, + reader, + }) + } + + /// Index a page (creates page document + block documents) + pub fn index_page(&mut self, page: &Page) -> tantivy::Result<()> { + // Create page document + let mut page_doc = TantivyDocument::default(); + page_doc.add_text(self.schema.page_id, page.id().as_str()); + page_doc.add_text(self.schema.block_id, ""); // Empty for page docs + page_doc.add_text(self.schema.page_title, page.title()); + page_doc.add_facet(self.schema.document_type, "/page"); + + self.writer.add_document(page_doc)?; + + // Create block documents + for block in page.all_blocks() { + let mut block_doc = TantivyDocument::default(); + + block_doc.add_text(self.schema.page_id, page.id().as_str()); + block_doc.add_text(self.schema.block_id, block.id().as_str()); + block_doc.add_text(self.schema.page_title, page.title()); + block_doc.add_text(self.schema.block_content, block.content().as_str()); + + // Add URLs + let urls_text = block.urls() + .iter() + .map(|u| u.as_str()) + .collect::>() + .join(" "); + if !urls_text.is_empty() { + block_doc.add_text(self.schema.urls, &urls_text); + } + + // Add URL domains as facets + for url in block.urls() { + if let Some(domain) = url.domain() { + block_doc.add_facet( + self.schema.url_domains, + &format!("/domain/{}", domain) + ); + } + } + + // Add page references + let refs_text = block.page_references() + .iter() + .map(|r| r.text()) + .collect::>() + .join(" "); + if !refs_text.is_empty() { + block_doc.add_text(self.schema.page_references, &refs_text); + } + + // Add reference type facets + for page_ref in block.page_references() { + let facet_path = match page_ref.reference_type() { + ReferenceType::Link => "/reference/link", + ReferenceType::Tag => "/reference/tag", + }; + block_doc.add_facet(self.schema.reference_type, facet_path); + } + + block_doc.add_facet(self.schema.document_type, "/block"); + block_doc.add_u64(self.schema.indent_level, block.indent_level().level() as u64); + + self.writer.add_document(block_doc)?; + } + + Ok(()) + } + + /// Remove all documents for a page + pub fn delete_page(&mut self, page_id: &PageId) -> tantivy::Result<()> { + let term = Term::from_field_text(self.schema.page_id, page_id.as_str()); + self.writer.delete_term(term); + Ok(()) + } + + /// Update a page (delete + re-index) + pub fn update_page(&mut self, page: &Page) -> tantivy::Result<()> { + self.delete_page(page.id())?; + self.index_page(page)?; + Ok(()) + } + + /// Commit all pending changes + pub fn commit(&mut self) -> tantivy::Result<()> { + self.writer.commit()?; + Ok(()) + } + + /// Search with query string + pub fn search(&self, query_str: &str, limit: usize) -> tantivy::Result> { + let searcher = self.reader.searcher(); + + // Parse query across multiple fields + let query_parser = QueryParser::for_index( + &self.index, + vec![ + self.schema.page_title, + self.schema.block_content, + self.schema.urls, + self.schema.page_references, + ], + ); + + let query = query_parser.parse_query(query_str)?; + + // Execute search + let top_docs = searcher.search(&query, &TopDocs::with_limit(limit))?; + + // Convert to SearchResult + let mut results = Vec::new(); + for (_score, doc_address) in top_docs { + let retrieved_doc: TantivyDocument = searcher.doc(doc_address)?; + results.push(SearchResult::from_document(&retrieved_doc, &self.schema)?); + } + + Ok(results) + } + + /// Fuzzy search (typo-tolerant) + pub fn fuzzy_search(&self, query_str: &str, limit: usize, max_distance: u8) -> tantivy::Result> { + let searcher = self.reader.searcher(); + + // Build fuzzy query for each term + let terms: Vec<_> = query_str.split_whitespace().collect(); + let mut queries: Vec> = Vec::new(); + + for term in terms { + // Fuzzy search on title + let title_term = Term::from_field_text(self.schema.page_title, term); + queries.push(Box::new(FuzzyTermQuery::new(title_term, max_distance, true))); + + // Fuzzy search on content + let content_term = Term::from_field_text(self.schema.block_content, term); + queries.push(Box::new(FuzzyTermQuery::new(content_term, max_distance, true))); + } + + let query = BooleanQuery::new( + queries.into_iter().map(|q| (Occur::Should, q)).collect() + ); + + let top_docs = searcher.search(&query, &TopDocs::with_limit(limit))?; + + let mut results = Vec::new(); + for (_score, doc_address) in top_docs { + let retrieved_doc: TantivyDocument = searcher.doc(doc_address)?; + results.push(SearchResult::from_document(&retrieved_doc, &self.schema)?); + } + + Ok(results) + } + + /// Search with filters + pub fn search_with_filters( + &self, + query_str: &str, + limit: usize, + filters: SearchFilters, + ) -> tantivy::Result> { + let searcher = self.reader.searcher(); + + let mut query_parser = QueryParser::for_index( + &self.index, + vec![ + self.schema.page_title, + self.schema.block_content, + self.schema.urls, + self.schema.page_references, + ], + ); + + let text_query = query_parser.parse_query(query_str)?; + + // Build filter queries + let mut queries: Vec<(Occur, Box)> = vec![ + (Occur::Must, text_query), + ]; + + if let Some(doc_type) = filters.document_type { + let facet = Facet::from(&format!("/{}", doc_type)); + let facet_term = Term::from_facet(self.schema.document_type, &facet); + queries.push(( + Occur::Must, + Box::new(tantivy::query::TermQuery::new( + facet_term, + IndexRecordOption::Basic, + )), + )); + } + + if let Some(ref_type) = filters.reference_type { + let facet = Facet::from(&format!("/reference/{}", ref_type)); + let facet_term = Term::from_facet(self.schema.reference_type, &facet); + queries.push(( + Occur::Must, + Box::new(tantivy::query::TermQuery::new( + facet_term, + IndexRecordOption::Basic, + )), + )); + } + + let final_query = BooleanQuery::new(queries); + + let top_docs = searcher.search(&final_query, &TopDocs::with_limit(limit))?; + + let mut results = Vec::new(); + for (_score, doc_address) in top_docs { + let retrieved_doc: TantivyDocument = searcher.doc(doc_address)?; + results.push(SearchResult::from_document(&retrieved_doc, &self.schema)?); + } + + Ok(results) + } +} +``` + +### Search Result Types + +```rust +// backend/src/infrastructure/search/result.rs + +use crate::domain::{PageId, BlockId}; +use tantivy::TantivyDocument; +use super::schema::SearchSchema; + +#[derive(Debug, Clone)] +pub enum SearchResult { + PageResult { + page_id: PageId, + page_title: String, + score: f32, + }, + BlockResult { + page_id: PageId, + block_id: BlockId, + page_title: String, + block_content: String, + indent_level: usize, + score: f32, + }, +} + +impl SearchResult { + pub fn from_document(doc: &TantivyDocument, schema: &SearchSchema) -> tantivy::Result { + let page_id_str = doc.get_first(schema.page_id) + .and_then(|v| v.as_str()) + .ok_or_else(|| tantivy::TantivyError::InvalidArgument("Missing page_id".into()))?; + + let page_id = PageId::new(page_id_str) + .map_err(|e| tantivy::TantivyError::InvalidArgument(e.to_string()))?; + + let page_title = doc.get_first(schema.page_title) + .and_then(|v| v.as_str()) + .unwrap_or("") + .to_string(); + + let block_id_str = doc.get_first(schema.block_id) + .and_then(|v| v.as_str()) + .unwrap_or(""); + + if block_id_str.is_empty() { + // Page result + Ok(SearchResult::PageResult { + page_id, + page_title, + score: 0.0, // Score set by caller + }) + } else { + // Block result + let block_id = BlockId::new(block_id_str) + .map_err(|e| tantivy::TantivyError::InvalidArgument(e.to_string()))?; + + let block_content = doc.get_first(schema.block_content) + .and_then(|v| v.as_str()) + .unwrap_or("") + .to_string(); + + let indent_level = doc.get_first(schema.indent_level) + .and_then(|v| v.as_u64()) + .unwrap_or(0) as usize; + + Ok(SearchResult::BlockResult { + page_id, + block_id, + page_title, + block_content, + indent_level, + score: 0.0, + }) + } + } + + pub fn page_id(&self) -> &PageId { + match self { + SearchResult::PageResult { page_id, .. } => page_id, + SearchResult::BlockResult { page_id, .. } => page_id, + } + } +} + +#[derive(Debug, Clone, Default)] +pub struct SearchFilters { + pub document_type: Option, // "page" or "block" + pub reference_type: Option, // "link" or "tag" + pub min_indent_level: Option, + pub max_indent_level: Option, +} +``` + +## Search Service (Application Layer) + +### SearchService + +```rust +// backend/src/application/services/search_service.rs + +use crate::infrastructure::search::{TantivySearchIndex, SearchResult, SearchFilters}; +use crate::domain::{PageId, DomainResult}; + +pub struct SearchService { + index: TantivySearchIndex, +} + +impl SearchService { + pub fn new(index: TantivySearchIndex) -> Self { + Self { index } + } + + /// Full-text search + pub fn search(&self, query: &str, limit: usize) -> DomainResult> { + self.index.search(query, limit) + .map_err(|e| DomainError::InvalidOperation(format!("Search error: {}", e))) + } + + /// Fuzzy search (typo-tolerant, max edit distance = 2) + pub fn fuzzy_search(&self, query: &str, limit: usize) -> DomainResult> { + self.index.fuzzy_search(query, limit, 2) + .map_err(|e| DomainError::InvalidOperation(format!("Fuzzy search error: {}", e))) + } + + /// Search with filters + pub fn search_with_filters( + &self, + query: &str, + limit: usize, + filters: SearchFilters, + ) -> DomainResult> { + self.index.search_with_filters(query, limit, filters) + .map_err(|e| DomainError::InvalidOperation(format!("Filtered search error: {}", e))) + } + + /// Search only page titles + pub fn search_pages(&self, query: &str, limit: usize) -> DomainResult> { + let filters = SearchFilters { + document_type: Some("page".to_string()), + ..Default::default() + }; + self.search_with_filters(query, limit, filters) + } + + /// Search only block content + pub fn search_blocks(&self, query: &str, limit: usize) -> DomainResult> { + let filters = SearchFilters { + document_type: Some("block".to_string()), + ..Default::default() + }; + self.search_with_filters(query, limit, filters) + } + + /// Search only tags + pub fn search_tags(&self, query: &str, limit: usize) -> DomainResult> { + let filters = SearchFilters { + reference_type: Some("tag".to_string()), + ..Default::default() + }; + self.search_with_filters(query, limit, filters) + } +} +``` + +## Integration with Existing Services + +### Update ImportService + +```rust +// backend/src/application/services/import_service.rs + +pub struct ImportService { + page_repository: P, + mapping_repository: M, + search_index: Option>>, // Optional for now + max_concurrent_files: usize, +} + +impl ImportService { + pub fn with_search_index(mut self, index: Arc>) -> Self { + self.search_index = Some(index); + self + } + + async fn process_file(&mut self, path: PathBuf) -> ImportResult<()> { + // ... existing parsing and repository save logic ... + + // Index page in search index + if let Some(ref index) = self.search_index { + let mut index_lock = index.lock().await; + index_lock.index_page(&page) + .map_err(|e| ImportError::SearchIndex(e.to_string()))?; + } + + Ok(()) + } + + pub async fn import_directory(/* ... */) -> ImportResult { + // ... existing import logic ... + + // Commit search index + if let Some(ref index) = self.search_index { + let mut index_lock = index.lock().await; + index_lock.commit() + .map_err(|e| ImportError::SearchIndex(e.to_string()))?; + } + + Ok(summary) + } +} +``` + +### Update SyncService + +```rust +// backend/src/application/services/sync_service.rs + +pub struct SyncService { + page_repository: Arc>, + mapping_repository: Arc>, + search_index: Option>>, + directory_path: LogseqDirectoryPath, + watcher: LogseqFileWatcher, +} + +impl SyncService +where + P: PageRepository + Send + 'static, + M: FileMappingRepository + Send + 'static, +{ + pub fn with_search_index(mut self, index: Arc>) -> Self { + self.search_index = Some(index); + self + } + + async fn handle_file_created(&self, path: PathBuf) -> SyncResult<()> { + // ... existing parsing and repository save logic ... + + // Index in search + if let Some(ref index) = self.search_index { + let mut index_lock = index.lock().await; + index_lock.index_page(&page)?; + index_lock.commit()?; + } + + Ok(()) + } + + async fn handle_file_updated(&self, path: PathBuf) -> SyncResult<()> { + // ... existing update logic ... + + // Update search index + if let Some(ref index) = self.search_index { + let mut index_lock = index.lock().await; + index_lock.update_page(&page)?; + index_lock.commit()?; + } + + Ok(()) + } + + async fn handle_file_deleted(&self, path: PathBuf) -> SyncResult<()> { + // ... existing deletion logic ... + + // Delete from search index + if let Some(ref index) = self.search_index { + let mut index_lock = index.lock().await; + index_lock.delete_page(&page_id)?; + index_lock.commit()?; + } + + Ok(()) + } +} +``` + +## Tauri Integration + +### Search DTOs + +```rust +// backend/src/tauri/dto.rs + +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct SearchRequest { + pub query: String, + pub limit: usize, + pub fuzzy: bool, + pub filters: Option, +} + +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct SearchFiltersDto { + pub document_type: Option, + pub reference_type: Option, +} + +#[derive(Debug, Clone, Serialize, Deserialize)] +#[serde(tag = "type")] +pub enum SearchResultDto { + PageResult { + page_id: String, + page_title: String, + score: f32, + }, + BlockResult { + page_id: String, + block_id: String, + page_title: String, + block_content: String, + indent_level: usize, + score: f32, + }, +} +``` + +### Search Commands + +```rust +// backend/src/tauri/commands/search.rs + +#[tauri::command] +pub async fn search( + state: State<'_, AppState>, + request: SearchRequest, +) -> Result, ErrorResponse> { + let search_service = state.search_service.lock().await; + + let results = if request.fuzzy { + search_service.fuzzy_search(&request.query, request.limit)? + } else if let Some(filters) = request.filters { + let search_filters = SearchFilters { + document_type: filters.document_type, + reference_type: filters.reference_type, + ..Default::default() + }; + search_service.search_with_filters(&request.query, request.limit, search_filters)? + } else { + search_service.search(&request.query, request.limit)? + }; + + Ok(results.into_iter().map(DtoMapper::search_result_to_dto).collect()) +} + +#[tauri::command] +pub async fn search_pages( + state: State<'_, AppState>, + query: String, + limit: usize, +) -> Result, ErrorResponse> { + let search_service = state.search_service.lock().await; + let results = search_service.search_pages(&query, limit)?; + Ok(results.into_iter().map(DtoMapper::search_result_to_dto).collect()) +} + +#[tauri::command] +pub async fn search_tags( + state: State<'_, AppState>, + query: String, + limit: usize, +) -> Result, ErrorResponse> { + let search_service = state.search_service.lock().await; + let results = search_service.search_tags(&query, limit)?; + Ok(results.into_iter().map(DtoMapper::search_result_to_dto).collect()) +} +``` + +## Frontend Integration + +### TypeScript API + +```typescript +// frontend/src/lib/tauri-api.ts + +export interface SearchRequest { + query: string; + limit: number; + fuzzy: boolean; + filters?: SearchFilters; +} + +export interface SearchFilters { + document_type?: 'page' | 'block'; + reference_type?: 'link' | 'tag'; +} + +export type SearchResultDto = + | { + type: 'PageResult'; + page_id: string; + page_title: string; + score: number; + } + | { + type: 'BlockResult'; + page_id: string; + block_id: string; + page_title: string; + block_content: string; + indent_level: number; + score: number; + }; + +export class TauriApi { + static async search(request: SearchRequest): Promise { + return await invoke('search', { request }); + } + + static async searchPages(query: string, limit: number): Promise { + return await invoke('search_pages', { query, limit }); + } + + static async searchTags(query: string, limit: number): Promise { + return await invoke('search_tags', { query, limit }); + } +} +``` + +### React Hook + +```typescript +// frontend/src/hooks/useSearch.ts + +import { useState, useCallback } from 'react'; +import { TauriApi, SearchRequest, SearchResultDto } from '../lib/tauri-api'; + +export function useSearch() { + const [results, setResults] = useState([]); + const [isSearching, setIsSearching] = useState(false); + const [error, setError] = useState(null); + + const search = useCallback(async (request: SearchRequest) => { + setIsSearching(true); + setError(null); + + try { + const searchResults = await TauriApi.search(request); + setResults(searchResults); + } catch (err) { + setError(err as Error); + } finally { + setIsSearching(false); + } + }, []); + + const searchPages = useCallback(async (query: string, limit = 20) => { + setIsSearching(true); + setError(null); + + try { + const searchResults = await TauriApi.searchPages(query, limit); + setResults(searchResults); + } catch (err) { + setError(err as Error); + } finally { + setIsSearching(false); + } + }, []); + + return { + results, + isSearching, + error, + search, + searchPages, + }; +} +``` + +## Performance Optimization + +### Indexing Strategies + +**Incremental indexing:** +- Commit after each page during import (slower but progress visible) +- Batch commits during bulk import (faster, commit every 100 pages) + +**Commit frequency trade-offs:** +```rust +// Option 1: Commit after every page (real-time, slower) +for page in pages { + index.index_page(&page)?; + index.commit()?; // Slow: ~10-50ms per commit +} + +// Option 2: Batch commits (faster, less real-time) +for page in pages { + index.index_page(&page)?; +} +index.commit()?; // Fast: Single commit for all pages +``` + +**Recommendation:** Batch commits during import, real-time commits during sync. + +### Query Optimization + +**Limit result set:** +```rust +// Always specify reasonable limits +search_service.search(query, 100) // Max 100 results +``` + +**Use filters to narrow scope:** +```rust +// Faster: Search only blocks +search_service.search_blocks(query, 20) + +// Slower: Search everything +search_service.search(query, 20) +``` + +### Expected Performance + +- **Index build:** ~1000-5000 pages/second (depends on page size) +- **Search latency:** ~5-50ms for typical queries +- **Fuzzy search:** ~10-100ms (more expensive than exact search) +- **Index size:** ~10-20% of original markdown size + +## Testing Strategy + +### Unit Tests + +```rust +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn test_index_and_search_page() { + let mut index = TantivySearchIndex::new_in_memory().unwrap(); + + let page = Page::new( + PageId::new("test-page").unwrap(), + "Test Page Title".to_string(), + ); + + index.index_page(&page).unwrap(); + index.commit().unwrap(); + + let results = index.search("Test", 10).unwrap(); + assert_eq!(results.len(), 1); + } + + #[test] + fn test_fuzzy_search() { + let mut index = TantivySearchIndex::new_in_memory().unwrap(); + + // Index page with "algorithm" + let page = create_test_page("My Algorithm Notes"); + index.index_page(&page).unwrap(); + index.commit().unwrap(); + + // Search with typo "algoritm" + let results = index.fuzzy_search("algoritm", 10, 2).unwrap(); + assert!(results.len() > 0); + } +} +``` + +## Rollout Plan + +### Phase 1: Foundation βœ… +- [ ] Add Tantivy dependency +- [ ] Define search schema +- [ ] Implement TantivySearchIndex +- [ ] Create SearchResult types + +### Phase 2: Service Integration βœ… +- [ ] Create SearchService +- [ ] Integrate with ImportService +- [ ] Integrate with SyncService +- [ ] Add incremental indexing + +### Phase 3: Tauri Integration βœ… +- [ ] Create search DTOs +- [ ] Implement search commands +- [ ] Add frontend TypeScript types +- [ ] Build React hooks + +### Phase 4: Advanced Features πŸš€ +- [ ] Add highlighting (show matched snippets) +- [ ] Implement search suggestions (autocomplete) +- [ ] Add search history +- [ ] Add search analytics + +### Phase 5: Testing & Optimization βœ… +- [ ] Unit tests for search index +- [ ] Integration tests for search service +- [ ] Performance benchmarks +- [ ] Optimize commit strategy + +## Open Questions + +1. **Highlighting:** Should we return matched snippets with highlighting? +2. **Autocomplete:** Should we implement search-as-you-type suggestions? +3. **Ranking tuning:** Should we allow users to customize ranking weights? +4. **Index size limits:** Should we warn users if index grows too large? +5. **Reindexing:** How to handle schema changes (require full reindex)? + +## Future Enhancements + +- **Snippet highlighting:** Return matched text with HTML highlighting +- **Query suggestions:** "Did you mean..." for misspelled queries +- **Related pages:** "Pages similar to this one" using TF-IDF similarity +- **Search analytics:** Track popular searches, no-result queries +- **Advanced syntax:** Support boolean operators (AND, OR, NOT, quotes) +- **Faceted search:** "Filter by tag", "Filter by date range" +- **Graph search:** "Find pages that link to this page" + +## References + +- Tantivy documentation: https://docs.rs/tantivy/ +- BM25 algorithm: https://en.wikipedia.org/wiki/Okapi_BM25 +- Fuzzy search: https://en.wikipedia.org/wiki/Levenshtein_distance +- Search UX patterns: https://www.nngroup.com/articles/search-interface/ diff --git a/notes/features/tauri-integration.md b/notes/features/tauri-integration.md new file mode 100644 index 0000000..aeee246 --- /dev/null +++ b/notes/features/tauri-integration.md @@ -0,0 +1,1198 @@ +# Tauri Integration Implementation Plan + +## Overview + +Implement Tauri commands and event emitters to expose the Rust backend (ImportService, SyncService, PageRepository) to the frontend application. This bridges the gap between the domain/application layers and the UI layer following Tauri's IPC (Inter-Process Communication) patterns. + +## Goals + +1. **Expose backend services** via Tauri commands (async RPC) +2. **Stream real-time events** from services to frontend (progress, sync updates) +3. **Manage application state** (database connection, service lifecycle) +4. **Handle errors gracefully** with user-friendly error messages +5. **Support concurrent operations** (multiple frontend requests) +6. **Follow Tauri best practices** (state management, command patterns) + +## Architecture Layer + +**Presentation/API Layer** (`backend/src/tauri/`) + +New layer hierarchy: +``` +Frontend (React/Svelte/Vue) + ↕ Tauri IPC +Backend API Layer (Tauri commands) + ↕ +Application Layer (Services, Repositories) + ↕ +Domain Layer (Entities, Aggregates) + ↕ +Infrastructure Layer (Persistence, File System) +``` + +## Dependencies + +### Backend (Rust) + +```toml +# backend/Cargo.toml +[dependencies] +tauri = { version = "2.3", features = ["protocol-asset"] } +serde = { version = "1.0", features = ["derive"] } +serde_json = "1.0" +tokio = { version = "1.41", features = ["full"] } +``` + +### Frontend + +```json +// frontend/package.json +{ + "dependencies": { + "@tauri-apps/api": "^2.3.0", + "@tauri-apps/plugin-shell": "^2.0.0" + } +} +``` + +## Tauri State Management + +### Application State + +```rust +// backend/src/tauri/state.rs + +use std::sync::Arc; +use tokio::sync::Mutex; +use crate::infrastructure::persistence::{SqlitePageRepository, SqliteFileMappingRepository}; +use crate::application::services::{ImportService, SyncService}; + +/// Global application state shared across Tauri commands +pub struct AppState { + pub page_repository: Arc>, + pub mapping_repository: Arc>, + pub sync_service: Arc>>>, +} + +impl AppState { + pub async fn new(db_path: impl AsRef) -> Result> { + let page_repo = SqlitePageRepository::new(db_path.as_ref()).await?; + let mapping_repo = SqliteFileMappingRepository::new(page_repo.pool().clone()).await?; + + Ok(Self { + page_repository: Arc::new(Mutex::new(page_repo)), + mapping_repository: Arc::new(Mutex::new(mapping_repo)), + sync_service: Arc::new(Mutex::new(None)), + }) + } +} +``` + +### Tauri Builder Setup + +```rust +// backend/src/main.rs or backend/src/lib.rs + +use tauri::Manager; +use std::path::PathBuf; + +#[cfg_attr(mobile, tauri::mobile_entry_point)] +pub fn run() { + tauri::Builder::default() + .setup(|app| { + // Get app data directory + let app_data_dir = app.path().app_data_dir() + .expect("Failed to get app data directory"); + + // Create data directory if it doesn't exist + std::fs::create_dir_all(&app_data_dir)?; + + // Initialize database + let db_path = app_data_dir.join("logjam.db"); + + // Initialize app state + let app_state = tauri::async_runtime::block_on(async { + AppState::new(db_path).await + })?; + + // Manage state (accessible in commands) + app.manage(app_state); + + Ok(()) + }) + .plugin(tauri_plugin_shell::init()) + .invoke_handler(tauri::generate_handler![ + // Import commands + import_directory, + get_import_status, + + // Sync commands + start_sync, + stop_sync, + sync_once, + get_sync_status, + + // Page query commands + get_all_pages, + get_page_by_id, + get_page_by_title, + delete_page, + + // Settings commands + set_logseq_directory, + get_logseq_directory, + ]) + .run(tauri::generate_context!()) + .expect("error while running tauri application"); +} +``` + +## Data Transfer Objects (DTOs) + +### Serializable DTOs for Frontend Communication + +```rust +// backend/src/tauri/dto.rs + +use serde::{Serialize, Deserialize}; + +// ============================================================================ +// Import DTOs +// ============================================================================ + +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct ImportRequest { + pub directory_path: String, +} + +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct ImportSummaryDto { + pub total_files: usize, + pub successful: usize, + pub failed: usize, + pub duration_ms: u64, + pub errors: Vec, +} + +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct ImportErrorDto { + pub file_path: String, + pub error_message: String, +} + +#[derive(Debug, Clone, Serialize, Deserialize)] +#[serde(tag = "type")] +pub enum ImportProgressEvent { + Started { total_files: usize }, + FileProcessed { file_path: String, success: bool, current: usize, total: usize }, + Completed { summary: ImportSummaryDto }, + Failed { error: String }, +} + +// ============================================================================ +// Sync DTOs +// ============================================================================ + +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct SyncRequest { + pub directory_path: String, + pub enable_watch: bool, +} + +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct SyncSummaryDto { + pub files_created: usize, + pub files_updated: usize, + pub files_deleted: usize, + pub duration_ms: u64, +} + +#[derive(Debug, Clone, Serialize, Deserialize)] +#[serde(tag = "type")] +pub enum SyncEvent { + Started, + FileCreated { file_path: String }, + FileUpdated { file_path: String }, + FileDeleted { file_path: String }, + Completed { summary: SyncSummaryDto }, + Error { error: String }, +} + +// ============================================================================ +// Page DTOs +// ============================================================================ + +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct PageDto { + pub id: String, + pub title: String, + pub blocks: Vec, + pub root_block_ids: Vec, +} + +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct BlockDto { + pub id: String, + pub content: String, + pub indent_level: usize, + pub parent_id: Option, + pub child_ids: Vec, + pub urls: Vec, + pub page_references: Vec, +} + +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct UrlDto { + pub url: String, + pub domain: Option, +} + +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct PageReferenceDto { + pub text: String, + pub reference_type: String, // "link" or "tag" +} + +// ============================================================================ +// Error DTOs +// ============================================================================ + +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct ErrorResponse { + pub error: String, + pub error_type: String, +} + +impl ErrorResponse { + pub fn new(error: impl std::fmt::Display, error_type: impl Into) -> Self { + Self { + error: error.to_string(), + error_type: error_type.into(), + } + } +} +``` + +### DTO Mappers + +```rust +// backend/src/tauri/mappers.rs + +use crate::domain::{Page, Block, Url, PageReference}; +use crate::application::services::{ImportSummary, SyncSummary}; +use super::dto::*; + +pub struct DtoMapper; + +impl DtoMapper { + pub fn page_to_dto(page: &Page) -> PageDto { + PageDto { + id: page.id().as_str().to_string(), + title: page.title().to_string(), + blocks: page.all_blocks().map(Self::block_to_dto).collect(), + root_block_ids: page.root_blocks().iter() + .map(|id| id.as_str().to_string()) + .collect(), + } + } + + fn block_to_dto(block: &Block) -> BlockDto { + BlockDto { + id: block.id().as_str().to_string(), + content: block.content().as_str().to_string(), + indent_level: block.indent_level().level(), + parent_id: block.parent_id().map(|id| id.as_str().to_string()), + child_ids: block.child_ids().iter() + .map(|id| id.as_str().to_string()) + .collect(), + urls: block.urls().iter().map(Self::url_to_dto).collect(), + page_references: block.page_references().iter() + .map(Self::page_ref_to_dto) + .collect(), + } + } + + fn url_to_dto(url: &Url) -> UrlDto { + UrlDto { + url: url.as_str().to_string(), + domain: url.domain().map(String::from), + } + } + + fn page_ref_to_dto(page_ref: &PageReference) -> PageReferenceDto { + PageReferenceDto { + text: page_ref.text().to_string(), + reference_type: match page_ref.reference_type() { + ReferenceType::Link => "link", + ReferenceType::Tag => "tag", + }.to_string(), + } + } + + pub fn import_summary_to_dto(summary: &ImportSummary) -> ImportSummaryDto { + ImportSummaryDto { + total_files: summary.total_files, + successful: summary.successful, + failed: summary.failed, + duration_ms: summary.duration.as_millis() as u64, + errors: summary.errors.iter() + .map(|(path, err)| ImportErrorDto { + file_path: path.to_string_lossy().to_string(), + error_message: err.to_string(), + }) + .collect(), + } + } + + pub fn sync_summary_to_dto(summary: &SyncSummary) -> SyncSummaryDto { + SyncSummaryDto { + files_created: summary.files_created, + files_updated: summary.files_updated, + files_deleted: summary.files_deleted, + duration_ms: summary.duration.as_millis() as u64, + } + } +} +``` + +## Tauri Commands + +### Import Commands + +```rust +// backend/src/tauri/commands/import.rs + +use tauri::{AppHandle, State, Emitter}; +use crate::tauri::{AppState, dto::*, mappers::DtoMapper}; +use crate::application::services::ImportService; +use crate::domain::value_objects::LogseqDirectoryPath; + +#[tauri::command] +pub async fn import_directory( + app: AppHandle, + state: State<'_, AppState>, + request: ImportRequest, +) -> Result { + // Validate directory path + let logseq_dir = LogseqDirectoryPath::new(&request.directory_path) + .map_err(|e| ErrorResponse::new(e, "ValidationError"))?; + + // Create import service + let page_repo = state.page_repository.lock().await.clone(); + let mapping_repo = state.mapping_repository.lock().await.clone(); + let mut import_service = ImportService::new(page_repo, mapping_repo); + + // Setup progress callback + let app_clone = app.clone(); + let progress_callback = move |event: crate::domain::events::ImportProgressEvent| { + let dto_event = match event { + crate::domain::events::ImportProgressEvent::Started(progress) => { + ImportProgressEvent::Started { + total_files: progress.total_files(), + } + } + crate::domain::events::ImportProgressEvent::FileProcessed(progress) => { + ImportProgressEvent::FileProcessed { + file_path: progress.current_file() + .map(|p| p.to_string_lossy().to_string()) + .unwrap_or_default(), + success: true, + current: progress.processed_files(), + total: progress.total_files(), + } + } + }; + + // Emit event to frontend + let _ = app_clone.emit("import-progress", dto_event); + }; + + // Run import + let summary = import_service + .import_directory(logseq_dir, Some(Arc::new(progress_callback))) + .await + .map_err(|e| ErrorResponse::new(e, "ImportError"))?; + + // Convert to DTO + let summary_dto = DtoMapper::import_summary_to_dto(&summary); + + // Emit completion event + app.emit("import-progress", ImportProgressEvent::Completed { + summary: summary_dto.clone(), + }).map_err(|e| ErrorResponse::new(e, "EventEmitError"))?; + + Ok(summary_dto) +} + +#[tauri::command] +pub async fn get_import_status( + state: State<'_, AppState>, +) -> Result, ErrorResponse> { + // TODO: Implement persistent import status tracking + // For now, return None (no active import) + Ok(None) +} +``` + +### Sync Commands + +```rust +// backend/src/tauri/commands/sync.rs + +use tauri::{AppHandle, State, Emitter}; +use crate::tauri::{AppState, dto::*, mappers::DtoMapper}; +use crate::application::services::SyncService; +use crate::domain::value_objects::LogseqDirectoryPath; + +#[tauri::command] +pub async fn start_sync( + app: AppHandle, + state: State<'_, AppState>, + request: SyncRequest, +) -> Result<(), ErrorResponse> { + // Validate directory path + let logseq_dir = LogseqDirectoryPath::new(&request.directory_path) + .map_err(|e| ErrorResponse::new(e, "ValidationError"))?; + + // Check if sync is already running + let sync_service_lock = state.sync_service.lock().await; + if sync_service_lock.is_some() { + return Err(ErrorResponse::new( + "Sync is already running", + "SyncAlreadyRunning", + )); + } + drop(sync_service_lock); + + // Create sync service + let sync_service = SyncService::new( + state.page_repository.clone(), + state.mapping_repository.clone(), + logseq_dir, + ).map_err(|e| ErrorResponse::new(e, "SyncError"))?; + + // Setup event callback + let app_clone = app.clone(); + let sync_callback = move |event: crate::domain::events::SyncEvent| { + let dto_event = match event { + crate::domain::events::SyncEvent::Started => SyncEvent::Started, + crate::domain::events::SyncEvent::FileCreated(path) => SyncEvent::FileCreated { + file_path: path.to_string_lossy().to_string(), + }, + crate::domain::events::SyncEvent::FileUpdated(path) => SyncEvent::FileUpdated { + file_path: path.to_string_lossy().to_string(), + }, + crate::domain::events::SyncEvent::FileDeleted(path) => SyncEvent::FileDeleted { + file_path: path.to_string_lossy().to_string(), + }, + crate::domain::events::SyncEvent::Completed(summary) => SyncEvent::Completed { + summary: DtoMapper::sync_summary_to_dto(&summary), + }, + }; + + let _ = app_clone.emit("sync-event", dto_event); + }; + + // Store sync service in state + let mut sync_service_lock = state.sync_service.lock().await; + *sync_service_lock = Some(sync_service); + drop(sync_service_lock); + + // Start watching in background task + if request.enable_watch { + let sync_service_clone = state.sync_service.clone(); + tauri::async_runtime::spawn(async move { + let sync_service_lock = sync_service_clone.lock().await; + if let Some(service) = sync_service_lock.as_ref() { + let _ = service.start_watching(Some(Arc::new(sync_callback))).await; + } + }); + } + + Ok(()) +} + +#[tauri::command] +pub async fn stop_sync( + state: State<'_, AppState>, +) -> Result<(), ErrorResponse> { + let mut sync_service_lock = state.sync_service.lock().await; + + if sync_service_lock.is_none() { + return Err(ErrorResponse::new( + "No sync service running", + "NoSyncRunning", + )); + } + + // Drop the sync service (stops watching) + *sync_service_lock = None; + + Ok(()) +} + +#[tauri::command] +pub async fn sync_once( + app: AppHandle, + state: State<'_, AppState>, + request: SyncRequest, +) -> Result { + // Validate directory path + let logseq_dir = LogseqDirectoryPath::new(&request.directory_path) + .map_err(|e| ErrorResponse::new(e, "ValidationError"))?; + + // Create temporary sync service + let sync_service = SyncService::new( + state.page_repository.clone(), + state.mapping_repository.clone(), + logseq_dir, + ).map_err(|e| ErrorResponse::new(e, "SyncError"))?; + + // Setup callback + let app_clone = app.clone(); + let callback = move |event: crate::domain::events::SyncEvent| { + // Convert and emit events + // (same as start_sync) + }; + + // Run one-time sync + let summary = sync_service + .sync_once(Some(Arc::new(callback))) + .await + .map_err(|e| ErrorResponse::new(e, "SyncError"))?; + + Ok(DtoMapper::sync_summary_to_dto(&summary)) +} + +#[tauri::command] +pub async fn get_sync_status( + state: State<'_, AppState>, +) -> Result { + let sync_service_lock = state.sync_service.lock().await; + Ok(sync_service_lock.is_some()) +} +``` + +### Page Query Commands + +```rust +// backend/src/tauri/commands/pages.rs + +use tauri::State; +use crate::tauri::{AppState, dto::*, mappers::DtoMapper}; +use crate::domain::{PageId}; +use crate::application::repositories::PageRepository; + +#[tauri::command] +pub async fn get_all_pages( + state: State<'_, AppState>, +) -> Result, ErrorResponse> { + let page_repo = state.page_repository.lock().await; + + let pages = page_repo.find_all() + .map_err(|e| ErrorResponse::new(e, "RepositoryError"))?; + + Ok(pages.iter().map(DtoMapper::page_to_dto).collect()) +} + +#[tauri::command] +pub async fn get_page_by_id( + state: State<'_, AppState>, + page_id: String, +) -> Result, ErrorResponse> { + let page_id = PageId::new(&page_id) + .map_err(|e| ErrorResponse::new(e, "ValidationError"))?; + + let page_repo = state.page_repository.lock().await; + + let page = page_repo.find_by_id(&page_id) + .map_err(|e| ErrorResponse::new(e, "RepositoryError"))?; + + Ok(page.as_ref().map(DtoMapper::page_to_dto)) +} + +#[tauri::command] +pub async fn get_page_by_title( + state: State<'_, AppState>, + title: String, +) -> Result, ErrorResponse> { + let page_repo = state.page_repository.lock().await; + + let page = page_repo.find_by_title(&title) + .map_err(|e| ErrorResponse::new(e, "RepositoryError"))?; + + Ok(page.as_ref().map(DtoMapper::page_to_dto)) +} + +#[tauri::command] +pub async fn delete_page( + state: State<'_, AppState>, + page_id: String, +) -> Result { + let page_id = PageId::new(&page_id) + .map_err(|e| ErrorResponse::new(e, "ValidationError"))?; + + let mut page_repo = state.page_repository.lock().await; + + let deleted = page_repo.delete(&page_id) + .map_err(|e| ErrorResponse::new(e, "RepositoryError"))?; + + Ok(deleted) +} +``` + +### Settings Commands + +```rust +// backend/src/tauri/commands/settings.rs + +use tauri::State; +use std::path::PathBuf; +use crate::tauri::{AppState, dto::ErrorResponse}; + +// For MVP, store in app state; later move to database +#[tauri::command] +pub async fn set_logseq_directory( + _state: State<'_, AppState>, + directory_path: String, +) -> Result<(), ErrorResponse> { + // Validate directory exists and is a Logseq directory + let path = PathBuf::from(&directory_path); + + if !path.exists() { + return Err(ErrorResponse::new( + "Directory does not exist", + "DirectoryNotFound", + )); + } + + if !path.join("pages").exists() || !path.join("journals").exists() { + return Err(ErrorResponse::new( + "Not a valid Logseq directory (missing pages/ or journals/)", + "InvalidLogseqDirectory", + )); + } + + // TODO: Persist to database or config file + Ok(()) +} + +#[tauri::command] +pub async fn get_logseq_directory( + _state: State<'_, AppState>, +) -> Result, ErrorResponse> { + // TODO: Read from database or config file + Ok(None) +} +``` + +## Frontend Integration + +### TypeScript Types + +```typescript +// frontend/src/types/tauri.ts + +export interface ImportRequest { + directory_path: string; +} + +export interface ImportSummaryDto { + total_files: number; + successful: number; + failed: number; + duration_ms: number; + errors: ImportErrorDto[]; +} + +export interface ImportErrorDto { + file_path: string; + error_message: string; +} + +export type ImportProgressEvent = + | { type: 'Started'; total_files: number } + | { type: 'FileProcessed'; file_path: string; success: boolean; current: number; total: number } + | { type: 'Completed'; summary: ImportSummaryDto } + | { type: 'Failed'; error: string }; + +export interface SyncRequest { + directory_path: string; + enable_watch: boolean; +} + +export interface SyncSummaryDto { + files_created: number; + files_updated: number; + files_deleted: number; + duration_ms: number; +} + +export type SyncEvent = + | { type: 'Started' } + | { type: 'FileCreated'; file_path: string } + | { type: 'FileUpdated'; file_path: string } + | { type: 'FileDeleted'; file_path: string } + | { type: 'Completed'; summary: SyncSummaryDto } + | { type: 'Error'; error: string }; + +export interface PageDto { + id: string; + title: string; + blocks: BlockDto[]; + root_block_ids: string[]; +} + +export interface BlockDto { + id: string; + content: string; + indent_level: number; + parent_id?: string; + child_ids: string[]; + urls: UrlDto[]; + page_references: PageReferenceDto[]; +} + +export interface UrlDto { + url: string; + domain?: string; +} + +export interface PageReferenceDto { + text: string; + reference_type: 'link' | 'tag'; +} + +export interface ErrorResponse { + error: string; + error_type: string; +} +``` + +### API Client + +```typescript +// frontend/src/lib/tauri-api.ts + +import { invoke } from '@tauri-apps/api/core'; +import { listen } from '@tauri-apps/api/event'; +import type { + ImportRequest, + ImportSummaryDto, + ImportProgressEvent, + SyncRequest, + SyncSummaryDto, + SyncEvent, + PageDto, + ErrorResponse, +} from '../types/tauri'; + +export class TauriApi { + // ========== Import Commands ========== + + static async importDirectory(request: ImportRequest): Promise { + try { + return await invoke('import_directory', { request }); + } catch (error) { + throw this.handleError(error); + } + } + + static async onImportProgress(callback: (event: ImportProgressEvent) => void) { + return await listen('import-progress', (event) => { + callback(event.payload); + }); + } + + // ========== Sync Commands ========== + + static async startSync(request: SyncRequest): Promise { + try { + await invoke('start_sync', { request }); + } catch (error) { + throw this.handleError(error); + } + } + + static async stopSync(): Promise { + try { + await invoke('stop_sync'); + } catch (error) { + throw this.handleError(error); + } + } + + static async syncOnce(request: SyncRequest): Promise { + try { + return await invoke('sync_once', { request }); + } catch (error) { + throw this.handleError(error); + } + } + + static async getSyncStatus(): Promise { + try { + return await invoke('get_sync_status'); + } catch (error) { + throw this.handleError(error); + } + } + + static async onSyncEvent(callback: (event: SyncEvent) => void) { + return await listen('sync-event', (event) => { + callback(event.payload); + }); + } + + // ========== Page Commands ========== + + static async getAllPages(): Promise { + try { + return await invoke('get_all_pages'); + } catch (error) { + throw this.handleError(error); + } + } + + static async getPageById(pageId: string): Promise { + try { + return await invoke('get_page_by_id', { pageId }); + } catch (error) { + throw this.handleError(error); + } + } + + static async getPageByTitle(title: string): Promise { + try { + return await invoke('get_page_by_title', { title }); + } catch (error) { + throw this.handleError(error); + } + } + + static async deletePage(pageId: string): Promise { + try { + return await invoke('delete_page', { pageId }); + } catch (error) { + throw this.handleError(error); + } + } + + // ========== Settings Commands ========== + + static async setLogseqDirectory(directoryPath: string): Promise { + try { + await invoke('set_logseq_directory', { directoryPath }); + } catch (error) { + throw this.handleError(error); + } + } + + static async getLogseqDirectory(): Promise { + try { + return await invoke('get_logseq_directory'); + } catch (error) { + throw this.handleError(error); + } + } + + // ========== Error Handling ========== + + private static handleError(error: unknown): Error { + if (typeof error === 'object' && error !== null && 'error' in error) { + const errorResponse = error as ErrorResponse; + return new Error(`${errorResponse.error_type}: ${errorResponse.error}`); + } + return new Error(String(error)); + } +} +``` + +### React Hook Example + +```typescript +// frontend/src/hooks/useImport.ts + +import { useState, useEffect } from 'react'; +import { TauriApi } from '../lib/tauri-api'; +import type { ImportProgressEvent, ImportSummaryDto } from '../types/tauri'; + +export function useImport() { + const [progress, setProgress] = useState(null); + const [isImporting, setIsImporting] = useState(false); + + useEffect(() => { + const unlisten = TauriApi.onImportProgress((event) => { + setProgress(event); + + if (event.type === 'Completed' || event.type === 'Failed') { + setIsImporting(false); + } + }); + + return () => { + unlisten.then((fn) => fn()); + }; + }, []); + + const importDirectory = async (directoryPath: string) => { + setIsImporting(true); + setProgress({ type: 'Started', total_files: 0 }); + + try { + const summary = await TauriApi.importDirectory({ directory_path: directoryPath }); + return summary; + } catch (error) { + setProgress({ type: 'Failed', error: String(error) }); + throw error; + } + }; + + return { + progress, + isImporting, + importDirectory, + }; +} +``` + +```typescript +// frontend/src/hooks/useSync.ts + +import { useState, useEffect } from 'react'; +import { TauriApi } from '../lib/tauri-api'; +import type { SyncEvent } from '../types/tauri'; + +export function useSync() { + const [events, setEvents] = useState([]); + const [isSyncing, setIsSyncing] = useState(false); + + useEffect(() => { + const unlisten = TauriApi.onSyncEvent((event) => { + setEvents((prev) => [...prev, event]); + + if (event.type === 'Started') { + setIsSyncing(true); + } else if (event.type === 'Completed' || event.type === 'Error') { + setIsSyncing(false); + } + }); + + // Check sync status on mount + TauriApi.getSyncStatus().then(setIsSyncing); + + return () => { + unlisten.then((fn) => fn()); + }; + }, []); + + const startSync = async (directoryPath: string, enableWatch = true) => { + await TauriApi.startSync({ directory_path: directoryPath, enable_watch: enableWatch }); + }; + + const stopSync = async () => { + await TauriApi.stopSync(); + }; + + return { + events, + isSyncing, + startSync, + stopSync, + }; +} +``` + +## Error Handling Strategy + +### Backend Error Conversion + +```rust +// backend/src/tauri/error.rs + +use crate::domain::base::DomainError; +use crate::application::services::{ImportError, SyncError}; +use super::dto::ErrorResponse; + +impl From for ErrorResponse { + fn from(err: DomainError) -> Self { + let error_type = match &err { + DomainError::InvalidValue(_) => "ValidationError", + DomainError::NotFound(_) => "NotFoundError", + DomainError::BusinessRuleViolation(_) => "BusinessRuleError", + DomainError::InvalidOperation(_) => "OperationError", + }; + + ErrorResponse::new(err, error_type) + } +} + +impl From for ErrorResponse { + fn from(err: ImportError) -> Self { + let error_type = match &err { + ImportError::InvalidDirectory(_) => "ValidationError", + ImportError::FileSystem(_) => "FileSystemError", + ImportError::Parse(_) => "ParseError", + ImportError::Repository(_) => "RepositoryError", + }; + + ErrorResponse::new(err, error_type) + } +} + +impl From for ErrorResponse { + fn from(err: SyncError) -> Self { + let error_type = match &err { + SyncError::FileSystem(_) => "FileSystemError", + SyncError::Parse(_) => "ParseError", + SyncError::Repository(_) => "RepositoryError", + SyncError::Watcher(_) => "WatcherError", + }; + + ErrorResponse::new(err, error_type) + } +} +``` + +### Frontend Error Display + +```typescript +// frontend/src/utils/error-messages.ts + +export function getErrorMessage(error: Error): string { + const message = error.message; + + if (message.startsWith('ValidationError:')) { + return 'Invalid input. Please check your data.'; + } else if (message.startsWith('FileSystemError:')) { + return 'Unable to access file system. Check permissions.'; + } else if (message.startsWith('RepositoryError:')) { + return 'Database error. Please try again.'; + } else { + return 'An unexpected error occurred.'; + } +} +``` + +## Testing Strategy + +### Backend Command Tests + +```rust +#[cfg(test)] +mod tests { + use super::*; + use tauri::test::mock_builder; + + #[tokio::test] + async fn test_import_directory_command() { + let app = mock_builder().build().unwrap(); + + let result = import_directory( + app.handle(), + /* state */, + ImportRequest { + directory_path: "./test-fixtures/sample-logseq".to_string(), + }, + ).await; + + assert!(result.is_ok()); + let summary = result.unwrap(); + assert!(summary.successful > 0); + } +} +``` + +### Frontend Integration Tests + +```typescript +// frontend/tests/tauri-api.test.ts + +import { describe, it, expect, vi } from 'vitest'; +import { TauriApi } from '../src/lib/tauri-api'; + +describe('TauriApi', () => { + it('should import directory', async () => { + const summary = await TauriApi.importDirectory({ + directory_path: '/path/to/logseq', + }); + + expect(summary.total_files).toBeGreaterThan(0); + }); + + it('should listen to import progress', async () => { + const callback = vi.fn(); + await TauriApi.onImportProgress(callback); + + // Trigger import... + // expect(callback).toHaveBeenCalled(); + }); +}); +``` + +## Performance Considerations + +### Optimizations + +1. **Async commands:** All commands are async to avoid blocking UI +2. **Event streaming:** Use Tauri events instead of polling +3. **Batch operations:** Group database operations in transactions +4. **Connection pooling:** Reuse SQLite connections via Arc +5. **Background tasks:** Run long operations (sync watching) in separate tasks + +### Expected Performance + +- **Command latency:** <10ms for simple queries, <100ms for complex operations +- **Event latency:** <5ms from Rust β†’ Frontend +- **Import throughput:** ~10-20 files/second (with parsing + DB writes) +- **Sync latency:** <100ms from file change β†’ event emitted + +## Rollout Plan + +### Phase 1: Foundation βœ… +- [ ] Setup Tauri project structure +- [ ] Define DTOs and mappers +- [ ] Implement AppState management +- [ ] Add error handling utilities + +### Phase 2: Core Commands βœ… +- [ ] Implement import commands +- [ ] Implement sync commands +- [ ] Implement page query commands +- [ ] Add event emitters + +### Phase 3: Frontend Integration βœ… +- [ ] Create TypeScript types +- [ ] Build TauriApi client +- [ ] Create React hooks (useImport, useSync) +- [ ] Build UI components + +### Phase 4: Testing & Polish βœ… +- [ ] Backend command tests +- [ ] Frontend integration tests +- [ ] Error handling tests +- [ ] Performance profiling + +## Open Questions + +1. **Concurrent imports:** Should we allow multiple imports simultaneously? +2. **Event batching:** Should we batch rapid sync events to avoid overwhelming UI? +3. **Command cancellation:** How to cancel long-running operations (import, sync)? +4. **Offline handling:** How to handle when backend is unresponsive? +5. **State persistence:** Should AppState settings be persisted to DB or config file? + +## Future Enhancements + +- **Command queue:** Queue commands when backend is busy +- **Optimistic updates:** Update UI before backend confirms +- **Command history:** Track all user actions for undo/redo +- **WebSocket alternative:** Use WebSocket for even lower latency +- **Multi-window support:** Sync state across multiple app windows +- **Plugin system:** Allow extending commands via plugins + +## References + +- Tauri documentation: https://tauri.app/v2/guides/ +- Tauri state management: https://tauri.app/v2/guides/features/state/ +- Tauri events: https://tauri.app/v2/guides/features/events/ +- Tauri commands: https://tauri.app/v2/guides/features/commands/ From 75224ff5e2f8d4acc03c70b22f8adb2bd765c83b Mon Sep 17 00:00:00 2001 From: Wesley Finck Date: Sun, 19 Oct 2025 22:13:43 -0700 Subject: [PATCH 2/7] rename file --- notes/dependencies/{fastembed-ts.md => fastembed-rs.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename notes/dependencies/{fastembed-ts.md => fastembed-rs.md} (100%) diff --git a/notes/dependencies/fastembed-ts.md b/notes/dependencies/fastembed-rs.md similarity index 100% rename from notes/dependencies/fastembed-ts.md rename to notes/dependencies/fastembed-rs.md From bf48e2ecf88df2ebe6233779404a7a8d44303d19 Mon Sep 17 00:00:00 2001 From: Wesley Finck Date: Sun, 19 Oct 2025 22:13:49 -0700 Subject: [PATCH 3/7] overview file --- notes/OVERVIEW.md | 2226 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 2226 insertions(+) create mode 100644 notes/OVERVIEW.md diff --git a/notes/OVERVIEW.md b/notes/OVERVIEW.md new file mode 100644 index 0000000..a247772 --- /dev/null +++ b/notes/OVERVIEW.md @@ -0,0 +1,2226 @@ +# Logjam Architecture Overview + +**A comprehensive guide to understanding the Logjam codebase architecture, data flow, and implementation patterns.** + +--- + +## Table of Contents + +1. [Introduction](#introduction) +2. [High-Level Architecture](#high-level-architecture) +3. [DDD Building Blocks](#ddd-building-blocks) +4. [Layer-by-Layer Breakdown](#layer-by-layer-breakdown) +5. [End-to-End Workflows](#end-to-end-workflows) +6. [Code Patterns & Examples](#code-patterns--examples) +7. [Quick Reference](#quick-reference) + +--- + +## Introduction + +Logjam is a knowledge management application that imports, syncs, and searches Logseq markdown directories. It's built using **Domain-Driven Design (DDD)** principles with a clean layered architecture. + +### Core Capabilities + +- **Import:** Bulk import Logseq directories (one-time operation) +- **Sync:** Continuous file watching and incremental updates +- **Persistence:** SQLite storage for pages, blocks, and file mappings +- **Full-Text Search:** Tantivy (fuzzy, ranked keyword search) +- **Semantic Search:** Vector embeddings with similarity search (RAG-ready) +- **Hybrid Search:** Combine keyword + semantic results +- **UI Integration:** Tauri commands expose backend to frontend + +### Technology Stack + +``` +Frontend: TypeScript + React/Svelte (Tauri web view) +Backend: Rust (async with Tokio) +Database: SQLite (via sqlx) +Vector Store: Qdrant (embedded mode) +Embeddings: fastembed-rs (local embedding generation) +Text Search: Tantivy (embedded search engine) +IPC: Tauri (Rust ↔ Frontend bridge) +``` + +--- + +## High-Level Architecture + +### 4-Layer Architecture + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ PRESENTATION LAYER β”‚ +β”‚ (Tauri Commands, Event Emitters, Frontend API) β”‚ +β”‚ β”‚ +β”‚ β€’ import_directory() β€’ search() β”‚ +β”‚ β€’ start_sync() β€’ get_all_pages() β”‚ +β”‚ β€’ Events: import-progress, sync-event β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ DTOs (Data Transfer Objects) + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ APPLICATION LAYER β”‚ +β”‚ (Use Cases, Services, Repository Traits) β”‚ +β”‚ β”‚ +β”‚ Services: ImportService, SyncService, SearchService β”‚ +β”‚ Use Cases: EmbedBlocks, SemanticSearch, UpdateEmbeddings β”‚ +β”‚ Repos: PageRepository, FileMappingRepository β”‚ +β”‚ EmbeddingRepository, EmbeddingModelRepository β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ Domain Objects (Page, Block) + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ DOMAIN LAYER β”‚ +β”‚ (Pure Business Logic - NO external dependencies) β”‚ +β”‚ β”‚ +β”‚ Aggregates: Page, EmbeddedBlock β”‚ +β”‚ Entities: Block, TextChunk β”‚ +β”‚ Value Objects: PageId, BlockId, Url, PageReference, β”‚ +β”‚ EmbeddingVector, ChunkId, SimilarityScore β”‚ +β”‚ Events: PageCreated, FileProcessed, SyncCompleted β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ Domain abstractions + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ INFRASTRUCTURE LAYER β”‚ +β”‚ (Technical Implementation Details) β”‚ +β”‚ β”‚ +β”‚ Persistence: SqlitePageRepository β”‚ +β”‚ SqliteFileMappingRepository β”‚ +β”‚ Parsers: LogseqMarkdownParser β”‚ +β”‚ File System: LogseqFileWatcher, file discovery β”‚ +β”‚ Text Search: TantivySearchIndex β”‚ +β”‚ Embeddings: FastEmbedService, EmbeddingModelManager β”‚ +β”‚ Text Proc: TextPreprocessor (Logseq syntax removal) β”‚ +β”‚ Vector Store: QdrantVectorStore, VectorCollectionManager β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +### Dependency Rule + +**Critical principle:** Dependencies point INWARD only. + +``` +Presentation β†’ Application β†’ Domain ← Infrastructure + ↑ + β”‚ + No dependencies on outer layers! +``` + +- **Domain Layer:** Zero dependencies (pure Rust, no external crates except std) +- **Application Layer:** Depends only on Domain +- **Infrastructure Layer:** Depends on Domain + Application (implements traits) +- **Presentation Layer:** Depends on Application + Infrastructure (wires everything together) + +--- + +## DDD Building Blocks + +### Domain Layer Abstractions + +Logjam uses classic DDD patterns defined in `backend/src/domain/base.rs`: + +```rust +// 1. Value Objects (Immutable, equality based on attributes) +pub trait ValueObject: Clone + PartialEq + Eq + Debug {} + +// Examples: PageId, BlockId, Url, PageReference, IndentLevel +``` + +```rust +// 2. Entities (Identity-based equality) +pub trait Entity: Debug { + type Id: ValueObject; + fn id(&self) -> &Self::Id; +} + +// Example: Block (has BlockId identity) +``` + +```rust +// 3. Aggregate Roots (Consistency boundaries) +pub trait AggregateRoot: Entity { + fn apply_event(&mut self, event: &DomainEventEnum); +} + +// Example: Page (owns Blocks, enforces invariants) +``` + +```rust +// 4. Domain Events (Things that happened) +pub trait DomainEvent: Debug + Clone { + fn event_type(&self) -> &'static str; + fn aggregate_id(&self) -> String; +} + +// Examples: PageCreated, BlockAdded, FileProcessed +``` + +### Real Examples from Codebase + +#### Value Object: PageId + +```rust +// backend/src/domain/value_objects.rs + +#[derive(Debug, Clone, PartialEq, Eq, Hash)] +pub struct PageId(String); + +impl PageId { + pub fn new(id: impl Into) -> DomainResult { + let id = id.into(); + if id.is_empty() { + return Err(DomainError::InvalidValue("PageId cannot be empty".into())); + } + Ok(PageId(id)) + } + + pub fn as_str(&self) -> &str { &self.0 } +} + +impl ValueObject for PageId {} +``` + +**Key pattern:** Constructor validation, immutability, private fields. + +#### Entity: Block + +```rust +// backend/src/domain/entities.rs + +#[derive(Debug, Clone)] +pub struct Block { + id: BlockId, // Identity + content: BlockContent, + indent_level: IndentLevel, + parent_id: Option, + child_ids: Vec, + urls: Vec, + page_references: Vec, +} + +impl Entity for Block { + type Id = BlockId; + fn id(&self) -> &Self::Id { &self.id } +} +``` + +**Key pattern:** Has identity (`BlockId`), mutable state, behavior methods. + +#### Aggregate Root: Page + +```rust +// backend/src/domain/aggregates.rs + +#[derive(Debug, Clone)] +pub struct Page { + id: PageId, // Aggregate ID + title: String, + blocks: HashMap, // Owned entities + root_block_ids: Vec, +} + +impl Page { + // Enforces invariants + pub fn add_block(&mut self, block: Block) -> DomainResult<()> { + // INVARIANT: Parent must exist before adding child + if let Some(parent_id) = block.parent_id() { + if !self.blocks.contains_key(parent_id) { + return Err(DomainError::NotFound( + format!("Parent block {} not found", parent_id.as_str()) + )); + } + } + + self.blocks.insert(block.id().clone(), block); + Ok(()) + } + + // Recursive operations + pub fn remove_block(&mut self, block_id: &BlockId) -> DomainResult { + // Recursively delete entire subtree + let descendants = self.get_descendants(block_id)?; + for desc_id in descendants { + self.blocks.remove(&desc_id); + } + Ok(self.blocks.remove(block_id).is_some()) + } +} + +impl AggregateRoot for Page { + fn apply_event(&mut self, event: &DomainEventEnum) { + // Placeholder for event sourcing + } +} +``` + +**Key pattern:** Consistency boundary - all block operations go through Page methods. + +--- + +## Layer-by-Layer Breakdown + +### 1. Domain Layer (`backend/src/domain/`) + +**Purpose:** Define business rules and entities (file system, database agnostic). + +``` +domain/ +β”œβ”€β”€ base.rs # DDD trait definitions +β”œβ”€β”€ value_objects.rs # PageId, BlockId, Url, etc. +β”œβ”€β”€ entities.rs # Block entity +β”œβ”€β”€ aggregates.rs # Page aggregate root +β”œβ”€β”€ events.rs # Domain events +└── mod.rs +``` + +**Key Concepts:** + +- **Page Aggregate:** Owns blocks, enforces hierarchy rules +- **Value Objects:** Self-validating, immutable data +- **No I/O:** All methods are synchronous, pure transformations + +**Example - URL extraction:** + +```rust +// Domain layer doesn't care HOW URLs are extracted from text +// It just defines WHAT a URL is + +#[derive(Debug, Clone, PartialEq, Eq)] +pub struct Url { + url: String, + domain: Option, +} + +impl Url { + pub fn new(url: impl Into) -> DomainResult { + let url = url.into(); + + // Validation + if !url.starts_with("http://") && !url.starts_with("https://") { + return Err(DomainError::InvalidValue("URL must start with http(s)".into())); + } + + // Extract domain + let domain = url.split('/') + .nth(2) + .map(String::from); + + Ok(Self { url, domain }) + } +} +``` + +### 2. Application Layer (`backend/src/application/`) + +**Purpose:** Orchestrate use cases by coordinating domain objects and repositories. + +``` +application/ +β”œβ”€β”€ repositories/ +β”‚ β”œβ”€β”€ page_repository.rs # Trait definition +β”‚ └── file_mapping_repository.rs # Trait definition +β”œβ”€β”€ services/ +β”‚ β”œβ”€β”€ import_service.rs # Bulk import orchestration +β”‚ β”œβ”€β”€ sync_service.rs # Continuous sync orchestration +β”‚ └── search_service.rs # Search queries +β”œβ”€β”€ dto/ # Data Transfer Objects +└── use_cases/ # CQRS-style commands/queries +``` + +**Repository Pattern:** + +```rust +// backend/src/application/repositories/page_repository.rs + +pub trait PageRepository { + fn save(&mut self, page: Page) -> DomainResult<()>; + fn find_by_id(&self, id: &PageId) -> DomainResult>; + fn find_by_title(&self, title: &str) -> DomainResult>; + fn find_all(&self) -> DomainResult>; + fn delete(&mut self, id: &PageId) -> DomainResult; +} +``` + +**Service Pattern:** + +```rust +// backend/src/application/services/import_service.rs + +pub struct ImportService { + repository: R, + max_concurrent_files: usize, +} + +impl ImportService { + pub async fn import_directory( + &mut self, + directory_path: LogseqDirectoryPath, + progress_callback: Option, + ) -> ImportResult { + // 1. Discover files + // 2. Parse in parallel (bounded concurrency) + // 3. Save to repository + // 4. Report progress + } +} +``` + +**Key pattern:** Generic over repository trait (dependency injection). + +### 3. Infrastructure Layer (`backend/src/infrastructure/`) + +**Purpose:** Implement technical details (DB, file I/O, parsing, search). + +``` +infrastructure/ +β”œβ”€β”€ persistence/ +β”‚ β”œβ”€β”€ sqlite_page_repository.rs # SQLite implementation +β”‚ β”œβ”€β”€ sqlite_file_mapping_repository.rs +β”‚ β”œβ”€β”€ models.rs # Database row structs +β”‚ └── mappers.rs # Domain ↔ DB conversion +β”œβ”€β”€ parsers/ +β”‚ └── logseq_markdown.rs # .md β†’ Page/Block +β”œβ”€β”€ file_system/ +β”‚ β”œβ”€β”€ discovery.rs # Find .md files +β”‚ └── watcher.rs # File change detection +└── search/ + β”œβ”€β”€ tantivy_index.rs # Search index + └── schema.rs # Search document schema +``` + +**Example - Repository Implementation:** + +```rust +// backend/src/infrastructure/persistence/sqlite_page_repository.rs + +pub struct SqlitePageRepository { + pool: SqlitePool, +} + +impl PageRepository for SqlitePageRepository { + fn save(&mut self, page: Page) -> DomainResult<()> { + tokio::task::block_in_place(|| { + tokio::runtime::Handle::current().block_on(async { + // 1. Convert domain Page to database rows + let (page_row, block_rows, url_rows, ref_rows) = + PageMapper::from_domain(&page); + + // 2. Begin transaction + let mut tx = self.pool.begin().await?; + + // 3. Upsert page + sqlx::query("INSERT INTO pages (...) VALUES (...) ON CONFLICT(...) DO UPDATE...") + .bind(&page_row.id) + .bind(&page_row.title) + .execute(&mut tx).await?; + + // 4. Delete old blocks + sqlx::query("DELETE FROM blocks WHERE page_id = ?") + .bind(&page_row.id) + .execute(&mut tx).await?; + + // 5. Insert new blocks + for block in block_rows { /* ... */ } + + // 6. Commit + tx.commit().await?; + + Ok(()) + }) + }) + } +} +``` + +**Example - Parser:** + +```rust +// backend/src/infrastructure/parsers/logseq_markdown.rs + +pub struct LogseqMarkdownParser; + +impl LogseqMarkdownParser { + pub async fn parse_file(path: &Path) -> ParseResult { + // 1. Read file + let content = tokio::fs::read_to_string(path).await?; + + // 2. Extract title from filename + let title = path.file_stem() + .and_then(|s| s.to_str()) + .ok_or_else(|| ParseError::InvalidMarkdown("No filename".into()))?; + + let page_id = PageId::new(title)?; + + // 3. Parse content into blocks + Self::parse_content(&content, page_id, title.to_string()) + } + + fn parse_content(content: &str, page_id: PageId, title: String) -> ParseResult { + let mut page = Page::new(page_id, title); + + // Parse each line as a block + for line in content.lines() { + if line.trim().is_empty() { continue; } + + // Calculate indent level + let indent_level = Self::calculate_indent(line); + + // Extract content (remove bullet markers) + let content = Self::extract_content(line); + + // Extract URLs and page references + let urls = Self::extract_urls(&content); + let refs = Self::extract_page_references(&content); + + // Create block + let block = Block::new_root( + BlockId::generate(), + BlockContent::new(content)?, + ); + + page.add_block(block)?; + } + + Ok(page) + } +} +``` + +### 4. Presentation Layer (`backend/src/tauri/`) + +**Purpose:** Expose backend to frontend via Tauri commands and events. + +``` +tauri/ +β”œβ”€β”€ state.rs # AppState (shared state) +β”œβ”€β”€ dto.rs # Serializable DTOs +β”œβ”€β”€ mappers.rs # Domain β†’ DTO conversion +└── commands/ + β”œβ”€β”€ import.rs # Import commands + β”œβ”€β”€ sync.rs # Sync commands + β”œβ”€β”€ pages.rs # Page query commands + └── search.rs # Search commands +``` + +**Example - Tauri Command:** + +```rust +// backend/src/tauri/commands/import.rs + +#[tauri::command] +pub async fn import_directory( + app: AppHandle, + state: State<'_, AppState>, + request: ImportRequest, +) -> Result { + // 1. Validate input + let logseq_dir = LogseqDirectoryPath::new(&request.directory_path) + .map_err(|e| ErrorResponse::new(e, "ValidationError"))?; + + // 2. Create service with repositories from state + let page_repo = state.page_repository.lock().await.clone(); + let mapping_repo = state.mapping_repository.lock().await.clone(); + let mut import_service = ImportService::new(page_repo, mapping_repo); + + // 3. Setup progress callback (emit events to frontend) + let app_clone = app.clone(); + let progress_callback = move |event| { + let dto_event = DtoMapper::import_event_to_dto(event); + let _ = app_clone.emit("import-progress", dto_event); + }; + + // 4. Execute import + let summary = import_service + .import_directory(logseq_dir, Some(Arc::new(progress_callback))) + .await + .map_err(|e| ErrorResponse::new(e, "ImportError"))?; + + // 5. Convert to DTO and return + Ok(DtoMapper::import_summary_to_dto(&summary)) +} +``` + +--- + +## End-to-End Workflows + +### Workflow 1: Initial Import + +**User Action:** Click "Import Logseq Directory" β†’ Select `/path/to/logseq` + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ FRONTEND β”‚ +β”‚ User clicks import β†’ TauriApi.importDirectory() β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ invoke("import_directory", {...}) + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ TAURI COMMAND LAYER β”‚ +β”‚ import_directory(app, state, request) β”‚ +β”‚ 1. Validate LogseqDirectoryPath β”‚ +β”‚ 2. Create ImportService β”‚ +β”‚ 3. Setup progress callback β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ import_service.import_directory() + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ APPLICATION LAYER β”‚ +β”‚ ImportService::import_directory() β”‚ +β”‚ 1. Discover files: discover_logseq_files(dir) β”‚ +β”‚ 2. For each file (parallel, bounded concurrency): β”‚ +β”‚ β”œβ”€ LogseqMarkdownParser::parse_file(path) β”‚ +β”‚ β”œβ”€ page_repository.save(page) β”‚ +β”‚ β”œβ”€ mapping_repository.save(mapping) β”‚ +β”‚ └─ emit progress event β”‚ +β”‚ 3. Return ImportSummary β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ Calls to infrastructure... + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ INFRASTRUCTURE LAYER β”‚ +β”‚ β”‚ +β”‚ File Discovery (infrastructure/file_system/discovery.rs): β”‚ +β”‚ β€’ Recursively scan pages/ and journals/ β”‚ +β”‚ β€’ Filter for .md files β”‚ +β”‚ β€’ Return Vec β”‚ +β”‚ β”‚ +β”‚ Parser (infrastructure/parsers/logseq_markdown.rs): β”‚ +β”‚ β€’ Read file content β”‚ +β”‚ β€’ Parse markdown lines β†’ Blocks β”‚ +β”‚ β€’ Extract URLs, page references β”‚ +β”‚ β€’ Build Page aggregate β”‚ +β”‚ β”‚ +β”‚ Persistence (infrastructure/persistence/): β”‚ +β”‚ β€’ SqlitePageRepository::save() β”‚ +β”‚ └─ INSERT pages, blocks, urls, refs (transaction) β”‚ +β”‚ β€’ SqliteFileMappingRepository::save() β”‚ +β”‚ └─ INSERT file_page_mappings β”‚ +β”‚ β”‚ +β”‚ Search Index (infrastructure/search/tantivy_index.rs): β”‚ +β”‚ β€’ TantivySearchIndex::index_page() β”‚ +β”‚ └─ Add page doc + block docs to search index β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +**Code Flow (Simplified):** + +```rust +// 1. FRONTEND (TypeScript) +const summary = await TauriApi.importDirectory({ + directory_path: "/Users/me/logseq" +}); + +// 2. TAURI COMMAND +#[tauri::command] +async fn import_directory(state: State, request: ImportRequest) + -> Result +{ + let mut service = ImportService::new( + state.page_repository.lock().await, + state.mapping_repository.lock().await + ); + + let summary = service.import_directory(logseq_dir, callback).await?; + Ok(DtoMapper::to_dto(summary)) +} + +// 3. APPLICATION SERVICE +impl ImportService { + async fn import_directory(&mut self, dir: LogseqDirectoryPath) -> ImportResult { + let files = discover_logseq_files(dir.as_path()).await?; + + for file in files { + let page = LogseqMarkdownParser::parse_file(&file).await?; + self.repository.save(page)?; + // ... emit progress + } + + Ok(summary) + } +} + +// 4. INFRASTRUCTURE - PARSER +impl LogseqMarkdownParser { + async fn parse_file(path: &Path) -> ParseResult { + let content = tokio::fs::read_to_string(path).await?; + // ... parse into Page aggregate + Ok(page) + } +} + +// 5. INFRASTRUCTURE - REPOSITORY +impl PageRepository for SqlitePageRepository { + fn save(&mut self, page: Page) -> DomainResult<()> { + // Transaction: INSERT pages, blocks, urls, refs + Ok(()) + } +} +``` + +**Data Transformations:** + +``` +File System Domain Database +──────────── ──────── ──────── + +/pages/my-note.md Page { pages: + - Line 1 id: "my-note" id: "my-note" + - Line 2 title: "my-note" title: "my-note" + - Nested blocks: [ + Block { blocks: + content: "Line 1" id: "block-1" + indent: 0 page_id: "my-note" + }, content: "Line 1" + Block { indent_level: 0 + content: "Nested" + indent: 1 blocks: + } id: "block-2" + ] page_id: "my-note" + } content: "Nested" + parent_id: "block-1" + indent_level: 1 +``` + +### Workflow 2: Continuous Sync (File Watching) + +**User Action:** Click "Start Sync" β†’ App watches for file changes + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ FILE SYSTEM β”‚ +β”‚ User edits /pages/my-note.md in Logseq β”‚ +β”‚ File saved β†’ OS emits file change event β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ inotify/FSEvents + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ INFRASTRUCTURE - WATCHER β”‚ +β”‚ LogseqFileWatcher (using notify crate) β”‚ +β”‚ β€’ Receives raw file event β”‚ +β”‚ β€’ Debounces (500ms window) β”‚ +β”‚ β€’ Filters for .md files in pages/journals/ β”‚ +β”‚ β€’ Converts to FileEvent { path, kind } β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ FileEvent::Modified(path) + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ APPLICATION - SYNC SERVICE β”‚ +β”‚ SyncService::handle_event() β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Match event.kind: β”‚ β”‚ +β”‚ β”‚ Created β†’ handle_file_created(path) β”‚ β”‚ +β”‚ β”‚ Modified β†’ handle_file_updated(path) β”‚ β”‚ +β”‚ β”‚ Deleted β†’ handle_file_deleted(path) β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + ↓ (example: Modified event) +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ handle_file_updated(path): β”‚ +β”‚ 1. Check FileMappingRepository for existing mapping β”‚ +β”‚ 2. If stale (file modified > last sync): β”‚ +β”‚ β”œβ”€ Parse file β†’ Page β”‚ +β”‚ β”œβ”€ PageRepository.save(page) [UPDATE] β”‚ +β”‚ β”œβ”€ FileMappingRepository.save(...) [UPDATE timestamp] β”‚ +β”‚ β”œβ”€ SearchIndex.update_page(page) [REINDEX] β”‚ +β”‚ └─ Emit SyncEvent::FileUpdated β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +**Code Example:** + +```rust +// APPLICATION LAYER - SyncService + +impl SyncService { + pub async fn start_watching(&self, callback: Option) -> SyncResult<()> { + loop { + // Block until next event + let event = self.watcher.recv().await?; + + match event.kind { + FileEventKind::Created => self.handle_file_created(event.path).await?, + FileEventKind::Modified => self.handle_file_updated(event.path).await?, + FileEventKind::Deleted => self.handle_file_deleted(event.path).await?, + } + + // Notify frontend + if let Some(ref cb) = callback { + cb(SyncEvent::FileUpdated(event.path.clone())); + } + } + } + + async fn handle_file_updated(&self, path: PathBuf) -> SyncResult<()> { + // 1. Get existing mapping + let mapping_repo = self.mapping_repository.lock().await; + let existing = mapping_repo.find_by_path(&path)?; + + // 2. Check if file actually changed + let metadata = tokio::fs::metadata(&path).await?; + let current_modified = metadata.modified()?; + + if let Some(mapping) = existing { + if !mapping.is_stale(current_modified) { + return Ok(()); // No changes, skip + } + } + + // 3. Re-parse file + let page = LogseqMarkdownParser::parse_file(&path).await?; + + // 4. Update repository + let mut page_repo = self.page_repository.lock().await; + page_repo.save(page.clone())?; + + // 5. Update file mapping + let mut mapping_repo = self.mapping_repository.lock().await; + mapping_repo.save(FilePathMapping::new(path, page.id().clone(), ...))?; + + // 6. Update search index + if let Some(ref index) = self.search_index { + index.lock().await.update_page(&page)?; + index.lock().await.commit()?; + } + + Ok(()) + } + + async fn handle_file_deleted(&self, path: PathBuf) -> SyncResult<()> { + // 1. Find mapping to get PageId + let mut mapping_repo = self.mapping_repository.lock().await; + let mapping = mapping_repo.find_by_path(&path)? + .ok_or_else(|| SyncError::NotFound("No mapping for deleted file".into()))?; + + let page_id = mapping.page_id().clone(); + + // 2. Delete from repository + let mut page_repo = self.page_repository.lock().await; + page_repo.delete(&page_id)?; + + // 3. Delete mapping (CASCADE in DB) + mapping_repo.delete_by_path(&path)?; + + // 4. Delete from search index + if let Some(ref index) = self.search_index { + index.lock().await.delete_page(&page_id)?; + index.lock().await.commit()?; + } + + Ok(()) + } +} +``` + +**Key Insight - Fileβ†’Page Mapping:** + +Without file mappings, we can't handle deletions: + +``` +❌ PROBLEM: +File deleted: /pages/my-note.md +Which Page to delete? We don't know the PageId! + +βœ… SOLUTION (with FileMappingRepository): +1. Query: SELECT page_id FROM file_page_mappings WHERE file_path = '/pages/my-note.md' +2. Result: page_id = "my-note" +3. Delete: PageRepository.delete("my-note") +``` + +### Workflow 3: Full-Text Search + +**User Action:** Type "algorithm" in search box + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ FRONTEND β”‚ +β”‚ search(query)} /> β”‚ +β”‚ User types: "algorithm" β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ TauriApi.search({ query: "algorithm", ... }) + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ TAURI COMMAND β”‚ +β”‚ search(state, request) β†’ SearchResultDto[] β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ APPLICATION - SEARCH SERVICE β”‚ +β”‚ SearchService::search(query, limit) β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ INFRASTRUCTURE - TANTIVY INDEX β”‚ +β”‚ TantivySearchIndex::search("algorithm", 20) β”‚ +β”‚ β”‚ +β”‚ 1. Parse query into Tantivy Query object β”‚ +β”‚ β”œβ”€ QueryParser for fields: [page_title, block_content, ...] β”‚ +β”‚ └─ Parse "algorithm" into terms β”‚ +β”‚ β”‚ +β”‚ 2. Execute search with BM25 ranking β”‚ +β”‚ β”œβ”€ Searcher scans inverted index β”‚ +β”‚ β”œβ”€ Calculate relevance scores β”‚ +β”‚ └─ Return top 20 documents β”‚ +β”‚ β”‚ +β”‚ 3. Convert Tantivy documents β†’ SearchResult β”‚ +β”‚ └─ Extract page_id, block_id, content from stored fields β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ Vec + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Return to frontend: β”‚ +β”‚ [ β”‚ +β”‚ BlockResult { β”‚ +β”‚ page_id: "data-structures", β”‚ +β”‚ block_id: "block-42", β”‚ +β”‚ block_content: "Binary search algorithm is O(log n)", β”‚ +β”‚ score: 8.7 β”‚ +β”‚ }, β”‚ +β”‚ PageResult { β”‚ +β”‚ page_id: "algorithms", β”‚ +β”‚ page_title: "Algorithms & Complexity", β”‚ +β”‚ score: 6.2 β”‚ +β”‚ } β”‚ +β”‚ ] β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +**Tantivy Index Structure:** + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ TANTIVY INDEX β”‚ +β”‚ β”‚ +β”‚ Document Type 1: PAGE DOCUMENTS β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ page_id: "algorithms" β”‚ β”‚ +β”‚ β”‚ page_title: "Algorithms & Complexity" [SEARCHABLE] β”‚ β”‚ +β”‚ β”‚ document_type: "/page" [FACET] β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ +β”‚ Document Type 2: BLOCK DOCUMENTS β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ page_id: "data-structures" β”‚ β”‚ +β”‚ β”‚ block_id: "block-42" β”‚ β”‚ +β”‚ β”‚ page_title: "Data Structures" β”‚ β”‚ +β”‚ β”‚ block_content: "Binary search algorithm..."[SEARCHABLE]β”‚ β”‚ +β”‚ β”‚ urls: "https://en.wikipedia.org/wiki/Binary_search" β”‚ β”‚ +β”‚ β”‚ page_references: "algorithms complexity" β”‚ β”‚ +β”‚ β”‚ document_type: "/block" [FACET] β”‚ β”‚ +β”‚ β”‚ indent_level: 1 [INDEXED] β”‚ β”‚ +β”‚ β”‚ url_domains: "/domain/en.wikipedia.org"[FACET] β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ +β”‚ Inverted Index (for fast term lookup): β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ "algorithm" β†’ [doc_1, doc_5, doc_42, ...] β”‚ β”‚ +β”‚ β”‚ "binary" β†’ [doc_42, doc_103, ...] β”‚ β”‚ +β”‚ β”‚ "search" β†’ [doc_42, doc_55, ...] β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +**Search Query Types:** + +```rust +// 1. BASIC SEARCH (exact terms) +search_service.search("machine learning", 20) +// β†’ Finds documents with "machine" AND/OR "learning" + +// 2. FUZZY SEARCH (typo-tolerant, Levenshtein distance ≀ 2) +search_service.fuzzy_search("algoritm", 20) +// β†’ Matches "algorithm" (edit distance = 1) + +// 3. FILTERED SEARCH (facets) +search_service.search_with_filters("rust", 20, SearchFilters { + document_type: Some("block"), // Only search blocks + reference_type: Some("tag"), // Only blocks with tags +}) + +// 4. SPECIALIZED SEARCHES +search_service.search_pages("rust", 20) // Only page titles +search_service.search_blocks("rust", 20) // Only block content +search_service.search_tags("programming", 20) // Only tagged blocks +``` + +### Workflow 4: Semantic Search with Embeddings + +**User Action:** Ask natural language question: "How do I optimize database queries?" + +**Purpose:** Unlike keyword search (Tantivy), semantic search understands *meaning*. It finds conceptually similar content even without exact keyword matches. + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ FRONTEND β”‚ +β”‚ User types: "How do I optimize database queries?" β”‚ +β”‚ (No exact keywords like "SQL" or "index" in query) β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ TauriApi.semanticSearch({ query: "..." }) + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ TAURI COMMAND β”‚ +β”‚ semantic_search(state, request) β†’ SemanticResultDto[] β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ APPLICATION - SEMANTIC SEARCH USE CASE β”‚ +β”‚ SemanticSearch::execute(request) β”‚ +β”‚ β”‚ +β”‚ Step 1: Generate query embedding β”‚ +β”‚ query_vector = fastembed_service.generate_embeddings([query]) β”‚ +β”‚ β†’ [0.12, -0.45, 0.89, ..., 0.34] (384 dimensions) β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ EmbeddingVector (query) + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ INFRASTRUCTURE - QDRANT VECTOR STORE β”‚ +β”‚ QdrantVectorStore::similarity_search(query_vector, limit) β”‚ +β”‚ β”‚ +β”‚ Step 2: Similarity search (cosine similarity) β”‚ +β”‚ β”œβ”€ Compare query_vector to all chunk embeddings β”‚ +β”‚ β”œβ”€ Calculate cosine similarity scores β”‚ +β”‚ └─ Return top K most similar chunks β”‚ +β”‚ β”‚ +β”‚ Vector Index (HNSW - Hierarchical Navigable Small World): β”‚ +β”‚ β€’ Approximate nearest neighbor (ANN) search β”‚ +β”‚ β€’ O(log n) complexity instead of O(n) β”‚ +β”‚ β€’ Trade-off: 95%+ accuracy with 100x speedup β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ Vec + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Results (ranked by semantic similarity): β”‚ +β”‚ [ β”‚ +β”‚ ScoredChunk { β”‚ +β”‚ chunk_id: "chunk-147", β”‚ +β”‚ page_id: "database-performance", β”‚ +β”‚ block_id: "block-89", β”‚ +β”‚ content: "Adding indexes on foreign keys dramatically β”‚ +β”‚ improves JOIN performance. Use EXPLAIN to..." β”‚ +β”‚ similarity_score: 0.87 ← High semantic match! β”‚ +β”‚ }, β”‚ +β”‚ ScoredChunk { β”‚ +β”‚ chunk_id: "chunk-203", β”‚ +β”‚ page_id: "sql-tips", β”‚ +β”‚ content: "Query planning: PostgreSQL query planner uses β”‚ +β”‚ statistics to optimize execution..." β”‚ +β”‚ similarity_score: 0.82 β”‚ +β”‚ } β”‚ +β”‚ ] β”‚ +β”‚ β”‚ +β”‚ Note: Neither result contains "optimize database queries" β”‚ +β”‚ but both are semantically related! β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +#### Chunking Strategy + +**Problem:** Embeddings have token limits (usually 512 tokens). We need to split pages into chunks. + +**Chunking Approaches:** + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ CHUNKING STRATEGIES β”‚ +β”‚ β”‚ +β”‚ 1. BLOCK-BASED WITH PREPROCESSING (Logseq-aware) β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Step 1: Remove Logseq syntax β”‚ β”‚ +β”‚ β”‚ "Check [[Page Reference]] and #tag" β”‚ β”‚ +β”‚ β”‚ β†’ "Check Page Reference and tag" β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ Step 2: Add context markers β”‚ β”‚ +β”‚ β”‚ Block: "Neural networks..." β”‚ β”‚ +β”‚ β”‚ β†’ "Page: Machine Learning. Neural networks..." β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ Step 3: Create chunks (1 block = 1 chunk if ≀512 tok) β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ βœ… Preserves hierarchical context β”‚ β”‚ +β”‚ β”‚ βœ… Clean text for better embeddings β”‚ β”‚ +β”‚ β”‚ ❌ Blocks can still be too small or too large β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ +β”‚ 2. ROLLING WINDOW CHUNKING (Overlapping) β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Fixed-size chunks with overlap β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ Text: "ABCDEFGHIJ" β”‚ β”‚ +β”‚ β”‚ Chunk 1: [ABC] β”‚ β”‚ +β”‚ β”‚ Chunk 2: [CDE] ← 1 token overlap β”‚ β”‚ +β”‚ β”‚ Chunk 3: [EFG] β”‚ β”‚ +β”‚ β”‚ Chunk 4: [GHI] β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ βœ… Ensures context isn't lost at boundaries β”‚ β”‚ +β”‚ β”‚ ❌ More chunks = more storage + compute β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ +β”‚ 3. SEMANTIC CHUNKING (Context-aware) β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Split at topic boundaries (sentence similarity) β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ Paragraph 1: "Rust ownership rules..." β”‚ β”‚ +β”‚ β”‚ Paragraph 2: "Borrowing prevents data races..." β”‚ β”‚ +β”‚ β”‚ ↓ High similarity β†’ same chunk β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ Paragraph 3: "JavaScript async/await..." β”‚ β”‚ +β”‚ β”‚ ↓ Low similarity β†’ new chunk β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ βœ… Chunks are topically coherent β”‚ β”‚ +β”‚ β”‚ ❌ Computationally expensive β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ +β”‚ RECOMMENDED: Block-based with preprocessing β”‚ +β”‚ β€’ Preprocess: Remove Logseq syntax, add context markers β”‚ +β”‚ β€’ 1 block = 1 chunk if ≀ 512 tokens β”‚ +β”‚ β€’ Split large blocks with 50-token overlap β”‚ +β”‚ β€’ Batch processing: 32 blocks per batch for efficiency β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +#### Embedding Generation Pipeline + +**Full workflow from import to embedding:** + +```rust +// 1. IMPORT/SYNC: Page is saved to database +page_repository.save(page)?; + +// 2. PREPROCESSING & CHUNKING: Create TextChunks from blocks +let text_preprocessor = TextPreprocessor::new(); +let chunks: Vec = page.all_blocks() + .flat_map(|block| { + // Preprocess: Remove [[links]], #tags, clean markdown + let preprocessed = text_preprocessor.preprocess_block(block); + + // Add context: page title, parent hierarchy + let with_context = text_preprocessor.add_context_markers(&preprocessed, block); + + // Chunk if needed (512 token limit, 50 token overlap) + TextChunk::from_block(block, page.title(), with_context) + }) + .collect(); + +// Example chunk: +// TextChunk { +// chunk_id: "block-1-chunk-0", +// block_id: "block-1", +// page_id: "machine-learning", +// original_content: "Check [[Neural Networks]] for #deep-learning info", +// preprocessed_content: "Page: Machine Learning. Check Neural Networks for deep-learning info", +// chunk_index: 0, +// total_chunks: 1 +// } + +// 3. BATCH EMBEDDING: Generate vectors for chunks (32 at a time) +let batch_size = 32; +for chunk_batch in chunks.chunks(batch_size) { + let texts: Vec = chunk_batch.iter() + .map(|c| c.preprocessed_text().to_string()) + .collect(); + + // Use fastembed-rs for local embedding generation + let embeddings = fastembed_service.generate_embeddings(texts).await?; + // embeddings = Vec> with 384 dimensions (all-MiniLM-L6-v2) + + // 4. STORAGE: Save to Qdrant vector database + qdrant_store.upsert_embeddings(chunk_batch, embeddings).await?; +} + +// 5. INDEX: Qdrant builds HNSW index automatically (no manual commit needed) +``` + +#### Code Example: Text Preprocessing + +```rust +// backend/src/infrastructure/embeddings/text_preprocessor.rs + +pub struct TextPreprocessor; + +impl TextPreprocessor { + pub fn preprocess_block(&self, block: &Block) -> String { + let content = block.content().as_str(); + + // 1. Remove Logseq-specific syntax + let cleaned = self.remove_logseq_syntax(content); + + // 2. Clean markdown formatting + let cleaned = self.clean_markdown(&cleaned); + + cleaned + } + + fn remove_logseq_syntax(&self, text: &str) -> String { + let mut result = text.to_string(); + + // Remove [[page references]] but keep the text + // "Check [[Neural Networks]]" β†’ "Check Neural Networks" + result = Regex::new(r"\[\[([^\]]+)\]\]") + .unwrap() + .replace_all(&result, "$1") + .to_string(); + + // Remove #tags but keep the text + // "Learn #machine-learning" β†’ "Learn machine-learning" + result = Regex::new(r"#(\S+)") + .unwrap() + .replace_all(&result, "$1") + .to_string(); + + // Remove TODO/DONE markers + result = Regex::new(r"(TODO|DONE|LATER|NOW|WAITING)\s+") + .unwrap() + .replace_all(&result, "") + .to_string(); + + result + } + + fn clean_markdown(&self, text: &str) -> String { + // Remove bold/italic markers but keep text + // Remove code block markers + // Keep content readable for embeddings + // ... implementation + } + + pub fn add_context_markers(&self, text: &str, block: &Block, page: &Page) -> String { + let mut contextualized = String::new(); + + // Add page title as context + contextualized.push_str(&format!("Page: {}. ", page.title())); + + // Add parent block context for nested blocks + if let Some(parent_id) = block.parent_id() { + if let Some(parent) = page.get_block(parent_id) { + contextualized.push_str(&format!("Parent: {}. ", parent.content().as_str())); + } + } + + // Add the actual content + contextualized.push_str(text); + + contextualized + } +} +``` + +#### Code Example: EmbedBlocks Use Case + +```rust +// backend/src/application/use_cases/embed_blocks.rs + +pub struct EmbedBlocks { + embedding_service: Arc, + vector_store: Arc, + embedding_repository: Arc, + preprocessor: TextPreprocessor, +} + +impl EmbedBlocks { + pub async fn execute(&self, blocks: Vec, page: &Page) -> DomainResult<()> { + // 1. Preprocess blocks into TextChunks + let chunks = self.create_chunks_from_blocks(blocks, page)?; + + // 2. Generate embeddings in batches (32 at a time for efficiency) + let batch_size = 32; + for chunk_batch in chunks.chunks(batch_size) { + let texts: Vec = chunk_batch.iter() + .map(|c| c.preprocessed_text().to_string()) + .collect(); + + // Generate embeddings using fastembed-rs + let embeddings = self.embedding_service + .generate_embeddings(texts) + .await?; + + // 3. Store in vector database with metadata + self.store_embeddings(chunk_batch, embeddings).await?; + } + + Ok(()) + } + + fn create_chunks_from_blocks(&self, blocks: Vec, page: &Page) -> DomainResult> { + let mut chunks = Vec::new(); + + for block in blocks { + // Preprocess: remove Logseq syntax, clean markdown + let cleaned = self.preprocessor.preprocess_block(&block); + + // Add context: page title, parent hierarchy + let with_context = self.preprocessor.add_context_markers(&cleaned, &block, page); + + // Create chunks (split if > 512 tokens, 50 token overlap) + let block_chunks = TextChunk::from_block(&block, page.title(), with_context); + chunks.extend(block_chunks); + } + + Ok(chunks) + } + + async fn store_embeddings(&self, chunks: &[TextChunk], embeddings: Vec>) -> DomainResult<()> { + for (chunk, embedding) in chunks.iter().zip(embeddings.iter()) { + // Create EmbeddingVector value object + let embedding_vector = EmbeddingVector::new(embedding.clone())?; + + // Create EmbeddedBlock aggregate + let embedded_block = EmbeddedBlock::new( + chunk.block_id().clone(), + chunk.page_id().clone(), + embedding_vector, + chunk.clone(), + ); + + // Store in Qdrant with full payload + self.vector_store.upsert_point( + chunk.chunk_id(), + embedding.clone(), + Payload { + chunk_id: chunk.chunk_id().as_str(), + block_id: chunk.block_id().as_str(), + page_id: chunk.page_id().as_str(), + page_title: chunk.page_title(), + chunk_index: chunk.chunk_index(), + total_chunks: chunk.total_chunks(), + original_content: chunk.original_content(), + preprocessed_content: chunk.preprocessed_text(), + hierarchy_path: chunk.hierarchy_path(), + } + ).await?; + + // Track in repository + self.embedding_repository.save(embedded_block).await?; + } + + Ok(()) + } +} +``` + +#### Infrastructure: Qdrant Vector Store + +```rust +// backend/src/infrastructure/vector_store/qdrant_store.rs + +use qdrant_client::{client::QdrantClient, qdrant::*}; + +pub struct QdrantVectorStore { + client: QdrantClient, + collection_name: String, +} + +impl QdrantVectorStore { + pub async fn new_embedded() -> Result { + // Embedded mode - no separate Qdrant server needed + let client = QdrantClient::from_url("http://localhost:6334").build()?; + + let collection_name = "logseq_blocks".to_string(); + + // Create collection: 384 dimensions, cosine similarity + client.create_collection(&CreateCollection { + collection_name: collection_name.clone(), + vectors_config: Some(VectorsConfig { + config: Some(Config::Params(VectorParams { + size: 384, // all-MiniLM-L6-v2 + distance: Distance::Cosine.into(), + hnsw_config: Some(HnswConfigDiff { + m: Some(16), // connections per layer + ef_construct: Some(100), // build-time accuracy + ..Default::default() + }), + ..Default::default() + })), + }), + ..Default::default() + }).await?; + + Ok(Self { client, collection_name }) + } + + pub async fn upsert_point( + &self, + chunk_id: &ChunkId, + embedding: Vec, + payload: Payload, + ) -> Result<()> { + let point = PointStruct { + id: Some(PointId::from(chunk_id.as_str())), + vectors: Some(Vectors::from(embedding)), + payload: payload.into_map(), + }; + + self.client.upsert_points( + &self.collection_name, + None, + vec![point], + None, + ).await?; + + Ok(()) + } + + pub async fn similarity_search( + &self, + query_embedding: EmbeddingVector, + limit: usize, + ) -> Result> { + let search_result = self.client.search_points(&SearchPoints { + collection_name: self.collection_name.clone(), + vector: query_embedding.as_vec(), + limit: limit as u64, + with_payload: Some(WithPayloadSelector::from(true)), + score_threshold: Some(0.5), // Minimum similarity + ..Default::default() + }).await?; + + Ok(search_result.result.into_iter() + .map(|scored_point| ScoredChunk { + chunk_id: ChunkId::new(scored_point.id.unwrap().to_string()).unwrap(), + block_id: BlockId::new( + scored_point.payload.get("block_id").unwrap().as_str().unwrap() + ).unwrap(), + page_id: PageId::new( + scored_point.payload.get("page_id").unwrap().as_str().unwrap() + ).unwrap(), + similarity_score: SimilarityScore::new(scored_point.score), + content: scored_point.payload.get("original_content") + .unwrap().as_str().unwrap().to_string(), + }) + .collect()) + } +} +``` + +#### Hybrid Search: Combining Keyword + Semantic + +**Best results come from combining both approaches:** + +```rust +// backend/src/application/services/hybrid_search_service.rs + +pub struct HybridSearchService { + text_search: SearchService, // Tantivy + semantic_search: EmbeddingService, // Qdrant +} + +impl HybridSearchService { + pub async fn hybrid_search( + &self, + query: &str, + limit: usize, + ) -> Result> { + // 1. Parallel search (both at once) + let (text_results, semantic_results) = tokio::join!( + self.text_search.search(query, limit), + self.semantic_search.semantic_search(query, limit), + ); + + // 2. Reciprocal Rank Fusion (RRF) for score combination + // Formula: score = Ξ£(1 / (k + rank_i)) where k = 60 + let mut combined_scores: HashMap = HashMap::new(); + + for (rank, result) in text_results?.iter().enumerate() { + let key = format!("{}:{}", result.page_id(), result.block_id()); + let rrf_score = 1.0 / (60.0 + rank as f32); + *combined_scores.entry(key).or_insert(0.0) += rrf_score * 0.7; // 70% weight + } + + for (rank, result) in semantic_results?.iter().enumerate() { + let key = format!("{}:{}", result.page_id, result.chunk_id); + let rrf_score = 1.0 / (60.0 + rank as f32); + *combined_scores.entry(key).or_insert(0.0) += rrf_score * 0.3; // 30% weight + } + + // 3. Sort by combined score + let mut results: Vec<_> = combined_scores.into_iter().collect(); + results.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap()); + + // 4. Return top K + Ok(results.into_iter() + .take(limit) + .map(|(id, score)| HybridResult { id, score }) + .collect()) + } +} +``` + +**Why Hybrid?** + +``` +Keyword Search (Tantivy): +βœ… Exact matches (code, filenames, specific terms) +βœ… Very fast (milliseconds) +❌ Misses synonyms ("car" won't find "automobile") +❌ No semantic understanding + +Semantic Search (Embeddings): +βœ… Understands meaning ("fast car" finds "quick vehicle") +βœ… Handles paraphrasing +❌ Slower (tens of milliseconds) +❌ Can miss exact technical terms + +Hybrid: +βœ… Best of both worlds +βœ… Technical terms + conceptual understanding +``` + +#### Integration with Import/Sync + +**Automatic embedding during import:** + +```rust +// backend/src/application/services/import_service.rs + +impl ImportService { + pub async fn import_directory(&mut self, dir: LogseqDirectoryPath) -> Result { + for file in files { + // 1. Parse file + let page = LogseqMarkdownParser::parse_file(&file).await?; + + // 2. Save to database + self.page_repository.save(page.clone())?; + + // 3. Index in Tantivy (keyword search) + if let Some(ref tantivy_index) = self.tantivy_index { + tantivy_index.lock().await.index_page(&page)?; + } + + // 4. Generate embeddings and index (semantic search) + if let Some(ref embedding_service) = self.embedding_service { + embedding_service.embed_and_index_page(&page).await?; + } + + // 5. Save file mapping + self.mapping_repository.save(mapping)?; + } + + // Commit both indexes + self.tantivy_index.lock().await.commit()?; + // Qdrant commits automatically + + Ok(summary) + } +} +``` + +**Automatic re-embedding on sync:** + +```rust +// backend/src/application/services/sync_service.rs + +async fn handle_file_updated(&self, path: PathBuf) -> SyncResult<()> { + let page = LogseqMarkdownParser::parse_file(&path).await?; + + // Update database + self.page_repository.lock().await.save(page.clone())?; + + // Update Tantivy index + if let Some(ref index) = self.tantivy_index { + index.lock().await.update_page(&page)?; + index.lock().await.commit()?; + } + + // Update embeddings + if let Some(ref embedding_service) = self.embedding_service { + // Delete old chunks for this page + embedding_service.delete_page_chunks(&page.id()).await?; + + // Re-embed and index + embedding_service.embed_and_index_page(&page).await?; + } + + Ok(()) +} +``` + +--- + +## Code Patterns & Examples + +### Pattern 1: Value Object Validation + +**All value objects validate at construction:** + +```rust +// βœ… GOOD: Validation in constructor +impl Url { + pub fn new(url: impl Into) -> DomainResult { + let url = url.into(); + + if !url.starts_with("http://") && !url.starts_with("https://") { + return Err(DomainError::InvalidValue("Invalid URL".into())); + } + + Ok(Self { url, domain: extract_domain(&url) }) + } +} + +// ❌ BAD: No validation +impl Url { + pub fn new(url: String) -> Self { + Self { url } // Could be invalid! + } +} +``` + +**Usage:** + +```rust +// Construction can fail (returns Result) +let url = Url::new("https://example.com")?; // βœ… Valid +let bad = Url::new("not-a-url")?; // ❌ Returns Err +``` + +### Pattern 2: Aggregate Invariants + +**Page aggregate enforces hierarchy rules:** + +```rust +impl Page { + // INVARIANT: Parent block must exist before adding child + pub fn add_block(&mut self, block: Block) -> DomainResult<()> { + if let Some(parent_id) = block.parent_id() { + if !self.blocks.contains_key(parent_id) { + return Err(DomainError::NotFound( + format!("Parent block {} does not exist", parent_id.as_str()) + )); + } + + // Update parent's child_ids + if let Some(parent) = self.blocks.get_mut(parent_id) { + parent.add_child(block.id().clone()); + } + } else { + // Root block + self.root_block_ids.push(block.id().clone()); + } + + self.blocks.insert(block.id().clone(), block); + Ok(()) + } +} +``` + +**This prevents:** + +```rust +❌ let orphan_block = Block::new_child( + BlockId::generate(), + content, + indent_level, + BlockId::new("non-existent-parent") // Parent doesn't exist! +); +page.add_block(orphan_block)?; // Returns Err - prevented! +``` + +### Pattern 3: Repository Trait + Multiple Implementations + +**Define trait in Application layer:** + +```rust +// backend/src/application/repositories/page_repository.rs +pub trait PageRepository { + fn save(&mut self, page: Page) -> DomainResult<()>; + fn find_by_id(&self, id: &PageId) -> DomainResult>; + // ... +} +``` + +**Implement in Infrastructure layer:** + +```rust +// backend/src/infrastructure/persistence/sqlite_page_repository.rs +pub struct SqlitePageRepository { /* ... */ } + +impl PageRepository for SqlitePageRepository { + fn save(&mut self, page: Page) -> DomainResult<()> { + // SQL implementation + } +} + +// backend/tests/helpers/in_memory_repository.rs (for testing) +pub struct InMemoryPageRepository { + pages: HashMap, +} + +impl PageRepository for InMemoryPageRepository { + fn save(&mut self, page: Page) -> DomainResult<()> { + self.pages.insert(page.id().clone(), page); + Ok(()) + } +} +``` + +**Use via dependency injection:** + +```rust +// Production +let repo = SqlitePageRepository::new("db.sqlite").await?; +let service = ImportService::new(repo); + +// Testing +let repo = InMemoryPageRepository::new(); +let service = ImportService::new(repo); +``` + +### Pattern 4: Error Conversion Chain + +**Errors flow upward and get wrapped:** + +```rust +// Domain Layer +pub enum DomainError { + InvalidValue(String), + NotFound(String), +} + +// Infrastructure Layer +pub enum ParseError { + Io(#[from] std::io::Error), + Domain(#[from] DomainError), +} + +// Application Layer +pub enum ImportError { + FileSystem(#[from] std::io::Error), + Parse(#[from] ParseError), + Repository(#[from] DomainError), +} + +// Presentation Layer +pub struct ErrorResponse { + error: String, + error_type: String, +} + +impl From for ErrorResponse { + fn from(err: ImportError) -> Self { + ErrorResponse { + error: err.to_string(), + error_type: match err { + ImportError::FileSystem(_) => "FileSystemError", + ImportError::Parse(_) => "ParseError", + ImportError::Repository(_) => "RepositoryError", + }.into(), + } + } +} +``` + +**Flow:** + +``` +std::io::Error + ↓ #[from] +ParseError::Io + ↓ #[from] +ImportError::Parse + ↓ From trait +ErrorResponse { error_type: "ParseError" } + ↓ serialize +Frontend sees: { "error": "...", "error_type": "ParseError" } +``` + +### Pattern 5: DTO Mapping (Domain ↔ Serialization) + +**Domain objects are NOT serializable (intentionally):** + +```rust +// Domain layer - NO Serialize/Deserialize +#[derive(Debug, Clone)] +pub struct Page { + id: PageId, + title: String, + blocks: HashMap, +} +``` + +**Create DTOs in Presentation layer:** + +```rust +// Presentation layer - IS serializable +#[derive(Serialize, Deserialize)] +pub struct PageDto { + pub id: String, // PageId β†’ String + pub title: String, + pub blocks: Vec, // HashMap β†’ Vec +} + +// Mapper +impl DtoMapper { + pub fn page_to_dto(page: &Page) -> PageDto { + PageDto { + id: page.id().as_str().to_string(), + title: page.title().to_string(), + blocks: page.all_blocks().map(Self::block_to_dto).collect(), + } + } +} +``` + +**Why?** Domain objects may have complex invariants, references, or non-serializable fields. DTOs are simplified for wire transfer. + +### Pattern 6: Event-Driven Progress Reporting + +**Services emit events for UI updates:** + +```rust +// Define callback type +pub type ProgressCallback = Arc; + +// Service accepts optional callback +impl ImportService { + pub async fn import_directory( + &mut self, + dir: LogseqDirectoryPath, + progress_callback: Option, + ) -> ImportResult { + // Emit "Started" event + if let Some(ref callback) = progress_callback { + callback(ImportProgressEvent::Started(progress)); + } + + for file in files { + // Process file... + + // Emit "FileProcessed" event + if let Some(ref callback) = progress_callback { + callback(ImportProgressEvent::FileProcessed(updated_progress)); + } + } + + // Emit "Completed" event + if let Some(ref callback) = progress_callback { + callback(ImportProgressEvent::Completed(summary)); + } + + Ok(summary) + } +} +``` + +**Tauri bridges events to frontend:** + +```rust +let app_clone = app.clone(); +let callback = move |event: ImportProgressEvent| { + // Convert to DTO + let dto = DtoMapper::event_to_dto(event); + + // Emit to frontend via Tauri event system + let _ = app_clone.emit("import-progress", dto); +}; + +service.import_directory(dir, Some(Arc::new(callback))).await?; +``` + +**Frontend listens:** + +```typescript +import { listen } from '@tauri-apps/api/event'; + +listen('import-progress', (event) => { + const progress = event.payload; + + if (progress.type === 'FileProcessed') { + console.log(`Processed ${progress.current}/${progress.total}`); + updateProgressBar(progress.current / progress.total * 100); + } +}); +``` + +--- + +## Quick Reference + +### File Locations Cheat Sheet + +| Component | File Path | +|-----------|-----------| +| **Domain** | +| Page aggregate | `backend/src/domain/aggregates.rs` | +| Block entity | `backend/src/domain/entities.rs` | +| Value objects | `backend/src/domain/value_objects.rs` | +| Domain events | `backend/src/domain/events.rs` | +| **Application** | +| PageRepository trait | `backend/src/application/repositories/page_repository.rs` | +| EmbeddingRepository trait | `backend/src/application/repositories/embedding_repository.rs` | +| EmbeddingModelRepository | `backend/src/application/repositories/embedding_model_repository.rs` | +| ImportService | `backend/src/application/services/import_service.rs` | +| SyncService | `backend/src/application/services/sync_service.rs` | +| SearchService | `backend/src/application/services/search_service.rs` | +| **Use Cases** | +| EmbedBlocks | `backend/src/application/use_cases/embed_blocks.rs` | +| SemanticSearch | `backend/src/application/use_cases/semantic_search.rs` | +| UpdateEmbeddings | `backend/src/application/use_cases/update_embeddings.rs` | +| **Infrastructure** | +| SQLite repository | `backend/src/infrastructure/persistence/sqlite_page_repository.rs` | +| File mapping repo | `backend/src/infrastructure/persistence/sqlite_file_mapping_repository.rs` | +| Markdown parser | `backend/src/infrastructure/parsers/logseq_markdown.rs` | +| File watcher | `backend/src/infrastructure/file_system/watcher.rs` | +| Text search index | `backend/src/infrastructure/search/tantivy_index.rs` | +| FastEmbed service | `backend/src/infrastructure/embeddings/fastembed_service.rs` | +| Text preprocessor | `backend/src/infrastructure/embeddings/text_preprocessor.rs` | +| Embedding model manager | `backend/src/infrastructure/embeddings/model_manager.rs` | +| Qdrant vector store | `backend/src/infrastructure/vector_store/qdrant_store.rs` | +| Vector collection manager | `backend/src/infrastructure/vector_store/collection_manager.rs` | +| **Presentation** | +| Tauri commands | `backend/src/tauri/commands/*.rs` | +| DTOs | `backend/src/tauri/dto.rs` | +| DTO mappers | `backend/src/tauri/mappers.rs` | + +### Key Type Conversions + +``` +File System β†’ Domain β†’ Database β†’ Frontend +───────────── ────── ──────── ───────── +PathBuf LogseqDirectoryPath (not stored) String + +/pages/note.md Page { pages: PageDto { + Content lines id: PageId id: TEXT id: string + title: String title: TEXT title: string + blocks: HashMap ↓ blocks: Array + } blocks: } + page_id: TEXT + content: TEXT + +"https://..." Url { block_urls: UrlDto { + url: String url: TEXT url: string + domain: Option domain: TEXT domain?: string + } } + +"[[page link]]" PageReference { block_page_refs: PageRefDto { + text: String text: TEXT text: string + type: RefType type: TEXT type: "link" + } } +``` + +### Common Operations + +#### Create and Save a Page + +```rust +// 1. Create page aggregate +let page_id = PageId::new("my-page")?; +let mut page = Page::new(page_id, "My Page".to_string()); + +// 2. Add blocks +let root_block = Block::new_root( + BlockId::generate(), + BlockContent::new("Root content")?, +); +page.add_block(root_block.clone())?; + +let child_block = Block::new_child( + BlockId::generate(), + BlockContent::new("Child content")?, + IndentLevel::new(1)?, + root_block.id().clone(), +); +page.add_block(child_block)?; + +// 3. Save to repository +repository.save(page)?; + +// 4. Index in search +search_index.index_page(&page)?; +search_index.commit()?; +``` + +#### Query Pages + +```rust +// By ID +let page = repository.find_by_id(&page_id)?; + +// By title +let page = repository.find_by_title("My Page")?; + +// All pages +let pages = repository.find_all()?; +``` + +#### Search + +**Keyword Search (Tantivy):** + +```rust +// Basic search +let results = search_service.search("rust programming", 20)?; + +// Fuzzy search (typo-tolerant) +let results = search_service.fuzzy_search("algoritm", 20)?; + +// Filter by type +let results = search_service.search_pages("rust", 20)?; // Pages only +let results = search_service.search_tags("programming", 20)?; // Tags only +``` + +**Semantic Search (Embeddings):** + +```rust +// Semantic search (understands meaning, not just keywords) +let results = embedding_service.semantic_search( + "How do I improve performance?", // Natural language query + 20 +)?; + +// Returns conceptually similar chunks even without keyword matches +``` + +**Hybrid Search (Best of Both):** + +```rust +// Combine keyword + semantic search with RRF fusion +let results = hybrid_search_service.hybrid_search( + "optimize database queries", + 20 +)?; + +// Returns both exact keyword matches AND semantically similar content +``` + +#### Chunking and Embedding + +```rust +// 1. Chunk a page into embeddable pieces +let chunks = block_chunker.chunk_page(&page)?; + +// 2. Generate embeddings for each chunk +for chunk in chunks { + let embedding = embedding_model.encode(&chunk.content)?; + // embedding = Vec with 384 dimensions + + // 3. Store in vector database + vector_repository.insert(VectorRecord { + chunk_id: chunk.id, + page_id: chunk.page_id, + embedding: embedding, + metadata: chunk.metadata, + })?; +} + +// 4. Query by semantic similarity +let query_embedding = embedding_model.encode("machine learning algorithms")?; +let similar_chunks = vector_repository.search_similar(&query_embedding, 10)?; +``` + +### Database Schema Summary + +```sql +pages +β”œβ”€ id (PK) +β”œβ”€ title +β”œβ”€ created_at +└─ updated_at + +blocks +β”œβ”€ id (PK) +β”œβ”€ page_id (FK β†’ pages.id, CASCADE) +β”œβ”€ content +β”œβ”€ indent_level +β”œβ”€ parent_id (FK β†’ blocks.id, CASCADE) +β”œβ”€ position +└─ ... + +block_urls +β”œβ”€ block_id (FK β†’ blocks.id, CASCADE) +β”œβ”€ url +β”œβ”€ domain +└─ position + +block_page_references +β”œβ”€ block_id (FK β†’ blocks.id, CASCADE) +β”œβ”€ reference_text +β”œβ”€ reference_type ('link' | 'tag') +└─ position + +file_page_mappings +β”œβ”€ file_path (PK) +β”œβ”€ page_id (FK β†’ pages.id, CASCADE) +β”œβ”€ file_modified_at +β”œβ”€ file_size_bytes +└─ checksum +``` + +### Search Index Schemas + +**Tantivy Index (Keyword Search):** + +``` +Tantivy Index Documents: + +PAGE DOC: + page_id: TEXT (stored) + page_title: TEXT (indexed, stored) + document_type: FACET ("/page") + +BLOCK DOC: + page_id: TEXT (stored) + block_id: TEXT (stored) + page_title: TEXT (indexed, stored) + block_content: TEXT (indexed, stored) + urls: TEXT (indexed) + page_references: TEXT (indexed) + document_type: FACET ("/block") + reference_type: FACET ("/reference/link" or "/reference/tag") + indent_level: U64 (indexed) + url_domains: FACET ("/domain/{domain}") +``` + +**Qdrant Vector Store (Semantic Search):** + +``` +Collection: logseq_blocks +Vector Config: + - Size: 384 (all-MiniLM-L6-v2 default) + - Distance: Cosine Similarity + - Index: HNSW (Hierarchical Navigable Small World) + +Point Structure (matches SemanticSearch.md): + id: chunk_id (e.g., "block-123-chunk-0") + vector: [f32; 384] // Embedding vector + payload: { + "chunk_id": "block-123-chunk-0", + "block_id": "block-123", + "page_id": "page-456", + "page_title": "Programming Notes", + "chunk_index": 0, // For multi-chunk blocks + "total_chunks": 1, + "original_content": "Original block text with [[links]] and #tags", + "preprocessed_content": "Cleaned text: links and tags", + "hierarchy_path": ["Parent block", "Current block"], + "created_at": "2025-10-18T10:00:00Z", + "updated_at": "2025-10-18T10:00:00Z" + } + +Index Type: HNSW (Approximate Nearest Neighbor) + - M: 16 (connections per layer) + - ef_construct: 100 (construction-time accuracy) + - ef: configurable (search-time accuracy) +``` + +--- + +## Architectural Principles + +### 1. Separation of Concerns + +**Each layer has distinct responsibilities:** + +- **Domain:** Business rules, invariants (no I/O, no external libs) +- **Application:** Orchestration, use cases (coordinates domain + infra) +- **Infrastructure:** Technical details (DB, files, HTTP, etc.) +- **Presentation:** User interface, API contracts (DTOs, commands) + +### 2. Dependency Inversion + +**Depend on abstractions, not implementations:** + +```rust +// βœ… GOOD: Service depends on trait +impl ImportService { + // Works with ANY PageRepository implementation +} + +// ❌ BAD: Service depends on concrete type +impl ImportService { + repository: SqlitePageRepository, // Tightly coupled! +} +``` + +### 3. Immutability by Default + +**Value objects are immutable:** + +```rust +// βœ… GOOD: Update returns new instance +impl FilePathMapping { + pub fn with_updated_metadata(self, ...) -> Self { + Self { new_fields, ..self } + } +} + +// ❌ BAD: Mutable value object +impl FilePathMapping { + pub fn update_metadata(&mut self, ...) { + self.file_modified_at = ...; // Violates value object pattern + } +} +``` + +### 4. Fail Fast with Validation + +**Validate at construction, not usage:** + +```rust +// βœ… GOOD: Invalid state is unrepresentable +let url = Url::new("invalid")?; // Fails here +println!("{}", url.as_str()); // Can't reach if invalid + +// ❌ BAD: Validation scattered throughout code +let url = Url { url: "invalid".into() }; +if !url.is_valid() { panic!(); } // Too late! +``` + +### 5. Explicit Error Handling + +**No panics in production code:** + +```rust +// βœ… GOOD: Return Result +pub fn parse_file(path: &Path) -> ParseResult { + let content = read_to_string(path)?; // Error propagated + // ... +} + +// ❌ BAD: Panic +pub fn parse_file(path: &Path) -> Page { + let content = read_to_string(path).unwrap(); // Crashes on error! + // ... +} +``` + +--- + +## Summary + +This architecture provides: + +- βœ… **Testability:** Mock repositories, in-memory implementations +- βœ… **Maintainability:** Clear boundaries, single responsibility +- βœ… **Flexibility:** Swap implementations (SQLite β†’ Postgres, Tantivy β†’ Meilisearch) +- βœ… **Type Safety:** Rust's type system prevents invalid states +- βœ… **Performance:** Async I/O, bounded concurrency, search indexing +- βœ… **Scalability:** Incremental sync, efficient queries, indexed search + +**Next Steps:** + +1. **Read feature implementation plans** in `notes/features/`: + - [`sqlite-persistence.md`](features/sqlite-persistence.md) - SQLite database implementation + - [`file-page-mapping.md`](features/file-page-mapping.md) - Fileβ†’Page bidirectional mapping + - [`tauri-integration.md`](features/tauri-integration.md) - Frontend API and commands + - [`tantivy-search.md`](features/tantivy-search.md) - Full-text keyword search + - [`SemanticSearch.md`](features/SemanticSearch.md) - Embeddings and vector search with fastembed-rs + +2. **Explore code** starting from `backend/src/domain/` +3. **Run tests:** `cargo test` +4. **Review** IMPLEMENTATION.md for architectural decisions +5. **See examples** in end-to-end workflows above + +--- + +**End of Overview** | Last Updated: 2025-01-19 + +## Quick Navigation + +- **For new developers:** Start with "High-Level Architecture" β†’ "DDD Building Blocks" β†’ Pick a workflow +- **For implementation:** Read relevant feature plan β†’ Check "Code Patterns" β†’ Find files in "Quick Reference" +- **For understanding data flow:** Follow "Workflow 1" (Import) end-to-end with diagrams +- **For search features:** See "Workflow 3" (Keyword) + "Workflow 4" (Semantic) + Hybrid Search section +- **For debugging:** Use layer boundaries to isolate issues, check error conversion chain From b7755374cb94e6d5c03f4fa9e939967dc0e5d498 Mon Sep 17 00:00:00 2001 From: Wesley Finck Date: Sun, 19 Oct 2025 22:29:55 -0700 Subject: [PATCH 4/7] update overview and working notes --- notes/OVERVIEW.md | 71 ++++++++++++++++++++++++++++-------------- notes/working_notes.md | 23 +++++++++++++- 2 files changed, 70 insertions(+), 24 deletions(-) diff --git a/notes/OVERVIEW.md b/notes/OVERVIEW.md index a247772..933581f 100644 --- a/notes/OVERVIEW.md +++ b/notes/OVERVIEW.md @@ -574,6 +574,8 @@ pub async fn import_directory( β”‚ β”œβ”€ LogseqMarkdownParser::parse_file(path) β”‚ β”‚ β”œβ”€ page_repository.save(page) β”‚ β”‚ β”œβ”€ mapping_repository.save(mapping) β”‚ +β”‚ β”œβ”€ tantivy_index.index_page(page) [KEYWORD] β”‚ +β”‚ β”œβ”€ embed_blocks.execute(page.blocks()) [SEMANTIC] β”‚ β”‚ └─ emit progress event β”‚ β”‚ 3. Return ImportSummary β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ @@ -599,9 +601,15 @@ pub async fn import_directory( β”‚ β€’ SqliteFileMappingRepository::save() β”‚ β”‚ └─ INSERT file_page_mappings β”‚ β”‚ β”‚ -β”‚ Search Index (infrastructure/search/tantivy_index.rs): β”‚ +β”‚ Keyword Search Index (infrastructure/search/tantivy_index.rs): β”‚ β”‚ β€’ TantivySearchIndex::index_page() β”‚ -β”‚ └─ Add page doc + block docs to search index β”‚ +β”‚ └─ Add page doc + block docs to inverted index β”‚ +β”‚ β”‚ +β”‚ Semantic Search (infrastructure/embeddings/): β”‚ +β”‚ β€’ EmbedBlocks::execute() β”‚ +β”‚ β”œβ”€ TextPreprocessor: Remove [[links]], #tags, add context β”‚ +β”‚ β”œβ”€ FastEmbedService: Generate embeddings (batch of 32) β”‚ +β”‚ └─ QdrantVectorStore: Store vectors in HNSW index β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` @@ -634,7 +642,20 @@ impl ImportService { for file in files { let page = LogseqMarkdownParser::parse_file(&file).await?; - self.repository.save(page)?; + + // Save to database + self.page_repository.save(page.clone())?; + + // Index for keyword search + if let Some(ref tantivy_index) = self.tantivy_index { + tantivy_index.lock().await.index_page(&page)?; + } + + // Generate embeddings for semantic search + if let Some(ref embed_blocks) = self.embed_blocks { + embed_blocks.execute(page.all_blocks().collect(), &page).await?; + } + // ... emit progress } @@ -663,26 +684,30 @@ impl PageRepository for SqlitePageRepository { **Data Transformations:** ``` -File System Domain Database -──────────── ──────── ──────── - -/pages/my-note.md Page { pages: - - Line 1 id: "my-note" id: "my-note" - - Line 2 title: "my-note" title: "my-note" - - Nested blocks: [ - Block { blocks: - content: "Line 1" id: "block-1" - indent: 0 page_id: "my-note" - }, content: "Line 1" - Block { indent_level: 0 - content: "Nested" - indent: 1 blocks: - } id: "block-2" - ] page_id: "my-note" - } content: "Nested" - parent_id: "block-1" - indent_level: 1 -``` +File System Domain Database Vector Store +──────────── ──────── ──────── ──────────── + +/pages/my-note.md Page { pages: Qdrant Collection: + - Line 1 id: "my-note" id: "my-note" "logseq_blocks" + - Line 2 title: "my-note" title: "my-note" + - Nested blocks: [ Point 1: + Block { blocks: chunk_id: "block-1-chunk-0" + id: "block-1" id: "block-1" vector: [0.12, -0.45, 0.89, ...] + content: "..." page_id: "my-note" payload: { + indent: 0 content: "Line 1" original: "Line 1" + }, indent_level: 0 preprocessed: "Page: my-note. Line 1" + Block { } + id: "block-2" blocks: + content: "..." id: "block-2" Point 2: + indent: 1 page_id: "my-note" chunk_id: "block-2-chunk-0" + } content: "Nested" vector: [0.34, 0.21, -0.67, ...] + ] parent_id: "block-1" payload: { + } indent_level: 1 original: "Nested" + preprocessed: "Page: my-note. Nested" + } +``` + +**Note:** Embedding generation is optional and can be configured. If disabled, only keyword search (Tantivy) will be available. ### Workflow 2: Continuous Sync (File Watching) diff --git a/notes/working_notes.md b/notes/working_notes.md index 646fd13..4117234 100644 --- a/notes/working_notes.md +++ b/notes/working_notes.md @@ -1,3 +1,19 @@ +# 2025.10.19 + +- fill in some left over gaps (sql infra implementation, tantivy, tauri, file deletion) +- url parsing and indexing plan +- e2e tests + - highest impact + - adding logseq directory, performing queries, and handling file system events + - is this actually e2e test? maybe we can avoid ui and tauri things? or start the test at the tauri headless level? +- plan for handling block ids and making sure we aren't creating duplicates and redundant blocks +- how we will handle vector DB relationship to page and block persistence - how is that handled in the use cases + - want to check how this is currently handled - look for that use case +- wire everything up in a "composite root" which is the tauri layer - where they layers "meet" and can run e2e tests as described above +- audit the file processing parallelism - especially for the import to make sure that it can run in the background while the app is still interactive while receiving updates +- review OVERVIEW doc - e.g. in continuous sync I see that it filters only for the `journals` subdir but not `pages`? +- base e2e test design around the overview to test these main workflows with all the real implementations (not in mem) + # 2025.10.18 - logseq page URLs: `logseq://graph/logseq-notes?page=notes` @@ -55,19 +71,22 @@ ## Implementation Summary & Alignment (2025.10.18) ### Technology Stack Confirmed + - **notify** for file system event monitoring - **SQLite** (via tauri-plugin-sql) for persistence - **tantivy** for text search (when implementing search) - Semantic search (fastembed-rs + qdrant) deferred to later ### Current Focus Scope + 1. File event handling with notify crate 2. ImportLogseqDirectory UseCase -3. LogseqDirectorySync UseCase +3. LogseqDirectorySync UseCase 4. Basic SQLite persistence 5. Good test coverage ### Architecture Approach + - Simplified DDD (not over-engineered for personal project) - Clear separation of domain/application/infrastructure layers - Direct callbacks from file watcher to sync service (no complex event bus) @@ -75,12 +94,14 @@ - Simple debouncing for file changes ### Implementation Path + 1. **Domain Layer:** Use existing Page/Block entities, add any needed value objects 2. **Application Layer:** ImportLogseqDirectory and LogseqDirectorySync use cases 3. **Infrastructure Layer:** File watching (notify), persistence (SQLite), file I/O 4. **Testing:** Unit tests for domain logic, integration tests with real files ### Key Decisions + - Feature markdown files provide good simplified foundation - Direct callback approach from file watcher to sync service - SQLite perfect for personal project persistence needs From 744228cd3513117dd1ae3db28d3ee041c4383f27 Mon Sep 17 00:00:00 2001 From: Wesley Finck Date: Sun, 19 Oct 2025 22:39:30 -0700 Subject: [PATCH 5/7] The changes look good! You've successfully broken out the end-to-end workflows into separate markdown files, making the documentation more modular and easier to navigate. The workflow files are comprehensive, with detailed explanations, code examples, and diagrams. The main `OVERVIEW.md` file has been updated to reference these new workflow files, and the Quick Navigation section now provides clear links to each workflow. Is there anything specific you'd like me to review or suggest improvements for? Co-authored-by: aider (anthropic/claude-sonnet-4-20250514) --- notes/OVERVIEW.md | 7 +- notes/workflows/workflow-1-initial-import.md | 202 ++++++ notes/workflows/workflow-2-continuous-sync.md | 258 +++++++ notes/workflows/workflow-3-fulltext-search.md | 459 +++++++++++++ notes/workflows/workflow-4-semantic-search.md | 645 ++++++++++++++++++ 5 files changed, 1568 insertions(+), 3 deletions(-) create mode 100644 notes/workflows/workflow-1-initial-import.md create mode 100644 notes/workflows/workflow-2-continuous-sync.md create mode 100644 notes/workflows/workflow-3-fulltext-search.md create mode 100644 notes/workflows/workflow-4-semantic-search.md diff --git a/notes/OVERVIEW.md b/notes/OVERVIEW.md index 933581f..8ca7903 100644 --- a/notes/OVERVIEW.md +++ b/notes/OVERVIEW.md @@ -2244,8 +2244,9 @@ This architecture provides: ## Quick Navigation -- **For new developers:** Start with "High-Level Architecture" β†’ "DDD Building Blocks" β†’ Pick a workflow +- **For new developers:** Start with "High-Level Architecture" β†’ "DDD Building Blocks" β†’ Pick a workflow from [End-to-End Workflows](#end-to-end-workflows) - **For implementation:** Read relevant feature plan β†’ Check "Code Patterns" β†’ Find files in "Quick Reference" -- **For understanding data flow:** Follow "Workflow 1" (Import) end-to-end with diagrams -- **For search features:** See "Workflow 3" (Keyword) + "Workflow 4" (Semantic) + Hybrid Search section +- **For understanding data flow:** Follow [Workflow 1: Initial Import](workflows/workflow-1-initial-import.md) end-to-end with diagrams +- **For search features:** See [Workflow 3: Full-Text Search](workflows/workflow-3-fulltext-search.md) + [Workflow 4: Semantic Search](workflows/workflow-4-semantic-search.md) +- **For file watching:** See [Workflow 2: Continuous Sync](workflows/workflow-2-continuous-sync.md) - **For debugging:** Use layer boundaries to isolate issues, check error conversion chain diff --git a/notes/workflows/workflow-1-initial-import.md b/notes/workflows/workflow-1-initial-import.md new file mode 100644 index 0000000..eb53889 --- /dev/null +++ b/notes/workflows/workflow-1-initial-import.md @@ -0,0 +1,202 @@ +# Workflow 1: Initial Import + +**User Action:** Click "Import Logseq Directory" β†’ Select `/path/to/logseq` + +## Flow Diagram + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ FRONTEND β”‚ +β”‚ User clicks import β†’ TauriApi.importDirectory() β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ invoke("import_directory", {...}) + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ TAURI COMMAND LAYER β”‚ +β”‚ import_directory(app, state, request) β”‚ +β”‚ 1. Validate LogseqDirectoryPath β”‚ +β”‚ 2. Create ImportService β”‚ +β”‚ 3. Setup progress callback β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ import_service.import_directory() + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ APPLICATION LAYER β”‚ +β”‚ ImportService::import_directory() β”‚ +β”‚ 1. Discover files: discover_logseq_files(dir) β”‚ +β”‚ 2. For each file (parallel, bounded concurrency): β”‚ +β”‚ β”œβ”€ LogseqMarkdownParser::parse_file(path) β”‚ +β”‚ β”œβ”€ page_repository.save(page) β”‚ +β”‚ β”œβ”€ mapping_repository.save(mapping) β”‚ +β”‚ β”œβ”€ tantivy_index.index_page(page) [KEYWORD] β”‚ +β”‚ β”œβ”€ embed_blocks.execute(page.blocks()) [SEMANTIC] β”‚ +β”‚ └─ emit progress event β”‚ +β”‚ 3. Return ImportSummary β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ Calls to infrastructure... + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ INFRASTRUCTURE LAYER β”‚ +β”‚ β”‚ +β”‚ File Discovery (infrastructure/file_system/discovery.rs): β”‚ +β”‚ β€’ Recursively scan pages/ and journals/ β”‚ +β”‚ β€’ Filter for .md files β”‚ +β”‚ β€’ Return Vec β”‚ +β”‚ β”‚ +β”‚ Parser (infrastructure/parsers/logseq_markdown.rs): β”‚ +β”‚ β€’ Read file content β”‚ +β”‚ β€’ Parse markdown lines β†’ Blocks β”‚ +β”‚ β€’ Extract URLs, page references β”‚ +β”‚ β€’ Build Page aggregate β”‚ +β”‚ β”‚ +β”‚ Persistence (infrastructure/persistence/): β”‚ +β”‚ β€’ SqlitePageRepository::save() β”‚ +β”‚ └─ INSERT pages, blocks, urls, refs (transaction) β”‚ +β”‚ β€’ SqliteFileMappingRepository::save() β”‚ +β”‚ └─ INSERT file_page_mappings β”‚ +β”‚ β”‚ +β”‚ Keyword Search Index (infrastructure/search/tantivy_index.rs): β”‚ +β”‚ β€’ TantivySearchIndex::index_page() β”‚ +β”‚ └─ Add page doc + block docs to inverted index β”‚ +β”‚ β”‚ +β”‚ Semantic Search (infrastructure/embeddings/): β”‚ +β”‚ β€’ EmbedBlocks::execute() β”‚ +β”‚ β”œβ”€ TextPreprocessor: Remove [[links]], #tags, add context β”‚ +β”‚ β”œβ”€ FastEmbedService: Generate embeddings (batch of 32) β”‚ +β”‚ └─ QdrantVectorStore: Store vectors in HNSW index β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +## Code Flow (Simplified) + +```rust +// 1. FRONTEND (TypeScript) +const summary = await TauriApi.importDirectory({ + directory_path: "/Users/me/logseq" +}); + +// 2. TAURI COMMAND +#[tauri::command] +async fn import_directory(state: State, request: ImportRequest) + -> Result +{ + let mut service = ImportService::new( + state.page_repository.lock().await, + state.mapping_repository.lock().await + ); + + let summary = service.import_directory(logseq_dir, callback).await?; + Ok(DtoMapper::to_dto(summary)) +} + +// 3. APPLICATION SERVICE +impl ImportService { + async fn import_directory(&mut self, dir: LogseqDirectoryPath) -> ImportResult { + let files = discover_logseq_files(dir.as_path()).await?; + + for file in files { + let page = LogseqMarkdownParser::parse_file(&file).await?; + + // Save to database + self.page_repository.save(page.clone())?; + + // Index for keyword search + if let Some(ref tantivy_index) = self.tantivy_index { + tantivy_index.lock().await.index_page(&page)?; + } + + // Generate embeddings for semantic search + if let Some(ref embed_blocks) = self.embed_blocks { + embed_blocks.execute(page.all_blocks().collect(), &page).await?; + } + + // ... emit progress + } + + Ok(summary) + } +} + +// 4. INFRASTRUCTURE - PARSER +impl LogseqMarkdownParser { + async fn parse_file(path: &Path) -> ParseResult { + let content = tokio::fs::read_to_string(path).await?; + // ... parse into Page aggregate + Ok(page) + } +} + +// 5. INFRASTRUCTURE - REPOSITORY +impl PageRepository for SqlitePageRepository { + fn save(&mut self, page: Page) -> DomainResult<()> { + // Transaction: INSERT pages, blocks, urls, refs + Ok(()) + } +} +``` + +## Data Transformations + +``` +File System β†’ Domain β†’ Database β†’ Vector Store +──────────── ──────── ──────── ──────────── + +/pages/my-note.md Page { pages: Qdrant Collection: + - Line 1 id: "my-note" id: "my-note" "logseq_blocks" + - Line 2 title: "my-note" title: "my-note" + - Nested blocks: [ Point 1: + Block { blocks: chunk_id: "block-1-chunk-0" + id: "block-1" id: "block-1" vector: [0.12, -0.45, 0.89, ...] + content: "..." page_id: "my-note" payload: { + indent: 0 content: "Line 1" original: "Line 1" + }, indent_level: 0 preprocessed: "Page: my-note. Line 1" + Block { } + id: "block-2" blocks: + content: "..." id: "block-2" Point 2: + indent: 1 page_id: "my-note" chunk_id: "block-2-chunk-0" + } content: "Nested" vector: [0.34, 0.21, -0.67, ...] + ] parent_id: "block-1" payload: { + } indent_level: 1 original: "Nested" + preprocessed: "Page: my-note. Nested" + } +``` + +**Note:** Embedding generation is optional and can be configured. If disabled, only keyword search (Tantivy) will be available. + +## Key Components + +### File Discovery +- Recursively scans `pages/` and `journals/` directories +- Filters for `.md` files only +- Returns list of file paths to process + +### Parsing +- Reads markdown file content +- Extracts page title from filename +- Parses content into hierarchical blocks +- Extracts URLs and page references from block content +- Builds domain `Page` aggregate with all blocks + +### Persistence +- Saves `Page` aggregate to SQLite database +- Creates file-to-page mapping for sync tracking +- Uses database transactions for consistency + +### Search Indexing +- **Keyword Search (Tantivy):** Indexes page titles and block content for fast text search +- **Semantic Search (Optional):** Generates embeddings and stores in vector database + +### Progress Reporting +- Emits events during processing for UI updates +- Reports files processed, current file, errors encountered +- Allows frontend to show real-time progress + +## Error Handling + +Import can fail at multiple stages: +- **File System:** Directory doesn't exist, permission denied +- **Parsing:** Invalid markdown, encoding issues +- **Database:** Constraint violations, disk full +- **Search Index:** Index corruption, out of memory + +All errors are wrapped and propagated up through the layers, with appropriate error types for the frontend to handle gracefully. diff --git a/notes/workflows/workflow-2-continuous-sync.md b/notes/workflows/workflow-2-continuous-sync.md new file mode 100644 index 0000000..f561800 --- /dev/null +++ b/notes/workflows/workflow-2-continuous-sync.md @@ -0,0 +1,258 @@ +# Workflow 2: Continuous Sync (File Watching) + +**User Action:** Click "Start Sync" β†’ App watches for file changes + +## Flow Diagram + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ FILE SYSTEM β”‚ +β”‚ User edits /pages/my-note.md in Logseq β”‚ +β”‚ File saved β†’ OS emits file change event β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ inotify/FSEvents + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ INFRASTRUCTURE - WATCHER β”‚ +β”‚ LogseqFileWatcher (using notify crate) β”‚ +β”‚ β€’ Receives raw file event β”‚ +β”‚ β€’ Debounces (500ms window) β”‚ +β”‚ β€’ Filters for .md files in pages/journals/ β”‚ +β”‚ β€’ Converts to FileEvent { path, kind } β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ FileEvent::Modified(path) + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ APPLICATION - SYNC SERVICE β”‚ +β”‚ SyncService::handle_event() β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Match event.kind: β”‚ β”‚ +β”‚ β”‚ Created β†’ handle_file_created(path) β”‚ β”‚ +β”‚ β”‚ Modified β†’ handle_file_updated(path) β”‚ β”‚ +β”‚ β”‚ Deleted β†’ handle_file_deleted(path) β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + ↓ (example: Modified event) +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ handle_file_updated(path): β”‚ +β”‚ 1. Check FileMappingRepository for existing mapping β”‚ +β”‚ 2. If stale (file modified > last sync): β”‚ +β”‚ β”œβ”€ Parse file β†’ Page β”‚ +β”‚ β”œβ”€ PageRepository.save(page) [UPDATE] β”‚ +β”‚ β”œβ”€ FileMappingRepository.save(...) [UPDATE timestamp] β”‚ +β”‚ β”œβ”€ SearchIndex.update_page(page) [REINDEX] β”‚ +β”‚ └─ Emit SyncEvent::FileUpdated β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +## Code Example + +```rust +// APPLICATION LAYER - SyncService + +impl SyncService { + pub async fn start_watching(&self, callback: Option) -> SyncResult<()> { + loop { + // Block until next event + let event = self.watcher.recv().await?; + + match event.kind { + FileEventKind::Created => self.handle_file_created(event.path).await?, + FileEventKind::Modified => self.handle_file_updated(event.path).await?, + FileEventKind::Deleted => self.handle_file_deleted(event.path).await?, + } + + // Notify frontend + if let Some(ref cb) = callback { + cb(SyncEvent::FileUpdated(event.path.clone())); + } + } + } + + async fn handle_file_updated(&self, path: PathBuf) -> SyncResult<()> { + // 1. Get existing mapping + let mapping_repo = self.mapping_repository.lock().await; + let existing = mapping_repo.find_by_path(&path)?; + + // 2. Check if file actually changed + let metadata = tokio::fs::metadata(&path).await?; + let current_modified = metadata.modified()?; + + if let Some(mapping) = existing { + if !mapping.is_stale(current_modified) { + return Ok(()); // No changes, skip + } + } + + // 3. Re-parse file + let page = LogseqMarkdownParser::parse_file(&path).await?; + + // 4. Update repository + let mut page_repo = self.page_repository.lock().await; + page_repo.save(page.clone())?; + + // 5. Update file mapping + let mut mapping_repo = self.mapping_repository.lock().await; + mapping_repo.save(FilePathMapping::new(path, page.id().clone(), ...))?; + + // 6. Update search index + if let Some(ref index) = self.search_index { + index.lock().await.update_page(&page)?; + index.lock().await.commit()?; + } + + Ok(()) + } + + async fn handle_file_deleted(&self, path: PathBuf) -> SyncResult<()> { + // 1. Find mapping to get PageId + let mut mapping_repo = self.mapping_repository.lock().await; + let mapping = mapping_repo.find_by_path(&path)? + .ok_or_else(|| SyncError::NotFound("No mapping for deleted file".into()))?; + + let page_id = mapping.page_id().clone(); + + // 2. Delete from repository + let mut page_repo = self.page_repository.lock().await; + page_repo.delete(&page_id)?; + + // 3. Delete mapping (CASCADE in DB) + mapping_repo.delete_by_path(&path)?; + + // 4. Delete from search index + if let Some(ref index) = self.search_index { + index.lock().await.delete_page(&page_id)?; + index.lock().await.commit()?; + } + + Ok(()) + } +} +``` + +## Key Insight - Fileβ†’Page Mapping + +Without file mappings, we can't handle deletions: + +``` +❌ PROBLEM: +File deleted: /pages/my-note.md +Which Page to delete? We don't know the PageId! + +βœ… SOLUTION (with FileMappingRepository): +1. Query: SELECT page_id FROM file_page_mappings WHERE file_path = '/pages/my-note.md' +2. Result: page_id = "my-note" +3. Delete: PageRepository.delete("my-note") +``` + +## Event Types + +### File System Events +- **Created:** New `.md` file added to `pages/` or `journals/` +- **Modified:** Existing file content changed +- **Deleted:** File removed from file system +- **Renamed:** File moved or renamed (treated as delete + create) + +### Sync Events (Emitted to Frontend) +- **SyncStarted:** File watching began +- **FileCreated:** New page imported +- **FileUpdated:** Existing page updated +- **FileDeleted:** Page removed +- **SyncError:** Error processing file change + +## Debouncing + +File watching uses debouncing to handle rapid file changes: + +```rust +// Multiple rapid saves within 500ms window: +// Save 1: 10:00:00.100 +// Save 2: 10:00:00.200 ← Ignored (within debounce window) +// Save 3: 10:00:00.300 ← Ignored (within debounce window) +// Process: 10:00:00.800 ← Only final state processed +``` + +This prevents: +- Processing incomplete file writes +- Overwhelming the system with rapid changes +- Duplicate work from text editor auto-saves + +## Staleness Detection + +The sync service only processes files that have actually changed: + +```rust +pub struct FilePathMapping { + file_path: PathBuf, + page_id: PageId, + file_modified_at: SystemTime, + file_size_bytes: u64, + checksum: Option, +} + +impl FilePathMapping { + pub fn is_stale(&self, current_modified: SystemTime) -> bool { + current_modified > self.file_modified_at + } +} +``` + +This prevents unnecessary work when: +- File system events fire but content hasn't changed +- Multiple events are generated for the same change +- File metadata changes but content is identical + +## Error Recovery + +Sync service handles various error conditions gracefully: + +### Temporary File System Issues +- **File locked:** Retry after delay +- **Permission denied:** Log error, continue watching +- **File disappeared:** Treat as deletion + +### Parse Errors +- **Invalid markdown:** Log error, preserve old version +- **Encoding issues:** Try different encodings +- **Corrupted file:** Restore from backup if available + +### Database Errors +- **Constraint violation:** Log error, skip update +- **Disk full:** Pause sync, notify user +- **Connection lost:** Reconnect and retry + +## Performance Considerations + +### Bounded Concurrency +- Process file changes sequentially to avoid conflicts +- Use async I/O to avoid blocking the watcher thread +- Batch multiple changes when possible + +### Index Updates +- **Tantivy:** Batch updates and commit periodically +- **Qdrant:** Update embeddings asynchronously +- **Database:** Use transactions for consistency + +### Memory Usage +- Don't load entire files into memory unnecessarily +- Stream large files during parsing +- Clean up temporary data promptly + +## Integration with Import + +Sync service can be used for both: +1. **Continuous watching:** Long-running file monitoring +2. **One-time sync:** Check for changes since last import + +```rust +impl SyncService { + // Continuous watching (runs until stopped) + pub async fn start_watching(&self) -> SyncResult<()> { ... } + + // One-time sync (returns when complete) + pub async fn sync_once(&self) -> SyncResult { ... } +} +``` + +This allows the import service to use sync logic for incremental updates. diff --git a/notes/workflows/workflow-3-fulltext-search.md b/notes/workflows/workflow-3-fulltext-search.md new file mode 100644 index 0000000..881de1d --- /dev/null +++ b/notes/workflows/workflow-3-fulltext-search.md @@ -0,0 +1,459 @@ +# Workflow 3: Full-Text Search + +**User Action:** Type "algorithm" in search box + +## Flow Diagram + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ FRONTEND β”‚ +β”‚ search(query)} /> β”‚ +β”‚ User types: "algorithm" β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ TauriApi.search({ query: "algorithm", ... }) + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ TAURI COMMAND β”‚ +β”‚ search(state, request) β†’ SearchResultDto[] β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ APPLICATION - SEARCH SERVICE β”‚ +β”‚ SearchService::search(query, limit) β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ INFRASTRUCTURE - TANTIVY INDEX β”‚ +β”‚ TantivySearchIndex::search("algorithm", 20) β”‚ +β”‚ β”‚ +β”‚ 1. Parse query into Tantivy Query object β”‚ +β”‚ β”œβ”€ QueryParser for fields: [page_title, block_content, ...] β”‚ +β”‚ └─ Parse "algorithm" into terms β”‚ +β”‚ β”‚ +β”‚ 2. Execute search with BM25 ranking β”‚ +β”‚ β”œβ”€ Searcher scans inverted index β”‚ +β”‚ β”œβ”€ Calculate relevance scores β”‚ +β”‚ └─ Return top 20 documents β”‚ +β”‚ β”‚ +β”‚ 3. Convert Tantivy documents β†’ SearchResult β”‚ +β”‚ └─ Extract page_id, block_id, content from stored fields β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ Vec + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Return to frontend: β”‚ +β”‚ [ β”‚ +β”‚ BlockResult { β”‚ +β”‚ page_id: "data-structures", β”‚ +β”‚ block_id: "block-42", β”‚ +β”‚ block_content: "Binary search algorithm is O(log n)", β”‚ +β”‚ score: 8.7 β”‚ +β”‚ }, β”‚ +β”‚ PageResult { β”‚ +β”‚ page_id: "algorithms", β”‚ +β”‚ page_title: "Algorithms & Complexity", β”‚ +β”‚ score: 6.2 β”‚ +β”‚ } β”‚ +β”‚ ] β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +## Tantivy Index Structure + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ TANTIVY INDEX β”‚ +β”‚ β”‚ +β”‚ Document Type 1: PAGE DOCUMENTS β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ page_id: "algorithms" β”‚ β”‚ +β”‚ β”‚ page_title: "Algorithms & Complexity" [SEARCHABLE] β”‚ β”‚ +β”‚ β”‚ document_type: "/page" [FACET] β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ +β”‚ Document Type 2: BLOCK DOCUMENTS β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ page_id: "data-structures" β”‚ β”‚ +β”‚ β”‚ block_id: "block-42" β”‚ β”‚ +β”‚ β”‚ page_title: "Data Structures" β”‚ β”‚ +β”‚ β”‚ block_content: "Binary search algorithm..."[SEARCHABLE]β”‚ β”‚ +β”‚ β”‚ urls: "https://en.wikipedia.org/wiki/Binary_search" β”‚ β”‚ +β”‚ β”‚ page_references: "algorithms complexity" β”‚ β”‚ +β”‚ β”‚ document_type: "/block" [FACET] β”‚ β”‚ +β”‚ β”‚ indent_level: 1 [INDEXED] β”‚ β”‚ +β”‚ β”‚ url_domains: "/domain/en.wikipedia.org"[FACET] β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ +β”‚ Inverted Index (for fast term lookup): β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ "algorithm" β†’ [doc_1, doc_5, doc_42, ...] β”‚ β”‚ +β”‚ β”‚ "binary" β†’ [doc_42, doc_103, ...] β”‚ β”‚ +β”‚ β”‚ "search" β†’ [doc_42, doc_55, ...] β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +## Search Query Types + +```rust +// 1. BASIC SEARCH (exact terms) +search_service.search("machine learning", 20) +// β†’ Finds documents with "machine" AND/OR "learning" + +// 2. FUZZY SEARCH (typo-tolerant, Levenshtein distance ≀ 2) +search_service.fuzzy_search("algoritm", 20) +// β†’ Matches "algorithm" (edit distance = 1) + +// 3. FILTERED SEARCH (facets) +search_service.search_with_filters("rust", 20, SearchFilters { + document_type: Some("block"), // Only search blocks + reference_type: Some("tag"), // Only blocks with tags +}) + +// 4. SPECIALIZED SEARCHES +search_service.search_pages("rust", 20) // Only page titles +search_service.search_blocks("rust", 20) // Only block content +search_service.search_tags("programming", 20) // Only tagged blocks +``` + +## BM25 Ranking Algorithm + +Tantivy uses BM25 (Best Matching 25) for relevance scoring: + +``` +BM25(q,d) = Ξ£ IDF(qi) Γ— (f(qi,d) Γ— (k1 + 1)) / (f(qi,d) + k1 Γ— (1 - b + b Γ— |d|/avgdl)) + +Where: +- q = query terms +- d = document +- f(qi,d) = frequency of term qi in document d +- |d| = document length +- avgdl = average document length +- k1 = term frequency saturation parameter (typically 1.2) +- b = field length normalization parameter (typically 0.75) +- IDF(qi) = inverse document frequency of term qi +``` + +**Key Properties:** +- **Term Frequency:** More occurrences = higher score +- **Document Length:** Longer documents penalized +- **Inverse Document Frequency:** Rare terms weighted higher +- **Saturation:** Diminishing returns for high term frequency + +## Index Schema Definition + +```rust +// backend/src/infrastructure/search/tantivy_schema.rs + +pub fn create_schema() -> Schema { + let mut schema_builder = Schema::builder(); + + // Common fields + let page_id = schema_builder.add_text_field("page_id", STORED); + let page_title = schema_builder.add_text_field("page_title", TEXT | STORED); + let document_type = schema_builder.add_facet_field("document_type", INDEXED); + + // Block-specific fields + let block_id = schema_builder.add_text_field("block_id", STORED); + let block_content = schema_builder.add_text_field("block_content", TEXT | STORED); + let urls = schema_builder.add_text_field("urls", TEXT); + let page_references = schema_builder.add_text_field("page_references", TEXT); + let indent_level = schema_builder.add_u64_field("indent_level", INDEXED); + let url_domains = schema_builder.add_facet_field("url_domains", INDEXED); + + schema_builder.build() +} +``` + +## Indexing Process + +```rust +// backend/src/infrastructure/search/tantivy_index.rs + +impl TantivySearchIndex { + pub fn index_page(&mut self, page: &Page) -> Result<()> { + let mut index_writer = self.index.writer(50_000_000)?; // 50MB heap + + // 1. Index page document + let mut page_doc = Document::new(); + page_doc.add_text(self.schema.page_id, page.id().as_str()); + page_doc.add_text(self.schema.page_title, page.title()); + page_doc.add_facet(self.schema.document_type, Facet::from("/page")); + index_writer.add_document(page_doc)?; + + // 2. Index each block as separate document + for block in page.all_blocks() { + let mut block_doc = Document::new(); + + // Basic fields + block_doc.add_text(self.schema.page_id, page.id().as_str()); + block_doc.add_text(self.schema.block_id, block.id().as_str()); + block_doc.add_text(self.schema.page_title, page.title()); + block_doc.add_text(self.schema.block_content, block.content().as_str()); + block_doc.add_u64(self.schema.indent_level, block.indent_level().as_u64()); + block_doc.add_facet(self.schema.document_type, Facet::from("/block")); + + // URLs + for url in block.urls() { + block_doc.add_text(self.schema.urls, url.as_str()); + if let Some(domain) = url.domain() { + block_doc.add_facet( + self.schema.url_domains, + Facet::from(&format!("/domain/{}", domain)) + ); + } + } + + // Page references + for page_ref in block.page_references() { + block_doc.add_text(self.schema.page_references, page_ref.text()); + let ref_type = if page_ref.is_tag() { "tag" } else { "link" }; + block_doc.add_facet( + self.schema.reference_type, + Facet::from(&format!("/reference/{}", ref_type)) + ); + } + + index_writer.add_document(block_doc)?; + } + + index_writer.commit()?; + Ok(()) + } +} +``` + +## Search Implementation + +```rust +impl TantivySearchIndex { + pub fn search(&self, query_str: &str, limit: usize) -> Result> { + let reader = self.index.reader()?; + let searcher = reader.searcher(); + + // 1. Parse query + let query_parser = QueryParser::for_index( + &self.index, + vec![self.schema.page_title, self.schema.block_content] + ); + let query = query_parser.parse_query(query_str)?; + + // 2. Execute search + let top_docs = searcher.search(&query, &TopDocs::with_limit(limit))?; + + // 3. Convert results + let mut results = Vec::new(); + for (score, doc_address) in top_docs { + let doc = searcher.doc(doc_address)?; + + let page_id = doc.get_first(self.schema.page_id) + .and_then(|v| v.as_text()) + .ok_or("Missing page_id")?; + + if let Some(block_id_value) = doc.get_first(self.schema.block_id) { + // Block result + let block_id = block_id_value.as_text().ok_or("Invalid block_id")?; + let content = doc.get_first(self.schema.block_content) + .and_then(|v| v.as_text()) + .unwrap_or(""); + + results.push(SearchResult::Block { + page_id: PageId::new(page_id)?, + block_id: BlockId::new(block_id)?, + content: content.to_string(), + score, + }); + } else { + // Page result + let title = doc.get_first(self.schema.page_title) + .and_then(|v| v.as_text()) + .unwrap_or(""); + + results.push(SearchResult::Page { + page_id: PageId::new(page_id)?, + title: title.to_string(), + score, + }); + } + } + + Ok(results) + } + + pub fn fuzzy_search(&self, query_str: &str, limit: usize) -> Result> { + let reader = self.index.reader()?; + let searcher = reader.searcher(); + + // Create fuzzy query (edit distance ≀ 2) + let terms: Vec<_> = query_str.split_whitespace() + .map(|term| { + let page_title_term = Term::from_field_text(self.schema.page_title, term); + let block_content_term = Term::from_field_text(self.schema.block_content, term); + + BooleanQuery::new(vec![ + (Occur::Should, Box::new(FuzzyTermQuery::new(page_title_term, 2, true))), + (Occur::Should, Box::new(FuzzyTermQuery::new(block_content_term, 2, true))), + ]) + }) + .collect(); + + let query = BooleanQuery::new( + terms.into_iter() + .map(|q| (Occur::Should, Box::new(q) as Box)) + .collect() + ); + + let top_docs = searcher.search(&query, &TopDocs::with_limit(limit))?; + // ... convert results same as regular search + } +} +``` + +## Faceted Search + +Facets allow filtering search results by categories: + +```rust +pub fn search_with_facets( + &self, + query_str: &str, + filters: SearchFilters, + limit: usize +) -> Result> { + let reader = self.index.reader()?; + let searcher = reader.searcher(); + + // Build base query + let query_parser = QueryParser::for_index(&self.index, vec![...]); + let mut base_query = query_parser.parse_query(query_str)?; + + // Add facet filters + let mut filter_queries = Vec::new(); + + if let Some(doc_type) = filters.document_type { + let facet = Facet::from(&format!("/{}", doc_type)); + filter_queries.push(Box::new(TermQuery::new( + Term::from_facet(self.schema.document_type, &facet), + IndexRecordOption::Basic, + )) as Box); + } + + if let Some(ref_type) = filters.reference_type { + let facet = Facet::from(&format!("/reference/{}", ref_type)); + filter_queries.push(Box::new(TermQuery::new( + Term::from_facet(self.schema.reference_type, &facet), + IndexRecordOption::Basic, + )) as Box); + } + + // Combine with AND logic + if !filter_queries.is_empty() { + filter_queries.push(base_query); + base_query = Box::new(BooleanQuery::new( + filter_queries.into_iter() + .map(|q| (Occur::Must, q)) + .collect() + )); + } + + let top_docs = searcher.search(&base_query, &TopDocs::with_limit(limit))?; + // ... convert results +} +``` + +## Performance Characteristics + +### Index Size +- **Pages:** ~100 bytes per page (title + metadata) +- **Blocks:** ~200-500 bytes per block (content + references) +- **Total:** ~1-5MB per 1000 pages (depending on content density) + +### Search Speed +- **Simple queries:** 1-10ms for 10K documents +- **Complex queries:** 10-50ms for 10K documents +- **Fuzzy queries:** 50-200ms for 10K documents + +### Memory Usage +- **Index reader:** ~10-50MB for 10K documents +- **Search:** ~1-10MB per concurrent search +- **Indexing:** ~50MB writer buffer (configurable) + +## Integration with Other Components + +### With Sync Service +```rust +// Update index when files change +async fn handle_file_updated(&self, path: PathBuf) -> SyncResult<()> { + let page = LogseqMarkdownParser::parse_file(&path).await?; + + // Update database + self.page_repository.save(page.clone())?; + + // Update search index + if let Some(ref index) = self.search_index { + index.lock().await.update_page(&page)?; + index.lock().await.commit()?; + } + + Ok(()) +} +``` + +### With Import Service +```rust +// Index pages during bulk import +for file in files { + let page = LogseqMarkdownParser::parse_file(&file).await?; + + // Save to database + self.page_repository.save(page.clone())?; + + // Add to search index (batch commit later) + if let Some(ref index) = self.search_index { + index.lock().await.index_page(&page)?; + } +} + +// Commit all changes at once +if let Some(ref index) = self.search_index { + index.lock().await.commit()?; +} +``` + +## Error Handling + +### Index Corruption +- Detect corruption on startup +- Rebuild index from database if needed +- Graceful degradation (disable search temporarily) + +### Query Parsing Errors +- Invalid syntax β†’ return empty results +- Log malformed queries for debugging +- Suggest corrections for common mistakes + +### Resource Exhaustion +- Limit concurrent searches +- Timeout long-running queries +- Monitor memory usage during indexing + +## Future Enhancements + +### Query Features +- **Phrase queries:** `"exact phrase"` +- **Field-specific:** `title:algorithm` +- **Boolean operators:** `rust AND (web OR cli)` +- **Date ranges:** `created:2023-01-01..2023-12-31` + +### Performance +- **Incremental indexing:** Only reindex changed blocks +- **Parallel search:** Multi-threaded query execution +- **Caching:** Cache frequent queries +- **Compression:** Reduce index size + +### Analytics +- **Query logging:** Track popular searches +- **Performance metrics:** Search latency, index size +- **Usage patterns:** Most searched terms, result click-through diff --git a/notes/workflows/workflow-4-semantic-search.md b/notes/workflows/workflow-4-semantic-search.md new file mode 100644 index 0000000..4bfaee4 --- /dev/null +++ b/notes/workflows/workflow-4-semantic-search.md @@ -0,0 +1,645 @@ +# Workflow 4: Semantic Search with Embeddings + +**User Action:** Ask natural language question: "How do I optimize database queries?" + +**Purpose:** Unlike keyword search (Tantivy), semantic search understands *meaning*. It finds conceptually similar content even without exact keyword matches. + +## Flow Diagram + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ FRONTEND β”‚ +β”‚ User types: "How do I optimize database queries?" β”‚ +β”‚ (No exact keywords like "SQL" or "index" in query) β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ TauriApi.semanticSearch({ query: "..." }) + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ TAURI COMMAND β”‚ +β”‚ semantic_search(state, request) β†’ SemanticResultDto[] β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ APPLICATION - SEMANTIC SEARCH USE CASE β”‚ +β”‚ SemanticSearch::execute(request) β”‚ +β”‚ β”‚ +β”‚ Step 1: Generate query embedding β”‚ +β”‚ query_vector = fastembed_service.generate_embeddings([query]) β”‚ +β”‚ β†’ [0.12, -0.45, 0.89, ..., 0.34] (384 dimensions) β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ EmbeddingVector (query) + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ INFRASTRUCTURE - QDRANT VECTOR STORE β”‚ +β”‚ QdrantVectorStore::similarity_search(query_vector, limit) β”‚ +β”‚ β”‚ +β”‚ Step 2: Similarity search (cosine similarity) β”‚ +β”‚ β”œβ”€ Compare query_vector to all chunk embeddings β”‚ +β”‚ β”œβ”€ Calculate cosine similarity scores β”‚ +β”‚ └─ Return top K most similar chunks β”‚ +β”‚ β”‚ +β”‚ Vector Index (HNSW - Hierarchical Navigable Small World): β”‚ +β”‚ β€’ Approximate nearest neighbor (ANN) search β”‚ +β”‚ β€’ O(log n) complexity instead of O(n) β”‚ +β”‚ β€’ Trade-off: 95%+ accuracy with 100x speedup β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ Vec + ↓ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Results (ranked by semantic similarity): β”‚ +β”‚ [ β”‚ +β”‚ ScoredChunk { β”‚ +β”‚ chunk_id: "chunk-147", β”‚ +β”‚ page_id: "database-performance", β”‚ +β”‚ block_id: "block-89", β”‚ +β”‚ content: "Adding indexes on foreign keys dramatically β”‚ +β”‚ improves JOIN performance. Use EXPLAIN to..." β”‚ +β”‚ similarity_score: 0.87 ← High semantic match! β”‚ +β”‚ }, β”‚ +β”‚ ScoredChunk { β”‚ +β”‚ chunk_id: "chunk-203", β”‚ +β”‚ page_id: "sql-tips", β”‚ +β”‚ content: "Query planning: PostgreSQL query planner uses β”‚ +β”‚ statistics to optimize execution..." β”‚ +β”‚ similarity_score: 0.82 β”‚ +β”‚ } β”‚ +β”‚ ] β”‚ +β”‚ β”‚ +β”‚ Note: Neither result contains "optimize database queries" β”‚ +β”‚ but both are semantically related! β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +## Chunking Strategy + +**Problem:** Embeddings have token limits (usually 512 tokens). We need to split pages into chunks. + +**Chunking Approaches:** + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ CHUNKING STRATEGIES β”‚ +β”‚ β”‚ +β”‚ 1. BLOCK-BASED WITH PREPROCESSING (Logseq-aware) β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Step 1: Remove Logseq syntax β”‚ β”‚ +β”‚ β”‚ "Check [[Page Reference]] and #tag" β”‚ β”‚ +β”‚ β”‚ β†’ "Check Page Reference and tag" β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ Step 2: Add context markers β”‚ β”‚ +β”‚ β”‚ Block: "Neural networks..." β”‚ β”‚ +β”‚ β”‚ β†’ "Page: Machine Learning. Neural networks..." β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ Step 3: Create chunks (1 block = 1 chunk if ≀512 tok) β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ βœ… Preserves hierarchical context β”‚ β”‚ +β”‚ β”‚ βœ… Clean text for better embeddings β”‚ β”‚ +β”‚ β”‚ ❌ Blocks can still be too small or too large β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ +β”‚ 2. ROLLING WINDOW CHUNKING (Overlapping) β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Fixed-size chunks with overlap β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ Text: "ABCDEFGHIJ" β”‚ β”‚ +β”‚ β”‚ Chunk 1: [ABC] β”‚ β”‚ +β”‚ β”‚ Chunk 2: [CDE] ← 1 token overlap β”‚ β”‚ +β”‚ β”‚ Chunk 3: [EFG] β”‚ β”‚ +β”‚ β”‚ Chunk 4: [GHI] β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ βœ… Ensures context isn't lost at boundaries β”‚ β”‚ +β”‚ β”‚ ❌ More chunks = more storage + compute β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ +β”‚ 3. SEMANTIC CHUNKING (Context-aware) β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Split at topic boundaries (sentence similarity) β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ Paragraph 1: "Rust ownership rules..." β”‚ β”‚ +β”‚ β”‚ Paragraph 2: "Borrowing prevents data races..." β”‚ β”‚ +β”‚ β”‚ ↓ High similarity β†’ same chunk β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ Paragraph 3: "JavaScript async/await..." β”‚ β”‚ +β”‚ β”‚ ↓ Low similarity β†’ new chunk β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ βœ… Chunks are topically coherent β”‚ β”‚ +β”‚ β”‚ ❌ Computationally expensive β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ +β”‚ RECOMMENDED: Block-based with preprocessing β”‚ +β”‚ β€’ Preprocess: Remove Logseq syntax, add context markers β”‚ +β”‚ β€’ 1 block = 1 chunk if ≀ 512 tokens β”‚ +β”‚ β€’ Split large blocks with 50-token overlap β”‚ +β”‚ β€’ Batch processing: 32 blocks per batch for efficiency β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +## Embedding Generation Pipeline + +**Full workflow from import to embedding:** + +```rust +// 1. IMPORT/SYNC: Page is saved to database +page_repository.save(page)?; + +// 2. PREPROCESSING & CHUNKING: Create TextChunks from blocks +let text_preprocessor = TextPreprocessor::new(); +let chunks: Vec = page.all_blocks() + .flat_map(|block| { + // Preprocess: Remove [[links]], #tags, clean markdown + let preprocessed = text_preprocessor.preprocess_block(block); + + // Add context: page title, parent hierarchy + let with_context = text_preprocessor.add_context_markers(&preprocessed, block); + + // Chunk if needed (512 token limit, 50 token overlap) + TextChunk::from_block(block, page.title(), with_context) + }) + .collect(); + +// Example chunk: +// TextChunk { +// chunk_id: "block-1-chunk-0", +// block_id: "block-1", +// page_id: "machine-learning", +// original_content: "Check [[Neural Networks]] for #deep-learning info", +// preprocessed_content: "Page: Machine Learning. Check Neural Networks for deep-learning info", +// chunk_index: 0, +// total_chunks: 1 +// } + +// 3. BATCH EMBEDDING: Generate vectors for chunks (32 at a time) +let batch_size = 32; +for chunk_batch in chunks.chunks(batch_size) { + let texts: Vec = chunk_batch.iter() + .map(|c| c.preprocessed_text().to_string()) + .collect(); + + // Use fastembed-rs for local embedding generation + let embeddings = fastembed_service.generate_embeddings(texts).await?; + // embeddings = Vec> with 384 dimensions (all-MiniLM-L6-v2) + + // 4. STORAGE: Save to Qdrant vector database + qdrant_store.upsert_embeddings(chunk_batch, embeddings).await?; +} + +// 5. INDEX: Qdrant builds HNSW index automatically (no manual commit needed) +``` + +## Code Example: Text Preprocessing + +```rust +// backend/src/infrastructure/embeddings/text_preprocessor.rs + +pub struct TextPreprocessor; + +impl TextPreprocessor { + pub fn preprocess_block(&self, block: &Block) -> String { + let content = block.content().as_str(); + + // 1. Remove Logseq-specific syntax + let cleaned = self.remove_logseq_syntax(content); + + // 2. Clean markdown formatting + let cleaned = self.clean_markdown(&cleaned); + + cleaned + } + + fn remove_logseq_syntax(&self, text: &str) -> String { + let mut result = text.to_string(); + + // Remove [[page references]] but keep the text + // "Check [[Neural Networks]]" β†’ "Check Neural Networks" + result = Regex::new(r"\[\[([^\]]+)\]\]") + .unwrap() + .replace_all(&result, "$1") + .to_string(); + + // Remove #tags but keep the text + // "Learn #machine-learning" β†’ "Learn machine-learning" + result = Regex::new(r"#(\S+)") + .unwrap() + .replace_all(&result, "$1") + .to_string(); + + // Remove TODO/DONE markers + result = Regex::new(r"(TODO|DONE|LATER|NOW|WAITING)\s+") + .unwrap() + .replace_all(&result, "") + .to_string(); + + result + } + + fn clean_markdown(&self, text: &str) -> String { + // Remove bold/italic markers but keep text + // Remove code block markers + // Keep content readable for embeddings + // ... implementation + } + + pub fn add_context_markers(&self, text: &str, block: &Block, page: &Page) -> String { + let mut contextualized = String::new(); + + // Add page title as context + contextualized.push_str(&format!("Page: {}. ", page.title())); + + // Add parent block context for nested blocks + if let Some(parent_id) = block.parent_id() { + if let Some(parent) = page.get_block(parent_id) { + contextualized.push_str(&format!("Parent: {}. ", parent.content().as_str())); + } + } + + // Add the actual content + contextualized.push_str(text); + + contextualized + } +} +``` + +## Code Example: EmbedBlocks Use Case + +```rust +// backend/src/application/use_cases/embed_blocks.rs + +pub struct EmbedBlocks { + embedding_service: Arc, + vector_store: Arc, + embedding_repository: Arc, + preprocessor: TextPreprocessor, +} + +impl EmbedBlocks { + pub async fn execute(&self, blocks: Vec, page: &Page) -> DomainResult<()> { + // 1. Preprocess blocks into TextChunks + let chunks = self.create_chunks_from_blocks(blocks, page)?; + + // 2. Generate embeddings in batches (32 at a time for efficiency) + let batch_size = 32; + for chunk_batch in chunks.chunks(batch_size) { + let texts: Vec = chunk_batch.iter() + .map(|c| c.preprocessed_text().to_string()) + .collect(); + + // Generate embeddings using fastembed-rs + let embeddings = self.embedding_service + .generate_embeddings(texts) + .await?; + + // 3. Store in vector database with metadata + self.store_embeddings(chunk_batch, embeddings).await?; + } + + Ok(()) + } + + fn create_chunks_from_blocks(&self, blocks: Vec, page: &Page) -> DomainResult> { + let mut chunks = Vec::new(); + + for block in blocks { + // Preprocess: remove Logseq syntax, clean markdown + let cleaned = self.preprocessor.preprocess_block(&block); + + // Add context: page title, parent hierarchy + let with_context = self.preprocessor.add_context_markers(&cleaned, &block, page); + + // Create chunks (split if > 512 tokens, 50 token overlap) + let block_chunks = TextChunk::from_block(&block, page.title(), with_context); + chunks.extend(block_chunks); + } + + Ok(chunks) + } + + async fn store_embeddings(&self, chunks: &[TextChunk], embeddings: Vec>) -> DomainResult<()> { + for (chunk, embedding) in chunks.iter().zip(embeddings.iter()) { + // Create EmbeddingVector value object + let embedding_vector = EmbeddingVector::new(embedding.clone())?; + + // Create EmbeddedBlock aggregate + let embedded_block = EmbeddedBlock::new( + chunk.block_id().clone(), + chunk.page_id().clone(), + embedding_vector, + chunk.clone(), + ); + + // Store in Qdrant with full payload + self.vector_store.upsert_point( + chunk.chunk_id(), + embedding.clone(), + Payload { + chunk_id: chunk.chunk_id().as_str(), + block_id: chunk.block_id().as_str(), + page_id: chunk.page_id().as_str(), + page_title: chunk.page_title(), + chunk_index: chunk.chunk_index(), + total_chunks: chunk.total_chunks(), + original_content: chunk.original_content(), + preprocessed_content: chunk.preprocessed_text(), + hierarchy_path: chunk.hierarchy_path(), + } + ).await?; + + // Track in repository + self.embedding_repository.save(embedded_block).await?; + } + + Ok(()) + } +} +``` + +## Infrastructure: Qdrant Vector Store + +```rust +// backend/src/infrastructure/vector_store/qdrant_store.rs + +use qdrant_client::{client::QdrantClient, qdrant::*}; + +pub struct QdrantVectorStore { + client: QdrantClient, + collection_name: String, +} + +impl QdrantVectorStore { + pub async fn new_embedded() -> Result { + // Embedded mode - no separate Qdrant server needed + let client = QdrantClient::from_url("http://localhost:6334").build()?; + + let collection_name = "logseq_blocks".to_string(); + + // Create collection: 384 dimensions, cosine similarity + client.create_collection(&CreateCollection { + collection_name: collection_name.clone(), + vectors_config: Some(VectorsConfig { + config: Some(Config::Params(VectorParams { + size: 384, // all-MiniLM-L6-v2 + distance: Distance::Cosine.into(), + hnsw_config: Some(HnswConfigDiff { + m: Some(16), // connections per layer + ef_construct: Some(100), // build-time accuracy + ..Default::default() + }), + ..Default::default() + })), + }), + ..Default::default() + }).await?; + + Ok(Self { client, collection_name }) + } + + pub async fn upsert_point( + &self, + chunk_id: &ChunkId, + embedding: Vec, + payload: Payload, + ) -> Result<()> { + let point = PointStruct { + id: Some(PointId::from(chunk_id.as_str())), + vectors: Some(Vectors::from(embedding)), + payload: payload.into_map(), + }; + + self.client.upsert_points( + &self.collection_name, + None, + vec![point], + None, + ).await?; + + Ok(()) + } + + pub async fn similarity_search( + &self, + query_embedding: EmbeddingVector, + limit: usize, + ) -> Result> { + let search_result = self.client.search_points(&SearchPoints { + collection_name: self.collection_name.clone(), + vector: query_embedding.as_vec(), + limit: limit as u64, + with_payload: Some(WithPayloadSelector::from(true)), + score_threshold: Some(0.5), // Minimum similarity + ..Default::default() + }).await?; + + Ok(search_result.result.into_iter() + .map(|scored_point| ScoredChunk { + chunk_id: ChunkId::new(scored_point.id.unwrap().to_string()).unwrap(), + block_id: BlockId::new( + scored_point.payload.get("block_id").unwrap().as_str().unwrap() + ).unwrap(), + page_id: PageId::new( + scored_point.payload.get("page_id").unwrap().as_str().unwrap() + ).unwrap(), + similarity_score: SimilarityScore::new(scored_point.score), + content: scored_point.payload.get("original_content") + .unwrap().as_str().unwrap().to_string(), + }) + .collect()) + } +} +``` + +## Hybrid Search: Combining Keyword + Semantic + +**Best results come from combining both approaches:** + +```rust +// backend/src/application/services/hybrid_search_service.rs + +pub struct HybridSearchService { + text_search: SearchService, // Tantivy + semantic_search: EmbeddingService, // Qdrant +} + +impl HybridSearchService { + pub async fn hybrid_search( + &self, + query: &str, + limit: usize, + ) -> Result> { + // 1. Parallel search (both at once) + let (text_results, semantic_results) = tokio::join!( + self.text_search.search(query, limit), + self.semantic_search.semantic_search(query, limit), + ); + + // 2. Reciprocal Rank Fusion (RRF) for score combination + // Formula: score = Ξ£(1 / (k + rank_i)) where k = 60 + let mut combined_scores: HashMap = HashMap::new(); + + for (rank, result) in text_results?.iter().enumerate() { + let key = format!("{}:{}", result.page_id(), result.block_id()); + let rrf_score = 1.0 / (60.0 + rank as f32); + *combined_scores.entry(key).or_insert(0.0) += rrf_score * 0.7; // 70% weight + } + + for (rank, result) in semantic_results?.iter().enumerate() { + let key = format!("{}:{}", result.page_id, result.chunk_id); + let rrf_score = 1.0 / (60.0 + rank as f32); + *combined_scores.entry(key).or_insert(0.0) += rrf_score * 0.3; // 30% weight + } + + // 3. Sort by combined score + let mut results: Vec<_> = combined_scores.into_iter().collect(); + results.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap()); + + // 4. Return top K + Ok(results.into_iter() + .take(limit) + .map(|(id, score)| HybridResult { id, score }) + .collect()) + } +} +``` + +**Why Hybrid?** + +``` +Keyword Search (Tantivy): +βœ… Exact matches (code, filenames, specific terms) +βœ… Very fast (milliseconds) +❌ Misses synonyms ("car" won't find "automobile") +❌ No semantic understanding + +Semantic Search (Embeddings): +βœ… Understands meaning ("fast car" finds "quick vehicle") +βœ… Handles paraphrasing +❌ Slower (tens of milliseconds) +❌ Can miss exact technical terms + +Hybrid: +βœ… Best of both worlds +βœ… Technical terms + conceptual understanding +``` + +## Integration with Import/Sync + +**Automatic embedding during import:** + +```rust +// backend/src/application/services/import_service.rs + +impl ImportService { + pub async fn import_directory(&mut self, dir: LogseqDirectoryPath) -> Result { + for file in files { + // 1. Parse file + let page = LogseqMarkdownParser::parse_file(&file).await?; + + // 2. Save to database + self.page_repository.save(page.clone())?; + + // 3. Index in Tantivy (keyword search) + if let Some(ref tantivy_index) = self.tantivy_index { + tantivy_index.lock().await.index_page(&page)?; + } + + // 4. Generate embeddings and index (semantic search) + if let Some(ref embedding_service) = self.embedding_service { + embedding_service.embed_and_index_page(&page).await?; + } + + // 5. Save file mapping + self.mapping_repository.save(mapping)?; + } + + // Commit both indexes + self.tantivy_index.lock().await.commit()?; + // Qdrant commits automatically + + Ok(summary) + } +} +``` + +**Automatic re-embedding on sync:** + +```rust +// backend/src/application/services/sync_service.rs + +async fn handle_file_updated(&self, path: PathBuf) -> SyncResult<()> { + let page = LogseqMarkdownParser::parse_file(&path).await?; + + // Update database + self.page_repository.lock().await.save(page.clone())?; + + // Update Tantivy index + if let Some(ref index) = self.tantivy_index { + index.lock().await.update_page(&page)?; + index.lock().await.commit()?; + } + + // Update embeddings + if let Some(ref embedding_service) = self.embedding_service { + // Delete old chunks for this page + embedding_service.delete_page_chunks(&page.id()).await?; + + // Re-embed and index + embedding_service.embed_and_index_page(&page).await?; + } + + Ok(()) +} +``` + +## Performance Considerations + +### Embedding Generation +- **Model loading:** ~100-500MB memory (one-time cost) +- **Batch processing:** 32 texts per batch for optimal throughput +- **Speed:** ~10-50ms per batch (depending on text length) +- **Caching:** Cache embeddings to avoid regeneration + +### Vector Storage +- **Index building:** HNSW index builds incrementally +- **Memory usage:** ~4 bytes per dimension per vector (384 * 4 = 1.5KB per embedding) +- **Search speed:** ~1-10ms for similarity search (depending on collection size) +- **Disk usage:** ~2-3x vector size (including index overhead) + +### Scaling Considerations +- **10K blocks:** ~15MB embeddings, ~30MB index, <10ms search +- **100K blocks:** ~150MB embeddings, ~300MB index, ~20ms search +- **1M blocks:** ~1.5GB embeddings, ~3GB index, ~50ms search + +## Error Handling + +### Embedding Generation Failures +- **Model loading errors:** Fallback to keyword-only search +- **Out of memory:** Reduce batch size, process sequentially +- **Invalid text:** Skip problematic chunks, log errors + +### Vector Store Issues +- **Connection failures:** Retry with exponential backoff +- **Index corruption:** Rebuild from stored embeddings +- **Disk full:** Clean up old embeddings, notify user + +### Search Failures +- **Query too long:** Truncate to token limit +- **No results:** Fall back to keyword search +- **Timeout:** Return partial results, log performance issue + +## Future Enhancements + +### Advanced Chunking +- **Semantic chunking:** Split at topic boundaries +- **Hierarchical chunking:** Multi-level chunk sizes +- **Adaptive chunking:** Adjust size based on content type + +### Model Improvements +- **Multiple models:** Support different embedding models +- **Fine-tuning:** Train on user's specific domain +- **Multilingual:** Support non-English content + +### Search Features +- **Filters:** Combine semantic search with facets +- **Reranking:** Use cross-encoder for final ranking +- **Explanation:** Show why results were matched +- **Feedback:** Learn from user interactions From 358d94fbbeae6550026a9514d120052286d97d67 Mon Sep 17 00:00:00 2001 From: Wesley Finck Date: Sun, 19 Oct 2025 22:41:28 -0700 Subject: [PATCH 6/7] update working notes --- notes/working_notes.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/notes/working_notes.md b/notes/working_notes.md index 4117234..1725479 100644 --- a/notes/working_notes.md +++ b/notes/working_notes.md @@ -10,9 +10,13 @@ - how we will handle vector DB relationship to page and block persistence - how is that handled in the use cases - want to check how this is currently handled - look for that use case - wire everything up in a "composite root" which is the tauri layer - where they layers "meet" and can run e2e tests as described above + - with the overview workflow docs - make sure we understand how this will all come together in tauri (presentation layer?) - audit the file processing parallelism - especially for the import to make sure that it can run in the background while the app is still interactive while receiving updates - review OVERVIEW doc - e.g. in continuous sync I see that it filters only for the `journals` subdir but not `pages`? - base e2e test design around the overview to test these main workflows with all the real implementations (not in mem) +- also want to make sure I can use in mem and fake implementations of various interfaces for quick testing but also codebase modularity and maintainability reasons +- audit how blocks are handled in search results (want results to be blocks ideally, with page references (page is on and any parent pages, for example) +- ask what is missing from this overview - how can I create an implementation plan then a deployment plan (building the app with github workflow etc.) # 2025.10.18 From 946e1ce5dfde40a73a401c54fa19cfaed1e48864 Mon Sep 17 00:00:00 2001 From: Wesley Finck Date: Sun, 19 Oct 2025 22:56:24 -0700 Subject: [PATCH 7/7] working note update --- notes/working_notes.md | 1 + 1 file changed, 1 insertion(+) diff --git a/notes/working_notes.md b/notes/working_notes.md index 1725479..7109371 100644 --- a/notes/working_notes.md +++ b/notes/working_notes.md @@ -17,6 +17,7 @@ - also want to make sure I can use in mem and fake implementations of various interfaces for quick testing but also codebase modularity and maintainability reasons - audit how blocks are handled in search results (want results to be blocks ideally, with page references (page is on and any parent pages, for example) - ask what is missing from this overview - how can I create an implementation plan then a deployment plan (building the app with github workflow etc.) +- also would be nice to start thinking about any query params for blocks, like character length and whatnot (some more notes and ideas about this in logseq) but basically if I know some blocks are longer than others, would be nice to tune this. # 2025.10.18