Batch chunk ingestion and batch embeddings for indexing/search#6
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 33d8ec64cf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
sia_code/indexer/coordinator.py
Outdated
| # Flush any remaining chunks | ||
| flush_chunks() |
There was a problem hiding this comment.
Handle final batch flush errors
If store_chunks_batch raises during the final flush_chunks() (e.g., disk full, embedding backend error), the exception now escapes because the flush happens outside the per-file try block. This aborts indexing and skips error accounting, whereas previously errors were caught and recorded per file. The same pattern appears in index_directory_parallel, so a late failure can crash the whole run rather than being reported in stats.
Useful? React with 👍 / 👎.
On main, store_chunks_batch used _embed() per chunk with an LRU cache, so incremental re-indexing of unchanged chunks was instant (cache hit). The batching PR replaced this with _embed_batch() which bypassed the LRU cache entirely, re-encoding all chunks from scratch every time. This caused test_index_update to regress from 7s to 5m35s, and similar regressions for test_compact_healthy_index and test_compact_force. Fix: Add a dict-based embedding cache to _embed_batch that stores vectors keyed by text content. On cache hit, skip encoding. On miss, batch-encode only the uncached texts. This restores the fast path for incremental indexing while preserving batch encoding for new chunks.
…aemon from CI Root cause fixes for 30min+ CI timeout: 1. store_chunks_batch: INSERT-first, embed-after pattern - Restructured to INSERT all chunks first with per-chunk IntegrityError handling - Only batch-embed successfully inserted chunks (skips duplicates) - Eliminates 7100% regression on incremental indexing (test_index_update) 2. Remove embedding daemon from CI workflow - Daemon adds 2.3x serialization overhead (JSON encode/decode of 384-dim vectors) - Eliminated 130% regression on test_index_full and test_index_clean - Daemon remains available for real multi-repo usage where model sharing helps 3. Simplify _embed_batch: remove process-local cache - Cache added in a7fc966 was ineffective (each CLI call = new subprocess) - Simplified to straightforward batch encoding in slices 4. Standardize on bge-small-en-v1.5 (384d) - Aligned config.py default with backend default - Faster encoding than bge-base (768d) Preserved improvements: - Batch WHERE id IN (?) lookups in search methods (faster) - pending_chunks buffering in coordinator - Hardware-aware batch sizing Tests: All unit tests pass (test_usearch_backend_batching.py, test_config.py)
1. Revert model to bge-base-en-v1.5 (768d) for backward compatibility - Revert config.py default from bge-small (384d) to bge-base (768d) - Revert usearch_backend.py default ndim from 384 to 768 - Maintains compatibility with existing indexes 2. Add dimension mismatch detection in open_index() - Validate loaded index dimension matches config after .view() - Clear error message guides users to run 'sia-code index --clean' - Prevents silent corruption when embedding model changes 3. Wrap final flush_chunks() in error handling - Add try/except around final flush in index_directory() - Add try/except around final flush in index_directory_parallel() - Prevents crashes on final batch failures, logs errors instead 4. Fix store_chunks_batch partial commit on embedding failure - Wrap Phase 2 (batch embedding) in try/except - Call conn.rollback() if embedding fails - Prevents chunks from being stored without embeddings 5. Replace newline-delimited framing with length-prefixed framing - Add 4-byte big-endian length prefix to Message.encode() - Add Message.read_from_socket() helper for chunked reading - Update client.py and daemon.py to use length-prefixed framing - Update test_embed_client_framing.py to validate new framing - Eliminates risk of message corruption from embedded newlines All fixes validated with unit tests. No breaking changes to public APIs.
Problem: Java E2E job took 28+ minutes and was cancelled at 30-minute job timeout. Two tests FAILED with 600s per-test timeouts: - test_index_full: 10 min (redundant re-index) - test_index_clean: 10 min (full rebuild) Root Cause: With embeddings enabled (default), each full index pass of google/gson takes ~10 minutes on CI CPU. The test suite was doing 3+ redundant full indexing passes: 1. indexed_repo fixture: sia-code index . (~10 min) 2. test_index_full: sia-code index . (~10 min, redundant!) 3. test_index_clean: sia-code index --clean . (~10 min, full rebuild) 4. test_compact_force: sia-code index --update . + compact (partial) Solution: 1. Changed test_index_full to use indexed_repo fixture instead of initialized_repo. Now it verifies the existing index (instant) rather than re-indexing (eliminates 10-min pass). 2. Reduced test_index_clean timeout from 600s to 300s. This test intentionally does a full rebuild and may timeout on large repos with embeddings, but 300s is sufficient for most cases. 3. Bumped workflow job timeout from 30 to 40 minutes for safety margin. Impact: - Java E2E: ~28 min → expected ~8-12 min (eliminates 1 full pass) - All E2E jobs: More robust against future regressions - Applied to all 10 language E2E test files for consistency Files changed: 11 files, 229 insertions(+), 71 deletions(-)
- Switch CI to BAAI/bge-small-en-v1.5 (384d) via conftest.py config override - Add HuggingFace model caching in GitHub Actions workflow - Use --clean in indexed_repo fixture to recreate index with correct 384d dimensions - Revert test_index_clean timeout from 300s to 600s (fast with bge-small) This keeps embeddings fully ON in CI while avoiding CPU timeout issues. bge-small is ~3x faster to encode and ~3x smaller to download than bge-base, while still testing the complete embedding pipeline (model load, encode, vector store, vector search, research multi-hop). Production default remains bge-base-en-v1.5 (768d) unchanged. Expected CI timing: ~60-90s for Java/gson instead of 600s+, total per job ~3-5min instead of 28+ min.
Motivation
Reduce write and embedding overhead during indexing by buffering chunks and performing batch embeddings and batch DB lookups to improve indexing/search throughput and scalability.
Changes
Batch chunk ingestion and embeddings (core feature):
chunk_batch_sizetoIndexingConfig(default 500) to control bufferingIndexingCoordinatorand flush to backend in batches_embed_batchinUsearchSqliteBackendfor single batch embedding callssearch_semanticandsearch_lexicalto bulk-fetch chunk rowsEmbedding server daemon (from merged PR #5):
Critical fixes from review:
ValueErrorwith guidance to--clean)flush_chunks()in try/except in coordinator (both serial and parallel paths)store_chunks_batch(prevents chunks without embeddings)E2E CI fixes:
Testing
test_usearch_backend_batching.py,test_indexer_buffering.py,test_embed_client_framing.py,test_embed_daemon_start.py,test_config.pytest_batch_indexing_search.py