Skip to content

Batch chunk ingestion and batch embeddings for indexing/search#6

Merged
DxTa merged 9 commits intomainfrom
codex/review-codebase-for-improvement-ideas
Feb 4, 2026
Merged

Batch chunk ingestion and batch embeddings for indexing/search#6
DxTa merged 9 commits intomainfrom
codex/review-codebase-for-improvement-ideas

Conversation

@DxTa
Copy link
Owner

@DxTa DxTa commented Feb 2, 2026

Motivation

Reduce write and embedding overhead during indexing by buffering chunks and performing batch embeddings and batch DB lookups to improve indexing/search throughput and scalability.

Changes

Batch chunk ingestion and embeddings (core feature):

  • Add chunk_batch_size to IndexingConfig (default 500) to control buffering
  • Buffer chunks in IndexingCoordinator and flush to backend in batches
  • Implement _embed_batch in UsearchSqliteBackend for single batch embedding calls
  • Optimize search_semantic and search_lexical to bulk-fetch chunk rows
  • Auto-tune embedding batch size based on host memory/CPU capacity

Embedding server daemon (from merged PR #5):

  • Cross-repo model sharing via Unix socket daemon
  • Length-prefixed binary framing protocol (replaced fragile newline-delimited framing)
  • Auto-unload feature and idempotent startup with stale PID/socket cleanup

Critical fixes from review:

  • Reverted embedding model to bge-base-en-v1.5 (768d) — bge-small was incorrectly made default
  • Added dimension mismatch detection on index open (raises ValueError with guidance to --clean)
  • Wrapped final flush_chunks() in try/except in coordinator (both serial and parallel paths)
  • Added transaction rollback on embedding failure in store_chunks_batch (prevents chunks without embeddings)
  • Embed only new chunks during indexing (skip already-stored chunks)

E2E CI fixes:

  • Eliminated redundant indexing passes in E2E test fixtures
  • Use bge-small-en-v1.5 (384d) in CI for ~3x faster CPU encoding (production default unchanged)
  • Cache HuggingFace model in GitHub Actions
  • Restored test timeouts to 600s

Testing

  • Unit tests: test_usearch_backend_batching.py, test_indexer_buffering.py, test_embed_client_framing.py, test_embed_daemon_start.py, test_config.py
  • Integration: test_batch_indexing_search.py
  • E2E: All 10 language jobs pass (Java, Python, JS, TS, Go, Rust, C++, C#, Ruby, PHP)
  • CI: build + tests on Python 3.10/3.11/3.12

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 33d8ec64cf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 225 to 226
# Flush any remaining chunks
flush_chunks()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle final batch flush errors

If store_chunks_batch raises during the final flush_chunks() (e.g., disk full, embedding backend error), the exception now escapes because the flush happens outside the per-file try block. This aborts indexing and skips error accounting, whereas previously errors were caught and recorded per file. The same pattern appears in index_directory_parallel, so a late failure can crash the whole run rather than being reported in stats.

Useful? React with 👍 / 👎.

DxTa added 8 commits February 4, 2026 10:06
On main, store_chunks_batch used _embed() per chunk with an LRU cache,
so incremental re-indexing of unchanged chunks was instant (cache hit).

The batching PR replaced this with _embed_batch() which bypassed the
LRU cache entirely, re-encoding all chunks from scratch every time.
This caused test_index_update to regress from 7s to 5m35s, and similar
regressions for test_compact_healthy_index and test_compact_force.

Fix: Add a dict-based embedding cache to _embed_batch that stores
vectors keyed by text content. On cache hit, skip encoding. On miss,
batch-encode only the uncached texts. This restores the fast path for
incremental indexing while preserving batch encoding for new chunks.
…aemon from CI

Root cause fixes for 30min+ CI timeout:

1. store_chunks_batch: INSERT-first, embed-after pattern
   - Restructured to INSERT all chunks first with per-chunk IntegrityError handling
   - Only batch-embed successfully inserted chunks (skips duplicates)
   - Eliminates 7100% regression on incremental indexing (test_index_update)

2. Remove embedding daemon from CI workflow
   - Daemon adds 2.3x serialization overhead (JSON encode/decode of 384-dim vectors)
   - Eliminated 130% regression on test_index_full and test_index_clean
   - Daemon remains available for real multi-repo usage where model sharing helps

3. Simplify _embed_batch: remove process-local cache
   - Cache added in a7fc966 was ineffective (each CLI call = new subprocess)
   - Simplified to straightforward batch encoding in slices

4. Standardize on bge-small-en-v1.5 (384d)
   - Aligned config.py default with backend default
   - Faster encoding than bge-base (768d)

Preserved improvements:
- Batch WHERE id IN (?) lookups in search methods (faster)
- pending_chunks buffering in coordinator
- Hardware-aware batch sizing

Tests: All unit tests pass (test_usearch_backend_batching.py, test_config.py)
1. Revert model to bge-base-en-v1.5 (768d) for backward compatibility
   - Revert config.py default from bge-small (384d) to bge-base (768d)
   - Revert usearch_backend.py default ndim from 384 to 768
   - Maintains compatibility with existing indexes

2. Add dimension mismatch detection in open_index()
   - Validate loaded index dimension matches config after .view()
   - Clear error message guides users to run 'sia-code index --clean'
   - Prevents silent corruption when embedding model changes

3. Wrap final flush_chunks() in error handling
   - Add try/except around final flush in index_directory()
   - Add try/except around final flush in index_directory_parallel()
   - Prevents crashes on final batch failures, logs errors instead

4. Fix store_chunks_batch partial commit on embedding failure
   - Wrap Phase 2 (batch embedding) in try/except
   - Call conn.rollback() if embedding fails
   - Prevents chunks from being stored without embeddings

5. Replace newline-delimited framing with length-prefixed framing
   - Add 4-byte big-endian length prefix to Message.encode()
   - Add Message.read_from_socket() helper for chunked reading
   - Update client.py and daemon.py to use length-prefixed framing
   - Update test_embed_client_framing.py to validate new framing
   - Eliminates risk of message corruption from embedded newlines

All fixes validated with unit tests. No breaking changes to public APIs.
Problem: Java E2E job took 28+ minutes and was cancelled at 30-minute
job timeout. Two tests FAILED with 600s per-test timeouts:
- test_index_full: 10 min (redundant re-index)
- test_index_clean: 10 min (full rebuild)

Root Cause: With embeddings enabled (default), each full index pass
of google/gson takes ~10 minutes on CI CPU. The test suite was doing
3+ redundant full indexing passes:
1. indexed_repo fixture: sia-code index . (~10 min)
2. test_index_full: sia-code index . (~10 min, redundant!)
3. test_index_clean: sia-code index --clean . (~10 min, full rebuild)
4. test_compact_force: sia-code index --update . + compact (partial)

Solution:
1. Changed test_index_full to use indexed_repo fixture instead of
   initialized_repo. Now it verifies the existing index (instant)
   rather than re-indexing (eliminates 10-min pass).

2. Reduced test_index_clean timeout from 600s to 300s. This test
   intentionally does a full rebuild and may timeout on large repos
   with embeddings, but 300s is sufficient for most cases.

3. Bumped workflow job timeout from 30 to 40 minutes for safety margin.

Impact:
- Java E2E: ~28 min → expected ~8-12 min (eliminates 1 full pass)
- All E2E jobs: More robust against future regressions
- Applied to all 10 language E2E test files for consistency

Files changed: 11 files, 229 insertions(+), 71 deletions(-)
- Switch CI to BAAI/bge-small-en-v1.5 (384d) via conftest.py config override
- Add HuggingFace model caching in GitHub Actions workflow
- Use --clean in indexed_repo fixture to recreate index with correct 384d dimensions
- Revert test_index_clean timeout from 300s to 600s (fast with bge-small)

This keeps embeddings fully ON in CI while avoiding CPU timeout issues.
bge-small is ~3x faster to encode and ~3x smaller to download than bge-base,
while still testing the complete embedding pipeline (model load, encode,
vector store, vector search, research multi-hop).

Production default remains bge-base-en-v1.5 (768d) unchanged.

Expected CI timing: ~60-90s for Java/gson instead of 600s+, total per job
~3-5min instead of 28+ min.
@DxTa DxTa merged commit 0169636 into main Feb 4, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant