Batch chunk ingestion and batch embeddings for indexing/search by DxTa · Pull Request #6 · DxTa/sia-code

DxTa · 2026-02-02T21:26:05Z

Motivation

Reduce write and embedding overhead during indexing by buffering chunks and performing batch embeddings and batch DB lookups to improve indexing/search throughput and scalability.

Changes

Batch chunk ingestion and embeddings (core feature):

Add chunk_batch_size to IndexingConfig (default 500) to control buffering
Buffer chunks in IndexingCoordinator and flush to backend in batches
Implement _embed_batch in UsearchSqliteBackend for single batch embedding calls
Optimize search_semantic and search_lexical to bulk-fetch chunk rows
Auto-tune embedding batch size based on host memory/CPU capacity

Embedding server daemon (from merged PR #5):

Cross-repo model sharing via Unix socket daemon
Length-prefixed binary framing protocol (replaced fragile newline-delimited framing)
Auto-unload feature and idempotent startup with stale PID/socket cleanup

Critical fixes from review:

Reverted embedding model to bge-base-en-v1.5 (768d) — bge-small was incorrectly made default
Added dimension mismatch detection on index open (raises ValueError with guidance to --clean)
Wrapped final flush_chunks() in try/except in coordinator (both serial and parallel paths)
Added transaction rollback on embedding failure in store_chunks_batch (prevents chunks without embeddings)
Embed only new chunks during indexing (skip already-stored chunks)

E2E CI fixes:

Eliminated redundant indexing passes in E2E test fixtures
Use bge-small-en-v1.5 (384d) in CI for ~3x faster CPU encoding (production default unchanged)
Cache HuggingFace model in GitHub Actions
Restored test timeouts to 600s

Testing

Unit tests: test_usearch_backend_batching.py, test_indexer_buffering.py, test_embed_client_framing.py, test_embed_daemon_start.py, test_config.py
Integration: test_batch_indexing_search.py
E2E: All 10 language jobs pass (Java, Python, JS, TS, Go, Rust, C++, C#, Ruby, PHP)
CI: build + tests on Python 3.10/3.11/3.12

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 33d8ec64cf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-02T21:28:02Z

sia_code/indexer/coordinator.py

+        # Flush any remaining chunks
+        flush_chunks()


Handle final batch flush errors

If store_chunks_batch raises during the final flush_chunks() (e.g., disk full, embedding backend error), the exception now escapes because the flush happens outside the per-file try block. This aborts indexing and skips error accounting, whereas previously errors were caught and recorded per file. The same pattern appears in index_directory_parallel, so a late failure can crash the whole run rather than being reported in stats.

Useful? React with 👍 / 👎.

On main, store_chunks_batch used _embed() per chunk with an LRU cache, so incremental re-indexing of unchanged chunks was instant (cache hit). The batching PR replaced this with _embed_batch() which bypassed the LRU cache entirely, re-encoding all chunks from scratch every time. This caused test_index_update to regress from 7s to 5m35s, and similar regressions for test_compact_healthy_index and test_compact_force. Fix: Add a dict-based embedding cache to _embed_batch that stores vectors keyed by text content. On cache hit, skip encoding. On miss, batch-encode only the uncached texts. This restores the fast path for incremental indexing while preserving batch encoding for new chunks.

…aemon from CI Root cause fixes for 30min+ CI timeout: 1. store_chunks_batch: INSERT-first, embed-after pattern - Restructured to INSERT all chunks first with per-chunk IntegrityError handling - Only batch-embed successfully inserted chunks (skips duplicates) - Eliminates 7100% regression on incremental indexing (test_index_update) 2. Remove embedding daemon from CI workflow - Daemon adds 2.3x serialization overhead (JSON encode/decode of 384-dim vectors) - Eliminated 130% regression on test_index_full and test_index_clean - Daemon remains available for real multi-repo usage where model sharing helps 3. Simplify _embed_batch: remove process-local cache - Cache added in a7fc966 was ineffective (each CLI call = new subprocess) - Simplified to straightforward batch encoding in slices 4. Standardize on bge-small-en-v1.5 (384d) - Aligned config.py default with backend default - Faster encoding than bge-base (768d) Preserved improvements: - Batch WHERE id IN (?) lookups in search methods (faster) - pending_chunks buffering in coordinator - Hardware-aware batch sizing Tests: All unit tests pass (test_usearch_backend_batching.py, test_config.py)

1. Revert model to bge-base-en-v1.5 (768d) for backward compatibility - Revert config.py default from bge-small (384d) to bge-base (768d) - Revert usearch_backend.py default ndim from 384 to 768 - Maintains compatibility with existing indexes 2. Add dimension mismatch detection in open_index() - Validate loaded index dimension matches config after .view() - Clear error message guides users to run 'sia-code index --clean' - Prevents silent corruption when embedding model changes 3. Wrap final flush_chunks() in error handling - Add try/except around final flush in index_directory() - Add try/except around final flush in index_directory_parallel() - Prevents crashes on final batch failures, logs errors instead 4. Fix store_chunks_batch partial commit on embedding failure - Wrap Phase 2 (batch embedding) in try/except - Call conn.rollback() if embedding fails - Prevents chunks from being stored without embeddings 5. Replace newline-delimited framing with length-prefixed framing - Add 4-byte big-endian length prefix to Message.encode() - Add Message.read_from_socket() helper for chunked reading - Update client.py and daemon.py to use length-prefixed framing - Update test_embed_client_framing.py to validate new framing - Eliminates risk of message corruption from embedded newlines All fixes validated with unit tests. No breaking changes to public APIs.

Problem: Java E2E job took 28+ minutes and was cancelled at 30-minute job timeout. Two tests FAILED with 600s per-test timeouts: - test_index_full: 10 min (redundant re-index) - test_index_clean: 10 min (full rebuild) Root Cause: With embeddings enabled (default), each full index pass of google/gson takes ~10 minutes on CI CPU. The test suite was doing 3+ redundant full indexing passes: 1. indexed_repo fixture: sia-code index . (~10 min) 2. test_index_full: sia-code index . (~10 min, redundant!) 3. test_index_clean: sia-code index --clean . (~10 min, full rebuild) 4. test_compact_force: sia-code index --update . + compact (partial) Solution: 1. Changed test_index_full to use indexed_repo fixture instead of initialized_repo. Now it verifies the existing index (instant) rather than re-indexing (eliminates 10-min pass). 2. Reduced test_index_clean timeout from 600s to 300s. This test intentionally does a full rebuild and may timeout on large repos with embeddings, but 300s is sufficient for most cases. 3. Bumped workflow job timeout from 30 to 40 minutes for safety margin. Impact: - Java E2E: ~28 min → expected ~8-12 min (eliminates 1 full pass) - All E2E jobs: More robust against future regressions - Applied to all 10 language E2E test files for consistency Files changed: 11 files, 229 insertions(+), 71 deletions(-)

- Switch CI to BAAI/bge-small-en-v1.5 (384d) via conftest.py config override - Add HuggingFace model caching in GitHub Actions workflow - Use --clean in indexed_repo fixture to recreate index with correct 384d dimensions - Revert test_index_clean timeout from 300s to 600s (fast with bge-small) This keeps embeddings fully ON in CI while avoiding CPU timeout issues. bge-small is ~3x faster to encode and ~3x smaller to download than bge-base, while still testing the complete embedding pipeline (model load, encode, vector store, vector search, research multi-hop). Production default remains bge-base-en-v1.5 (768d) unchanged. Expected CI timing: ~60-90s for Java/gson instead of 600s+, total per job ~3-5min instead of 28+ min.

Add batching for indexing and search

33d8ec6

DxTa added the codex label Feb 2, 2026 — with ChatGPT Codex Connector

chatgpt-codex-connector bot reviewed Feb 2, 2026

View reviewed changes

DxTa added 8 commits February 4, 2026 10:06

Improve embedding batching stability

289b0bf

Fix embed daemon framing

ea98a7e

Add E2E timing breadcrumbs

f292dac

DxTa merged commit 0169636 into main Feb 4, 2026
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch chunk ingestion and batch embeddings for indexing/search#6

Batch chunk ingestion and batch embeddings for indexing/search#6
DxTa merged 9 commits intomainfrom
codex/review-codebase-for-improvement-ideas

DxTa commented Feb 2, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DxTa commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Testing

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DxTa commented Feb 2, 2026 •

edited

Loading