The Monte Carlo boundary search in `chunker.rs` calls the ONNX embedding model 10–50 times per document to locate semantic split points between sentences. Sentence embeddings computed during candidate sampling are discarded and the same sentences get re-embedded during boundary refinement.
Caching sentence embeddings (keyed by sentence text or index) within a single `chunk()` call would eliminate redundant ONNX inference and significantly reduce ingest latency for long documents, where chunking is currently the dominant cost.
The Monte Carlo boundary search in `chunker.rs` calls the ONNX embedding model 10–50 times per document to locate semantic split points between sentences. Sentence embeddings computed during candidate sampling are discarded and the same sentences get re-embedded during boundary refinement.
Caching sentence embeddings (keyed by sentence text or index) within a single `chunk()` call would eliminate redundant ONNX inference and significantly reduce ingest latency for long documents, where chunking is currently the dominant cost.