fix(prefilter): spawn_blocking handlers + alive AND at substitute#208
Merged
fix(prefilter): spawn_blocking handlers + alive AND at substitute#208
Conversation
Review findings from Gemini + GPT + Plan Reviewer on PR #207 surfaced two critical bugs plus a few design-doc deviations worth documenting. 1. Deleted-doc resurrection (Gemini, CRITICAL) The engine uses clean-delete: on delete, the doc's bits are cleared from every filter/sort bitmap AND from alive, so filter bitmaps stay clean and the query hot path has no `alive AND filter`. The cached prefilter bitmap is NOT maintained by clean-delete — it's a frozen snapshot. Between refreshes, deleted docs still have bits set in the cached prefilter. A query whose clauses match the prefilter (no extra clauses to narrow against clean filter bitmaps) would have acc = stale bitmap → deleted docs leak through. Fix: substitute() now takes an `&RoaringBitmap alive` argument and intersects the cached prefilter bitmap with alive before wrapping it in a BucketBitmap clause. Cost: one bitmap AND (~20-50ms at 107M) per substituted query — still a large win vs the 300-900ms saved by eliding the original clauses. Verified by: - unit test substitute_excludes_dead_slots_from_stale_prefilter - integration test deleted_docs_excluded_from_stale_prefilter (registers prefilter, deletes 3/10 docs without refresh, confirms substituted query returns 7 IDs excluding the deleted slots) 2. Tokio reactor starvation (Gemini, CRITICAL) handle_register_prefilter and handle_refresh_prefilter called engine.register_prefilter / engine.refresh_prefilter directly on the Axum async worker thread. compute_filters over 100M-bit bitmaps is ~300-900ms of CPU work — running that on a reactor thread starves the runtime and tanks p99 for every other request. Fix: wrap both in tokio::task::spawn_blocking. Errors propagate as JoinError → 500 Internal Server Error with message. No other behavioral change. 3. Or-children canonicalization (Plan Reviewer, low priority) Design open question #1: whether Or(A, B) and Or(B, A) should canonicalize identically. Current answer is "no" — we don't reorder Or children. Added unit test canonicalize_does_not_reorder_or_children as a regression backstop: if a future change enables Or sorting, the test must be updated deliberately. 4. Documented deviations + Phase 1 hazards in the design doc - Registry on ConcurrentEngine, not InnerEngine (reason: avoid snapshot publish on registration; clause sets are immutable so no coherence issue) - Phase 1b flag gate skipped (empty-registry early-return is the kill switch) - Cache/prefilter drift bifurcation: cache-hit path and cache-miss path may return slightly different results under drift; bounded by refresh interval; acceptable for Phase 1 per design's explicit "mild drift is acceptable"; Phase 2 includes prefilter version in cache key - Delete-drift / alive-intersection rationale Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Review-driven follow-ups to PR #207 (merged as v1.0.218). Two CRITICAL fixes + docs.
Blockers found by review
1. Deleted-doc resurrection (Gemini, CRITICAL)
Engine uses clean-delete — filter bitmaps stay clean, no `alive AND filter` in the query hot path. The cached prefilter bitmap is NOT maintained by clean-delete — it's a frozen snapshot, so deleted docs still have bits set between refreshes. If a query's clauses match the prefilter with no extra constraints, `acc = stale_bitmap` and the deleted doc leaks through.
Fix: `substitute()` now takes `&RoaringBitmap alive` and intersects at substitute time. Cost: one bitmap AND (~20–50 ms at 107 M) — still a large net win vs the 300–900 ms saved. Verified by new integration test `deleted_docs_excluded_from_stale_prefilter` (register prefilter → delete 3/10 docs without refreshing → substituted query returns the remaining 7).
2. Tokio reactor starvation (Gemini, CRITICAL)
`handle_register_prefilter` and `handle_refresh_prefilter` called into `compute_filters` directly on the Axum async worker thread. At prod scale that's ~300–900 ms of CPU work stalling the reactor and tanking p99 for every other request. Fix: wrap both in `tokio::task::spawn_blocking`.
Smaller items
Test plan
Follow-ups not in this PR
🤖 Generated with Claude Code