Skip to content

fix(prefilter): spawn_blocking handlers + alive AND at substitute#208

Merged
JustMaier merged 1 commit intomainfrom
ivy-prefilter-fixes
Apr 14, 2026
Merged

fix(prefilter): spawn_blocking handlers + alive AND at substitute#208
JustMaier merged 1 commit intomainfrom
ivy-prefilter-fixes

Conversation

@JustMaier
Copy link
Copy Markdown
Contributor

Summary

Review-driven follow-ups to PR #207 (merged as v1.0.218). Two CRITICAL fixes + docs.

Blockers found by review

1. Deleted-doc resurrection (Gemini, CRITICAL)

Engine uses clean-delete — filter bitmaps stay clean, no `alive AND filter` in the query hot path. The cached prefilter bitmap is NOT maintained by clean-delete — it's a frozen snapshot, so deleted docs still have bits set between refreshes. If a query's clauses match the prefilter with no extra constraints, `acc = stale_bitmap` and the deleted doc leaks through.

Fix: `substitute()` now takes `&RoaringBitmap alive` and intersects at substitute time. Cost: one bitmap AND (~20–50 ms at 107 M) — still a large net win vs the 300–900 ms saved. Verified by new integration test `deleted_docs_excluded_from_stale_prefilter` (register prefilter → delete 3/10 docs without refreshing → substituted query returns the remaining 7).

2. Tokio reactor starvation (Gemini, CRITICAL)

`handle_register_prefilter` and `handle_refresh_prefilter` called into `compute_filters` directly on the Axum async worker thread. At prod scale that's ~300–900 ms of CPU work stalling the reactor and tanking p99 for every other request. Fix: wrap both in `tokio::task::spawn_blocking`.

Smaller items

  • Or-children canonicalization (Plan Reviewer, low priority). Pinned the existing behavior — `Or(A,B)` ≠ `Or(B,A)` — with a regression test (`canonicalize_does_not_reorder_or_children`) so a future reordering change has to update it deliberately.
  • Design-doc deviations written up explicitly:
    • Registry placed on `ConcurrentEngine`, not `InnerEngine`
    • Phase 1b flag-gate skipped (empty-registry early-return is the kill switch)
    • Cache/prefilter drift bifurcation (cache-hit vs cache-miss can diverge under drift; Phase 2 fix: include prefilter version in cache key)
    • Delete-drift + alive-intersection rationale

Test plan

  • `cargo test --lib prefilter::` → 17 unit tests green (incl. new dead-slot guard + Or canonicalization backstop)
  • `cargo test --test prefilter_integration` → 3 integration tests green (incl. new `deleted_docs_excluded_from_stale_prefilter`)
  • Aidan: deploy as v1.0.219 over current v1.0.218 — no schema or on-disk changes, pure bug fix; safe to roll out without a data migration.

Follow-ups not in this PR

🤖 Generated with Claude Code

Review findings from Gemini + GPT + Plan Reviewer on PR #207 surfaced two
critical bugs plus a few design-doc deviations worth documenting.

1. Deleted-doc resurrection (Gemini, CRITICAL)

   The engine uses clean-delete: on delete, the doc's bits are cleared from
   every filter/sort bitmap AND from alive, so filter bitmaps stay clean and
   the query hot path has no `alive AND filter`. The cached prefilter bitmap
   is NOT maintained by clean-delete — it's a frozen snapshot. Between
   refreshes, deleted docs still have bits set in the cached prefilter. A
   query whose clauses match the prefilter (no extra clauses to narrow
   against clean filter bitmaps) would have acc = stale bitmap → deleted
   docs leak through.

   Fix: substitute() now takes an `&RoaringBitmap alive` argument and
   intersects the cached prefilter bitmap with alive before wrapping it in a
   BucketBitmap clause. Cost: one bitmap AND (~20-50ms at 107M) per
   substituted query — still a large win vs the 300-900ms saved by eliding
   the original clauses. Verified by:
     - unit test substitute_excludes_dead_slots_from_stale_prefilter
     - integration test deleted_docs_excluded_from_stale_prefilter (registers
       prefilter, deletes 3/10 docs without refresh, confirms substituted
       query returns 7 IDs excluding the deleted slots)

2. Tokio reactor starvation (Gemini, CRITICAL)

   handle_register_prefilter and handle_refresh_prefilter called
   engine.register_prefilter / engine.refresh_prefilter directly on the Axum
   async worker thread. compute_filters over 100M-bit bitmaps is ~300-900ms
   of CPU work — running that on a reactor thread starves the runtime and
   tanks p99 for every other request.

   Fix: wrap both in tokio::task::spawn_blocking. Errors propagate as
   JoinError → 500 Internal Server Error with message. No other behavioral
   change.

3. Or-children canonicalization (Plan Reviewer, low priority)

   Design open question #1: whether Or(A, B) and Or(B, A) should canonicalize
   identically. Current answer is "no" — we don't reorder Or children. Added
   unit test canonicalize_does_not_reorder_or_children as a regression
   backstop: if a future change enables Or sorting, the test must be
   updated deliberately.

4. Documented deviations + Phase 1 hazards in the design doc

   - Registry on ConcurrentEngine, not InnerEngine (reason: avoid snapshot
     publish on registration; clause sets are immutable so no coherence
     issue)
   - Phase 1b flag gate skipped (empty-registry early-return is the kill
     switch)
   - Cache/prefilter drift bifurcation: cache-hit path and cache-miss path
     may return slightly different results under drift; bounded by refresh
     interval; acceptable for Phase 1 per design's explicit "mild drift is
     acceptable"; Phase 2 includes prefilter version in cache key
   - Delete-drift / alive-intersection rationale

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@JustMaier JustMaier merged commit 0f1d727 into main Apr 14, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant