Skip to content

perf(p99): eliminate ForcePublish blocking — lazy load 1-5s→actual disk time#197

Merged
JustMaier merged 1 commit intomainfrom
ivy/eliminate-force-publish-block
Apr 13, 2026
Merged

perf(p99): eliminate ForcePublish blocking — lazy load 1-5s→actual disk time#197
JustMaier merged 1 commit intomainfrom
ivy/eliminate-force-publish-block

Conversation

@JustMaier
Copy link
Copy Markdown
Contributor

Summary

Root cause found: After lazy-loading a filter/sort bitmap from disk, the query thread blocked on done_rx.recv_timeout(5s) waiting for the flush thread to process a ForcePublish command. With 431ms flush cycles, every lazy-load query blocked for the full cycle duration.

This was the hidden serialization point Justin identified — it explains why all trace phases appeared slow (lazy_load, pre_cache, docs, sort). Queries queued behind the flush thread, inflating every metric.

Fix

Apply loaded bitmaps directly to the published snapshot via ArcSwap::store(), then send to flush thread non-blocking (fire-and-forget). Query continues immediately with the updated snapshot.

Before: lazy_load → disk read (10ms) → send ForcePublish → block 431ms → resume
After: lazy_load → disk read (10ms) → direct publish (1ms) → resume immediately

Safety

  • ArcSwap::store is atomic — safe for concurrent publishers
  • Flush thread still receives data via lazy_tx channel on its next cycle
  • No data loss — flush thread applies it redundantly (idempotent)

Evidence

Traces show lazy_load_us at 1-5s when actual disk reads are 10-100ms. The 431ms flush_last_duration_nanos matches the unexplained gap. Queries with pre_cache_us: 1800ms, filter_us: 0, sort_us: 0 — all the time is between function entry and cache lookup, exactly where ForcePublish blocks.

Test plan

  • cargo check --features server passes
  • Deploy, check traces: lazy_load_us should drop from 1-5s to 10-100ms
  • Verify P99 improvement
  • Watch for any bitmap consistency issues (shouldn't happen — ArcSwap is atomic)

🤖 Generated with Claude Code

ROOT CAUSE: After lazy-loading a filter/sort value from disk, the query
thread sent a ForcePublish command to the flush thread and blocked via
done_rx.recv_timeout(5s) waiting for the response. With ~431ms flush
cycles, every lazy-load query blocked for the entire flush duration.

This was the hidden serialization point causing ALL trace phases to
appear slow — lazy_load, pre_cache, docs, sort all inflated because
queries piled up behind the flush thread's cycle.

Fix: Apply loaded bitmaps directly to the published snapshot via
ArcSwap::store(), then send to the flush thread non-blocking
(fire-and-forget via lazy_tx). The query continues immediately with
the updated snapshot instead of waiting for the flush thread.

Safety: ArcSwap::store is atomic. The flush thread will also receive
and apply the data on its next cycle. Two concurrent publishers is
safe — the flush thread's next publish will include all lazy-loaded
data from the lazy_tx channel.

Expected: lazy_load_us should drop from 1-5s to the actual disk read
time (10-100ms). pre_cache_us should drop to near-zero.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@JustMaier JustMaier merged commit 03075cd into main Apr 13, 2026
1 check failed
JustMaier added a commit that referenced this pull request Apr 13, 2026
JustMaier added a commit that referenced this pull request Apr 13, 2026
… internal RwLock

Previous attempt (PR #197, reverted) tried to clone + publish the entire
InnerEngine from the query thread, which was O(all fields) and caused a
1.85s flush regression.

New approach: FilterField::load_values() and load_field_complete() take
&self and use internal RwLock — they can be called directly on the
published snapshot's fields without cloning the engine or publishing.
Loaded bitmaps become visible to all readers immediately through the
shared Arc<FilterField>.

Key changes:
- Filter loads (load_field_complete, load_values): apply directly to
  the current snapshot, skip ForcePublish entirely. Fire-and-forget
  send to flush thread for its staging copy.
- Sort loads (load_layers needs &mut self): still use ForcePublish
  round-trip to flush thread.

This is safe because:
- FilterField::bitmaps is RwLock<HashMap> — internal mutation is sound
- The flush thread also applies the same data from lazy_tx (idempotent)
- No engine clone, no ArcSwap::store race

Expected: filter lazy_load_us drops from 1-5s (ForcePublish block) to
10-100ms (actual disk read time). Sort lazy loads still block but are
much rarer than filter loads.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
JustMaier added a commit that referenced this pull request Apr 13, 2026
… internal RwLock (#198)

Previous attempt (PR #197, reverted) tried to clone + publish the entire
InnerEngine from the query thread, which was O(all fields) and caused a
1.85s flush regression.

New approach: FilterField::load_values() and load_field_complete() take
&self and use internal RwLock — they can be called directly on the
published snapshot's fields without cloning the engine or publishing.
Loaded bitmaps become visible to all readers immediately through the
shared Arc<FilterField>.

Key changes:
- Filter loads (load_field_complete, load_values): apply directly to
  the current snapshot, skip ForcePublish entirely. Fire-and-forget
  send to flush thread for its staging copy.
- Sort loads (load_layers needs &mut self): still use ForcePublish
  round-trip to flush thread.

This is safe because:
- FilterField::bitmaps is RwLock<HashMap> — internal mutation is sound
- The flush thread also applies the same data from lazy_tx (idempotent)
- No engine clone, no ArcSwap::store race

Expected: filter lazy_load_us drops from 1-5s (ForcePublish block) to
10-100ms (actual disk read time). Sort lazy loads still block but are
much rarer than filter loads.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant