feat(metrics): expose prefilter registry state via Prometheus#209
Merged
feat(metrics): expose prefilter registry state via Prometheus#209
Conversation
Wires the 5 metrics from design-prefilter-registry.md plus one extra age gauge into the collect-on-scrape path. Atomics were already tracked on PrefilterEntry from PR #207; this PR adds scrape-side plumbing so Grafana can show registered count, cardinality, substitution rate, last compute duration, refresh errors, and age-since-refresh. - bitdex_prefilter_registered{index} — gauge, count of registered entries - bitdex_prefilter_cardinality{index,name} — gauge, current bitmap size - bitdex_prefilter_substitutions_total{index,name} — counter surfaced as gauge (matches existing cache_hits_total pattern — the value comes from an AtomicU64 counter on the entry, not from a live Prometheus counter) - bitdex_prefilter_last_compute_seconds{index,name} — gauge - bitdex_prefilter_refresh_errors_total{index,name} — gauge - bitdex_prefilter_age_seconds{index,name} — gauge, alert candidate: if an SWR thread or orchestrator isn't refreshing, age will grow unbounded Tests: lib 17/17 + integration 3/3 still green. No behavior change to the substitute() hot path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
JustMaier
added a commit
that referenced
this pull request
Apr 14, 2026
Closes design Goal 2 ("Stale-while-revalidate — periodic refresh without
blocking queries") and the last Phase-1 gap called out by the post-merge
Plan Review.
A single dedicated thread per engine, spawned from the server's boot
phase 7 (after eager preload + bound cache). Each tick (default 10s):
for entry in registry.entries():
if entry.is_stale(now): refresh_prefilter(entry.name)
`refresh_prefilter` recomputes off the query path and atomically swaps
the bitmap via ArcSwap — in-flight queries holding the old Arc continue
reading; next query sees the new one.
Lifecycle safety:
- Holds `Weak<ConcurrentEngine>`, so engine drop breaks the cycle.
- Checks `self.shutdown` between entries (not just per-tick), so the
thread can exit promptly even if it's mid-refresh on a stale entry.
Integration test `swr_thread_refreshes_stale_prefilters` verifies
end-to-end: register 10 docs → insert 5 more → rewind last_refreshed →
cardinality flips to 15 within 5s.
With the age gauge from PR #209, ops can alert on SWR thread liveness
(bitdex_prefilter_age_seconds > 2 × refresh_interval = something is
wedged).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 tasks
JustMaier
added a commit
that referenced
this pull request
Apr 14, 2026
Closes design Goal 2 ("Stale-while-revalidate — periodic refresh without
blocking queries") and the last Phase-1 gap called out by the post-merge
Plan Review.
A single dedicated thread per engine, spawned from the server's boot
phase 7 (after eager preload + bound cache). Each tick (default 10s):
for entry in registry.entries():
if entry.is_stale(now): refresh_prefilter(entry.name)
`refresh_prefilter` recomputes off the query path and atomically swaps
the bitmap via ArcSwap — in-flight queries holding the old Arc continue
reading; next query sees the new one.
Lifecycle safety:
- Holds `Weak<ConcurrentEngine>`, so engine drop breaks the cycle.
- Checks `self.shutdown` between entries (not just per-tick), so the
thread can exit promptly even if it's mid-refresh on a stale entry.
Integration test `swr_thread_refreshes_stale_prefilters` verifies
end-to-end: register 10 docs → insert 5 more → rewind last_refreshed →
cardinality flips to 15 within 5s.
With the age gauge from PR #209, ops can alert on SWR thread liveness
(bitdex_prefilter_age_seconds > 2 × refresh_interval = something is
wedged).
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Final Phase-1 prefilter follow-up: wires the design doc's Prometheus metrics (plus one extra age gauge) so Grafana can show substitution rate once `civitai_safe` is registered on v1.0.219.
The atomics were already tracked on `PrefilterEntry` (from PR #207); this is pure scrape-side plumbing. Collect-on-scrape fits the existing pattern in `handle_metrics`.
Metrics added
`substitutions_total` and `refresh_errors_total` are surfaced as gauges read from `AtomicU64` counters on the entry — same pattern as the existing `cache_hits_total` / `cache_misses_total` gauges.
Test plan
Follow-ups remaining
🤖 Generated with Claude Code