fix: wire up missing Prometheus metrics#102
Merged
lance0 merged 10 commits intolance0:mainfrom Apr 3, 2026
Merged
Conversation
Increment prefixd_events_ingested_total after each ban event is stored. Increment prefixd_events_rejected_total on duplicate events and guardrail rejections. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- MITIGATIONS_CREATED: increment on successful mitigation creation (both via event ingest and manual API) - MITIGATIONS_WITHDRAWN: increment on detector unban and operator withdrawal, with reason label - MITIGATIONS_EXPIRED: increment in reconciliation expire loop - MITIGATIONS_ACTIVE: set as gauge in reconciliation sync, grouped by action_type and pop Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instrument all announce() and withdraw() call sites with counters and latency histograms: - handle_ban announce - handle_unban withdraw - create_mitigation announce - withdraw_mitigation withdraw - bulk_withdraw withdraw - reconciliation expire withdraw - reconciliation re-announce Uses "unknown" as the peer label since GoBGP manages peer selection internally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- RECONCILIATION_RUNS: increment with success/error label after each reconciliation cycle (initial and periodic) - BGP_SESSION_UP: poll session_status() each cycle and set gauge per peer (1=established, 0=down) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Increment prefixd_guardrail_rejections_total when guardrails reject a mitigation, with the variant name as the reason label (e.g. Safelisted, QuotaExceeded, TtlRequired). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use .as_str() for String fields passed alongside &str literals in with_label_values() calls. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
prefixd calls GoBGP's AddPath/DeletePath RPCs which operate on the global RIB. GoBGP distributes routes to all peers based on policy, so there is no per-peer dimension at the announce/withdraw layer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Failed announce/withdraw calls should not be counted as announcements. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
No labels are needed since the peer label was removed. Using a plain Histogram avoids the awkward empty slice cast at call sites. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bswinnerton
commented
Apr 3, 2026
Comment on lines
+903
to
+907
| PrefixdError::GuardrailViolation(g) => format!("{:?}", g) | ||
| .split_whitespace() | ||
| .next() | ||
| .unwrap_or("unknown") | ||
| .to_string(), |
Contributor
Author
There was a problem hiding this comment.
This is a little clunky but does the trick. If we wanted to take a dependency on strum, we could clean this up to be something like:
#[derive(strum::AsRefStr)]
enum GuardrailError {
TtlRequired, // .as_ref() => "TtlRequired"
Safelisted { ip: String }, // .as_ref() => "Safelisted"
QuotaExceeded { .. }, // .as_ref() => "QuotaExceeded"
}let reason = match &e {
PrefixdError::GuardrailViolation(g) => g.as_ref(),
_ => "unknown",
};
bswinnerton
commented
Apr 3, 2026
| register_counter_vec!( | ||
| "prefixd_announcements_total", | ||
| "Total number of BGP announcements", | ||
| &["peer", "status"] |
Contributor
Author
There was a problem hiding this comment.
I couldn't find an easy way to derive the peer, so I opted to remove this from the instrumentation schema, but happy to put it back and add a placeholder like "global".
I also moved over to a Histogram so we don't have to do something like this:
crate::observability::metrics::ANNOUNCEMENTS_LATENCY
.with_label_values(&[] as &[&str])
Owner
|
Thanks for catching this @bswinnerton! These metrics were defined but completely dead -- nice work wiring them all up. The label simplification on announcements_total/latency makes sense too since we don't have peer context at the handler level. Merged. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
All 16
prefixd_*metrics referenced in the Grafana dashboards (prefixd-operations.json and prefixd-security.json) were defined and initialized inmetrics.rs, but only 5 were instrumented at runtime. This pull request wires up the remaining 11:prefixd_events_ingested_total— after ban events are storedprefixd_events_rejected_total— on duplicate events and guardrail rejectionsprefixd_mitigations_created_total— on successful mitigation creation (event ingest + manual API)prefixd_mitigations_active— gauge updated each reconciliation cycle, grouped by action_type/popprefixd_mitigations_withdrawn_total— on detector unban and operator withdrawalprefixd_mitigations_expired_total— in the reconciliation expire loopprefixd_announcements_total— on successful announce/withdraw BGP callsprefixd_announcements_latency_seconds— histogram around successful announce/withdraw callsprefixd_reconciliation_runs_total— after each reconciliation cycle with success/error labelprefixd_bgp_session_up— polls session_status() each reconciliation cycleprefixd_guardrail_rejections_total— with Debug variant name as reason labelType of Change
Checklist
cargo fmtandcargo clippycargo test)Note: no new tests added — the codebase has no existing pattern for asserting on Prometheus metric values. The instrumented code paths are already covered by the existing integration test suite.
Testing
cargo fmt --check— passescargo clippy -- -D warnings— passescargo test— all 298 tests pass