Skip to content

fix: wire up missing Prometheus metrics#102

Merged
lance0 merged 10 commits intolance0:mainfrom
bswinnerton:fix/wire-up-prometheus-metrics
Apr 3, 2026
Merged

fix: wire up missing Prometheus metrics#102
lance0 merged 10 commits intolance0:mainfrom
bswinnerton:fix/wire-up-prometheus-metrics

Conversation

@bswinnerton
Copy link
Copy Markdown
Contributor

Description

All 16 prefixd_* metrics referenced in the Grafana dashboards (prefixd-operations.json and prefixd-security.json) were defined and initialized in metrics.rs, but only 5 were instrumented at runtime. This pull request wires up the remaining 11:

  • prefixd_events_ingested_total — after ban events are stored
  • prefixd_events_rejected_total — on duplicate events and guardrail rejections
  • prefixd_mitigations_created_total — on successful mitigation creation (event ingest + manual API)
  • prefixd_mitigations_active — gauge updated each reconciliation cycle, grouped by action_type/pop
  • prefixd_mitigations_withdrawn_total — on detector unban and operator withdrawal
  • prefixd_mitigations_expired_total — in the reconciliation expire loop
  • prefixd_announcements_total — on successful announce/withdraw BGP calls
  • prefixd_announcements_latency_seconds — histogram around successful announce/withdraw calls
  • prefixd_reconciliation_runs_total — after each reconciliation cycle with success/error label
  • prefixd_bgp_session_up — polls session_status() each reconciliation cycle
  • prefixd_guardrail_rejections_total — with Debug variant name as reason label

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Checklist

  • My code follows the project's coding style
  • I have run cargo fmt and cargo clippy
  • I have added tests that prove my fix/feature works
  • All new and existing tests pass (cargo test)
  • I have updated documentation as needed
  • I have updated CHANGELOG.md (for user-facing changes)

Note: no new tests added — the codebase has no existing pattern for asserting on Prometheus metric values. The instrumented code paths are already covered by the existing integration test suite.

Testing

  • cargo fmt --check — passes
  • cargo clippy -- -D warnings — passes
  • cargo test — all 298 tests pass

bswinnerton and others added 10 commits April 3, 2026 00:10
Increment prefixd_events_ingested_total after each ban event is stored.
Increment prefixd_events_rejected_total on duplicate events and
guardrail rejections.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- MITIGATIONS_CREATED: increment on successful mitigation creation
  (both via event ingest and manual API)
- MITIGATIONS_WITHDRAWN: increment on detector unban and operator
  withdrawal, with reason label
- MITIGATIONS_EXPIRED: increment in reconciliation expire loop
- MITIGATIONS_ACTIVE: set as gauge in reconciliation sync, grouped
  by action_type and pop

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instrument all announce() and withdraw() call sites with counters
and latency histograms:
- handle_ban announce
- handle_unban withdraw
- create_mitigation announce
- withdraw_mitigation withdraw
- bulk_withdraw withdraw
- reconciliation expire withdraw
- reconciliation re-announce

Uses "unknown" as the peer label since GoBGP manages peer selection
internally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- RECONCILIATION_RUNS: increment with success/error label after each
  reconciliation cycle (initial and periodic)
- BGP_SESSION_UP: poll session_status() each cycle and set gauge per
  peer (1=established, 0=down)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Increment prefixd_guardrail_rejections_total when guardrails reject
a mitigation, with the variant name as the reason label (e.g.
Safelisted, QuotaExceeded, TtlRequired).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use .as_str() for String fields passed alongside &str literals
in with_label_values() calls.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
prefixd calls GoBGP's AddPath/DeletePath RPCs which operate on the
global RIB. GoBGP distributes routes to all peers based on policy,
so there is no per-peer dimension at the announce/withdraw layer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Failed announce/withdraw calls should not be counted as
announcements.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
No labels are needed since the peer label was removed. Using a plain
Histogram avoids the awkward empty slice cast at call sites.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread src/api/handlers.rs
Comment on lines +903 to +907
PrefixdError::GuardrailViolation(g) => format!("{:?}", g)
.split_whitespace()
.next()
.unwrap_or("unknown")
.to_string(),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little clunky but does the trick. If we wanted to take a dependency on strum, we could clean this up to be something like:

#[derive(strum::AsRefStr)]
enum GuardrailError {
    TtlRequired,           // .as_ref() => "TtlRequired"
    Safelisted { ip: String }, // .as_ref() => "Safelisted"
    QuotaExceeded { .. },  // .as_ref() => "QuotaExceeded"
}
let reason = match &e {
    PrefixdError::GuardrailViolation(g) => g.as_ref(),
    _ => "unknown",
};

register_counter_vec!(
"prefixd_announcements_total",
"Total number of BGP announcements",
&["peer", "status"]
Copy link
Copy Markdown
Contributor Author

@bswinnerton bswinnerton Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find an easy way to derive the peer, so I opted to remove this from the instrumentation schema, but happy to put it back and add a placeholder like "global".

I also moved over to a Histogram so we don't have to do something like this:

crate::observability::metrics::ANNOUNCEMENTS_LATENCY
  .with_label_values(&[] as &[&str])

@lance0 lance0 merged commit ce989ae into lance0:main Apr 3, 2026
4 of 5 checks passed
@lance0
Copy link
Copy Markdown
Owner

lance0 commented Apr 3, 2026

Thanks for catching this @bswinnerton! These metrics were defined but completely dead -- nice work wiring them all up. The label simplification on announcements_total/latency makes sense too since we don't have peer context at the handler level. Merged.

@bswinnerton bswinnerton deleted the fix/wire-up-prometheus-metrics branch April 3, 2026 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants