feat(metrics): add Prometheus metrics expansion with DNS-specific buckets#142
feat(metrics): add Prometheus metrics expansion with DNS-specific buckets#142
Conversation
|
Warning Rate limit exceeded
To continue reviewing without waiting, purchase usage credits in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughThis PR adds comprehensive Prometheus metrics instrumentation to the DNS server: custom histogram buckets, derived cache-hit ratio, DNSSEC key statistics collection, rate-limit counters, recursive resolution outcome metrics, cache/AXFR/NOTIFY instrumentation, startup zone/record polling, and related tests. ChangesMetrics Instrumentation Overhaul
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/dns/server/server.go (1)
957-989:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
CacheHitRatiois heavily skewed by multi-level miss double-counting.On the request path, a single query that misses L1+L2 and falls through to the DB produces three
RecordCacheMiss()calls (lines 965, 988, 1089) versus zero hits. An L2 hit produces one hit and one miss for the same query. Net result: with even a healthy ~90% L1 hit rate, the publishedclouddns_cache_hit_ratiowill look much worse than reality, and the metric becomes hard to interpret.Pick one canonical layer to feed the ratio (typically L1 — the help text already says "L1 cache hit ratio") and only call
RecordCacheHit/Missonce per query at that layer. The L2/DB miss recording should be removed (the existingCacheOperationsCounterVecalready exposes per-level hit/miss counts for finer-grained dashboards).♻️ Suggested change
@@ line ~985 - } else if s.Redis != nil { - // Redis was checked but key not found = L2 miss - metrics.CacheOperations.WithLabelValues("l2", "miss").Inc() - metrics.RecordCacheMiss() + } else if s.Redis != nil { + // Redis was checked but key not found = L2 miss (recorded only on the + // CounterVec; the L1 RecordCacheMiss above already covers the ratio). + metrics.CacheOperations.WithLabelValues("l2", "miss").Inc() } @@ line ~1089 - metrics.RecordCacheMiss() // DB lookup is a cache miss for ratio purposes + // Note: the L1 miss above already recorded this query as a miss for the + // cache hit ratio; do not double-count here.Also applies to: 1089-1089
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/dns/server/server.go` around lines 957 - 989, The cache hit/miss accounting double-counts across L1/L2/DB; only use RecordCacheHit/RecordCacheMiss once per request at the canonical L1 decision point. Keep metrics.CacheOperations.WithLabelValues(...) per-layer counters, but remove the extra metrics.RecordCacheHit() / metrics.RecordCacheMiss() calls in the L2 hit/miss branch (the block using s.Redis.GetWithTTL and its else branch) and in the downstream DB miss path so that RecordCacheHit is only invoked in the L1-hit branch and RecordCacheMiss only in the L1-miss branch; leave per-layer CacheOperations increments in place for detailed dashboards. Ensure the L1 branch still calls QueryDuration/QueriesTotal as before and add a short comment near s.Redis.GetWithTTL explaining that L2/DB no longer touch the global hit/miss aggregate.
🧹 Nitpick comments (5)
internal/dns/server/server.go (1)
1322-1323: ⚡ Quick win
NotifiesTotalalways uses"accepted"— theresultlabel is effectively dead.The metric is incremented before any of the actual decisions (zone exists, role is
slave, refresh succeeds) happen, and it's hard-coded to"accepted". Anyone queryingclouddns_notifies_total{result="rejected"}(orignored/refresh_failed) will see zero forever, and a malformed/unauthorized NOTIFY isn't distinguishable from a real one.Either move the increment after the relevant branch is decided and emit the actual outcome (
accepted,ignored_not_slave,unknown_zone,refresh_failed), or drop theresultlabel entirely if you never plan to populate it.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/dns/server/server.go` around lines 1322 - 1323, The NotifiesTotal metric is incremented prematurely with a hard-coded "accepted" result; update the logic so metrics.NotifiesTotal.WithLabelValues(request.Questions[0].Name, <result>).Inc() is called after you determine the actual outcome (e.g., "accepted" when zone exists and role == slave and refresh succeeded; "unknown_zone" when zone not found; "ignored_not_slave" when not a slave; "refresh_failed" when Notify refresh fails), or if you don't intend to record different outcomes remove the result label entirely; modify code around s.Logger.Info(...) and the NOTIFY handling branches to emit the correct label value using the existing metrics.NotifiesTotal.WithLabelValues call.internal/infrastructure/metrics/metrics_test.go (1)
91-109: ⚡ Quick winBackground ticker can race with the test body and leave the gauge in a surprising state.
NewDerivedMetricCollector(50 * time.Millisecond)starts a real goroutine that firescompute()every 50ms while the test is also callingcompute()and mutatingCacheHitCount/CacheMissCount. Because the fouratomic.StoreUint64calls on lines 95–100 are not atomic as a group, the goroutine can observe a transient(0, 20)or(80, 0)state and write the corresponding ratio toCacheHitRatiobetween the test setup andcompute()call. Also,collector.Stop()runs after the restore, so on a test failure (t.Fatal/t.Errorflater) the goroutine outlives the test.Two cheap improvements:
- Use a long interval (e.g.
time.Hour) so the test deterministically only triggerscompute()itself, likeTestDerivedMetricCollector_Stopalready does.defer collector.Stop()immediately after construction.♻️ Suggested change
- collector := NewDerivedMetricCollector(50 * time.Millisecond) - collector.compute() - - // Restore - atomic.StoreUint64(&CacheHitCount, origHit) - atomic.StoreUint64(&CacheMissCount, origMiss) - collector.Stop() + collector := NewDerivedMetricCollector(time.Hour) + defer collector.Stop() + collector.compute() + + // Restore + atomic.StoreUint64(&CacheHitCount, origHit) + atomic.StoreUint64(&CacheMissCount, origMiss)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/infrastructure/metrics/metrics_test.go` around lines 91 - 109, The test races with the collector's background ticker because NewDerivedMetricCollector(50*time.Millisecond) starts a goroutine; change the constructor call to use a very long interval (e.g. time.Hour) so the test only exercises compute() explicitly, and call defer collector.Stop() immediately after creating the collector to ensure the goroutine is stopped even on test failures; adjust references to NewDerivedMetricCollector, compute, CacheHitCount, CacheMissCount, CacheHitRatio, and Stop accordingly.internal/dns/server/recursive.go (1)
162-198: ⚡ Quick win
RecursiveResolutionsTotalis missing from several terminal paths.The metric is only incremented in three places. Other outcome paths return without recording anything, so the counter undercounts and label distributions become misleading:
- Line 163 — CAA/SRV NOERROR with zero answers (returns success-equivalent).
- Line 190 — fallback resolver returned a usable response (success or NXDOMAIN).
- Line 175 — pre-fallback timeout (only the in-loop and per-fallback timeouts are counted).
- Line 197 — final failure after roots and fallbacks are exhausted.
Suggest adding label values like
"success"for the fallback success branch (usingresp.Header.ResCodeto differentiatenxdomainvssuccess),"timeout"for line 175, and a new"failure"(or"servfail") label at line 197 so the metric remains a complete classification of outcomes.♻️ Example for the fallback success path
if err == nil && (resp.Header.ResCode == 0 || resp.Header.ResCode == 3) { s.Logger.Info("fallback resolution successful", "name", name, "fallback", fallback) + result := "success" + if resp.Header.ResCode == 3 { + result = "nxdomain" + } + metrics.RecursiveResolutionsTotal.WithLabelValues(result).Inc() return resp, nil }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/dns/server/recursive.go` around lines 162 - 198, The RecursiveResolutionsTotal metric is not incremented on several terminal paths; update the control flow in the recursive resolution function to increment metrics. Specifically: increment metrics.RecursiveResolutionsTotal.WithLabelValues("success") (or "nxdomain"/"success" based on resp.Header.ResCode) before returning in the CAA/SRV no-answer early-return path and in the fallback-success branch (use resp.Header.ResCode to select "nxdomain" vs "success"); increment metrics.RecursiveResolutionsTotal.WithLabelValues("timeout") before the pre-fallback timeout return (the branch that returns errRecursiveTimeout when time.Since(resolveStart) >= recursiveTimeout); and increment metrics.RecursiveResolutionsTotal.WithLabelValues("failure") (or "servfail") immediately before the final fmt.Errorf return that wraps lastErr, ensuring each terminal return path uses metrics.RecursiveResolutionsTotal so outcomes are fully classified.internal/infrastructure/metrics/metrics.go (1)
112-119: 🏗️ Heavy lift
CacheHitRatiois a lifetime ratio, not a recent-behavior ratio.
hitsandmissesare atomic counters that only ever increase since process start, soclouddns_cache_hit_ratiois the cumulative ratio over the lifetime of the process. After a few days of healthy traffic the ratio is dominated by historical data and a sudden cache-warmup event or invalidation storm will barely move the gauge — exactly the situations operators want to alert on.Two common alternatives:
- Drop the gauge and let Prometheus compute it server-side via
rate(clouddns_cache_operations_total{result="hit"}[5m]) / rate(clouddns_cache_operations_total[5m])(sinceCacheOperationsis already aCounterVec, this is essentially free and exposes any window operators want).- If a server-computed gauge is required, snapshot the previous tick's
(hits, misses)and compute the per-interval ratio.The first option also lets you delete
CacheHitCount/CacheMissCount/RecordCacheHit/RecordCacheMissand the entireDerivedMetricCollector, simplifying the surface area considerably.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/infrastructure/metrics/metrics.go` around lines 112 - 119, The current CacheHitRatio gauge is computed from lifetime counters (CacheHitCount/CacheMissCount) in DerivedMetricCollector, producing a cumulative ratio; either remove this gauge and rely on Prometheus-side computation using the existing CacheOperations CounterVec (operators can use rate(...) over any window), deleting CacheHitCount/CacheMissCount/RecordCacheHit/RecordCacheMiss and DerivedMetricCollector, or change DerivedMetricCollector.compute to calculate the ratio from interval deltas by storing previous snapshot fields (e.g., prevHits, prevMisses) on DerivedMetricCollector, computing hitsDelta = hits - prevHits and missesDelta = misses - prevMisses, and setting CacheHitRatio to hitsDelta/(hitsDelta+missesDelta) only when the delta total > 0, then update the snapshots.internal/core/services/dnssec_service.go (1)
215-242: 💤 Low valueSilent error swallowing on per-zone key listing.
When
ListKeysForZonefails for a zone, the error is dropped without any log entry. For a metrics collector this is acceptable in the sense that it won't break the loop, but it makes diagnosis impossible when DNSSEC metrics start showing missing zones. Consider at least a debug/warn log so operators can correlate gaps in the metric with repo errors.♻️ Suggested change
keys, err := s.repo.ListKeysForZone(ctx, zone.ID) if err != nil { + // Skip this zone but record the failure for observability + // (caller is the metrics collector; one zone failing shouldn't fail the whole sweep) continue }If a logger is plumbed into
DNSSECServicelater, swap the comment for aWarncall.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/core/services/dnssec_service.go` around lines 215 - 242, In CollectKeyStats, ListKeysForZone errors are silently ignored; update the loop in DNSSECService.CollectKeyStats to log a warning containing the zone identifier (zone.ID and/or zone.Name) and the error when s.repo.ListKeysForZone(ctx, zone.ID) returns an error, then continue the loop so metrics collection isn’t aborted; if a logger field (e.g., s.logger or s.log) is available use its Warn/Warning method, otherwise add a TODO comment that records the zone and error for later plumbing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/dns/server/server.go`:
- Around line 254-271: updateDNSSECMetrics is wrong: it increments signedZones
per key, hides multi-key rollover counts, and overwrites ages. Fix by (1)
tracking unique zones with a map[string]struct{} (e.g., seenZones) and after the
loop set metrics.DNSSECZonesSigned to len(seenZones); (2) after
metrics.DNSSECKeysTotal.Reset() call
metrics.DNSSECKeysTotal.WithLabelValues(st.ZoneName, st.KeyType,
fmt.Sprintf("%d", st.Algorithm)).Inc() for each stat instead of Set(1) so
duplicate label tuples accumulate counts; and (3) avoid overwriting ages by
either adding a unique key identifier label (e.g., st.KeyTag or st.KeyID) to
metrics.DNSSECKeysAgeSeconds.WithLabelValues(...) before Set(st.AgeSeconds) or
by pre-aggregating (choose the oldest/desired age per (ZoneName,KeyType) in a
map and call Set once per group). Apply these changes inside updateDNSSECMetrics
using the existing stats elements returned by s.DNSSEC.CollectKeyStats.
In `@internal/infrastructure/metrics/metrics.go`:
- Around line 247-251: The ZoneRecordRepo interface is unimplementable and
RecordsTotal is never set: change the interface signatures from []interface{} to
the concrete domain slice types used by the real repo (e.g., ListZones(ctx
context.Context, tenantID string) ([]domain.Zone, error) and
ListRecordsForZone(ctx context.Context, zoneID, tenantID string)
([]domain.Record, error)) so the real ports.DNSRepository can be passed into
NewZoneRecordCounter, and then update the collect() method in the
ZoneRecordCounter implementation to call ListRecordsForZone for each zone and
set the RecordsTotal gauge appropriately (in addition to ZonesTotal) so the
metric reflects actual record counts.
- Around line 88-96: The DerivedMetricCollector and ZoneRecordCounter are never
instantiated in production; instantiate them during server startup by calling
NewDerivedMetricCollector(...) (pass a sensible interval) and
NewZoneRecordCounter() and keep their returned pointers, and ensure you call
their Stop() methods during server shutdown/cleanup (e.g., in your main
startup/shutdown sequence or signal handler) so gauges like CacheHitRatio,
ZonesTotal, and RecordsTotal are populated and collectors are cleaned up.
---
Outside diff comments:
In `@internal/dns/server/server.go`:
- Around line 957-989: The cache hit/miss accounting double-counts across
L1/L2/DB; only use RecordCacheHit/RecordCacheMiss once per request at the
canonical L1 decision point. Keep metrics.CacheOperations.WithLabelValues(...)
per-layer counters, but remove the extra metrics.RecordCacheHit() /
metrics.RecordCacheMiss() calls in the L2 hit/miss branch (the block using
s.Redis.GetWithTTL and its else branch) and in the downstream DB miss path so
that RecordCacheHit is only invoked in the L1-hit branch and RecordCacheMiss
only in the L1-miss branch; leave per-layer CacheOperations increments in place
for detailed dashboards. Ensure the L1 branch still calls
QueryDuration/QueriesTotal as before and add a short comment near
s.Redis.GetWithTTL explaining that L2/DB no longer touch the global hit/miss
aggregate.
---
Nitpick comments:
In `@internal/core/services/dnssec_service.go`:
- Around line 215-242: In CollectKeyStats, ListKeysForZone errors are silently
ignored; update the loop in DNSSECService.CollectKeyStats to log a warning
containing the zone identifier (zone.ID and/or zone.Name) and the error when
s.repo.ListKeysForZone(ctx, zone.ID) returns an error, then continue the loop so
metrics collection isn’t aborted; if a logger field (e.g., s.logger or s.log) is
available use its Warn/Warning method, otherwise add a TODO comment that records
the zone and error for later plumbing.
In `@internal/dns/server/recursive.go`:
- Around line 162-198: The RecursiveResolutionsTotal metric is not incremented
on several terminal paths; update the control flow in the recursive resolution
function to increment metrics. Specifically: increment
metrics.RecursiveResolutionsTotal.WithLabelValues("success") (or
"nxdomain"/"success" based on resp.Header.ResCode) before returning in the
CAA/SRV no-answer early-return path and in the fallback-success branch (use
resp.Header.ResCode to select "nxdomain" vs "success"); increment
metrics.RecursiveResolutionsTotal.WithLabelValues("timeout") before the
pre-fallback timeout return (the branch that returns errRecursiveTimeout when
time.Since(resolveStart) >= recursiveTimeout); and increment
metrics.RecursiveResolutionsTotal.WithLabelValues("failure") (or "servfail")
immediately before the final fmt.Errorf return that wraps lastErr, ensuring each
terminal return path uses metrics.RecursiveResolutionsTotal so outcomes are
fully classified.
In `@internal/dns/server/server.go`:
- Around line 1322-1323: The NotifiesTotal metric is incremented prematurely
with a hard-coded "accepted" result; update the logic so
metrics.NotifiesTotal.WithLabelValues(request.Questions[0].Name, <result>).Inc()
is called after you determine the actual outcome (e.g., "accepted" when zone
exists and role == slave and refresh succeeded; "unknown_zone" when zone not
found; "ignored_not_slave" when not a slave; "refresh_failed" when Notify
refresh fails), or if you don't intend to record different outcomes remove the
result label entirely; modify code around s.Logger.Info(...) and the NOTIFY
handling branches to emit the correct label value using the existing
metrics.NotifiesTotal.WithLabelValues call.
In `@internal/infrastructure/metrics/metrics_test.go`:
- Around line 91-109: The test races with the collector's background ticker
because NewDerivedMetricCollector(50*time.Millisecond) starts a goroutine;
change the constructor call to use a very long interval (e.g. time.Hour) so the
test only exercises compute() explicitly, and call defer collector.Stop()
immediately after creating the collector to ensure the goroutine is stopped even
on test failures; adjust references to NewDerivedMetricCollector, compute,
CacheHitCount, CacheMissCount, CacheHitRatio, and Stop accordingly.
In `@internal/infrastructure/metrics/metrics.go`:
- Around line 112-119: The current CacheHitRatio gauge is computed from lifetime
counters (CacheHitCount/CacheMissCount) in DerivedMetricCollector, producing a
cumulative ratio; either remove this gauge and rely on Prometheus-side
computation using the existing CacheOperations CounterVec (operators can use
rate(...) over any window), deleting
CacheHitCount/CacheMissCount/RecordCacheHit/RecordCacheMiss and
DerivedMetricCollector, or change DerivedMetricCollector.compute to calculate
the ratio from interval deltas by storing previous snapshot fields (e.g.,
prevHits, prevMisses) on DerivedMetricCollector, computing hitsDelta = hits -
prevHits and missesDelta = misses - prevMisses, and setting CacheHitRatio to
hitsDelta/(hitsDelta+missesDelta) only when the delta total > 0, then update the
snapshots.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 192c47bc-4673-42c2-959e-c459aaaa6ba3
📒 Files selected for processing (6)
internal/core/services/dnssec_service.gointernal/dns/server/ratelimit.gointernal/dns/server/recursive.gointernal/dns/server/server.gointernal/infrastructure/metrics/metrics.gointernal/infrastructure/metrics/metrics_test.go
| func (s *Server) updateDNSSECMetrics(ctx context.Context) { | ||
| if s.DNSSEC == nil { | ||
| return | ||
| } | ||
| stats, err := s.DNSSEC.CollectKeyStats(ctx) | ||
| if err != nil { | ||
| return | ||
| } | ||
| metrics.DNSSECKeysTotal.Reset() | ||
| metrics.DNSSECKeysAgeSeconds.Reset() | ||
| signedZones := 0 | ||
| for _, st := range stats { | ||
| metrics.DNSSECKeysTotal.WithLabelValues(st.ZoneName, st.KeyType, fmt.Sprintf("%d", st.Algorithm)).Set(1) | ||
| metrics.DNSSECKeysAgeSeconds.WithLabelValues(st.ZoneName, st.KeyType).Set(st.AgeSeconds) | ||
| signedZones++ | ||
| } | ||
| metrics.DNSSECZonesSigned.Set(float64(signedZones)) | ||
| } |
There was a problem hiding this comment.
updateDNSSECMetrics has three correctness bugs.
DNSSECZonesSignedcounts keys, not zones.signedZones++runs once per key, so a zone with both KSK and ZSK (or multiple algorithms / mid-rollover) inflates the gauge. Track unique zone names (or zone IDs) in amap[string]struct{}and reportlen(set).DNSSECKeysTotal.Set(1)cannot represent rollover. During rollover a zone legitimately has two active keys of the same(zone, key_type, algorithm)tuple, butSet(1)keeps the gauge at 1 and the second key disappears from the metric. UseInc()afterReset()so each label combination accumulates the actual count.DNSSECKeysAgeSeconds.Setoverwrites siblings. With overlap rollover you'll have two keys sharing(zone, key_type); the secondSetdiscards the first key's age. Either expand the label set with the key tag/ID, or aggregate (e.g. report only the oldest active key's age).
♻️ Proposed fix
metrics.DNSSECKeysTotal.Reset()
metrics.DNSSECKeysAgeSeconds.Reset()
- signedZones := 0
+ zoneSet := make(map[string]struct{}, len(stats))
+ oldestAge := make(map[string]float64) // key: zone|key_type
for _, st := range stats {
- metrics.DNSSECKeysTotal.WithLabelValues(st.ZoneName, st.KeyType, fmt.Sprintf("%d", st.Algorithm)).Set(1)
- metrics.DNSSECKeysAgeSeconds.WithLabelValues(st.ZoneName, st.KeyType).Set(st.AgeSeconds)
- signedZones++
+ metrics.DNSSECKeysTotal.WithLabelValues(st.ZoneName, st.KeyType, fmt.Sprintf("%d", st.Algorithm)).Inc()
+ k := st.ZoneName + "|" + st.KeyType
+ if cur, ok := oldestAge[k]; !ok || st.AgeSeconds > cur {
+ oldestAge[k] = st.AgeSeconds
+ }
+ zoneSet[st.ZoneID] = struct{}{}
}
- metrics.DNSSECZonesSigned.Set(float64(signedZones))
+ for k, age := range oldestAge {
+ parts := strings.SplitN(k, "|", 2)
+ metrics.DNSSECKeysAgeSeconds.WithLabelValues(parts[0], parts[1]).Set(age)
+ }
+ metrics.DNSSECZonesSigned.Set(float64(len(zoneSet)))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/dns/server/server.go` around lines 254 - 271, updateDNSSECMetrics is
wrong: it increments signedZones per key, hides multi-key rollover counts, and
overwrites ages. Fix by (1) tracking unique zones with a map[string]struct{}
(e.g., seenZones) and after the loop set metrics.DNSSECZonesSigned to
len(seenZones); (2) after metrics.DNSSECKeysTotal.Reset() call
metrics.DNSSECKeysTotal.WithLabelValues(st.ZoneName, st.KeyType,
fmt.Sprintf("%d", st.Algorithm)).Inc() for each stat instead of Set(1) so
duplicate label tuples accumulate counts; and (3) avoid overwriting ages by
either adding a unique key identifier label (e.g., st.KeyTag or st.KeyID) to
metrics.DNSSECKeysAgeSeconds.WithLabelValues(...) before Set(st.AgeSeconds) or
by pre-aggregating (choose the oldest/desired age per (ZoneName,KeyType) in a
map and call Set once per group). Apply these changes inside updateDNSSECMetrics
using the existing stats elements returned by s.DNSSEC.CollectKeyStats.
… collector - Replace HTTP-oriented DefBuckets with DNS-appropriate buckets (50µs-10s) - Add CacheHitRatio gauge with atomic hit/miss counters for ratio computation - Add DNSSEC key metrics: DNSSECKeysTotal, DNSSECKeysAgeSeconds, DNSSECZonesSigned - Add NotifiesTotal, RateLimitedTotal, AXFRBytesTotal, RecursiveResolutionsTotal counters - Add ZonesTotal and RecordsTotal gauges - Add DerivedMetricCollector background goroutine for periodic ratio computation - Add ZoneRecordCounter for periodic zone/record count updates - Add comprehensive tests for all new metrics and bucket validation
- Add rateLimited atomic counter to rateLimiter struct - Increment RateLimitedTotal metric when Allow() returns false - Add RateLimited() accessor method for testing - Add CollectKeyStats() to DNSSECService for key age tracking - Add KeyStats struct with ZoneID, ZoneName, KeyType, Algorithm, AgeSeconds
…on metrics - Track NOTIFY messages with zone and result labels in handleNotify() - Track AXFR bytes transferred in sendAXFRRecord() - Add RecordCacheHit/RecordCacheMiss atomics for L1/L2 cache operations - Add RecursiveResolutionsTotal counter with success/nxdomain/timeout labels - Add updateDNSSECMetrics() to populate DNSSEC key age gauges - Call updateDNSSECMetrics() after automateDNSSEC() completes
…Stats, add test - Fix ZoneRecordRepo interface to use []domain.Zone and []domain.Record - Wire ZoneRecordCounter into main.go with 30s interval - Unexport KeyStats to keyStats (internal only) - Add TestRateLimiter_RateLimited for RateLimited() accessor - Add trailing newline to metrics_test.go
- Add log/slog logger to DNSSECService for error visibility - Log debug message when ListKeysForZone fails in CollectKeyStats - Log debug message when CollectKeyStats fails in updateDNSSECMetrics - Implement RecordsTotal: loop all zones and count records per zone - Add trailing newline to metrics.go
5335ff5 to
b884635
Compare
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/dns/server/recursive.go (1)
176-178:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winTwo metric increments missing from the fallback resolution paths.
Pre-fallback timeout (line 176–178): This early-return timeout path does not increment the
"timeout"label, silently dropping these events fromclouddns_recursive_resolutions_total.Fallback resolution outcomes (lines 191–193): When a fallback resolver succeeds with
ResCode==0orResCode==3, neither"success"nor"nxdomain"is incremented. All fallback-sourced resolutions are invisible in the counter.🐛 Proposed fix
if time.Since(resolveStart) >= recursiveTimeout { s.Logger.Warn("recursive resolution timed out before fallback", "name", name) + metrics.RecursiveResolutionsTotal.WithLabelValues("timeout").Inc() return nil, errors.New(errRecursiveTimeout) }if err == nil && (resp.Header.ResCode == 0 || resp.Header.ResCode == 3) { s.Logger.Info("fallback resolution successful", "name", name, "fallback", fallback) + if resp.Header.ResCode == 3 { + metrics.RecursiveResolutionsTotal.WithLabelValues("nxdomain").Inc() + } else { + metrics.RecursiveResolutionsTotal.WithLabelValues("success").Inc() + } return resp, nil }Also applies to: 191-193
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/dns/server/recursive.go` around lines 176 - 178, The early-return timeout branch using resolveStart/recursiveTimeout (where s.Logger.Warn is called and errRecursiveTimeout is returned) must increment clouddns_recursive_resolutions_total with label "timeout" before returning; likewise, when handling fallback resolver outcomes (check ResCode on the fallback result for ResCode==0 and ResCode==3) increment the same metric with labels "success" and "nxdomain" respectively so fallback-sourced resolutions are counted; add these metric.Inc calls immediately prior to each return in the timeout branch and the fallback success/nxdomain branches, referencing the existing metric name clouddns_recursive_resolutions_total and the ResCode checks already present in the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/clouddns/main.go`:
- Around line 182-192: The lint failure is due to Start() and its internal
collect method not accepting a context; update the ZoneRecordCounter
implementation in metrics.go so that collect(ctx context.Context) takes a
context and Start(ctx context.Context) accepts and passes that context into
collect (and any spawned goroutines), and update NewZoneRecordCounter usage
sites (e.g., where counter := metrics.NewZoneRecordCounter(repo, interval)) to
call counter.Start(runCtx) so lifecycle/cancellation is driven by the provided
runCtx; ensure Stop still cancels/stops appropriately and adjust any helper
methods that call collect to accept/forward the context.
In `@internal/core/services/dnssec_service.go`:
- Around line 207-245: The exported method CollectKeyStats returns an unexported
type keyStats causing a lint error; rename the struct keyStats to KeyStats (and
update its fields if any need exporting) and update CollectKeyStats's return
signature and all usages to return []KeyStats instead of []keyStats (search for
keyStats, CollectKeyStats, and any callers that declare or type-assert the slice
and update them accordingly).
In `@internal/infrastructure/metrics/metrics.go`:
- Around line 265-282: ZoneRecordCounter.Start currently spawns a goroutine that
calls collect which uses context.Background(), causing un-cancellable DB
queries; change the signature to Start(ctx context.Context) and collect(ctx
context.Context), update the goroutine in ZoneRecordCounter.Start to pass the
provided ctx into collect and to select on ctx.Done() in addition to c.stopCh
and ticker.C, and update any internal calls to collect to stop using
context.Background() and instead use the propagated ctx so shutdown cancels DB
queries; ensure callers (e.g., in cmd/clouddns/main.go) call Start with their
context and adjust any tests or references to the new signatures.
- Around line 291-307: The collect method in ZoneRecordCounter currently sets
RecordsTotal even when c.repo.ListZones returns an error, which zeroes the gauge
on transient failures; update ZoneRecordCounter.collect so that after calling
c.repo.ListZones(ctx, "") you only proceed to iterate zones, compute
totalRecords, and call RecordsTotal.Set(...) when err == nil (same guard used
for ZonesTotal.Set); keep the existing per-zone error checks around
c.repo.ListRecordsForZone(ctx, z.ID, "") and only add record counts when those
calls succeed.
- Around line 54-57: The two exported counters CacheHitCount and CacheMissCount
should be made unexported since they're internal implementation details; rename
them (e.g., cacheHitCount, cacheMissCount) in the struct and update all
references in RecordCacheHit, RecordCacheMiss and compute methods to the new
names so the linter no longer requires doc comments for them.
- Around line 88-97: Instantiate and start the DerivedMetricCollector in the
metrics setup of cmd/clouddns/main.go so cache-derived metrics get computed:
call NewDerivedMetricCollector(...) (using the same interval used for
ZoneRecordCounter) and store the returned *DerivedMetricCollector, mirroring how
ZoneRecordCounter is initialized; also ensure you cleanly stop it on shutdown
(like ZoneRecordCounter) so RecordCacheHit/RecordCacheMiss updates produce the
clouddns_cache_hit_ratio metric.
---
Outside diff comments:
In `@internal/dns/server/recursive.go`:
- Around line 176-178: The early-return timeout branch using
resolveStart/recursiveTimeout (where s.Logger.Warn is called and
errRecursiveTimeout is returned) must increment
clouddns_recursive_resolutions_total with label "timeout" before returning;
likewise, when handling fallback resolver outcomes (check ResCode on the
fallback result for ResCode==0 and ResCode==3) increment the same metric with
labels "success" and "nxdomain" respectively so fallback-sourced resolutions are
counted; add these metric.Inc calls immediately prior to each return in the
timeout branch and the fallback success/nxdomain branches, referencing the
existing metric name clouddns_recursive_resolutions_total and the ResCode checks
already present in the code.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 352a93cd-eae9-4591-a5b9-1312d97c34e5
📒 Files selected for processing (8)
cmd/clouddns/main.gointernal/core/services/dnssec_service.gointernal/dns/server/ratelimit.gointernal/dns/server/ratelimit_test.gointernal/dns/server/recursive.gointernal/dns/server/server.gointernal/infrastructure/metrics/metrics.gointernal/infrastructure/metrics/metrics_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- internal/infrastructure/metrics/metrics_test.go
- internal/dns/server/server.go
| // NewDerivedMetricCollector creates a collector that updates derived metrics at the given interval. | ||
| func NewDerivedMetricCollector(interval time.Duration) *DerivedMetricCollector { | ||
| c := &DerivedMetricCollector{ | ||
| interval: interval, | ||
| stopCh: make(chan struct{}), | ||
| doneCh: make(chan struct{}), | ||
| } | ||
| go c.run() | ||
| return c | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if NewDerivedMetricCollector is called anywhere in non-test production code
echo "=== NewDerivedMetricCollector call sites (excluding tests) ==="
rg -nP --type=go 'NewDerivedMetricCollector' -g '!*_test.go' -C3
echo
echo "=== NewDerivedMetricCollector call sites (tests only for comparison) ==="
rg -nP --type=go 'NewDerivedMetricCollector' --glob '*_test.go' -C2Repository: poyrazK/cloudDNS
Length of output: 1662
🏁 Script executed:
# First, let's find the main entry point and server initialization files
fd -type f -name 'main.go'
fd -type f -name 'server.go'
fd -type f -name '*server*.go' -type fRepository: poyrazK/cloudDNS
Length of output: 573
🏁 Script executed:
# Check the cmd/clouddns/main.go file to see how metrics are initialized
if [ -f "cmd/clouddns/main.go" ]; then
wc -l cmd/clouddns/main.go
fiRepository: poyrazK/cloudDNS
Length of output: 85
🏁 Script executed:
# Search for all places where ZoneRecordCounter is mentioned to understand the metric initialization pattern
rg -n 'ZoneRecordCounter' -g '!*_test.go' --type=goRepository: poyrazK/cloudDNS
Length of output: 962
🏁 Script executed:
# Search for any initialization or startup related to DerivedMetricCollector in comments or code
rg -n 'DerivedMetricCollector\|derived.*metric\|cache.*ratio' -g '!*_test.go' --type=go -iRepository: poyrazK/cloudDNS
Length of output: 42
🏁 Script executed:
# Read the main.go file to see the complete context around the metric initialization
sed -n '180,200p' cmd/clouddns/main.goRepository: poyrazK/cloudDNS
Length of output: 576
🏁 Script executed:
# Find all go files that might be related to server initialization
fd -name 'server.go' -o -name '*server*.go'Repository: poyrazK/cloudDNS
Length of output: 291
🏁 Script executed:
# Search for where metrics are used in main.go more broadly
grep -n 'metrics\|ZoneRecordCounter\|DerivedMetric' cmd/clouddns/main.goRepository: poyrazK/cloudDNS
Length of output: 332
🏁 Script executed:
# Find server-related files
fd 'server' --type fRepository: poyrazK/cloudDNS
Length of output: 169
🏁 Script executed:
# Let's check if there are any server.go files in the internal directory
find . -name 'server.go' -type fRepository: poyrazK/cloudDNS
Length of output: 92
🏁 Script executed:
# Check the metrics.go file to understand what DerivedMetricCollector does and what it computes
sed -n '78,120p' internal/infrastructure/metrics/metrics.goRepository: poyrazK/cloudDNS
Length of output: 1069
🏁 Script executed:
# Look for clouddns_cache_hit_ratio mention in the codebase
rg 'cache_hit_ratio|cache.*hit' --type-list | head -20Repository: poyrazK/cloudDNS
Length of output: 450
🏁 Script executed:
# Check the server.go file to see if it initializes DerivedMetricCollector
sed -n '1,50p' internal/dns/server/server.goRepository: poyrazK/cloudDNS
Length of output: 1195
🏁 Script executed:
# Search for any initialization or startup patterns in server.go that might relate to metrics
rg 'DerivedMetricCollector|CacheHit|metrics' internal/dns/server/server.go -A2 -B2Repository: poyrazK/cloudDNS
Length of output: 4180
🏁 Script executed:
# Check if there's any global initialization of DerivedMetricCollector in init() or similar
rg 'func init\|NewDerivedMetricCollector' --type=go -g '!*_test.go'Repository: poyrazK/cloudDNS
Length of output: 42
🏁 Script executed:
# Let's check where the cache hit/miss recording happens
rg 'RecordCacheHit|RecordCacheMiss|CacheHitCount|CacheMissCount' --type=go -g '!*_test.go' -B2 -A2Repository: poyrazK/cloudDNS
Length of output: 4009
Start DerivedMetricCollector in cmd/clouddns/main.go.
RecordCacheHit() and RecordCacheMiss() are called throughout server.go during query processing, but NewDerivedMetricCollector() is never instantiated in production code. This means clouddns_cache_hit_ratio remains uncomputed. Add initialization similar to ZoneRecordCounter in the metrics setup section of main.go.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/infrastructure/metrics/metrics.go` around lines 88 - 97, Instantiate
and start the DerivedMetricCollector in the metrics setup of
cmd/clouddns/main.go so cache-derived metrics get computed: call
NewDerivedMetricCollector(...) (using the same interval used for
ZoneRecordCounter) and store the returned *DerivedMetricCollector, mirroring how
ZoneRecordCounter is initialized; also ensure you cleanly stop it on shutdown
(like ZoneRecordCounter) so RecordCacheHit/RecordCacheMiss updates produce the
clouddns_cache_hit_ratio metric.
…ed return, exported var - Pass runCtx to ZoneRecordCounter.Start() to satisfy contextcheck linter - Unexport cacheHitCount/cacheMissCount to fix revive:exported rule - Restore CacheHitRatio gauge declaration in var block - Update test to use atomic.Uint64 methods instead of raw atomic package - Keep CollectKeyStats exported (nolint approach rejected by user)
…ests - Add TestCollectKeyStats_AllZonesFail to verify empty slice on all-zones-fail - Add TestCollectKeyStats_Normal to verify correct stats when keys exist - Add TestZoneRecordCounter, TestZoneRecordCounter_EmptyZones, TestZoneRecordCounter_ZonesWithNoRecords - Split mockDNSSECRepo err into keysErr and listErr for granular error control
…rd, fallback metrics, DerivedMetricCollector start - Export KeyStats struct (was keyStats) to fix unexported-return revive error - Guard RecordsTotal.Set with same err==nil check used for ZonesTotal - Add timeout/resolveStart metric Inc before fallback early-return - Add success/nxdomain metric Inc for fallback resolution outcomes - Start DerivedMetricCollector in main.go alongside ZoneRecordCounter
Summary
clouddns_cache_hit_ratiogauge computed via background collectorclouddns_dnssec_keys_total,clouddns_dnssec_keys_age_seconds,clouddns_zones_signedfor DNSSEC visibilityclouddns_notifies_total(CounterVec with zone/result labels)clouddns_rate_limited_totalcounter tracking rejected queriesclouddns_axfr_bytes_totalcounter for zone transfer monitoringclouddns_recursive_resolutions_total(CounterVec with result: success/nxdomain/timeout)clouddns_zones_totalandclouddns_records_totalgaugesRecordCacheHit()/RecordCacheMiss()atomics for accurate hit ratioCollectKeyStats()to DNSSECService for key age metricsTest plan
go test ./internal/infrastructure/metrics/...passesgo test ./internal/dns/server/...passescurl localhost:8080/metrics | grep clouddns_shows all new metricsSummary by CodeRabbit