fix: address CodeRabbit review findings across certificate, dlq, webhook, idempotency, and systemplane#420
Conversation
…ook, idempotency, and systemplane packages - certificate: preserve intermediate chain in Rotate via variadic intermediates, deep-copy DER chain in TLSCertificate to prevent aliasing, add LoadFromFilesWithChain - dlq/consumer: fix Run restart after ctx shutdown by clearing stopCh on exit, escalate re-enqueue failures to ERROR with metrics (prevents silent message loss) - webhook: enforce redirect blocking on custom HTTP clients (SSRF protection), fix TLS SNI for HTTPS endpoints after DNS pinning via ServerName on cloned transport - idempotency: use Redis pipeline for atomic response cache + state marker writes, correct godoc from at-most-once to best-effort idempotency (fail-open on Redis outage) - systemplane/domain: add nil-receiver guard to ConfigValue and setting helpers, clone mutable defaults in DefaultSnapshotFromKeyDefs to prevent Value/Default aliasing, use platform-independent overflow bounds in intFromFloat64 and scaleDurationFloat64 - systemplane/bootstrap: wrap resource-validation errors with backend kind context - systemplane/catalog: inline containsString to direct slices.Contains call - systemplane/ports: fix AllowAllAuthorizer godoc to reflect fail-closed behavior, add TenantID exact-max-length boundary test - systemplane/bootstrap: use snapshot() in tests to honor registry mutex contract
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR makes targeted functional changes across multiple packages. Certificate manager: Rotate now accepts optional DER-encoded intermediates, re-parses and deep-copies the leaf, TLSCertificate returns a deep-copied DER chain, and LoadFromFilesWithChain was added. DLQ consumer: Run uses a per-invocation stop channel with safer cleanup; processMessage records lost metrics and calls libOtel.HandleSpanError on re-enqueue failures, and returns immediately on loss paths. Idempotency middleware: Redis writes/deletes now use a single pipeline per outcome. Webhook deliverer: WithHTTPClient clones the client and blocks redirects; added httpsClientForPinnedIP to set SNI for pinned-IP requests. 🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
commons/systemplane/domain/snapshot.go (1)
80-98: 🧹 Nitpick | 🔵 TrivialConsider adding explicit nil-receiver checks to
GlobalSettingValueandTenantSettingValuefor consistency.While these methods are implicitly nil-safe (via their
Get*calls), adding explicit nil-receiver checks would provide consistency withConfigValueand make the nil-safety immediately obvious to readers.♻️ Suggested consistency improvement
// GlobalSettingValue returns the setting value for the given key, or the fallback -// if the key is not present. +// if the key is not present. Nil-receiver safe. func (s *Snapshot) GlobalSettingValue(key string, fallback any) any { + if s == nil { + return fallback + } + if v, ok := s.GetGlobalSetting(key); ok { return v.Value } return fallback } // TenantSettingValue returns the tenant setting value for the given key, or the -// fallback if the key is not present. +// fallback if the key is not present. Nil-receiver safe. func (s *Snapshot) TenantSettingValue(tenantID, key string, fallback any) any { + if s == nil { + return fallback + } + if v, ok := s.GetTenantSetting(tenantID, key); ok { return v.Value } return fallback }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@commons/systemplane/domain/snapshot.go` around lines 80 - 98, Add explicit nil-receiver checks to Snapshot.GlobalSettingValue and Snapshot.TenantSettingValue: at the start of each method check if s == nil and immediately return the provided fallback, then proceed to call GetGlobalSetting / GetTenantSetting as before; this mirrors the nil-safety pattern used by ConfigValue and makes the intent obvious to readers while keeping existing behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@commons/certificate/certificate.go`:
- Around line 123-126: The Rotate implementation builds m.chain by appending the
provided intermediates slice directly (chain := ...; chain = append(chain,
cert.Raw; chain = append(chain, intermediates...); m.chain = chain), which
allows external mutation to alter the stored chain; modify Rotate to deep-copy
each intermediate []byte before appending (allocate new byte slices and copy
contents) so m.chain holds independent buffers (consistent with
TLSCertificate()), ensuring m.chain and cert.Raw are not aliased to caller-owned
memory.
In `@commons/dlq/consumer.go`:
- Around line 397-406: The branch that handles a failed re-enqueue for a
not-yet-ready message currently logs (c.logger.Log), records exhaustion
(c.metrics.RecordExhausted via c.sanitizeMetricSource) and then falls through to
the "bounced back" path by returning false, which prevents the failure from
counting toward the poll's BatchSize; change behavior so the failed re-enqueue
is treated as consumed for batching: after logging and calling
c.metrics.RecordExhausted, mark the message as counted toward the batch (i.e.
increment whatever batch counter or return true from this handler) so the
message contributes to the poll BatchSize instead of using the bounced-back
return false path.
- Around line 182-192: The Run() start logic currently unconditionally replaces
c.stopCh with a new runStopCh which can allow overlapping Run loops because an
existing loop might still be in safeProcessOnce; fix by holding c.stopMu,
checking if c.stopCh != nil and if so do not start a new loop (return/exit),
otherwise set c.stopCh = runStopCh and unlock; keep the deferred cleanup that
clears c.stopCh only if it still equals runStopCh to avoid clearing a
newly-started loop. Ensure references: Run(), Stop(), c.stopCh, runStopCh,
c.stopMu, and safeProcessOnce are used as described.
In `@commons/net/http/idempotency/idempotency.go`:
- Around line 298-301: The two separate Redis deletions queued via pipe.Del(ctx,
key) and pipe.Del(ctx, responseKey) should be combined into a single variadic
Del call to reduce roundtrips; update the code around client.Pipeline() where
pipe.Del is used (the two calls for key and responseKey) to a single
pipe.Del(ctx, key, responseKey) so both keys are deleted in one Redis command.
In `@commons/systemplane/domain/snapshot_from_keydefs.go`:
- Around line 41-64: The domain's cloneRuntimeValue implementation is narrower
than the reflection-based generic version in manager_helpers.go; replace the
type-asserting cloneRuntimeValue in snapshot_from_keydefs.go with the reusable
reflection-based implementation from manager_helpers.go (keeping the same
function name cloneRuntimeValue) so it handles any map/slice types and includes
explicit nil checks, or if you intentionally need only map[string]any and []any,
add a comment explaining that constraint and ensure its behavior (including nil
handling) matches expectations; update only the implementation body to reuse the
shared logic and remove duplication.
In `@commons/webhook/deliverer.go`:
- Around line 575-600: The httpsClientForPinnedIP function currently clones the
Transport on every call which breaks connection reuse; change it to cache one
cloned Transport/Client per originalHost to preserve connection pooling while
avoiding races from mutating ServerName. Implement a host->*http.Client cache on
Deliverer (e.g., a sync.Map or a map with mutex) keyed by originalHost, where
the stored client uses a transport clone whose TLSClientConfig.ServerName is set
once; ensure cached clients inherit d.client.Timeout and CheckRedirect, provide
an eviction/CloseIdleConnections strategy for long-lived entries, and keep the
fallback behavior when d.client.Transport is not an *http.Transport.
- Around line 588-594: The cloned transport `pinned` should enforce a minimum
TLS version of TLS1.2: after creating or copying `pinned.TLSClientConfig` (the
block around `pinned := transport.Clone()` and `if pinned.TLSClientConfig == nil
{ pinned.TLSClientConfig = &tls.Config{MinVersion: tls.VersionTLS12} }`), ensure
you set `pinned.TLSClientConfig.MinVersion = tls.VersionTLS12` when the existing
MinVersion is zero or less than `tls.VersionTLS12` so callers cannot downgrade
TLS; keep `pinned.TLSClientConfig.ServerName = originalHost` as-is.
---
Outside diff comments:
In `@commons/systemplane/domain/snapshot.go`:
- Around line 80-98: Add explicit nil-receiver checks to
Snapshot.GlobalSettingValue and Snapshot.TenantSettingValue: at the start of
each method check if s == nil and immediately return the provided fallback, then
proceed to call GetGlobalSetting / GetTenantSetting as before; this mirrors the
nil-safety pattern used by ConfigValue and makes the intent obvious to readers
while keeping existing behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: aef8ef36-540f-4abc-bb9b-2a991d030178
📒 Files selected for processing (14)
commons/certificate/certificate.gocommons/dlq/consumer.gocommons/net/http/idempotency/doc.gocommons/net/http/idempotency/idempotency.gocommons/systemplane/bootstrap/backend.gocommons/systemplane/bootstrap/backend_test.gocommons/systemplane/catalog/validate.gocommons/systemplane/domain/coercion_helpers.gocommons/systemplane/domain/setting_helpers.gocommons/systemplane/domain/snapshot.gocommons/systemplane/domain/snapshot_from_keydefs.gocommons/systemplane/ports/authorizer_defaults.gocommons/systemplane/ports/identity_defaults_test.gocommons/webhook/deliverer.go
- certificate: deep-copy intermediate DER bytes in Rotate to prevent caller mutation - dlq/consumer: simplify Run() overlap guard to reject when stopCh is non-nil (prevents concurrent loops while previous goroutine drains safeProcessOnce); lost not-yet-ready messages now count against BatchSize (return true) - idempotency: combine two pipe.Del calls into single variadic Del(ctx, key, responseKey) - webhook: enforce minimum TLS 1.2 on cloned transport even when caller config is weaker - systemplane/domain: add nil-receiver guards to GlobalSettingValue and TenantSettingValue; add comment explaining intentional narrower clone scope vs reflection-based version
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@commons/certificate/certificate.go`:
- Around line 83-87: Update the godoc to explicitly show that
LoadFromFilesWithChain returns a slice containing the leaf followed by
intermediates and that Rotate expects only the intermediates (not the leaf); add
a short example like calling m.Rotate(cert, signer, chain[1:]...) to make the
hot-reload path unambiguous and note that passing chain[1:] avoids duplicating
the leaf in m.chain; apply the same clarification to the other related comment
block that documents this behavior near the second Rotate/Manager usage.
- Around line 121-132: Rotate stores m.cert and m.chain entries by reference so
the leaf certificate and chain[0] alias the original DER buffer; fix by making
Rotate own a deep copy/reparse of the leaf before assigning m.cert and m.chain
(e.g., re-parse a copy of cert.Raw or clone via x509.ParseCertificate on a
copied buffer) so m.cert.Raw does not point to external mutable memory
(intermediates are already copied); additionally, change TLSCertificate to
return a cloned Leaf (not m.cert directly) so external callers cannot mutate the
manager's internal certificate (affects GetCertificate(), ExpiresAt(),
PublicKey()); apply the same copying/cloning approach to the other Rotate/assign
block referenced around lines 228-238.
In `@commons/dlq/consumer.go`:
- Around line 392-405: The terminal-loss branches that log and meter a lost
message (the branch guarding msg.NextRetryAt.IsZero/now.Before and the similar
branch around lines 438-450) do not mark the per-message tracing span as failed;
update those branches to record the span error/status before returning (e.g.,
call the per-message span's error recording API such as span.RecordError(err)
and/or span.SetStatus(Error) or equivalent in your tracing wrapper) so the trace
reflects the failure alongside the existing c.logger.Log and
c.metrics.RecordExhausted calls; apply this to the branch handling
c.handler.Enqueue failure (refer to msg.NextRetryAt, c.handler.Enqueue,
c.logger.Log, c.metrics.RecordExhausted) and the other terminal-loss branch
referenced in the review.
In `@commons/net/http/idempotency/idempotency.go`:
- Around line 255-257: The comment wrongly claims atomicity: update the godoc on
saveResult to reflect that it uses redis.Client.Pipeline() which batches
commands for a single round-trip but is not transactionally atomic (other
clients may interleave); either change the wording from "atomically" to "in a
single round-trip" (or "in a single pipeline") or, if true MULTI/EXEC semantics
are required, replace Pipeline() with TxPipeline()/MULTI-EXEC and adjust logic
accordingly while referencing saveResult and the Pipeline() call.
In `@commons/webhook/deliverer.go`:
- Around line 597-600: The returned http.Client is being constructed from
scratch and drops fields like d.client.Jar; instead clone the existing client
(as done in WithHTTPClient()) and only replace its Transport with the pinned
transport so caller-configured fields (Jar, Timeout, CheckRedirect, etc.) are
preserved; locate the code that builds the client (uses d.client and pinned) and
change it to copy/clone d.client into a new client instance and assign
newClient.Transport = pinned before returning it.
- Around line 581-585: The pinned-HTTPS path currently type-asserts
d.client.Transport to *http.Transport and returns early when Transport is nil
(or non-*http.Transport), breaking TLS ServerName setup; update the logic in the
function that builds the pinned client (where d.client and transport are used)
to treat a nil Transport as http.DefaultTransport before doing the type
assertion, then if the resulting transport is an *http.Transport clone its
TLSClientConfig (or create one) and set ServerName to the original hostname
(preserving other fields and concurrency safety), falling back only if the
transport truly isn't an *http.Transport after normalizing nil to
http.DefaultTransport.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: daa27325-8f8f-4325-9665-f9ad1e297744
📒 Files selected for processing (6)
commons/certificate/certificate.gocommons/dlq/consumer.gocommons/net/http/idempotency/idempotency.gocommons/systemplane/domain/snapshot.gocommons/systemplane/domain/snapshot_from_keydefs.gocommons/webhook/deliverer.go
- certificate: deep-copy leaf DER (cert.Raw) in Rotate to prevent aliasing; add chain[1:]... usage example to Rotate and LoadFromFilesWithChain godocs - dlq/consumer: record span error via HandleSpanError on both terminal message-loss paths (not-yet-ready re-enqueue failure and retry re-enqueue failure) - idempotency: correct saveResult godoc from 'atomically' to 'in a single round-trip' (Pipeline batches but does not provide MULTI/EXEC transactional atomicity) - webhook: normalize nil Transport to http.DefaultTransport before type assertion in httpsClientForPinnedIP; clone d.client instead of building from scratch to preserve Jar and other caller-configured fields
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@commons/certificate/certificate.go`:
- Around line 235-245: The returned tls.Certificate from TLSCertificate
currently deep-copies Certificate bytes but sets Leaf: m.cert, exposing the
manager's internal *x509.Certificate; update the TLSCertificate godoc to
explicitly state that the Leaf pointer is shared with the manager (same contract
as GetCertificate) and must not be mutated by callers, referencing
TLSCertificate, Leaf and m.cert so readers can locate the behavior and
expectations.
In `@commons/dlq/consumer.go`:
- Around line 402-405: Add a distinct "lost" metric instead of reusing
RecordExhausted: update the DLQMetrics interface to include RecordLost(ctx
context.Context, source string, reason string) and, where the code currently
treats re-enqueue failures as exhaustion (the site calling
c.metrics.RecordExhausted after sanitizeMetricSource), change that call to
c.metrics.RecordLost(ctx, metricSource, "<brief-reason-or-error>") while leaving
true max-retry paths to call RecordExhausted; use sanitizeMetricSource,
c.metrics and the surrounding re-enqueue/error handling block to locate the spot
and include the error or reason string when calling RecordLost.
In `@commons/net/http/idempotency/idempotency.go`:
- Around line 290-305: The warning logs currently include the raw composite
idempotency key (variable "key") which may leak
high-cardinality/request-specific identifiers; in the m.logger.Log calls inside
the branches that call pipe.Exec (and the pipeline started via client.Pipeline
with pipe.Del using "responseKey"), remove or replace the raw "key" field with a
redacted or stable hash (e.g., compute a truncated SHA256 or a fixed "REDACTED"
token before logging) and log that instead; update both the "idempotency: failed
to atomically cache response and mark complete" and "idempotency: failed to
delete keys after handler error" log invocations (and any other m.logger.Log
usages nearby that pass "key") to use the sanitized value rather than the raw
variable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 53e480d8-1ecc-4208-a44c-91c04cad2dba
📒 Files selected for processing (4)
commons/certificate/certificate.gocommons/dlq/consumer.gocommons/net/http/idempotency/idempotency.gocommons/webhook/deliverer.go
Deep-copy leaf certificate in Rotate to prevent aliasing caller-owned memory, document shared Leaf pointer in TLSCertificate godoc, and remove high-cardinality composite key from idempotency warning logs.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@commons/net/http/idempotency/idempotency.go`:
- Around line 297-298: The current pipeline call pipe.Del(ctx, key, responseKey)
can trigger a Redis Cluster CROSSSLOT error because key and responseKey hash to
different slots; replace the multi-key DEL with two single-key deletes in the
same pipeline (e.g., pipe.Del(ctx, key) and pipe.Del(ctx, responseKey)) so each
DEL targets a single key, then execute the pipeline and handle any error from
the Exec; update the cleanup logic around client.Pipeline(), pipe.Del, key and
responseKey to perform two separate Del calls and propagate/log Exec errors as
before.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 54f94232-fd7e-49f7-9894-c435863da012
📒 Files selected for processing (2)
commons/certificate/certificate.gocommons/net/http/idempotency/idempotency.go
…haustion Adds a dedicated RecordLost method to the DLQMetrics interface so operators can separately alert on infrastructure-caused message loss (re-enqueue failures) versus expected exhaustion (max retries reached).
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@commons/dlq/consumer_test.go`:
- Around line 64-69: Add a new test that exercises the RecordLost metric by
simulating a Redis write failure after dequeue: use newTestConsumer to create a
consumer with a mockMetrics instance and a retryFn that returns an error, inject
a retryable FailedMessage via injectMessage into the appropriate DLQ queue, call
mr.SetError("READONLY") to force Redis write failures, run c.ProcessOnce(ctx)
and assert mockMetrics.lostCount() > 0 (or equals 1). This targets the
lostCount() helper and the code path in ProcessOnce that should call RecordLost
when re-enqueue/retry writes fail.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: c23a6d40-0463-4847-8f20-bf6a672e0fd6
📒 Files selected for processing (3)
commons/dlq/consumer.gocommons/dlq/consumer_test.gocommons/dlq/handler.go
…empotency Split pipe.Del(ctx, key, responseKey) into two single-key DEL calls within the same pipeline to prevent CROSSSLOT errors when keys hash to different Redis Cluster slots.
|
@CodeRabbit -- everything ok? mergeable? |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 45719
✅ CI: Both ❌ Blocked: There are 18 unresolved review comments across 5 review rounds, all with 🔴 Critical (1)
🟠 Major (5)
🔵 Nitpick/Minor (12)Includes: span not marked failed on terminal-loss paths, Would you like me to use the Autofix (Beta) to push a commit addressing all of these, or would you prefer to handle them manually? |
Summary
Addresses all 20 actionable CodeRabbit review findings from PR #419 across 14 files.
Critical fixes
Rotatenow preserves intermediate chain via variadicintermediates ...[]byteparameter;TLSCertificate()deep-copies DER chain to prevent callers from mutating manager state; newLoadFromFilesWithChainexposes the full chain for hot-reload workflowsRun()now clearsstopChon every exit path so subsequentRun()calls work correctly after context cancellation; re-enqueue failures escalated to ERROR withRecordExhaustedmetric — prevents silent message lossWithHTTPClientclones the caller's client and forces redirect blocking to preserve SSRF protection; HTTPS endpoints after DNS pinning now setTLSClientConfig.ServerNameon a cloned transport for correct TLS SNI and cert verificationMajor fixes
saveResultuses Redis pipeline for atomic response cache + state marker writes; godoc corrected from "at-most-once" to "best-effort idempotency" reflecting fail-open behaviorConfigValueand allSnapSetting*helpers are now nil-receiver safe;DefaultSnapshotFromKeyDefsclones mutable defaults to prevent Value/Default aliasing;intFromFloat64usesfloat64(math.MaxInt)+1for platform-independent boundsMinor / nitpick fixes
containsStringto directslices.Containssnapshot()to honor registry mutex contractVerification
make cipasses: lint, format, tidy, security, vet, unit tests, and integration tests all green.