Skip to content

release: lib-commons v4 — multi-tenant platform, systemplane, outbox, and 20+ new packages#419

Closed
fredcamaral wants to merge 4 commits intomainfrom
develop
Closed

release: lib-commons v4 — multi-tenant platform, systemplane, outbox, and 20+ new packages#419
fredcamaral wants to merge 4 commits intomainfrom
develop

Conversation

@fredcamaral
Copy link
Copy Markdown
Member

Summary

Comprehensive v4 release of lib-commons consolidating all shared infrastructure for Lerian's financial services platform. This release introduces multi-tenant database-per-tenant isolation, a runtime configuration plane (systemplane), transactional outbox, webhook delivery, and 20+ new packages — alongside significant hardening and test coverage improvements across all existing packages.

Key Additions

New Packages

  • tenant-manager/ — Full multi-tenant lifecycle: PostgreSQL, MongoDB, RabbitMQ, Redis, and S3 managers with event-driven tenant discovery (Redis Pub/Sub), LRU eviction, circuit breaker integration, and Fiber middleware (WithPG/WithMB)
  • systemplane/ — Runtime configuration plane with domain model, key registry, Postgres/MongoDB store adapters, changefeed adapters (LISTEN/NOTIFY + change streams), Fiber HTTP handler, bootstrap layer, catalog of known keys, and supervisor/escalation lifecycle
  • outbox/ — Transactional outbox primitives with PostgreSQL adapter, migrations, dispatcher with concurrency control, tenant-scoped event classification, and metrics
  • webhook/ — Outbound webhook delivery with SSRF protection (DNS pinning, private IP blocking, redirect blocking), HMAC-SHA256 signing, encrypted secret support, concurrency-limited fan-out, and exponential backoff retries
  • net/http/idempotency/ — Redis-backed at-most-once request middleware (SetNX, fail-open, 409 for in-flight, response replay, tenant-scoped keys)
  • net/http/ratelimit/ — Redis-backed sliding window rate limiter with SHA-256 key hashing, identity extraction helpers, and configurable bypass
  • dlq/ — Redis-backed dead letter queue with tenant-scoped keys, exponential backoff (AWS Full Jitter), background consumer with retry/exhaust lifecycle, and ScanQueues for cross-tenant discovery
  • assert/ — Production-safe assertions with telemetry integration, domain predicates library, and zero-panic guarantee
  • runtime/ — Panic recovery, panic metrics, safe goroutine wrappers (SafeGo/SafeGoWithContext), error reporter, production mode, and span recording
  • safe/ — Panic-free math (division, percentage), regex (with caching), and slice operations with error returns
  • certificate/ — Thread-safe TLS certificate manager with hot-reload via Rotate(), PEM file loading, PKCS#8/PKCS#1/EC key support, file permission enforcement, and tls.Config integration
  • backoff/ — Exponential backoff with jitter and context-aware sleep
  • errgroup/ — Goroutine coordination with panic recovery
  • jwt/ — HMAC-based JWT signing, verification, and time-claim validation (HS256/384/512)
  • cron/ — Cron expression parsing and scheduling
  • secretsmanager/ — AWS Secrets Manager M2M credential retrieval for per-tenant service authentication
  • security_override.goALLOW_* progressive security policy gates with environment-tier enforcement

Major Enhancements

  • Postgresdbresolver with read/write splitting, backoff-based lazy-connect, OTEL spans, migration integration tests, resilience tests, periodic pool settings revalidation
  • MongoDB — Functional options constructor, URI builder, index helpers, OTEL spans, integration tests, ResolveClient with reconnect
  • Redis — Topology-based config (standalone/sentinel/cluster), GCP IAM auth, distributed locking (Redsync), backoff-based reconnect, resilience integration tests
  • RabbitMQ — Context-aware lifecycle methods, DLQ support, publisher with confirms/retry, trace propagation integration tests
  • OpenTelemetryRedactor with RedactionRule patterns (replaces FieldObfuscator), RedactingAttrBagSpanProcessor, noop global providers on empty endpoint
  • MetricsMetricsFactory with fluent builders (Counter/Gauge/Histogram), system metrics, NopFactory for tests
  • Logging — 5-method Logger interface, GoLogger with CWE-117 log-injection prevention, sanitizer, zap adapter with ISO8601 timestamps and OTEL bridge
  • HTTP — Reverse proxy with SSRF protection, CORS hardening, pagination (offset/cursor/timestamp/sort), validation helpers, ownership verification, telemetry middleware refactoring
  • ServerServerManager-based graceful shutdown with chainable config and ServersStarted() for test coordination
  • Circuit BreakerManager interface with preset configs, metrics support, health checker with validation
  • Transaction — Intent-based planning, balance eligibility validation, posting flow with typed operations

Infrastructure

  • Module path: github.com/LerianStudio/lib-commons/v4
  • Go 1.25.7
  • .env.reference for all consumed environment variables
  • AGENTS.md with full API invariants for coding agents
  • MIGRATION_MAP.md for v1→v4 symbol mapping
  • REVIEW.md with review guidelines
  • Makefile overhaul: make ci, coverage targets, integration test support, LOW_RESOURCE mode
  • golangci-lint v2 with 3-tier linter configuration (safety → quality → zero-issue guards)

Stats

  • 604 files changed
  • ~142,700 insertions / ~11,500 deletions
  • Comprehensive unit and integration test coverage across all packages

fredcamaral and others added 3 commits March 28, 2026 13:42
… service layer (#404)

* feat(systemplane): add standardization helpers, catalog, and hardened service layer

Adds catalog package with shared key validation and registration for
Postgres and Redis backends. Introduces domain coercion helpers for
type-safe snapshot value conversion (bool, int, float64, duration, string
slice) with overflow/NaN/Inf guards and case-insensitive bool parsing.

Adds fail-closed default auth adapters (deny mutations, allow reads) and
default identity adapters (system identity for internal operations) with
Actor.ID length validation.

Refactors service/manager into reads, writes, and helpers files. Adds
component diff detection for identifying changed settings between
snapshots. Introduces shutdown helpers for ordered systemplane teardown.
Extracts supervisor helpers with overflow-safe phase sorting and
non-mutating tenant ID merge.

Hardens BootstrapConfig.Validate with default case, DelegatingAuthorizer
with unconditional empty-action rejection, cloneSnapshot with nil
TenantSettings preservation, and Swagger MergeInto with tag
deduplication. Includes comprehensive tests.

X-Lerian-Ref: 0x1

* style(tenant-manager): apply lint-fix and gofmt auto-formatting

Sort import groups alphabetically, remove trailing blank line in test,
and normalize struct field comment alignment in Manager.

* fix(systemplane): address all CodeRabbit review findings

- Add sync.RWMutex to backend registry for concurrent safety
- Use snapshot consistently for factory lookup in NewBackendFromConfig
- Extract withRegistrySnapshot test helper to reduce duplication
- Add ErrUnhandledBackend sentinel error in config validation
- Fix EnvVar validation gap when catalog defines no env vars
- Fix float-to-int boundary at 2^63 (>= instead of > for overflow check)
- Accept Secret+RedactMask in KeyDef.Validate (runtime normalizes to Full)
- Add godocs for Actor/TenantID, maxTenantIDLength, whitespace test case
- Add DeepEqual semantics comment on effectiveValueChanged
- Return persisted revision on escalation failure in persistAndApplyWrite
- Assert unconditionally in shutdown_test (covers all-nil-steps case)
- Check key existence before asserting Redacted in snapshot_builder_test
- Use discardFailedCandidate for incremental build error path
- Document concurrency safety invariant in prepareReloadBuild
- Update MergeInto godoc and deduplicate srcArr tags in mergeTags
- Auto-formatting: struct field alignment in tenant-manager packages

* fix(systemplane): address remaining CodeRabbit review findings

- Guard nil init errors in RecordInitError to prevent false positives
- Propagate WriteResult.Revision on escalation failure in PatchConfigs/PatchSettings
- Include Revision and BuiltAt in zero-snapshot check for accurate diff
- Return error instead of silently skipping malformed source tags in swagger merge
- Reject empty-name tags in swagger merge to prevent spec corruption
- Route backend_test.go registry mutations through locked helpers
- Align Actor godoc with whitespace-only fallback behavior

* fix(systemplane): eliminate torn-read window and log cleanup errors

Replace two separate atomic pointers (snapshot + bundle) with a single
atomic.Pointer[supervisorState] so lock-free readers always see a
consistent (snapshot, bundle) pair. This eliminates the torn-read window
where Current() and Snapshot() could return mismatched states.

Record cleanup errors from discard/close paths to the active OTEL span
instead of silently discarding them with _ assignment.

Add 64-bit platform note to intFromFloat64 boundary constant.

---------

Co-authored-by: Jefferson Rodrigues <jeff@lerian.studio>
- Add missing packages to structure tree: internal/nilcheck, outbox,
  secretsmanager, systemplane (7 sub-dirs), tenant-manager (14 sub-dirs),
  and root files environment.go, security_override.go
- Update enabled linters: remove stale thelper/tparallel, add Tier 1-3
  linters (27 total) matching actual .golangci.yml
- Fix API Invariants heading: v2 -> v4
- Update allowed dependencies table with mongo-driver/v2,
  testcontainers-go, go-sqlmock, goleak, jwt/v5, aws-sdk-go-v2,
  golang.org/x/sync, golang.org/x/text, grpc, protobuf
- Fix security section: replace non-existent SECURE_LOG_FIELDS with
  actual LOG_OBFUSCATION_DISABLED and security.IsSensitiveField()
…lane enhancements (#418)

* feat(certificate): add thread-safe TLS certificate manager with hot-reload

Provides PEM-based certificate loading (PKCS#8/PKCS#1/EC key parsing order), strict file-permission enforcement (0600), atomic hot-reload via Rotate with expiry and public-key match validation, and tls.Config integration via GetCertificateFunc for transparent certificate rotation without restart.

X-Lerian-Ref: 0x1

* feat(dlq): add Redis-backed dead letter queue with consumer lifecycle

Tenant-scoped Redis keys with exponential backoff (AWS Full Jitter), background consumer with poll-based retry/exhaust lifecycle, non-blocking SCAN for tenant discovery, and pruning for exhausted messages. Nil-safe handler and consumer with functional options for logger, tracer, and metrics.

X-Lerian-Ref: 0x1

* feat(idempotency): add Redis-backed at-most-once request middleware for Fiber

SetNX-based idempotency enforcement with tenant-scoped keys, fail-open on Redis unavailability, cached response replay with Idempotency-Replayed header, 409 Conflict for in-flight duplicates, and automatic key cleanup on handler error to allow client retry.

X-Lerian-Ref: 0x1

* feat(webhook): add outbound webhook delivery with SSRF protection and HMAC signing

Concurrent fan-out delivery to active endpoints with semaphore-capped concurrency, DNS-pinned SSRF protection (validates all resolved IPs against private/loopback/CGNAT/RFC-reserved ranges to eliminate TOCTOU), redirect blocking, HMAC-SHA256 payload signing, encrypted secret support via SecretDecryptor, and exponential backoff retries (non-retryable on 4xx except 429).

X-Lerian-Ref: 0x1

* feat(systemplane): add bootstrap helpers, validation options, and snapshot builder

Adds ApplyKeyDefs for propagating KeyDef behaviors into bootstrap config with auto-configured secret encryption, LoadFromEnvOrDefault for zero-config Postgres fallback, ValidateKeyDefsWithOptions with WithIgnoreFields/WithKnownDeviation for suppressing intentional catalog deviations, SecretStoreConfig validation with base64 key support, and SnapshotFromKeyDefs domain helper.

X-Lerian-Ref: 0x1

* docs: document certificate, dlq, idempotency, webhook packages and systemplane catalog env vars

Updates AGENTS.md API reference, README.md package listings, PROJECT_RULES.md structure and invariant tables, and .env.reference with systemplane catalog environment variables (server TLS, CORS, rate limiting, auth, telemetry, and all shared catalog keys).

X-Lerian-Ref: 0x1

* fix(dlq): resolve modernize lint issues in handler

Use omitzero tag for time.Time struct field (omitempty has no effect on nested structs) and replace manual floor comparison with builtin max.

X-Lerian-Ref: 0x1

* fix: address CodeRabbit review feedback across all new packages

Security (critical):
- SSRF: fail-closed on DNS lookup failure instead of falling back to raw URL
- SSRF: pin to first valid IP (skip unparseable entries) and bracket-wrap bare IPv6
- SSRF test: assert error identity (errors.Is) instead of message substring

DLQ hardening:
- Consumer Run() guards against concurrent invocations (prevents orphaned stopCh)
- retryFunc panic recovery via safeRetryFunc (prevents message loss on panic)
- Source validation (validateKeySegment) on all read/scan APIs, not just Enqueue
- PruneExhaustedMessages distinguishes empty-queue from real errors, propagates both Dequeue and Enqueue failures

Documentation:
- certificate/doc.go: example now handles LoadFromFiles and Rotate errors
- idempotency/doc.go: example now handles redis.New error
- webhook/doc.go: clarifies timestamp is unsigned and insufficient for replay protection
- webhook/errors.go: broadened ErrSSRFBlocked and ErrInvalidURL godoc to match actual scope

Webhook delivery:
- Pre-populate DeliveryResult with EndpointID before goroutine launch (panic-safe)

Systemplane:
- env_or_default tests: explicit t.Setenv(EnvBackend, "") to isolate from ambient env
- ValidateOption nil guard in newValidateConfig to prevent panic
- ValidateKeyDefs godoc: removed misleading "append to returned slice" guidance

X-Lerian-Ref: 0x1

* fix: address second-round CodeRabbit review feedback

- certificate/doc.go: document EC (SEC 1) key fallback alongside PKCS#8 and PKCS#1
- dlq/handler.go: add backslash to validateKeySegment disallowed characters (Redis escape char)
- idempotency/doc.go: clarify tenant-isolation is scoped when tenant is present, global otherwise
- validate.go: correct godoc to reference ValidateKeyDefsWithOptions as the options entry point
- ssrf.go: panic on invalid hardcoded CIDR in init() instead of silently skipping

X-Lerian-Ref: 0x1

* refactor(webhook): replace runtime CIDR parsing with static net.IPNet literals

Eliminates all net.ParseCIDR calls, the self-invoking cgnatBlock initializer, and the init() panic path. SSRF blocklist entries are now compile-time-constructed via a cidr4 helper, so typos surface as test failures rather than startup crashes.

X-Lerian-Ref: 0x1
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 29, 2026

Walkthrough

Adds multiple new commons packages and features: a thread-safe TLS certificate manager with hot-rotation (commons/certificate); a Redis-backed tenant-scoped dead-letter queue and background consumer (commons/dlq); an at-most-once idempotency Fiber middleware backed by Redis (commons/net/http/idempotency); and a concurrent webhook deliverer with SSRF protection, DNS pinning, HMAC signing, and retry/backoff (commons/webhook). Also introduces a shared systemplane catalog and validation framework, config/catalog key additions (Postgres, Redis, CORS, auth, telemetry, RabbitMQ, server TLS and HTTP sizing), secret master-key handling and encryption notes, snapshot/config helpers, component-diffing, and various manager/refactor changes and tests.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Client
participant Deliverer
participant EndpointLister
participant DNS as "DNS Resolver"
participant HTTP as "HTTP Client"
participant Endpoint
Client->>Deliverer: Deliver(event)
Deliverer->>EndpointLister: ListActiveEndpoints()
EndpointLister-->>Deliverer: []Endpoint
loop for each endpoint (concurrent, limited by semaphore)
Deliverer->>DNS: resolveAndValidateIP(endpoint.URL)
DNS-->>Deliverer: pinnedIP, hostname / ErrSSRFBlocked
alt resolved and allowed
Deliverer->>Deliverer: decrypt secret (optional)
Deliverer->>HTTP: POST pinnedIP with SNI/Host header and headers (signed if secret)
HTTP-->>Deliverer: status code / response
Deliverer->>Endpoint: record DeliveryResult (status, attempts, success/failure)
else SSRF blocked or secret decrypt error
Deliverer-->>Endpoint: record DeliveryResult (error)
end
end
Deliverer-->>Client: aggregated result (or nil error)

Suggested reviewers

  • ClaraTersi
  • qnen
  • gandalf-at-lerian

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 24

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@commons/certificate/certificate_test.go`:
- Around line 863-878: Tests reveal TLSCertificate() is not nil-receiver-safe
while other Manager methods are; update the production code to either document
this asymmetry or make the method nil-receiver-safe: either add a godoc note on
TLSCertificate explaining it must not be called on a nil *Manager, or move the
nil-check so TLSCertificate performs "if m == nil" before acquiring the RLock
(adjust function TLSCertificate in the Manager implementation accordingly);
ensure behavior remains consistent with NewManager-created empty managers and
update any related comments/tests as needed.

In `@commons/certificate/certificate.go`:
- Around line 125-135: Get rid of exposing writable internal state by returning
defensive copies: in Manager.GetCertificate() acquire the read lock,
clone/deep-copy m.cert into a new *x509.Certificate instance (do the deep copy
while holding the lock) and return the copy instead of m.cert; similarly update
Manager.TLSCertificate() (and the code around m.chain/m.cert at the 201-217
region) to build and return a new tls.Certificate whose Certificate [][]byte is
a deep copy of m.chain bytes and whose Leaf is a cloned copy of m.cert (or nil)
so callers never receive references aliasing m.chain or m.cert; ensure copies
are made while holding m.mu.RLock() and release the lock before returning.
- Around line 58-75: Rotate is currently overwriting m.chain with only the leaf
DER (cert.Raw), dropping intermediates; update Rotate (and the similar block at
the second occurrence) to preserve the full DER chain returned by loadFromFiles
by assigning the full chain value (the chain variable returned by loadFromFiles)
to m.chain instead of cert.Raw so TLSCertificate() continues to serve the
complete certificate chain after hot reloads; ensure the same fix is applied to
both places where m.chain is set during load/rotate.

In `@commons/dlq/consumer.go`:
- Around line 167-183: The Run method leaves c.stopCh set after the goroutine
exits via ctx.Done(), causing future Run calls to think it's still running;
modify Run so that every exit path clears the stored channel: on goroutine exit
(after the main loop that watches ctx.Done()) and in the other early-exit path
referenced (the branch around the 194-203 region), acquire c.stopMu and set
c.stopCh = nil (and/or close it if intended) before returning so subsequent Run
invocations can create a fresh channel; ensure you use the existing c.stopMu
lock/unlock around these assignments to avoid races and reference c.stopCh,
c.stopMu, and Run() when making the change.
- Around line 386-394: The re-enqueue failure paths in the DLQ consumer (inside
Dequeue where it checks msg.NextRetryAt and the later branch around lines
416-437) currently only log errors and continue, which causes a popped message
to be silently lost on Redis failures; change these branches that call
c.handler.Enqueue(ctx, msg) to treat Enqueue errors as hard failures by
returning an error (or otherwise propagating failure to the Dequeue caller)
instead of just logging, ensuring the caller knows the message was not requeued;
update both the NextRetryAt branch and the later requeue branch so they return
an error when c.handler.Enqueue fails (and fix any callers to handle the
propagated error).

In `@commons/net/http/idempotency/doc.go`:
- Around line 1-2: The package-level godoc for package idempotency currently
promises “at-most-once” semantics but the implementation fails open on Redis
outages; update the package comment to describe “best-effort idempotency” (so
callers don’t rely on a strict guarantee) OR add a configurable failure-policy
option (e.g., add a FailureMode / FailOnRedisError flag on the exported Options
and honor it in the idempotency middleware constructor such as NewMiddleware or
whatever exported constructor is used) and default that flag to fail-closed for
mutation routes; ensure the package godoc and the exported constructor/options
(e.g., NewMiddleware, Options, FailureMode) are updated to reflect the behavior
and default.

In `@commons/net/http/idempotency/idempotency.go`:
- Around line 265-290: The response cache write and the state-marker update must
be done atomically to avoid a crash leaving a "processing" state without a
cached response; modify the cache path in the idempotency middleware (the block
that builds cachedResponse and calls client.Set(ctx, responseKey, ...)) to use a
Redis MULTI/EXEC or pipeline on the same Redis client (referencing client,
responseKey, stateKey, and m.keyTTL) so both the responseKey and the stateKey
update are enqueued and executed together; ensure error handling logs via
m.logger (log.LevelWarn, log.String("key", responseKey), log.Err(setErr)) if the
transaction/pipeline fails and fall back to the existing behavior only on
transaction failure.

In `@commons/systemplane/bootstrap/backend_test.go`:
- Around line 359-360: The test directly reads backendRegistry.factories and
backendRegistry.initErrors (e.g., the lines that reference factory :=
backendRegistry.factories[testKind]) which bypasses the registry mutex; change
these assertions to call backendRegistry.snapshot() and assert against the
returned snapshot's factories and initErrors (e.g., snapshot :=
backendRegistry.snapshot(); snapshot.factories[testKind] and
snapshot.initErrors) so the test honors the synchronization contract; apply the
same change to the other occurrences noted (around lines 379-380 and 408-415).

In `@commons/systemplane/bootstrap/backend.go`:
- Around line 162-163: The call to validateBackendResources returns sentinel
errors but is returned directly, losing which backend produced the failure; wrap
the returned error with the backend kind before returning so errors.Is stays
intact (e.g., return nil, fmt.Errorf("backend %s: %w",
<backend-kind-identifier>, err)). Update the code path where
validateBackendResources(...) is called in backend.go to wrap the error using %w
and reference the backend kind field/variable used in this scope (e.g., b.Kind
or backend.Kind) so callers see which backend failed.

In `@commons/systemplane/catalog/validate.go`:
- Around line 240-242: The small wrapper function containsString simply forwards
to slices.Contains and should be inlined: remove the containsString function and
replace all its call sites with direct calls to slices.Contains(values, target)
(e.g., the current call site that uses containsString). This reduces indirection
and dead code while preserving behavior; ensure imports remain correct and run
tests to verify no references to containsString remain.

In `@commons/systemplane/domain/coercion_helpers.go`:
- Around line 109-126: The overflow check in intFromFloat64 currently uses the
hardcoded 1<<63 boundary which is wrong on 32-bit platforms; update the bounds
check to use the platform-specific values by comparing truncated against
float64(math.MaxInt) and float64(math.MinInt) (e.g., replace the "if truncated
>= 1<<63 || truncated < math.MinInt" condition with a check like "if truncated >
float64(math.MaxInt) || truncated < float64(math.MinInt)" so conversions respect
the actual int size), keeping the rest of intFromFloat64 (truncated,
math.IsNaN/IsInf) unchanged.

In `@commons/systemplane/domain/config_helpers.go`:
- Around line 9-11: The exported helpers (e.g., SnapString) call
snap.ConfigValue(...) without guarding against a nil Snapshot, and ConfigValue
itself lacks a nil-receiver check; add a nil check to ConfigValue: in func (s
*Snapshot) ConfigValue(key string, fallback any) any return fallback immediately
if s == nil (this prevents s.GetConfig from panicking), and optionally ensure
the callers SnapString/SnapInt/etc. rely on that nil-safe ConfigValue rather
than duplicating checks; reference ConfigValue, Snapshot.GetConfig, and helper
functions like SnapString when making the change.

In `@commons/systemplane/domain/setting_helpers.go`:
- Around line 43-57: SnapSettingBool can panic if the returned raw setting is
nil (or its Value is nil) before calling tryCoerceBool; update the function to
guard the tenant and global branches by checking raw != nil (and raw.Value !=
nil if Value can be nil) before passing raw.Value to tryCoerceBool.
Specifically, modify the branches that call snap.GetTenantSetting and
snap.GetGlobalSetting in SnapSettingBool to only call tryCoerceBool when ok is
true AND raw is non-nil (and raw.Value is non-nil when applicable), otherwise
fall through to the next check or return fallback.
- Around line 25-39: SnapSettingInt lacks nil-safety: ensure you guard against a
nil Snapshot and nil returned raw settings before accessing raw.Value. Update
SnapSettingInt to first check snap != nil, then when calling
snap.GetTenantSetting and snap.GetGlobalSetting only proceed if ok && raw !=
nil, and pass raw.Value to tryCoerceInt; otherwise fall through to the fallback.
Keep the function name SnapSettingInt and the existing tryCoerceInt usage but
add the nil checks around snap and raw to prevent nil dereferences.

In `@commons/systemplane/domain/snapshot_from_keydefs.go`:
- Around line 23-30: The code currently assigns the same def.DefaultValue
reference to both EffectiveValue.Value and EffectiveValue.Default causing
aliasing for mutable defaults; modify the snapshot_from_keydefs logic that
builds configs[def.Key] (the EffectiveValue struct population) to deep-clone
def.DefaultValue for the runtime Value and, separately, for the stored Default
so mutations to Value cannot affect Default; add a nil-safe, recursive helper
(e.g., cloneRuntimeValue) in the package that handles map[string]any and []any
by copying recursively and returns primitives as-is, and call that helper when
assigning both Value and Default (also preserve existing Redacted, Source,
Revision assignments).

In `@commons/systemplane/ports/authorizer_defaults.go`:
- Around line 12-14: Update the godoc for AllowAllAuthorizer to reflect its
actual checks: state that it permits operations only when the receiver is
non-nil, the provided context is non-nil, and the permission string is
non-empty; explicitly note that it will return an error/deny for a nil receiver,
nil context, or blank permission. Reference the AllowAllAuthorizer type and its
Authorize behavior so readers understand the preconditions required for it to
"allow all" operations.

In `@commons/systemplane/ports/identity_defaults_test.go`:
- Around line 153-211: Add a boundary unit test named
TestFuncIdentityResolver_TenantID_ExactMaxLength_Allowed that constructs exactID
:= strings.Repeat("t", maxTenantIDLength), creates a FuncIdentityResolver with
TenantFunc returning exactID, calls resolver.TenantID(context.Background()), and
asserts no error and that the returned tenant equals exactID; this mirrors the
Actor exact-length test and verifies the TenantID length check uses strict
greater-than rather than greater-or-equal.

In `@commons/systemplane/service/supervisor.go`:
- Around line 215-226: You’re loading supervisor.state twice — once here to
build prevSnap/tenantIDs and again inside prepareReloadBuild — so change
prepareReloadBuild to accept the already-loaded state (or at least prevSnap and
previousBundle) instead of re-loading atomically; update the call site (replace
prepareReloadBuild(ctx, tenantIDs) with prepareReloadBuild(ctx, tenantIDs, st)
or prepareReloadBuild(ctx, tenantIDs, prevSnap)) and adjust the
prepareReloadBuild signature and its internal uses to use the provided
state/prevSnap/previousBundle rather than calling supervisor.state.Load() again.

In `@commons/webhook/deliverer.go`:
- Around line 562-574: sanitizeURL currently only strips query params but can
still leak credentials via the URL Userinfo or by returning the raw string on
parse failure; update sanitizeURL to clear u.User (set to nil) before returning
u.String() and, on parse error, return a redacted fallback (e.g., strip any
userinfo pattern) instead of the raw input so credentials aren’t logged. Locate
the sanitizeURL function and ensure you remove userinfo by clearing u.User and
handle the parse-failure branch to return a log-safe string rather than rawURL.
- Around line 105-111: WithHTTPClient currently assigns the caller's
*http.Client directly, bypassing the redirect protection from defaultHTTPClient;
instead, clone the provided client (copy its struct fields) into a new
http.Client, set its CheckRedirect to a function that returns
http.ErrUseLastResponse to block redirects, and then assign that clone to
Deliverer.client (preserving timeouts, Transport, Jar, etc. but not the original
pointer) so custom clients cannot re-enable redirects; keep the existing nil
check in WithHTTPClient and reference the WithHTTPClient function and
Deliverer.client when making the change.
- Around line 450-455: The code sets req.Host to originalHost after rewriting
the URL to a pinned IP, which doesn't preserve TLS SNI and will break
certificate validation for HTTPS; update the HTTP client logic that performs the
pinning (where pinnedURL and req are used) to avoid modifying req.Host for TLS
purposes and instead create a custom http.Transport with a DialContext that
connects to the pinned IP while setting TLSClientConfig.ServerName to the
original hostname (and ensure the Transport is used by the request client),
e.g., locate the pinning logic around pinnedURL/req.Host in deliverer.go and
replace the simple req.Host assignment with a Transport-based solution that
preserves the Host header for HTTP and sets ServerName for TLS.

In `@commons/webhook/doc.go`:
- Around line 20-29: The documentation warns that X-Webhook-Timestamp is
unsigned and recommends adding a freshness value to prevent replay; update the
webhook signing and verification design to adopt a versioned signature format
(e.g., include a version prefix and combine timestamp with payload in the HMAC
input such as "v<n>:timestamp.payload") and update doc.go text to describe the
new format and required verification steps (verify version, include timestamp in
HMAC input, and enforce an allowed clock skew/window or nonce/event-ID
tracking). Modify any signing/verification functions that reference
X-Webhook-Signature or X-Webhook-Timestamp to consume and produce the versioned
signature string and to validate freshness before accepting a webhook. Ensure
backward compatibility notes are added explaining the migration path for
existing consumers.

In `@commons/webhook/ssrf_test.go`:
- Around line 132-155: TestResolveAndValidateIP_AllowedSchemes depends on live
DNS because resolveAndValidateIP wraps DNS failures with ErrSSRFBlocked; to fix,
separate the URL scheme check into a pure function (e.g., isAllowedScheme or
validateScheme) and update resolveAndValidateIP to call it, then change this
unit test to call the pure helper instead of resolveAndValidateIP;
alternatively, make resolveAndValidateIP accept an injectable resolver interface
and pass a mock resolver from the test so the scheme-check can be asserted
hermetically without performing real DNS resolution.

In `@commons/webhook/ssrf.go`:
- Around line 58-60: The function that builds and returns pinnedURL and
originalHost currently uses u.Hostname(), which drops explicit ports and thus
cannot be reused as the HTTP Host header; update the code so you preserve the
original authority (u.Host) as the value returned for the Host header (e.g.,
keep originalHost = u.Host) and use u.Hostname() only for the SNI/TLS hostname
(e.g., sniName = u.Hostname()). Ensure callers use the preserved originalHost
for req.Host and the SNI-only value for TLS config.ServerName; adjust any return
tuple/struct (pinnedURL, originalHost) and related variable names to reflect
both values so ports are not lost.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3057fce9-fdc1-42db-a6a8-9f62a8d2d232

📥 Commits

Reviewing files that changed from the base of the PR and between 439c777 and 41dedc2.

📒 Files selected for processing (82)
  • .env.reference
  • AGENTS.md
  • README.md
  • commons/certificate/certificate.go
  • commons/certificate/certificate_test.go
  • commons/certificate/doc.go
  • commons/dlq/consumer.go
  • commons/dlq/consumer_test.go
  • commons/dlq/doc.go
  • commons/dlq/errors.go
  • commons/dlq/handler.go
  • commons/dlq/handler_test.go
  • commons/net/http/idempotency/doc.go
  • commons/net/http/idempotency/idempotency.go
  • commons/net/http/idempotency/idempotency_test.go
  • commons/systemplane/adapters/http/fiber/dto.go
  • commons/systemplane/adapters/http/fiber/dto_test.go
  • commons/systemplane/bootstrap/apply_keydefs_test.go
  • commons/systemplane/bootstrap/backend.go
  • commons/systemplane/bootstrap/backend_test.go
  • commons/systemplane/bootstrap/config.go
  • commons/systemplane/bootstrap/env.go
  • commons/systemplane/bootstrap/env_or_default_test.go
  • commons/systemplane/catalog/doc.go
  • commons/systemplane/catalog/keys_postgres.go
  • commons/systemplane/catalog/keys_redis.go
  • commons/systemplane/catalog/keys_shared.go
  • commons/systemplane/catalog/shared_key.go
  • commons/systemplane/catalog/validate.go
  • commons/systemplane/catalog/validate_options_test.go
  • commons/systemplane/catalog/validate_test.go
  • commons/systemplane/domain/coercion_helpers.go
  • commons/systemplane/domain/config_helpers.go
  • commons/systemplane/domain/key_def.go
  • commons/systemplane/domain/key_def_test.go
  • commons/systemplane/domain/setting_helpers.go
  • commons/systemplane/domain/snapshot_config_helpers_test.go
  • commons/systemplane/domain/snapshot_duration_slice_helpers_test.go
  • commons/systemplane/domain/snapshot_from_keydefs.go
  • commons/systemplane/domain/snapshot_from_keydefs_test.go
  • commons/systemplane/domain/snapshot_setting_helpers_test.go
  • commons/systemplane/domain/snapshot_test_helpers_test.go
  • commons/systemplane/ports/authorizer_defaults.go
  • commons/systemplane/ports/authorizer_defaults_test.go
  • commons/systemplane/ports/identity_defaults.go
  • commons/systemplane/ports/identity_defaults_test.go
  • commons/systemplane/service/component_diff.go
  • commons/systemplane/service/component_diff_test.go
  • commons/systemplane/service/manager.go
  • commons/systemplane/service/manager_helpers.go
  • commons/systemplane/service/manager_helpers_test.go
  • commons/systemplane/service/manager_read_helpers.go
  • commons/systemplane/service/manager_reads.go
  • commons/systemplane/service/manager_reads_test.go
  • commons/systemplane/service/manager_schema_test.go
  • commons/systemplane/service/manager_test.go
  • commons/systemplane/service/manager_writes.go
  • commons/systemplane/service/manager_writes_helpers.go
  • commons/systemplane/service/shutdown.go
  • commons/systemplane/service/shutdown_test.go
  • commons/systemplane/service/snapshot_builder.go
  • commons/systemplane/service/snapshot_builder_test.go
  • commons/systemplane/service/supervisor.go
  • commons/systemplane/service/supervisor_helpers.go
  • commons/systemplane/swagger/spec.json
  • commons/systemplane/swagger/swagger.go
  • commons/systemplane/swagger/swagger_test.go
  • commons/tenant-manager/client/client.go
  • commons/tenant-manager/consumer/multi_tenant_optional_rabbitmq_test.go
  • commons/tenant-manager/mongo/manager.go
  • commons/tenant-manager/mongo/manager_test.go
  • commons/tenant-manager/postgres/manager.go
  • commons/tenant-manager/rabbitmq/manager.go
  • commons/tenant-manager/rabbitmq/manager_test.go
  • commons/webhook/deliverer.go
  • commons/webhook/deliverer_test.go
  • commons/webhook/doc.go
  • commons/webhook/errors.go
  • commons/webhook/ssrf.go
  • commons/webhook/ssrf_test.go
  • commons/webhook/types.go
  • docs/PROJECT_RULES.md
💤 Files with no reviewable changes (2)
  • commons/systemplane/service/manager_reads_test.go
  • commons/systemplane/service/manager_test.go

Comment on lines +863 to +878
t.Run("TLSCertificate returns empty", func(t *testing.T) {
t.Parallel()
// TLSCertificate checks m == nil after acquiring the lock which it
// can't do on a nil receiver — production code checks m == nil inside
// the method body after the RLock, but since m is nil the lock call
// itself will panic. The method guards with `if m == nil` but the
// guard is AFTER the RLock. Let's verify by calling it safely.
// Actually the method body is: mu.RLock then `if m == nil` — a nil
// pointer dereference would occur. We document that TLSCertificate
// is NOT nil-receiver-safe (unlike the other methods) by asserting
// the returned value from a non-nil empty manager instead.
empty, emptyErr := NewManager("", "")
require.NoError(t, emptyErr)
tlsCert := empty.TLSCertificate()
assert.Equal(t, tls.Certificate{}, tlsCert)
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Document the nil-receiver inconsistency for TLSCertificate.

The comment correctly identifies that TLSCertificate() is not nil-receiver-safe while other methods are. This asymmetry may surprise callers. Consider adding this caveat to the method's godoc in the production code, or restructuring the guard to check m == nil before acquiring the lock (if feasible).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commons/certificate/certificate_test.go` around lines 863 - 878, Tests reveal
TLSCertificate() is not nil-receiver-safe while other Manager methods are;
update the production code to either document this asymmetry or make the method
nil-receiver-safe: either add a godoc note on TLSCertificate explaining it must
not be called on a nil *Manager, or move the nil-check so TLSCertificate
performs "if m == nil" before acquiring the RLock (adjust function
TLSCertificate in the Manager implementation accordingly); ensure behavior
remains consistent with NewManager-created empty managers and update any related
comments/tests as needed.

Comment on lines +125 to +135
// GetCertificate returns the current certificate, or nil if none is loaded.
func (m *Manager) GetCertificate() *x509.Certificate {
if m == nil {
return nil
}

m.mu.RLock()
defer m.mu.RUnlock()

return m.cert
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

These accessors expose mutable internal state.

GetCertificate() returns the live *x509.Certificate, and TLSCertificate() returns a struct whose Certificate slice and Leaf pointer still alias m.chain/m.cert. Callers can mutate manager state after the lock is released, which violates the “safe for concurrent use” contract on Manager. Return defensive copies or narrow the API so callers never get writable references to the backing state.

Also applies to: 201-217

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commons/certificate/certificate.go` around lines 125 - 135, Get rid of
exposing writable internal state by returning defensive copies: in
Manager.GetCertificate() acquire the read lock, clone/deep-copy m.cert into a
new *x509.Certificate instance (do the deep copy while holding the lock) and
return the copy instead of m.cert; similarly update Manager.TLSCertificate()
(and the code around m.chain/m.cert at the 201-217 region) to build and return a
new tls.Certificate whose Certificate [][]byte is a deep copy of m.chain bytes
and whose Leaf is a cloned copy of m.cert (or nil) so callers never receive
references aliasing m.chain or m.cert; ensure copies are made while holding
m.mu.RLock() and release the lock before returning.

Comment on lines +562 to +574
// sanitizeURL strips query parameters from a URL before logging to prevent
// credential leakage. Webhook URLs may carry tokens in query params
// (e.g., ?token=..., ?api_key=...) that must not appear in log output.
// On parse failure the raw string is returned unchanged so no log line is lost.
func sanitizeURL(rawURL string) string {
u, err := url.Parse(rawURL)
if err != nil {
return rawURL
}

u.RawQuery = ""

return u.String()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Redact userinfo before logging webhook URLs.

This only strips the query string. u.String() still includes user:password@host, and the raw fallback on parse errors can also leak credentials. Clear u.User before returning a log-safe value.

🛡️ Harden URL redaction in logs
 func sanitizeURL(rawURL string) string {
 	u, err := url.Parse(rawURL)
 	if err != nil {
-		return rawURL
+		return "<invalid-url>"
 	}
 
+	u.User = nil
 	u.RawQuery = ""
+	u.Fragment = ""
 
 	return u.String()
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commons/webhook/deliverer.go` around lines 562 - 574, sanitizeURL currently
only strips query params but can still leak credentials via the URL Userinfo or
by returning the raw string on parse failure; update sanitizeURL to clear u.User
(set to nil) before returning u.String() and, on parse error, return a redacted
fallback (e.g., strip any userinfo pattern) instead of the raw input so
credentials aren’t logged. Locate the sanitizeURL function and ensure you remove
userinfo by clearing u.User and handle the parse-failure branch to return a
log-safe string rather than rawURL.

Comment on lines +20 to +29
// The X-Webhook-Signature header carries HMAC-SHA256 computed over the raw
// payload bytes only. X-Webhook-Timestamp is sent as a separate informational
// header but is NOT covered by the HMAC — an attacker who captures a valid
// payload+signature pair can replay it with a fresh timestamp.
//
// Receivers who need replay protection must implement it independently, for
// example by tracking event IDs or embedding a nonce in the payload itself.
// Timestamp-window checks alone are insufficient because the timestamp is
// unsigned. Including the timestamp in the HMAC would be a breaking change
// for existing consumers.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Include a freshness value in the signature input.

Lines 20-29 describe a scheme where X-Webhook-Timestamp is unsigned, so a captured body+signature pair can be replayed indefinitely with a new timestamp. If compatibility matters, ship a versioned signature format now (for example timestamp.payload) rather than releasing v4 with replayable signatures.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commons/webhook/doc.go` around lines 20 - 29, The documentation warns that
X-Webhook-Timestamp is unsigned and recommends adding a freshness value to
prevent replay; update the webhook signing and verification design to adopt a
versioned signature format (e.g., include a version prefix and combine timestamp
with payload in the HMAC input such as "v<n>:timestamp.payload") and update
doc.go text to describe the new format and required verification steps (verify
version, include timestamp in HMAC input, and enforce an allowed clock
skew/window or nonce/event-ID tracking). Modify any signing/verification
functions that reference X-Webhook-Signature or X-Webhook-Timestamp to consume
and produce the versioned signature string and to validate freshness before
accepting a webhook. Ensure backward compatibility notes are added explaining
the migration path for existing consumers.

Comment on lines +132 to +155
func TestResolveAndValidateIP_AllowedSchemes(t *testing.T) {
t.Parallel()

// These schemes should pass the scheme check (they may still fail DNS
// resolution, but they should NOT fail with ErrSSRFBlocked for scheme).
tests := []struct {
name string
url string
}{
{name: "http scheme", url: "http://example.com"},
{name: "https scheme", url: "https://example.com"},
}

for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
t.Parallel()

_, _, err := resolveAndValidateIP(context.Background(), tt.url)
// The error, if any, should NOT be ErrSSRFBlocked for scheme reasons.
if err != nil {
assert.False(t, errors.Is(err, ErrSSRFBlocked), "http/https should not be SSRF-blocked: %v", err)
}
})
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This unit test still depends on live DNS and can fail offline.

resolveAndValidateIP wraps DNS lookup failures with ErrSSRFBlocked, so the assertion on Lines 149-153 turns example.com resolution into a hidden prerequisite. In CI environments without outbound DNS, this test fails even when the allowed-scheme check is correct. Split scheme validation into a pure helper or inject a resolver so this case stays hermetic.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commons/webhook/ssrf_test.go` around lines 132 - 155,
TestResolveAndValidateIP_AllowedSchemes depends on live DNS because
resolveAndValidateIP wraps DNS failures with ErrSSRFBlocked; to fix, separate
the URL scheme check into a pure function (e.g., isAllowedScheme or
validateScheme) and update resolveAndValidateIP to call it, then change this
unit test to call the pure helper instead of resolveAndValidateIP;
alternatively, make resolveAndValidateIP accept an injectable resolver interface
and pass a mock resolver from the test so the scheme-check can be asserted
hermetically without performing real DNS resolution.

Comment on lines +58 to +60
// On success it returns:
// - pinnedURL — original URL with the hostname replaced by the first resolved IP.
// - originalHost — the original hostname, for use as the HTTP Host header (TLS SNI).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Return the original authority separately from the SNI hostname.

url.URL.Hostname() on Line 76 strips explicit ports, so the value described on Lines 58-60 cannot safely be reused as the HTTP Host header. For https://example.com:8443/..., callers would end up with example.com instead of example.com:8443, which can break virtual-host routing on non-default ports. Keep u.Host for req.Host and u.Hostname() only for TLS SNI.

Also applies to: 76-79

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commons/webhook/ssrf.go` around lines 58 - 60, The function that builds and
returns pinnedURL and originalHost currently uses u.Hostname(), which drops
explicit ports and thus cannot be reused as the HTTP Host header; update the
code so you preserve the original authority (u.Host) as the value returned for
the Host header (e.g., keep originalHost = u.Host) and use u.Hostname() only for
the SNI/TLS hostname (e.g., sniName = u.Hostname()). Ensure callers use the
preserved originalHost for req.Host and the SNI-only value for TLS
config.ServerName; adjust any return tuple/struct (pinnedURL, originalHost) and
related variable names to reflect both values so ports are not lost.

…ook, idempotency, and systemplane (#420)

* fix: address CodeRabbit review findings across certificate, dlq, webhook, idempotency, and systemplane packages

- certificate: preserve intermediate chain in Rotate via variadic intermediates,
  deep-copy DER chain in TLSCertificate to prevent aliasing, add LoadFromFilesWithChain
- dlq/consumer: fix Run restart after ctx shutdown by clearing stopCh on exit,
  escalate re-enqueue failures to ERROR with metrics (prevents silent message loss)
- webhook: enforce redirect blocking on custom HTTP clients (SSRF protection),
  fix TLS SNI for HTTPS endpoints after DNS pinning via ServerName on cloned transport
- idempotency: use Redis pipeline for atomic response cache + state marker writes,
  correct godoc from at-most-once to best-effort idempotency (fail-open on Redis outage)
- systemplane/domain: add nil-receiver guard to ConfigValue and setting helpers,
  clone mutable defaults in DefaultSnapshotFromKeyDefs to prevent Value/Default aliasing,
  use platform-independent overflow bounds in intFromFloat64 and scaleDurationFloat64
- systemplane/bootstrap: wrap resource-validation errors with backend kind context
- systemplane/catalog: inline containsString to direct slices.Contains call
- systemplane/ports: fix AllowAllAuthorizer godoc to reflect fail-closed behavior,
  add TenantID exact-max-length boundary test
- systemplane/bootstrap: use snapshot() in tests to honor registry mutex contract

* fix: address follow-up CodeRabbit findings on PR #420

- certificate: deep-copy intermediate DER bytes in Rotate to prevent caller mutation
- dlq/consumer: simplify Run() overlap guard to reject when stopCh is non-nil
  (prevents concurrent loops while previous goroutine drains safeProcessOnce);
  lost not-yet-ready messages now count against BatchSize (return true)
- idempotency: combine two pipe.Del calls into single variadic Del(ctx, key, responseKey)
- webhook: enforce minimum TLS 1.2 on cloned transport even when caller config is weaker
- systemplane/domain: add nil-receiver guards to GlobalSettingValue and TenantSettingValue;
  add comment explaining intentional narrower clone scope vs reflection-based version

* fix: address second-round CodeRabbit findings on PR #420

- certificate: deep-copy leaf DER (cert.Raw) in Rotate to prevent aliasing;
  add chain[1:]... usage example to Rotate and LoadFromFilesWithChain godocs
- dlq/consumer: record span error via HandleSpanError on both terminal
  message-loss paths (not-yet-ready re-enqueue failure and retry re-enqueue failure)
- idempotency: correct saveResult godoc from 'atomically' to 'in a single round-trip'
  (Pipeline batches but does not provide MULTI/EXEC transactional atomicity)
- webhook: normalize nil Transport to http.DefaultTransport before type assertion
  in httpsClientForPinnedIP; clone d.client instead of building from scratch
  to preserve Jar and other caller-configured fields

* fix: address third-round CodeRabbit findings on PR #420

Deep-copy leaf certificate in Rotate to prevent aliasing caller-owned
memory, document shared Leaf pointer in TLSCertificate godoc, and
remove high-cardinality composite key from idempotency warning logs.

* fix: add RecordLost to DLQMetrics to distinguish message loss from exhaustion

Adds a dedicated RecordLost method to the DLQMetrics interface so operators
can separately alert on infrastructure-caused message loss (re-enqueue
failures) versus expected exhaustion (max retries reached).

* fix: split multi-key DEL to avoid Redis Cluster CROSSSLOT error in idempotency

Split pipe.Del(ctx, key, responseKey) into two single-key DEL calls
within the same pipeline to prevent CROSSSLOT errors when keys hash
to different Redis Cluster slots.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

♻️ Duplicate comments (3)
commons/systemplane/domain/snapshot_from_keydefs.go (1)

41-68: ⚠️ Potential issue | 🟠 Major

Broaden cloning beyond map[string]any / []any to avoid remaining mutable aliasing.

cloneRuntimeValue currently misses typed mutable containers (for example []string, map[string]string). If any KeyDef.DefaultValue uses those, Value and Default can still share backing data and mutate together.

🔧 Suggested direction
 func cloneRuntimeValue(v any) any {
 	switch x := v.(type) {
 	case map[string]any:
 		out := make(map[string]any, len(x))
 		for k, vv := range x {
 			out[k] = cloneRuntimeValue(vv)
 		}
 		return out
 	case []any:
 		out := make([]any, len(x))
 		for i, vv := range x {
 			out[i] = cloneRuntimeValue(vv)
 		}
 		return out
+	case []string:
+		return append([]string(nil), x...)
+	case map[string]string:
+		out := make(map[string]string, len(x))
+		for k, vv := range x {
+			out[k] = vv
+		}
+		return out
 	default:
 		return v
 	}
 }

Use this read-only check to confirm whether typed mutable defaults are present in key definitions:

#!/bin/bash
set -euo pipefail

# Find KeyDef default literals that are mutable but not covered by cloneRuntimeValue.
rg -nP --type go -C2 'DefaultValue\s*:\s*\[\]string|DefaultValue\s*:\s*map\[string\]string|DefaultValue\s*:\s*\[\][A-Za-z_]\w*|DefaultValue\s*:\s*map\[[^]]+\][A-Za-z_]\w*'

Expected result: any match indicates current clone coverage can still alias mutable defaults.

As per coding guidelines "Keep behavior nil-safe and concurrency-safe by default in exported APIs".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commons/systemplane/domain/snapshot_from_keydefs.go` around lines 41 - 68,
cloneRuntimeValue currently only handles map[string]any and []any so typed
mutable containers (e.g. []string, map[string]string or other []T/map[K]V) in
KeyDef.DefaultValue can still alias; update cloneRuntimeValue to detect any
slice or map via reflection (reflect.Kind == Slice or Map), create a new value
of the same concrete type and capacity, iterate elements/keys, recursively clone
each element/value by calling cloneRuntimeValue, and set them into the new
reflected container; ensure nil slices/maps are preserved as nil, preserve the
original concrete type (so []T stays []T and map[K]V stays map[K]V), and keep
the existing fast path for map[string]any / []any if you prefer for performance.
commons/dlq/consumer.go (1)

392-407: ⚠️ Potential issue | 🔴 Critical

Re-enqueue failures are still being swallowed after Dequeue.

RecordLost is useful telemetry, but both branches still return true after a destructive pop. A transient Redis write error therefore becomes permanent message loss while the poll cycle looks successful to the caller. Bubble the failure out of processMessage/drainSource, or move this path to a pending-list/LMOVE flow.

As per coding guidelines "Do not swallow errors; return or handle errors with context".

Also applies to: 440-455

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commons/dlq/consumer.go` around lines 392 - 407, The code swallows re-enqueue
failures after a destructive pop by returning true; change this so Enqueue
errors are propagated instead of treated as success: in the block handling
msg.NextRetryAt (and the similar block around lines 440-455), remove the
unconditional return true and instead return the Enqueue error (or wrap it with
context) from processMessage/drainSource so the caller sees the failure; keep
logging/RecordLost/metrics but ensure c.handler.Enqueue(ctx, msg) failures cause
processMessage/drainSource to return an error (or a distinct status) rather than
indicating successful processing.
commons/net/http/idempotency/idempotency.go (1)

171-173: ⚠️ Potential issue | 🟠 Major

These two-key updates are still non-atomic with the current key layout.

key and responseKey are separate raw Redis keys, so with the current construction they land in different cluster slots unless callers manually inject a shared hash tag via WithKeyPrefix. client.Pipeline() therefore only batches the success-path SETs and error-path DELs; it cannot guarantee they succeed together on a redis.UniversalClient. That leaves the same split-state failure mode previously raised here (processing without :response, or a stale :response after cleanup). Either co-locate both records with a hash tag and switch to TxPipeline, or collapse state + replay payload into one key. The “atomically” wording on Line 292 also overstates the current guarantee.

#!/bin/bash
python - <<'PY'
def crc16(data: bytes) -> int:
    crc = 0
    for b in data:
        crc ^= b << 8
        for _ in range(8):
            if crc & 0x8000:
                crc = ((crc << 1) ^ 0x1021) & 0xFFFF
            else:
                crc = (crc << 1) & 0xFFFF
    return crc

def slot(key: str) -> int:
    lb = key.find("{")
    rb = key.find("}", lb + 1)
    hashtag = key[lb + 1:rb] if lb != -1 and rb > lb + 1 else key
    return crc16(hashtag.encode()) % 16384

for key in [
    "idempotency:tenant:abc",
    "idempotency:tenant:abc:response",
]:
    print(f"{key} -> slot {slot(key)}")
PY

Also applies to: 192-193, 255-257, 268-305

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commons/net/http/idempotency/idempotency.go` around lines 171 - 173, The two
Redis keys built from m.keyPrefix + tenantID + idempotencyKey and its
":response" sibling can land in different Redis cluster slots, so
client.Pipeline() cannot make their SET/DEL operations atomic; fix by
co-locating them and using a transactional pipeline or collapsing into one key:
either (A) change the key construction in the idempotency key builders (where
tenantID is derived via tmcore.GetTenantIDContext(c.UserContext()) and keys are
formed from m.keyPrefix and idempotencyKey) to include a shared Redis hash tag
(e.g., wrap the tenant-scoped portion in {...}) so both key and responseKey map
to the same slot, and replace client.Pipeline() calls with client.TxPipeline()
to execute SET/DEL under MULTI/EXEC; or (B) collapse state + response payload
into a single Redis value under the existing key and remove the separate
responseKey, updating the logic that reads/writes responseKey accordingly; also
update the "atomically" wording near the idempotency commit to reflect the
actual guarantee after you implement one of these fixes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@commons/certificate/certificate.go`:
- Around line 214-227: The doc comment for Manager.DaysUntilExpiry is incomplete
about expired certificates; update the godoc for the DaysUntilExpiry method to
state that it returns -1 when no certificate is loaded (m == nil or
ExpiresAt().IsZero()) and otherwise returns the number of days until expiry,
which may be negative for already expired certificates (e.g., -3 means expired 3
days ago); reference the Manager type, DaysUntilExpiry method, and ExpiresAt()
helper in the comment so readers know the exact behavior.

In `@commons/dlq/consumer.go`:
- Around line 109-115: The WithSources option currently assigns the caller's
slice directly into Consumer.cfg.Sources which can lead to races if the caller
mutates it; change WithSources (the returned func that takes *Consumer) to clone
the input slice before assignment (e.g., allocate a new []string of len(sources)
and copy into it) and then set c.cfg.Sources to that new slice so the consumer
has its own backing array and is safe for concurrent use.

In `@commons/dlq/handler.go`:
- Around line 166-177: Update the enqueue normalization to reject negative
MaxRetries by treating any non-positive value as uninitialized: in the block
that currently checks msg.MaxRetries == 0, change the condition to
msg.MaxRetries <= 0 and set msg.MaxRetries = h.maxRetries (preserving the
existing behavior for zero while preventing negative values from being passed
through). Keep the initialEnqueue/CreatedAt logic intact.
- Around line 179-188: Validate the chosen effectiveTenant before using
tenantScopedKeyForTenant: after computing effectiveTenant (using ctxTenant and
msg.TenantID) add a tenant key-segment validation (e.g. allow only
[A-Za-z0-9._-] or use an existing validateTenant/IsValidTenant function if
available) and if the tenant is invalid return an error instead of silently
falling back to the global key; ensure this change touches the same block using
ctxTenant, effectiveTenant, msg.TenantID and tenantScopedKeyForTenant so
tenant-scoped routing is only used for validated tenant IDs.

In `@commons/net/http/idempotency/doc.go`:
- Around line 9-14: The package godoc overstates duplicate-request behavior;
update the package documentation in doc.go to list the actual branches: when a
request is already in-flight handle() returns 409, when no cached response
exists callers receive the generic IDEMPOTENT payload rather than the original
response, HEAD requests are bypassed by handle(), oversized idempotency keys may
invoke the configured rejection handler (not always a 400), and successful
replays set the X-Idempotency-Replayed header to "true"; mention tenant-scoped
vs global key composition and reference these symbols (handle(), IDEMPOTENT,
X-Idempotency-Replayed, rejection handler) so readers see the exact behavior.

In `@commons/net/http/idempotency/idempotency.go`:
- Around line 221-225: In handleDuplicate, stop swallowing errors from
client.Get for keyValue and cached response; replace the ignored-error calls
with proper err checks: call client.Get(...).Result() into (keyValue, err) and
(cached, cacheErr), treat redis.Nil as “missing key” but for any other non-nil
error (timeouts/connection failures) log the error and fail open by calling
c.Next() (mirroring the SetNX error path) instead of fabricating an "IDEMPOTENT"
response; ensure only when cacheErr==nil and cached present you replay the
cached response, and when cacheErr==redis.Nil you proceed with normal 409/200
logic.
- Around line 23-28: The cachedResponse struct is lossy: change Body from string
to []byte and add a Headers field (map[string][]string) to persist response
headers like Location, ETag, Set-Cookie while keeping StatusCode and
ContentType; when replaying, write headers from cachedResponse.Headers and use
Send(...) (or the binary-safe response writer) instead of SendString(...) so
binary/non-UTF8 bodies are preserved and headers are faithfully replayed; update
all related usages (including the other cachedResponse definitions/usages
referenced around the other occurrences) to marshal/unmarshal the new struct and
to set headers before sending.

In `@commons/systemplane/bootstrap/backend_test.go`:
- Around line 410-412: The test currently asserts exact factory count with
assert.Len(t, factories, 1); instead assert that the specific factory registered
by the test setup is present to make the test resilient—call
backendRegistry.snapshot() as before, then replace the length assertion with an
assertion that the expected factory key (the identifier your test registers via
withRegistrySnapshot) exists in factories (e.g., using assert.Contains or an
equivalent map-key check), and keep assert.NotEmpty(t, initErrors) unchanged.

In `@commons/systemplane/catalog/validate.go`:
- Around line 150-164: buildCatalogIndexes currently lets later SharedKey
entries silently overwrite earlier env var mappings in envIndex; modify
buildCatalogIndexes to detect duplicates by checking if envVar already exists in
envIndex (compare existing SharedKey.Key to sk.Key) while iterating
allowedEnvVars(sk), collect conflicting envVars (and their previous/new
SharedKey keys) and surface them—either return an error/list of conflicts, or
emit a clear warning via your logger—so callers of buildCatalogIndexes (or
callers of SharedKey validation) can fail fast or be informed; reference
buildCatalogIndexes, SharedKey, keyIndex, envIndex, and allowedEnvVars when
making the change.

In `@commons/webhook/deliverer.go`:
- Around line 408-414: The metrics call uses attempt+1 which is inconsistent
with the result struct; change the recordMetrics invocation in the non-retryable
branch to use result.Attempts instead of attempt+1 so it matches the value used
elsewhere (e.g., later calls that pass result.Attempts), keeping the same
context variables (ep.ID, statusCode, ctx) and leaving result.Error assignment
as-is.
- Around line 205-206: The code calls filterActive(endpoints) after receiving
endpoints from ListActiveEndpoints(), which is redundant or inconsistent with
the interface; either remove the call to filterActive and use the results from
ListActiveEndpoints() directly, or if defensive filtering is intended, document
that filterActive is a safety check and keep it (or rename ListActiveEndpoints
to ListEndpoints if the interface does not guarantee filtering). Update the
call-site around ListActiveEndpoints and the filterActive usage (and the
ListActiveEndpoints documentation in types.go) so the contract is consistent:
choose one of the three options (remove filterActive, document defensive
filtering, or rename ListActiveEndpoints) and apply the corresponding change.

---

Duplicate comments:
In `@commons/dlq/consumer.go`:
- Around line 392-407: The code swallows re-enqueue failures after a destructive
pop by returning true; change this so Enqueue errors are propagated instead of
treated as success: in the block handling msg.NextRetryAt (and the similar block
around lines 440-455), remove the unconditional return true and instead return
the Enqueue error (or wrap it with context) from processMessage/drainSource so
the caller sees the failure; keep logging/RecordLost/metrics but ensure
c.handler.Enqueue(ctx, msg) failures cause processMessage/drainSource to return
an error (or a distinct status) rather than indicating successful processing.

In `@commons/net/http/idempotency/idempotency.go`:
- Around line 171-173: The two Redis keys built from m.keyPrefix + tenantID +
idempotencyKey and its ":response" sibling can land in different Redis cluster
slots, so client.Pipeline() cannot make their SET/DEL operations atomic; fix by
co-locating them and using a transactional pipeline or collapsing into one key:
either (A) change the key construction in the idempotency key builders (where
tenantID is derived via tmcore.GetTenantIDContext(c.UserContext()) and keys are
formed from m.keyPrefix and idempotencyKey) to include a shared Redis hash tag
(e.g., wrap the tenant-scoped portion in {...}) so both key and responseKey map
to the same slot, and replace client.Pipeline() calls with client.TxPipeline()
to execute SET/DEL under MULTI/EXEC; or (B) collapse state + response payload
into a single Redis value under the existing key and remove the separate
responseKey, updating the logic that reads/writes responseKey accordingly; also
update the "atomically" wording near the idempotency commit to reflect the
actual guarantee after you implement one of these fixes.

In `@commons/systemplane/domain/snapshot_from_keydefs.go`:
- Around line 41-68: cloneRuntimeValue currently only handles map[string]any and
[]any so typed mutable containers (e.g. []string, map[string]string or other
[]T/map[K]V) in KeyDef.DefaultValue can still alias; update cloneRuntimeValue to
detect any slice or map via reflection (reflect.Kind == Slice or Map), create a
new value of the same concrete type and capacity, iterate elements/keys,
recursively clone each element/value by calling cloneRuntimeValue, and set them
into the new reflected container; ensure nil slices/maps are preserved as nil,
preserve the original concrete type (so []T stays []T and map[K]V stays
map[K]V), and keep the existing fast path for map[string]any / []any if you
prefer for performance.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 692b30bf-e18a-4c81-a62c-a50439627af8

📥 Commits

Reviewing files that changed from the base of the PR and between 41dedc2 and 7b684cf.

📒 Files selected for processing (16)
  • commons/certificate/certificate.go
  • commons/dlq/consumer.go
  • commons/dlq/consumer_test.go
  • commons/dlq/handler.go
  • commons/net/http/idempotency/doc.go
  • commons/net/http/idempotency/idempotency.go
  • commons/systemplane/bootstrap/backend.go
  • commons/systemplane/bootstrap/backend_test.go
  • commons/systemplane/catalog/validate.go
  • commons/systemplane/domain/coercion_helpers.go
  • commons/systemplane/domain/setting_helpers.go
  • commons/systemplane/domain/snapshot.go
  • commons/systemplane/domain/snapshot_from_keydefs.go
  • commons/systemplane/ports/authorizer_defaults.go
  • commons/systemplane/ports/identity_defaults_test.go
  • commons/webhook/deliverer.go

Comment on lines +214 to +227
// DaysUntilExpiry returns the number of days until the certificate expires.
// Returns -1 if no certificate is loaded.
func (m *Manager) DaysUntilExpiry() int {
if m == nil {
return -1
}

exp := m.ExpiresAt()
if exp.IsZero() {
return -1
}

return int(time.Until(exp).Hours() / 24)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Minor doc gap: behavior for expired certificates.

The docstring documents -1 for unloaded certificates but doesn't mention that expired certificates return a negative day count (e.g., -3 for a cert that expired 3 days ago). Consider clarifying this in the godoc for completeness.

📝 Suggested documentation enhancement
 // DaysUntilExpiry returns the number of days until the certificate expires.
-// Returns -1 if no certificate is loaded.
+// Returns -1 if no certificate is loaded, or a negative value if the
+// certificate is already expired (e.g., -3 means expired 3 days ago).
 func (m *Manager) DaysUntilExpiry() int {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// DaysUntilExpiry returns the number of days until the certificate expires.
// Returns -1 if no certificate is loaded.
func (m *Manager) DaysUntilExpiry() int {
if m == nil {
return -1
}
exp := m.ExpiresAt()
if exp.IsZero() {
return -1
}
return int(time.Until(exp).Hours() / 24)
}
// DaysUntilExpiry returns the number of days until the certificate expires.
// Returns -1 if no certificate is loaded, or a negative value if the
// certificate is already expired (e.g., -3 means expired 3 days ago).
func (m *Manager) DaysUntilExpiry() int {
if m == nil {
return -1
}
exp := m.ExpiresAt()
if exp.IsZero() {
return -1
}
return int(time.Until(exp).Hours() / 24)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commons/certificate/certificate.go` around lines 214 - 227, The doc comment
for Manager.DaysUntilExpiry is incomplete about expired certificates; update the
godoc for the DaysUntilExpiry method to state that it returns -1 when no
certificate is loaded (m == nil or ExpiresAt().IsZero()) and otherwise returns
the number of days until expiry, which may be negative for already expired
certificates (e.g., -3 means expired 3 days ago); reference the Manager type,
DaysUntilExpiry method, and ExpiresAt() helper in the comment so readers know
the exact behavior.

Comment on lines +109 to +115
// WithSources sets the DLQ source queue names to consume.
func WithSources(sources ...string) ConsumerOption {
return func(c *Consumer) {
if len(sources) > 0 {
c.cfg.Sources = sources
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Clone the sources slice on assignment.

WithSources(existing...) keeps the caller's backing array. Mutating that slice after construction changes the live consumer config and can race with Run().

♻️ Suggested change
 func WithSources(sources ...string) ConsumerOption {
 	return func(c *Consumer) {
 		if len(sources) > 0 {
-			c.cfg.Sources = sources
+			c.cfg.Sources = slices.Clone(sources)
 		}
 	}
 }

As per coding guidelines "Keep behavior nil-safe and concurrency-safe by default in exported APIs".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commons/dlq/consumer.go` around lines 109 - 115, The WithSources option
currently assigns the caller's slice directly into Consumer.cfg.Sources which
can lead to races if the caller mutates it; change WithSources (the returned
func that takes *Consumer) to clone the input slice before assignment (e.g.,
allocate a new []string of len(sources) and copy into it) and then set
c.cfg.Sources to that new slice so the consumer has its own backing array and is
safe for concurrent use.

Comment on lines +166 to +177
// Only stamp CreatedAt and MaxRetries on initial enqueue (zero-valued).
// Re-enqueue paths (consumer retry-failed, not-yet-ready, prune) pass
// messages that already carry the original values; overwriting them would
// permanently lose the original failure timestamp and retry budget.
initialEnqueue := msg.CreatedAt.IsZero()
if initialEnqueue {
msg.CreatedAt = time.Now().UTC()
}

if msg.MaxRetries == 0 {
msg.MaxRetries = h.maxRetries
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject negative MaxRetries before enqueue.

Only 0 is normalized here. A negative value passes through and makes the consumer treat the message as already exhausted on its first dequeue, so malformed input is discarded without any retry.

♻️ Suggested change
-	initialEnqueue := msg.CreatedAt.IsZero()
-	if initialEnqueue {
-		msg.CreatedAt = time.Now().UTC()
-	}
-
-	if msg.MaxRetries == 0 {
+	if msg.MaxRetries < 0 {
+		return errors.New("dlq: enqueue: max_retries must not be negative")
+	}
+
+	initialEnqueue := msg.CreatedAt.IsZero()
+	if initialEnqueue {
+		msg.CreatedAt = time.Now().UTC()
+	}
+
+	if msg.MaxRetries == 0 {
 		msg.MaxRetries = h.maxRetries
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Only stamp CreatedAt and MaxRetries on initial enqueue (zero-valued).
// Re-enqueue paths (consumer retry-failed, not-yet-ready, prune) pass
// messages that already carry the original values; overwriting them would
// permanently lose the original failure timestamp and retry budget.
initialEnqueue := msg.CreatedAt.IsZero()
if initialEnqueue {
msg.CreatedAt = time.Now().UTC()
}
if msg.MaxRetries == 0 {
msg.MaxRetries = h.maxRetries
}
if msg.MaxRetries < 0 {
return errors.New("dlq: enqueue: max_retries must not be negative")
}
// Only stamp CreatedAt and MaxRetries on initial enqueue (zero-valued).
// Re-enqueue paths (consumer retry-failed, not-yet-ready, prune) pass
// messages that already carry the original values; overwriting them would
// permanently lose the original failure timestamp and retry budget.
initialEnqueue := msg.CreatedAt.IsZero()
if initialEnqueue {
msg.CreatedAt = time.Now().UTC()
}
if msg.MaxRetries == 0 {
msg.MaxRetries = h.maxRetries
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commons/dlq/handler.go` around lines 166 - 177, Update the enqueue
normalization to reject negative MaxRetries by treating any non-positive value
as uninitialized: in the block that currently checks msg.MaxRetries == 0, change
the condition to msg.MaxRetries <= 0 and set msg.MaxRetries = h.maxRetries
(preserving the existing behavior for zero while preventing negative values from
being passed through). Keep the initialEnqueue/CreatedAt logic intact.

Comment on lines +179 to +188
ctxTenant := tmcore.GetTenantIDContext(ctx)

effectiveTenant := msg.TenantID
if effectiveTenant == "" {
effectiveTenant = ctxTenant
msg.TenantID = effectiveTenant
}

if effectiveTenant != "" && ctxTenant != "" && effectiveTenant != ctxTenant {
return fmt.Errorf("dlq: enqueue: tenant mismatch between message (%s) and context (%s)", effectiveTenant, ctxTenant)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate effectiveTenant before choosing the Redis key.

effectiveTenant can come from the message or the context, but it is never key-segment validated here. If it contains :, *, etc., tenantScopedKeyForTenant silently reroutes the message to the global key, which breaks tenant routing and can strand the message once any tenant-scoped queue exists.

♻️ Suggested change
 	if effectiveTenant != "" && ctxTenant != "" && effectiveTenant != ctxTenant {
 		return fmt.Errorf("dlq: enqueue: tenant mismatch between message (%s) and context (%s)", effectiveTenant, ctxTenant)
 	}
+
+	if effectiveTenant != "" {
+		if err := validateKeySegment("tenantID", effectiveTenant); err != nil {
+			return fmt.Errorf("dlq: enqueue: %w", err)
+		}
+	}

Based on learnings "consumer.go’s processSource uses a two-path model: (1) scan-based discovery of tenant-scoped keys, and (2) a bare-context fallback that drains the global key only when no tenant keys are found."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commons/dlq/handler.go` around lines 179 - 188, Validate the chosen
effectiveTenant before using tenantScopedKeyForTenant: after computing
effectiveTenant (using ctxTenant and msg.TenantID) add a tenant key-segment
validation (e.g. allow only [A-Za-z0-9._-] or use an existing
validateTenant/IsValidTenant function if available) and if the tenant is invalid
return an error instead of silently falling back to the global key; ensure this
change touches the same block using ctxTenant, effectiveTenant, msg.TenantID and
tenantScopedKeyForTenant so tenant-scoped routing is only used for validated
tenant IDs.

Comment on lines +9 to +14
// The middleware uses the X-Idempotency request header combined with the tenant ID
// (from tenant-manager context) to form a composite Redis key. When a tenant ID is
// present, keys are scoped per-tenant to prevent cross-tenant collisions. When no
// tenant is in context (e.g., non-tenant-scoped routes), keys are still valid but
// are shared across the global namespace. Duplicate requests receive the original
// response with the X-Idempotency-Replayed header set to "true".
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Package docs still over-promise duplicate and rejection behavior.

The implementation does not always return the original response on duplicates: in-flight requests return 409, uncached responses fall back to the generic IDEMPOTENT payload, HEAD is also bypassed in handle(), and oversized keys can be routed to a custom rejection handler instead of the default 400. Please spell out those branches here so callers do not infer stronger behavior from the exported package comment. As per coding guidelines, "Keep exported godocs aligned with actual function behavior".

Suggested doc wording
-// are shared across the global namespace. Duplicate requests receive the original
-// response with the X-Idempotency-Replayed header set to "true".
+// are shared across the global namespace. Duplicate requests set
+// X-Idempotency-Replayed: "true" and either replay the cached response, return
+// 409 while the original request is still in flight, or fall back to a generic
+// "already processed" response when no cached body is available.
@@
-//   - GET/OPTIONS requests pass through (idempotency is irrelevant for reads).
+//   - GET/HEAD/OPTIONS requests pass through (idempotency is irrelevant for reads).
 //   - Absent header: request proceeds normally (idempotency is opt-in).
-//   - Header exceeds MaxKeyLength (default 256): request rejected with 400.
+//   - Header exceeds MaxKeyLength (default 256): request rejected by the configured handler
+//     (default: 400 JSON).

Also applies to: 27-32

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commons/net/http/idempotency/doc.go` around lines 9 - 14, The package godoc
overstates duplicate-request behavior; update the package documentation in
doc.go to list the actual branches: when a request is already in-flight handle()
returns 409, when no cached response exists callers receive the generic
IDEMPOTENT payload rather than the original response, HEAD requests are bypassed
by handle(), oversized idempotency keys may invoke the configured rejection
handler (not always a 400), and successful replays set the
X-Idempotency-Replayed header to "true"; mention tenant-scoped vs global key
composition and reference these symbols (handle(), IDEMPOTENT,
X-Idempotency-Replayed, rejection handler) so readers see the exact behavior.

Comment on lines +221 to +225
// Read the current key value to distinguish in-flight from completed.
keyValue, _ := client.Get(ctx, key).Result()

// Try to replay the cached response (true idempotency).
cached, cacheErr := client.Get(ctx, responseKey).Result()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and read the file to understand the error handling pattern
wc -l commons/net/http/idempotency/idempotency.go

Repository: LerianStudio/lib-commons

Length of output: 115


🏁 Script executed:

# Read the relevant sections with surrounding context (lines 200-260)
sed -n '200,260p' commons/net/http/idempotency/idempotency.go | cat -n

Repository: LerianStudio/lib-commons

Length of output: 2818


🏁 Script executed:

# Also check around the SetNX call to understand the fail-open/fail-closed pattern mentioned
sed -n '160,190p' commons/net/http/idempotency/idempotency.go | cat -n

Repository: LerianStudio/lib-commons

Length of output: 1403


🏁 Script executed:

# Read the file beginning to find keyStateProcessing constant and understand key states
head -n 100 commons/net/http/idempotency/idempotency.go | cat -n

Repository: LerianStudio/lib-commons

Length of output: 3129


🏁 Script executed:

# Also check the full handleDuplicate function to see complete flow
sed -n '210,260p' commons/net/http/idempotency/idempotency.go | cat -n

Repository: LerianStudio/lib-commons

Length of output: 2412


Apply consistent error handling to GET operations in handleDuplicate: distinguish missing keys from transport errors, and fail open on timeouts instead of fabricating synthetic replay responses.

Line 223 swallows the GET error outright, and line 226 only handles the cached response on success. When either read times out or the Redis connection fails between the SETNX check and these follow-up reads, the code falls through to the 409/200 branches without distinguishing whether the key state is truly known. This contradicts the SetNX path (lines 176–180), which logs and fails open via c.Next() on errors.

Missing keys (redis.Nil) and transient failures (timeout, connection issues) require different handling. Apply the same fail-open policy you use for SetNX errors instead of returning a synthetic "IDEMPOTENT" response when the actual state is unknown.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commons/net/http/idempotency/idempotency.go` around lines 221 - 225, In
handleDuplicate, stop swallowing errors from client.Get for keyValue and cached
response; replace the ignored-error calls with proper err checks: call
client.Get(...).Result() into (keyValue, err) and (cached, cacheErr), treat
redis.Nil as “missing key” but for any other non-nil error (timeouts/connection
failures) log the error and fail open by calling c.Next() (mirroring the SetNX
error path) instead of fabricating an "IDEMPOTENT" response; ensure only when
cacheErr==nil and cached present you replay the cached response, and when
cacheErr==redis.Nil you proceed with normal 409/200 logic.

Comment on lines +410 to 412
factories, initErrors := backendRegistry.snapshot()
assert.Len(t, factories, 1)
assert.NotEmpty(t, initErrors)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider making the factory count assertion more robust.

The assertion assert.Len(t, factories, 1) assumes the registry contains exactly one factory after the test setup. While this works correctly when tests run in isolation (due to withRegistrySnapshot), the assertion could be more explicit:

💡 Optional: Assert on specific key presence instead of total count
 factories, initErrors := backendRegistry.snapshot()
-assert.Len(t, factories, 1)
+_, ok := factories[domain.BackendPostgres]
+assert.True(t, ok, "expected postgres factory to be registered")
 assert.NotEmpty(t, initErrors)

This makes the test intention clearer and is resilient to any future changes in test setup.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
factories, initErrors := backendRegistry.snapshot()
assert.Len(t, factories, 1)
assert.NotEmpty(t, initErrors)
factories, initErrors := backendRegistry.snapshot()
_, ok := factories[domain.BackendPostgres]
assert.True(t, ok, "expected postgres factory to be registered")
assert.NotEmpty(t, initErrors)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commons/systemplane/bootstrap/backend_test.go` around lines 410 - 412, The
test currently asserts exact factory count with assert.Len(t, factories, 1);
instead assert that the specific factory registered by the test setup is present
to make the test resilient—call backendRegistry.snapshot() as before, then
replace the length assertion with an assertion that the expected factory key
(the identifier your test registers via withRegistrySnapshot) exists in
factories (e.g., using assert.Contains or an equivalent map-key check), and keep
assert.NotEmpty(t, initErrors) unchanged.

Comment on lines +150 to +164
func buildCatalogIndexes(catalogKeys ...[]SharedKey) (map[string]SharedKey, map[string]SharedKey) {
keyIndex := make(map[string]SharedKey)
envIndex := make(map[string]SharedKey)

for _, slice := range catalogKeys {
for _, sk := range slice {
keyIndex[sk.Key] = sk
for _, envVar := range allowedEnvVars(sk) {
envIndex[envVar] = sk
}
}
}

return keyIndex, envIndex
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider detecting duplicate env var mappings in catalog.

If multiple SharedKey entries share the same EnvVar or have overlapping MatchEnvVars, later entries silently overwrite earlier ones in envIndex (lines 157-159). This could cause a product KeyDef to match an unintended catalog key.

If duplicate env vars in the catalog indicate a configuration error, consider detecting and reporting them. If last-wins semantics are intentional, documenting this behavior would help future maintainers.

♻️ Optional: Detect and warn on duplicate env var mappings
 func buildCatalogIndexes(catalogKeys ...[]SharedKey) (map[string]SharedKey, map[string]SharedKey) {
 	keyIndex := make(map[string]SharedKey)
 	envIndex := make(map[string]SharedKey)
 
 	for _, slice := range catalogKeys {
 		for _, sk := range slice {
 			keyIndex[sk.Key] = sk
 			for _, envVar := range allowedEnvVars(sk) {
+				if existing, dup := envIndex[envVar]; dup && existing.Key != sk.Key {
+					// Log or collect duplicate for caller visibility
+				}
 				envIndex[envVar] = sk
 			}
 		}
 	}
 
 	return keyIndex, envIndex
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commons/systemplane/catalog/validate.go` around lines 150 - 164,
buildCatalogIndexes currently lets later SharedKey entries silently overwrite
earlier env var mappings in envIndex; modify buildCatalogIndexes to detect
duplicates by checking if envVar already exists in envIndex (compare existing
SharedKey.Key to sk.Key) while iterating allowedEnvVars(sk), collect conflicting
envVars (and their previous/new SharedKey keys) and surface them—either return
an error/list of conflicts, or emit a clear warning via your logger—so callers
of buildCatalogIndexes (or callers of SharedKey validation) can fail fast or be
informed; reference buildCatalogIndexes, SharedKey, keyIndex, envIndex, and
allowedEnvVars when making the change.

Comment on lines +205 to +206
active := filterActive(endpoints)
if len(active) == 0 {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Redundant filtering — clarify interface contract or remove double-check.

ListActiveEndpoints() is documented to return "active webhook endpoints" (per types.go:33-38), yet filterActive() is applied to its result. Either:

  1. The interface contract doesn't guarantee pre-filtering (rename to ListEndpoints or update docs), or
  2. This filter is defensive and should be documented as such, or
  3. Remove the redundant filterActive() call.

The current mismatch between interface naming and implementation creates confusion about where filtering responsibility lies.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commons/webhook/deliverer.go` around lines 205 - 206, The code calls
filterActive(endpoints) after receiving endpoints from ListActiveEndpoints(),
which is redundant or inconsistent with the interface; either remove the call to
filterActive and use the results from ListActiveEndpoints() directly, or if
defensive filtering is intended, document that filterActive is a safety check
and keep it (or rename ListActiveEndpoints to ListEndpoints if the interface
does not guarantee filtering). Update the call-site around ListActiveEndpoints
and the filterActive usage (and the ListActiveEndpoints documentation in
types.go) so the contract is consistent: choose one of the three options (remove
filterActive, document defensive filtering, or rename ListActiveEndpoints) and
apply the corresponding change.

Comment on lines +408 to +414
// Non-retryable client errors — break immediately (except 429 Too Many Requests).
if statusCode >= http.StatusBadRequest && statusCode < http.StatusInternalServerError && statusCode != http.StatusTooManyRequests {
result.Error = fmt.Errorf("webhook: non-retryable status %d", statusCode)
d.recordMetrics(ctx, ep.ID, false, statusCode, attempt+1)

return result
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Use result.Attempts for metric consistency.

Line 411 uses attempt+1 while line 425 uses result.Attempts. Both represent the same value, but using result.Attempts consistently improves readability and ensures the metric matches the result struct.

♻️ Suggested fix
 		if statusCode >= http.StatusBadRequest && statusCode < http.StatusInternalServerError && statusCode != http.StatusTooManyRequests {
 			result.Error = fmt.Errorf("webhook: non-retryable status %d", statusCode)
-			d.recordMetrics(ctx, ep.ID, false, statusCode, attempt+1)
+			d.recordMetrics(ctx, ep.ID, false, statusCode, result.Attempts)
 
 			return result
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commons/webhook/deliverer.go` around lines 408 - 414, The metrics call uses
attempt+1 which is inconsistent with the result struct; change the recordMetrics
invocation in the non-retryable branch to use result.Attempts instead of
attempt+1 so it matches the value used elsewhere (e.g., later calls that pass
result.Attempts), keeping the same context variables (ep.ID, statusCode, ctx)
and leaving result.Error assignment as-is.

fredcamaral added a commit that referenced this pull request Mar 30, 2026
Address 21 unresolved CodeRabbit review findings across certificate,
systemplane, webhook, DLQ, and idempotency packages:

- certificate: defensive copies in GetCertificate/TLSCertificate, nil-receiver
  safety for TLSCertificate, expanded DaysUntilExpiry godoc
- systemplane: nil-safety guards for SnapSettingInt/SnapSettingBool,
  consolidate redundant state loads in supervisor reload, robust factory
  assertion in backend test, duplicate env var detection in catalog validation
- webhook: redact URL userinfo in logs, versioned HMAC signature format (v1)
  with backward-compatible migration path, DNS-free scheme validation tests,
  preserve original authority for HTTP Host header, defensive filter comment,
  consistent metric reporting with result.Attempts
- dlq: clone sources slice in WithSources, reject negative MaxRetries,
  validate tenant key segment before Redis routing
- idempotency: binary-safe cached response with headers, fail-open error
  handling in handleDuplicate, accurate package documentation
@fredcamaral
Copy link
Copy Markdown
Member Author

CodeRabbit Auto-Fixes Applied

Fixed 16 file(s) based on 21 unresolved review comment(s).

Files modified:

  • commons/certificate/certificate.go
  • commons/certificate/certificate_test.go
  • commons/dlq/consumer.go
  • commons/dlq/handler.go
  • commons/net/http/idempotency/doc.go
  • commons/net/http/idempotency/idempotency.go
  • commons/systemplane/bootstrap/backend_test.go
  • commons/systemplane/catalog/validate.go
  • commons/systemplane/domain/setting_helpers.go
  • commons/systemplane/service/supervisor.go
  • commons/systemplane/service/supervisor_helpers.go
  • commons/webhook/deliverer.go
  • commons/webhook/deliverer_test.go
  • commons/webhook/doc.go
  • commons/webhook/ssrf.go
  • commons/webhook/ssrf_test.go

Commits: 418ce48, ce56bc2

The fixes are on the fix/coderabbit-autofix-pr419 branch → PR #421.

fredcamaral added a commit that referenced this pull request Mar 30, 2026
* fix: apply CodeRabbit auto-fixes for PR #419

Address 21 unresolved CodeRabbit review findings across certificate,
systemplane, webhook, DLQ, and idempotency packages:

- certificate: defensive copies in GetCertificate/TLSCertificate, nil-receiver
  safety for TLSCertificate, expanded DaysUntilExpiry godoc
- systemplane: nil-safety guards for SnapSettingInt/SnapSettingBool,
  consolidate redundant state loads in supervisor reload, robust factory
  assertion in backend test, duplicate env var detection in catalog validation
- webhook: redact URL userinfo in logs, versioned HMAC signature format (v1)
  with backward-compatible migration path, DNS-free scheme validation tests,
  preserve original authority for HTTP Host header, defensive filter comment,
  consistent metric reporting with result.Attempts
- dlq: clone sources slice in WithSources, reject negative MaxRetries,
  validate tenant key segment before Redis routing
- idempotency: binary-safe cached response with headers, fail-open error
  handling in handleDuplicate, accurate package documentation

* fix: reduce Enqueue cyclomatic complexity and add missing errors import

- Extract validateEnqueueMessage, stampInitialEnqueue, and
  resolveAndValidateTenant helpers from Handler.Enqueue to bring
  cyclomatic complexity from 18 down to 7
- Add missing "errors" import in idempotency package for errors.Is calls

* fix: replace deprecated Header.VisitAll with Header.All iterator

staticcheck SA1019: fasthttp Header.VisitAll is deprecated in favor of
the range-compatible All() iterator.

* fix: address CodeRabbit round-2 findings on PR #421

- idempotency: use Header.Add for multi-value replay, log unmarshal
  failures, rename shadowed loop variable
- catalog: introduce ValidationResult to separate env var conflicts from
  Mismatch, mark duplicate env vars as ambiguous in index
- webhook: strip URL fragments in sanitizeURL, fix godoc references to
  VerifySignature/VerifySignatureWithFreshness, pin TestComputeHMACv1
  to independently computed expected value

* feat(security/ssrf): add canonical SSRF validation package

Introduce commons/security/ssrf as the single source of truth for SSRF
protection across all Lerian services. Consolidates two internal,
duplicated implementations into one exported package.

New exported API:
- IsBlockedIP(net.IP) / IsBlockedAddr(netip.Addr): IP-level blocking
  with canonical CIDR blocklist (8 ranges + stdlib predicates)
- IsBlockedHostname(hostname): hostname-level blocking for localhost,
  cloud metadata endpoints, .local/.internal/.cluster.local suffixes
- BlockedPrefixes(): returns copy of CIDR blocklist for auditing
- ValidateURL(ctx, url, opts...): scheme + hostname + IP validation
  without DNS resolution
- ResolveAndValidate(ctx, url, opts...): DNS-pinned validation with
  TOCTOU elimination, returns ResolveResult{PinnedURL, Authority,
  SNIHostname}
- Functional options: WithHTTPSOnly, WithAllowPrivateNetwork,
  WithLookupFunc, WithAllowHostname
- Sentinel errors: ErrBlocked, ErrInvalidURL, ErrDNSFailed

Refactored consumers:
- commons/webhook/ssrf.go: resolveAndValidateIP delegates to
  ssrf.ResolveAndValidate, removed duplicated isPrivateIP/CIDR blocklist
- commons/net/http/proxy_validation.go: isUnsafeIP delegates to
  ssrf.IsBlockedIP, removed duplicated blockedProxyPrefixes

Canonicalized on netip.Prefix (modern Go) with net.IP bridge for
legacy callers. All tests hermetic via WithLookupFunc injection.

* docs: document commons/security/ssrf package in AGENTS.md and CLAUDE.md

Add repository shape entry, API invariants section, and other-packages
bullet for the new canonical SSRF validation package.

* style(security/ssrf): fix comment alignment in CIDR blocklist

X-Lerian-Ref: 0x1

* test(webhook): refactor SSRF tests and add mapSSRFError coverage

Remove duplicate test functions (InvalidScheme was duplicate of BlockedSchemes, PrivateIP replaced by simpler BlockedHostname). Add dedicated TestMapSSRFError covering all four sentinel translation branches. Upgrade assert.Error to require.Error for fail-fast on nil errors.

X-Lerian-Ref: 0x1

* fix: apply CodeRabbit auto-fixes for PR #421
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant