Skip to content

Conversation

@Uzziee
Copy link

@Uzziee Uzziee commented Nov 18, 2025

This proposal is to add hot reload functionality, which will enable the app to reload any changes to virtual cluster config without the need to restart the app

Signed-off-by: Urjit Patel <105218041+Uzziee@users.noreply.github.com>
Copy link
Member

@SamBarker SamBarker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design PR#83 Feedback - Configuration Reload Design

Date: 2026-01-28
Reviewer: Sam Barker
Design PR: #83

Executive Summary

Thank you for putting together this design proposal! Configuration reload is a critical operational feature that many users have been asking for, and your design work provides a solid foundation for moving this forward.

The current proposal focuses on file watch as the primary mechanism. This feedback suggests an alternative HTTP-first approach with 2-phase validation and discusses enhancements that will make either approach production-ready. The feedback builds on analysis of the POC implementation (PR#3176).

Your POC demonstrates the core reload mechanism works well - the questions here are primarily about the trigger mechanism and operator integration patterns. The groundwork you've laid out makes these decisions much clearer.

Proposed Change to Design: HTTP Endpoints as Primary Interface

Current Design Proposal

The design PR currently proposes file watch as the primary mechanism for configuration reload (Part 1), with potential HTTP endpoints as future work.

Recommended Alternative: HTTP-First Approach

I recommend inverting this: make HTTP endpoints the primary interface, with file watching as an optional convenience layer.

Rationale for HTTP-first:

Universal: Works on bare metal, Kubernetes, and any deployment model
Operator-friendly: Natural integration point for Kubernetes operator (operator detects ConfigMap changes → POST /admin/config/reload)
Testable: Easy to test programmatically (integration tests can POST directly)
Observable: Clear success/failure responses (200 OK vs 400 Bad Request with error details)
Composable: File watching can be implemented as a layer that calls the HTTP endpoint internally
Kubernetes-native: Aligns with how operators interact with workloads (API calls, not filesystem)

File watching challenges:

  • ❌ Read-only filesystem (Kubernetes security best practice blocks file writes)
  • ❌ ConfigMap mounting complexity (..data symlinks, atomic updates)
  • ❌ No feedback mechanism (how does operator know reload succeeded/failed?)
  • ❌ Race conditions (file watch triggers before ConfigMap fully mounted)

Proposed architecture:

Core: HTTP Management Endpoints

Proxy exposes on localhost:9190 (management port):
    ↓
POST /admin/config/validate (validate without applying)
POST /admin/config/reload (apply changes)
GET /admin/config/status (current config version, last operation status)
GET /admin/health (proxy health for liveness/readiness, already exists)
    ↓
Core reload mechanism (shared by all trigger mechanisms)

2-Phase Workflow:

  1. Validate: Build models, initialize filters, check internal consistency (no port binding)
  2. Reload: If validation passes, apply changes (bind ports, register gateways)

Security:

  • Default bind: localhost:9190 (local access only)
  • For Kubernetes: Bind to 0.0.0.0:9190 (pod IP accessible to operator)
  • Authentication: Optional (TLS client certificates, bearer tokens)
  • Recommendations:
    • Bare metal: Keep localhost binding, use local access controls
    • Kubernetes: Use NetworkPolicy to restrict operator→proxy traffic
    • Production: Consider mTLS for operator↔proxy communication

Trigger Mechanisms (How to Call HTTP Endpoints)

Option 1: Direct HTTP (Kubernetes Operator)

Operator detects ConfigMap change
    ↓
POST /admin/config/validate to management Service
    ↓
POST /admin/config/reload to all pod IPs

✅ Native Kubernetes integration
✅ Immediate feedback via HTTP responses
✅ No filesystem coupling

Option 2: File Watcher (Bare Metal)

Sidecar process watches config file
    ↓
On file change → POST to localhost:9190/admin/config/validate
    ↓
If valid → POST to localhost:9190/admin/config/reload

Sidecar options:

  • Shell script: Simple inotifywait wrapper
    inotifywait -e modify /etc/kroxylicious/config.yaml | while read; do
      curl -X POST http://localhost:9190/admin/config/validate --data-binary @/etc/kroxylicious/config.yaml
      if [ $? -eq 0 ]; then
        curl -X POST http://localhost:9190/admin/config/reload --data-binary @/etc/kroxylicious/config.yaml
      fi
    done
  • Go binary: More robust error handling, retry logic
  • In-process Java: WatchService (if proxy can write to filesystem for persistence)

✅ Familiar workflow for bare metal users
✅ Decoupled from proxy (sidecar can be restarted independently)
✅ Uses same HTTP endpoints as Kubernetes

This means:

  • HTTP endpoints are the primitive (required)
  • File watching is optional convenience (can be added later)
  • Both deployment models use same tested, validated endpoints
  • Validation catches config errors before any cluster goes down

Note: This is a significant change from the current design proposal, which focuses on file watch without a validation phase. If the community prefers file watch as the primary mechanism, we should address the challenges listed above (read-only filesystem, feedback mechanisms, etc.) in the design.

Cluster Modification Semantics

The design's remove→add pattern is architecturally necessary:

The proxy's channel state machine has a fundamental constraint: each frontend channel (client→proxy) has a 1:1 relationship with a backend channel (proxy→broker). There's no mechanism to redirect an existing backend connection without closing the frontend connection.

This means:

  • Any cluster modification requires draining connections (1-30 seconds downtime per cluster)
  • "Atomic swap" approaches don't eliminate downtime—they would require hot-swapping filters in the Netty pipeline, which introduces filter state management complexity
  • The remove→add pattern is the correct architectural choice, not a limitation to be overcome

Implication for design: Document that cluster modifications incur brief downtime (1-30s) and this is by design, not a quality issue.

Rollback Strategy (Needs Discussion)

Current POC behavior: Rollback ALL clusters on ANY failure (all-or-nothing semantics)

This is a critical design decision that requires community consensus. The choice affects operational complexity, user experience, and downtime characteristics. See "Questions for Design Discussion" below for detailed analysis of trade-offs.

Key question: When cluster-a succeeds but cluster-b fails, should we:

  • Option A: Rollback cluster-a (all-or-nothing) → simpler operations, more downtime
  • Option B: Keep cluster-a on new config (partial success) → less downtime, more complexity

Recommendation for design: Dedicate a section to this decision, present both options fairly, and explicitly request community feedback before proceeding.

Core Design: HTTP Endpoints with 2-Phase Commit

Validation Endpoint (Core Component)

API:

POST /admin/config/validate
Content-Type: application/yaml

{new configuration YAML}

Response (200 OK):
{
  "valid": true,
  "configVersion": "a3f5b2c19e4d"  // SHA-256 hash of config
}

Response (400 Bad Request):
{
  "valid": false,
  "errors": [
    "Filter 'record-encryption' initialization failed: KMS URL required",
    "Port conflict: 9293 used by cluster-a and cluster-b"
  ]
}

What it validates:

  • ✅ YAML syntax and structure
  • ✅ Filter types exist (registered via SPI)
  • FilterFactory.initialize() succeeds (filter config valid)
  • ✅ Port ranges internally consistent (no duplicate ports in config)

What it doesn't validate (runtime concerns):

  • ❌ Ports actually available on the OS (might be in use)
  • ❌ External dependencies reachable (KMS might be down during reload)
  • ❌ Upstream Kafka cluster healthy

Why this split is acceptable:

Validation is about catching configuration errors (syntax, invalid filter config). Runtime failures (port conflicts, KMS down at reload time) are handled by rollback. We can't guarantee "config valid at 10:00am" means "will succeed at 10:02am" for external dependencies.

Implementation note: Validation should build models and initialize filters without binding ports or registering gateways. This makes validation:

  • Fast (no network operations)
  • Deterministic (same result on all pods)
  • Resource-light (no double-memory usage)

Reload Endpoint (Core Component)

API:

POST /admin/config/reload
Content-Type: application/yaml

{new configuration YAML}

Response (200 OK):
{
  "success": true,
  "configVersion": "a3f5b2c19e4d",
  "clustersModified": ["cluster-a", "cluster-b"]
}

Response (500 Internal Server Error):
{
  "success": false,
  "error": "Failed to modify cluster-b: filter initialization failed",
  "configVersion": "abc123"  // Rolled back to previous version
}

What it does:

  1. Applies configuration changes (remove→add clusters as needed)
  2. If any operation fails → rollback all changes
  3. Returns success/failure with current config version

Configuration Options

Management endpoint binding:

# proxy-config.yaml
admin:
  host: "localhost"  # Default: localhost only (bare metal)
  # host: "0.0.0.0"  # Kubernetes: bind to pod IP
  port: 9190
  tls:  # Optional: mTLS for operator communication
    keyStore: /path/to/keystore.jks
    trustStore: /path/to/truststore.jks

Benefits of this architecture:

  • Catches 90% of errors before any cluster goes down (validation phase)
  • Clear error messages before disruption
  • Same HTTP endpoints for Kubernetes and bare metal
  • File watching is optional, can be added as sidecar later
  • Security: localhost by default, configurable for Kubernetes

Kubernetes Integration Patterns

Management Service

Problem: Operator creates Services for Kafka traffic (ports 9292+) but not for the management port (9190).

Proposed: Create dedicated management Service for operator↔proxy communication:

apiVersion: v1
kind: Service
metadata:
  name: my-proxy-management
spec:
  type: ClusterIP  # Internal only
  selector:
    app.kubernetes.io/instance: minimal
    app.kubernetes.io/component: proxy
  ports:
  - name: management
    port: 9190
    targetPort: 9190

Benefits:

  • ✅ Automatic pod readiness handling (Service only routes to ready pods, returns 503 if none ready)
  • ✅ Stable DNS endpoint (my-proxy-management.ns.svc.cluster.local)
  • ✅ Survives pod restarts/rescheduling
  • ✅ Follows Kubernetes best practices (Services for stable endpoints)

Usage:

  • Validation: POST http://my-proxy-management:9190/admin/config/validate (one pod via Service)
  • Reload: Iterate over pods, POST directly to pod IPs (all pods must succeed)

Recommendation: Add management Service pattern to Kubernetes deployment section of design.

Read-Only Filesystem Support

Problem: Kubernetes deployments use securityContext.readOnlyRootFilesystem: true as security best practice. Current design persists config to disk after successful reload, which fails with read-only filesystem.

Proposed: Make config file persistence optional:

Deployment models:

  • Bare metal: Config file on disk, persist on successful reload
  • Kubernetes: Config in ConfigMap (operator-managed), no disk persistence

Recommendation: Document read-only filesystem support as a requirement for Kubernetes deployments.

Checksum-Based Change Detection

Problem: Operator needs to detect "config actually changed" vs "CRD reconciliation loop with no real change."

Proposed: Store SHA-256 hash of config YAML in KafkaProxy annotation:

apiVersion: kroxylicious.io/v1alpha1
kind: KafkaProxy
metadata:
  name: minimal
  annotations:
    kroxylicious.io/config-checksum: "a3f5b2c19e4d"  # SHA-256 hash
spec:
  # ... config ...

Operator logic:

String newChecksum = sha256(generateYaml(kafkaProxy));
String oldChecksum = kafkaProxy.getMetadata().getAnnotations().get("kroxylicious.io/config-checksum");

if (newChecksum.equals(oldChecksum)) {
    LOGGER.debug("Config unchanged, skipping reload");
    return;  // No-op, avoid unnecessary reload
}

// Config changed, trigger 2-phase reload
ValidationResult validation = validateViaManagementService(yaml);
if (validation.valid()) {
    reloadAllPods(yaml);
    kafkaProxy.getMetadata().getAnnotations().put("kroxylicious.io/config-checksum", newChecksum);
}

Benefits:

  • ✅ Automatic no-op detection (reconciliation loop doesn't trigger unnecessary reloads)
  • ✅ Rollback detection (reverting config doesn't reload if already at that state)
  • ✅ O(1) comparison vs deep config diff

Recommendation: Add checksum-based change detection to operator integration section.

Additional Design Components

Configurable Drain Timeout

Problem: Hard-coded 30-second drain timeout is too short for Kafka consumers with long poll timeouts (default 5 minutes).

Proposed:

# proxy-config.yaml
admin:
  drainTimeoutSeconds: 300  # 5 minutes for graceful connection drain

Trade-off: Longer timeouts mean longer reload times, but fewer disrupted clients.

Recommendation: Add configurable drain timeout to design.

Observability and Status Reporting

Configuration Status Endpoint:

Separate configuration status from health checks (health is for liveness/readiness):

GET /admin/config/status
{
  "currentConfigVersion": "sha256:a3f5b2c19e4d...",
  "appliedAt": "2026-01-28T10:15:30Z",
  "lastReloadAttempt": {
    "timestamp": "2026-01-28T10:15:30Z",
    "status": "SUCCESS",
    "requestedVersion": "sha256:a3f5b2c19e4d...",
    "durationMs": 1234,
    "clustersModified": ["cluster-a"]
  },
  "lastValidationAttempt": {
    "timestamp": "2026-01-28T10:15:25Z",
    "status": "SUCCESS",
    "requestedVersion": "sha256:a3f5b2c19e4d..."
  }
}

// After reload failure with rollback failure:
{
  "currentConfigVersion": "sha256:abc123...",  // Previous version still running
  "appliedAt": "2026-01-28T09:00:00Z",
  "lastReloadAttempt": {
    "timestamp": "2026-01-28T10:20:00Z",
    "status": "ROLLBACK_PARTIAL_FAILURE",
    "requestedVersion": "sha256:newversion...",
    "rollbackState": {
      "successful": ["cluster-a"],
      "failed": {
        "cluster-b": "Failed to re-register gateway: port 9293 in use"
      }
    }
  }
}

Health endpoint stays focused on proxy health:

GET /admin/health
{
  "status": "UP",
  "checks": {
    "netty": "UP",
    "virtualClusters": "UP"
  }
}

Benefit: Clean separation - operators query /admin/config/status for reload state, /admin/health for liveness/readiness.

Recommendation: Add dedicated config status endpoint to design.

Metrics:

kroxylicious_config_reload_total{result="success|failure"} counter
kroxylicious_config_reload_duration_seconds histogram
kroxylicious_config_version_info{version="a3f5b2c19e4d"} gauge

Use cases:

  • Alerting on reload failures
  • Tracking reload duration trends
  • Capacity planning (reload frequency)

Recommendation: Add metrics to observability section.

Error Handling and Recovery

Rollback Failure Handling:

Current design: Log "CRITICAL: system may be in inconsistent state"

Proposed: Track rollback state and expose via health endpoint (see above).

Recovery path:

  1. Query /admin/health to see which clusters failed rollback
  2. Manual intervention:
    • Verify cluster state (is port bound? filter initialized?)
    • Either retry reload or manually fix state
  3. Operator automation (future):
    • Detect rollback failure from health endpoint
    • Attempt recovery (remove failed cluster, re-add from old config)

Recommendation: Document rollback failure recovery procedures.

Concurrent Reload Prevention:

  • Only one reload at a time (enforced via lock)
  • Concurrent requests fail fast with 409 Conflict
POST /admin/config/reload
{new config}

Response (409 Conflict):
{
  "error": "Reload already in progress",
  "inProgressSince": "2026-01-28T10:15:30Z"
}

Recommendation: Document concurrency model in API specification.

Design Document Structure

Suggest organizing the design document as follows. Note: This structure assumes the HTTP-first approach described above. If the community prefers the file watch approach, the structure would need to adjust accordingly (swap "HTTP Endpoints" with "File Watch" as primary, etc.).

1. Goals and Non-Goals

Goals:

  • Zero-restart configuration updates
  • Universal deployment model (bare metal, Kubernetes)
  • Operator-friendly integration
  • Clear error handling and rollback

Non-Goals:

  • Zero-downtime modification (brief downtime per cluster is acceptable)
  • Hot-swapping filters in active connections
  • Partial success / continue-on-failure

2. Architecture

2.1 Core: HTTP Management Endpoints

Required endpoints:

  • POST /admin/config/validate - Validate config without applying
  • POST /admin/config/reload - Apply validated config
  • GET /admin/config/status - Current config version, last operation status
  • GET /admin/health - Proxy health (liveness/readiness)

Security:

  • Default bind: localhost:9190 (bare metal)
  • Kubernetes bind: 0.0.0.0:9190 (pod IP)
  • Optional TLS/mTLS for authentication
  • NetworkPolicy to restrict access in Kubernetes

2.2 Trigger Mechanisms (Optional)

Direct HTTP (Kubernetes):

  • Operator calls endpoints directly
  • No file watching needed

File Watcher Sidecar (Bare Metal):

  • Separate process watches config file
  • Calls HTTP endpoints on change
  • Options: shell script, Go binary, Java WatchService
  • Decoupled from proxy process

2.3 Reload Mechanism

  • Remove→add pattern (architecturally necessary)
  • Sequential processing (simplicity > parallelism)
  • All-or-nothing rollback (operational simplicity - needs discussion)

2.4 Validation Strategy

  • Build models + initialize filters without port binding
  • Deterministic (same result on all pods)
  • Catches config errors, not runtime failures

3. Deployment Patterns

3.1 Bare Metal

  • HTTP endpoints on localhost:9190

  • Config file on disk (optional)

  • Persist config to disk on success (if writable filesystem)

3.2 Kubernetes with Operator

  • HTTP endpoints on 0.0.0.0:9190
  • Config in ConfigMap (operator-managed)
  • Management Service for validation (exposes port 9190)
  • Checksum-based change detection (avoid no-op reloads)
  • 2-phase commit (validate via Service → reload all pods)
  • Read-only filesystem support (no disk persistence)
    • Sidecar file watcher (optional) → calls HTTP endpoints

4. Failure Modes and Recovery

  • Filter initialization failure → rollback
  • Port binding failure → rollback
  • Rollback failure → tracked state, manual recovery
  • Concurrent reload → fail fast with 409

5. Observability

  • Logging throughout reload process
  • Metrics for reload operations

6. Future Enhancements

  • Granular endpoints (/reload/cluster/{name})
  • Canary rollout strategies
  • Blue-green at pod level (operator)

Questions for Design Discussion

  1. Should FilterFactory.initialize() be documented as validation-safe?

    • Must be idempotent (can be called multiple times)?
    • Should avoid side effects (don't connect to external services)?
    • Or allow filter authors to decide (validation calls real KMS if they want)?
  2. Rollback Strategy: All-or-Nothing vs Partial Success (Critical Design Decision)

    This requires community consensus before proceeding.

    Scenario: Config change affects cluster-a, cluster-b, cluster-c

    • cluster-a: modify succeeds ✅ (downtime: 2s)
    • cluster-b: modify fails ❌ (downtime: 30s)
    • cluster-c: modify succeeds ✅ (downtime: 2s)

    Option A: All-or-Nothing (Current POC)

    Result: Rollback cluster-a and cluster-c
    Final state: All clusters on OLD config
    Total downtime: cluster-a (4s), cluster-b (30s), cluster-c (4s)
    

    Pros:

    • ✅ Single source of truth (config file intent OR previous state, never mixed)
    • ✅ Predictable retry path (fix issue → retry → all move together)
    • ✅ No configuration drift (never "cluster-a on v2, cluster-b on v1")
    • ✅ Simple status model (one config version for entire proxy)
    • ✅ Follows declarative configuration philosophy (Kubernetes/GitOps)

    Cons:

    • ❌ Unnecessary downtime for successful clusters during rollback
    • ❌ Wastes successful work (cluster-a, cluster-c succeeded but rolled back)

    Option B: Partial Success / Continue-on-Failure

    Result: Keep cluster-a and cluster-c on new config
    Final state: cluster-a (NEW), cluster-b (OLD), cluster-c (NEW)
    Total downtime: cluster-a (2s), cluster-b (30s), cluster-c (2s)
    

    Pros:

    • ✅ Less total downtime (no rollback for successful clusters)
    • ✅ Preserves successful work

    Cons:

    • ❌ Configuration drift (reality doesn't match declared intent)
    • ❌ Complex status model (per-cluster versions: {a: "v2", b: "v1", c: "v2"})
    • ❌ Unclear retry path (should cluster-a reload again? How does operator know?)
    • ❌ Reconciliation complexity (which clusters already on target version?)
    • ❌ Requires granular reload endpoints (/reload/cluster/{name})
    • ❌ Confusing user experience ("Reload failed" but some clusters succeeded?)

    Operational Comparison:

    Aspect All-or-Nothing Partial Success
    Source of truth Config OR previous state (clear) Mixed state (confusing)
    Retry after fixing cluster-b Simple (reload all) Complex (skip a,c or reload?)
    Status API One version Per-cluster versions
    Downtime on failure Higher (rollback) Lower (no rollback)
    Operator logic Simple Complex reconciliation
    User understanding Clear Confusing

    User Experience Example:

    All-or-Nothing:

    $ kubectl apply -f new-config.yaml
    Error: Config reload failed on cluster-b (filter init error)
    Status: All clusters on version abc123 (previous config)
    Action: Fix cluster-b config, retry apply
    

    Partial Success:

    $ kubectl apply -f new-config.yaml
    Error: Config reload failed on cluster-b (filter init error)
    Status: cluster-a (def456), cluster-b (abc123), cluster-c (def456)
    Question: Should I retry? Will cluster-a reload again?
    

    Questions for the community:

    • Which operational model do users prefer?
    • Is configuration drift acceptable as a trade-off for less downtime?
    • Should this be configurable, or should we pick one approach?
    • If configurable:
      admin:
        rollbackStrategy: ALL  # Default? Or FAILED_ONLY?
    • Do we need granular reload endpoints regardless of rollback strategy?

    Calude's recommendation: Start with all-or-nothing (simpler, matches declarative config philosophy), gather operational feedback, add partial success later if users request it. But this needs community buy-in, not just maintainer decision.

  3. Should we define granular reload endpoints now or defer?

    • POST /admin/config/reload (full config, current)
    • POST /admin/config/reload/cluster/{name} (single cluster, future?)
  4. What should config version format be?

    • SHA-256 hash (deterministic, no clock dependency)
    • Timestamp-based (easier for humans to understand)
    • Operator-provided (e.g., ConfigMap resourceVersion)

Summary

The configuration reload design addresses a critical operational need. This feedback proposes HTTP endpoints with 2-phase commit (validate → reload) as the primary interface (alternative to the current file watch proposal) for the following reasons:

Why HTTP-first with validation:

  • Better Kubernetes integration (operator-friendly, read-only filesystem compatible)
  • Clear observability (HTTP responses vs file watch with no feedback)
  • Testability (programmatic testing vs file system manipulation)
  • Validation catches config errors before any cluster goes down
  • File watching can still be supported as a convenience layer that calls HTTP internally

Core components proposed:

  1. POST /admin/config/validate - Validates config without applying (deterministic, fast)
  2. POST /admin/config/reload - Applies validated config (with rollback on failure)
  3. Management Service - Kubernetes Service exposing port 9190 for operator access
  4. Checksum-based change detection - Avoid unnecessary reloads on no-op reconciliation
  5. Read-only filesystem support - Make disk persistence optional for Kubernetes

Key takeaway: The architectural constraints (channel state machine, draining requirement) mean the design correctly accepts brief downtime per cluster modification. This is not a limitation—it's the right trade-off for operational simplicity and safety.

Recommended next steps:

  1. Discuss HTTP vs file watch as primary mechanism - This is a fundamental design choice that needs community input
  2. Discuss rollback strategy - All-or-nothing vs partial success requires consensus
  3. Add validation endpoint and 2-phase commit to design
  4. Add Kubernetes integration patterns (management Service, checksum-based change detection)
  5. Document failure modes and recovery procedures
  6. Refine POC implementation (PR#3176) based on finalized design

Excellent work on the POC—it provides a solid foundation for whichever trigger mechanism the community prefers!

Signed-off-by: Urjit Patel <105218041+Uzziee@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants