Skip to content

[CAE-1072] Hitless Upgrades #3447

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: ndyakov/CAE-1088-resp3-notification-handlers
Choose a base branch
from

Conversation

ndyakov
Copy link
Member

@ndyakov ndyakov commented Jul 25, 2025

Hitless Upgrades

Seamless Redis connection handoffs during topology changes without interrupting operations.

Quick Start

import "github.com/redis/go-redis/v9/hitless"

opt := &redis.Options{
    Addr:     "localhost:6379",
    Protocol: 3, // RESP3 required
    HitlessUpgrades: &redis.HitlessUpgradeConfig{
        Mode: hitless.MaintNotificationsEnabled, // or MaintNotificationsAuto
    },
}
client := redis.NewClient(opt)

Modes

  • MaintNotificationsDisabled: Hitless upgrades are completely disabled
  • MaintNotificationsEnabled: Hitless upgrades are forcefully enabled (fails if server doesn't support it)
  • MaintNotificationsAuto: Hitless upgrades are enabled if server supports it (default)

Configuration

import "github.com/redis/go-redis/v9/hitless"

Config: &hitless.Config{
    Mode:                       hitless.MaintNotificationsAuto, // Notification mode
    MaxHandoffRetries:           3,  // Retry failed handoffs
    HandoffTimeout:             15 * time.Second, // Handoff operation timeout
    RelaxedTimeout:             10 * time.Second, // Extended timeout during migrations
    PostHandoffRelaxedDuration: 20 * time.Second, // Keep relaxed timeout after handoff
    LogLevel:                   1,  // 0=errors, 1=warnings, 2=info, 3=debug
    MaxWorkers:                 15, // Concurrent handoff workers
    HandoffQueueSize:           50, // Handoff request queue size
}

Worker Scaling

  • Auto-calculated: min(10, PoolSize/3) - scales with pool size, capped at 10
  • Explicit values: max(10, set_value) - enforces minimum 10 workers
  • On-demand: Workers created when needed, cleaned up when idle

Queue Sizing

  • Auto-calculated: 10 × MaxWorkers, capped by pool size
  • Always capped: Queue size never exceeds pool size

Metrics Hook Example

A metrics collection hook is available in example_hooks.go that demonstrates how to monitor hitless upgrade operations:

import "github.com/redis/go-redis/v9/hitless"

metricsHook := hitless.NewMetricsHook()
// Use with your monitoring system

The metrics hook tracks:

  • Handoff success/failure rates
  • Handoff duration
  • Queue depth
  • Worker utilization
  • Connection lifecycle events

Requirements

  • RESP3 Protocol: Required for push notifications

@ndyakov ndyakov force-pushed the ndyakov/CAE-1072-hitless-upgrades-2 branch 2 times, most recently from 49e5814 to 43aef14 Compare July 25, 2025 12:51
@ndyakov ndyakov changed the base branch from ndyakov/CAE-1088-resp3-notification-handlers to master July 25, 2025 12:52
@ndyakov ndyakov force-pushed the ndyakov/CAE-1072-hitless-upgrades-2 branch 21 times, most recently from 8608f93 to a39a23a Compare July 30, 2025 07:05
@ndyakov ndyakov force-pushed the ndyakov/CAE-1072-hitless-upgrades-2 branch 2 times, most recently from e88e673 to 4542e8f Compare August 4, 2025 13:01
@ndyakov ndyakov changed the base branch from master to ndyakov/CAE-1088-resp3-notification-handlers August 4, 2025 13:02
@ndyakov ndyakov force-pushed the ndyakov/CAE-1072-hitless-upgrades-2 branch 3 times, most recently from 9590c26 to 100c3d2 Compare August 4, 2025 13:57
ndyakov and others added 3 commits August 19, 2025 15:38
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@ndyakov ndyakov marked this pull request as ready for review August 19, 2025 13:16
@ndyakov ndyakov requested a review from Copilot August 19, 2025 13:16
Copilot

This comment was marked as outdated.

@ndyakov ndyakov changed the title [WIP] hitless [CAE-1072] Hitless Upgrades Aug 19, 2025
ndyakov and others added 4 commits August 19, 2025 16:21
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@ndyakov ndyakov requested a review from Copilot August 19, 2025 13:38
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements hitless upgrades for Redis connections, enabling seamless connection handoffs during topology changes without interrupting operations. It adds comprehensive support for RESP3 push notifications to handle cluster migration events gracefully.

Key changes include:

  • Hitless upgrade manager and pool hooks for event-driven connection handoffs
  • Enhanced connection pool with relaxed timeout support and atomic state management
  • RESP3 push notification handling for cluster migration events (MOVING, MIGRATING, etc.)

Reviewed Changes

Copilot reviewed 46 out of 48 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
universal.go Added HitlessUpgradeConfig support to universal options
redis.go Integrated hitless manager and enhanced connection initialization
options.go Added hitless configuration with endpoint auto-detection
internal/pool/conn.go Enhanced connection with atomic handoff state and relaxed timeouts
internal/pool/pool.go Added pool hooks system and improved connection lifecycle management
hitless/pool_hook.go Implemented event-driven connection handoff processing
hitless/notification_handler.go Added RESP3 push notification handlers for cluster events

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@ndyakov ndyakov force-pushed the ndyakov/CAE-1072-hitless-upgrades-2 branch from 4f069e2 to adee0a8 Compare August 19, 2025 13:50
}

// Use the base dialer to connect to the new endpoint
return ph.baseDialer(ctx, ph.network, net.JoinHostPort(host, port))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When MOVING points to a different host/FQDN or an IP, we reuse the original TLS config (including ServerName) while dialing the new endpoint. That can cause hostname verification mismatches (e.g., cert valid for old host, not new host/IP).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Since there are no certificates transmitted as part of the notification, I am assuming there would be a root level configuration for tls which should cover the new endpoint.

@ndyakov ndyakov force-pushed the ndyakov/CAE-1072-hitless-upgrades-2 branch from 226822f to 2e47e39 Compare August 21, 2025 13:02
@ndyakov ndyakov force-pushed the ndyakov/CAE-1072-hitless-upgrades-2 branch from 0baeb53 to b2228f4 Compare August 21, 2025 14:21
@ndyakov ndyakov force-pushed the ndyakov/CAE-1072-hitless-upgrades-2 branch from c908056 to bfca15a Compare August 22, 2025 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants