Skip to content

Ha/main#5988

Closed
nik-dev-ops wants to merge 2 commits intonetbirdio:mainfrom
nik-dev-ops:ha/main
Closed

Ha/main#5988
nik-dev-ops wants to merge 2 commits intonetbirdio:mainfrom
nik-dev-ops:ha/main

Conversation

@nik-dev-ops
Copy link
Copy Markdown

@nik-dev-ops nik-dev-ops commented Apr 24, 2026

/

Summary by CodeRabbit

Release Notes

  • New Features

    • High-availability (HA) mode with Redis-backed distributed coordination for Signal and Management servers
    • Cross-instance peer routing and message forwarding
    • Automatic failover between server instances with distributed registry and locking
    • Docker Compose-based HA testing environment
  • Documentation

    • Added comprehensive build, deployment, and testing guides
    • Updated main README with HA architecture and quick-start examples
    • Included rebase strategy documentation for maintaining HA fork
  • Tests

    • New integration test suite validating Signal and Management HA behavior, failover, and degradation scenarios
  • Chores

    • Dependency updates

VPN Dev and others added 2 commits April 24, 2026 02:25
…anagement servers

- Signal server: Redis distributed registry and pub/sub for cross-instance peer routing and message forwarding
- Management server: Redis pub/sub for account updates, distributed locks (SET NX EX), ephemeral peer deadline management
- Traefik load balancer with health checks for automatic failover
- 14 integration tests validating HA behavior (7 signal + 7 management)
- Full WireGuard encrypted login+sync in failover tests
- Comprehensive documentation: README.md, docs/TESTING.md, docs/BUILD_DEPLOY.md, docs/REBASE_GUIDE.md
- All HA parameters externally configurable via env vars and YAML
- Docker Compose test environment with 2x signal + 2x management + Traefik + Redis
- Original upstream README preserved as original_readme.md
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

❌ VPN Dev
❌ nik-dev-ops


VPN Dev seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: adfb3ba9-b865-4d4f-b46e-5257f9f1ed1a

📥 Commits

Reviewing files that changed from the base of the PR and between d6f08e4 and e168173.

⛔ Files ignored due to path filters (2)
  • go.sum is excluded by !**/*.sum
  • tests/integration/go.sum is excluded by !**/*.sum
📒 Files selected for processing (48)
  • .env.example
  • README.md
  • README_FORK.md
  • combined/cmd/config.go
  • combined/cmd/root.go
  • docker-compose.ha-test.yml
  • docs/BUILD_DEPLOY.md
  • docs/REBASE_GUIDE.md
  • docs/TESTING.md
  • go.mod
  • management/Dockerfile
  • management/cmd/management.go
  • management/internals/controllers/network_map/controller/controller.go
  • management/internals/controllers/network_map/update_channel/updatechannel.go
  • management/internals/modules/peers/ephemeral/manager/ephemeral.go
  • management/internals/server/boot.go
  • management/internals/server/config/config.go
  • management/internals/server/controllers.go
  • management/internals/server/server.go
  • management/internals/shared/grpc/loginfilter.go
  • management/internals/shared/grpc/server.go
  • management/internals/shared/grpc/token_mgr.go
  • management/internals/shared/grpc/token_mgr_test.go
  • management/server/distributed/config.go
  • management/server/distributed/lock.go
  • management/server/distributed/registry.go
  • management/server/management_proto_test.go
  • management/server/management_test.go
  • original_readme.md
  • shared/distributed/config.go
  • shared/distributed/redis.go
  • signal/cmd/root.go
  • signal/cmd/run.go
  • signal/metrics/app.go
  • signal/server/config.go
  • signal/server/signal.go
  • tests/integration/Dockerfile.agent
  • tests/integration/Dockerfile.test
  • tests/integration/README.md
  • tests/integration/config/management.json
  • tests/integration/go.mod
  • tests/integration/helper_test.go
  • tests/integration/management_ha_test.go
  • tests/integration/scripts/agent-setup.sh
  • tests/integration/scripts/build.sh
  • tests/integration/scripts/init-test-data.sh
  • tests/integration/scripts/run-tests.sh
  • tests/integration/signal_ha_test.go

📝 Walkthrough

Walkthrough

This PR introduces a complete high-availability (HA) infrastructure for NetBird, enabling horizontal scaling of Signal and Management servers through Redis-backed distributed coordination. It adds configuration, deployment tooling, comprehensive integration tests, and documentation to support active-active clustering with automatic failover, cross-instance routing, and distributed locks.

Changes

Cohort / File(s) Summary
Shared HA Foundation
shared/distributed/config.go, shared/distributed/redis.go
Introduces core HA configuration (HAConfig) with Redis connection details, timeout/pool settings, and instance ID detection; provides Redis client wrapper with validation, health checking, and graceful shutdown.
Signal HA Implementation
signal/server/config.go, signal/server/signal.go, signal/cmd/root.go, signal/cmd/run.go, signal/metrics/app.go
Adds Signal-specific HA config with registry/channel keys and TTL settings; implements Redis registry lookups for cross-instance peer routing, distributed registry updates with heartbeat refresh, and new metrics for cross-instance message forwarding and registry hit/miss tracking.
Management HA Configuration & Initialization
management/server/distributed/config.go, management/internals/server/boot.go, management/internals/server/server.go, management/internals/server/config/config.go
Defines Management-specific HA config with key prefixes for peers/accounts/locks/logins/ephemeral; adds BaseServer Redis client accessor; initializes and stores Redis client during server start with proper cleanup on shutdown.
Management Distributed Primitives
management/server/distributed/lock.go, management/server/distributed/registry.go
Implements RedisLock for distributed mutual exclusion using SETNX with instance-ID-guarded deletion and TTL refresh; implements RedisRegistry for peer-to-instance mapping with per-peer TTL management.
Management Network Map & Controllers
management/internals/controllers/network_map/controller/controller.go, management/internals/controllers/network_map/update_channel/updatechannel.go, management/internals/server/controllers.go
Extends account peer updates to publish Redis pub/sub messages; adds handler for remote account updates received from Redis; adds subscription listener for cross-instance account update notifications.
Management Ephemeral & Login
management/internals/modules/peers/ephemeral/manager/ephemeral.go, management/internals/shared/grpc/loginfilter.go
Adds Redis ZSET-backed ephemeral peer cleanup with background polling and expiry processing; extends login filter with Redis persistence for peer login state with TTL-based expiry.
Management gRPC & Locking
management/internals/shared/grpc/server.go, management/internals/shared/grpc/token_mgr.go, management/internals/shared/grpc/token_mgr_test.go
Introduces NoopLock implementation for backward compatibility; replaces per-peer in-memory mutexes with pluggable Lock interface; converts TimeBasedAuthSecretsManager to stateless on-demand credential generation.
Combined Server HA Config
combined/cmd/config.go, combined/cmd/root.go
Adds HA configuration field to SignalConfig struct; wires HA settings from config into Signal server creation.
Testing Infrastructure
docker-compose.ha-test.yml, tests/integration/Dockerfile.agent, tests/integration/Dockerfile.test, tests/integration/go.mod, tests/integration/config/management.json, tests/integration/README.md
Defines complete HA test environment with dual Signal/Management instances, Redis, PostgreSQL, Traefik, Coturn, relay, and test agents; provides integration test module configuration and test environment documentation.
Integration Test Helpers & Suites
tests/integration/helper_test.go, tests/integration/management_ha_test.go, tests/integration/signal_ha_test.go
Implements test helpers for Redis/gRPC clients, PAT generation, Docker control, and Postgres queries; provides comprehensive HA test suites validating cross-instance messaging, registry behavior, failover, distributed locks, health consistency, and policy propagation.
Integration Test Scripts
tests/integration/scripts/build.sh, tests/integration/scripts/init-test-data.sh, tests/integration/scripts/run-tests.sh, tests/integration/scripts/agent-setup.sh
Provides build script with CGO/build metadata injection; data initialization with idempotent setup key/PAT/peer creation; test runner with service health checks; agent setup for NetBird startup with environment variables.
Documentation & Deployment
README.md, README_FORK.md, docs/BUILD_DEPLOY.md, docs/TESTING.md, docs/REBASE_GUIDE.md, original_readme.md
Updates main README with HA architecture, Signal/Management behavior, file-by-file changes, and quick-start flow; adds fork-specific README; provides comprehensive build/deploy, integration testing, and rebase strategy guides.
Utilities & Dependencies
management/Dockerfile, management/cmd/management.go, go.mod
Updates management Dockerfile to include wget; adds HA config loading from environment variables in management startup; bumps github.com/rs/xid dependency to v1.6.0.

Sequence Diagram(s)

sequenceDiagram
    participant PeerA as Peer A<br/>(Signal Instance 1)
    participant Sig1 as Signal 1<br/>HA Server
    participant Redis as Redis<br/>Registry
    participant Sig2 as Signal 2<br/>HA Server
    participant PeerB as Peer B<br/>(Signal Instance 2)

    PeerA->>Sig1: Connect Stream (register)
    Sig1->>Redis: HSet peer_registry: peerA → instance1
    Sig1->>Redis: Expire peer_registry (TTL)
    
    PeerB->>Sig2: Connect Stream (register)
    Sig2->>Redis: HSet peer_registry: peerB → instance2
    Sig2->>Redis: Expire peer_registry (TTL)
    
    PeerA->>Sig1: Send message to peerB
    Sig1->>Redis: HGet peer_registry for peerB
    Redis-->>Sig1: instance2
    Sig1->>Redis: Publish to signal:instance2 channel
    
    Sig2->>Redis: Subscribe signal:instance2 channel
    Redis-->>Sig2: Receive message envelope
    Sig2->>PeerB: Route message
    PeerB-->>Sig2: Receive
Loading
sequenceDiagram
    participant Agent as Agent
    participant Mgmt1 as Management 1<br/>HA Instance
    participant Redis as Redis<br/>Registry & Pub/Sub
    participant Mgmt2 as Management 2<br/>HA Instance

    Agent->>Mgmt1: Login (acquire lock)
    Mgmt1->>Redis: SETNX lock:peerID with instanceID
    Redis-->>Mgmt1: OK
    
    Mgmt1->>Mgmt1: Recalculate network map
    Mgmt1->>Redis: Publish account:accountID update
    
    Redis-->>Mgmt2: Receive account update
    Mgmt2->>Mgmt2: Recalculate network map
    Mgmt2->>Redis: HSet peer_registry: agentID → instance1
    
    alt Instance 1 Fails
        Agent->>Redis: Query peer_registry for mgmt instance
        Redis-->>Agent: instance2
        Agent->>Mgmt2: Reconnect, sync
        Mgmt2->>Agent: Send updated network config
    end
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly Related PRs

Suggested Reviewers

  • pascal-fischer

Poem

🐰 Hopping through Redis with instances in flight,
Locks and registries keeping peers in sight,
When one server stumbles, another takes the lead,
High availability magic—that's what we need!
Scaling and syncing with heartbeat delight. 🚀

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
18 New issues
3 Security Hotspots
18 New Code Smells (required ≤ 0)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants