diff --git a/.env.example b/.env.example new file mode 100644 index 00000000000..f82ce95c744 --- /dev/null +++ b/.env.example @@ -0,0 +1,82 @@ +# NetBird HA Test Environment Configuration +# Copy this file to .env and adjust values as needed +# NOTHING is hardcoded โ€” all values come from this file + +# --- Domain Configuration --- +NB_DOMAIN=nb-ha.local +NB_SIGNAL_DOMAIN=signal.nb-ha.local +NB_MGMT_DOMAIN=mgmt.nb-ha.local +NB_RELAY_DOMAIN=relay.nb-ha.local +NB_TURN_DOMAIN=turn.nb-ha.local +NB_DASHBOARD_DOMAIN=dashboard.nb-ha.local + +# --- Redis Configuration --- +NB_REDIS_ADDRESS=redis.nb-ha.local:6379 +NB_REDIS_PASSWORD= +NB_REDIS_DB=0 +NB_REDIS_DIAL_TIMEOUT=5s +NB_REDIS_READ_TIMEOUT=3s +NB_REDIS_WRITE_TIMEOUT=3s +NB_REDIS_POOL_SIZE=10 + +# --- PostgreSQL Configuration --- +NB_POSTGRES_HOST=postgres.nb-ha.local +NB_POSTGRES_PORT=5432 +NB_POSTGRES_USER=netbird +NB_POSTGRES_PASSWORD=netbird +NB_POSTGRES_DB=netbird +NB_POSTGRES_SSLMODE=disable + +# --- Relay Configuration --- +NB_RELAY_SECRET=netbird-relay-secret-key-change-in-production +NB_RELAY_LISTEN_ADDRESS=0.0.0.0:443 +NB_RELAY_EXPOSED_ADDRESS=relay.nb-ha.local:443 + +# --- TURN Configuration --- +NB_TURN_SECRET=netbird-turn-secret-key-change-in-production +NB_TURN_REALM=nb-ha.local +NB_TURN_PORT=3478 + +# --- Shared HA Configuration (Signal + Management) --- +NB_HA_ENABLED=true + +# --- Signal HA Configuration --- +NB_SIGNAL_REGISTRY_KEY=nb:signal:registry +NB_SIGNAL_CHANNEL_PREFIX=nb:signal:instance: +NB_SIGNAL_PEER_TTL=60s +NB_SIGNAL_HEARTBEAT_INTERVAL=30s +NB_SIGNAL_SEND_TIMEOUT=10s + +# --- Management HA Configuration --- +NB_MGMT_PEERS_REGISTRY_KEY=nb:mgmt:peers +NB_MGMT_ACCOUNT_CHANNEL_PREFIX=nb:mgmt:account: +NB_MGMT_LOCK_PREFIX=nb:mgmt:lock: +NB_MGMT_LOGIN_FILTER_KEY=nb:mgmt:loginfilter +NB_MGMT_EPHEMERAL_KEY=nb:mgmt:ephemeral +NB_MGMT_PEER_TTL=60s +NB_MGMT_HEARTBEAT_INTERVAL=30s +NB_MGMT_LOCK_TTL=30s + +# --- Logging --- +NB_LOG_LEVEL=debug +NB_LOG_FILE=console + +# --- Dashboard --- +NB_DASHBOARD_IMAGE=netbirdio/dashboard:latest + +# --- Ports (host mapping) --- +NB_HOST_REDIS_PORT=6379 +NB_HOST_POSTGRES_PORT=5432 +NB_HOST_MGMT1_PORT=33073 +NB_HOST_MGMT2_PORT=33074 +NB_HOST_SIGNAL1_PORT=10000 +NB_HOST_SIGNAL2_PORT=10001 +NB_HOST_RELAY_PORT=443 +NB_HOST_TURN_PORT=3478 +NB_HOST_TURN_TLS_PORT=5349 +NB_HOST_DASHBOARD_PORT=8080 +NB_HOST_MGMT1_METRICS=9091 +NB_HOST_MGMT2_METRICS=9092 +NB_HOST_SIGNAL1_METRICS=9093 +NB_HOST_SIGNAL2_METRICS=9094 +NB_HOST_RELAY_METRICS=9095 diff --git a/README.md b/README.md index dc84af2fd04..fe816a1058d 100644 --- a/README.md +++ b/README.md @@ -1,149 +1,315 @@ +# NetBird High Availability (HA) Fork + +**A horizontally-scalable, active-active fork of [netbirdio/netbird](https://github.com/netbirdio/netbird).** + +This fork adds Redis-based distributed state to enable multiple Signal and Management server instances to operate concurrently behind a load balancer. All changes are backward-compatible: when HA is disabled, the system behaves exactly like upstream NetBird. + +--- + +## Table of Contents + +1. [Architecture Overview](#architecture-overview) +2. [What Changed (File-by-File)](#what-changed-file-by-file) +3. [Technologies Used](#technologies-used) +4. [Key Design Decisions](#key-design-decisions) +5. [Configuration Reference](#configuration-reference) +6. [Quick Start](#quick-start) +7. [Integration Tests](#integration-tests) +8. [Build & Deploy](#build--deploy) +9. [Maintaining After Upstream Updates](#maintaining-after-upstream-updates) +10. [Project Structure](#project-structure) + +--- + +## Architecture Overview + +``` + Traefik LB (localhost:8088) + | + +--------------------------+--------------------------+ + | | | + +------v------+ +------v------+ +-----v-------+ + | signal-1 | | signal-2 | | dashboard | + | :10000 | | :10000 | | :80 | + +------+------+ +------+------+ +-------------+ + | | + +------------+-------------+ + | + +--------v---------+ + | Redis | + | nb:signal:registry | + | nb:signal:instance: | + +--------+---------+ + | + +--------v---------+ + | PostgreSQL | + | (shared state) | + +--------+---------+ + | + +------------+-------------+ + | | + +------v------+ +------v------+ + | mgmt-1 | | mgmt-2 | + | :33073 | | :33073 | + +------+------+ +------+------+ + | | + +------------+-------------+ + | + +--------v---------+ + | Redis | + | nb:mgmt:account: | + | nb:mgmt:lock: | + | nb:mgmt:ephemeral| + +------------------+ +``` + +### Components + +| Component | Role | Count | +|-----------|------|-------| +| **Traefik** | Reverse proxy & load balancer for HTTP/gRPC | 1 | +| **Signal Server** | WebRTC signaling, peer message relay | 2+ | +| **Management Server** | Peer auth, network maps, policies | 2+ | +| **Redis** | Distributed state, pub/sub, locks | 1 (or Sentinel/Cluster) | +| **PostgreSQL** | Persistent account, peer, policy data | 1 | +| **Relay** | Fallback peer relay (self-hosted) | 1 | +| **coturn** | STUN/TURN for NAT traversal | 1 | +| **Dashboard** | Web UI (Next.js via Traefik) | 1 | + +### How HA Works + +#### Signal Server HA +- Each peer is registered in Redis under `nb:signal:registry` (HSET: peerPubKey -> instanceID) +- Each signal instance subscribes to a Redis channel `nb:signal:instance:` +- When a peer sends a message to another peer, the server looks up the recipient's instance in Redis +- If the recipient is on a different instance, the message is forwarded via Redis pub/sub +- Heartbeat goroutines refresh the Redis TTL every 30 seconds +- If Redis is unavailable, signal degrades to local-only mode (no cross-instance routing) + +#### Management Server HA +- **Account Updates**: When a management instance changes account state, it publishes to `nb:mgmt:account:` on Redis. All instances receive the event and push updates to connected peers. +- **Distributed Locks**: Critical operations (peer registration, account creation) use Redis `SET NX EX` locks with TTL and heartbeat refresh. +- **Peer Registry**: Maps peer -> management instance in Redis Hash with TTL. +- **Login Filter**: Tracks in-progress logins in Redis Hash to prevent duplicate registration attempts. +- **Ephemeral Peers**: Uses Redis ZSET with TTL deadlines; a background goroutine polls and cleans up expired entries. +- **TURN/Relay Credentials**: Stateless credential refresh using HMAC (no in-memory timers), safe for any instance to generate. + +--- + +## What Changed (File-by-File) + +### New Files + +| File | Purpose | +|------|---------| +| `shared/distributed/config.go` | `HAConfig` struct with env var bindings for all HA services | +| `shared/distributed/redis.go` | Redis client wrapper with health checks and reconnection | +| `management/server/distributed/config.go` | `ManagementHAConfig` extending HAConfig with mgmt-specific keys | +| `management/server/distributed/lock.go` | Distributed lock implementation using `SET NX EX` + heartbeat | +| `management/server/distributed/registry.go` | Peer-to-instance registry wrapper around Redis Hash | +| `signal/server/config.go` | `SignalHAConfig` with signal-specific env vars | +| `signal/metrics/app.go` | HA-specific metrics (cross-instance forwards, Redis errors) | +| `.env.example` | All configuration values externalized | +| `docker-compose.ha-test.yml` | Full test stack with Traefik, 2x signal, 2x mgmt, agents | +| `tests/integration/**` | 14 integration tests + helper utilities | + +### Modified Files (Signal Server) + +| File | Change | +|------|--------| +| `signal/server/signal.go` | Added Redis registry, cross-instance pub/sub forwarding, heartbeat goroutines, graceful degradation when Redis unavailable | +| `signal/cmd/run.go` | Parse HA CLI flags (`--ha-enabled`, `--ha-redis-address`) | +| `signal/cmd/root.go` | Wire HA config into signal server initialization | +| `signal/metrics/app.go` | Added cross-instance forward count, Redis error count, registry hit/miss metrics | + +### Modified Files (Management Server) + +| File | Change | +|------|--------| +| `management/internals/shared/grpc/server.go` | Added distributed peer locks (`NoopLock` fallback when HA disabled) | +| `management/internals/shared/grpc/loginfilter.go` | Redis Hash + TTL for in-progress login tracking | +| `management/internals/shared/grpc/token_mgr.go` | Stateless TURN/Relay credential refresh (removed in-memory timers) | +| `management/internals/modules/peers/ephemeral/manager/ephemeral.go` | Redis ZSET for ephemeral peer deadlines with polling cleanup | +| `management/internals/controllers/network_map/update_channel/updatechannel.go` | Account update pub/sub via Redis | +| `management/internals/controllers/network_map/controller/controller.go` | Broadcast account updates to all connected peers | +| `management/internals/server/server.go` | Wire Redis client into boot sequence | +| `management/internals/server/boot.go` | Initialize Redis client and HA components | +| `management/internals/server/controllers.go` | Pass Redis client to controllers | +| `management/internals/server/config/config.go` | Added `HAConfig` field | +| `management/cmd/management.go` | Parse HA flags from env vars | + +### Modified Files (Combined Mode) + +| File | Change | +|------|--------| +| `combined/cmd/root.go` | Pass HA config when running in combined mode | +| `combined/cmd/config.go` | Wire HA config into combined server | + +### Modified Files (Test Environment) + +| File | Change | +|------|--------| +| `management/Dockerfile` | Added `wget` for healthchecks | +| `tests/integration/config/management.json` | Self-hosted config with embedded IdP, STUN/TURN, relay | +| `tests/integration/Dockerfile.test` | Full project copy + Docker CLI for container stop/start tests | +| `tests/integration/Dockerfile.agent` | NetBird agent image for peer connectivity tests | + +--- + +## Technologies Used + +| Technology | Version | Purpose | +|------------|---------|---------| +| Go | 1.25.5 | Primary language | +| Redis | 7.x (via Docker) | Distributed state, pub/sub, locks | +| PostgreSQL | 15+ (via Docker) | Persistent data store | +| go-redis/v9 | 9.7.3 | Redis client library | +| WireGuard | kernel module | VPN tunneling | +| gRPC | 1.80.0 | Signal/Management RPC | +| Traefik | v3.6 | Reverse proxy / load balancer | +| Docker & Docker Compose | 29.x | Container orchestration | +| coturn | latest | STUN/TURN server | +| Next.js | latest (dashboard) | Web UI | + +--- + +## Key Design Decisions + +1. **Redis-first approach**: Local memory is a cache; Redis is the source of truth for cross-instance routing. +2. **Backward compatibility**: When `NB_HA_ENABLED=false` (or unset), the system uses `NoopLock` and nil Redis checks -- behavior is identical to upstream. +3. **Env var auto-mapping**: Signal CLI flags are automatically populated from env vars via `setFlagsFromEnvVars()`. +4. **Zero hardcoded values**: All URLs, endpoints, secrets are configurable via `.env` file. +5. **Instance ID auto-detection**: Falls back from config -> env var -> hostname -> UUID. +6. **Graceful degradation**: If Redis is unavailable, Signal continues in local-only mode; Management uses nil checks to skip HA features. +7. **Traefik for same-origin**: Dashboard and embedded IdP are served on the same origin (`localhost:8088`) to avoid CORS issues. +8. **Self-hosted everything**: No external dependencies -- STUN, TURN, relay, signal, management, dashboard all run in Docker. + +--- + +## Configuration Reference + +All configuration is in `.env` (copy from `.env.example`): -
-
-
-

- -

-

- - - - - - -
- - - - - - -
- - - -

-
- - -

- - Start using NetBird at netbird.io -
- See Documentation -
- Join our Slack channel or our Community forum -
- -
-
- - ๐Ÿš€ We are hiring! Join us at careers.netbird.io - -
-
- - New: NetBird terraform provider - -

- -
- -**NetBird combines a configuration-free peer-to-peer private network and a centralized access control system in a single platform, making it easy to create secure private networks for your organization or home.** - -**Connect.** NetBird creates a WireGuard-based overlay network that automatically connects your machines over an encrypted tunnel, leaving behind the hassle of opening ports, complex firewall rules, VPN gateways, and so forth. - -**Secure.** NetBird enables secure remote access by applying granular access policies while allowing you to manage them intuitively from a single place. Works universally on any infrastructure. - -### Open Source Network Security in a Single Platform - -https://github.com/user-attachments/assets/10cec749-bb56-4ab3-97af-4e38850108d2 - -### Self-Host NetBird (Video) -[![Watch the video](https://img.youtube.com/vi/bZAgpT6nzaQ/0.jpg)](https://youtu.be/bZAgpT6nzaQ) - -### Key features - -| Connectivity | Management | Security | Automation| Platforms | -|----|----|----|----|----| -| | | | | | -| |