Skip to content

NATS cluster mode and test matrix#76

Open
poelzi wants to merge 16 commits intobluecatengineering:masterfrom
cluster21:nats-backend-and-tests
Open

NATS cluster mode and test matrix#76
poelzi wants to merge 16 commits intobluecatengineering:masterfrom
cluster21:nats-backend-and-tests

Conversation

@poelzi
Copy link

@poelzi poelzi commented Feb 27, 2026

Sorry, thats a big one (lots of test infrastructure code)

I have implemented a full NATS backend that allows to run multiple dora server in the same network in a high availability setup (active-active).

It also allows to send host specific options via a NATS jetstream KV bucket. This will allow easy control in a huge cluster environments.

I added a complete nix based test framework that tests standalone and NATS version against an array of clients and creates a matrix report that can be used as long term test reports.

The NATS version also uses 2 load generators, the kea and the new dhcp-loadtest

  • The main reason for the custom tool is, that I wanted to test that under load the IP assignment is returned correctly, something the kea tool can't test.

It will be easy to add new dhcp clients to the test matrix after that.

The NATS version seems faster then the sqlite standalone version in the test VM.

Add opt-in clustered DHCP mode with NATS coordination config:

- T001: Add BackendMode enum (standalone/clustered) to wire config
  with default standalone and normalized accessors on DhcpConfig
- T002: Add NatsConfig, NatsSubjects, NatsSecurityMode structs
  with configurable subject templates and security mode selection
- T003: Add validate_cluster_config() enforcing required clustered
  fields (servers, contract_version, non-empty subjects) only when
  clustered mode is active; standalone validation path unchanged
- T004: Extend CLI config with --backend-mode, --instance-id, and
  --nats-servers runtime overrides for clustered operation
- T005: Update example.yaml and config_schema.json with clustered
  mode configuration examples and schema definitions
- T006: Add 16 new config regression tests covering legacy standalone
  parsing, clustered config validation (valid/invalid), custom
  subject overrides, security modes, and fixture files

Standalone mode remains default and behaviorally unchanged.
All 53 config tests and 7 dora-core tests pass.
…ption coordination

Implement the NATS coordination library (libs/nats-coordination) with:

- T007: Crate scaffold with Cargo.toml, module layout, workspace wiring
- T008: Typed models and JSON codecs for LeaseRecord, HostOptionLookup
  request/response, LeaseSnapshot, CoordinationEvent matching AsyncAPI contract
- T009: Contract-versioned SubjectResolver with configurable templates,
  default prefix, and placeholder/empty-subject validation
- T010: NatsClient connection manager wrapping async-nats with optional
  auth modes (none/user_password/token/nkey/tls/creds_file), connection
  state observability, publish/request helpers with timeout
- T011: LeaseCoordinator with reserve/lease/release/probate/snapshot APIs,
  revision-aware conflict retry, and degraded-mode blocking
- T012: HostOptionClient with hit/miss/error outcome classification,
  correlation IDs, and bounded timeout (errors don't block DHCP)
- T013: 59 unit tests covering subject generation, codec round-trips,
  error classification, timeout/conflict retry, and degraded-mode behavior
…ded mode, and metrics

- T014: Wire backend mode selection in bin/src/main.rs (standalone SQLite vs clustered NATS)
- T015: Refactor leases plugin with LeaseBackend trait, StandaloneBackend, ClusteredBackend
- T016: Strict uniqueness conflict handling with bounded retries
- T017: Degraded-mode: block new allocations on NATS loss, allow known-lease renewals
- T018: Post-outage reconciliation via snapshot refresh
- T019: 7 cluster operational metrics in dora-core/src/metrics.rs
- T020: Integration tests deferred (need NATS test harness from WP08)
…nd enrichment

- T021: New plugin crate plugins/host-option-sync/ with v4/v6 registration
- T022: Host identity resolution (client identifier first, MAC fallback, v6 DUID support)
- T023: Host-option lookup via nats-coordination with correlation IDs and timeout
- T024: Response enrichment with protocol/subnet applicability checks
- T025: Miss/error/timeout fallback behavior with observability events
- T026: Plugin wired into bin/src/main.rs for v4 and v6 pipelines
- T027: Unit tests for hit/miss/error/timeout and option injection
…plugin lazy_static

WP05 (T028-T034):
- Stateful DHCPv6 lease flow (allocate, renew, release, decline)
- DUID+IAID uniqueness key extraction and validation
- Multi-lease support per DUID when IAID differs
- DHCPv6 degraded-mode behavior matching v4 outage policy
- DHCPv6 cluster metrics and tests

CHG-001 (metrics locality):
- Remove centralized cluster/host-option metrics from dora-core/src/metrics.rs
- Add plugins/leases/src/metrics.rs with lazy_static for all cluster v4/v6 metrics
- Add lazy_static metrics inline in plugins/host-option-sync/src/lib.rs
- Update bin/src/main.rs to reference leases::metrics::CLUSTER_COORDINATION_STATE
- Policy: each plugin owns its metrics with lazy initialization
…ntining

- Quarantine conflicted IPs via probation instead of retrying same address
- On conflict in reserve_first, loop to allocate a different IP
- Release locally reserved IP on coordination errors to prevent leaks
- Increase MAX_CONFLICT_RETRIES from 3 to 8
- Track conflict state to only increment resolved metric when appropriate
@leshow
Copy link
Collaborator

leshow commented Feb 27, 2026

Hey! This is a lot of code that appears to be substantial changes to how dora works, it would have been better to propose changes more piece-meal in individual issues so some back and forth could take place. I'm assuming it was written with an LLM coding assistant? In any case, thanks for the contribution.

@poelzi
Copy link
Author

poelzi commented Feb 27, 2026

Absolutely, I dislike so big changes too.

It is not a fundamental change it is an alternative backend as a plugin. Very little changes to the standalone codepath happened.

Most of the code is actually a nix based test framework I build for testing all versions with different dhcp clients and a load tester.
The new ci passes ensure the clients work and we can track regressions.
In the nats case, 2 servers and 1 client are run inside the test VM.

Part is also the missing dhcpv6 types that where missing.

Yes developed it with spec-kitty, opus and codex mostly and multiple rounds of feedback and reviews. Codex did most of the planning, opus the implementation, both did review, then I had complains 😆
This changes are a squashed rebase after multiple fixup commits.

We have the protrntial to remove ~ 2k LoC by using the dhcpproto create for the dhcp definitions. I would prefer that.

I could remove the memory backend and use the sqlite in memory mode or just keep it as it was. It doesn't matter because the cluster lease ist checked against die global db anyways.

The only thing that changsed in the default path is that the server only reports healthy when the subsystems started. He was reporting healthy directly after start even tho, the subsystems could have problems.

@poelzi
Copy link
Author

poelzi commented Feb 27, 2026

Ah, I see that the last refactoring moved stuff wrongly around - I will shrink this down

@bluecatengineering bluecatengineering deleted a comment from stappersg Feb 27, 2026
@leshow
Copy link
Collaborator

leshow commented Feb 27, 2026

It is not a fundamental change it is an alternative backend as a plugin. Very little changes to the standalone codepath happened.

I just meant that is a substantial change to dora in the general sense i.e. it adds a huge feature

At present, dora trades performance for simplicity. I talk about that a little in the README. That's why there's no existing in-memory mode, but we do have an open issue for it. Stateful v6 should be separated as well.

A distributed mode is something I'm interested in, but I haven't evaluated the alternatives in a serious way. Can you provide some more info on why you chose NATS?

I'll have a look at the testing framework, a better test suite is something we could use. I've not really been satisfied with the network namespaces approach we use in the component tests.

@poelzi poelzi force-pushed the nats-backend-and-tests branch from 03b8b6a to 5b49a35 Compare February 28, 2026 01:11
@poelzi
Copy link
Author

poelzi commented Mar 1, 2026

Sorry for the bad refactoring earlier, that was definitely to much for poor codexes head. Opus 4.6 does such a better job at tasks like that.
There are no changes anymore to the old lease code. I added a generic interface for any form of distributed backend. NATS is one implementation of that.
I'm building the next generation cluster software from 25 years experience seeing garbage clusters. I'm using NATS as data plane because it scales, has perfect ACL, job queues, routing, load balancing, KV store and many other goodies you need. Then I'm using tempora as workflow engine and a few other cool tech. Everything build via nix

I like rust software the most, that's why I'm choosing rust tools, this server was the closest to what I need, so I implemented the rest. This project is 10 man years and I'm building it alone in a few months - I'm more an architect then a coder, but I can smell bad code and can instruct an army of agents to build what I want.

My code got reviewed by opus 4.6 max, codex 4.6 xhigh, kimi 2.5, GLM-5 and I looked over it - every complain got fixed. Every reviewer got something to complain - even wired race conditions got found. My code is matrix tested and 2 different loadtests have proven better performance then current code.

TBH. I don't care if this gets merged. Opus is so good at solving merge conflicts and rebases, I'm using nix, I don't care about forks. It some point I just get bored managing multiple integration branches. I for sure will fix the other findings of the code in my branch. I just find it better when projects stay together and things go upstream. nats backend is feature gated

@stappersg
Copy link
Contributor

stappersg commented Mar 1, 2026 via email

@leshow
Copy link
Collaborator

leshow commented Mar 1, 2026

My code is matrix tested and 2 different loadtests have proven better performance then current code.

To be clear, this is in the README as an explicit non-goal because simplicity was preferred over "good enough" performance. #63 is open to provide an in-memory lease option.

With respect, I appreciate contributions but you've not provided an explanation of how things work. Say nodes hand out conflicting IPs, or there's a network partition preventing communication, or increased latency.

@poelzi
Copy link
Author

poelzi commented Mar 2, 2026

@jsilke the behavior is documented in docs/cluster.md
The server creates a IP with ping check and then checks in NATS if it will be unique, in case of conflict, a new one is created.
In case NATS is down: no new IPs are created, renewals are acked in degraded mode.
I think this is the most acceptable and safe behavior.

How the different NATS subjects are used together with their format is also documented there.
I see the NATS backend as a optional, enterprise version that most people will not need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants