[DPE-9752] Add topology observer by reneradoi · Pull Request #50 · canonical/valkey-operator

reneradoi · 2026-04-22T09:42:55Z

The PR adds a topology observer to the charm, inspired by the existing solution in Postgres. The observer checks if the Primary of Valkey has changed and dispatches a Juju event to the charm if so. The event is used for updating pod labels (in K8s) and client relation data (updated endpoints and read-only-endpoints).

Design:

the topology observer runs as a background process and checks continuously if the primary has changed
the check happens by querying the sentinel master with valkey-py, because valkey-glide does not support Sentinel connections and both clients do not support pubsub connections with Sentinel
after startup, the leader unit starts the observer
on config changes, secret changes, peer relations changes (for TLS updates) or departure events, the observer gets restarted
the observer writes a log file to /var/log/topology_observer.log and stores the CA certificate for client TLS in the default truststore of the host in /etc/ssl/certs/
on unit removal, the observer gets stopped
when a primary change is detected, the observer process dispatches a custom event topology_changed
this event is used to update pod labels and client relation data in ExternalClientEvents

Side quest:
Bug fix for updating passwords of Sentinel admin users. These updates where not configured to the Sentinel ACL files and Sentinel was not restarted to pick up the changes. This bug was found while testing the PR manually.

…min user acl

…ogy-observer # Conflicts: # src/workload_vm.py

Mehdi-Bendriss

Thanks René! Good work.
I left a few comments and questions, mainly around why we start/restart in certain hooks and under certain conditions of the current leader unit + the use of valkey-py / Glide

Mehdi-Bendriss · 2026-04-22T18:23:13Z

+            return
+
+        try:
+            self.charm.topology_manager.start_observer()


why isn't this run at the beginning of _on_unit_fully_started? in general, why are we conditioning the start of this process on the health of the server?

Not necessarily on the health of the server, rather on the starting procedure to be completed on the leader unit.

Two things to consider here:

I think there is no need to start an observer if the unit has not fully started yet, given that this is only started on the leader, which usually starts up first -> the observer would only run into connection errors anyway.

There is also no need to have the observer in place that early, as there are a lot of peer-relation-changed events during the startup process. The services and pod labels will only be created anyway as soon as the leader unit is up (here). And in case of client relations: The relation data will not be published until the leader unit is up, too.

Mehdi-Bendriss · 2026-04-22T18:24:56Z

        for lock in [StartLock(self.charm.state), RestartLock(self.charm.state)]:
            lock.process()

+        if not self.charm.state.unit_server.is_active:


similar question here, the health of the unit's service should not have an impact on the topology observer process start, or am I missing something?

No, the health doesn't impact the observer, only the starting procedure to be completed (and the unit not being in the process of being removed).

For context: The is_active flag means the unit started (started flag is set) and is not being scaled down.

Mehdi-Bendriss · 2026-04-22T18:25:45Z

+            return
+
+        try:
+            self.charm.topology_manager.restart_observer()


why do we restart in this hook?

When a unit leaves, the connection parameters (endpoints) of the observer should be updated to not connect to this unit anymore.

Mehdi-Bendriss · 2026-04-22T18:34:02Z

+        )
+
+        try:
+            primary_name = sentinel.discover_master(PRIMARY_NAME)[0]


I think we can stick to glide using the Sentinel ports, with something along the lines of:

sentinel_config = GlideClientConfiguration( addresses=[NodeAddress(host, port) for host, port in addresses], request_timeout=100, # 0.1s in milliseconds .... ) sentinel_client = await GlideClient.create(sentinel_config) result = await sentinel_client.custom_command(["SENTINEL", "GET-MASTER-ADDR-BY-NAME", PRIMARY_NAME]) primary_name = result[0].decode()

Unfortunately not: Valkey glide does not support Sentinel connections (see this issue), which I also confirmed by testing. It only supports connections to Valkey instances directly, be it in "standalone" mode (primary + replicas) or "cluster" mode.

It would only be possible to connect to Valkey itself and query the replication info or such. But I don't think we should do this.

Once Valkey glide has support for Sentinel, migration should not be to difficult though. There is actually work going on on this, but not yet available for us.

Can't we use the cli for this?

In general I prefer a proper Python client for this. Only glide not being able yet let me to use valkey-py instead. It should be easier to change this to glide at some point than coming from the cli tool.

I would be in favor of the cli as well, I don't like the idea of having another client as a dependency for a single operation

# Conflicts: # charmcraft.yaml

# Conflicts: # src/workload_vm.py

skourta

No major comments. I would prefer using the CLI over using valkey-py since that's what we started the charm with.
I would have also preferred we had a general charm maintenance event dispatched that handeled multiple cases including this one.

skourta · 2026-04-23T09:32:18Z

        for lock in [StartLock(self.charm.state), RestartLock(self.charm.state)]:
            lock.process()

+        if not self.charm.state.unit_server.is_active:


For context: The is_active flag means the unit started (started flag is set) and is not being scaled down.

skourta · 2026-04-23T09:49:41Z

+        )
+
+        try:
+            primary_name = sentinel.discover_master(PRIMARY_NAME)[0]


Can't we use the cli for this?

…one` in VM

# Conflicts: # src/managers/sentinel.py

# Conflicts: # poetry.lock # pyproject.toml # src/events/external_clients.py # src/managers/sentinel.py # tests/unit/conftest.py

reneradoi and others added 13 commits April 16, 2026 17:28

add K8s client class, create and update services and pod labels

c042adf

update rust to 1.94.0 to build lightkube

572e889

publish service endpoints in client relation on K8s

48ba7d2

add K8s services to SANs DNS in TLS certificate requests

f7816b0

Merge remote-tracking branch 'origin/9/edge' into add-k8s-services

8d3cc01

WIP: add topology observer

3c91b2b

Merge branch '9/edge' into add-k8s-services

6c46244

add topology observer based on valkey-py

c7d98ff

handle topology-changed event

fd44750

workaround for core26 snap, remove pubsub channel permissions from ad…

06d1bc6

…min user acl

bugfix: passwords for sentinel users not updated on password-change

5460bf2

integration test coverage

49f748b

adjust test steps

4c93025

reneradoi changed the base branch from 9/edge to add-k8s-services April 22, 2026 09:43

reneradoi added 2 commits April 22, 2026 11:44

Merge remote-tracking branch 'origin/add-k8s-services' into add-topol…

3378430

…ogy-observer # Conflicts: # src/workload_vm.py

beautify

7a85afb

reneradoi marked this pull request as ready for review April 22, 2026 11:24

reneradoi requested review from Mehdi-Bendriss and skourta April 22, 2026 11:24

handle leadership changes

887b6ce

Mehdi-Bendriss reviewed Apr 22, 2026

View reviewed changes

reneradoi added 4 commits April 23, 2026 08:32

Merge remote-tracking branch 'origin/9/edge' into add-k8s-services

438490f

# Conflicts: # charmcraft.yaml

additional error handling

1474d53

Merge branch 'add-k8s-services' into add-topology-observer

29fadf7

# Conflicts: # src/workload_vm.py

update file name

613ef61

skourta reviewed Apr 23, 2026

View reviewed changes

reneradoi added 3 commits April 23, 2026 12:20

move substrate check to event handler, instantiate k8s_client as `N…

36a4ba3

…one` in VM

Merge branch 'add-k8s-services' into add-topology-observer

ba70e90

# Conflicts: # src/managers/sentinel.py

minor code improvements, add error handling for failed dispatch commands

88b7ae0

reneradoi requested a review from skourta April 23, 2026 12:27

reneradoi requested a review from Mehdi-Bendriss April 23, 2026 12:27

Base automatically changed from add-k8s-services to 9/edge April 23, 2026 16:11

reneradoi added 2 commits April 23, 2026 18:16

Merge remote-tracking branch 'origin/9/edge' into add-topology-observer

55af2d0

# Conflicts: # poetry.lock # pyproject.toml # src/events/external_clients.py # src/managers/sentinel.py # tests/unit/conftest.py

fix timeout for disabling TLS

0c5040e

Conversation

reneradoi commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mehdi-Bendriss left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

reneradoi Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

reneradoi Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

skourta left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

reneradoi commented Apr 22, 2026 •

edited

Loading

reneradoi Apr 23, 2026 •

edited

Loading

reneradoi Apr 23, 2026 •

edited

Loading