Skip to content

[DPE-9752] Add topology observer#50

Open
reneradoi wants to merge 25 commits into9/edgefrom
add-topology-observer
Open

[DPE-9752] Add topology observer#50
reneradoi wants to merge 25 commits into9/edgefrom
add-topology-observer

Conversation

@reneradoi
Copy link
Copy Markdown
Contributor

@reneradoi reneradoi commented Apr 22, 2026

The PR adds a topology observer to the charm, inspired by the existing solution in Postgres. The observer checks if the Primary of Valkey has changed and dispatches a Juju event to the charm if so. The event is used for updating pod labels (in K8s) and client relation data (updated endpoints and read-only-endpoints).

Design:

  • the topology observer runs as a background process and checks continuously if the primary has changed
  • the check happens by querying the sentinel master with valkey-py, because valkey-glide does not support Sentinel connections and both clients do not support pubsub connections with Sentinel
  • after startup, the leader unit starts the observer
  • on config changes, secret changes, peer relations changes (for TLS updates) or departure events, the observer gets restarted
  • the observer writes a log file to /var/log/topology_observer.log and stores the CA certificate for client TLS in the default truststore of the host in /etc/ssl/certs/
  • on unit removal, the observer gets stopped
  • when a primary change is detected, the observer process dispatches a custom event topology_changed
  • this event is used to update pod labels and client relation data in ExternalClientEvents

Side quest:
Bug fix for updating passwords of Sentinel admin users. These updates where not configured to the Sentinel ACL files and Sentinel was not restarted to pick up the changes. This bug was found while testing the PR manually.

@reneradoi reneradoi changed the base branch from 9/edge to add-k8s-services April 22, 2026 09:43
@reneradoi reneradoi marked this pull request as ready for review April 22, 2026 11:24
Copy link
Copy Markdown
Contributor

@Mehdi-Bendriss Mehdi-Bendriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks René! Good work.
I left a few comments and questions, mainly around why we start/restart in certain hooks and under certain conditions of the current leader unit + the use of valkey-py / Glide

Comment thread src/events/base_events.py
return

try:
self.charm.topology_manager.start_observer()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why isn't this run at the beginning of _on_unit_fully_started? in general, why are we conditioning the start of this process on the health of the server?

Copy link
Copy Markdown
Contributor Author

@reneradoi reneradoi Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily on the health of the server, rather on the starting procedure to be completed on the leader unit.

Two things to consider here:

  • I think there is no need to start an observer if the unit has not fully started yet, given that this is only started on the leader, which usually starts up first -> the observer would only run into connection errors anyway.
  • There is also no need to have the observer in place that early, as there are a lot of peer-relation-changed events during the startup process. The services and pod labels will only be created anyway as soon as the leader unit is up (here). And in case of client relations: The relation data will not be published until the leader unit is up, too.

Comment thread src/events/base_events.py
for lock in [StartLock(self.charm.state), RestartLock(self.charm.state)]:
lock.process()

if not self.charm.state.unit_server.is_active:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar question here, the health of the unit's service should not have an impact on the topology observer process start, or am I missing something?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the health doesn't impact the observer, only the starting procedure to be completed (and the unit not being in the process of being removed).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context: The is_active flag means the unit started (started flag is set) and is not being scaled down.

Comment thread src/events/base_events.py
return

try:
self.charm.topology_manager.restart_observer()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we restart in this hook?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a unit leaves, the connection parameters (endpoints) of the observer should be updated to not connect to this unit anymore.

)

try:
primary_name = sentinel.discover_master(PRIMARY_NAME)[0]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can stick to glide using the Sentinel ports, with something along the lines of:

sentinel_config = GlideClientConfiguration(
    addresses=[NodeAddress(host, port) for host, port in addresses],
    request_timeout=100,  # 0.1s in milliseconds
    ....
)

sentinel_client = await GlideClient.create(sentinel_config)

result = await sentinel_client.custom_command(["SENTINEL", "GET-MASTER-ADDR-BY-NAME", PRIMARY_NAME])
primary_name = result[0].decode()

Copy link
Copy Markdown
Contributor Author

@reneradoi reneradoi Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not: Valkey glide does not support Sentinel connections (see this issue), which I also confirmed by testing. It only supports connections to Valkey instances directly, be it in "standalone" mode (primary + replicas) or "cluster" mode.

It would only be possible to connect to Valkey itself and query the replication info or such. But I don't think we should do this.

Once Valkey glide has support for Sentinel, migration should not be to difficult though. There is actually work going on on this, but not yet available for us.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use the cli for this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I prefer a proper Python client for this. Only glide not being able yet let me to use valkey-py instead. It should be easier to change this to glide at some point than coming from the cli tool.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be in favor of the cli as well, I don't like the idea of having another client as a dependency for a single operation

Comment thread src/literals.py Outdated
Copy link
Copy Markdown
Contributor

@skourta skourta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No major comments. I would prefer using the CLI over using valkey-py since that's what we started the charm with.
I would have also preferred we had a general charm maintenance event dispatched that handeled multiple cases including this one.

Comment thread src/events/base_events.py
for lock in [StartLock(self.charm.state), RestartLock(self.charm.state)]:
lock.process()

if not self.charm.state.unit_server.is_active:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context: The is_active flag means the unit started (started flag is set) and is not being scaled down.

Comment thread src/events/base_events.py
Comment thread src/charm.py
Comment thread src/events/external_clients.py Outdated
Comment thread src/managers/sentinel.py Outdated
Comment thread src/scripts/topology_observer.py Outdated
Comment thread src/events/base_events.py Outdated
Comment thread src/events/base_events.py
)

try:
primary_name = sentinel.discover_master(PRIMARY_NAME)[0]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use the cli for this?

Comment thread src/workload_vm.py
@reneradoi reneradoi requested a review from skourta April 23, 2026 12:27
Base automatically changed from add-k8s-services to 9/edge April 23, 2026 16:11
# Conflicts:
#	poetry.lock
#	pyproject.toml
#	src/events/external_clients.py
#	src/managers/sentinel.py
#	tests/unit/conftest.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants