Conversation
…ogy-observer # Conflicts: # src/workload_vm.py
Mehdi-Bendriss
left a comment
There was a problem hiding this comment.
Thanks René! Good work.
I left a few comments and questions, mainly around why we start/restart in certain hooks and under certain conditions of the current leader unit + the use of valkey-py / Glide
| return | ||
|
|
||
| try: | ||
| self.charm.topology_manager.start_observer() |
There was a problem hiding this comment.
why isn't this run at the beginning of _on_unit_fully_started? in general, why are we conditioning the start of this process on the health of the server?
There was a problem hiding this comment.
Not necessarily on the health of the server, rather on the starting procedure to be completed on the leader unit.
Two things to consider here:
- I think there is no need to start an observer if the unit has not fully started yet, given that this is only started on the leader, which usually starts up first -> the observer would only run into connection errors anyway.
- There is also no need to have the observer in place that early, as there are a lot of
peer-relation-changedevents during the startup process. The services and pod labels will only be created anyway as soon as the leader unit is up (here). And in case of client relations: The relation data will not be published until the leader unit is up, too.
| for lock in [StartLock(self.charm.state), RestartLock(self.charm.state)]: | ||
| lock.process() | ||
|
|
||
| if not self.charm.state.unit_server.is_active: |
There was a problem hiding this comment.
similar question here, the health of the unit's service should not have an impact on the topology observer process start, or am I missing something?
There was a problem hiding this comment.
No, the health doesn't impact the observer, only the starting procedure to be completed (and the unit not being in the process of being removed).
There was a problem hiding this comment.
For context: The is_active flag means the unit started (started flag is set) and is not being scaled down.
| return | ||
|
|
||
| try: | ||
| self.charm.topology_manager.restart_observer() |
There was a problem hiding this comment.
why do we restart in this hook?
There was a problem hiding this comment.
When a unit leaves, the connection parameters (endpoints) of the observer should be updated to not connect to this unit anymore.
| ) | ||
|
|
||
| try: | ||
| primary_name = sentinel.discover_master(PRIMARY_NAME)[0] |
There was a problem hiding this comment.
I think we can stick to glide using the Sentinel ports, with something along the lines of:
sentinel_config = GlideClientConfiguration(
addresses=[NodeAddress(host, port) for host, port in addresses],
request_timeout=100, # 0.1s in milliseconds
....
)
sentinel_client = await GlideClient.create(sentinel_config)
result = await sentinel_client.custom_command(["SENTINEL", "GET-MASTER-ADDR-BY-NAME", PRIMARY_NAME])
primary_name = result[0].decode()
There was a problem hiding this comment.
Unfortunately not: Valkey glide does not support Sentinel connections (see this issue), which I also confirmed by testing. It only supports connections to Valkey instances directly, be it in "standalone" mode (primary + replicas) or "cluster" mode.
It would only be possible to connect to Valkey itself and query the replication info or such. But I don't think we should do this.
Once Valkey glide has support for Sentinel, migration should not be to difficult though. There is actually work going on on this, but not yet available for us.
There was a problem hiding this comment.
Can't we use the cli for this?
There was a problem hiding this comment.
In general I prefer a proper Python client for this. Only glide not being able yet let me to use valkey-py instead. It should be easier to change this to glide at some point than coming from the cli tool.
There was a problem hiding this comment.
I would be in favor of the cli as well, I don't like the idea of having another client as a dependency for a single operation
# Conflicts: # charmcraft.yaml
# Conflicts: # src/workload_vm.py
skourta
left a comment
There was a problem hiding this comment.
No major comments. I would prefer using the CLI over using valkey-py since that's what we started the charm with.
I would have also preferred we had a general charm maintenance event dispatched that handeled multiple cases including this one.
| for lock in [StartLock(self.charm.state), RestartLock(self.charm.state)]: | ||
| lock.process() | ||
|
|
||
| if not self.charm.state.unit_server.is_active: |
There was a problem hiding this comment.
For context: The is_active flag means the unit started (started flag is set) and is not being scaled down.
| ) | ||
|
|
||
| try: | ||
| primary_name = sentinel.discover_master(PRIMARY_NAME)[0] |
There was a problem hiding this comment.
Can't we use the cli for this?
# Conflicts: # src/managers/sentinel.py
# Conflicts: # poetry.lock # pyproject.toml # src/events/external_clients.py # src/managers/sentinel.py # tests/unit/conftest.py
The PR adds a topology observer to the charm, inspired by the existing solution in Postgres. The observer checks if the Primary of Valkey has changed and dispatches a Juju event to the charm if so. The event is used for updating pod labels (in K8s) and client relation data (updated endpoints and read-only-endpoints).
Design:
sentinel masterwithvalkey-py, becausevalkey-glidedoes not support Sentinel connections and both clients do not support pubsub connections with Sentinel/var/log/topology_observer.logand stores the CA certificate for client TLS in the default truststore of the host in/etc/ssl/certs/topology_changedExternalClientEventsSide quest:
Bug fix for updating passwords of Sentinel admin users. These updates where not configured to the Sentinel ACL files and Sentinel was not restarted to pick up the changes. This bug was found while testing the PR manually.