feat: add etcd lock and single API orchestration by patriciareinoso · Pull Request #419 · canonical/charmlibs

patriciareinoso · 2026-04-13T17:11:06Z

Summary

This PR introduces the etcd sync and async locking mechanism for rolling ops together with a single API that fallbacks to the peer-relation solution in case of etcd failure or unavailability.

Description

etcd async lock

An background process is spawn on lock request
It performs the lock acquisition and triggers a lock_granted juju custom hook
The charm observes this event and executes the corresponding operation and indicates when it finished using etcd keys.

etcd sync lock

Implemented a distributed lock using etcd (lease + transactional lock key).
We use the :sync as the owner of the lock to differentiate from the lock trying to be acquired in the background process. Meaning that the sync lock may take priority over operations on the async lock.
Lock acquisition uses retries (via tenacity) on etcdctl commands.
The lock is tied to a lease and automatically released if the lease expires.

common API

The RollingOpsManager is the public API for advanced rolling ops
It decides on which backend the operation will run: etcd or the peer-relation
If etcd operations fail (e.g. etcdctl errors, lease issues), the system falls back to the peer-relation-based implementation.
This ensures operations can continue even when etcd is not usable.
If there is any failure on the etcd background process a juju custom hook is trigger, this way the RollingOpsManager knows it needs to fallback to peer-relations
Every operation request is:
- written to etcd
- also duplicated into the peer relation databag
After execution, the operation state is updated in both places.
This ensures:
- no operations are lost during fallback
- scheduling can continue seamlessly regardless of backend
The system only switches back to etcd when the operation queue is empty
Before resuming etcd usage any remaining etcd queue state is cleaned up. This guarantees a clean slate and avoids inconsistencies

Peer sync lock

The SyncLockBackend defines the interface that any charm author would need to implement in order to use sync locking in the context of the peer-relation solution given that using the peer-solutions we are not able to guarantee mutual exclusion when tearing down.

* patch: etcd rolling ops version * first working version * fix format * fix linting * add tenacity to integration test * remove unnecessary logs * add dataplatform as reviewes * rename and add integration tests * linting and rebase * first part of comments * more comments answered * more comments answered * fix linting job * fix UT * mark tests as only k8s * fix integration tests * use charmlibs apt * remove sans dns * add dependencies to .toml * add uv lock * add wait in itnegration tests * increate timeout * increase log count * unlimited debug-log * comments review * fix paths * migrate v1 * fix integration tests * fix integration tests * add lock and integration tests * unify operations * add tenacity * draft * fallback implementation * add sync lock and state * feat: advanced rolling ops using etcd (canonical#364) ## Context All the code on this PR is new This implementation is based on [DA241 Spec](https://docs.google.com/document/d/1ez4h6vOOyHy5mu6xDblcBt8PPAtMe7MUp75MtgG1sns/edit?tab=t.0) - The content of `charmlibs/advanced-rollingops/src/charmlibs/advanced_rollingops/_dp_interfaces_v1.py` belongs to another library that is currently being migrated to charmlibs so you can ignore it for now. ## Summary This PR is the first part of the implementation of advanced rolling ops at cluster level. This PR includes: - Management of the client certificate used to connect to etcd - The leader unit creates a self-signed certificate with live of 50 years - Share the certificate with the other units using a peer relation - Implementation of the integration with etcd - Share the mtls certificate - Observe the `resource_created` event - Observe the `endpoints_changed` event - Management of a env file needed to connecto etcd via `etcdctl` This PR does not implement the locking mechanism. In here we only test that we can connect to etcd from each unit. ## Current workflow: 1. The unit make a request 2. A new background process is spawn 3. The background process dispatches a Juju hook 4. The unit observes that hook 5. The unit writes and read a value in etcd 6. If the unit was able to connect to etcd, it executes the "restart" function. This is a very simplified workflow to be able to test that the units from different apps can reach etcd. ## To do - Implement the actual locking mechanism - Figure out how to properly install etcdctl * feat: migrate rollingops v1 from charm-rolling-ops repo (canonical#415) * define syn lock backend * fix merge * clean up * fix peer integration tests * fix integration tests * fix integration tests * docstrings * add update status handled and improve integration tests * general cleanup

Gu1nness

Long review again, but I really love the way this is coming up!
It's mostly nitpicking and improving a bit the logging/stability.
Notes on the data modelling but I don't think we have time to rewrite the whole data modelling so I guess we'll go with that for now? Maybe except the from_string, to_string that really looks like hacking around json/poorly using json.

Nearly there!

Gu1nness · 2026-04-15T08:06:14Z

+"""Exceptions used in rollingops."""
+
+
+class RollingOpsError(Exception):


Note for the future if we want to improve all of our error frameworks: Each error should have a unique error code and a message so that we can log it and it would improve tracing and monitoring.

No need to do it now, just food for thoughts for the future.

Idea of implementation:

from pydantic import BaseModel, ConfigDict from dataclasses import dataclass, field class CommonExceptionModel(BaseModel): # Model config model_config = ConfigDict(from_attributes=True) # Attributes code: int message: str name: str @dataclass class CommonException(Exception): message: str code: int = 0 name: str = field(init=False) def __post_init__(self): try: self.name = self.__class__.__name__ except AttributeError: # For inheritance scenarios. pass super().__init__(self.message) def __str__(self): return str(self.message) def serialize(self): return CommonExceptionModel.model_validate(self).model_dump_json()

Nice, it could be implemented later

Gu1nness · 2026-04-16T12:08:18Z

@patriciareinoso About the data modelling comment, and after discussion with Mehdi I paste here a more detailed input:

I think we're hacking around json and building a lot of complexity by re-inventig data serialization/deserialization instead of using proper tools to do that.

An example that we have is:

@dataclass
class Operation:
    …
    def from_string(cls, data: str) -> Operation:
        obj = json.loads(data)
        return cls.from_dict(obj)

    def to_string() -> str:
        return json.dumps(self._to_dict(), separators=(',', ':'))
       

class OperationQueue: # Beware, not event a dataclass
    def __init__(self, operations: list[Operation] | None = None):
        self.operations: list[Operation] = list(operations or [])

    def to_string(self):
        items = [op.to_string() for op in self.operations]
        return json.dumps(items, separators=(',', ':'))

   def from_string(cls, data: str):
        try:
            items = json.loads(data)
        except:
            …
        if not isinstance(items, list) or not all(isinstance(s, str) for s in items):
            raise …
        operations = [Operation.from_string(s) for s in items]
        return cls(operations)

And that builds something weird.

Instead of having:

[
    {"callback_id": "...", "requested_at": "…",}, …
]

We have:

[
    "{\"callback_id\": \"…\", \"requested_at\": \"…\"}", …
]

This implies:

two json deser instead of one
Non standard format
Custom parsing
Risks in the future if all those nested quotes become re-escaped by something downhill

Possibilities to improve this:

Improve the from_string and to_string so they are are less complex, using the power of json and dataclasses (aka build proper json of nested structures).
(better IMHO)Use dedicated library for data serialization and deserialization that automatically does the parsing and validation (because I haven't even talked about the manual validation that we end up doing with this implem). An idea could be pydantic which is already used in the background by Data interfaces V1 on which you rely already.

Signed-off-by: Patricia Reinoso <patricia.reinoso@canonical.com>

patriciareinoso · 2026-04-17T15:34:34Z

@Gu1nness
Thank you for the review.

The main change is the use of pydantic for the Operation and OperationQueue classes.
These were the most critical since the stored object was incorrect.
This change also simplify all the / escaping when writing to etcd so thanks for that catch.

As for the rest of the databag modelling I think it can wait because the fields are simple and the change on the logic should not change the content of it. So I'll try to do it later.

While testing I caught some issues which I fixed: we were killing the process before the lease was revoke and the lock was released in case of rollingops_etcd_failed hook. And I improved the update-status reconciliation.

I will review the lease process because if the parent process die, it will not die. So that comment is not yet answered.

Gu1nness

This is so much better now!
Congrats!

Gu1nness · 2026-04-17T16:04:35Z


-    _pid_field = WORKER_PID_FIELD
-    _log_filename = 'etcd_rollingops_worker'
+    _pid_field = 'etcd-rollingops-worker-pid'


Nit: Why did this one slip to a hardcoded string from a constant ?

sinclert-canonical · 2026-04-21T10:40:18Z

👋🏻 Hey @patriciareinoso

I know we just spoke offline about the rollingops v0/v1 charmlib and its port to a standard Python package, which I assume this PR is doing. Please, let me know if that is not the case.

I wanted to share a possible improvement regarding log file locations, in the context of root-less Kubernetes charms (those where the workload container user does not have access to /var/lib, /var/log... unless explicitly granted). The improvement would be to allow the clients to specify the location for their log-file, instead of hard-coding them. I think it would make the library easier to integrate with that type of Kubernetes charms.

Disclaimer:

I do not know in which review stage this PR is. Feel free to completely ignore my feedback if the current approach has already been agreed upon. The last thing I want is to delay the porting of the charmlib to a standard Python package.

patriciareinoso · 2026-04-22T09:47:47Z

+        etcd_relation_name: str,
+        cluster_id: str,


These 2 parameter should be optional.

Mehdi-Bendriss

Thank you Patricia! This is massive and really good work.
I have 2 points that are not blocking the merge - as it targets a feature branch:

Will you follow with a next PR for patching setup_logging with a logfile path (and making the etcd relation name optional)?
Can you fix the docs CI steps that are failing?

patriciareinoso · 2026-04-23T19:47:28Z

@Mehdi-Bendriss check #445

It answers boths questions

patriciareinoso added 3 commits April 13, 2026 19:08

Merge branch 'DPE-9349-rolling-ops-maintenance' into DPE-9350-etcd-lock

3e74f5d

fix merge

c877e3d

patriciareinoso changed the title ~~feat: add etcd lock (#9)~~ feat: add etcd lock and single API orchestration Apr 13, 2026

patriciareinoso marked this pull request as ready for review April 13, 2026 17:47

patriciareinoso requested a review from a team as a code owner April 13, 2026 17:47

raise of failed transactions

f24a115

patriciareinoso requested review from Gu1nness and Mehdi-Bendriss April 15, 2026 07:44

Gu1nness reviewed Apr 15, 2026

View reviewed changes

Comment thread rollingops/src/charmlibs/rollingops/common/_base_worker.py Outdated

Gu1nness reviewed Apr 15, 2026

View reviewed changes

address review feedback

f4080be

Signed-off-by: Patricia Reinoso <patricia.reinoso@canonical.com>

patriciareinoso requested a review from Gu1nness April 17, 2026 15:34

Gu1nness approved these changes Apr 17, 2026

View reviewed changes

short uuid and subprocess attach to parent

9bf857b

patriciareinoso commented Apr 22, 2026

View reviewed changes

This was referenced Apr 22, 2026

[MISC] 8.0 - Bump pydantic library to v2 canonical/mysql-operators#260

Merged

[MISC] 8.4 - Bump pydantic library to v2 canonical/mysql-operators#261

Merged

use pipes to end refresh lease process

bd6b208

sinclert-canonical mentioned this pull request Apr 23, 2026

[MISC] 8.4 - Bump rolling-ops charmlib to v1 canonical/mysql-operators#255

Open

Mehdi-Bendriss approved these changes Apr 23, 2026

View reviewed changes

patriciareinoso merged commit 6affc8a into canonical:DPE-9349-rolling-ops-maintenance Apr 23, 2026
56 of 64 checks passed

		"""Exceptions used in rollingops."""


		class RollingOpsError(Exception):

Conversation

patriciareinoso commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Description

etcd async lock

etcd sync lock

common API

Peer sync lock

Uh oh!

Uh oh!

Gu1nness left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Gu1nness Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

patriciareinoso Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Gu1nness commented Apr 16, 2026

Uh oh!

patriciareinoso commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gu1nness left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Gu1nness Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

sinclert-canonical commented Apr 21, 2026

Uh oh!

patriciareinoso Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Mehdi-Bendriss left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

patriciareinoso commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

patriciareinoso commented Apr 13, 2026 •

edited

Loading

patriciareinoso commented Apr 17, 2026 •

edited

Loading