Skip to content

feat: add etcd lock and single API orchestration#419

Merged
patriciareinoso merged 7 commits intocanonical:DPE-9349-rolling-ops-maintenancefrom
patriciareinoso:DPE-9350-etcd-lock
Apr 23, 2026
Merged

feat: add etcd lock and single API orchestration#419
patriciareinoso merged 7 commits intocanonical:DPE-9349-rolling-ops-maintenancefrom
patriciareinoso:DPE-9350-etcd-lock

Conversation

@patriciareinoso
Copy link
Copy Markdown

@patriciareinoso patriciareinoso commented Apr 13, 2026

Summary

This PR introduces the etcd sync and async locking mechanism for rolling ops together with a single API that fallbacks to the peer-relation solution in case of etcd failure or unavailability.

Description

etcd async lock

  • An background process is spawn on lock request
  • It performs the lock acquisition and triggers a lock_granted juju custom hook
  • The charm observes this event and executes the corresponding operation and indicates when it finished using etcd keys.

etcd sync lock

  • Implemented a distributed lock using etcd (lease + transactional lock key).
  • We use the :sync as the owner of the lock to differentiate from the lock trying to be acquired in the background process. Meaning that the sync lock may take priority over operations on the async lock.
  • Lock acquisition uses retries (via tenacity) on etcdctl commands.
  • The lock is tied to a lease and automatically released if the lease expires.

common API

  • The RollingOpsManager is the public API for advanced rolling ops
  • It decides on which backend the operation will run: etcd or the peer-relation
  • If etcd operations fail (e.g. etcdctl errors, lease issues), the system falls back to the peer-relation-based implementation.
  • This ensures operations can continue even when etcd is not usable.
  • If there is any failure on the etcd background process a juju custom hook is trigger, this way the RollingOpsManager knows it needs to fallback to peer-relations
  • Every operation request is:
    • written to etcd
    • also duplicated into the peer relation databag
  • After execution, the operation state is updated in both places.
  • This ensures:
    • no operations are lost during fallback
    • scheduling can continue seamlessly regardless of backend
  • The system only switches back to etcd when the operation queue is empty
  • Before resuming etcd usage any remaining etcd queue state is cleaned up. This guarantees a clean slate and avoids inconsistencies

Peer sync lock

The SyncLockBackend defines the interface that any charm author would need to implement in order to use sync locking in the context of the peer-relation solution given that using the peer-solutions we are not able to guarantee mutual exclusion when tearing down.

* patch: etcd rolling ops version

* first working version

* fix format

* fix linting

* add tenacity to integration test

* remove unnecessary logs

* add dataplatform as reviewes

* rename and add integration tests

* linting and rebase

* first part of comments

* more comments answered

* more comments answered

* fix linting job

* fix UT

* mark tests as only k8s

* fix integration tests

* use charmlibs apt

* remove sans dns

* add dependencies to .toml

* add uv lock

* add wait in itnegration tests

* increate timeout

* increase log count

* unlimited debug-log

* comments review

* fix paths

* migrate v1

* fix integration tests

* fix integration tests

* add lock and integration tests

* unify operations

* add tenacity

* draft

* fallback implementation

* add sync lock and state

* feat: advanced rolling ops using etcd (canonical#364)

## Context

All the code on this PR is new

This implementation is based on [DA241
Spec](https://docs.google.com/document/d/1ez4h6vOOyHy5mu6xDblcBt8PPAtMe7MUp75MtgG1sns/edit?tab=t.0)

- The content of
`charmlibs/advanced-rollingops/src/charmlibs/advanced_rollingops/_dp_interfaces_v1.py`
belongs to another library that is currently being migrated to charmlibs
so you can ignore it for now.

## Summary

This PR is the first part of the implementation of advanced rolling ops
at cluster level.

This PR includes:
- Management of the client certificate used to connect to etcd
- The leader unit creates a self-signed certificate with live of 50
years
  - Share the certificate with the other units using a peer relation
- Implementation of the integration with etcd
  - Share the mtls certificate
  - Observe the `resource_created` event 
  - Observe the `endpoints_changed` event
- Management of a env file needed to connecto etcd via `etcdctl`
 
This PR does not implement the locking mechanism. In here we only test
that we can connect to etcd from each unit.

## Current workflow:

1. The unit make a request
2. A new background process is spawn
3. The background process dispatches a Juju hook
4. The unit observes that hook
5. The unit writes and read a value in etcd
6. If the unit was able to connect to etcd, it executes the "restart"
function.

This is a very simplified workflow to be able to test that the units
from different apps can reach etcd.

## To do
- Implement the actual locking mechanism
- Figure out how to properly install etcdctl

* feat: migrate rollingops v1 from charm-rolling-ops repo (canonical#415)

* define syn lock backend

* fix merge

* clean up

* fix peer integration tests

* fix integration tests

* fix integration tests

* docstrings

* add update status handled and improve integration tests

* general cleanup
@patriciareinoso patriciareinoso changed the title feat: add etcd lock (#9) feat: add etcd lock and single API orchestration Apr 13, 2026
@patriciareinoso patriciareinoso marked this pull request as ready for review April 13, 2026 17:47
@patriciareinoso patriciareinoso requested a review from a team as a code owner April 13, 2026 17:47
Comment thread rollingops/src/charmlibs/rollingops/common/_base_worker.py Outdated
Copy link
Copy Markdown
Contributor

@Gu1nness Gu1nness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long review again, but I really love the way this is coming up!
It's mostly nitpicking and improving a bit the logging/stability.
Notes on the data modelling but I don't think we have time to rewrite the whole data modelling so I guess we'll go with that for now? Maybe except the from_string, to_string that really looks like hacking around json/poorly using json.

Nearly there!

Comment thread rollingops/src/charmlibs/rollingops/common/_base_worker.py Outdated
Comment thread rollingops/src/charmlibs/rollingops/common/_base_worker.py
"""Exceptions used in rollingops."""


class RollingOpsError(Exception):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for the future if we want to improve all of our error frameworks: Each error should have a unique error code and a message so that we can log it and it would improve tracing and monitoring.

No need to do it now, just food for thoughts for the future.

Idea of implementation:

from pydantic import BaseModel, ConfigDict
from dataclasses import dataclass, field
class CommonExceptionModel(BaseModel):
	# Model config
	model_config = ConfigDict(from_attributes=True)

	# Attributes
    code: int
    message: str
    name: str


@dataclass
class CommonException(Exception):
    message: str
    code: int = 0
    name: str = field(init=False)

    def __post_init__(self):
        try:
            self.name = self.__class__.__name__
        except AttributeError:
			# For inheritance scenarios.
            pass
        super().__init__(self.message)

    def __str__(self):
        return str(self.message)

    def serialize(self):
            return CommonExceptionModel.model_validate(self).model_dump_json()

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, it could be implemented later

Comment thread rollingops/src/charmlibs/rollingops/common/_models.py Outdated
Comment thread rollingops/src/charmlibs/rollingops/common/_models.py Outdated
Comment thread rollingops/src/charmlibs/rollingops/etcd/_rollingops.py Outdated
Comment thread rollingops/src/charmlibs/rollingops/etcd/_rollingops.py Outdated
Comment thread rollingops/src/charmlibs/rollingops/etcd/_rollingops.py Outdated
Comment thread rollingops/src/charmlibs/rollingops/peer/_worker.py
Comment thread rollingops/tests/integration/charms/actions.yaml Outdated
@Gu1nness
Copy link
Copy Markdown
Contributor

@patriciareinoso About the data modelling comment, and after discussion with Mehdi I paste here a more detailed input:

I think we're hacking around json and building a lot of complexity by re-inventig data serialization/deserialization instead of using proper tools to do that.

An example that we have is:

@dataclass
class Operation:
    …
    def from_string(cls, data: str) -> Operation:
        obj = json.loads(data)
        return cls.from_dict(obj)

    def to_string() -> str:
        return json.dumps(self._to_dict(), separators=(',', ':'))
       

class OperationQueue: # Beware, not event a dataclass
    def __init__(self, operations: list[Operation] | None = None):
        self.operations: list[Operation] = list(operations or [])

    def to_string(self):
        items = [op.to_string() for op in self.operations]
        return json.dumps(items, separators=(',', ':'))

   def from_string(cls, data: str):
        try:
            items = json.loads(data)
        except:
            …
        if not isinstance(items, list) or not all(isinstance(s, str) for s in items):
            raiseoperations = [Operation.from_string(s) for s in items]
        return cls(operations)

And that builds something weird.

Instead of having:

[
    {"callback_id": "...", "requested_at": "…",}, …
]

We have:

[
    "{\"callback_id\": \"…\", \"requested_at\": \"…\"}", …
]

This implies:

  • two json deser instead of one
  • Non standard format
  • Custom parsing
  • Risks in the future if all those nested quotes become re-escaped by something downhill

Possibilities to improve this:

  1. Improve the from_string and to_string so they are are less complex, using the power of json and dataclasses (aka build proper json of nested structures).
  2. (better IMHO)Use dedicated library for data serialization and deserialization that automatically does the parsing and validation (because I haven't even talked about the manual validation that we end up doing with this implem). An idea could be pydantic which is already used in the background by Data interfaces V1 on which you rely already.

Signed-off-by: Patricia Reinoso <patricia.reinoso@canonical.com>
@patriciareinoso
Copy link
Copy Markdown
Author

patriciareinoso commented Apr 17, 2026

@Gu1nness
Thank you for the review.

The main change is the use of pydantic for the Operation and OperationQueue classes.
These were the most critical since the stored object was incorrect.
This change also simplify all the / escaping when writing to etcd so thanks for that catch.

As for the rest of the databag modelling I think it can wait because the fields are simple and the change on the logic should not change the content of it. So I'll try to do it later.

While testing I caught some issues which I fixed: we were killing the process before the lease was revoke and the lock was released in case of rollingops_etcd_failed hook. And I improved the update-status reconciliation.

I will review the lease process because if the parent process die, it will not die. So that comment is not yet answered.

Copy link
Copy Markdown
Contributor

@Gu1nness Gu1nness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so much better now!
Congrats!

Comment thread rollingops/src/charmlibs/rollingops/common/_models.py
Comment thread rollingops/src/charmlibs/rollingops/common/_models.py
Comment thread rollingops/src/charmlibs/rollingops/etcd/_etcd.py
Comment thread rollingops/src/charmlibs/rollingops/etcd/_etcd.py

_pid_field = WORKER_PID_FIELD
_log_filename = 'etcd_rollingops_worker'
_pid_field = 'etcd-rollingops-worker-pid'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Why did this one slip to a hardcoded string from a constant ?

@sinclert-canonical
Copy link
Copy Markdown

👋🏻 Hey @patriciareinoso

I know we just spoke offline about the rollingops v0/v1 charmlib and its port to a standard Python package, which I assume this PR is doing. Please, let me know if that is not the case.

I wanted to share a possible improvement regarding log file locations, in the context of root-less Kubernetes charms (those where the workload container user does not have access to /var/lib, /var/log... unless explicitly granted). The improvement would be to allow the clients to specify the location for their log-file, instead of hard-coding them. I think it would make the library easier to integrate with that type of Kubernetes charms.


Disclaimer:

I do not know in which review stage this PR is. Feel free to completely ignore my feedback if the current approach has already been agreed upon. The last thing I want is to delay the porting of the charmlib to a standard Python package.

Comment on lines +70 to +71
etcd_relation_name: str,
cluster_id: str,
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 2 parameter should be optional.

Copy link
Copy Markdown

@Mehdi-Bendriss Mehdi-Bendriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Patricia! This is massive and really good work.
I have 2 points that are not blocking the merge - as it targets a feature branch:

  • Will you follow with a next PR for patching setup_logging with a logfile path (and making the etcd relation name optional)?
  • Can you fix the docs CI steps that are failing?

@patriciareinoso patriciareinoso merged commit 6affc8a into canonical:DPE-9349-rolling-ops-maintenance Apr 23, 2026
56 of 64 checks passed
@patriciareinoso
Copy link
Copy Markdown
Author

@Mehdi-Bendriss check #445

It answers boths questions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants