feat: add etcd lock and single API orchestration#419
Conversation
* patch: etcd rolling ops version * first working version * fix format * fix linting * add tenacity to integration test * remove unnecessary logs * add dataplatform as reviewes * rename and add integration tests * linting and rebase * first part of comments * more comments answered * more comments answered * fix linting job * fix UT * mark tests as only k8s * fix integration tests * use charmlibs apt * remove sans dns * add dependencies to .toml * add uv lock * add wait in itnegration tests * increate timeout * increase log count * unlimited debug-log * comments review * fix paths * migrate v1 * fix integration tests * fix integration tests * add lock and integration tests * unify operations * add tenacity * draft * fallback implementation * add sync lock and state * feat: advanced rolling ops using etcd (canonical#364) ## Context All the code on this PR is new This implementation is based on [DA241 Spec](https://docs.google.com/document/d/1ez4h6vOOyHy5mu6xDblcBt8PPAtMe7MUp75MtgG1sns/edit?tab=t.0) - The content of `charmlibs/advanced-rollingops/src/charmlibs/advanced_rollingops/_dp_interfaces_v1.py` belongs to another library that is currently being migrated to charmlibs so you can ignore it for now. ## Summary This PR is the first part of the implementation of advanced rolling ops at cluster level. This PR includes: - Management of the client certificate used to connect to etcd - The leader unit creates a self-signed certificate with live of 50 years - Share the certificate with the other units using a peer relation - Implementation of the integration with etcd - Share the mtls certificate - Observe the `resource_created` event - Observe the `endpoints_changed` event - Management of a env file needed to connecto etcd via `etcdctl` This PR does not implement the locking mechanism. In here we only test that we can connect to etcd from each unit. ## Current workflow: 1. The unit make a request 2. A new background process is spawn 3. The background process dispatches a Juju hook 4. The unit observes that hook 5. The unit writes and read a value in etcd 6. If the unit was able to connect to etcd, it executes the "restart" function. This is a very simplified workflow to be able to test that the units from different apps can reach etcd. ## To do - Implement the actual locking mechanism - Figure out how to properly install etcdctl * feat: migrate rollingops v1 from charm-rolling-ops repo (canonical#415) * define syn lock backend * fix merge * clean up * fix peer integration tests * fix integration tests * fix integration tests * docstrings * add update status handled and improve integration tests * general cleanup
Gu1nness
left a comment
There was a problem hiding this comment.
Long review again, but I really love the way this is coming up!
It's mostly nitpicking and improving a bit the logging/stability.
Notes on the data modelling but I don't think we have time to rewrite the whole data modelling so I guess we'll go with that for now? Maybe except the from_string, to_string that really looks like hacking around json/poorly using json.
Nearly there!
| """Exceptions used in rollingops.""" | ||
|
|
||
|
|
||
| class RollingOpsError(Exception): |
There was a problem hiding this comment.
Note for the future if we want to improve all of our error frameworks: Each error should have a unique error code and a message so that we can log it and it would improve tracing and monitoring.
No need to do it now, just food for thoughts for the future.
Idea of implementation:
from pydantic import BaseModel, ConfigDict
from dataclasses import dataclass, field
class CommonExceptionModel(BaseModel):
# Model config
model_config = ConfigDict(from_attributes=True)
# Attributes
code: int
message: str
name: str
@dataclass
class CommonException(Exception):
message: str
code: int = 0
name: str = field(init=False)
def __post_init__(self):
try:
self.name = self.__class__.__name__
except AttributeError:
# For inheritance scenarios.
pass
super().__init__(self.message)
def __str__(self):
return str(self.message)
def serialize(self):
return CommonExceptionModel.model_validate(self).model_dump_json()There was a problem hiding this comment.
Nice, it could be implemented later
|
@patriciareinoso About the data modelling comment, and after discussion with Mehdi I paste here a more detailed input: I think we're hacking around json and building a lot of complexity by re-inventig data serialization/deserialization instead of using proper tools to do that. An example that we have is: @dataclass
class Operation:
…
def from_string(cls, data: str) -> Operation:
obj = json.loads(data)
return cls.from_dict(obj)
def to_string() -> str:
return json.dumps(self._to_dict(), separators=(',', ':'))
class OperationQueue: # Beware, not event a dataclass
def __init__(self, operations: list[Operation] | None = None):
self.operations: list[Operation] = list(operations or [])
def to_string(self):
items = [op.to_string() for op in self.operations]
return json.dumps(items, separators=(',', ':'))
def from_string(cls, data: str):
try:
items = json.loads(data)
except:
…
if not isinstance(items, list) or not all(isinstance(s, str) for s in items):
raise …
operations = [Operation.from_string(s) for s in items]
return cls(operations)And that builds something weird. Instead of having: We have: This implies:
Possibilities to improve this:
|
Signed-off-by: Patricia Reinoso <patricia.reinoso@canonical.com>
|
@Gu1nness The main change is the use of pydantic for the As for the rest of the databag modelling I think it can wait because the fields are simple and the change on the logic should not change the content of it. So I'll try to do it later. While testing I caught some issues which I fixed: we were killing the process before the lease was revoke and the lock was released in case of I will review the lease process because if the parent process die, it will not die. So that comment is not yet answered. |
Gu1nness
left a comment
There was a problem hiding this comment.
This is so much better now!
Congrats!
|
|
||
| _pid_field = WORKER_PID_FIELD | ||
| _log_filename = 'etcd_rollingops_worker' | ||
| _pid_field = 'etcd-rollingops-worker-pid' |
There was a problem hiding this comment.
Nit: Why did this one slip to a hardcoded string from a constant ?
|
👋🏻 Hey @patriciareinoso I know we just spoke offline about the rollingops v0/v1 charmlib and its port to a standard Python package, which I assume this PR is doing. Please, let me know if that is not the case. I wanted to share a possible improvement regarding log file locations, in the context of root-less Kubernetes charms (those where the workload container user does not have access to Disclaimer: I do not know in which review stage this PR is. Feel free to completely ignore my feedback if the current approach has already been agreed upon. The last thing I want is to delay the porting of the charmlib to a standard Python package. |
| etcd_relation_name: str, | ||
| cluster_id: str, |
There was a problem hiding this comment.
These 2 parameter should be optional.
Mehdi-Bendriss
left a comment
There was a problem hiding this comment.
Thank you Patricia! This is massive and really good work.
I have 2 points that are not blocking the merge - as it targets a feature branch:
- Will you follow with a next PR for patching
setup_loggingwith a logfile path (and making the etcd relation name optional)? - Can you fix the docs CI steps that are failing?
6affc8a
into
canonical:DPE-9349-rolling-ops-maintenance
|
@Mehdi-Bendriss check #445 It answers boths questions |
Summary
This PR introduces the etcd sync and async locking mechanism for rolling ops together with a single API that fallbacks to the peer-relation solution in case of etcd failure or unavailability.
Description
etcd async lock
etcd sync lock
common API
RollingOpsManageris the public API for advanced rolling opsPeer sync lock
The
SyncLockBackenddefines the interface that any charm author would need to implement in order to use sync locking in the context of the peer-relation solution given that using the peer-solutions we are not able to guarantee mutual exclusion when tearing down.