Conversation
Signed-off-by: Patricia Reinoso <patricia.reinoso@canonical.com>
|
|
||
| @override | ||
| def restart_charm_services(self, force: bool = False): | ||
| def restart_charm_services(self, force: bool = False) -> OperationResult: |
There was a problem hiding this comment.
ass discussed offline.
This callback should add guards about LDAP and vault state because restarting if we do not have the appropriate state can break the charm.
Gu1nness
left a comment
There was a problem hiding this comment.
We need to be careful in the complex relations (sharding + cluster) as they are heavily stateful and the current implementation adds unnecessary delay.
We should take advantage of the asynchronous locking to do something better where we don't have to exchange so much data because we can retry the critical path.
Eg: config-server <-> shard
Config server sends keyfile and request lock
On lock of config server: tries to add shard to cluster, if it fails (auth not updated yet) it asks for retry
Shard:
Upon receiving keyfile, asks for two locks (restart with keyfile, restart PBM).
On lock for restart with keyfile: restart with new config
On lock for restart PBM: check if shard has been added to cluster. If yes, start/restart PBM.
That way we rely on the asynchronicity to ensure that we end up in the correct state eventually and we have removed the flag for `auth-updated` and the one for `shard-added-to-cluster` checks.
| msg = "Waiting for mongos to be restarted" | ||
| defer_event_with_info_log(logger, event, str(type(event)), str(msg)) | ||
| return | ||
| self.manager._reconcile_after_mongos_restart() |
There was a problem hiding this comment.
I don't think we should defer here.
I would rather have a dedicated callback or move the code around so that we don't need to do that.
The reason:
- relation-changed => request lock
- lock granted => execute first the deferred event that re-requests the lock then does the restart
- We still haven't run the reconciliation but we'll need to wait until later to run that.
| def is_waiting_for_rolling_restart(self) -> bool: | ||
| """Returns whether Mongos has pending rolling operations.""" | ||
| return self.rollingops_manager.state.status == RollingOpsStatus.WAITING |
There was a problem hiding this comment.
Maybe this could be exposed by rolling ops as an interface ? It could make sense
is_waiting_for_lock(callback_id) ?
| if self.charm.unit.is_leader(): | ||
| self.sync_cluster_passwords(operator_password, backup_password) | ||
|
|
||
| self.update_member_auth(keyfile, tls_ca, external_tls_ca) |
There was a problem hiding this comment.
We're messing with the workflow here.
Workflow is:
- Update keyfile on filesystem.
- Restart.
- Wait for readiness.
- Set auth-updated so that config-server can add the shard to the cluster (BECAUSE we have restarted and we now have the same keyfile as the config-server.
- Config server adds us and sets a flag to indicate that shard has been added to cluster.
- Shard restarts PBM
With the new workflow:
- Update keyfile on filesystem
- Ask for restart
- Wait for readiness (we're waiting for restart, of course we're raising.
- On lock granted, we run the deferred event that defers again. We restart. We don't send auth-updated flag so config server is still waiting. We wait until next event for relation-changed to re run (it's been deferred). Which means we need 3 iterations of the relation-changed event (Including the 2 deferrals) before we can signal to the config server to add us.
- Config server adds us.
- We restart PBM.
So way more waiting while it should make it easier to do this kind of scenarios.
We need a dedicated callback for this one IMHO.
| self.delete_certificates_from_workload(internal) | ||
| self.dependent.restart_charm_services(force=True) |
There was a problem hiding this comment.
Certificate deletion should happen inside the lock.
Otherwise in the meantime between certificate deletion and restart, if mongod restarts unexpectedly, it will fail that is a disruption of service.
Needs dedicated callback probably, or improve the restart_charm_services to do the cert setting / deletion in a systematic way (if nothing in secret, remove the file if it exists, else write the file with the data from the secret.
Signed-off-by: Patricia Reinoso <patricia.reinoso@canonical.com>
Signed-off-by: Patricia Reinoso <patricia.reinoso@canonical.com>
🏷️ Type of changes
📝 Description
Rolling Ops Integration for MongoDB
This PR introduces rolling-ops based asynchronous restarts for mongod and mongos components.
The main change is:
Before: components restarted immediately within hooks
Now: restarts are queued via rolling ops (async lock) and executed in a rolling fashion.
This ensures safer, ordered restarts across units.
This PR adds the relation to etcd but fully etcd integration is not implemented.
LDAP
Former behavior:
New behavior:
Restart code sections
restart_when_readyrestart-if-readyif leader orrelation-changedif followerclean_ldap_credentials_and_uriandremove_ldap_certificates:relation-brokenor unavailable.TLS
Former behavior
New behavior
Restart code sections
enable_certificates_for_unittriggered on certificate availabledisable_certificates_for_unit: triggered on tls relation-brokenSHARD MANAGER
Former behavior
On DB created:
New behavior
On DB created:
Restart code sections
update_member_authupdate_pbm_certificate_in_trust_storeCLUSTER (mongos)
Former behavior:
For mongos, on relation-changed:
New behavior:
For mongos, on relation-changed:
MongoDB
Former behavior:
On config-changed:
New behavior:
On config-changed:
Operator:
Former behavior:
New behavior:
Same behavior kept but using async restarts
Restart code sections
remove_ca_cert_from_trust_store:🧪 Manual testing steps
🔬 Automated testing steps
✅ Checklist