patch: test rollingops by patriciareinoso · Pull Request #299 · canonical/mongo-single-kernel-library

patriciareinoso · 2026-04-17T20:22:37Z

🏷️ Type of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Tooling and CI
Dependencies upgrade or change
Chores / refactoring

📝 Description

Rolling Ops Integration for MongoDB

This PR introduces rolling-ops based asynchronous restarts for mongod and mongos components.

The main change is:

Before: components restarted immediately within hooks
Now: restarts are queued via rolling ops (async lock) and executed in a rolling fashion.

This ensures safer, ordered restarts across units.

This PR adds the relation to etcd but fully etcd integration is not implemented.

LDAP

Former behavior:

Immediate restart
LDAP statuses recomputed after restart

New behavior:

Trigger async restart via rolling ops
LDAP statuses are set immediately and remain unchanged
Restart applies to mongod component only
Restart-related status is applied to MongoD, not LDAP

Restart code sections

restart_when_ready
- Triggered by restart-if-ready if leader or relation-changed if follower
clean_ldap_credentials_and_uri and remove_ldap_certificates:
- Triggered on LDAP relation-broken or unavailable.

TLS

Former behavior

Immediate restart
SHARD Manager / Mongos statuses recomputed after restart

New behavior

Trigger async restart
Recompute SHARD Manager / Mongos statuses immediately
Restart applies to MongoD / Mongos components
SHARD Manager statuses remain unchanged

Restart code sections

enable_certificates_for_unit triggered on certificate available
disable_certificates_for_unit : triggered on tls relation-broken

SHARD MANAGER

Former behavior
On DB created:

If keyfile changed and cluster auth uses keyfile -> immediate restart
If the PBM CA certificate does not exist but it exists in the trust store -> immediate restart
If mondod is not ready -> defer

New behavior
On DB created:

If keyfile changed and cluster auth uses keyfile -> async restart
If the PBM CA certificate does not exist but it exists in the trust store -> async restart
If a restart is pending or mondod is not ready -> defer

Restart code sections

update_member_auth
update_pbm_certificate_in_trust_store

CLUSTER (mongos)

Former behavior:
For mongos, on relation-changed:

If the config changed, keyfile changed or mongos is not running -> immediate restart
If mongos is not running -> defer

New behavior:
For mongos, on relation-changed:

If the config changed, keyfile changed or mongos is not running -> async restart
If a restart is pending -> defer
If mongos is not running -> defer

MongoDB

Former behavior:
On config-changed:

Immediate restart (it is actually restarted if the IPs changed)
Continue configuration

New behavior:
On config-changed:

Async restart (it is actually restarted if the IPs changed)
Continue configuration

Operator:

Former behavior:

Trigger immediate restart on shard relation broken
Trigger immediate restart on s3/gsc relation broken

New behavior:
Same behavior kept but using async restarts

Restart code sections

remove_ca_cert_from_trust_store:

🧪 Manual testing steps

1. juju deploy mongodb as a replica set 
2. Enable peer TLS
3. Enable client TLS
4. Disable client TLS
5. Disable peer TLS
6. Scale to 3 units
7. Enable client TLS
8. Enable peer TLS 
9. Scale to 5 units

🔬 Automated testing steps

✅ Checklist

My code follows the code style of this project.
I have added or updated any relevant documentation.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.

Signed-off-by: Patricia Reinoso <patricia.reinoso@canonical.com>

patriciareinoso · 2026-04-21T13:48:47Z


    @override
-    def restart_charm_services(self, force: bool = False):
+    def restart_charm_services(self, force: bool = False) -> OperationResult:


ass discussed offline.

This callback should add guards about LDAP and vault state because restarting if we do not have the appropriate state can break the charm.

Gu1nness

We need to be careful in the complex relations (sharding + cluster) as they are heavily stateful and the current implementation adds unnecessary delay.
We should take advantage of the asynchronous locking to do something better where we don't have to exchange so much data because we can retry the critical path.
Eg: config-server <-> shard

Config server sends keyfile and request lock
On lock of config server: tries to add shard to cluster, if it fails (auth not updated yet) it asks for retry

Shard:
Upon receiving keyfile, asks for two locks (restart with keyfile, restart PBM).
On lock for restart with keyfile: restart with new config
On lock for restart PBM: check if shard has been added to cluster. If yes, start/restart PBM.

That way we rely on the asynchronicity to ensure that we end up in the correct state eventually and we have removed the flag for `auth-updated` and the one for `shard-added-to-cluster` checks.

Gu1nness · 2026-04-21T14:14:45Z

+                msg = "Waiting for mongos to be restarted"
+                defer_event_with_info_log(logger, event, str(type(event)), str(msg))
+                return
+            self.manager._reconcile_after_mongos_restart()


I don't think we should defer here.
I would rather have a dedicated callback or move the code around so that we don't need to do that.
The reason:

relation-changed => request lock

lock granted => execute first the deferred event that re-requests the lock then does the restart

We still haven't run the reconciliation but we'll need to wait until later to run that.

Gu1nness · 2026-04-21T14:20:21Z

+    def is_waiting_for_rolling_restart(self) -> bool:
+        """Returns whether Mongos has pending rolling operations."""
+        return self.rollingops_manager.state.status == RollingOpsStatus.WAITING


Maybe this could be exposed by rolling ops as an interface ? It could make sense
is_waiting_for_lock(callback_id) ?

Gu1nness · 2026-04-21T14:30:28Z

        if self.charm.unit.is_leader():
            self.sync_cluster_passwords(operator_password, backup_password)

        self.update_member_auth(keyfile, tls_ca, external_tls_ca)


We're messing with the workflow here.
Workflow is:

Update keyfile on filesystem.

Restart.

Wait for readiness.

Set auth-updated so that config-server can add the shard to the cluster (BECAUSE we have restarted and we now have the same keyfile as the config-server.

Config server adds us and sets a flag to indicate that shard has been added to cluster.

Shard restarts PBM

With the new workflow:

Update keyfile on filesystem

Ask for restart

Wait for readiness (we're waiting for restart, of course we're raising.

On lock granted, we run the deferred event that defers again. We restart. We don't send auth-updated flag so config server is still waiting. We wait until next event for relation-changed to re run (it's been deferred). Which means we need 3 iterations of the relation-changed event (Including the 2 deferrals) before we can signal to the config server to add us.

Config server adds us.

We restart PBM.

So way more waiting while it should make it easier to do this kind of scenarios.

We need a dedicated callback for this one IMHO.

Gu1nness · 2026-04-21T14:32:43Z

        self.delete_certificates_from_workload(internal)
-        self.dependent.restart_charm_services(force=True)


Certificate deletion should happen inside the lock.
Otherwise in the meantime between certificate deletion and restart, if mongod restarts unexpectedly, it will fail that is a disruption of service.
Needs dedicated callback probably, or improve the restart_charm_services to do the cert setting / deletion in a systematic way (if nothing in secret, remove the file if it exists, else write the file with the data from the secret.

Signed-off-by: Patricia Reinoso <patricia.reinoso@canonical.com>

patriciareinoso added 17 commits April 17, 2026 22:20

feat: test rollingops

a5a4d71

minimum fix ut

12aaa1f

fix lint

c987485

fix pre commit

05bb961

fix charms pack

50cd931

Update mongodb_operator.py

711ae13

fix common name

e25fa6a

fix python version

71a15b4

fix python versio

f55e244

remove sync

71c3162

new iteration

7e389a4

Merge branch '8/edge' into DPE-9684-rolling-ops

dae5581

Signed-off-by: Patricia Reinoso <patricia.reinoso@canonical.com>

fix merge

70c0816

skip vault for now

0cac66b

improve workflows

2478f7c

restore build charms

136f1db

restore tls uts

f767a7b

patriciareinoso changed the title ~~feat: test rollingops~~ patch: test rollingops Apr 21, 2026

patriciareinoso commented Apr 21, 2026

View reviewed changes

Comment thread single_kernel_mongo/managers/mongodb_operator.py Outdated

patriciareinoso commented Apr 21, 2026

View reviewed changes

Gu1nness reviewed Apr 21, 2026

View reviewed changes

patriciareinoso added 9 commits April 22, 2026 10:16

Merge branch '8/edge' into DPE-9684-rolling-ops

a83c4a1

Signed-off-by: Patricia Reinoso <patricia.reinoso@canonical.com>

improve workflows

b038087

Merge branch '8/edge' into DPE-9684-rolling-ops

9b7d83c

Signed-off-by: Patricia Reinoso <patricia.reinoso@canonical.com>

fix merge

9b2fcce

fix precommit

23cd69b

fix mypy

9eb0e61

avoid crashing

c9063cf

optional retry mongod is ready

0ab14b5

Merge branch '8/edge' into DPE-9684-rolling-ops

29d73b4

patriciareinoso mentioned this pull request Apr 24, 2026

feat: parameterize logs location and make etcd relation optional canonical/charmlibs#445

Merged

patriciareinoso added 2 commits April 24, 2026 16:26

add waiting on mongostls

fdbcd32

wait on ldap

4219c35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

patch: test rollingops#299

patch: test rollingops#299
patriciareinoso wants to merge 28 commits into8/edgefrom
DPE-9684-rolling-ops

patriciareinoso commented Apr 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

patriciareinoso Apr 21, 2026

Uh oh!

Gu1nness left a comment

Uh oh!

Gu1nness Apr 21, 2026

Uh oh!

Gu1nness Apr 21, 2026

Uh oh!

Gu1nness Apr 21, 2026

Uh oh!

Gu1nness Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		self.delete_certificates_from_workload(internal)
		self.dependent.restart_charm_services(force=True)

Conversation

patriciareinoso commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🏷️ Type of changes

📝 Description

Rolling Ops Integration for MongoDB

LDAP

TLS

SHARD MANAGER

CLUSTER (mongos)

MongoDB

Operator:

🧪 Manual testing steps

🔬 Automated testing steps

✅ Checklist

Uh oh!

Uh oh!

patriciareinoso Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Gu1nness left a comment

Choose a reason for hiding this comment

Uh oh!

Gu1nness Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Gu1nness Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Gu1nness Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Gu1nness Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

patriciareinoso commented Apr 17, 2026 •

edited

Loading