Skip to content

patch: test rollingops#299

Draft
patriciareinoso wants to merge 28 commits into8/edgefrom
DPE-9684-rolling-ops
Draft

patch: test rollingops#299
patriciareinoso wants to merge 28 commits into8/edgefrom
DPE-9684-rolling-ops

Conversation

@patriciareinoso
Copy link
Copy Markdown
Contributor

@patriciareinoso patriciareinoso commented Apr 17, 2026

🏷️ Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Tooling and CI
  • Dependencies upgrade or change
  • Chores / refactoring

📝 Description

Rolling Ops Integration for MongoDB

This PR introduces rolling-ops based asynchronous restarts for mongod and mongos components.

The main change is:

Before: components restarted immediately within hooks
Now: restarts are queued via rolling ops (async lock) and executed in a rolling fashion.

This ensures safer, ordered restarts across units.

This PR adds the relation to etcd but fully etcd integration is not implemented.

LDAP

Former behavior:

  • Immediate restart
  • LDAP statuses recomputed after restart

New behavior:

  • Trigger async restart via rolling ops
  • LDAP statuses are set immediately and remain unchanged
  • Restart applies to mongod component only
  • Restart-related status is applied to MongoD, not LDAP

Restart code sections

  1. restart_when_ready

    • Triggered by restart-if-ready if leader or relation-changed if follower
  2. clean_ldap_credentials_and_uri and remove_ldap_certificates:

    • Triggered on LDAP relation-broken or unavailable.

TLS

Former behavior

  • Immediate restart
  • SHARD Manager / Mongos statuses recomputed after restart

New behavior

  • Trigger async restart
  • Recompute SHARD Manager / Mongos statuses immediately
  • Restart applies to MongoD / Mongos components
  • SHARD Manager statuses remain unchanged

Restart code sections

  1. enable_certificates_for_unit triggered on certificate available

  2. disable_certificates_for_unit : triggered on tls relation-broken

SHARD MANAGER

Former behavior
On DB created:

  • If keyfile changed and cluster auth uses keyfile -> immediate restart
  • If the PBM CA certificate does not exist but it exists in the trust store -> immediate restart
  • If mondod is not ready -> defer

New behavior
On DB created:

  • If keyfile changed and cluster auth uses keyfile -> async restart
  • If the PBM CA certificate does not exist but it exists in the trust store -> async restart
  • If a restart is pending or mondod is not ready -> defer

Restart code sections

  1. update_member_auth
  2. update_pbm_certificate_in_trust_store

CLUSTER (mongos)

Former behavior:
For mongos, on relation-changed:

  • If the config changed, keyfile changed or mongos is not running -> immediate restart
  • If mongos is not running -> defer

New behavior:
For mongos, on relation-changed:

  • If the config changed, keyfile changed or mongos is not running -> async restart
  • If a restart is pending -> defer
  • If mongos is not running -> defer

MongoDB

Former behavior:
On config-changed:

  • Immediate restart (it is actually restarted if the IPs changed)
  • Continue configuration

New behavior:
On config-changed:

  • Async restart (it is actually restarted if the IPs changed)
  • Continue configuration

Operator:

Former behavior:

  • Trigger immediate restart on shard relation broken
  • Trigger immediate restart on s3/gsc relation broken

New behavior:
Same behavior kept but using async restarts

Restart code sections

  1. remove_ca_cert_from_trust_store:

🧪 Manual testing steps

1. juju deploy mongodb as a replica set 
2. Enable peer TLS
3. Enable client TLS
4. Disable client TLS
5. Disable peer TLS
6. Scale to 3 units
7. Enable client TLS
8. Enable peer TLS 
9. Scale to 5 units

🔬 Automated testing steps

✅ Checklist

  • My code follows the code style of this project.
  • I have added or updated any relevant documentation.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@patriciareinoso patriciareinoso changed the title feat: test rollingops patch: test rollingops Apr 21, 2026
Comment thread single_kernel_mongo/managers/mongodb_operator.py Outdated

@override
def restart_charm_services(self, force: bool = False):
def restart_charm_services(self, force: bool = False) -> OperationResult:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ass discussed offline.

This callback should add guards about LDAP and vault state because restarting if we do not have the appropriate state can break the charm.

Copy link
Copy Markdown
Contributor

@Gu1nness Gu1nness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be careful in the complex relations (sharding + cluster) as they are heavily stateful and the current implementation adds unnecessary delay.
We should take advantage of the asynchronous locking to do something better where we don't have to exchange so much data because we can retry the critical path.
Eg: config-server <-> shard

Config server sends keyfile and request lock
On lock of config server: tries to add shard to cluster, if it fails (auth not updated yet) it asks for retry

Shard:
Upon receiving keyfile, asks for two locks (restart with keyfile, restart PBM).
On lock for restart with keyfile: restart with new config
On lock for restart PBM: check if shard has been added to cluster. If yes, start/restart PBM.

That way we rely on the asynchronicity to ensure that we end up in the correct state eventually and we have removed the flag for `auth-updated` and the one for `shard-added-to-cluster` checks.

Comment thread single_kernel_mongo/events/cluster.py Outdated
msg = "Waiting for mongos to be restarted"
defer_event_with_info_log(logger, event, str(type(event)), str(msg))
return
self.manager._reconcile_after_mongos_restart()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should defer here.
I would rather have a dedicated callback or move the code around so that we don't need to do that.
The reason:

  1. relation-changed => request lock
  2. lock granted => execute first the deferred event that re-requests the lock then does the restart
  3. We still haven't run the reconciliation but we'll need to wait until later to run that.

Comment on lines +1275 to +1277
def is_waiting_for_rolling_restart(self) -> bool:
"""Returns whether Mongos has pending rolling operations."""
return self.rollingops_manager.state.status == RollingOpsStatus.WAITING
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could be exposed by rolling ops as an interface ? It could make sense
is_waiting_for_lock(callback_id) ?

if self.charm.unit.is_leader():
self.sync_cluster_passwords(operator_password, backup_password)

self.update_member_auth(keyfile, tls_ca, external_tls_ca)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're messing with the workflow here.
Workflow is:

  1. Update keyfile on filesystem.
  2. Restart.
  3. Wait for readiness.
  4. Set auth-updated so that config-server can add the shard to the cluster (BECAUSE we have restarted and we now have the same keyfile as the config-server.
  5. Config server adds us and sets a flag to indicate that shard has been added to cluster.
  6. Shard restarts PBM

With the new workflow:

  1. Update keyfile on filesystem
  2. Ask for restart
  3. Wait for readiness (we're waiting for restart, of course we're raising.
  4. On lock granted, we run the deferred event that defers again. We restart. We don't send auth-updated flag so config server is still waiting. We wait until next event for relation-changed to re run (it's been deferred). Which means we need 3 iterations of the relation-changed event (Including the 2 deferrals) before we can signal to the config server to add us.
  5. Config server adds us.
  6. We restart PBM.

So way more waiting while it should make it easier to do this kind of scenarios.

We need a dedicated callback for this one IMHO.

Comment on lines 166 to -167
self.delete_certificates_from_workload(internal)
self.dependent.restart_charm_services(force=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certificate deletion should happen inside the lock.
Otherwise in the meantime between certificate deletion and restart, if mongod restarts unexpectedly, it will fail that is a disruption of service.
Needs dedicated callback probably, or improve the restart_charm_services to do the cert setting / deletion in a systematic way (if nothing in secret, remove the file if it exists, else write the file with the data from the secret.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants