Skip to content

Conversation

MaciejKaras
Copy link
Collaborator

@MaciejKaras MaciejKaras commented Oct 6, 2025

Summary

During investigation of https://jira.mongodb.org/browse/CLOUDP-317886 we have found that the operator panics, when the cluster is removed from clusterSpecList. Initially we thought about allowing to do that, but later on I have discussed the issue with @lsierant and the better approach was chosen:

  • when the user wants to remove cluster that already has 0 members, we allow this
  • when the user wants to remove cluster that has more than 0 members, we block this, because it can lead to unexpected issues. This is described in the code comment:

// We cannot allow removing cluster specification if the cluster is not scaled down to zero.
// For example: we have 3 members in a cluster, and we try to remove the entire cluster spec. The operator is scaling members down one by one.
// We could remove one member successfully, but recreate other members with default configuration, rather the one that was used before.
// Removing cluster spec would remove all non-default cluster configuration i.e. priority, persistence, etc. and that can lead to unexpected issues.
if err := r.blockNonEmptyClusterSpecItemRemoval(); err != nil {
return r.updateStatus(ctx, sc, workflow.Failed(err), log)
}

...and in the changelog:

MultiClusterSharded: Block removing non-zero member cluster from MongoDB resource. This prevents from scaling down member cluster without current configuration available, which can lead to unexpected issues. Previously operator was crashing in that scenario, after the fix it will mark reconciliation as Failed with appropriate message. Example unsafe scenario that is now blocked:

  • User has 2 member clusters: main is used for application traffic, read-analytics is used for read-only analytics
  • main cluster has 7 voting members
  • read-analytics cluster has 3 non-voting members
  • User decides to remove read-analytics cluster, by removing the clusterSpecItem completely
  • Operator scales down members from read-analytics cluster one by one
  • Because the configuration does not have voting options specified anymore and by default priority is set to 1, the operator will remove one member, but the other two members will be reconfigured as voting members
  • replicaset contains now 9 voting members, which is not supported by MongoDB

Proof of Work

Passing CI + new unit tests for blockNonEmptyClusterSpecItemRemoval function.

Checklist

  • Have you linked a jira ticket and/or is the ticket in the title?
  • Have you checked whether your jira ticket required DOCSP changes?
  • Have you added changelog file?

Copy link

github-actions bot commented Oct 6, 2025

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.5.0 Release Notes

New Features

  • Improve automation agent certificate rotation: the agent now restarts automatically when its certificate is renewed, ensuring smooth operation without manual intervention and allowing seamless certificate updates without requiring manual Pod restarts.

Bug Fixes

  • MongoDBMultiCluster: fix resource stuck in Pending state if any clusterSpecList item has 0 members. After the fix, a value of 0 members is handled correctly, similarly to how it's done in the MongoDB resource.
  • MultiClusterSharded: Blocked removing non-zero member cluster from MongoDB resource. This prevents from scaling down member cluster without current configuration available, which could lead to unexpected issues.

@MaciejKaras MaciejKaras changed the title CLOUDP-317886 - fix: removing cluster from MC Sharded deployment causes a panic CLOUDP-317886 - block removing cluster from MC Sharded deployment Oct 6, 2025
require.NoError(t, err)
addKubernetesTlsResources(ctx, client, rs)
kubeClient, omConnectionFactory := mock.NewDefaultFakeClient(rs)
reconciler := newReplicaSetReconciler(ctx, kubeClient, nil, "", "", false, false, omConnectionFactory.GetConnectionFunc)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was previously using shardedClusterReconciler instead of replicaset reconciler

@MaciejKaras MaciejKaras marked this pull request as ready for review October 6, 2025 15:15
@MaciejKaras MaciejKaras requested review from vinilage and a team as code owners October 6, 2025 15:15
* User decides to remove `read-analytics` cluster, by removing the `clusterSpecItem` completely
* Operator scales down members from `read-analytics` cluster one by one
* Because the configuration does not have voting options specified anymore and by default `priority` is set to 1, the operator will remove one member, but the other two members will be reconfigured as voting members
* `replicaset` contains now 9 voting members, which is not [supported by MongoDB](https://www.mongodb.com/docs/manual/reference/limits/#mongodb-limit-Number-of-Voting-Members-of-a-Replica-Set)
Copy link
Contributor

@lsierant lsierant Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice there is a code in the operator that limits the number of voting members to 7 and mark the rest as non voting. But it's still a valid case of losing configuration and resulting in a non deterministic configuration of voting members.

https://github.com/mongodb/mongodb-kubernetes/blob/master/controllers/om/deployment.go#L1206-L1218

Copy link
Contributor

@lsierant lsierant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great investigation and the fix! 👏 LGTM!

Copy link
Collaborator

@vinilage vinilage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

…from.md

Co-authored-by: Vivek Singh <vsingh.ggits.2010@gmail.com>
@MaciejKaras MaciejKaras enabled auto-merge (squash) October 8, 2025 14:32
@MaciejKaras MaciejKaras merged commit a90d540 into master Oct 8, 2025
8 of 9 checks passed
@MaciejKaras MaciejKaras deleted the maciejk/CLOUDP-317886 branch October 8, 2025 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants