Skip to content

Conversation

Julien-Ben
Copy link
Collaborator

@Julien-Ben Julien-Ben commented Oct 2, 2025

Summary

This test has been broken for a while. Because there were two functions with the name test_tls_is_disabled_and_scaled_up, only one of them was running everytime.

I think this scenario has low chance of existing in production, hence why we never had a ticket related to this bug.

On top of that it performed the update in two separate step, whereas to test this behaviour it should be done in one update.

I uncovered it as part of the larger refactoring of the controller. But to keep the PR scope reasonable, I extracted the related changes as they are self contained
The bug may exist in multi-cluster as well. This bug was first discovered in 2021: https://jira.mongodb.org/browse/CLOUDP-80768

The blocking mechanism will be implemented in a better way after the multi-cluster first refactor, since we will keep track of a global reconciler state, notably holding the target number of replicas for this reconciliation.

Proof of Work

In the commit (1d8847e) which just fixes the e2e test, it fails: evg task Showing that the reconciler had to be fixed as well.

The PR is correct if CI is green again.

Checklist

  • Have you linked a jira ticket and/or is the ticket in the title?
  • Have you checked whether your jira ticket required DOCSP changes?
  • Have you added changelog file?

Copy link

github-actions bot commented Oct 2, 2025

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.5.0 Release Notes

New Features

  • Improve automation agent certificate rotation: the agent now restarts automatically when its certificate is renewed, ensuring smooth operation without manual intervention and allowing seamless certificate updates without requiring manual Pod restarts.

Bug Fixes

  • MongoDBMultiCluster: fix resource stuck in Pending state if any clusterSpecList item has 0 members. After the fix, a value of 0 members is handled correctly, similarly to how it's done in the MongoDB resource.

@Julien-Ben Julien-Ben added the skip-changelog Use this label in Pull Request to not require new changelog entry file label Oct 2, 2025
@pytest.mark.e2e_disable_tls_scale_up
def test_tls_is_disabled_and_scaled_up(replica_set: MongoDB):
replica_set.load()
replica_set["spec"]["members"] = 5
Copy link
Collaborator Author

@Julien-Ben Julien-Ben Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue in the test is that it was doing the update in two steps (scale, and then disable tls), while the whole point is to change them at the same time. (on top of having a duplicate function name)

// Check if TLS is being disabled. If so, we need to lock replicas at the current member count
// to prevent scaling during the TLS disable operation. This decision is made once here and
// applied to both the StatefulSet and OM automation config.
tlsWillBeDisabled, err := checkIfTLSWillBeDisabled(conn, rs, log)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we just block this with validation and say that it's not possible to change both member count and disabling TLS at the same time?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, after discussing it in DM, let's do that instead of adding complexity to the reconcile loop. This is not a common use case anyway.

@Julien-Ben Julien-Ben mentioned this pull request Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip-changelog Use this label in Pull Request to not require new changelog entry file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants