Skip to content

Conversation

nammn
Copy link
Collaborator

@nammn nammn commented Sep 2, 2025

Summary

This pull request introduces process ID persistence for MongoDB sharded cluster deployments, ensuring that process IDs for replica sets are correctly maintained across reconciliation cycles and during migration scenarios (such as project changes). This is achieved by storing process IDs in the deployment state and updating the controller logic and tests accordingly.

Process ID Persistence for Sharded Clusters:

  • Added a new ProcessIds field to the ShardedClusterDeploymentState struct to store process IDs for each replica set, enabling persistence across reconciliation cycles and project migrations.
  • Updated the buildReplicaSetFromProcesses function and its call sites to retrieve and use persisted process IDs from the deployment state when process IDs are missing (e.g., during migration).

Controller Logic Enhancements:

  • Added logic in the reconciliation flow to save the final process IDs to the deployment state after each reconciliation, logging a warning if saving fails but not failing the reconciliation.

Testing Improvements:

  • Added comprehensive unit tests in mongodbshardedcluster_controller_test.go to verify process ID persistence, retrieval, edge cases, and JSON serialization/deserialization for state store compatibility. Also, tested the integration with the updated buildReplicaSetFromProcesses and the new process ID persistence logic.

End-to-End (E2E) Test Enhancements:

  • Introduced new E2E tests in multi_cluster_sharded_scaling.py to verify that process IDs are preserved during project migration scenarios, including the creation of non-sequential member IDs and asserting that replica set member IDs remain unchanged after migration.

These changes collectively improve the robustness of sharded cluster management, particularly in scenarios involving cluster migration or changes to Ops Manager configuration.

Proof of Work

  • green ci
  • passing new test: Link

Checklist

  • Have you linked a jira ticket and/or is the ticket in the title?
  • Have you checked whether your jira ticket required DOCSP changes?
  • Have you added changelog file?

Copy link

github-actions bot commented Sep 2, 2025

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.3.0 Release Notes

New Features

Multi-Architecture Support

We've added comprehensive multi-architecture support for the kubernetes operator. This enhancement enables deployment on IBM Power (ppc64le) and IBM Z (s390x) architectures alongside
existing x86_64 support. Core images (operator, agent, init containers, database, readiness probe) now support multiple architectures. We do not add support IBM and ARM support for Ops-Manager and the init-ops-manager image.

  • MongoDB Agent images have been migrated to new container repository: quay.io/mongodb/mongodb-agent.
    • the agents in the new repository will support the x86-64, ARM64, s390x, and ppc64le architectures. More can be read in the public docs.
    • operator running >=MCK1.3.0 and static cannot use the agent images from the old container repository quay.io/mongodb/mongodb-agent-ubi.
  • quay.io/mongodb/mongodb-agent-ubi should not be used anymore, it's only there for backwards compatibility.

Bug Fixes

  • This change fixes the current complex and difficult-to-maintain architecture for stateful set containers, which relies on an "agent matrix" to map operator and agent versions which led to a sheer amount of images.
  • We solve this by shifting to a 3-container setup. This new design eliminates the need for the operator-version/agent-version matrix by adding one additional container containing all required binaries. This architecture maps to what we already do with the mongodb-database container.
  • Fixed an issue where the readiness probe reported the node as ready even when its authentication mechanism was not in sync with the other nodes, potentially causing premature restarts.
  • Fixed an issue where the MongoDB Agents did not adhere to the NO_PROXY environment variable configured on the operator.
  • Fixed an issue where moving a MongoDB sharded cluster resource to a new project (or a new OM instance) would leave the deployment in a failed state.

Other Changes

  • Optional permissions for PersistentVolumeClaim moved to a separate role. When managing the operator with Helm it is possible to disable permissions for PersistentVolumeClaim resources by setting operator.enablePVCResize value to false (true by default). When enabled, previously these permissions were part of the primary operator role. With this change, permissions have a separate role.
  • subresourceEnabled Helm value was removed. This setting used to be true by default and made it possible to exclude subresource permissions from the operator role by specifying false as the value. We are removing this configuration option, making the operator roles always have subresource permissions. This setting was introduced as a temporary solution for this OpenShift issue. The issue has since been resolved and the setting is no longer needed.
  • We have deliberately not published the container images for OpsManager versions 7.0.16, 8.0.8, 8.0.9 and 8.0.10 due to a bug in the OpsManager which prevents MCK customers to upgrade their OpsManager deployments to those versions.

@nammn nammn marked this pull request as ready for review September 3, 2025 08:24
@nammn nammn requested a review from a team as a code owner September 3, 2025 08:24
@nammn nammn marked this pull request as draft September 3, 2025 09:26
@nammn
Copy link
Collaborator Author

nammn commented Sep 3, 2025

this is blocked on: https://jira.mongodb.org/browse/CLOUDP-328217

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant