CLOUDP-329178 - Support Project Migration in Sharded Clusters #399

nammn · 2025-09-02T14:54:08Z

Summary

This pull request introduces process ID persistence for MongoDB sharded cluster deployments, ensuring that process IDs for replica sets are correctly maintained across reconciliation cycles and during migration scenarios (such as project changes). This is achieved by storing process IDs in the deployment state and updating the controller logic and tests accordingly.

Process ID Persistence for Sharded Clusters:

Added a new ProcessIds field to the ShardedClusterDeploymentState struct to store process IDs for each replica set, enabling persistence across reconciliation cycles and project migrations.
Updated the buildReplicaSetFromProcesses function and its call sites to retrieve and use persisted process IDs from the deployment state when process IDs are missing (e.g., during migration).

Controller Logic Enhancements:

Added logic in the reconciliation flow to save the final process IDs to the deployment state after each reconciliation, logging a warning if saving fails but not failing the reconciliation.

Testing Improvements:

Added comprehensive unit tests in mongodbshardedcluster_controller_test.go to verify process ID persistence, retrieval, edge cases, and JSON serialization/deserialization for state store compatibility. Also, tested the integration with the updated buildReplicaSetFromProcesses and the new process ID persistence logic.

End-to-End (E2E) Test Enhancements:

Introduced new E2E tests in multi_cluster_sharded_scaling.py to verify that process IDs are preserved during project migration scenarios, including the creation of non-sequential member IDs and asserting that replica set member IDs remain unchanged after migration.

These changes collectively improve the robustness of sharded cluster management, particularly in scenarios involving cluster migration or changes to Ops Manager configuration.

Proof of Work

green ci
passing new test: Link

Checklist

Have you linked a jira ticket and/or is the ticket in the title?
Have you checked whether your jira ticket required DOCSP changes?
Have you added changelog file?
- use skip-changelog label if not needed
- refer to Changelog files and Release Notes section in CONTRIBUTING.md for more details

github-actions · 2025-09-02T14:54:50Z

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.3.0 Release Notes

New Features

Multi-Architecture Support

We've added comprehensive multi-architecture support for the kubernetes operator. This enhancement enables deployment on IBM Power (ppc64le) and IBM Z (s390x) architectures alongside
existing x86_64 support. Core images (operator, agent, init containers, database, readiness probe) now support multiple architectures. We do not add support IBM and ARM support for Ops-Manager and the init-ops-manager image.

MongoDB Agent images have been migrated to new container repository: quay.io/mongodb/mongodb-agent.
- the agents in the new repository will support the x86-64, ARM64, s390x, and ppc64le architectures. More can be read in the public docs.
- operator running >=MCK1.3.0 and static cannot use the agent images from the old container repository quay.io/mongodb/mongodb-agent-ubi.
quay.io/mongodb/mongodb-agent-ubi should not be used anymore, it's only there for backwards compatibility.

Bug Fixes

This change fixes the current complex and difficult-to-maintain architecture for stateful set containers, which relies on an "agent matrix" to map operator and agent versions which led to a sheer amount of images.
We solve this by shifting to a 3-container setup. This new design eliminates the need for the operator-version/agent-version matrix by adding one additional container containing all required binaries. This architecture maps to what we already do with the mongodb-database container.
Fixed an issue where the readiness probe reported the node as ready even when its authentication mechanism was not in sync with the other nodes, potentially causing premature restarts.
Fixed an issue where the MongoDB Agents did not adhere to the NO_PROXY environment variable configured on the operator.
Fixed an issue where moving a MongoDB sharded cluster resource to a new project (or a new OM instance) would leave the deployment in a failed state.

Other Changes

Optional permissions for PersistentVolumeClaim moved to a separate role. When managing the operator with Helm it is possible to disable permissions for PersistentVolumeClaim resources by setting operator.enablePVCResize value to false (true by default). When enabled, previously these permissions were part of the primary operator role. With this change, permissions have a separate role.
subresourceEnabled Helm value was removed. This setting used to be true by default and made it possible to exclude subresource permissions from the operator role by specifying false as the value. We are removing this configuration option, making the operator roles always have subresource permissions. This setting was introduced as a temporary solution for this OpenShift issue. The issue has since been resolved and the setting is no longer needed.
We have deliberately not published the container images for OpsManager versions 7.0.16, 8.0.8, 8.0.9 and 8.0.10 due to a bug in the OpsManager which prevents MCK customers to upgrade their OpsManager deployments to those versions.

nammn · 2025-09-03T09:26:30Z

this is blocked on: https://jira.mongodb.org/browse/CLOUDP-328217

nammn added 3 commits September 2, 2025 14:12

add memberid migration case

4a078fa

add memberid migration case

b49f584

add typing

9139ea0

nammn added 3 commits September 2, 2025 17:07

add changelog

1e7bfaf

fix test

152d3d8

Merge branch 'master' into memberid-sharded

d4f6f9c

nammn marked this pull request as ready for review September 3, 2025 08:24

nammn requested a review from a team as a code owner September 3, 2025 08:24

nammn requested review from m1kola, MaciejKaras and lucian-tosa September 3, 2025 08:24

nammn marked this pull request as draft September 3, 2025 09:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CLOUDP-329178 - Support Project Migration in Sharded Clusters #399

CLOUDP-329178 - Support Project Migration in Sharded Clusters #399

Uh oh!

nammn commented Sep 2, 2025

Uh oh!

github-actions bot commented Sep 2, 2025 •

edited

Loading

Uh oh!

nammn commented Sep 3, 2025

Uh oh!

Uh oh!

CLOUDP-329178 - Support Project Migration in Sharded Clusters #399

Are you sure you want to change the base?

CLOUDP-329178 - Support Project Migration in Sharded Clusters #399

Uh oh!

Conversation

nammn commented Sep 2, 2025

Summary

Proof of Work

Checklist

Uh oh!

github-actions bot commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MCK 1.3.0 Release Notes

New Features

Multi-Architecture Support

Bug Fixes

Other Changes

Uh oh!

nammn commented Sep 3, 2025

Uh oh!

Uh oh!

github-actions bot commented Sep 2, 2025 •

edited

Loading