Update cluster runtime upgrade with expand health check feature #3885

deepukraju · 2025-09-23T10:36:10Z

The expand health check feature details are updated in learn document section concepts-cluster-upgrade-runtime.

prmerger-automator · 2025-09-23T10:36:23Z

@deepukraju : Thanks for your contribution! The author(s) and reviewer(s) have been notified to review your proposed change.

learn-build-service-prod · 2025-09-23T10:38:12Z

Learn Build status updates of commit db537eb:

✅ Validation status: passed

File	Status	Preview URL	Details
operator-nexus/concepts-cluster-upgrade-overview.md	✅Succeeded

For more details, please refer to the build report.

santhosh-kumar-cm · 2025-09-23T11:06:38Z

operator-nexus/concepts-cluster-upgrade-overview.md

+
+## Nexus tenant workload health check during cluster runtime upgrade
+
+During a runtime upgrade, the inventory readiness check is triggered to conduct workload health checks. The inventory readiness check feature is appliable for only rack by rack upgrade strategy. The platform feature "UpgradeInventoryChecks" controls the platform runtime upgrade outcome when the health check fails. When the feature is enabled, the upgrade pauses if there is an inventory readiness check failure after the compute rack upgrade. The upgrade can be continued using CCUVA. When the feature is disabled the inventory readiness failures are logged and upgrade continues to next stage.  By default the feature is disabled. 


Should we generalize the CCUVA term? May be say, "The upgrade can be continued when the customer executes the upgrade API."

Do not use term CUVA, as customers do not know this.
State runtime upgrade or similar

Updated the text to remove CUVA reference and used runtime upgrade.

santhosh-kumar-cm · 2025-09-23T11:09:13Z

operator-nexus/concepts-cluster-upgrade-overview.md

+3. **Comparison Process** - Comparison of current workloads with snapshot taken during start of upgrade. Report comparison status.
+4. **Health Check Handling** - On success proceed to next upgrade stage. For failure, based on inventory readiness check feature is enable or disable its handled as below.
+
+| Upgrade Stage            | UpgradeInventoryChecks Enable       | UpgradeInventoryChecks Disable |


Can we clearly state the heading of the stage to say that this is the failure case? Something on the lines "Failure at Upgrade state". Trying to see if can put something that says what happens if there is a failure in this stage.

The above statement says the table is for upgrade failure.

JAC0BSMITH · 2025-09-23T14:51:23Z

operator-nexus/concepts-cluster-upgrade-overview.md

+
+## Nexus tenant workload health check during cluster runtime upgrade
+
+During a runtime upgrade, the inventory readiness check is triggered to conduct workload health checks. The inventory readiness check feature is appliable for only rack by rack upgrade strategy. The platform feature "UpgradeInventoryChecks" controls the platform runtime upgrade outcome when the health check fails. When the feature is enabled, the upgrade pauses if there is an inventory readiness check failure after the compute rack upgrade. The upgrade can be continued using CCUVA. When the feature is disabled the inventory readiness failures are logged and upgrade continues to next stage.  By default the feature is disabled. 


Do not use term CUVA, as customers do not know this.
State runtime upgrade or similar

matternst7258 · 2025-09-23T15:01:46Z

operator-nexus/concepts-cluster-upgrade-overview.md

+
+## Nexus tenant workload health check during cluster runtime upgrade
+
+During a runtime upgrade, the inventory readiness check is triggered to conduct workload health checks. The inventory readiness check feature is appliable for only rack by rack upgrade strategy. The platform feature "UpgradeInventoryChecks" controls the platform runtime upgrade outcome when the health check fails. When the feature is enabled, the upgrade pauses if there is an inventory readiness check failure after the compute rack upgrade. The upgrade can be continued using CCUVA. When the feature is disabled the inventory readiness failures are logged and upgrade continues to next stage.  By default the feature is disabled. 


we aren't really doing workload health checks. We are looking at the infrastructure of the tenant resources. I want to avoid indicating we check how their workloads are performing

specifically right now we're checking for Nexus VM / Nexus AKS health. would it make sense to explicitly reference those for clarity?

maybe something like "...triggered to conduct workload infrastructure availability" or "to conduct availability of the VM and NAKS health"

Have mentioned tenant workload (Nexus Kubernetes Cluster and Virtual Machine). This will clarify we are referring to Nexus Kubernetes Cluster and Virtual Machine as tenant workload during health checks.

matternst7258 · 2025-09-23T15:02:55Z

operator-nexus/concepts-cluster-upgrade-overview.md

+
+## Nexus tenant workload health check during cluster runtime upgrade
+
+During a runtime upgrade, the inventory readiness check is triggered to conduct workload health checks. The inventory readiness check feature is appliable for only rack by rack upgrade strategy. The platform feature "UpgradeInventoryChecks" controls the platform runtime upgrade outcome when the health check fails. When the feature is enabled, the upgrade pauses if there is an inventory readiness check failure after the compute rack upgrade. The upgrade can be continued using CCUVA. When the feature is disabled the inventory readiness failures are logged and upgrade continues to next stage.  By default the feature is disabled. 


"When the feature is disabled the inventory readiness failures are logged and upgrade continues to next stage."

Is this information exposed to customers? How can it be viewed?

I think logging is an internal mechanism that we probably don't want to inform customer about. I think in target-state we probably want to expose the information in the same form as we would if the feature were enabled (e.g. ARM properties, but TBD), but just don't block upgrades on it, and have it be informational only.

What is the user experience look like for customers? I'm trying to understand how the customer would know the workloads aren't healthy after the runtime. I don't think we need to specify why but something to indicate the workload inventory check failed.

The behavior is different for compute node-pool and other KCP and mgmt-plane servers. Its good to document this explicitly.
I will remove the log section.

matternst7258 · 2025-09-23T15:04:12Z

operator-nexus/concepts-cluster-upgrade-overview.md

+
+During a runtime upgrade, the inventory readiness check is triggered to conduct workload health checks. The inventory readiness check feature is appliable for only rack by rack upgrade strategy. The platform feature "UpgradeInventoryChecks" controls the platform runtime upgrade outcome when the health check fails. When the feature is enabled, the upgrade pauses if there is an inventory readiness check failure after the compute rack upgrade. The upgrade can be continued using CCUVA. When the feature is disabled the inventory readiness failures are logged and upgrade continues to next stage.  By default the feature is disabled. 
+
+The Inventory Readiness Check feature performs workload health check after control-plane, management-plane and compute servers are upgraded during platform runtime upgrade. It operates in snapshot and comparison modes and provides a mechanism to verify workload health state after different stages of platform runtime upgrade. the feature supports Nexus Kubernetes Cluster and Virtual Machine workloads.


There aren't workloads on the control plane. We need to be clear about what is being checked on these servers.

the checks at each phase (KCP, NMP, Compute Rack 1, ....) aren't actually checking for workloads running on those node poolds, they are executing global checks as e.g. upgrading kubernetes could cause issues with workloads running on computes even though those compute machines haven't been upgraded yet. how much of this details do we want to include in the docs?

Since we specify these are workload inventory checks, I would only specify the compute node scope. It may beneficial to reference the management node if you are checking CSNs.

These are health check after different stages of upgrade. I have mentioned that.

matternst7258 · 2025-09-23T15:05:19Z

operator-nexus/concepts-cluster-upgrade-overview.md

+
+During a runtime upgrade, the inventory readiness check is triggered to conduct workload health checks. The inventory readiness check feature is appliable for only rack by rack upgrade strategy. The platform feature "UpgradeInventoryChecks" controls the platform runtime upgrade outcome when the health check fails. When the feature is enabled, the upgrade pauses if there is an inventory readiness check failure after the compute rack upgrade. The upgrade can be continued using CCUVA. When the feature is disabled the inventory readiness failures are logged and upgrade continues to next stage.  By default the feature is disabled. 
+
+The Inventory Readiness Check feature performs workload health check after control-plane, management-plane and compute servers are upgraded during platform runtime upgrade. It operates in snapshot and comparison modes and provides a mechanism to verify workload health state after different stages of platform runtime upgrade. the feature supports Nexus Kubernetes Cluster and Virtual Machine workloads.


What actions need to be performed by customer if the checks fail?

I have provided the link for this.

matternst7258 · 2025-09-23T15:06:01Z

operator-nexus/concepts-cluster-upgrade-overview.md

+
+## Nexus tenant workload health check during cluster runtime upgrade
+
+During a runtime upgrade, the inventory readiness check is triggered to conduct workload health checks. The inventory readiness check feature is appliable for only rack by rack upgrade strategy. The platform feature "UpgradeInventoryChecks" controls the platform runtime upgrade outcome when the health check fails. When the feature is enabled, the upgrade pauses if there is an inventory readiness check failure after the compute rack upgrade. The upgrade can be continued using CCUVA. When the feature is disabled the inventory readiness failures are logged and upgrade continues to next stage.  By default the feature is disabled. 


How is the feature turned on/off?

is there separate documentation produced for AFEC feature flags that we can link to?

I would recommend to specify the functionality is feature flag enabled.

I have mentioned functionality is feature flag enabled.
Need to provide a link to this.

v-dirichards · 2025-09-23T15:17:24Z

#label:"aq-pr-triaged"
@MicrosoftDocs/public-repo-pr-review-team

seaneagan · 2025-09-23T19:22:20Z

operator-nexus/concepts-cluster-upgrade-overview.md

+
+## Nexus tenant workload health check during cluster runtime upgrade
+
+During a runtime upgrade, the inventory readiness check is triggered to conduct workload health checks. The inventory readiness check feature is appliable for only rack by rack upgrade strategy. The platform feature "UpgradeInventoryChecks" controls the platform runtime upgrade outcome when the health check fails. When the feature is enabled, the upgrade pauses if there is an inventory readiness check failure after the compute rack upgrade. The upgrade can be continued using CCUVA. When the feature is disabled the inventory readiness failures are logged and upgrade continues to next stage.  By default the feature is disabled. 


inventory readiness check is triggered to conduct workload health checks

would be good to align on one term. i'm guessing "tenant workload health checks" may resonate the best based based on the rest of the nexus documentation? i think "inventory readiness check" is the name of the implementation which customer shouldn't need to know about.

Change to tenant workload health checks

Updated the tenant workload health check process during cluster runtime upgrades, clarifying feature flag functionality and workflow steps.

learn-build-service-prod · 2025-09-25T04:47:48Z

Learn Build status updates of commit f592b78:

✅ Validation status: passed

File	Status	Preview URL	Details
operator-nexus/concepts-cluster-upgrade-overview.md	✅Succeeded

For more details, please refer to the build report.

Clarified behavior of health checks during runtime upgrade when feature is disabled.

learn-build-service-prod · 2025-09-25T05:02:37Z

Learn Build status updates of commit 1541c9b:

✅ Validation status: passed

File	Status	Preview URL	Details
operator-nexus/concepts-cluster-upgrade-overview.md	✅Succeeded

For more details, please refer to the build report.

seaneagan · 2025-10-02T15:21:35Z

operator-nexus/concepts-cluster-upgrade-overview.md

+
+## Nexus tenant workload health check during cluster runtime upgrade
+
+During a runtime upgrade, tenant workload (Nexus Kubernetes Cluster and Virtual Machine) health checks are performed for only rack by rack upgrade strategy. This functionality is feature flag enabled to control the cluster runtime upgrade outcome for health check failures. When the feature is enabled, the upgrade is paused if health check fails after the compute rack upgrade. The runtime upgrade can be resumed when customer executes the upgrade API [here](./howto-cluster-runtime-upgrade-with-pauserack-strategy.md). When the feature is disabled the upgrade continues to next stage even after the health check failure. By default the feature is disabled. 


do we want to reference the AFEC flag name here? (EnableUpgradeInventoryChecks) or are we relying on other documentation elsewhere for that?

This is added in TSG

seaneagan · 2025-10-02T15:23:20Z

operator-nexus/concepts-cluster-upgrade-overview.md

+| Initial Snapshot         | Upgrade failure                                                | Upgrade continues to next stage |
+| Control Plane Upgrade    | Upgrade failure                                                | Upgrade continues to next stage |
+| Management Plane Upgrade | Upgrade failure                                                | Upgrade continues to next stage |
+| Compute server Upgrade   | Upgrade paused, resumed when customer executes the upgrade API | Upgrade continues to next stage |


Update cluster runtime upgrade with expand health check feature

db537eb

The expand health check feature details are updated in learn document section concepts-cluster-upgrade-runtime.

prmerger-automator bot added the do-not-merge label Sep 23, 2025

prmerger-automator bot assigned matternst7258 Sep 23, 2025

prmerger-automator bot requested a review from matternst7258 September 23, 2025 10:36

prmerger-automator bot added azure-operator-nexus/svc Change sent to author labels Sep 23, 2025

santhosh-kumar-cm suggested changes Sep 23, 2025

View reviewed changes

JAC0BSMITH suggested changes Sep 23, 2025

View reviewed changes

matternst7258 reviewed Sep 23, 2025

View reviewed changes

prmerger-automator bot added the aq-pr-triaged Tracking label for the PR review team label Sep 23, 2025

seaneagan reviewed Sep 23, 2025

View reviewed changes

update with review comments

f592b78

Updated the tenant workload health check process during cluster runtime upgrades, clarifying feature flag functionality and workflow steps.

Update health check behavior description in upgrade overview

1541c9b

Clarified behavior of health checks during runtime upgrade when feature is disabled.

seaneagan reviewed Oct 2, 2025

View reviewed changes


		## Nexus tenant workload health check during cluster runtime upgrade

		During a runtime upgrade, the inventory readiness check is triggered to conduct workload health checks. The inventory readiness check feature is appliable for only rack by rack upgrade strategy. The platform feature "UpgradeInventoryChecks" controls the platform runtime upgrade outcome when the health check fails. When the feature is enabled, the upgrade pauses if there is an inventory readiness check failure after the compute rack upgrade. The upgrade can be continued using CCUVA. When the feature is disabled the inventory readiness failures are logged and upgrade continues to next stage. By default the feature is disabled.


		During a runtime upgrade, the inventory readiness check is triggered to conduct workload health checks. The inventory readiness check feature is appliable for only rack by rack upgrade strategy. The platform feature "UpgradeInventoryChecks" controls the platform runtime upgrade outcome when the health check fails. When the feature is enabled, the upgrade pauses if there is an inventory readiness check failure after the compute rack upgrade. The upgrade can be continued using CCUVA. When the feature is disabled the inventory readiness failures are logged and upgrade continues to next stage. By default the feature is disabled.

		The Inventory Readiness Check feature performs workload health check after control-plane, management-plane and compute servers are upgraded during platform runtime upgrade. It operates in snapshot and comparison modes and provides a mechanism to verify workload health state after different stages of platform runtime upgrade. the feature supports Nexus Kubernetes Cluster and Virtual Machine workloads.


		## Nexus tenant workload health check during cluster runtime upgrade

		During a runtime upgrade, tenant workload (Nexus Kubernetes Cluster and Virtual Machine) health checks are performed for only rack by rack upgrade strategy. This functionality is feature flag enabled to control the cluster runtime upgrade outcome for health check failures. When the feature is enabled, the upgrade is paused if health check fails after the compute rack upgrade. The runtime upgrade can be resumed when customer executes the upgrade API [here](./howto-cluster-runtime-upgrade-with-pauserack-strategy.md). When the feature is disabled the upgrade continues to next stage even after the health check failure. By default the feature is disabled.

	\| Compute server Upgrade \| Upgrade paused, resumed when customer executes the upgrade API \| Upgrade continues to next stage \|
	\| Compute server Upgrade \| Upgrade paused, resumed when customer executes the continue API \| Upgrade continues to next stage \|

Update cluster runtime upgrade with expand health check feature #3885

Are you sure you want to change the base?

Update cluster runtime upgrade with expand health check feature #3885

Uh oh!

Conversation

deepukraju commented Sep 23, 2025

Uh oh!

prmerger-automator bot commented Sep 23, 2025

Uh oh!

learn-build-service-prod bot commented Sep 23, 2025

✅ Validation status: passed

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seaneagan Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seaneagan Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

v-dirichards commented Sep 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

learn-build-service-prod bot commented Sep 25, 2025

✅ Validation status: passed

Uh oh!

learn-build-service-prod bot commented Sep 25, 2025

✅ Validation status: passed

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seaneagan Sep 23, 2025 •

edited

Loading

seaneagan Sep 23, 2025 •

edited

Loading