OCPEDGE-2215: Updated TNF EP to address some drift from original requirements. #1887

jaypoulz · 2025-11-06T15:28:25Z

Updated warning against baremetal platform including BMC block
Updated test section to note that we'll skip requirements criteria if no requirements are provided
Added a new block that explains the PacemakerCluster API, the status collector, and the health check controller

openshift-ci · 2025-11-06T15:29:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign joelanford for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fonta-rh · 2025-11-06T15:31:25Z

enhancements/two-node-fencing/tnf.md

@@ -317,6 +317,9 @@ In the future, it may be possible to lower the privilege level of the TNF contro
 to run without root privileges. We are working with the RHEL-HA team to identify the specific set of commands that we use to narrow the scope of progress towards this goal. This remains a long-term
 objective for both teams.

+##### The PacemakerCluster Health Check
+See [Status Propogation with PacemakerCluster Health Check](#status-propogation-with-pacemakercluster-health-check)


nit: propagation (somewhere else in the doc too)

that word is hard. :)

fonta-rh · 2025-11-06T15:34:34Z

enhancements/two-node-fencing/tnf.md

+need to do so. An example of this would be if the cluster administrator rotated their BMC password without updating the fencing secret in the cluster. This would be caught by the pacemaker monitoring
+checks, but something in the cluster would need to aware of that to propogate that information to the user directly.
+
+To acheive this, we plan on using a pair of new controllers in CEO. The first is a status collector, which syncs every 30 seconds to gather that current state of pacemaker via `sudo pcs status xml`.


Nit, some spelling:
To achieve this, we plan on using two new controllers in CEO. The first is a status collector which syncs every 30 seconds to gather the current state of pacemaker via sudo pcs status xml

fonta-rh · 2025-11-06T15:37:16Z

enhancements/two-node-fencing/tnf.md

@@ -708,6 +665,35 @@ This collection of diagrams collects a series of scenarios where both nodes fail

 ![Diagrams of Multi-Node Failure Scenarios](etcd-flowchart-both-nodes-reboot-scenarios.svg)

+#### Status Propogation with PacemakerCluster Health Check
+An important goal of Two Node OpenShift with Fencing is ensuring that we always warn the user before a disaster event can occur that would require manually intervention if we have the information we


This sentence is really hard to parse. Maybe something like:
An important goal of Two Node OpenShift with Fencing is ensuring an early warning for potentially disastrous events (that would requiring manual intervention). To provide this warning, the information needs to be available within the cluster.

fonta-rh · 2025-11-06T15:39:02Z

enhancements/two-node-fencing/tnf.md

+- A list of nodes currently registered in pacemaker
+- A list of recent events recorded by the pacemaker resources
+- A list of recent fencing events performed by pacemaker
+- A dump of the full pacemaker XML. This is kept so that in the case that the XML API is changed in a way that breaks the other fields, we can quickly deliver a fix for the breakage that parses the


Wording: This is kept to be able to deliver a quick fix to the XML parsing code if the XML API is changed in such a way that other fields break

fonta-rh · 2025-11-06T15:39:24Z

enhancements/two-node-fencing/tnf.md

+- A dump of the full pacemaker XML. This is kept so that in the case that the XML API is changed in a way that breaks the other fields, we can quickly deliver a fix for the breakage that parses the
+  XML directly.
+
+Once the PacemakerCluster object is populated is it handled on the CEO side by a new pacemaker healthcheck controller. This controller evaluates the status of the report and creates events in CEO for the following things:


is it -> it is

fonta-rh · 2025-11-06T15:40:47Z

enhancements/two-node-fencing/tnf.md

+- Not all resources and nodes are in their expected / healthy state
+- The pacemakercluster status object is stale (hasn't been updated in the last 5 minutes)
+
+Overall these health checks are almost entirely informational. The only time they are used outside of this event creation or operator status is to ensure that the nodes recorded in pacemaker match the


nit: remove "this" in "this event"

JoelSpeed · 2025-11-06T17:25:19Z

enhancements/two-node-fencing/tnf.md

+- A list of nodes currently registered in pacemaker
+- A list of recent events recorded by the pacemaker resources
+- A list of recent fencing events performed by pacemaker
+- A dump of the full pacemaker XML. This is kept so that in the case that the XML API is changed in a way that breaks the other fields, we can quickly deliver a fix for the breakage that parses the


This doesn't feel like the right place to put this. This to me at least is not really a field a user would care about. Who is this API for?

JoelSpeed · 2025-11-06T17:26:15Z

enhancements/two-node-fencing/tnf.md

+- A dump of the full pacemaker XML. This is kept so that in the case that the XML API is changed in a way that breaks the other fields, we can quickly deliver a fix for the breakage that parses the
+  XML directly.
+
+Once the PacemakerCluster object is populated is it handled on the CEO side by a new pacemaker healthcheck controller. This controller evaluates the status of the report and creates events in CEO for the following things:


Why can the writer of PacemakerCluster not produce the events? Seems that component has all of the correct/relevant information to be able to write the events?

JoelSpeed · 2025-11-06T17:27:34Z

enhancements/two-node-fencing/tnf.md

+- Warnings for fencing events that have happened on the cluster
+
+More importantly, it also sets the CEO's status to degraded if one of the following conditions are true:
+- Not all resources and nodes are in their expected / healthy state


Is this correct for CEO to go degraded? I thought I saw kubelet listed? Wouldn't some other component be responsible for alerting when a kubelet on a control plane node is down? Doesn't really feel like a CEO issue to report?

Most other components that relying on multiple replicas will be degraded at the same time. The obvious one is API server. In fact, CEO already reports degraded when kubelet is down because it doesn't have all of the endpoints it thinks it's supposed to have (one per control-plane node).

The reason we include reporting the kubelet behavior in the pacemaker status is because pacemaker ensures that kubelet is started before etcd. That means, that for etcd to be healthy, kubelet must be healthy. We could ignore the state of the kubelet resource when reporting the state of pacemaker, but as I mentioned before, the etcd member controller is going to be reporting degraded anyway so it's just extra information that explains why pacemaker is unhealthy.

JoelSpeed · 2025-11-06T17:27:50Z

enhancements/two-node-fencing/tnf.md

+
+More importantly, it also sets the CEO's status to degraded if one of the following conditions are true:
+- Not all resources and nodes are in their expected / healthy state
+- The pacemakercluster status object is stale (hasn't been updated in the last 5 minutes)


Needs admin intervention in a fairly prompt manner?

We don't know for sure. We can only give admins instructions if we know the state of pacemaker. If we haven't received a status, this means that CEO's status collector cronjob has stopped posting them or what's being posted is being rejected by the API.

In either case, the cluster could be in a state where the cluster could fail without recovering automatically. The goal is to raise this in a way where the cluster admin knows that something could be wrong.

- Updated warning against baremetal platform including BMC block - Updated test section to note that we'll skip requirements criteria if no requirements are provided - Added a new block that explains the PacemakerCluster API, the status collector, and the health check controller

openshift-ci-robot · 2025-11-06T18:48:58Z

@jaypoulz: This pull request references OCPEDGE-2215 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Updated warning against baremetal platform including BMC block

Updated test section to note that we'll skip requirements criteria if no requirements are provided

Added a new block that explains the PacemakerCluster API, the status collector, and the health check controller

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-11-06T19:50:13Z

@jaypoulz: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/markdownlint	`3320626`	link	true	`/test markdownlint`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

clobrano · 2025-11-07T14:04:09Z

enhancements/two-node-fencing/tnf.md

+would be caught by the pacemaker monitoring checks, but something in the cluster needs to propagate that information to the user directly.
+
+To acheive this, we plan on using two new controllers in CEO. The first is a status collector which syncs every 30 seconds to gather that current state of pacemaker via `sudo pcs status xml`.
+This is parsed and populates a new status object called a `PacemakerCluster`, which is a singleton resource that created by CEO when the transition to an external etcd is completed.


Some typos in thi sentece, which could flow better like so:

This is parsed and populates by a new status object called a PacemakerCluster, ~~which is~~ a singleton resource ~~that~~ created by CEO when the transition to an external etcd is completed.

clobrano · 2025-11-07T14:09:16Z

enhancements/two-node-fencing/tnf.md

+- A list of recent fencing events performed by pacemaker
+- A dump of the full pacemaker XML. This is kept to be able to deliver a quick fix to the XML parsing code if the XML API is changed in such a way that other fields break.
+
+Once the PacemakerCluster object is populated it is handled on the CEO side by a new pacemaker healthcheck controller. This controller evaluates the status of the report and creates events in


This sentence seems a bit wordy. What about the following:

Once populated, the PacemakerCluster object is handled ...

openshift-ci bot requested review from adambkaplan and celebdor November 6, 2025 15:29

jaypoulz force-pushed the tnf branch from efc8808 to e9338b7 Compare November 6, 2025 15:36

fonta-rh reviewed Nov 6, 2025

View reviewed changes

jaypoulz force-pushed the tnf branch from e9338b7 to 058cd19 Compare November 6, 2025 16:06

JoelSpeed reviewed Nov 6, 2025

View reviewed changes

jaypoulz force-pushed the tnf branch from 058cd19 to 3320626 Compare November 6, 2025 18:47

jaypoulz changed the title ~~Updated TNF EP to address some drift from original requirements.~~ OCPEDGE-2215: Updated TNF EP to address some drift from original requirements. Nov 6, 2025

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 6, 2025

clobrano reviewed Nov 7, 2025

View reviewed changes

OCPEDGE-2215: Updated TNF EP to address some drift from original requirements. #1887

Are you sure you want to change the base?

OCPEDGE-2215: Updated TNF EP to address some drift from original requirements. #1887

Uh oh!

Conversation

jaypoulz commented Nov 6, 2025

Uh oh!

openshift-ci bot commented Nov 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Nov 6, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Nov 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

openshift-ci-robot commented Nov 6, 2025 •

edited by openshift-ci bot

Loading