Skip to content

Commit 26f7821

Browse files
committed
OCPEDGE-2084: Add PacemakerStatus CRD for two-node fencing
Introduces tnf.etcd.openshift.io/v1alpha1 API group with PacemakerStatus custom resource. This provides visibility into Pacemaker cluster health for dual-replica etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. Gated by DualReplica feature and managed by two-node-fencing component.
1 parent a2cb0c5 commit 26f7821

File tree

17 files changed

+1025
-0
lines changed

17 files changed

+1025
-0
lines changed

etcd/README.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# tnf.etcd.openshift.io API Group
2+
3+
This API group contains CRDs related to two-node fencing (TNF) for etcd cluster management.
4+
5+
## API Versions
6+
7+
### tnf/v1alpha1
8+
9+
Contains the `PacemakerStatus` custom resource for monitoring Pacemaker cluster health in dual-replica (two-node) deployments.
10+
11+
#### PacemakerStatus
12+
13+
- **Feature Gate**: `DualReplica`
14+
- **Component**: `two-node-fencing`
15+
- **Scope**: Cluster-scoped singleton resource named "cluster"
16+
17+
The `PacemakerStatus` resource provides visibility into the health and status of a Pacemaker-managed cluster. It is periodically updated by the cluster-etcd-operator's status collector running as a privileged CronJob.
18+
19+
**Status Fields:**
20+
- `summary`: High-level cluster health metrics (quorum, node counts, resource counts)
21+
- `nodes`: Detailed status of each node (online, standby)
22+
- `resources`: Detailed status of each resource (role, active, node assignment)
23+
- `nodeHistory`: Recent operation failures for troubleshooting
24+
- `fencingHistory`: Recent fencing events
25+
- `rawXML`: Full XML output from `pcs status xml` for debugging
26+
27+
**Usage:**
28+
The cluster-etcd-operator healthcheck controller watches this resource and updates operator conditions based on the cluster state.

etcd/install.go

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
package etcd
2+
3+
import (
4+
"k8s.io/apimachinery/pkg/runtime"
5+
"k8s.io/apimachinery/pkg/runtime/schema"
6+
7+
tnfv1alpha1 "github.com/openshift/api/etcd/tnf/v1alpha1"
8+
)
9+
10+
const (
11+
GroupName = "tnf.etcd.openshift.io"
12+
)
13+
14+
var (
15+
schemeBuilder = runtime.NewSchemeBuilder(tnfv1alpha1.Install)
16+
// Install is a function which adds every version of this group to a scheme
17+
Install = schemeBuilder.AddToScheme
18+
)
19+
20+
func Resource(resource string) schema.GroupResource {
21+
return schema.GroupResource{Group: GroupName, Resource: resource}
22+
}
23+
24+
func Kind(kind string) schema.GroupKind {
25+
return schema.GroupKind{Group: GroupName, Kind: kind}
26+
}

etcd/tnf/v1alpha1/Makefile

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
.PHONY: verify-with-container
2+
verify-with-container:
3+
$(MAKE) -f ../../Makefile $@
4+
5+
.PHONY: update-with-container
6+
update-with-container:
7+
$(MAKE) -f ../../Makefile $@

etcd/tnf/v1alpha1/README.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# tnf.etcd.openshift.io/v1alpha1
2+
3+
This API group contains types related to two-node fencing for etcd cluster management.
4+
5+
## PacemakerStatus
6+
7+
The `PacemakerStatus` CRD provides visibility into the health and status of Pacemaker-managed clusters in dual-replica (two-node) OpenShift deployments.
8+
9+
### Feature Gate
10+
11+
- **FeatureGate**: `DualReplica`
12+
- **Component**: `two-node-fencing`
13+
14+
### Usage
15+
16+
The PacemakerStatus resource is a cluster-scoped, status-only singleton named "cluster". It is periodically updated by a privileged controller that runs `pcs status xml` and parses the output into structured fields for health checking.
17+
18+
### Fields
19+
20+
- **Summary**: High-level cluster state (quorum, node counts, resource counts, recent failures/fencing)
21+
- **Nodes**: Detailed per-node status (online, standby)
22+
- **Resources**: Detailed per-resource status (agent type, role, active state, node)
23+
- **NodeHistory**: Recent operation history for troubleshooting
24+
- **FencingHistory**: Recent fencing events
25+
- **RawXML**: Complete XML output (for debugging only, max 256KB)
26+
27+
### Notes
28+
29+
The spec field is reserved but unused - all meaningful data is in the status subresource.

etcd/tnf/v1alpha1/doc.go

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
// +k8s:deepcopy-gen=package,register
2+
// +k8s:defaulter-gen=TypeMeta
3+
// +k8s:openapi-gen=true
4+
5+
// +groupName=tnf.etcd.openshift.io
6+
package v1alpha1

etcd/tnf/v1alpha1/register.go

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
package v1alpha1
2+
3+
import (
4+
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
5+
"k8s.io/apimachinery/pkg/runtime"
6+
"k8s.io/apimachinery/pkg/runtime/schema"
7+
)
8+
9+
var (
10+
GroupName = "tnf.etcd.openshift.io"
11+
GroupVersion = schema.GroupVersion{Group: GroupName, Version: "v1alpha1"}
12+
schemeBuilder = runtime.NewSchemeBuilder(addKnownTypes)
13+
// Install is a function which adds this version to a scheme
14+
Install = schemeBuilder.AddToScheme
15+
16+
// SchemeGroupVersion generated code relies on this name
17+
// Deprecated
18+
SchemeGroupVersion = GroupVersion
19+
// AddToScheme exists solely to keep the old generators creating valid code
20+
// DEPRECATED
21+
AddToScheme = schemeBuilder.AddToScheme
22+
)
23+
24+
// Resource generated code relies on this being here, but it logically belongs to the group
25+
// DEPRECATED
26+
func Resource(resource string) schema.GroupResource {
27+
return schema.GroupResource{Group: GroupName, Resource: resource}
28+
}
29+
30+
func addKnownTypes(scheme *runtime.Scheme) error {
31+
metav1.AddToGroupVersion(scheme, GroupVersion)
32+
33+
scheme.AddKnownTypes(GroupVersion,
34+
&PacemakerStatus{},
35+
&PacemakerStatusList{},
36+
)
37+
38+
return nil
39+
}
Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
apiVersion: apiextensions.k8s.io/v1 # Hack because controller-gen complains if we don't have this
2+
name: "PacemakerStatus"
3+
crdName: pacemakerstatuses.tnf.etcd.openshift.io
4+
featureGates:
5+
- DualReplica
6+
tests:
7+
onCreate:
8+
- name: Should be able to create a minimal PacemakerStatus
9+
initial: |
10+
apiVersion: tnf.etcd.openshift.io/v1alpha1
11+
kind: PacemakerStatus
12+
metadata:
13+
name: cluster
14+
spec: {}
15+
expected: |
16+
apiVersion: tnf.etcd.openshift.io/v1alpha1
17+
kind: PacemakerStatus
18+
metadata:
19+
name: cluster
20+
spec: {}
21+
---
22+
apiVersion: apiextensions.k8s.io/v1
23+
kind: CustomResourceDefinition
24+
metadata:
25+
annotations:
26+
api-approved.openshift.io: https://github.com/openshift/api/pull/TBD
27+
include.release.openshift.io/ibm-cloud-managed: "true"
28+
include.release.openshift.io/self-managed-high-availability: "true"
29+
release.openshift.io/feature-set: DualReplica
30+
name: pacemakerstatuses.tnf.etcd.openshift.io
31+
spec:
32+
group: tnf.etcd.openshift.io
33+
names:
34+
kind: PacemakerStatus
35+
listKind: PacemakerStatusList
36+
plural: pacemakerstatuses
37+
singular: pacemakerstatus
38+
scope: Cluster
39+
versions:
40+
- name: v1alpha1
41+
schema:
42+
openAPIV3Schema:
43+
description: 'PacemakerStatus represents the current state of the Pacemaker cluster as reported by the pcs status command'
44+
type: object
45+
properties:
46+
apiVersion:
47+
description: 'APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources'
48+
type: string
49+
kind:
50+
description: 'Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
51+
type: string
52+
metadata:
53+
type: object
54+
spec:
55+
description: spec is reserved for future use but is currently unused.
56+
type: object
57+
properties:
58+
nodeName:
59+
description: Reserved for future use
60+
type: string
61+
status:
62+
description: status contains the actual pacemaker cluster status information collected from the cluster.
63+
type: object
64+
properties:
65+
collectionError:
66+
description: collectionError contains any error encountered while collecting status
67+
type: string
68+
fencingHistory:
69+
description: fencingHistory provides recent fencing events
70+
type: array
71+
items:
72+
type: object
73+
required:
74+
- action
75+
- status
76+
- target
77+
properties:
78+
action:
79+
description: action is the fencing action performed (e.g., "reboot", "off", "on")
80+
type: string
81+
completed:
82+
description: completed is the timestamp when the fencing completed
83+
type: string
84+
format: date-time
85+
status:
86+
description: status is the status of the fencing operation (e.g., "success", "failed")
87+
type: string
88+
target:
89+
description: target is the node that was fenced
90+
type: string
91+
lastUpdated:
92+
description: lastUpdated is the timestamp when this status was last updated
93+
type: string
94+
format: date-time
95+
nodeHistory:
96+
description: nodeHistory provides recent operation history for troubleshooting
97+
type: array
98+
items:
99+
type: object
100+
required:
101+
- node
102+
- operation
103+
- rc
104+
- resource
105+
properties:
106+
lastRCChange:
107+
description: lastRCChange is the timestamp when the RC last changed
108+
type: string
109+
format: date-time
110+
node:
111+
description: node is the node where the operation occurred
112+
type: string
113+
operation:
114+
description: operation is the operation that was performed (e.g., "monitor", "start", "stop")
115+
type: string
116+
rc:
117+
description: rc is the return code from the operation
118+
type: integer
119+
rcText:
120+
description: rcText is the human-readable return code text (e.g., "ok", "error", "not running")
121+
type: string
122+
resource:
123+
description: resource is the resource that was operated on
124+
type: string
125+
nodes:
126+
description: nodes provides detailed information about each node in the cluster
127+
type: array
128+
items:
129+
type: object
130+
required:
131+
- name
132+
- online
133+
properties:
134+
name:
135+
description: name is the name of the node
136+
type: string
137+
online:
138+
description: online indicates if the node is online
139+
type: boolean
140+
standby:
141+
description: standby indicates if the node is in standby mode
142+
type: boolean
143+
rawXML:
144+
description: rawXML contains the raw XML output from pcs status xml command.
145+
type: string
146+
maxLength: 262144
147+
resources:
148+
description: resources provides detailed information about each resource in the cluster
149+
type: array
150+
items:
151+
type: object
152+
required:
153+
- name
154+
properties:
155+
active:
156+
description: active indicates if the resource is active
157+
type: boolean
158+
name:
159+
description: name is the name of the resource
160+
type: string
161+
node:
162+
description: node is the node where the resource is running
163+
type: string
164+
resourceAgent:
165+
description: resourceAgent is the resource agent type (e.g., "ocf:heartbeat:IPaddr2", "systemd:kubelet")
166+
type: string
167+
role:
168+
description: role is the current role of the resource (e.g., "Started", "Stopped")
169+
type: string
170+
summary:
171+
description: summary provides high-level counts and flags for the cluster state
172+
type: object
173+
properties:
174+
hasQuorum:
175+
description: hasQuorum indicates if the cluster has quorum
176+
type: boolean
177+
nodesOnline:
178+
description: nodesOnline is the count of online nodes
179+
type: integer
180+
nodesTotal:
181+
description: nodesTotal is the total count of configured nodes
182+
type: integer
183+
pacemakerdState:
184+
description: pacemakerdState indicates if pacemaker is running
185+
type: string
186+
recentFailures:
187+
description: recentFailures indicates if there are recent operation failures
188+
type: boolean
189+
recentFencing:
190+
description: recentFencing indicates if there are recent fencing events
191+
type: boolean
192+
resourcesStarted:
193+
description: resourcesStarted is the count of started resources
194+
type: integer
195+
resourcesTotal:
196+
description: resourcesTotal is the total count of configured resources
197+
type: integer
198+
served: true
199+
storage: true
200+
subresources:
201+
status: {}

0 commit comments

Comments
 (0)