Skip to content

Commit 29b9fec

Browse files
committed
OCPEDGE-2084: Add PacemakerStatus CRD for two-node fencing
Introduces tnf.etcd.openshift.io/v1alpha1 API group with PacemakerStatus custom resource. This provides visibility into Pacemaker cluster health for dual-replica etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. Gated by DualReplica feature and managed by two-node-fencing component.
1 parent a2cb0c5 commit 29b9fec

File tree

17 files changed

+999
-0
lines changed

17 files changed

+999
-0
lines changed

etcd/README.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# tnf.etcd.openshift.io API Group
2+
3+
This API group contains CRDs related to two-node fencing (TNF) for etcd cluster management.
4+
5+
## API Versions
6+
7+
### tnf/v1alpha1
8+
9+
Contains the `PacemakerStatus` custom resource for monitoring Pacemaker cluster health in dual-replica (two-node) deployments.
10+
11+
#### PacemakerStatus
12+
13+
- **Feature Gate**: `DualReplica`
14+
- **Component**: `two-node-fencing`
15+
- **Scope**: Cluster-scoped singleton resource named "cluster"
16+
17+
The `PacemakerStatus` resource provides visibility into the health and status of a Pacemaker-managed cluster. It is periodically updated by the cluster-etcd-operator's status collector running as a privileged CronJob.
18+
19+
**Status Fields:**
20+
- `summary`: High-level cluster health metrics (quorum, node counts, resource counts)
21+
- `nodes`: Detailed status of each node (online, standby)
22+
- `resources`: Detailed status of each resource (role, active, node assignment)
23+
- `nodeHistory`: Recent operation failures for troubleshooting
24+
- `fencingHistory`: Recent fencing events
25+
- `rawXML`: Full XML output from `pcs status xml` for debugging
26+
27+
**Usage:**
28+
The cluster-etcd-operator healthcheck controller watches this resource and updates operator conditions based on the cluster state.

etcd/install.go

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
package etcd
2+
3+
import (
4+
"k8s.io/apimachinery/pkg/runtime"
5+
"k8s.io/apimachinery/pkg/runtime/schema"
6+
7+
tnfv1alpha1 "github.com/openshift/api/etcd/tnf/v1alpha1"
8+
)
9+
10+
const (
11+
GroupName = "tnf.etcd.openshift.io"
12+
)
13+
14+
var (
15+
schemeBuilder = runtime.NewSchemeBuilder(tnfv1alpha1.Install)
16+
// Install is a function which adds every version of this group to a scheme
17+
Install = schemeBuilder.AddToScheme
18+
)
19+
20+
func Resource(resource string) schema.GroupResource {
21+
return schema.GroupResource{Group: GroupName, Resource: resource}
22+
}
23+
24+
func Kind(kind string) schema.GroupKind {
25+
return schema.GroupKind{Group: GroupName, Kind: kind}
26+
}

etcd/tnf/v1alpha1/Makefile

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
.PHONY: verify-with-container
2+
verify-with-container:
3+
$(MAKE) -f ../../Makefile $@
4+
5+
.PHONY: update-with-container
6+
update-with-container:
7+
$(MAKE) -f ../../Makefile $@

etcd/tnf/v1alpha1/README.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# tnf.etcd.openshift.io/v1alpha1
2+
3+
This API group contains types related to two-node fencing for etcd cluster management.
4+
5+
## PacemakerStatus
6+
7+
The `PacemakerStatus` CRD provides visibility into the health and status of Pacemaker-managed clusters in dual-replica (two-node) OpenShift deployments.
8+
9+
### Feature Gate
10+
11+
- **FeatureGate**: `DualReplica`
12+
- **Component**: `two-node-fencing`
13+
14+
### Usage
15+
16+
The PacemakerStatus resource is a cluster-scoped, status-only singleton named "cluster". It is periodically updated by a privileged controller that runs `pcs status xml` and parses the output into structured fields for health checking.
17+
18+
### Fields
19+
20+
- **Summary**: High-level cluster state (quorum, node counts, resource counts, recent failures/fencing)
21+
- **Nodes**: Detailed per-node status (online, standby)
22+
- **Resources**: Detailed per-resource status (agent type, role, active state, node)
23+
- **NodeHistory**: Recent operation history for troubleshooting
24+
- **FencingHistory**: Recent fencing events
25+
- **RawXML**: Complete XML output (for debugging only, max 256KB)
26+
27+
### Notes
28+
29+
The spec field is reserved but unused - all meaningful data is in the status subresource.

etcd/tnf/v1alpha1/doc.go

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
// +k8s:deepcopy-gen=package,register
2+
// +k8s:defaulter-gen=TypeMeta
3+
// +k8s:openapi-gen=true
4+
5+
// +groupName=tnf.etcd.openshift.io
6+
package v1alpha1

etcd/tnf/v1alpha1/register.go

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
package v1alpha1
2+
3+
import (
4+
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
5+
"k8s.io/apimachinery/pkg/runtime"
6+
"k8s.io/apimachinery/pkg/runtime/schema"
7+
)
8+
9+
var (
10+
GroupName = "tnf.etcd.openshift.io"
11+
GroupVersion = schema.GroupVersion{Group: GroupName, Version: "v1alpha1"}
12+
schemeBuilder = runtime.NewSchemeBuilder(addKnownTypes)
13+
// Install is a function which adds this version to a scheme
14+
Install = schemeBuilder.AddToScheme
15+
16+
// SchemeGroupVersion generated code relies on this name
17+
// Deprecated
18+
SchemeGroupVersion = GroupVersion
19+
// AddToScheme exists solely to keep the old generators creating valid code
20+
// DEPRECATED
21+
AddToScheme = schemeBuilder.AddToScheme
22+
)
23+
24+
// Resource generated code relies on this being here, but it logically belongs to the group
25+
// DEPRECATED
26+
func Resource(resource string) schema.GroupResource {
27+
return schema.GroupResource{Group: GroupName, Resource: resource}
28+
}
29+
30+
func addKnownTypes(scheme *runtime.Scheme) error {
31+
metav1.AddToGroupVersion(scheme, GroupVersion)
32+
33+
scheme.AddKnownTypes(GroupVersion,
34+
&PacemakerStatus{},
35+
&PacemakerStatusList{},
36+
)
37+
38+
return nil
39+
}
Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
apiVersion: apiextensions.k8s.io/v1
2+
kind: CustomResourceDefinition
3+
metadata:
4+
annotations:
5+
api-approved.openshift.io: https://github.com/openshift/api/pull/TBD
6+
include.release.openshift.io/ibm-cloud-managed: "true"
7+
include.release.openshift.io/self-managed-high-availability: "true"
8+
release.openshift.io/feature-set: DualReplica
9+
name: pacemakerstatuses.tnf.etcd.openshift.io
10+
spec:
11+
group: tnf.etcd.openshift.io
12+
names:
13+
kind: PacemakerStatus
14+
listKind: PacemakerStatusList
15+
plural: pacemakerstatuses
16+
singular: pacemakerstatus
17+
scope: Cluster
18+
versions:
19+
- name: v1alpha1
20+
schema:
21+
openAPIV3Schema:
22+
description: 'PacemakerStatus represents the current state of the Pacemaker cluster as reported by the pcs status command'
23+
type: object
24+
properties:
25+
apiVersion:
26+
description: 'APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources'
27+
type: string
28+
kind:
29+
description: 'Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
30+
type: string
31+
metadata:
32+
type: object
33+
spec:
34+
description: spec is reserved for future use but is currently unused.
35+
type: object
36+
properties:
37+
nodeName:
38+
description: Reserved for future use
39+
type: string
40+
status:
41+
description: status contains the actual pacemaker cluster status information collected from the cluster.
42+
type: object
43+
properties:
44+
collectionError:
45+
description: collectionError contains any error encountered while collecting status
46+
type: string
47+
fencingHistory:
48+
description: fencingHistory provides recent fencing events
49+
type: array
50+
items:
51+
type: object
52+
required:
53+
- action
54+
- status
55+
- target
56+
properties:
57+
action:
58+
description: action is the fencing action performed (e.g., "reboot", "off", "on")
59+
type: string
60+
completed:
61+
description: completed is the timestamp when the fencing completed
62+
type: string
63+
format: date-time
64+
status:
65+
description: status is the status of the fencing operation (e.g., "success", "failed")
66+
type: string
67+
target:
68+
description: target is the node that was fenced
69+
type: string
70+
lastUpdated:
71+
description: lastUpdated is the timestamp when this status was last updated
72+
type: string
73+
format: date-time
74+
nodeHistory:
75+
description: nodeHistory provides recent operation history for troubleshooting
76+
type: array
77+
items:
78+
type: object
79+
required:
80+
- node
81+
- operation
82+
- rc
83+
- resource
84+
properties:
85+
lastRCChange:
86+
description: lastRCChange is the timestamp when the RC last changed
87+
type: string
88+
format: date-time
89+
node:
90+
description: node is the node where the operation occurred
91+
type: string
92+
operation:
93+
description: operation is the operation that was performed (e.g., "monitor", "start", "stop")
94+
type: string
95+
rc:
96+
description: rc is the return code from the operation
97+
type: integer
98+
rcText:
99+
description: rcText is the human-readable return code text (e.g., "ok", "error", "not running")
100+
type: string
101+
resource:
102+
description: resource is the resource that was operated on
103+
type: string
104+
nodes:
105+
description: nodes provides detailed information about each node in the cluster
106+
type: array
107+
items:
108+
type: object
109+
required:
110+
- name
111+
- online
112+
properties:
113+
name:
114+
description: name is the name of the node
115+
type: string
116+
online:
117+
description: online indicates if the node is online
118+
type: boolean
119+
standby:
120+
description: standby indicates if the node is in standby mode
121+
type: boolean
122+
rawXML:
123+
description: rawXML contains the raw XML output from pcs status xml command.
124+
type: string
125+
maxLength: 262144
126+
resources:
127+
description: resources provides detailed information about each resource in the cluster
128+
type: array
129+
items:
130+
type: object
131+
required:
132+
- name
133+
properties:
134+
active:
135+
description: active indicates if the resource is active
136+
type: boolean
137+
name:
138+
description: name is the name of the resource
139+
type: string
140+
node:
141+
description: node is the node where the resource is running
142+
type: string
143+
resourceAgent:
144+
description: resourceAgent is the resource agent type (e.g., "ocf:heartbeat:IPaddr2", "systemd:kubelet")
145+
type: string
146+
role:
147+
description: role is the current role of the resource (e.g., "Started", "Stopped")
148+
type: string
149+
summary:
150+
description: summary provides high-level counts and flags for the cluster state
151+
type: object
152+
properties:
153+
hasQuorum:
154+
description: hasQuorum indicates if the cluster has quorum
155+
type: boolean
156+
nodesOnline:
157+
description: nodesOnline is the count of online nodes
158+
type: integer
159+
nodesTotal:
160+
description: nodesTotal is the total count of configured nodes
161+
type: integer
162+
pacemakerdState:
163+
description: pacemakerdState indicates if pacemaker is running
164+
type: string
165+
recentFailures:
166+
description: recentFailures indicates if there are recent operation failures
167+
type: boolean
168+
recentFencing:
169+
description: recentFencing indicates if there are recent fencing events
170+
type: boolean
171+
resourcesStarted:
172+
description: resourcesStarted is the count of started resources
173+
type: integer
174+
resourcesTotal:
175+
description: resourcesTotal is the total count of configured resources
176+
type: integer
177+
served: true
178+
storage: true
179+
subresources:
180+
status: {}

0 commit comments

Comments
 (0)