Skip to content

Commit 2fb0282

Browse files
committed
OCPEDGE-2084: Add PacemakerStatus CRD for two-node fencing
Introduces etcd.openshift.io/v1alpha1 API group with a PacemakerCluster custom resource. This provides visibility into Pacemaker cluster health for Two Node Fencing etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. This API is not gated because it's only created by CEO once the transition to an ExternalEtcd has occured.
1 parent 8691c30 commit 2fb0282

File tree

20 files changed

+2539
-39302
lines changed

20 files changed

+2539
-39302
lines changed

etcd/.codegen.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
swaggerdocs:
2+
commentPolicy: Warn

etcd/README.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# etcd.openshift.io API Group
2+
3+
This API group contains CRDs related to etcd cluster management. Specifically, this is only used for TNF (Two Node Fencing)
4+
for gathering status updates from the node to ensure the cluster-admin is warned about unhealthy setups.
5+
6+
## API Versions
7+
8+
### v1alpha1
9+
10+
Contains the `PacemakerCluster` custom resource for monitoring Pacemaker cluster health in TNF (Two Node Fencing) deployments.
11+
12+
#### PacemakerCluster
13+
14+
- **Feature Gate**: None - this CRD is gated by cluster-etcd-operator start-up. It will only be created once a TNF cluster has transitioned to external etcd.
15+
- **Component**: `two-node-fencing`
16+
- **Scope**: Cluster-scoped singleton resource named "cluster"
17+
- **Resource Path**: `pacemakerclusters.etcd.openshift.io`
18+
19+
The `PacemakerCluster` resource provides visibility into the health and status of a Pacemaker-managed cluster. It is periodically updated by the cluster-etcd-operator's status collector running as a privileged CronJob.
20+
21+
**Status Fields:**
22+
- `lastUpdated` (required): Timestamp when status was last collected - used to detect stale data
23+
- `summary`: High-level cluster health metrics
24+
- `pacemakerDaemonState`: Running state (enum: `Running`, `KnownNotRunning`)
25+
- `quorumStatus`: Quorum state (enum: `Quorate`, `NoQuorum`)
26+
- `nodesOnline`, `nodesTotal`: Node counts (0-2)
27+
- `resourcesStarted`, `resourcesTotal`: Resource counts (0-16)
28+
- `nodes`: Detailed status of each node (1-2 nodes)
29+
- Name, IPv4/IPv6 addresses, online status (enum), mode (enum: `Active`, `Standby`)
30+
- `resources`: Detailed status of each resource (1-16 resources)
31+
- Name, resource agent, role (enum: `Started`, `Stopped`), active status (enum), node assignment
32+
- `nodeHistory`: Recent operation failures for troubleshooting (up to 16 entries, last 5 minutes)
33+
- `fencingHistory`: Recent fencing events (up to 16 events, last 24 hours)
34+
- Target node, action (enum: `reboot`, `off`, `on`), status (enum: `success`, `failed`, `pending`), completion timestamp
35+
- `collectionError`: Any errors encountered during status collection (max 2KB)
36+
- `rawXML`: Full XML output from `pcs status xml` for debugging (max 256KB)
37+
38+
**Design Principles:**
39+
The API follows "Act on Deterministic Information":
40+
- All fields except `lastUpdated` are optional
41+
- Missing data indicates unknown state, not error
42+
- Operator only acts on definitive information
43+
- Unknown state preserves the last known health condition
44+
45+
**Usage:**
46+
The cluster-etcd-operator healthcheck controller watches this resource and updates operator conditions based on the cluster state.

etcd/install.go

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
package etcd
2+
3+
import (
4+
"k8s.io/apimachinery/pkg/runtime"
5+
"k8s.io/apimachinery/pkg/runtime/schema"
6+
7+
v1alpha1 "github.com/openshift/api/etcd/v1alpha1"
8+
)
9+
10+
const (
11+
GroupName = "etcd.openshift.io"
12+
)
13+
14+
var (
15+
schemeBuilder = runtime.NewSchemeBuilder(v1alpha1.Install)
16+
// Install is a function which adds every version of this group to a scheme
17+
Install = schemeBuilder.AddToScheme
18+
)
19+
20+
func Resource(resource string) schema.GroupResource {
21+
return schema.GroupResource{Group: GroupName, Resource: resource}
22+
}
23+
24+
func Kind(kind string) schema.GroupKind {
25+
return schema.GroupKind{Group: GroupName, Kind: kind}
26+
}

etcd/v1alpha1/Makefile

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
.PHONY: verify-with-container
2+
verify-with-container:
3+
$(MAKE) -f ../../Makefile $@
4+
5+
.PHONY: update-with-container
6+
update-with-container:
7+
$(MAKE) -f ../../Makefile $@

etcd/v1alpha1/README.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# etcd.openshift.io/v1alpha1
2+
3+
This API group contains types related to two-node fencing for etcd cluster management.
4+
5+
## PacemakerCluster
6+
7+
The `PacemakerCluster` CRD provides visibility into the health and status of Pacemaker-managed clusters in dual-replica (two-node) OpenShift deployments.
8+
9+
### Feature Gate
10+
11+
- **Feature Gate**: None - this CRD is gated by cluster-etcd-operator start-up. It will only be created once a TNF cluster has transitioned to external etcd.
12+
- **Component**: `two-node-fencing`
13+
14+
### Usage
15+
16+
The PacemakerCluster resource is a cluster-scoped, status-only singleton named "cluster". It is periodically updated by a privileged controller that runs `pcs status xml` and parses the output into structured fields for health checking.
17+
18+
### Status Fields
19+
20+
- **LastUpdated** (required): Timestamp when status was last collected
21+
- **Summary**: High-level cluster state including:
22+
- `pacemakerDaemonState`: Running state of the pacemaker daemon (enum: `Running`, `KnownNotRunning`)
23+
- `quorumStatus`: Whether cluster has quorum (enum: `Quorate`, `NoQuorum`)
24+
- `nodesOnline`, `nodesTotal`: Node counts
25+
- `resourcesStarted`, `resourcesTotal`: Resource counts
26+
- **Nodes**: Detailed per-node status (name, IPv4/IPv6 addresses, online status, mode)
27+
- **Resources**: Detailed per-resource status (name, resource agent type, role enum, active status, node assignment)
28+
- **NodeHistory**: Recent operation history for troubleshooting (operation failures within last 5 minutes)
29+
- **FencingHistory**: Recent fencing events (events within last 24 hours)
30+
- **RawXML**: Complete XML output from `pcs status xml` (for debugging only, max 256KB)
31+
- **CollectionError**: Any errors encountered during status collection
32+
33+
### Design Principles
34+
35+
The API follows a "Design Principle: Act on Deterministic Information" approach:
36+
- Almost all fields are optional except `lastUpdated`
37+
- Missing data means "unknown" not "error"
38+
- The operator only transitions between PacemakerHealthy and PacemakerDegraded states based on deterministic information
39+
- When information is unavailable, the last known state is preserved
40+
41+
### Notes
42+
43+
The spec field is reserved but unused - all meaningful data is in the status subresource.

etcd/v1alpha1/doc.go

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
// +k8s:deepcopy-gen=package,register
2+
// +k8s:defaulter-gen=TypeMeta
3+
// +k8s:openapi-gen=true
4+
// +openshift:featuregated-schema-gen=true
5+
6+
// +kubebuilder:validation:Optional
7+
// +groupName=etcd.openshift.io
8+
package v1alpha1

etcd/v1alpha1/register.go

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
package v1alpha1
2+
3+
import (
4+
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
5+
"k8s.io/apimachinery/pkg/runtime"
6+
"k8s.io/apimachinery/pkg/runtime/schema"
7+
)
8+
9+
var (
10+
GroupName = "etcd.openshift.io"
11+
GroupVersion = schema.GroupVersion{Group: GroupName, Version: "v1alpha1"}
12+
schemeBuilder = runtime.NewSchemeBuilder(addKnownTypes)
13+
// Install is a function which adds this version to a scheme
14+
Install = schemeBuilder.AddToScheme
15+
16+
// SchemeGroupVersion generated code relies on this name
17+
// Deprecated
18+
SchemeGroupVersion = GroupVersion
19+
// AddToScheme exists solely to keep the old generators creating valid code
20+
// DEPRECATED
21+
AddToScheme = schemeBuilder.AddToScheme
22+
)
23+
24+
// Resource generated code relies on this being here, but it logically belongs to the group
25+
// DEPRECATED
26+
func Resource(resource string) schema.GroupResource {
27+
return schema.GroupResource{Group: GroupName, Resource: resource}
28+
}
29+
30+
func addKnownTypes(scheme *runtime.Scheme) error {
31+
metav1.AddToGroupVersion(scheme, GroupVersion)
32+
33+
scheme.AddKnownTypes(GroupVersion,
34+
&PacemakerCluster{},
35+
&PacemakerClusterList{},
36+
)
37+
38+
return nil
39+
}
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
apiVersion: apiextensions.k8s.io/v1 # Hack because controller-gen complains if we don't have this
2+
name: "PacemakerCluster"
3+
crdName: pacemakerclusters.etcd.openshift.io
4+
tests:
5+
onCreate:
6+
- name: Should be able to create a minimal PacemakerCluster
7+
initial: |
8+
apiVersion: etcd.openshift.io/v1alpha1
9+
kind: PacemakerCluster
10+
metadata:
11+
name: cluster
12+
spec: {}
13+
expected: |
14+
apiVersion: etcd.openshift.io/v1alpha1
15+
kind: PacemakerCluster
16+
metadata:
17+
name: cluster
18+
spec: {}

0 commit comments

Comments
 (0)