Skip to content

Commit df97bb6

Browse files
committed
OCPEDGE-2084: Add PacemakerStatus CRD for two-node fencing
Introduces etcd.openshift.io/v1alpha1 API group with a PacemakerCluster custom resource. This provides visibility into Pacemaker cluster health for Two Node Fencing (TNF) etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. This API is not explicitly gated because it's only created by CEO once the transition to an ExternalEtcd has occured. This means that it is naturally gated by the TNF topology.
1 parent 243758d commit df97bb6

File tree

20 files changed

+3061
-0
lines changed

20 files changed

+3061
-0
lines changed

etcd/.codegen.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
swaggerdocs:
2+
commentPolicy: Warn

etcd/README.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# etcd.openshift.io API Group
2+
3+
This API group contains CRDs related to etcd cluster management. Specifically, this is only used for TNF (Two Node Fencing)
4+
for gathering status updates from the node to ensure the cluster-admin is warned about unhealthy setups.
5+
6+
## API Versions
7+
8+
### v1alpha1
9+
10+
Contains the `PacemakerCluster` custom resource for monitoring Pacemaker cluster health in TNF (Two Node Fencing) deployments.
11+
12+
#### PacemakerCluster
13+
14+
- **Feature Gate**: None - this CRD is gated by cluster-etcd-operator start-up. It will only be created once a TNF cluster has transitioned to external etcd.
15+
- **Component**: `two-node-fencing`
16+
- **Scope**: Cluster-scoped singleton resource named "cluster"
17+
- **Resource Path**: `pacemakerclusters.etcd.openshift.io`
18+
19+
The `PacemakerCluster` resource provides visibility into the health and status of a Pacemaker-managed cluster. It is periodically updated by the cluster-etcd-operator's status collector running as a privileged CronJob.
20+
21+
**Status Fields:**
22+
- `lastUpdated` (required): Timestamp when status was last collected - used to detect stale data
23+
- `summary`: High-level cluster health metrics
24+
- `pacemakerDaemonState`: Running state (enum: `Running`, `KnownNotRunning`)
25+
- `quorumStatus`: Quorum state (enum: `Quorate`, `NoQuorum`)
26+
- `nodesOnline`, `nodesTotal`: Node counts (0-2)
27+
- `resourcesStarted`, `resourcesTotal`: Resource counts (0-16)
28+
- `nodes`: Detailed status of each node (1-2 nodes)
29+
- Name, IPv4/IPv6 addresses, online status (enum), mode (enum: `Active`, `Standby`)
30+
- `resources`: Detailed status of each resource (1-16 resources)
31+
- Name, resource agent, role (enum: `Started`, `Stopped`), active status (enum), node assignment
32+
- `nodeHistory`: Recent operation failures for troubleshooting (up to 16 entries, last 5 minutes)
33+
- `fencingHistory`: Recent fencing events (up to 16 events, last 24 hours)
34+
- Target node, action (enum: `reboot`, `off`, `on`), status (enum: `success`, `failed`, `pending`), completion timestamp
35+
- `collectionError`: Any errors encountered during status collection (max 2KB)
36+
- `rawXML`: Full XML output from `pcs status xml` for debugging (max 256KB)
37+
38+
**Design Principles:**
39+
The API follows "Act on Deterministic Information":
40+
- All fields except `lastUpdated` are optional
41+
- Missing data indicates unknown state, not error
42+
- Operator only acts on definitive information
43+
- Unknown state preserves the last known health condition
44+
45+
**Usage:**
46+
The cluster-etcd-operator healthcheck controller watches this resource and updates operator conditions based on the cluster state.

etcd/install.go

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
package etcd
2+
3+
import (
4+
"k8s.io/apimachinery/pkg/runtime"
5+
"k8s.io/apimachinery/pkg/runtime/schema"
6+
7+
v1alpha1 "github.com/openshift/api/etcd/v1alpha1"
8+
)
9+
10+
const (
11+
GroupName = "etcd.openshift.io"
12+
)
13+
14+
var (
15+
schemeBuilder = runtime.NewSchemeBuilder(v1alpha1.Install)
16+
// Install is a function which adds every version of this group to a scheme
17+
Install = schemeBuilder.AddToScheme
18+
)
19+
20+
func Resource(resource string) schema.GroupResource {
21+
return schema.GroupResource{Group: GroupName, Resource: resource}
22+
}
23+
24+
func Kind(kind string) schema.GroupKind {
25+
return schema.GroupKind{Group: GroupName, Kind: kind}
26+
}

etcd/v1alpha1/Makefile

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
.PHONY: verify-with-container
2+
verify-with-container:
3+
$(MAKE) -f ../../Makefile $@
4+
5+
.PHONY: update-with-container
6+
update-with-container:
7+
$(MAKE) -f ../../Makefile $@

etcd/v1alpha1/README.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# etcd.openshift.io/v1alpha1
2+
3+
This API group contains types related to two-node fencing for etcd cluster management.
4+
5+
## PacemakerCluster
6+
7+
The `PacemakerCluster` CRD provides visibility into the health and status of Pacemaker-managed clusters in dual-replica (two-node) OpenShift deployments.
8+
9+
### Feature Gate
10+
11+
- **Feature Gate**: None - this CRD is gated by cluster-etcd-operator start-up. It will only be created once a TNF cluster has transitioned to external etcd.
12+
- **Component**: `two-node-fencing`
13+
14+
### Usage
15+
16+
The PacemakerCluster resource is a cluster-scoped, status-only singleton named "cluster". It is periodically updated by a privileged controller that runs `pcs status xml` and parses the output into structured fields for health checking.
17+
18+
### Status Fields
19+
20+
- **LastUpdated** (required): Timestamp when status was last collected
21+
- **Summary**: High-level cluster state including:
22+
- `pacemakerDaemonState`: Running state of the pacemaker daemon (enum: `Running`, `KnownNotRunning`)
23+
- `quorumStatus`: Whether cluster has quorum (enum: `Quorate`, `NoQuorum`)
24+
- `nodesOnline`, `nodesTotal`: Node counts
25+
- `resourcesStarted`, `resourcesTotal`: Resource counts
26+
- **Nodes**: Detailed per-node status (name, IPv4/IPv6 addresses, online status, mode)
27+
- **Resources**: Detailed per-resource status (name, resource agent type, role enum, active status, node assignment)
28+
- **NodeHistory**: Recent operation history for troubleshooting (operation failures within last 5 minutes)
29+
- **FencingHistory**: Recent fencing events (events within last 24 hours)
30+
- **RawXML**: Complete XML output from `pcs status xml` (for debugging only, max 256KB)
31+
- **CollectionError**: Any errors encountered during status collection
32+
33+
### Design Principles
34+
35+
The API follows a "Design Principle: Act on Deterministic Information" approach:
36+
- Almost all fields are optional except `lastUpdated`
37+
- Missing data means "unknown" not "error"
38+
- The operator only transitions between PacemakerHealthy and PacemakerDegraded states based on deterministic information
39+
- When information is unavailable, the last known state is preserved
40+
41+
### Notes
42+
43+
The spec field is reserved but unused - all meaningful data is in the status subresource.

etcd/v1alpha1/doc.go

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
// +k8s:deepcopy-gen=package,register
2+
// +k8s:defaulter-gen=TypeMeta
3+
// +k8s:openapi-gen=true
4+
// +openshift:featuregated-schema-gen=true
5+
6+
// +kubebuilder:validation:Optional
7+
// +groupName=etcd.openshift.io
8+
package v1alpha1

etcd/v1alpha1/register.go

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
package v1alpha1
2+
3+
import (
4+
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
5+
"k8s.io/apimachinery/pkg/runtime"
6+
"k8s.io/apimachinery/pkg/runtime/schema"
7+
)
8+
9+
var (
10+
GroupName = "etcd.openshift.io"
11+
GroupVersion = schema.GroupVersion{Group: GroupName, Version: "v1alpha1"}
12+
schemeBuilder = runtime.NewSchemeBuilder(addKnownTypes)
13+
// Install is a function which adds this version to a scheme
14+
Install = schemeBuilder.AddToScheme
15+
16+
// SchemeGroupVersion generated code relies on this name
17+
// Deprecated
18+
SchemeGroupVersion = GroupVersion
19+
// AddToScheme exists solely to keep the old generators creating valid code
20+
// DEPRECATED
21+
AddToScheme = schemeBuilder.AddToScheme
22+
)
23+
24+
// Resource generated code relies on this being here, but it logically belongs to the group
25+
// DEPRECATED
26+
func Resource(resource string) schema.GroupResource {
27+
return schema.GroupResource{Group: GroupName, Resource: resource}
28+
}
29+
30+
func addKnownTypes(scheme *runtime.Scheme) error {
31+
metav1.AddToGroupVersion(scheme, GroupVersion)
32+
33+
scheme.AddKnownTypes(GroupVersion,
34+
&PacemakerCluster{},
35+
&PacemakerClusterList{},
36+
)
37+
38+
return nil
39+
}
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
apiVersion: apiextensions.k8s.io/v1 # Hack because controller-gen complains if we don't have this
2+
name: "PacemakerCluster"
3+
crdName: pacemakerclusters.etcd.openshift.io
4+
tests:
5+
onCreate:
6+
- name: Should be able to create a minimal PacemakerCluster
7+
initial: |
8+
apiVersion: etcd.openshift.io/v1alpha1
9+
kind: PacemakerCluster
10+
metadata:
11+
name: cluster
12+
spec: {}
13+
expected: |
14+
apiVersion: etcd.openshift.io/v1alpha1
15+
kind: PacemakerCluster
16+
metadata:
17+
name: cluster
18+
spec: {}

0 commit comments

Comments
 (0)