You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You can configure when node repair actions will stop using either percentage or count-based thresholds. *Note: You cannot use both percentage and count thresholds at the same time.*
59
+
60
+
==== CLI flags for thresholds
61
+
62
+
[,shell]
63
+
----
64
+
# Percentage-based threshold - repair stops when 20% of nodes are unhealthy
65
+
eksctl create cluster --enable-node-repair \
66
+
--node-repair-max-unhealthy-percentage=20
67
+
68
+
# Count-based threshold - repair stops when 5 nodes are unhealthy
69
+
eksctl create cluster --enable-node-repair \
70
+
--node-repair-max-unhealthy-count=5
71
+
----
72
+
73
+
==== Configuration file for thresholds
74
+
75
+
[,yaml]
76
+
----
77
+
managedNodeGroups:
78
+
- name: threshold-ng
79
+
nodeRepairConfig:
80
+
enabled: true
81
+
# Stop repair actions when 20% of nodes are unhealthy
82
+
maxUnhealthyNodeThresholdPercentage: 20
83
+
# Alternative: stop repair actions when 3 nodes are unhealthy
84
+
# maxUnhealthyNodeThresholdCount: 3
85
+
# Note: Cannot use both percentage and count thresholds simultaneously
86
+
----
87
+
88
+
=== Parallel Repair Limits
89
+
90
+
Control the maximum number of nodes that can be repaired concurrently or in parallel. This gives you finer-grained control over the pace of node replacements. *Note: You cannot use both percentage and count limits at the same time.*
91
+
92
+
==== CLI flags for parallel limits
93
+
94
+
[,shell]
95
+
----
96
+
# Percentage-based parallel limits - repair at most 15% of unhealthy nodes in parallel
97
+
eksctl create cluster --enable-node-repair \
98
+
--node-repair-max-parallel-percentage=15
99
+
100
+
# Count-based parallel limits - repair at most 2 unhealthy nodes in parallel
101
+
eksctl create cluster --enable-node-repair \
102
+
--node-repair-max-parallel-count=2
103
+
----
104
+
105
+
==== Configuration file for parallel limits
106
+
107
+
[,yaml]
108
+
----
109
+
managedNodeGroups:
110
+
- name: parallel-ng
111
+
nodeRepairConfig:
112
+
enabled: true
113
+
# Repair at most 15% of unhealthy nodes in parallel
114
+
maxParallelNodesRepairedPercentage: 15
115
+
# Alternative: repair at most 2 unhealthy nodes in parallel
116
+
# maxParallelNodesRepairedCount: 2
117
+
# Note: Cannot use both percentage and count limits simultaneously
118
+
----
119
+
120
+
=== Custom Repair Overrides
121
+
122
+
Specify granular overrides for specific repair actions. These overrides control the repair action and the repair delay time before a node is considered eligible for repair. *If you use this, you must specify all the values for each override.*
123
+
124
+
[,yaml]
125
+
----
126
+
managedNodeGroups:
127
+
- name: custom-repair-ng
128
+
instanceType: g4dn.xlarge # GPU instances
129
+
nodeRepairConfig:
130
+
enabled: true
131
+
maxUnhealthyNodeThresholdPercentage: 25
132
+
maxParallelNodesRepairedCount: 1
133
+
nodeRepairConfigOverrides:
134
+
# Handle GPU-related failures with immediate termination
# Handle network issues with restart after waiting
140
+
- nodeMonitoringCondition: "NetworkNotReady"
141
+
nodeUnhealthyReason: "InterfaceNotUp"
142
+
minRepairWaitTimeMins: 20
143
+
repairAction: "Restart"
144
+
----
145
+
146
+
== Complete Configuration Examples
147
+
148
+
For a comprehensive example with all configuration options, see link:https://github.com/eksctl-io/eksctl/blob/main/examples/44-node-repair.yaml[examples/44-node-repair.yaml].
149
+
150
+
=== Example 1: Basic repair with percentage thresholds
151
+
152
+
[,yaml]
153
+
----
154
+
apiVersion: eksctl.io/v1alpha5
155
+
kind: ClusterConfig
156
+
157
+
metadata:
158
+
name: basic-repair-cluster
159
+
region: us-west-2
160
+
161
+
managedNodeGroups:
162
+
- name: basic-ng
163
+
instanceType: m5.large
164
+
desiredCapacity: 3
165
+
nodeRepairConfig:
166
+
enabled: true
167
+
maxUnhealthyNodeThresholdPercentage: 20
168
+
maxParallelNodesRepairedPercentage: 15
169
+
----
170
+
171
+
=== Example 2: Conservative repair for critical workloads
172
+
173
+
[,yaml]
174
+
----
175
+
apiVersion: eksctl.io/v1alpha5
176
+
kind: ClusterConfig
177
+
178
+
metadata:
179
+
name: critical-workload-cluster
180
+
region: us-west-2
181
+
182
+
managedNodeGroups:
183
+
- name: critical-ng
184
+
instanceType: c5.2xlarge
185
+
desiredCapacity: 6
186
+
nodeRepairConfig:
187
+
enabled: true
188
+
# Very conservative settings
189
+
maxUnhealthyNodeThresholdPercentage: 10
190
+
maxParallelNodesRepairedCount: 1
191
+
nodeRepairConfigOverrides:
192
+
# Wait longer before taking action on critical workloads
193
+
- nodeMonitoringCondition: "NetworkNotReady"
194
+
nodeUnhealthyReason: "InterfaceNotUp"
195
+
minRepairWaitTimeMins: 45
196
+
repairAction: "Restart"
197
+
----
198
+
199
+
=== Example 3: GPU workload with specialized repair
0 commit comments