Skip to content

Commit b0bc94d

Browse files
sapphirewrhaowang
andauthored
docs: enhance node repair configuration documentation (#5)
Co-authored-by: Hao Wang <rhaowang@amazon.com>
1 parent cd7ea8c commit b0bc94d

File tree

1 file changed

+288
-12
lines changed

1 file changed

+288
-12
lines changed

docs/nodegroups/nodegroup-node-repair-config.adoc

Lines changed: 288 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -2,40 +2,42 @@
22

33
[.topic]
44
[#nodegroup-node-repair-config]
5-
= Support for Node Repair Config in EKS Managed Nodegroups
5+
= Support Node Repair Configuration for EKS Managed Nodegroups
66
:info_doctype: section
77
:info_titleabbrev: Node Repair Config
88

9-
EKS Managed Nodegroups now supports Node Repair, where the health of managed nodes are monitored,
10-
and unhealthy worker nodes are replaced or rebooted in response.
9+
EKS Managed Nodegroups supports Node Repair, where the health of managed nodes are monitored,
10+
and unhealthy worker nodes are replaced or rebooted in response. eksctl now provides comprehensive
11+
configuration options for fine-grained control over node repair behavior.
1112

12-
== Creating a cluster a managed nodegroup with node repair enabled
13+
== Basic Node Repair Configuration
1314

14-
To create a cluster with a managed nodegroup using node repair, pass the `--enable-node-repair` flag:
15+
=== Using CLI flags
16+
17+
To create a cluster with a managed nodegroup using basic node repair, pass the `--enable-node-repair` flag:
1518

1619
[,shell]
1720
----
1821
eksctl create cluster --enable-node-repair
1922
----
2023

21-
To create a managed nodegroup using node repair on an existing cluster:
24+
To create a managed nodegroup with node repair on an existing cluster:
2225

2326
[,shell]
2427
----
2528
eksctl create nodegroup --cluster=<clusterName> --enable-node-repair
2629
----
2730

28-
To create a cluster with a managed nodegroup using node repair via a config file:
31+
=== Using configuration files
2932

3033
[,yaml]
3134
----
32-
# node-repair-nodegroup-cluster.yaml
33-
---
35+
# basic-node-repair.yaml
3436
apiVersion: eksctl.io/v1alpha5
3537
kind: ClusterConfig
3638
3739
metadata:
38-
name: cluster-44
40+
name: basic-node-repair-cluster
3941
region: us-west-2
4042
4143
managedNodeGroups:
@@ -46,9 +48,283 @@ managedNodeGroups:
4648

4749
[,shell]
4850
----
49-
eksctl create cluster -f node-repair-nodegroup-cluster.yaml
51+
eksctl create cluster -f basic-node-repair.yaml
5052
----
5153

52-
== Further information
54+
== Enhanced Node Repair Configuration
55+
56+
=== Threshold Configuration
57+
58+
You can configure when node repair actions will stop using either percentage or count-based thresholds. *Note: You cannot use both percentage and count thresholds at the same time.*
59+
60+
==== CLI flags for thresholds
61+
62+
[,shell]
63+
----
64+
# Percentage-based threshold - repair stops when 20% of nodes are unhealthy
65+
eksctl create cluster --enable-node-repair \
66+
--node-repair-max-unhealthy-percentage=20
67+
68+
# Count-based threshold - repair stops when 5 nodes are unhealthy
69+
eksctl create cluster --enable-node-repair \
70+
--node-repair-max-unhealthy-count=5
71+
----
72+
73+
==== Configuration file for thresholds
74+
75+
[,yaml]
76+
----
77+
managedNodeGroups:
78+
- name: threshold-ng
79+
nodeRepairConfig:
80+
enabled: true
81+
# Stop repair actions when 20% of nodes are unhealthy
82+
maxUnhealthyNodeThresholdPercentage: 20
83+
# Alternative: stop repair actions when 3 nodes are unhealthy
84+
# maxUnhealthyNodeThresholdCount: 3
85+
# Note: Cannot use both percentage and count thresholds simultaneously
86+
----
87+
88+
=== Parallel Repair Limits
89+
90+
Control the maximum number of nodes that can be repaired concurrently or in parallel. This gives you finer-grained control over the pace of node replacements. *Note: You cannot use both percentage and count limits at the same time.*
91+
92+
==== CLI flags for parallel limits
93+
94+
[,shell]
95+
----
96+
# Percentage-based parallel limits - repair at most 15% of unhealthy nodes in parallel
97+
eksctl create cluster --enable-node-repair \
98+
--node-repair-max-parallel-percentage=15
99+
100+
# Count-based parallel limits - repair at most 2 unhealthy nodes in parallel
101+
eksctl create cluster --enable-node-repair \
102+
--node-repair-max-parallel-count=2
103+
----
104+
105+
==== Configuration file for parallel limits
106+
107+
[,yaml]
108+
----
109+
managedNodeGroups:
110+
- name: parallel-ng
111+
nodeRepairConfig:
112+
enabled: true
113+
# Repair at most 15% of unhealthy nodes in parallel
114+
maxParallelNodesRepairedPercentage: 15
115+
# Alternative: repair at most 2 unhealthy nodes in parallel
116+
# maxParallelNodesRepairedCount: 2
117+
# Note: Cannot use both percentage and count limits simultaneously
118+
----
119+
120+
=== Custom Repair Overrides
121+
122+
Specify granular overrides for specific repair actions. These overrides control the repair action and the repair delay time before a node is considered eligible for repair. *If you use this, you must specify all the values for each override.*
123+
124+
[,yaml]
125+
----
126+
managedNodeGroups:
127+
- name: custom-repair-ng
128+
instanceType: g4dn.xlarge # GPU instances
129+
nodeRepairConfig:
130+
enabled: true
131+
maxUnhealthyNodeThresholdPercentage: 25
132+
maxParallelNodesRepairedCount: 1
133+
nodeRepairConfigOverrides:
134+
# Handle GPU-related failures with immediate termination
135+
- nodeMonitoringCondition: "AcceleratedInstanceNotReady"
136+
nodeUnhealthyReason: "NvidiaXID13Error"
137+
minRepairWaitTimeMins: 10
138+
repairAction: "Terminate"
139+
# Handle network issues with restart after waiting
140+
- nodeMonitoringCondition: "NetworkNotReady"
141+
nodeUnhealthyReason: "InterfaceNotUp"
142+
minRepairWaitTimeMins: 20
143+
repairAction: "Restart"
144+
----
145+
146+
== Complete Configuration Examples
147+
148+
For a comprehensive example with all configuration options, see link:https://github.com/eksctl-io/eksctl/blob/main/examples/44-node-repair.yaml[examples/44-node-repair.yaml].
149+
150+
=== Example 1: Basic repair with percentage thresholds
151+
152+
[,yaml]
153+
----
154+
apiVersion: eksctl.io/v1alpha5
155+
kind: ClusterConfig
156+
157+
metadata:
158+
name: basic-repair-cluster
159+
region: us-west-2
160+
161+
managedNodeGroups:
162+
- name: basic-ng
163+
instanceType: m5.large
164+
desiredCapacity: 3
165+
nodeRepairConfig:
166+
enabled: true
167+
maxUnhealthyNodeThresholdPercentage: 20
168+
maxParallelNodesRepairedPercentage: 15
169+
----
170+
171+
=== Example 2: Conservative repair for critical workloads
172+
173+
[,yaml]
174+
----
175+
apiVersion: eksctl.io/v1alpha5
176+
kind: ClusterConfig
177+
178+
metadata:
179+
name: critical-workload-cluster
180+
region: us-west-2
181+
182+
managedNodeGroups:
183+
- name: critical-ng
184+
instanceType: c5.2xlarge
185+
desiredCapacity: 6
186+
nodeRepairConfig:
187+
enabled: true
188+
# Very conservative settings
189+
maxUnhealthyNodeThresholdPercentage: 10
190+
maxParallelNodesRepairedCount: 1
191+
nodeRepairConfigOverrides:
192+
# Wait longer before taking action on critical workloads
193+
- nodeMonitoringCondition: "NetworkNotReady"
194+
nodeUnhealthyReason: "InterfaceNotUp"
195+
minRepairWaitTimeMins: 45
196+
repairAction: "Restart"
197+
----
198+
199+
=== Example 3: GPU workload with specialized repair
200+
201+
[,yaml]
202+
----
203+
apiVersion: eksctl.io/v1alpha5
204+
kind: ClusterConfig
205+
206+
metadata:
207+
name: gpu-workload-cluster
208+
region: us-west-2
209+
210+
managedNodeGroups:
211+
- name: gpu-ng
212+
instanceType: g4dn.xlarge
213+
desiredCapacity: 4
214+
nodeRepairConfig:
215+
enabled: true
216+
maxUnhealthyNodeThresholdPercentage: 25
217+
maxParallelNodesRepairedCount: 1
218+
nodeRepairConfigOverrides:
219+
# GPU failures require immediate termination
220+
- nodeMonitoringCondition: "AcceleratedInstanceNotReady"
221+
nodeUnhealthyReason: "NvidiaXID13Error"
222+
minRepairWaitTimeMins: 5
223+
repairAction: "Terminate"
224+
----
225+
226+
== CLI Reference
227+
228+
=== Node Repair Flags
229+
230+
[cols="1,2,1"]
231+
|===
232+
|Flag |Description |Example
233+
234+
|`--enable-node-repair`
235+
|Enable automatic node repair
236+
|`--enable-node-repair`
237+
238+
|`--node-repair-max-unhealthy-percentage`
239+
|Maximum percentage of unhealthy nodes before repair
240+
|`--node-repair-max-unhealthy-percentage=20`
241+
242+
|`--node-repair-max-unhealthy-count`
243+
|Maximum count of unhealthy nodes before repair
244+
|`--node-repair-max-unhealthy-count=5`
245+
246+
|`--node-repair-max-parallel-percentage`
247+
|Maximum percentage of nodes to repair in parallel
248+
|`--node-repair-max-parallel-percentage=15`
249+
250+
|`--node-repair-max-parallel-count`
251+
|Maximum count of nodes to repair in parallel
252+
|`--node-repair-max-parallel-count=2`
253+
|===
254+
255+
*Note:* Node repair config overrides are only supported through YAML configuration files due to their complexity.
256+
257+
== Configuration Reference
258+
259+
=== nodeRepairConfig
53260

261+
[cols="1,1,2,1,1"]
262+
|===
263+
|Field |Type |Description |Constraints |Example
264+
265+
|`enabled`
266+
|boolean
267+
|Enable/disable node repair
268+
|-
269+
|`true`
270+
271+
|`maxUnhealthyNodeThresholdPercentage`
272+
|integer
273+
|Percentage threshold of unhealthy nodes, above which node auto repair actions will stop
274+
|Cannot be used with `maxUnhealthyNodeThresholdCount`
275+
|`20`
276+
277+
|`maxUnhealthyNodeThresholdCount`
278+
|integer
279+
|Count threshold of unhealthy nodes, above which node auto repair actions will stop
280+
|Cannot be used with `maxUnhealthyNodeThresholdPercentage`
281+
|`5`
282+
283+
|`maxParallelNodesRepairedPercentage`
284+
|integer
285+
|Maximum percentage of unhealthy nodes that can be repaired concurrently or in parallel
286+
|Cannot be used with `maxParallelNodesRepairedCount`
287+
|`15`
288+
289+
|`maxParallelNodesRepairedCount`
290+
|integer
291+
|Maximum count of unhealthy nodes that can be repaired concurrently or in parallel
292+
|Cannot be used with `maxParallelNodesRepairedPercentage`
293+
|`2`
294+
295+
|`nodeRepairConfigOverrides`
296+
|array
297+
|Granular overrides for specific repair actions controlling repair action and delay time
298+
|All values must be specified for each override
299+
|See examples above
300+
|===
301+
302+
=== nodeRepairConfigOverrides
303+
304+
[cols="1,1,2,1"]
305+
|===
306+
|Field |Type |Description |Valid Values
307+
308+
|`nodeMonitoringCondition`
309+
|string
310+
|Unhealthy condition reported by the node monitoring agent that this override applies to
311+
|`"AcceleratedInstanceNotReady"`, `"NetworkNotReady"`
312+
313+
|`nodeUnhealthyReason`
314+
|string
315+
|Reason reported by the node monitoring agent that this override applies to
316+
|`"NvidiaXID13Error"`, `"InterfaceNotUp"`
317+
318+
|`minRepairWaitTimeMins`
319+
|integer
320+
|Minimum time in minutes to wait before attempting to repair a node with the specified condition and reason
321+
|Any positive integer
322+
323+
|`repairAction`
324+
|string
325+
|Repair action to take for nodes when all of the specified conditions are met
326+
|`"Terminate"`, `"Restart"`, `"NoAction"`
327+
|===
328+
329+
== Further information
54330
* link:eks/latest/userguide/node-health.html["EKS Managed Nodegroup Node Health",type="documentation"]

0 commit comments

Comments
 (0)