Skip to content

Commit 4c69ef9

Browse files
bsalamatzacharysarah
authored andcommitted
Update preemption document with the new improvements added in 1.9 (kubernetes#6505)
1 parent 2caaa89 commit 4c69ef9

File tree

1 file changed

+25
-38
lines changed

1 file changed

+25
-38
lines changed

docs/concepts/configuration/pod-priority-preemption.md

Lines changed: 25 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -12,19 +12,15 @@ title: Pod Priority and Preemption
1212
[Pods](/docs/user-guide/pods) in Kubernetes 1.8 and later can have priority. Priority
1313
indicates the importance of a Pod relative to other Pods. When a Pod cannot be scheduled,
1414
the scheduler tries to preempt (evict) lower priority Pods to make scheduling of the
15-
pending Pod possible. In a future Kubernetes release, priority will also affect
16-
out-of-resource eviction ordering on the Node.
17-
18-
**Note:** Preemption does not respect PodDisruptionBudget; see
19-
[the limitations section](#poddisruptionbudget-is-not-supported) for more details.
20-
{: .note}
15+
pending Pod possible. In Kubernetes 1.9 and later, Priority also affects scheduling
16+
order of pods and out-of-resource eviction ordering on the Node.
2117

2218
{% endcapture %}
2319

2420
{% capture body %}
2521

2622
## How to use priority and preemption
27-
To use priority and preemption in Kubernetes 1.8, follow these steps:
23+
To use priority and preemption in Kubernetes 1.8 and later, follow these steps:
2824

2925
1. Enable the feature.
3026

@@ -135,6 +131,15 @@ spec:
135131
priorityClassName: high-priority
136132
```
137133

134+
### Effect of Pod priority on scheduling order
135+
136+
In Kubernetes 1.9 and later, when Pod priority is enabled, scheduler orders pending
137+
pods by their priority and a pending Pod is placed ahead of other pending Pods with
138+
lower priority in the scheduling queue. As a result, the higher priority pod may
139+
by scheduled sooner that pods with lower priority if its scheduling requirements
140+
are met. If such pod cannot be scheduled, scheduler will continue and tries to
141+
schedule other lower priority Pods.
142+
138143
## Preemption
139144

140145
When Pods are created, they go to a queue and wait to be scheduled. The scheduler
@@ -145,9 +150,9 @@ where removal of one or more Pods with lower priority than P would enable P to b
145150
on that Node. If such a Node is found, one or more lower priority Pods get
146151
deleted from the Node. After the Pods are gone, P can be scheduled on the Node.
147152

148-
### Limitations of preemption (alpha version)
153+
### Limitations of preemption
149154

150-
#### Starvation of preempting Pod
155+
#### Graceful termination of preemption victims
151156

152157
When Pods are preempted, the victims get their
153158
[graceful termination period](https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods).
@@ -156,33 +161,24 @@ killed. This graceful termination period creates a time gap between the point
156161
that the scheduler preempts Pods and the time when the pending Pod (P) can be
157162
scheduled on the Node (N). In the meantime, the scheduler keeps scheduling other
158163
pending Pods. As victims exit or get terminated, the scheduler tries to schedule
159-
Pods in the pending queue, and one or more of them may be considered and
160-
scheduled to N before the scheduler considers scheduling P on N. In such a case,
161-
it is likely that when all the victims exit, Pod P won't fit on Node N anymore.
162-
So, scheduler will have to preempt other Pods on Node N or another Node so that
163-
P can be scheduled. This scenario might be repeated again for the second and
164-
subsequent rounds of preemption, and P might not get scheduled for a while.
165-
This scenario can cause problems in various clusters, but is particularly
166-
problematic in clusters with a high Pod creation rate.
167-
168-
We will address this problem in the beta version of Pod preemption. The solution
169-
we plan to implement is
170-
[provided here](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/pod-preemption.md#preemption-mechanics).
164+
Pods in the pending queue. Therefore, there is usually a time gap between the point
165+
that scheduler preempts victims and the time that Pod P is scheduled. In order to
166+
minimize this gap, one can set graceful termination period of lower priority pods
167+
to zero or a small number.
171168

172-
#### PodDisruptionBudget is not supported
169+
#### PodDisruptionBudget is supported, but not guaranteed!
173170

174171
A [Pod Disruption Budget (PDB)](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/)
175172
allows application owners to limit the number Pods of a replicated application that
176-
are down simultaneously from voluntary disruptions. However, the alpha version of
177-
preemption does not respect PDB when choosing preemption victims.
178-
We plan to add PDB support in beta, but even in beta, respecting PDB will be best
179-
effort. The Scheduler will try to find victims whose PDB won't be violated by preemption,
180-
but if no such victims are found, preemption will still happen, and lower priority Pods
181-
will be removed despite their PDBs being violated.
173+
are down simultaneously from voluntary disruptions. Kubernetes 1.9 supports PDB
174+
when preempting pods, but respecting PDB is best effort. The Scheduler tries to
175+
find victims whose PDB are not violated by preemption, but if no such victims are
176+
found, preemption will still happen, and lower priority Pods will be removed
177+
despite their PDBs being violated.
182178

183179
#### Inter-Pod affinity on lower-priority Pods
184180

185-
In version 1.8, a Node is considered for preemption only when
181+
A Node is considered for preemption only when
186182
the answer to this question is yes: "If all the Pods with lower priority than
187183
the pending Pod are removed from the Node, can the pending pod be scheduled on
188184
the Node?"
@@ -200,15 +196,6 @@ lower-priority Pods. In this case, the scheduler does not preempt any Pods on th
200196
Node. Instead, it looks for another Node. The scheduler might find a suitable Node
201197
or it might not. There is no guarantee that the pending Pod can be scheduled.
202198

203-
We might address this issue in future versions, but we don't have a clear plan yet.
204-
We will not consider it a blocker for Beta or GA. Part
205-
of the reason is that finding the set of lower-priority Pods that satisfy all
206-
inter-Pod affinity rules is computationally expensive, and adds substantial
207-
complexity to the preemption logic. Besides, even if preemption keeps the lower-priority
208-
Pods to satisfy inter-Pod affinity, the lower priority Pods might be preempted
209-
later by other Pods, which removes the benefits of having the complex logic of
210-
respecting inter-Pod affinity.
211-
212199
Our recommended solution for this problem is to create inter-Pod affinity only towards
213200
equal or higher priority pods.
214201

0 commit comments

Comments
 (0)