@@ -12,19 +12,15 @@ title: Pod Priority and Preemption
1212[ Pods] ( /docs/user-guide/pods ) in Kubernetes 1.8 and later can have priority. Priority
1313indicates the importance of a Pod relative to other Pods. When a Pod cannot be scheduled,
1414the scheduler tries to preempt (evict) lower priority Pods to make scheduling of the
15- pending Pod possible. In a future Kubernetes release, priority will also affect
16- out-of-resource eviction ordering on the Node.
17-
18- ** Note:** Preemption does not respect PodDisruptionBudget; see
19- [ the limitations section] ( #poddisruptionbudget-is-not-supported ) for more details.
20- {: .note}
15+ pending Pod possible. In Kubernetes 1.9 and later, Priority also affects scheduling
16+ order of pods and out-of-resource eviction ordering on the Node.
2117
2218{% endcapture %}
2319
2420{% capture body %}
2521
2622## How to use priority and preemption
27- To use priority and preemption in Kubernetes 1.8, follow these steps:
23+ To use priority and preemption in Kubernetes 1.8 and later , follow these steps:
2824
29251 . Enable the feature.
3026
@@ -135,6 +131,15 @@ spec:
135131 priorityClassName: high-priority
136132` ` `
137133
134+ # ## Effect of Pod priority on scheduling order
135+
136+ In Kubernetes 1.9 and later, when Pod priority is enabled, scheduler orders pending
137+ pods by their priority and a pending Pod is placed ahead of other pending Pods with
138+ lower priority in the scheduling queue. As a result, the higher priority pod may
139+ by scheduled sooner that pods with lower priority if its scheduling requirements
140+ are met. If such pod cannot be scheduled, scheduler will continue and tries to
141+ schedule other lower priority Pods.
142+
138143# # Preemption
139144
140145When Pods are created, they go to a queue and wait to be scheduled. The scheduler
@@ -145,9 +150,9 @@ where removal of one or more Pods with lower priority than P would enable P to b
145150on that Node. If such a Node is found, one or more lower priority Pods get
146151deleted from the Node. After the Pods are gone, P can be scheduled on the Node.
147152
148- # ## Limitations of preemption (alpha version)
153+ # ## Limitations of preemption
149154
150- # ### Starvation of preempting Pod
155+ # ### Graceful termination of preemption victims
151156
152157When Pods are preempted, the victims get their
153158[graceful termination period](https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods).
@@ -156,33 +161,24 @@ killed. This graceful termination period creates a time gap between the point
156161that the scheduler preempts Pods and the time when the pending Pod (P) can be
157162scheduled on the Node (N). In the meantime, the scheduler keeps scheduling other
158163pending Pods. As victims exit or get terminated, the scheduler tries to schedule
159- Pods in the pending queue, and one or more of them may be considered and
160- scheduled to N before the scheduler considers scheduling P on N. In such a case,
161- it is likely that when all the victims exit, Pod P won't fit on Node N anymore.
162- So, scheduler will have to preempt other Pods on Node N or another Node so that
163- P can be scheduled. This scenario might be repeated again for the second and
164- subsequent rounds of preemption, and P might not get scheduled for a while.
165- This scenario can cause problems in various clusters, but is particularly
166- problematic in clusters with a high Pod creation rate.
167-
168- We will address this problem in the beta version of Pod preemption. The solution
169- we plan to implement is
170- [provided here](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/pod-preemption.md#preemption-mechanics).
164+ Pods in the pending queue. Therefore, there is usually a time gap between the point
165+ that scheduler preempts victims and the time that Pod P is scheduled. In order to
166+ minimize this gap, one can set graceful termination period of lower priority pods
167+ to zero or a small number.
171168
172- # ### PodDisruptionBudget is not supported
169+ # ### PodDisruptionBudget is supported, but not guaranteed!
173170
174171A [Pod Disruption Budget (PDB)](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/)
175172allows application owners to limit the number Pods of a replicated application that
176- are down simultaneously from voluntary disruptions. However, the alpha version of
177- preemption does not respect PDB when choosing preemption victims.
178- We plan to add PDB support in beta, but even in beta, respecting PDB will be best
179- effort. The Scheduler will try to find victims whose PDB won't be violated by preemption,
180- but if no such victims are found, preemption will still happen, and lower priority Pods
181- will be removed despite their PDBs being violated.
173+ are down simultaneously from voluntary disruptions. Kubernetes 1.9 supports PDB
174+ when preempting pods, but respecting PDB is best effort. The Scheduler tries to
175+ find victims whose PDB are not violated by preemption, but if no such victims are
176+ found, preemption will still happen, and lower priority Pods will be removed
177+ despite their PDBs being violated.
182178
183179# ### Inter-Pod affinity on lower-priority Pods
184180
185- In version 1.8, a Node is considered for preemption only when
181+ A Node is considered for preemption only when
186182the answer to this question is yes : " If all the Pods with lower priority than
187183the pending Pod are removed from the Node, can the pending pod be scheduled on
188184the Node?"
@@ -200,15 +196,6 @@ lower-priority Pods. In this case, the scheduler does not preempt any Pods on th
200196Node. Instead, it looks for another Node. The scheduler might find a suitable Node
201197or it might not. There is no guarantee that the pending Pod can be scheduled.
202198
203- We might address this issue in future versions, but we don't have a clear plan yet.
204- We will not consider it a blocker for Beta or GA. Part
205- of the reason is that finding the set of lower-priority Pods that satisfy all
206- inter-Pod affinity rules is computationally expensive, and adds substantial
207- complexity to the preemption logic. Besides, even if preemption keeps the lower-priority
208- Pods to satisfy inter-Pod affinity, the lower priority Pods might be preempted
209- later by other Pods, which removes the benefits of having the complex logic of
210- respecting inter-Pod affinity.
211-
212199Our recommended solution for this problem is to create inter-Pod affinity only towards
213200equal or higher priority pods.
214201
0 commit comments