Motivation
Our batch workloads generate a significant number of defunct Pods (from completed, failed, or suspended Jobs) that persist in the cluster.
The current cleanup mechanisms are insufficient, targeting only a few specific failure reasons (like "Evicted"). We need a comprehensive policy for Job-related Pod lifecycle management.
Proposed Solution
Implement a garbage collection policy to automatically identify and delete Pods owned by Jobs that are no longer actively scheduling work.
Criteria for Deletion (Definition of "Unused Pod")
A Pod is eligible for deletion if it meets the Ownership Check AND one of the Job State Checks:
1. Ownership Check (Mandatory)
The Pod MUST have an ownerReference of kind Job.
2. Job State Checks (Any of these trigger deletion)
Job Completed: The Job has a type: Complete condition with status: True.
Job Failed (Limits): The Job has failed due to BackoffLimitExceeded or DeadlineExceeded.
Job Suspended: The Job is explicitly set to Suspended (.spec.suspend: true).
3. Pod Status Check (Included for completeness/Original Scope)
pod.Status.Phase is a terminal state (Succeeded or Failed), including the specific case of pod.Status.Reason being Evicted.
Configuration
The feature requires a configurable grace period (e.g., retentionSecondsAfterCompletion) to ensure time for log extraction before deletion.
The feature should enhance the current pods module
Motivation
Our batch workloads generate a significant number of defunct Pods (from completed, failed, or suspended Jobs) that persist in the cluster.
The current cleanup mechanisms are insufficient, targeting only a few specific failure reasons (like "Evicted"). We need a comprehensive policy for Job-related Pod lifecycle management.
Proposed Solution
Implement a garbage collection policy to automatically identify and delete Pods owned by Jobs that are no longer actively scheduling work.
Criteria for Deletion (Definition of "Unused Pod")
A Pod is eligible for deletion if it meets the Ownership Check AND one of the Job State Checks:
1. Ownership Check (Mandatory)
The Pod MUST have an ownerReference of kind Job.
2. Job State Checks (Any of these trigger deletion)
Job Completed: The Job has a type: Complete condition with status: True.
Job Failed (Limits): The Job has failed due to BackoffLimitExceeded or DeadlineExceeded.
Job Suspended: The Job is explicitly set to Suspended (.spec.suspend: true).
3. Pod Status Check (Included for completeness/Original Scope)
pod.Status.Phase is a terminal state (Succeeded or Failed), including the specific case of pod.Status.Reason being Evicted.
Configuration
The feature requires a configurable grace period (e.g., retentionSecondsAfterCompletion) to ensure time for log extraction before deletion.
The feature should enhance the current pods module