Skip to content

Feature: Automated Garbage Collection for Defunct Job Pods  #526

@yonahd

Description

@yonahd

Motivation
​Our batch workloads generate a significant number of defunct Pods (from completed, failed, or suspended Jobs) that persist in the cluster.

​The current cleanup mechanisms are insufficient, targeting only a few specific failure reasons (like "Evicted"). We need a comprehensive policy for Job-related Pod lifecycle management.

​Proposed Solution
​Implement a garbage collection policy to automatically identify and delete Pods owned by Jobs that are no longer actively scheduling work.
​Criteria for Deletion (Definition of "Unused Pod")
​A Pod is eligible for deletion if it meets the Ownership Check AND one of the Job State Checks:
​1. Ownership Check (Mandatory)
​The Pod MUST have an ownerReference of kind Job.
​2. Job State Checks (Any of these trigger deletion)
​Job Completed: The Job has a type: Complete condition with status: True.
​Job Failed (Limits): The Job has failed due to BackoffLimitExceeded or DeadlineExceeded.
​Job Suspended: The Job is explicitly set to Suspended (.spec.suspend: true).
​3. Pod Status Check (Included for completeness/Original Scope)
​pod.Status.Phase is a terminal state (Succeeded or Failed), including the specific case of pod.Status.Reason being Evicted.
​Configuration

​The feature requires a configurable grace period (e.g., retentionSecondsAfterCompletion) to ensure time for log extraction before deletion.

The feature should enhance the current pods module

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions