Skip to content

feat(job-wait): Fix IndexError and add ApiException for better error handling#155

Open
haracejacob wants to merge 3 commits intorundeck-plugins:masterfrom
haracejacob:master
Open

feat(job-wait): Fix IndexError and add ApiException for better error handling#155
haracejacob wants to merge 3 commits intorundeck-plugins:masterfrom
haracejacob:master

Conversation

@haracejacob
Copy link
Copy Markdown

When job-wait is executed immediately after job-create, the core_v1.list_namespaced_pod() function sometimes returns an empty array, it cause an IndexError when accessing pod_list.items[0]. This is due to the delay between job creation and pod creation.

To fix this issue, Throwing ApiException(404) when an empty array is returned to execute time.sleep(). And it raise TimeoutError in an error situation.

AS-IS

Traceback (most recent call last):
  File "/home1/rundeck/libext/cache/kubernetes-2.0.10/job-wait.py", line 138, in <module>
    main()
  File "/home1/rundeck/libext/cache/kubernetes-2.0.10/job-wait.py", line 134, in main
    wait()
  File "/home1/rundeck/libext/cache/kubernetes-2.0.10/job-wait.py", line 60, in wait
    first_item = pod_list.items[0]
IndexError: list index out of range
Failed: NonZeroResultCode: Script result code was: 1

TO-BE

WARNING: kubernetes-wait-job: Pod is not ready, status: 404
INFO: kubernetes-wait-job: waiting for log

@mbranchnl
Copy link
Copy Markdown

This issue still occurs, whe solved it by mannualy patching job-wait.py

  pod_list = core_v1.list_namespaced_pod(
      namespace,
      label_selector="job-name==" + name
  )

+++++++
  if not pod_list.items:
      log.warning("No pods found for the job yet, retry")
      time.sleep(5)	
      # Handle this situation as needed
  else:
+++++++
      first_item = pod_list.items[0]

@fdevans
Copy link
Copy Markdown
Contributor

fdevans commented Mar 6, 2026

Thank you for identifying and fixing this race condition! This is a real issue that affects customers, as confirmed by @mbranchnl's recent comment.

However, we'd like to request a cleaner implementation approach. The current solution of throwing ApiException(404) when the pod list is empty works, but it's semantically incorrect - an empty list isn't actually an API exception, it's just a timing issue where the pod hasn't been created yet.

We'd prefer an approach similar to what @mbranchnl suggested in their manual patch - explicitly handle the empty pod list case with a clear warning and retry logic:

pod_list = core_v1.list_namespaced_pod(
    namespace,
    label_selector="job-name==" + name
)

if not pod_list.items:
    log.warning("No pods found for job yet, waiting for pod creation")
    time.sleep(5)
    continue  # Continue the while loop to retry

first_item = pod_list.items[0]
pod_name = first_item.metadata.name

This makes the code more maintainable and clearly communicates what's happening (pod not created yet) versus masking it as an API error.

Could you update the PR to use this approach instead? Also, please rebase on the latest master branch before updating.

Thanks again for the contribution!

@haracejacob
Copy link
Copy Markdown
Author

@fdevans

Thank you for the review.
I have updated the retry logic based on your feedback and rebased the branch on the latest master.
I would appreciate it if you could take another look!

@fdevans
Copy link
Copy Markdown
Contributor

fdevans commented Mar 9, 2026

Thank you for updating the PR with the cleaner approach! The implementation looks much better.

However, we found one issue: the timeout protection is bypassed in the new code path.

The problem:

The timeout check (line 74-75) only runs inside the except ApiException block. Your new code uses continue before reaching that check, which means if pods never appear, the loop will run forever instead of timing out after 300 seconds.

The fix:

Please add the timeout check before the continue:

if not pod_list.items:
    log.warning("No pods found for job yet, waiting for pod creation")
    time.sleep(5)
    if timeout and time.time() - start_time > timeout:
        raise TimeoutError
    continue

This ensures the 5-minute timeout protection works for the empty pod list scenario.

Could you update the PR with this change?

Thanks!

@fdevans fdevans requested a review from a team March 9, 2026 22:32
@haracejacob
Copy link
Copy Markdown
Author

haracejacob commented Mar 14, 2026

@fdevans

Thank you for catching that issue. You're right—the continue statement would have bypassed the timeout check, potentially leading to an infinite loop.

I've updated the PR to address this, but instead of adding a second timeout check, I moved the logic to the top of the while True loop. I believe this is a cleaner and more robust approach for a couple of reasons:

  • Centralized Protection: It ensures the timeout is checked on every single iteration, regardless of which code path (like continue) is taken later in the loop.
  • DRY (Don't Repeat Yourself): It avoids duplicating the timeout logic, making the code easier to maintain.

Please let me know if this centralized approach looks good to you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants