feat(job-wait): Fix IndexError and add ApiException for better error handling#155
feat(job-wait): Fix IndexError and add ApiException for better error handling#155haracejacob wants to merge 3 commits intorundeck-plugins:masterfrom
Conversation
|
This issue still occurs, whe solved it by mannualy patching job-wait.py |
|
Thank you for identifying and fixing this race condition! This is a real issue that affects customers, as confirmed by @mbranchnl's recent comment. However, we'd like to request a cleaner implementation approach. The current solution of throwing We'd prefer an approach similar to what @mbranchnl suggested in their manual patch - explicitly handle the empty pod list case with a clear warning and retry logic: pod_list = core_v1.list_namespaced_pod(
namespace,
label_selector="job-name==" + name
)
if not pod_list.items:
log.warning("No pods found for job yet, waiting for pod creation")
time.sleep(5)
continue # Continue the while loop to retry
first_item = pod_list.items[0]
pod_name = first_item.metadata.nameThis makes the code more maintainable and clearly communicates what's happening (pod not created yet) versus masking it as an API error. Could you update the PR to use this approach instead? Also, please rebase on the latest master branch before updating. Thanks again for the contribution! |
|
Thank you for the review. |
|
Thank you for updating the PR with the cleaner approach! The implementation looks much better. However, we found one issue: the timeout protection is bypassed in the new code path. The problem: The timeout check (line 74-75) only runs inside the The fix: Please add the timeout check before the if not pod_list.items:
log.warning("No pods found for job yet, waiting for pod creation")
time.sleep(5)
if timeout and time.time() - start_time > timeout:
raise TimeoutError
continueThis ensures the 5-minute timeout protection works for the empty pod list scenario. Could you update the PR with this change? Thanks! |
|
Thank you for catching that issue. You're right—the continue statement would have bypassed the timeout check, potentially leading to an infinite loop. I've updated the PR to address this, but instead of adding a second timeout check, I moved the logic to the top of the while True loop. I believe this is a cleaner and more robust approach for a couple of reasons:
Please let me know if this centralized approach looks good to you! |
When
job-waitis executed immediately afterjob-create, thecore_v1.list_namespaced_pod()function sometimes returns an empty array, it cause an IndexError when accessingpod_list.items[0]. This is due to the delay between job creation and pod creation.To fix this issue, Throwing
ApiException(404)when an empty array is returned to executetime.sleep(). And it raiseTimeoutErrorin an error situation.AS-IS
TO-BE