Fix timeout error after 4h and improve log resumption #173#178
Fix timeout error after 4h and improve log resumption #173#178PeekLeon wants to merge 1 commit intorundeck-plugins:masterfrom
Conversation
|
Hi there, I wanted to check in on this PR to see if there’s any chance to get it reviewed or merged. Thanks a lot for your time, and please let me know if there’s anything I can adjust to help move it forward. |
There was a problem hiding this comment.
Pull Request Overview
This PR fixes a timeout error that occurs after 4 hours by implementing periodic reconnections every 30 minutes and improving log resumption to avoid restarting from the beginning.
- Implements a 30-minute connection timeout with automatic reconnection
- Adds log line tracking to resume from the last processed line instead of restarting
- Introduces connection state management to control retry behavior during reconnections
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
|
||
|
|
||
| def wait(): | ||
| connection_max_time = 1800 # time in seconds |
There was a problem hiding this comment.
The magic number 1800 should be defined as a named constant (e.g., CONNECTION_TIMEOUT_SECONDS = 1800) to improve code maintainability and make the 30-minute timeout more explicit.
| connection_max_time = 1800 # time in seconds | |
| connection_max_time = CONNECTION_TIMEOUT_SECONDS |
|
|
||
| log.info("Waiting for job completion") | ||
| time.sleep(sleep) | ||
| if current_line_number > last_line_number or last_line_number == 0: |
There was a problem hiding this comment.
This condition will skip the first line when resuming (when last_line_number > 0). The condition should be 'current_line_number >= last_line_number' to avoid missing lines during log resumption.
| if current_line_number > last_line_number or last_line_number == 0: | |
| if current_line_number >= last_line_number: |
| last_line_number = current_line_number | ||
|
|
||
| current_line_number += 1 |
There was a problem hiding this comment.
The last_line_number is being set to current_line_number inside the if block, but current_line_number is incremented after this assignment. This will cause the next reconnection to skip one line. Move this assignment after the current_line_number increment or use current_line_number + 1.
| last_line_number = current_line_number | |
| current_line_number += 1 | |
| current_line_number += 1 | |
| last_line_number = current_line_number |
|
@PeekLeon Thanks for tackling issue #173! The 4-hour timeout is a real problem for long-running jobs, and we appreciate you working on a solution. Code ReviewThe approach of periodic reconnection is on the right track, but there are a few issues that need to be addressed: Issues to Fix1. Off-by-one error in log resumption (Line 98) The condition will skip the first line when resuming: if current_line_number > last_line_number or last_line_number == 0:Should be: if current_line_number >= last_line_number:2. Line tracking bug (Lines 102-104)
last_line_number = current_line_number
current_line_number += 1Should be: current_line_number += 1
last_line_number = current_line_number3. Magic number The CONNECTION_TIMEOUT_SECONDS = 1800 # 30 minutesSuggested ImprovementsConsider using Kubernetes API's built-in features instead of manual line counting, which is fragile: Option A: Use timestamps (more reliable) # Enable timestamps in logs
for line in w.stream(
core_v1.read_namespaced_pod_log,
name=pod_name,
namespace=namespace,
timestamps=True, # Get timestamps with each log line
since_time=last_timestamp # Resume from last timestamp on reconnect
):Option B: Use timeout_seconds on watch # Let the watch API handle timeout
w = watch.Watch()
for line in w.stream(
core_v1.read_namespaced_pod_log,
name=pod_name,
namespace=namespace,
timeout_seconds=1800 # Automatically reconnect after 30 minutes
):These approaches are more robust than counting lines, as timestamps don't get out of sync. Requests Before Merge1. Fix the bugs listed aboveThe off-by-one errors will cause missing log lines, which defeats the purpose of the fix. 2. Rebase on latest masterYour branch is behind master by many commits, including critical security updates. Please rebase: git fetch upstream
git rebase upstream/master
# Resolve any conflicts
git push --force-with-lease3. Consider the timestamp approachIf you're open to it, the timestamp-based approach would be more reliable than line counting. Happy to discuss the implementation if you'd like guidance. 4. Add documentationPlease add a brief note in the README or code comments explaining:
Let me know if you have questions or would like help with any of these changes. Thanks again for the contribution! |
Fix timeout error after 4 hours and improve log resumption
Description:
Environment for Testing