Skip to content

Fix timeout error after 4h and improve log resumption #173#178

Open
PeekLeon wants to merge 1 commit intorundeck-plugins:masterfrom
PeekLeon:issue/173
Open

Fix timeout error after 4h and improve log resumption #173#178
PeekLeon wants to merge 1 commit intorundeck-plugins:masterfrom
PeekLeon:issue/173

Conversation

@PeekLeon
Copy link
Copy Markdown

Fix timeout error after 4 hours and improve log resumption

Description:

  • Fixes the timeout error occurring after 4 hours.
  • Implements reconnection every 30 minutes.
  • Logs no longer restart from the beginning but resume from where they stopped.

Environment for Testing

  • Rundeck Version: 5.10.0
  • Plugin Version: 2.0.14

@PeekLeon
Copy link
Copy Markdown
Author

Hi there,

I wanted to check in on this PR to see if there’s any chance to get it reviewed or merged.
Also, I was wondering if the project is still supported.

Thanks a lot for your time, and please let me know if there’s anything I can adjust to help move it forward.

@fdevans fdevans requested a review from Copilot August 26, 2025 18:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a timeout error that occurs after 4 hours by implementing periodic reconnections every 30 minutes and improving log resumption to avoid restarting from the beginning.

  • Implements a 30-minute connection timeout with automatic reconnection
  • Adds log line tracking to resume from the last processed line instead of restarting
  • Introduces connection state management to control retry behavior during reconnections

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread contents/job-wait.py


def wait():
connection_max_time = 1800 # time in seconds
Copy link

Copilot AI Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 1800 should be defined as a named constant (e.g., CONNECTION_TIMEOUT_SECONDS = 1800) to improve code maintainability and make the 30-minute timeout more explicit.

Suggested change
connection_max_time = 1800 # time in seconds
connection_max_time = CONNECTION_TIMEOUT_SECONDS

Copilot uses AI. Check for mistakes.
Comment thread contents/job-wait.py

log.info("Waiting for job completion")
time.sleep(sleep)
if current_line_number > last_line_number or last_line_number == 0:
Copy link

Copilot AI Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition will skip the first line when resuming (when last_line_number > 0). The condition should be 'current_line_number >= last_line_number' to avoid missing lines during log resumption.

Suggested change
if current_line_number > last_line_number or last_line_number == 0:
if current_line_number >= last_line_number:

Copilot uses AI. Check for mistakes.
Comment thread contents/job-wait.py
Comment on lines +100 to +102
last_line_number = current_line_number

current_line_number += 1
Copy link

Copilot AI Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last_line_number is being set to current_line_number inside the if block, but current_line_number is incremented after this assignment. This will cause the next reconnection to skip one line. Move this assignment after the current_line_number increment or use current_line_number + 1.

Suggested change
last_line_number = current_line_number
current_line_number += 1
current_line_number += 1
last_line_number = current_line_number

Copilot uses AI. Check for mistakes.
@fdevans
Copy link
Copy Markdown
Contributor

fdevans commented Mar 6, 2026

@PeekLeon Thanks for tackling issue #173! The 4-hour timeout is a real problem for long-running jobs, and we appreciate you working on a solution.

Code Review

The approach of periodic reconnection is on the right track, but there are a few issues that need to be addressed:

Issues to Fix

1. Off-by-one error in log resumption (Line 98)

The condition will skip the first line when resuming:

if current_line_number > last_line_number or last_line_number == 0:

Should be:

if current_line_number >= last_line_number:

2. Line tracking bug (Lines 102-104)

last_line_number is set before incrementing current_line_number, which will cause the next reconnection to skip a line:

last_line_number = current_line_number
current_line_number += 1

Should be:

current_line_number += 1
last_line_number = current_line_number

3. Magic number

The 1800 should be a named constant at the top of the file:

CONNECTION_TIMEOUT_SECONDS = 1800  # 30 minutes

Suggested Improvements

Consider using Kubernetes API's built-in features instead of manual line counting, which is fragile:

Option A: Use timestamps (more reliable)

# Enable timestamps in logs
for line in w.stream(
    core_v1.read_namespaced_pod_log,
    name=pod_name,
    namespace=namespace,
    timestamps=True,  # Get timestamps with each log line
    since_time=last_timestamp  # Resume from last timestamp on reconnect
):

Option B: Use timeout_seconds on watch

# Let the watch API handle timeout
w = watch.Watch()
for line in w.stream(
    core_v1.read_namespaced_pod_log,
    name=pod_name,
    namespace=namespace,
    timeout_seconds=1800  # Automatically reconnect after 30 minutes
):

These approaches are more robust than counting lines, as timestamps don't get out of sync.

Requests Before Merge

1. Fix the bugs listed above

The off-by-one errors will cause missing log lines, which defeats the purpose of the fix.

2. Rebase on latest master

Your branch is behind master by many commits, including critical security updates. Please rebase:

git fetch upstream
git rebase upstream/master
# Resolve any conflicts
git push --force-with-lease

3. Consider the timestamp approach

If you're open to it, the timestamp-based approach would be more reliable than line counting. Happy to discuss the implementation if you'd like guidance.

4. Add documentation

Please add a brief note in the README or code comments explaining:

  • The 4-hour timeout issue
  • How the reconnection works
  • Any configuration options (if you make the timeout configurable)

Let me know if you have questions or would like help with any of these changes. Thanks again for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants