Skip to content

Feature Request: Add periodic checkpoint saving to prevent progress loss on interruption #937

@karthikbekalp

Description

@karthikbekalp

Describe the bug

When using deadline queue sync-output for automatic/incremental downloads, if the download is interrupted (network failure, user cancellation, system crash, credentials expire), all progress is lost. The checkpoint is only saved at the very end of the operation, so the next run restarts from the beginning rather than resuming from where it stopped.

For queues with many files or large outputs, this causes:

  1. Significant time waste re-downloading already completed files
  2. Increased S3 transfer costs for customers
  3. Poor user experience during long-running sync operations

Expected Behaviour

Save the checkpoint periodically during the download process (e.g., every 60 seconds or maybe after downloading a certain amount of file etc) rather than only at completion. This would allow interrupted downloads to resume from a recent checkpoint.

Current Behaviour

The checkpoint (IncrementalDownloadState) is saved only once at the end of sync_output() in queue_group.py over here.

Reproduction Steps

  1. Submit a job with many output files (or multiple jobs) to a queue in the farm.
  2. Wait for some tasks to complete and produce outputs
  3. Start the incremental download. Instructions here
  4. While files are downloading, interrupt the process (Ctrl+C, network disconnect, or kill the process).

Environment

This is not environment specific.

Please share other details about your environment that you think might be relevant to reproducing the bug.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestjob attachmentsFor an issue with job attachments

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions