Skip to content

BatchRunner: resumable checkpointing across job restarts #12

@korbonits

Description

@korbonits

Follow-up from v0.6 scope-trimming. Deferred from the initial BatchRunner implementation (v0.6 v1 is stateless JSONL → JSONL).

What v0.6 v1 has for free

  • Ray Core task retries (@ray.remote(max_retries=N)): a worker dying mid-task → automatic re-dispatch.
  • Ray Data lineage re-execution: a worker dying mid-batch → that batch is re-dispatched to another worker from lineage.

These cover failures within a single run.

What's missing

Resumable progress across process restarts — kill the driver and restart, Ray does not pick up where it left off. For long offline jobs (e.g., 10M-row SDXL inference over S3), this matters.

Design sketch

  • Write output rows as they complete to a per-shard file ({sink}/part-{shard_id}.jsonl) rather than a single monolithic sink file.
  • On run(), enumerate existing output shards; skip input shards whose output already exists and is complete (sentinel file, or row-count match).
  • Requires shard/partition semantics on both source and sink — clean for Parquet, extra bookkeeping for JSONL.

Prerequisites

  • S3 / Parquet sources and sinks likely land first (those are where checkpointing genuinely matters — JSONL jobs are usually small enough to just re-run).
  • BatchSpec grows a checkpoint_dir field and a resume: bool = True flag.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions