Follow-up from v0.6 scope-trimming. Deferred from the initial BatchRunner implementation (v0.6 v1 is stateless JSONL → JSONL).
What v0.6 v1 has for free
- Ray Core task retries (
@ray.remote(max_retries=N)): a worker dying mid-task → automatic re-dispatch.
- Ray Data lineage re-execution: a worker dying mid-batch → that batch is re-dispatched to another worker from lineage.
These cover failures within a single run.
What's missing
Resumable progress across process restarts — kill the driver and restart, Ray does not pick up where it left off. For long offline jobs (e.g., 10M-row SDXL inference over S3), this matters.
Design sketch
- Write output rows as they complete to a per-shard file (
{sink}/part-{shard_id}.jsonl) rather than a single monolithic sink file.
- On
run(), enumerate existing output shards; skip input shards whose output already exists and is complete (sentinel file, or row-count match).
- Requires shard/partition semantics on both source and sink — clean for Parquet, extra bookkeeping for JSONL.
Prerequisites
- S3 / Parquet sources and sinks likely land first (those are where checkpointing genuinely matters — JSONL jobs are usually small enough to just re-run).
BatchSpec grows a checkpoint_dir field and a resume: bool = True flag.
References
Follow-up from v0.6 scope-trimming. Deferred from the initial
BatchRunnerimplementation (v0.6 v1 is stateless JSONL → JSONL).What v0.6 v1 has for free
@ray.remote(max_retries=N)): a worker dying mid-task → automatic re-dispatch.These cover failures within a single run.
What's missing
Resumable progress across process restarts — kill the driver and restart, Ray does not pick up where it left off. For long offline jobs (e.g., 10M-row SDXL inference over S3), this matters.
Design sketch
{sink}/part-{shard_id}.jsonl) rather than a single monolithic sink file.run(), enumerate existing output shards; skip input shards whose output already exists and is complete (sentinel file, or row-count match).Prerequisites
BatchSpecgrows acheckpoint_dirfield and aresume: bool = Trueflag.References