BatchRunner: resumable checkpointing across job restarts

Follow-up from v0.6 scope-trimming. Deferred from the initial `BatchRunner` implementation (v0.6 v1 is stateless JSONL → JSONL).

## What v0.6 v1 has for free

- **Ray Core task retries** (`@ray.remote(max_retries=N)`): a worker dying mid-task → automatic re-dispatch.
- **Ray Data lineage re-execution**: a worker dying mid-batch → that batch is re-dispatched to another worker from lineage.

These cover failures *within a single run*.

## What's missing

Resumable progress across **process restarts** — kill the driver and restart, Ray does not pick up where it left off. For long offline jobs (e.g., 10M-row SDXL inference over S3), this matters.

## Design sketch

- Write output rows as they complete to a per-shard file (`{sink}/part-{shard_id}.jsonl`) rather than a single monolithic sink file.
- On `run()`, enumerate existing output shards; skip input shards whose output already exists and is complete (sentinel file, or row-count match).
- Requires shard/partition semantics on both source and sink — clean for Parquet, extra bookkeeping for JSONL.

## Prerequisites

- S3 / Parquet sources and sinks likely land first (those are where checkpointing genuinely matters — JSONL jobs are usually small enough to just re-run).
- `BatchSpec` grows a `checkpoint_dir` field and a `resume: bool = True` flag.

## References

- Ray Core fault tolerance: https://docs.ray.io/en/latest/ray-core/fault_tolerance.html
- Ray Data fault tolerance: https://docs.ray.io/en/latest/data/fault-tolerance.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BatchRunner: resumable checkpointing across job restarts #12

What v0.6 v1 has for free

What's missing

Design sketch

Prerequisites

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

BatchRunner: resumable checkpointing across job restarts #12

Description

What v0.6 v1 has for free

What's missing

Design sketch

Prerequisites

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions