Skip to content

Checkpoint should save after each step, not just at end #27

@landigf

Description

@landigf

Problem

The CheckpointManager saves state after each step completes, which is correct. But if a step itself is very long (e.g., a 10-minute meta-review) and the process is killed mid-step, we lose the partial work.

More importantly, in a crew pipeline with 8 sequential steps, if step 6 fails, we can resume from step 6 on the next run. But if the pipeline runner itself crashes (OOM, laptop sleep, network drop), the checkpoint may not have been flushed to disk.

Suggested fix

  1. Ensure CheckpointManager.save() is called with fsync to guarantee durability
  2. Consider saving a "step started" marker before execution and "step completed" marker after, so on resume we can distinguish "step 6 never started" from "step 6 started but crashed"
  3. For crew pipelines specifically, consider a mode where each expert's output is saved to a separate file immediately on completion, so partial crew results are always recoverable

Real-world context

During overnight IMC crew runs, two reviewers (imc-chair and methodology-expert) produced empty outputs due to timeout issues. The pipeline continued but the final meta-review was based on incomplete data. A more robust checkpoint system would have detected the empty outputs and retried or flagged them.

Labels

enhancement, reliability

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions