Problem
The CheckpointManager saves state after each step completes, which is correct. But if a step itself is very long (e.g., a 10-minute meta-review) and the process is killed mid-step, we lose the partial work.
More importantly, in a crew pipeline with 8 sequential steps, if step 6 fails, we can resume from step 6 on the next run. But if the pipeline runner itself crashes (OOM, laptop sleep, network drop), the checkpoint may not have been flushed to disk.
Suggested fix
- Ensure
CheckpointManager.save() is called with fsync to guarantee durability
- Consider saving a "step started" marker before execution and "step completed" marker after, so on resume we can distinguish "step 6 never started" from "step 6 started but crashed"
- For crew pipelines specifically, consider a mode where each expert's output is saved to a separate file immediately on completion, so partial crew results are always recoverable
Real-world context
During overnight IMC crew runs, two reviewers (imc-chair and methodology-expert) produced empty outputs due to timeout issues. The pipeline continued but the final meta-review was based on incomplete data. A more robust checkpoint system would have detected the empty outputs and retried or flagged them.
Labels
enhancement, reliability
Problem
The
CheckpointManagersaves state after each step completes, which is correct. But if a step itself is very long (e.g., a 10-minute meta-review) and the process is killed mid-step, we lose the partial work.More importantly, in a crew pipeline with 8 sequential steps, if step 6 fails, we can resume from step 6 on the next run. But if the pipeline runner itself crashes (OOM, laptop sleep, network drop), the checkpoint may not have been flushed to disk.
Suggested fix
CheckpointManager.save()is called withfsyncto guarantee durabilityReal-world context
During overnight IMC crew runs, two reviewers (imc-chair and methodology-expert) produced empty outputs due to timeout issues. The pipeline continued but the final meta-review was based on incomplete data. A more robust checkpoint system would have detected the empty outputs and retried or flagged them.
Labels
enhancement, reliability