Checkpoint should save after each step, not just at end

## Problem

The `CheckpointManager` saves state after each step completes, which is correct. But if a step itself is very long (e.g., a 10-minute meta-review) and the process is killed mid-step, we lose the partial work.

More importantly, in a crew pipeline with 8 sequential steps, if step 6 fails, we can resume from step 6 on the next run. But if the pipeline runner itself crashes (OOM, laptop sleep, network drop), the checkpoint may not have been flushed to disk.

## Suggested fix

1. Ensure `CheckpointManager.save()` is called with `fsync` to guarantee durability
2. Consider saving a "step started" marker before execution and "step completed" marker after, so on resume we can distinguish "step 6 never started" from "step 6 started but crashed"
3. For crew pipelines specifically, consider a mode where each expert's output is saved to a separate file immediately on completion, so partial crew results are always recoverable

## Real-world context

During overnight IMC crew runs, two reviewers (imc-chair and methodology-expert) produced empty outputs due to timeout issues. The pipeline continued but the final meta-review was based on incomplete data. A more robust checkpoint system would have detected the empty outputs and retried or flagged them.

## Labels
enhancement, reliability

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpoint should save after each step, not just at end #27

Problem

Suggested fix

Real-world context

Labels

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Checkpoint should save after each step, not just at end #27

Description

Problem

Suggested fix

Real-world context

Labels

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions