Sandbox resume 409: suspended sandbox reports no checkpoint data after prior checkpoint-received events

## Summary
A Coder Hub sandbox session failed to resume. The Hub surfaced a generic error, but a direct manual resume via the Agentuity CLI reproduced a platform-side `409` with a more specific message:

```text
Cannot resume sandbox: it is suspended but has no checkpoint data
```

This looks inconsistent with the sandbox lifecycle history, which shows prior evacuation/checkpoint activity for the same sandbox, including one event with a concrete checkpoint id. The expectation is that this sandbox should be resumable, or at minimum the platform state should be self-consistent.

## Affected IDs
- Hub session: `codesess_938b33a6bf93`
- Sandbox: `sbx_aa2c0c5d2c92b74e8f890ad57ca57f454ad8b52f05461d0f0ad23263e88c`
- Driver job: `job_d9f56dde7973877928a81271`

## What I checked
### Hub side
- Hub session row still exists and is `paused`.
- In-memory runtime is gone, which is expected after paused eviction.
- Hub sandbox tracker latched this error:
  - `Sandbox resume failed while platform status is suspended (HTTP 409 conflict). This looks like a platform checkpoint/resume failure.`
- Replay / DB activity shows the last persisted assistant reply completed cleanly; no new `user_prompt`, `turn_start`, task, or tool activity was recorded after that.
- This means the failed prompt never made it into a live turn; failure happened during wake/resume, before agent execution.

### Platform side
Current sandbox status:
- `suspended`

Manual resume via CLI:
```bash
agentuity cloud sandbox resume sbx_aa2c0c5d2c92b74e8f890ad57ca57f454ad8b52f05461d0f0ad23263e88c \
  --org-id org_2u8RgDTwcZWrZrZ3sZh24T5FCtz --json
```

Result:
```json
{
  "error": {
    "code": "API_ERROR",
    "message": "Cannot resume sandbox: it is suspended but has no checkpoint data",
    "exitCode": 14,
    "details": {
      "tag": "APIErrorResponse",
      "status": 409,
      "sessionId": "sess_c3233f64db553bfd2dfa6d9cc4fcb095"
    }
  }
}
```

Sandbox status remained `suspended` after the manual resume attempt.

## Relevant sandbox lifecycle timeline
All times UTC on 2026-04-03.

- `14:55:42Z` sandbox created / started
- `14:55:44Z` tracked driver job created and reported `running`
- `15:17:19Z` `lifecycle:suspended` + `evacuation:state-update(status=suspended, evacuation_phase=checkpoint-received)`
- `15:18:23Z` another suspend event emitted with:
  - `checkpoint_id=ckpt_fc6d480abf57f5f3`
  - `checkpoint_bucket=ago-d066e3-checkpoints`
  - `checkpoint_size=65519891`
- `15:19:21Z` `lifecycle:reconcile(previous_status=suspended)`
- `15:23:07Z` `lifecycle:resumed`
- `15:24:03Z` another evacuation/suspend sequence:
  - `lifecycle:suspended(phase=pre-suspend, suspension_reason=evacuation)`
  - `evacuation:state-update(status=suspended, evacuation_phase=checkpoint-received)`
  - another `lifecycle:suspended(phase=pre-suspend, suspension_reason=evacuation)`

## Additional suspicious signals
- The platform still reports the original tracked driver job as `running` with no completion or replacement job.
- A direct `GET /sandbox/:id` succeeds normally.
- A direct `GET /sandbox/checkpoints/:id?orgId=...` timed out after 12s with zero bytes returned.
  - I am not over-claiming on this one, but it smells related given the resume error.

## Why this looks wrong
The platform is simultaneously telling us:
- the sandbox is `suspended`
- earlier lifecycle events included `checkpoint-received`
- one suspend emitted a concrete checkpoint id
- a manual resume now fails because the sandbox allegedly has `no checkpoint data`

That combination should not happen in a healthy checkpoint/resume flow.

## Expected behavior
One of these should be true:
1. the sandbox resumes successfully from its latest checkpoint, or
2. the sandbox transitions into a terminal/invalid state with a consistent reason and the checkpoint/event surfaces agree about the missing checkpoint, or
3. the resume endpoint returns a more precise failure mode tied to the actual checkpoint object that is missing/corrupt/unreadable.

## Notes
Coder Hub currently collapses any resume `409` that leaves the sandbox in `suspended` into a generic message, so the direct CLI/manual resume result above is the most useful raw signal I found.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sandbox resume 409: suspended sandbox reports no checkpoint data after prior checkpoint-received events #1343

Summary

Affected IDs

What I checked

Hub side

Platform side

Relevant sandbox lifecycle timeline

Additional suspicious signals

Why this looks wrong

Expected behavior

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sandbox resume 409: suspended sandbox reports no checkpoint data after prior checkpoint-received events #1343

Description

Summary

Affected IDs

What I checked

Hub side

Platform side

Relevant sandbox lifecycle timeline

Additional suspicious signals

Why this looks wrong

Expected behavior

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions