Skip to content

Sandbox resume 409: suspended sandbox reports no checkpoint data after prior checkpoint-received events #1343

@rblalock

Description

@rblalock

Summary

A Coder Hub sandbox session failed to resume. The Hub surfaced a generic error, but a direct manual resume via the Agentuity CLI reproduced a platform-side 409 with a more specific message:

Cannot resume sandbox: it is suspended but has no checkpoint data

This looks inconsistent with the sandbox lifecycle history, which shows prior evacuation/checkpoint activity for the same sandbox, including one event with a concrete checkpoint id. The expectation is that this sandbox should be resumable, or at minimum the platform state should be self-consistent.

Affected IDs

  • Hub session: codesess_938b33a6bf93
  • Sandbox: sbx_aa2c0c5d2c92b74e8f890ad57ca57f454ad8b52f05461d0f0ad23263e88c
  • Driver job: job_d9f56dde7973877928a81271

What I checked

Hub side

  • Hub session row still exists and is paused.
  • In-memory runtime is gone, which is expected after paused eviction.
  • Hub sandbox tracker latched this error:
    • Sandbox resume failed while platform status is suspended (HTTP 409 conflict). This looks like a platform checkpoint/resume failure.
  • Replay / DB activity shows the last persisted assistant reply completed cleanly; no new user_prompt, turn_start, task, or tool activity was recorded after that.
  • This means the failed prompt never made it into a live turn; failure happened during wake/resume, before agent execution.

Platform side

Current sandbox status:

  • suspended

Manual resume via CLI:

agentuity cloud sandbox resume sbx_aa2c0c5d2c92b74e8f890ad57ca57f454ad8b52f05461d0f0ad23263e88c \
  --org-id org_2u8RgDTwcZWrZrZ3sZh24T5FCtz --json

Result:

{
  "error": {
    "code": "API_ERROR",
    "message": "Cannot resume sandbox: it is suspended but has no checkpoint data",
    "exitCode": 14,
    "details": {
      "tag": "APIErrorResponse",
      "status": 409,
      "sessionId": "sess_c3233f64db553bfd2dfa6d9cc4fcb095"
    }
  }
}

Sandbox status remained suspended after the manual resume attempt.

Relevant sandbox lifecycle timeline

All times UTC on 2026-04-03.

  • 14:55:42Z sandbox created / started
  • 14:55:44Z tracked driver job created and reported running
  • 15:17:19Z lifecycle:suspended + evacuation:state-update(status=suspended, evacuation_phase=checkpoint-received)
  • 15:18:23Z another suspend event emitted with:
    • checkpoint_id=ckpt_fc6d480abf57f5f3
    • checkpoint_bucket=ago-d066e3-checkpoints
    • checkpoint_size=65519891
  • 15:19:21Z lifecycle:reconcile(previous_status=suspended)
  • 15:23:07Z lifecycle:resumed
  • 15:24:03Z another evacuation/suspend sequence:
    • lifecycle:suspended(phase=pre-suspend, suspension_reason=evacuation)
    • evacuation:state-update(status=suspended, evacuation_phase=checkpoint-received)
    • another lifecycle:suspended(phase=pre-suspend, suspension_reason=evacuation)

Additional suspicious signals

  • The platform still reports the original tracked driver job as running with no completion or replacement job.
  • A direct GET /sandbox/:id succeeds normally.
  • A direct GET /sandbox/checkpoints/:id?orgId=... timed out after 12s with zero bytes returned.
    • I am not over-claiming on this one, but it smells related given the resume error.

Why this looks wrong

The platform is simultaneously telling us:

  • the sandbox is suspended
  • earlier lifecycle events included checkpoint-received
  • one suspend emitted a concrete checkpoint id
  • a manual resume now fails because the sandbox allegedly has no checkpoint data

That combination should not happen in a healthy checkpoint/resume flow.

Expected behavior

One of these should be true:

  1. the sandbox resumes successfully from its latest checkpoint, or
  2. the sandbox transitions into a terminal/invalid state with a consistent reason and the checkpoint/event surfaces agree about the missing checkpoint, or
  3. the resume endpoint returns a more precise failure mode tied to the actual checkpoint object that is missing/corrupt/unreadable.

Notes

Coder Hub currently collapses any resume 409 that leaves the sandbox in suspended into a generic message, so the direct CLI/manual resume result above is the most useful raw signal I found.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions