Skip to content

CLI channel deadlocks on stale session resume #25

@sc-moore

Description

@sc-moore

Summary

The CLI channel deadlocks when persistSession: true attempts to resume a stale SDK session. The stale session detection fires but the fallback does not recover — the agent becomes completely unresponsive on the CLI channel, responding with canned "still working" messages to every input indefinitely.

Environment

  • Phantom v0.18.2
  • Running in Docker (docker-compose.yaml, full stack)
  • macOS host, Docker Desktop 29.3.0

Steps to Reproduce

  1. Start Phantom via docker compose up -d
  2. Attach to CLI and have a conversation (creates a session in SQLite)
  3. Recreate the container: docker compose up -d --force-recreate phantom
  4. Attach to CLI and send a message
  5. Log shows: [runtime] Stale session detected, retrying without resume: cli:cli:local
  6. Agent responds with canned messages, never invokes tools, never processes input

Expected Behavior

The stale session fallback should discard the old SDK session and start a fresh one. The CLI channel should recover transparently.

Actual Behavior

The agent enters a deadlock. Every subsequent message on the CLI channel gets the same canned response. Quality judges score the session 0.05 (catastrophic). The trigger endpoint (/trigger) continues working normally with separate sessions, confirming the issue is isolated to CLI session resumption.

Workaround

Delete the stale session from SQLite and restart:

docker exec phantom sqlite3 /app/data/phantom.db \
  "DELETE FROM sessions WHERE session_key = 'cli:cli:local';"
docker compose up -d --force-recreate phantom

Impact

  • CLI becomes completely unusable until manual intervention
  • No self-recovery mechanism — the agent cannot fix this on its own
  • Evolution quality judges burn tokens scoring the stuck sessions 0.05 repeatedly
  • In our case, this contributed to ~$50 in wasted judge costs from repeated quality assessments of stuck sessions

Observations

  • The session table (sessions in data/phantom.db) retains the old sdk_session_id across container recreates because the SQLite database is on a persistent volume
  • The [runtime] Stale session detected, retrying without resume log message indicates the detection logic exists but the recovery path fails
  • The trigger endpoint is unaffected because it creates new session keys per request (trigger:<timestamp>)
  • This has reproduced 3 times in ~12 hours of operation

Suggestion

Consider one or more of:

  • Invalidate all CLI sessions on startup (the SDK session IDs won't survive a container restart anyway)
  • Make the stale session fallback create a genuinely new session rather than retrying
  • Add a startup migration that clears sdk_session_id for sessions older than the current process start time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions