-
Notifications
You must be signed in to change notification settings - Fork 104
CLI channel deadlocks on stale session resume #25
Description
Summary
The CLI channel deadlocks when persistSession: true attempts to resume a stale SDK session. The stale session detection fires but the fallback does not recover — the agent becomes completely unresponsive on the CLI channel, responding with canned "still working" messages to every input indefinitely.
Environment
- Phantom v0.18.2
- Running in Docker (docker-compose.yaml, full stack)
- macOS host, Docker Desktop 29.3.0
Steps to Reproduce
- Start Phantom via
docker compose up -d - Attach to CLI and have a conversation (creates a session in SQLite)
- Recreate the container:
docker compose up -d --force-recreate phantom - Attach to CLI and send a message
- Log shows:
[runtime] Stale session detected, retrying without resume: cli:cli:local - Agent responds with canned messages, never invokes tools, never processes input
Expected Behavior
The stale session fallback should discard the old SDK session and start a fresh one. The CLI channel should recover transparently.
Actual Behavior
The agent enters a deadlock. Every subsequent message on the CLI channel gets the same canned response. Quality judges score the session 0.05 (catastrophic). The trigger endpoint (/trigger) continues working normally with separate sessions, confirming the issue is isolated to CLI session resumption.
Workaround
Delete the stale session from SQLite and restart:
docker exec phantom sqlite3 /app/data/phantom.db \
"DELETE FROM sessions WHERE session_key = 'cli:cli:local';"
docker compose up -d --force-recreate phantomImpact
- CLI becomes completely unusable until manual intervention
- No self-recovery mechanism — the agent cannot fix this on its own
- Evolution quality judges burn tokens scoring the stuck sessions 0.05 repeatedly
- In our case, this contributed to ~$50 in wasted judge costs from repeated quality assessments of stuck sessions
Observations
- The session table (
sessionsindata/phantom.db) retains the oldsdk_session_idacross container recreates because the SQLite database is on a persistent volume - The
[runtime] Stale session detected, retrying without resumelog message indicates the detection logic exists but the recovery path fails - The trigger endpoint is unaffected because it creates new session keys per request (
trigger:<timestamp>) - This has reproduced 3 times in ~12 hours of operation
Suggestion
Consider one or more of:
- Invalidate all CLI sessions on startup (the SDK session IDs won't survive a container restart anyway)
- Make the stale session fallback create a genuinely new session rather than retrying
- Add a startup migration that clears
sdk_session_idfor sessions older than the current process start time