Skip to content

fix(daemon): guard session resume when agent runtime changes#1905

Open
bzqzheng wants to merge 1 commit intomultica-ai:mainfrom
bzqzheng:fix/bri-66-runtime-session-guard
Open

fix(daemon): guard session resume when agent runtime changes#1905
bzqzheng wants to merge 1 commit intomultica-ai:mainfrom
bzqzheng:fix/bri-66-runtime-session-guard

Conversation

@bzqzheng
Copy link
Copy Markdown
Contributor

@bzqzheng bzqzheng commented Apr 29, 2026

Summary

  • Guard issue and chat session resume pointers by comparing the prior session's runtime_id with the claiming task's runtime_id. When they differ, the claim endpoint returns an empty PriorSessionID so the daemon starts a fresh session instead of trying to resume a cross-runtime session.
  • Add chat_session.runtime_id column (nullable UUID → agent_runtime.id) so chat resume pointers remain self-sufficient instead of relying only on task-row history. Backfilled from the most recent completed/failed task per chat session; legacy rows with NULL runtime_id fall back to the task-row lookup.
  • Preserve work_dir reuse even when PriorSessionID is cleared — working directories are runtime-portable and should survive migrations.
  • Comparison uses task.RuntimeID (not agent.current_runtime_id) so old-runtime tasks can still resume old-runtime sessions during migration windows.

What changed

  • handler/daemon.goClaimTaskByRuntime: skip PriorSessionID when prior.RuntimeID != task.RuntimeID for issue tasks; for chat tasks, skip when chat_session.runtime_id is NULL or doesn't match, then fall back to task-row lookup with the same runtime guard.
  • service/task.goCompleteTask / FailTask: persist runtime_id into chat_session on task completion so the pointer stays current.
  • Migration 060_chat_session_runtime_id — adds chat_session.runtime_id with FK to agent_runtime, backfills from latest completed/failed task per session.
  • SQL queries (agent.sql, chat.sql) — GetLastTaskSession and GetLastChatTaskSession now return runtime_id. CreateChatSession auto-populates runtime_id from the agent. UpdateChatSessionSession accepts runtime_id.

Test plan

  • go test ./internal/handler -run 'TestClaimTask_(IssuePriorSessionRuntimeGuard|ChatPriorSessionRuntimeGuard)$' — new tests for both issue and chat paths
  • go test ./internal/handler — all existing handler tests pass
  • go test ./... — clean

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 29, 2026

@bzqzheng is attempting to deploy a commit to the IndexLabs Team on Vercel.

A member of the Team first needs to authorize it.

@bzqzheng bzqzheng changed the title Fix runtime session resume guard fix(daemon): guard session resume when agent runtime changes Apr 29, 2026
@bzqzheng bzqzheng force-pushed the fix/bri-66-runtime-session-guard branch 2 times, most recently from 2b76ddd to 2b7b39f Compare April 29, 2026 19:44
@bzqzheng bzqzheng marked this pull request as ready for review April 29, 2026 19:50
Copy link
Copy Markdown
Collaborator

@multica-eve multica-eve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. I found one blocker in the migration backfill.

server/migrations/060_chat_session_runtime_id.up.sql sets chat_session.runtime_id from the latest completed/failed task for the chat session, but it does not verify that this latest task is the task that produced the current chat_session.session_id. Existing rows can have a stale but non-null chat_session.session_id: the daemon pins agent_task_queue.session_id mid-run, and orphan recovery / retry paths can leave the task row with a newer session while the chat_session pointer still points at an older session. In that case this migration pairs the old chat_session session id with the newer task's runtime id. After migration, ClaimTaskByRuntime trusts cs.SessionID when cs.RuntimeID == task.RuntimeID, so it can still return a cross-runtime PriorSessionID, which is exactly the failure this PR is trying to prevent.

Please make the backfill preserve the session/runtime pairing, for example by selecting the latest task's session_id too and only setting runtime_id when latest.session_id = cs.session_id, or by deliberately updating both session_id and runtime_id from the same latest task if that is the desired data repair. I would also add a regression case for a stale chat_session.session_id plus a newer task-row session from a different runtime.

I also tried the focused handler tests locally, but my local test DB is stale and missing the new chat_session.runtime_id column, so the chat test failed at fixture setup rather than in the PR logic. CI backend is already green on the PR.

@bzqzheng bzqzheng force-pushed the fix/bri-66-runtime-session-guard branch from 2b7b39f to fe86003 Compare April 30, 2026 12:31
@bzqzheng
Copy link
Copy Markdown
Contributor Author

@multica-eve all feedback addressed — migration now pairs latest.session_id = cs.session_id to prevent cross-runtime backfill, plus a regression test for the stale-session scenario. Ready for re-review. Thanks for catching this.

@bzqzheng bzqzheng requested a review from multica-eve April 30, 2026 12:59
@bzqzheng bzqzheng force-pushed the fix/bri-66-runtime-session-guard branch from fe86003 to e7f92d2 Compare April 30, 2026 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants