fix(daemon): guard session resume when agent runtime changes#1905
fix(daemon): guard session resume when agent runtime changes#1905bzqzheng wants to merge 1 commit intomultica-ai:mainfrom
Conversation
|
@bzqzheng is attempting to deploy a commit to the IndexLabs Team on Vercel. A member of the Team first needs to authorize it. |
2b76ddd to
2b7b39f
Compare
multica-eve
left a comment
There was a problem hiding this comment.
Thanks for the fix. I found one blocker in the migration backfill.
server/migrations/060_chat_session_runtime_id.up.sql sets chat_session.runtime_id from the latest completed/failed task for the chat session, but it does not verify that this latest task is the task that produced the current chat_session.session_id. Existing rows can have a stale but non-null chat_session.session_id: the daemon pins agent_task_queue.session_id mid-run, and orphan recovery / retry paths can leave the task row with a newer session while the chat_session pointer still points at an older session. In that case this migration pairs the old chat_session session id with the newer task's runtime id. After migration, ClaimTaskByRuntime trusts cs.SessionID when cs.RuntimeID == task.RuntimeID, so it can still return a cross-runtime PriorSessionID, which is exactly the failure this PR is trying to prevent.
Please make the backfill preserve the session/runtime pairing, for example by selecting the latest task's session_id too and only setting runtime_id when latest.session_id = cs.session_id, or by deliberately updating both session_id and runtime_id from the same latest task if that is the desired data repair. I would also add a regression case for a stale chat_session.session_id plus a newer task-row session from a different runtime.
I also tried the focused handler tests locally, but my local test DB is stale and missing the new chat_session.runtime_id column, so the chat test failed at fixture setup rather than in the PR logic. CI backend is already green on the PR.
2b7b39f to
fe86003
Compare
|
@multica-eve all feedback addressed — migration now pairs |
fe86003 to
e7f92d2
Compare
Summary
runtime_idwith the claiming task'sruntime_id. When they differ, the claim endpoint returns an emptyPriorSessionIDso the daemon starts a fresh session instead of trying to resume a cross-runtime session.chat_session.runtime_idcolumn (nullable UUID →agent_runtime.id) so chat resume pointers remain self-sufficient instead of relying only on task-row history. Backfilled from the most recent completed/failed task per chat session; legacy rows with NULLruntime_idfall back to the task-row lookup.work_dirreuse even whenPriorSessionIDis cleared — working directories are runtime-portable and should survive migrations.task.RuntimeID(notagent.current_runtime_id) so old-runtime tasks can still resume old-runtime sessions during migration windows.What changed
handler/daemon.go—ClaimTaskByRuntime: skipPriorSessionIDwhenprior.RuntimeID != task.RuntimeIDfor issue tasks; for chat tasks, skip whenchat_session.runtime_idis NULL or doesn't match, then fall back to task-row lookup with the same runtime guard.service/task.go—CompleteTask/FailTask: persistruntime_idintochat_sessionon task completion so the pointer stays current.060_chat_session_runtime_id— addschat_session.runtime_idwith FK toagent_runtime, backfills from latest completed/failed task per session.agent.sql,chat.sql) —GetLastTaskSessionandGetLastChatTaskSessionnow returnruntime_id.CreateChatSessionauto-populatesruntime_idfrom the agent.UpdateChatSessionSessionacceptsruntime_id.Test plan
go test ./internal/handler -run 'TestClaimTask_(IssuePriorSessionRuntimeGuard|ChatPriorSessionRuntimeGuard)$'— new tests for both issue and chat pathsgo test ./internal/handler— all existing handler tests passgo test ./...— clean