Background sessions on Mongo control plane (Main + Worker topology)#206
Background sessions on Mongo control plane (Main + Worker topology)#206akseljoonas wants to merge 15 commits intomainfrom
Conversation
Spec, trace, and consensus implementation plan for v1 of background-running ml-intern: sessions survive SSE drops, browser close, and Main Space restarts via a Main+Workers HF Space topology with MongoDB as the universal control plane. Lease+heartbeat ownership, configurable grace period, and Worker idle eviction. Frontend stays unchanged. Plan was reviewed by architect + critic to consensus over 2 iterations.
Extends MongoSessionStore with the building blocks of the lease+heartbeat
control plane:
- claim_lease / renew_lease / release_lease: atomic CAS via
findOneAndUpdate, with TTL semantics and idempotent release.
- enqueue_pending_submission / claim_pending_submission /
mark_submission_done / requeue_claimed_for: FIFO submission queue with
handover-safe requeue (preserves created_at across lease changes).
- change_stream_pending_submissions / change_stream_events: real-time
tailers with PyMongoError signalling for the polling fallback path.
- make_holder_id(mode): "{mode}:{hostname}:{uuid7-or-uuid4-suffix}".
MongoSessionStore.init() now runs an idempotent backfill on startup:
sessions with last_active_at within the last hour get an empty lease so
the next CAS succeeds; older sessions flip to runtime_state=idle (still
recoverable, not ended).
NoopSessionStore mirrors the new surface so the public CLI still works
without MongoDB.
Tests: 12 new in tests/test_session_persistence_lease.py covering CAS
exactly-once under concurrency, FIFO claim order, requeue ordering
preservation, recency-split backfill, idempotent re-init, and holder-id
format. Uses mongomock-motor (added to dev deps).
Wires the lease+heartbeat ownership model into SessionManager so each process picks a stable holder_id at startup, claims a Mongo lease before instantiating any runtime session, renews held leases on a TTL/3 cadence, and cleans up correctly when ownership transfers. Highlights: - __init__ reads MODE env (main|worker, defaults main on invalid with WARN), computes self.mode and self._holder_id via make_holder_id(). - start() launches _lease_heartbeat_loop (10 s = TTL/3). - _on_lease_lost calls requeue_claimed_for(holder_id), pops the session, cancels its task without awaiting, and logs WARN. This is the Step 1.5 requeue path so a Worker can pick up orphaned submissions. - create_session and ensure_session_loaded claim a lease before starting or restoring a session; ensure_session_loaded refuses to restore when another process holds the lease. - _run_session, shutdown_session, and delete_session release the lease on exit. - close() cancels and awaits the heartbeat task. Tests: 12 new in tests/test_session_manager_lease.py covering holder identity, claim wiring, heartbeat renewal, Noop fallthrough, lease-loss requeue+drop, release in _run_session finally, and close-time cancel. Adjusted unit/test_session_manager_persistence.py helper to seed the new __init__ attributes; 35 tests total pass. The submission-consume loop, EventBroadcaster, lifespan, and routes are unchanged (US-004 / US-005 / US-006 territory).
Rewrites the central session loop so user input, approvals, interrupts, and shutdowns flow through the durable pending_submissions collection instead of an in-process asyncio.Queue. This is the substrate change that lets a Worker pick up a session another process originally created. backend/session_manager.py: - Drop submission_queue field from AgentSession and its three construction sites; add last_submission_at for US-007 idle eviction. - _consume_submissions: change-stream-first consumer with a 500ms polling fallback when the deployment is not a replica set. - _drain_and_process: FIFO atomic claim loop. Handles op_type "interrupt" (session.cancel()) and "shutdown" (session.is_running= False) inline; everything else dispatches through process_submission. Marks each claimed submission done in finally so poison rows can't redeliver. - Replace _run_session's queue-poll loop with await _consume_submissions; the session_manager is the only thing reaching into pending_submissions. - Rewrite submit / submit_user_input / submit_approval / undo / compact / shutdown_session as thin enqueue calls via a shared _enqueue_or_false helper that checks in-memory presence OR store.load_session() so cross-process delivery works. - Rewrite interrupt with the holder branch: in-memory holder calls session.cancel() directly; non-holders enqueue op_type "interrupt" for the actual holder to consume. is_in_tool_call flag (Critic MAJOR #5): - agent/core/session.py: new boolean on Session. - agent/core/agent_loop.py: wraps both asyncio.gather tool batches (non-approval and post-approval) with True/False around the await. - agent/tools/research_tool.py: same wrapper around the sub-agent's tool_router.call_tool. Tests: - tests/test_session_manager_submissions.py (new, 18 tests): enqueue paths, holder fast-path interrupt, non-holder enqueued interrupt, FIFO drain order, inline interrupt/shutdown handling, poison- submission containment, store-disabled fallback, _build_operation. - Existing fixtures (test_session_manager_lease, test_messaging, test_session_manager_persistence) updated to drop the submission_queue kwarg and the now-removed positional argument to _run_session. 55 tests pass in the lease+submission verification suite (12 + 12 + 18 + 6 + 5 + 2 inline). Broader unit suite remains green at 248 (two pre-existing test_doom_loop failures verified unrelated via git stash).
Keeps EventBroadcaster as an opt-in read-side cache attached only when this process holds the session's lease. Non-holders tail Mongo change_stream_events with a 500 ms load_events_after polling fallback on PyMongoError or NotImplementedError. Writes still go through Mongo first via Session.send_event -> append_event (durability invariant verified, no change needed in agent/core/session.py). backend/session_manager.py: - AgentSession gains holder_id (set when lease is claimed). - SessionManager gains _subscriber_counts and _no_subscriber_since dicts plus _attach_subscriber / _detach_subscriber helpers; both SSE transports call them at attach + finally-detach. Used by the upcoming grace-period sweeper. backend/routes/agent.py: - _sse_response now takes (session_id, agent_session, *, replay_events, after_seq). Generator does the replay-then-live pattern in two branches: holder subscribes to in-process broadcaster; non-holder opens change_stream_events from last_seen_seq, with a 500 ms poll fallback. Keepalives preserved on both paths. - chat_sse no longer subscribes pre-submit (Mongo durability makes the early-subscribe-to-not-miss-events trick unnecessary). - subscribe_events updated to the new signature. Tests: tests/test_sse_holder_overlay.py (new, 7): holder fast path, non-holder change-stream, fallback-to-poll on PyMongoError, subscriber counter attach/detach on both paths, replay phase respects after_seq, terminal event in replay ends stream cleanly. 62 tests pass cumulative across the lease/submission/SSE suites.
Wires the three Spec migration triggers into the Main Space process:
1. Main shutdown (deploy): lifespan hook iterates self.sessions and
for any session in runtime_state="processing", calls
release_session_to_background(reason="main_shutdown") before
session_manager.close().
2. SSE-drop grace period: a 30 s sweeper task scans sessions held by
this process. When _no_subscriber_since[sid] is older than
GRACE_PERIOD_SECONDS (default 180) AND the session has work
(is_processing, is_in_tool_call, or pending submissions in Mongo),
it auto-releases. Idle-with-no-work sessions are left alone so an
open-and-walk-away tab doesn't get backgrounded for nothing.
3. Manual button: POST /api/session/{id}/background route reuses
_check_session_access and calls the same helper.
All three paths converge on release_session_to_background(sid, reason),
which:
- Emits a `migrating` session_event (durably persisted via append_event
so non-holder SSE readers see it on the next change-stream tick),
- Calls store.requeue_claimed_for(self._holder_id) so a Worker can
pick up our claimed-but-uncompleted submissions,
- Calls store.release_lease(sid, self._holder_id),
- Drops the session from self.sessions and cancels its agent task.
Tests: tests/test_lifespan_grace_sweep.py (new, 8) covering the helper,
the grace-sweep predicate matrix (subs present, grace+work, grace
without work), and the route. _bare_manager fixtures in two existing
test files updated to set manager._grace_sweep_task = None so close()
doesn't AttributeError.
70 tests pass cumulatively across the lease/submission/SSE/grace suites.
Implements the second deployable surface — Worker Spaces — and the session-eviction policy that keeps held sessions from accumulating forever in a Worker process. backend/start.sh: - New MODE=worker branch as first action, exec python -m worker. - MODE=main path is unchanged: existing port-conflict graceful exit-0 hack for HF Spaces dev mode preserved verbatim. backend/worker.py (new): three-line shim that calls asyncio.run(worker_loop()) at module entry. Uses `from main import worker_loop` to match the existing `WORKDIR=/app/backend` contract. backend/main.py: - worker_loop(): boots session_manager (which now starts heartbeat, grace sweep, AND idle eviction tasks), then runs _worker_claim_tick on a 1 s cadence with 2 s back-off on errors. Closes cleanly on cancellation. - _worker_claim_tick(): scans db.pending_submissions for status= "pending" docs, skips session_ids we already hold, and calls claim_dormant_session for the rest. Lease CAS handles contention. backend/session_manager.py: - Refactor: _rebuild_agent_session_from_store() extracted from ensure_session_loaded so claim_dormant_session can reuse it. - claim_dormant_session(session_id): for the worker_loop's claim path. Bypasses the user-ownership gate (process-level trust); the lease CAS still enforces one-holder-at-a-time. - _idle_eviction_loop: 60 s tick. Releases sessions held by us whose last_submission_at is older than IDLE_EVICTION_SECONDS (default 1800) AND not is_processing AND not is_in_tool_call AND no pending_submissions in Mongo. No migrating event - nobody is watching an idle session. start()/close() manage the task. - last_submission_at switched from asyncio.get_event_loop().time() to time.time() so the idle predicate uses the same wall clock as _no_subscriber_since. Tests: tests/test_worker_idle_eviction.py (new, 13): idle predicate matrix, claim_dormant_session bypass + lease-taken paths, claim-tick skips already-held sessions. _bare_manager() fixtures in three existing files updated to set manager._idle_eviction_task = None. 83 tests pass cumulatively.
Adds 11 observability log lines across the new control plane and a deployment-runbook section to AGENTS.md. No behavior change. backend/session_manager.py (10 log points): - lease_claim INFO at all three lease-claim sites (create_session, ensure_session_loaded, claim_dormant_session). - lease_release INFO at all three release sites (release_session_to_ background, _run_session finally, _idle_eviction_loop). - requeue_claimed INFO when count > 0 (_on_lease_lost, release_session_to_background). - migrating_emitted INFO in release_session_to_background after the send_event. - pending_submission_lag DEBUG in _drain_and_process; only logs when lag > 100 ms to avoid noise. backend/routes/agent.py (1 log point): - replay_event_count INFO at the end of the replay phase in _sse_response, on both the terminal-event and phase-complete paths. AGENTS.md: new "Background sessions deployment" section covers: - Two-tier topology (Main + Worker(s) sharing one Docker image, MODE env var differentiates). - Deploy ordering (Workers first, Main last; backfill runs in MongoSessionStore.init). - Blast-radius query to capture pre-deploy. - Env vars table (MODE, MONGODB_URI, GRACE_PERIOD_SECONDS, IDLE_EVICTION_SECONDS). - Local two-process stack: docker mongo replica set + Main + Worker. - Chaos test (docker pause/unpause Mongo) verifying change-stream resume token. - Observability grep patterns for the new log lines. 83 tests pass. Final story of the bg-sessions-mongo-control-plane refactor.
Five small cleanups identified during the post-Ralph anti-slop pass. No behavior change; 83 tests remain green. - agent/core/session_persistence.py: drop redundant `return None` from a -> None function (release_lease). - backend/session_manager.py: collapse the duplicate WARN+INFO log pair in _on_lease_lost into a single WARN; switch to %-style formatting consistent with the surrounding file. Drop the impossible `if not agent_session: return` guard in _run_session — the caller inserts the session before creating the task, so the dict lookup cannot miss; direct access now surfaces any future invariant break. - agent/core/agent_loop.py: remove two comments that just restated the next line of well-named code. - agent/tools/research_tool.py: drop the in-loop `import json as _json` shadow; the module-level json import is already in scope.
- .gitignore: add .omc/, .omx/, AGENTS.md, uv.lock so transient AI-tooling artifacts and locally-generated lockfiles aren't tracked. - docs/deployment.md (new): the background-sessions deployment runbook previously embedded in AGENTS.md, now a standalone doc that doesn't require AGENTS.md to be in the tree. - AGENTS.md: removed (was tracked); contents superseded by docs/deployment.md for the parts that belong in the repo. - uv.lock: removed; regenerated locally via `uv sync`. - .omc/specs, .omc/plans, .omc/prd.json, .omc/progress.txt: removed; these were design-pipeline artifacts. The PR description carries the durable summary.
Reverts the AGENTS.md and uv.lock changes from this branch so the PR diff is empty for both files. The deployment runbook stays in docs/deployment.md. - pyproject.toml: drop mongomock-motor>=0.0.36 from dev deps. The lease-control-plane tests already use pytest.importorskip, so they skip gracefully when the package is not installed; devs who want to run them install mongomock-motor manually. - .gitignore: drop AGENTS.md and uv.lock entries (no-ops on already- tracked files); keep .omc/ and .omx/ entries.
|
Claude finished @akseljoonas's task in 6m 36s —— View job PR Review: Background sessions on Mongo control planeTasks
2 P0, 3 P1 What I checked
P0 — Reconnect to Worker-held session returns 404 (acceptance drill 2 is broken)
This blocks every endpoint — including The unit tests in Acceptance drill 2 states: "A Worker claims the lease within 30 s; the user reconnects to a fresh Main and sees the turn continuing." This cannot happen because Main returns 404 for any session whose lease is currently held by a Worker. Fix direction: P0 —
|
…trol-plane # Conflicts: # agent/core/agent_loop.py # backend/session_manager.py
|
Claude finished @akseljoonas's task in 1m 47s —— View job PR Review: Background sessions on Mongo control planeTasks
2 P0 still open (unchanged from prior review), 3 P1 not yet addressed What I checked
P0 — Reconnect to Worker-held session returns 404 (unchanged)
if claimed is None:
logger.info(
f"Refusing restore of {session_id}: lease held by another process"
)
return None
P0 —
|
Five correctness fixes flagged by an automated PR review on the bg-sessions branch: P0 — reconnect to a Worker-held session returned 404 ensure_session_loaded called claim_lease and returned None when another holder owned the lease, which made _check_session_access raise HTTP 404 for every session in background mode. The non-holder SSE slow path was unreachable. Fix: when claim_lease fails because another process holds the lease, build a stub AgentSession with holder_id=<foreign> and explicit None for session/tool_router/ broadcaster, and return it without inserting into self.sessions. The SSE generator already branches on holder_id == session_manager._holder_id and falls through to the Mongo change-stream tail. Made AgentSession.session and AgentSession.tool_router Optional to support stubs. P0 — _on_lease_lost requeued submissions for ALL held sessions requeue_claimed_for(holder_id) ran update_many filtered only on holder, flipping every claimed submission this Main owned. A transient lease loss on session A could cause Worker double-execution of session B's in-flight submission. Fix: requeue_claimed_for now takes an optional session_id; _on_lease_lost passes it. The other call sites (release_session_to_background, lifespan shutdown sweep) keep the no-arg behavior on purpose. P1 — renew_lease swallowed PyMongoError as None A transient Mongo network blip looked identical to "lease stolen" and triggered _on_lease_lost. Fix: renew_lease no longer catches PyMongoError; the heartbeat loop wraps the call in try/except, logs a warning, and continues to the next session. Only renewed is None (CAS mismatch — real lease theft) calls _on_lease_lost. P1 — release_lease left holder_id intact A heartbeat tick that snapshotted session_ids before the release-lease pop could re-extend the lease 30 s into the future because the renew CAS still matched on holder_id. Fix: release_lease now sets lease.holder_id to null alongside lease.expires_at = now. P1 — shutdown_session held _lock during external I/O The lock was held while awaiting release_lease (Mongo I/O) and asyncio.wait_for(task, timeout=5.0). Fix: scope the lock to just the dict pop; release_lease and the task drain run unlocked. Tests: 9 new across tests/test_session_persistence_lease.py and tests/test_session_manager_lease.py covering each fix and the backward-compat path for requeue_claimed_for. 94 tests pass on the target suite (up from 85); 302 in the broader unit run.
PR Review: Background sessions on Mongo control plane
|
Removes the six new test files added across the bg-sessions stories (test_session_persistence_lease, test_session_manager_lease, test_session_manager_submissions, test_sse_holder_overlay, test_lifespan_grace_sweep, test_worker_idle_eviction) and restores tests/unit/test_messaging.py and tests/unit/test_session_manager_persistence.py to their state on main. The tests will land in a follow-up PR so this one stays scoped to the production refactor.
|
Claude finished @akseljoonas's task in 5m 21s —— View job Claude finished @akseljoonas's task —— View job PR Review: Background sessions on Mongo control planeTasks
1 P0 What I checked
Acknowledged fixes from prior reviewAll five prior findings are addressed:
P0 — Stub with
|
Wires the manual handoff path into the existing React UI. The backend
route POST /api/session/{id}/background already exists; this is purely
frontend.
Behavior:
- A moon IconButton appears in the composer when the Main process
holds the lease and the agent has work in flight (isProcessing or
pending approval). Click POSTs to the background route. Hidden once
the session is already backgrounded.
- The migrating SSE event is now a recognized EventType. The transport
emits finish-step + finish (mirroring the interrupted/shutdown
pattern) so the AI SDK closes the stream cleanly, then triggers
reconnectToStream so SSE switches from the holder fast-path to the
Mongo change-stream slow-path. The user keeps watching live.
- ActivityStatus gains a 'migrating' variant. ActivityStatusBar shows
"Running in background".
- SessionMeta gains an isBackgrounded flag, set from the migrating
event handler and (defensively) from the click handler optimistic
path. Sticky for the session lifetime.
- Sessions sidebar swaps the chat-bubble icon for a cloud icon when
isBackgrounded is true. The amber needsAttention pulse dot still
composes on top.
- The chat panel renders a muted-info banner ("Running in background
— you can close this tab and come back anytime") when the session
is backgrounded.
- Mount-time hydration tolerates the get_session_info 404 that occurs
for non-in-process sessions: when /messages succeeds but /session
returns 404, infer isBackgrounded=true instead of falling through to
markExpired. Reload-into-backgrounded-session now renders the banner
rather than the catch-up flow.
Errors on the POST currently fall back to console.error — there is no
toast helper in the existing codebase. v1.1 follow-up.
Two backend gaps documented in .omc/specs/deep-dive-what-do-we-need.md
remain as v1.1 work: release_session_to_background only handles
Main-held sessions; get_session_info doesn't consult Mongo for
foreign-held sessions. Neither blocks v1 of this button (button is
hidden for Worker-held sessions; reload tolerance covers the second).
10 files changed, ~80 LOC.
|
Claude finished @akseljoonas's task in 1m 39s —— View job PR Review: Background sessions on Mongo control planeTasks
1 P0 What I checked
Acknowledged fixes from prior reviewThe prior P0 finding from the last review round (the P0 —
|

Summary
v1 of "background-running ml-intern": agent sessions survive SSE drops, browser close, and Main Space restarts. Frontend is unchanged — backend-only refactor.
MODE=main|workerenv var differentiates. Worker Spaces have no public HTTP — they consume a Mongo-backed work queue.submission_queueis replaced;EventBroadcasteris kept as an opt-in read-side cache attached only when this process holds the lease (so foreground SSE skips the change-stream round-trip).findOneAndUpdateCAS onsessions.lease. TTL = 30 s, renew every 10 s. Lease loss triggersrequeue_claimed_for(holder_id)and drops the session locally so a Worker can pick up orphaned submissions.release_session_to_background(reason): SSE-drop grace period (default 180 s), manualPOST /api/session/{id}/background, and Main lifespan shutdown for active turns. Each emits amigratingsession_event the frontend can render as "reconnecting".not is_in_tool_call AND not is_processing AND no pending_submissions in Mongo.Acceptance drills (manual, post-deploy)
docker restart). A Worker claims the lease within 30 s; the user reconnects to a fresh Main and sees the turn continuing without aninterruptedevent.Both drills are runnable locally against
docker run mongo:7 --replSet rs0+ 2 processes (MODE=mainandMODE=worker). See the new "Background sessions deployment" section inAGENTS.md.Test plan
tests/test_session_persistence_lease.py(12) — atomic CAS, FIFO, requeue ordering preservation, idempotent backfilltests/test_session_manager_lease.py(12) — holder identity, claim wiring, heartbeat renewal + loss handling, lease release in finallytests/test_session_manager_submissions.py(18) — enqueue paths, holder/non-holder interrupt, FIFO drain, inline interrupt/shutdown handling, poison containmenttests/test_sse_holder_overlay.py(7) — holder fast path, non-holder change stream, polling fallback onPyMongoError, subscriber counter attach/detachtests/test_lifespan_grace_sweep.py(8) —release_session_to_backgroundevent emission, grace-sweep predicate matrix,/backgroundroute 200/404tests/test_worker_idle_eviction.py(13) — idle predicate matrix,claim_dormant_sessionuser-bypass + lease-taken paths, claim-tick skips already-heldtest_doom_loopfailures verified viagit stash)ruff checkclean on every file modified by this PR (pre-existing E402/F401/F841 inmain.pyandroutes/agent.pypredate this branch)db.sessions.count({runtime_state: "processing"})to capture blast-radiusDeployment notes (see
AGENTS.mdfor the full runbook)pending_submissionsagainst any session whose lease has expired, but won't see anything new until Main also rolls.MongoSessionStore.init()runs an idempotent backfill: sessions withlast_active_at > now-1hand no lease get an empty lease so the next CAS succeeds; older sessions flip toruntime_state: idle(still recoverable, neverended).release_session_to_background(reason="main_shutdown"); Workers pick them up within 30 s.Required env vars:
MODEmainworkerflips to the worker-loop entrypointMONGODB_URINoopSessionStore(CLI compatibility)GRACE_PERIOD_SECONDS180IDLE_EVICTION_SECONDS1800MONGODB_URImust point at a replica set (Atlas works out of the box) for change-stream support. Single-node Mongo falls back to 500 ms polling.Process & follow-ups
/deep-dive(8 interview rounds + 3-lane causal trace) and refined via/plan --consensus --direct(Architect + Critic over 2 iterations, both APPROVED). Spec, trace, and consensus plan are committed under.omc/specs/and.omc/plans/for posterity.delete_sessionrequeue ordering (race window for clean-delete UI)claim_dormant_sessionlease release on rebuild failure (TTL recovers in 30 s)pending_submissionspayload size guard (16 MB BSON ceiling)Risks
_lease_heartbeat_looprenewals on outage — heartbeat returnsNone,_on_lease_lostcleans up gracefully, clients reconnect.MAX_SESSIONS_PER_WORKER× N Workers. Scaling is operational (deploy more Workers); the lease design supports it without code changes.Out of scope (deferred)