feat(litebox): add pause/resume API for zero-CPU VM freezing#413
feat(litebox): add pause/resume API for zero-CPU VM freezing#413lilongen wants to merge 3 commits intoboxlite-ai:mainfrom
Conversation
Add pause() and resume() across all API layers (Rust core, REST, Python SDK)
to freeze/resume VMs via SIGSTOP/SIGCONT with guest filesystem quiesce.
Core implementation:
- pause(): FIFREEZE guest I/O → SIGSTOP shim (quiesce-then-freeze)
- resume(): SIGCONT shim → FITHAW guest I/O (resume-then-thaw)
- Both operations are idempotent (pause on Paused = no-op, etc.)
- State machine: Running ↔ Paused, Paused → Stopped
Safety and correctness:
- stop() sends SIGCONT before guest shutdown RPC on Paused boxes
(prevents 10s gRPC timeout on SIGSTOP'd process)
- exec/copy_into/copy_out reject Paused boxes with InvalidState
(shim can't handle gRPC while SIGSTOP'd)
- Health check skips gRPC pings during Paused state but verifies
process alive via kill(pid, 0) to detect death while paused
- with_quiesce_async preserves user-initiated Paused state
(clone/export/snapshot don't auto-resume user-paused boxes)
- Fix pre-existing deadlock: health check save_box used state.read()
while holding state.write() (parking_lot RwLock is not reentrant)
API surface:
- EventListener: on_box_paused/on_box_resumed callbacks
- AuditEventKind: BoxPaused/BoxResumed variants
- BoxStatus: can_pause()/can_resume()/is_paused() methods
- REST: POST /v1/default/boxes/{id}/pause and /resume
- Python SDK: box.pause() and box.resume() async methods
Tests: 12 new pause/resume unit tests, 2 new integration tests
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…w fixes Address code review findings for the pause/resume API: - Fix TOCTOU: re-check shutdown_token after guest_quiesce() before SIGSTOP - Replace force_status() with transition_to() + fallback for safety - Log save_box failures with tracing::warn instead of silent discard - Combine double state.read() in health check into single lock acquisition - Add Paused → Stopping transition to state machine for completeness Add missing test coverage: - copy_into/copy_out rejected while paused (P1 gap) - resume on stopped box returns error (P1 gap) - Event listener multi-listener and box_id correctness tests - State machine: Paused→Stopping transition, Paused cannot remove Add Node.js SDK bindings: - pause()/resume() in napi-rs (box_handle.rs) - pause()/resume() in SimpleBox TypeScript wrapper with JSDoc Add integration tests and Python example: - 10 integration tests covering all pause/resume scenarios - Python example with 4 demo functions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…g, typed error matching - Handle ESRCH in SIGSTOP/SIGCONT: if shim dies mid-pause, transition to Stopped instead of returning a confusing Internal error - Handle stop() racing with pause(): if state transitions to Stopping/Stopped during pause, undo SIGSTOP and yield to stop() teardown - Track quiesced flag on BoxState so with_quiesce_async knows whether guest I/O was frozen during an earlier pause(); warn when degrading to crash-consistent (SIGSTOP-only) snapshots - CLI serve: pattern-match on BoxliteError variants instead of string matching for HTTP status classification - Node.js SDK: fix duplicate JSDoc block on stop() - Python SDK: add idempotency docstrings to pause()/resume() - Python example: add try/except cleanup, use info.state.status (not info.state) - Add unit tests for quiesced flag initialization and mark_stop clearing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a new pause/resume lifecycle to Boxlite’s LiteBox to support “zero-CPU” VM freezing (SIGSTOP/SIGCONT), with best-effort guest filesystem quiesce/thaw and full propagation through REST/CLI, events/audit, and SDKs.
Changes:
- Introduces
pause()/resume()across the core runtime (BoxBackend,LiteBox,BoxImpl) and state machine (BoxStatus::Paused, transition rules,quiescedtracking). - Exposes pause/resume via REST server routes + REST client backend, plus event listener callbacks and audit event kinds.
- Adds SDK bindings (Python + Node), an example script, and integration/unit tests for pause/resume behavior.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
boxlite/src/litebox/box_impl.rs |
Implements pause/resume via guest quiesce + SIGSTOP/SIGCONT, gates exec/copy while paused, updates quiesce bracket + health-check behavior. |
boxlite/src/litebox/state.rs |
Adds Paused state, transition rules, can_pause()/can_resume() predicates, and runtime-only quiesced flag + tests. |
boxlite/src/litebox/mod.rs |
Exposes LiteBox::pause() and LiteBox::resume() public API. |
boxlite/src/runtime/backend.rs |
Extends BoxBackend trait with pause/resume. |
boxlite/src/rest/litebox.rs |
Adds REST backend implementations calling /boxes/:id/pause and /boxes/:id/resume. |
boxlite-cli/src/commands/serve/mod.rs |
Adds pause/resume routes and improves error classification by matching BoxliteError variants. |
boxlite-cli/src/commands/serve/handlers/boxes.rs |
Adds HTTP handlers for pause/resume endpoints. |
boxlite/src/event_listener/listener.rs |
Adds on_box_paused / on_box_resumed callbacks. |
boxlite/src/event_listener/event.rs |
Adds AuditEventKind::BoxPaused / BoxResumed. |
boxlite/src/event_listener/audit_event_listener.rs |
Records pause/resume audit events + adds tests. |
boxlite/tests/pause_resume.rs |
New integration test suite covering pause/resume, idempotency, and operation gating. |
boxlite/tests/audit.rs |
Extends event/audit tests to cover pause/resume events. |
sdks/python/src/box_handle.rs |
Adds async Python bindings pause() / resume(). |
sdks/node/src/box_handle.rs |
Adds N-API bindings pause() / resume(). |
sdks/node/lib/simplebox.ts |
Adds SimpleBox.pause() / SimpleBox.resume() convenience methods with docs. |
examples/python/03_lifecycle/pause_and_resume.py |
Adds a runnable example demonstrating pause/resume flows and constraints. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Phase 1: Freeze guest I/O (best-effort, 5s timeout) | ||
| let frozen = self.guest_quiesce().await; | ||
|
|
There was a problem hiding this comment.
guest_quiesce() is treated as a boolean “quiesce succeeded”, but the guest RPC can legitimately return frozen_count = 0 (e.g., no writable/freezable mounts found). In that case frozen should be false so state.quiesced and logs don’t incorrectly claim FIFREEZE succeeded; consider propagating the count (or making guest_quiesce() return count > 0).
| state.quiesced = false; | ||
| if let Err(e) = self.runtime.box_manager.save_box(self.id(), &state) { | ||
| tracing::warn!(box_id = %self.config.id, error = %e, "Failed to persist Running state"); | ||
| } | ||
| } | ||
|
|
||
| // Phase 2: Thaw guest I/O (best-effort) | ||
| self.guest_thaw().await; |
There was a problem hiding this comment.
The PR description says guest thaw (FITHAW) should happen only if pause() successfully quiesced. Here state.quiesced is cleared before the thaw and guest_thaw() is called unconditionally, so a box paused without FIFREEZE will still attempt FITHAW on resume. Consider capturing the previous quiesced value before clearing it and only calling guest_thaw() when it was true (to align behavior and avoid unnecessary RPC/log noise).
| async fn pause(&self) -> BoxliteResult<()> { | ||
| self.pause().await | ||
| } | ||
|
|
||
| async fn resume(&self) -> BoxliteResult<()> { | ||
| self.resume().await | ||
| } |
There was a problem hiding this comment.
In the BoxBackend impl, self.pause().await / self.resume().await rely on inherent-method resolution to avoid calling the trait methods recursively. This is easy to misread and becomes fragile if method names change; prefer an explicit UFCS call (e.g., BoxImpl::pause(self).await) to make it unambiguous.
| // Create a temp file to copy | ||
| let tmp = std::env::temp_dir().join("boxlite-test-copy-pause"); | ||
| std::fs::write(&tmp, b"test").expect("write temp file"); |
There was a problem hiding this comment.
Using a fixed filename under std::env::temp_dir() can make these tests flaky when run in parallel (collisions between concurrent test processes, or leftover files from prior runs). Prefer a per-test unique temp file/dir (e.g., tempfile::NamedTempFile/TempDir, or include a UUID/box id in the filename) and ensure cleanup runs even on failure.
| // Exec should fail with InvalidState | ||
| let cmd = BoxCommand::new("echo").args(["should-fail"]); | ||
| let err = match litebox.exec(cmd).await { | ||
| Err(e) => e, | ||
| Ok(_) => panic!("exec should fail while paused"), | ||
| }; | ||
| let msg = err.to_string(); | ||
| assert!( | ||
| msg.contains("Paused") || msg.contains("paused") || msg.contains("InvalidState"), | ||
| "Expected InvalidState/Paused error, got: {msg}" | ||
| ); |
There was a problem hiding this comment.
These assertions depend on err.to_string() containing specific substrings, which is brittle and can break on harmless wording changes. Since this is a Rust integration test, prefer matching on the concrete error variant (e.g., Err(BoxliteError::InvalidState(_))) and/or checking structured fields instead of string contents.
| /// Check if pause() can be called from this state. | ||
| /// Only Running boxes can be paused. | ||
| pub fn can_pause(&self) -> bool { | ||
| matches!(self, BoxStatus::Running) | ||
| } | ||
|
|
||
| /// Check if resume() can be called from this state. | ||
| /// Only Paused boxes can be resumed. |
There was a problem hiding this comment.
can_pause()/can_resume() are documented as “can be called from this state”, but the public pause/resume APIs are explicitly idempotent (pause on Paused and resume on Running are no-ops). Either adjust these predicates (e.g., include Paused/Running respectively) or clarify in the doc comments that these are “meaningful transition allowed” checks rather than “API call will succeed” checks.
| /// Check if pause() can be called from this state. | |
| /// Only Running boxes can be paused. | |
| pub fn can_pause(&self) -> bool { | |
| matches!(self, BoxStatus::Running) | |
| } | |
| /// Check if resume() can be called from this state. | |
| /// Only Paused boxes can be resumed. | |
| /// Check if calling `pause()` from this state would cause a meaningful | |
| /// state transition (i.e., enter `Paused` from `Running`). | |
| /// | |
| /// Note: the public `pause()` API is idempotent and may be safely invoked | |
| /// from `Paused` as a no-op. This predicate intentionally returns `true` | |
| /// only when a transition `Running -> Paused` is allowed. | |
| pub fn can_pause(&self) -> bool { | |
| matches!(self, BoxStatus::Running) | |
| } | |
| /// Check if calling `resume()` from this state would cause a meaningful | |
| /// state transition (i.e., leave `Paused` and enter `Running`). | |
| /// | |
| /// Note: the public `resume()` API is idempotent and may be safely invoked | |
| /// from `Running` as a no-op. This predicate intentionally returns `true` | |
| /// only when a transition `Paused -> Running` is allowed. |
Summary
Add
pause()andresume()methods toLiteBoxthat freeze/thaw a running VM using SIGSTOP/SIGCONT with optional guest filesystem quiesce (FIFREEZE/FITHAW), enabling zero-CPU idle states while preserving full memory and process state.InvalidStatewhile pausedHow It Works
The pause path:
FIFREEZE) — freezes guest filesystems for consistency (best-effort, 5s timeout)SIGSTOPon shim process — freezes all vCPUs, virtio backends, and I/O threads atomicallyPaused, trackquiescedflag for later snapshot awarenessThe resume path:
SIGCONTon shim process — unfreezes everythingkill(pid, 0)liveness check — detect if process died while pausedFITHAW) if quiesced — only if FIFREEZE succeeded during pauseRunningKey Design Decisions
quiescedflag (runtime-only,#[serde(skip)]): Tracks whether FIFREEZE succeeded sowith_quiesce_async(export/snapshot) can skip redundant quiesce or warn about crash-consistent degradationStoppedinstead of returning confusing errorsstop()races withpause(), undo SIGSTOP and yield to stop's teardown pathChanges
Core Runtime (boxlite/)
litebox/box_impl.rs—pause(),resume(), exec/copy guards,with_quiesce_asyncawareness, ESRCH + stop-race handlinglitebox/state.rs—BoxStatus::Paused, transition rules,quiescedfield,can_pause()/can_resume()predicateslitebox/mod.rs— Publicpause()/resume()onLiteBoxhandleruntime/backend.rs—pause()/resume()onBoxBackendtraitrest/litebox.rs— REST endpoints for pause/resumeEvent System
event_listener/listener.rs—on_box_paused()/on_box_resumed()callbacksevent_listener/event.rs—BoxPaused/BoxResumedevent kindsevent_listener/audit_event_listener.rs— Records pause/resume eventsCLI & HTTP
boxlite-cli/src/commands/serve/handlers/boxes.rs— POST/boxes/:id/pause,/boxes/:id/resumeboxlite-cli/src/commands/serve/mod.rs— Pattern-match error classification (replaces string matching)SDKs
sdks/python/src/box_handle.rs—pause()/resume()with idempotency docssdks/node/src/box_handle.rs—pause()/resume()via napi-rssdks/node/lib/simplebox.ts—pause()/resume()on SimpleBoxTests
boxlite/tests/pause_resume.rs— 10 integration tests (pause/resume, idempotency, exec/copy rejection, stop-from-paused, multi-cycle, error cases)boxlite/src/litebox/state.rs— 14 unit tests for state machine transitions + quiesced trackingboxlite/src/event_listener/audit_event_listener.rs— 2 event recording testsboxlite/tests/audit.rs— Integration test for pause/resume event emissionExample
examples/python/03_lifecycle/pause_and_resume.py— 4 demos: basic pause/resume, exec-blocked-while-paused, multi-cycle, stop-from-pausedTest Plan
cargo clippy -p boxlite --no-default-features --lib -- -D warnings)cargo fmt -- --check)🤖 Generated with Claude Code