Skip to content

feat(litebox): add pause/resume API for zero-CPU VM freezing#413

Open
lilongen wants to merge 3 commits intoboxlite-ai:mainfrom
lilongen:feat/pause-resume-api
Open

feat(litebox): add pause/resume API for zero-CPU VM freezing#413
lilongen wants to merge 3 commits intoboxlite-ai:mainfrom
lilongen:feat/pause-resume-api

Conversation

@lilongen
Copy link
Copy Markdown

Summary

Add pause() and resume() methods to LiteBox that freeze/thaw a running VM using SIGSTOP/SIGCONT with optional guest filesystem quiesce (FIFREEZE/FITHAW), enabling zero-CPU idle states while preserving full memory and process state.

  • pause(): guest I/O quiesce (best-effort FIFREEZE) → SIGSTOP → state → Paused
  • resume(): SIGCONT → liveness check → guest thaw (FITHAW if quiesced) → state → Running
  • Idempotent: pause on Paused = no-op, resume on Running = no-op
  • Exec/copy gated: operations rejected with InvalidState while paused
  • stop() from Paused: works directly (SIGCONT → teardown), no resume needed

How It Works

Running ──pause()──► Paused ──resume()──► Running
                       │
                       └──stop()──► Stopped

The pause path:

  1. Guest quiesce via gRPC (FIFREEZE) — freezes guest filesystems for consistency (best-effort, 5s timeout)
  2. SIGSTOP on shim process — freezes all vCPUs, virtio backends, and I/O threads atomically
  3. State transition to Paused, track quiesced flag for later snapshot awareness

The resume path:

  1. SIGCONT on shim process — unfreezes everything
  2. kill(pid, 0) liveness check — detect if process died while paused
  3. Guest thaw via gRPC (FITHAW) if quiesced — only if FIFREEZE succeeded during pause
  4. State transition back to Running

Key Design Decisions

  1. SIGSTOP over cgroup freezer: Works on both macOS and Linux without cgroup setup; freezes the entire shim process tree atomically including all virtio backends
  2. Guest quiesce before SIGSTOP: FIFREEZE flushes dirty pages and freezes filesystems for point-in-time consistency, making paused state safe for snapshots/exports
  3. quiesced flag (runtime-only, #[serde(skip)]): Tracks whether FIFREEZE succeeded so with_quiesce_async (export/snapshot) can skip redundant quiesce or warn about crash-consistent degradation
  4. ESRCH handling: If shim dies between status check and signal, transitions to Stopped instead of returning confusing errors
  5. stop() race handling: If stop() races with pause(), undo SIGSTOP and yield to stop's teardown path

Changes

Core Runtime (boxlite/)

  • litebox/box_impl.rspause(), resume(), exec/copy guards, with_quiesce_async awareness, ESRCH + stop-race handling
  • litebox/state.rsBoxStatus::Paused, transition rules, quiesced field, can_pause()/can_resume() predicates
  • litebox/mod.rs — Public pause()/resume() on LiteBox handle
  • runtime/backend.rspause()/resume() on BoxBackend trait
  • rest/litebox.rs — REST endpoints for pause/resume

Event System

  • event_listener/listener.rson_box_paused()/on_box_resumed() callbacks
  • event_listener/event.rsBoxPaused/BoxResumed event kinds
  • event_listener/audit_event_listener.rs — Records pause/resume events

CLI & HTTP

  • boxlite-cli/src/commands/serve/handlers/boxes.rs — POST /boxes/:id/pause, /boxes/:id/resume
  • boxlite-cli/src/commands/serve/mod.rs — Pattern-match error classification (replaces string matching)

SDKs

  • sdks/python/src/box_handle.rspause()/resume() with idempotency docs
  • sdks/node/src/box_handle.rspause()/resume() via napi-rs
  • sdks/node/lib/simplebox.tspause()/resume() on SimpleBox

Tests

  • boxlite/tests/pause_resume.rs — 10 integration tests (pause/resume, idempotency, exec/copy rejection, stop-from-paused, multi-cycle, error cases)
  • boxlite/src/litebox/state.rs — 14 unit tests for state machine transitions + quiesced tracking
  • boxlite/src/event_listener/audit_event_listener.rs — 2 event recording tests
  • boxlite/tests/audit.rs — Integration test for pause/resume event emission

Example

  • examples/python/03_lifecycle/pause_and_resume.py — 4 demos: basic pause/resume, exec-blocked-while-paused, multi-cycle, stop-from-paused

Test Plan

  • 608/608 unit tests pass (macOS)
  • 583/583 unit tests pass on Linux/Lima (25 pre-existing KVM failures unrelated)
  • Clippy clean (cargo clippy -p boxlite --no-default-features --lib -- -D warnings)
  • Format clean (cargo fmt -- --check)
  • 26 pause/resume specific tests pass (14 state machine + 2 audit + 10 integration)
  • All 47 state.rs tests pass (existing + new)

🤖 Generated with Claude Code

lile and others added 3 commits March 27, 2026 18:07
Add pause() and resume() across all API layers (Rust core, REST, Python SDK)
to freeze/resume VMs via SIGSTOP/SIGCONT with guest filesystem quiesce.

Core implementation:
- pause(): FIFREEZE guest I/O → SIGSTOP shim (quiesce-then-freeze)
- resume(): SIGCONT shim → FITHAW guest I/O (resume-then-thaw)
- Both operations are idempotent (pause on Paused = no-op, etc.)
- State machine: Running ↔ Paused, Paused → Stopped

Safety and correctness:
- stop() sends SIGCONT before guest shutdown RPC on Paused boxes
  (prevents 10s gRPC timeout on SIGSTOP'd process)
- exec/copy_into/copy_out reject Paused boxes with InvalidState
  (shim can't handle gRPC while SIGSTOP'd)
- Health check skips gRPC pings during Paused state but verifies
  process alive via kill(pid, 0) to detect death while paused
- with_quiesce_async preserves user-initiated Paused state
  (clone/export/snapshot don't auto-resume user-paused boxes)
- Fix pre-existing deadlock: health check save_box used state.read()
  while holding state.write() (parking_lot RwLock is not reentrant)

API surface:
- EventListener: on_box_paused/on_box_resumed callbacks
- AuditEventKind: BoxPaused/BoxResumed variants
- BoxStatus: can_pause()/can_resume()/is_paused() methods
- REST: POST /v1/default/boxes/{id}/pause and /resume
- Python SDK: box.pause() and box.resume() async methods

Tests: 12 new pause/resume unit tests, 2 new integration tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…w fixes

Address code review findings for the pause/resume API:

- Fix TOCTOU: re-check shutdown_token after guest_quiesce() before SIGSTOP
- Replace force_status() with transition_to() + fallback for safety
- Log save_box failures with tracing::warn instead of silent discard
- Combine double state.read() in health check into single lock acquisition
- Add Paused → Stopping transition to state machine for completeness

Add missing test coverage:
- copy_into/copy_out rejected while paused (P1 gap)
- resume on stopped box returns error (P1 gap)
- Event listener multi-listener and box_id correctness tests
- State machine: Paused→Stopping transition, Paused cannot remove

Add Node.js SDK bindings:
- pause()/resume() in napi-rs (box_handle.rs)
- pause()/resume() in SimpleBox TypeScript wrapper with JSDoc

Add integration tests and Python example:
- 10 integration tests covering all pause/resume scenarios
- Python example with 4 demo functions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…g, typed error matching

- Handle ESRCH in SIGSTOP/SIGCONT: if shim dies mid-pause, transition to
  Stopped instead of returning a confusing Internal error
- Handle stop() racing with pause(): if state transitions to Stopping/Stopped
  during pause, undo SIGSTOP and yield to stop() teardown
- Track quiesced flag on BoxState so with_quiesce_async knows whether
  guest I/O was frozen during an earlier pause(); warn when degrading to
  crash-consistent (SIGSTOP-only) snapshots
- CLI serve: pattern-match on BoxliteError variants instead of string matching
  for HTTP status classification
- Node.js SDK: fix duplicate JSDoc block on stop()
- Python SDK: add idempotency docstrings to pause()/resume()
- Python example: add try/except cleanup, use info.state.status (not info.state)
- Add unit tests for quiesced flag initialization and mark_stop clearing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 30, 2026 02:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new pause/resume lifecycle to Boxlite’s LiteBox to support “zero-CPU” VM freezing (SIGSTOP/SIGCONT), with best-effort guest filesystem quiesce/thaw and full propagation through REST/CLI, events/audit, and SDKs.

Changes:

  • Introduces pause()/resume() across the core runtime (BoxBackend, LiteBox, BoxImpl) and state machine (BoxStatus::Paused, transition rules, quiesced tracking).
  • Exposes pause/resume via REST server routes + REST client backend, plus event listener callbacks and audit event kinds.
  • Adds SDK bindings (Python + Node), an example script, and integration/unit tests for pause/resume behavior.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
boxlite/src/litebox/box_impl.rs Implements pause/resume via guest quiesce + SIGSTOP/SIGCONT, gates exec/copy while paused, updates quiesce bracket + health-check behavior.
boxlite/src/litebox/state.rs Adds Paused state, transition rules, can_pause()/can_resume() predicates, and runtime-only quiesced flag + tests.
boxlite/src/litebox/mod.rs Exposes LiteBox::pause() and LiteBox::resume() public API.
boxlite/src/runtime/backend.rs Extends BoxBackend trait with pause/resume.
boxlite/src/rest/litebox.rs Adds REST backend implementations calling /boxes/:id/pause and /boxes/:id/resume.
boxlite-cli/src/commands/serve/mod.rs Adds pause/resume routes and improves error classification by matching BoxliteError variants.
boxlite-cli/src/commands/serve/handlers/boxes.rs Adds HTTP handlers for pause/resume endpoints.
boxlite/src/event_listener/listener.rs Adds on_box_paused / on_box_resumed callbacks.
boxlite/src/event_listener/event.rs Adds AuditEventKind::BoxPaused / BoxResumed.
boxlite/src/event_listener/audit_event_listener.rs Records pause/resume audit events + adds tests.
boxlite/tests/pause_resume.rs New integration test suite covering pause/resume, idempotency, and operation gating.
boxlite/tests/audit.rs Extends event/audit tests to cover pause/resume events.
sdks/python/src/box_handle.rs Adds async Python bindings pause() / resume().
sdks/node/src/box_handle.rs Adds N-API bindings pause() / resume().
sdks/node/lib/simplebox.ts Adds SimpleBox.pause() / SimpleBox.resume() convenience methods with docs.
examples/python/03_lifecycle/pause_and_resume.py Adds a runnable example demonstrating pause/resume flows and constraints.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +362 to +364
// Phase 1: Freeze guest I/O (best-effort, 5s timeout)
let frozen = self.guest_quiesce().await;

Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guest_quiesce() is treated as a boolean “quiesce succeeded”, but the guest RPC can legitimately return frozen_count = 0 (e.g., no writable/freezable mounts found). In that case frozen should be false so state.quiesced and logs don’t incorrectly claim FIFREEZE succeeded; consider propagating the count (or making guest_quiesce() return count > 0).

Copilot uses AI. Check for mistakes.
Comment on lines +536 to +543
state.quiesced = false;
if let Err(e) = self.runtime.box_manager.save_box(self.id(), &state) {
tracing::warn!(box_id = %self.config.id, error = %e, "Failed to persist Running state");
}
}

// Phase 2: Thaw guest I/O (best-effort)
self.guest_thaw().await;
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says guest thaw (FITHAW) should happen only if pause() successfully quiesced. Here state.quiesced is cleared before the thaw and guest_thaw() is called unconditionally, so a box paused without FIFREEZE will still attempt FITHAW on resume. Consider capturing the previous quiesced value before clearing it and only calling guest_thaw() when it was true (to align behavior and avoid unnecessary RPC/log noise).

Copilot uses AI. Check for mistakes.
Comment on lines +1435 to +1441
async fn pause(&self) -> BoxliteResult<()> {
self.pause().await
}

async fn resume(&self) -> BoxliteResult<()> {
self.resume().await
}
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the BoxBackend impl, self.pause().await / self.resume().await rely on inherent-method resolution to avoid calling the trait methods recursively. This is easy to misread and becomes fragile if method names change; prefer an explicit UFCS call (e.g., BoxImpl::pause(self).await) to make it unambiguous.

Copilot uses AI. Check for mistakes.
Comment on lines +248 to +250
// Create a temp file to copy
let tmp = std::env::temp_dir().join("boxlite-test-copy-pause");
std::fs::write(&tmp, b"test").expect("write temp file");
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a fixed filename under std::env::temp_dir() can make these tests flaky when run in parallel (collisions between concurrent test processes, or leftover files from prior runs). Prefer a per-test unique temp file/dir (e.g., tempfile::NamedTempFile/TempDir, or include a UUID/box id in the filename) and ensure cleanup runs even on failure.

Copilot uses AI. Check for mistakes.
Comment on lines +77 to +87
// Exec should fail with InvalidState
let cmd = BoxCommand::new("echo").args(["should-fail"]);
let err = match litebox.exec(cmd).await {
Err(e) => e,
Ok(_) => panic!("exec should fail while paused"),
};
let msg = err.to_string();
assert!(
msg.contains("Paused") || msg.contains("paused") || msg.contains("InvalidState"),
"Expected InvalidState/Paused error, got: {msg}"
);
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These assertions depend on err.to_string() containing specific substrings, which is brittle and can break on harmless wording changes. Since this is a Rust integration test, prefer matching on the concrete error variant (e.g., Err(BoxliteError::InvalidState(_))) and/or checking structured fields instead of string contents.

Copilot uses AI. Check for mistakes.
Comment on lines +103 to +110
/// Check if pause() can be called from this state.
/// Only Running boxes can be paused.
pub fn can_pause(&self) -> bool {
matches!(self, BoxStatus::Running)
}

/// Check if resume() can be called from this state.
/// Only Paused boxes can be resumed.
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can_pause()/can_resume() are documented as “can be called from this state”, but the public pause/resume APIs are explicitly idempotent (pause on Paused and resume on Running are no-ops). Either adjust these predicates (e.g., include Paused/Running respectively) or clarify in the doc comments that these are “meaningful transition allowed” checks rather than “API call will succeed” checks.

Suggested change
/// Check if pause() can be called from this state.
/// Only Running boxes can be paused.
pub fn can_pause(&self) -> bool {
matches!(self, BoxStatus::Running)
}
/// Check if resume() can be called from this state.
/// Only Paused boxes can be resumed.
/// Check if calling `pause()` from this state would cause a meaningful
/// state transition (i.e., enter `Paused` from `Running`).
///
/// Note: the public `pause()` API is idempotent and may be safely invoked
/// from `Paused` as a no-op. This predicate intentionally returns `true`
/// only when a transition `Running -> Paused` is allowed.
pub fn can_pause(&self) -> bool {
matches!(self, BoxStatus::Running)
}
/// Check if calling `resume()` from this state would cause a meaningful
/// state transition (i.e., leave `Paused` and enter `Running`).
///
/// Note: the public `resume()` API is idempotent and may be safely invoked
/// from `Running` as a no-op. This predicate intentionally returns `true`
/// only when a transition `Paused -> Running` is allowed.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants