Skip to content

feat: Box Snapshot API #205

@DorianZheng

Description

@DorianZheng

Overview

Add snapshot, export, import, and clone functionality to BoxLite.

PR #217 by @joeyaflores provides a solid foundation (disk filename constants, auto-migration, resolve_stopped_box helper). However the API architecture needs rework to match the design below — see Design Deviations from PR #217 at the bottom.

API Design (Final)

┌──────────────────────────────────────────────────────────────────────┐
│                       FINAL API (8 methods)                          │
├──────────────────────────────────────────────────────────────────────┤
│ LiteBox                              │ BoxliteRuntime                │
│ ─────────────────────────────────────│──────────────────────────────│
│ box.snapshot().create(name, opts)    │ runtime.import(archive, name) │
│ box.snapshot().list()                │                               │
│ box.snapshot().get(name)             │                               │
│ box.snapshot().remove(name)          │                               │
│ box.snapshot().restore(name)         │                               │
│ box.export(dest, opts)               │                               │
│ box.clone(name, opts)                │                               │
└──────────────────────────────────────────────────────────────────────┘

Ownership principle:

  • Has source box handle → box method (clone, export, snapshot.*)
  • No source box → runtime method (import creates from external archive)

Why sub-resource snapshot()?

  • Resolves @uran0sH's naming confusion (snapshot verb vs snapshots noun)
  • Clear CRUD semantics: create/list/get/remove/restore
  • Zero-cost in Rust (SnapshotHandle is just a &LiteBox reference)

Why clone on box, not runtime?

  • Box already has disk paths, config — no need to pass source name string
  • Prevents typos: rt.clone("tset", ...) impossible with box.clone(...)
  • Fan-out is natural: [box.clone(f"w-{i}", opts) for i in range(10)]

Rust API

impl LiteBox {
    pub fn snapshot(&self) -> SnapshotHandle<'_>;
    pub async fn export(&self, dest: &Path, opts: ExportOptions) -> BoxliteResult<PathBuf>;
    pub async fn clone(&self, name: &str, opts: CloneOptions) -> BoxliteResult<LiteBox>;
}

pub struct SnapshotHandle<'a> { litebox: &'a LiteBox }

impl<'a> SnapshotHandle<'a> {
    pub async fn create(&self, name: &str, opts: SnapshotOptions) -> BoxliteResult<SnapshotInfo>;
    pub async fn list(&self) -> BoxliteResult<Vec<SnapshotInfo>>;
    pub async fn get(&self, name: &str) -> BoxliteResult<Option<SnapshotInfo>>;
    pub async fn remove(&self, name: &str) -> BoxliteResult<()>;
    pub async fn restore(&self, name: &str) -> BoxliteResult<()>;
}

impl BoxliteRuntime {
    pub async fn import(&self, archive: &Path, name: &str) -> BoxliteResult<LiteBox>;
}

Options (RocksDB-inspired)

All options: Default trait, builder methods returning &mut Self.

pub struct SnapshotOptions {
    quiesce: bool,                    // Default: true (FIFREEZE before snapshot)
    quiesce_timeout_secs: u64,        // Default: 30
    stop_on_quiesce_fail: bool,       // Default: true
}

pub struct ExportOptions {
    compress: bool,                   // Default: true (zstd)
    compression_level: i32,           // Default: 3 (zstd 1-22)
    include_metadata: bool,           // Default: true
}

pub struct CloneOptions {
    cow: bool,                        // Default: true (QCOW2 COW, ~1ms per clone)
    start_after_clone: bool,          // Default: false
    from_snapshot: Option<String>,    // Default: None (clone from specific snapshot)
}

Why from_snapshot? (addresses @IANTHEREAL Q1)

  • Fan-out 10 workers from same snapshot: box.clone("w-1", CloneOptions::default().from_snapshot("v1"))
  • Without this: restore → clone → restore back (3 ops instead of 1)
  • 10 parallel COW clones ~3ms wall time

Types

pub struct SnapshotInfo {
    pub name: String,
    pub created_at: DateTime<Utc>,
    pub size_bytes: u64,              // Total (both disks)
    pub guest_disk_bytes: u64,
    pub container_disk_bytes: u64,
}

pub struct ArchiveManifest {
    pub version: u32,
    pub box_name: Option<String>,
    pub image: String,
    pub guest_disk_checksum: String,  // "sha256:..."
    pub container_disk_checksum: String,
    pub exported_at: DateTime<Utc>,
}

Key Design Decisions

  1. Two disks captured — Both guest-rootfs.qcow2 and disk.qcow2 (entire filesystem, @IANTHEREAL Q2)
  2. Disk-only state — No memory/CPU state, clean boot on restore
  3. External COW snapshots — Separate files per snapshot in snapshots/{name}/, not QCOW2 internal snapshots
  4. COW clone by default — Uses existing create_cow_child_disk(), no qemu-img dependency
  5. remove not delete — Consistent with existing runtime.remove()
  6. Naming reviewed against BoxLite conventions, RocksDB, containerd, libgit2, PostgreSQL, Kubernetes

Disk Layout

~/.boxlite/boxes/{box_id}/
├── guest-rootfs.qcow2          # Guest VM disk
├── disk.qcow2                  # Container disk
└── snapshots/{name}/
    ├── guest-rootfs.qcow2      # QCOW2 COW snapshot
    └── disk.qcow2              # QCOW2 COW snapshot

Archive (.boxsnap):

archive.boxsnap (tar.zst)
├── manifest.json
├── guest-rootfs.qcow2          # Flattened (standalone)
└── disk.qcow2                  # Flattened (standalone)

Database Schema

CREATE TABLE IF NOT EXISTS box_snapshot (
    id TEXT PRIMARY KEY NOT NULL,
    box_id TEXT NOT NULL,
    name TEXT NOT NULL,
    created_at INTEGER NOT NULL,
    snapshot_dir TEXT NOT NULL,
    guest_disk_size_bytes INTEGER NOT NULL,
    container_disk_size_bytes INTEGER NOT NULL,
    size_bytes INTEGER NOT NULL DEFAULT 0,
    FOREIGN KEY (box_id) REFERENCES box_config(id) ON DELETE CASCADE,
    UNIQUE(box_id, name)
);

Thread Safety

pub enum BoxStatus {
    Stopped, Running, Snapshotting, Restoring, Exporting,
}
Operation Allowed From Blocks Others
snapshot().create() Stopped, Running Yes
snapshot().list/get() Any No
snapshot().restore() Stopped only Yes
snapshot().remove() Stopped only Yes
export() Stopped, Running Yes

Python Usage

from boxlite import Boxlite, BoxOptions, SnapshotOptions, ExportOptions, CloneOptions
import asyncio

async def main():
    async with Boxlite.default() as rt:
        box = await rt.create(BoxOptions(image='alpine'), name='test')

        # ── Snapshot CRUD ──
        await box.snapshot.create("v1")
        await box.snapshot.create("v2", SnapshotOptions(quiesce_timeout_secs=60))
        snaps = await box.snapshot.list()
        info  = await box.snapshot.get("v1")
        await box.snapshot.restore("v1")
        await box.snapshot.remove("v2")

        # ── Export / Import ──
        archive = await box.export("/tmp/backup.boxsnap")
        new_box = await rt.import_box("/tmp/backup.boxsnap", "restored")

        # ── Clone (current state) ──
        clone = await box.clone("my-clone")

        # ── Fan-out from snapshot (sub-second!) ──
        opts = CloneOptions(from_snapshot="v1")
        workers = await asyncio.gather(*[
            box.clone(f"worker-{i}", opts) for i in range(10)
        ])

Implementation Steps

Priority Module Files
1 Types, Options, Handle boxlite/src/snapshot/{mod,handle,options,types}.rs
2 Database & Disk boxlite/src/db/{schema,snapshots}.rs, disk/qcow2.rs
3 LiteBox & Runtime litebox/box_impl.rs, runtime/core.rs
4 Guest Quiesce guest/src/quiesce.rs (FIFREEZE/FITHAW)
5 Python SDK sdks/python/src/{snapshot,box_handle,runtime,options}.rs

Reviewer Questions Addressed

Question Answer
@shayne-snap: Running box? Quiesce (FIFREEZE), falls back to stop
@shayne-snap: After restore? Box stays stopped, user calls start()
@shayne-snap: Delete cascade? Yes, ON DELETE CASCADE
@IANTHEREAL: Clone from snapshot? CloneOptions(from_snapshot="v1")
@IANTHEREAL: Snapshot boundary? Entire filesystem (both disks)
@IANTHEREAL: Independent lifecycle? Future enhancement (CASCADE for now)
@uran0sH: Naming confusion? Sub-resource: box.snapshot().create/list/get/remove/restore

Design Deviations from PR #217

PR #217 by @joeyaflores is a great first implementation. Here's what to adopt and what needs rework:

✅ Adopt from PR #217

  • Disk filename constants (disk/constants.rs)
  • Auto-migration support (v4→v5)
  • resolve_stopped_box() helper
  • guest_rootfs_disk_path() on layout
  • DB test patterns

🔧 Needs Rework

PR #217 Target Design Why
All methods on BoxliteRuntime Methods on LiteBox + sub-resource Box handle already has context; avoids string-based lookup
Flat API (snapshot(), list_snapshots()) Sub-resource box.snapshot().create/list/... Resolves naming confusion, clean CRUD
No Options types SnapshotOptions, ExportOptions, CloneOptions Extensible without breaking changes
Full copy only (qemu-img convert) COW clone default (create_cow_child_disk()) 1000x perf: ~1ms vs ~seconds per clone
External qemu-img dependency In-process QCOW2 operations No new dependencies
QCOW2 internal snapshots External COW files in snapshots/{name}/ Better for clone-from-snapshot, size tracking
duplicate() clone() Industry standard; LiteBox doesn't impl Rust Clone
delete_snapshot() remove() Consistent with runtime.remove()
SnapshotRecord SnapshotInfo Consistent with BoxInfo pattern
snapshots table box_snapshot table Consistent with box_config, box_state
.boxlite extension .boxsnap extension Avoids conflict with project name
No compression tar.zst with configurable level Smaller archives
No checksums SHA-256 checksums in manifest Integrity verification

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions