Skip to content

feat: Firecracker microVM runtime backend#16

Merged
rohitg00 merged 11 commits intomainfrom
feat/firecracker-runtime
Mar 11, 2026
Merged

feat: Firecracker microVM runtime backend#16
rohitg00 merged 11 commits intomainfrom
feat/firecracker-runtime

Conversation

@rohitg00
Copy link
Contributor

@rohitg00 rohitg00 commented Mar 11, 2026

Summary

  • Implements complete Firecracker microVM isolation backend behind #[cfg(feature = "firecracker")] feature gate
  • Adds lightweight guest agent binary (packages/guest-agent/) that runs inside each VM for host↔guest communication via VSOCK
  • Full SandboxRuntime trait implementation — all 29 async methods, enabling drop-in replacement for Docker runtime

Architecture

Host                          │  Firecracker VM
─────────────────────────────│──────────────────────
Worker (Rust)                 │  Linux Kernel
  ├─ fc_api (Unix socket) ───│→ Firecracker REST API
  ├─ fc_agent (VSOCK:52) ────│→ Guest Agent (:8052)
  ├─ fc_network (TAP+NAT)    │  /usr/local/bin/iii-guest-agent
  └─ fc_rootfs (OCI→ext4)    │  ext4 rootfs from OCI image

New Files (2,840 lines, 55 tests)

File Lines Tests Purpose
fc_types.rs 435 14 Firecracker API + guest agent serde types
fc_api.rs 262 8 Unix socket HTTP client for FC REST API
fc_agent.rs 302 4 VSOCK client for guest agent communication
fc_network.rs 222 5 TAP devices, subnet allocator, iptables NAT
fc_rootfs.rs 370 3 OCI image → ext4 rootfs conversion pipeline
firecracker.rs 768 10 Full SandboxRuntime impl (29 methods)
guest-agent/main.rs 596 11 Lightweight HTTP server for inside VMs

Key Design Decisions

  • VSOCK over SSH: No SSH daemon needed inside VMs — lighter, faster, more secure
  • Feature-gated: All FC code behind #[cfg(feature = "firecracker")] — zero impact on default Docker builds
  • Native snapshots: Uses Firecracker's built-in snapshot API for commit_sandbox
  • 253 VMs/host: Each VM gets a /30 subnet (172.16.{1-253}.0/30) with dedicated TAP interface
  • Minimal guest agent: std-only Rust binary (no async runtime) — serde + base64 deps only, optimized for size (LTO + strip)

Test plan

  • cargo clippy --features firecracker passes clean
  • cargo test --features firecracker — 55 new unit tests pass
  • cargo clippy (without feature) — no regressions
  • Guest agent: cargo clippy and cargo test pass
  • Integration test on Linux host with KVM + Firecracker binary
  • End-to-end: create sandbox → exec command → read/write files → stop → remove

Summary by CodeRabbit

  • New Features

    • Full Firecracker microVM backend: VM lifecycle, snapshots, networking (TAP/NAT, subnet allocator), metrics, and per-VM guest agent with terminal sessions, command execution, file I/O, and process listing.
    • OCI image → ext4 rootfs conversion with injected init for boots.
  • Chores

    • CI/release updated to build and publish guest-agent binaries and include them in images; container build includes guest-agent.
  • Tests

    • Extensive unit and integration tests and new end-to-end test script.
  • Docs

    • README and competitive analysis expanded to describe Firecracker backend and benchmarks.

Add complete Firecracker isolation backend behind #[cfg(feature = "firecracker")] feature gate:

- fc_types: Firecracker API serde types + guest agent request/response types (14 tests)
- fc_api: Unix socket HTTP client for Firecracker REST API (8 tests)
- fc_agent: VSOCK-based guest agent client for exec/file/stats operations (4 tests)
- fc_network: TAP device management, subnet allocator (/30 per VM), iptables NAT (5 tests)
- fc_rootfs: OCI image → ext4 rootfs conversion pipeline via bollard (3 tests)
- firecracker: Full SandboxRuntime trait impl — all 29 methods (10 tests)
- guest-agent: Lightweight HTTP server for inside VMs — std-only, thread-per-connection (11 tests)

Architecture: Each VM gets a dedicated CID (VSOCK), TAP interface, /30 subnet,
and runs the guest agent on port 8052 for host↔guest communication.
@coderabbitai
Copy link

coderabbitai bot commented Mar 11, 2026

Warning

Rate limit exceeded

@rohitg00 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 8 minutes and 54 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 393fc272-7292-4e65-bf41-41976f9a3148

📥 Commits

Reviewing files that changed from the base of the PR and between 6d35ff2 and 76468c8.

📒 Files selected for processing (9)
  • .github/workflows/ci.yml
  • README.md
  • docs/api-reference.md
  • docs/architecture.md
  • packages/guest-agent/src/main.rs
  • packages/worker/src/runtime/fc_agent.rs
  • packages/worker/src/runtime/fc_network.rs
  • packages/worker/src/runtime/fc_rootfs.rs
  • packages/worker/src/runtime/firecracker.rs
📝 Walkthrough

Walkthrough

Adds a Rust guest-agent binary and a full Firecracker backend: new clients, types, networking, rootfs tooling, init injection, terminal integration, CI/release changes, Docker packaging updates, helper scripts, and extensive tests enabling VM lifecycle and agent-backed sandbox operations.

Changes

Cohort / File(s) Summary
Guest Agent
packages/guest-agent/Cargo.toml, packages/guest-agent/src/main.rs
New iii-guest-agent crate: TCP HTTP-like JSON API for health, exec, file ops, stats, processes, and PTY terminal management; includes unit tests and release profile.
Firecracker Runtime Core
packages/worker/src/runtime/firecracker.rs, packages/worker/src/main.rs, packages/worker/src/runtime/mod.rs
Replaces stub runtime with feature-complete FirecrackerRuntime accepting FcConfig + Docker handle; VM state tracking, CID/subnet allocation, TAP/NAT orchestration, lifecycle actions, snapshot/commit flow, and integration with FcClient/AgentClient.
Types & Config
packages/worker/src/runtime/fc_types.rs
New public Firecracker and agent data models (FcConfig, BootSource, Drive, MachineConfig, Vsock, VmInstance, agent request/response/terminal types) with serde, Default/Display impls and tests.
Firecracker API Client
packages/worker/src/runtime/fc_api.rs
New FcClient for Unix-socket HTTP-like communication with Firecracker: raw request/response handling, status parsing, and typed API helpers (boot, machine, drives, network, vsock, control, snapshots).
Guest Agent Client
packages/worker/src/runtime/fc_agent.rs
New AgentClient communicating with guest-agent over UDS/vsock path: generic request sender and async methods for exec, detached exec, file I/O, list/search/info, stats, processes, health, and terminal operations.
Networking
packages/worker/src/runtime/fc_network.rs
SubnetAllocator (per-VM /30), MAC generator, TAP device creation/deletion, NAT/DNAT port-forward helpers, run_cmd helper and unit tests covering allocation, exhaustion, and IP/MAC logic.
Rootfs & Init
packages/worker/src/runtime/fc_rootfs.rs, packages/worker/src/runtime/fc_init.rs, scripts/build-init.sh
OCI image → ext4 rootfs pipeline with caching, layer extraction/merge, ext4 image creation; init injection helper and script to embed minimal init that starts guest agent and configures runtime dirs.
Terminal Integration
packages/worker/src/functions/terminal.rs
Runtime-conditional terminals: Docker or Firecracker-backed PTYs, backend-session routing, read/write/resize/close flows, and tests.
Worker Tests & Helpers
packages/worker/tests/*, packages/worker/tests/common/mod.rs, packages/worker/tests/firecracker_integration.rs
Extensive unit and guarded integration tests for allocator, MAC/name logic, FC types, and end-to-end Firecracker flows; test helpers and skip macro added.
Build & Release
.github/workflows/ci.yml, .github/workflows/release.yml, packages/worker/Dockerfile
CI job for guest-agent (clippy, musl build, artifact upload), release matrix for multiple arch artifacts, release artifact preparation/publish steps, and Dockerfile copying per-arch guest-agent binary.
Utility Scripts & Docs
scripts/fetch-kernel.sh, scripts/test-e2e.sh, README.md, docs/competitive-analysis.md
Kernel fetcher with checksum support, e2e test driver script, README updated with Firecracker backend docs and test counts, and new competitive analysis doc.

Sequence Diagram(s)

sequenceDiagram
    participant Worker
    participant Docker
    participant Network
    participant Firecracker
    participant GuestAgent

    Worker->>Docker: ensure_image(image)
    Docker-->>Worker: prepared rootfs

    Worker->>Network: allocate_cid() / allocate subnet
    Network-->>Worker: guest_cid, SubnetInfo

    Worker->>Network: create_tap_device(host_ip)
    Worker->>Network: setup_nat(guest_subnet)

    Worker->>Firecracker: start_firecracker_process(vm_id)
    Firecracker-->>Worker: socket ready

    Worker->>Firecracker: configure VM (boot, machine, drives, vsock)
    Firecracker-->>Worker: configured

    Worker->>Firecracker: start_instance()
    Firecracker-->>Worker: running

    Worker->>GuestAgent: health_check() / exec() / file ops / terminal
    GuestAgent-->>Worker: JSON responses / stream data
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 I hopped in with a config and a plan,
spun TAPs and subnets with careful hand,
stitched layers to ext4, set a tiny init,
chatted with guests through sockets and a writ,
carrots, PTYs, and binaries — sandbox is grand!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 49.83% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'feat: Firecracker microVM runtime backend' directly and clearly summarizes the primary change: implementation of a complete Firecracker microVM runtime backend as an alternative to Docker, which is the core objective of this substantial changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/firecracker-runtime

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/guest-agent/src/main.rs`:
- Around line 118-137: When handling req.detached in the block that calls
cmd.spawn(), avoid dropping the Child handle immediately (which causes zombies)
by spawning a background thread that takes ownership of the Child returned from
cmd.spawn() and calls child.wait() (or child.wait_with_output()/handle Result)
to reap the process and optionally log the exit status or error; keep the
existing behavior of returning the PID to the client after sending the response.
Modify the code around cmd.spawn() and the Ok(child) arm so it clones/reads
child.id() for the PID, sends the response, then moves the child into a
std::thread::spawn closure that calls child.wait() and logs the result (or
ignores the result after reaping).
- Around line 9-21: The guest-agent currently binds a TCP listener in main
(TcpListener::bind) and prints "iii-guest-agent listening on port {port}", but
the runtime connects via the Firecracker vsock/UDS path used in
packages/worker/src/runtime/fc_agent.rs; replace the TCP listener with a
vsock/Unix domain socket listener: change the binding logic in main to open the
same UDS/vsock socket path the worker expects (use UnixListener::bind or the
vsock crate equivalent instead of TcpListener::bind), read the socket path from
the same env/format used by fc_agent (the ..._52 path) and update the startup
log to reflect the socket path instead of listening on 0.0.0.0:{port} so the
runtime can reach the guest agent.

In `@packages/worker/src/main.rs`:
- Around line 47-55: The code is silently reinterpreting
config.firecracker_rootfs as a directory when building FcConfig
(rootfs_cache_dir) which can treat file paths as directories; update the config
shape and usage so the value is explicit and not misused: add distinct fields in
your configuration (e.g., firecracker_kernel_path and
firecracker_rootfs_cache_dir or similar) and use those when constructing
FcConfig in main.rs (replace uses of config.firecracker_rootfs with the new
firecracker_rootfs_cache_dir and ensure kernel_path is taken from the explicit
kernel config), or if you must keep the old single field, detect whether
config.firecracker_rootfs is a file and map it correctly (use parent dir for
cache vs. file path for rootfs) before assigning to FcConfig.rootfs_cache_dir;
update any references in FcConfig and runtime::fc_types::FcConfig to match the
new fields so defaults in fc_types.rs are not silently relied upon.

In `@packages/worker/src/runtime/fc_agent.rs`:
- Around line 25-67: send_request currently hardcodes "POST" and extract_body
drops the HTTP status, so GET routes like health_check never reach the guest and
non-2xx JSON bodies are treated as success; change send_request(signature) to
accept an HTTP method parameter (e.g., method: &str) and use it when formatting
the request_line instead of the hardcoded "POST" (refer to send_request and
request_line), update callers such as health_check to pass "GET", and modify
extract_body (or add a small response parser in send_request) to parse the
response status line, return or check the numeric status code, and treat non-2xx
responses as Err (including the response body or JSON error text in the error
message) before attempting serde_json::from_str.

In `@packages/worker/src/runtime/fc_api.rs`:
- Around line 40-85: The raw_request function lacks timeouts for socket
operations; wrap the UnixStream::connect, stream.write_all, stream.shutdown, and
stream.read_to_end calls in timed operations (e.g., tokio::time::timeout) inside
raw_request so each I/O has a bounded duration and returns a clear error on
timeout; pick sensible durations (or use a configurable timeout value) and map
timeout errors to a descriptive Err string (e.g., "Timed out
connecting/writing/reading/shutting down Firecracker socket") while preserving
existing map_err behavior for underlying I/O failures.

In `@packages/worker/src/runtime/fc_network.rs`:
- Around line 19-37: The allocator currently monotically increments next_subnet
in allocate() and release() only removes the mapping, so freed subnets are never
reused; change allocate() and release() to maintain a free pool: in allocate(),
first check if vm_id already exists in allocated and return its SubnetInfo;
otherwise if a free list (e.g., VecDeque free_subnets) is non-empty pop from it
to assign, else use next_subnet and increment; in release(), look up and remove
the vm_id from allocated and push the freed subnet_id onto the free_subnets
queue so it can be recycled; update struct fields to include free_subnets and
ensure you only push when the vm_id was actually allocated (reference allocate,
release, next_subnet, and allocated).
- Around line 74-117: The NAT rules hard-code the host uplink as "eth0" in
setup_nat and teardown_nat, causing failures on hosts with different interface
names; update the functions (setup_nat and teardown_nat) to accept an uplink
interface parameter (e.g., uplink_iface: &str) or read it from
configuration/environment and use that variable in all iptables invocations
replacing the literal "eth0", and update all callers to pass the correct uplink
interface (or load it from config) so the POSTROUTING and FORWARD rules match
the actual host egress device.

In `@packages/worker/src/runtime/fc_rootfs.rs`:
- Around line 29-37: The export pipeline is wrong because
export_image_to_tar(docker, image, &tar_path) produces a flattened container
filesystem, so after extract_tar(&tar_path, &extract_dir) there are no OCI
layers or manifest to merge; remove or skip the subsequent
extract_layers(&extract_dir) and any manifest/*.tar merging logic that expects
OCI layers (code referencing extract_layers, manifest.json processing, and
merged_dir) and instead treat extract_dir as the final rootfs (or change
export_image_to_tar to export OCI-style image layers if you intend to keep the
merge pipeline); update all code paths that reference merged_dir to use
extract_dir as the final rootfs when using export_image_to_tar.
- Around line 46-56: The code currently skips copying the guest agent when
agent_path doesn't exist, producing a broken VM; change the logic in
fc_rootfs.rs so that if agent_path.exists() is false you return an Err (fail the
build) with a clear message instead of silently continuing. Update the block
around agent_path/agent_dest (the copy using fs::create_dir_all and fs::copy
that writes usr/local/bin/iii-guest-agent) to error out when the source is
missing, and apply the same fix to the other occurrence noted (the similar block
at the 216-223 region) so both code paths fail-fast when the guest agent binary
is absent.
- Around line 13-18: The current logic uses a deterministic intermediate path
(rootfs_path and the shared _mount_tmp used by create_ext4) so concurrent runs
can clobber each other and leave half-written artifacts; change the build to
create a unique temporary workspace (e.g., use a UUID/PID-based temp dir or
tempfile::Builder) for the mount/build steps instead of _mount_tmp, build the
image there, write the ext4 to a temp file (e.g., <safe_name>.ext4.tmp), and
atomically publish by renaming the temp file to rootfs_path only after the build
completes successfully; update the code paths that call create_ext4 and the
existence checks (references: safe_name, rootfs_path, create_ext4, _mount_tmp)
to use the unique temp location and atomic rename to prevent races and
half-written rootfs (also apply same fix to the other blocks noted around lines
28-43 and 256-282).
- Around line 272-293: The code currently treats any non-zero cp exit as success
and may cache a partially populated rootfs; update the match on cp_result in
fc_rootfs.rs so that when cp returns Ok(o) but o.status is not success you
return an Err with the cp stderr (instead of Ok(())); keep the existing warning
log (tracing::warn!) if desired, but change the branch to propagate an error
(e.g., Err(format!("cp failed: {}", stderr))) so the caller won’t persist an
incomplete image; reference the cp_result variable and the Command::new("sh")
block that invokes cp, and ensure umount and fs::remove_dir still run as cleanup
before returning the Err.

In `@packages/worker/src/runtime/fc_types.rs`:
- Around line 199-213: The file-copy contract currently models file content as
UTF-8 Strings which breaks binary files; change the content fields in
AgentFileWriteRequest and AgentFileReadResponse from String to a binary-safe
type (Vec<u8>) and annotate with serde_bytes (e.g. #[serde(with =
"serde_bytes")]) so Serialize/Deserialize remain efficient; keep
AgentFileReadRequest.path, AgentFileWriteRequest.path and mode as-is and leave
size as u64, and update any consumers of copy_to_sandbox / copy_from_sandbox to
accept/produce Vec<u8> instead of String.

In `@packages/worker/src/runtime/firecracker.rs`:
- Around line 265-279: The Sandbox never records or exposes active port forwards
because VmInstance is constructed with port_mappings: HashMap::new() and nothing
populates it, so sandbox_port_bindings() always returns empty; fix by ensuring
port mappings are created and stored when ports are published/forwarded — update
the VM lifecycle code paths that create networking (e.g., where VmInstance is
created and where inbound forwarding is installed) to add entries into
VmInstance.port_mappings and persist them (mirror the structure expected by
sandbox_port_bindings()), and update the functions that install
NAT/iptables/tproxy forwarding to return the assigned host port and write that
into VmInstance.port_mappings (and any teardown paths must remove them). Target
symbols: VmInstance, port_mappings, sandbox_port_bindings(), and the VM
creation/networking routines that set up tap/veth and publish ports so active
bindings are discoverable by packages/worker/src/functions/port.rs.
- Around line 48-54: The socket_path and vsock_path functions (and other places
that build filesystem paths from VM identifiers at the mentioned locations)
currently join raw vm_id/name values directly into paths; fix by
validating/sanitizing identifiers before use: add a helper (e.g., sanitize_vm_id
or ensure_safe_basename) that rejects empty strings and any id containing path
separators, .. or other traversal characters (or enforce a strict whitelist
regex like alphanumeric, hyphen, underscore, dot), and use that helper in
socket_path, vsock_path and the rootfs/snapshot/volume path builders referenced
(lines ~238-239, ~525-526, ~628-636) so only a single safe basename is ever
joined into config.socket_dir or other managed dirs; return/propagate an error
when validation fails.
- Around line 230-263: The code currently leaks resources if any of
configure_vm(), client.start_instance(), or wait_for_agent() fail after creating
vm_rootfs, allocating a subnet and guest_cid, creating the TAP and NAT, and
spawning Firecracker; add a rollback/cleanup path that runs on error to undo
those steps: delete the copied vm_rootfs created by ensure_rootfs, release the
subnet back to subnet_allocator (use the same subnet value returned by
allocator.allocate), remove the TAP device via
super::fc_network::create_tap_device counterpart (e.g., delete_tap_device) and
undo NAT via super::fc_network::setup_nat counterpart (e.g., teardown_nat),
terminate the spawned Firecracker process started by start_firecracker_process
(using the returned pid or handle), and release guest_cid if applicable;
implement this by capturing the created resources (vm_rootfs path, subnet,
tap_name, pid, guest_cid) and invoking a cleanup async function or a drop/guard
on any Err from configure_vm, client.start_instance, or wait_for_agent so all
partially-created sandbox artifacts are removed on failure.
- Around line 525-576: The code currently writes every snapshot to
<snapshot_dir>/<vm_id>/ which causes overwrites and the inspect/remove helpers
reconstruct the path by splitting on '-' which breaks for hyphenated vm_ids; fix
commit_sandbox (where snapshot_id is created and snapshot files are written),
inspect_image_size, and remove_image to use a unique, immutable per-snapshot
path derived from the full snapshot id: (1) when creating the snapshot in
commit_sandbox, generate snapshot_id and create a dedicated directory for that
snapshot (e.g., snap_dir = self.config.snapshot_dir.join(snapshot_id_safe)) and
write snapshot.bin and mem.bin there instead of using vm_id; ensure
snapshot_id_safe encodes comment/vm_id safely (URL-encode or use a UUID suffix)
so it is a valid directory name; (2) in inspect_image_size and remove_image,
reconstruct the exact snapshot directory by taking
image_id.trim_start_matches("fc-snap-") (do NOT .split('-').next()) and using
that whole remainder as the directory name to locate snapshot.bin and mem.bin or
remove the directory; this preserves uniqueness and avoids truncating hyphenated
vm IDs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: af2a53c9-9085-491b-9e93-5d230099f6be

📥 Commits

Reviewing files that changed from the base of the PR and between 797a05e and a108595.

📒 Files selected for processing (10)
  • packages/guest-agent/Cargo.toml
  • packages/guest-agent/src/main.rs
  • packages/worker/src/main.rs
  • packages/worker/src/runtime/fc_agent.rs
  • packages/worker/src/runtime/fc_api.rs
  • packages/worker/src/runtime/fc_network.rs
  • packages/worker/src/runtime/fc_rootfs.rs
  • packages/worker/src/runtime/fc_types.rs
  • packages/worker/src/runtime/firecracker.rs
  • packages/worker/src/runtime/mod.rs

Comment on lines +9 to +21
fn main() {
let port = std::env::var("AGENT_PORT")
.unwrap_or_else(|_| "8052".to_string())
.parse::<u16>()
.unwrap_or(8052);

let listener = TcpListener::bind(format!("0.0.0.0:{port}"))
.unwrap_or_else(|e| {
eprintln!("Failed to bind on port {port}: {e}");
std::process::exit(1);
});

eprintln!("iii-guest-agent listening on port {port}");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bind the agent to Firecracker VSOCK, not TCP.

packages/worker/src/runtime/fc_agent.rs:9-36 connects through the Firecracker vsock UDS path ..._52, but this server listens on 0.0.0.0:8052. With the current host/client pair, the runtime can never reach the guest agent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/guest-agent/src/main.rs` around lines 9 - 21, The guest-agent
currently binds a TCP listener in main (TcpListener::bind) and prints
"iii-guest-agent listening on port {port}", but the runtime connects via the
Firecracker vsock/UDS path used in packages/worker/src/runtime/fc_agent.rs;
replace the TCP listener with a vsock/Unix domain socket listener: change the
binding logic in main to open the same UDS/vsock socket path the worker expects
(use UnixListener::bind or the vsock crate equivalent instead of
TcpListener::bind), read the socket path from the same env/format used by
fc_agent (the ..._52 path) and update the startup log to reflect the socket path
instead of listening on 0.0.0.0:{port} so the runtime can reach the guest agent.

Comment on lines +47 to +55
use runtime::fc_types::FcConfig;
let dk = connect_docker();
let fc_config = FcConfig {
kernel_path: std::path::PathBuf::from(&config.firecracker_kernel),
rootfs_cache_dir: std::path::PathBuf::from(&config.firecracker_rootfs),
default_vcpus: config.firecracker_vcpus,
default_mem_mib: config.firecracker_mem_mib,
..Default::default()
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't silently reinterpret firecracker_rootfs.

packages/worker/src/config.rs:1-45 still exposes a single firecracker_rootfs field, but this branch now uses it as rootfs_cache_dir and backfills the other FcConfig paths from hard-coded defaults in packages/worker/src/runtime/fc_types.rs:119-133. Existing deployments can end up treating a file path as a directory and looking for /usr/bin//opt//tmp assets that were never configured.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/worker/src/main.rs` around lines 47 - 55, The code is silently
reinterpreting config.firecracker_rootfs as a directory when building FcConfig
(rootfs_cache_dir) which can treat file paths as directories; update the config
shape and usage so the value is explicit and not misused: add distinct fields in
your configuration (e.g., firecracker_kernel_path and
firecracker_rootfs_cache_dir or similar) and use those when constructing
FcConfig in main.rs (replace uses of config.firecracker_rootfs with the new
firecracker_rootfs_cache_dir and ensure kernel_path is taken from the explicit
kernel config), or if you must keep the old single field, detect whether
config.firecracker_rootfs is a file and map it correctly (use parent dir for
cache vs. file path for rootfs) before assigning to FcConfig.rootfs_cache_dir;
update any references in FcConfig and runtime::fc_types::FcConfig to match the
new fields so defaults in fc_types.rs are not silently relied upon.

Comment on lines +199 to +213
pub struct AgentFileWriteRequest {
pub path: String,
pub content: String,
pub mode: Option<u32>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct AgentFileReadRequest {
pub path: String,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct AgentFileReadResponse {
pub content: String,
pub size: u64,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make the file-copy wire contract binary-safe.

These request/response types model file contents as String, but the runtime API that consumes them is byte-oriented (copy_to_sandbox/copy_from_sandbox). That means arbitrary files with non-UTF-8 bytes cannot be round-tripped reliably through this contract. Use an explicit binary representation here instead of plain text.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/worker/src/runtime/fc_types.rs` around lines 199 - 213, The
file-copy contract currently models file content as UTF-8 Strings which breaks
binary files; change the content fields in AgentFileWriteRequest and
AgentFileReadResponse from String to a binary-safe type (Vec<u8>) and annotate
with serde_bytes (e.g. #[serde(with = "serde_bytes")]) so Serialize/Deserialize
remain efficient; keep AgentFileReadRequest.path, AgentFileWriteRequest.path and
mode as-is and leave size as u64, and update any consumers of copy_to_sandbox /
copy_from_sandbox to accept/produce Vec<u8> instead of String.

Comment on lines +265 to +279
let vm = VmInstance {
id: id.to_string(),
socket_path: self.socket_path(id),
rootfs_path: vm_rootfs,
pid: Some(pid),
guest_cid,
tap_name,
guest_ip: subnet.guest_ip.clone(),
host_ip: subnet.host_ip.clone(),
state: VmLifecycleState::Running,
vcpus,
mem_mib,
labels: HashMap::new(),
port_mappings: HashMap::new(),
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

sandbox_port_bindings() can never report active mappings.

New VMs start with port_mappings: HashMap::new(), and nothing in this file ever populates that map or installs inbound forwarding for published ports. The method therefore always returns an empty map, while packages/worker/src/functions/port.rs:360-370 expects real bindings to surface active host ports.

Also applies to: 503-511

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/worker/src/runtime/firecracker.rs` around lines 265 - 279, The
Sandbox never records or exposes active port forwards because VmInstance is
constructed with port_mappings: HashMap::new() and nothing populates it, so
sandbox_port_bindings() always returns empty; fix by ensuring port mappings are
created and stored when ports are published/forwarded — update the VM lifecycle
code paths that create networking (e.g., where VmInstance is created and where
inbound forwarding is installed) to add entries into VmInstance.port_mappings
and persist them (mirror the structure expected by sandbox_port_bindings()), and
update the functions that install NAT/iptables/tproxy forwarding to return the
assigned host port and write that into VmInstance.port_mappings (and any
teardown paths must remove them). Target symbols: VmInstance, port_mappings,
sandbox_port_bindings(), and the VM creation/networking routines that set up
tap/veth and publish ports so active bindings are discoverable by
packages/worker/src/functions/port.rs.

Comment on lines +525 to +576
let snap_dir = self.config.snapshot_dir.join(vm_id);
tokio::fs::create_dir_all(&snap_dir)
.await
.map_err(|e| format!("Failed to create snapshot dir: {e}"))?;

let snapshot_path = snap_dir.join("snapshot.bin");
let mem_path = snap_dir.join("mem.bin");

client
.create_snapshot(&SnapshotCreateParams {
snapshot_type: "Full".to_string(),
snapshot_path: snapshot_path.to_string_lossy().to_string(),
mem_file_path: mem_path.to_string_lossy().to_string(),
})
.await?;

client.resume_instance().await?;

let snapshot_id = format!("fc-snap-{}-{}", vm_id, comment.replace(' ', "-"));
tracing::info!(vm_id = %vm_id, snapshot_id = %snapshot_id, "VM snapshot created");

Ok(snapshot_id)
}

async fn inspect_image_size(&self, image_id: &str) -> Result<u64, String> {
if image_id.starts_with("fc-snap-") {
let snap_dir = self.config.snapshot_dir.join(
image_id.trim_start_matches("fc-snap-").split('-').next().unwrap_or("")
);
let snapshot_path = snap_dir.join("snapshot.bin");
let mem_path = snap_dir.join("mem.bin");

let mut total = 0u64;
if let Ok(meta) = tokio::fs::metadata(&snapshot_path).await {
total += meta.len();
}
if let Ok(meta) = tokio::fs::metadata(&mem_path).await {
total += meta.len();
}
return Ok(total);
}

async fn inspect_image_size(&self, _image_id: &str) -> Result<u64, String> {
Err("Firecracker runtime not yet implemented".to_string())
super::fc_rootfs::rootfs_size(&self.config.rootfs_cache_dir, image_id).await
}

async fn remove_image(&self, _image_id: &str) -> Result<(), String> {
Err("Firecracker runtime not yet implemented".to_string())
async fn remove_image(&self, image_id: &str) -> Result<(), String> {
if image_id.starts_with("fc-snap-") {
let snap_dir = self.config.snapshot_dir.join(
image_id.trim_start_matches("fc-snap-").split('-').next().unwrap_or("")
);
let _ = tokio::fs::remove_dir_all(&snap_dir).await;
return Ok(());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Snapshot IDs do not map to unique snapshot data.

commit_sandbox() always writes to <snapshot_dir>/<vm_id>/snapshot.bin and mem.bin, so a second commit overwrites the first even though it returns a different snapshot_id. Then inspect_image_size() and remove_image() recover the directory by splitting on -, which also truncates hyphenated VM IDs. Store each snapshot under its own immutable path and key lookup/removal by the full returned ID.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/worker/src/runtime/firecracker.rs` around lines 525 - 576, The code
currently writes every snapshot to <snapshot_dir>/<vm_id>/ which causes
overwrites and the inspect/remove helpers reconstruct the path by splitting on
'-' which breaks for hyphenated vm_ids; fix commit_sandbox (where snapshot_id is
created and snapshot files are written), inspect_image_size, and remove_image to
use a unique, immutable per-snapshot path derived from the full snapshot id: (1)
when creating the snapshot in commit_sandbox, generate snapshot_id and create a
dedicated directory for that snapshot (e.g., snap_dir =
self.config.snapshot_dir.join(snapshot_id_safe)) and write snapshot.bin and
mem.bin there instead of using vm_id; ensure snapshot_id_safe encodes
comment/vm_id safely (URL-encode or use a UUID suffix) so it is a valid
directory name; (2) in inspect_image_size and remove_image, reconstruct the
exact snapshot directory by taking image_id.trim_start_matches("fc-snap-") (do
NOT .split('-').next()) and using that whole remainder as the directory name to
locate snapshot.bin and mem.bin or remove the directory; this preserves
uniqueness and avoids truncating hyphenated vm IDs.

Critical:
- Remove shell injection vectors in fc_rootfs.rs (replace sh -c with direct cp/tar args)
- Add 10MB Content-Length limit in guest agent to prevent memory exhaustion
- Fix PID 0 fallback in firecracker.rs (error instead of unwrap_or(0))
- Add SIGCHLD ignore in guest agent to prevent zombie processes

Important:
- Add socket timeouts (30s) to fc_api.rs raw_request
- Subnet allocator now reclaims released IDs via free list + idempotent allocation
- TAP device cleanup on partial creation failure
- Container cleanup on OCI export failure (deferred cleanup pattern)
- Cleanup rootfs/subnet/TAP/socket on create_sandbox failure
- Atomic rootfs creation via temp file + rename
- Propagate create_dir_all errors in extract_layers
- Propagate set_permissions errors in guest agent file write
- Validate HTTP status before parsing body in fc_agent.rs
- Error on malformed HTTP responses in fc_api.rs (was silent Ok(""))
- extract_status_code returns Result (was defaulting to 200)
- Snapshot ID uses __ delimiter (was - which conflicts with hyphenated vm_ids)

Tests: 385 pass (worker) + 11 pass (guest-agent), 2 new subnet reclamation tests
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
packages/worker/src/runtime/firecracker.rs (1)

70-75: CID allocation is monotonic without recycling.

Unlike subnet allocation which now recycles IDs, allocate_cid() increments indefinitely. For long-running workers with many VM create/destroy cycles, consider recycling CIDs when VMs are removed, similar to the subnet allocator's free_list approach.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/worker/src/runtime/firecracker.rs` around lines 70 - 75,
allocate_cid currently only increments next_cid (monotonic) and should be
changed to reuse freed IDs: add a free_list (e.g., VecDeque<u32> or HashSet)
alongside next_cid and next_cid.lock(), modify allocate_cid() to first check and
pop an ID from free_list before using and incrementing next_cid, add a public
async free_cid(&self, cid: u32) that pushes returned CIDs onto free_list, and
call free_cid from the VM teardown/remove path (where VMs are destroyed) so CIDs
are recycled similar to the subnet allocator’s free_list approach.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/worker/src/runtime/fc_agent.rs`:
- Line 9: AGENT_PORT is set to 52 but the guest agent listens on 8052; update
the constant AGENT_PORT in fc_agent.rs from 52 to 8052 (keep type u32) so the
runtime connects to the correct guest agent port.

---

Nitpick comments:
In `@packages/worker/src/runtime/firecracker.rs`:
- Around line 70-75: allocate_cid currently only increments next_cid (monotonic)
and should be changed to reuse freed IDs: add a free_list (e.g., VecDeque<u32>
or HashSet) alongside next_cid and next_cid.lock(), modify allocate_cid() to
first check and pop an ID from free_list before using and incrementing next_cid,
add a public async free_cid(&self, cid: u32) that pushes returned CIDs onto
free_list, and call free_cid from the VM teardown/remove path (where VMs are
destroyed) so CIDs are recycled similar to the subnet allocator’s free_list
approach.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a5433d90-5363-4e0e-8ad5-5108cc3f4a64

📥 Commits

Reviewing files that changed from the base of the PR and between a108595 and ea73ae2.

📒 Files selected for processing (7)
  • packages/guest-agent/Cargo.toml
  • packages/guest-agent/src/main.rs
  • packages/worker/src/runtime/fc_agent.rs
  • packages/worker/src/runtime/fc_api.rs
  • packages/worker/src/runtime/fc_network.rs
  • packages/worker/src/runtime/fc_rootfs.rs
  • packages/worker/src/runtime/firecracker.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • packages/guest-agent/src/main.rs
  • packages/guest-agent/Cargo.toml

use super::fc_types::*;
use crate::types::{ExecResult, FileInfo, FileMetadata, SandboxMetrics};

const AGENT_PORT: u32 = 52;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the port used by the guest agent
rg -n "AGENT_PORT" packages/guest-agent/src/main.rs
rg -n '"8052"' packages/guest-agent/src/main.rs

Repository: iii-hq/sandbox

Length of output: 154


🏁 Script executed:

rg -n "AGENT_PORT" packages/worker/src/runtime/fc_agent.rs -A 3 -B 1

Repository: iii-hq/sandbox

Length of output: 684


🏁 Script executed:

# Check how vsock communication is set up
rg -n "vsock\|Vsock" packages/worker/src/runtime/fc_agent.rs -A 2 -B 2

Repository: iii-hq/sandbox

Length of output: 40


🏁 Script executed:

# Look for socket address/connection setup
rg -n "52|8052" packages/worker/src/runtime/fc_agent.rs

Repository: iii-hq/sandbox

Length of output: 87


🏁 Script executed:

rg -n "AGENT_PORT\|8052\|vsock" packages/guest-agent/src/main.rs -A 5 -B 2

Repository: iii-hq/sandbox

Length of output: 40


🏁 Script executed:

# Check if guest-agent source shows socket file naming
cat packages/guest-agent/src/main.rs | head -50

Repository: iii-hq/sandbox

Length of output: 1598


🏁 Script executed:

# Check if there are any comments or documentation about port selection
rg -n "52|8052" packages/worker/src/runtime/fc_agent.rs packages/guest-agent/src/main.rs -B 3 -A 1 --multiline

Repository: iii-hq/sandbox

Length of output: 856


🏁 Script executed:

# Search for any configuration files or documentation mentioning these ports
fd -t f "\.toml|\.md|\.txt|\.yaml|\.yml" | xargs rg -l "52|8052|AGENT_PORT" 2>/dev/null | head -20

Repository: iii-hq/sandbox

Length of output: 110


🏁 Script executed:

# Check if the scaling plan mentions port configuration
cat docs/plans/2026-03-10-scaling-plan.md | grep -A 10 -B 10 "52\|8052\|port\|AGENT_PORT"

Repository: iii-hq/sandbox

Length of output: 1124


🏁 Script executed:

# Check if there are any references linking fc_agent and guest-agent
rg -n "guest.*agent\|guest-agent" packages/worker/src/runtime/fc_agent.rs

Repository: iii-hq/sandbox

Length of output: 40


🏁 Script executed:

# Check if there's any documentation about the vsock communication setup
rg -n "vsock\|UDS\|unix.*socket" packages/worker/src/runtime/fc_agent.rs -B 2 -A 2

Repository: iii-hq/sandbox

Length of output: 40


🏁 Script executed:

# Look for any recent changes or comments about port 52 vs 8052
git log --oneline packages/worker/src/runtime/fc_agent.rs packages/guest-agent/src/main.rs 2>/dev/null | head -20

Repository: iii-hq/sandbox

Length of output: 122


🏁 Script executed:

# Check the recent fix commit
git show ea73ae2 --stat | head -50

Repository: iii-hq/sandbox

Length of output: 2835


🏁 Script executed:

# Look for the actual change related to ports in that commit
git show ea73ae2 packages/worker/src/runtime/fc_agent.rs | head -100

Repository: iii-hq/sandbox

Length of output: 4145


🏁 Script executed:

# Check how the vsock_uds_path is set up and if there's any mapping of ports
rg -n "vsock_uds_path\|AGENT_PORT" packages/worker/src/runtime/ -B 3 -A 3

Repository: iii-hq/sandbox

Length of output: 40


🏁 Script executed:

# Check if there's any firecracker configuration or setup that maps vsock ports
rg -n "52\|8052" packages/worker/src/runtime/ --type rust

Repository: iii-hq/sandbox

Length of output: 40


🏁 Script executed:

# Check how AgentClient is instantiated to understand the full picture
rg -n "AgentClient::new" packages/worker/src/runtime/

Repository: iii-hq/sandbox

Length of output: 324


🏁 Script executed:

# Check how AgentClient is instantiated with the vsock_uds_path
rg -n "AgentClient::new" packages/worker/src/runtime/firecracker.rs -B 5 -A 2

Repository: iii-hq/sandbox

Length of output: 669


🏁 Script executed:

# Look for vsock socket setup in firecracker
rg -n "vsock\|sock" packages/worker/src/runtime/firecracker.rs -B 2 -A 2 | head -100

Repository: iii-hq/sandbox

Length of output: 40


🏁 Script executed:

# Check vsock_path method to understand how the socket is set up
rg -n "vsock_path\|fn vsock" packages/worker/src/runtime/firecracker.rs -A 5

Repository: iii-hq/sandbox

Length of output: 40


🏁 Script executed:

# Look for any firecracker configuration that maps vsock ports to TCP
rg -n "8052\|guest.*port\|vsock.*port" packages/worker/src/runtime/firecracker.rs

Repository: iii-hq/sandbox

Length of output: 40


🏁 Script executed:

# Check if guest-agent is exposed via vsock somehow
rg -n "vsock\|guest.agent" packages/ --type rust | grep -i "port\|listen\|bind"

Repository: iii-hq/sandbox

Length of output: 40


🏁 Script executed:

# Check the full vsock_path method implementation
cat packages/worker/src/runtime/firecracker.rs | grep -A 20 "fn vsock_path"

Repository: iii-hq/sandbox

Length of output: 1595


🏁 Script executed:

# Look for any documentation or setup of vsock socket naming
rg -n "vsock\|uds" packages/worker/src/runtime/firecracker.rs -i

Repository: iii-hq/sandbox

Length of output: 40


🏁 Script executed:

# Check if there's a vsock proxy or bridge component
fd -t f "vsock\|proxy" packages/worker/src/runtime/

Repository: iii-hq/sandbox

Length of output: 40


🏁 Script executed:

# Search entire worker package for references to port 52 or the socket path construction
rg -n "format\|vsock_uds_path" packages/worker/src/ -A 1 | grep -B 1 -A 1 "52\|uds_path"

Repository: iii-hq/sandbox

Length of output: 40


🏁 Script executed:

# Search for vsock proxy or daemon components
fd -t f -i "proxy\|daemon\|bridge" packages/

Repository: iii-hq/sandbox

Length of output: 40


🏁 Script executed:

# Look for Dockerfile or scripts that might set up the guest agent
rg -n "AGENT_PORT\|8052\|guest-agent" images/ --type bash --type dockerfile

Repository: iii-hq/sandbox

Length of output: 86


🏁 Script executed:

# Check if there's a firecracker launch or vm start that might set up proxy
rg -n "vsock\|proxy\|listen" packages/worker/src/runtime/firecracker.rs -i

Repository: iii-hq/sandbox

Length of output: 40


🏁 Script executed:

# Search for any kernel module or helper process
rg -n "vsock\|vhost" packages/ --type rust -i

Repository: iii-hq/sandbox

Length of output: 40


Fix port mismatch with guest agent.

AGENT_PORT is hardcoded to 52, but the guest agent defaults to port 8052. Update the constant to align:

-const AGENT_PORT: u32 = 52;
+const AGENT_PORT: u32 = 8052;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const AGENT_PORT: u32 = 52;
const AGENT_PORT: u32 = 8052;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/worker/src/runtime/fc_agent.rs` at line 9, AGENT_PORT is set to 52
but the guest agent listens on 8052; update the constant AGENT_PORT in
fc_agent.rs from 52 to 8052 (keep type u32) so the runtime connects to the
correct guest agent port.

- Guest agent: PTY terminal support (forkpty, 5 endpoints, 16 session limit)
- Terminal: runtime-agnostic — Docker exec or FC agent PTY via backend dispatch
- CI: guest-agent cross-compilation (x86_64/aarch64 musl), release artifacts
- Init system: kernel cmdline networking, graceful shutdown, auto-start agent
- Integration tests: 48 unit + 6 conditional E2E tests for FC backend
- Dockerfile: embeds guest-agent binary per TARGETARCH

460 tests passing (394 worker + 48 FC integration + 18 guest-agent)
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (9)
packages/worker/src/runtime/fc_rootfs.rs (4)

46-56: ⚠️ Potential issue | 🔴 Critical

Fail the rootfs build if the guest-agent binary is missing.

Skipping this copy publishes an ext4 that can never satisfy wait_for_agent() in packages/worker/src/runtime/firecracker.rs. That should abort before the cache entry is finalized.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/worker/src/runtime/fc_rootfs.rs` around lines 46 - 56, The code
currently silently skips copying when agent_path doesn't exist; change it to
fail the rootfs build instead: in fc_rootfs.rs replace the conditional so that
if agent_path.exists() it copies as before, but otherwise return an Err
(propagate a formatted error like "Missing guest-agent binary at {agent_path}")
so the rootfs build aborts and prevents publishing an unusable image that can't
satisfy wait_for_agent() in firecracker.rs.

304-339: ⚠️ Potential issue | 🔴 Critical

Propagate non-zero cp exits from create_ext4().

A short copy here is still treated as success, so ENOSPC or permission failures can cache a partially populated ext4. Log the stderr if useful, but return an error after unmounting.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/worker/src/runtime/fc_rootfs.rs` around lines 304 - 339, The loop in
create_ext4 currently treats non-zero cp exits as warnings only and proceeds,
risking a partially populated ext4; change the handling in the match arm for
Ok(o) where !o.status.success() to set cp_error (including stderr text and the
src path) and break the loop so the function returns an error after the cleanup;
keep the existing unmount (Command::new("umount")) and fs::remove_dir calls
before returning, and reference cp_error, the Command::new("cp") invocation, and
the entries.next_entry() loop when making the change.

13-18: ⚠️ Potential issue | 🟠 Major

Same-image builds still race on shared temp paths.

The rootfs_path.exists() check does not serialize writers: concurrent ensure_rootfs() calls for the same image still share {safe_name}.tar, {safe_name}_extract, {safe_name}_merged, {safe_name}.ext4.tmp, and _mount_tmp. Two builders can still trample each other and publish a corrupted cache entry.

Also applies to: 28-44, 60-63, 288-291

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/worker/src/runtime/fc_rootfs.rs` around lines 13 - 18, Concurrent
ensure_rootfs() callers race because the current exists() check doesn't
serialize writers; add a per-image exclusive lock (e.g., a lock file keyed by
safe_name) and acquire it before checking rootfs_path.exists() and before
creating/writing any of the temporary artifacts ({safe_name}.tar,
{safe_name}_extract, {safe_name}_merged, {safe_name}.ext4.tmp, _mount_tmp), then
release the lock after atomically renaming the temp .ext4.tmp to
{safe_name}.ext4; ensure the lock is held across the whole create/publish
sequence in ensure_rootfs() and that locks are released on error to avoid
deadlocks.

29-45: ⚠️ Potential issue | 🔴 Critical

The export/merge pipeline drops the extracted rootfs.

Docker/Bollard export_container() returns a flattened filesystem tar, not OCI layers. When manifest.json is absent, merge_layers() only looks for nested *.tar files, so merged_dir ends up empty instead of containing the extracted container contents.

Also applies to: 118-149, 166-170, 212-254

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/worker/src/runtime/fc_rootfs.rs` around lines 29 - 45, The pipeline
assumes exported images contain OCI layer tars but Docker/Bollard export
produces a flattened rootfs, so after export_image_to_tar(...) ->
extract_tar(...)-> extract_layers(...) there may be no layer tars and
merge_layers(...) produces an empty merged_dir; change the flow so that after
extract_tar you detect whether any layer tars exist (update extract_layers to
return a list/count or add a helper like find_layer_tars), and if none are found
treat the extracted rootfs as the single layer by either copying/moving the
extracted_dir contents into merged_dir or invoking merge_layers with the
extracted_dir as the only source; update calls around extract_tar,
extract_layers, and merge_layers to handle this fallback for
export_container-style archives.
packages/worker/src/runtime/firecracker.rs (4)

77-113: ⚠️ Potential issue | 🟠 Major

Manage Firecracker children with a reapable handle, not optimistic state + raw PIDs.

After spawn() you drop tokio::process::Child, stop_sandbox() marks the VM stopped before TERM is observed, and remove_sandbox() only kills Running VMs. That combination can leave a setup failure, paused VM, or slow shutdown with a live/unreaped Firecracker process and no handle left to clean it up.

Also applies to: 219-233, 324-375

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/worker/src/runtime/firecracker.rs` around lines 77 - 113,
start_firecracker_process currently spawns Firecracker and drops the
tokio::process::Child (using only the PID) which prevents proper reaping and
later cleanup; change this to return and persist a reapable Child handle (or
store it in the runtime's VM state) instead of relying on raw PIDs, ensure
spawn()'s Child is saved (e.g., in the struct that tracks VM state created by
start_firecracker_process), and update stop_sandbox and remove_sandbox to use
that stored Child handle to send TERM/kill and await its exit (or reap) rather
than marking state optimistically and killing by PID so slow shutdowns, start
failures, or paused VMs are reliably cleaned up.

228-241: ⚠️ Potential issue | 🟠 Major

sandbox_port_bindings() can never report active mappings.

New VMs always start with port_mappings: HashMap::new(), and nothing in this file ever records forwarded host ports. The getter later in the trait impl therefore only returns empty maps.

Also applies to: 533-541

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/worker/src/runtime/firecracker.rs` around lines 228 - 241, The
sandbox_port_bindings getter is always empty because new VmInstance is
initialized with port_mappings: HashMap::new() and no code records forwarded
host ports; populate and persist that map when you create or modify port
forwards. Specifically, update the VmInstance construction (the VmInstance { ...
port_mappings: ... } block) to accept an initial mapping if available, then
instrument the code paths that set up host->guest port forwards (the function
that configures tap/nat/firewall rules) to insert entries into
VmInstance.port_mappings (and persist them alongside other VM state), and ensure
the trait impl method sandbox_port_bindings reads and returns this stored map
instead of always returning an empty HashMap.

48-54: ⚠️ Potential issue | 🔴 Critical

Validate IDs/names before joining them into host paths.

Raw values are later used for socket, rootfs, snapshot, and volume paths. An input containing / or .. can escape the managed directories and make create/remove operations touch arbitrary host files.

Also applies to: 293-295, 555-556, 658-667

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/worker/src/runtime/firecracker.rs` around lines 48 - 54, The VM ID
and other name inputs are used to build host filesystem paths and must be
validated to prevent path traversal; add a shared validator (e.g., validate_id
or sanitize_vm_id) that rejects or normalizes any id containing path separators,
"." or ".." components, empty strings, or absolute paths and permits only a
narrow charset (e.g., alphanumerics, dashes, underscores, dots) and call it from
socket_path and vsock_path (and the other places noted: the
rootfs/snapshot/volume path builders around the other ranges) before joining
into PathBufs; return or propagate an error on invalid IDs rather than joining
them.

555-576: ⚠️ Potential issue | 🔴 Critical

Returned snapshot IDs still do not map to unique snapshot data.

Every commit for the same VM writes to <snapshot_dir>/<vm_id>/snapshot.bin and mem.bin, then inspect_image_size() / remove_image() recover that directory by trimming the comment back off the returned ID. A second snapshot overwrites the first even though it returns a new ID.

Also applies to: 579-605

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/worker/src/runtime/firecracker.rs` around lines 555 - 576, The
current snapshot creation always writes to snap_dir.join(vm_id)/snapshot.bin and
mem.bin so later snapshots overwrite earlier ones even though snapshot_id is
unique; change create_snapshot (in the method that builds snapshot_path and
mem_path) to write each snapshot into a unique per-snapshot location (e.g.,
snap_dir.join(snapshot_id) or include a timestamp/uuid in the filenames) so the
on-disk paths incorporate snapshot_id instead of always using
snapshot.bin/mem.bin, and update any consumers (inspect_image_size() and
remove_image()) to use the full snapshot_id-based path (do not rely on trimming
the comment from snapshot_id) or persist a direct mapping from snapshot_id ->
filesystem path; update references to snapshot_path, mem_path, snapshot_id,
create_snapshot, inspect_image_size(), and remove_image() accordingly.
packages/worker/src/runtime/fc_types.rs (1)

198-213: ⚠️ Potential issue | 🟠 Major

Make the file-copy agent payloads binary-safe.

The runtime trait in packages/worker/src/runtime/mod.rs is byte-oriented, so String here still breaks non-UTF-8 round-trips. Use a binary-safe wire type for both request and response content fields.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/worker/src/runtime/fc_types.rs` around lines 198 - 213,
AgentFileWriteRequest and AgentFileReadResponse use String for file payloads
which breaks non-UTF-8 round-trips; change the payload fields to a binary-safe
type. Update AgentFileWriteRequest.content and AgentFileReadResponse.content
from String to Vec<u8> (and optionally add #[serde(with = "serde_bytes")] on
those fields to preserve compact binary serialization), leaving path as String;
then update any code that constructs/consumes AgentFileWriteRequest or
AgentFileReadResponse to pass/expect raw bytes instead of UTF-8 strings. Ensure
the structs still derive Serialize/Deserialize after the change.
🧹 Nitpick comments (1)
packages/worker/tests/common/mod.rs (1)

75-86: Minor code duplication between fc_available() and fc_binary_path().

Both functions read the same environment variable with the same default. Consider extracting to reduce duplication:

♻️ Optional refactor
 pub fn fc_available() -> bool {
-    let fc_bin = std::env::var("FIRECRACKER_BIN")
-        .unwrap_or_else(|_| "/usr/bin/firecracker".to_string());
-    std::path::Path::new(&fc_bin).exists()
+    fc_binary_path().exists()
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/worker/tests/common/mod.rs` around lines 75 - 86, Extract the
duplicated environment lookup into a single helper used by both functions:
create a function (e.g., fc_bin_from_env or fc_default_path) that reads
std::env::var("FIRECRACKER_BIN") with the "/usr/bin/firecracker" default and
returns a PathBuf or String, then update fc_available to call that helper and
check .exists() and update fc_binary_path to return the helper's PathBuf; update
signatures and imports as needed and ensure names fc_available and
fc_binary_path remain unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/guest-agent/src/main.rs`:
- Line 531: The call to libc::forkpty uses an unnecessary mutable reference for
the winsize parameter; change the third/fourth argument usage so that you pass a
shared reference (use &ws instead of &mut ws) and, if ws was declared as mutable
solely for this call, remove the mut from its binding; update the call site in
the forkpty invocation (master_fd, std::ptr::null_mut(), std::ptr::null_mut(),
ws) to use &ws and ensure the winsize variable and master_fd usage align with
that non-mutable reference.

In `@packages/worker/src/functions/terminal.rs`:
- Around line 284-304: The current Firecracker branch calls
agent.terminal_close(exec_id).await but proceeds to remove KV/session metadata
regardless of success; change it so you check the Result from
agent.terminal_close and only delete the session when terminal_close returns Ok.
Locate the Firecracker-specific block (the downcast via
rt.as_any().downcast_ref::<FirecrackerRuntime>(), vm_id computed from id,
exec_id extracted from session, and the call to fc_rt.agent_client(vm_id).await
-> agent.terminal_close(exec_id).await) and gate the subsequent session removal
on the terminal_close outcome (handle Err by logging and preserving the session
instead of deleting it).

In `@packages/worker/src/runtime/fc_agent.rs`:
- Around line 291-310: The AgentTerminalReadResponse deserialization fails when
the guest-agent omits eof; update the response struct to tolerate missing eof by
making the field optional/defaulted (e.g., change the eof field in
AgentTerminalReadResponse to have #[serde(default)] and keep it as bool or
Option<bool> with an explicit default false) so terminal_read (which calls
send_request("/terminal/read", ...)) can successfully decode responses that only
include data; alternatively, ensure the guest-agent actually includes "eof" in
its JSON responses if you prefer fixing the producer instead of the consumer.
- Around line 272-289: The terminal_write method is base64-encoding input before
sending, but the guest-agent (reads req.data.as_bytes()) expects plain text;
update terminal_write (and the AgentTerminalWriteRequest payload) to send the
raw text string instead of base64: convert the incoming data: &[u8] to a UTF-8
string (e.g., String::from_utf8 or from_utf8_lossy) and set req.data to that
plain string, then call send_request("/terminal/write", &req, DEFAULT_TIMEOUT)
as before so the guest-agent receives plain text matching its test expectations.

In `@packages/worker/src/runtime/fc_init.rs`:
- Around line 100-114: The PID 1 loop currently busy-waits because a plain wait
returns immediately when there are no children; update the shutdown/monitoring
loop to avoid spinning by waiting specifically for the guest agent and sleeping
when no children exist: after launching AGENT_BIN and setting AGENT_PID, use
targeted waits (e.g., wait "$AGENT_PID" or wait -n) and detect when wait returns
with no children, then sleep (e.g., sleep 1) before retrying; also handle agent
exit by logging and optionally restarting or breaking the loop. Modify the loop
that references AGENT_PID and the unconditional wait || true to perform
process-aware waiting and a short sleep when no child is present to prevent CPU
busy-looping.

In `@packages/worker/src/runtime/firecracker.rs`:
- Around line 552-571: After calling self.fc_client(vm_id).pause_instance(),
ensure the VM is always resumed even if snapshot creation fails: introduce a
scope guard or a try/finally-style cleanup that calls
client.resume_instance().await on all paths (success or error). Locate the
pause_instance()/create_snapshot()/resume_instance() sequence (symbols:
fc_client, pause_instance, create_snapshot, resume_instance,
SnapshotCreateParams, vm_id) and either wrap the snapshot logic in a block that
resumes in a finally/Err arm or implement a small Drop-based guard (e.g.,
ResumeOnDrop) that holds the client and invokes resume_instance().await if the
snapshot flow exits with an error, so the sandbox cannot remain paused.

In `@scripts/build-init.sh`:
- Around line 10-11: The script accepts --agent-path but never applies it to the
generated init script: capture the CLI arg into AGENT_PATH (or similar) and
substitute it into the heredoc instead of the hardcoded
AGENT_BIN="/usr/local/bin/iii-guest-agent" so the generated init uses the
provided path; update the printed help/usage lines (the printf for --agent-path)
to match the variable name and ensure any later references to AGENT_BIN or the
agent binary in the heredoc and after (e.g., where AGENT_BIN is defined or used)
use "${AGENT_PATH:-/usr/local/bin/iii-guest-agent}" to keep the default when not
provided.

---

Duplicate comments:
In `@packages/worker/src/runtime/fc_rootfs.rs`:
- Around line 46-56: The code currently silently skips copying when agent_path
doesn't exist; change it to fail the rootfs build instead: in fc_rootfs.rs
replace the conditional so that if agent_path.exists() it copies as before, but
otherwise return an Err (propagate a formatted error like "Missing guest-agent
binary at {agent_path}") so the rootfs build aborts and prevents publishing an
unusable image that can't satisfy wait_for_agent() in firecracker.rs.
- Around line 304-339: The loop in create_ext4 currently treats non-zero cp
exits as warnings only and proceeds, risking a partially populated ext4; change
the handling in the match arm for Ok(o) where !o.status.success() to set
cp_error (including stderr text and the src path) and break the loop so the
function returns an error after the cleanup; keep the existing unmount
(Command::new("umount")) and fs::remove_dir calls before returning, and
reference cp_error, the Command::new("cp") invocation, and the
entries.next_entry() loop when making the change.
- Around line 13-18: Concurrent ensure_rootfs() callers race because the current
exists() check doesn't serialize writers; add a per-image exclusive lock (e.g.,
a lock file keyed by safe_name) and acquire it before checking
rootfs_path.exists() and before creating/writing any of the temporary artifacts
({safe_name}.tar, {safe_name}_extract, {safe_name}_merged, {safe_name}.ext4.tmp,
_mount_tmp), then release the lock after atomically renaming the temp .ext4.tmp
to {safe_name}.ext4; ensure the lock is held across the whole create/publish
sequence in ensure_rootfs() and that locks are released on error to avoid
deadlocks.
- Around line 29-45: The pipeline assumes exported images contain OCI layer tars
but Docker/Bollard export produces a flattened rootfs, so after
export_image_to_tar(...) -> extract_tar(...)-> extract_layers(...) there may be
no layer tars and merge_layers(...) produces an empty merged_dir; change the
flow so that after extract_tar you detect whether any layer tars exist (update
extract_layers to return a list/count or add a helper like find_layer_tars), and
if none are found treat the extracted rootfs as the single layer by either
copying/moving the extracted_dir contents into merged_dir or invoking
merge_layers with the extracted_dir as the only source; update calls around
extract_tar, extract_layers, and merge_layers to handle this fallback for
export_container-style archives.

In `@packages/worker/src/runtime/fc_types.rs`:
- Around line 198-213: AgentFileWriteRequest and AgentFileReadResponse use
String for file payloads which breaks non-UTF-8 round-trips; change the payload
fields to a binary-safe type. Update AgentFileWriteRequest.content and
AgentFileReadResponse.content from String to Vec<u8> (and optionally add
#[serde(with = "serde_bytes")] on those fields to preserve compact binary
serialization), leaving path as String; then update any code that
constructs/consumes AgentFileWriteRequest or AgentFileReadResponse to
pass/expect raw bytes instead of UTF-8 strings. Ensure the structs still derive
Serialize/Deserialize after the change.

In `@packages/worker/src/runtime/firecracker.rs`:
- Around line 77-113: start_firecracker_process currently spawns Firecracker and
drops the tokio::process::Child (using only the PID) which prevents proper
reaping and later cleanup; change this to return and persist a reapable Child
handle (or store it in the runtime's VM state) instead of relying on raw PIDs,
ensure spawn()'s Child is saved (e.g., in the struct that tracks VM state
created by start_firecracker_process), and update stop_sandbox and
remove_sandbox to use that stored Child handle to send TERM/kill and await its
exit (or reap) rather than marking state optimistically and killing by PID so
slow shutdowns, start failures, or paused VMs are reliably cleaned up.
- Around line 228-241: The sandbox_port_bindings getter is always empty because
new VmInstance is initialized with port_mappings: HashMap::new() and no code
records forwarded host ports; populate and persist that map when you create or
modify port forwards. Specifically, update the VmInstance construction (the
VmInstance { ... port_mappings: ... } block) to accept an initial mapping if
available, then instrument the code paths that set up host->guest port forwards
(the function that configures tap/nat/firewall rules) to insert entries into
VmInstance.port_mappings (and persist them alongside other VM state), and ensure
the trait impl method sandbox_port_bindings reads and returns this stored map
instead of always returning an empty HashMap.
- Around line 48-54: The VM ID and other name inputs are used to build host
filesystem paths and must be validated to prevent path traversal; add a shared
validator (e.g., validate_id or sanitize_vm_id) that rejects or normalizes any
id containing path separators, "." or ".." components, empty strings, or
absolute paths and permits only a narrow charset (e.g., alphanumerics, dashes,
underscores, dots) and call it from socket_path and vsock_path (and the other
places noted: the rootfs/snapshot/volume path builders around the other ranges)
before joining into PathBufs; return or propagate an error on invalid IDs rather
than joining them.
- Around line 555-576: The current snapshot creation always writes to
snap_dir.join(vm_id)/snapshot.bin and mem.bin so later snapshots overwrite
earlier ones even though snapshot_id is unique; change create_snapshot (in the
method that builds snapshot_path and mem_path) to write each snapshot into a
unique per-snapshot location (e.g., snap_dir.join(snapshot_id) or include a
timestamp/uuid in the filenames) so the on-disk paths incorporate snapshot_id
instead of always using snapshot.bin/mem.bin, and update any consumers
(inspect_image_size() and remove_image()) to use the full snapshot_id-based path
(do not rely on trimming the comment from snapshot_id) or persist a direct
mapping from snapshot_id -> filesystem path; update references to snapshot_path,
mem_path, snapshot_id, create_snapshot, inspect_image_size(), and remove_image()
accordingly.

---

Nitpick comments:
In `@packages/worker/tests/common/mod.rs`:
- Around line 75-86: Extract the duplicated environment lookup into a single
helper used by both functions: create a function (e.g., fc_bin_from_env or
fc_default_path) that reads std::env::var("FIRECRACKER_BIN") with the
"/usr/bin/firecracker" default and returns a PathBuf or String, then update
fc_available to call that helper and check .exists() and update fc_binary_path
to return the helper's PathBuf; update signatures and imports as needed and
ensure names fc_available and fc_binary_path remain unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3b7bbb24-263b-403d-8f5c-cfd2f7cff7cd

📥 Commits

Reviewing files that changed from the base of the PR and between ea73ae2 and ece04a4.

📒 Files selected for processing (15)
  • .github/workflows/ci.yml
  • .github/workflows/release.yml
  • packages/guest-agent/src/main.rs
  • packages/worker/Dockerfile
  • packages/worker/src/functions/terminal.rs
  • packages/worker/src/runtime/fc_agent.rs
  • packages/worker/src/runtime/fc_init.rs
  • packages/worker/src/runtime/fc_rootfs.rs
  • packages/worker/src/runtime/fc_types.rs
  • packages/worker/src/runtime/firecracker.rs
  • packages/worker/src/runtime/mod.rs
  • packages/worker/tests/common/mod.rs
  • packages/worker/tests/firecracker_integration.rs
  • scripts/build-init.sh
  • scripts/fetch-kernel.sh

Comment on lines +284 to +304
#[cfg(feature = "firecracker")]
{
let backend_str = session
.get("backend")
.and_then(|v| v.as_str())
.unwrap_or("docker");

if backend_str == "firecracker" {
let exec_id = session
.get("execId")
.and_then(|v| v.as_str())
.unwrap_or("");

if let Some(fc_rt) = rt.as_any().downcast_ref::<FirecrackerRuntime>() {
let vm_id = id.strip_prefix("iii-sbx-").unwrap_or(id);
if let Ok(agent) = fc_rt.agent_client(vm_id).await {
let _ = agent.terminal_close(exec_id).await;
}
}
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't delete the session if Firecracker terminal close failed.

This path removes the KV/session metadata even when agent.terminal_close() fails. With the guest agent's 16-session cap, a transient vsock error here can leak PTYs until the VM is recycled.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/worker/src/functions/terminal.rs` around lines 284 - 304, The
current Firecracker branch calls agent.terminal_close(exec_id).await but
proceeds to remove KV/session metadata regardless of success; change it so you
check the Result from agent.terminal_close and only delete the session when
terminal_close returns Ok. Locate the Firecracker-specific block (the downcast
via rt.as_any().downcast_ref::<FirecrackerRuntime>(), vm_id computed from id,
exec_id extracted from session, and the call to fc_rt.agent_client(vm_id).await
-> agent.terminal_close(exec_id).await) and gate the subsequent session removal
on the terminal_close outcome (handle Err by logging and preserving the session
instead of deleting it).

Comment on lines +272 to +289
pub async fn terminal_write(
&self,
session_id: &str,
data: &[u8],
) -> Result<(), String> {
let req = AgentTerminalWriteRequest {
session_id: session_id.to_string(),
data: base64::Engine::encode(
&base64::engine::general_purpose::STANDARD,
data,
),
};

let _: serde_json::Value = self
.send_request("/terminal/write", &req, DEFAULT_TIMEOUT)
.await?;
Ok(())
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify terminal_write handling in guest-agent
rg -n "TerminalWriteRequest|handle_terminal_write" packages/guest-agent/src/main.rs -A 10

Repository: iii-hq/sandbox

Length of output: 1997


🏁 Script executed:

# Verify the worker code actually uses base64 encoding currently
rg -n "base64::Engine::encode" packages/worker/src/runtime/fc_agent.rs -B 5 -A 2

Repository: iii-hq/sandbox

Length of output: 701


Terminal write sends base64-encoded text, but guest-agent expects plain text.

The terminal_write method base64-encodes the data before sending, but packages/guest-agent/src/main.rs:601 reads req.data.as_bytes() directly without decoding. This causes garbled terminal input (e.g., "ls\n" becomes "bHMK" written to the terminal). The guest-agent test at line 884 confirms the expected behavior is to receive plain text.

🔧 Proposed fix - send plain text
     pub async fn terminal_write(
         &self,
         session_id: &str,
         data: &[u8],
     ) -> Result<(), String> {
         let req = AgentTerminalWriteRequest {
             session_id: session_id.to_string(),
-            data: base64::Engine::encode(
-                &base64::engine::general_purpose::STANDARD,
-                data,
-            ),
+            data: String::from_utf8_lossy(data).to_string(),
         };
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pub async fn terminal_write(
&self,
session_id: &str,
data: &[u8],
) -> Result<(), String> {
let req = AgentTerminalWriteRequest {
session_id: session_id.to_string(),
data: base64::Engine::encode(
&base64::engine::general_purpose::STANDARD,
data,
),
};
let _: serde_json::Value = self
.send_request("/terminal/write", &req, DEFAULT_TIMEOUT)
.await?;
Ok(())
}
pub async fn terminal_write(
&self,
session_id: &str,
data: &[u8],
) -> Result<(), String> {
let req = AgentTerminalWriteRequest {
session_id: session_id.to_string(),
data: String::from_utf8_lossy(data).to_string(),
};
let _: serde_json::Value = self
.send_request("/terminal/write", &req, DEFAULT_TIMEOUT)
.await?;
Ok(())
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/worker/src/runtime/fc_agent.rs` around lines 272 - 289, The
terminal_write method is base64-encoding input before sending, but the
guest-agent (reads req.data.as_bytes()) expects plain text; update
terminal_write (and the AgentTerminalWriteRequest payload) to send the raw text
string instead of base64: convert the incoming data: &[u8] to a UTF-8 string
(e.g., String::from_utf8 or from_utf8_lossy) and set req.data to that plain
string, then call send_request("/terminal/write", &req, DEFAULT_TIMEOUT) as
before so the guest-agent receives plain text matching its test expectations.

Comment on lines +291 to +310
pub async fn terminal_read(
&self,
session_id: &str,
) -> Result<(Vec<u8>, bool), String> {
let req = AgentTerminalReadRequest {
session_id: session_id.to_string(),
};

let resp: AgentTerminalReadResponse = self
.send_request("/terminal/read", &req, DEFAULT_TIMEOUT)
.await?;

let data = base64::Engine::decode(
&base64::engine::general_purpose::STANDARD,
&resp.data,
)
.map_err(|e| format!("Failed to decode terminal data: {e}"))?;

Ok((data, resp.eof))
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify terminal_read response structure in guest-agent
rg -n "handle_terminal_read" packages/guest-agent/src/main.rs -A 30

Repository: iii-hq/sandbox

Length of output: 2130


🏁 Script executed:

#!/bin/bash
# Find and inspect AgentTerminalReadResponse struct definition
rg -n "struct AgentTerminalReadResponse" packages/worker/src/ -A 10

Repository: iii-hq/sandbox

Length of output: 780


🏁 Script executed:

#!/bin/bash
# Get the complete struct definition with derives
sed -n '290,310p' packages/worker/src/runtime/fc_types.rs

Repository: iii-hq/sandbox

Length of output: 577


Terminal read response missing eof field causes deserialization failures.

AgentTerminalReadResponse requires an eof: bool field (with no default), but the guest-agent at packages/guest-agent/src/main.rs:674 only returns {"data": ...}. Deserialization will fail when the worker processes the response.

The guest-agent must include the eof field in its response, or the struct must make the field optional with #[serde(default)].

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/worker/src/runtime/fc_agent.rs` around lines 291 - 310, The
AgentTerminalReadResponse deserialization fails when the guest-agent omits eof;
update the response struct to tolerate missing eof by making the field
optional/defaulted (e.g., change the eof field in AgentTerminalReadResponse to
have #[serde(default)] and keep it as bool or Option<bool> with an explicit
default false) so terminal_read (which calls send_request("/terminal/read",
...)) can successfully decode responses that only include data; alternatively,
ensure the guest-agent actually includes "eof" in its JSON responses if you
prefer fixing the producer instead of the consumer.

Comment on lines +100 to +114
AGENT_BIN="/usr/local/bin/iii-guest-agent"
if [ -x "$AGENT_BIN" ]; then
echo "Init: starting guest agent"
"$AGENT_BIN" &
AGENT_PID=$!
echo "Init: guest agent started (pid=$AGENT_PID)"
else
echo "Init: WARNING - guest agent not found at $AGENT_BIN" >&2
fi

echo "Init: system ready"

while true; do
wait || true
done
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

head -120 packages/worker/src/runtime/fc_init.rs | tail -30

Repository: iii-hq/sandbox

Length of output: 757


🏁 Script executed:

wc -l packages/worker/src/runtime/fc_init.rs

Repository: iii-hq/sandbox

Length of output: 100


🏁 Script executed:

cat -n packages/worker/src/runtime/fc_init.rs | head -30

Repository: iii-hq/sandbox

Length of output: 903


🏁 Script executed:

cat -n packages/worker/src/runtime/fc_init.rs | sed -n '70,130p'

Repository: iii-hq/sandbox

Length of output: 2086


🏁 Script executed:

cat -n packages/worker/src/runtime/fc_init.rs | sed -n '29,70p'

Repository: iii-hq/sandbox

Length of output: 1491


Avoid the no-child busy loop in PID 1.

When the guest agent is missing or exits, wait returns immediately and this loop spins forever consuming CPU. Add sleep and/or process monitoring to avoid busy-spinning when there are no child processes to wait on.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/worker/src/runtime/fc_init.rs` around lines 100 - 114, The PID 1
loop currently busy-waits because a plain wait returns immediately when there
are no children; update the shutdown/monitoring loop to avoid spinning by
waiting specifically for the guest agent and sleeping when no children exist:
after launching AGENT_BIN and setting AGENT_PID, use targeted waits (e.g., wait
"$AGENT_PID" or wait -n) and detect when wait returns with no children, then
sleep (e.g., sleep 1) before retrying; also handle agent exit by logging and
optionally restarting or breaking the loop. Modify the loop that references
AGENT_PID and the unconditional wait || true to perform process-aware waiting
and a short sleep when no child is present to prevent CPU busy-looping.

Comment on lines +552 to +571
let client = self.fc_client(vm_id);
client.pause_instance().await?;

let snap_dir = self.config.snapshot_dir.join(vm_id);
tokio::fs::create_dir_all(&snap_dir)
.await
.map_err(|e| format!("Failed to create snapshot dir: {e}"))?;

let snapshot_path = snap_dir.join("snapshot.bin");
let mem_path = snap_dir.join("mem.bin");

client
.create_snapshot(&SnapshotCreateParams {
snapshot_type: "Full".to_string(),
snapshot_path: snapshot_path.to_string_lossy().to_string(),
mem_file_path: mem_path.to_string_lossy().to_string(),
})
.await?;

client.resume_instance().await?;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Always resume the VM when snapshot creation fails.

Any error after pause_instance() currently bubbles out with the sandbox still paused. Wrap the snapshot flow in a guard/finally so failures restore the running state before returning.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/worker/src/runtime/firecracker.rs` around lines 552 - 571, After
calling self.fc_client(vm_id).pause_instance(), ensure the VM is always resumed
even if snapshot creation fails: introduce a scope guard or a try/finally-style
cleanup that calls client.resume_instance().await on all paths (success or
error). Locate the pause_instance()/create_snapshot()/resume_instance() sequence
(symbols: fc_client, pause_instance, create_snapshot, resume_instance,
SnapshotCreateParams, vm_id) and either wrap the snapshot logic in a block that
resumes in a finally/Err arm or implement a small Drop-based guard (e.g.,
ResumeOnDrop) that holds the client and invokes resume_instance().await if the
snapshot flow exits with an error, so the sandbox cannot remain paused.

Comment on lines +10 to +11
printf " --agent-path P Path to guest agent binary inside rootfs\n"
printf " (default: /usr/local/bin/iii-guest-agent)\n"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

--agent-path argument is accepted but never used.

The script accepts --agent-path and prints it at the end, but the generated init script hardcodes AGENT_BIN="/usr/local/bin/iii-guest-agent" (line 159). Users providing a custom path will see it echoed but the init will still look for the agent at the default location.

Either remove the misleading option or substitute it into the heredoc:

🔧 Proposed fix to use the AGENT_PATH variable
-AGENT_BIN="/usr/local/bin/iii-guest-agent"
+AGENT_BIN="__AGENT_PATH_PLACEHOLDER__"
 if [ -x "$AGENT_BIN" ]; then

Then after the heredoc:

 chmod 755 "${ROOTFS_DIR}/sbin/init"
+sed -i "s|__AGENT_PATH_PLACEHOLDER__|${AGENT_PATH}|g" "${ROOTFS_DIR}/sbin/init"

Also applies to: 18-18, 159-159, 185-185

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/build-init.sh` around lines 10 - 11, The script accepts --agent-path
but never applies it to the generated init script: capture the CLI arg into
AGENT_PATH (or similar) and substitute it into the heredoc instead of the
hardcoded AGENT_BIN="/usr/local/bin/iii-guest-agent" so the generated init uses
the provided path; update the printed help/usage lines (the printf for
--agent-path) to match the variable name and ensure any later references to
AGENT_BIN or the agent binary in the heredoc and after (e.g., where AGENT_BIN is
defined or used) use "${AGENT_PATH:-/usr/local/bin/iii-guest-agent}" to keep the
default when not provided.

- README: benchmark table (200ms cold start, 59ms exec avg), competition
  bar chart, feature matrix vs 7 competitors, Firecracker architecture
  diagram, updated test counts (941 total)
- docs/competitive-analysis.md: deep dive on 15 platforms with real
  published numbers and sources
- scripts/test-e2e.sh: 26-test E2E suite covering lifecycle, exec,
  filesystem, env, git, pause/resume, parallel creates, throughput
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (2)
scripts/test-e2e.sh (2)

1-8: Consider documenting script dependencies.

The script requires curl, python3, and standard Unix utilities. Adding a brief comment at the top would help users identify missing dependencies quickly.

Suggested documentation
 #!/bin/bash
+# E2E test suite for iii-sandbox API
+# Dependencies: curl, python3, bash 4+
+# Usage: ./scripts/test-e2e.sh
+# Environment: III_AUTH_TOKEN (default: demo123)
 set -euo pipefail
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/test-e2e.sh` around lines 1 - 8, Add a short top-of-file comment in
scripts/test-e2e.sh (immediately after the shebang) that documents the script
dependencies (e.g., requires bash, curl, python3, and common Unix utilities like
sed/grep/awk) and any required environment variables (e.g.,
TOKEN/III_AUTH_TOKEN, API) so users can quickly identify missing tools before
running the script.

233-259: Consider cleaning up temporary files.

The parallel sandbox creation writes to /tmp/iii-para-*.json but these files are not cleaned up after the test. This is minor but could leave artifacts on repeated runs.

Proposed cleanup
 PARA_COUNT=$(echo "$PARA_IDS" | wc -w | tr -d ' ')
 if [ "$PARA_COUNT" -eq 3 ]; then
     PASS=$((PASS+1))
     echo "  $(green PASS) all 3 sandboxes created"
 else
     FAIL=$((FAIL+1))
     echo "  $(red FAIL) only $PARA_COUNT/3 created"
 fi
+rm -f /tmp/iii-para-*.json
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/test-e2e.sh` around lines 233 - 259, The temp files
/tmp/iii-para-*.json created during the parallel sandbox creation loop (see
variables PARA_START/PARA_END, PARA_IDS and the api POST that writes to
/tmp/iii-para-$i.json) are not removed; add cleanup to delete those files after
you've extracted IDs (or in a trap/cleanup function) so repeated runs don't
leave artifacts—remove /tmp/iii-para-1.json /tmp/iii-para-2.json
/tmp/iii-para-3.json (or pattern /tmp/iii-para-*.json) immediately after the
PARA_IDS loop or in a finally/trap block.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.md`:
- Around line 685-701: The fenced code block showing the benchmark table is
missing a language identifier which triggers MD040; update the opening fence
(the line with ```) that precedes the table (the block beginning with
"+-----------------------+----------+") to include a language token such as text
or console (e.g., change ``` to ```text) so the markdown linter recognizes the
code block language.
- Line 648: Update the README table row that currently reads "Firecracker |
Production | `bollard` + guest agent | `firecracker`" to accurately reflect
build/runtime structure: state that `bollard` is an unconditional dependency
(e.g., "bollard (always included)") and that the guest agent is a separate
binary (e.g., "guest agent binary"), and clarify that the `firecracker` feature
flag is empty and only gates runtime implementation code rather than adding
dependencies; keep the Production column and feature column but replace the
dependencies cell with wording like "bollard (always included) + guest agent
binary" or similar clear phrasing.
- Around line 656-675: The fenced ASCII diagram block starting with "Worker
(host)" is missing a language identifier; update the opening triple-backtick
fence to include a language token like text or ascii (e.g., change ``` to
```text) so the diagram renders correctly and satisfies the MD040 lint rule;
locate the diagram block by the "Worker (host)" header and modify its opening
fence accordingly.
- Around line 707-721: The fenced code block containing the latency/competitive
comparison chart in README.md lacks a language identifier and triggers MD040;
fix it by adding a language tag (e.g., use "text") to the opening
triple-backticks of that code block so it becomes ```text, leaving the chart
content unchanged; locate the block that starts with the ASCII chart (lines
showing "iii-sandbox (Docker) ..." through the timeline footer) and update the
opening fence accordingly.
- Around line 877-886: Update the README test-count table and any related text
to reflect the real test totals: change the "Rust SDK | 11 | 32" row to the
actual 99 tests, update "TypeScript (SDK/CLI/MCP) | 33 | ~293" to 379, correct
"Rust engine | 24 modules | 394" to show 76 (or split into 22 unit + 54
firecracker integration tests), and ensure the "Rust engine tests (333)"
reference elsewhere matches the corrected totals; adjust the table rows for
"Rust engine (FC)", "Python SDK", and "Guest agent" only if their counts differ,
and keep the table formatting and original row labels (e.g., the exact string
"Rust engine | 24 modules | 394" and "TypeScript (SDK/CLI/MCP) | 33 | ~293") so
changes are easy to locate.

In `@scripts/test-e2e.sh`:
- Around line 120-124: The mkdir invocation in scripts/test-e2e.sh is sending
the wrong payload key — change the api POST call to
"/sandbox/sandboxes/$SBX_ID/files/mkdir" to use {"paths":["/workspace/subdir"]}
(plural array) instead of {"path":"..."} and stop discarding its output (remove
the > /dev/null 2>&1) so failures are visible; update the mkdir call near the
api POST "/sandbox/sandboxes/$SBX_ID/files/mkdir" line and optionally assert its
success before proceeding to files/write and files/read.

---

Nitpick comments:
In `@scripts/test-e2e.sh`:
- Around line 1-8: Add a short top-of-file comment in scripts/test-e2e.sh
(immediately after the shebang) that documents the script dependencies (e.g.,
requires bash, curl, python3, and common Unix utilities like sed/grep/awk) and
any required environment variables (e.g., TOKEN/III_AUTH_TOKEN, API) so users
can quickly identify missing tools before running the script.
- Around line 233-259: The temp files /tmp/iii-para-*.json created during the
parallel sandbox creation loop (see variables PARA_START/PARA_END, PARA_IDS and
the api POST that writes to /tmp/iii-para-$i.json) are not removed; add cleanup
to delete those files after you've extracted IDs (or in a trap/cleanup function)
so repeated runs don't leave artifacts—remove /tmp/iii-para-1.json
/tmp/iii-para-2.json /tmp/iii-para-3.json (or pattern /tmp/iii-para-*.json)
immediately after the PARA_IDS loop or in a finally/trap block.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f9460ed8-4caf-4f15-b558-15d32d947fe6

📥 Commits

Reviewing files that changed from the base of the PR and between ece04a4 and 6d35ff2.

📒 Files selected for processing (3)
  • README.md
  • docs/competitive-analysis.md
  • scripts/test-e2e.sh

Comment on lines +120 to +124
# nested dir
api POST "/sandbox/sandboxes/$SBX_ID/files/mkdir" '{"path":"/workspace/subdir"}' > /dev/null 2>&1
api POST "/sandbox/sandboxes/$SBX_ID/files/write" '{"path":"/workspace/subdir/nested.txt","content":"nested"}' > /dev/null 2>&1
RESP=$(api POST "/sandbox/sandboxes/$SBX_ID/files/read" '{"path":"/workspace/subdir/nested.txt"}')
assert_contains "nested file read" "nested" "$RESP"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the mkdir endpoint payload schema
rg -n "mkdir" --type ts --type rust -A5 -B2 | head -60

Repository: iii-hq/sandbox

Length of output: 4716


🏁 Script executed:

#!/bin/bash
# Check the actual content of scripts/test-e2e.sh around lines 120-124
cat -n scripts/test-e2e.sh | sed -n '115,130p'

Repository: iii-hq/sandbox

Length of output: 978


🏁 Script executed:

#!/bin/bash
# Check how the Rust mkdir function extracts the paths parameter
rg -n "fs::mkdir" --type rust -A 20 packages/worker/src/functions/filesystem.rs

Repository: iii-hq/sandbox

Length of output: 1558


Payload mismatch in mkdir API call — uses "path" instead of "paths".

Line 121 uses {"path":"/workspace/subdir"}, but the API expects {"paths":["/workspace/subdir"]} (plural key with array value, as defined in the Rust implementation at packages/worker/src/functions/filesystem.rs:283). The call fails silently since output is discarded to /dev/null, making the issue harder to detect.

Proposed fix
-api POST "/sandbox/sandboxes/$SBX_ID/files/mkdir" '{"path":"/workspace/subdir"}' > /dev/null 2>&1
+api POST "/sandbox/sandboxes/$SBX_ID/files/mkdir" '{"paths":["/workspace/subdir"]}' > /dev/null 2>&1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# nested dir
api POST "/sandbox/sandboxes/$SBX_ID/files/mkdir" '{"path":"/workspace/subdir"}' > /dev/null 2>&1
api POST "/sandbox/sandboxes/$SBX_ID/files/write" '{"path":"/workspace/subdir/nested.txt","content":"nested"}' > /dev/null 2>&1
RESP=$(api POST "/sandbox/sandboxes/$SBX_ID/files/read" '{"path":"/workspace/subdir/nested.txt"}')
assert_contains "nested file read" "nested" "$RESP"
# nested dir
api POST "/sandbox/sandboxes/$SBX_ID/files/mkdir" '{"paths":["/workspace/subdir"]}' > /dev/null 2>&1
api POST "/sandbox/sandboxes/$SBX_ID/files/write" '{"path":"/workspace/subdir/nested.txt","content":"nested"}' > /dev/null 2>&1
RESP=$(api POST "/sandbox/sandboxes/$SBX_ID/files/read" '{"path":"/workspace/subdir/nested.txt"}')
assert_contains "nested file read" "nested" "$RESP"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/test-e2e.sh` around lines 120 - 124, The mkdir invocation in
scripts/test-e2e.sh is sending the wrong payload key — change the api POST call
to "/sandbox/sandboxes/$SBX_ID/files/mkdir" to use
{"paths":["/workspace/subdir"]} (plural array) instead of {"path":"..."} and
stop discarding its output (remove the > /dev/null 2>&1) so failures are
visible; update the mkdir call near the api POST
"/sandbox/sandboxes/$SBX_ID/files/mkdir" line and optionally assert its success
before proceeding to files/write and files/read.

Reduce README from 891 to ~230 lines. Fix broken ASCII diagrams
with plain-text that renders on GitHub. Move full API reference
to docs/api-reference.md and worker architecture to docs/architecture.md.
Replace plain-text ASCII with mermaid (renders natively on GitHub).
Add back cold start bar chart and feature comparison matrix.
1. fc_agent: send_request now accepts HTTP method param (was hardcoded POST)
2. firecracker: validate_vm_id rejects path traversal (/, \, ..)
3. fc_rootfs: missing guest-agent binary now fails the build
4. fc_rootfs: cp non-zero exit now returns Err (was silently continuing)
5. guest-agent: detached children reaped via background thread
6. guest-agent: timeout_ms now enforced with kill on deadline
7. fc_network: NAT uplink interface configurable via III_UPLINK_IFACE
8. fc_types: file content is already binary-safe (base64 on wire) — no change needed

443 tests passing, 0 failures.
Clippy 1.94+ flags &mut ws as unnecessary_mut_passed since forkpty
takes *mut winsize (raw pointer). Using std::ptr::addr_of_mut!(ws)
produces the raw pointer directly without creating a mutable reference.
Newer Ubuntu runners output "static-pie linked" instead of
"statically linked". Match either form with grep -E.
Also print the file output for easier debugging.
@rohitg00 rohitg00 merged commit d0cf9f6 into main Mar 11, 2026
5 checks passed
@rohitg00 rohitg00 deleted the feat/firecracker-runtime branch March 11, 2026 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant