feat: Firecracker microVM runtime backend#16
Conversation
Add complete Firecracker isolation backend behind #[cfg(feature = "firecracker")] feature gate: - fc_types: Firecracker API serde types + guest agent request/response types (14 tests) - fc_api: Unix socket HTTP client for Firecracker REST API (8 tests) - fc_agent: VSOCK-based guest agent client for exec/file/stats operations (4 tests) - fc_network: TAP device management, subnet allocator (/30 per VM), iptables NAT (5 tests) - fc_rootfs: OCI image → ext4 rootfs conversion pipeline via bollard (3 tests) - firecracker: Full SandboxRuntime trait impl — all 29 methods (10 tests) - guest-agent: Lightweight HTTP server for inside VMs — std-only, thread-per-connection (11 tests) Architecture: Each VM gets a dedicated CID (VSOCK), TAP interface, /30 subnet, and runs the guest agent on port 8052 for host↔guest communication.
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (9)
📝 WalkthroughWalkthroughAdds a Rust guest-agent binary and a full Firecracker backend: new clients, types, networking, rootfs tooling, init injection, terminal integration, CI/release changes, Docker packaging updates, helper scripts, and extensive tests enabling VM lifecycle and agent-backed sandbox operations. Changes
Sequence Diagram(s)sequenceDiagram
participant Worker
participant Docker
participant Network
participant Firecracker
participant GuestAgent
Worker->>Docker: ensure_image(image)
Docker-->>Worker: prepared rootfs
Worker->>Network: allocate_cid() / allocate subnet
Network-->>Worker: guest_cid, SubnetInfo
Worker->>Network: create_tap_device(host_ip)
Worker->>Network: setup_nat(guest_subnet)
Worker->>Firecracker: start_firecracker_process(vm_id)
Firecracker-->>Worker: socket ready
Worker->>Firecracker: configure VM (boot, machine, drives, vsock)
Firecracker-->>Worker: configured
Worker->>Firecracker: start_instance()
Firecracker-->>Worker: running
Worker->>GuestAgent: health_check() / exec() / file ops / terminal
GuestAgent-->>Worker: JSON responses / stream data
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 17
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/guest-agent/src/main.rs`:
- Around line 118-137: When handling req.detached in the block that calls
cmd.spawn(), avoid dropping the Child handle immediately (which causes zombies)
by spawning a background thread that takes ownership of the Child returned from
cmd.spawn() and calls child.wait() (or child.wait_with_output()/handle Result)
to reap the process and optionally log the exit status or error; keep the
existing behavior of returning the PID to the client after sending the response.
Modify the code around cmd.spawn() and the Ok(child) arm so it clones/reads
child.id() for the PID, sends the response, then moves the child into a
std::thread::spawn closure that calls child.wait() and logs the result (or
ignores the result after reaping).
- Around line 9-21: The guest-agent currently binds a TCP listener in main
(TcpListener::bind) and prints "iii-guest-agent listening on port {port}", but
the runtime connects via the Firecracker vsock/UDS path used in
packages/worker/src/runtime/fc_agent.rs; replace the TCP listener with a
vsock/Unix domain socket listener: change the binding logic in main to open the
same UDS/vsock socket path the worker expects (use UnixListener::bind or the
vsock crate equivalent instead of TcpListener::bind), read the socket path from
the same env/format used by fc_agent (the ..._52 path) and update the startup
log to reflect the socket path instead of listening on 0.0.0.0:{port} so the
runtime can reach the guest agent.
In `@packages/worker/src/main.rs`:
- Around line 47-55: The code is silently reinterpreting
config.firecracker_rootfs as a directory when building FcConfig
(rootfs_cache_dir) which can treat file paths as directories; update the config
shape and usage so the value is explicit and not misused: add distinct fields in
your configuration (e.g., firecracker_kernel_path and
firecracker_rootfs_cache_dir or similar) and use those when constructing
FcConfig in main.rs (replace uses of config.firecracker_rootfs with the new
firecracker_rootfs_cache_dir and ensure kernel_path is taken from the explicit
kernel config), or if you must keep the old single field, detect whether
config.firecracker_rootfs is a file and map it correctly (use parent dir for
cache vs. file path for rootfs) before assigning to FcConfig.rootfs_cache_dir;
update any references in FcConfig and runtime::fc_types::FcConfig to match the
new fields so defaults in fc_types.rs are not silently relied upon.
In `@packages/worker/src/runtime/fc_agent.rs`:
- Around line 25-67: send_request currently hardcodes "POST" and extract_body
drops the HTTP status, so GET routes like health_check never reach the guest and
non-2xx JSON bodies are treated as success; change send_request(signature) to
accept an HTTP method parameter (e.g., method: &str) and use it when formatting
the request_line instead of the hardcoded "POST" (refer to send_request and
request_line), update callers such as health_check to pass "GET", and modify
extract_body (or add a small response parser in send_request) to parse the
response status line, return or check the numeric status code, and treat non-2xx
responses as Err (including the response body or JSON error text in the error
message) before attempting serde_json::from_str.
In `@packages/worker/src/runtime/fc_api.rs`:
- Around line 40-85: The raw_request function lacks timeouts for socket
operations; wrap the UnixStream::connect, stream.write_all, stream.shutdown, and
stream.read_to_end calls in timed operations (e.g., tokio::time::timeout) inside
raw_request so each I/O has a bounded duration and returns a clear error on
timeout; pick sensible durations (or use a configurable timeout value) and map
timeout errors to a descriptive Err string (e.g., "Timed out
connecting/writing/reading/shutting down Firecracker socket") while preserving
existing map_err behavior for underlying I/O failures.
In `@packages/worker/src/runtime/fc_network.rs`:
- Around line 19-37: The allocator currently monotically increments next_subnet
in allocate() and release() only removes the mapping, so freed subnets are never
reused; change allocate() and release() to maintain a free pool: in allocate(),
first check if vm_id already exists in allocated and return its SubnetInfo;
otherwise if a free list (e.g., VecDeque free_subnets) is non-empty pop from it
to assign, else use next_subnet and increment; in release(), look up and remove
the vm_id from allocated and push the freed subnet_id onto the free_subnets
queue so it can be recycled; update struct fields to include free_subnets and
ensure you only push when the vm_id was actually allocated (reference allocate,
release, next_subnet, and allocated).
- Around line 74-117: The NAT rules hard-code the host uplink as "eth0" in
setup_nat and teardown_nat, causing failures on hosts with different interface
names; update the functions (setup_nat and teardown_nat) to accept an uplink
interface parameter (e.g., uplink_iface: &str) or read it from
configuration/environment and use that variable in all iptables invocations
replacing the literal "eth0", and update all callers to pass the correct uplink
interface (or load it from config) so the POSTROUTING and FORWARD rules match
the actual host egress device.
In `@packages/worker/src/runtime/fc_rootfs.rs`:
- Around line 29-37: The export pipeline is wrong because
export_image_to_tar(docker, image, &tar_path) produces a flattened container
filesystem, so after extract_tar(&tar_path, &extract_dir) there are no OCI
layers or manifest to merge; remove or skip the subsequent
extract_layers(&extract_dir) and any manifest/*.tar merging logic that expects
OCI layers (code referencing extract_layers, manifest.json processing, and
merged_dir) and instead treat extract_dir as the final rootfs (or change
export_image_to_tar to export OCI-style image layers if you intend to keep the
merge pipeline); update all code paths that reference merged_dir to use
extract_dir as the final rootfs when using export_image_to_tar.
- Around line 46-56: The code currently skips copying the guest agent when
agent_path doesn't exist, producing a broken VM; change the logic in
fc_rootfs.rs so that if agent_path.exists() is false you return an Err (fail the
build) with a clear message instead of silently continuing. Update the block
around agent_path/agent_dest (the copy using fs::create_dir_all and fs::copy
that writes usr/local/bin/iii-guest-agent) to error out when the source is
missing, and apply the same fix to the other occurrence noted (the similar block
at the 216-223 region) so both code paths fail-fast when the guest agent binary
is absent.
- Around line 13-18: The current logic uses a deterministic intermediate path
(rootfs_path and the shared _mount_tmp used by create_ext4) so concurrent runs
can clobber each other and leave half-written artifacts; change the build to
create a unique temporary workspace (e.g., use a UUID/PID-based temp dir or
tempfile::Builder) for the mount/build steps instead of _mount_tmp, build the
image there, write the ext4 to a temp file (e.g., <safe_name>.ext4.tmp), and
atomically publish by renaming the temp file to rootfs_path only after the build
completes successfully; update the code paths that call create_ext4 and the
existence checks (references: safe_name, rootfs_path, create_ext4, _mount_tmp)
to use the unique temp location and atomic rename to prevent races and
half-written rootfs (also apply same fix to the other blocks noted around lines
28-43 and 256-282).
- Around line 272-293: The code currently treats any non-zero cp exit as success
and may cache a partially populated rootfs; update the match on cp_result in
fc_rootfs.rs so that when cp returns Ok(o) but o.status is not success you
return an Err with the cp stderr (instead of Ok(())); keep the existing warning
log (tracing::warn!) if desired, but change the branch to propagate an error
(e.g., Err(format!("cp failed: {}", stderr))) so the caller won’t persist an
incomplete image; reference the cp_result variable and the Command::new("sh")
block that invokes cp, and ensure umount and fs::remove_dir still run as cleanup
before returning the Err.
In `@packages/worker/src/runtime/fc_types.rs`:
- Around line 199-213: The file-copy contract currently models file content as
UTF-8 Strings which breaks binary files; change the content fields in
AgentFileWriteRequest and AgentFileReadResponse from String to a binary-safe
type (Vec<u8>) and annotate with serde_bytes (e.g. #[serde(with =
"serde_bytes")]) so Serialize/Deserialize remain efficient; keep
AgentFileReadRequest.path, AgentFileWriteRequest.path and mode as-is and leave
size as u64, and update any consumers of copy_to_sandbox / copy_from_sandbox to
accept/produce Vec<u8> instead of String.
In `@packages/worker/src/runtime/firecracker.rs`:
- Around line 265-279: The Sandbox never records or exposes active port forwards
because VmInstance is constructed with port_mappings: HashMap::new() and nothing
populates it, so sandbox_port_bindings() always returns empty; fix by ensuring
port mappings are created and stored when ports are published/forwarded — update
the VM lifecycle code paths that create networking (e.g., where VmInstance is
created and where inbound forwarding is installed) to add entries into
VmInstance.port_mappings and persist them (mirror the structure expected by
sandbox_port_bindings()), and update the functions that install
NAT/iptables/tproxy forwarding to return the assigned host port and write that
into VmInstance.port_mappings (and any teardown paths must remove them). Target
symbols: VmInstance, port_mappings, sandbox_port_bindings(), and the VM
creation/networking routines that set up tap/veth and publish ports so active
bindings are discoverable by packages/worker/src/functions/port.rs.
- Around line 48-54: The socket_path and vsock_path functions (and other places
that build filesystem paths from VM identifiers at the mentioned locations)
currently join raw vm_id/name values directly into paths; fix by
validating/sanitizing identifiers before use: add a helper (e.g., sanitize_vm_id
or ensure_safe_basename) that rejects empty strings and any id containing path
separators, .. or other traversal characters (or enforce a strict whitelist
regex like alphanumeric, hyphen, underscore, dot), and use that helper in
socket_path, vsock_path and the rootfs/snapshot/volume path builders referenced
(lines ~238-239, ~525-526, ~628-636) so only a single safe basename is ever
joined into config.socket_dir or other managed dirs; return/propagate an error
when validation fails.
- Around line 230-263: The code currently leaks resources if any of
configure_vm(), client.start_instance(), or wait_for_agent() fail after creating
vm_rootfs, allocating a subnet and guest_cid, creating the TAP and NAT, and
spawning Firecracker; add a rollback/cleanup path that runs on error to undo
those steps: delete the copied vm_rootfs created by ensure_rootfs, release the
subnet back to subnet_allocator (use the same subnet value returned by
allocator.allocate), remove the TAP device via
super::fc_network::create_tap_device counterpart (e.g., delete_tap_device) and
undo NAT via super::fc_network::setup_nat counterpart (e.g., teardown_nat),
terminate the spawned Firecracker process started by start_firecracker_process
(using the returned pid or handle), and release guest_cid if applicable;
implement this by capturing the created resources (vm_rootfs path, subnet,
tap_name, pid, guest_cid) and invoking a cleanup async function or a drop/guard
on any Err from configure_vm, client.start_instance, or wait_for_agent so all
partially-created sandbox artifacts are removed on failure.
- Around line 525-576: The code currently writes every snapshot to
<snapshot_dir>/<vm_id>/ which causes overwrites and the inspect/remove helpers
reconstruct the path by splitting on '-' which breaks for hyphenated vm_ids; fix
commit_sandbox (where snapshot_id is created and snapshot files are written),
inspect_image_size, and remove_image to use a unique, immutable per-snapshot
path derived from the full snapshot id: (1) when creating the snapshot in
commit_sandbox, generate snapshot_id and create a dedicated directory for that
snapshot (e.g., snap_dir = self.config.snapshot_dir.join(snapshot_id_safe)) and
write snapshot.bin and mem.bin there instead of using vm_id; ensure
snapshot_id_safe encodes comment/vm_id safely (URL-encode or use a UUID suffix)
so it is a valid directory name; (2) in inspect_image_size and remove_image,
reconstruct the exact snapshot directory by taking
image_id.trim_start_matches("fc-snap-") (do NOT .split('-').next()) and using
that whole remainder as the directory name to locate snapshot.bin and mem.bin or
remove the directory; this preserves uniqueness and avoids truncating hyphenated
vm IDs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: af2a53c9-9085-491b-9e93-5d230099f6be
📒 Files selected for processing (10)
packages/guest-agent/Cargo.tomlpackages/guest-agent/src/main.rspackages/worker/src/main.rspackages/worker/src/runtime/fc_agent.rspackages/worker/src/runtime/fc_api.rspackages/worker/src/runtime/fc_network.rspackages/worker/src/runtime/fc_rootfs.rspackages/worker/src/runtime/fc_types.rspackages/worker/src/runtime/firecracker.rspackages/worker/src/runtime/mod.rs
| fn main() { | ||
| let port = std::env::var("AGENT_PORT") | ||
| .unwrap_or_else(|_| "8052".to_string()) | ||
| .parse::<u16>() | ||
| .unwrap_or(8052); | ||
|
|
||
| let listener = TcpListener::bind(format!("0.0.0.0:{port}")) | ||
| .unwrap_or_else(|e| { | ||
| eprintln!("Failed to bind on port {port}: {e}"); | ||
| std::process::exit(1); | ||
| }); | ||
|
|
||
| eprintln!("iii-guest-agent listening on port {port}"); |
There was a problem hiding this comment.
Bind the agent to Firecracker VSOCK, not TCP.
packages/worker/src/runtime/fc_agent.rs:9-36 connects through the Firecracker vsock UDS path ..._52, but this server listens on 0.0.0.0:8052. With the current host/client pair, the runtime can never reach the guest agent.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/guest-agent/src/main.rs` around lines 9 - 21, The guest-agent
currently binds a TCP listener in main (TcpListener::bind) and prints
"iii-guest-agent listening on port {port}", but the runtime connects via the
Firecracker vsock/UDS path used in packages/worker/src/runtime/fc_agent.rs;
replace the TCP listener with a vsock/Unix domain socket listener: change the
binding logic in main to open the same UDS/vsock socket path the worker expects
(use UnixListener::bind or the vsock crate equivalent instead of
TcpListener::bind), read the socket path from the same env/format used by
fc_agent (the ..._52 path) and update the startup log to reflect the socket path
instead of listening on 0.0.0.0:{port} so the runtime can reach the guest agent.
| use runtime::fc_types::FcConfig; | ||
| let dk = connect_docker(); | ||
| let fc_config = FcConfig { | ||
| kernel_path: std::path::PathBuf::from(&config.firecracker_kernel), | ||
| rootfs_cache_dir: std::path::PathBuf::from(&config.firecracker_rootfs), | ||
| default_vcpus: config.firecracker_vcpus, | ||
| default_mem_mib: config.firecracker_mem_mib, | ||
| ..Default::default() | ||
| }; |
There was a problem hiding this comment.
Don't silently reinterpret firecracker_rootfs.
packages/worker/src/config.rs:1-45 still exposes a single firecracker_rootfs field, but this branch now uses it as rootfs_cache_dir and backfills the other FcConfig paths from hard-coded defaults in packages/worker/src/runtime/fc_types.rs:119-133. Existing deployments can end up treating a file path as a directory and looking for /usr/bin//opt//tmp assets that were never configured.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/worker/src/main.rs` around lines 47 - 55, The code is silently
reinterpreting config.firecracker_rootfs as a directory when building FcConfig
(rootfs_cache_dir) which can treat file paths as directories; update the config
shape and usage so the value is explicit and not misused: add distinct fields in
your configuration (e.g., firecracker_kernel_path and
firecracker_rootfs_cache_dir or similar) and use those when constructing
FcConfig in main.rs (replace uses of config.firecracker_rootfs with the new
firecracker_rootfs_cache_dir and ensure kernel_path is taken from the explicit
kernel config), or if you must keep the old single field, detect whether
config.firecracker_rootfs is a file and map it correctly (use parent dir for
cache vs. file path for rootfs) before assigning to FcConfig.rootfs_cache_dir;
update any references in FcConfig and runtime::fc_types::FcConfig to match the
new fields so defaults in fc_types.rs are not silently relied upon.
| pub struct AgentFileWriteRequest { | ||
| pub path: String, | ||
| pub content: String, | ||
| pub mode: Option<u32>, | ||
| } | ||
|
|
||
| #[derive(Debug, Clone, Serialize, Deserialize)] | ||
| pub struct AgentFileReadRequest { | ||
| pub path: String, | ||
| } | ||
|
|
||
| #[derive(Debug, Clone, Serialize, Deserialize)] | ||
| pub struct AgentFileReadResponse { | ||
| pub content: String, | ||
| pub size: u64, |
There was a problem hiding this comment.
Make the file-copy wire contract binary-safe.
These request/response types model file contents as String, but the runtime API that consumes them is byte-oriented (copy_to_sandbox/copy_from_sandbox). That means arbitrary files with non-UTF-8 bytes cannot be round-tripped reliably through this contract. Use an explicit binary representation here instead of plain text.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/worker/src/runtime/fc_types.rs` around lines 199 - 213, The
file-copy contract currently models file content as UTF-8 Strings which breaks
binary files; change the content fields in AgentFileWriteRequest and
AgentFileReadResponse from String to a binary-safe type (Vec<u8>) and annotate
with serde_bytes (e.g. #[serde(with = "serde_bytes")]) so Serialize/Deserialize
remain efficient; keep AgentFileReadRequest.path, AgentFileWriteRequest.path and
mode as-is and leave size as u64, and update any consumers of copy_to_sandbox /
copy_from_sandbox to accept/produce Vec<u8> instead of String.
| let vm = VmInstance { | ||
| id: id.to_string(), | ||
| socket_path: self.socket_path(id), | ||
| rootfs_path: vm_rootfs, | ||
| pid: Some(pid), | ||
| guest_cid, | ||
| tap_name, | ||
| guest_ip: subnet.guest_ip.clone(), | ||
| host_ip: subnet.host_ip.clone(), | ||
| state: VmLifecycleState::Running, | ||
| vcpus, | ||
| mem_mib, | ||
| labels: HashMap::new(), | ||
| port_mappings: HashMap::new(), | ||
| }; |
There was a problem hiding this comment.
sandbox_port_bindings() can never report active mappings.
New VMs start with port_mappings: HashMap::new(), and nothing in this file ever populates that map or installs inbound forwarding for published ports. The method therefore always returns an empty map, while packages/worker/src/functions/port.rs:360-370 expects real bindings to surface active host ports.
Also applies to: 503-511
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/worker/src/runtime/firecracker.rs` around lines 265 - 279, The
Sandbox never records or exposes active port forwards because VmInstance is
constructed with port_mappings: HashMap::new() and nothing populates it, so
sandbox_port_bindings() always returns empty; fix by ensuring port mappings are
created and stored when ports are published/forwarded — update the VM lifecycle
code paths that create networking (e.g., where VmInstance is created and where
inbound forwarding is installed) to add entries into VmInstance.port_mappings
and persist them (mirror the structure expected by sandbox_port_bindings()), and
update the functions that install NAT/iptables/tproxy forwarding to return the
assigned host port and write that into VmInstance.port_mappings (and any
teardown paths must remove them). Target symbols: VmInstance, port_mappings,
sandbox_port_bindings(), and the VM creation/networking routines that set up
tap/veth and publish ports so active bindings are discoverable by
packages/worker/src/functions/port.rs.
| let snap_dir = self.config.snapshot_dir.join(vm_id); | ||
| tokio::fs::create_dir_all(&snap_dir) | ||
| .await | ||
| .map_err(|e| format!("Failed to create snapshot dir: {e}"))?; | ||
|
|
||
| let snapshot_path = snap_dir.join("snapshot.bin"); | ||
| let mem_path = snap_dir.join("mem.bin"); | ||
|
|
||
| client | ||
| .create_snapshot(&SnapshotCreateParams { | ||
| snapshot_type: "Full".to_string(), | ||
| snapshot_path: snapshot_path.to_string_lossy().to_string(), | ||
| mem_file_path: mem_path.to_string_lossy().to_string(), | ||
| }) | ||
| .await?; | ||
|
|
||
| client.resume_instance().await?; | ||
|
|
||
| let snapshot_id = format!("fc-snap-{}-{}", vm_id, comment.replace(' ', "-")); | ||
| tracing::info!(vm_id = %vm_id, snapshot_id = %snapshot_id, "VM snapshot created"); | ||
|
|
||
| Ok(snapshot_id) | ||
| } | ||
|
|
||
| async fn inspect_image_size(&self, image_id: &str) -> Result<u64, String> { | ||
| if image_id.starts_with("fc-snap-") { | ||
| let snap_dir = self.config.snapshot_dir.join( | ||
| image_id.trim_start_matches("fc-snap-").split('-').next().unwrap_or("") | ||
| ); | ||
| let snapshot_path = snap_dir.join("snapshot.bin"); | ||
| let mem_path = snap_dir.join("mem.bin"); | ||
|
|
||
| let mut total = 0u64; | ||
| if let Ok(meta) = tokio::fs::metadata(&snapshot_path).await { | ||
| total += meta.len(); | ||
| } | ||
| if let Ok(meta) = tokio::fs::metadata(&mem_path).await { | ||
| total += meta.len(); | ||
| } | ||
| return Ok(total); | ||
| } | ||
|
|
||
| async fn inspect_image_size(&self, _image_id: &str) -> Result<u64, String> { | ||
| Err("Firecracker runtime not yet implemented".to_string()) | ||
| super::fc_rootfs::rootfs_size(&self.config.rootfs_cache_dir, image_id).await | ||
| } | ||
|
|
||
| async fn remove_image(&self, _image_id: &str) -> Result<(), String> { | ||
| Err("Firecracker runtime not yet implemented".to_string()) | ||
| async fn remove_image(&self, image_id: &str) -> Result<(), String> { | ||
| if image_id.starts_with("fc-snap-") { | ||
| let snap_dir = self.config.snapshot_dir.join( | ||
| image_id.trim_start_matches("fc-snap-").split('-').next().unwrap_or("") | ||
| ); | ||
| let _ = tokio::fs::remove_dir_all(&snap_dir).await; | ||
| return Ok(()); |
There was a problem hiding this comment.
Snapshot IDs do not map to unique snapshot data.
commit_sandbox() always writes to <snapshot_dir>/<vm_id>/snapshot.bin and mem.bin, so a second commit overwrites the first even though it returns a different snapshot_id. Then inspect_image_size() and remove_image() recover the directory by splitting on -, which also truncates hyphenated VM IDs. Store each snapshot under its own immutable path and key lookup/removal by the full returned ID.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/worker/src/runtime/firecracker.rs` around lines 525 - 576, The code
currently writes every snapshot to <snapshot_dir>/<vm_id>/ which causes
overwrites and the inspect/remove helpers reconstruct the path by splitting on
'-' which breaks for hyphenated vm_ids; fix commit_sandbox (where snapshot_id is
created and snapshot files are written), inspect_image_size, and remove_image to
use a unique, immutable per-snapshot path derived from the full snapshot id: (1)
when creating the snapshot in commit_sandbox, generate snapshot_id and create a
dedicated directory for that snapshot (e.g., snap_dir =
self.config.snapshot_dir.join(snapshot_id_safe)) and write snapshot.bin and
mem.bin there instead of using vm_id; ensure snapshot_id_safe encodes
comment/vm_id safely (URL-encode or use a UUID suffix) so it is a valid
directory name; (2) in inspect_image_size and remove_image, reconstruct the
exact snapshot directory by taking image_id.trim_start_matches("fc-snap-") (do
NOT .split('-').next()) and using that whole remainder as the directory name to
locate snapshot.bin and mem.bin or remove the directory; this preserves
uniqueness and avoids truncating hyphenated vm IDs.
Critical:
- Remove shell injection vectors in fc_rootfs.rs (replace sh -c with direct cp/tar args)
- Add 10MB Content-Length limit in guest agent to prevent memory exhaustion
- Fix PID 0 fallback in firecracker.rs (error instead of unwrap_or(0))
- Add SIGCHLD ignore in guest agent to prevent zombie processes
Important:
- Add socket timeouts (30s) to fc_api.rs raw_request
- Subnet allocator now reclaims released IDs via free list + idempotent allocation
- TAP device cleanup on partial creation failure
- Container cleanup on OCI export failure (deferred cleanup pattern)
- Cleanup rootfs/subnet/TAP/socket on create_sandbox failure
- Atomic rootfs creation via temp file + rename
- Propagate create_dir_all errors in extract_layers
- Propagate set_permissions errors in guest agent file write
- Validate HTTP status before parsing body in fc_agent.rs
- Error on malformed HTTP responses in fc_api.rs (was silent Ok(""))
- extract_status_code returns Result (was defaulting to 200)
- Snapshot ID uses __ delimiter (was - which conflicts with hyphenated vm_ids)
Tests: 385 pass (worker) + 11 pass (guest-agent), 2 new subnet reclamation tests
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
packages/worker/src/runtime/firecracker.rs (1)
70-75: CID allocation is monotonic without recycling.Unlike subnet allocation which now recycles IDs,
allocate_cid()increments indefinitely. For long-running workers with many VM create/destroy cycles, consider recycling CIDs when VMs are removed, similar to the subnet allocator'sfree_listapproach.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/worker/src/runtime/firecracker.rs` around lines 70 - 75, allocate_cid currently only increments next_cid (monotonic) and should be changed to reuse freed IDs: add a free_list (e.g., VecDeque<u32> or HashSet) alongside next_cid and next_cid.lock(), modify allocate_cid() to first check and pop an ID from free_list before using and incrementing next_cid, add a public async free_cid(&self, cid: u32) that pushes returned CIDs onto free_list, and call free_cid from the VM teardown/remove path (where VMs are destroyed) so CIDs are recycled similar to the subnet allocator’s free_list approach.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/worker/src/runtime/fc_agent.rs`:
- Line 9: AGENT_PORT is set to 52 but the guest agent listens on 8052; update
the constant AGENT_PORT in fc_agent.rs from 52 to 8052 (keep type u32) so the
runtime connects to the correct guest agent port.
---
Nitpick comments:
In `@packages/worker/src/runtime/firecracker.rs`:
- Around line 70-75: allocate_cid currently only increments next_cid (monotonic)
and should be changed to reuse freed IDs: add a free_list (e.g., VecDeque<u32>
or HashSet) alongside next_cid and next_cid.lock(), modify allocate_cid() to
first check and pop an ID from free_list before using and incrementing next_cid,
add a public async free_cid(&self, cid: u32) that pushes returned CIDs onto
free_list, and call free_cid from the VM teardown/remove path (where VMs are
destroyed) so CIDs are recycled similar to the subnet allocator’s free_list
approach.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a5433d90-5363-4e0e-8ad5-5108cc3f4a64
📒 Files selected for processing (7)
packages/guest-agent/Cargo.tomlpackages/guest-agent/src/main.rspackages/worker/src/runtime/fc_agent.rspackages/worker/src/runtime/fc_api.rspackages/worker/src/runtime/fc_network.rspackages/worker/src/runtime/fc_rootfs.rspackages/worker/src/runtime/firecracker.rs
🚧 Files skipped from review as they are similar to previous changes (2)
- packages/guest-agent/src/main.rs
- packages/guest-agent/Cargo.toml
| use super::fc_types::*; | ||
| use crate::types::{ExecResult, FileInfo, FileMetadata, SandboxMetrics}; | ||
|
|
||
| const AGENT_PORT: u32 = 52; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the port used by the guest agent
rg -n "AGENT_PORT" packages/guest-agent/src/main.rs
rg -n '"8052"' packages/guest-agent/src/main.rsRepository: iii-hq/sandbox
Length of output: 154
🏁 Script executed:
rg -n "AGENT_PORT" packages/worker/src/runtime/fc_agent.rs -A 3 -B 1Repository: iii-hq/sandbox
Length of output: 684
🏁 Script executed:
# Check how vsock communication is set up
rg -n "vsock\|Vsock" packages/worker/src/runtime/fc_agent.rs -A 2 -B 2Repository: iii-hq/sandbox
Length of output: 40
🏁 Script executed:
# Look for socket address/connection setup
rg -n "52|8052" packages/worker/src/runtime/fc_agent.rsRepository: iii-hq/sandbox
Length of output: 87
🏁 Script executed:
rg -n "AGENT_PORT\|8052\|vsock" packages/guest-agent/src/main.rs -A 5 -B 2Repository: iii-hq/sandbox
Length of output: 40
🏁 Script executed:
# Check if guest-agent source shows socket file naming
cat packages/guest-agent/src/main.rs | head -50Repository: iii-hq/sandbox
Length of output: 1598
🏁 Script executed:
# Check if there are any comments or documentation about port selection
rg -n "52|8052" packages/worker/src/runtime/fc_agent.rs packages/guest-agent/src/main.rs -B 3 -A 1 --multilineRepository: iii-hq/sandbox
Length of output: 856
🏁 Script executed:
# Search for any configuration files or documentation mentioning these ports
fd -t f "\.toml|\.md|\.txt|\.yaml|\.yml" | xargs rg -l "52|8052|AGENT_PORT" 2>/dev/null | head -20Repository: iii-hq/sandbox
Length of output: 110
🏁 Script executed:
# Check if the scaling plan mentions port configuration
cat docs/plans/2026-03-10-scaling-plan.md | grep -A 10 -B 10 "52\|8052\|port\|AGENT_PORT"Repository: iii-hq/sandbox
Length of output: 1124
🏁 Script executed:
# Check if there are any references linking fc_agent and guest-agent
rg -n "guest.*agent\|guest-agent" packages/worker/src/runtime/fc_agent.rsRepository: iii-hq/sandbox
Length of output: 40
🏁 Script executed:
# Check if there's any documentation about the vsock communication setup
rg -n "vsock\|UDS\|unix.*socket" packages/worker/src/runtime/fc_agent.rs -B 2 -A 2Repository: iii-hq/sandbox
Length of output: 40
🏁 Script executed:
# Look for any recent changes or comments about port 52 vs 8052
git log --oneline packages/worker/src/runtime/fc_agent.rs packages/guest-agent/src/main.rs 2>/dev/null | head -20Repository: iii-hq/sandbox
Length of output: 122
🏁 Script executed:
# Check the recent fix commit
git show ea73ae2 --stat | head -50Repository: iii-hq/sandbox
Length of output: 2835
🏁 Script executed:
# Look for the actual change related to ports in that commit
git show ea73ae2 packages/worker/src/runtime/fc_agent.rs | head -100Repository: iii-hq/sandbox
Length of output: 4145
🏁 Script executed:
# Check how the vsock_uds_path is set up and if there's any mapping of ports
rg -n "vsock_uds_path\|AGENT_PORT" packages/worker/src/runtime/ -B 3 -A 3Repository: iii-hq/sandbox
Length of output: 40
🏁 Script executed:
# Check if there's any firecracker configuration or setup that maps vsock ports
rg -n "52\|8052" packages/worker/src/runtime/ --type rustRepository: iii-hq/sandbox
Length of output: 40
🏁 Script executed:
# Check how AgentClient is instantiated to understand the full picture
rg -n "AgentClient::new" packages/worker/src/runtime/Repository: iii-hq/sandbox
Length of output: 324
🏁 Script executed:
# Check how AgentClient is instantiated with the vsock_uds_path
rg -n "AgentClient::new" packages/worker/src/runtime/firecracker.rs -B 5 -A 2Repository: iii-hq/sandbox
Length of output: 669
🏁 Script executed:
# Look for vsock socket setup in firecracker
rg -n "vsock\|sock" packages/worker/src/runtime/firecracker.rs -B 2 -A 2 | head -100Repository: iii-hq/sandbox
Length of output: 40
🏁 Script executed:
# Check vsock_path method to understand how the socket is set up
rg -n "vsock_path\|fn vsock" packages/worker/src/runtime/firecracker.rs -A 5Repository: iii-hq/sandbox
Length of output: 40
🏁 Script executed:
# Look for any firecracker configuration that maps vsock ports to TCP
rg -n "8052\|guest.*port\|vsock.*port" packages/worker/src/runtime/firecracker.rsRepository: iii-hq/sandbox
Length of output: 40
🏁 Script executed:
# Check if guest-agent is exposed via vsock somehow
rg -n "vsock\|guest.agent" packages/ --type rust | grep -i "port\|listen\|bind"Repository: iii-hq/sandbox
Length of output: 40
🏁 Script executed:
# Check the full vsock_path method implementation
cat packages/worker/src/runtime/firecracker.rs | grep -A 20 "fn vsock_path"Repository: iii-hq/sandbox
Length of output: 1595
🏁 Script executed:
# Look for any documentation or setup of vsock socket naming
rg -n "vsock\|uds" packages/worker/src/runtime/firecracker.rs -iRepository: iii-hq/sandbox
Length of output: 40
🏁 Script executed:
# Check if there's a vsock proxy or bridge component
fd -t f "vsock\|proxy" packages/worker/src/runtime/Repository: iii-hq/sandbox
Length of output: 40
🏁 Script executed:
# Search entire worker package for references to port 52 or the socket path construction
rg -n "format\|vsock_uds_path" packages/worker/src/ -A 1 | grep -B 1 -A 1 "52\|uds_path"Repository: iii-hq/sandbox
Length of output: 40
🏁 Script executed:
# Search for vsock proxy or daemon components
fd -t f -i "proxy\|daemon\|bridge" packages/Repository: iii-hq/sandbox
Length of output: 40
🏁 Script executed:
# Look for Dockerfile or scripts that might set up the guest agent
rg -n "AGENT_PORT\|8052\|guest-agent" images/ --type bash --type dockerfileRepository: iii-hq/sandbox
Length of output: 86
🏁 Script executed:
# Check if there's a firecracker launch or vm start that might set up proxy
rg -n "vsock\|proxy\|listen" packages/worker/src/runtime/firecracker.rs -iRepository: iii-hq/sandbox
Length of output: 40
🏁 Script executed:
# Search for any kernel module or helper process
rg -n "vsock\|vhost" packages/ --type rust -iRepository: iii-hq/sandbox
Length of output: 40
Fix port mismatch with guest agent.
AGENT_PORT is hardcoded to 52, but the guest agent defaults to port 8052. Update the constant to align:
-const AGENT_PORT: u32 = 52;
+const AGENT_PORT: u32 = 8052;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const AGENT_PORT: u32 = 52; | |
| const AGENT_PORT: u32 = 8052; |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/worker/src/runtime/fc_agent.rs` at line 9, AGENT_PORT is set to 52
but the guest agent listens on 8052; update the constant AGENT_PORT in
fc_agent.rs from 52 to 8052 (keep type u32) so the runtime connects to the
correct guest agent port.
- Guest agent: PTY terminal support (forkpty, 5 endpoints, 16 session limit) - Terminal: runtime-agnostic — Docker exec or FC agent PTY via backend dispatch - CI: guest-agent cross-compilation (x86_64/aarch64 musl), release artifacts - Init system: kernel cmdline networking, graceful shutdown, auto-start agent - Integration tests: 48 unit + 6 conditional E2E tests for FC backend - Dockerfile: embeds guest-agent binary per TARGETARCH 460 tests passing (394 worker + 48 FC integration + 18 guest-agent)
There was a problem hiding this comment.
Actionable comments posted: 7
♻️ Duplicate comments (9)
packages/worker/src/runtime/fc_rootfs.rs (4)
46-56:⚠️ Potential issue | 🔴 CriticalFail the rootfs build if the guest-agent binary is missing.
Skipping this copy publishes an ext4 that can never satisfy
wait_for_agent()inpackages/worker/src/runtime/firecracker.rs. That should abort before the cache entry is finalized.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/worker/src/runtime/fc_rootfs.rs` around lines 46 - 56, The code currently silently skips copying when agent_path doesn't exist; change it to fail the rootfs build instead: in fc_rootfs.rs replace the conditional so that if agent_path.exists() it copies as before, but otherwise return an Err (propagate a formatted error like "Missing guest-agent binary at {agent_path}") so the rootfs build aborts and prevents publishing an unusable image that can't satisfy wait_for_agent() in firecracker.rs.
304-339:⚠️ Potential issue | 🔴 CriticalPropagate non-zero
cpexits fromcreate_ext4().A short copy here is still treated as success, so ENOSPC or permission failures can cache a partially populated ext4. Log the stderr if useful, but return an error after unmounting.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/worker/src/runtime/fc_rootfs.rs` around lines 304 - 339, The loop in create_ext4 currently treats non-zero cp exits as warnings only and proceeds, risking a partially populated ext4; change the handling in the match arm for Ok(o) where !o.status.success() to set cp_error (including stderr text and the src path) and break the loop so the function returns an error after the cleanup; keep the existing unmount (Command::new("umount")) and fs::remove_dir calls before returning, and reference cp_error, the Command::new("cp") invocation, and the entries.next_entry() loop when making the change.
13-18:⚠️ Potential issue | 🟠 MajorSame-image builds still race on shared temp paths.
The
rootfs_path.exists()check does not serialize writers: concurrentensure_rootfs()calls for the same image still share{safe_name}.tar,{safe_name}_extract,{safe_name}_merged,{safe_name}.ext4.tmp, and_mount_tmp. Two builders can still trample each other and publish a corrupted cache entry.Also applies to: 28-44, 60-63, 288-291
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/worker/src/runtime/fc_rootfs.rs` around lines 13 - 18, Concurrent ensure_rootfs() callers race because the current exists() check doesn't serialize writers; add a per-image exclusive lock (e.g., a lock file keyed by safe_name) and acquire it before checking rootfs_path.exists() and before creating/writing any of the temporary artifacts ({safe_name}.tar, {safe_name}_extract, {safe_name}_merged, {safe_name}.ext4.tmp, _mount_tmp), then release the lock after atomically renaming the temp .ext4.tmp to {safe_name}.ext4; ensure the lock is held across the whole create/publish sequence in ensure_rootfs() and that locks are released on error to avoid deadlocks.
29-45:⚠️ Potential issue | 🔴 CriticalThe export/merge pipeline drops the extracted rootfs.
Docker/Bollard
export_container()returns a flattened filesystem tar, not OCI layers. Whenmanifest.jsonis absent,merge_layers()only looks for nested*.tarfiles, somerged_dirends up empty instead of containing the extracted container contents.Also applies to: 118-149, 166-170, 212-254
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/worker/src/runtime/fc_rootfs.rs` around lines 29 - 45, The pipeline assumes exported images contain OCI layer tars but Docker/Bollard export produces a flattened rootfs, so after export_image_to_tar(...) -> extract_tar(...)-> extract_layers(...) there may be no layer tars and merge_layers(...) produces an empty merged_dir; change the flow so that after extract_tar you detect whether any layer tars exist (update extract_layers to return a list/count or add a helper like find_layer_tars), and if none are found treat the extracted rootfs as the single layer by either copying/moving the extracted_dir contents into merged_dir or invoking merge_layers with the extracted_dir as the only source; update calls around extract_tar, extract_layers, and merge_layers to handle this fallback for export_container-style archives.packages/worker/src/runtime/firecracker.rs (4)
77-113:⚠️ Potential issue | 🟠 MajorManage Firecracker children with a reapable handle, not optimistic state + raw PIDs.
After
spawn()you droptokio::process::Child,stop_sandbox()marks the VM stopped before TERM is observed, andremove_sandbox()only killsRunningVMs. That combination can leave a setup failure, paused VM, or slow shutdown with a live/unreaped Firecracker process and no handle left to clean it up.Also applies to: 219-233, 324-375
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/worker/src/runtime/firecracker.rs` around lines 77 - 113, start_firecracker_process currently spawns Firecracker and drops the tokio::process::Child (using only the PID) which prevents proper reaping and later cleanup; change this to return and persist a reapable Child handle (or store it in the runtime's VM state) instead of relying on raw PIDs, ensure spawn()'s Child is saved (e.g., in the struct that tracks VM state created by start_firecracker_process), and update stop_sandbox and remove_sandbox to use that stored Child handle to send TERM/kill and await its exit (or reap) rather than marking state optimistically and killing by PID so slow shutdowns, start failures, or paused VMs are reliably cleaned up.
228-241:⚠️ Potential issue | 🟠 Major
sandbox_port_bindings()can never report active mappings.New VMs always start with
port_mappings: HashMap::new(), and nothing in this file ever records forwarded host ports. The getter later in the trait impl therefore only returns empty maps.Also applies to: 533-541
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/worker/src/runtime/firecracker.rs` around lines 228 - 241, The sandbox_port_bindings getter is always empty because new VmInstance is initialized with port_mappings: HashMap::new() and no code records forwarded host ports; populate and persist that map when you create or modify port forwards. Specifically, update the VmInstance construction (the VmInstance { ... port_mappings: ... } block) to accept an initial mapping if available, then instrument the code paths that set up host->guest port forwards (the function that configures tap/nat/firewall rules) to insert entries into VmInstance.port_mappings (and persist them alongside other VM state), and ensure the trait impl method sandbox_port_bindings reads and returns this stored map instead of always returning an empty HashMap.
48-54:⚠️ Potential issue | 🔴 CriticalValidate IDs/names before joining them into host paths.
Raw values are later used for socket, rootfs, snapshot, and volume paths. An input containing
/or..can escape the managed directories and make create/remove operations touch arbitrary host files.Also applies to: 293-295, 555-556, 658-667
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/worker/src/runtime/firecracker.rs` around lines 48 - 54, The VM ID and other name inputs are used to build host filesystem paths and must be validated to prevent path traversal; add a shared validator (e.g., validate_id or sanitize_vm_id) that rejects or normalizes any id containing path separators, "." or ".." components, empty strings, or absolute paths and permits only a narrow charset (e.g., alphanumerics, dashes, underscores, dots) and call it from socket_path and vsock_path (and the other places noted: the rootfs/snapshot/volume path builders around the other ranges) before joining into PathBufs; return or propagate an error on invalid IDs rather than joining them.
555-576:⚠️ Potential issue | 🔴 CriticalReturned snapshot IDs still do not map to unique snapshot data.
Every commit for the same VM writes to
<snapshot_dir>/<vm_id>/snapshot.binandmem.bin, theninspect_image_size()/remove_image()recover that directory by trimming the comment back off the returned ID. A second snapshot overwrites the first even though it returns a new ID.Also applies to: 579-605
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/worker/src/runtime/firecracker.rs` around lines 555 - 576, The current snapshot creation always writes to snap_dir.join(vm_id)/snapshot.bin and mem.bin so later snapshots overwrite earlier ones even though snapshot_id is unique; change create_snapshot (in the method that builds snapshot_path and mem_path) to write each snapshot into a unique per-snapshot location (e.g., snap_dir.join(snapshot_id) or include a timestamp/uuid in the filenames) so the on-disk paths incorporate snapshot_id instead of always using snapshot.bin/mem.bin, and update any consumers (inspect_image_size() and remove_image()) to use the full snapshot_id-based path (do not rely on trimming the comment from snapshot_id) or persist a direct mapping from snapshot_id -> filesystem path; update references to snapshot_path, mem_path, snapshot_id, create_snapshot, inspect_image_size(), and remove_image() accordingly.packages/worker/src/runtime/fc_types.rs (1)
198-213:⚠️ Potential issue | 🟠 MajorMake the file-copy agent payloads binary-safe.
The runtime trait in
packages/worker/src/runtime/mod.rsis byte-oriented, soStringhere still breaks non-UTF-8 round-trips. Use a binary-safe wire type for both request and response content fields.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/worker/src/runtime/fc_types.rs` around lines 198 - 213, AgentFileWriteRequest and AgentFileReadResponse use String for file payloads which breaks non-UTF-8 round-trips; change the payload fields to a binary-safe type. Update AgentFileWriteRequest.content and AgentFileReadResponse.content from String to Vec<u8> (and optionally add #[serde(with = "serde_bytes")] on those fields to preserve compact binary serialization), leaving path as String; then update any code that constructs/consumes AgentFileWriteRequest or AgentFileReadResponse to pass/expect raw bytes instead of UTF-8 strings. Ensure the structs still derive Serialize/Deserialize after the change.
🧹 Nitpick comments (1)
packages/worker/tests/common/mod.rs (1)
75-86: Minor code duplication betweenfc_available()andfc_binary_path().Both functions read the same environment variable with the same default. Consider extracting to reduce duplication:
♻️ Optional refactor
pub fn fc_available() -> bool { - let fc_bin = std::env::var("FIRECRACKER_BIN") - .unwrap_or_else(|_| "/usr/bin/firecracker".to_string()); - std::path::Path::new(&fc_bin).exists() + fc_binary_path().exists() }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/worker/tests/common/mod.rs` around lines 75 - 86, Extract the duplicated environment lookup into a single helper used by both functions: create a function (e.g., fc_bin_from_env or fc_default_path) that reads std::env::var("FIRECRACKER_BIN") with the "/usr/bin/firecracker" default and returns a PathBuf or String, then update fc_available to call that helper and check .exists() and update fc_binary_path to return the helper's PathBuf; update signatures and imports as needed and ensure names fc_available and fc_binary_path remain unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/guest-agent/src/main.rs`:
- Line 531: The call to libc::forkpty uses an unnecessary mutable reference for
the winsize parameter; change the third/fourth argument usage so that you pass a
shared reference (use &ws instead of &mut ws) and, if ws was declared as mutable
solely for this call, remove the mut from its binding; update the call site in
the forkpty invocation (master_fd, std::ptr::null_mut(), std::ptr::null_mut(),
ws) to use &ws and ensure the winsize variable and master_fd usage align with
that non-mutable reference.
In `@packages/worker/src/functions/terminal.rs`:
- Around line 284-304: The current Firecracker branch calls
agent.terminal_close(exec_id).await but proceeds to remove KV/session metadata
regardless of success; change it so you check the Result from
agent.terminal_close and only delete the session when terminal_close returns Ok.
Locate the Firecracker-specific block (the downcast via
rt.as_any().downcast_ref::<FirecrackerRuntime>(), vm_id computed from id,
exec_id extracted from session, and the call to fc_rt.agent_client(vm_id).await
-> agent.terminal_close(exec_id).await) and gate the subsequent session removal
on the terminal_close outcome (handle Err by logging and preserving the session
instead of deleting it).
In `@packages/worker/src/runtime/fc_agent.rs`:
- Around line 291-310: The AgentTerminalReadResponse deserialization fails when
the guest-agent omits eof; update the response struct to tolerate missing eof by
making the field optional/defaulted (e.g., change the eof field in
AgentTerminalReadResponse to have #[serde(default)] and keep it as bool or
Option<bool> with an explicit default false) so terminal_read (which calls
send_request("/terminal/read", ...)) can successfully decode responses that only
include data; alternatively, ensure the guest-agent actually includes "eof" in
its JSON responses if you prefer fixing the producer instead of the consumer.
- Around line 272-289: The terminal_write method is base64-encoding input before
sending, but the guest-agent (reads req.data.as_bytes()) expects plain text;
update terminal_write (and the AgentTerminalWriteRequest payload) to send the
raw text string instead of base64: convert the incoming data: &[u8] to a UTF-8
string (e.g., String::from_utf8 or from_utf8_lossy) and set req.data to that
plain string, then call send_request("/terminal/write", &req, DEFAULT_TIMEOUT)
as before so the guest-agent receives plain text matching its test expectations.
In `@packages/worker/src/runtime/fc_init.rs`:
- Around line 100-114: The PID 1 loop currently busy-waits because a plain wait
returns immediately when there are no children; update the shutdown/monitoring
loop to avoid spinning by waiting specifically for the guest agent and sleeping
when no children exist: after launching AGENT_BIN and setting AGENT_PID, use
targeted waits (e.g., wait "$AGENT_PID" or wait -n) and detect when wait returns
with no children, then sleep (e.g., sleep 1) before retrying; also handle agent
exit by logging and optionally restarting or breaking the loop. Modify the loop
that references AGENT_PID and the unconditional wait || true to perform
process-aware waiting and a short sleep when no child is present to prevent CPU
busy-looping.
In `@packages/worker/src/runtime/firecracker.rs`:
- Around line 552-571: After calling self.fc_client(vm_id).pause_instance(),
ensure the VM is always resumed even if snapshot creation fails: introduce a
scope guard or a try/finally-style cleanup that calls
client.resume_instance().await on all paths (success or error). Locate the
pause_instance()/create_snapshot()/resume_instance() sequence (symbols:
fc_client, pause_instance, create_snapshot, resume_instance,
SnapshotCreateParams, vm_id) and either wrap the snapshot logic in a block that
resumes in a finally/Err arm or implement a small Drop-based guard (e.g.,
ResumeOnDrop) that holds the client and invokes resume_instance().await if the
snapshot flow exits with an error, so the sandbox cannot remain paused.
In `@scripts/build-init.sh`:
- Around line 10-11: The script accepts --agent-path but never applies it to the
generated init script: capture the CLI arg into AGENT_PATH (or similar) and
substitute it into the heredoc instead of the hardcoded
AGENT_BIN="/usr/local/bin/iii-guest-agent" so the generated init uses the
provided path; update the printed help/usage lines (the printf for --agent-path)
to match the variable name and ensure any later references to AGENT_BIN or the
agent binary in the heredoc and after (e.g., where AGENT_BIN is defined or used)
use "${AGENT_PATH:-/usr/local/bin/iii-guest-agent}" to keep the default when not
provided.
---
Duplicate comments:
In `@packages/worker/src/runtime/fc_rootfs.rs`:
- Around line 46-56: The code currently silently skips copying when agent_path
doesn't exist; change it to fail the rootfs build instead: in fc_rootfs.rs
replace the conditional so that if agent_path.exists() it copies as before, but
otherwise return an Err (propagate a formatted error like "Missing guest-agent
binary at {agent_path}") so the rootfs build aborts and prevents publishing an
unusable image that can't satisfy wait_for_agent() in firecracker.rs.
- Around line 304-339: The loop in create_ext4 currently treats non-zero cp
exits as warnings only and proceeds, risking a partially populated ext4; change
the handling in the match arm for Ok(o) where !o.status.success() to set
cp_error (including stderr text and the src path) and break the loop so the
function returns an error after the cleanup; keep the existing unmount
(Command::new("umount")) and fs::remove_dir calls before returning, and
reference cp_error, the Command::new("cp") invocation, and the
entries.next_entry() loop when making the change.
- Around line 13-18: Concurrent ensure_rootfs() callers race because the current
exists() check doesn't serialize writers; add a per-image exclusive lock (e.g.,
a lock file keyed by safe_name) and acquire it before checking
rootfs_path.exists() and before creating/writing any of the temporary artifacts
({safe_name}.tar, {safe_name}_extract, {safe_name}_merged, {safe_name}.ext4.tmp,
_mount_tmp), then release the lock after atomically renaming the temp .ext4.tmp
to {safe_name}.ext4; ensure the lock is held across the whole create/publish
sequence in ensure_rootfs() and that locks are released on error to avoid
deadlocks.
- Around line 29-45: The pipeline assumes exported images contain OCI layer tars
but Docker/Bollard export produces a flattened rootfs, so after
export_image_to_tar(...) -> extract_tar(...)-> extract_layers(...) there may be
no layer tars and merge_layers(...) produces an empty merged_dir; change the
flow so that after extract_tar you detect whether any layer tars exist (update
extract_layers to return a list/count or add a helper like find_layer_tars), and
if none are found treat the extracted rootfs as the single layer by either
copying/moving the extracted_dir contents into merged_dir or invoking
merge_layers with the extracted_dir as the only source; update calls around
extract_tar, extract_layers, and merge_layers to handle this fallback for
export_container-style archives.
In `@packages/worker/src/runtime/fc_types.rs`:
- Around line 198-213: AgentFileWriteRequest and AgentFileReadResponse use
String for file payloads which breaks non-UTF-8 round-trips; change the payload
fields to a binary-safe type. Update AgentFileWriteRequest.content and
AgentFileReadResponse.content from String to Vec<u8> (and optionally add
#[serde(with = "serde_bytes")] on those fields to preserve compact binary
serialization), leaving path as String; then update any code that
constructs/consumes AgentFileWriteRequest or AgentFileReadResponse to
pass/expect raw bytes instead of UTF-8 strings. Ensure the structs still derive
Serialize/Deserialize after the change.
In `@packages/worker/src/runtime/firecracker.rs`:
- Around line 77-113: start_firecracker_process currently spawns Firecracker and
drops the tokio::process::Child (using only the PID) which prevents proper
reaping and later cleanup; change this to return and persist a reapable Child
handle (or store it in the runtime's VM state) instead of relying on raw PIDs,
ensure spawn()'s Child is saved (e.g., in the struct that tracks VM state
created by start_firecracker_process), and update stop_sandbox and
remove_sandbox to use that stored Child handle to send TERM/kill and await its
exit (or reap) rather than marking state optimistically and killing by PID so
slow shutdowns, start failures, or paused VMs are reliably cleaned up.
- Around line 228-241: The sandbox_port_bindings getter is always empty because
new VmInstance is initialized with port_mappings: HashMap::new() and no code
records forwarded host ports; populate and persist that map when you create or
modify port forwards. Specifically, update the VmInstance construction (the
VmInstance { ... port_mappings: ... } block) to accept an initial mapping if
available, then instrument the code paths that set up host->guest port forwards
(the function that configures tap/nat/firewall rules) to insert entries into
VmInstance.port_mappings (and persist them alongside other VM state), and ensure
the trait impl method sandbox_port_bindings reads and returns this stored map
instead of always returning an empty HashMap.
- Around line 48-54: The VM ID and other name inputs are used to build host
filesystem paths and must be validated to prevent path traversal; add a shared
validator (e.g., validate_id or sanitize_vm_id) that rejects or normalizes any
id containing path separators, "." or ".." components, empty strings, or
absolute paths and permits only a narrow charset (e.g., alphanumerics, dashes,
underscores, dots) and call it from socket_path and vsock_path (and the other
places noted: the rootfs/snapshot/volume path builders around the other ranges)
before joining into PathBufs; return or propagate an error on invalid IDs rather
than joining them.
- Around line 555-576: The current snapshot creation always writes to
snap_dir.join(vm_id)/snapshot.bin and mem.bin so later snapshots overwrite
earlier ones even though snapshot_id is unique; change create_snapshot (in the
method that builds snapshot_path and mem_path) to write each snapshot into a
unique per-snapshot location (e.g., snap_dir.join(snapshot_id) or include a
timestamp/uuid in the filenames) so the on-disk paths incorporate snapshot_id
instead of always using snapshot.bin/mem.bin, and update any consumers
(inspect_image_size() and remove_image()) to use the full snapshot_id-based path
(do not rely on trimming the comment from snapshot_id) or persist a direct
mapping from snapshot_id -> filesystem path; update references to snapshot_path,
mem_path, snapshot_id, create_snapshot, inspect_image_size(), and remove_image()
accordingly.
---
Nitpick comments:
In `@packages/worker/tests/common/mod.rs`:
- Around line 75-86: Extract the duplicated environment lookup into a single
helper used by both functions: create a function (e.g., fc_bin_from_env or
fc_default_path) that reads std::env::var("FIRECRACKER_BIN") with the
"/usr/bin/firecracker" default and returns a PathBuf or String, then update
fc_available to call that helper and check .exists() and update fc_binary_path
to return the helper's PathBuf; update signatures and imports as needed and
ensure names fc_available and fc_binary_path remain unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3b7bbb24-263b-403d-8f5c-cfd2f7cff7cd
📒 Files selected for processing (15)
.github/workflows/ci.yml.github/workflows/release.ymlpackages/guest-agent/src/main.rspackages/worker/Dockerfilepackages/worker/src/functions/terminal.rspackages/worker/src/runtime/fc_agent.rspackages/worker/src/runtime/fc_init.rspackages/worker/src/runtime/fc_rootfs.rspackages/worker/src/runtime/fc_types.rspackages/worker/src/runtime/firecracker.rspackages/worker/src/runtime/mod.rspackages/worker/tests/common/mod.rspackages/worker/tests/firecracker_integration.rsscripts/build-init.shscripts/fetch-kernel.sh
| #[cfg(feature = "firecracker")] | ||
| { | ||
| let backend_str = session | ||
| .get("backend") | ||
| .and_then(|v| v.as_str()) | ||
| .unwrap_or("docker"); | ||
|
|
||
| if backend_str == "firecracker" { | ||
| let exec_id = session | ||
| .get("execId") | ||
| .and_then(|v| v.as_str()) | ||
| .unwrap_or(""); | ||
|
|
||
| if let Some(fc_rt) = rt.as_any().downcast_ref::<FirecrackerRuntime>() { | ||
| let vm_id = id.strip_prefix("iii-sbx-").unwrap_or(id); | ||
| if let Ok(agent) = fc_rt.agent_client(vm_id).await { | ||
| let _ = agent.terminal_close(exec_id).await; | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Don't delete the session if Firecracker terminal close failed.
This path removes the KV/session metadata even when agent.terminal_close() fails. With the guest agent's 16-session cap, a transient vsock error here can leak PTYs until the VM is recycled.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/worker/src/functions/terminal.rs` around lines 284 - 304, The
current Firecracker branch calls agent.terminal_close(exec_id).await but
proceeds to remove KV/session metadata regardless of success; change it so you
check the Result from agent.terminal_close and only delete the session when
terminal_close returns Ok. Locate the Firecracker-specific block (the downcast
via rt.as_any().downcast_ref::<FirecrackerRuntime>(), vm_id computed from id,
exec_id extracted from session, and the call to fc_rt.agent_client(vm_id).await
-> agent.terminal_close(exec_id).await) and gate the subsequent session removal
on the terminal_close outcome (handle Err by logging and preserving the session
instead of deleting it).
| pub async fn terminal_write( | ||
| &self, | ||
| session_id: &str, | ||
| data: &[u8], | ||
| ) -> Result<(), String> { | ||
| let req = AgentTerminalWriteRequest { | ||
| session_id: session_id.to_string(), | ||
| data: base64::Engine::encode( | ||
| &base64::engine::general_purpose::STANDARD, | ||
| data, | ||
| ), | ||
| }; | ||
|
|
||
| let _: serde_json::Value = self | ||
| .send_request("/terminal/write", &req, DEFAULT_TIMEOUT) | ||
| .await?; | ||
| Ok(()) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify terminal_write handling in guest-agent
rg -n "TerminalWriteRequest|handle_terminal_write" packages/guest-agent/src/main.rs -A 10Repository: iii-hq/sandbox
Length of output: 1997
🏁 Script executed:
# Verify the worker code actually uses base64 encoding currently
rg -n "base64::Engine::encode" packages/worker/src/runtime/fc_agent.rs -B 5 -A 2Repository: iii-hq/sandbox
Length of output: 701
Terminal write sends base64-encoded text, but guest-agent expects plain text.
The terminal_write method base64-encodes the data before sending, but packages/guest-agent/src/main.rs:601 reads req.data.as_bytes() directly without decoding. This causes garbled terminal input (e.g., "ls\n" becomes "bHMK" written to the terminal). The guest-agent test at line 884 confirms the expected behavior is to receive plain text.
🔧 Proposed fix - send plain text
pub async fn terminal_write(
&self,
session_id: &str,
data: &[u8],
) -> Result<(), String> {
let req = AgentTerminalWriteRequest {
session_id: session_id.to_string(),
- data: base64::Engine::encode(
- &base64::engine::general_purpose::STANDARD,
- data,
- ),
+ data: String::from_utf8_lossy(data).to_string(),
};📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| pub async fn terminal_write( | |
| &self, | |
| session_id: &str, | |
| data: &[u8], | |
| ) -> Result<(), String> { | |
| let req = AgentTerminalWriteRequest { | |
| session_id: session_id.to_string(), | |
| data: base64::Engine::encode( | |
| &base64::engine::general_purpose::STANDARD, | |
| data, | |
| ), | |
| }; | |
| let _: serde_json::Value = self | |
| .send_request("/terminal/write", &req, DEFAULT_TIMEOUT) | |
| .await?; | |
| Ok(()) | |
| } | |
| pub async fn terminal_write( | |
| &self, | |
| session_id: &str, | |
| data: &[u8], | |
| ) -> Result<(), String> { | |
| let req = AgentTerminalWriteRequest { | |
| session_id: session_id.to_string(), | |
| data: String::from_utf8_lossy(data).to_string(), | |
| }; | |
| let _: serde_json::Value = self | |
| .send_request("/terminal/write", &req, DEFAULT_TIMEOUT) | |
| .await?; | |
| Ok(()) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/worker/src/runtime/fc_agent.rs` around lines 272 - 289, The
terminal_write method is base64-encoding input before sending, but the
guest-agent (reads req.data.as_bytes()) expects plain text; update
terminal_write (and the AgentTerminalWriteRequest payload) to send the raw text
string instead of base64: convert the incoming data: &[u8] to a UTF-8 string
(e.g., String::from_utf8 or from_utf8_lossy) and set req.data to that plain
string, then call send_request("/terminal/write", &req, DEFAULT_TIMEOUT) as
before so the guest-agent receives plain text matching its test expectations.
| pub async fn terminal_read( | ||
| &self, | ||
| session_id: &str, | ||
| ) -> Result<(Vec<u8>, bool), String> { | ||
| let req = AgentTerminalReadRequest { | ||
| session_id: session_id.to_string(), | ||
| }; | ||
|
|
||
| let resp: AgentTerminalReadResponse = self | ||
| .send_request("/terminal/read", &req, DEFAULT_TIMEOUT) | ||
| .await?; | ||
|
|
||
| let data = base64::Engine::decode( | ||
| &base64::engine::general_purpose::STANDARD, | ||
| &resp.data, | ||
| ) | ||
| .map_err(|e| format!("Failed to decode terminal data: {e}"))?; | ||
|
|
||
| Ok((data, resp.eof)) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify terminal_read response structure in guest-agent
rg -n "handle_terminal_read" packages/guest-agent/src/main.rs -A 30Repository: iii-hq/sandbox
Length of output: 2130
🏁 Script executed:
#!/bin/bash
# Find and inspect AgentTerminalReadResponse struct definition
rg -n "struct AgentTerminalReadResponse" packages/worker/src/ -A 10Repository: iii-hq/sandbox
Length of output: 780
🏁 Script executed:
#!/bin/bash
# Get the complete struct definition with derives
sed -n '290,310p' packages/worker/src/runtime/fc_types.rsRepository: iii-hq/sandbox
Length of output: 577
Terminal read response missing eof field causes deserialization failures.
AgentTerminalReadResponse requires an eof: bool field (with no default), but the guest-agent at packages/guest-agent/src/main.rs:674 only returns {"data": ...}. Deserialization will fail when the worker processes the response.
The guest-agent must include the eof field in its response, or the struct must make the field optional with #[serde(default)].
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/worker/src/runtime/fc_agent.rs` around lines 291 - 310, The
AgentTerminalReadResponse deserialization fails when the guest-agent omits eof;
update the response struct to tolerate missing eof by making the field
optional/defaulted (e.g., change the eof field in AgentTerminalReadResponse to
have #[serde(default)] and keep it as bool or Option<bool> with an explicit
default false) so terminal_read (which calls send_request("/terminal/read",
...)) can successfully decode responses that only include data; alternatively,
ensure the guest-agent actually includes "eof" in its JSON responses if you
prefer fixing the producer instead of the consumer.
| AGENT_BIN="/usr/local/bin/iii-guest-agent" | ||
| if [ -x "$AGENT_BIN" ]; then | ||
| echo "Init: starting guest agent" | ||
| "$AGENT_BIN" & | ||
| AGENT_PID=$! | ||
| echo "Init: guest agent started (pid=$AGENT_PID)" | ||
| else | ||
| echo "Init: WARNING - guest agent not found at $AGENT_BIN" >&2 | ||
| fi | ||
|
|
||
| echo "Init: system ready" | ||
|
|
||
| while true; do | ||
| wait || true | ||
| done |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
head -120 packages/worker/src/runtime/fc_init.rs | tail -30Repository: iii-hq/sandbox
Length of output: 757
🏁 Script executed:
wc -l packages/worker/src/runtime/fc_init.rsRepository: iii-hq/sandbox
Length of output: 100
🏁 Script executed:
cat -n packages/worker/src/runtime/fc_init.rs | head -30Repository: iii-hq/sandbox
Length of output: 903
🏁 Script executed:
cat -n packages/worker/src/runtime/fc_init.rs | sed -n '70,130p'Repository: iii-hq/sandbox
Length of output: 2086
🏁 Script executed:
cat -n packages/worker/src/runtime/fc_init.rs | sed -n '29,70p'Repository: iii-hq/sandbox
Length of output: 1491
Avoid the no-child busy loop in PID 1.
When the guest agent is missing or exits, wait returns immediately and this loop spins forever consuming CPU. Add sleep and/or process monitoring to avoid busy-spinning when there are no child processes to wait on.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/worker/src/runtime/fc_init.rs` around lines 100 - 114, The PID 1
loop currently busy-waits because a plain wait returns immediately when there
are no children; update the shutdown/monitoring loop to avoid spinning by
waiting specifically for the guest agent and sleeping when no children exist:
after launching AGENT_BIN and setting AGENT_PID, use targeted waits (e.g., wait
"$AGENT_PID" or wait -n) and detect when wait returns with no children, then
sleep (e.g., sleep 1) before retrying; also handle agent exit by logging and
optionally restarting or breaking the loop. Modify the loop that references
AGENT_PID and the unconditional wait || true to perform process-aware waiting
and a short sleep when no child is present to prevent CPU busy-looping.
| let client = self.fc_client(vm_id); | ||
| client.pause_instance().await?; | ||
|
|
||
| let snap_dir = self.config.snapshot_dir.join(vm_id); | ||
| tokio::fs::create_dir_all(&snap_dir) | ||
| .await | ||
| .map_err(|e| format!("Failed to create snapshot dir: {e}"))?; | ||
|
|
||
| let snapshot_path = snap_dir.join("snapshot.bin"); | ||
| let mem_path = snap_dir.join("mem.bin"); | ||
|
|
||
| client | ||
| .create_snapshot(&SnapshotCreateParams { | ||
| snapshot_type: "Full".to_string(), | ||
| snapshot_path: snapshot_path.to_string_lossy().to_string(), | ||
| mem_file_path: mem_path.to_string_lossy().to_string(), | ||
| }) | ||
| .await?; | ||
|
|
||
| client.resume_instance().await?; |
There was a problem hiding this comment.
Always resume the VM when snapshot creation fails.
Any error after pause_instance() currently bubbles out with the sandbox still paused. Wrap the snapshot flow in a guard/finally so failures restore the running state before returning.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/worker/src/runtime/firecracker.rs` around lines 552 - 571, After
calling self.fc_client(vm_id).pause_instance(), ensure the VM is always resumed
even if snapshot creation fails: introduce a scope guard or a try/finally-style
cleanup that calls client.resume_instance().await on all paths (success or
error). Locate the pause_instance()/create_snapshot()/resume_instance() sequence
(symbols: fc_client, pause_instance, create_snapshot, resume_instance,
SnapshotCreateParams, vm_id) and either wrap the snapshot logic in a block that
resumes in a finally/Err arm or implement a small Drop-based guard (e.g.,
ResumeOnDrop) that holds the client and invokes resume_instance().await if the
snapshot flow exits with an error, so the sandbox cannot remain paused.
| printf " --agent-path P Path to guest agent binary inside rootfs\n" | ||
| printf " (default: /usr/local/bin/iii-guest-agent)\n" |
There was a problem hiding this comment.
--agent-path argument is accepted but never used.
The script accepts --agent-path and prints it at the end, but the generated init script hardcodes AGENT_BIN="/usr/local/bin/iii-guest-agent" (line 159). Users providing a custom path will see it echoed but the init will still look for the agent at the default location.
Either remove the misleading option or substitute it into the heredoc:
🔧 Proposed fix to use the AGENT_PATH variable
-AGENT_BIN="/usr/local/bin/iii-guest-agent"
+AGENT_BIN="__AGENT_PATH_PLACEHOLDER__"
if [ -x "$AGENT_BIN" ]; thenThen after the heredoc:
chmod 755 "${ROOTFS_DIR}/sbin/init"
+sed -i "s|__AGENT_PATH_PLACEHOLDER__|${AGENT_PATH}|g" "${ROOTFS_DIR}/sbin/init"Also applies to: 18-18, 159-159, 185-185
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/build-init.sh` around lines 10 - 11, The script accepts --agent-path
but never applies it to the generated init script: capture the CLI arg into
AGENT_PATH (or similar) and substitute it into the heredoc instead of the
hardcoded AGENT_BIN="/usr/local/bin/iii-guest-agent" so the generated init uses
the provided path; update the printed help/usage lines (the printf for
--agent-path) to match the variable name and ensure any later references to
AGENT_BIN or the agent binary in the heredoc and after (e.g., where AGENT_BIN is
defined or used) use "${AGENT_PATH:-/usr/local/bin/iii-guest-agent}" to keep the
default when not provided.
- README: benchmark table (200ms cold start, 59ms exec avg), competition bar chart, feature matrix vs 7 competitors, Firecracker architecture diagram, updated test counts (941 total) - docs/competitive-analysis.md: deep dive on 15 platforms with real published numbers and sources - scripts/test-e2e.sh: 26-test E2E suite covering lifecycle, exec, filesystem, env, git, pause/resume, parallel creates, throughput
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (2)
scripts/test-e2e.sh (2)
1-8: Consider documenting script dependencies.The script requires
curl,python3, and standard Unix utilities. Adding a brief comment at the top would help users identify missing dependencies quickly.Suggested documentation
#!/bin/bash +# E2E test suite for iii-sandbox API +# Dependencies: curl, python3, bash 4+ +# Usage: ./scripts/test-e2e.sh +# Environment: III_AUTH_TOKEN (default: demo123) set -euo pipefail🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/test-e2e.sh` around lines 1 - 8, Add a short top-of-file comment in scripts/test-e2e.sh (immediately after the shebang) that documents the script dependencies (e.g., requires bash, curl, python3, and common Unix utilities like sed/grep/awk) and any required environment variables (e.g., TOKEN/III_AUTH_TOKEN, API) so users can quickly identify missing tools before running the script.
233-259: Consider cleaning up temporary files.The parallel sandbox creation writes to
/tmp/iii-para-*.jsonbut these files are not cleaned up after the test. This is minor but could leave artifacts on repeated runs.Proposed cleanup
PARA_COUNT=$(echo "$PARA_IDS" | wc -w | tr -d ' ') if [ "$PARA_COUNT" -eq 3 ]; then PASS=$((PASS+1)) echo " $(green PASS) all 3 sandboxes created" else FAIL=$((FAIL+1)) echo " $(red FAIL) only $PARA_COUNT/3 created" fi +rm -f /tmp/iii-para-*.json🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/test-e2e.sh` around lines 233 - 259, The temp files /tmp/iii-para-*.json created during the parallel sandbox creation loop (see variables PARA_START/PARA_END, PARA_IDS and the api POST that writes to /tmp/iii-para-$i.json) are not removed; add cleanup to delete those files after you've extracted IDs (or in a trap/cleanup function) so repeated runs don't leave artifacts—remove /tmp/iii-para-1.json /tmp/iii-para-2.json /tmp/iii-para-3.json (or pattern /tmp/iii-para-*.json) immediately after the PARA_IDS loop or in a finally/trap block.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@README.md`:
- Around line 685-701: The fenced code block showing the benchmark table is
missing a language identifier which triggers MD040; update the opening fence
(the line with ```) that precedes the table (the block beginning with
"+-----------------------+----------+") to include a language token such as text
or console (e.g., change ``` to ```text) so the markdown linter recognizes the
code block language.
- Line 648: Update the README table row that currently reads "Firecracker |
Production | `bollard` + guest agent | `firecracker`" to accurately reflect
build/runtime structure: state that `bollard` is an unconditional dependency
(e.g., "bollard (always included)") and that the guest agent is a separate
binary (e.g., "guest agent binary"), and clarify that the `firecracker` feature
flag is empty and only gates runtime implementation code rather than adding
dependencies; keep the Production column and feature column but replace the
dependencies cell with wording like "bollard (always included) + guest agent
binary" or similar clear phrasing.
- Around line 656-675: The fenced ASCII diagram block starting with "Worker
(host)" is missing a language identifier; update the opening triple-backtick
fence to include a language token like text or ascii (e.g., change ``` to
```text) so the diagram renders correctly and satisfies the MD040 lint rule;
locate the diagram block by the "Worker (host)" header and modify its opening
fence accordingly.
- Around line 707-721: The fenced code block containing the latency/competitive
comparison chart in README.md lacks a language identifier and triggers MD040;
fix it by adding a language tag (e.g., use "text") to the opening
triple-backticks of that code block so it becomes ```text, leaving the chart
content unchanged; locate the block that starts with the ASCII chart (lines
showing "iii-sandbox (Docker) ..." through the timeline footer) and update the
opening fence accordingly.
- Around line 877-886: Update the README test-count table and any related text
to reflect the real test totals: change the "Rust SDK | 11 | 32" row to the
actual 99 tests, update "TypeScript (SDK/CLI/MCP) | 33 | ~293" to 379, correct
"Rust engine | 24 modules | 394" to show 76 (or split into 22 unit + 54
firecracker integration tests), and ensure the "Rust engine tests (333)"
reference elsewhere matches the corrected totals; adjust the table rows for
"Rust engine (FC)", "Python SDK", and "Guest agent" only if their counts differ,
and keep the table formatting and original row labels (e.g., the exact string
"Rust engine | 24 modules | 394" and "TypeScript (SDK/CLI/MCP) | 33 | ~293") so
changes are easy to locate.
In `@scripts/test-e2e.sh`:
- Around line 120-124: The mkdir invocation in scripts/test-e2e.sh is sending
the wrong payload key — change the api POST call to
"/sandbox/sandboxes/$SBX_ID/files/mkdir" to use {"paths":["/workspace/subdir"]}
(plural array) instead of {"path":"..."} and stop discarding its output (remove
the > /dev/null 2>&1) so failures are visible; update the mkdir call near the
api POST "/sandbox/sandboxes/$SBX_ID/files/mkdir" line and optionally assert its
success before proceeding to files/write and files/read.
---
Nitpick comments:
In `@scripts/test-e2e.sh`:
- Around line 1-8: Add a short top-of-file comment in scripts/test-e2e.sh
(immediately after the shebang) that documents the script dependencies (e.g.,
requires bash, curl, python3, and common Unix utilities like sed/grep/awk) and
any required environment variables (e.g., TOKEN/III_AUTH_TOKEN, API) so users
can quickly identify missing tools before running the script.
- Around line 233-259: The temp files /tmp/iii-para-*.json created during the
parallel sandbox creation loop (see variables PARA_START/PARA_END, PARA_IDS and
the api POST that writes to /tmp/iii-para-$i.json) are not removed; add cleanup
to delete those files after you've extracted IDs (or in a trap/cleanup function)
so repeated runs don't leave artifacts—remove /tmp/iii-para-1.json
/tmp/iii-para-2.json /tmp/iii-para-3.json (or pattern /tmp/iii-para-*.json)
immediately after the PARA_IDS loop or in a finally/trap block.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f9460ed8-4caf-4f15-b558-15d32d947fe6
📒 Files selected for processing (3)
README.mddocs/competitive-analysis.mdscripts/test-e2e.sh
| # nested dir | ||
| api POST "/sandbox/sandboxes/$SBX_ID/files/mkdir" '{"path":"/workspace/subdir"}' > /dev/null 2>&1 | ||
| api POST "/sandbox/sandboxes/$SBX_ID/files/write" '{"path":"/workspace/subdir/nested.txt","content":"nested"}' > /dev/null 2>&1 | ||
| RESP=$(api POST "/sandbox/sandboxes/$SBX_ID/files/read" '{"path":"/workspace/subdir/nested.txt"}') | ||
| assert_contains "nested file read" "nested" "$RESP" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the mkdir endpoint payload schema
rg -n "mkdir" --type ts --type rust -A5 -B2 | head -60Repository: iii-hq/sandbox
Length of output: 4716
🏁 Script executed:
#!/bin/bash
# Check the actual content of scripts/test-e2e.sh around lines 120-124
cat -n scripts/test-e2e.sh | sed -n '115,130p'Repository: iii-hq/sandbox
Length of output: 978
🏁 Script executed:
#!/bin/bash
# Check how the Rust mkdir function extracts the paths parameter
rg -n "fs::mkdir" --type rust -A 20 packages/worker/src/functions/filesystem.rsRepository: iii-hq/sandbox
Length of output: 1558
Payload mismatch in mkdir API call — uses "path" instead of "paths".
Line 121 uses {"path":"/workspace/subdir"}, but the API expects {"paths":["/workspace/subdir"]} (plural key with array value, as defined in the Rust implementation at packages/worker/src/functions/filesystem.rs:283). The call fails silently since output is discarded to /dev/null, making the issue harder to detect.
Proposed fix
-api POST "/sandbox/sandboxes/$SBX_ID/files/mkdir" '{"path":"/workspace/subdir"}' > /dev/null 2>&1
+api POST "/sandbox/sandboxes/$SBX_ID/files/mkdir" '{"paths":["/workspace/subdir"]}' > /dev/null 2>&1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # nested dir | |
| api POST "/sandbox/sandboxes/$SBX_ID/files/mkdir" '{"path":"/workspace/subdir"}' > /dev/null 2>&1 | |
| api POST "/sandbox/sandboxes/$SBX_ID/files/write" '{"path":"/workspace/subdir/nested.txt","content":"nested"}' > /dev/null 2>&1 | |
| RESP=$(api POST "/sandbox/sandboxes/$SBX_ID/files/read" '{"path":"/workspace/subdir/nested.txt"}') | |
| assert_contains "nested file read" "nested" "$RESP" | |
| # nested dir | |
| api POST "/sandbox/sandboxes/$SBX_ID/files/mkdir" '{"paths":["/workspace/subdir"]}' > /dev/null 2>&1 | |
| api POST "/sandbox/sandboxes/$SBX_ID/files/write" '{"path":"/workspace/subdir/nested.txt","content":"nested"}' > /dev/null 2>&1 | |
| RESP=$(api POST "/sandbox/sandboxes/$SBX_ID/files/read" '{"path":"/workspace/subdir/nested.txt"}') | |
| assert_contains "nested file read" "nested" "$RESP" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/test-e2e.sh` around lines 120 - 124, The mkdir invocation in
scripts/test-e2e.sh is sending the wrong payload key — change the api POST call
to "/sandbox/sandboxes/$SBX_ID/files/mkdir" to use
{"paths":["/workspace/subdir"]} (plural array) instead of {"path":"..."} and
stop discarding its output (remove the > /dev/null 2>&1) so failures are
visible; update the mkdir call near the api POST
"/sandbox/sandboxes/$SBX_ID/files/mkdir" line and optionally assert its success
before proceeding to files/write and files/read.
Reduce README from 891 to ~230 lines. Fix broken ASCII diagrams with plain-text that renders on GitHub. Move full API reference to docs/api-reference.md and worker architecture to docs/architecture.md.
Replace plain-text ASCII with mermaid (renders natively on GitHub). Add back cold start bar chart and feature comparison matrix.
1. fc_agent: send_request now accepts HTTP method param (was hardcoded POST) 2. firecracker: validate_vm_id rejects path traversal (/, \, ..) 3. fc_rootfs: missing guest-agent binary now fails the build 4. fc_rootfs: cp non-zero exit now returns Err (was silently continuing) 5. guest-agent: detached children reaped via background thread 6. guest-agent: timeout_ms now enforced with kill on deadline 7. fc_network: NAT uplink interface configurable via III_UPLINK_IFACE 8. fc_types: file content is already binary-safe (base64 on wire) — no change needed 443 tests passing, 0 failures.
Clippy 1.94+ flags &mut ws as unnecessary_mut_passed since forkpty takes *mut winsize (raw pointer). Using std::ptr::addr_of_mut!(ws) produces the raw pointer directly without creating a mutable reference.
Newer Ubuntu runners output "static-pie linked" instead of "statically linked". Match either form with grep -E. Also print the file output for easier debugging.
Summary
#[cfg(feature = "firecracker")]feature gatepackages/guest-agent/) that runs inside each VM for host↔guest communication via VSOCKSandboxRuntimetrait implementation — all 29 async methods, enabling drop-in replacement for Docker runtimeArchitecture
New Files (2,840 lines, 55 tests)
fc_types.rsfc_api.rsfc_agent.rsfc_network.rsfc_rootfs.rsfirecracker.rsguest-agent/main.rsKey Design Decisions
#[cfg(feature = "firecracker")]— zero impact on default Docker buildscommit_sandboxTest plan
cargo clippy --features firecrackerpasses cleancargo test --features firecracker— 55 new unit tests passcargo clippy(without feature) — no regressionscargo clippyandcargo testpassSummary by CodeRabbit
New Features
Chores
Tests
Docs