Skip to content

feat(gateway): add libkrun microVM gateway for hardware-isolated cluster bootstrap#76

Closed
drew wants to merge 4 commits intomainfrom
cluster-as-vm/dn
Closed

feat(gateway): add libkrun microVM gateway for hardware-isolated cluster bootstrap#76
drew wants to merge 4 commits intomainfrom
cluster-as-vm/dn

Conversation

@drew
Copy link
Copy Markdown
Collaborator

@drew drew commented Mar 3, 2026

This is just a prototype. Not intended to merge

Summary

Introduces a new navigator gateway command group that uses libkrun to launch hardware-isolated microVMs via Apple Hypervisor.framework (macOS ARM64) or KVM (Linux), replacing Docker+k3s containers with a lighter-weight VM-based approach.

  • nav gateway run — ad-hoc microVM execution with a user-provided rootfs
  • nav gateway cluster — boots k3s inside a microVM with gvproxy networking, automatic rootfs extraction from Docker images, health checking, and kubeconfig extraction
  • navigator-gateway crate — safe Rust wrappers over ~15 libkrun C FFI functions with RAII, builder pattern, and support for virtio-fs, console redirect, TSI control, and gvproxy virtio-net networking

Key design decisions

  • gvproxy networking instead of TSI (Transparent Socket Impersonation) — TSI intercepts all guest connect() syscalls which breaks k3s internal localhost connections
  • Noop CNI plugin — libkrunfw kernel lacks bridge/netfilter modules; a shell script CNI delegates to host-local IPAM only, enough for node Ready + pod scheduling
  • tmpfs k3s data dir — SQLite locking issues on virtio-fs; state is ephemeral (acceptable for dev clusters)
  • Fork modelfork_start() boots VM in child process; parent polls /readyz, extracts kubeconfig from virtio-fs rootfs, blocks on waitpid
  • Boot retry — vm-init.sh retries k3s up to 3 times to handle transient kine SQLite race condition on tmpfs

New files

Path Description
crates/navigator-gateway/ New crate: FFI bindings, KrunContextBuilder, error types
crates/navigator-cli/build.rs rpath flags for libkrun/libkrunfw
deploy/gateway/vm-init.sh Guest bootstrap: DHCP networking, noop CNI, k3s retry wrapper
crates/navigator-gateway/entitlements.plist macOS hypervisor entitlement for codesigning

Prerequisites

brew tap slp/krun && brew install libkrun   # libkrun + libkrunfw
# gvproxy from Podman at /opt/podman/bin/gvproxy

Usage

# Boot a cluster
nav gateway cluster --kube-port 6444

# Ad-hoc microVM
nav gateway run --rootfs ./my-rootfs /bin/echo "Hello from microVM"

drew added 3 commits March 3, 2026 10:26
…ter bootstrap

Introduce the navigator-gateway crate with safe Rust wrappers over the
libkrun C FFI, enabling lightweight microVM execution via Apple
Hypervisor.framework (macOS) or KVM (Linux).

Key components:
- navigator-gateway crate: KrunContextBuilder with RAII, ~15 FFI
  bindings, support for virtio-fs, console redirect, TSI control,
  and gvproxy virtio-net networking
- nav gateway run: ad-hoc microVM execution (direct enter model)
- nav gateway cluster: boots k3s inside a microVM with gvproxy
  networking, automatic rootfs extraction from Docker images, and
  port forwarding via gvproxy HTTP API
- vm-init.sh: guest bootstrap script with DHCP networking, noop CNI
  plugin (kernel lacks bridge module), and tmpfs-backed k3s data dir
- macOS codesigning with com.apple.security.hypervisor entitlement
  via auto-signing in scripts/bin/nav
- navigator-cli build.rs for libkrun/libkrunfw rpath resolution
…ot retry

After fork_start(), the parent now:
- Polls https://localhost:<kube_port>/readyz with 2s intervals (120s timeout)
- Checks child PID is alive between polls (fast-fail on VM crash)
- Reads kubeconfig from rootfs via virtio-fs (host-visible)
- Rewrites server URL and cluster name, stores in standard location
- Prints kubectl usage instructions when ready

Always forwards kube API port (ephemeral if --kube-port not specified)
to enable health checking. Adds --name flag for kubeconfig context naming.

Boot reliability: vm-init.sh now retries k3s up to 3 times with cleanup
between attempts, handling the transient kine SQLite race condition on
tmpfs that occasionally crashes k3s on first boot.

Exports rewrite_kubeconfig and store_kubeconfig from navigator-bootstrap
for use by the gateway cluster command.
…er export leak

is_pid_alive() now uses waitpid(WNOHANG) instead of kill(pid, 0).
kill(0) returns success for zombie processes, so the health check loop
would never detect an early child exit — it would spin for the full
120s timeout instead of failing fast.

Also fixes a resource leak in extract_rootfs_from_docker(): the docker
export child process was never waited on, leaving a zombie and missing
its exit status check.

Fixes doc comment on net_gvproxy() that incorrectly stated guest IP
as .2 (actual: .3 with gvproxy v0.8.6). Removes misleading explicit
drop of c_env_strings in build().
@drew drew self-assigned this Mar 3, 2026
…e log

Root cause of readyz timeout: gvproxy DHCP assigns guest IPs
nondeterministically (.2 or .3 depending on timing), but port
forwarding was hardcoded to .3. When the guest got .2, the health
check polled the wrong IP for 120s.

Fix: vm-init.sh now assigns 192.168.127.2 statically instead of using
DHCP. The CLI guest_ip constant matches. This eliminates the race.

Also fixes libkrun VMM warnings (virtio-fs passthrough symlink errors)
appearing on the parent terminal — fork_start() now redirects the
child's stderr to the console log file via dup2(). The warnings are
from macOS symlink resolution limits on Kubernetes ConfigMap mounts
and are non-fatal.

vm-init.sh is now always updated in the cached rootfs on every boot
(not just during initial extraction), so networking fixes take effect
without requiring rootfs re-extraction.
@drew drew closed this Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant