openshell-server is the gateway -- the central control plane for a cluster. It exposes two gRPC services (OpenShell and Inference) and HTTP endpoints on a single multiplexed port, manages sandbox lifecycle through Kubernetes CRDs, persists state in SQLite or Postgres, and provides SSH tunneling into sandbox pods. The gateway coordinates all interactions between clients, the Kubernetes cluster, and the persistence layer.
The following diagram shows the major components inside the gateway process and their relationships.
graph TD
Client["gRPC / HTTP Client"]
TCP["TCP Listener"]
TLS["TLS Acceptor<br/>(optional)"]
MUX["MultiplexedService"]
GRPC_ROUTER["GrpcRouter"]
NAV["OpenShellServer<br/>(OpenShell service)"]
INF["InferenceServer<br/>(Inference service)"]
HTTP["HTTP Router<br/>(Axum)"]
HEALTH["Health Endpoints"]
SSH_TUNNEL["SSH Tunnel<br/>(/connect/ssh)"]
STORE["Store<br/>(SQLite / Postgres)"]
K8S["Kubernetes API"]
WATCHER["Sandbox Watcher"]
EVENT_TAILER["Kube Event Tailer"]
WATCH_BUS["SandboxWatchBus"]
LOG_BUS["TracingLogBus"]
PLAT_BUS["PlatformEventBus"]
INDEX["SandboxIndex"]
Client --> TCP
TCP --> TLS
TLS --> MUX
MUX -->|"content-type: application/grpc"| GRPC_ROUTER
MUX -->|"other"| HTTP
GRPC_ROUTER -->|"/openshell.inference.v1.Inference/*"| INF
GRPC_ROUTER -->|"all other paths"| NAV
HTTP --> HEALTH
HTTP --> SSH_TUNNEL
NAV --> STORE
NAV --> K8S
INF --> STORE
SSH_TUNNEL --> STORE
SSH_TUNNEL --> K8S
WATCHER --> K8S
WATCHER --> STORE
WATCHER --> WATCH_BUS
WATCHER --> INDEX
EVENT_TAILER --> K8S
EVENT_TAILER --> PLAT_BUS
EVENT_TAILER --> INDEX
LOG_BUS --> PLAT_BUS
| Module | File | Purpose |
|---|---|---|
| Entry point | crates/openshell-server/src/main.rs |
CLI argument parsing, config assembly, tracing setup, calls run_server |
| Gateway runtime | crates/openshell-server/src/lib.rs |
ServerState struct, run_server() accept loop |
| Protocol mux | crates/openshell-server/src/multiplex.rs |
MultiplexService, MultiplexedService, GrpcRouter, BoxBody |
| gRPC: OpenShell | crates/openshell-server/src/grpc.rs |
OpenShellService -- sandbox CRUD, provider CRUD, watch, exec, SSH sessions, policy delivery |
| gRPC: Inference | crates/openshell-server/src/inference.rs |
InferenceService -- cluster inference config (set/get) and sandbox inference bundle delivery |
| HTTP | crates/openshell-server/src/http.rs |
Health endpoints, merged with SSH tunnel router |
| Browser auth | crates/openshell-server/src/auth.rs |
Cloudflare browser login relay at /auth/connect |
| SSH tunnel | crates/openshell-server/src/ssh_tunnel.rs |
HTTP CONNECT handler at /connect/ssh |
| WS tunnel | crates/openshell-server/src/ws_tunnel.rs |
WebSocket tunnel handler at /_ws_tunnel for Cloudflare-fronted clients |
| TLS | crates/openshell-server/src/tls.rs |
TlsAcceptor wrapping rustls with ALPN |
| Persistence | crates/openshell-server/src/persistence/mod.rs |
Store enum (SQLite/Postgres), generic object CRUD, protobuf codec |
| Persistence: SQLite | crates/openshell-server/src/persistence/sqlite.rs |
SqliteStore with sqlx |
| Persistence: Postgres | crates/openshell-server/src/persistence/postgres.rs |
PostgresStore with sqlx |
| Sandbox K8s | crates/openshell-server/src/sandbox/mod.rs |
SandboxClient, CRD creation/deletion, Kubernetes watcher, phase derivation |
| Sandbox index | crates/openshell-server/src/sandbox_index.rs |
SandboxIndex -- in-memory name/pod-to-id correlation |
| Watch bus | crates/openshell-server/src/sandbox_watch.rs |
SandboxWatchBus, PlatformEventBus, Kubernetes event tailer |
| Tracing bus | crates/openshell-server/src/tracing_bus.rs |
TracingLogBus -- captures tracing events keyed by sandbox_id |
Proto definitions consumed by the gateway:
| Proto file | Package | Defines |
|---|---|---|
proto/openshell.proto |
openshell.v1 |
OpenShell service, sandbox/provider/SSH/watch messages |
proto/inference.proto |
openshell.inference.v1 |
Inference service: SetClusterInference, GetClusterInference, GetInferenceBundle |
proto/datamodel.proto |
openshell.datamodel.v1 |
Sandbox, SandboxSpec, SandboxStatus, Provider, SandboxPhase |
proto/sandbox.proto |
openshell.sandbox.v1 |
SandboxPolicy, NetworkPolicyRule, SettingValue, EffectiveSetting, SettingScope, PolicySource, GetSandboxSettingsRequest/Response, GetGatewaySettingsRequest/Response |
The gateway boots in main() (crates/openshell-server/src/main.rs) and proceeds through these steps:
- Install rustls crypto provider --
aws_lc_rs::default_provider().install_default(). - Parse CLI arguments --
Args::parse()viaclap. Every flag has a corresponding environment variable (see Configuration). - Initialize tracing -- Creates a
TracingLogBusand installs a tracing subscriber that writes to stdout and publishes log events keyed bysandbox_idinto the bus. - Build
Config-- Assembles aopenshell_core::Configfrom the parsed arguments. - Call
run_server()(crates/openshell-server/src/lib.rs):- Connect to the persistence store (
Store::connect), which auto-detects SQLite vs Postgres from the URL prefix and runs migrations. - Create
SandboxClient(initializes akube::Clientfrom in-cluster or kubeconfig). - Build
ServerState(shared viaArc<ServerState>across all handlers). - Spawn background tasks:
spawn_sandbox_watcher-- watches Kubernetes Sandbox CRDs and syncs state to the store.spawn_kube_event_tailer-- watches Kubernetes Events in the sandbox namespace and publishes them to thePlatformEventBus.
- Create
MultiplexService. - Bind
TcpListeneronconfig.bind_address. - Optionally create
TlsAcceptorfrom cert/key files. - Enter the accept loop: for each connection, spawn a tokio task that optionally performs a TLS handshake, then calls
MultiplexService::serve().
- Connect to the persistence store (
All configuration is via CLI flags with environment variable fallbacks. The --db-url flag is the only required argument.
| Flag | Env Var | Default | Description |
|---|---|---|---|
--port |
OPENSHELL_SERVER_PORT |
8080 |
TCP listen port (binds 0.0.0.0) |
--log-level |
OPENSHELL_LOG_LEVEL |
info |
Tracing log level filter |
--tls-cert |
OPENSHELL_TLS_CERT |
None | Path to PEM certificate file |
--tls-key |
OPENSHELL_TLS_KEY |
None | Path to PEM private key file |
--tls-client-ca |
OPENSHELL_TLS_CLIENT_CA |
None | Path to PEM CA cert for mTLS client verification |
--disable-tls |
OPENSHELL_DISABLE_TLS |
false |
Listen on plaintext HTTP behind a trusted reverse proxy or tunnel |
--disable-gateway-auth |
OPENSHELL_DISABLE_GATEWAY_AUTH |
false |
Keep TLS enabled but allow no-certificate clients and rely on application-layer auth |
--client-tls-secret-name |
OPENSHELL_CLIENT_TLS_SECRET_NAME |
None | K8s secret name to mount into sandbox pods for mTLS |
--db-url |
OPENSHELL_DB_URL |
required | Database URL (sqlite:... or postgres://...). The Helm chart defaults to sqlite:/var/openshell/openshell.db (persistent volume). In-memory SQLite (sqlite::memory:?cache=shared) works for ephemeral/test environments but data is lost on restart. |
--sandbox-namespace |
OPENSHELL_SANDBOX_NAMESPACE |
default |
Kubernetes namespace for sandbox CRDs |
--sandbox-image |
OPENSHELL_SANDBOX_IMAGE |
None | Default container image for sandbox pods |
--grpc-endpoint |
OPENSHELL_GRPC_ENDPOINT |
None | gRPC endpoint reachable from within the cluster (for sandbox callbacks) |
--ssh-gateway-host |
OPENSHELL_SSH_GATEWAY_HOST |
127.0.0.1 |
Public hostname returned in SSH session responses |
--ssh-gateway-port |
OPENSHELL_SSH_GATEWAY_PORT |
8080 |
Public port returned in SSH session responses |
--ssh-connect-path |
OPENSHELL_SSH_CONNECT_PATH |
/connect/ssh |
HTTP path for SSH CONNECT/upgrade |
--sandbox-ssh-port |
OPENSHELL_SANDBOX_SSH_PORT |
2222 |
SSH listen port inside sandbox pods |
--ssh-handshake-secret |
OPENSHELL_SSH_HANDSHAKE_SECRET |
None | Shared HMAC-SHA256 secret for gateway-to-sandbox handshake |
--ssh-handshake-skew-secs |
OPENSHELL_SSH_HANDSHAKE_SKEW_SECS |
300 |
Allowed clock skew (seconds) for SSH handshake timestamps |
All handlers share an Arc<ServerState> (crates/openshell-server/src/lib.rs):
pub struct ServerState {
pub config: Config,
pub store: Arc<Store>,
pub sandbox_client: SandboxClient,
pub sandbox_index: SandboxIndex,
pub sandbox_watch_bus: SandboxWatchBus,
pub tracing_log_bus: TracingLogBus,
pub ssh_connections_by_token: Mutex<HashMap<String, u32>>,
pub ssh_connections_by_sandbox: Mutex<HashMap<String, u32>>,
pub settings_mutex: tokio::sync::Mutex<()>,
}store-- persistence backend (SQLite or Postgres) for all object types.sandbox_client-- Kubernetes client scoped to the sandbox namespace; creates/deletes CRDs and resolves pod IPs.sandbox_index-- in-memory bidirectional index mapping sandbox names and agent pod names to sandbox IDs. Used by the event tailer to correlate Kubernetes events.sandbox_watch_bus--broadcast-based notification bus keyed by sandbox ID. Producers callnotify(&id)when the persisted sandbox record changes; consumers inWatchSandboxstreams receive()signals and re-read the record.tracing_log_bus-- capturestracingevents that include asandbox_idfield and republishes them asSandboxLogLinemessages. Maintains a per-sandbox tail buffer (default 200 entries). Also contains a nestedPlatformEventBusfor Kubernetes events.settings_mutex-- serializes settings mutations (global and sandbox) to prevent read-modify-write races. Held for the duration of any setting set/delete or global policy set/delete operation. See Gateway Settings Channel.
All traffic (gRPC and HTTP) shares a single TCP port. Multiplexing happens at the request level, not the connection level.
MultiplexService::serve() (crates/openshell-server/src/multiplex.rs) creates per-connection service instances:
- Each accepted TCP stream (optionally TLS-wrapped) is passed to
hyper_util::server::conn::auto::Builder, which auto-negotiates HTTP/1.1 or HTTP/2. - The builder calls
serve_connection_with_upgrades(), which supports HTTP upgrades (needed for the SSH tunnel's CONNECT method). - For each request,
MultiplexedServiceinspects thecontent-typeheader:- Starts with
application/grpc-- routes toGrpcRouter. - Anything else -- routes to the Axum HTTP router.
- Starts with
GrpcRouter (crates/openshell-server/src/multiplex.rs) further routes gRPC requests by URI path prefix:
- Paths starting with
/openshell.inference.v1.Inference/go toInferenceServer. - All other gRPC paths go to
OpenShellServer.
Both gRPC and HTTP handlers produce different response body types. MultiplexedService normalizes them through a custom BoxBody wrapper (an UnsyncBoxBody<Bytes, Box<dyn Error>>) so that Hyper receives a uniform response type.
When TLS is enabled (crates/openshell-server/src/tls.rs):
TlsAcceptor::from_files()loads PEM certificates and keys viarustls_pemfile, builds arustls::ServerConfig, and configures ALPN to advertiseh2andhttp/1.1.- When a client CA path is provided (
--tls-client-ca), the server enforces mutual TLS usingWebPkiClientVerifierby default. In Cloudflare-fronted deployments,--disable-gateway-authkeeps TLS enabled but allows no-certificate clients so the edge can forward a JWT instead. --disable-tlsremoves gateway-side TLS entirely and serves plaintext HTTP behind a trusted reverse proxy or tunnel.- Supports PKCS#1, PKCS#8, and SEC1 private key formats.
- The TLS handshake happens before the stream reaches Hyper's auto builder, so ALPN negotiation and HTTP version detection work together transparently.
- Certificates are generated at cluster bootstrap time by the
openshell-bootstrapcrate usingrcgen, not by a Helm Job. The bootstrap reconciles three K8s secrets:openshell-server-tls(server cert+key),openshell-server-client-ca(CA cert), andopenshell-client-tls(client cert+key+CA, shared by CLI and sandbox pods). - Certificate lifetime: Certificates use
rcgendefaults (effectively never expire), which is appropriate for an internal dev-cluster PKI where certs are ephemeral to the cluster's lifetime. - Redeploy behavior: On redeploy, existing cluster TLS secrets are loaded and reused if they are complete and valid PEM. If secrets are missing, incomplete, or malformed, fresh PKI is generated. If rotation occurs and the openshell workload is already running, the bootstrap performs a rollout restart and waits for completion before persisting CLI-side credentials.
Defined in proto/openshell.proto, implemented in crates/openshell-server/src/grpc.rs as OpenShellService.
| RPC | Description | Key behavior |
|---|---|---|
Health |
Returns service status and version | Always returns HEALTHY with CARGO_PKG_VERSION |
CreateSandbox |
Create a new sandbox | Validates spec and policy, validates provider names exist (fail-fast), persists to store, creates Kubernetes CRD. On K8s 409 conflict or error, rolls back the store record and index entry. |
GetSandbox |
Fetch sandbox by name | Looks up by name via store.get_message_by_name() |
ListSandboxes |
List sandboxes | Paginated (default limit 100), decodes protobuf payloads from store records |
DeleteSandbox |
Delete sandbox by name | Sets phase to Deleting, persists, notifies watch bus, then deletes the Kubernetes CRD. Cleans up store if the CRD was already gone. |
WatchSandbox |
Stream sandbox updates | Server-streaming RPC. See Watch Sandbox Stream below. |
ExecSandbox |
Execute command in sandbox | Server-streaming RPC. See Remote Exec via SSH below. |
| RPC | Description |
|---|---|
CreateSshSession |
Creates a session token for a Ready sandbox. Persists an SshSession record and returns gateway connection details (host, port, scheme, connect path). |
RevokeSshSession |
Marks a session as revoked by setting session.revoked = true in the store. |
Full CRUD for Provider objects, which store typed credentials (e.g., API keys for Claude, GitLab tokens).
| RPC | Description |
|---|---|
CreateProvider |
Creates a provider. Requires type field; auto-generates a 6-char name if not provided. Rejects duplicates by name. |
GetProvider |
Fetches a provider by name. |
ListProviders |
Paginated list (default limit 100). |
UpdateProvider |
Updates an existing provider by name. Preserves the stored id and name; replaces type, credentials, and config. |
DeleteProvider |
Deletes a provider by name. Returns deleted: true/false. |
These RPCs are called by sandbox pods at startup and during runtime polling.
| RPC | Description |
|---|---|
GetSandboxSettings |
Returns effective sandbox config looked up by sandbox ID: policy payload, policy metadata (version, hash, source, global_policy_version), merged effective settings, and a config_revision fingerprint for change detection. Two-tier resolution: registered keys start unset, sandbox values overlay, global values override. The reserved policy key in global settings can override the sandbox's own policy. When a global policy is active, policy_source is GLOBAL and global_policy_version carries the active revision number. See Gateway Settings Channel. |
GetGatewaySettings |
Returns gateway-global settings only (excluding the reserved policy key). Returns registered keys with empty values when unconfigured, and a monotonic settings_revision. |
GetSandboxProviderEnvironment |
Resolves provider credentials into environment variables for a sandbox. Iterates the sandbox's spec.providers list, fetches each Provider, and collects credential key-value pairs. First provider wins on duplicate keys. Skips credential keys that do not match ^[A-Za-z_][A-Za-z0-9_]*$. |
These RPCs support the sandbox-initiated policy recommendation pipeline. The sandbox generates proposals via its mechanistic mapper and submits them; the gateway validates, persists, and manages the approval workflow. See architecture/policy-advisor.md for the full pipeline design.
| RPC | Description |
|---|---|
SubmitPolicyAnalysis |
Receives pre-formed PolicyChunk proposals from a sandbox. Validates each chunk, persists via upsert on (sandbox_id, host, port, binary) dedup key, notifies watch bus. |
GetDraftPolicy |
Returns all draft chunks for a sandbox with current draft version. |
ApproveDraftChunk |
Approves a pending or rejected chunk. Merges the proposed rule into the active policy (appends binary to existing rule or inserts new rule). Blocked when a global policy is active -- returns FailedPrecondition. |
RejectDraftChunk |
Rejects a pending chunk or revokes an approved chunk. If revoking, removes the binary from the active policy rule. Rejection of pending chunks is always allowed. Revoking approved chunks is blocked when a global policy is active -- returns FailedPrecondition. |
ApproveAllDraftChunks |
Bulk approves all pending chunks for a sandbox. Blocked when a global policy is active -- returns FailedPrecondition. |
EditDraftChunk |
Updates the proposed rule on a pending chunk. |
GetDraftHistory |
Returns all chunks (including rejected) for audit trail. |
Defined in proto/inference.proto, implemented in crates/openshell-server/src/inference.rs as InferenceService.
The gateway acts as the control plane for inference configuration. It stores a single managed cluster inference route (named inference.local) and delivers resolved route bundles to sandbox pods. The gateway does not execute inference requests -- sandboxes connect directly to inference backends using the credentials and endpoints provided in the bundle.
The gateway manages a single cluster-wide inference route that maps to a provider record. When set, the route stores only a provider_name and model_id reference. At bundle resolution time, the gateway looks up the referenced provider and derives the endpoint URL, API key, protocols, and provider type from it. This late-binding design means provider credential rotations are automatically reflected in the next bundle fetch without updating the route itself.
| RPC | Description |
|---|---|
SetClusterInference |
Configures the cluster inference route. Validates provider_name and model_id are non-empty, verifies the named provider exists and has a supported type for inference (openai, anthropic, nvidia), validates the provider has a usable API key, then upserts the inference.local route record. Increments a monotonic version on each update. Returns the configured provider_name, model_id, and version. |
GetClusterInference |
Returns the current cluster inference configuration (provider_name, model_id, version). Returns NotFound if no cluster inference is configured, or FailedPrecondition if the stored route has empty provider/model metadata. |
GetInferenceBundle |
Returns the resolved inference route bundle for sandbox consumption. See Route Bundle Delivery below. |
The GetInferenceBundle RPC resolves the managed cluster route into a GetInferenceBundleResponse containing fully materialized route data that sandboxes can use directly.
The trait method delegates to resolve_inference_bundle(store) (crates/openshell-server/src/inference.rs), which takes &Store instead of &self. This extraction decouples bundle resolution from ServerState, enabling direct unit testing against an in-memory SQLite store without constructing a full server.
The GetInferenceBundleResponse includes:
routes-- a list ofResolvedRoutemessages containing base URL, model ID, API key, protocols, and provider type. Currently contains zero or one routes (the managed cluster route).revision-- a hex-encoded hash computed from route contents. Sandboxes compare this value to detect when their route set has changed.generated_at_ms-- epoch milliseconds when the bundle was assembled.
Managed route resolution in resolve_managed_cluster_route() (crates/openshell-server/src/inference.rs):
- Load the managed route by name (
inference.local). - Skip (return
None) if the route does not exist, has no spec, or is disabled. - Validate that
provider_nameandmodel_idare non-empty. - Fetch the referenced provider record from the store.
- Resolve the provider into a
ResolvedProviderRouteviaresolve_provider_route():- Look up the
InferenceProviderProfilefor the provider's type viaopenshell_core::inference::profile_for(). Supported types:openai,anthropic,nvidia. - Search the provider's credentials map for an API key using the profile's preferred key name (e.g.,
OPENAI_API_KEY), falling back to the first non-empty credential in sorted key order. - Resolve the base URL from the provider's config map using the profile-specific key (e.g.,
OPENAI_BASE_URL), falling back to the profile's default URL. - Derive protocols from the profile (e.g.,
openai_chat_completions,openai_completions,openai_responses,model_discoveryfor OpenAI-compatible providers).
- Look up the
- Return a
ResolvedRoutewith the fully materialized endpoint, credentials, and protocols.
The ClusterInferenceConfig stored in the database contains only provider_name and model_id. All other fields (endpoint, credentials, protocols, auth style) are resolved from the provider record at bundle generation time via build_cluster_inference_config().
The HTTP router (crates/openshell-server/src/http.rs) merges two sub-routers:
| Path | Method | Response |
|---|---|---|
/health |
GET | 200 OK (empty body) |
/healthz |
GET | 200 OK (empty body) -- Kubernetes liveness probe |
/readyz |
GET | 200 OK with JSON {"status": "healthy", "version": "<version>"} -- Kubernetes readiness probe |
| Path | Method | Response |
|---|---|---|
/connect/ssh |
CONNECT | Upgrades the connection to a bidirectional TCP tunnel to a sandbox pod's SSH port |
See SSH Tunnel Gateway for details.
| Path | Method | Response |
|---|---|---|
/auth/connect |
GET | Browser login relay page that reads CF_Authorization and POSTs it back to the CLI localhost callback |
/_ws_tunnel |
GET upgrade | WebSocket tunnel that bridges bytes directly into MultiplexedService over an in-memory duplex stream |
The WatchSandbox RPC (crates/openshell-server/src/grpc.rs) provides a multiplexed server-streaming response that can include sandbox status snapshots, gateway log lines, and platform events.
The WatchSandboxRequest controls what the stream includes:
follow_status-- subscribe toSandboxWatchBusnotifications and re-read the sandbox record on each change.follow_logs-- subscribe toTracingLogBusfor gateway log lines correlated bysandbox_id.follow_events-- subscribe toPlatformEventBusfor Kubernetes events correlated to the sandbox.log_tail_lines-- replay the last N log lines before following (default 200).stop_on_terminal-- end the stream when the sandbox reaches theReadyphase. Note:Errorphase does not stop the stream because it may be transient (e.g.,ReconcilerError).
- Subscribe to all requested buses before reading the initial snapshot (prevents missed notifications).
- Send the current sandbox record as the first event.
- If
stop_on_terminalis set and the sandbox is alreadyReady, end the stream immediately. - Replay tail logs if
follow_logsis enabled. - Enter a
tokio::select!loop listening on up to three broadcast receivers:- Status updates: re-read the sandbox from the store, send the snapshot, check for terminal phase.
- Log lines: forward
SandboxStreamEvent::Logmessages. - Platform events: forward
SandboxStreamEvent::Eventmessages.
graph LR
SW["spawn_sandbox_watcher"]
ET["spawn_kube_event_tailer"]
TL["SandboxLogLayer<br/>(tracing layer)"]
WB["SandboxWatchBus<br/>(broadcast per ID)"]
LB["TracingLogBus<br/>(broadcast per ID + tail buffer)"]
PB["PlatformEventBus<br/>(broadcast per ID)"]
WS["WatchSandbox stream"]
SW -->|"notify(id)"| WB
TL -->|"publish(id, log_event)"| LB
ET -->|"publish(id, platform_event)"| PB
WB -->|"subscribe(id)"| WS
LB -->|"subscribe(id)"| WS
PB -->|"subscribe(id)"| WS
All buses use tokio::sync::broadcast channels keyed by sandbox ID. Buffer sizes:
SandboxWatchBus: 128 (signals only, no payload -- just())TracingLogBus: 1024 (fullSandboxStreamEventpayloads)PlatformEventBus: 1024 (fullSandboxStreamEventpayloads)
Broadcast lag is translated to Status::resource_exhausted via broadcast_to_status().
Cleanup: Each bus exposes a remove(sandbox_id) method that drops the broadcast sender (closing active receivers with RecvError::Closed) and frees internal map entries. Cleanup is wired into both the handle_deleted reconciler (Kubernetes watcher) and the delete_sandbox gRPC handler to prevent unbounded memory growth from accumulated entries for deleted sandboxes.
Validation: WatchSandbox validates that the sandbox exists before subscribing to any bus, preventing entries from being created for non-existent IDs. PushSandboxLogs validates sandbox existence once on the first batch of the stream.
The ExecSandbox RPC (crates/openshell-server/src/grpc.rs) executes a command inside a sandbox pod over SSH and streams stdout/stderr/exit back to the client.
- Validate request:
sandbox_id,command, and environment key format (^[A-Za-z_][A-Za-z0-9_]*$). - Verify sandbox exists and is in
Readyphase. - Resolve target: prefer agent pod IP (via
sandbox_client.agent_pod_ip()), fall back to Kubernetes service DNS (<name>.<namespace>.svc.cluster.local). - Build the remote command string: sort environment variables, shell-escape all values, prepend
cd <workdir> &&ifworkdiris set. - Start a single-use SSH proxy: binds an ephemeral local TCP port, accepts one connection, performs the NSSH1 handshake with the sandbox, and bidirectionally copies data.
- Connect via
russh: establishes an SSH connection through the local proxy, authenticates withnoneauth as usersandbox, opens a session channel, and executes the command. - Stream
ExecSandboxStdout,ExecSandboxStderrchunks as they arrive, then sendExecSandboxExitwith the exit code. - On timeout (if
timeout_seconds > 0), send exit code 124 (matching thetimeout(1)convention).
The single-use SSH proxy and the SSH tunnel endpoint both use the same handshake:
NSSH1 <token> <timestamp> <nonce> <hmac_signature>\n
token-- session token or a one-time UUID.timestamp-- Unix epoch seconds.nonce-- UUID v4.hmac_signature--HMAC-SHA256(secret, "{token}|{timestamp}|{nonce}"), hex-encoded.- Expected response:
OK\nfrom the sandbox.
The ssh_handshake_skew_secs configuration controls how much clock skew is tolerated.
The SSH tunnel endpoint (crates/openshell-server/src/ssh_tunnel.rs) allows external SSH clients to reach sandbox pods through the gateway using HTTP CONNECT upgrades.
- Client sends
CONNECT /connect/sshwith headersx-sandbox-idandx-sandbox-token. - Handler validates the method is CONNECT, extracts headers.
- Fetches the
SshSessionfrom the store by token; rejects if revoked or ifsandbox_iddoes not match. - Fetches the
Sandbox; rejects if not inReadyphase. - Resolves the connect target: agent pod IP if available, otherwise Kubernetes service DNS.
- Returns
200 OK, then upgrades the connection viahyper::upgrade::on(). - In a spawned task: connects to the sandbox's SSH port, performs the NSSH1 handshake, then bidirectionally copies bytes between the upgraded HTTP connection and the sandbox TCP stream.
- On completion, gracefully shuts down the write-half of the upgraded connection for clean EOF handling.
The Store enum (crates/openshell-server/src/persistence/mod.rs) dispatches to either SqliteStore or PostgresStore based on the database URL prefix:
sqlite:*-- usessqlx::SqlitePool(1 connection for in-memory, 5 for file-based).postgres://orpostgresql://-- usessqlx::PgPool(max 10 connections).
Both backends auto-run migrations on connect from crates/openshell-server/migrations/{sqlite,postgres}/.
A single objects table stores all object types:
CREATE TABLE objects (
object_type TEXT NOT NULL,
id TEXT NOT NULL,
name TEXT NOT NULL,
payload BLOB NOT NULL,
created_at_ms INTEGER NOT NULL,
updated_at_ms INTEGER NOT NULL,
PRIMARY KEY (id),
UNIQUE (object_type, name)
);Objects are identified by (object_type, id) with a unique constraint on (object_type, name). The payload column stores protobuf-encoded bytes.
| Object type string | Proto message / format | Traits implemented | Notes |
|---|---|---|---|
"sandbox" |
Sandbox |
ObjectType, ObjectId, ObjectName |
|
"provider" |
Provider |
ObjectType, ObjectId, ObjectName |
|
"ssh_session" |
SshSession |
ObjectType, ObjectId, ObjectName |
|
"inference_route" |
InferenceRoute |
ObjectType, ObjectId, ObjectName |
|
"gateway_settings" |
JSON StoredSettings |
Generic put/get |
Singleton, id="global". Contains the reserved policy key for global policy delivery. |
"sandbox_settings" |
JSON StoredSettings |
Generic put/get |
Per-sandbox, id="settings:{sandbox_uuid}" |
The sandbox_policies table stores versioned policy revisions for both sandbox-scoped and global policies. Global revisions use the sentinel sandbox_id = "__global__". See Gateway Settings Channel for schema details.
The Store provides typed helpers that leverage trait bounds:
put_message<T: Message + ObjectType + ObjectId + ObjectName>(&self, msg: &T)-- encodes to protobuf bytes and upserts.get_message<T: Message + Default + ObjectType>(&self, id: &str)-- fetches by ID, decodes protobuf.get_message_by_name<T: Message + Default + ObjectType>(&self, name: &str)-- fetches by name, decodes protobuf.
The generate_name() function produces random 6-character lowercase alphabetic strings for auto-naming objects.
The gateway runs as a Kubernetes StatefulSet with a volumeClaimTemplate that provisions a 1Gi ReadWriteOnce PersistentVolumeClaim mounted at /var/openshell. On k3s clusters this uses the built-in local-path-provisioner StorageClass (the cluster default). The SQLite database file at /var/openshell/openshell.db survives pod restarts and rescheduling.
The Helm chart template is at deploy/helm/openshell/templates/statefulset.yaml.
- Put: Performs an upsert (
INSERT ... ON CONFLICT (id) DO UPDATE ...). Bothcreated_at_msandupdated_at_msare set to the current timestamp in theVALUESclause, but theON CONFLICTupdate only writespayloadandupdated_at_ms-- socreated_at_msis preserved after the initial insert. - Get / Delete: Operate by primary key (
id), filtered byobject_type. - List: Pages by
limit+offsetwith deterministic ordering:ORDER BY created_at_ms ASC, name ASC. The secondary sort onnameprevents unstable ordering when rows share the same millisecond timestamp.
SandboxClient (crates/openshell-server/src/sandbox/mod.rs) manages agents.x-k8s.io/v1alpha1/Sandbox CRDs.
- Create: Translates a
Sandboxproto into a KubernetesDynamicObjectwith labels (openshell.ai/sandbox-id,openshell.ai/managed-by: openshell) and a spec that includes the pod template, environment variables, and gateway-required env vars (OPENSHELL_SANDBOX_ID,OPENSHELL_ENDPOINT,OPENSHELL_SSH_LISTEN_ADDR, etc.). - Delete: Calls the Kubernetes API to delete the CRD by name. Returns
falseif already gone (404). - Pod IP resolution:
agent_pod_ip()fetches the agent pod and readsstatus.podIP.
spawn_sandbox_watcher() (crates/openshell-server/src/sandbox/mod.rs) runs a Kubernetes watcher on Sandbox CRDs and processes three event types:
- Applied: Extracts the sandbox ID from labels (or falls back to name prefix stripping), reads the CRD status, derives the phase, and upserts the sandbox record in the store. Notifies the watch bus.
- Deleted: Removes the sandbox record from the store and the index. Notifies the watch bus.
- Restarted: Re-processes all objects (full resync).
derive_phase() maps Kubernetes condition state to SandboxPhase:
| Condition | Phase |
|---|---|
deletionTimestamp is set |
Deleting |
Ready condition status=True |
Ready |
Ready condition status=False, terminal reason |
Error |
Ready condition status=False, transient reason |
Provisioning |
| No conditions or no status | Provisioning (if status exists) / Unknown (if no status) |
Transient reasons (will retry, stay in Provisioning): ReconcilerError, DependenciesNotReady.
All other Ready=False reasons are treated as terminal failures (Error phase).
spawn_kube_event_tailer() (crates/openshell-server/src/sandbox_watch.rs) watches all Kubernetes Event objects in the sandbox namespace and correlates them to sandbox IDs using SandboxIndex:
- Events involving
kind: Sandboxare correlated by sandbox name. - Events involving
kind: Podare correlated by agent pod name. - Other event kinds are ignored.
Matched events are published to the PlatformEventBus as SandboxStreamEvent::Event payloads.
SandboxIndex (crates/openshell-server/src/sandbox_index.rs) maintains two in-memory maps protected by an RwLock:
sandbox_name_to_id: HashMap<String, String>agent_pod_to_id: HashMap<String, String>
Updated by the sandbox watcher on every Applied event and by gRPC handlers during sandbox creation. Used by the event tailer to map Kubernetes event objects back to sandbox IDs.
-
gRPC errors: All gRPC handlers return
tonic::Statuswith appropriate codes:InvalidArgumentfor missing/malformed fieldsNotFoundfor nonexistent objectsAlreadyExistsfor duplicate creationFailedPreconditionfor state violations (e.g., exec on non-Ready sandbox, missing provider)Internalfor store/decode/Kubernetes failuresPermissionDeniedfor policy violationsResourceExhaustedfor broadcast lag (missed messages)Cancelledfor closed broadcast channels
-
HTTP errors: The SSH tunnel handler returns HTTP status codes directly (
401,404,405,412,500,502). -
Connection errors: Logged at
errorlevel but do not crash the gateway. TLS handshake failures and individual connection errors are caught and logged per-connection. -
Background task errors: The sandbox watcher and event tailer log warnings for individual processing failures but continue running. If the watcher stream ends, it logs a warning and the task exits (no automatic restart).
- Sandbox Architecture -- sandbox-side policy enforcement, proxy, and isolation details
- Gateway Settings Channel -- runtime settings channel, two-tier resolution, CLI/TUI commands
- Inference Routing -- end-to-end inference interception flow, sandbox-side proxy logic, and route resolution
- Container Management -- how sandbox container images are built and configured
- Sandbox Connect -- client-side SSH connection flow
- Providers -- provider credential management and injection