A continuation of k8s-1-to-4, focusing on GPU workloads, LLM inference, custom operators, and admission webhooks.
- P5: Kind + NVIDIA GPU Operator + CUDA Pod (
nvidia-smi) - P6: vLLM Deployment + Service + autoscaling
- P7:
InferenceDeploymentCRD + Operator (Kubebuilder) managing vLLM fleets - P8: Mutating/validating webhooks + simulated cluster upgrade
A10 No filesystem attachment Any region
This repo assumes a single Lambda spot GPU VM running Kind locally (not a managed Lambda Kubernetes cluster).
Quickstart: scripts/lambda-labs-quickstart.sh (or scripts/p5.sh).
Everything from k8s-1-to-4 plus:
- Kind (Kubernetes in Docker, multi-node)
- Helm 3
- NVIDIA GPU on host with recent drivers (for real GPU tests)
- kubebuilder (for Project 7)
- Go 1.22+ (for operator + webhooks)
- cert-manager (installed into the kind cluster for webhooks in P8)
Note: Projects 5–8 are designed to run on Kind instead of Minikube, to better mirror “real” multi-node clusters and GPU scheduling.
Goal: Run a GPU-enabled Pod on a local Kind cluster using the NVIDIA GPU Operator, and expose a fake ML endpoint.
High level:
gpu-kind.yaml: Multi-node Kind cluster config with one GPU-capable worker.scripts/p5.sh:kind create cluster --config k8s/05-gpu/gpu-kind.yaml- Install NVIDIA GPU Operator via Helm.
- Deploy a Pod using
nvidia/cudaimage requestingnvidia.com/gpu: 1. - Run
nvidia-smito prove GPU is wired up.
- Extend the app with a fake ML endpoint:
- Simple HTTP server (Node or Python) with a
/inferroute. - Uses a tiny torch stub (CPU tensor ops only) to mimic an inference pipeline.
- Simple HTTP server (Node or Python) with a
See:
k8s/05-gpu/gpu-kind.yamlk8s/05-gpu/gpu-pod.yamlscripts/p5.shml-stub/— fake ML app (Flask + PyTorch stub,POST /infer)blog/p5-gpu-pods-on-kind.md
Goal: Serve an LLM with vLLM on Kubernetes, with a basic autoscaling story.
High level:
k8s/06-vllm/vllm-pvc.yaml: PVC for the Hugging Face cache/model.k8s/06-vllm/vllm-deployment.yaml: vLLM server, GPU-enabled, exposes port8000.k8s/06-vllm/vllm-service.yaml: ClusterIPServicefor vLLM.k8s/06-vllm/vllm-hpa.yaml: HPA scaling on CPU (or placeholder for queue length via Prometheus).scripts/p6.sh:- Assumes Project 5’s GPU cluster (or another GPU-enabled cluster) is up.
- Applies vLLM manifests.
- Waits for model warmup.
- Sends
/generatetest requests.
See:
k8s/06-vllm/*.yamlscripts/p6.shblog/p6-vllm-on-k8s.md
Sources:
- vLLM docs (
https://docs.vllm.ai), especially Kubernetes and production stack sections.
Goal: Build a Kubebuilder operator that manages vLLM Deployments via a custom resource: InferenceDeployment.
High level:
api/v1alpha1/inferencedeployment_types.go: CRD Go type.controllers/inferencedeployment_controller.go: Reconciler that:- Watches
InferenceDeploymentresources. - Creates/updates a GPU
Deployment+Servicefor vLLM. - Optionally manages an HPA.
- Watches
config/crd/bases/ml.example.com_inferencedeployments.yaml: Generated CRD YAML.examples/07-operator/inferencedeployment-sample.yaml: Example CR instance.scripts/p7.sh:- Runs
make docker-buildandmake deploy(kubebuilder defaults). - Applies the sample
InferenceDeployment.
- Runs
See:
k8s/07-operator/(CR + example)blog/p7-operator-for-ai-fleets.md
Sources:
- Kubebuilder Book (
https://book.kubebuilder.io/) - Sample controllers (
https://github.com/kubernetes-sigs/kubebuilder-declarative-pattern)
Goal: Add mutating + validating webhooks for GPU Pods, and simulate a basic cluster upgrade flow.
High level:
- Go webhook server:
- Mutating webhook: injects a logging sidecar into AI Pods.
- Validating webhook: rejects Pods requesting more than 1 GPU.
k8s/08-webhooks/:- Deployment + Service for webhook server.
ValidatingWebhookConfiguration+MutatingWebhookConfiguration.Issuer/Certificateresources for cert-manager.
scripts/p8.sh:- Installs cert-manager (if not present).
- Deploys webhook stack.
- Demonstrates:
- A Pod that gets a sidecar injected.
- A Pod with
nvidia.com/gpu: 2being rejected.
- Sketches a Kind control-plane upgrade using
kind+kubeadmdocs (no destructive steps).
See:
k8s/08-webhooks/*.yamlblog/p8-webhooks-and-upgrades.md
Sources:
- Kubernetes admission webhooks docs.
- cert-manager docs.
- Kind cluster lifecycle / upgrade docs.
Each project has a blog outline in blog/:
- P5:
blog/p5-gpu-pods-on-kind.md— “GPU Pods on Local K8s: AI Ready in 10 min.” - P6:
blog/p6-vllm-on-k8s.md— “vLLM on K8s: Production Inference Fleet.” - P7:
blog/p7-operator-for-ai-fleets.md— “Operator in 1hr: Reconcile AI Fleets.” - P8:
blog/p8-webhooks-and-upgrades.md— “Webhooks + Upgrades: Secure AI Clusters.”
These outlines are structured so you can turn each into a full tutorial post.