GitHub - steven4354/k8s-gpu-stack: Progressive K8s projects: GPU workloads, vLLM serving, custom operators, webhooks

k8s-5-to-8 (GPU + AI)

A continuation of k8s-1-to-4, focusing on GPU workloads, LLM inference, custom operators, and admission webhooks.

P5: Kind + NVIDIA GPU Operator + CUDA Pod (nvidia-smi)
P6: vLLM Deployment + Service + autoscaling
P7: InferenceDeployment CRD + Operator (Kubebuilder) managing vLLM fleets
P8: Mutating/validating webhooks + simulated cluster upgrade

Lambda Labs

A10 No filesystem attachment Any region

This repo assumes a single Lambda spot GPU VM running Kind locally (not a managed Lambda Kubernetes cluster). Quickstart: scripts/lambda-labs-quickstart.sh (or scripts/p5.sh).

System requirements

Everything from k8s-1-to-4 plus:

Kind (Kubernetes in Docker, multi-node)
Helm 3
NVIDIA GPU on host with recent drivers (for real GPU tests)
kubebuilder (for Project 7)
Go 1.22+ (for operator + webhooks)
cert-manager (installed into the kind cluster for webhooks in P8)

Note: Projects 5–8 are designed to run on Kind instead of Minikube, to better mirror “real” multi-node clusters and GPU scheduling.

Project 5: GPU Workloads (Inference Stub)

Goal: Run a GPU-enabled Pod on a local Kind cluster using the NVIDIA GPU Operator, and expose a fake ML endpoint.

High level:

gpu-kind.yaml: Multi-node Kind cluster config with one GPU-capable worker.
scripts/p5.sh:
- kind create cluster --config k8s/05-gpu/gpu-kind.yaml
- Install NVIDIA GPU Operator via Helm.
- Deploy a Pod using nvidia/cuda image requesting nvidia.com/gpu: 1.
- Run nvidia-smi to prove GPU is wired up.
Extend the app with a fake ML endpoint:
- Simple HTTP server (Node or Python) with a /infer route.
- Uses a tiny torch stub (CPU tensor ops only) to mimic an inference pipeline.

See:

k8s/05-gpu/gpu-kind.yaml
k8s/05-gpu/gpu-pod.yaml
scripts/p5.sh
ml-stub/ — fake ML app (Flask + PyTorch stub, POST /infer)
blog/p5-gpu-pods-on-kind.md

Project 6: vLLM Inference Serving

Goal: Serve an LLM with vLLM on Kubernetes, with a basic autoscaling story.

High level:

k8s/06-vllm/vllm-pvc.yaml: PVC for the Hugging Face cache/model.
k8s/06-vllm/vllm-deployment.yaml: vLLM server, GPU-enabled, exposes port 8000.
k8s/06-vllm/vllm-service.yaml: ClusterIP Service for vLLM.
k8s/06-vllm/vllm-hpa.yaml: HPA scaling on CPU (or placeholder for queue length via Prometheus).
scripts/p6.sh:
- Assumes Project 5’s GPU cluster (or another GPU-enabled cluster) is up.
- Applies vLLM manifests.
- Waits for model warmup.
- Sends /generate test requests.

See:

k8s/06-vllm/*.yaml
scripts/p6.sh
blog/p6-vllm-on-k8s.md

Sources:

vLLM docs (https://docs.vllm.ai), especially Kubernetes and production stack sections.

Project 7: Custom Operator for Model Deploys

Goal: Build a Kubebuilder operator that manages vLLM Deployments via a custom resource: InferenceDeployment.

High level:

api/v1alpha1/inferencedeployment_types.go: CRD Go type.
controllers/inferencedeployment_controller.go: Reconciler that:
- Watches InferenceDeployment resources.
- Creates/updates a GPU Deployment + Service for vLLM.
- Optionally manages an HPA.
config/crd/bases/ml.example.com_inferencedeployments.yaml: Generated CRD YAML.
examples/07-operator/inferencedeployment-sample.yaml: Example CR instance.
scripts/p7.sh:
- Runs make docker-build and make deploy (kubebuilder defaults).
- Applies the sample InferenceDeployment.

See:

k8s/07-operator/ (CR + example)
blog/p7-operator-for-ai-fleets.md

Sources:

Kubebuilder Book (https://book.kubebuilder.io/)
Sample controllers (https://github.com/kubernetes-sigs/kubebuilder-declarative-pattern)

Project 8: Webhooks & Cluster Lifecycle

Goal: Add mutating + validating webhooks for GPU Pods, and simulate a basic cluster upgrade flow.

High level:

Go webhook server:
- Mutating webhook: injects a logging sidecar into AI Pods.
- Validating webhook: rejects Pods requesting more than 1 GPU.
k8s/08-webhooks/:
- Deployment + Service for webhook server.
- ValidatingWebhookConfiguration + MutatingWebhookConfiguration.
- Issuer / Certificate resources for cert-manager.
scripts/p8.sh:
- Installs cert-manager (if not present).
- Deploys webhook stack.
- Demonstrates:
  - A Pod that gets a sidecar injected.
  - A Pod with nvidia.com/gpu: 2 being rejected.
- Sketches a Kind control-plane upgrade using kind + kubeadm docs (no destructive steps).

See:

k8s/08-webhooks/*.yaml
blog/p8-webhooks-and-upgrades.md

Sources:

Kubernetes admission webhooks docs.
cert-manager docs.
Kind cluster lifecycle / upgrade docs.

Blog outlines

Each project has a blog outline in blog/:

P5: blog/p5-gpu-pods-on-kind.md — “GPU Pods on Local K8s: AI Ready in 10 min.”
P6: blog/p6-vllm-on-k8s.md — “vLLM on K8s: Production Inference Fleet.”
P7: blog/p7-operator-for-ai-fleets.md — “Operator in 1hr: Reconcile AI Fleets.”
P8: blog/p8-webhooks-and-upgrades.md — “Webhooks + Upgrades: Secure AI Clusters.”

These outlines are structured so you can turn each into a full tutorial post.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

k8s-5-to-8 (GPU + AI)

Lambda Labs

System requirements

Project 5: GPU Workloads (Inference Stub)

Project 6: vLLM Inference Serving

Project 7: Custom Operator for Model Deploys

Project 8: Webhooks & Cluster Lifecycle

Blog outlines

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
blog		blog
k8s		k8s
ml-stub		ml-stub
operator		operator
scripts		scripts
webhook		webhook
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

k8s-5-to-8 (GPU + AI)

Lambda Labs

System requirements

Project 5: GPU Workloads (Inference Stub)

Project 6: vLLM Inference Serving

Project 7: Custom Operator for Model Deploys

Project 8: Webhooks & Cluster Lifecycle

Blog outlines

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages