Complete operational knowledge for deploying, managing, and operating production Kubernetes clusters using Talos Linux. An agentic stack that turns AI agents into Talos experts.
This repository is a structured knowledge base — not code. It teaches AI agents (Claude, Gemini, GPT, etc.) how to operate Kubernetes clusters on Talos Linux across the full lifecycle: from initial infrastructure provisioning through production operations, upgrades, and disaster recovery.
When an agent reads the CLAUDE.md entry point, it gains:
- Deep understanding of Talos's immutable, API-driven architecture
- Step-by-step procedures with exact
talosctlandkubectlcommands - Decision frameworks for choosing CNI, storage, ingress, and other components
- Safety guardrails that prevent destructive operations without approval
- Symptom-based troubleshooting decision trees
| Area | What's Covered |
|---|---|
| Deployment targets | Bare metal, Proxmox, VMware, libvirt, AWS, GCP, Azure, Hetzner, DigitalOcean |
| CNI | Cilium (7 install methods), Flannel, Calico, Multus |
| Storage | Rook-Ceph, Longhorn, OpenEBS, Local Path, NFS, Cloud CSI (EBS/PD/Azure Disk) |
| Platform | Flux, ArgoCD, NGINX/Traefik/Cilium Gateway API, cert-manager, Prometheus, Loki, OpenTelemetry, Kyverno/OPA, Sealed Secrets/ESO/Vault, Cilium Mesh/Istio/Linkerd |
| Operations | Health checks, node scaling, Talos + K8s upgrades, etcd backup/DR, CA rotation |
| Topologies | Single-node dev, 3 CP + N workers, 5 CP + N workers, multi-site |
| Talos versions | 1.8.x, 1.9.x with version-specific known issues |
Pull this stack into your project:
agentic-stacks init ./my-cluster --name my-cluster --namespace my-org --from kubernetes-talosOr clone directly:
git clone https://github.com/agentic-stacks/kubernetes-talos.git .stacks/kubernetes-talosThen point your agent to CLAUDE.md (or .stacks/kubernetes-talos/CLAUDE.md if using the stacks workflow). The agent will use the routing table to navigate to the right skill for any task.
Browse the skills directly:
- New to Talos? Start with
skills/foundation/concepts - Building a cluster? Follow the new cluster workflow
- Choosing components? Check the decision guides
- Something broken? Jump to troubleshooting
| Skill | Description |
|---|---|
foundation/concepts |
Talos architecture, immutable OS model, API-driven operations |
foundation/machine-config |
Config generation, patching, secrets management, system extensions |
foundation/infrastructure |
Platform-specific provisioning guides for 9 platforms |
| Skill | Description |
|---|---|
deploy/bootstrap |
Cluster creation, talosctl bootstrap, kubeconfig retrieval |
deploy/networking |
CNI selection, comparison, and installation (Cilium, Flannel, Calico, Multus) |
deploy/storage |
CSI selection, comparison, and installation (6 options) |
| Skill | Description |
|---|---|
platform/gitops |
Flux and ArgoCD bootstrap, repo structure patterns |
platform/ingress |
NGINX, Traefik, Cilium Gateway API, cert-manager |
platform/observability |
Prometheus, Loki, OpenTelemetry, Talos-native metrics |
platform/security |
Pod security, secrets management, RBAC, network policy |
platform/service-mesh |
Cilium mesh, Istio, Linkerd |
| Skill | Description |
|---|---|
operations/health-check |
Cluster validation procedures and health report format |
operations/scaling |
Adding and removing nodes, topology changes |
operations/upgrades |
Talos OS, Kubernetes, and component rolling upgrades |
operations/backup-restore |
etcd backup, Velero, disaster recovery procedures |
operations/certificate-mgmt |
Talos PKI, CA rotation, expiry monitoring |
| Skill | Description |
|---|---|
diagnose/troubleshooting |
Symptom-based decision trees for 8 common scenarios |
| Skill | Description |
|---|---|
reference/known-issues |
Version-specific bugs and workarounds |
reference/compatibility |
Talos/K8s/CNI/CSI compatibility matrices |
reference/decision-guides |
Trade-off analyses for CNI, CSI, topology, HA, GitOps |
foundation/concepts → foundation/machine-config → foundation/infrastructure
→ deploy/bootstrap → deploy/networking → deploy/storage
→ platform/* (as needed) → operations/health-check
Jump directly to the relevant operations/, diagnose/, or platform/ skill.
| Tool | Purpose |
|---|---|
talosctl |
Talos API client (version must match target Talos version) |
kubectl |
Kubernetes CLI |
helm |
Helm package manager |
flux |
Flux CLI (optional, for GitOps with Flux) |
argocd |
ArgoCD CLI (optional, for GitOps with ArgoCD) |
When using this stack, your operator project should look like:
my-cluster/
├── CLAUDE.md # Points to .stacks/kubernetes-talos/
├── stacks.lock
├── .stacks/
│ └── kubernetes-talos/ # This stack
├── controlplane.yaml # Generated machine config
├── controlplane.yaml.orig # Stock config for diffing
├── worker.yaml
├── worker.yaml.orig
├── secrets.yaml # Cluster secrets (keep secure)
├── patches/ # Per-role and per-node patches
├── talosconfig # Talos client config
├── kubeconfig # K8s client config
├── manifests/ # Platform component manifests
└── scripts/ # Operational scripts
This stack follows the agentic-stacks format. Each skill is a directory under skills/ with a README.md entry point and optional sub-files for specific topics.
To add or update content:
- Follow the existing writing style (imperative headings, exact commands, decision trees)
- Verify commands against the official Talos docs
- Add version-specific notes where behavior differs between Talos releases
- Update
stack.yamlif adding new skills
MIT