Real topology viewer capture: a kind-style Helm workload graph with Service, config, identity/RBAC, PVC, PV, StorageClass, CSI driver, provisioner, Node, and Event evidence.
kubernetes-ontology is a read-only Kubernetes topology service for
diagnostics, graph exploration, and AI-agent workflows.
It builds an in-memory ontology graph from Kubernetes objects, keeps the graph fresh with informers or polling, and exposes stable CLI and HTTP queries for entities, relations, neighbors, and diagnostic subgraphs.
The open-source MVP is intentionally lightweight:
- no controller or mutating webhook for the workloads being observed
- no runtime writes to observed Kubernetes resources
- no persistent database requirement
- no external graph backend requirement
- no CRD installation requirement
For the standard server and client workflow, start with QUICKSTART.md.
Kubernetes troubleshooting usually starts with scattered object reads:
kubectl get pod, then owner references, then services, events, PVCs, RBAC,
webhooks, CSI drivers, and controller pods.
This project turns those object reads into a graph:
- pods, workloads, services, nodes, storage, RBAC, events, images, webhooks, Helm releases, and Helm charts become typed entities
- Kubernetes references and inferred dependencies become typed relations
- diagnostic queries return a focused subgraph instead of a flat object dump
- AI agents can ask stable read-only questions without crawling the cluster from scratch every time
PodWorkloadPVCPVStorageClassCSIDriverHelmReleaseHelmChart
- full bootstrap snapshot from the Kubernetes API
- long-running daemon with runtime status
- informer-first continuous refresh with polling fallback
- bounded CLI observe mode
- category-aware change planning
- scoped graph mutation for common update categories
Current narrow strategies:
service-narrowevent-narrowstorage-narrowidentity/security-narrowpod-narrowworkload-narrow
Unsupported categories fall back to a full rebuild.
The graph can recover and correlate:
- recursive owner chains, including
Pod -> ReplicaSet -> Deployment - custom workload resources configured from CRDs, such as Kruise ASTS or Redis clusters
- display-only controller ownership rules for controller pods that Kubernetes does not expose through owner references
- service selector matches
- pod to node placement
- pod to Secret, ConfigMap, ServiceAccount, image, PVC, PV, StorageClass, and CSI driver paths
- ServiceAccount to RoleBinding and ClusterRoleBinding evidence
- Kubernetes Event and admission webhook evidence
- PV CSI metadata
- Helm release and chart provenance from standard Helm labels and annotations
CSI storage topology follows PVC -> PV/StorageClass -> CSIDriver. Component
correlation is configured with csiComponentRules; driver-specific controller
and node-agent inference is not enabled unless a matching rule is configured.
Recovered evidence can include relations such as:
provisioned_by_csi_driverimplemented_by_csi_controllerimplemented_by_csi_node_agentmanaged_by_csi_controllerserved_by_csi_node_agent
Resources labeled with standard Helm metadata produce HelmRelease and
HelmChart nodes. The graph adds managed_by_helm_release and
installs_chart edges with label_evidence provenance and confidence scores.
These are ownership hints from labels, not exact manifest membership.
The Incident Context Pack recipe flags in this section are available in
v0.1.6 and newer release archives.
When a user only says "helm upgrade failed" and does not have the Helm CLI
output, kubernetes-ontology can still diagnose the current cluster state for
that release:
kubernetes-ontology \
--server "http://127.0.0.1:18080" \
--diagnose-helm-release \
--namespace default \
--name my-releaseThe response expands the probable release-owned resources and chart evidence. It also marks the missing Helm-side evidence explicitly:
helm_cli_output_not_observed: template, values, repository, client, hook, and--atomicrollback errors are outside current Kubernetes object state.helm_manifest_evidence_not_collected: default Helm ownership is label and annotation evidence, not exact release manifest membership.
For rollout failures that reached the cluster, follow the release graph into
the affected Workload or Pod diagnostic. For render/client failures, ask the
user to paste the helm upgrade stderr or helm status/history output.
Incident Context Pack v1 adds an optional recipe label for this workflow:
kubernetes-ontology \
--server "http://127.0.0.1:18080" \
--entry-kind Pod \
--namespace default \
--name bad-pod \
--recipe helm-upgrade-runtime-failureThe checked-in sample at samples/helm-upgrade-failure/ can be opened in the
viewer without a live cluster and demonstrates ranked evidence, freshness,
budget metadata, Helm caveats, and clickable evidence references.
This repository provides a Codex-style skill:
skills/kubernetes-ontology-access.
Install it directly from GitHub when you want an AI agent to guide the whole
onboarding flow instead of reading the docs manually. Users do not need to
clone this repository before installing the skill.
npx skills add https://github.com/Colvin-Y/kubernetes-ontology/tree/main/skills/kubernetes-ontology-access -g --agent codexYou can also install from the repository root and select the skill by name:
npx skills add Colvin-Y/kubernetes-ontology -s kubernetes-ontology-access -g --agent codexSkill marketplace links intentionally point at the default branch so agents get the latest onboarding instructions. Use tagged releases for runtime binaries, container images, and Helm chart versions.
Restart Codex after installing the skill, then ask for a guided setup, for example:
Use the kubernetes-ontology-access skill to onboard my cluster with Helm,
install the CLI, run a Pod diagnostic query, and open the viewer path.
The skill connects the three intended access modes:
- AI-agent automatic troubleshooting with daemon-backed diagnostic subgraphs.
- CLI queries for status, entity resolution, relations, neighbors, expansion, and Pod/Workload diagnosis.
- Human visual inspection through the topology viewer and exported graph JSON.
Agent implementers should also read AI_CONTRACT.md for the diagnostic subgraph contract and safe downstream reasoning rules.
kubernetes-ontology is read-only with respect to the Kubernetes resources it
observes.
At runtime, the daemon does not:
- create, patch, update, or delete observed Kubernetes resources
- write annotations or status fields
- install CRDs or controllers for observed workloads
- mutate RBAC policy in the observed cluster
There are three deployment modes:
- Source/local mode uses your kubeconfig and performs read-only Kubernetes API calls.
- Release binary mode uses the published archive to run
kubernetes-ontologydon your workstation or a bastion host. It creates no Kubernetes resources and only needs network access from that host to the Kubernetes API server. - Helm mode installs this project's own Deployment, Service, ServiceAccount,
ConfigMap, and read-only RBAC so the daemon and viewer can run in-cluster.
That install-time footprint is expected. The granted RBAC is limited to
get,list, andwatchfor collected resources. Secret reads are enabled by default so Secret nodes anduses_secretedges can be collected; setrbac.readSecrets=falseto disable them.
The HTTP API is intended for local or controlled environments, not public multi-tenant exposure.
Use this path when the target cluster is private, air-gapped, or cannot pull
the published GHCR image. The release archive includes the server
kubernetes-ontologyd, the CLI client kubernetes-ontology, and the optional
viewer kubernetes-ontology-viewer.
export KO_VERSION=v0.1.6
curl -LO "https://github.com/Colvin-Y/kubernetes-ontology/releases/download/${KO_VERSION}/kubernetes-ontology_${KO_VERSION}_linux_amd64.tar.gz"
tar -xzf "kubernetes-ontology_${KO_VERSION}_linux_amd64.tar.gz"
cd "kubernetes-ontology_${KO_VERSION}_linux_amd64"Create kubernetes-ontology.yaml with a kubeconfig path and collection scope:
kubeconfig: /absolute/path/to/kubeconfig.yaml
cluster: your-logical-cluster
contextNamespaces:
- default
- kube-system
server:
addr: 127.0.0.1:18080
bootstrapTimeout: 2m
streamMode: informerStart the server:
./kubernetes-ontologyd --config ./kubernetes-ontology.yamlQuery it from another terminal:
./kubernetes-ontology --server "http://127.0.0.1:18080" --statusThis mode starts only host-local processes. Stop foreground server or viewer
processes with Ctrl-C; if you background them, store the PID and kill it
when the diagnostic session ends.
Use this path when you want to run the server in Kubernetes without compiling
from source and cluster nodes can pull the configured image. For private
clusters, mirror ghcr.io/colvin-y/kubernetes-ontology to an internal registry
and set KO_IMAGE to that mirror, or use the release binary path above.
export KO_VERSION=v0.1.6
export KO_IMAGE=ghcr.io/colvin-y/kubernetes-ontology
helm upgrade --install kubernetes-ontology ./charts/kubernetes-ontology \
--namespace kubernetes-ontology \
--create-namespace \
--set image.repository="${KO_IMAGE}" \
--set image.tag="${KO_VERSION}" \
--set cluster="your-logical-cluster" \
--set contextNamespaces='{default,kube-system}'Expose the server locally:
kubectl -n kubernetes-ontology port-forward svc/kubernetes-ontology 18080:18080Download the kubernetes-ontology CLI from
GitHub Releases,
or set KO_VERSION to the release tag you want to install, then query the
server:
kubernetes-ontology --server "http://127.0.0.1:18080" --statusThe Helm chart creates the project Deployment, Service, ServiceAccount, ConfigMap, and read-only RBAC required to run in-cluster. It also deploys the topology viewer by default:
kubectl -n kubernetes-ontology port-forward svc/kubernetes-ontology-viewer 8765:8765Open http://127.0.0.1:8765.
Stop short-lived kubectl port-forward processes with Ctrl-C. Remove the
in-cluster footprint with:
helm uninstall kubernetes-ontology --namespace kubernetes-ontologyUse this path for local development or when you want to run the daemon from your workstation.
make build
cp local/kubernetes-ontology.yaml.example local/kubernetes-ontology.yamlEdit local/kubernetes-ontology.yaml, then start the daemon:
make serveIn another terminal:
make status-server
make list-entities-server ENTITY_KIND=Pod NAMESPACE=default LIMIT=20See QUICKSTART.md for the full walkthrough.
YAML config is the recommended way to keep cluster-specific settings:
kubeconfig: /absolute/path/to/kubeconfig.yaml
cluster: your-logical-cluster
namespace: default
contextNamespaces:
- default
- kube-system
server:
addr: 127.0.0.1:18080
url: http://127.0.0.1:18080
bootstrapTimeout: 2m
streamMode: informer
pollInterval: 5sCustom workload resources and display-only controller rules are optional:
workloadResources:
- group: apps.kruise.io
version: v1beta1
resource: statefulsets
kind: StatefulSet
namespaced: true
controllerRules:
- apiVersion: apps.kruise.io/*
kind: "*"
namespace: kruise-system
controllerPodPrefixes:
- kruise-controller-manager
nodeDaemonPodPrefixes:
- kruise-daemon
csiComponentRules:
- driver: diskplugin.csi.alibabacloud.com
namespace: kube-system
controllerPodPrefixes:
- csi-provisioner-
nodeAgentPodPrefixes:
- csi-plugin-If a configured custom resource is not installed in the cluster, the daemon logs the missing resource and skips that informer. This is expected on a clean kind cluster that does not have OpenKruise, Redis operators, or similar CRDs installed.
More detail: local/README.md.
Query daemon status:
./bin/kubernetes-ontology --server "http://127.0.0.1:18080" --statusResolve a pod entity:
./bin/kubernetes-ontology \
--server "http://127.0.0.1:18080" \
--resolve-entity \
--entity-kind Pod \
--namespace default \
--name my-podDiagnose a pod:
./bin/kubernetes-ontology \
--server "http://127.0.0.1:18080" \
--diagnose-pod \
--namespace default \
--name my-pod \
--max-nodes 200 \
--max-edges 400Diagnose a Helm release after a failed upgrade:
Requires v0.1.6 or newer.
./bin/kubernetes-ontology \
--server "http://127.0.0.1:18080" \
--diagnose-helm-release \
--namespace default \
--name my-releaseDiagnostic responses include additive schemaVersion, recipe, lanes,
partial, warnings, budgets, rankedEvidence, degradedSources, and
conflicts fields. Agents should use
those fields to distinguish bounded evidence from complete cluster truth.
Expand one graph node:
./bin/kubernetes-ontology \
--server "http://127.0.0.1:18080" \
--expand-entity \
--entity-id 'your/entityGlobalId' \
--expand-depth 1 \
--limit 100List filtered relations:
./bin/kubernetes-ontology \
--server "http://127.0.0.1:18080" \
--list-filtered-relations \
--from 'your/entityGlobalId' \
--relation-kind scheduled_on \
--limit 50For machine-readable server query failures:
./bin/kubernetes-ontology \
--server "http://127.0.0.1:18080" \
--machine-errors \
--resolve-entity \
--entity-kind Pod \
--namespace default \
--name missing-podThe daemon exposes the current in-memory ontology database over HTTP:
GET /healthzGET /statusGET /entity?entityGlobalId=...GET /entity?kind=Pod&namespace=default&name=my-podGET /entities?kind=Pod&namespace=default&limit=50GET /relations?from=...&kind=scheduled_onGET /neighbors?entityGlobalId=...&direction=outGET /expand?entityGlobalId=...&depth=1GET /diagnostic?kind=Pod&namespace=default&name=my-pod&recipe=pod-incidentGET /diagnostic/pod?namespace=default&name=my-pod&maxNodes=200&maxEdges=400GET /diagnostic/workload?namespace=default&name=my-deployment
Graph and list responses include additive freshness metadata when daemon
runtime status is available. Error responses include code, message,
status, retryable, and source alongside the historical error string.
Diagnostic responses additionally include explicit partial/budget metadata and
ranked evidence for downstream agents.
The repository includes a local topology viewer:
kubernetes-ontology-viewer, a release binary with embedded static assetstools/visualize/server.py, a development servertools/visualize/index.html, the browser UI
Start the daemon first:
make serveStart the viewer:
make visualizeOpen http://127.0.0.1:8765.
The viewer can load live topology, query focused diagnostic graphs, expand and collapse nodes, filter by node or relation metadata, inspect provenance, and export the visible subgraph as JSON. Focused diagnostic graphs show a Diagnostic Signals panel with budget truncation, warnings, conflicts, degraded sources, and ranked evidence before lower-level explanation text.
Core layers:
internal/collect/k8s: read-only Kubernetes collection, informers, and polling fallbackinternal/runtime: bootstrap, lifecycle, status, and stream applicationinternal/ontology: entity and relation storage abstractioninternal/server: HTTP API for status, ontology queries, and diagnosticsinternal/reconcile: full rebuild and scoped mutation reconcilersinternal/graph: graph builder, kernel, and indexinternal/query: query facadeinternal/service/diagnostic: diagnostic subgraph query implementationtools/visualize: local graph viewer
Owner-chain recovery prefers controller owner references, resolves by UID first,
falls back to namespace/kind/name, guards against cycles, and supports deeper
chains beyond Pod -> ReplicaSet -> Deployment.
Build:
make buildRun tests:
make testmake test runs:
go test -p 1 ./...After code changes that touch the daemon or viewer, use the fixed local verification flow:
make verify
make serve
make visualize
make live-check NAMESPACE=default NAME=my-podTagged releases publish:
- per-platform archives containing
kubernetes-ontology,kubernetes-ontologyd,kubernetes-ontology-viewer, Quickstart docs, release notes, and a local config example - a packaged Helm chart archive, for example
kubernetes-ontology-0.1.6.tgz - a multi-architecture image at
ghcr.io/colvin-y/kubernetes-ontology:<tag> - SemVer aliases without the leading
v, pluslatest
See docs/release.md for the release checklist. See CHANGELOG.md for release notes.
The agent skill is published from the default branch rather than from release
archives, so marketplace pages should link to the live repository path:
skills/kubernetes-ontology-access.
- Graph state is in memory only.
- HTTP auth and TLS are not implemented yet.
- Persistent graph backends and external graph adapters are outside the open-source MVP.
- RBAC topology is represented for ServiceAccount subjects and binding objects; it is not a full permission reasoning engine.
- Evidence ranking currently starts with returned Event evidence and will grow into richer signal ranking over time.
- Runtime RDF/OWL materialization is not implemented.
- Extend informer and scoped-reconcile coverage for more topology categories.
- Add HTTP auth/TLS and longer daemon soak tests.
- Improve diagnostic evidence ranking for downstream AI agents.
- Broaden RBAC interpretation without turning the MVP into a full authorization engine.
- Keep persistent stores and external graph adapters as post-MVP research.
- QUICKSTART.md: full setup and query walkthrough
- README.zh-CN.md: Chinese overview and usage notes
- AI_CONTRACT.md: contract for AI-agent consumers
- skills/kubernetes-ontology-access: project-local skill for Helm, CLI, AI-agent, and viewer onboarding
- docs/design/README.md: design document index
- docs/ontology/README.md: ontology notes
- docs/release.md: release checklist
- CONTRIBUTING.md: contribution workflow and validation
- SECURITY.md: supported versions, reporting, and safety boundaries
- CHANGELOG.md: release notes
Licensed under the Apache License, Version 2.0. See LICENSE.
