Kubernetes Intelligence Engine
A deterministic-first, multi-agent analysis system for Kubernetes infrastructure assessment using LangChain, LangGraph, and Ollama.
KubeSentinel analyzes Kubernetes clusters to identify reliability, security, and cost issues. It combines:
- Deterministic rules (200+ built-in signals)
- AI agents (cost, security, reliability analyzers)
- Query-aware routing (understands what you're asking)
- Safe remediation (kubectl approval gates, audit logs)
- Slack integration (real-time cluster monitoring)
This is not a chatbot. It's a graph-based reasoning system that produces actionable intelligence.
- Python 3.11+
- uv package manager (install)
- Ollama with
llama3.1:8b-instruct-q8_0(install) - Kubernetes cluster with kubeconfig or in-cluster auth
make install
# or: uv sync# Full cluster analysis
uv run kubesentinel scan
# Focus on cost
uv run kubesentinel scan --query "reduce costs"
# Focus on security
uv run kubesentinel scan --query "security audit"
# CI mode (exit 0 for A/B/C, exit 1 for D/F)
uv run kubesentinel scan --ciOutput:
report.md- Full analysis with findings and recommendations- Risk grade: A (healthy) through F (critical)
Comprehensive documentation is organized into 5 files:
| Document | Purpose |
|---|---|
| ARCHITECTURE.md | System design, pipeline, modules, design principles |
| IMPLEMENTATION.md | Implementation details, algorithms, code structure |
| OPERATIONS_AND_USAGE.md | Installation, configuration, Slack setup, troubleshooting |
| REPORTS_AND_RESULTS.md | Test results, performance metrics, safety validation |
| README.md | Quick start, overview, key features |
- 200+ built-in signals for reliability, security, cost, and architecture
- Rules-based checks run first, LLM provides deeper insights when needed
- No hallucinations in numeric contexts (costs, replicas, etc.)
- Understand what you're asking ("reduce costs" vs "security audit")
- Route to appropriate analysis agents automatically
- Default fallback behavior for generic queries
- Approval gates for dangerous kubectl commands
- Audit trail for all executions
- Clear distinction between diagnostics and remediation
- Interactive analysis directly in Slack
- Cache-aware follow-up questions (instant responses)
- Safe kubectl command execution with approval dialogs
- Rich formatting with risk scores and actionable findings
- 73+ test scenarios covering all features
- 100% test pass rate
- Type-safe (mypy validation clean)
- Production-grade error handling
KubeSentinel can be accessed directly from Slack without any HTTP endpoint or ngrok.
-
Create a Slack app:
- Go to api.slack.com/apps
- Click "Create New App" β "From scratch"
- Name:
KubeSentinel - Workspace: your workspace
-
Enable Socket Mode:
- Go to Settings β Socket Mode
- Toggle "Enable Socket Mode" ON
- Generate an App-Level Token (
xapp-...) - Copy this to
SLACK_APP_TOKENin.env
-
Configure OAuth & Permissions:
- Go to OAuth & Permissions
- Under "Bot Token Scopes", add:
chat:writeapp_mentions:readchannels:historyim:history
- Copy the Bot Token (
xoxb-...) toSLACK_BOT_TOKENin.env
-
Subscribe to Events:
- Go to Event Subscriptions
- Toggle "Enable Events" ON
- Under "Subscribe to bot events", add:
app_mentionmessage.im
- Save
-
Get local credentials:
cp .env.example .env # Edit .env and paste your Bot Token and App Token -
Run the bot:
uv run kubesentinel-slack
(Or:
uv run python -m kubesentinel.integrations.slack_bot)
Once the bot is running, you can ask KubeSentinel questions in Slack:
Mention the bot in a channel:
@kubesentinel why are pods pending
Send a direct message to the bot:
why are pods pending
The bot will respond in a thread with:
- Risk score (0-100) and grade (A-F)
- Strategic summary
- Top findings (reliability, cost, security)
Example reply:
KubeSentinel Analysis
Risk Score: 63/100 (C) π‘ Medium
Summary:
Cluster has 2 pods in pending state due to node resource constraints.
Redis deployment lacks resource limits.
Top Findings:
β’ Unschedulable pods due to insufficient CPU
β’ Missing resource limits on deployments
β’ CrashLoopBackOff in media-frontend
- Never commit
.envwith real tokens. .envis in.gitignoreβ keep it local only.- Rotate tokens immediately if exposed:
- Go to api.slack.com/apps β your app β Regenerate tokens
KubeSentinel generates report.md with the following sections:
- Architecture Report - Cluster topology, orphan services, single-replica deployments
- Cost Optimization Report - Over-provisioning, missing limits, waste
- Security Audit - Privileged containers, :latest tags, vulnerabilities
- Reliability Risk Score - Weighted risk assessment (0-100, grades A-F)
- Strategic AI Analysis - Executive summary with prioritized recommendations
make test
# Or with pytest directly
uv run pytest kubesentinel/tests/ -vmake lintmake typecheckmake cleankubesentinel/
βββ __init__.py
βββ models.py # State contract (TypedDict)
βββ cluster.py # Cluster snapshot node
βββ graph_builder.py # Dependency graph node
βββ signals.py # Signal engine node
βββ risk.py # Risk model node
βββ tools.py # Deterministic tools for agents
βββ agents.py # Planner, agent nodes, synthesizer
βββ runtime.py # LangGraph orchestration
βββ reporting.py # Markdown report builder
βββ main.py # Typer CLI
βββ prompts/ # Agent system prompts
β βββ planner.txt
β βββ failure_agent.txt
β βββ cost_agent.txt
β βββ security_agent.txt
β βββ synthesizer.txt
βββ tests/ # Unit tests
βββ test_signals.py
βββ test_risk.py
βββ test_graph.py
All cluster inspection is deterministic. The LLM:
- NEVER connects to the cluster
- NEVER receives full raw cluster JSON
- NEVER mutates the cluster
LLMs only see:
- Slim snapshots
- Signal summaries
- Graph summaries
- Structured tool outputs
All execution state is stored in a typed schema (InfraState).
- No hidden memory
- State flows node-to-node
- Full checkpointing support
Uses LangGraph's StateGraph for explicit delegation:
scan_cluster β build_graph β generate_signals β compute_risk
β planner β [agents] β synthesizer β END
Delegation = graph traversal.
To prevent unbounded growth:
- Max 1000 pods
- Max 200 deployments
- Max 200 services
- Max 200 signals
- Max 50 findings per agent
β
End-to-end execution works
β
DeepAgent graph delegates properly
β
Risk score computed
β
Reports generated
β
Memory checkpointing works
β
< 20 files
β
~1000 LOC
This is an MVP. The following are explicitly out of scope:
β Auto-remediation
β UI/Dashboard
β RAG/Vector databases
β Telemetry/Observability
β Fine-tuning
β Additional agent types
β Docker/Kubernetes deployment (dev-only for MVP)
MIT License - See LICENSE file for details.
This is an MVP implementation following strict requirements.
Feature additions beyond scope will not be accepted.
For bugs or improvements within scope, please open an issue.
Built with: