This project provides an AI-assisted DevOps monitoring stack. A FastAPI agent receives Alertmanager webhooks, summarizes context via Google Gemini, optionally auto-remediates Kubernetes workloads (scale/restart), and posts summaries to Telegram. It includes Helm charts, raw Kubernetes manifests, and Grafana dashboards.
- FastAPI AI agent with
/healthz,/readyz, and/alertendpoints - Gemini-based analysis with model auto-discovery and configurable
GEMINI_MODEL - Auto-remediation helpers to scale/restart Kubernetes deployments
- Telegram notifications (optional)
- Helm chart (
helm/ai-monitor) and raw manifests (kubernetes/) - Grafana dashboards and Prometheus datasource templates (
grafana/)
- Alertmanager → HTTP POST to agent
/alert - Agent → summarize using Gemini → decide action → optional K8s change → Telegram notify
- Actions use Kubernetes Python client first, falling back to
kubectlif needed
- Python 3.10+ recommended (3.12 preferred). 3.9 works but prints legacy warnings.
kubectlconfigured if you want real remediation.- Optional: Gemini API key (
GOOGLE_API_KEYorGEMINI_API_KEY), Telegram bot token/chat id.
GOOGLE_API_KEYorGEMINI_API_KEY: Gemini access (omit for demo mode)GEMINI_MODEL(optional): e.g.gemini-2.5-flash(auto-discovery is built-in)TELEGRAM_TOKEN,TELEGRAM_CHAT_ID(optional)KUBECONFIG: path to kubeconfig (defaults to standard locations)DEFAULT_SCALE_REPLICAS(int, default 4)AUTO_REMEDIATE(true/false, default true)DEMO_MODE(true/false, default false)
- Create venv and install deps
python3 -m venv .venv && source .venv/bin/activate
pip install -U pip setuptools wheel
pip install -r requirements.txt- Create
.env(optional for demo)
cp .env.example .env # if provided; otherwise create with keys you need
# minimally for real mode:
# GOOGLE_API_KEY=...
# TELEGRAM_TOKEN=...
# TELEGRAM_CHAT_ID=...- Real mode (recommended on Python 3.10+):
python3 -m uvicorn src.ai_agent:app --host 0.0.0.0 --port 8080- Demo mode (no external deps):
DEMO_MODE=true python3 -m uvicorn src.ai_agent:app --host 0.0.0.0 --port 8080- Start the agent (real or demo as above)
- Send a sample Alertmanager webhook:
curl -s -X POST http://localhost:8080/alert \
-H 'Content-Type: application/json' \
--data @demo/sample-alert.json | jq .- Expected:
- Response contains
{"status":"processed", "action": ...} - Logs show Gemini analysis; in real mode, a current model is auto-selected
- If Kubernetes is reachable and
cpu-appexists, a scale/restart action may occur - If Telegram configured, a message is sent; otherwise it is skipped gracefully
Build and run:
docker build -t ai-devops-cloud-monitor:local .
docker run --env-file .env -p 8080:8080 ai-devops-cloud-monitor:localdocker build -t ai-devops-cloud-monitor:demo .
docker run -e DEMO_MODE=true -p 8080:8000 ai-devops-cloud-monitor:demokubectl apply -f kubernetes/
# or minimally deploy the sample app if you only want to test remediation
kubectl apply -f kubernetes/sample-app-deployment.yaml
# If running the agent locally against a cluster, ensure kubectl works and KUBECONFIG is set
export KUBECONFIG="$HOME/.kube/config"
kubectl get deploy cpu-app -n default# 1) Point kubectl at your cluster
export KUBECONFIG="$HOME/.kube/config"
kubectl config current-context
# 2) Ensure the target namespace exists (default is fine)
kubectl get ns
# 3) Deploy the sample cpu-app workload
kubectl apply -f kubernetes/sample-app-deployment.yaml
# 4) Wait for rollout to complete
kubectl rollout status deploy/cpu-app -n default
kubectl get deploy cpu-app -n default
# 5) (Optional) Inspect pods
kubectl get pods -l app=cpu-app -n default
kubectl logs -l app=cpu-app -n default --tail=100
# 6) (Optional) Generate some load to trigger CPU usage (adjust as needed)
# kubectl run loadgen --rm -it --image=busybox --restart=Never -- sh -c \
# 'i=0; while [ $i -lt 200000 ]; do i=$((i+1)); done; echo done'
# 7) Verify the agent scales cpu-app after an alert (send sample alert to agent)
curl -s -X POST http://localhost:8080/alert \
-H 'Content-Type: application/json' \
--data @demo/sample-alert.json | jq .
# 8) Confirm replicas changed (if AUTO action decided to scale)
kubectl get deploy cpu-app -n default -o jsonpath='{.spec.replicas}{"\n"}'helm upgrade --install ai-monitor helm/ai-monitor- Import JSON from
grafana/dashboards/ - Configure Prometheus using
grafana/datasources/prometheus-datasource.yaml(or your platform’s preferred method)
- Port in use:
lsof -i :8080thenkill -9 <PID> - Python 3.9 warnings: upgrade to Python 3.10+ to remove legacy messages
- Gemini model 404: set
GEMINI_MODELto a model enabled for your key (e.g.,gemini-2.5-flash), or rely on auto-discovery - Kubernetes 404 "deployment not found": deploy the sample
cpu-appor adjust the target name/namespace - No kube config: set
KUBECONFIG=$HOME/.kube/configor run in-cluster - Disable auto-remediation while testing:
AUTO_REMEDIATE=false
.env.example(if provided)requirements.txtDockerfilehelm/ai-monitor(Chart, templates)kubernetes/(agent/service/alert rules/alertmanager config/sample app)src/(FastAPI app and utils)grafana/(dashboards and datasource)
MIT