Skip to content

AI-powered SRE agent that automatically detects, diagnoses, and remediates production issues using Claude Haiku and runbook automation. Features multi-service Docker architecture, structured logging, and cost-optimized incident response.

Notifications You must be signed in to change notification settings

praniketkw/Auto-SRE-Agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Auto-SRE-Agent

Project Overview

This project simulates a Site Reliability Engineering (SRE) agent. It demonstrates how an automated system can detect and diagnose issues within a software application.

The system consists of two main parts:

  1. Simulator: A component that intentionally introduces faults, such as database crashes or memory leaks, into a dummy application.
  2. SRE Agent: An intelligent agent that monitors logs and metrics to identify these faults and generate an incident report.

Project Structure

Simulator (simulator/)

Contains the code for the simulated application and the fault injection mechanism.

  • chaos_monkey.py: The script that runs the application and injects errors.
  • scenarios.py: Defines the different error scenarios, such as database connection failures or memory leaks.

Agent (sre_agent/)

Contains the logic for the SRE agent.

  • main.py: The entry point for the agent.
  • agent.py: The core logic that gathers data and coordinates analysis.
  • llm.py: A simulated Large Language Model client. In a production environment, this would connect to an external AI service. Here, it uses keyword matching to demonstrate functionality without API keys.

Data (var/)

Stores runtime data and output files.

  • logs/app.log: Application logs generated by the simulator.
  • metrics.json: System metrics like CPU and memory usage.
  • report.md: The incident report generated by the agent.

Usage

1. Setup

First, install the required dependencies:

pip install -r requirements.txt

You will also need an Anthropic API Key.

  1. Copy the example environment file:
    cp .env.example .env
  2. Open .env and paste your API key:
    ANTHROPIC_API_KEY=sk-ant-...
    

2. Run the Application (The Patient)

Start the web server in one terminal:

uvicorn demo_app.main:app --port 8000

3. Run the Agent (The Doctor)

Start the SRE agent in a new terminal window:

python3 -m sre_agent.main

The agent will start monitoring http://localhost:8000/health.

4. Cause Trouble

In a third terminal (or the same one), trigger a crash:

curl -X POST http://localhost:8000/simulate/crash

Watch the Agent terminal! You should see it:

  1. Detect the 500 Error.
  2. Investigate logs.
  3. Restart the server automatically.

About

AI-powered SRE agent that automatically detects, diagnoses, and remediates production issues using Claude Haiku and runbook automation. Features multi-service Docker architecture, structured logging, and cost-optimized incident response.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages