Auto-SRE-Agent

Project Overview

This project simulates a Site Reliability Engineering (SRE) agent. It demonstrates how an automated system can detect and diagnose issues within a software application.

The system consists of two main parts:

Simulator: A component that intentionally introduces faults, such as database crashes or memory leaks, into a dummy application.
SRE Agent: An intelligent agent that monitors logs and metrics to identify these faults and generate an incident report.

Project Structure

Simulator (simulator/)

Contains the code for the simulated application and the fault injection mechanism.

chaos_monkey.py: The script that runs the application and injects errors.
scenarios.py: Defines the different error scenarios, such as database connection failures or memory leaks.

Agent (sre_agent/)

Contains the logic for the SRE agent.

main.py: The entry point for the agent.
agent.py: The core logic that gathers data and coordinates analysis.
llm.py: A simulated Large Language Model client. In a production environment, this would connect to an external AI service. Here, it uses keyword matching to demonstrate functionality without API keys.

Data (var/)

Stores runtime data and output files.

logs/app.log: Application logs generated by the simulator.
metrics.json: System metrics like CPU and memory usage.
report.md: The incident report generated by the agent.

Usage

1. Setup

First, install the required dependencies:

pip install -r requirements.txt

You will also need an Anthropic API Key.

Copy the example environment file:
```
cp .env.example .env
```
Open .env and paste your API key:
```
ANTHROPIC_API_KEY=sk-ant-...
```

2. Run the Application (The Patient)

Start the web server in one terminal:

uvicorn demo_app.main:app --port 8000

3. Run the Agent (The Doctor)

Start the SRE agent in a new terminal window:

python3 -m sre_agent.main

The agent will start monitoring http://localhost:8000/health.

4. Cause Trouble

In a third terminal (or the same one), trigger a crash:

curl -X POST http://localhost:8000/simulate/crash

Watch the Agent terminal! You should see it:

Detect the 500 Error.
Investigate logs.
Restart the server automatically.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Auto-SRE-Agent

Project Overview

Project Structure

Simulator (simulator/)

Agent (sre_agent/)

Data (var/)

Usage

1. Setup

2. Run the Application (The Patient)

3. Run the Agent (The Doctor)

4. Cause Trouble

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
demo_app		demo_app
simulator		simulator
sre_agent		sre_agent
var		var
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

praniketkw/Auto-SRE-Agent

Folders and files

Latest commit

History

Repository files navigation

Auto-SRE-Agent

Project Overview

Project Structure

Simulator (simulator/)

Agent (sre_agent/)

Data (var/)

Usage

1. Setup

2. Run the Application (The Patient)

3. Run the Agent (The Doctor)

4. Cause Trouble

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages