This project simulates a Site Reliability Engineering (SRE) agent. It demonstrates how an automated system can detect and diagnose issues within a software application.
The system consists of two main parts:
- Simulator: A component that intentionally introduces faults, such as database crashes or memory leaks, into a dummy application.
- SRE Agent: An intelligent agent that monitors logs and metrics to identify these faults and generate an incident report.
Contains the code for the simulated application and the fault injection mechanism.
- chaos_monkey.py: The script that runs the application and injects errors.
- scenarios.py: Defines the different error scenarios, such as database connection failures or memory leaks.
Contains the logic for the SRE agent.
- main.py: The entry point for the agent.
- agent.py: The core logic that gathers data and coordinates analysis.
- llm.py: A simulated Large Language Model client. In a production environment, this would connect to an external AI service. Here, it uses keyword matching to demonstrate functionality without API keys.
Stores runtime data and output files.
- logs/app.log: Application logs generated by the simulator.
- metrics.json: System metrics like CPU and memory usage.
- report.md: The incident report generated by the agent.
First, install the required dependencies:
pip install -r requirements.txtYou will also need an Anthropic API Key.
- Copy the example environment file:
cp .env.example .env
- Open
.envand paste your API key:ANTHROPIC_API_KEY=sk-ant-...
Start the web server in one terminal:
uvicorn demo_app.main:app --port 8000Start the SRE agent in a new terminal window:
python3 -m sre_agent.mainThe agent will start monitoring http://localhost:8000/health.
In a third terminal (or the same one), trigger a crash:
curl -X POST http://localhost:8000/simulate/crashWatch the Agent terminal! You should see it:
- Detect the 500 Error.
- Investigate logs.
- Restart the server automatically.