Skip to content

Commit fe2c50e

Browse files
authored
feat(alerting): implement proactive alerting with Alertmanager (#14)
* feat(alerting): integrate alertmanager service into docker-compose * feat(alerting): configure Prometheus to manage and forward alerts * feat(testing): implement UI-driven alert test harness * feat(alerting): define Prometheus rules and complete alerting pipeline * docs(readme): create detailed documentation for Phase 6
1 parent 76e7f85 commit fe2c50e

File tree

13 files changed

+443
-63
lines changed

13 files changed

+443
-63
lines changed

.dockerignore

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,4 +13,10 @@ build.log
1313
.gitignore
1414

1515
# Ignore local environment files (the real one should never be in the image)
16-
.env
16+
.env
17+
18+
# Ignore observability stack configs not needed in the app image
19+
config/
20+
21+
# Ignore persistent data volumes
22+
data/

.gitignore

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
1-
# File: .gitignore
2-
31
### Java / Maven ###
42
.project
53
.settings/
@@ -53,4 +51,7 @@ reports/
5351
dependency-check-report.*
5452

5553
# Local environment variables
56-
.env
54+
.env
55+
56+
# Persistent data volumes
57+
/data/

README.md

Lines changed: 105 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,38 +1,122 @@
11
# Spring Boot Security & Observability Lab
22

3-
This repository is an advanced, hands-on lab demonstrating the architectural evolution of a modern Java application. We will build a system from the ground up, starting with a secure monolith and progressively refactoring it into a fully observable, distributed system using cloud-native best practices.
3+
This repository is a hands-on lab designed to demonstrate the architectural evolution of a modern Java application. We will build a system from the ground up, starting with a secure monolith and progressively refactoring it into a fully observable, distributed system using cloud-native best practices.
44

55
---
66

7-
## Workshop Guide: The Evolutionary Phases
7+
## Lab Progress: Phase 6 - Proactive Alerting with Alertmanager
88

9-
This lab is structured in distinct, self-contained phases. The `main` branch always represents the latest completed phase. To explore a previous phase's code and detailed documentation, use the links below.
9+
The `main` branch currently represents the completed state of **Phase 6**.
1010

11-
| Phase | Description & Key Concepts | Code & Docs (at tag) | Key Pull Requests |
12-
|:-----------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
13-
| **1. The Secure Monolith** | A standalone service that issues and validates its own JWTs. Concepts: `AuthenticationManager`, custom `JwtAuthenticationFilter`, `jjwt` library, and a foundational CI pipeline. | [`v1.0-secure-monolith`](https://github.com/apenlor/spring-boot-security-observability-lab/tree/v1.0-secure-monolith) | [#2](https://github.com/apenlor/spring-boot-security-observability-lab/pull/2), [#3](https://github.com/apenlor/spring-boot-security-observability-lab/pull/3), [#4](https://github.com/apenlor/spring-boot-security-observability-lab/pull/4) |
14-
| **2. Observing the Monolith** | The service is containerized and orchestrated via `docker-compose`. Concepts: Micrometer, Prometheus, Grafana, custom metrics, and automated dashboard provisioning. | [`v2.0-observable-monolith`](https://github.com/apenlor/spring-boot-security-observability-lab/tree/v2.0-observable-monolith) | [#6](https://github.com/apenlor/spring-boot-security-observability-lab/pull/6) |
15-
| **3. Evolving to Federated Identity** | The system is refactored into a multi-service architecture with an external IdP. Concepts: Keycloak, OIDC, OAuth2 Client (`web-client`) vs. Resource Server, Traefik reverse proxy, service-to-service security. | [`v3.0-federated-identity`](https://github.com/apenlor/spring-boot-security-observability-lab/tree/v3.0-federated-identity) | [#8](https://github.com/apenlor/spring-boot-security-observability-lab/pull/8) |
16-
| **4. Tracing a Distributed System** | Services are instrumented with the OpenTelemetry agent to generate traces. Concepts: Tempo, agent-based instrumentation, W3C Trace Context, Service Graphs, and a hybrid PUSH/PULL metrics architecture. | [`v4.0-distributed-tracing`](https://github.com/apenlor/spring-boot-security-observability-lab/tree/v4.0-distributed-tracing) | [#10](https://github.com/apenlor/spring-boot-security-observability-lab/pull/10) |
17-
| **5. Correlated Logs & Access Auditing** | The three pillars of observability are complete (metrics, traces, logs). Alloy is the unified collection agent. Concepts: Loki, Grafana Alloy, Docker service discovery, structured JSON logs, AOP-based auditing, trace-to-log correlation, and detailed audit metrics. | [`v5.0-correlated-logs-auditing`](https://github.com/apenlor/spring-boot-security-observability-lab/tree/v5.0-correlated-logs-auditing) | [#12](https://github.com/apenlor/spring-boot-security-observability-lab/pull/12) |
18-
| **6. Proactive Alerting** | _Upcoming..._ | - | - |
19-
| **7. Continuous Security Integration** | _Upcoming..._ | - | - |
20-
| **8. Advanced Secret Management** | _Upcoming..._ | - | - |
11+
* **Git Tag for this Phase:** `v6.0-proactive-alerting`
12+
13+
### Objective
14+
15+
The goal of this phase was to transition our monitoring strategy from passive (dashboards) to **proactive**. We have integrated the Prometheus Alertmanager into our stack to create a system that can automatically detect and route notifications about problems, without requiring a human to be watching a screen. This demonstrates the completion of a production-grade monitoring feedback loop.
16+
17+
### Key Concepts Demonstrated
18+
19+
* **Prometheus Alerting Pipeline:** Understanding the distinct roles of Prometheus (which evaluates rules and generates alerts) and Alertmanager (which receives, de-duplicates, groups, and routes alerts).
20+
* **Declarative Alerting Rules:** Defining alerting conditions as code using PromQL expressions in a version-controlled YAML file.
21+
* **Alerting on Technical & Security Metrics:** Creating two distinct types of alerts:
22+
1. A **technical alert** (`ApiServerErrorRateHigh`) that fires on infrastructure-level signals like a spike in 5xx server errors.
23+
2. A **security alert** (`UnauthorizedAdminAccessSpike`) that fires on application-level signals, such as an abnormal rate of `4xx` errors on a privileged endpoint.
24+
* **Alert Lifecycle:** Observing the full lifecycle of an alert: `Inactive` -> `Pending` -> `Firing` -> `Resolved`.
25+
* **UI-Driven Test Harness:** Building a dedicated "Alerting Test Panel" in our web application to reliably trigger alert conditions on demand, proving the entire pipeline works end-to-end.
26+
27+
### Architecture Overview
28+
29+
Phase 6 introduces Alertmanager and connects it to our existing Prometheus instance. The data flow for alerting is now a core part of our observability stack.
30+
31+
```mermaid
32+
graph TD
33+
subgraph "Application Services"
34+
RS[Resource Server]
35+
WC[Web Client]
36+
end
37+
38+
subgraph "Observability Stack"
39+
Prom[Prometheus] -->|1. Scrapes Metrics| RS
40+
Prom -->|1. Scrapes Metrics| WC
41+
42+
subgraph "Alerting Pipeline"
43+
Rules[alerts.yml] -->|2. Evaluates| Prom
44+
Prom -->|3. Sends Firing Alerts| AM[Alertmanager]
45+
end
46+
47+
G[Grafana]
48+
end
49+
50+
subgraph "Operators / External Systems"
51+
AM -->|4. Routes Notifications| Notif[Email, Slack, etc.]
52+
Ops[Operator] -->|Views & Manages Alerts| AM
53+
Ops -->|Views Dashboards| G
54+
end
55+
```
56+
57+
1. **[Prometheus](config/prometheus/prometheus.yml):** Its role is expanded. It is now configured to load a [rule file](config/prometheus/alerts.yml) and to send any alerts that become "Firing" to the Alertmanager service. The `--web.external-url` flag is set to ensure backlinks are generated with a browser-resolvable hostname.
58+
2. **[Alertmanager](config/alertmanager/alertmanager.yml):** The new central hub for all alerts. It receives alerts from Prometheus, groups them to reduce noise, and would (in a production setup) route them to configured receivers. For this lab, we use a "null" receiver.
2159

2260
---
2361

24-
## How to Follow This Lab
62+
### Key Configuration Details
2563

26-
1. **Start with the `main` branch** to see the latest state of the project.
27-
2. To go back in time, use the **"Code & Docs" link** for a specific phase. This will show you the `README.md` for that phase, which contains the specific instructions and examples for that version of the code.
28-
3. To understand the *"why"* behind the changes, review the **Key Pull Requests** for each phase.
64+
#### 1. Prometheus Alert Rules
65+
66+
The core of this phase is the [alerts.yml](config/prometheus/alerts.yml) file. We have defined two rules that are specifically tailored for our application and optimized for a lab environment with short `for` durations for rapid testing.
67+
68+
* **`ApiServerErrorRateHigh`:** This rule fires when the rate of `5xx` status codes from the `resource-server` exceeds 0 for a continuous period. It is designed to be triggered by our `ChaosController`.
69+
* **`UnauthorizedAdminAccessSpike`:** This security-focused rule fires when the rate of `4xx` status codes on the specific `/api/secure/admin` endpoint exceeds 0. This is more robust than checking for just `403` as it captures any client-side error on this privileged endpoint, signaling a potential issue.
70+
71+
#### 2. UI-Driven Test Harness
72+
73+
To validate the entire alerting pipeline, we implemented a dedicated "Alerting Test Panel" in the `web-client`.
74+
* The `ChaosController` in the `resource-server` was enhanced with a guaranteed-failure endpoint (`/api/chaos/error`).
75+
* The `WebController` in the `web-client` was updated with two new `POST` endpoints that call the backend to generate `5xx` and `4xx` errors.
76+
77+
---
78+
79+
## Local Development & Quick Start
80+
81+
The prerequisites and setup are the same as in previous phases.
82+
83+
1. **Configure Local Hostnames (One-Time Setup, if not already done):**
84+
Edit your local `hosts` file to add:
85+
```
86+
127.0.0.1 keycloak.local
87+
```
88+
2. **Create and Configure Your Environment File:**
89+
```bash
90+
cp .env.example .env
91+
# ...then edit .env to add your WEB_CLIENT_SECRET from Keycloak.
92+
```
93+
3. **Build and run the entire stack:**
94+
```bash
95+
docker-compose up --build -d
96+
```
97+
4. **Access the Services:**
98+
* **Web Client Application:** [http://localhost:8082](http://localhost:8082) (Login with `lab-user`/`lab-user` or `lab-admin`/`lab-admin`)
99+
* **Keycloak Admin Console:** [http://keycloak.local](http://keycloak.local) (Login with `admin`/`admin`)
100+
* **Prometheus UI:** [http://localhost:9090](http://localhost:9090)
101+
* **Alertmanager UI:** [http://localhost:9093](http://localhost:9093)
102+
* **Grafana UI:** [http://localhost:3000](http://localhost:3000)
29103
30104
---
31105
32-
## Running the Project
106+
## Validating the New Alerting Features
107+
108+
1. **Confirm Rules are Loaded:**
109+
* Navigate to the Prometheus UI's "Alerts" tab ([http://localhost:9090/alerts](http://localhost:9090/alerts)).
110+
* Verify that both new alerts are present and in the green "Inactive" state.
33111
34-
To run the application and see usage examples for the **current phase**, please refer to the detailed instructions in its tagged `README.md` file.
112+
2. **Trigger the Alerts via the UI:**
113+
* Log in to the Web Client as **`lab-user` / `lab-user`**.
114+
* In the "Alerting Test Panel", repeatedly click the buttons to generate `403` and `5xx` errors.
115+
* Watch the Prometheus Alerts UI. The alerts will transition from `Inactive` to `Pending` (yellow) and then to `Firing` (red).
116+
* Once firing, the alerts will appear in the Alertmanager UI.
35117
36-
**[>> Go to instructions for the current phase: `v5.0-correlated-logs-auditing` <<](https://github.com/apenlor/spring-boot-security-observability-lab/blob/v5.0-correlated-logs-auditing/docs/phase-5-readme.md#local-development--quick-start)**
118+
#### Stop the Environment
37119
38-
As the lab progresses, this link will always be updated to point to the latest completed phase.
120+
```bash
121+
docker-compose down -v
122+
```
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
global:
2+
# --- Example: Global Slack API URL ---
3+
# slack_api_url: '<your-global-slack-webhook-url>'
4+
5+
# The root route on which each incoming alert enters
6+
route:
7+
# For this lab, we use a 'null' receiver to prevent any actual notifications from being sent
8+
receiver: 'null-receiver'
9+
10+
# --- Example: notification settings ---
11+
# group_by: ['alertname', 'job']
12+
# group_wait: 30s
13+
# group_interval: 5m
14+
# repeat_interval: 1h
15+
16+
# --- Example: Sub-route ---
17+
# The sub-routes. Alertmanager matches alerts against sub-routes recursively.
18+
# routes:
19+
# - receiver: 'critical-alerts-webhook'
20+
# # When an alert has a label 'severity' that is 'critical'...
21+
# matchers:
22+
# - severity = critical
23+
# # ...it will be sent to the 'critical-alerts-webhook' receiver instead of the parent's.
24+
# continue: true # Set to true if you want it to also match subsequent sibling routes.
25+
26+
# A list of receivers that define how notifications are sent
27+
receivers:
28+
# This receiver is a blackhole, for the Lab is a perfect option.
29+
- name: 'null-receiver'
30+
31+
# --- Example: Generic Webhook Receiver ---
32+
# - name: 'critical-alerts-webhook'
33+
# webhook_configs:
34+
# # URL to send POST requests to (e.g., a custom integration, PagerDuty, etc.).
35+
# - url: 'http://some-webhook-receiver:8080/notifications'
36+
# send_resolved: true
37+
38+
# --- Example: Slack Receiver ---
39+
# - name: 'slack-notifications'
40+
# slack_configs:
41+
# # If a global "slack_api_url" is not defined, it must be specified here.
42+
# # This also allows overriding the global URL for a specific receiver.
43+
# # api_url: '<your-channel-specific-slack-webhook-url>'
44+
# - channel: '#alerts'
45+
# send_resolved: true
46+
# # You can customize the message title, text, etc. using templates.
47+
# # title: '{{ .CommonLabels.alertname }} - {{ .Status | toUpper }}'
48+
# # text: '{{ .CommonAnnotations.summary }}'

config/prometheus/alerts.yml

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
groups:
2+
- name: application_alerts
3+
rules:
4+
# ===================================================================
5+
# RULE 1: Technical Alert - High 5xx Server Error Rate
6+
# ===================================================================
7+
- alert: ApiServerErrorRateHigh
8+
# This expression calculates the per-second rate of 5xx server errors
9+
# originating specifically from our backend resource-server.
10+
# It's designed to be triggered by the ChaosController's /api/chaos/error endpoint.
11+
# Lower rate (>0) to simplify testing.
12+
expr: sum(rate(http_server_requests_seconds_count{status=~"5..", job="resource-server"}[1m])) by (job) > 0
13+
14+
# Lab Setting: A short "for" duration for rapid testing. In a real-world
15+
# scenario, this would be longer to avoid noise from transient issues.
16+
for: 15s
17+
18+
labels:
19+
severity: critical
20+
21+
annotations:
22+
summary: "High 5xx server error rate on job '{{ $labels.job }}'."
23+
description: "The API service '{{ $labels.job }}' is experiencing a high rate of 5xx server errors. Current value is {{ $value | printf \"%.2f\" }} errors/sec."
24+
runbook_url: "https://internal-wiki.example.com/security/runbooks/api-error-rate"
25+
26+
# ===================================================================
27+
# RULE 2: Security Alert - High 4xx Client Error Rate (Admin Endpoint)
28+
# ===================================================================
29+
- alert: UnauthorizedAdminAccessSpike
30+
# This expression monitors for a spike in client errors (4xx) specifically on the
31+
# privileged /api/secure/admin endpoint. Any client error here is a significant security signal,
32+
# indicating misconfigured clients or malicious probing. Lower rate (>0) to simplify testing.
33+
expr: sum(rate(http_server_requests_seconds_count{uri="/api/secure/admin", status=~"4..", job="resource-server"}[1m])) by (job) > 0
34+
35+
# Also set low for demonstration purposes.
36+
for: 15s
37+
38+
labels:
39+
severity: warning
40+
41+
annotations:
42+
summary: "Spike in unauthorized access attempts to admin endpoint."
43+
description: "A high rate of failed attempts (4xx errors) to access the admin endpoint on '{{ $labels.job }}' has been detected. The current rate is {{ $value | printf \"%.2f\" }} req/sec."
44+
runbook_url: "https://internal-wiki.example.com/security/runbooks/unauthorized-admin-access"

0 commit comments

Comments
 (0)