|
1 | 1 | # Spring Boot Security & Observability Lab |
2 | 2 |
|
3 | | -This repository is an advanced, hands-on lab demonstrating the architectural evolution of a modern Java application. We will build a system from the ground up, starting with a secure monolith and progressively refactoring it into a fully observable, distributed system using cloud-native best practices. |
| 3 | +This repository is a hands-on lab designed to demonstrate the architectural evolution of a modern Java application. We will build a system from the ground up, starting with a secure monolith and progressively refactoring it into a fully observable, distributed system using cloud-native best practices. |
4 | 4 |
|
5 | 5 | --- |
6 | 6 |
|
7 | | -## Workshop Guide: The Evolutionary Phases |
| 7 | +## Lab Progress: Phase 6 - Proactive Alerting with Alertmanager |
8 | 8 |
|
9 | | -This lab is structured in distinct, self-contained phases. The `main` branch always represents the latest completed phase. To explore a previous phase's code and detailed documentation, use the links below. |
| 9 | +The `main` branch currently represents the completed state of **Phase 6**. |
10 | 10 |
|
11 | | -| Phase | Description & Key Concepts | Code & Docs (at tag) | Key Pull Requests | |
12 | | -|:-----------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
13 | | -| **1. The Secure Monolith** | A standalone service that issues and validates its own JWTs. Concepts: `AuthenticationManager`, custom `JwtAuthenticationFilter`, `jjwt` library, and a foundational CI pipeline. | [`v1.0-secure-monolith`](https://github.com/apenlor/spring-boot-security-observability-lab/tree/v1.0-secure-monolith) | [#2](https://github.com/apenlor/spring-boot-security-observability-lab/pull/2), [#3](https://github.com/apenlor/spring-boot-security-observability-lab/pull/3), [#4](https://github.com/apenlor/spring-boot-security-observability-lab/pull/4) | |
14 | | -| **2. Observing the Monolith** | The service is containerized and orchestrated via `docker-compose`. Concepts: Micrometer, Prometheus, Grafana, custom metrics, and automated dashboard provisioning. | [`v2.0-observable-monolith`](https://github.com/apenlor/spring-boot-security-observability-lab/tree/v2.0-observable-monolith) | [#6](https://github.com/apenlor/spring-boot-security-observability-lab/pull/6) | |
15 | | -| **3. Evolving to Federated Identity** | The system is refactored into a multi-service architecture with an external IdP. Concepts: Keycloak, OIDC, OAuth2 Client (`web-client`) vs. Resource Server, Traefik reverse proxy, service-to-service security. | [`v3.0-federated-identity`](https://github.com/apenlor/spring-boot-security-observability-lab/tree/v3.0-federated-identity) | [#8](https://github.com/apenlor/spring-boot-security-observability-lab/pull/8) | |
16 | | -| **4. Tracing a Distributed System** | Services are instrumented with the OpenTelemetry agent to generate traces. Concepts: Tempo, agent-based instrumentation, W3C Trace Context, Service Graphs, and a hybrid PUSH/PULL metrics architecture. | [`v4.0-distributed-tracing`](https://github.com/apenlor/spring-boot-security-observability-lab/tree/v4.0-distributed-tracing) | [#10](https://github.com/apenlor/spring-boot-security-observability-lab/pull/10) | |
17 | | -| **5. Correlated Logs & Access Auditing** | The three pillars of observability are complete (metrics, traces, logs). Alloy is the unified collection agent. Concepts: Loki, Grafana Alloy, Docker service discovery, structured JSON logs, AOP-based auditing, trace-to-log correlation, and detailed audit metrics. | [`v5.0-correlated-logs-auditing`](https://github.com/apenlor/spring-boot-security-observability-lab/tree/v5.0-correlated-logs-auditing) | [#12](https://github.com/apenlor/spring-boot-security-observability-lab/pull/12) | |
18 | | -| **6. Proactive Alerting** | _Upcoming..._ | - | - | |
19 | | -| **7. Continuous Security Integration** | _Upcoming..._ | - | - | |
20 | | -| **8. Advanced Secret Management** | _Upcoming..._ | - | - | |
| 11 | +* **Git Tag for this Phase:** `v6.0-proactive-alerting` |
| 12 | + |
| 13 | +### Objective |
| 14 | + |
| 15 | +The goal of this phase was to transition our monitoring strategy from passive (dashboards) to **proactive**. We have integrated the Prometheus Alertmanager into our stack to create a system that can automatically detect and route notifications about problems, without requiring a human to be watching a screen. This demonstrates the completion of a production-grade monitoring feedback loop. |
| 16 | + |
| 17 | +### Key Concepts Demonstrated |
| 18 | + |
| 19 | +* **Prometheus Alerting Pipeline:** Understanding the distinct roles of Prometheus (which evaluates rules and generates alerts) and Alertmanager (which receives, de-duplicates, groups, and routes alerts). |
| 20 | +* **Declarative Alerting Rules:** Defining alerting conditions as code using PromQL expressions in a version-controlled YAML file. |
| 21 | +* **Alerting on Technical & Security Metrics:** Creating two distinct types of alerts: |
| 22 | + 1. A **technical alert** (`ApiServerErrorRateHigh`) that fires on infrastructure-level signals like a spike in 5xx server errors. |
| 23 | + 2. A **security alert** (`UnauthorizedAdminAccessSpike`) that fires on application-level signals, such as an abnormal rate of `4xx` errors on a privileged endpoint. |
| 24 | +* **Alert Lifecycle:** Observing the full lifecycle of an alert: `Inactive` -> `Pending` -> `Firing` -> `Resolved`. |
| 25 | +* **UI-Driven Test Harness:** Building a dedicated "Alerting Test Panel" in our web application to reliably trigger alert conditions on demand, proving the entire pipeline works end-to-end. |
| 26 | + |
| 27 | +### Architecture Overview |
| 28 | + |
| 29 | +Phase 6 introduces Alertmanager and connects it to our existing Prometheus instance. The data flow for alerting is now a core part of our observability stack. |
| 30 | + |
| 31 | +```mermaid |
| 32 | +graph TD |
| 33 | + subgraph "Application Services" |
| 34 | + RS[Resource Server] |
| 35 | + WC[Web Client] |
| 36 | + end |
| 37 | +
|
| 38 | + subgraph "Observability Stack" |
| 39 | + Prom[Prometheus] -->|1. Scrapes Metrics| RS |
| 40 | + Prom -->|1. Scrapes Metrics| WC |
| 41 | + |
| 42 | + subgraph "Alerting Pipeline" |
| 43 | + Rules[alerts.yml] -->|2. Evaluates| Prom |
| 44 | + Prom -->|3. Sends Firing Alerts| AM[Alertmanager] |
| 45 | + end |
| 46 | +
|
| 47 | + G[Grafana] |
| 48 | + end |
| 49 | + |
| 50 | + subgraph "Operators / External Systems" |
| 51 | + AM -->|4. Routes Notifications| Notif[Email, Slack, etc.] |
| 52 | + Ops[Operator] -->|Views & Manages Alerts| AM |
| 53 | + Ops -->|Views Dashboards| G |
| 54 | + end |
| 55 | +``` |
| 56 | + |
| 57 | +1. **[Prometheus](config/prometheus/prometheus.yml):** Its role is expanded. It is now configured to load a [rule file](config/prometheus/alerts.yml) and to send any alerts that become "Firing" to the Alertmanager service. The `--web.external-url` flag is set to ensure backlinks are generated with a browser-resolvable hostname. |
| 58 | +2. **[Alertmanager](config/alertmanager/alertmanager.yml):** The new central hub for all alerts. It receives alerts from Prometheus, groups them to reduce noise, and would (in a production setup) route them to configured receivers. For this lab, we use a "null" receiver. |
21 | 59 |
|
22 | 60 | --- |
23 | 61 |
|
24 | | -## How to Follow This Lab |
| 62 | +### Key Configuration Details |
25 | 63 |
|
26 | | -1. **Start with the `main` branch** to see the latest state of the project. |
27 | | -2. To go back in time, use the **"Code & Docs" link** for a specific phase. This will show you the `README.md` for that phase, which contains the specific instructions and examples for that version of the code. |
28 | | -3. To understand the *"why"* behind the changes, review the **Key Pull Requests** for each phase. |
| 64 | +#### 1. Prometheus Alert Rules |
| 65 | + |
| 66 | +The core of this phase is the [alerts.yml](config/prometheus/alerts.yml) file. We have defined two rules that are specifically tailored for our application and optimized for a lab environment with short `for` durations for rapid testing. |
| 67 | + |
| 68 | +* **`ApiServerErrorRateHigh`:** This rule fires when the rate of `5xx` status codes from the `resource-server` exceeds 0 for a continuous period. It is designed to be triggered by our `ChaosController`. |
| 69 | +* **`UnauthorizedAdminAccessSpike`:** This security-focused rule fires when the rate of `4xx` status codes on the specific `/api/secure/admin` endpoint exceeds 0. This is more robust than checking for just `403` as it captures any client-side error on this privileged endpoint, signaling a potential issue. |
| 70 | + |
| 71 | +#### 2. UI-Driven Test Harness |
| 72 | + |
| 73 | +To validate the entire alerting pipeline, we implemented a dedicated "Alerting Test Panel" in the `web-client`. |
| 74 | +* The `ChaosController` in the `resource-server` was enhanced with a guaranteed-failure endpoint (`/api/chaos/error`). |
| 75 | +* The `WebController` in the `web-client` was updated with two new `POST` endpoints that call the backend to generate `5xx` and `4xx` errors. |
| 76 | + |
| 77 | +--- |
| 78 | + |
| 79 | +## Local Development & Quick Start |
| 80 | + |
| 81 | +The prerequisites and setup are the same as in previous phases. |
| 82 | + |
| 83 | +1. **Configure Local Hostnames (One-Time Setup, if not already done):** |
| 84 | + Edit your local `hosts` file to add: |
| 85 | + ``` |
| 86 | + 127.0.0.1 keycloak.local |
| 87 | + ``` |
| 88 | +2. **Create and Configure Your Environment File:** |
| 89 | + ```bash |
| 90 | + cp .env.example .env |
| 91 | + # ...then edit .env to add your WEB_CLIENT_SECRET from Keycloak. |
| 92 | + ``` |
| 93 | +3. **Build and run the entire stack:** |
| 94 | + ```bash |
| 95 | + docker-compose up --build -d |
| 96 | + ``` |
| 97 | +4. **Access the Services:** |
| 98 | + * **Web Client Application:** [http://localhost:8082](http://localhost:8082) (Login with `lab-user`/`lab-user` or `lab-admin`/`lab-admin`) |
| 99 | + * **Keycloak Admin Console:** [http://keycloak.local](http://keycloak.local) (Login with `admin`/`admin`) |
| 100 | + * **Prometheus UI:** [http://localhost:9090](http://localhost:9090) |
| 101 | + * **Alertmanager UI:** [http://localhost:9093](http://localhost:9093) |
| 102 | + * **Grafana UI:** [http://localhost:3000](http://localhost:3000) |
29 | 103 |
|
30 | 104 | --- |
31 | 105 |
|
32 | | -## Running the Project |
| 106 | +## Validating the New Alerting Features |
| 107 | +
|
| 108 | +1. **Confirm Rules are Loaded:** |
| 109 | + * Navigate to the Prometheus UI's "Alerts" tab ([http://localhost:9090/alerts](http://localhost:9090/alerts)). |
| 110 | + * Verify that both new alerts are present and in the green "Inactive" state. |
33 | 111 |
|
34 | | -To run the application and see usage examples for the **current phase**, please refer to the detailed instructions in its tagged `README.md` file. |
| 112 | +2. **Trigger the Alerts via the UI:** |
| 113 | + * Log in to the Web Client as **`lab-user` / `lab-user`**. |
| 114 | + * In the "Alerting Test Panel", repeatedly click the buttons to generate `403` and `5xx` errors. |
| 115 | + * Watch the Prometheus Alerts UI. The alerts will transition from `Inactive` to `Pending` (yellow) and then to `Firing` (red). |
| 116 | + * Once firing, the alerts will appear in the Alertmanager UI. |
35 | 117 |
|
36 | | -**[>> Go to instructions for the current phase: `v5.0-correlated-logs-auditing` <<](https://github.com/apenlor/spring-boot-security-observability-lab/blob/v5.0-correlated-logs-auditing/docs/phase-5-readme.md#local-development--quick-start)** |
| 118 | +#### Stop the Environment |
37 | 119 |
|
38 | | -As the lab progresses, this link will always be updated to point to the latest completed phase. |
| 120 | +```bash |
| 121 | +docker-compose down -v |
| 122 | +``` |
0 commit comments