apenlor
diff --git a/‎.dockerignore‎
Lines changed: 7 additions & 1 deletion b/‎.dockerignore‎
Lines changed: 7 additions & 1 deletion
diff --git a/‎.gitignore‎
Lines changed: 4 additions & 3 deletions b/‎.gitignore‎
Lines changed: 4 additions & 3 deletions
diff --git a/‎README.md‎
Lines changed: 105 additions & 21 deletions b/‎README.md‎
Lines changed: 105 additions & 21 deletions
diff --git a/‎config/alertmanager/alertmanager.yml‎
Lines changed: 48 additions & 0 deletions b/‎config/alertmanager/alertmanager.yml‎
Lines changed: 48 additions & 0 deletions
diff --git a/‎config/prometheus/alerts.yml‎
Lines changed: 44 additions & 0 deletions b/‎config/prometheus/alerts.yml‎
Lines changed: 44 additions & 0 deletions
@@ -13,4 +13,10 @@ build.log
 .gitignore
 
 # Ignore local environment files (the real one should never be in the image)
-.env
+.env
+
+# Ignore observability stack configs not needed in the app image
+config/
+
+# Ignore persistent data volumes
+data/
@@ -1,5 +1,3 @@
-# File: .gitignore
-
 ### Java / Maven ###
 .project
 .settings/
@@ -53,4 +51,7 @@ reports/
 dependency-check-report.*
 
 # Local environment variables
-.env
+.env
+
+# Persistent data volumes
+/data/
@@ -1,38 +1,122 @@
 # Spring Boot Security & Observability Lab
 
-This repository is an advanced, hands-on lab demonstrating the architectural evolution of a modern Java application. We will build a system from the ground up, starting with a secure monolith and progressively refactoring it into a fully observable, distributed system using cloud-native best practices.
+This repository is a hands-on lab designed to demonstrate the architectural evolution of a modern Java application. We will build a system from the ground up, starting with a secure monolith and progressively refactoring it into a fully observable, distributed system using cloud-native best practices.
 
 ---
 
-## Workshop Guide: The Evolutionary Phases
+## Lab Progress: Phase 6 - Proactive Alerting with Alertmanager
 
-This lab is structured in distinct, self-contained phases. The `main` branch always represents the latest completed phase. To explore a previous phase's code and detailed documentation, use the links below.
+The `main` branch currently represents the completed state of **Phase 6**.
 
-| Phase                                    | Description & Key Concepts                                                                                                                                                                                                                                               | Code & Docs (at tag)                                                                                                                    | Key Pull Requests                                                                                                                                                                                                                              |
-|:-----------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| **1. The Secure Monolith**               | A standalone service that issues and validates its own JWTs. Concepts: `AuthenticationManager`, custom `JwtAuthenticationFilter`, `jjwt` library, and a foundational CI pipeline.                                                                                        | [`v1.0-secure-monolith`](https://github.com/apenlor/spring-boot-security-observability-lab/tree/v1.0-secure-monolith)                   | [#2](https://github.com/apenlor/spring-boot-security-observability-lab/pull/2), [#3](https://github.com/apenlor/spring-boot-security-observability-lab/pull/3), [#4](https://github.com/apenlor/spring-boot-security-observability-lab/pull/4) |
-| **2. Observing the Monolith**            | The service is containerized and orchestrated via `docker-compose`. Concepts: Micrometer, Prometheus, Grafana, custom metrics, and automated dashboard provisioning.                                                                                                     | [`v2.0-observable-monolith`](https://github.com/apenlor/spring-boot-security-observability-lab/tree/v2.0-observable-monolith)           | [#6](https://github.com/apenlor/spring-boot-security-observability-lab/pull/6)                                                                                                                                                                 |
-| **3. Evolving to Federated Identity**    | The system is refactored into a multi-service architecture with an external IdP. Concepts: Keycloak, OIDC, OAuth2 Client (`web-client`) vs. Resource Server, Traefik reverse proxy, service-to-service security.                                                         | [`v3.0-federated-identity`](https://github.com/apenlor/spring-boot-security-observability-lab/tree/v3.0-federated-identity)             | [#8](https://github.com/apenlor/spring-boot-security-observability-lab/pull/8)                                                                                                                                                                 |
-| **4. Tracing a Distributed System**      | Services are instrumented with the OpenTelemetry agent to generate traces. Concepts: Tempo, agent-based instrumentation, W3C Trace Context, Service Graphs, and a hybrid PUSH/PULL metrics architecture.                                                                 | [`v4.0-distributed-tracing`](https://github.com/apenlor/spring-boot-security-observability-lab/tree/v4.0-distributed-tracing)           | [#10](https://github.com/apenlor/spring-boot-security-observability-lab/pull/10)                                                                                                                                                               |
-| **5. Correlated Logs & Access Auditing** | The three pillars of observability are complete (metrics, traces, logs). Alloy is the unified collection agent. Concepts: Loki, Grafana Alloy, Docker service discovery, structured JSON logs, AOP-based auditing, trace-to-log correlation, and detailed audit metrics. | [`v5.0-correlated-logs-auditing`](https://github.com/apenlor/spring-boot-security-observability-lab/tree/v5.0-correlated-logs-auditing) | [#12](https://github.com/apenlor/spring-boot-security-observability-lab/pull/12)                                                                                                                                                               |
-| **6. Proactive Alerting**                | _Upcoming..._                                                                                                                                                                                                                                                            | -                                                                                                                                       | -                                                                                                                                                                                                                                              |
-| **7. Continuous Security Integration**   | _Upcoming..._                                                                                                                                                                                                                                                            | -                                                                                                                                       | -                                                                                                                                                                                                                                              |
-| **8. Advanced Secret Management**        | _Upcoming..._                                                                                                                                                                                                                                                            | -                                                                                                                                       | -                                                                                                                                                                                                                                              |
+*   **Git Tag for this Phase:** `v6.0-proactive-alerting`
+
+### Objective
+
+The goal of this phase was to transition our monitoring strategy from passive (dashboards) to **proactive**. We have integrated the Prometheus Alertmanager into our stack to create a system that can automatically detect and route notifications about problems, without requiring a human to be watching a screen. This demonstrates the completion of a production-grade monitoring feedback loop.
+
+### Key Concepts Demonstrated
+
+*   **Prometheus Alerting Pipeline:** Understanding the distinct roles of Prometheus (which evaluates rules and generates alerts) and Alertmanager (which receives, de-duplicates, groups, and routes alerts).
+*   **Declarative Alerting Rules:** Defining alerting conditions as code using PromQL expressions in a version-controlled YAML file.
+*   **Alerting on Technical & Security Metrics:** Creating two distinct types of alerts:
+    1.  A **technical alert** (`ApiServerErrorRateHigh`) that fires on infrastructure-level signals like a spike in 5xx server errors.
+    2.  A **security alert** (`UnauthorizedAdminAccessSpike`) that fires on application-level signals, such as an abnormal rate of `4xx` errors on a privileged endpoint.
+*   **Alert Lifecycle:** Observing the full lifecycle of an alert: `Inactive` -> `Pending` -> `Firing` -> `Resolved`.
+*   **UI-Driven Test Harness:** Building a dedicated "Alerting Test Panel" in our web application to reliably trigger alert conditions on demand, proving the entire pipeline works end-to-end.
+
+### Architecture Overview
+
+Phase 6 introduces Alertmanager and connects it to our existing Prometheus instance. The data flow for alerting is now a core part of our observability stack.
+
+```mermaid
+graph TD
+    subgraph "Application Services"
+        RS[Resource Server]
+        WC[Web Client]
+    end
+
+    subgraph "Observability Stack"
+        Prom[Prometheus] -->|1. Scrapes Metrics| RS
+        Prom -->|1. Scrapes Metrics| WC
+        
+        subgraph "Alerting Pipeline"
+            Rules[alerts.yml] -->|2. Evaluates| Prom
+            Prom -->|3. Sends Firing Alerts| AM[Alertmanager]
+        end
+
+        G[Grafana]
+    end
+    
+    subgraph "Operators / External Systems"
+        AM -->|4. Routes Notifications| Notif[Email, Slack, etc.]
+        Ops[Operator] -->|Views & Manages Alerts| AM
+        Ops -->|Views Dashboards| G
+    end
+```
+
+1.  **[Prometheus](config/prometheus/prometheus.yml):** Its role is expanded. It is now configured to load a [rule file](config/prometheus/alerts.yml) and to send any alerts that become "Firing" to the Alertmanager service. The `--web.external-url` flag is set to ensure backlinks are generated with a browser-resolvable hostname.
+2.  **[Alertmanager](config/alertmanager/alertmanager.yml):** The new central hub for all alerts. It receives alerts from Prometheus, groups them to reduce noise, and would (in a production setup) route them to configured receivers. For this lab, we use a "null" receiver.
 
 ---
 
-## How to Follow This Lab
+### Key Configuration Details
 
-1.  **Start with the `main` branch** to see the latest state of the project.
-2.  To go back in time, use the **"Code & Docs" link** for a specific phase. This will show you the `README.md` for that phase, which contains the specific instructions and examples for that version of the code.
-3.  To understand the *"why"* behind the changes, review the **Key Pull Requests** for each phase.
+#### 1. Prometheus Alert Rules
+
+The core of this phase is the [alerts.yml](config/prometheus/alerts.yml) file. We have defined two rules that are specifically tailored for our application and optimized for a lab environment with short `for` durations for rapid testing.
+
+*   **`ApiServerErrorRateHigh`:** This rule fires when the rate of `5xx` status codes from the `resource-server` exceeds 0 for a continuous period. It is designed to be triggered by our `ChaosController`.
+*   **`UnauthorizedAdminAccessSpike`:** This security-focused rule fires when the rate of `4xx` status codes on the specific `/api/secure/admin` endpoint exceeds 0. This is more robust than checking for just `403` as it captures any client-side error on this privileged endpoint, signaling a potential issue.
+
+#### 2. UI-Driven Test Harness
+
+To validate the entire alerting pipeline, we implemented a dedicated "Alerting Test Panel" in the `web-client`.
+*   The `ChaosController` in the `resource-server` was enhanced with a guaranteed-failure endpoint (`/api/chaos/error`).
+*   The `WebController` in the `web-client` was updated with two new `POST` endpoints that call the backend to generate `5xx` and `4xx` errors.
+
+---
+
+## Local Development & Quick Start
+
+The prerequisites and setup are the same as in previous phases.
+
+1.  **Configure Local Hostnames (One-Time Setup, if not already done):**
+    Edit your local `hosts` file to add:
+    ```
+    127.0.0.1   keycloak.local
+    ```
+2.  **Create and Configure Your Environment File:**
+    ```bash
+    cp .env.example .env
+    # ...then edit .env to add your WEB_CLIENT_SECRET from Keycloak.
+    ```
+3.  **Build and run the entire stack:**
+    ```bash
+    docker-compose up --build -d
+    ```
+4.  **Access the Services:**
+    *   **Web Client Application:** [http://localhost:8082](http://localhost:8082) (Login with `lab-user`/`lab-user` or `lab-admin`/`lab-admin`)
+    *   **Keycloak Admin Console:** [http://keycloak.local](http://keycloak.local) (Login with `admin`/`admin`)
+    *   **Prometheus UI:** [http://localhost:9090](http://localhost:9090)
+    *   **Alertmanager UI:** [http://localhost:9093](http://localhost:9093)
+    *   **Grafana UI:** [http://localhost:3000](http://localhost:3000)
 
 ---
 
-## Running the Project
+## Validating the New Alerting Features
+
+1.  **Confirm Rules are Loaded:**
+    *   Navigate to the Prometheus UI's "Alerts" tab ([http://localhost:9090/alerts](http://localhost:9090/alerts)).
+    *   Verify that both new alerts are present and in the green "Inactive" state.
 
-To run the application and see usage examples for the **current phase**, please refer to the detailed instructions in its tagged `README.md` file.
+2.  **Trigger the Alerts via the UI:**
+    *   Log in to the Web Client as **`lab-user` / `lab-user`**.
+    *   In the "Alerting Test Panel", repeatedly click the buttons to generate `403` and `5xx` errors.
+    *   Watch the Prometheus Alerts UI. The alerts will transition from `Inactive` to `Pending` (yellow) and then to `Firing` (red).
+    *   Once firing, the alerts will appear in the Alertmanager UI.
 
-**[>> Go to instructions for the current phase: `v5.0-correlated-logs-auditing` <<](https://github.com/apenlor/spring-boot-security-observability-lab/blob/v5.0-correlated-logs-auditing/docs/phase-5-readme.md#local-development--quick-start)**
+#### Stop the Environment
 
-As the lab progresses, this link will always be updated to point to the latest completed phase.
+```bash
+docker-compose down -v
+```
@@ -0,0 +1,48 @@
+global:
+# --- Example: Global Slack API URL ---
+# slack_api_url: '<your-global-slack-webhook-url>'
+
+# The root route on which each incoming alert enters
+route:
+  # For this lab, we use a 'null' receiver to prevent any actual notifications from being sent
+  receiver: 'null-receiver'
+
+  # --- Example: notification settings ---
+  #  group_by: ['alertname', 'job']
+  #  group_wait: 30s
+  #  group_interval: 5m
+  #  repeat_interval: 1h
+
+  # --- Example: Sub-route ---
+  # The sub-routes. Alertmanager matches alerts against sub-routes recursively.
+  # routes:
+  #   - receiver: 'critical-alerts-webhook'
+  #     # When an alert has a label 'severity' that is 'critical'...
+  #     matchers:
+  #       - severity = critical
+  #     # ...it will be sent to the 'critical-alerts-webhook' receiver instead of the parent's.
+  #     continue: true # Set to true if you want it to also match subsequent sibling routes.
+
+# A list of receivers that define how notifications are sent
+receivers:
+  # This receiver is a blackhole, for the Lab is a perfect option.
+  - name: 'null-receiver'
+
+  # --- Example: Generic Webhook Receiver ---
+  # - name: 'critical-alerts-webhook'
+  #   webhook_configs:
+  #     # URL to send POST requests to (e.g., a custom integration, PagerDuty, etc.).
+  #     - url: 'http://some-webhook-receiver:8080/notifications'
+  #       send_resolved: true
+
+  # --- Example: Slack Receiver ---
+  # - name: 'slack-notifications'
+  #   slack_configs:
+  #     # If a global "slack_api_url" is not defined, it must be specified here.
+  #     # This also allows overriding the global URL for a specific receiver.
+  #     # api_url: '<your-channel-specific-slack-webhook-url>'
+  #     - channel: '#alerts'
+  #       send_resolved: true
+  #       # You can customize the message title, text, etc. using templates.
+  #       # title: '{{ .CommonLabels.alertname }} - {{ .Status | toUpper }}'
+  #       # text: '{{ .CommonAnnotations.summary }}'
@@ -0,0 +1,44 @@
+groups:
+  - name: application_alerts
+    rules:
+      # ===================================================================
+      # RULE 1: Technical Alert - High 5xx Server Error Rate
+      # ===================================================================
+      - alert: ApiServerErrorRateHigh
+        # This expression calculates the per-second rate of 5xx server errors
+        # originating specifically from our backend resource-server.
+        # It's designed to be triggered by the ChaosController's /api/chaos/error endpoint.
+        # Lower rate (>0) to simplify testing.
+        expr: sum(rate(http_server_requests_seconds_count{status=~"5..", job="resource-server"}[1m])) by (job) > 0
+
+        # Lab Setting: A short "for" duration for rapid testing. In a real-world
+        # scenario, this would be longer to avoid noise from transient issues.
+        for: 15s
+
+        labels:
+          severity: critical
+
+        annotations:
+          summary: "High 5xx server error rate on job '{{ $labels.job }}'."
+          description: "The API service '{{ $labels.job }}' is experiencing a high rate of 5xx server errors. Current value is {{ $value | printf \"%.2f\" }} errors/sec."
+          runbook_url: "https://internal-wiki.example.com/security/runbooks/api-error-rate"
+
+      # ===================================================================
+      # RULE 2: Security Alert - High 4xx Client Error Rate (Admin Endpoint)
+      # ===================================================================
+      - alert: UnauthorizedAdminAccessSpike
+        # This expression monitors for a spike in client errors (4xx) specifically on the
+        # privileged /api/secure/admin endpoint. Any client error here is a significant security signal,
+        # indicating misconfigured clients or malicious probing. Lower rate (>0) to simplify testing.
+        expr: sum(rate(http_server_requests_seconds_count{uri="/api/secure/admin", status=~"4..", job="resource-server"}[1m])) by (job) > 0
+
+        # Also set low for demonstration purposes.
+        for: 15s
+
+        labels:
+          severity: warning
+
+        annotations:
+          summary: "Spike in unauthorized access attempts to admin endpoint."
+          description: "A high rate of failed attempts (4xx errors) to access the admin endpoint on '{{ $labels.job }}' has been detected. The current rate is {{ $value | printf \"%.2f\" }} req/sec."
+          runbook_url: "https://internal-wiki.example.com/security/runbooks/unauthorized-admin-access"