From a509f2a1a6bbac2e70c904f7833100ebd8763f1a Mon Sep 17 00:00:00 2001 From: Sina Date: Tue, 21 Apr 2026 10:34:49 -0400 Subject: [PATCH 1/7] feat: coordinated workers explanation doc --- .../architecture/coordinated-workers.md | 107 ++++++++++++++++++ docs/explanation/architecture/index.md | 9 ++ 2 files changed, 116 insertions(+) create mode 100644 docs/explanation/architecture/coordinated-workers.md diff --git a/docs/explanation/architecture/coordinated-workers.md b/docs/explanation/architecture/coordinated-workers.md new file mode 100644 index 00000000..84714670 --- /dev/null +++ b/docs/explanation/architecture/coordinated-workers.md @@ -0,0 +1,107 @@ +--- +myst: + html_meta: + description: "Explanation of the coordinator-worker pattern used in COS HA for Loki, Mimir, and Tempo: how it works, its components, and why it was chosen." +--- + +# Coordinated workers + +## Overview + +The telemetry backend components in the COS stack (namely Loki, Mimir, and Tempo) are built on top of a single executable that can operate in two modes: + +- Monolithic mode: a single process runs all internal services. This is the mode that is used in COS Lite, where Prometheus and Loki each run as one service. +- Microservices mode: multiple processes each run a subset of services, known as *roles*. + +The coordinated workers pattern is the design that COS HA (sometimes referred to as simply COS) uses to deploy these components in microservices mode, providing the high-availability (HA) topology of COS. + +## Roles and meta-roles + +In microservices mode, each process runs one or more *roles*. How roles are distributed varies by product: + +- Mimir: each process runs an arbitrary subset of roles, or one of several predefined subsets. +- Loki: each process runs one of three predefined role subsets. +- Tempo: each process runs exactly one role. + +Predefined subsets of roles are called *meta-roles*. A deployment is considered *consistent* when all roles required for the product to function are covered by at least one running process. + +## The coordinator-worker pattern + +Each COS HA component is made up of exactly two charms: + +1. A coordinator charm, which acts as the single entrypoint for all communication with the cluster. It runs an nginx reverse proxy to route and load-balance requests across workers, verifies that the cluster is consistent (i.e. all required roles are deployed), and owns all rule files and dashboards. It also handles integration with the rest of COS. As a result, individual workers do not need to be related to other charms directly. The coordinator, based on its relations and current config options, determines the necessary workload config file that the workers must run and forwards it to them over relation data. + +1. A worker charm, which runs one or more roles as configured by the operator via a config option. Multiple worker applications can be deployed with different roles to compose a full cluster. This charm runs the appropriate workload container based on the config file it receives from its coordinator. + +```{mermaid} +graph LR + +subgraph mimir +mimir-write ---|mimir-cluster| coordinator +mimir-read ---|mimir-cluster| coordinator +mimir-backend ---|mimir-cluster| coordinator +end + +subgraph cos +coordinator ---|alertmanager_dispatch| prometheus +coordinator ---|grafana_dashboards| grafana +coordinator ---|ingress| Traefik +end +``` + +Worker roles are set via a config option on the worker application. Boolean per-role config variables (e.g. `role-ingester`, `role-querier`) are used rather than a free-text field, so that Juju can validate the input and prevent misconfiguration from typos. + +## Why a coordinator? + +The coordinator solves several problems that would otherwise require complex distributed coordination among workers: + +- Single entrypoint: all traffic flows through the coordinator's nginx reverse proxy, which load-balances across workers of the same role. +- Cleaner Juju topology: integrations with other charms (e.g. S3, alerting rules, dashboards) are established once on the coordinator, which then distributes the relevant configuration to workers. +- Consistency checking: the coordinator can verify that the cluster has all required roles covered before marking the deployment as ready, without requiring workers to cross-relate with each other. + +## Status and health checks + +Both the coordinator and the worker report their health through Juju's `collect-status` event, which aggregates multiple status conditions and surfaces the most critical one. + +### Worker health + +Each worker tracks the health of its workload process through a pebble readiness check. When a readiness check endpoint is configured, the worker periodically GETs it and interprets the response: + +- A `"ready"` response body (with any 2xx status code) means the worker is `up`. +- A 2xx response with any other body means the worker is still `starting` (e.g. waiting for peer workers to become available). +- An HTTP error or connection failure means the worker is `down`. + +This three-state model (`starting` / `up` / `down`) is used internally by the worker to decide whether a restart is needed and to populate the Juju unit status. + +When collecting unit status, the worker checks conditions in order of priority and reports the first applicable one: + +| Condition | Status | +|---|---| +| Pebble container not reachable | `waiting` | +| No relation to a coordinator | `blocked` | +| Cluster relation not ready | `waiting` | +| No config published by coordinator | `waiting` | +| No roles assigned | `blocked` | +| Workload starting or down | `waiting` | +| All roles running | `active` | + +### Coordinator health + +The coordinator distinguishes between two levels of cluster completeness: + +- Coherent (minimal deployment): all roles in the `minimal_deployment` set are covered by at least one worker. Below this threshold the cluster cannot function and the coordinator reports `blocked`. +- Recommended (robust deployment): all roles in the `recommended_deployment` set are covered with the suggested number of units. Below this threshold the cluster can function but is degraded, and the coordinator reports a `waiting` or degraded status. + +The coordinator also blocks or waits on its own prerequisites: + +| Condition | Status | +|---|---| +| No workers related | `blocked` | +| Cluster not coherent (missing required roles) | `blocked` | +| Cluster not at recommended deployment | `waiting` (degraded) | +| No S3 relation | `blocked` | +| S3 relation not ready | `waiting` | +| All checks pass | `active` | + +Because the coordinator owns the workload config and forwards it to workers over the cluster relation, a worker that has not yet received its config will remain in `waiting` until the coordinator becomes ready and publishes it. + diff --git a/docs/explanation/architecture/index.md b/docs/explanation/architecture/index.md index e98b8d10..39433fcf 100644 --- a/docs/explanation/architecture/index.md +++ b/docs/explanation/architecture/index.md @@ -42,3 +42,12 @@ The available stack configurations and how telemetry moves through them. Telemetry Flow ``` + +## Coordinated workers +The coordinated workers pattern and the role it plays in COS HA. + +```{toctree} +:maxdepth: 1 + +Coordinated Workers +``` \ No newline at end of file From 88d533a88b5f6362d5f7355812dbc1293fd03c61 Mon Sep 17 00:00:00 2001 From: Sina Date: Thu, 23 Apr 2026 16:43:10 -0400 Subject: [PATCH 2/7] feat: coordinated workers explanation doc --- .../architecture/coordinated-workers.md | 2 + .../coordinated-workers-meta-roles.md | 90 +++++++++++++++++++ 2 files changed, 92 insertions(+) create mode 100644 docs/reference/coordinated-workers-meta-roles.md diff --git a/docs/explanation/architecture/coordinated-workers.md b/docs/explanation/architecture/coordinated-workers.md index 84714670..5ac0cc92 100644 --- a/docs/explanation/architecture/coordinated-workers.md +++ b/docs/explanation/architecture/coordinated-workers.md @@ -105,3 +105,5 @@ The coordinator also blocks or waits on its own prerequisites: Because the coordinator owns the workload config and forwards it to workers over the cluster relation, a worker that has not yet received its config will remain in `waiting` until the coordinator becomes ready and publishes it. +## References +- [Meta roles used in COS](/reference/coodinated-workers) \ No newline at end of file diff --git a/docs/reference/coordinated-workers-meta-roles.md b/docs/reference/coordinated-workers-meta-roles.md new file mode 100644 index 00000000..11cd725b --- /dev/null +++ b/docs/reference/coordinated-workers-meta-roles.md @@ -0,0 +1,90 @@ +--- +myst: + html_meta: + description: "Reference for the worker roles and meta-roles available in the Mimir, Loki, and Tempo coordinated worker deployments in COS HA." +--- + +# Coordinated worker roles and meta-roles + +In COS HA, each telemetry backend (Mimir, Loki, Tempo) is deployed using the [coordinator-worker pattern](../explanation/architecture/coordinated-workers.md). Workers are configured with one or more *roles*, which determine which internal services the workload process will run. + +*Meta-roles* are named shortcuts that expand to a predefined set of roles. They allow operators to deploy common groupings of roles without having to enumerate each one individually. For example, assigning `role-write=true` to a Mimir worker is equivalent to enabling both the `distributor` and `ingester` roles. + +A deployment is considered: +- **Consistent** (or coherent) when all roles in the *minimal deployment* set are covered by at least one worker unit. +- **Recommended** when all roles in the *recommended deployment* set are covered with the suggested number of units. + +--- + +## Mimir + +### Meta-roles + +| Meta-role | Constituent roles | +|---|---| +| `read` | `querier`, `query-frontend` | +| `write` | `distributor`, `ingester` | +| `backend` | `alertmanager`, `compactor`, `overrides-exporter`, `query-scheduler`, `ruler`, `store-gateway` | +| `all` | `compactor`, `distributor`, `ingester`, `querier`, `query-frontend`, `ruler`, `store-gateway` | + +### Roles + +| Role | Part of meta-role | Min. deployment | Recommended units | +|---|---|:---:|:---:| +| `compactor` | `backend`, `all` | yes | 1 | +| `distributor` | `write`, `all` | yes | 1 | +| `ingester` | `write`, `all` | yes | 3 | +| `querier` | `read`, `all` | yes | 2 | +| `query-frontend` | `read`, `all` | yes | 1 | +| `store-gateway` | `backend`, `all` | yes | 1 | +| `ruler` | `backend`, `all` | yes | 1 | +| `alertmanager` | `backend` | no | — | +| `overrides-exporter` | `backend` | no | — | +| `query-scheduler` | `backend` | no | — | +| `flusher` | — | no | — | + +--- + +## Loki + +Loki's microservices mode uses three top-level roles (`read`, `write`, `backend`) as the primary unit of deployment. These roles are themselves the building blocks for the `all` meta-role. + +### Meta-roles + +| Meta-role | Constituent roles | +|---|---| +| `all` | `read`, `write`, `backend` | + +### Roles + +| Role | Part of meta-role | Min. deployment | Recommended units | +|---|---|:---:|:---:| +| `read` | `all` | yes | 3 | +| `write` | `all` | yes | 3 | +| `backend` | `all` | yes | 3 | + +--- + +## Tempo + +### Meta-roles + +| Meta-role | Constituent roles | +|---|---| +| `all` | `querier`, `query-frontend`, `ingester`, `distributor`, `compactor`, `metrics-generator` | + +### Roles + +| Role | Part of meta-role | Min. deployment | Recommended units | +|---|---|:---:|:---:| +| `querier` | `all` | yes | 1 | +| `query-frontend` | `all` | yes | 1 | +| `ingester` | `all` | yes | 3 | +| `distributor` | `all` | yes | 1 | +| `compactor` | `all` | yes | 1 | +| `metrics-generator` | `all` | no | 1 | + +## References +- [Mimir components](https://grafana.com/docs/mimir/latest/references/architecture/components/) +- [Loki components](https://grafana.com/docs/loki/latest/get-started/components/) +- [Tempo components](https://grafana.com/docs/tempo/latest/introduction/architecture/) \ No newline at end of file From bb0788aebc223d78e9cee31d514fa80cc00b92e8 Mon Sep 17 00:00:00 2001 From: Sina Date: Thu, 23 Apr 2026 16:52:23 -0400 Subject: [PATCH 3/7] fix: spelling --- docs/.custom_wordlist.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/.custom_wordlist.txt b/docs/.custom_wordlist.txt index 13d72561..4ee5b0c2 100644 --- a/docs/.custom_wordlist.txt +++ b/docs/.custom_wordlist.txt @@ -102,6 +102,7 @@ microservice microservices Mimir Minio +misconfiguration Multipass MyST Nginx From eaf146a6b1ac791af7b277edf8982fa4fbd8d173 Mon Sep 17 00:00:00 2001 From: Sina Date: Fri, 24 Apr 2026 10:21:08 -0400 Subject: [PATCH 4/7] fix: link --- docs/explanation/architecture/coordinated-workers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/explanation/architecture/coordinated-workers.md b/docs/explanation/architecture/coordinated-workers.md index 5ac0cc92..3318aa9d 100644 --- a/docs/explanation/architecture/coordinated-workers.md +++ b/docs/explanation/architecture/coordinated-workers.md @@ -106,4 +106,4 @@ The coordinator also blocks or waits on its own prerequisites: Because the coordinator owns the workload config and forwards it to workers over the cluster relation, a worker that has not yet received its config will remain in `waiting` until the coordinator becomes ready and publishes it. ## References -- [Meta roles used in COS](/reference/coodinated-workers) \ No newline at end of file +- [Meta roles used in COS](/reference/coodinated-workers.md) \ No newline at end of file From 9af5b668f54f6eef3cb0ed33bdc40c5722567892 Mon Sep 17 00:00:00 2001 From: Sina Date: Fri, 24 Apr 2026 10:24:12 -0400 Subject: [PATCH 5/7] fix: link --- docs/explanation/architecture/coordinated-workers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/explanation/architecture/coordinated-workers.md b/docs/explanation/architecture/coordinated-workers.md index 3318aa9d..78103f80 100644 --- a/docs/explanation/architecture/coordinated-workers.md +++ b/docs/explanation/architecture/coordinated-workers.md @@ -106,4 +106,4 @@ The coordinator also blocks or waits on its own prerequisites: Because the coordinator owns the workload config and forwards it to workers over the cluster relation, a worker that has not yet received its config will remain in `waiting` until the coordinator becomes ready and publishes it. ## References -- [Meta roles used in COS](/reference/coodinated-workers.md) \ No newline at end of file +- [Meta roles used in COS](/reference/coodinated-workers-meta-roles.md) \ No newline at end of file From 8e76736f138843e1c44dd5108af2785c7f71f266 Mon Sep 17 00:00:00 2001 From: Sina Date: Fri, 24 Apr 2026 10:29:42 -0400 Subject: [PATCH 6/7] fix: use sentence case --- docs/explanation/architecture/coordinated-workers.md | 2 +- docs/explanation/architecture/index.md | 2 +- docs/reference/index.md | 10 ++++++++++ 3 files changed, 12 insertions(+), 2 deletions(-) diff --git a/docs/explanation/architecture/coordinated-workers.md b/docs/explanation/architecture/coordinated-workers.md index 78103f80..ca7291ea 100644 --- a/docs/explanation/architecture/coordinated-workers.md +++ b/docs/explanation/architecture/coordinated-workers.md @@ -43,7 +43,7 @@ mimir-backend ---|mimir-cluster| coordinator end subgraph cos -coordinator ---|alertmanager_dispatch| prometheus +coordinator ---|alertmanager_dispatch| alertmanager coordinator ---|grafana_dashboards| grafana coordinator ---|ingress| Traefik end diff --git a/docs/explanation/architecture/index.md b/docs/explanation/architecture/index.md index b1d79d2e..7ce9bff8 100644 --- a/docs/explanation/architecture/index.md +++ b/docs/explanation/architecture/index.md @@ -49,5 +49,5 @@ The coordinated workers pattern and the role it plays in COS HA. ```{toctree} :maxdepth: 1 -Coordinated Workers +Coordinated workers ``` \ No newline at end of file diff --git a/docs/reference/index.md b/docs/reference/index.md index 4b7f231a..ab181648 100644 --- a/docs/reference/index.md +++ b/docs/reference/index.md @@ -80,3 +80,13 @@ over time. Lifecycle ``` + +## Coordinated workers meta roles + +Meta roles used in the coordinated workers + +```{toctree} +:maxdepth: 1 + +Coordinated workers meta roles +``` \ No newline at end of file From 2654582e31edc50ccdc01d5d76af68905db331d8 Mon Sep 17 00:00:00 2001 From: Sina Date: Fri, 24 Apr 2026 10:31:58 -0400 Subject: [PATCH 7/7] fix: typo --- docs/explanation/architecture/coordinated-workers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/explanation/architecture/coordinated-workers.md b/docs/explanation/architecture/coordinated-workers.md index ca7291ea..f504a2b8 100644 --- a/docs/explanation/architecture/coordinated-workers.md +++ b/docs/explanation/architecture/coordinated-workers.md @@ -106,4 +106,4 @@ The coordinator also blocks or waits on its own prerequisites: Because the coordinator owns the workload config and forwards it to workers over the cluster relation, a worker that has not yet received its config will remain in `waiting` until the coordinator becomes ready and publishes it. ## References -- [Meta roles used in COS](/reference/coodinated-workers-meta-roles.md) \ No newline at end of file +- [Meta roles used in COS](/reference/coordinated-workers-meta-roles.md) \ No newline at end of file