doc: add cluster manager reference architecture by minaelee · Pull Request #1209 · canonical/microcloud

minaelee · 2026-02-02T23:21:50Z

Add reference architecture documentation for MicroCloud Cluster Manager.

edlerd

Excellent start. I have many thoughts and comments below. We can have a chat if you like to clarify on the open issues.

edlerd · 2026-02-04T15:06:57Z

doc/reference/cluster-manager-architecture.md

+
+The MicroCloud Cluster Manager is a centralized tool that provides an overview of MicroCloud deployments. In its initial implementation, it provides an overview of resource usage and availability for all clusters. Future implementations will include centralized cluster management capabilities.
+
+Cluster Manager stores the data from registered clusters in Postgres and Prometheus databases. This data can be displayed in the Cluster Manager UI, which also links to Grafana dashboards for each MicroCloud.


which also links to Grafana dashboards for each MicroCloud

This is a possible extension. By default, the COS stack is not available. So a user will deploy cluster manager and get the manager UI without links to Grafana.

Does this update work, or would you prefer we did not mention Grafana at all?

This data can be displayed in the Cluster Manager UI, which can be extended to link to Grafana dashboards for each MicroCloud.

Note: This information is from https://github.com/canonical/microcloud-cluster-manager/blob/main/ARCHITECTURE.md and likely should be updated there as well.

edlerd · 2026-02-04T15:23:21Z

doc/reference/cluster-manager-architecture.md

+```{figure} ../images/cluster_manager_architecture.png
+   :alt: A diagram of Cluster Manager architecture
+   :align: center
+```


This diagram is from an earlier development environment. It is mostly correct, but some things have slightly changed.

Is there an updated diagram, or can you let me know what has changed and I can update it?

edlerd · 2026-02-04T15:24:56Z

doc/reference/cluster-manager-architecture.md

+That static external IP acts as the gateway to route user traffic to the appropriate Kubernetes load balancers.
+
+TCP load balancers
+: Two TCP load balancer services distribute traffic to the Management API and Cluster Connector deployments without terminating TLS. Instead, TLS termination is handled directly within each deployment application. This approach is particularly crucial for the Cluster Connector deployment, as it relies on mutual TLS (mTLS) authentication for secure communication.


We are using a single Traefik instance that is dealing with the incoming requests, no two load balancers anymore.

Fixed to:

A TCP load balancer (using a Traefik instance) distributes traffic to the Management API and Cluster Connector deployments without terminating TLS.

edlerd · 2026-02-04T15:27:45Z

doc/reference/cluster-manager-architecture.md

+Certificate manager
+: Manages TLS/SSL certificates for secure communication within the Kubernetes cluster. It stores secrets in Kubernetes to be used by various components. The certificates are used by both the Management API and Cluster Connector deployments for HTTPS encryption.


We now rely on a charm that implements the certificates interface to provide certificates. This can be the self-signed-certificates charm, as suggested in the readme. We do not rely on the certificate manager k8s app anymore.

Should the Certificate manager section in lines 33-34 above be removed entirely?

edlerd · 2026-02-04T15:28:34Z

doc/reference/cluster-manager-architecture.md

+Persistent Volume (PV) and Persistent Volume Claim (PVC)
+: The Persistent Volume is the storage resource provisioned for the Postgres deployment. The Persistent Volume Claim is the request for storage by the Postgres deployment to ensure data persistence.


We rely on the canonical Postgres charm. How that charm does persistent storage is outside our control.

What information should we provide in this section instead, or should we remove it entirely?

doc/reference/cluster-manager-architecture.md

edlerd · 2026-02-04T15:55:14Z

doc/reference/cluster-manager-architecture.md

+(ref-cluster-manager-architecture-management-ui)=
+### UI
+
+The Management API deployment handles serving static assets for the UI. Users access information about clusters through the UI. Through it, users can create remote cluster join tokens and view information about existing tokens, as well as approve or reject join requests.


Suggested change

The Management API deployment handles serving static assets for the UI. Users access information about clusters through the UI. Through it, users can create remote cluster join tokens and view information about existing tokens, as well as approve or reject join requests.

The Management API deployment handles serving static assets for the UI. Users access information about clusters through the UI. Through it, users can create remote cluster join tokens and view information about existing tokens.

We can expand here, we serve warnings and metric insights on a high level as well as a list of all registered clusters.

The Management API handles serving static assets for the UI. Users access information about clusters through the UI. Through it, users can create remote cluster join tokens and view information about existing tokens. The UI also serves warnings and metric insights on a high level.

I added this, but "on a high level" could bear more explanation. Do you mean through optional extension with Grafana, or something more/else?

doc/reference/cluster-manager-architecture.md

edlerd · 2026-02-04T16:00:37Z

doc/reference/cluster-manager-architecture.md

+- mTLS authentication check against the matched certificate
+- Store and overwrite data in the `remote_cluster_details` table
+
+To avoid overwhelming the Cluster Connector deployment, the status endpoints are rate limited. The response sent to the originating cluster includes a delay period (in seconds) that must pass before the next status signal request.


All endpoints are rate limited, not just this one.

Updated to:

(ref-cluster-manager-architecture-rate-limited)= ## Rate limited endpoints To avoid overwhelming the Cluster Manager, all its endpoints are rate limited. When any endpoint receives a request from a cluster, the response from Cluster Manager includes a delay period (in seconds) that must pass before the next request to that endpoint.

Or did you mean all endpoints for the Cluster Connector deployment only?

Also: do you want to change the term "Cluster Connector deployment" to "Cluster Connector" (like with "Management API deployment" to "Management API") or does it make sense to keep the word "deployment" here?

Signed-off-by: Minae Lee <minae.lee@canonical.com>

github-actions bot added the Documentation Documentation needs updating label Feb 2, 2026

minaelee force-pushed the cluster-manager-architecture branch from 84f08c8 to 2496821 Compare February 4, 2026 15:43

edlerd reviewed Feb 4, 2026

View reviewed changes

doc: add cluster manager reference architecture

a2c04aa

Signed-off-by: Minae Lee <minae.lee@canonical.com>

minaelee force-pushed the cluster-manager-architecture branch from 2496821 to a2c04aa Compare February 4, 2026 23:36


		The MicroCloud Cluster Manager is a centralized tool that provides an overview of MicroCloud deployments. In its initial implementation, it provides an overview of resource usage and availability for all clusters. Future implementations will include centralized cluster management capabilities.

		Cluster Manager stores the data from registered clusters in Postgres and Prometheus databases. This data can be displayed in the Cluster Manager UI, which also links to Grafana dashboards for each MicroCloud.

		Certificate manager
		: Manages TLS/SSL certificates for secure communication within the Kubernetes cluster. It stores secrets in Kubernetes to be used by various components. The certificates are used by both the Management API and Cluster Connector deployments for HTTPS encryption.

		Persistent Volume (PV) and Persistent Volume Claim (PVC)
		: The Persistent Volume is the storage resource provisioned for the Postgres deployment. The Persistent Volume Claim is the request for storage by the Postgres deployment to ensure data persistence.

	The Management API deployment handles serving static assets for the UI. Users access information about clusters through the UI. Through it, users can create remote cluster join tokens and view information about existing tokens, as well as approve or reject join requests.
	The Management API deployment handles serving static assets for the UI. Users access information about clusters through the UI. Through it, users can create remote cluster join tokens and view information about existing tokens.

Conversation

minaelee commented Feb 2, 2026

Uh oh!

edlerd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

minaelee Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

minaelee Feb 4, 2026 •

edited

Loading