Skip to content

doc: add cluster manager reference architecture#1209

Draft
minaelee wants to merge 1 commit intocanonical:mainfrom
minaelee:cluster-manager-architecture
Draft

doc: add cluster manager reference architecture#1209
minaelee wants to merge 1 commit intocanonical:mainfrom
minaelee:cluster-manager-architecture

Conversation

@minaelee
Copy link
Contributor

@minaelee minaelee commented Feb 2, 2026

Add reference architecture documentation for MicroCloud Cluster Manager.

@github-actions github-actions bot added the Documentation Documentation needs updating label Feb 2, 2026
@minaelee minaelee force-pushed the cluster-manager-architecture branch from 84f08c8 to 2496821 Compare February 4, 2026 15:43
Copy link
Contributor

@edlerd edlerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent start. I have many thoughts and comments below. We can have a chat if you like to clarify on the open issues.


The MicroCloud Cluster Manager is a centralized tool that provides an overview of MicroCloud deployments. In its initial implementation, it provides an overview of resource usage and availability for all clusters. Future implementations will include centralized cluster management capabilities.

Cluster Manager stores the data from registered clusters in Postgres and Prometheus databases. This data can be displayed in the Cluster Manager UI, which also links to Grafana dashboards for each MicroCloud.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which also links to Grafana dashboards for each MicroCloud

This is a possible extension. By default, the COS stack is not available. So a user will deploy cluster manager and get the manager UI without links to Grafana.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this update work, or would you prefer we did not mention Grafana at all?

This data can be displayed in the Cluster Manager UI, which can be extended to link to Grafana dashboards for each MicroCloud.

Note: This information is from https://github.com/canonical/microcloud-cluster-manager/blob/main/ARCHITECTURE.md and likely should be updated there as well.

Comment on lines +19 to +22
```{figure} ../images/cluster_manager_architecture.png
:alt: A diagram of Cluster Manager architecture
:align: center
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diagram is from an earlier development environment. It is mostly correct, but some things have slightly changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an updated diagram, or can you let me know what has changed and I can update it?

That static external IP acts as the gateway to route user traffic to the appropriate Kubernetes load balancers.

TCP load balancers
: Two TCP load balancer services distribute traffic to the Management API and Cluster Connector deployments without terminating TLS. Instead, TLS termination is handled directly within each deployment application. This approach is particularly crucial for the Cluster Connector deployment, as it relies on mutual TLS (mTLS) authentication for secure communication.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using a single Traefik instance that is dealing with the incoming requests, no two load balancers anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to:

A TCP load balancer (using a Traefik instance) distributes traffic to the Management API and Cluster Connector deployments without terminating TLS.

Comment on lines +33 to +34
Certificate manager
: Manages TLS/SSL certificates for secure communication within the Kubernetes cluster. It stores secrets in Kubernetes to be used by various components. The certificates are used by both the Management API and Cluster Connector deployments for HTTPS encryption.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now rely on a charm that implements the certificates interface to provide certificates. This can be the self-signed-certificates charm, as suggested in the readme. We do not rely on the certificate manager k8s app anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the Certificate manager section in lines 33-34 above be removed entirely?

Comment on lines +39 to +40
Persistent Volume (PV) and Persistent Volume Claim (PVC)
: The Persistent Volume is the storage resource provisioned for the Postgres deployment. The Persistent Volume Claim is the request for storage by the Postgres deployment to ensure data persistence.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We rely on the canonical Postgres charm. How that charm does persistent storage is outside our control.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What information should we provide in this section instead, or should we remove it entirely?

(ref-cluster-manager-architecture-management-ui)=
### UI

The Management API deployment handles serving static assets for the UI. Users access information about clusters through the UI. Through it, users can create remote cluster join tokens and view information about existing tokens, as well as approve or reject join requests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The Management API deployment handles serving static assets for the UI. Users access information about clusters through the UI. Through it, users can create remote cluster join tokens and view information about existing tokens, as well as approve or reject join requests.
The Management API deployment handles serving static assets for the UI. Users access information about clusters through the UI. Through it, users can create remote cluster join tokens and view information about existing tokens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can expand here, we serve warnings and metric insights on a high level as well as a list of all registered clusters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Management API handles serving static assets for the UI. Users access information about clusters through the UI. Through it, users can create remote cluster join tokens and view information about existing tokens. The UI also serves warnings and metric insights on a high level.

I added this, but "on a high level" could bear more explanation. Do you mean through optional extension with Grafana, or something more/else?

- mTLS authentication check against the matched certificate
- Store and overwrite data in the `remote_cluster_details` table

To avoid overwhelming the Cluster Connector deployment, the status endpoints are rate limited. The response sent to the originating cluster includes a delay period (in seconds) that must pass before the next status signal request.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All endpoints are rate limited, not just this one.

Copy link
Contributor Author

@minaelee minaelee Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to:

(ref-cluster-manager-architecture-rate-limited)=
## Rate limited endpoints

To avoid overwhelming the Cluster Manager, all its endpoints are rate limited. When any endpoint receives a request from a cluster, the response from Cluster Manager includes a delay period (in seconds) that must pass before the next request to that endpoint.

Or did you mean all endpoints for the Cluster Connector deployment only?

Also: do you want to change the term "Cluster Connector deployment" to "Cluster Connector" (like with "Management API deployment" to "Management API") or does it make sense to keep the word "deployment" here?

Signed-off-by: Minae Lee <minae.lee@canonical.com>
@minaelee minaelee force-pushed the cluster-manager-architecture branch from 2496821 to a2c04aa Compare February 4, 2026 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Documentation Documentation needs updating

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants