-
Notifications
You must be signed in to change notification settings - Fork 68
doc: add cluster manager reference architecture #1209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,150 @@ | ||
| --- | ||
| myst: | ||
| html_meta: | ||
| description: Reference architecture for MicroCloud Cluster Manager, a Kubernetes-based web application for viewing and managing multiple MicroCloud deployments. | ||
| --- | ||
|
|
||
| (ref-cluster-manager-architecture)= | ||
| # Architecture of the MicroCloud Cluster Manager | ||
|
|
||
| The MicroCloud Cluster Manager is a centralized tool that provides an overview of MicroCloud deployments. In its initial implementation, it provides an overview of resource usage and availability for all clusters. Future implementations will include centralized cluster management capabilities. | ||
|
|
||
| Cluster Manager stores the data from registered clusters in Postgres and Prometheus databases. This data can be displayed in the Cluster Manager UI, which can be extended to link to Grafana dashboards for each MicroCloud. | ||
|
|
||
| (ref-cluster-manager-architecture-overview)= | ||
| ## Architecture overview | ||
|
|
||
| Cluster Manager is a distributed web application that runs inside a Kubernetes cluster to achieve high availability. The diagram below illustrates its system architecture: | ||
|
|
||
| ```{figure} ../images/cluster_manager_architecture.png | ||
| :alt: A diagram of Cluster Manager architecture | ||
| :align: center | ||
| ``` | ||
|
|
||
| Inside the Kubernetes cluster are the following system components: | ||
|
|
||
| DNS and static external IP | ||
| : The Domain Name Server (DNS) must be set up by the user to resolve their domain names to a static external IP. | ||
| That static external IP acts as the gateway to route user traffic to the appropriate Kubernetes load balancers. | ||
|
|
||
| TCP load balancers | ||
| : A TCP load balancer (using a [Traefik](https://traefik.io/) instance) distributes traffic to the Management API and Cluster Connector deployments without terminating TLS. Instead, TLS termination is handled directly within each deployment application. This approach is particularly crucial for the Cluster Connector deployment, as it relies on mutual TLS (mTLS) authentication for secure communication. | ||
|
|
||
| Certificate manager | ||
| : Manages TLS/SSL certificates for secure communication within the Kubernetes cluster. It stores secrets in Kubernetes to be used by various components. The certificates are used by both the Management API and Cluster Connector deployments for HTTPS encryption. | ||
|
Comment on lines
+33
to
+34
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We now rely on a charm that implements the certificates interface to provide certificates. This can be the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should the Certificate manager section in lines 33-34 above be removed entirely?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can remove it, yes. |
||
|
|
||
| Postgres deployment | ||
| : A PostgreSQL database deployed within the Kubernetes cluster. It provides persistent storage for system data. Both the Management API and Cluster Connector will communicate with the Postgres database for CRUD operations. | ||
|
|
||
| Persistent Volume (PV) and Persistent Volume Claim (PVC) | ||
| : The Persistent Volume is the storage resource provisioned for the Postgres deployment. The Persistent Volume Claim is the request for storage by the Postgres deployment to ensure data persistence. | ||
|
Comment on lines
+39
to
+40
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We rely on the canonical Postgres charm. How that charm does persistent storage is outside our control.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What information should we provide in this section instead, or should we remove it entirely?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can remove it, yes. |
||
|
|
||
| {ref}`ref-cluster-manager-architecture-management` | ||
| : Responsible for serving the static UI assets, exposing API endpoints for the UI to communicate with the Cluster Manager backend. Requests from the UI are authenticated using OIDC. | ||
|
|
||
| {ref}`ref-cluster-manager-architecture-connector` | ||
| : Responsible for handling requests from MicroCloud clusters, authenticated using mTLS. | ||
|
|
||
| (ref-cluster-manager-architecture-management)= | ||
| ## Management API | ||
|
|
||
| The management API handles local operations in Cluster Manager, including: | ||
|
|
||
| - Listing active MicroCloud clusters | ||
| - Creating cluster join tokens | ||
| - Approving join requests from clusters | ||
| - Serving the UI's static assets | ||
|
|
||
| (ref-cluster-manager-architecture-management-ingress)= | ||
| ### Management API ingress | ||
|
|
||
| The Management API configures and runs an HTTPS server to make API endpoints available. Traffic to the server passes through a TCP load balancer. | ||
|
|
||
| (ref-cluster-manager-architecture-management-oidc)= | ||
| ### OIDC authentication | ||
|
|
||
| The Management API is secured using OIDC authentication, using the [`microcloud-cluster-manager-k8s`](https://charmhub.io/microcloud-cluster-manager-k8s) charm configurations. The charm handles providing OIDC information to the Kubernetes cluster and its `configMap`. | ||
|
|
||
| The OIDC flow is similar to the {ref}`approach implemented in LXD <lxd:authentication-openid>`: | ||
| - A user initiates the login flow from the UI. This makes a request to the `/oidc/login` endpoint, which redirects the user to the identity provider's authentication screen. At this stage, a callback endpoint (`*/oidc/callback`) is set in the redirect request. | ||
| - The user then enters their credentials to authenticate with the identity provider. | ||
| Upon successful authentication, the identity provider sends a request to the callback endpoint `*/oidc/callback` set in step 1. | ||
| - The request includes an authorization code. The callback endpoint uses this code to initiate the token exchange process with the identity provider and acquire the ID, access, and refresh tokens for the authenticated user. | ||
| - These tokens are set in the session cookie and the user is redirected to the base route of the UI. | ||
| - Subsequent requests use the session cookie to validate authentication. | ||
|
|
||
| (ref-cluster-manager-architecture-management-registering)= | ||
| ### Registering clusters | ||
|
|
||
| To register a MicroCloud cluster with a Cluster Manager, the user first creates a remote cluster join token in the Manager. This token is base64-encoded and includes the following information: | ||
|
|
||
| | Key | Description | | ||
| | ------------| ----------- | | ||
| | secret | A secret to be used by a MicroCloud cluster for creating a [HMAC](https://en.wikipedia.org/wiki/HMAC) signature for the join request. | | ||
| | expires_at | Expiry date for the remote cluster join token. | | ||
| | address | The address at which the MicroCloud Cluster Manager is reachable. This address can be a domain name or external static IP. | | ||
| | server_name | This is unique and stored for reference purposes in Cluster Manager to map which cluster the token belongs to. | | ||
| | fingerprint | The public key from the Cluster Connector deployment certificate (secret in Kubernetes cluster). Used to establish mTLS between Cluster Manager and the cluster. | | ||
|
|
||
| On a member of the registering cluster, the token is used in the command `microcloud cluster-manager join <token>`. The join request is sent to the Cluster Manager for approval. The request payload includes the registering cluster's name and cluster certificate. | ||
|
|
||
| The registering cluster begins sending periodic heartbeats along with status updates to the Cluster Manager. These heartbeats and updates begin as soon as a join request response is received from the Manager. | ||
|
|
||
| This is due to the unidirectional nature of communication initialization between clusters and the Cluster Manager. The Cluster Manager can only receive and respond to data from the cluster; it cannot initiate a message to tell the cluster when the join request is approved. For details about this, see {ref}`ref-cluster-manager-architecture-connector-ingress`. | ||
|
|
||
| Once the Cluster Manager receives a join request, it tries to match the cluster name in the payload to an entry in its `remote_cluster_tokens` table. If it finds a match, it uses the corresponding token secret stored in that table to verify the join request HMAC signature. The validity of the remote cluster join token is also checked against its expiry date. | ||
|
|
||
| If a valid match is found, the matched token is removed and the cluster is registered with the following information: | ||
|
|
||
| - `name` - extracted from the token | ||
| - `cluster_certificate` - received from the join request from the registering cluster. This is the MicroCloud cluster certificate for establishing mutual TLS authentication with the Cluster Manager. | ||
|
|
||
| A corresponding entry in the remote_clusters table is created, adding the following information: | ||
|
|
||
| - `status` - `ACTIVE` | ||
|
|
||
| Once a cluster has been successfully registered, Cluster Manager begins storing status update data sent by the cluster. | ||
|
|
||
| (ref-cluster-manager-architecture-management-ui)= | ||
| ### UI | ||
|
|
||
| The Management API handles serving static assets for the UI. Users access information about clusters through the UI. Through it, users can create remote cluster join tokens and view information about existing tokens. The UI also serves warnings and metric insights on a high level. | ||
|
|
||
| (ref-cluster-manager-architecture-connector)= | ||
| ## Cluster Connector deployment | ||
|
|
||
| The Cluster Connector deployment handles operations between MicroCloud clusters and the Cluster Manager. | ||
|
|
||
| (ref-cluster-manager-architecture-connector-ingress)= | ||
| ### Cluster Connector deployment ingress | ||
|
|
||
| We expect each cluster to be able to reach the Cluster Manager. However, we do not expect the Manager to be able to reach each remote cluster directly. This is because clusters might not be exposed on an internet-facing IP, or they might run behind a firewall or NAT. Therefore, operations consist of ingress traffic only. | ||
|
|
||
| (ref-cluster-manager-architecture-connector-mtls)= | ||
| ### mTLS authentication | ||
|
|
||
| During the initial join request of a cluster, each cluster presents a dedicated certificate to the Cluster Manager. The Manager uses this certificate to authenticate all subsequent requests from the cluster, using mTLS. For efficiency considerations, these certificates are cached after the first authenticated request. | ||
|
|
||
| Due to the mTLS requirement, the TCP load balancer passes through TLS traffic and the Cluster Connector terminates TLS itself. | ||
|
|
||
| (ref-cluster-manager-architecture-connector-heartbeats)= | ||
| ### Heartbeats | ||
|
|
||
| The `db-leader` of each connected MicroCloud cluster sends periodic heartbeats to the Cluster Manager, along with data about resource usage and availability. A heartbeat update includes the following information: | ||
|
|
||
| - Cluster level details including: | ||
| - Number of cluster-wide instances and distribution of instance status (such as how many instances are stopped or started) | ||
| - Number of cluster members and distribution of member status (number of members online, number of members in error status, and so on) | ||
| - CPU, memory, and disk utilization for each cluster, as aggregated totals across all cluster members | ||
| - MicroCloud certificate within the request context for mTLS authentication. The certificate fingerprint is used to look up the cluster ID in a cache for updating cluster details. | ||
|
|
||
| For each heartbeat it receives, the Cluster Manager performs the following tasks: | ||
|
|
||
| - Match status update request by certificate fingerprint; check that the cluster exists and is marked as active. | ||
| - mTLS authentication check against the matched certificate | ||
| - Store and overwrite data in the `remote_cluster_details` table | ||
|
|
||
| (ref-cluster-manager-architecture-rate-limited)= | ||
| ## Rate limited endpoints | ||
|
|
||
| To avoid overwhelming the Cluster Manager, all its endpoints are rate limited. When any endpoint receives a request from a cluster, the response from Cluster Manager includes a delay period (in seconds) that must pass before the next request to that endpoint. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,10 +1,30 @@ | ||
| --- | ||
| myst: | ||
| html_meta: | ||
| description: Reference guides for MicroCloud such as deployment requirements, releases and snaps details, and the architecture of the MicroCloud Cluster Manager. | ||
| --- | ||
|
|
||
| (reference)= | ||
| # Reference | ||
|
|
||
| The reference material in this section provides technical information about MicroCloud. | ||
|
|
||
| (ref-microcloud)= | ||
| ## MicroCloud | ||
|
|
||
| Find out about requirements for a MicroCloud deployment, as well as information about its release cycles, release types, and snaps. | ||
|
|
||
| ```{toctree} | ||
| :maxdepth: 1 | ||
|
|
||
| MicroCloud requirements </reference/requirements> | ||
| /reference/releases-snaps | ||
| ``` | ||
|
|
||
| ## MicroCloud Cluster Manager | ||
|
|
||
| Learn about the MicroCloud Cluster Manager's architecture and how it connects multiple clusters to a centralized UI tool. | ||
|
|
||
| ```{toctree} | ||
| :maxdepth: 1 | ||
| Cluster Manager architecture </reference/cluster-manager-architecture> | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This diagram is from an earlier development environment. It is mostly correct, but some things have slightly changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an updated diagram, or can you let me know what has changed and I can update it?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have an updated diagram yet. Things that have changed:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can create a task for myself to create an updated diagram.