Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/prompts/review-docs.prompt.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Review the documentation for clarity, completeness, and accuracy.
- H1 titles under `docs/how-to` should start with "How to".
- Section headers and index entries should all use sentence case (not title case).
- Known product names should be capitalized consistently throughout the documentation.
- Spelling according to US English conventions.

## Context

Expand Down
22 changes: 21 additions & 1 deletion docs/how-to/configure-and-tune/evaluate-telemetry-volume.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,21 @@ In order to correctly size the VM(s) needed for COS, you need to know how much t


## Metrics rate

### Manual evaluation
Find out the metrics endpoint manifest for each observed workload. If it is not documented,
you will need to manually count the number of non-comment lines served on the metrics endpoint,
for example:

```bash
curl -sf localhost:8080/metrics | grep -v "^# " | wc -l
```

This will give you the number of timeseries that will be created for the workload, per unit.

Another option is to deploy a temporary pilot Prometheus charm.

### With charmed Prometheus
Have your deployment sending all metrics to Prometheus (or Mimir) and inspect the 48hr plot for `count({__name__=~".+"})`.
The raw data can also be obtained by querying the Prometheus `query` endpoint directly:

Comment on lines 29 to 31
Expand All @@ -36,8 +51,13 @@ load[load generator] ---|db| postgresql
postgresql ---|metrics-endpoint| prometheus
```


## Logs rate
### Manual evaluation
The most reliable way to evaluate the logging rate of a workload is with load tests.

Another option is to deploy temporary pilot Loki and Prometheus charms.

### With charmed Loki and Prometheus
Have your deployment sending all logs to Loki, and inspect the 48hr plot for `loki_distributor_*_received_total`:

```
Expand Down
17 changes: 12 additions & 5 deletions docs/how-to/deploy-and-manage/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,10 @@ These guides cover deploying, upgrading, managing, and securing access to COS.

See our [tutorials](/tutorial/index) for guidance on deploying COS.

## Upgrades

Move between COS revisions with confidence.

```{toctree}
:maxdepth: 1

Cross-track upgrade instructions <upgrade>
Install <install>
```

## Secure access
Expand All @@ -34,3 +30,14 @@ Protect and expose COS endpoints for production traffic.
Configure TLS encryption <configure-tls-encryption>
Configure ingress <configure-granular-ingress>
```

## Upgrades

Move between COS revisions with confidence.

```{toctree}
:maxdepth: 1

Cross-track upgrade instructions <upgrade>
```

83 changes: 83 additions & 0 deletions docs/how-to/deploy-and-manage/install.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
---
myst:
html_meta:
description: "Install the Canonical Observability Stack: preparation checklist covering sizing, networking, storage, and deployment options."
---

# How to install COS

## Preparation

Before deploying COS or COS Lite, work through the items below.

### COS flavor

The [flavor of COS](/explanation/overview/what-is-cos) to install depends on your use-case.
If you want to install on edge devices, then COS Lite is likely the right choice; otherwise
you should probably go with "full" COS.

```{mermaid}
graph LR

subgraph env["Monitored environment"]
opentelemetry-collector
end

subgraph k8s["K8s cluster"]
COS
end

subgraph pc["Public cloud"]
cos-alerter["COS Alerter"]
end

subgraph storage["Storage cluster"]
S3
end

opentelemetry-collector ---|telemetry| COS
COS --- S3
COS --- cos-alerter
```

### Kubernetes cluster

Deploy COS on a high-availability Kubernetes cluster with at least 3 control plane nodes.

### Sizing

Use the [sizing guide](/reference/system-requirements) to determine the minimum hardware for your deployment.
If you don't yet know how much telemetry your workloads generate, start with [How to evaluate telemetry volume](/how-to/configure-and-tune/evaluate-telemetry-volume).

Follow the [storage best practices](/reference/storage) to set up a distributed storage backend with a replication factor of 3.
Do **not** use `hostPath` storage in production.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a brief explanation about "why not to use hostPath"?


### Configure networking

Review the [networking best practices](/reference/networking) and ensure:

- A load balancer (for example, MetalLB) is available to give Traefik a stable IP.
- Egress is open for Charmhub, the Juju OCI registry, and Snapcraft.

### Plan for TLS

Production deployments should use TLS.
See [How to configure TLS encryption](/how-to/deploy-and-manage/configure-tls-encryption) for the available modes and what you need to prepare (for example, an external certificates provider).

### Authentication and authorization
Only the Grafana and Traefik charms support auth.
For exposing Grafana publicly, use two Traefik charms, one for internal connections, and another for external access, which will provide ingress to Grafana.
Comment on lines +68 to +69
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor suggestion:

Suggested change
Only the Grafana and Traefik charms support auth.
For exposing Grafana publicly, use two Traefik charms, one for internal connections, and another for external access, which will provide ingress to Grafana.
Only the Grafana and Traefik charms support authentication.
To expose Grafana publicly, deploy two Traefik charms: one for internal connections and another for external access to provide ingress.


### Dedicated Juju controller and model

You should bootstrap a dedicated Juju controller and model just for COS.

## Create Terraform plan

```hcl
module ... { ... }
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichaelThamm are we ready to put something here now?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

```
Comment on lines +75 to +79

## Deploy COS Alerter

COS Alerter is a watchdog service for COS you should deploy on a physically different cloud.
Loading