Skip to content

feat(observability): add runner VM hostmetrics Grafana dashboard#187

Draft
cbartz wants to merge 3 commits intomainfrom
feat/hostmetrics-grafana-dashboard
Draft

feat(observability): add runner VM hostmetrics Grafana dashboard#187
cbartz wants to merge 3 commits intomainfrom
feat/hostmetrics-grafana-dashboard

Conversation

@cbartz
Copy link
Copy Markdown
Collaborator

@cbartz cbartz commented Apr 23, 2026

Summary

Adds a read-only Grafana dashboard for runner VM host-level metrics, served via cos-configuration-k8s using the grafana-dashboard relation. Provisioned dashboards are immutable in Grafana (filesystem-based), so they cannot be edited regardless of user role.

Changes

  • runner_grafana_dashboards/runner_vm_hostmetrics.json: new dashboard covering CPU, memory, disk I/O, filesystem, network traffic and load averages
    • Template variables: github_job_id (filter by workflow run) and instance (filter by hostname)
    • Metric names follow the OpenTelemetry hostmetrics receiver Prometheus convention
    • editable: false + __inputs datasource declaration
  • README.md: documents the repo layout and the observability dashboard delivery mechanism

Notes

  • The github_job_id label is expected to be set as a resource attribute by the otelcol pipeline — confirm the exact label name once that pipeline is wired up
  • The matching Terraform change (deploying cos-configuration-k8s) is in platform-engineering-deployments feat/runner-hostmetrics-cos-configuration

Closes / relates to: ISD-5152

cbartz added 3 commits April 23, 2026 14:21
Adds a read-only Grafana dashboard (editable: false) for runner VM
host-level metrics to be served via cos-configuration-k8s using the
grafana-dashboard relation, which provisions it as an immutable
filesystem dashboard in Grafana.

The dashboard covers:
- CPU utilisation by state and load averages
- Memory usage by state
- Disk I/O throughput and operations
- Filesystem usage % by mount point
- Network traffic, errors and drops

Template variables:
- github_job_id: filter by GitHub Actions workflow run job ID
- instance: filter by runner hostname

Metric names follow the OpenTelemetry hostmetrics receiver prometheus
convention (e.g. system_cpu_time_seconds_total). The github_job_id
label is expected to be set as a resource attribute by the otelcol
pipeline collecting metrics from the runner VMs.

Related: ISD-5152
Rename grafana_dashboards/ to runner_grafana_dashboards/ to make the
purpose explicit at the repo root level (runner VM host metrics, not
charm workload metrics).

Update README with:
- Repository layout overview
- Observability section explaining the cos-configuration-k8s delivery
  mechanism and the immutability guarantee
- Table of conventions for where dashboards live and what
  grafana_dashboards_path value to use in Terraform
Replace github_job_id with github_job and instance with github_runner
to match the actual attribute labels set by the pre-job OTel config
(see canonical/github-runner-operator#781).

Add github_repository and github_workflow template variables so the
dashboard can be filtered the same way as the existing PS6 hostmetrics
dashboard.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant