Skip to content

Self-hosted OTLP observability stack via Tailscale (grafana/otel-lgtm) #131

@gerchowl

Description

@gerchowl

What

Stand up a free, self-hostable observability dashboard for mat-vis
bakes/derives without a Dagger Cloud / paid subscription.

Design

Single container: `grafana/otel-lgtm` (Loki + Grafana + Tempo +
Mimir + Prometheus). Exposed only via Tailscale, so:

  • No public endpoint.
  • Auth handled by tailnet identity.
  • Works from laptops + GitHub runners identically.

Host

Small always-on host that runs the OTLP collector. Options:

  • Home Raspberry Pi (cheapest, 24/7, already on tailnet).
  • Hetzner / Fly.io free tier (remote, stateless, restartable).

Setup (one-shot)

```bash
podman run -d --name otel-lgtm \
--restart=always \
-p 4318:4318 -p 3000:3000 \
-v otel-lgtm-data:/data \
grafana/otel-lgtm:latest
```

Expose Grafana via Tailscale serve:

```bash
tailscale serve --bg --https=3000 http://localhost:3000
```

CI plumbing

  • `.github/workflows/*.yml` gains a `tailscale/github-action@v2`
    step at the top of any job that wants to emit telemetry.
  • `OTEL_EXPORTER_OTLP_ENDPOINT` pointed at the tailnet hostname.
  • Runner authenticates via an ephemeral auth key stored in
    `secrets.TS_AUTHKEY`.

Baker plumbing

```python
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import (
OTLPSpanExporter,
)
```

Span around `_stream_transform_into_tar` + child spans per shard
if #129 sharding lands. No-op when
`OTEL_EXPORTER_OTLP_ENDPOINT` is unset (default for non-CI non-
instrumented runs).

Dashboard (committed)

`docs/observability/dashboard.json` — a Grafana dashboard that
visualises:

  • Active runs, per-source, per-tier.
  • Per-shard timing + throughput.
  • Sliding-window failure rate (the gate from ADR-0009).
  • Terminal-gate status (pushed vs refused).
  • First-error strings.

Acceptance criteria

  • `podman run` recipe works end-to-end on a Pi.
  • A `derive.yml` CI run shows up live in Grafana within ~5 s of
    emitting its first span.
  • Local `mat-vis-baker hf-derive ...` with the env var set shows
    up in the same dashboard.
  • Zero subscription fees.

Out of scope

  • Replacing Dagger Cloud for teams that want hosted UIs — just an
    alternative.
  • Long-term retention (Tempo defaults to 72 h; fine for debugging).

Refs #129

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:ciCI/CD, GitHub Actions, workflows

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions