What
Stand up a free, self-hostable observability dashboard for mat-vis
bakes/derives without a Dagger Cloud / paid subscription.
Design
Single container: `grafana/otel-lgtm` (Loki + Grafana + Tempo +
Mimir + Prometheus). Exposed only via Tailscale, so:
- No public endpoint.
- Auth handled by tailnet identity.
- Works from laptops + GitHub runners identically.
Host
Small always-on host that runs the OTLP collector. Options:
- Home Raspberry Pi (cheapest, 24/7, already on tailnet).
- Hetzner / Fly.io free tier (remote, stateless, restartable).
Setup (one-shot)
```bash
podman run -d --name otel-lgtm \
--restart=always \
-p 4318:4318 -p 3000:3000 \
-v otel-lgtm-data:/data \
grafana/otel-lgtm:latest
```
Expose Grafana via Tailscale serve:
```bash
tailscale serve --bg --https=3000 http://localhost:3000
```
CI plumbing
- `.github/workflows/*.yml` gains a `tailscale/github-action@v2`
step at the top of any job that wants to emit telemetry.
- `OTEL_EXPORTER_OTLP_ENDPOINT` pointed at the tailnet hostname.
- Runner authenticates via an ephemeral auth key stored in
`secrets.TS_AUTHKEY`.
Baker plumbing
```python
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import (
OTLPSpanExporter,
)
```
Span around `_stream_transform_into_tar` + child spans per shard
if #129 sharding lands. No-op when
`OTEL_EXPORTER_OTLP_ENDPOINT` is unset (default for non-CI non-
instrumented runs).
Dashboard (committed)
`docs/observability/dashboard.json` — a Grafana dashboard that
visualises:
- Active runs, per-source, per-tier.
- Per-shard timing + throughput.
- Sliding-window failure rate (the gate from ADR-0009).
- Terminal-gate status (pushed vs refused).
- First-error strings.
Acceptance criteria
- `podman run` recipe works end-to-end on a Pi.
- A `derive.yml` CI run shows up live in Grafana within ~5 s of
emitting its first span.
- Local `mat-vis-baker hf-derive ...` with the env var set shows
up in the same dashboard.
- Zero subscription fees.
Out of scope
- Replacing Dagger Cloud for teams that want hosted UIs — just an
alternative.
- Long-term retention (Tempo defaults to 72 h; fine for debugging).
Refs #129
What
Stand up a free, self-hostable observability dashboard for mat-vis
bakes/derives without a Dagger Cloud / paid subscription.
Design
Single container: `grafana/otel-lgtm` (Loki + Grafana + Tempo +
Mimir + Prometheus). Exposed only via Tailscale, so:
Host
Small always-on host that runs the OTLP collector. Options:
Setup (one-shot)
```bash
podman run -d --name otel-lgtm \
--restart=always \
-p 4318:4318 -p 3000:3000 \
-v otel-lgtm-data:/data \
grafana/otel-lgtm:latest
```
Expose Grafana via Tailscale serve:
```bash
tailscale serve --bg --https=3000 http://localhost:3000
```
CI plumbing
step at the top of any job that wants to emit telemetry.
`secrets.TS_AUTHKEY`.
Baker plumbing
```python
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import (
OTLPSpanExporter,
)
```
Span around `_stream_transform_into_tar` + child spans per shard
if #129 sharding lands. No-op when
`OTEL_EXPORTER_OTLP_ENDPOINT` is unset (default for non-CI non-
instrumented runs).
Dashboard (committed)
`docs/observability/dashboard.json` — a Grafana dashboard that
visualises:
Acceptance criteria
emitting its first span.
up in the same dashboard.
Out of scope
alternative.
Refs #129