Self-hosted OTLP observability stack via Tailscale (grafana/otel-lgtm)

## What

Stand up a free, self-hostable observability dashboard for mat-vis
bakes/derives without a Dagger Cloud / paid subscription.

## Design

Single container: \`grafana/otel-lgtm\` (Loki + Grafana + Tempo +
Mimir + Prometheus). Exposed only via Tailscale, so:

- No public endpoint.
- Auth handled by tailnet identity.
- Works from laptops + GitHub runners identically.

### Host

Small always-on host that runs the OTLP collector. Options:

- Home Raspberry Pi (cheapest, 24/7, already on tailnet).
- Hetzner / Fly.io free tier (remote, stateless, restartable).

### Setup (one-shot)

\`\`\`bash
podman run -d --name otel-lgtm \\
  --restart=always \\
  -p 4318:4318 -p 3000:3000 \\
  -v otel-lgtm-data:/data \\
  grafana/otel-lgtm:latest
\`\`\`

Expose Grafana via Tailscale serve:

\`\`\`bash
tailscale serve --bg --https=3000 http://localhost:3000
\`\`\`

### CI plumbing

- \`.github/workflows/*.yml\` gains a \`tailscale/github-action@v2\`
  step at the top of any job that wants to emit telemetry.
- \`OTEL_EXPORTER_OTLP_ENDPOINT\` pointed at the tailnet hostname.
- Runner authenticates via an ephemeral auth key stored in
  \`secrets.TS_AUTHKEY\`.

### Baker plumbing

\`\`\`python
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import (
    OTLPSpanExporter,
)
\`\`\`

Span around \`_stream_transform_into_tar\` + child spans per shard
if #129 sharding lands. No-op when
\`OTEL_EXPORTER_OTLP_ENDPOINT\` is unset (default for non-CI non-
instrumented runs).

## Dashboard (committed)

\`docs/observability/dashboard.json\` — a Grafana dashboard that
visualises:

- Active runs, per-source, per-tier.
- Per-shard timing + throughput.
- Sliding-window failure rate (the gate from ADR-0009).
- Terminal-gate status (pushed vs refused).
- First-error strings.

## Acceptance criteria

- \`podman run\` recipe works end-to-end on a Pi.
- A \`derive.yml\` CI run shows up live in Grafana within ~5 s of
  emitting its first span.
- Local \`mat-vis-baker hf-derive ...\` with the env var set shows
  up in the same dashboard.
- Zero subscription fees.

## Out of scope

- Replacing Dagger Cloud for teams that want hosted UIs — just an
  alternative.
- Long-term retention (Tempo defaults to 72 h; fine for debugging).

Refs #129

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self-hosted OTLP observability stack via Tailscale (grafana/otel-lgtm) #131

What

Design

Host

Setup (one-shot)

CI plumbing

Baker plumbing

Dashboard (committed)

Acceptance criteria

Out of scope

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Self-hosted OTLP observability stack via Tailscale (grafana/otel-lgtm) #131

Description

What

Design

Host

Setup (one-shot)

CI plumbing

Baker plumbing

Dashboard (committed)

Acceptance criteria

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions