Skip to content

Operational improvements: labels, checkpoints, structured errors, production dashboard#67

Merged
vnvo merged 12 commits intomainfrom
better-ops
Apr 2, 2026
Merged

Operational improvements: labels, checkpoints, structured errors, production dashboard#67
vnvo merged 12 commits intomainfrom
better-ops

Conversation

@vnvo
Copy link
Copy Markdown
Owner

@vnvo vnvo commented Apr 2, 2026

What

Operational improvements for production fleet management at scale.

Pipeline metadata:

  • Labels + annotations on pipeline metadata (labels: {env: prod, team: platform})
  • GET /pipelines?label=env:prod - filter by label with AND logic
  • deltaforge_pipeline_info{pipeline, tenant} gauge for Grafana joins

Enriched APIs:

  • GET /pipelines/{name} includes ops field: uptime, DLQ count, per-sink checkpoints
  • GET /pipelines/{name}/checkpoints — per-sink positions with age
  • GET /health returns JSON with status + failed_pipelines (was plain text)
  • GET /log-level - current RUST_LOG value
  • POST /validate - dry-run config validation without creating pipeline
  • Structured error responses: {"code": "PIPELINE_NOT_FOUND", "message": "..."}

Per-table lag:

  • deltaforge_source_table_lag_seconds{pipeline, table} gauge

Grafana dashboard rebuilt for 300+ pipelines:

  • Fleet overview: aggregate totals (events/s, data/s, max lag, DLQ, errors)
  • Top-N panels: top 10 laggiest, throughput, DLQ backlogs
  • Tenant variable, per-table lag, all DLQ + EOS panels
  • Table legends with sortable values

Documentation:

  • API reference updated with all new endpoints
  • Correctness test matrix in guarantees.md
  • Observability page updated with Grafana dashboard link

Why

Operators managing hundreds of pipelines need: label-based filtering, one-call status with all operational data, structured errors for automation, top-N dashboards instead of 300 unreadable series, and config validation before deployment.

Testing

  • cargo test --workspace --lib - all tests pass
  • cargo clippy --all-targets --all-features -- -D warnings - clean
  • mdbook build docs/ - builds
  • Health endpoint test updated for JSON response

Checklist

  • Tests pass (cargo test)
  • Code formatted (cargo fmt)
  • Clippy clean
  • Docs updated (API reference, observability, guarantees)

vnvo added 12 commits April 2, 2026 22:45
Pipeline metadata: labels + annotations (HashMap<String, String>).
Labels enable filtering via GET /pipelines?label=env:prod (AND logic).

Checkpoint inspection: GET /pipelines/{name}/checkpoints returns
per-sink checkpoint positions with age.

Per-table replication lag: deltaforge_source_table_lag_seconds{table}
gauge emitted per table in each batch alongside pipeline-level lag.

Correctness test matrix added to guarantees.md — maps every guarantee
to its test with status (exists/planned).
Metadata: labels (HashMap) and annotations (HashMap) with serde(default).
GET /pipelines?label=env:prod - AND logic, key:value or key-only filter.
Labels enable Grafana variables, operator selection, fleet management.
GET /pipelines/{name}: ops field with DLQ count, per-sink checkpoints.
GET /health: returns JSON with status + failed_pipelines list (was plain text).
GET /pipelines/{name}/checkpoints: per-sink positions and ages.
deltaforge_pipeline_info{pipeline,tenant} gauge for Grafana joins.
Rebuilt dashboard optimized for fleet operations:
- Fleet overview: aggregate totals (events/s, data/s, max lag, DLQ, errors)
- Top-N panels: top 10 laggiest, throughput, DLQ backlogs (readable at scale)
- Tenant variable for filtering by tenant label
- Per-table lag (top 10 tables)
- All DLQ metrics: entries, events/s, saturation, overflow
- All EOS metrics: checkpoint status, commit rate, txn commits/aborts
- Table legends with sortable values (replaces unreadable list legends)
- Collapsed sections for batching and infrastructure (reduce noise)
All error responses now return {"code": "PIPELINE_NOT_FOUND", "message": "..."}.
Shared ApiResult type across pipelines, schemas, and sensing modules.
Pipeline uptime: started_at tracked in runtime, exposed as ops.uptime_seconds.
GET /log-level: returns current RUST_LOG value.
POST /validate: dry-run config validation without creating pipeline.
GET /health: returns JSON with status + failed_pipelines list.
Added: DLQ (peek/count/ack/purge), checkpoint inspection, label filtering,
log-level, config validation. Updated: health (JSON response), get pipeline
(ops field), error responses (structured codes).
@vnvo vnvo merged commit 0536459 into main Apr 2, 2026
3 of 4 checks passed
@vnvo vnvo deleted the better-ops branch April 2, 2026 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant