Operational improvements: labels, checkpoints, structured errors, production dashboard by vnvo · Pull Request #67 · vnvo/deltaforge

vnvo · 2026-04-02T21:39:04Z

What

Operational improvements for production fleet management at scale.

Pipeline metadata:

Labels + annotations on pipeline metadata (labels: {env: prod, team: platform})
GET /pipelines?label=env:prod - filter by label with AND logic
deltaforge_pipeline_info{pipeline, tenant} gauge for Grafana joins

Enriched APIs:

GET /pipelines/{name} includes ops field: uptime, DLQ count, per-sink checkpoints
GET /pipelines/{name}/checkpoints — per-sink positions with age
GET /health returns JSON with status + failed_pipelines (was plain text)
GET /log-level - current RUST_LOG value
POST /validate - dry-run config validation without creating pipeline
Structured error responses: {"code": "PIPELINE_NOT_FOUND", "message": "..."}

Per-table lag:

deltaforge_source_table_lag_seconds{pipeline, table} gauge

Grafana dashboard rebuilt for 300+ pipelines:

Fleet overview: aggregate totals (events/s, data/s, max lag, DLQ, errors)
Top-N panels: top 10 laggiest, throughput, DLQ backlogs
Tenant variable, per-table lag, all DLQ + EOS panels
Table legends with sortable values

Documentation:

API reference updated with all new endpoints
Correctness test matrix in guarantees.md
Observability page updated with Grafana dashboard link

Why

Operators managing hundreds of pipelines need: label-based filtering, one-call status with all operational data, structured errors for automation, top-N dashboards instead of 300 unreadable series, and config validation before deployment.

Testing

cargo test --workspace --lib - all tests pass
cargo clippy --all-targets --all-features -- -D warnings - clean
mdbook build docs/ - builds
Health endpoint test updated for JSON response

Checklist

Tests pass (cargo test)
Code formatted (cargo fmt)
Clippy clean
Docs updated (API reference, observability, guarantees)

Pipeline metadata: labels + annotations (HashMap<String, String>). Labels enable filtering via GET /pipelines?label=env:prod (AND logic). Checkpoint inspection: GET /pipelines/{name}/checkpoints returns per-sink checkpoint positions with age. Per-table replication lag: deltaforge_source_table_lag_seconds{table} gauge emitted per table in each batch alongside pipeline-level lag. Correctness test matrix added to guarantees.md — maps every guarantee to its test with status (exists/planned).

…econds)

… per-table lag

Metadata: labels (HashMap) and annotations (HashMap) with serde(default). GET /pipelines?label=env:prod - AND logic, key:value or key-only filter. Labels enable Grafana variables, operator selection, fleet management.

GET /pipelines/{name}: ops field with DLQ count, per-sink checkpoints. GET /health: returns JSON with status + failed_pipelines list (was plain text). GET /pipelines/{name}/checkpoints: per-sink positions and ages. deltaforge_pipeline_info{pipeline,tenant} gauge for Grafana joins.

Rebuilt dashboard optimized for fleet operations: - Fleet overview: aggregate totals (events/s, data/s, max lag, DLQ, errors) - Top-N panels: top 10 laggiest, throughput, DLQ backlogs (readable at scale) - Tenant variable for filtering by tenant label - Per-table lag (top 10 tables) - All DLQ metrics: entries, events/s, saturation, overflow - All EOS metrics: checkpoint status, commit rate, txn commits/aborts - Table legends with sortable values (replaces unreadable list legends) - Collapsed sections for batching and infrastructure (reduce noise)

All error responses now return {"code": "PIPELINE_NOT_FOUND", "message": "..."}. Shared ApiResult type across pipelines, schemas, and sensing modules.

Pipeline uptime: started_at tracked in runtime, exposed as ops.uptime_seconds. GET /log-level: returns current RUST_LOG value. POST /validate: dry-run config validation without creating pipeline. GET /health: returns JSON with status + failed_pipelines list.

Added: DLQ (peek/count/ack/purge), checkpoint inspection, label filtering, log-level, config validation. Updated: health (JSON response), get pipeline (ops field), error responses (structured codes).

vnvo added 12 commits April 2, 2026 22:45

feat: per-table replication lag metric (deltaforge_source_table_lag_s…

48cfdfd

…econds)

docs: correctness test matrix in guarantees page

9e901a1

docs: update observability page — mark implemented metrics, add DLQ +…

dccd63d

… per-table lag

feat: pipeline labels, annotations, and REST filtering

9742c4d

Metadata: labels (HashMap) and annotations (HashMap) with serde(default). GET /pipelines?label=env:prod - AND logic, key:value or key-only filter. Labels enable Grafana variables, operator selection, fleet management.

feat: structured API errors with code + message JSON

13a74f8

All error responses now return {"code": "PIPELINE_NOT_FOUND", "message": "..."}. Shared ApiResult type across pipelines, schemas, and sensing modules.

docs: update API reference with all new endpoints

83526ef

Added: DLQ (peek/count/ack/purge), checkpoint inspection, label filtering, log-level, config validation. Updated: health (JSON response), get pipeline (ops field), error responses (structured codes).

formatting fixes

e3322b8

clippy fixes

6406b46

vnvo merged commit 0536459 into main Apr 2, 2026
3 of 4 checks passed

vnvo deleted the better-ops branch April 2, 2026 21:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operational improvements: labels, checkpoints, structured errors, production dashboard#67

Operational improvements: labels, checkpoints, structured errors, production dashboard#67
vnvo merged 12 commits intomainfrom
better-ops

vnvo commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vnvo commented Apr 2, 2026

What

Why

Testing

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant