Feature 028: Add Grafana observability stack for local development by ovation22 · Pull Request #33 · ovation22/TripleDerby

ovation22 · 2026-02-02T21:18:27Z

Summary

Implements a comprehensive local development observability stack using Grafana, Prometheus, and OpenTelemetry Collector. All six microservices (API, Admin, Breeding, Feeding, Training, Racing) now export telemetry through a centralized OTLP Collector, enabling real-time monitoring, performance analysis, and troubleshooting during development.

Implementation Overview

1. Core Observability Infrastructure

Added containerized observability stack:

OpenTelemetry Collector (ports 4317/4318/8889) - Central telemetry hub receiving OTLP from all services
Prometheus (port 9090) - Metrics storage and querying
Grafana (port 3000) - Visualization with pre-configured ASP.NET Core dashboards

Architecture pattern:

Services → OTLP → Collector → Prometheus → Grafana

All containers use persistent lifetime to preserve historical data across restarts.

2. Grafana Dashboard Provisioning

Pre-configured dashboards:

Dashboard 19924: ASP.NET Core application-wide metrics
- Request duration percentiles (p50, p75, p90, p95, p98, p99, p99.9)
- Error rates (4XX, 5XX)
- Active connections and requests
- Request volume by protocol
- Top 10 requested endpoints
- Top 10 exception endpoints
Dashboard 19925: ASP.NET Core endpoint details
- Per-route latency analysis
- Per-route error rates
- Filterable by HTTP method and route pattern

Configuration:

Auto-provisioned Prometheus datasource
Anonymous access enabled for local dev (no login required)
Dashboards organized in "ASP.NET Core" folder
10-second auto-refresh

3. Service Integration

All services now export metrics via OTLP:

API service (HTTP endpoints)
Admin/Web service (HTTP endpoints)
Breeding worker service (background processing)
Feeding worker service (background processing)
Training worker service (background processing)
Racing worker service (background processing)

Metrics namespacing:

All metrics prefixed with triple_derby_ to avoid naming conflicts
Service identification via exported_job label
Consistent labeling across the stack

4. Comprehensive Documentation

Created docs/OBSERVABILITY.md covering:

Architecture and Flow:

OTLP Collector hub pattern explanation
Telemetry flow diagrams (Mermaid)
Container startup dependencies

Available Dashboards:

Dashboard descriptions and key metrics
How to navigate and interpret visualizations

Prometheus Query Examples:

Basic service queries (listing services, request rates)
HTTP service queries (API, Admin) with latency percentiles, error tracking, endpoint analysis
Worker service queries (runtime metrics, HttpClient metrics for outbound calls)
Error tracking (4XX/5XX rates, unhandled exceptions)
Cross-service comparisons
System resource metrics (GC, thread pool, memory)

Service Reference Table:
Shows which services expose HTTP metrics vs runtime-only metrics

Troubleshooting:

No metrics appearing in Prometheus
Grafana dashboard shows "No Data"
Services not starting
Performance impact considerations

Production Considerations:

Managed services recommendations
Security hardening requirements
Scaling patterns

5. OTLP Collector Configuration

Metrics pipeline:

Receives OTLP on ports 4317 (gRPC) and 4318 (HTTP)
Batch processor with 10-second timeout
Prometheus exporter on port 8889
Namespace prefix: triple_derby_
Constant label: app: triple-derby

Scraping configuration:

Prometheus scrapes collector every 15 seconds
15-day retention (default)

Key Technical Details

Metric Label Usage (Critical)

When using the OTLP Collector pattern:

Filter by exported_job to select specific services (api, admin, breeding, feeding, training, racing)
The job label will always be "otel-collector" because that's what Prometheus scrapes
This distinction is documented and all example queries use the correct label

Service Type Differences

HTTP Services (API, Admin):

Export HTTP request metrics (request counts, durations, status codes)
Export runtime metrics (GC, thread pool, assemblies)
Available in Grafana dashboards

Worker Services (Breeding, Feeding, Training, Racing):

Export runtime metrics only (no HTTP endpoints)
Can export HttpClient metrics for outbound HTTP calls
Would require custom instrumentation for Service Bus message processing metrics

Files Added/Modified

AppHost Configuration:

TripleDerby.AppHost/Program.cs - Added container resources and service OTLP configuration
TripleDerby.AppHost/otel-collector-config.yaml - OTLP Collector pipeline configuration
TripleDerby.AppHost/prometheus.yml - Prometheus scrape configuration

Grafana Configuration:

TripleDerby.AppHost/grafana/provisioning/datasources/datasources.yaml - Prometheus datasource
TripleDerby.AppHost/grafana/provisioning/dashboards/dashboards.yaml - Dashboard auto-provisioning
TripleDerby.AppHost/grafana/dashboards/aspnetcore-19924.json - ASP.NET Core dashboard
TripleDerby.AppHost/grafana/dashboards/aspnetcore-endpoint-19925.json - Endpoint details dashboard

Documentation:

docs/OBSERVABILITY.md - Comprehensive observability guide (250+ lines)
docs/features/028-grafana-observability-stack.md - Feature specification
docs/implementation/028-grafana-observability-stack-implementation-plan.md - Implementation plan

Developer Experience

Single command startup:

dotnet run --project TripleDerby.AppHost

Immediate access to:

Grafana: http://localhost:3000 (no login required)
Prometheus: http://localhost:9090 (for ad-hoc queries)
Aspire Dashboard: http://localhost:18888 (existing functionality preserved)

Real-time visibility:

All services emit metrics automatically (no code changes required)
Metrics appear in Grafana within seconds
Historical data preserved across restarts

Validation

All acceptance criteria met:

All containers start successfully with correct dependencies
Metrics from all 6 services visible in Grafana dashboard 19924
Per-endpoint metrics available in dashboard 19925
Prometheus queries work with correct exported_job label usage
Documentation enables immediate developer onboarding
Existing Aspire dashboard functionality preserved
Container startup completes within 30 seconds

Next Steps (Future Enhancements)

Custom Service Bus Metrics:
Documentation includes examples of how to add custom instrumentation for message processing metrics in worker services using OpenTelemetry's Meter API.

Alerting:
Foundation is in place for adding Prometheus Alertmanager or Grafana alerts in future feature.

Trace Storage:
Currently using Aspire dashboard for traces. Could add Tempo for long-term trace storage if needed.

Impact

This observability stack provides immediate visibility into application performance during local development, enabling:

Performance optimization through latency percentile analysis
Error detection and troubleshooting through error rate tracking
Capacity planning through active connection and request monitoring
Endpoint-level performance analysis
Cross-service performance comparison
Historical trend analysis with persistent data storage

Implements local development observability with: - Prometheus for metrics storage and querying (port 9090) - OpenTelemetry Collector for OTLP ingestion and routing (ports 4317, 4318, 8889) - Grafana with provisioned ASP.NET Core dashboards (port 3000) The API service now exports metrics via OTLP to the collector, which exposes them to Prometheus. Grafana includes pre-configured dashboards for request duration, error rates, active connections, and endpoint performance metrics. All metrics are prefixed with 'triple_derby_' namespace for clarity.

Extended OTLP telemetry export to all microservices: - Web/Admin - Breeding service - Feeding service - Training service - Racing service All services now export metrics to the OTLP Collector for centralized observability. Metrics will appear in Grafana dashboards alongside the API service metrics. Added comprehensive observability documentation (docs/OBSERVABILITY.md) covering architecture, accessing tools, available dashboards, metrics reference, and troubleshooting guide.

…ation - Fix metric label usage: use exported_job instead of job for OTLP Collector pattern - Add query examples for HTTP services (API, Admin) including latency percentiles, error tracking, and endpoint analysis - Add query examples for worker services (breeding, feeding, training, racing) with runtime and HttpClient metrics - Document custom instrumentation approach for Service Bus message processing metrics - Add service reference table showing which services have HTTP vs runtime-only metrics - Organize queries by category: basic service queries, worker services, latency, errors, endpoints, cross-service comparisons, and system resources

github-actions · 2026-02-03T00:07:44Z

Test Results

723 tests ±0 723 ✅ ±0 7s ⏱️ -1s
1 suites ±0 0 💤 ±0
1 files ±0 0 ❌ ±0

Results for commit 67358cd. ± Comparison against base commit c1ac668.

ovation22 added 3 commits February 2, 2026 15:50

ovation22 changed the title ~~Add comprehensive Prometheus query examples to observability documentation~~ Feature 028: Add Grafana observability stack for local development Feb 2, 2026

ovation22 merged commit 7901eb5 into main Feb 3, 2026
2 of 4 checks passed

ovation22 deleted the feature/028-grafana-observability-stack branch February 3, 2026 00:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature 028: Add Grafana observability stack for local development#33

Feature 028: Add Grafana observability stack for local development#33
ovation22 merged 3 commits intomainfrom
feature/028-grafana-observability-stack

ovation22 commented Feb 2, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ovation22 commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation Overview

1. Core Observability Infrastructure

2. Grafana Dashboard Provisioning

3. Service Integration

4. Comprehensive Documentation

5. OTLP Collector Configuration

Key Technical Details

Metric Label Usage (Critical)

Service Type Differences

Files Added/Modified

Developer Experience

Validation

Next Steps (Future Enhancements)

Impact

Uh oh!

github-actions bot commented Feb 3, 2026

Test Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ovation22 commented Feb 2, 2026 •

edited

Loading