Skip to content

Feature 028: Add Grafana observability stack for local development#33

Merged
ovation22 merged 3 commits intomainfrom
feature/028-grafana-observability-stack
Feb 3, 2026
Merged

Feature 028: Add Grafana observability stack for local development#33
ovation22 merged 3 commits intomainfrom
feature/028-grafana-observability-stack

Conversation

@ovation22
Copy link
Copy Markdown
Owner

@ovation22 ovation22 commented Feb 2, 2026

Summary

Implements a comprehensive local development observability stack using Grafana, Prometheus, and OpenTelemetry Collector. All six microservices (API, Admin, Breeding, Feeding, Training, Racing) now export telemetry through a centralized OTLP Collector, enabling real-time monitoring, performance analysis, and troubleshooting during development.

Implementation Overview

1. Core Observability Infrastructure

Added containerized observability stack:

  • OpenTelemetry Collector (ports 4317/4318/8889) - Central telemetry hub receiving OTLP from all services
  • Prometheus (port 9090) - Metrics storage and querying
  • Grafana (port 3000) - Visualization with pre-configured ASP.NET Core dashboards

Architecture pattern:

Services → OTLP → Collector → Prometheus → Grafana

All containers use persistent lifetime to preserve historical data across restarts.

2. Grafana Dashboard Provisioning

Pre-configured dashboards:

  • Dashboard 19924: ASP.NET Core application-wide metrics

    • Request duration percentiles (p50, p75, p90, p95, p98, p99, p99.9)
    • Error rates (4XX, 5XX)
    • Active connections and requests
    • Request volume by protocol
    • Top 10 requested endpoints
    • Top 10 exception endpoints
  • Dashboard 19925: ASP.NET Core endpoint details

    • Per-route latency analysis
    • Per-route error rates
    • Filterable by HTTP method and route pattern

Configuration:

  • Auto-provisioned Prometheus datasource
  • Anonymous access enabled for local dev (no login required)
  • Dashboards organized in "ASP.NET Core" folder
  • 10-second auto-refresh

3. Service Integration

All services now export metrics via OTLP:

  • API service (HTTP endpoints)
  • Admin/Web service (HTTP endpoints)
  • Breeding worker service (background processing)
  • Feeding worker service (background processing)
  • Training worker service (background processing)
  • Racing worker service (background processing)

Metrics namespacing:

  • All metrics prefixed with triple_derby_ to avoid naming conflicts
  • Service identification via exported_job label
  • Consistent labeling across the stack

4. Comprehensive Documentation

Created docs/OBSERVABILITY.md covering:

Architecture and Flow:

  • OTLP Collector hub pattern explanation
  • Telemetry flow diagrams (Mermaid)
  • Container startup dependencies

Available Dashboards:

  • Dashboard descriptions and key metrics
  • How to navigate and interpret visualizations

Prometheus Query Examples:

  • Basic service queries (listing services, request rates)
  • HTTP service queries (API, Admin) with latency percentiles, error tracking, endpoint analysis
  • Worker service queries (runtime metrics, HttpClient metrics for outbound calls)
  • Error tracking (4XX/5XX rates, unhandled exceptions)
  • Cross-service comparisons
  • System resource metrics (GC, thread pool, memory)

Service Reference Table:
Shows which services expose HTTP metrics vs runtime-only metrics

Troubleshooting:

  • No metrics appearing in Prometheus
  • Grafana dashboard shows "No Data"
  • Services not starting
  • Performance impact considerations

Production Considerations:

  • Managed services recommendations
  • Security hardening requirements
  • Scaling patterns

5. OTLP Collector Configuration

Metrics pipeline:

  • Receives OTLP on ports 4317 (gRPC) and 4318 (HTTP)
  • Batch processor with 10-second timeout
  • Prometheus exporter on port 8889
  • Namespace prefix: triple_derby_
  • Constant label: app: triple-derby

Scraping configuration:

  • Prometheus scrapes collector every 15 seconds
  • 15-day retention (default)

Key Technical Details

Metric Label Usage (Critical)

When using the OTLP Collector pattern:

  • Filter by exported_job to select specific services (api, admin, breeding, feeding, training, racing)
  • The job label will always be "otel-collector" because that's what Prometheus scrapes
  • This distinction is documented and all example queries use the correct label

Service Type Differences

HTTP Services (API, Admin):

  • Export HTTP request metrics (request counts, durations, status codes)
  • Export runtime metrics (GC, thread pool, assemblies)
  • Available in Grafana dashboards

Worker Services (Breeding, Feeding, Training, Racing):

  • Export runtime metrics only (no HTTP endpoints)
  • Can export HttpClient metrics for outbound HTTP calls
  • Would require custom instrumentation for Service Bus message processing metrics

Files Added/Modified

AppHost Configuration:

  • TripleDerby.AppHost/Program.cs - Added container resources and service OTLP configuration
  • TripleDerby.AppHost/otel-collector-config.yaml - OTLP Collector pipeline configuration
  • TripleDerby.AppHost/prometheus.yml - Prometheus scrape configuration

Grafana Configuration:

  • TripleDerby.AppHost/grafana/provisioning/datasources/datasources.yaml - Prometheus datasource
  • TripleDerby.AppHost/grafana/provisioning/dashboards/dashboards.yaml - Dashboard auto-provisioning
  • TripleDerby.AppHost/grafana/dashboards/aspnetcore-19924.json - ASP.NET Core dashboard
  • TripleDerby.AppHost/grafana/dashboards/aspnetcore-endpoint-19925.json - Endpoint details dashboard

Documentation:

  • docs/OBSERVABILITY.md - Comprehensive observability guide (250+ lines)
  • docs/features/028-grafana-observability-stack.md - Feature specification
  • docs/implementation/028-grafana-observability-stack-implementation-plan.md - Implementation plan

Developer Experience

Single command startup:

dotnet run --project TripleDerby.AppHost

Immediate access to:

Real-time visibility:

  • All services emit metrics automatically (no code changes required)
  • Metrics appear in Grafana within seconds
  • Historical data preserved across restarts

Validation

All acceptance criteria met:

  • All containers start successfully with correct dependencies
  • Metrics from all 6 services visible in Grafana dashboard 19924
  • Per-endpoint metrics available in dashboard 19925
  • Prometheus queries work with correct exported_job label usage
  • Documentation enables immediate developer onboarding
  • Existing Aspire dashboard functionality preserved
  • Container startup completes within 30 seconds

Next Steps (Future Enhancements)

Custom Service Bus Metrics:
Documentation includes examples of how to add custom instrumentation for message processing metrics in worker services using OpenTelemetry's Meter API.

Alerting:
Foundation is in place for adding Prometheus Alertmanager or Grafana alerts in future feature.

Trace Storage:
Currently using Aspire dashboard for traces. Could add Tempo for long-term trace storage if needed.

Impact

This observability stack provides immediate visibility into application performance during local development, enabling:

  • Performance optimization through latency percentile analysis
  • Error detection and troubleshooting through error rate tracking
  • Capacity planning through active connection and request monitoring
  • Endpoint-level performance analysis
  • Cross-service performance comparison
  • Historical trend analysis with persistent data storage

Implements local development observability with:
- Prometheus for metrics storage and querying (port 9090)
- OpenTelemetry Collector for OTLP ingestion and routing (ports 4317, 4318, 8889)
- Grafana with provisioned ASP.NET Core dashboards (port 3000)

The API service now exports metrics via OTLP to the collector, which exposes
them to Prometheus. Grafana includes pre-configured dashboards for request
duration, error rates, active connections, and endpoint performance metrics.

All metrics are prefixed with 'triple_derby_' namespace for clarity.
Extended OTLP telemetry export to all microservices:
- Web/Admin
- Breeding service
- Feeding service
- Training service
- Racing service

All services now export metrics to the OTLP Collector for centralized
observability. Metrics will appear in Grafana dashboards alongside the API
service metrics.

Added comprehensive observability documentation (docs/OBSERVABILITY.md)
covering architecture, accessing tools, available dashboards, metrics
reference, and troubleshooting guide.
…ation

- Fix metric label usage: use exported_job instead of job for OTLP Collector pattern
- Add query examples for HTTP services (API, Admin) including latency percentiles, error tracking, and endpoint analysis
- Add query examples for worker services (breeding, feeding, training, racing) with runtime and HttpClient metrics
- Document custom instrumentation approach for Service Bus message processing metrics
- Add service reference table showing which services have HTTP vs runtime-only metrics
- Organize queries by category: basic service queries, worker services, latency, errors, endpoints, cross-service comparisons, and system resources
@ovation22 ovation22 changed the title Add comprehensive Prometheus query examples to observability documentation Feature 028: Add Grafana observability stack for local development Feb 2, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 3, 2026

Test Results

723 tests  ±0   723 ✅ ±0   7s ⏱️ -1s
  1 suites ±0     0 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit 67358cd. ± Comparison against base commit c1ac668.

@ovation22 ovation22 merged commit 7901eb5 into main Feb 3, 2026
2 of 4 checks passed
@ovation22 ovation22 deleted the feature/028-grafana-observability-stack branch February 3, 2026 00:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant