Feature 028: Add Grafana observability stack for local development#33
Merged
Feature 028: Add Grafana observability stack for local development#33
Conversation
Implements local development observability with: - Prometheus for metrics storage and querying (port 9090) - OpenTelemetry Collector for OTLP ingestion and routing (ports 4317, 4318, 8889) - Grafana with provisioned ASP.NET Core dashboards (port 3000) The API service now exports metrics via OTLP to the collector, which exposes them to Prometheus. Grafana includes pre-configured dashboards for request duration, error rates, active connections, and endpoint performance metrics. All metrics are prefixed with 'triple_derby_' namespace for clarity.
Extended OTLP telemetry export to all microservices: - Web/Admin - Breeding service - Feeding service - Training service - Racing service All services now export metrics to the OTLP Collector for centralized observability. Metrics will appear in Grafana dashboards alongside the API service metrics. Added comprehensive observability documentation (docs/OBSERVABILITY.md) covering architecture, accessing tools, available dashboards, metrics reference, and troubleshooting guide.
…ation - Fix metric label usage: use exported_job instead of job for OTLP Collector pattern - Add query examples for HTTP services (API, Admin) including latency percentiles, error tracking, and endpoint analysis - Add query examples for worker services (breeding, feeding, training, racing) with runtime and HttpClient metrics - Document custom instrumentation approach for Service Bus message processing metrics - Add service reference table showing which services have HTTP vs runtime-only metrics - Organize queries by category: basic service queries, worker services, latency, errors, endpoints, cross-service comparisons, and system resources
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements a comprehensive local development observability stack using Grafana, Prometheus, and OpenTelemetry Collector. All six microservices (API, Admin, Breeding, Feeding, Training, Racing) now export telemetry through a centralized OTLP Collector, enabling real-time monitoring, performance analysis, and troubleshooting during development.
Implementation Overview
1. Core Observability Infrastructure
Added containerized observability stack:
Architecture pattern:
All containers use persistent lifetime to preserve historical data across restarts.
2. Grafana Dashboard Provisioning
Pre-configured dashboards:
Dashboard 19924: ASP.NET Core application-wide metrics
Dashboard 19925: ASP.NET Core endpoint details
Configuration:
3. Service Integration
All services now export metrics via OTLP:
Metrics namespacing:
triple_derby_to avoid naming conflictsexported_joblabel4. Comprehensive Documentation
Created docs/OBSERVABILITY.md covering:
Architecture and Flow:
Available Dashboards:
Prometheus Query Examples:
Service Reference Table:
Shows which services expose HTTP metrics vs runtime-only metrics
Troubleshooting:
Production Considerations:
5. OTLP Collector Configuration
Metrics pipeline:
triple_derby_app: triple-derbyScraping configuration:
Key Technical Details
Metric Label Usage (Critical)
When using the OTLP Collector pattern:
exported_jobto select specific services (api, admin, breeding, feeding, training, racing)joblabel will always be "otel-collector" because that's what Prometheus scrapesService Type Differences
HTTP Services (API, Admin):
Worker Services (Breeding, Feeding, Training, Racing):
Files Added/Modified
AppHost Configuration:
TripleDerby.AppHost/Program.cs- Added container resources and service OTLP configurationTripleDerby.AppHost/otel-collector-config.yaml- OTLP Collector pipeline configurationTripleDerby.AppHost/prometheus.yml- Prometheus scrape configurationGrafana Configuration:
TripleDerby.AppHost/grafana/provisioning/datasources/datasources.yaml- Prometheus datasourceTripleDerby.AppHost/grafana/provisioning/dashboards/dashboards.yaml- Dashboard auto-provisioningTripleDerby.AppHost/grafana/dashboards/aspnetcore-19924.json- ASP.NET Core dashboardTripleDerby.AppHost/grafana/dashboards/aspnetcore-endpoint-19925.json- Endpoint details dashboardDocumentation:
docs/OBSERVABILITY.md- Comprehensive observability guide (250+ lines)docs/features/028-grafana-observability-stack.md- Feature specificationdocs/implementation/028-grafana-observability-stack-implementation-plan.md- Implementation planDeveloper Experience
Single command startup:
Immediate access to:
Real-time visibility:
Validation
All acceptance criteria met:
exported_joblabel usageNext Steps (Future Enhancements)
Custom Service Bus Metrics:
Documentation includes examples of how to add custom instrumentation for message processing metrics in worker services using OpenTelemetry's Meter API.
Alerting:
Foundation is in place for adding Prometheus Alertmanager or Grafana alerts in future feature.
Trace Storage:
Currently using Aspire dashboard for traces. Could add Tempo for long-term trace storage if needed.
Impact
This observability stack provides immediate visibility into application performance during local development, enabling: