-
Notifications
You must be signed in to change notification settings - Fork 68
Open
Description
Add OpenTelemetry Distributed Tracing for Observability
📋 Summary
Implement OpenTelemetry distributed tracing to provide comprehensive observability for the Maglev API server. This enables detailed performance monitoring, request flow visualization, and error tracking across all endpoints.
🎯 Problem Statement
Currently, the Maglev API lacks detailed observability into:
- Request execution paths and timing
- Performance bottlenecks in handler functions
- Database query latency
- Parallel query execution visualization (especially for PR feat: parallelize independent DB queries in handler functions (#468) #469)
- Error propagation and failure points
- End-to-end request duration breakdown
Without distributed tracing, debugging performance issues and understanding system behavior in production requires extensive logging and manual correlation.
💡 Proposed Solution
Add OpenTelemetry instrumentation to provide:
- Automatic HTTP request tracing via middleware
- Handler-level spans for critical endpoints
- Database query tracing for performance analysis
- Parallel query visualization to validate optimizations
- Error tracking with automatic span status updates
- Context propagation for distributed tracing support
🏗️ Implementation Plan
Phase 1: Core Infrastructure ✅
- Add OpenTelemetry dependencies to
go.mod - Create
internal/restapi/tracing.gofor initialization - Create
internal/restapi/tracing_middleware.gofor HTTP instrumentation - Integrate tracer lifecycle in
cmd/api/app.go - Apply middleware to all routes in
internal/restapi/routes.go
Phase 2: Handler Instrumentation ✅
- Instrument
arrivalAndDepartureForStopHandler - Instrument
stopsForRouteHandler - Instrument
tripDetailsHandler - Add child spans for parallel database queries
- Add error recording with proper status codes
- Add custom attributes (stop IDs, route IDs, batch sizes)
Phase 3: Testing ✅
- Create integration tests in
tracing_middleware_test.go - Verify middleware doesn't break existing functionality
- Test error handling in traces
Phase 4: Future Enhancements (Optional)
- Add database-level instrumentation with
otelsql - Instrument GTFS real-time feed fetching
- Instrument GTFS static data loading
- Replace stdout exporter with OTLP for production
- Add remaining handler functions
- Configure sampling strategy for production
📊 Technical Details
Files Modified
New Files
internal/restapi/tracing.go- Tracer initialization and configurationinternal/restapi/tracing_middleware.go- HTTP request auto-instrumentationinternal/restapi/tracing_middleware_test.go- Integration tests
Modified Files
go.mod- Added OpenTelemetry dependenciescmd/api/app.go- Tracer lifecycle management inRun()functioninternal/restapi/routes.go- Applied tracing middleware to handler chaininternal/restapi/arrival_and_departure_for_stop_handler.go- Added spansinternal/restapi/stops_for_route_handler.go- Added spans with parallel query trackinginternal/restapi/trip_details_handler.go- Added spans with error handling
Dependencies Added
go.opentelemetry.io/otel v1.37.0
go.opentelemetry.io/otel/trace v1.37.0
go.opentelemetry.io/otel/sdk v1.37.0
go.opentelemetry.io/otel/exporters/stdout/stdouttrace v1.37.0Architecture
Request Flow:
1. Request arrives → TracingMiddleware creates root span
2. Context with span passed to handler
3. Handler creates child span with custom attributes
4. Database queries create child spans (parallel execution visible)
5. Response built and returned
6. Spans closed with duration/status recorded
7. Trace exported (stdout/Jaeger/DataDog)
Example Trace Output
📊 HTTP Request: /api/where/stops-for-route/25_100238 (145ms)
├─ stopsForRouteHandler (142ms)
│ ├─ GetAgency (2ms)
│ ├─ GetActiveServiceIDsForDate (5ms)
│ ├─ buildStopsList (118ms)
│ │ ├─ GetStopsByIDs (58ms) ← parallel
│ │ └─ GetRouteIDsForStops (60ms) ← parallel
│ └─ response formatting (8ms)
└─ compression (3ms)
Attributes:
- route.id: "100238"
- route.agency_id: "25"
- stop_count: 42
- http.status_code: 200
✅ Benefits
Performance Insights
- Identify bottlenecks: See exact timing for each operation
- Validate optimizations: Visual proof that parallel queries execute concurrently (PR feat: parallelize independent DB queries in handler functions (#468) #469)
- Database query analysis: Track slow queries and N+1 problems
- Handler comparison: Compare performance across different endpoints
Operational Excellence
- Error tracking: Automatic capture of errors with context
- Production debugging: Trace requests from frontend to database
- Distributed tracing: Track requests across multiple services (future microservices)
- SLA monitoring: Measure and track API response times
Developer Experience
- Local debugging: Pretty-printed traces in development
- Visual representation: Use Jaeger UI to explore traces
- Context preservation: Spans maintain parent-child relationships
- Zero overhead: Minimal performance impact when tracing is disabled
🔧 Configuration
Development Mode (Current)
- Exporter:
stdout(pretty-printed JSON) - Sampling: 100% (all traces captured)
- Output: Console
Production Mode (Future)
// Replace stdout exporter in tracing.go with:
import "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
exporter, err := otlptracehttp.New(ctx,
otlptracehttp.WithEndpoint("jaeger:4318"),
otlptracehttp.WithInsecure(),
)Environment Variables (Future Enhancement)
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4318
OTEL_SERVICE_NAME=maglev
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1 # 10% sampling in production🧪 Testing
Manual Testing
# Start the server
make run
# Make a request
curl "http://localhost:4000/api/where/stops-for-route/25_100238?key=TEST"
# Check console for trace output (pretty-printed JSON)Integration Tests
go test -tags sqlite_fts5 -run TestTracingMiddleware ./internal/restapi/With Jaeger (Optional)
# Start Jaeger
docker run -d --name jaeger \
-p 16686:16686 \
-p 4318:4318 \
jaegertracing/all-in-one:latest
# Update exporter in tracing.go to OTLP
# Restart server
# View traces at http://localhost:16686📈 Success Metrics
- ✅ All HTTP requests automatically traced
- ✅ Handler-level spans for critical endpoints
- ✅ Parallel query execution visible in traces
- ✅ Error spans properly marked with status codes
- ✅ Zero test failures
- ✅ Graceful degradation if tracing initialization fails
📝 Notes
Design Decisions
- Middleware-first approach: Automatic instrumentation for all endpoints without code changes
- Manual handler spans: Explicit spans for detailed visibility into critical paths
- Stdout exporter: Easy debugging in development, replaceable in production
- 100% sampling: Capture everything in development, tune for production
- Graceful degradation: Server continues if tracing fails to initialize
Future Improvements
- Sampling strategy: Implement adaptive sampling for production (reduce overhead)
- Database instrumentation: Use
otelsqlwrapper for automatic query tracing - Metrics correlation: Link spans to Prometheus metrics
- Custom exporters: Support for multiple backends (Jaeger, DataDog, New Relic)
- Baggage propagation: Carry request-scoped data through trace context
- Span events: Add timeline events for significant points within spans
🎓 References
- OpenTelemetry Go Documentation
- OpenTelemetry Semantic Conventions
- Jaeger Documentation
- W3C Trace Context
🏷️ Labels
enhancementobservabilityperformancedeveloper-experiencepriority: high
Implementation Status: ✅ COMPLETED
All core functionality has been implemented and tested. The tracing system is ready for use in development and can be extended to production with configuration changes.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels