Skip to content

# Add OpenTelemetry Distributed Tracing for Observability #484

@sreenivasivbieb

Description

@sreenivasivbieb

Add OpenTelemetry Distributed Tracing for Observability

📋 Summary

Implement OpenTelemetry distributed tracing to provide comprehensive observability for the Maglev API server. This enables detailed performance monitoring, request flow visualization, and error tracking across all endpoints.

🎯 Problem Statement

Currently, the Maglev API lacks detailed observability into:

Without distributed tracing, debugging performance issues and understanding system behavior in production requires extensive logging and manual correlation.

💡 Proposed Solution

Add OpenTelemetry instrumentation to provide:

  1. Automatic HTTP request tracing via middleware
  2. Handler-level spans for critical endpoints
  3. Database query tracing for performance analysis
  4. Parallel query visualization to validate optimizations
  5. Error tracking with automatic span status updates
  6. Context propagation for distributed tracing support

🏗️ Implementation Plan

Phase 1: Core Infrastructure ✅

  • Add OpenTelemetry dependencies to go.mod
  • Create internal/restapi/tracing.go for initialization
  • Create internal/restapi/tracing_middleware.go for HTTP instrumentation
  • Integrate tracer lifecycle in cmd/api/app.go
  • Apply middleware to all routes in internal/restapi/routes.go

Phase 2: Handler Instrumentation ✅

  • Instrument arrivalAndDepartureForStopHandler
  • Instrument stopsForRouteHandler
  • Instrument tripDetailsHandler
  • Add child spans for parallel database queries
  • Add error recording with proper status codes
  • Add custom attributes (stop IDs, route IDs, batch sizes)

Phase 3: Testing ✅

  • Create integration tests in tracing_middleware_test.go
  • Verify middleware doesn't break existing functionality
  • Test error handling in traces

Phase 4: Future Enhancements (Optional)

  • Add database-level instrumentation with otelsql
  • Instrument GTFS real-time feed fetching
  • Instrument GTFS static data loading
  • Replace stdout exporter with OTLP for production
  • Add remaining handler functions
  • Configure sampling strategy for production

📊 Technical Details

Files Modified

New Files

  • internal/restapi/tracing.go - Tracer initialization and configuration
  • internal/restapi/tracing_middleware.go - HTTP request auto-instrumentation
  • internal/restapi/tracing_middleware_test.go - Integration tests

Modified Files

  • go.mod - Added OpenTelemetry dependencies
  • cmd/api/app.go - Tracer lifecycle management in Run() function
  • internal/restapi/routes.go - Applied tracing middleware to handler chain
  • internal/restapi/arrival_and_departure_for_stop_handler.go - Added spans
  • internal/restapi/stops_for_route_handler.go - Added spans with parallel query tracking
  • internal/restapi/trip_details_handler.go - Added spans with error handling

Dependencies Added

go.opentelemetry.io/otel v1.37.0
go.opentelemetry.io/otel/trace v1.37.0
go.opentelemetry.io/otel/sdk v1.37.0
go.opentelemetry.io/otel/exporters/stdout/stdouttrace v1.37.0

Architecture

Request Flow:
1. Request arrives → TracingMiddleware creates root span
2. Context with span passed to handler
3. Handler creates child span with custom attributes
4. Database queries create child spans (parallel execution visible)
5. Response built and returned
6. Spans closed with duration/status recorded
7. Trace exported (stdout/Jaeger/DataDog)

Example Trace Output

📊 HTTP Request: /api/where/stops-for-route/25_100238 (145ms)
  ├─ stopsForRouteHandler (142ms)
  │   ├─ GetAgency (2ms)
  │   ├─ GetActiveServiceIDsForDate (5ms)
  │   ├─ buildStopsList (118ms)
  │   │   ├─ GetStopsByIDs (58ms) ← parallel
  │   │   └─ GetRouteIDsForStops (60ms) ← parallel
  │   └─ response formatting (8ms)
  └─ compression (3ms)

Attributes:
  - route.id: "100238"
  - route.agency_id: "25"
  - stop_count: 42
  - http.status_code: 200

✅ Benefits

Performance Insights

Operational Excellence

  • Error tracking: Automatic capture of errors with context
  • Production debugging: Trace requests from frontend to database
  • Distributed tracing: Track requests across multiple services (future microservices)
  • SLA monitoring: Measure and track API response times

Developer Experience

  • Local debugging: Pretty-printed traces in development
  • Visual representation: Use Jaeger UI to explore traces
  • Context preservation: Spans maintain parent-child relationships
  • Zero overhead: Minimal performance impact when tracing is disabled

🔧 Configuration

Development Mode (Current)

  • Exporter: stdout (pretty-printed JSON)
  • Sampling: 100% (all traces captured)
  • Output: Console

Production Mode (Future)

// Replace stdout exporter in tracing.go with:
import "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"

exporter, err := otlptracehttp.New(ctx,
    otlptracehttp.WithEndpoint("jaeger:4318"),
    otlptracehttp.WithInsecure(),
)

Environment Variables (Future Enhancement)

OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4318
OTEL_SERVICE_NAME=maglev
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1  # 10% sampling in production

🧪 Testing

Manual Testing

# Start the server
make run

# Make a request
curl "http://localhost:4000/api/where/stops-for-route/25_100238?key=TEST"

# Check console for trace output (pretty-printed JSON)

Integration Tests

go test -tags sqlite_fts5 -run TestTracingMiddleware ./internal/restapi/

With Jaeger (Optional)

# Start Jaeger
docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4318:4318 \
  jaegertracing/all-in-one:latest

# Update exporter in tracing.go to OTLP
# Restart server
# View traces at http://localhost:16686

📈 Success Metrics

  • ✅ All HTTP requests automatically traced
  • ✅ Handler-level spans for critical endpoints
  • ✅ Parallel query execution visible in traces
  • ✅ Error spans properly marked with status codes
  • ✅ Zero test failures
  • ✅ Graceful degradation if tracing initialization fails

📝 Notes

Design Decisions

  1. Middleware-first approach: Automatic instrumentation for all endpoints without code changes
  2. Manual handler spans: Explicit spans for detailed visibility into critical paths
  3. Stdout exporter: Easy debugging in development, replaceable in production
  4. 100% sampling: Capture everything in development, tune for production
  5. Graceful degradation: Server continues if tracing fails to initialize

Future Improvements

  1. Sampling strategy: Implement adaptive sampling for production (reduce overhead)
  2. Database instrumentation: Use otelsql wrapper for automatic query tracing
  3. Metrics correlation: Link spans to Prometheus metrics
  4. Custom exporters: Support for multiple backends (Jaeger, DataDog, New Relic)
  5. Baggage propagation: Carry request-scoped data through trace context
  6. Span events: Add timeline events for significant points within spans

🎓 References

🏷️ Labels

  • enhancement
  • observability
  • performance
  • developer-experience
  • priority: high

Implementation Status: ✅ COMPLETED

All core functionality has been implemented and tested. The tracing system is ready for use in development and can be extended to production with configuration changes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions