# Add OpenTelemetry Distributed Tracing for Observability

# Add OpenTelemetry Distributed Tracing for Observability

## 📋 Summary

Implement OpenTelemetry distributed tracing to provide comprehensive observability for the Maglev API server. This enables detailed performance monitoring, request flow visualization, and error tracking across all endpoints.

## 🎯 Problem Statement

Currently, the Maglev API lacks detailed observability into:
- Request execution paths and timing
- Performance bottlenecks in handler functions
- Database query latency
- Parallel query execution visualization (especially for PR #469)
- Error propagation and failure points
- End-to-end request duration breakdown

Without distributed tracing, debugging performance issues and understanding system behavior in production requires extensive logging and manual correlation.

## 💡 Proposed Solution

Add OpenTelemetry instrumentation to provide:

1. **Automatic HTTP request tracing** via middleware
2. **Handler-level spans** for critical endpoints
3. **Database query tracing** for performance analysis
4. **Parallel query visualization** to validate optimizations
5. **Error tracking** with automatic span status updates
6. **Context propagation** for distributed tracing support

## 🏗️ Implementation Plan

### Phase 1: Core Infrastructure ✅
- [x] Add OpenTelemetry dependencies to `go.mod`
- [x] Create `internal/restapi/tracing.go` for initialization
- [x] Create `internal/restapi/tracing_middleware.go` for HTTP instrumentation
- [x] Integrate tracer lifecycle in `cmd/api/app.go`
- [x] Apply middleware to all routes in `internal/restapi/routes.go`

### Phase 2: Handler Instrumentation ✅
- [x] Instrument `arrivalAndDepartureForStopHandler`
- [x] Instrument `stopsForRouteHandler` 
- [x] Instrument `tripDetailsHandler`
- [x] Add child spans for parallel database queries
- [x] Add error recording with proper status codes
- [x] Add custom attributes (stop IDs, route IDs, batch sizes)

### Phase 3: Testing ✅
- [x] Create integration tests in `tracing_middleware_test.go`
- [x] Verify middleware doesn't break existing functionality
- [x] Test error handling in traces

### Phase 4: Future Enhancements (Optional)
- [ ] Add database-level instrumentation with `otelsql`
- [ ] Instrument GTFS real-time feed fetching
- [ ] Instrument GTFS static data loading
- [ ] Replace stdout exporter with OTLP for production
- [ ] Add remaining handler functions
- [ ] Configure sampling strategy for production

## 📊 Technical Details

### Files Modified

#### New Files
- `internal/restapi/tracing.go` - Tracer initialization and configuration
- `internal/restapi/tracing_middleware.go` - HTTP request auto-instrumentation
- `internal/restapi/tracing_middleware_test.go` - Integration tests

#### Modified Files
- `go.mod` - Added OpenTelemetry dependencies
- `cmd/api/app.go` - Tracer lifecycle management in `Run()` function
- `internal/restapi/routes.go` - Applied tracing middleware to handler chain
- `internal/restapi/arrival_and_departure_for_stop_handler.go` - Added spans
- `internal/restapi/stops_for_route_handler.go` - Added spans with parallel query tracking
- `internal/restapi/trip_details_handler.go` - Added spans with error handling

### Dependencies Added
```go
go.opentelemetry.io/otel v1.37.0
go.opentelemetry.io/otel/trace v1.37.0
go.opentelemetry.io/otel/sdk v1.37.0
go.opentelemetry.io/otel/exporters/stdout/stdouttrace v1.37.0
```

### Architecture

```
Request Flow:
1. Request arrives → TracingMiddleware creates root span
2. Context with span passed to handler
3. Handler creates child span with custom attributes
4. Database queries create child spans (parallel execution visible)
5. Response built and returned
6. Spans closed with duration/status recorded
7. Trace exported (stdout/Jaeger/DataDog)
```

### Example Trace Output

```
📊 HTTP Request: /api/where/stops-for-route/25_100238 (145ms)
  ├─ stopsForRouteHandler (142ms)
  │   ├─ GetAgency (2ms)
  │   ├─ GetActiveServiceIDsForDate (5ms)
  │   ├─ buildStopsList (118ms)
  │   │   ├─ GetStopsByIDs (58ms) ← parallel
  │   │   └─ GetRouteIDsForStops (60ms) ← parallel
  │   └─ response formatting (8ms)
  └─ compression (3ms)

Attributes:
  - route.id: "100238"
  - route.agency_id: "25"
  - stop_count: 42
  - http.status_code: 200
```

## ✅ Benefits

### Performance Insights
- **Identify bottlenecks**: See exact timing for each operation
- **Validate optimizations**: Visual proof that parallel queries execute concurrently (PR #469)
- **Database query analysis**: Track slow queries and N+1 problems
- **Handler comparison**: Compare performance across different endpoints

### Operational Excellence
- **Error tracking**: Automatic capture of errors with context
- **Production debugging**: Trace requests from frontend to database
- **Distributed tracing**: Track requests across multiple services (future microservices)
- **SLA monitoring**: Measure and track API response times

### Developer Experience
- **Local debugging**: Pretty-printed traces in development
- **Visual representation**: Use Jaeger UI to explore traces
- **Context preservation**: Spans maintain parent-child relationships
- **Zero overhead**: Minimal performance impact when tracing is disabled

## 🔧 Configuration

### Development Mode (Current)
- Exporter: `stdout` (pretty-printed JSON)
- Sampling: 100% (all traces captured)
- Output: Console

### Production Mode (Future)
```go
// Replace stdout exporter in tracing.go with:
import "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"

exporter, err := otlptracehttp.New(ctx,
    otlptracehttp.WithEndpoint("jaeger:4318"),
    otlptracehttp.WithInsecure(),
)
```

### Environment Variables (Future Enhancement)
```bash
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4318
OTEL_SERVICE_NAME=maglev
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1  # 10% sampling in production
```

## 🧪 Testing

### Manual Testing
```bash
# Start the server
make run

# Make a request
curl "http://localhost:4000/api/where/stops-for-route/25_100238?key=TEST"

# Check console for trace output (pretty-printed JSON)
```

### Integration Tests
```bash
go test -tags sqlite_fts5 -run TestTracingMiddleware ./internal/restapi/
```

### With Jaeger (Optional)
```bash
# Start Jaeger
docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4318:4318 \
  jaegertracing/all-in-one:latest

# Update exporter in tracing.go to OTLP
# Restart server
# View traces at http://localhost:16686
```

## 📈 Success Metrics

- ✅ All HTTP requests automatically traced
- ✅ Handler-level spans for critical endpoints
- ✅ Parallel query execution visible in traces
- ✅ Error spans properly marked with status codes
- ✅ Zero test failures
- ✅ Graceful degradation if tracing initialization fails



## 📝 Notes

### Design Decisions

1. **Middleware-first approach**: Automatic instrumentation for all endpoints without code changes
2. **Manual handler spans**: Explicit spans for detailed visibility into critical paths
3. **Stdout exporter**: Easy debugging in development, replaceable in production
4. **100% sampling**: Capture everything in development, tune for production
5. **Graceful degradation**: Server continues if tracing fails to initialize

### Future Improvements

1. **Sampling strategy**: Implement adaptive sampling for production (reduce overhead)
2. **Database instrumentation**: Use `otelsql` wrapper for automatic query tracing
3. **Metrics correlation**: Link spans to Prometheus metrics
4. **Custom exporters**: Support for multiple backends (Jaeger, DataDog, New Relic)
5. **Baggage propagation**: Carry request-scoped data through trace context
6. **Span events**: Add timeline events for significant points within spans

## 🎓 References

- [OpenTelemetry Go Documentation](https://opentelemetry.io/docs/instrumentation/go/)
- [OpenTelemetry Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/)
- [Jaeger Documentation](https://www.jaegertracing.io/docs/)
- [W3C Trace Context](https://www.w3.org/TR/trace-context/)


## 🏷️ Labels

- `enhancement`
- `observability`
- `performance`
- `developer-experience`
- `priority: high`

---

## Implementation Status: ✅ COMPLETED

All core functionality has been implemented and tested. The tracing system is ready for use in development and can be extended to production with configuration changes.


# Add OpenTelemetry Distributed Tracing for Observability #484

Description

Add OpenTelemetry Distributed Tracing for Observability

📋 Summary

🎯 Problem Statement

💡 Proposed Solution

🏗️ Implementation Plan

Phase 1: Core Infrastructure ✅

Phase 2: Handler Instrumentation ✅

Phase 3: Testing ✅

Phase 4: Future Enhancements (Optional)

📊 Technical Details

Files Modified

New Files

Modified Files

Dependencies Added

Architecture

Example Trace Output

✅ Benefits

Performance Insights

Operational Excellence

Developer Experience

🔧 Configuration

Development Mode (Current)

Production Mode (Future)

Environment Variables (Future Enhancement)

🧪 Testing

Manual Testing

Integration Tests

With Jaeger (Optional)

📈 Success Metrics

📝 Notes

Design Decisions

Future Improvements

🎓 References

🏷️ Labels

Implementation Status: ✅ COMPLETED

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions