📋 Task Description
Implement comprehensive observability for StarGate including structured logging, Prometheus metrics integration, Grafana dashboards, and distributed tracing with OpenTelemetry. Enable production-grade monitoring, alerting, and troubleshooting capabilities.
🎯 Objectives
- Implement structured logging with Serilog
- Add correlation IDs for request tracking
- Integrate Prometheus metrics
- Create custom application metrics
- Setup Prometheus in Kubernetes
- Create Grafana dashboards
- Implement distributed tracing with OpenTelemetry
- Configure trace exporters (Jaeger/Zipkin)
- Add performance monitoring
- Create alerting rules
- Document observability practices
- Test monitoring stack locally
🎯 TECHNICAL-ANALYSIS.md Coverage
This issue covers Phase 11 - Sprint 11.2: Observability:
✅ Add structured logging - Serilog with JSON formatting and enrichers
✅ Integrate with Prometheus - Metrics collection and exposition
✅ Create Grafana dashboards - Visualization and monitoring
✅ Setup distributed tracing - OpenTelemetry with Jaeger/Zipkin
📦 Deliverables
1. Add Structured Logging Dependencies
Update src/StarGate.Server/StarGate.Server.csproj:
<ItemGroup>
<!-- Structured Logging -->
<PackageReference Include="Serilog.AspNetCore" Version="8.0.1" />
<PackageReference Include="Serilog.Enrichers.Environment" Version="3.0.0" />
<PackageReference Include="Serilog.Enrichers.Process" Version="3.0.0" />
<PackageReference Include="Serilog.Enrichers.Thread" Version="4.0.0" />
<PackageReference Include="Serilog.Sinks.Console" Version="5.0.1" />
<PackageReference Include="Serilog.Sinks.File" Version="5.0.0" />
<!-- Metrics -->
<PackageReference Include="prometheus-net.AspNetCore" Version="8.2.1" />
<!-- Distributed Tracing -->
<PackageReference Include="OpenTelemetry.Exporter.Console" Version="1.7.0" />
<PackageReference Include="OpenTelemetry.Exporter.Jaeger" Version="1.5.1" />
<PackageReference Include="OpenTelemetry.Extensions.Hosting" Version="1.7.0" />
<PackageReference Include="OpenTelemetry.Instrumentation.AspNetCore" Version="1.7.1" />
<PackageReference Include="OpenTelemetry.Instrumentation.Http" Version="1.7.1" />
</ItemGroup>
2. Configure Structured Logging
Update src/StarGate.Server/Program.cs:
using Serilog;
using Serilog.Events;
// Configure Serilog
Log.Logger = new LoggerConfiguration()
.MinimumLevel.Information()
.MinimumLevel.Override("Microsoft.AspNetCore", LogEventLevel.Warning)
.Enrich.FromLogContext()
.Enrich.WithMachineName()
.Enrich.WithProcessId()
.Enrich.WithThreadId()
.Enrich.WithProperty("Application", "StarGate")
.WriteTo.Console(new Serilog.Formatting.Compact.CompactJsonFormatter())
.WriteTo.File(
new Serilog.Formatting.Compact.CompactJsonFormatter(),
"logs/stargate-.log",
rollingInterval: RollingInterval.Day,
retainedFileCountLimit: 7)
.CreateLogger();
try
{
Log.Information("Starting StarGate Server");
var builder = WebApplication.CreateBuilder(args);
// Use Serilog for logging
builder.Host.UseSerilog();
// ... rest of configuration
var app = builder.Build();
// Add Serilog request logging
app.UseSerilogRequestLogging(options =>
{
options.MessageTemplate =
"HTTP {RequestMethod} {RequestPath} responded {StatusCode} in {Elapsed:0.0000} ms";
options.EnrichDiagnosticContext = (diagnosticContext, httpContext) =>
{
diagnosticContext.Set("RequestHost", httpContext.Request.Host.Value);
diagnosticContext.Set("UserAgent", httpContext.Request.Headers["User-Agent"].ToString());
diagnosticContext.Set("ClientIP", httpContext.Connection.RemoteIpAddress?.ToString());
};
});
app.Run();
}
catch (Exception ex)
{
Log.Fatal(ex, "Application start-up failed");
}
finally
{
Log.CloseAndFlush();
}
3. Add Correlation ID Middleware
Create src/StarGate.Server/Middleware/CorrelationIdMiddleware.cs:
namespace StarGate.Server.Middleware;
using Microsoft.Extensions.Primitives;
using Serilog.Context;
public class CorrelationIdMiddleware
{
private const string CorrelationIdHeader = "X-Correlation-ID";
private readonly RequestDelegate _next;
public CorrelationIdMiddleware(RequestDelegate next)
{
_next = next;
}
public async Task InvokeAsync(HttpContext context)
{
var correlationId = GetOrCreateCorrelationId(context);
// Add to response headers
context.Response.OnStarting(() =>
{
context.Response.Headers.Append(CorrelationIdHeader, correlationId);
return Task.CompletedTask;
});
// Add to logging context
using (LogContext.PushProperty("CorrelationId", correlationId))
{
await _next(context);
}
}
private static string GetOrCreateCorrelationId(HttpContext context)
{
if (context.Request.Headers.TryGetValue(CorrelationIdHeader, out StringValues correlationId))
{
return correlationId.FirstOrDefault() ?? Guid.NewGuid().ToString();
}
return Guid.NewGuid().ToString();
}
}
// Extension method
public static class CorrelationIdMiddlewareExtensions
{
public static IApplicationBuilder UseCorrelationId(this IApplicationBuilder builder)
{
return builder.UseMiddleware<CorrelationIdMiddleware>();
}
}
Register in Program.cs:
4. Integrate Prometheus Metrics
Update src/StarGate.Server/Program.cs:
using Prometheus;
// Add metrics endpoint
app.UseMetricServer(); // Exposes /metrics endpoint
app.UseHttpMetrics(); // Collects HTTP metrics
// Map metrics endpoint explicitly if needed
app.MapMetrics(); // Alternative: explicit endpoint mapping
Create src/StarGate.Core/Metrics/ApplicationMetrics.cs:
namespace StarGate.Core.Metrics;
using Prometheus;
public static class ApplicationMetrics
{
// Process metrics
public static readonly Counter ProcessesCreated = Metrics.CreateCounter(
"stargate_processes_created_total",
"Total number of processes created",
new CounterConfiguration
{
LabelNames = new[] { "client_id", "process_type" }
});
public static readonly Counter ProcessesCompleted = Metrics.CreateCounter(
"stargate_processes_completed_total",
"Total number of processes completed successfully",
new CounterConfiguration
{
LabelNames = new[] { "client_id", "process_type" }
});
public static readonly Counter ProcessesFailed = Metrics.CreateCounter(
"stargate_processes_failed_total",
"Total number of processes failed",
new CounterConfiguration
{
LabelNames = new[] { "client_id", "process_type", "error_type" }
});
public static readonly Gauge ProcessesInProgress = Metrics.CreateGauge(
"stargate_processes_in_progress",
"Number of processes currently in progress",
new GaugeConfiguration
{
LabelNames = new[] { "process_type" }
});
public static readonly Histogram ProcessDuration = Metrics.CreateHistogram(
"stargate_process_duration_seconds",
"Duration of process execution in seconds",
new HistogramConfiguration
{
LabelNames = new[] { "process_type", "status" },
Buckets = Histogram.ExponentialBuckets(0.1, 2, 10) // 0.1s to ~100s
});
// Retry metrics
public static readonly Counter RetryAttempts = Metrics.CreateCounter(
"stargate_retry_attempts_total",
"Total number of retry attempts",
new CounterConfiguration
{
LabelNames = new[] { "operation", "attempt_number" }
});
// Circuit breaker metrics
public static readonly Gauge CircuitBreakerState = Metrics.CreateGauge(
"stargate_circuit_breaker_state",
"Circuit breaker state (0=Closed, 1=Open, 2=HalfOpen)",
new GaugeConfiguration
{
LabelNames = new[] { "circuit_name" }
});
// Queue metrics
public static readonly Gauge QueueDepth = Metrics.CreateGauge(
"stargate_queue_depth",
"Number of messages in queue",
new GaugeConfiguration
{
LabelNames = new[] { "queue_name" }
});
}
Instrument ProcessService:
public async Task<Process> CreateProcessAsync(CreateProcessCommand command)
{
// Increment counter
ApplicationMetrics.ProcessesCreated
.WithLabels(command.ClientId, command.ProcessType)
.Inc();
// Track in progress
ApplicationMetrics.ProcessesInProgress
.WithLabels(command.ProcessType)
.Inc();
try
{
var stopwatch = System.Diagnostics.Stopwatch.StartNew();
var process = await CreateProcessInternalAsync(command);
stopwatch.Stop();
// Record duration
ApplicationMetrics.ProcessDuration
.WithLabels(command.ProcessType, "created")
.Observe(stopwatch.Elapsed.TotalSeconds);
return process;
}
finally
{
ApplicationMetrics.ProcessesInProgress
.WithLabels(command.ProcessType)
.Dec();
}
}
5. Setup OpenTelemetry Tracing
Update src/StarGate.Server/Program.cs:
using OpenTelemetry.Resources;
using OpenTelemetry.Trace;
builder.Services.AddOpenTelemetry()
.WithTracing(tracerProviderBuilder =>
{
tracerProviderBuilder
.SetResourceBuilder(ResourceBuilder.CreateDefault()
.AddService("StarGate", serviceVersion: "1.0.0"))
.AddAspNetCoreInstrumentation(options =>
{
options.RecordException = true;
options.EnrichWithHttpRequest = (activity, httpRequest) =>
{
activity.SetTag("client.address", httpRequest.HttpContext.Connection.RemoteIpAddress);
};
options.EnrichWithHttpResponse = (activity, httpResponse) =>
{
activity.SetTag("http.response.status_code", httpResponse.StatusCode);
};
})
.AddHttpClientInstrumentation()
.AddSource("StarGate.*")
.AddConsoleExporter()
.AddJaegerExporter(options =>
{
options.AgentHost = builder.Configuration["Jaeger:AgentHost"] ?? "localhost";
options.AgentPort = int.Parse(builder.Configuration["Jaeger:AgentPort"] ?? "6831");
});
});
Create ActivitySource:
namespace StarGate.Core.Tracing;
using System.Diagnostics;
public static class StarGateActivitySource
{
public static readonly ActivitySource Source = new ActivitySource("StarGate.Core");
}
Instrument methods:
public async Task<Process> HandleProcessAsync(Guid processId)
{
using var activity = StarGateActivitySource.Source.StartActivity("HandleProcess");
activity?.SetTag("process.id", processId);
activity?.SetTag("process.type", processType);
try
{
var result = await ExecuteHandlerAsync(processId);
activity?.SetTag("process.status", "completed");
return result;
}
catch (Exception ex)
{
activity?.SetTag("process.status", "failed");
activity?.SetTag("error", true);
activity?.RecordException(ex);
throw;
}
}
6. Create Prometheus Deployment for Kubernetes
Create k8s/monitoring/prometheus-config.yaml:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: stargate
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'stargate'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- stargate
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: stargate
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:latest
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
ports:
- containerPort: 9090
volumeMounts:
- name: config
mountPath: /etc/prometheus
- name: data
mountPath: /prometheus
volumes:
- name: config
configMap:
name: prometheus-config
- name: data
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: stargate
spec:
ports:
- port: 9090
targetPort: 9090
selector:
app: prometheus
7. Create Grafana Deployment
Create k8s/monitoring/grafana-deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: stargate
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:latest
ports:
- containerPort: 3000
env:
- name: GF_SECURITY_ADMIN_USER
value: admin
- name: GF_SECURITY_ADMIN_PASSWORD
value: admin
volumeMounts:
- name: grafana-storage
mountPath: /var/lib/grafana
volumes:
- name: grafana-storage
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: stargate
spec:
type: LoadBalancer
ports:
- port: 3000
targetPort: 3000
selector:
app: grafana
8. Create Grafana Dashboards
Create k8s/monitoring/grafana-dashboards.yaml (ConfigMap) with pre-configured dashboards:
Dashboard: StarGate Overview
- Total processes (counter)
- Success rate (percentage)
- Active processes (gauge)
- Request rate (graph)
- Error rate (graph)
- Process duration (histogram)
Dashboard: Process Details
- Processes by type (pie chart)
- Processes by client (bar chart)
- Retry attempts (graph)
- Circuit breaker states (gauge)
Dashboard: Infrastructure
- CPU usage
- Memory usage
- Pod count
- Request latency (p50, p95, p99)
9. Create Jaeger Deployment
Create k8s/monitoring/jaeger-deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: stargate
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:latest
ports:
- containerPort: 5775
protocol: UDP
- containerPort: 6831
protocol: UDP
- containerPort: 6832
protocol: UDP
- containerPort: 5778
- containerPort: 16686
- containerPort: 14268
env:
- name: COLLECTOR_ZIPKIN_HTTP_PORT
value: "9411"
---
apiVersion: v1
kind: Service
metadata:
name: jaeger
namespace: stargate
spec:
type: LoadBalancer
ports:
- port: 5775
targetPort: 5775
protocol: UDP
name: zipkin-compact
- port: 6831
targetPort: 6831
protocol: UDP
name: jaeger-compact
- port: 6832
targetPort: 6832
protocol: UDP
name: jaeger-binary
- port: 5778
targetPort: 5778
name: config
- port: 16686
targetPort: 16686
name: ui
- port: 14268
targetPort: 14268
name: collector
selector:
app: jaeger
10. Create Monitoring Deployment Script
Create scripts/deploy-monitoring.sh:
#!/bin/bash
set -e
echo "Deploying monitoring stack..."
# Deploy Prometheus
echo "Deploying Prometheus..."
kubectl apply -f k8s/monitoring/prometheus-config.yaml
kubectl wait --for=condition=ready pod -l app=prometheus -n stargate --timeout=60s
# Deploy Grafana
echo "Deploying Grafana..."
kubectl apply -f k8s/monitoring/grafana-deployment.yaml
kubectl wait --for=condition=ready pod -l app=grafana -n stargate --timeout=60s
# Deploy Jaeger
echo "Deploying Jaeger..."
kubectl apply -f k8s/monitoring/jaeger-deployment.yaml
kubectl wait --for=condition=ready pod -l app=jaeger -n stargate --timeout=60s
echo ""
echo "Monitoring stack deployed!"
echo ""
echo "Access URLs:"
echo " Prometheus: kubectl port-forward svc/prometheus 9090:9090 -n stargate"
echo " Grafana: kubectl port-forward svc/grafana 3000:3000 -n stargate"
echo " Jaeger UI: kubectl port-forward svc/jaeger 16686:16686 -n stargate"
11. Create Documentation
Create docs/OBSERVABILITY.md:
# Observability Guide - StarGate
## Overview
StarGate implements comprehensive observability through:
- **Structured Logging**: JSON logs with correlation IDs
- **Metrics**: Prometheus metrics for monitoring
- **Tracing**: Distributed tracing with OpenTelemetry/Jaeger
- **Dashboards**: Grafana dashboards for visualization
## Logging
### Structured Logs
All logs are in JSON format:
```json
{
"@t": "2026-02-18T14:30:00.123Z",
"@mt": "Process {ProcessId} completed successfully",
"@l": "Information",
"ProcessId": "123e4567-e89b-12d3-a456-426614174000",
"CorrelationId": "abc-123-def",
"SourceContext": "StarGate.Core.Services.ProcessService"
}
Correlation IDs
Every request gets a correlation ID:
- Automatically generated or from
X-Correlation-ID header
- Propagated through entire request chain
- Included in all log entries
- Returned in response headers
Log Levels
- Debug: Detailed diagnostic information
- Information: General application flow
- Warning: Abnormal but handled situations
- Error: Error events that still allow app to continue
- Fatal: Critical errors causing shutdown
Metrics
Application Metrics
Process Metrics:
stargate_processes_created_total - Total processes created
stargate_processes_completed_total - Successful completions
stargate_processes_failed_total - Failed processes
stargate_processes_in_progress - Currently active
stargate_process_duration_seconds - Execution duration
Resilience Metrics:
stargate_retry_attempts_total - Retry attempts
stargate_circuit_breaker_state - Circuit breaker states
Infrastructure Metrics:
stargate_queue_depth - Message queue depth
- HTTP metrics (via prometheus-net.AspNetCore)
- .NET runtime metrics
Querying Metrics
Process Success Rate:
rate(stargate_processes_completed_total[5m]) /
rate(stargate_processes_created_total[5m])
P95 Process Duration:
histogram_quantile(0.95,
rate(stargate_process_duration_seconds_bucket[5m])
)
Distributed Tracing
Trace Structure
HTTP Request
│
├── CreateProcess (API)
│ ├── ValidateRequest
│ ├── PublishToQueue
│ └── SaveToDatabase
│
└── ProcessMessage (Worker)
├── LoadProcess
├── ExecuteHandler
│ ├── BusinessLogic
│ └── ExternalAPI (if any)
└── UpdateStatus
Trace Tags
process.id - Process identifier
process.type - Type of process
client.id - Client identifier
error - Boolean if error occurred
http.status_code - HTTP status
Dashboards
StarGate Overview Dashboard
Key Metrics:
- Total Processes (24h)
- Success Rate (%)
- Active Processes
- Request Rate (req/s)
- Error Rate (%)
- P95 Latency (ms)
Graphs:
- Processes Over Time
- Success vs Failure Rate
- Process Duration Distribution
- Top Clients by Volume
Process Details Dashboard
Panels:
- Processes by Type (Pie)
- Processes by Client (Bar)
- Retry Attempts (Time Series)
- Circuit Breaker States (Gauge)
- Queue Depth (Graph)
Infrastructure Dashboard
Panels:
- CPU Usage (%)
- Memory Usage (MB)
- Pod Count
- Request Latency (p50, p95, p99)
- HTTP Status Codes
- Database Connection Pool
Alerting
Alert Rules
High Error Rate:
- alert: HighErrorRate
expr: rate(stargate_processes_failed_total[5m]) > 0.1
for: 5m
annotations:
summary: "High error rate detected"
Circuit Breaker Open:
- alert: CircuitBreakerOpen
expr: stargate_circuit_breaker_state == 1
for: 2m
annotations:
summary: "Circuit breaker is open"
High Queue Depth:
- alert: HighQueueDepth
expr: stargate_queue_depth > 1000
for: 10m
annotations:
summary: "Queue depth exceeds threshold"
Accessing Monitoring Tools
Local Development
# Port forward Prometheus
kubectl port-forward svc/prometheus 9090:9090 -n stargate
open http://localhost:9090
# Port forward Grafana
kubectl port-forward svc/grafana 3000:3000 -n stargate
open http://localhost:3000
# Default: admin/admin
# Port forward Jaeger
kubectl port-forward svc/jaeger 16686:16686 -n stargate
open http://localhost:16686
Production
Use Ingress for external access with proper authentication.
Troubleshooting
Find Logs for Request
# Using correlation ID
kubectl logs -n stargate deployment/stargate-server | \
grep "correlation-id-here"
Find Slow Requests
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
) > 1.0
Trace Failed Process
- Open Jaeger UI
- Search for service: StarGate
- Filter by tag:
error=true
- View trace details
Best Practices
- Always include correlation ID in external calls
- Log at appropriate levels (avoid debug in production)
- Use structured logging (not string interpolation)
- Add custom metrics for business KPIs
- Create traces for critical paths
- Set up alerts for SLOs
- Review dashboards regularly
- Correlate metrics, logs, and traces
## ✅ Acceptance Criteria
- [ ] Serilog integrated with JSON formatting
- [ ] Correlation ID middleware implemented
- [ ] Prometheus metrics integrated
- [ ] Custom application metrics implemented
- [ ] OpenTelemetry tracing configured
- [ ] Jaeger exporter configured
- [ ] ProcessService instrumented with metrics and traces
- [ ] Prometheus deployed to Kubernetes
- [ ] Grafana deployed with datasource configured
- [ ] Jaeger deployed for tracing
- [ ] Grafana dashboards created (Overview, Details, Infrastructure)
- [ ] Monitoring deployment script created
- [ ] All services emit structured logs
- [ ] Metrics endpoint (/metrics) accessible
- [ ] Traces visible in Jaeger UI
- [ ] Dashboards display real data
- [ ] OBSERVABILITY.md documentation complete
- [ ] Code follows CODING-CONVENTIONS.md
## 📝 Testing Instructions
```bash
# Deploy application
./scripts/k8s-deploy.sh minikube
# Deploy monitoring stack
./scripts/deploy-monitoring.sh
# Access Grafana
kubectl port-forward svc/grafana 3000:3000 -n stargate
open http://localhost:3000
# Login: admin/admin
# Add Prometheus datasource: http://prometheus:9090
# Import dashboards from k8s/monitoring/grafana-dashboards.yaml
# Access Prometheus
kubectl port-forward svc/prometheus 9090:9090 -n stargate
open http://localhost:9090
# Query: stargate_processes_created_total
# Access Jaeger
kubectl port-forward svc/jaeger 16686:16686 -n stargate
open http://localhost:16686
# Generate traffic
for i in {1..100}; do
curl -X POST http://localhost:8080/api/processes \
-H "Content-Type: application/json" \
-H "X-Correlation-ID: test-$i" \
-d '{
"clientId": "test-client",
"processType": "order",
"clientProcessId": "test-'$i'",
"metadata": {"orderId": "order-'$i'"}
}'
done
# View logs with correlation ID
kubectl logs -n stargate deployment/stargate-server | grep "test-1"
# Check metrics
curl http://localhost:8080/metrics | grep stargate_processes
# View traces in Jaeger
# 1. Select service: StarGate
# 2. Click "Find Traces"
# 3. Click on a trace to see details
# View Grafana dashboard
# 1. Go to Dashboards
# 2. Select "StarGate Overview"
# 3. Verify metrics are displayed
📚 References
🏷️ Labels
phase-11 production-readiness sprint-11.2 observability logging metrics tracing prometheus grafana
⏱️ Estimated Effort
14-18 hours
🔗 Dependencies
🔗 Related Issues
Part of Phase 11: Production Readiness - Sprint 11.2: Observability
📋 Task Description
Implement comprehensive observability for StarGate including structured logging, Prometheus metrics integration, Grafana dashboards, and distributed tracing with OpenTelemetry. Enable production-grade monitoring, alerting, and troubleshooting capabilities.
🎯 Objectives
🎯 TECHNICAL-ANALYSIS.md Coverage
This issue covers Phase 11 - Sprint 11.2: Observability:
✅ Add structured logging - Serilog with JSON formatting and enrichers
✅ Integrate with Prometheus - Metrics collection and exposition
✅ Create Grafana dashboards - Visualization and monitoring
✅ Setup distributed tracing - OpenTelemetry with Jaeger/Zipkin
📦 Deliverables
1. Add Structured Logging Dependencies
Update
src/StarGate.Server/StarGate.Server.csproj:2. Configure Structured Logging
Update
src/StarGate.Server/Program.cs:3. Add Correlation ID Middleware
Create
src/StarGate.Server/Middleware/CorrelationIdMiddleware.cs:Register in
Program.cs:4. Integrate Prometheus Metrics
Update
src/StarGate.Server/Program.cs:Create
src/StarGate.Core/Metrics/ApplicationMetrics.cs:Instrument ProcessService:
5. Setup OpenTelemetry Tracing
Update
src/StarGate.Server/Program.cs:Create ActivitySource:
Instrument methods:
6. Create Prometheus Deployment for Kubernetes
Create
k8s/monitoring/prometheus-config.yaml:7. Create Grafana Deployment
Create
k8s/monitoring/grafana-deployment.yaml:8. Create Grafana Dashboards
Create
k8s/monitoring/grafana-dashboards.yaml(ConfigMap) with pre-configured dashboards:Dashboard: StarGate Overview
Dashboard: Process Details
Dashboard: Infrastructure
9. Create Jaeger Deployment
Create
k8s/monitoring/jaeger-deployment.yaml:10. Create Monitoring Deployment Script
Create
scripts/deploy-monitoring.sh:11. Create Documentation
Create
docs/OBSERVABILITY.md:Correlation IDs
Every request gets a correlation ID:
X-Correlation-IDheaderLog Levels
Metrics
Application Metrics
Process Metrics:
stargate_processes_created_total- Total processes createdstargate_processes_completed_total- Successful completionsstargate_processes_failed_total- Failed processesstargate_processes_in_progress- Currently activestargate_process_duration_seconds- Execution durationResilience Metrics:
stargate_retry_attempts_total- Retry attemptsstargate_circuit_breaker_state- Circuit breaker statesInfrastructure Metrics:
stargate_queue_depth- Message queue depthQuerying Metrics
Process Success Rate:
P95 Process Duration:
Distributed Tracing
Trace Structure
Trace Tags
process.id- Process identifierprocess.type- Type of processclient.id- Client identifiererror- Boolean if error occurredhttp.status_code- HTTP statusDashboards
StarGate Overview Dashboard
Key Metrics:
Graphs:
Process Details Dashboard
Panels:
Infrastructure Dashboard
Panels:
Alerting
Alert Rules
High Error Rate:
Circuit Breaker Open:
High Queue Depth:
Accessing Monitoring Tools
Local Development
Production
Use Ingress for external access with proper authentication.
Troubleshooting
Find Logs for Request
Find Slow Requests
Trace Failed Process
error=trueBest Practices
📚 References
🏷️ Labels
phase-11production-readinesssprint-11.2observabilityloggingmetricstracingprometheusgrafana⏱️ Estimated Effort
14-18 hours
🔗 Dependencies
🔗 Related Issues
Part of Phase 11: Production Readiness - Sprint 11.2: Observability