The Code Executor includes comprehensive metrics and monitoring capabilities:
- Real-time Status Endpoint (
/status) - JSON formatted metrics - Prometheus Metrics (
/metrics) - Standard Prometheus format for scraping - Performance Tracking - Execution times, queue wait times, percentiles
- Job Metrics - Submissions, completions, success rates by language
- System Health - Redis connectivity, memory usage, error rates
Real-time metrics in JSON format with human-readable summary.
Response:
{
"timestamp": "2026-01-31T08:30:45.123Z",
"uptime": {
"seconds": 3600,
"human": "1h"
},
"jobs": {
"submitted": 150,
"completed": 145,
"accepted": 140,
"failed": 3,
"timeout": 2,
"compile_error": 0,
"success_rate": "96.55%",
"jobs_per_second": "0.04",
"by_language": {
"python": {"submitted": 100, "completed": 97},
"c": {"submitted": 50, "completed": 48}
}
},
"execution": {
"total_time_ms": 45000,
"count": 145,
"average_ms": 310,
"min_ms": 50,
"max_ms": 1800,
"p50_ms": 250,
"p95_ms": 800,
"p99_ms": 1500,
"by_language": {
"python": {
"total_ms": 32000,
"count": 97,
"average_ms": 330,
"min_ms": 60,
"max_ms": 1800
},
"c": {
"total_ms": 13000,
"count": 48,
"average_ms": 270,
"min_ms": 50,
"max_ms": 900
}
}
},
"queue": {
"current_size": 2,
"max_size": 15,
"total_dequeued": 145,
"average_wait_time_ms": 120,
"p50_wait_ms": 100,
"p95_wait_ms": 300,
"p99_wait_ms": 500
},
"workers": {
"error_count": 2,
"last_error": "2026-01-31T08:25:15.000Z"
},
"system": {
"redis_connected": true,
"memory_mb": 125,
"error_rate": "1.33%",
"total_requests": 150
}
}Prometheus-format metrics for scraping with Prometheus, Grafana, etc.
Example output:
# HELP code_executor_jobs_submitted Total jobs submitted
# TYPE code_executor_jobs_submitted counter
code_executor_jobs_submitted 150 1675255445123
# HELP code_executor_jobs_completed Total jobs completed
# TYPE code_executor_jobs_completed counter
code_executor_jobs_completed 145 1675255445123
# HELP code_executor_execution_time_ms Execution time in milliseconds
# TYPE code_executor_execution_time_ms gauge
code_executor_execution_time_ms{quantile="avg"} 310 1675255445123
code_executor_execution_time_ms{quantile="p95"} 800 1675255445123
code_executor_execution_time_ms{quantile="p99"} 1500 1675255445123
# ... many more metrics
docker-compose.monitoring.yml:
version: '3.8'
services:
code-executor:
build: .
ports:
- "4000:4000"
environment:
- WORKERS=2
- REDIS_URL=redis://redis:6379
depends_on:
- redis
networks:
- monitoring
redis:
image: redis:7-alpine
ports:
- "6379:6379"
networks:
- monitoring
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
networks:
- monitoring
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana_data:/var/lib/grafana
depends_on:
- prometheus
networks:
- monitoring
volumes:
prometheus_data:
grafana_data:
networks:
monitoring:
driver: bridgeprometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'code-executor'
static_configs:
- targets: ['code-executor:4000']
metrics_path: '/metrics'
scrape_interval: 5s# Start everything
docker-compose -f docker-compose.monitoring.yml up -d
# Access services:
# Code Executor: http://localhost:4000
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3000 (user: admin, pass: admin)- Success Rate:
code_executor_success_rate- Target: >99% - Jobs/Second:
code_executor_jobs_per_second- Throughput indicator - Timeout Rate: Count of
TIME_LIMIT_EXCEEDED- Should be <1%
- Execution Time P95:
code_executor_execution_time_ms{quantile="p95"}- Target: <500ms - Execution Time P99:
code_executor_execution_time_ms{quantile="p99"}- Target: <1000ms - Queue Wait P95:
code_executor_queue_wait_time_ms{quantile="p95"}- Target: <200ms
- Redis Connection:
code_executor_redis_connected- Should be 1 - Memory Usage:
code_executor_memory_mb- Monitor for leaks - Error Rate:
code_executor_error_count- Should be 0 or very low
- Queue Size:
code_executor_queue_size- Monitor peaks - Max Queue Size:
code_executor_queue_max_size- Capacity planning
- Go to Grafana: http://localhost:3000
- Dashboard → Import
- ID: Use custom dashboard JSON (see below)
Panel 1: Success Rate Over Time
code_executor_success_rate
Panel 2: Jobs Per Second
rate(code_executor_jobs_completed[1m])
Panel 3: Execution Time Distribution
{
avg: code_executor_execution_time_ms{quantile="avg"},
p95: code_executor_execution_time_ms{quantile="p95"},
p99: code_executor_execution_time_ms{quantile="p99"}
}
Panel 4: Queue Depth Over Time
code_executor_queue_size
Panel 5: Error Count
rate(code_executor_error_count[5m])
Panel 6: Performance by Language
code_executor_execution_time_ms{quantile="avg"} by (language)
prometheus-alerts.yml:
groups:
- name: code-executor
rules:
- alert: HighErrorRate
expr: code_executor_error_count > 10
for: 5m
annotations:
summary: "High error rate detected ({{ $value }} errors)"
- alert: HighQueueDepth
expr: code_executor_queue_size > 100
for: 10m
annotations:
summary: "Queue depth too high ({{ $value }} jobs waiting)"
- alert: SlowExecution
expr: code_executor_execution_time_ms{quantile="p95"} > 1000
for: 5m
annotations:
summary: "P95 execution time exceeds 1s ({{ $value }}ms)"
- alert: RedisDown
expr: code_executor_redis_connected == 0
for: 1m
annotations:
summary: "Redis connection lost"
- alert: HighMemoryUsage
expr: code_executor_memory_mb > 500
for: 5m
annotations:
summary: "Memory usage high ({{ $value }}MB)"code_executor_jobs_submitted- Total jobs submittedcode_executor_jobs_completed- Total jobs completedcode_executor_jobs_accepted- Jobs with successful exitcode_executor_jobs_failed- Jobs with runtime errorscode_executor_jobs_timeout- Jobs that timed outcode_executor_jobs_compile_error- Compilation failures
code_executor_execution_time_ms- Execution time with quantilesquantile="avg"- Averagequantile="min"- Minimumquantile="max"- Maximumquantile="p50"- 50th percentilequantile="p95"- 95th percentilequantile="p99"- 99th percentile
code_executor_queue_size- Current queue depthcode_executor_queue_max_size- Peak queue depthcode_executor_queue_wait_time_ms- Queue wait time with quantiles
code_executor_uptime_seconds- Server uptimecode_executor_success_rate- Percentage of successful jobscode_executor_jobs_per_second- Throughputcode_executor_memory_mb- Heap memory usagecode_executor_redis_connected- Redis connection status (1=connected, 0=down)code_executor_error_count- Total worker errors
- Set Baselines: Run for 24+ hours, establish normal patterns
- Alert Thresholds: Based on your baselines, typically:
- Error rate > 1%
- Queue depth > 50% of worker capacity
- P95 execution > 2x baseline
- Regular Reviews: Check dashboards daily, review trends weekly
- Retention Policy: Keep Prometheus data for 30+ days
- Documentation: Document alerting rules and their meanings
# Forward Prometheus metrics to DataDog
docker run -d \
-e DATADOG_API_KEY=<your-key> \
-e DD_DOCKER_LABELS_AS_TAGS='true' \
-p 9090:9090 \
datadog/agent:7// Use custom events API
import newrelic from 'newrelic';
// In metrics collection:
newrelic.recordCustomEvent('JobCompletion', {
status: result.status,
language: job.language,
duration: executionTime
});import AWS from 'aws-sdk';
const cloudwatch = new AWS.CloudWatch();
// Push metrics periodically
cloudwatch.putMetricData({
Namespace: 'CodeExecutor',
MetricData: [
{
MetricName: 'SuccessRate',
Value: metrics.getSuccessRate(),
Unit: 'Percent'
}
]
}).promise();- Check
/metricsendpoint returns data:curl localhost:4000/metrics - Check Prometheus is scraping: http://localhost:9090/targets
- Verify scrape interval in prometheus.yml
- Check if server was restarted (metrics reset)
- Verify Prometheus storage isn't full
- Check network connectivity between services
- Performance histogram grows unbounded
- Solution: Increase sampling interval or add memory limits
- Set up Grafana dashboards for your team
- Configure alerting rules based on SLOs
- Export metrics to external monitoring system if needed
- Create runbooks for alert responses
- Monitor the monitors - ensure Prometheus/Grafana stay healthy