feat: add distributed tracing to tracker and worker#292
Conversation
# Conflicts: # services/tracker/src/tracker/utils.py
2b2652b was a broad ruff format pass that touched 21 files. Keeping the formatting applied to logfire-related code (utils.py, config.py, tracker_stack.py, worker_stack.py) but dropping the unrelated churn in migrations, test files, cli/main.py, and logging/config.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| additional_span_processors=[_ContextVarSpanProcessor()], | ||
| ) | ||
|
|
||
| logfire.instrument_httpx() |
There was a problem hiding this comment.
🚩 logfire.instrument_httpx() called globally in configure_logfire
logfire.instrument_httpx() at services/tracker/src/tracker/tracing.py:49 monkey-patches httpx.Client and httpx.AsyncClient globally. This means ALL httpx clients created after this call are instrumented, including the Slack webhook client in notifications.py:138 and any BenchmarkServiceClient httpx usage. For the worker, configure_logfire is called during WORKER_STARTUP event (config.py:76), which runs before any tasks are processed, so all benchmark service clients created per-task will be instrumented. This is a broad-scope side effect worth being aware of — if any httpx client handles sensitive data in headers, those would appear in traces.
Was this helpful? React with 👍 or 👎 to provide feedback.
JarettForzano
left a comment
There was a problem hiding this comment.
Instead of using context managers I think we should make decorators that wrap around the functions. Its easier to maintain and helps clean up the code by enforcing that we divide it up into functions.
| task.status = TaskStatus.BUILDING | ||
| task_session.commit() | ||
|
|
||
| environment = os.environ.get("ENVIRONMENT", "development") |
There was a problem hiding this comment.
This I think should live inside of the config file
| try: | ||
| sentry_sdk.init( | ||
| dsn=dsn, | ||
| environment=os.environ.get("ENVIRONMENT", "development"), |
There was a problem hiding this comment.
Pull from config - just since its used in more than one location
| # Sentry's LoggingIntegration (enabled in init_sentry with sentry_logs_level=INFO) | ||
| # attaches to the root logger and ships records to Sentry Logs without needing a | ||
| # dictConfig handler entry. |
There was a problem hiding this comment.
it just uses the logs that we have built in?
There was a problem hiding this comment.
I think we may want to use instrument with logfire on these so that we can capture the sandbox id. Should also use logfire.exception on the warnings so that we can make sure we are capturing those.
JarettForzano
left a comment
There was a problem hiding this comment.
Can merge after requested changes are made / responded to
Summary
Instruments the tracker API and Taskiq worker with OpenTelemetry, giving end-to-end visibility into benchmark execution from HTTP request through worker task completion. Trace context propagates across the Redis/Taskiq boundary so a single trace covers the full benchmark lifecycle. Traces and logs ship to Sentry.
Architecture
logfirepackage (spans + FastAPI/httpx auto-instrumentation).send_to_logfire=False— we don't ship to Logfire cloud, logfire is only used as an ergonomic OTel wrapper.SentrySpanProcessor, withSPAN_MAX_TIME_OPEN_MINUTESbumped to 240 so long-running parents likeprocess_benchmark/process_tasksurvive (default 10m drops them).LoggingIntegration(sentry_logs_level=INFO). Stdliblogger.info/warning/errorappear alongside spans in the Sentry UI.traceparent/tracestate+ Baggage + Sentry'ssentry-trace/baggage, so both non-Sentry peers (Daytona, benchmark_service) and Sentry peers see the trace.Changes
Tracing config
tracing.pywithconfigure_tracing(service_name)— wires up the OTel SDK, Sentry span processor, composite propagator, and a_ContextVarSpanProcessorthat attachesrequest_id/benchmark_id/task_idcontextvars to every span (mirrors the existingsentry._before_sendpattern for events)./healthexcluded from request tracing.Cross-boundary propagation
/start-benchmarkand/retry-or-resume-benchmarkinject()the current trace context into Taskiq labels before kicking.TracingContextMiddleware(Taskiq) extracts trace context from labels on the worker side, so the kickedprocess_benchmarkrun shows up as a child of the originating HTTP request span.DAYTONA_SANDBOX_OTEL_EXTRA_LABELSenv var so the sandbox's internal OTel telemetry is filterable bybenchmark_id/task_id/environment(Daytona's OTLP export is account-level, so environment tags matter for separating dev/staging/prod).Manual spans
process_benchmarkandprocess_taskget top-level spans.upload_agent,setup_task,run_agent,evaluate_instance,final_score,upload_results,create_log_group,upload_to_s3,send_notification,pty_disconnect,kill_pty_session.logfire.exception()at the threesentry_sdk.capture_exception()sites (process_benchmark,process_task,TrackedTask.run).Sentry config
init_sentryswitched toINSTRUMENTER.OTELandenable_logs=True.LoggingIntegration(level=None, event_level=None, sentry_logs_level=INFO)— breadcrumb capture and auto-error events disabled (spans carry context; wecapture_exceptionexplicitly).Infra
LOGFIRE_TOKENsecret fromTrackerStack/WorkerStack(no longer needed).SENTRY_DSN/SENTRY_RELEASE/ENVIRONMENTthrough from host env for local testing.Follow-ups
benchmark_servicewith OTel + Sentry if we want visibility into its internals — right now it appears only as outbound httpx spans from the tracker side, so you can see how long a call took but not what it was doing.Type of Change
Testing
Checklist