Skip to content

feat: add distributed tracing to tracker and worker#292

Merged
BradenBug merged 21 commits intodevfrom
bw/logfire-tracing
May 5, 2026
Merged

feat: add distributed tracing to tracker and worker#292
BradenBug merged 21 commits intodevfrom
bw/logfire-tracing

Conversation

@BradenBug
Copy link
Copy Markdown
Contributor

@BradenBug BradenBug commented Apr 17, 2026

Summary

Instruments the tracker API and Taskiq worker with OpenTelemetry, giving end-to-end visibility into benchmark execution from HTTP request through worker task completion. Trace context propagates across the Redis/Taskiq boundary so a single trace covers the full benchmark lifecycle. Traces and logs ship to Sentry.

Screenshot 2026-04-24 at 11 35 41 AM

Architecture

  • OTel SDK via the logfire package (spans + FastAPI/httpx auto-instrumentation). send_to_logfire=False — we don't ship to Logfire cloud, logfire is only used as an ergonomic OTel wrapper.
  • Traces → Sentry via SentrySpanProcessor, with SPAN_MAX_TIME_OPEN_MINUTES bumped to 240 so long-running parents like process_benchmark / process_task survive (default 10m drops them).
  • Logs → Sentry Logs via LoggingIntegration(sentry_logs_level=INFO). Stdlib logger.info/warning/error appear alongside spans in the Sentry UI.
  • Propagation via a composite propagator: W3C traceparent/tracestate + Baggage + Sentry's sentry-trace/baggage, so both non-Sentry peers (Daytona, benchmark_service) and Sentry peers see the trace.

Changes

Tracing config

  • New tracing.py with configure_tracing(service_name) — wires up the OTel SDK, Sentry span processor, composite propagator, and a _ContextVarSpanProcessor that attaches request_id/benchmark_id/task_id contextvars to every span (mirrors the existing sentry._before_send pattern for events).
  • Auto-instrument FastAPI and httpx; SQLAlchemy and Redis intentionally excluded (high-volume, low-signal Taskiq plumbing).
  • /health excluded from request tracing.

Cross-boundary propagation

  • /start-benchmark and /retry-or-resume-benchmark inject() the current trace context into Taskiq labels before kicking.
  • New TracingContextMiddleware (Taskiq) extracts trace context from labels on the worker side, so the kicked process_benchmark run shows up as a child of the originating HTTP request span.
  • Sandbox-side: pass DAYTONA_SANDBOX_OTEL_EXTRA_LABELS env var so the sandbox's internal OTel telemetry is filterable by benchmark_id/task_id/environment (Daytona's OTLP export is account-level, so environment tags matter for separating dev/staging/prod).

Manual spans

  • process_benchmark and process_task get top-level spans.
  • Sub-op spans: upload_agent, setup_task, run_agent, evaluate_instance, final_score, upload_results, create_log_group, upload_to_s3, send_notification, pty_disconnect, kill_pty_session.
  • Caught exceptions recorded via logfire.exception() at the three sentry_sdk.capture_exception() sites (process_benchmark, process_task, TrackedTask.run).

Sentry config

  • init_sentry switched to INSTRUMENTER.OTEL and enable_logs=True.
  • LoggingIntegration(level=None, event_level=None, sentry_logs_level=INFO) — breadcrumb capture and auto-error events disabled (spans carry context; we capture_exception explicitly).

Infra

  • Removed LOGFIRE_TOKEN secret from TrackerStack / WorkerStack (no longer needed).
  • docker-compose passes SENTRY_DSN / SENTRY_RELEASE / ENVIRONMENT through from host env for local testing.

Follow-ups

  • Instrument benchmark_service with OTel + Sentry if we want visibility into its internals — right now it appears only as outbound httpx spans from the tracker side, so you can see how long a call took but not what it was doing.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Refactor
  • Docs / config

Testing

  • Added/updated unit tests
  • Manually tested

Checklist

  • Self-reviewed the diff
  • No debug/dead code left in
  • Docs updated if needed

@assert-app
Copy link
Copy Markdown

assert-app Bot commented Apr 17, 2026

Review on Assert →

BradenBug and others added 2 commits April 17, 2026 17:08
# Conflicts:
#	services/tracker/src/tracker/utils.py
2b2652b was a broad ruff format pass that touched 21 files. Keeping
the formatting applied to logfire-related code (utils.py, config.py,
tracker_stack.py, worker_stack.py) but dropping the unrelated churn
in migrations, test files, cli/main.py, and logging/config.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

additional_span_processors=[_ContextVarSpanProcessor()],
)

logfire.instrument_httpx()
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 logfire.instrument_httpx() called globally in configure_logfire

logfire.instrument_httpx() at services/tracker/src/tracker/tracing.py:49 monkey-patches httpx.Client and httpx.AsyncClient globally. This means ALL httpx clients created after this call are instrumented, including the Slack webhook client in notifications.py:138 and any BenchmarkServiceClient httpx usage. For the worker, configure_logfire is called during WORKER_STARTUP event (config.py:76), which runs before any tasks are processed, so all benchmark service clients created per-task will be instrumented. This is a broad-scope side effect worth being aware of — if any httpx client handles sensitive data in headers, those would appear in traces.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@BradenBug BradenBug marked this pull request as ready for review April 22, 2026 22:13
@BradenBug BradenBug marked this pull request as draft April 23, 2026 00:48
@BradenBug BradenBug changed the title feat: add logfire distributed tracing to tracker and worker feat: add distributed tracing to tracker and worker Apr 24, 2026
@BradenBug BradenBug marked this pull request as ready for review April 24, 2026 19:45
@BradenBug BradenBug marked this pull request as draft April 24, 2026 19:45
@BradenBug BradenBug marked this pull request as ready for review April 24, 2026 19:54
Copy link
Copy Markdown
Collaborator

@JarettForzano JarettForzano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using context managers I think we should make decorators that wrap around the functions. Its easier to maintain and helps clean up the code by enforcing that we divide it up into functions.

Comment thread services/tracker/src/tracker/config.py Outdated
Comment thread services/tracker/docker-compose.yml
@BradenBug BradenBug requested a review from JarettForzano April 27, 2026 18:03
Comment thread services/tracker/src/tracker/utils.py Outdated
task.status = TaskStatus.BUILDING
task_session.commit()

environment = os.environ.get("ENVIRONMENT", "development")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I think should live inside of the config file

Comment thread services/tracker/src/tracker/sentry.py Outdated
try:
sentry_sdk.init(
dsn=dsn,
environment=os.environ.get("ENVIRONMENT", "development"),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull from config - just since its used in more than one location

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure can do

Comment on lines +83 to +85
# Sentry's LoggingIntegration (enabled in init_sentry with sentry_logs_level=INFO)
# attaches to the root logger and ships records to Sentry Logs without needing a
# dictConfig handler entry.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it just uses the logs that we have built in?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may want to use instrument with logfire on these so that we can capture the sandbox id. Should also use logfire.exception on the warnings so that we can make sure we are capturing those.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yuup

Copy link
Copy Markdown
Collaborator

@JarettForzano JarettForzano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can merge after requested changes are made / responded to

@BradenBug BradenBug merged commit 175cf53 into dev May 5, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants