Tracker service v0.5: OpenTelemetry observability — SDK, metrics instrumentation & load testing#1060
Tracker service v0.5: OpenTelemetry observability — SDK, metrics instrumentation & load testing#1060
Conversation
There was a problem hiding this comment.
Pull request overview
Introduces a first-pass OpenTelemetry SDK integration into the Tracker service, enabling configurable metrics and tracing export (OTLP/gRPC) while keeping no-op providers when disabled to minimize overhead.
Changes:
- Adds
Telemetrylifecycle manager to initialize/shutdown global OTel metrics and tracing providers. - Extends tracker configuration + env var overrides for OTLP endpoint and metrics/tracing enablement + export intervals.
- Wires telemetry init/shutdown into tracker
main, and adds unit tests + build system dependencies.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tracker/src/telemetry.cpp | Implements OTel SDK init/shutdown and global provider wiring for metrics + tracing. |
| tracker/inc/telemetry.hpp | Declares the Telemetry lifecycle manager API and state flags. |
| tracker/src/main.cpp | Calls Telemetry::init() after logger init and Telemetry::shutdown() during shutdown. |
| tracker/inc/config_loader.hpp | Adds OtlpConfig, MetricsConfig, TracingConfig and JSON pointer constants. |
| tracker/src/config_loader.cpp | Parses observability + OTLP config and applies new env var overrides. |
| tracker/inc/env_vars.hpp | Adds env var names for OTLP endpoint and metrics/tracing controls. |
| tracker/config/tracker.json | Extends example config with metrics/tracing blocks. |
| tracker/conanfile.txt | Adds opentelemetry-cpp/1.18.0 and enables OTLP gRPC exporter options. |
| tracker/CMakeLists.txt | Finds/links OpenTelemetry + gRPC and adds telemetry source file. |
| tracker/test/unit/CMakeLists.txt | Links unit tests against OpenTelemetry + gRPC and builds telemetry implementation into tests. |
| tracker/test/unit/telemetry_test.cpp | Adds unit tests covering enabled/disabled/init/shutdown paths. |
b2eadbf to
a229957
Compare
21727fe to
78f8039
Compare
- Replace 3x static_cast with dynamic_cast + nullptr guard in Telemetry init/shutdown for safe provider downcasting - Fix contradictory thread-safety doc in telemetry.hpp - Add @throws documentation for double-init behavior
…rumentation & load testing Signed-off-by: Józef Daniecki <jozef.daniecki@intel.com>
9a76d71 to
b6ddbdf
Compare
…missions and licensing clarity
tdorauintc
left a comment
There was a problem hiding this comment.
Comments starting with [future] do not block the PR.
There was a problem hiding this comment.
I suggest to add a short load test description to README and how test parameters can be set.
| init_flag.~once_flag(); | ||
| new (&init_flag) std::once_flag(); |
There was a problem hiding this comment.
Better use std::destruct_at and std::construct_at here.
| opentelemetry::nostd::unique_ptr<metrics_api::Histogram<double>>* hist = nullptr; | ||
| if (std::strcmp(metric_name, kMetricStageParse) == 0) { | ||
| hist = &stage_parse_histogram; | ||
| } else if (std::strcmp(metric_name, kMetricStageBuffer) == 0) { | ||
| hist = &stage_buffer_histogram; | ||
| } else if (std::strcmp(metric_name, kMetricStageQueue) == 0) { | ||
| hist = &stage_queue_histogram; | ||
| } else if (std::strcmp(metric_name, kMetricStageTransform) == 0) { | ||
| hist = &stage_transform_histogram; | ||
| } else if (std::strcmp(metric_name, kMetricStageTrack) == 0) { | ||
| hist = &stage_track_histogram; | ||
| } else if (std::strcmp(metric_name, kMetricStagePublish) == 0) { | ||
| hist = &stage_publish_histogram; | ||
| } | ||
|
|
||
| if (hist && *hist) { | ||
| (*hist)->Record(ms, opentelemetry::common::KeyValueIterableView<MetricAttributes>(attrs), | ||
| opentelemetry::context::Context{}); | ||
| } |
There was a problem hiding this comment.
[nit] We could do it without pointers to unique pointers:
| opentelemetry::nostd::unique_ptr<metrics_api::Histogram<double>>* hist = nullptr; | |
| if (std::strcmp(metric_name, kMetricStageParse) == 0) { | |
| hist = &stage_parse_histogram; | |
| } else if (std::strcmp(metric_name, kMetricStageBuffer) == 0) { | |
| hist = &stage_buffer_histogram; | |
| } else if (std::strcmp(metric_name, kMetricStageQueue) == 0) { | |
| hist = &stage_queue_histogram; | |
| } else if (std::strcmp(metric_name, kMetricStageTransform) == 0) { | |
| hist = &stage_transform_histogram; | |
| } else if (std::strcmp(metric_name, kMetricStageTrack) == 0) { | |
| hist = &stage_track_histogram; | |
| } else if (std::strcmp(metric_name, kMetricStagePublish) == 0) { | |
| hist = &stage_publish_histogram; | |
| } | |
| if (hist && *hist) { | |
| (*hist)->Record(ms, opentelemetry::common::KeyValueIterableView<MetricAttributes>(attrs), | |
| opentelemetry::context::Context{}); | |
| } | |
| auto get_histogram = [](const char* metric_name) -> metrics_api::Histogram<double>& { | |
| if (std::strcmp(metric_name, kMetricStageParse) == 0) { | |
| return *stage_parse_histogram; | |
| } else if (std::strcmp(metric_name, kMetricStageBuffer) == 0) { | |
| return *stage_buffer_histogram; | |
| } else if (std::strcmp(metric_name, kMetricStageQueue) == 0) { | |
| return *stage_queue_histogram; | |
| } else if (std::strcmp(metric_name, kMetricStageTransform) == 0) { | |
| return *stage_transform_histogram; | |
| } else if (std::strcmp(metric_name, kMetricStageTrack) == 0) { | |
| return *stage_track_histogram; | |
| } else if (std::strcmp(metric_name, kMetricStagePublish) == 0) { | |
| return *stage_publish_histogram; | |
| } | |
| throw std::invalid_argument("Invalid metric name: " + std::string(metric_name)); | |
| }; | |
| metrics_api::Histogram<double>& hist = get_histogram(metric_name); | |
| hist.Record(ms, opentelemetry::common::KeyValueIterableView<MetricAttributes>(attrs), | |
| opentelemetry::context::Context{}); |
|
|
||
| void MessageHandler::handleCameraMessage(const std::string& topic, const std::string& payload) { | ||
| ObservabilityContext obs_ctx; | ||
| obs_ctx.receive_time = std::chrono::steady_clock::now(); |
There was a problem hiding this comment.
[nit, future] We could make these calls inline members
obs_ctx.captureReceiveTime();
| // Propagate earliest batch's observability context to chunk level | ||
| if (!chunk.camera_batches.empty()) { | ||
| chunk.obs_ctx = chunk.camera_batches.front().obs_ctx; | ||
| chunk.obs_ctx.dispatch_time = std::chrono::steady_clock::now(); | ||
| } |
There was a problem hiding this comment.
If more than one camera batch exists, then only the earliest camera_id is propagated and we lose information about other camera_id that have been received and parsed since the previous chunk. As a result, finalize() will tag latency histograms tracker.stage.{parse,buffer} with only one camera_id label.
Therefore I suggest to invalidate chunk.obs_ctx.camera_id (e.g. set it to empty string) here and to not use camera_id attribute for stage latencies.
| batch.obs_ctx = std::move(obs_ctx); | ||
| batch.obs_ctx.buffer_time = std::chrono::steady_clock::now(); | ||
| batch.obs_ctx.category = category; |
There was a problem hiding this comment.
Since we do it multiple times in a loop, we need to copy obs_ctx instead of moving it. Reusing moved object is undefined behavior.
| build: | ||
| context: . | ||
| dockerfile: Dockerfile.k6 |
There was a problem hiding this comment.
[future] I get the following error on my machine:
failed to solve: DeadlineExceeded: grafana/xk6:1.3.5@sha256:a6ec20c88f5a1087ed97861ddc3b1803a9148436d11f3e6253c7459d6d5781ce: failed to resolve source metadata for docker.io/grafana/xk6:1.3.5@sha256:a6ec20c88f5a1087ed97861ddc3b1803a9148436d11f3e6253c7459d6d5781ce: failed to do request: Head "https://registry-1.docker.io/v2/grafana/xk6/manifests/sha256:a6ec20c88f5a1087ed97861ddc3b1803a9148436d11f3e6253c7459d6d5781ce": dial tcp 18.213.134.106:443: i/o timeout
It would be good to debug it and address the issue in the documentation (it may be platform configuration problem).
Description
Adds OpenTelemetry observability to the Tracker Service for production monitoring and performance validation.
Key Changes:
Why: Enables monitoring real-time performance, validating SLIs under load, and debugging performance issues in production.
Load Test Result