Skip to content

Tracker service v0.5: OpenTelemetry observability — SDK, metrics instrumentation & load testing#1060

Open
jdanieck wants to merge 4 commits intomainfrom
tracker-service-v0.5.0
Open

Tracker service v0.5: OpenTelemetry observability — SDK, metrics instrumentation & load testing#1060
jdanieck wants to merge 4 commits intomainfrom
tracker-service-v0.5.0

Conversation

@jdanieck
Copy link
Contributor

@jdanieck jdanieck commented Feb 19, 2026

Description

Adds OpenTelemetry observability to the Tracker Service for production monitoring and performance validation.

Key Changes:

  • OpenTelemetry SDK integration with OTLP/gRPC metrics export
  • Instrumented metrics: MQTT pipeline (messages, drops, latency), tracking state (active tracks), per-stage latency breakdown
  • k6-based load testing with automated SLI validation (< 0.1% drop rate)
  • Environment variable configuration for all observability settings

Why: Enables monitoring real-time performance, validating SLIs under load, and debugging performance issues in production.

Load Test Result

Hardware: Intel(R) Core(TM) Ultra 9 285H (16 cores), 62.1 GB, kernel 6.17.0-14-generic
  Load:     4 cam × 15 FPS × 300 obj = 60 msg/s for 1m (60s)

  KPI                            Actual       Threshold  Result
  -------------------------------------------------------------
  Dropped messages              0.0000%       < 0.1000%    PASS
  Active tracks                     300          == 300    PASS
  Throughput                 59.5 msg/s   >= 57.0 msg/s      OK
  Latency p50                  63.51 ms       < 66.7 ms      OK
  Latency p99                  99.91 ms      < 133.3 ms      OK

  Messages: 3572 received, 0 dropped

  Stage Latency (informational):
    Stage                    p50 (ms)    p99 (ms)
    ----------------------------------------------
    Parse                        0.74        4.59
    Buffer                       0.50        0.99
    Queue wait                  38.35       74.27
    Transform                    5.10        9.94
    Track                       17.56       32.41
    Publish                      0.54        3.48
    ----------------------------------------------
    End-to-end                  63.51       99.91

@jdanieck jdanieck self-assigned this Feb 19, 2026
@jdanieck jdanieck changed the title feat(tracker): add OpenTelemetry SDK foundation WIP: feat(tracker): add OpenTelemetry SDK foundation Feb 19, 2026
@jdanieck jdanieck changed the title WIP: feat(tracker): add OpenTelemetry SDK foundation [WIP] feat(tracker): add OpenTelemetry SDK foundation Feb 19, 2026
@jdanieck jdanieck changed the base branch from main to tracker-service-v0.4.1 February 19, 2026 15:32
@jdanieck jdanieck requested a review from Copilot February 19, 2026 15:33
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a first-pass OpenTelemetry SDK integration into the Tracker service, enabling configurable metrics and tracing export (OTLP/gRPC) while keeping no-op providers when disabled to minimize overhead.

Changes:

  • Adds Telemetry lifecycle manager to initialize/shutdown global OTel metrics and tracing providers.
  • Extends tracker configuration + env var overrides for OTLP endpoint and metrics/tracing enablement + export intervals.
  • Wires telemetry init/shutdown into tracker main, and adds unit tests + build system dependencies.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tracker/src/telemetry.cpp Implements OTel SDK init/shutdown and global provider wiring for metrics + tracing.
tracker/inc/telemetry.hpp Declares the Telemetry lifecycle manager API and state flags.
tracker/src/main.cpp Calls Telemetry::init() after logger init and Telemetry::shutdown() during shutdown.
tracker/inc/config_loader.hpp Adds OtlpConfig, MetricsConfig, TracingConfig and JSON pointer constants.
tracker/src/config_loader.cpp Parses observability + OTLP config and applies new env var overrides.
tracker/inc/env_vars.hpp Adds env var names for OTLP endpoint and metrics/tracing controls.
tracker/config/tracker.json Extends example config with metrics/tracing blocks.
tracker/conanfile.txt Adds opentelemetry-cpp/1.18.0 and enables OTLP gRPC exporter options.
tracker/CMakeLists.txt Finds/links OpenTelemetry + gRPC and adds telemetry source file.
tracker/test/unit/CMakeLists.txt Links unit tests against OpenTelemetry + gRPC and builds telemetry implementation into tests.
tracker/test/unit/telemetry_test.cpp Adds unit tests covering enabled/disabled/init/shutdown paths.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 7 comments.

@jdanieck jdanieck changed the title [WIP] feat(tracker): add OpenTelemetry SDK foundation [WIP] Tracker service v0.5.0: OpenTelemetry SDK foundation Feb 20, 2026
@jdanieck jdanieck force-pushed the tracker-service-v0.4.1 branch from b2eadbf to a229957 Compare February 23, 2026 08:52
Base automatically changed from tracker-service-v0.4.1 to tracker-service-v0.4.0 February 23, 2026 08:58
@jdanieck jdanieck force-pushed the tracker-service-v0.5.0 branch from 21727fe to 78f8039 Compare February 23, 2026 09:18
jdanieck added a commit that referenced this pull request Feb 23, 2026
- Replace 3x static_cast with dynamic_cast + nullptr guard in
  Telemetry init/shutdown for safe provider downcasting
- Fix contradictory thread-safety doc in telemetry.hpp
- Add @throws documentation for double-init behavior
@jdanieck jdanieck changed the title [WIP] Tracker service v0.5.0: OpenTelemetry SDK foundation feat(tracker): OpenTelemetry observability (Phase 1-2: SDK + metrics instrumentation) Feb 23, 2026
@jdanieck jdanieck changed the title feat(tracker): OpenTelemetry observability (Phase 1-2: SDK + metrics instrumentation) Tracker service v0.5: metrics Feb 23, 2026
@jdanieck jdanieck changed the title Tracker service v0.5: metrics feat(tracker): OpenTelemetry observability — SDK, metrics instrumentation & load testing Feb 23, 2026
@jdanieck jdanieck changed the title feat(tracker): OpenTelemetry observability — SDK, metrics instrumentation & load testing Tracker service v0.5: OpenTelemetry observability — SDK, metrics instrumentation & load testing Feb 23, 2026
Base automatically changed from tracker-service-v0.4.0 to main February 25, 2026 11:21
…rumentation & load testing

Signed-off-by: Józef Daniecki <jozef.daniecki@intel.com>
@jdanieck jdanieck force-pushed the tracker-service-v0.5.0 branch from 9a76d71 to b6ddbdf Compare February 25, 2026 11:34
@jdanieck jdanieck marked this pull request as ready for review February 25, 2026 11:54
Copy link
Contributor

@tdorauintc tdorauintc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments starting with [future] do not block the PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to add a short load test description to README and how test parameters can be set.

Comment on lines +171 to +172
init_flag.~once_flag();
new (&init_flag) std::once_flag();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better use std::destruct_at and std::construct_at here.

Comment on lines +109 to +127
opentelemetry::nostd::unique_ptr<metrics_api::Histogram<double>>* hist = nullptr;
if (std::strcmp(metric_name, kMetricStageParse) == 0) {
hist = &stage_parse_histogram;
} else if (std::strcmp(metric_name, kMetricStageBuffer) == 0) {
hist = &stage_buffer_histogram;
} else if (std::strcmp(metric_name, kMetricStageQueue) == 0) {
hist = &stage_queue_histogram;
} else if (std::strcmp(metric_name, kMetricStageTransform) == 0) {
hist = &stage_transform_histogram;
} else if (std::strcmp(metric_name, kMetricStageTrack) == 0) {
hist = &stage_track_histogram;
} else if (std::strcmp(metric_name, kMetricStagePublish) == 0) {
hist = &stage_publish_histogram;
}

if (hist && *hist) {
(*hist)->Record(ms, opentelemetry::common::KeyValueIterableView<MetricAttributes>(attrs),
opentelemetry::context::Context{});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] We could do it without pointers to unique pointers:

Suggested change
opentelemetry::nostd::unique_ptr<metrics_api::Histogram<double>>* hist = nullptr;
if (std::strcmp(metric_name, kMetricStageParse) == 0) {
hist = &stage_parse_histogram;
} else if (std::strcmp(metric_name, kMetricStageBuffer) == 0) {
hist = &stage_buffer_histogram;
} else if (std::strcmp(metric_name, kMetricStageQueue) == 0) {
hist = &stage_queue_histogram;
} else if (std::strcmp(metric_name, kMetricStageTransform) == 0) {
hist = &stage_transform_histogram;
} else if (std::strcmp(metric_name, kMetricStageTrack) == 0) {
hist = &stage_track_histogram;
} else if (std::strcmp(metric_name, kMetricStagePublish) == 0) {
hist = &stage_publish_histogram;
}
if (hist && *hist) {
(*hist)->Record(ms, opentelemetry::common::KeyValueIterableView<MetricAttributes>(attrs),
opentelemetry::context::Context{});
}
auto get_histogram = [](const char* metric_name) -> metrics_api::Histogram<double>& {
if (std::strcmp(metric_name, kMetricStageParse) == 0) {
return *stage_parse_histogram;
} else if (std::strcmp(metric_name, kMetricStageBuffer) == 0) {
return *stage_buffer_histogram;
} else if (std::strcmp(metric_name, kMetricStageQueue) == 0) {
return *stage_queue_histogram;
} else if (std::strcmp(metric_name, kMetricStageTransform) == 0) {
return *stage_transform_histogram;
} else if (std::strcmp(metric_name, kMetricStageTrack) == 0) {
return *stage_track_histogram;
} else if (std::strcmp(metric_name, kMetricStagePublish) == 0) {
return *stage_publish_histogram;
}
throw std::invalid_argument("Invalid metric name: " + std::string(metric_name));
};
metrics_api::Histogram<double>& hist = get_histogram(metric_name);
hist.Record(ms, opentelemetry::common::KeyValueIterableView<MetricAttributes>(attrs),
opentelemetry::context::Context{});


void MessageHandler::handleCameraMessage(const std::string& topic, const std::string& payload) {
ObservabilityContext obs_ctx;
obs_ctx.receive_time = std::chrono::steady_clock::now();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit, future] We could make these calls inline members

obs_ctx.captureReceiveTime();

Comment on lines +191 to +195
// Propagate earliest batch's observability context to chunk level
if (!chunk.camera_batches.empty()) {
chunk.obs_ctx = chunk.camera_batches.front().obs_ctx;
chunk.obs_ctx.dispatch_time = std::chrono::steady_clock::now();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If more than one camera batch exists, then only the earliest camera_id is propagated and we lose information about other camera_id that have been received and parsed since the previous chunk. As a result, finalize() will tag latency histograms tracker.stage.{parse,buffer} with only one camera_id label.

Therefore I suggest to invalidate chunk.obs_ctx.camera_id (e.g. set it to empty string) here and to not use camera_id attribute for stage latencies.

Comment on lines +302 to +304
batch.obs_ctx = std::move(obs_ctx);
batch.obs_ctx.buffer_time = std::chrono::steady_clock::now();
batch.obs_ctx.category = category;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we do it multiple times in a loop, we need to copy obs_ctx instead of moving it. Reusing moved object is undefined behavior.

Comment on lines +94 to +96
build:
context: .
dockerfile: Dockerfile.k6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[future] I get the following error on my machine:

failed to solve: DeadlineExceeded: grafana/xk6:1.3.5@sha256:a6ec20c88f5a1087ed97861ddc3b1803a9148436d11f3e6253c7459d6d5781ce: failed to resolve source metadata for docker.io/grafana/xk6:1.3.5@sha256:a6ec20c88f5a1087ed97861ddc3b1803a9148436d11f3e6253c7459d6d5781ce: failed to do request: Head "https://registry-1.docker.io/v2/grafana/xk6/manifests/sha256:a6ec20c88f5a1087ed97861ddc3b1803a9148436d11f3e6253c7459d6d5781ce": dial tcp 18.213.134.106:443: i/o timeout

It would be good to debug it and address the issue in the documentation (it may be platform configuration problem).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants