refactor: consolidate Observability files (e.g. OTEL docker-compose, md files) #4173

keivenchang · 2025-11-07T01:54:55Z

Overview:

This PR consolidates Dynamo's observability infrastructure to provide a more consistent and easier-to-follow experience. All observability documentation now follows a uniform structure (Overview → Environment Variables → Getting Started → Details), making it simple to find what you need whether you're setting up metrics, tracing, logging, or health checks.

The refactored structure separates concerns clearly: docker-compose.yml handles only core infrastructure (NATS & etcd), while docker-observability.yml provides the complete observability stack. All observability configurations are now centralized under deploy/observability/, eliminating the previous scattered structure across deploy/metrics/, deploy/logging/, and deploy/tracing/.

Documentation improvements include:

Consistent structure: Every observability doc follows the same pattern with Environment Variables tables and Getting Started sections
Single entry point: docs/observability/README.md serves as the unified gateway to all observability topics
Practical examples: Each guide now includes single-GPU Getting Started examples for quick testing
Clear separation: Prometheus/Grafana guide focuses on demo setup, while detailed metrics reference lives in metrics.md
Developer resources: New Metrics Developer Guide for creating custom metrics in Rust/Python

Details:

Refactor services into separate docker-compose.yml (NATS & etcd only) and docker-observability.yml (Prometheus, Grafana, Tempo, exporters)
Consolidated observability configs under deploy/observability/ (previously scattered across metrics/, logging/, tracing/)
Reorganized Kubernetes-specific observability configs under deploy/observability/k8s/
Standardized all observability docs with consistent sections: Overview, Environment Variables, Getting Started, and detailed reference
Created new docs/observability/README.md as unified entry point with navigation table
Refactored Prometheus/Grafana guide to focus on single-machine demos (removed detailed metrics explanations, now in metrics.md)
Added Metrics Developer Guide for creating custom metrics in Rust/Python
Enhanced all docs with Environment Variables tables for easy reference
Added practical Getting Started sections with single-GPU examples for quick testing
Enhanced tracing docs with x-request-id correlation guidance for easier debugging
Updated env_is_truthy utility usage for OTLP configuration consistency

Where should the reviewer start?

Review docs/observability/README.md, which serves as the entry point to all observability documentation. Notice how metrics.md, tracing.md, health-checks.md, and logging.md all follow the same consistent structure: Overview → Environment Variables → Getting Started → Details. This uniform pattern makes it easy to quickly find configuration options (always in a table) and get started with practical examples (always in a Getting Started section).

Then check deploy/observability/ to see how all observability configs are now centralized in one location instead of being scattered across multiple directories.

BEFORE:
=======

deploy/
├── docker-compose.yml (NATS + etcd + Prometheus + Grafana + exporters + monitoring network)
├── metrics/
│   ├── grafana-datasources.yml
│   ├── prometheus.yml
│   ├── grafana_dashboards/
│   │   ├── grafana-dashboard-providers.yml
│   │   ├── grafana-dcgm-metrics.json
│   │   ├── grafana-dynamo-dashboard.json
│   │   └── grafana-kvbm-dashboard.json
│   └── k8s/
│       ├── README.md
│       ├── frontend-podmonitor.yaml
│       ├── planner-podmonitor.yaml
│       ├── worker-podmonitor.yaml
│       └── grafana-dynamo-dashboard-configmap.yaml
├── logging/
│   ├── README.md
│   ├── grafana/
│   │   ├── dashboard.json
│   │   ├── logging-dashboard.yaml
│   │   └── loki-datasource.yaml
│   └── values/
│       ├── alloy-values.yaml
│       └── loki-values.yaml
└── tracing/
    ├── docker-compose.yml (Tempo + Grafana) [DELETED]
    ├── README.md
    ├── trace.png
    ├── tempo.yaml
    └── grafana/provisioning/datasources/
        └── tempo.yaml

docs/observability/
├── (no README.md)
├── metrics.md (basic)
├── health-checks.md (basic)
├── logging.md (basic)
├── prometheus-grafana.md (detailed)
└── (no tracing docs here)


AFTER (with +/- line counts):
==============================

deploy/
├── docker-compose.yml -120
├── docker-observability.yml +137
└── observability/
    ├── tempo-datasource.yml -1/+1
    ├── grafana_dashboards/
    │   └── grafana-dynamo-dashboard.json -1/+1
    └── k8s/
        ├── grafana-dynamo-dashboard-configmap.yaml -1/+1
        └── logging/
            └── README.md -1/+1

docs/observability/
├── README.md +33
├── metrics.md +112
├── metrics-developer-guide.md +450
├── health-checks.md +24
├── logging.md +1
├── tracing.md +60
└── prometheus-grafana.md -325

docs/kubernetes/observability/
├── logging.md -6/+6
└── metrics.md -3/+1

lib/runtime/src/logging.rs -2/+2

README.md +15

lib/runtime/examples/metrics_python/README.md -150/+4

DELETED:
deploy/tracing/docker-compose.yml -35

Related Issues:

Relates to DIS-980

/coderabbit profile chill

…bility.yml - Moved metrics (Prometheus, Grafana, DCGM, NATS exporter) and tracing (Tempo) into single docker-observability.yml - Simplified docker-compose.yml to only include core infrastructure (NATS, etcd) - Reorganized observability files: deploy/metrics/* and deploy/tracing/* -> deploy/observability/* - Updated documentation: deploy/tracing/README.md -> docs/observability/tracing.md - Unified Grafana configuration to support both Prometheus and Tempo datasources - Single observability stack now runs on unified 'server' network for better integration Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>

- Move deploy/logging to deploy/observability/k8s/logging for better organization - Move trace.png to docs/observability/ to be alongside tracing.md - Fix vllm lazy import of kvbm to avoid Tokio runtime initialization issues - Add log level documentation explaining DEBUG vs INFO for trace visibility - Update all references to reflect new paths - Clarify OTEL environment variable defaults and behavior Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>

- Create docs/observability/README.md as central hub - Split metrics-developer-guide.md from prometheus-grafana.md - Standardize all docs: Overview, Environment Variables, Getting Started - Update env variable parsing to accept truthy values (true/1/on/yes) - Consolidate prometheus-grafana.md as quick start guide - Improve metrics.md as reference document - Clarify tracing requirements and overlap with logging - Fix double space and grammatical issues Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>

coderabbitai · 2025-11-07T02:01:19Z

Walkthrough

The changes reorganize observability infrastructure by separating observability services into a standalone compose file, consolidating and restructuring Kubernetes observability configurations, updating environment variable handling for OTEL exports to support truthy values, expanding observability documentation with guides for metrics, tracing, logging, and health checks, and streamlining Docker Compose setup instructions.

Changes

Cohort / File(s)	Summary
Docker Compose Observability Reorganization `README.md`, `deploy/docker-compose.yml`, `deploy/docker-observability.yml`, `deploy/tracing/docker-compose.yml`	Extracted observability services (Prometheus, Grafana, Tempo, DCGM exporter, NATS exporter) from main docker-compose.yml into dedicated deploy/docker-observability.yml; removed tracing services from deploy/tracing/docker-compose.yml; updated README with observability stack setup instructions using bash-specific code block.
Grafana Dashboard Configuration Updates `deploy/observability/grafana_dashboards/grafana-dynamo-dashboard.json`, `deploy/observability/k8s/grafana-dynamo-dashboard-configmap.yaml`, `docs/kubernetes/observability/metrics.md`	Renamed Grafana dashboard title to "Dynamo Dashboard (generic)"; simplified Kubernetes metrics documentation to use single kubectl apply command with updated path.
Kubernetes Observability Path Restructuring `deploy/observability/k8s/logging/README.md`, `docs/kubernetes/observability/logging.md`, `docs/kubernetes/observability/metrics.md`	Updated relative path references from deploy/logging/* to deploy/observability/k8s/logging/*; adjusted Helm and configuration artifact paths for Loki, Alloy, and Grafana provisioning.
Observability Documentation Suite `docs/observability/README.md`, `docs/observability/health-checks.md`, `docs/observability/logging.md`, `docs/observability/metrics.md`, `docs/observability/metrics-developer-guide.md`, `docs/observability/prometheus-grafana.md`, `docs/observability/tracing.md`	Added comprehensive observability guides with environment variables, getting started sections, metric categories, runtime hierarchy, and practical examples; restructured tracing and logging documentation; condensed Prometheus/Grafana guide to single-machine demo setup; added new metrics developer guide for custom metrics creation across Rust and Python.
Tempo Datasource & Python Examples `deploy/observability/tempo-datasource.yml`, `lib/bindings/python/examples/metrics/README.md`	Changed Tempo datasource isDefault from true to false; pared down Python metrics example documentation to concise "Running Examples" section with link to developer guide.
Environment Variable Handling Refactor `lib/bindings/python/rust/lib.rs`, `lib/runtime/src/logging.rs`	Updated OTEL_EXPORT_ENABLED checks from string equality "1" to truthy value support using env_is_truthy() utility; updated corresponding warning messages and documentation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Docker observability file: New comprehensive service stack with multiple dependencies—verify image versions, port mappings, volume configurations, and network assignments are correct
Documentation scope: Extensive rewrites across 7+ observability docs with environmental variable tables and new sections—confirm consistency of variable names, defaults, and examples across all guides
Path migrations: Multiple files updated with new kubernetes observability paths—verify all paths are correctly updated and no dangling references remain
Environment variable refactoring: Changes from numeric "1" to truthy values across Rust code—ensure all call sites properly use the new env_is_truthy() utility and backward compatibility is maintained

Poem

🐰 Hop hop, the observability stack does gleam,
Separated services fulfilling every dream,
Metrics, traces, and logs now neatly arranged,
Truthiness and defaults—everything changed!
From chaos to clarity, what a fine sight,
Documentation and dashboards now burning so bright! ✨

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check	✅ Passed	The pull request description is comprehensive and well-structured, following the template with clear Overview, Details, Where to start, and Related Issues sections.
Title check	✅ Passed	The title accurately reflects the main objective: consolidating observability files (docker-compose, markdown, configs) from scattered directories into organized deploy/observability and docs/observability structures.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (6)

docs/kubernetes/observability/logging.md (1)

49-49: Path updates are correct; fix minor formatting at line 144.

All referenced configuration files exist in the new centralized paths under deploy/observability/k8s/logging/. The path structure is consistent across Loki values, Alloy values, and Grafana configuration files (lines 49, 63, 113, 116).

Minor correction needed at line 144: change "component type (e.g frontend, worker, etc)" to "component type (e.g., frontend, worker, etc.)" to follow American English style guidelines.

deploy/docker-observability.yml (1)

90-91: Minor: Tempo config file uses YAML extension, verify naming consistency.

The GPU reservation is correctly configured with count: all to reserve all GPUs on the host. Line 92 references ./observability/tempo.yaml (YAML extension). Verify this filename is consistent with your configuration management. Most tools use .yaml or .yml interchangeably, but consistency across the repo is preferred.
docs/observability/metrics.md (2)
62-74: Add language identifier to Prometheus exposition format code block.

Fenced code blocks with a language identifier provide the best readability and syntax highlighting. The code block showing Prometheus exposition format (lines 62-74) lacks a language identifier.
-```
+```text
 # HELP dynamo_component_requests_total Total requests processed
 # TYPE dynamo_component_requests_total counter
Alternatively, if Prometheus format highlighting is available, use prometheus as the language identifier.

172-194: Add language identifier to code blocks showing timeline and concurrency examples.

Lines 172 and 182 contain fenced code blocks without language identifiers. The timeline ASCII diagram (lines 182-194) should use text to preserve formatting without attempting syntax highlighting.
-```
+```text
 Timeline:    0, 1, ...
 Client ────> Frontend:8000 ...
docs/observability/tracing.md (2)
27-76: Clarify docker compose file path for consistency.

The guide uses docker compose -f docker-observability.yml (lines 33, 46, 162). While this works from the deploy/ directory, ensure the current working directory is clear in all instructions. The commands at lines 32-33 show cd deploy before running the compose command, which is good. However, line 162 in the "Stop Services" section does not show the cd deploy step.

Add consistent context at section 6 (lines 156-163) to clarify working directory:
 ### 6. Stop Services
 
 When done, stop the observability stack:
 
 ```bash
+cd deploy
 docker compose -f docker-observability.yml down
---

`90-116`: **Disaggregated deployment script reference needs clarification.**

Lines 78-116 provide a manual script for disaggregated deployment but reference modifying `disagg.sh`. The comment at line 90 states "You may need to modify `disagg.sh`" but the script provided (lines 92-116) is shown as a complete replacement example.

Clarify whether:
1. `disagg.sh` should be modified before running, or
2. The provided script should be used as a reference/replacement


Add a note clarifying the intent:

```diff
 **Note:** You may need to modify `disagg.sh` to export the tracing environment
 variables before starting each component:

+**Option A:** Modify your existing `disagg.sh` to add the following lines at the top, or
+**Option B:** Use the template below as a complete `disagg.sh` replacement:

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f509493 and 0b71cf8.

⛔ Files ignored due to path filters (1)

docs/observability/trace.png is excluded by !**/*.png

📒 Files selected for processing (20)

README.md (1 hunks)
deploy/docker-compose.yml (1 hunks)
deploy/docker-observability.yml (1 hunks)
deploy/observability/grafana_dashboards/grafana-dynamo-dashboard.json (1 hunks)
deploy/observability/k8s/grafana-dynamo-dashboard-configmap.yaml (1 hunks)
deploy/observability/k8s/logging/README.md (1 hunks)
deploy/observability/tempo-datasource.yml (1 hunks)
deploy/tracing/docker-compose.yml (0 hunks)
docs/kubernetes/observability/logging.md (4 hunks)
docs/kubernetes/observability/metrics.md (1 hunks)
docs/observability/README.md (1 hunks)
docs/observability/health-checks.md (1 hunks)
docs/observability/logging.md (5 hunks)
docs/observability/metrics-developer-guide.md (1 hunks)
docs/observability/metrics.md (2 hunks)
docs/observability/prometheus-grafana.md (1 hunks)
docs/observability/tracing.md (4 hunks)
lib/bindings/python/examples/metrics/README.md (1 hunks)
lib/bindings/python/rust/lib.rs (2 hunks)
lib/runtime/src/logging.rs (1 hunks)

💤 Files with no reviewable changes (1)

deploy/tracing/docker-compose.yml

🧰 Additional context used

🧠 Learnings (11)

📓 Common learnings

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3051
File: container/templates/Dockerfile.trtllm.j2:424-437
Timestamp: 2025-09-16T17:16:03.785Z
Learning: keivenchang prioritizes maintaining exact backward compatibility during migration/refactoring PRs, even when bugs are identified in the original code. Fixes should be deferred to separate PRs after the migration is complete.

📚 Learning: 2025-09-16T00:26:43.641Z

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The team at ai-dynamo/dynamo prefers to use consistent metric naming patterns with _total suffixes across all metric types (including gauges) for internal consistency, even when this differs from strict Prometheus conventions that reserve _total for counters only. This design decision was confirmed by keivenchang in PR 3035, referencing examples in prometheus_names.rs and input from team members.

Applied to files:

docs/observability/README.md
docs/observability/prometheus-grafana.md
docs/observability/metrics-developer-guide.md
deploy/observability/k8s/grafana-dynamo-dashboard-configmap.yaml
docs/observability/metrics.md
lib/bindings/python/examples/metrics/README.md

📚 Learning: 2025-07-14T21:25:56.930Z

Learnt from: ryanolson
Repo: ai-dynamo/dynamo PR: 1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

Applied to files:

lib/bindings/python/rust/lib.rs

📚 Learning: 2025-07-18T16:04:31.771Z

Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.

Applied to files:

docs/kubernetes/observability/logging.md

📚 Learning: 2025-09-16T00:26:43.641Z

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The ai-dynamo/dynamo team uses _total as a semantic unit suffix across all metric types (including gauges like INFLIGHT_REQUESTS_TOTAL) for internal consistency, as evidenced by patterns in prometheus_names.rs. This is a deliberate architectural choice to prioritize uniform naming conventions over strict Prometheus conventions that reserve _total only for counters.

Applied to files:

docs/observability/prometheus-grafana.md
docs/observability/metrics.md

📚 Learning: 2025-09-16T00:21:44.912Z

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: deploy/metrics/README.md:43-43
Timestamp: 2025-09-16T00:21:44.912Z
Learning: Graham (grahamking) has provided guidance in PR 2914 that metrics should end with _count or _total to indicate units, but this needs to be clarified whether it applies to all metric types or just counters, as Prometheus conventions differ between counters (should have _total) and gauges (should not have _total).

Applied to files:

docs/observability/metrics-developer-guide.md

📚 Learning: 2025-09-24T19:06:57.156Z

Learnt from: ryan-lempka
Repo: ai-dynamo/dynamo PR: 3062
File: lib/llm/src/audit/sink.rs:15-27
Timestamp: 2025-09-24T19:06:57.156Z
Learning: In the audit logging system, full request/response logging requires both DYN_AUDIT_ENABLED=1 environment variable and explicit store=true in the request. Without store=true, only usage statistics are logged (UsageOnly mode). The stderr sink is the initial implementation with plans for additional sinks in the future.

Applied to files:

docs/observability/logging.md

📚 Learning: 2025-06-04T13:09:53.416Z

Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 1365
File: deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go:171-178
Timestamp: 2025-06-04T13:09:53.416Z
Learning: The `DYN_DEPLOYMENT_CONFIG` environment variable (commonconsts.DynamoDeploymentConfigEnvVar) in the Dynamo operator will never be set via ValueFrom (secrets/config maps), only via direct Value assignment. The GetDynamoDeploymentConfig method correctly only checks env.Value for this specific environment variable.

Applied to files:

docs/observability/health-checks.md

📚 Learning: 2025-06-05T01:46:15.509Z

Learnt from: GuanLuo
Repo: ai-dynamo/dynamo PR: 1371
File: examples/llm/benchmarks/vllm_multinode_setup.sh:18-25
Timestamp: 2025-06-05T01:46:15.509Z
Learning: In multi-node setups with head/worker architecture, the head node typically doesn't need environment variables pointing to its own services (like NATS_SERVER, ETCD_ENDPOINTS) because local processes can access them via localhost. Only worker nodes need these environment variables to connect to the head node's external IP address.

Applied to files:

docs/observability/health-checks.md

📚 Learning: 2025-07-25T22:34:11.384Z

Learnt from: nnshah1
Repo: ai-dynamo/dynamo PR: 2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.

Applied to files:

docs/observability/health-checks.md

📚 Learning: 2025-09-16T00:27:43.992Z

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:75-79
Timestamp: 2025-09-16T00:27:43.992Z
Learning: In the ai-dynamo/dynamo codebase, the project uses "_total" suffix for all Prometheus metrics including gauges like inflight_requests, which differs from standard Prometheus conventions. The constant work_handler::INFLIGHT_REQUESTS does not exist - only work_handler::INFLIGHT_REQUESTS_TOTAL exists and should be used for the inflight requests gauge metric.

Applied to files:

docs/observability/metrics.md

🧬 Code graph analysis (2)

lib/runtime/src/logging.rs (1)

lib/runtime/src/config.rs (1)

env_is_truthy (422-427)

lib/bindings/python/rust/lib.rs (1)

lib/runtime/src/config.rs (1)

env_is_truthy (422-427)

🪛 GitHub Check: Check for broken markdown links

lib/bindings/python/examples/metrics/README.md

[failure] 10-10:
Broken link: Metrics Developer Guide - Python Section - View: https://github.com/ai-dynamo/dynamo/blob/HEAD/lib/bindings/python/examples/metrics/README.md?plain=1#L10

🪛 LanguageTool

docs/kubernetes/observability/logging.md

[uncategorized] ~144-~144: The abbreviation “e.g.” (= for example) requires two periods.
Context: ...loyment, namespace, and component type (e.g frontend, worker, etc).

(E_G)

[style] ~144-~144: In American English, abbreviations like “etc.” require a period.
Context: ...d component type (e.g frontend, worker, etc).

(ETC_PERIOD)

docs/observability/prometheus-grafana.md

[duplication] ~28-~28: Possible typo: you repeated a word.
Context: ...tes Install these on your machine: - Docker - [Docker Compose](https://docs.docker.com/compos...

(ENGLISH_WORD_REPEAT_RULE)

docs/observability/metrics-developer-guide.md

[uncategorized] ~36-~36: Loose punctuation mark.
Context: ...Methods - .metrics().create_counter(): Create a counter metric - `.metrics().c...

(UNLIKELY_OPENING_PUNCTUATION)

docs/observability/metrics.md

[uncategorized] ~102-~102: Loose punctuation mark.
Context: ... - dynamo_component_inflight_requests: Requests currently being processed (gau...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~113-~113: Loose punctuation mark.
Context: ...dynamo_component_kvstats_active_blocks: Number of active KV cache blocks curren...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~126-~126: Loose punctuation mark.
Context: ...unctionality: - dynamo_preprocessor_*: Metrics specific to preprocessor compon...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~132-~132: Loose punctuation mark.
Context: ...: - dynamo_frontend_inflight_requests: Inflight requests (gauge) - `dynamo_fro...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~150-~150: Loose punctuation mark.
Context: ... dynamo_frontend_model_total_kv_blocks: Total KV blocks available for a worker ...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~157-~157: Loose punctuation mark.
Context: ...- dynamo_frontend_model_context_length: Maximum context length for a worker ser...

(UNLIKELY_OPENING_PUNCTUATION)

🪛 markdownlint-cli2 (0.18.1)

docs/observability/metrics.md

62-62: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

172-172: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

182-182: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)

GitHub Check: trtllm (arm64)
GitHub Check: operator (amd64)
GitHub Check: sglang (arm64)
GitHub Check: sglang (amd64)
GitHub Check: trtllm (amd64)
GitHub Check: vllm (amd64)
GitHub Check: vllm (arm64)
GitHub Check: operator (arm64)
GitHub Check: Build and Test - dynamo
GitHub Check: tests (launch/dynamo-run)
GitHub Check: clippy (launch/dynamo-run)
GitHub Check: tests (lib/bindings/python)
GitHub Check: tests (lib/runtime/examples)
GitHub Check: clippy (.)
GitHub Check: clippy (lib/bindings/python)
GitHub Check: tests (.)

🔇 Additional comments (27)

lib/runtime/src/logging.rs (1)

147-150: LGTM! Truthy value support improves usability.

The change from exact "1" matching to env_is_truthy() improves the developer experience by accepting common truthy values (1, true, on, yes). The documentation comment accurately reflects the new behavior.

lib/bindings/python/rust/lib.rs (2)

128-134: LGTM! Consistent truthy value support.

The change to use env_is_truthy() and the updated warning message are consistent with the changes in lib/runtime/src/logging.rs. The warning text correctly reflects that any truthy value (not just "=1") will trigger the deferred initialization.

449-453: LGTM! Proper deferred initialization.

The truthy check here correctly defers logging initialization until the Tokio runtime is available, which is required for the OTEL exporter. This mirrors the pattern in the _core module initialization.

docs/observability/metrics-developer-guide.md (1)

1-270: Excellent comprehensive developer guide!

This is a well-structured and thorough metrics developer guide that covers:

Clear getting started instructions with environment variables

Both Rust and Python API usage with practical examples

Vector metrics with labels

Advanced features (custom buckets, constant labels)

Update patterns and examples

The document effectively consolidates metrics guidance and provides a strong foundation for developers working with Dynamo metrics.

docs/observability/health-checks.md (2)

14-24: Well-documented environment variables.

The new environment variables table provides clear, comprehensive information with descriptions, defaults, and examples. This follows a consistent format with other observability documentation.

25-48: Helpful Getting Started section.

The new Getting Started section provides clear, practical examples for enabling and testing health checks on a single GPU. This aligns well with the PR's objective to improve observability documentation and provide better getting-started guidance.

docs/observability/prometheus-grafana.md (3)

1-21: Improved overview and environment variables documentation.

The updated overview clearly focuses on single-machine demo setup, and the environment variables table follows the consistent format used across other observability documentation.

22-84: Excellent Getting Started guide.

The new Getting Started section provides a comprehensive, step-by-step guide that covers:

Prerequisites and installation

Starting infrastructure and observability services

Launching Dynamo components with proper configuration

Testing with example requests

Accessing web interfaces with helpful notes about remote access

This is very user-friendly and practical.

88-167: Well-structured topology and configuration guidance.

The topology diagram provides a clear visualization of service relationships, and the configuration/troubleshooting sections offer practical guidance. The reference to the Metrics Developer Guide is a good addition for users who want to create custom metrics.

docs/observability/logging.md (3)

27-52: Improved environment variables documentation and Getting Started section.

The restructured environment variables table follows the consistent format used across observability documentation, and the new Getting Started section provides clear, practical examples for enabling JSONL logging.

102-141: Clear explanation of OpenTelemetry logging integration.

The renamed section "Logging of OpenTelemetry Tracing" better reflects its purpose, and the clarification about the overlap with distributed tracing documentation is helpful. The note distinguishing between seeing trace context in logs versus exporting to a trace backend is particularly useful.

142-258: Comprehensive trace and span information examples.

The updated section title accurately reflects the content, and the detailed examples effectively demonstrate how trace context appears in logs. The examples showing custom request IDs with the x-request-id header are particularly practical for correlating traces with application-level request tracking.

lib/bindings/python/examples/metrics/README.md (1)

10-10: Link is valid; no changes needed.

Verification confirms the documentation link is functional. The file docs/observability/metrics-developer-guide.md exists and contains the Python Metrics API section referenced by the anchor. The static analysis flag appears to be a false positive.

deploy/observability/k8s/grafana-dynamo-dashboard-configmap.yaml (1)

1005-1005: Dashboard title update is consistent with observability reorganization.

The title change aligns with the PR's generic naming convention and matches the corresponding update in deploy/observability/grafana_dashboards/grafana-dynamo-dashboard.json. This is a safe cosmetic change with no impact on dashboard functionality or metrics queries.

deploy/observability/tempo-datasource.yml (1)

12-12: Tempo datasource correctly set as non-default.

Setting isDefault: false appropriately makes Prometheus the default datasource while allowing Tempo queries when explicitly selected. This is the correct configuration for a multi-datasource observability setup where metrics are primary and tracing is supplementary.

deploy/observability/grafana_dashboards/grafana-dynamo-dashboard.json (1)

1023-1023: Dashboard title matches configmap and supports generic naming convention.

The title update is consistent with the parallel change in the ConfigMap and maintains the semantic meaning of the dashboard while supporting the new "generic" naming introduced in this PR. All dashboard panels and queries remain intact.

deploy/observability/k8s/logging/README.md (1)

3-3: Relative path correctly updated for directory relocation.

The path has been properly adjusted to reflect the directory restructuring from deploy/logging/ to deploy/observability/k8s/logging/. The new path ../../../../docs/kubernetes/observability/logging.md correctly navigates from the new file location to the documentation.

docs/kubernetes/observability/metrics.md (1)

130-132: Documentation simplified and clarified.

Replacing the multi-step pushd/popd pattern with a single direct kubectl apply command improves clarity and reduces cognitive load. The simplified instruction is easier to follow and less error-prone for users.

README.md (2)

104-107: Code block syntax highlighting improved.

Specifying bash as the code fence language enables proper syntax highlighting for users reading the README, improving clarity and professional appearance.

109-117: New observability section is clear and well-positioned.

The optional observability stack section appropriately documents the separate deployment path and clearly communicates:

What services are included (Prometheus, Grafana, Tempo, metrics exporters)

How to deploy (docker compose -f deploy/docker-observability.yml)

How to access it (Grafana credentials and port)

Marking this as optional correctly reflects that it's not required for basic Dynamo operation, improving documentation clarity for new users. Credentials (dynamo/dynamo) and port (3000) match the PR objectives.

docs/observability/README.md (1)

1-32: Observability documentation hub verified and complete.

The README effectively centralizes observability configuration with well-organized tables, shared variable annotations (†), and appropriate separation between user and developer guides. All referenced guide files have been confirmed to exist.

deploy/docker-observability.yml (2)

106-127: Clarify Grafana security configuration.

The Grafana service has several security-related settings that should be documented or reviewed:

user: root (Line 90 for Tempo is reasonable for file permissions, but Line 106-127 uses default user) — consider whether root privileges are necessary for Grafana

GF_SECURITY_DISABLE_INITIAL_ADMIN_CREATION=false (Line 125) — contradicts the intent of the other security flags; this will NOT disable admin creation despite the name

Default credentials dynamo/dynamo are acceptable for local development but should be rotated in production

Consider adding a comment to clarify these are development defaults and should not be used in production, or add safeguards for production deployments.

64-66: All referenced configuration files are present.

Verification confirms that all required configuration files and directories referenced in the compose file exist in the deploy/observability/ directory. The observability stack is properly configured and ready for use.
docs/observability/metrics.md (2)
29-54: Environment variable names and defaults are correct. Verification confirms DYN_SYSTEM_ENABLED and DYN_SYSTEM_PORT match the codebase exactly. The system metrics server is disabled by default and enabled when DYN_SYSTEM_ENABLED=true. The example port 8081 is consistent with actual usage throughout the codebase.

96-120: Fix documented metric names to match implementation.

Two documentation errors found:

Line 106: dynamo_component_system_uptime_seconds should be dynamo_component_uptime_seconds (remove "system_" prefix)

Lines 110-113: KVStats metrics are missing the dynamo_component_ prefix in documentation. The actual metric names are:

kvstats_active_blocks (not dynamo_component_kvstats_active_blocks)

kvstats_total_blocks (not dynamo_component_kvstats_total_blocks)

kvstats_gpu_cache_usage_percent (not dynamo_component_kvstats_gpu_cache_usage_percent)

kvstats_gpu_prefix_cache_hit_rate (not dynamo_component_kvstats_gpu_prefix_cache_hit_rate)

The work handler metrics (lines 102-105) are correctly named in the documentation.
⛔ Skipped due to learnings
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:75-79
Timestamp: 2025-09-16T00:27:43.992Z
Learning: In the ai-dynamo/dynamo codebase, the project uses "_total" suffix for all Prometheus metrics including gauges like inflight_requests, which differs from standard Prometheus conventions. The constant work_handler::INFLIGHT_REQUESTS does not exist - only work_handler::INFLIGHT_REQUESTS_TOTAL exists and should be used for the inflight requests gauge metric.
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The team at ai-dynamo/dynamo prefers to use consistent metric naming patterns with _total suffixes across all metric types (including gauges) for internal consistency, even when this differs from strict Prometheus conventions that reserve _total for counters only. This design decision was confirmed by keivenchang in PR 3035, referencing examples in prometheus_names.rs and input from team members.
docs/observability/tracing.md (2)

18-25: All environment variable defaults verified and confirmed correct:

Variable Documentation Default Implementation Default Status

DYN_LOGGING_JSONL false env_is_truthy() → false ✓

OTEL_EXPORT_ENABLED false env_is_truthy() → false ✓

OTEL_EXPORTER_OTLP_TRACES_ENDPOINT http://localhost:4317 http://localhost:4317 ✓

OTEL_SERVICE_NAME dynamo dynamo ✓

Tempo OTLP gRPC port 4317 confirmed in deploy/docker-observability.yml (line 96).

56-58: No issues found—environment variable syntax is correct.

The OTEL_EXPORT_ENABLED=true syntax in docs/observability/tracing.md is valid. The application's is_truthy function (lib/config/src/lib.rs:20-22, lib/runtime/src/config.rs:396-398) explicitly accepts "true" as a truthy value via pattern matching: matches!(val.to_lowercase().as_str(), "1" | "true" | "on" | "yes"). The example aligns with the documented acceptable values listed in the code comment.

coderabbitai · 2025-11-07T02:01:22Z

deploy/docker-compose.yml

+# Bare minimum infrastructure services for Dynamo.
+# For observability (metrics, tracing, dashboards), use docker-compose-observability.yml


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify the observability compose file name.

The comment references docker-compose-observability.yml, but the AI summary mentions docker-observability.yml. Please verify the correct filename.

🏁 Script executed:

#!/bin/bash # Description: Verify the observability compose file name if [ -f "deploy/docker-observability.yml" ]; then echo "✓ Found: deploy/docker-observability.yml" elif [ -f "deploy/docker-compose-observability.yml" ]; then echo "✓ Found: deploy/docker-compose-observability.yml" else echo "✗ Neither deploy/docker-observability.yml nor deploy/docker-compose-observability.yml found" fi

Length of output: 40

Update comment to reference correct observability compose file name.

The comment at lines 4-5 references docker-compose-observability.yml, but the actual file is docker-observability.yml. Update the comment to reference the correct filename:

# For observability (metrics, tracing, dashboards), use docker-observability.yml

🤖 Prompt for AI Agents

In deploy/docker-compose.yml around lines 4 to 5, the inline comment references the wrong observability compose filename; update the comment to reference docker-observability.yml instead of docker-compose-observability.yml so the comment reads: "# For observability (metrics, tracing, dashboards), use docker-observability.yml".

@keivenchang Thank you for the quick fix! 🎉

🧠 Learnings used

Learnt from: keivenchang Repo: ai-dynamo/dynamo PR: 3035 File: lib/runtime/src/metrics/prometheus_names.rs:49-53 Timestamp: 2025-09-16T00:26:37.092Z Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.

Learnt from: keivenchang Repo: ai-dynamo/dynamo PR: 3051 File: container/templates/Dockerfile.trtllm.j2:424-437 Timestamp: 2025-09-16T17:16:03.785Z Learning: keivenchang prioritizes maintaining exact backward compatibility during migration/refactoring PRs, even when bugs are identified in the original code. Fixes should be deferred to separate PRs after the migration is complete.

If you found this review helpful, would you consider giving us a shout-out on X?

Thank you for using CodeRabbit!

docs/kubernetes/observability/logging.md

docs/observability/tracing.md

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>

keivenchang · 2025-11-07T02:08:56Z

deploy/tracing/docker-compose.yml

@@ -1,35 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


This is now subsumed in the deploy/docker-observability.yml file (plus other stuff).

deploy/docker-compose.yml

keivenchang · 2025-11-07T02:10:29Z

deploy/docker-observability.yml

@@ -0,0 +1,137 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


This new file is a merge of the old deploy/tracing/docker-compose.yml and the old deploy/docker-compose.yml, containing only observability related services.

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>

- Fix docker-compose.yml comment to reference correct filename (docker-observability.yml not docker-compose-observability.yml) - Fix abbreviation formatting in kubernetes logging docs (e.g., etc.) Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>

Correct relative path to metrics-developer-guide.md (needs 6 levels up, not 5) Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>

deploy/observability/k8s/logging/grafana/dashboard.json

nnshah1 · 2025-11-07T16:39:58Z

deploy/observability/k8s/logging/README.md

@rmccorm4 , @athreesh - do we want to keep the readme.md s or just remove them?

for this PR let's keep as-is, will revisit after the cleanup. As a developer I like the pointer to the docs since I reach for the code folders before the docs folders.

nnshah1 · 2025-11-07T16:48:30Z

docs/observability/health-checks.md

@tzulingk we need to update this for the canary health check

nnshah1 · 2025-11-07T16:58:13Z

docs/observability/logging.md

+| Variable | Description | Default | Example |
+|----------|-------------|---------|---------|
+| `DYN_LOGGING_JSONL` | Enable JSONL logging format | `false` | `true` |
+| `DYN_LOG` | Log level: `info` or `debug` | `info` | `debug` |


I think we lost some info here in being able to change logging by target -

You are right, added back.

nnshah1 · 2025-11-07T16:58:27Z

docs/observability/logging.md

+|----------|-------------|---------|---------|
+| `DYN_LOGGING_JSONL` | Enable JSONL logging format | `false` | `true` |
+| `DYN_LOG` | Log level: `info` or `debug` | `info` | `debug` |
+| `DYN_LOG_USE_LOCAL_TZ` | Use local timezone for timestamps | `false` | `true` |


we should indicate the default is UTC

docs/observability/logging.md

docs/observability/tracing.md

lib/bindings/python/examples/metrics/README.md

nnshah1

only thing needs changing is some tweaking of the logging doc to reduce confusion with OpenTelemetry - hope that makes sense - otherwise LGTM - great work!

- Remove redundant 'grafana-' prefix from dashboard filenames - Remove redundant '-dashboard' suffix from JSON files - Simplify dcgm-metrics.json copyright to 2-line SPDX format - Update all documentation references to new filenames - Add detailed DYN_LOG per-target syntax to observability docs - Clarify DYN_LOG_USE_LOCAL_TZ default timezone (UTC) - Add Dynamo logging variables table to K8s logging docs Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>

- Centralize docker-compose instructions in README.md - Add topology diagram and configuration files to README.md - Standardize all docs to use 'Getting Started Quickly' heading - Remove duplicate docker-compose commands from individual guides - All guides now reference README.md for observability stack setup - Update DYN_LOG documentation with per-target syntax examples - Clarify DYN_LOG_USE_LOCAL_TZ default timezone (UTC) - Add Dynamo logging variables to K8s logging docs - Remove lib/bindings/python/examples/metrics/README.md - Improve consistency across all observability documentation Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>

…-compose-files Resolved conflicts: - docs/observability/health-checks.md: Removed duplicate environment variables table - docs/observability/logging.md: Kept simplified 'OTLP export enabled' message - docs/observability/prometheus-grafana.md: Kept link to metrics-developer-guide.md - lib/bindings/python/examples/metrics/README.md: Accepted deletion Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>

Add wait_for_service_ready() call to ensure the HTTP service is fully started and listening before sending requests. Without this, the test could fail with 404 errors if requests arrive before the server is ready. This race condition has existed since the test was written, but became more visible after recent changes (e.g. KeyValueStoreManager refactor in Oct 2025). The wait_for_service_ready() helper was added in July 2025 for HTTP disconnect tests but the original test_http_service was never updated to use it. This follows the pattern used by other tests in the same file. Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>

- Update logging.md to clarify that trace/span information uses OpenTelemetry format/libraries but doesn't require an OpenTelemetry backend (Tempo/Jaeger) - Standardize copyright headers to 2-line SPDX format across observability docs - Remove full Apache license text from logging.md - Add missing copyright header to prometheus-grafana.md Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>

rmccorm4 · 2025-11-08T05:09:43Z

lib/llm/tests/http-service.rs

    let task = tokio::spawn(async move { service.run(token.clone()).await });

+    // Wait for the service to be ready before proceeding
+    wait_for_service_ready(port).await;


was this flaky from a race condition or something before?

I saw flaky behavior from this test in CI +1

rmccorm4 · 2025-11-08T05:12:48Z

docs/observability/metrics.md

-**Component Metrics**: The core Dynamo backend system automatically exposes metrics with the `dynamo_component_*` prefix for all components that use the `DistributedRuntime` framework. These include request counts, processing times, byte transfers, and system uptime metrics. See [prometheus-grafana.md](prometheus-grafana.md#available-metrics) for the complete list of component metrics.
+| Variable | Description | Default | Example |
+|----------|-------------|---------|---------|
+| `DYN_SYSTEM_ENABLED` | Enable system metrics/health server | `false` | `true` |


Note any straggling references to DYN_SYSTEM_ENABLED in this PR will need to be removed in your part2 or a part3 follow up

mohammedabdulwahhab · 2025-11-08T17:26:50Z

lib/runtime/src/logging.rs

 }

-/// Check if OTLP trace exporting is enabled (set OTEL_EXPORT_ENABLED=1 to enable)
+/// Check if OTLP trace exporting is enabled (set OTEL_EXPORT_ENABLED to a truthy value: 1, true, on, yes)


mohammedabdulwahhab · 2025-11-08T17:38:58Z

@keivenchang Just took a tour through your branch and it looks awesome.

Here are some quick fixes I found:

Your refactoring breaks some of the links here: https://github.com/ai-dynamo/dynamo/tree/keivenchang/DIS-980__consolidate-OTEL-docker-compose-files/deploy#additional-resources The logging and monitoring links are giving 404s
We can remove the pod-monitor resources under the k8s folder. They are no longer needed.
Similar to https://github.com/ai-dynamo/dynamo/blob/keivenchang/DIS-980__consolidate-OTEL-docker-compose-files/deploy/observability/k8s/README.md, should we add a README at the root of the observability folder that points to docs. This is in case users land at the observability folder before the docs.

keivenchang added 3 commits November 7, 2025 01:45

keivenchang self-assigned this Nov 7, 2025

keivenchang requested review from a team as code owners November 7, 2025 01:54

pull-request-size bot added the size/XXL label Nov 7, 2025

coderabbitai bot reviewed Nov 7, 2025

View reviewed changes

Add new observability docs to toctree

9cd7cf1

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>

keivenchang commented Nov 7, 2025

View reviewed changes

keivenchang added 2 commits November 7, 2025 02:16

Remove Optional Observability Stack section from README

8a80eee

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>

copy-pr-bot bot temporarily deployed to GITLAB November 7, 2025 02:27 Inactive

keivenchang changed the title ~~Keivenchang/dis 980 consolidate otel docker compose files~~ refactor: consolidate OTEL docker compose files Nov 7, 2025

github-actions bot added the refactor label Nov 7, 2025

keivenchang changed the title ~~refactor: consolidate OTEL docker compose files~~ refactor: consolidate Observability files (e.g. OTEL docker-compose, md files) Nov 7, 2025

keivenchang added 2 commits November 6, 2025 18:31

Fix broken link anchor in Python metrics README

ef2e529

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>

Fix broken link in Python metrics examples README

4c1e061

Correct relative path to metrics-developer-guide.md (needs 6 levels up, not 5) Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>