Skip to content

Conversation

@keivenchang
Copy link
Contributor

@keivenchang keivenchang commented Nov 7, 2025

Overview:

This PR consolidates Dynamo's observability infrastructure to provide a more consistent and easier-to-follow experience. All observability documentation now follows a uniform structure (Overview → Environment Variables → Getting Started → Details), making it simple to find what you need whether you're setting up metrics, tracing, logging, or health checks.

The refactored structure separates concerns clearly: docker-compose.yml handles only core infrastructure (NATS & etcd), while docker-observability.yml provides the complete observability stack. All observability configurations are now centralized under deploy/observability/, eliminating the previous scattered structure across deploy/metrics/, deploy/logging/, and deploy/tracing/.

Documentation improvements include:

  • Consistent structure: Every observability doc follows the same pattern with Environment Variables tables and Getting Started sections
  • Single entry point: docs/observability/README.md serves as the unified gateway to all observability topics
  • Practical examples: Each guide now includes single-GPU Getting Started examples for quick testing
  • Clear separation: Prometheus/Grafana guide focuses on demo setup, while detailed metrics reference lives in metrics.md
  • Developer resources: New Metrics Developer Guide for creating custom metrics in Rust/Python

Details:

  • Refactor services into separate docker-compose.yml (NATS & etcd only) and docker-observability.yml (Prometheus, Grafana, Tempo, exporters)
  • Consolidated observability configs under deploy/observability/ (previously scattered across metrics/, logging/, tracing/)
  • Reorganized Kubernetes-specific observability configs under deploy/observability/k8s/
  • Standardized all observability docs with consistent sections: Overview, Environment Variables, Getting Started, and detailed reference
  • Created new docs/observability/README.md as unified entry point with navigation table
  • Refactored Prometheus/Grafana guide to focus on single-machine demos (removed detailed metrics explanations, now in metrics.md)
  • Added Metrics Developer Guide for creating custom metrics in Rust/Python
  • Enhanced all docs with Environment Variables tables for easy reference
  • Added practical Getting Started sections with single-GPU examples for quick testing
  • Enhanced tracing docs with x-request-id correlation guidance for easier debugging
  • Updated env_is_truthy utility usage for OTLP configuration consistency

Where should the reviewer start?

Review docs/observability/README.md, which serves as the entry point to all observability documentation. Notice how metrics.md, tracing.md, health-checks.md, and logging.md all follow the same consistent structure: Overview → Environment Variables → Getting Started → Details. This uniform pattern makes it easy to quickly find configuration options (always in a table) and get started with practical examples (always in a Getting Started section).

Then check deploy/observability/ to see how all observability configs are now centralized in one location instead of being scattered across multiple directories.

BEFORE:
=======

deploy/
├── docker-compose.yml (NATS + etcd + Prometheus + Grafana + exporters + monitoring network)
├── metrics/
│   ├── grafana-datasources.yml
│   ├── prometheus.yml
│   ├── grafana_dashboards/
│   │   ├── grafana-dashboard-providers.yml
│   │   ├── grafana-dcgm-metrics.json
│   │   ├── grafana-dynamo-dashboard.json
│   │   └── grafana-kvbm-dashboard.json
│   └── k8s/
│       ├── README.md
│       ├── frontend-podmonitor.yaml
│       ├── planner-podmonitor.yaml
│       ├── worker-podmonitor.yaml
│       └── grafana-dynamo-dashboard-configmap.yaml
├── logging/
│   ├── README.md
│   ├── grafana/
│   │   ├── dashboard.json
│   │   ├── logging-dashboard.yaml
│   │   └── loki-datasource.yaml
│   └── values/
│       ├── alloy-values.yaml
│       └── loki-values.yaml
└── tracing/
    ├── docker-compose.yml (Tempo + Grafana) [DELETED]
    ├── README.md
    ├── trace.png
    ├── tempo.yaml
    └── grafana/provisioning/datasources/
        └── tempo.yaml

docs/observability/
├── (no README.md)
├── metrics.md (basic)
├── health-checks.md (basic)
├── logging.md (basic)
├── prometheus-grafana.md (detailed)
└── (no tracing docs here)


AFTER (with +/- line counts):
==============================

deploy/
├── docker-compose.yml -120
├── docker-observability.yml +137
└── observability/
    ├── tempo-datasource.yml -1/+1
    ├── grafana_dashboards/
    │   └── grafana-dynamo-dashboard.json -1/+1
    └── k8s/
        ├── grafana-dynamo-dashboard-configmap.yaml -1/+1
        └── logging/
            └── README.md -1/+1

docs/observability/
├── README.md +33
├── metrics.md +112
├── metrics-developer-guide.md +450
├── health-checks.md +24
├── logging.md +1
├── tracing.md +60
└── prometheus-grafana.md -325

docs/kubernetes/observability/
├── logging.md -6/+6
└── metrics.md -3/+1

lib/runtime/src/logging.rs -2/+2

README.md +15

lib/runtime/examples/metrics_python/README.md -150/+4

DELETED:
deploy/tracing/docker-compose.yml -35

Related Issues:

Relates to DIS-980

/coderabbit profile chill

…bility.yml

- Moved metrics (Prometheus, Grafana, DCGM, NATS exporter) and tracing (Tempo) into single docker-observability.yml
- Simplified docker-compose.yml to only include core infrastructure (NATS, etcd)
- Reorganized observability files: deploy/metrics/* and deploy/tracing/* -> deploy/observability/*
- Updated documentation: deploy/tracing/README.md -> docs/observability/tracing.md
- Unified Grafana configuration to support both Prometheus and Tempo datasources
- Single observability stack now runs on unified 'server' network for better integration

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
- Move deploy/logging to deploy/observability/k8s/logging for better organization
- Move trace.png to docs/observability/ to be alongside tracing.md
- Fix vllm lazy import of kvbm to avoid Tokio runtime initialization issues
- Add log level documentation explaining DEBUG vs INFO for trace visibility
- Update all references to reflect new paths
- Clarify OTEL environment variable defaults and behavior

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
- Create docs/observability/README.md as central hub
- Split metrics-developer-guide.md from prometheus-grafana.md
- Standardize all docs: Overview, Environment Variables, Getting Started
- Update env variable parsing to accept truthy values (true/1/on/yes)
- Consolidate prometheus-grafana.md as quick start guide
- Improve metrics.md as reference document
- Clarify tracing requirements and overlap with logging
- Fix double space and grammatical issues

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
@keivenchang keivenchang self-assigned this Nov 7, 2025
@keivenchang keivenchang requested review from a team as code owners November 7, 2025 01:54
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 7, 2025

Walkthrough

The changes reorganize observability infrastructure by separating observability services into a standalone compose file, consolidating and restructuring Kubernetes observability configurations, updating environment variable handling for OTEL exports to support truthy values, expanding observability documentation with guides for metrics, tracing, logging, and health checks, and streamlining Docker Compose setup instructions.

Changes

Cohort / File(s) Summary
Docker Compose Observability Reorganization
README.md, deploy/docker-compose.yml, deploy/docker-observability.yml, deploy/tracing/docker-compose.yml
Extracted observability services (Prometheus, Grafana, Tempo, DCGM exporter, NATS exporter) from main docker-compose.yml into dedicated deploy/docker-observability.yml; removed tracing services from deploy/tracing/docker-compose.yml; updated README with observability stack setup instructions using bash-specific code block.
Grafana Dashboard Configuration Updates
deploy/observability/grafana_dashboards/grafana-dynamo-dashboard.json, deploy/observability/k8s/grafana-dynamo-dashboard-configmap.yaml, docs/kubernetes/observability/metrics.md
Renamed Grafana dashboard title to "Dynamo Dashboard (generic)"; simplified Kubernetes metrics documentation to use single kubectl apply command with updated path.
Kubernetes Observability Path Restructuring
deploy/observability/k8s/logging/README.md, docs/kubernetes/observability/logging.md, docs/kubernetes/observability/metrics.md
Updated relative path references from deploy/logging/* to deploy/observability/k8s/logging/*; adjusted Helm and configuration artifact paths for Loki, Alloy, and Grafana provisioning.
Observability Documentation Suite
docs/observability/README.md, docs/observability/health-checks.md, docs/observability/logging.md, docs/observability/metrics.md, docs/observability/metrics-developer-guide.md, docs/observability/prometheus-grafana.md, docs/observability/tracing.md
Added comprehensive observability guides with environment variables, getting started sections, metric categories, runtime hierarchy, and practical examples; restructured tracing and logging documentation; condensed Prometheus/Grafana guide to single-machine demo setup; added new metrics developer guide for custom metrics creation across Rust and Python.
Tempo Datasource & Python Examples
deploy/observability/tempo-datasource.yml, lib/bindings/python/examples/metrics/README.md
Changed Tempo datasource isDefault from true to false; pared down Python metrics example documentation to concise "Running Examples" section with link to developer guide.
Environment Variable Handling Refactor
lib/bindings/python/rust/lib.rs, lib/runtime/src/logging.rs
Updated OTEL_EXPORT_ENABLED checks from string equality "1" to truthy value support using env_is_truthy() utility; updated corresponding warning messages and documentation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Docker observability file: New comprehensive service stack with multiple dependencies—verify image versions, port mappings, volume configurations, and network assignments are correct
  • Documentation scope: Extensive rewrites across 7+ observability docs with environmental variable tables and new sections—confirm consistency of variable names, defaults, and examples across all guides
  • Path migrations: Multiple files updated with new kubernetes observability paths—verify all paths are correctly updated and no dangling references remain
  • Environment variable refactoring: Changes from numeric "1" to truthy values across Rust code—ensure all call sites properly use the new env_is_truthy() utility and backward compatibility is maintained

Poem

🐰 Hop hop, the observability stack does gleam,
Separated services fulfilling every dream,
Metrics, traces, and logs now neatly arranged,
Truthiness and defaults—everything changed!
From chaos to clarity, what a fine sight,
Documentation and dashboards now burning so bright!

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed The pull request description is comprehensive and well-structured, following the template with clear Overview, Details, Where to start, and Related Issues sections.
Title check ✅ Passed The title accurately reflects the main objective: consolidating observability files (docker-compose, markdown, configs) from scattered directories into organized deploy/observability and docs/observability structures.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (6)
docs/kubernetes/observability/logging.md (1)

49-49: Path updates are correct; fix minor formatting at line 144.

All referenced configuration files exist in the new centralized paths under deploy/observability/k8s/logging/. The path structure is consistent across Loki values, Alloy values, and Grafana configuration files (lines 49, 63, 113, 116).

Minor correction needed at line 144: change "component type (e.g frontend, worker, etc)" to "component type (e.g., frontend, worker, etc.)" to follow American English style guidelines.

deploy/docker-observability.yml (1)

90-91: Minor: Tempo config file uses YAML extension, verify naming consistency.

The GPU reservation is correctly configured with count: all to reserve all GPUs on the host. Line 92 references ./observability/tempo.yaml (YAML extension). Verify this filename is consistent with your configuration management. Most tools use .yaml or .yml interchangeably, but consistency across the repo is preferred.

docs/observability/metrics.md (2)

62-74: Add language identifier to Prometheus exposition format code block.

Fenced code blocks with a language identifier provide the best readability and syntax highlighting. The code block showing Prometheus exposition format (lines 62-74) lacks a language identifier.

-```
+```text
 # HELP dynamo_component_requests_total Total requests processed
 # TYPE dynamo_component_requests_total counter

Alternatively, if Prometheus format highlighting is available, use prometheus as the language identifier.


172-194: Add language identifier to code blocks showing timeline and concurrency examples.

Lines 172 and 182 contain fenced code blocks without language identifiers. The timeline ASCII diagram (lines 182-194) should use text to preserve formatting without attempting syntax highlighting.

-```
+```text
 Timeline:    0, 1, ...
 Client ────> Frontend:8000 ...
docs/observability/tracing.md (2)

27-76: Clarify docker compose file path for consistency.

The guide uses docker compose -f docker-observability.yml (lines 33, 46, 162). While this works from the deploy/ directory, ensure the current working directory is clear in all instructions. The commands at lines 32-33 show cd deploy before running the compose command, which is good. However, line 162 in the "Stop Services" section does not show the cd deploy step.

Add consistent context at section 6 (lines 156-163) to clarify working directory:

 ### 6. Stop Services
 
 When done, stop the observability stack:
 
 ```bash
+cd deploy
 docker compose -f docker-observability.yml down

---

`90-116`: **Disaggregated deployment script reference needs clarification.**

Lines 78-116 provide a manual script for disaggregated deployment but reference modifying `disagg.sh`. The comment at line 90 states "You may need to modify `disagg.sh`" but the script provided (lines 92-116) is shown as a complete replacement example.

Clarify whether:
1. `disagg.sh` should be modified before running, or
2. The provided script should be used as a reference/replacement


Add a note clarifying the intent:

```diff
 **Note:** You may need to modify `disagg.sh` to export the tracing environment
 variables before starting each component:

+**Option A:** Modify your existing `disagg.sh` to add the following lines at the top, or
+**Option B:** Use the template below as a complete `disagg.sh` replacement:
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f509493 and 0b71cf8.

⛔ Files ignored due to path filters (1)
  • docs/observability/trace.png is excluded by !**/*.png
📒 Files selected for processing (20)
  • README.md (1 hunks)
  • deploy/docker-compose.yml (1 hunks)
  • deploy/docker-observability.yml (1 hunks)
  • deploy/observability/grafana_dashboards/grafana-dynamo-dashboard.json (1 hunks)
  • deploy/observability/k8s/grafana-dynamo-dashboard-configmap.yaml (1 hunks)
  • deploy/observability/k8s/logging/README.md (1 hunks)
  • deploy/observability/tempo-datasource.yml (1 hunks)
  • deploy/tracing/docker-compose.yml (0 hunks)
  • docs/kubernetes/observability/logging.md (4 hunks)
  • docs/kubernetes/observability/metrics.md (1 hunks)
  • docs/observability/README.md (1 hunks)
  • docs/observability/health-checks.md (1 hunks)
  • docs/observability/logging.md (5 hunks)
  • docs/observability/metrics-developer-guide.md (1 hunks)
  • docs/observability/metrics.md (2 hunks)
  • docs/observability/prometheus-grafana.md (1 hunks)
  • docs/observability/tracing.md (4 hunks)
  • lib/bindings/python/examples/metrics/README.md (1 hunks)
  • lib/bindings/python/rust/lib.rs (2 hunks)
  • lib/runtime/src/logging.rs (1 hunks)
💤 Files with no reviewable changes (1)
  • deploy/tracing/docker-compose.yml
🧰 Additional context used
🧠 Learnings (11)
📓 Common learnings
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3051
File: container/templates/Dockerfile.trtllm.j2:424-437
Timestamp: 2025-09-16T17:16:03.785Z
Learning: keivenchang prioritizes maintaining exact backward compatibility during migration/refactoring PRs, even when bugs are identified in the original code. Fixes should be deferred to separate PRs after the migration is complete.
📚 Learning: 2025-09-16T00:26:43.641Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The team at ai-dynamo/dynamo prefers to use consistent metric naming patterns with _total suffixes across all metric types (including gauges) for internal consistency, even when this differs from strict Prometheus conventions that reserve _total for counters only. This design decision was confirmed by keivenchang in PR 3035, referencing examples in prometheus_names.rs and input from team members.

Applied to files:

  • docs/observability/README.md
  • docs/observability/prometheus-grafana.md
  • docs/observability/metrics-developer-guide.md
  • deploy/observability/k8s/grafana-dynamo-dashboard-configmap.yaml
  • docs/observability/metrics.md
  • lib/bindings/python/examples/metrics/README.md
📚 Learning: 2025-07-14T21:25:56.930Z
Learnt from: ryanolson
Repo: ai-dynamo/dynamo PR: 1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

Applied to files:

  • lib/bindings/python/rust/lib.rs
📚 Learning: 2025-07-18T16:04:31.771Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.

Applied to files:

  • docs/kubernetes/observability/logging.md
📚 Learning: 2025-09-16T00:26:43.641Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The ai-dynamo/dynamo team uses _total as a semantic unit suffix across all metric types (including gauges like INFLIGHT_REQUESTS_TOTAL) for internal consistency, as evidenced by patterns in prometheus_names.rs. This is a deliberate architectural choice to prioritize uniform naming conventions over strict Prometheus conventions that reserve _total only for counters.

Applied to files:

  • docs/observability/prometheus-grafana.md
  • docs/observability/metrics.md
📚 Learning: 2025-09-16T00:21:44.912Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: deploy/metrics/README.md:43-43
Timestamp: 2025-09-16T00:21:44.912Z
Learning: Graham (grahamking) has provided guidance in PR 2914 that metrics should end with _count or _total to indicate units, but this needs to be clarified whether it applies to all metric types or just counters, as Prometheus conventions differ between counters (should have _total) and gauges (should not have _total).

Applied to files:

  • docs/observability/metrics-developer-guide.md
📚 Learning: 2025-09-24T19:06:57.156Z
Learnt from: ryan-lempka
Repo: ai-dynamo/dynamo PR: 3062
File: lib/llm/src/audit/sink.rs:15-27
Timestamp: 2025-09-24T19:06:57.156Z
Learning: In the audit logging system, full request/response logging requires both DYN_AUDIT_ENABLED=1 environment variable and explicit store=true in the request. Without store=true, only usage statistics are logged (UsageOnly mode). The stderr sink is the initial implementation with plans for additional sinks in the future.

Applied to files:

  • docs/observability/logging.md
📚 Learning: 2025-06-04T13:09:53.416Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 1365
File: deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go:171-178
Timestamp: 2025-06-04T13:09:53.416Z
Learning: The `DYN_DEPLOYMENT_CONFIG` environment variable (commonconsts.DynamoDeploymentConfigEnvVar) in the Dynamo operator will never be set via ValueFrom (secrets/config maps), only via direct Value assignment. The GetDynamoDeploymentConfig method correctly only checks env.Value for this specific environment variable.

Applied to files:

  • docs/observability/health-checks.md
📚 Learning: 2025-06-05T01:46:15.509Z
Learnt from: GuanLuo
Repo: ai-dynamo/dynamo PR: 1371
File: examples/llm/benchmarks/vllm_multinode_setup.sh:18-25
Timestamp: 2025-06-05T01:46:15.509Z
Learning: In multi-node setups with head/worker architecture, the head node typically doesn't need environment variables pointing to its own services (like NATS_SERVER, ETCD_ENDPOINTS) because local processes can access them via localhost. Only worker nodes need these environment variables to connect to the head node's external IP address.

Applied to files:

  • docs/observability/health-checks.md
📚 Learning: 2025-07-25T22:34:11.384Z
Learnt from: nnshah1
Repo: ai-dynamo/dynamo PR: 2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.

Applied to files:

  • docs/observability/health-checks.md
📚 Learning: 2025-09-16T00:27:43.992Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:75-79
Timestamp: 2025-09-16T00:27:43.992Z
Learning: In the ai-dynamo/dynamo codebase, the project uses "_total" suffix for all Prometheus metrics including gauges like inflight_requests, which differs from standard Prometheus conventions. The constant work_handler::INFLIGHT_REQUESTS does not exist - only work_handler::INFLIGHT_REQUESTS_TOTAL exists and should be used for the inflight requests gauge metric.

Applied to files:

  • docs/observability/metrics.md
🧬 Code graph analysis (2)
lib/runtime/src/logging.rs (1)
lib/runtime/src/config.rs (1)
  • env_is_truthy (422-427)
lib/bindings/python/rust/lib.rs (1)
lib/runtime/src/config.rs (1)
  • env_is_truthy (422-427)
🪛 GitHub Check: Check for broken markdown links
lib/bindings/python/examples/metrics/README.md

[failure] 10-10:
Broken link: Metrics Developer Guide - Python Section - View: https://github.com/ai-dynamo/dynamo/blob/HEAD/lib/bindings/python/examples/metrics/README.md?plain=1#L10

🪛 LanguageTool
docs/kubernetes/observability/logging.md

[uncategorized] ~144-~144: The abbreviation “e.g.” (= for example) requires two periods.
Context: ...loyment, namespace, and component type (e.g frontend, worker, etc).

(E_G)


[style] ~144-~144: In American English, abbreviations like “etc.” require a period.
Context: ...d component type (e.g frontend, worker, etc).

(ETC_PERIOD)

docs/observability/prometheus-grafana.md

[duplication] ~28-~28: Possible typo: you repeated a word.
Context: ...tes Install these on your machine: - Docker - [Docker Compose](https://docs.docker.com/compos...

(ENGLISH_WORD_REPEAT_RULE)

docs/observability/metrics-developer-guide.md

[uncategorized] ~36-~36: Loose punctuation mark.
Context: ...Methods - .metrics().create_counter(): Create a counter metric - `.metrics().c...

(UNLIKELY_OPENING_PUNCTUATION)

docs/observability/metrics.md

[uncategorized] ~102-~102: Loose punctuation mark.
Context: ... - dynamo_component_inflight_requests: Requests currently being processed (gau...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~113-~113: Loose punctuation mark.
Context: ...dynamo_component_kvstats_active_blocks: Number of active KV cache blocks curren...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~126-~126: Loose punctuation mark.
Context: ...unctionality: - dynamo_preprocessor_*: Metrics specific to preprocessor compon...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~132-~132: Loose punctuation mark.
Context: ...: - dynamo_frontend_inflight_requests: Inflight requests (gauge) - `dynamo_fro...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~150-~150: Loose punctuation mark.
Context: ... dynamo_frontend_model_total_kv_blocks: Total KV blocks available for a worker ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~157-~157: Loose punctuation mark.
Context: ...- dynamo_frontend_model_context_length: Maximum context length for a worker ser...

(UNLIKELY_OPENING_PUNCTUATION)

🪛 markdownlint-cli2 (0.18.1)
docs/observability/metrics.md

62-62: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


172-172: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


182-182: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: operator (amd64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: operator (arm64)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: tests (launch/dynamo-run)
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: tests (lib/bindings/python)
  • GitHub Check: tests (lib/runtime/examples)
  • GitHub Check: clippy (.)
  • GitHub Check: clippy (lib/bindings/python)
  • GitHub Check: tests (.)
🔇 Additional comments (27)
lib/runtime/src/logging.rs (1)

147-150: LGTM! Truthy value support improves usability.

The change from exact "1" matching to env_is_truthy() improves the developer experience by accepting common truthy values (1, true, on, yes). The documentation comment accurately reflects the new behavior.

lib/bindings/python/rust/lib.rs (2)

128-134: LGTM! Consistent truthy value support.

The change to use env_is_truthy() and the updated warning message are consistent with the changes in lib/runtime/src/logging.rs. The warning text correctly reflects that any truthy value (not just "=1") will trigger the deferred initialization.


449-453: LGTM! Proper deferred initialization.

The truthy check here correctly defers logging initialization until the Tokio runtime is available, which is required for the OTEL exporter. This mirrors the pattern in the _core module initialization.

docs/observability/metrics-developer-guide.md (1)

1-270: Excellent comprehensive developer guide!

This is a well-structured and thorough metrics developer guide that covers:

  • Clear getting started instructions with environment variables
  • Both Rust and Python API usage with practical examples
  • Vector metrics with labels
  • Advanced features (custom buckets, constant labels)
  • Update patterns and examples

The document effectively consolidates metrics guidance and provides a strong foundation for developers working with Dynamo metrics.

docs/observability/health-checks.md (2)

14-24: Well-documented environment variables.

The new environment variables table provides clear, comprehensive information with descriptions, defaults, and examples. This follows a consistent format with other observability documentation.


25-48: Helpful Getting Started section.

The new Getting Started section provides clear, practical examples for enabling and testing health checks on a single GPU. This aligns well with the PR's objective to improve observability documentation and provide better getting-started guidance.

docs/observability/prometheus-grafana.md (3)

1-21: Improved overview and environment variables documentation.

The updated overview clearly focuses on single-machine demo setup, and the environment variables table follows the consistent format used across other observability documentation.


22-84: Excellent Getting Started guide.

The new Getting Started section provides a comprehensive, step-by-step guide that covers:

  • Prerequisites and installation
  • Starting infrastructure and observability services
  • Launching Dynamo components with proper configuration
  • Testing with example requests
  • Accessing web interfaces with helpful notes about remote access

This is very user-friendly and practical.


88-167: Well-structured topology and configuration guidance.

The topology diagram provides a clear visualization of service relationships, and the configuration/troubleshooting sections offer practical guidance. The reference to the Metrics Developer Guide is a good addition for users who want to create custom metrics.

docs/observability/logging.md (3)

27-52: Improved environment variables documentation and Getting Started section.

The restructured environment variables table follows the consistent format used across observability documentation, and the new Getting Started section provides clear, practical examples for enabling JSONL logging.


102-141: Clear explanation of OpenTelemetry logging integration.

The renamed section "Logging of OpenTelemetry Tracing" better reflects its purpose, and the clarification about the overlap with distributed tracing documentation is helpful. The note distinguishing between seeing trace context in logs versus exporting to a trace backend is particularly useful.


142-258: Comprehensive trace and span information examples.

The updated section title accurately reflects the content, and the detailed examples effectively demonstrate how trace context appears in logs. The examples showing custom request IDs with the x-request-id header are particularly practical for correlating traces with application-level request tracking.

lib/bindings/python/examples/metrics/README.md (1)

10-10: Link is valid; no changes needed.

Verification confirms the documentation link is functional. The file docs/observability/metrics-developer-guide.md exists and contains the Python Metrics API section referenced by the anchor. The static analysis flag appears to be a false positive.

deploy/observability/k8s/grafana-dynamo-dashboard-configmap.yaml (1)

1005-1005: Dashboard title update is consistent with observability reorganization.

The title change aligns with the PR's generic naming convention and matches the corresponding update in deploy/observability/grafana_dashboards/grafana-dynamo-dashboard.json. This is a safe cosmetic change with no impact on dashboard functionality or metrics queries.

deploy/observability/tempo-datasource.yml (1)

12-12: Tempo datasource correctly set as non-default.

Setting isDefault: false appropriately makes Prometheus the default datasource while allowing Tempo queries when explicitly selected. This is the correct configuration for a multi-datasource observability setup where metrics are primary and tracing is supplementary.

deploy/observability/grafana_dashboards/grafana-dynamo-dashboard.json (1)

1023-1023: Dashboard title matches configmap and supports generic naming convention.

The title update is consistent with the parallel change in the ConfigMap and maintains the semantic meaning of the dashboard while supporting the new "generic" naming introduced in this PR. All dashboard panels and queries remain intact.

deploy/observability/k8s/logging/README.md (1)

3-3: Relative path correctly updated for directory relocation.

The path has been properly adjusted to reflect the directory restructuring from deploy/logging/ to deploy/observability/k8s/logging/. The new path ../../../../docs/kubernetes/observability/logging.md correctly navigates from the new file location to the documentation.

docs/kubernetes/observability/metrics.md (1)

130-132: Documentation simplified and clarified.

Replacing the multi-step pushd/popd pattern with a single direct kubectl apply command improves clarity and reduces cognitive load. The simplified instruction is easier to follow and less error-prone for users.

README.md (2)

104-107: Code block syntax highlighting improved.

Specifying bash as the code fence language enables proper syntax highlighting for users reading the README, improving clarity and professional appearance.


109-117: New observability section is clear and well-positioned.

The optional observability stack section appropriately documents the separate deployment path and clearly communicates:

  • What services are included (Prometheus, Grafana, Tempo, metrics exporters)
  • How to deploy (docker compose -f deploy/docker-observability.yml)
  • How to access it (Grafana credentials and port)

Marking this as optional correctly reflects that it's not required for basic Dynamo operation, improving documentation clarity for new users. Credentials (dynamo/dynamo) and port (3000) match the PR objectives.

docs/observability/README.md (1)

1-32: Observability documentation hub verified and complete.

The README effectively centralizes observability configuration with well-organized tables, shared variable annotations (†), and appropriate separation between user and developer guides. All referenced guide files have been confirmed to exist.

deploy/docker-observability.yml (2)

106-127: Clarify Grafana security configuration.

The Grafana service has several security-related settings that should be documented or reviewed:

  • user: root (Line 90 for Tempo is reasonable for file permissions, but Line 106-127 uses default user) — consider whether root privileges are necessary for Grafana
  • GF_SECURITY_DISABLE_INITIAL_ADMIN_CREATION=false (Line 125) — contradicts the intent of the other security flags; this will NOT disable admin creation despite the name
  • Default credentials dynamo/dynamo are acceptable for local development but should be rotated in production

Consider adding a comment to clarify these are development defaults and should not be used in production, or add safeguards for production deployments.


64-66: All referenced configuration files are present.

Verification confirms that all required configuration files and directories referenced in the compose file exist in the deploy/observability/ directory. The observability stack is properly configured and ready for use.

docs/observability/metrics.md (2)

29-54: Environment variable names and defaults are correct. Verification confirms DYN_SYSTEM_ENABLED and DYN_SYSTEM_PORT match the codebase exactly. The system metrics server is disabled by default and enabled when DYN_SYSTEM_ENABLED=true. The example port 8081 is consistent with actual usage throughout the codebase.


96-120: Fix documented metric names to match implementation.

Two documentation errors found:

  1. Line 106: dynamo_component_system_uptime_seconds should be dynamo_component_uptime_seconds (remove "system_" prefix)

  2. Lines 110-113: KVStats metrics are missing the dynamo_component_ prefix in documentation. The actual metric names are:

    • kvstats_active_blocks (not dynamo_component_kvstats_active_blocks)
    • kvstats_total_blocks (not dynamo_component_kvstats_total_blocks)
    • kvstats_gpu_cache_usage_percent (not dynamo_component_kvstats_gpu_cache_usage_percent)
    • kvstats_gpu_prefix_cache_hit_rate (not dynamo_component_kvstats_gpu_prefix_cache_hit_rate)

The work handler metrics (lines 102-105) are correctly named in the documentation.

⛔ Skipped due to learnings
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:75-79
Timestamp: 2025-09-16T00:27:43.992Z
Learning: In the ai-dynamo/dynamo codebase, the project uses "_total" suffix for all Prometheus metrics including gauges like inflight_requests, which differs from standard Prometheus conventions. The constant work_handler::INFLIGHT_REQUESTS does not exist - only work_handler::INFLIGHT_REQUESTS_TOTAL exists and should be used for the inflight requests gauge metric.
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The team at ai-dynamo/dynamo prefers to use consistent metric naming patterns with _total suffixes across all metric types (including gauges) for internal consistency, even when this differs from strict Prometheus conventions that reserve _total for counters only. This design decision was confirmed by keivenchang in PR 3035, referencing examples in prometheus_names.rs and input from team members.
docs/observability/tracing.md (2)

18-25: All environment variable defaults verified and confirmed correct:

Variable Documentation Default Implementation Default Status
DYN_LOGGING_JSONL false env_is_truthy() → false
OTEL_EXPORT_ENABLED false env_is_truthy() → false
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT http://localhost:4317 http://localhost:4317
OTEL_SERVICE_NAME dynamo dynamo

Tempo OTLP gRPC port 4317 confirmed in deploy/docker-observability.yml (line 96).


56-58: No issues found—environment variable syntax is correct.

The OTEL_EXPORT_ENABLED=true syntax in docs/observability/tracing.md is valid. The application's is_truthy function (lib/config/src/lib.rs:20-22, lib/runtime/src/config.rs:396-398) explicitly accepts "true" as a truthy value via pattern matching: matches!(val.to_lowercase().as_str(), "1" | "true" | "on" | "yes"). The example aligns with the documented acceptable values listed in the code comment.

Comment on lines 4 to 5
# Bare minimum infrastructure services for Dynamo.
# For observability (metrics, tracing, dashboards), use docker-compose-observability.yml
Copy link
Contributor

@coderabbitai coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify the observability compose file name.

The comment references docker-compose-observability.yml, but the AI summary mentions docker-observability.yml. Please verify the correct filename.


🏁 Script executed:

#!/bin/bash
# Description: Verify the observability compose file name

if [ -f "deploy/docker-observability.yml" ]; then
    echo "✓ Found: deploy/docker-observability.yml"
elif [ -f "deploy/docker-compose-observability.yml" ]; then
    echo "✓ Found: deploy/docker-compose-observability.yml"
else
    echo "✗ Neither deploy/docker-observability.yml nor deploy/docker-compose-observability.yml found"
fi

Length of output: 40


Update comment to reference correct observability compose file name.

The comment at lines 4-5 references docker-compose-observability.yml, but the actual file is docker-observability.yml. Update the comment to reference the correct filename:

# For observability (metrics, tracing, dashboards), use docker-observability.yml
🤖 Prompt for AI Agents
In deploy/docker-compose.yml around lines 4 to 5, the inline comment references
the wrong observability compose filename; update the comment to reference
docker-observability.yml instead of docker-compose-observability.yml so the
comment reads: "# For observability (metrics, tracing, dashboards), use
docker-observability.yml".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keivenchang Thank you for the quick fix! 🎉


🧠 Learnings used
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3051
File: container/templates/Dockerfile.trtllm.j2:424-437
Timestamp: 2025-09-16T17:16:03.785Z
Learning: keivenchang prioritizes maintaining exact backward compatibility during migration/refactoring PRs, even when bugs are identified in the original code. Fixes should be deferred to separate PRs after the migration is complete.

If you found this review helpful, would you consider giving us a shout-out on X?

Thank you for using CodeRabbit!

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
@@ -1,35 +0,0 @@
# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now subsumed in the deploy/docker-observability.yml file (plus other stuff).

@@ -0,0 +1,137 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new file is a merge of the old deploy/tracing/docker-compose.yml and the old deploy/docker-compose.yml, containing only observability related services.

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
- Fix docker-compose.yml comment to reference correct filename (docker-observability.yml not docker-compose-observability.yml)
- Fix abbreviation formatting in kubernetes logging docs (e.g., etc.)

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
@keivenchang keivenchang changed the title Keivenchang/dis 980 consolidate otel docker compose files refactor: consolidate OTEL docker compose files Nov 7, 2025
@keivenchang keivenchang changed the title refactor: consolidate OTEL docker compose files refactor: consolidate Observability files (e.g. OTEL docker-compose, md files) Nov 7, 2025
Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
Correct relative path to metrics-developer-guide.md (needs 6 levels up, not 5)

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmccorm4 , @athreesh - do we want to keep the readme.md s or just remove them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this PR let's keep as-is, will revisit after the cleanup. As a developer I like the pointer to the docs since I reach for the code folders before the docs folders.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tzulingk we need to update this for the canary health check

| Variable | Description | Default | Example |
|----------|-------------|---------|---------|
| `DYN_LOGGING_JSONL` | Enable JSONL logging format | `false` | `true` |
| `DYN_LOG` | Log level: `info` or `debug` | `info` | `debug` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we lost some info here in being able to change logging by target -

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, added back.

|----------|-------------|---------|---------|
| `DYN_LOGGING_JSONL` | Enable JSONL logging format | `false` | `true` |
| `DYN_LOG` | Log level: `info` or `debug` | `info` | `debug` |
| `DYN_LOG_USE_LOCAL_TZ` | Use local timezone for timestamps | `false` | `true` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should indicate the default is UTC

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@nnshah1 nnshah1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only thing needs changing is some tweaking of the logging doc to reduce confusion with OpenTelemetry - hope that makes sense - otherwise LGTM - great work!

- Remove redundant 'grafana-' prefix from dashboard filenames
- Remove redundant '-dashboard' suffix from JSON files
- Simplify dcgm-metrics.json copyright to 2-line SPDX format
- Update all documentation references to new filenames
- Add detailed DYN_LOG per-target syntax to observability docs
- Clarify DYN_LOG_USE_LOCAL_TZ default timezone (UTC)
- Add Dynamo logging variables table to K8s logging docs

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
- Centralize docker-compose instructions in README.md
- Add topology diagram and configuration files to README.md
- Standardize all docs to use 'Getting Started Quickly' heading
- Remove duplicate docker-compose commands from individual guides
- All guides now reference README.md for observability stack setup
- Update DYN_LOG documentation with per-target syntax examples
- Clarify DYN_LOG_USE_LOCAL_TZ default timezone (UTC)
- Add Dynamo logging variables to K8s logging docs
- Remove lib/bindings/python/examples/metrics/README.md
- Improve consistency across all observability documentation

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
…-compose-files

Resolved conflicts:
- docs/observability/health-checks.md: Removed duplicate environment variables table
- docs/observability/logging.md: Kept simplified 'OTLP export enabled' message
- docs/observability/prometheus-grafana.md: Kept link to metrics-developer-guide.md
- lib/bindings/python/examples/metrics/README.md: Accepted deletion

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
Add wait_for_service_ready() call to ensure the HTTP service is fully
started and listening before sending requests. Without this, the test
could fail with 404 errors if requests arrive before the server is ready.

This race condition has existed since the test was written, but became
more visible after recent changes (e.g. KeyValueStoreManager refactor in
Oct 2025). The wait_for_service_ready() helper was added in July 2025
for HTTP disconnect tests but the original test_http_service was never
updated to use it.

This follows the pattern used by other tests in the same file.

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
- Update logging.md to clarify that trace/span information uses OpenTelemetry
  format/libraries but doesn't require an OpenTelemetry backend (Tempo/Jaeger)
- Standardize copyright headers to 2-line SPDX format across observability docs
- Remove full Apache license text from logging.md
- Add missing copyright header to prometheus-grafana.md

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
let task = tokio::spawn(async move { service.run(token.clone()).await });

// Wait for the service to be ready before proceeding
wait_for_service_ready(port).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this flaky from a race condition or something before?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw flaky behavior from this test in CI +1

**Component Metrics**: The core Dynamo backend system automatically exposes metrics with the `dynamo_component_*` prefix for all components that use the `DistributedRuntime` framework. These include request counts, processing times, byte transfers, and system uptime metrics. See [prometheus-grafana.md](prometheus-grafana.md#available-metrics) for the complete list of component metrics.
| Variable | Description | Default | Example |
|----------|-------------|---------|---------|
| `DYN_SYSTEM_ENABLED` | Enable system metrics/health server | `false` | `true` |
Copy link
Contributor

@rmccorm4 rmccorm4 Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note any straggling references to DYN_SYSTEM_ENABLED in this PR will need to be removed in your part2 or a part3 follow up

}

/// Check if OTLP trace exporting is enabled (set OTEL_EXPORT_ENABLED=1 to enable)
/// Check if OTLP trace exporting is enabled (set OTEL_EXPORT_ENABLED to a truthy value: 1, true, on, yes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@mohammedabdulwahhab
Copy link
Contributor

mohammedabdulwahhab commented Nov 8, 2025

@keivenchang Just took a tour through your branch and it looks awesome.

Here are some quick fixes I found:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants