-
Notifications
You must be signed in to change notification settings - Fork 676
refactor: consolidate Observability files (e.g. OTEL docker-compose, md files) #4173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
refactor: consolidate Observability files (e.g. OTEL docker-compose, md files) #4173
Conversation
…bility.yml - Moved metrics (Prometheus, Grafana, DCGM, NATS exporter) and tracing (Tempo) into single docker-observability.yml - Simplified docker-compose.yml to only include core infrastructure (NATS, etcd) - Reorganized observability files: deploy/metrics/* and deploy/tracing/* -> deploy/observability/* - Updated documentation: deploy/tracing/README.md -> docs/observability/tracing.md - Unified Grafana configuration to support both Prometheus and Tempo datasources - Single observability stack now runs on unified 'server' network for better integration Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
- Move deploy/logging to deploy/observability/k8s/logging for better organization - Move trace.png to docs/observability/ to be alongside tracing.md - Fix vllm lazy import of kvbm to avoid Tokio runtime initialization issues - Add log level documentation explaining DEBUG vs INFO for trace visibility - Update all references to reflect new paths - Clarify OTEL environment variable defaults and behavior Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
- Create docs/observability/README.md as central hub - Split metrics-developer-guide.md from prometheus-grafana.md - Standardize all docs: Overview, Environment Variables, Getting Started - Update env variable parsing to accept truthy values (true/1/on/yes) - Consolidate prometheus-grafana.md as quick start guide - Improve metrics.md as reference document - Clarify tracing requirements and overlap with logging - Fix double space and grammatical issues Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
WalkthroughThe changes reorganize observability infrastructure by separating observability services into a standalone compose file, consolidating and restructuring Kubernetes observability configurations, updating environment variable handling for OTEL exports to support truthy values, expanding observability documentation with guides for metrics, tracing, logging, and health checks, and streamlining Docker Compose setup instructions. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (6)
docs/kubernetes/observability/logging.md (1)
49-49: Path updates are correct; fix minor formatting at line 144.All referenced configuration files exist in the new centralized paths under
deploy/observability/k8s/logging/. The path structure is consistent across Loki values, Alloy values, and Grafana configuration files (lines 49, 63, 113, 116).Minor correction needed at line 144: change
"component type (e.g frontend, worker, etc)"to"component type (e.g., frontend, worker, etc.)"to follow American English style guidelines.deploy/docker-observability.yml (1)
90-91: Minor: Tempo config file uses YAML extension, verify naming consistency.The GPU reservation is correctly configured with
count: allto reserve all GPUs on the host. Line 92 references./observability/tempo.yaml(YAML extension). Verify this filename is consistent with your configuration management. Most tools use.yamlor.ymlinterchangeably, but consistency across the repo is preferred.docs/observability/metrics.md (2)
62-74: Add language identifier to Prometheus exposition format code block.Fenced code blocks with a language identifier provide the best readability and syntax highlighting. The code block showing Prometheus exposition format (lines 62-74) lacks a language identifier.
-``` +```text # HELP dynamo_component_requests_total Total requests processed # TYPE dynamo_component_requests_total counterAlternatively, if Prometheus format highlighting is available, use
prometheusas the language identifier.
172-194: Add language identifier to code blocks showing timeline and concurrency examples.Lines 172 and 182 contain fenced code blocks without language identifiers. The timeline ASCII diagram (lines 182-194) should use
textto preserve formatting without attempting syntax highlighting.-``` +```text Timeline: 0, 1, ... Client ────> Frontend:8000 ...docs/observability/tracing.md (2)
27-76: Clarify docker compose file path for consistency.The guide uses
docker compose -f docker-observability.yml(lines 33, 46, 162). While this works from thedeploy/directory, ensure the current working directory is clear in all instructions. The commands at lines 32-33 showcd deploybefore running the compose command, which is good. However, line 162 in the "Stop Services" section does not show thecd deploystep.Add consistent context at section 6 (lines 156-163) to clarify working directory:
### 6. Stop Services When done, stop the observability stack: ```bash +cd deploy docker compose -f docker-observability.yml down--- `90-116`: **Disaggregated deployment script reference needs clarification.** Lines 78-116 provide a manual script for disaggregated deployment but reference modifying `disagg.sh`. The comment at line 90 states "You may need to modify `disagg.sh`" but the script provided (lines 92-116) is shown as a complete replacement example. Clarify whether: 1. `disagg.sh` should be modified before running, or 2. The provided script should be used as a reference/replacement Add a note clarifying the intent: ```diff **Note:** You may need to modify `disagg.sh` to export the tracing environment variables before starting each component: +**Option A:** Modify your existing `disagg.sh` to add the following lines at the top, or +**Option B:** Use the template below as a complete `disagg.sh` replacement:
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
docs/observability/trace.pngis excluded by!**/*.png
📒 Files selected for processing (20)
README.md(1 hunks)deploy/docker-compose.yml(1 hunks)deploy/docker-observability.yml(1 hunks)deploy/observability/grafana_dashboards/grafana-dynamo-dashboard.json(1 hunks)deploy/observability/k8s/grafana-dynamo-dashboard-configmap.yaml(1 hunks)deploy/observability/k8s/logging/README.md(1 hunks)deploy/observability/tempo-datasource.yml(1 hunks)deploy/tracing/docker-compose.yml(0 hunks)docs/kubernetes/observability/logging.md(4 hunks)docs/kubernetes/observability/metrics.md(1 hunks)docs/observability/README.md(1 hunks)docs/observability/health-checks.md(1 hunks)docs/observability/logging.md(5 hunks)docs/observability/metrics-developer-guide.md(1 hunks)docs/observability/metrics.md(2 hunks)docs/observability/prometheus-grafana.md(1 hunks)docs/observability/tracing.md(4 hunks)lib/bindings/python/examples/metrics/README.md(1 hunks)lib/bindings/python/rust/lib.rs(2 hunks)lib/runtime/src/logging.rs(1 hunks)
💤 Files with no reviewable changes (1)
- deploy/tracing/docker-compose.yml
🧰 Additional context used
🧠 Learnings (11)
📓 Common learnings
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3051
File: container/templates/Dockerfile.trtllm.j2:424-437
Timestamp: 2025-09-16T17:16:03.785Z
Learning: keivenchang prioritizes maintaining exact backward compatibility during migration/refactoring PRs, even when bugs are identified in the original code. Fixes should be deferred to separate PRs after the migration is complete.
📚 Learning: 2025-09-16T00:26:43.641Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The team at ai-dynamo/dynamo prefers to use consistent metric naming patterns with _total suffixes across all metric types (including gauges) for internal consistency, even when this differs from strict Prometheus conventions that reserve _total for counters only. This design decision was confirmed by keivenchang in PR 3035, referencing examples in prometheus_names.rs and input from team members.
Applied to files:
docs/observability/README.mddocs/observability/prometheus-grafana.mddocs/observability/metrics-developer-guide.mddeploy/observability/k8s/grafana-dynamo-dashboard-configmap.yamldocs/observability/metrics.mdlib/bindings/python/examples/metrics/README.md
📚 Learning: 2025-07-14T21:25:56.930Z
Learnt from: ryanolson
Repo: ai-dynamo/dynamo PR: 1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
Applied to files:
lib/bindings/python/rust/lib.rs
📚 Learning: 2025-07-18T16:04:31.771Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.
Applied to files:
docs/kubernetes/observability/logging.md
📚 Learning: 2025-09-16T00:26:43.641Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The ai-dynamo/dynamo team uses _total as a semantic unit suffix across all metric types (including gauges like INFLIGHT_REQUESTS_TOTAL) for internal consistency, as evidenced by patterns in prometheus_names.rs. This is a deliberate architectural choice to prioritize uniform naming conventions over strict Prometheus conventions that reserve _total only for counters.
Applied to files:
docs/observability/prometheus-grafana.mddocs/observability/metrics.md
📚 Learning: 2025-09-16T00:21:44.912Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: deploy/metrics/README.md:43-43
Timestamp: 2025-09-16T00:21:44.912Z
Learning: Graham (grahamking) has provided guidance in PR 2914 that metrics should end with _count or _total to indicate units, but this needs to be clarified whether it applies to all metric types or just counters, as Prometheus conventions differ between counters (should have _total) and gauges (should not have _total).
Applied to files:
docs/observability/metrics-developer-guide.md
📚 Learning: 2025-09-24T19:06:57.156Z
Learnt from: ryan-lempka
Repo: ai-dynamo/dynamo PR: 3062
File: lib/llm/src/audit/sink.rs:15-27
Timestamp: 2025-09-24T19:06:57.156Z
Learning: In the audit logging system, full request/response logging requires both DYN_AUDIT_ENABLED=1 environment variable and explicit store=true in the request. Without store=true, only usage statistics are logged (UsageOnly mode). The stderr sink is the initial implementation with plans for additional sinks in the future.
Applied to files:
docs/observability/logging.md
📚 Learning: 2025-06-04T13:09:53.416Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 1365
File: deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go:171-178
Timestamp: 2025-06-04T13:09:53.416Z
Learning: The `DYN_DEPLOYMENT_CONFIG` environment variable (commonconsts.DynamoDeploymentConfigEnvVar) in the Dynamo operator will never be set via ValueFrom (secrets/config maps), only via direct Value assignment. The GetDynamoDeploymentConfig method correctly only checks env.Value for this specific environment variable.
Applied to files:
docs/observability/health-checks.md
📚 Learning: 2025-06-05T01:46:15.509Z
Learnt from: GuanLuo
Repo: ai-dynamo/dynamo PR: 1371
File: examples/llm/benchmarks/vllm_multinode_setup.sh:18-25
Timestamp: 2025-06-05T01:46:15.509Z
Learning: In multi-node setups with head/worker architecture, the head node typically doesn't need environment variables pointing to its own services (like NATS_SERVER, ETCD_ENDPOINTS) because local processes can access them via localhost. Only worker nodes need these environment variables to connect to the head node's external IP address.
Applied to files:
docs/observability/health-checks.md
📚 Learning: 2025-07-25T22:34:11.384Z
Learnt from: nnshah1
Repo: ai-dynamo/dynamo PR: 2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.
Applied to files:
docs/observability/health-checks.md
📚 Learning: 2025-09-16T00:27:43.992Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:75-79
Timestamp: 2025-09-16T00:27:43.992Z
Learning: In the ai-dynamo/dynamo codebase, the project uses "_total" suffix for all Prometheus metrics including gauges like inflight_requests, which differs from standard Prometheus conventions. The constant work_handler::INFLIGHT_REQUESTS does not exist - only work_handler::INFLIGHT_REQUESTS_TOTAL exists and should be used for the inflight requests gauge metric.
Applied to files:
docs/observability/metrics.md
🧬 Code graph analysis (2)
lib/runtime/src/logging.rs (1)
lib/runtime/src/config.rs (1)
env_is_truthy(422-427)
lib/bindings/python/rust/lib.rs (1)
lib/runtime/src/config.rs (1)
env_is_truthy(422-427)
🪛 GitHub Check: Check for broken markdown links
lib/bindings/python/examples/metrics/README.md
[failure] 10-10:
Broken link: Metrics Developer Guide - Python Section - View: https://github.com/ai-dynamo/dynamo/blob/HEAD/lib/bindings/python/examples/metrics/README.md?plain=1#L10
🪛 LanguageTool
docs/kubernetes/observability/logging.md
[uncategorized] ~144-~144: The abbreviation “e.g.” (= for example) requires two periods.
Context: ...loyment, namespace, and component type (e.g frontend, worker, etc).
(E_G)
[style] ~144-~144: In American English, abbreviations like “etc.” require a period.
Context: ...d component type (e.g frontend, worker, etc).
(ETC_PERIOD)
docs/observability/prometheus-grafana.md
[duplication] ~28-~28: Possible typo: you repeated a word.
Context: ...tes Install these on your machine: - Docker - [Docker Compose](https://docs.docker.com/compos...
(ENGLISH_WORD_REPEAT_RULE)
docs/observability/metrics-developer-guide.md
[uncategorized] ~36-~36: Loose punctuation mark.
Context: ...Methods - .metrics().create_counter(): Create a counter metric - `.metrics().c...
(UNLIKELY_OPENING_PUNCTUATION)
docs/observability/metrics.md
[uncategorized] ~102-~102: Loose punctuation mark.
Context: ... - dynamo_component_inflight_requests: Requests currently being processed (gau...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~113-~113: Loose punctuation mark.
Context: ...dynamo_component_kvstats_active_blocks: Number of active KV cache blocks curren...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~126-~126: Loose punctuation mark.
Context: ...unctionality: - dynamo_preprocessor_*: Metrics specific to preprocessor compon...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~132-~132: Loose punctuation mark.
Context: ...: - dynamo_frontend_inflight_requests: Inflight requests (gauge) - `dynamo_fro...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~150-~150: Loose punctuation mark.
Context: ... dynamo_frontend_model_total_kv_blocks: Total KV blocks available for a worker ...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~157-~157: Loose punctuation mark.
Context: ...- dynamo_frontend_model_context_length: Maximum context length for a worker ser...
(UNLIKELY_OPENING_PUNCTUATION)
🪛 markdownlint-cli2 (0.18.1)
docs/observability/metrics.md
62-62: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
172-172: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
182-182: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
- GitHub Check: trtllm (arm64)
- GitHub Check: operator (amd64)
- GitHub Check: sglang (arm64)
- GitHub Check: sglang (amd64)
- GitHub Check: trtllm (amd64)
- GitHub Check: vllm (amd64)
- GitHub Check: vllm (arm64)
- GitHub Check: operator (arm64)
- GitHub Check: Build and Test - dynamo
- GitHub Check: tests (launch/dynamo-run)
- GitHub Check: clippy (launch/dynamo-run)
- GitHub Check: tests (lib/bindings/python)
- GitHub Check: tests (lib/runtime/examples)
- GitHub Check: clippy (.)
- GitHub Check: clippy (lib/bindings/python)
- GitHub Check: tests (.)
🔇 Additional comments (27)
lib/runtime/src/logging.rs (1)
147-150: LGTM! Truthy value support improves usability.The change from exact "1" matching to
env_is_truthy()improves the developer experience by accepting common truthy values (1, true, on, yes). The documentation comment accurately reflects the new behavior.lib/bindings/python/rust/lib.rs (2)
128-134: LGTM! Consistent truthy value support.The change to use
env_is_truthy()and the updated warning message are consistent with the changes inlib/runtime/src/logging.rs. The warning text correctly reflects that any truthy value (not just "=1") will trigger the deferred initialization.
449-453: LGTM! Proper deferred initialization.The truthy check here correctly defers logging initialization until the Tokio runtime is available, which is required for the OTEL exporter. This mirrors the pattern in the
_coremodule initialization.docs/observability/metrics-developer-guide.md (1)
1-270: Excellent comprehensive developer guide!This is a well-structured and thorough metrics developer guide that covers:
- Clear getting started instructions with environment variables
- Both Rust and Python API usage with practical examples
- Vector metrics with labels
- Advanced features (custom buckets, constant labels)
- Update patterns and examples
The document effectively consolidates metrics guidance and provides a strong foundation for developers working with Dynamo metrics.
docs/observability/health-checks.md (2)
14-24: Well-documented environment variables.The new environment variables table provides clear, comprehensive information with descriptions, defaults, and examples. This follows a consistent format with other observability documentation.
25-48: Helpful Getting Started section.The new Getting Started section provides clear, practical examples for enabling and testing health checks on a single GPU. This aligns well with the PR's objective to improve observability documentation and provide better getting-started guidance.
docs/observability/prometheus-grafana.md (3)
1-21: Improved overview and environment variables documentation.The updated overview clearly focuses on single-machine demo setup, and the environment variables table follows the consistent format used across other observability documentation.
22-84: Excellent Getting Started guide.The new Getting Started section provides a comprehensive, step-by-step guide that covers:
- Prerequisites and installation
- Starting infrastructure and observability services
- Launching Dynamo components with proper configuration
- Testing with example requests
- Accessing web interfaces with helpful notes about remote access
This is very user-friendly and practical.
88-167: Well-structured topology and configuration guidance.The topology diagram provides a clear visualization of service relationships, and the configuration/troubleshooting sections offer practical guidance. The reference to the Metrics Developer Guide is a good addition for users who want to create custom metrics.
docs/observability/logging.md (3)
27-52: Improved environment variables documentation and Getting Started section.The restructured environment variables table follows the consistent format used across observability documentation, and the new Getting Started section provides clear, practical examples for enabling JSONL logging.
102-141: Clear explanation of OpenTelemetry logging integration.The renamed section "Logging of OpenTelemetry Tracing" better reflects its purpose, and the clarification about the overlap with distributed tracing documentation is helpful. The note distinguishing between seeing trace context in logs versus exporting to a trace backend is particularly useful.
142-258: Comprehensive trace and span information examples.The updated section title accurately reflects the content, and the detailed examples effectively demonstrate how trace context appears in logs. The examples showing custom request IDs with the
x-request-idheader are particularly practical for correlating traces with application-level request tracking.lib/bindings/python/examples/metrics/README.md (1)
10-10: Link is valid; no changes needed.Verification confirms the documentation link is functional. The file
docs/observability/metrics-developer-guide.mdexists and contains the Python Metrics API section referenced by the anchor. The static analysis flag appears to be a false positive.deploy/observability/k8s/grafana-dynamo-dashboard-configmap.yaml (1)
1005-1005: Dashboard title update is consistent with observability reorganization.The title change aligns with the PR's generic naming convention and matches the corresponding update in
deploy/observability/grafana_dashboards/grafana-dynamo-dashboard.json. This is a safe cosmetic change with no impact on dashboard functionality or metrics queries.deploy/observability/tempo-datasource.yml (1)
12-12: Tempo datasource correctly set as non-default.Setting
isDefault: falseappropriately makes Prometheus the default datasource while allowing Tempo queries when explicitly selected. This is the correct configuration for a multi-datasource observability setup where metrics are primary and tracing is supplementary.deploy/observability/grafana_dashboards/grafana-dynamo-dashboard.json (1)
1023-1023: Dashboard title matches configmap and supports generic naming convention.The title update is consistent with the parallel change in the ConfigMap and maintains the semantic meaning of the dashboard while supporting the new "generic" naming introduced in this PR. All dashboard panels and queries remain intact.
deploy/observability/k8s/logging/README.md (1)
3-3: Relative path correctly updated for directory relocation.The path has been properly adjusted to reflect the directory restructuring from
deploy/logging/todeploy/observability/k8s/logging/. The new path../../../../docs/kubernetes/observability/logging.mdcorrectly navigates from the new file location to the documentation.docs/kubernetes/observability/metrics.md (1)
130-132: Documentation simplified and clarified.Replacing the multi-step
pushd/popdpattern with a single directkubectl applycommand improves clarity and reduces cognitive load. The simplified instruction is easier to follow and less error-prone for users.README.md (2)
104-107: Code block syntax highlighting improved.Specifying
bashas the code fence language enables proper syntax highlighting for users reading the README, improving clarity and professional appearance.
109-117: New observability section is clear and well-positioned.The optional observability stack section appropriately documents the separate deployment path and clearly communicates:
- What services are included (Prometheus, Grafana, Tempo, metrics exporters)
- How to deploy (
docker compose -f deploy/docker-observability.yml)- How to access it (Grafana credentials and port)
Marking this as optional correctly reflects that it's not required for basic Dynamo operation, improving documentation clarity for new users. Credentials (dynamo/dynamo) and port (3000) match the PR objectives.
docs/observability/README.md (1)
1-32: Observability documentation hub verified and complete.The README effectively centralizes observability configuration with well-organized tables, shared variable annotations (†), and appropriate separation between user and developer guides. All referenced guide files have been confirmed to exist.
deploy/docker-observability.yml (2)
106-127: Clarify Grafana security configuration.The Grafana service has several security-related settings that should be documented or reviewed:
user: root(Line 90 for Tempo is reasonable for file permissions, but Line 106-127 uses default user) — consider whether root privileges are necessary for GrafanaGF_SECURITY_DISABLE_INITIAL_ADMIN_CREATION=false(Line 125) — contradicts the intent of the other security flags; this will NOT disable admin creation despite the name- Default credentials
dynamo/dynamoare acceptable for local development but should be rotated in productionConsider adding a comment to clarify these are development defaults and should not be used in production, or add safeguards for production deployments.
64-66: All referenced configuration files are present.Verification confirms that all required configuration files and directories referenced in the compose file exist in the
deploy/observability/directory. The observability stack is properly configured and ready for use.docs/observability/metrics.md (2)
29-54: Environment variable names and defaults are correct. Verification confirmsDYN_SYSTEM_ENABLEDandDYN_SYSTEM_PORTmatch the codebase exactly. The system metrics server is disabled by default and enabled whenDYN_SYSTEM_ENABLED=true. The example port 8081 is consistent with actual usage throughout the codebase.
96-120: Fix documented metric names to match implementation.Two documentation errors found:
Line 106:
dynamo_component_system_uptime_secondsshould bedynamo_component_uptime_seconds(remove "system_" prefix)Lines 110-113: KVStats metrics are missing the
dynamo_component_prefix in documentation. The actual metric names are:
kvstats_active_blocks(notdynamo_component_kvstats_active_blocks)kvstats_total_blocks(notdynamo_component_kvstats_total_blocks)kvstats_gpu_cache_usage_percent(notdynamo_component_kvstats_gpu_cache_usage_percent)kvstats_gpu_prefix_cache_hit_rate(notdynamo_component_kvstats_gpu_prefix_cache_hit_rate)The work handler metrics (lines 102-105) are correctly named in the documentation.
⛔ Skipped due to learnings
Learnt from: keivenchang Repo: ai-dynamo/dynamo PR: 3035 File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:75-79 Timestamp: 2025-09-16T00:27:43.992Z Learning: In the ai-dynamo/dynamo codebase, the project uses "_total" suffix for all Prometheus metrics including gauges like inflight_requests, which differs from standard Prometheus conventions. The constant work_handler::INFLIGHT_REQUESTS does not exist - only work_handler::INFLIGHT_REQUESTS_TOTAL exists and should be used for the inflight requests gauge metric.Learnt from: keivenchang Repo: ai-dynamo/dynamo PR: 3035 File: lib/runtime/examples/system_metrics/README.md:65-65 Timestamp: 2025-09-16T00:26:43.641Z Learning: The team at ai-dynamo/dynamo prefers to use consistent metric naming patterns with _total suffixes across all metric types (including gauges) for internal consistency, even when this differs from strict Prometheus conventions that reserve _total for counters only. This design decision was confirmed by keivenchang in PR 3035, referencing examples in prometheus_names.rs and input from team members.docs/observability/tracing.md (2)
18-25: All environment variable defaults verified and confirmed correct:
Variable Documentation Default Implementation Default Status DYN_LOGGING_JSONLfalseenv_is_truthy() → false ✓ OTEL_EXPORT_ENABLEDfalseenv_is_truthy() → false ✓ OTEL_EXPORTER_OTLP_TRACES_ENDPOINThttp://localhost:4317http://localhost:4317✓ OTEL_SERVICE_NAMEdynamodynamo✓ Tempo OTLP gRPC port 4317 confirmed in
deploy/docker-observability.yml(line 96).
56-58: No issues found—environment variable syntax is correct.The
OTEL_EXPORT_ENABLED=truesyntax in docs/observability/tracing.md is valid. The application'sis_truthyfunction (lib/config/src/lib.rs:20-22, lib/runtime/src/config.rs:396-398) explicitly accepts "true" as a truthy value via pattern matching:matches!(val.to_lowercase().as_str(), "1" | "true" | "on" | "yes"). The example aligns with the documented acceptable values listed in the code comment.
deploy/docker-compose.yml
Outdated
| # Bare minimum infrastructure services for Dynamo. | ||
| # For observability (metrics, tracing, dashboards), use docker-compose-observability.yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify the observability compose file name.
The comment references docker-compose-observability.yml, but the AI summary mentions docker-observability.yml. Please verify the correct filename.
🏁 Script executed:
#!/bin/bash
# Description: Verify the observability compose file name
if [ -f "deploy/docker-observability.yml" ]; then
echo "✓ Found: deploy/docker-observability.yml"
elif [ -f "deploy/docker-compose-observability.yml" ]; then
echo "✓ Found: deploy/docker-compose-observability.yml"
else
echo "✗ Neither deploy/docker-observability.yml nor deploy/docker-compose-observability.yml found"
fiLength of output: 40
Update comment to reference correct observability compose file name.
The comment at lines 4-5 references docker-compose-observability.yml, but the actual file is docker-observability.yml. Update the comment to reference the correct filename:
# For observability (metrics, tracing, dashboards), use docker-observability.yml
🤖 Prompt for AI Agents
In deploy/docker-compose.yml around lines 4 to 5, the inline comment references
the wrong observability compose filename; update the comment to reference
docker-observability.yml instead of docker-compose-observability.yml so the
comment reads: "# For observability (metrics, tracing, dashboards), use
docker-observability.yml".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@keivenchang Thank you for the quick fix! 🎉
🧠 Learnings used
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3051
File: container/templates/Dockerfile.trtllm.j2:424-437
Timestamp: 2025-09-16T17:16:03.785Z
Learning: keivenchang prioritizes maintaining exact backward compatibility during migration/refactoring PRs, even when bugs are identified in the original code. Fixes should be deferred to separate PRs after the migration is complete.
If you found this review helpful, would you consider giving us a shout-out on X?
Thank you for using CodeRabbit!
Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
| @@ -1,35 +0,0 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now subsumed in the deploy/docker-observability.yml file (plus other stuff).
| @@ -0,0 +1,137 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new file is a merge of the old deploy/tracing/docker-compose.yml and the old deploy/docker-compose.yml, containing only observability related services.
Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
- Fix docker-compose.yml comment to reference correct filename (docker-observability.yml not docker-compose-observability.yml) - Fix abbreviation formatting in kubernetes logging docs (e.g., etc.) Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
Correct relative path to metrics-developer-guide.md (needs 6 levels up, not 5) Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for this PR let's keep as-is, will revisit after the cleanup. As a developer I like the pointer to the docs since I reach for the code folders before the docs folders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tzulingk we need to update this for the canary health check
docs/observability/logging.md
Outdated
| | Variable | Description | Default | Example | | ||
| |----------|-------------|---------|---------| | ||
| | `DYN_LOGGING_JSONL` | Enable JSONL logging format | `false` | `true` | | ||
| | `DYN_LOG` | Log level: `info` or `debug` | `info` | `debug` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we lost some info here in being able to change logging by target -
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, added back.
docs/observability/logging.md
Outdated
| |----------|-------------|---------|---------| | ||
| | `DYN_LOGGING_JSONL` | Enable JSONL logging format | `false` | `true` | | ||
| | `DYN_LOG` | Log level: `info` or `debug` | `info` | `debug` | | ||
| | `DYN_LOG_USE_LOCAL_TZ` | Use local timezone for timestamps | `false` | `true` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should indicate the default is UTC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
nnshah1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only thing needs changing is some tweaking of the logging doc to reduce confusion with OpenTelemetry - hope that makes sense - otherwise LGTM - great work!
- Remove redundant 'grafana-' prefix from dashboard filenames - Remove redundant '-dashboard' suffix from JSON files - Simplify dcgm-metrics.json copyright to 2-line SPDX format - Update all documentation references to new filenames - Add detailed DYN_LOG per-target syntax to observability docs - Clarify DYN_LOG_USE_LOCAL_TZ default timezone (UTC) - Add Dynamo logging variables table to K8s logging docs Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
- Centralize docker-compose instructions in README.md - Add topology diagram and configuration files to README.md - Standardize all docs to use 'Getting Started Quickly' heading - Remove duplicate docker-compose commands from individual guides - All guides now reference README.md for observability stack setup - Update DYN_LOG documentation with per-target syntax examples - Clarify DYN_LOG_USE_LOCAL_TZ default timezone (UTC) - Add Dynamo logging variables to K8s logging docs - Remove lib/bindings/python/examples/metrics/README.md - Improve consistency across all observability documentation Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
…-compose-files Resolved conflicts: - docs/observability/health-checks.md: Removed duplicate environment variables table - docs/observability/logging.md: Kept simplified 'OTLP export enabled' message - docs/observability/prometheus-grafana.md: Kept link to metrics-developer-guide.md - lib/bindings/python/examples/metrics/README.md: Accepted deletion Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
Add wait_for_service_ready() call to ensure the HTTP service is fully started and listening before sending requests. Without this, the test could fail with 404 errors if requests arrive before the server is ready. This race condition has existed since the test was written, but became more visible after recent changes (e.g. KeyValueStoreManager refactor in Oct 2025). The wait_for_service_ready() helper was added in July 2025 for HTTP disconnect tests but the original test_http_service was never updated to use it. This follows the pattern used by other tests in the same file. Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
- Update logging.md to clarify that trace/span information uses OpenTelemetry format/libraries but doesn't require an OpenTelemetry backend (Tempo/Jaeger) - Standardize copyright headers to 2-line SPDX format across observability docs - Remove full Apache license text from logging.md - Add missing copyright header to prometheus-grafana.md Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
| let task = tokio::spawn(async move { service.run(token.clone()).await }); | ||
|
|
||
| // Wait for the service to be ready before proceeding | ||
| wait_for_service_ready(port).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this flaky from a race condition or something before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw flaky behavior from this test in CI +1
| **Component Metrics**: The core Dynamo backend system automatically exposes metrics with the `dynamo_component_*` prefix for all components that use the `DistributedRuntime` framework. These include request counts, processing times, byte transfers, and system uptime metrics. See [prometheus-grafana.md](prometheus-grafana.md#available-metrics) for the complete list of component metrics. | ||
| | Variable | Description | Default | Example | | ||
| |----------|-------------|---------|---------| | ||
| | `DYN_SYSTEM_ENABLED` | Enable system metrics/health server | `false` | `true` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note any straggling references to DYN_SYSTEM_ENABLED in this PR will need to be removed in your part2 or a part3 follow up
| } | ||
|
|
||
| /// Check if OTLP trace exporting is enabled (set OTEL_EXPORT_ENABLED=1 to enable) | ||
| /// Check if OTLP trace exporting is enabled (set OTEL_EXPORT_ENABLED to a truthy value: 1, true, on, yes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
|
@keivenchang Just took a tour through your branch and it looks awesome. Here are some quick fixes I found:
|
Overview:
This PR consolidates Dynamo's observability infrastructure to provide a more consistent and easier-to-follow experience. All observability documentation now follows a uniform structure (Overview → Environment Variables → Getting Started → Details), making it simple to find what you need whether you're setting up metrics, tracing, logging, or health checks.
The refactored structure separates concerns clearly:
docker-compose.ymlhandles only core infrastructure (NATS & etcd), whiledocker-observability.ymlprovides the complete observability stack. All observability configurations are now centralized underdeploy/observability/, eliminating the previous scattered structure acrossdeploy/metrics/,deploy/logging/, anddeploy/tracing/.Documentation improvements include:
docs/observability/README.mdserves as the unified gateway to all observability topicsmetrics.mdDetails:
docker-compose.yml(NATS & etcd only) anddocker-observability.yml(Prometheus, Grafana, Tempo, exporters)deploy/observability/(previously scattered acrossmetrics/,logging/,tracing/)deploy/observability/k8s/docs/observability/README.mdas unified entry point with navigation tablemetrics.md)x-request-idcorrelation guidance for easier debuggingenv_is_truthyutility usage for OTLP configuration consistencyWhere should the reviewer start?
Review
docs/observability/README.md, which serves as the entry point to all observability documentation. Notice howmetrics.md,tracing.md,health-checks.md, andlogging.mdall follow the same consistent structure: Overview → Environment Variables → Getting Started → Details. This uniform pattern makes it easy to quickly find configuration options (always in a table) and get started with practical examples (always in a Getting Started section).Then check
deploy/observability/to see how all observability configs are now centralized in one location instead of being scattered across multiple directories.Related Issues:
Relates to DIS-980
/coderabbit profile chill