Skip to content

Conversation

daveoy
Copy link
Contributor

@daveoy daveoy commented Sep 26, 2025

should close #1008

  • replace opencensus with opentelemetry in: promethetus exporter, stackdriver exporter and metrics abstraction utilities used by other components
  • fake metrics replaced to preserve problem metrics manager stub used for testing

to note:

  • daemon and exporter initialization have been reordered since problemmetrics global manager initialization order matters (we need to register all exporters with the global manager before starting the problem daemon -- this has been run on a few clusters in my env and does not affect node-problem-detector
  • make test is passing
  • make lint is passing
  • i have no way of testing the stackdriver / google compute engine metrics refactor, i hope the e2e test provides this?
  • systemstatsmonitor, healthchecker etc are not in use in my environment and i don't know if the tests are enough to ensure they're still working too (some common components, i.e. the metrics_int64 and float64 representations might be at play)

* feat(exporters): replace opencensus with opentelemetry

* fix(stackdriver_exporter): use metrics util constants

* chore:  use global meter and init

* feat(exporters): add resource info, use global meter

* chore: update deps

* fix: use a gauge directly, narrow scope info

* cleanup: deps

* fix: try initOnce pattern for global problem metrics

* fix(problemmetrics): init is being called too late

* fix(gauge): remove _ratio suffix

* fix: clean up after rebase

* fix: cleanup after rebase

* fix: cleanup after rebase

* fix: cleanup after rebase

* fix: remove nodename

* fix(test): remove instance id check

* fix(test): add attr values

* fix(test): unexported resources

* fix(test): initialize global problem metrics manager

* fix(lint): make the linter happy
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 26, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @daveoy. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: daveoy
Once this PR has been reviewed and has the lgtm label, please assign random-liu for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 26, 2025
@daveoy
Copy link
Contributor Author

daveoy commented Sep 26, 2025

@wangzhen127 can i get an /ok-to-test please

@hakman
Copy link
Member

hakman commented Sep 26, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 26, 2025
@daveoy
Copy link
Contributor Author

daveoy commented Sep 26, 2025

e2e test is complaining about the presence of otel's default go_ namespaced metrics -- we can configure them off they are new/default with otel.

@daveoy
Copy link
Contributor Author

daveoy commented Sep 27, 2025

@hakman @wangzhen127 looks like we're ready for review over here

be gentle

@daveoy
Copy link
Contributor Author

daveoy commented Sep 29, 2025

Can we get a review?

@hakman
Copy link
Member

hakman commented Sep 30, 2025

@daveoy Sorry, but I am not very familiar with this area. I can give it a try, but probably @wangzhen127 or someone from Google has better context.
CC @SergeyKanzhelev

@wangzhen127
Copy link
Member

Thanks for uploading this PR! Sorry for the delay. Got busy in the past week. Will try to get to this within this week.

@SergeyKanzhelev
Copy link
Member

@dashpole can you help review or suggest somebody?

@SergeyKanzhelev
Copy link
Member

  • i have no way of testing the stackdriver / google compute engine metrics refactor, i hope the e2e test provides this?

I would think we should start testing OpenTelemetry metrics with OpenTelemetry collector. It should be reasonably easy to integrate to e2e. @dashpole are there a good examples where OTel colelctor is integrated in e2e tests?

We may also eliminate the Stackdriver portion since Otel protocol must be reasonalbly universal

@@ -0,0 +1,52 @@
/*
Copyright 2019 The Kubernetes Authors All rights reserved.
Copy link
Member

@SergeyKanzhelev SergeyKanzhelev Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated: we may need to eliminate the year completely

go.opentelemetry.io/otel/trace v1.36.0 // indirect
go.opentelemetry.io/otel/trace v1.37.0 // indirect
go.yaml.in/yaml/v2 v2.4.2 // indirect
go.yaml.in/yaml/v3 v3.0.4 // indirect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought one of these will be eliminated after opencensus removed. Are there more dependencies on older one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I was thinking about the gopkg.in/yaml

@wangzhen127
Copy link
Member

Adding @rkolchmeyer as well

@dashpole
Copy link
Contributor

dashpole commented Oct 3, 2025

Not sure if you've seen https://opentelemetry.io/docs/specs/otel/compatibility/opencensus/#migration-path, but that might be helpful.

@dashpole
Copy link
Contributor

dashpole commented Oct 3, 2025

You can test OTLP in an integration test similar to this, but for metrics: https://github.com/kubernetes/kubernetes/blob/9624c3dcdc9a9e1286e8ca32d07d231e69ed2f0c/test/integration/apiserver/tracing/tracing_test.go#L354. Not sure if integrating the OTel collector will be easier or not.

@dashpole
Copy link
Contributor

dashpole commented Oct 3, 2025

For unit tests, consider using go.opentelemetry.io/otel/sdk/metric/metricdata/metricdatatest. The test can set up an SDK with a ManualReader, manually call Collect, and then make assertions on the resulting metric data.

@dashpole
Copy link
Contributor

dashpole commented Oct 3, 2025

If you need to write integration tests for google cloud monitoring, you can do something similar to https://github.com/GoogleCloudPlatform/opentelemetry-operations-go/blob/7130d1aded77c51f613a9d949f166d119cd595e9/internal/cloudmock/metrics.go

}

// MeterName is the standard meter name used across the application
const MeterName = "node-problem-detector"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually recommend that the meter name is the package name where metrics are defined. But given this is used as an implementation of your metric interface, that might not make sense.

@dashpole
Copy link
Contributor

dashpole commented Oct 3, 2025

Overall, this looks correct to me. With better testing of the outputs (ideally done as a separate PR prior to this migration PR), including prometheus and GCM you could do this migration with higher confidence that it isn't breaking anything

@daveoy
Copy link
Contributor Author

daveoy commented Oct 3, 2025

thank you all for your comments, keep them coming!

i should mention that this is running in my environment and the problemdaemons and metrics (problem_counter and problem_gauge) are operating as expected.

i'll see if i can mock the Google cloud monitoring stuff for higher confidence on the stackdriver side.

Copy link

@rkolchmeyer rkolchmeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there are issues on GCE:

rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: timeSeries[57] (metric.type="compute.googleapis.com/guest/disk/operation_bytes_count", metric.labels={"direction": "write", "device_name"
: "sda8", "service_name": "node-problem-detector"}): unrecognized metric labels [service_name];

and that error repeats for a bunch of other metrics. I guess this PR adds a new "service_name" label that Cloud Monitoring doesn't expect. Is that label necessary for all opentelemetry clients, or can node-problem-detector avoid sending that on GCE?

Record(labelValues map[string]string, value int64) error
}

// NewInt64Metric create a Int64Metric metric, returns nil when viewName is empty.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the "returns nil when viewName is empty" behavior wasn't retained - the program exits in this situation in the new implementation. Would it make sense to keep this behavior? Same for the float64 metric.

@dashpole
Copy link
Contributor

dashpole commented Oct 3, 2025

YOu will want to add the WithFilteredResourceAttributes(NoAttributes()) option to the google cloud exporter to fix #1141 (review)

@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Migrate OpenCensus to OpenTelemetry
7 participants