Skip to content

[feat][monitor] PIP-223: Add metrics for all rest endpoints.#21772

Open
dao-jun wants to merge 17 commits intoapache:masterfrom
dao-jun:dev/pip-223
Open

[feat][monitor] PIP-223: Add metrics for all rest endpoints.#21772
dao-jun wants to merge 17 commits intoapache:masterfrom
dao-jun:dev/pip-223

Conversation

@dao-jun
Copy link
Member

@dao-jun dao-jun commented Dec 20, 2023

PIP: #18560

Motivation

#18560

Modifications

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: dao-jun#6

@github-actions github-actions bot added the doc-required Your PR changes impact docs and you will update later. label Dec 20, 2023
@dao-jun dao-jun self-assigned this Dec 20, 2023
@dao-jun
Copy link
Member Author

dao-jun commented Dec 20, 2023

refers to: #18836

@dao-jun dao-jun changed the title [feat][monitoring] PIP-223: Add metrics for all rest endpoints. [feat][monitor] PIP-223: Add metrics for all rest endpoints. Dec 20, 2023
@Technoboy- Technoboy- modified the milestones: 3.2.0, 3.3.0 Dec 22, 2023
@dao-jun dao-jun closed this Feb 29, 2024
@dao-jun dao-jun reopened this Feb 29, 2024
@github-actions github-actions bot removed the type/PIP label Feb 29, 2024
@dao-jun dao-jun requested review from asafm and lhotari February 29, 2024 13:24
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 86.79245% with 7 lines in your changes are missing coverage. Please review.

Project coverage is 73.59%. Comparing base (bbc6224) to head (e139404).
Report is 19 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #21772      +/-   ##
============================================
+ Coverage     73.57%   73.59%   +0.01%     
+ Complexity    32624    32153     -471     
============================================
  Files          1877     1878       +1     
  Lines        139502   139603     +101     
  Branches      15299    15321      +22     
============================================
+ Hits         102638   102737      +99     
+ Misses        28908    28878      -30     
- Partials       7956     7988      +32     
Flag Coverage Δ
inttests 26.43% <3.77%> (+1.85%) ⬆️
systests 24.31% <3.77%> (-0.02%) ⬇️
unittests 72.84% <86.79%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...org/apache/pulsar/broker/ServiceConfiguration.java 99.39% <100.00%> (+<0.01%) ⬆️
.../java/org/apache/pulsar/broker/web/WebService.java 94.21% <100.00%> (+0.06%) ⬆️
...e/pulsar/broker/web/RestEndpointMetricsFilter.java 85.71% <85.71%> (ø)

... and 74 files with indirect coverage changes

Copy link
Contributor

@asafm asafm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing this.
Please note that we have already started implementing PIP-264, which is Open Telemetry. We have added the infrastructure needed to define metrics, and we are underway of adding the first metrics.
Once we finish the first PR, can we ping you to add those metrics using OTel? Once that PR we fill finally know the naming conventions to use.

admin.namespaces().deleteNamespace("test/test");
admin.tenants().deleteTenant("test");

ByteArrayOutputStream output = new ByteArrayOutputStream();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Save yourself this code and the intimate knowledge of how it is implemented:
Create a client like this against the local broker.

            prometheusMetricsClient = new PrometheusMetricsClient("127.0.0.1", pulsar.getListenPortHTTP().get());

Get the Metrics:

            Metrics metrics = prometheusMetricsClient.getMetrics();

and assert example:

            Metric backlogAgeMetric =
                    metrics.findSingleMetricByNameAndLabels("pulsar_storage_backlog_age_seconds",
                            Pair.of("topic", topic1));
            assertThat(backlogAgeMetric.tags).containsExactly(
                    entry("cluster", CLUSTER_NAME),
                    entry("namespace", namespace),
                    entry("topic", topic1));
            assertThat((long) backlogAgeMetric.value).isCloseTo(expectedMessageAgeSeconds, within(2L));

Add the method you lack at Metrics

private RestEndpointMetricsFilter(PulsarService pulsar) {
PulsarBrokerOpenTelemetry telemetry = pulsar.getOpenTelemetry();
Meter meter = telemetry.getMeter();
latency = meter.histogramBuilder("pulsar_broker_rest_endpoint_latency")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please look at #22058 to understand how to do:
Instrument name, description, unit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still relevant :)

@dao-jun
Copy link
Member Author

dao-jun commented Mar 12, 2024

@dragosvictor Could you please take a look that why my test keeps failing?

private RestEndpointMetricsFilter(PulsarService pulsar) {
PulsarBrokerOpenTelemetry telemetry = pulsar.getOpenTelemetry();
Meter meter = telemetry.getMeter();
latency = meter.histogramBuilder("pulsar_broker_rest_endpoint_latency")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still relevant :)

Meter meter = openTelemetry.getMeter();
latency = meter.histogramBuilder("pulsar_broker_rest_endpoint_latency")
.setDescription("Latency of REST endpoints in Pulsar broker")
.setUnit("ms")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.build();
}

private static volatile RestEndpointMetricsFilter instance;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it need to be static? Can't you just created one instance of the filter and register it?

UriRoutingContext info = (UriRoutingContext) req.getUriInfo();
attrs = getRequestAttributes(info, statusCode);
} catch (Throwable ex) {
attrs = Attributes.of(PATH, "UNKNOWN", METHOD, req.getMethod(), CODE, (long) statusCode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about that yet.


Object o = req.getProperty(REQUEST_START_TIME);
if (o instanceof Long start) {
long duration = System.currentTimeMillis() - start;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read online now, and from from what I can gather, it's not recommended to use System.currentTimeMillis(). It's a native call to obtain the wall clock time. NTP for example can sync the clock and change. Not 100% sure but leap second may also change it. See: https://develotters.com/posts/how-not-to-measure-elapsed-time/

I think it is safer to use System.nanoTime()

UriRoutingContext info = (UriRoutingContext) req.getUriInfo();
attrs = getRequestAttributes(info, statusCode);
} catch (Throwable ex) {
attrs = Attributes.of(PATH, "UNKNOWN", METHOD, req.getMethod(), CODE, (long) statusCode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a good practice. For example OOME will vanish like that.
I think if we don't have any specific exception in mind, no point in guarding this part.

if (CollectionUtils.isEmpty(templates)) {
return Attributes.of(PATH, "UNKNOWN", METHOD, httpMethod, CODE, statusCode);
}
UriTemplate[] arr = templates.toArray(new UriTemplate[0]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you specifically need to convert this to an array? Can't you just iterate over the list using get(i)?

UriTemplate[] arr = templates.toArray(new UriTemplate[0]);
int idx = arr.length - 1;
StringBuilder builder = new StringBuilder();
for (; idx >= 0; idx--) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for (int idx = arr.length - 1; ...?

admin.tenants().deleteTenant("test");

Collection<MetricData> metricDatas = pulsarTestContext.getOpenTelemetryMetricReader().collectAllMetrics();
log.info("Metrics size: {}", metricDatas.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be removed at this stage


Collection<MetricData> metricDatas = pulsarTestContext.getOpenTelemetryMetricReader().collectAllMetrics();
log.info("Metrics size: {}", metricDatas.size());
Optional<MetricData> optional = metricDatas.stream().peek(m -> log.info("metric name: {}", m.getName()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/broker area/metrics doc-required Your PR changes impact docs and you will update later. ready-to-test release/4.0.9

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants