Skip to content

[feat] [broker] Add broker health check status into prometheus metrics#10

Open
vineeth1995 wants to merge 1372 commits intomasterfrom
healthcheck
Open

[feat] [broker] Add broker health check status into prometheus metrics#10
vineeth1995 wants to merge 1372 commits intomasterfrom
healthcheck

Conversation

@vineeth1995
Copy link
Owner

@vineeth1995 vineeth1995 commented Apr 17, 2023

Motivation

To add broker health check into prometheus metric.

Modifications

Schedule a job at 1 minute time interval which calls healthCheck API on broker and updates the pulsar stats based on the broker health.

Verifying this change

Unit test cases were added to verify this change.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Copy link

@rdhabalia rdhabalia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's create an issue first before creating this PR.
We want at least 10 entries for you in your contribution list to start vote for committer
https://github.com/apache/pulsar/pulls?q=is%3Apr+assignee%3Avineeth1995

conf/broker.conf Outdated

includeHealthCheckInMetrics=true
healthCheckFrequencyInSeconds=60
healthCheckInitialDelayInSecs=60

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead, all 3 config variables just keep one variable: healthCheckMetricsUpdateTimeInSeconds=-1. here -1 means disable. and please add config documentation in config.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

import org.apache.pulsar.common.naming.SystemTopicNames;
import org.apache.pulsar.common.naming.TopicDomain;
import org.apache.pulsar.common.naming.TopicName;
import org.apache.pulsar.common.naming.*;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having * is not good. please update with the required imports.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

.numThreads(1)
.build();

this.healthChecker = OrderedScheduler.newSchedulerBuilder()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initialize executor only if healthCheckMetricsUpdateTimeInSeconds >0 . so, just call method

void initializeHealthChecker() {
if(healthCheckMetricsUpdateTimeInSeconds > 0){
healthChecker = new ...
healthChecker.scheduleAtFixedRate(this::checkHealth,
                    healthCheckMetricsUpdateTimeInSeconds, healthCheckMetricsUpdateTimeInSeconds, TimeUnit.SECONDS);
}
}

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

updateRates();
}

protected void startHealthChecker() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced by initializeHealthChecker(..)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

private final LongAdder connectionCreateFailCount;
private final LongAdder connectionTotalClosedCount;
private final LongAdder connectionActive;
private volatile Long healthCheckStatus;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default value -1 means unknown
1 = success
0 = fail

so, if no one enables this task then value should be -1 unknown.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

@vineeth1995 vineeth1995 force-pushed the healthcheck branch 3 times, most recently from 89dcf3c to 3f9b152 Compare April 19, 2023 21:31
minValue = -1,
doc = "HealthCheck update frequency in seconds. Disable health check with value -1 (Default value 60)"
)
private int healthCheckMetricsUpdateTimeInSeconds = 60;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default should be disabled -1

.numThreads(1)
.build();
int healthCheckFreqInSecs = config.getHealthCheckMetricsUpdateTimeInSeconds();
int healthCheckInitialDelayInSecs = config.getHealthCheckMetricsUpdateTimeInSeconds();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

healthCheckFreqInSecs and healthCheckInitialDelayInSecs are same variable. we should use only 1 healthCheckFreqInSecs. please remove duplicate.

Copy link

@rdhabalia rdhabalia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}

public void checkHealth() {
BrokersBase.internalRunHealthCheck(TopicVersion.V2, pulsar(), null).thenAccept(__ -> {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import static import org.apache.pulsar.broker.admin.impl.BrokersBase.internalRunHealthCheck;

internalRunHealthCheck(TopicVersion.V2, pulsar(), null)...

}

public void recordHealthCheckStatusSuccess() {
System.out.println("recording health check success");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove System.out.println

private final LongAdder connectionCreateFailCount;
private final LongAdder connectionTotalClosedCount;
private final LongAdder connectionActive;
private volatile Long healthCheckStatus;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we please add comment:
// 1=success, 0=failure, -1=unknown

@vineeth1995 vineeth1995 force-pushed the healthcheck branch 3 times, most recently from f74cf57 to 8216576 Compare April 24, 2023 20:19
@github-actions
Copy link

The pr had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label May 25, 2023
coderzc and others added 17 commits June 17, 2024 11:25
…, LeastLongTermMessageRate, ModularLoadManagerImpl. (apache#22889)

Implementation PR: apache#22888

### Motivation

Initially, we introduce `loadBalancerCPUResourceWeight`, `loadBalancerBandwidthInResourceWeight`, `loadBalancerBandwidthOutResourceWeight`, `loadBalancerMemoryResourceWeight`, `loadBalancerDirectMemoryResourceWeight` in `ThresholdShedder` to control the resource weight for different resources when calculating the load of the broker. 

Then we let it work for `LeastResourceUsageWithWeight` for better bundle placement policy.

But apache#19559 and apache#21168 have point out that the actual load of the broker is not related to the memory usage and direct memory usage, thus we have changed the default value of `loadBalancerMemoryResourceWeight`, `loadBalancerDirectMemoryResourceWeight` to 0.0.


There are still some places where memory usage and direct memory usage are used to calculate the load of the broker, such as `OverloadShedder`, `LeastLongTermMessageRate`, `ModularLoadManagerImpl`. We should let the resource weight work for these places so that we can set the resource weight to 0.0 to avoid the impact of memory usage and direct memory usage on the load of the broker.


### Modifications

- Let resource weight work for `OverloadShedder`, `LeastLongTermMessageRate`, `ModularLoadManagerImpl`.
### Motivation

Those bundles that are filtered when try to unload them should not be included in the indicator.

### Modifications

Increment the metric only when the bundle are unloaded.
…, LeastLongTermMessageRate, ModularLoadManagerImpl. (apache#22888)
Demogorgon314 and others added 25 commits September 29, 2024 15:13
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matteo Merli <mmerli@apache.org>
@vineeth1995 vineeth1995 force-pushed the healthcheck branch 3 times, most recently from 50bb381 to 29f77e9 Compare October 4, 2024 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.