[improve][broker]add active status into cursor stats by HQebupt · Pull Request #1 · HQebupt/pulsar

HQebupt · 2022-09-29T07:23:44Z

Motivation

The active state indicates whether a cursor is active. And it affects the cache hit rate, because inactive cursors will be evicted from the entry cache. This active state helps to troubleshoot issues with low cache hit rate.
Also it is meaningful for configuring a suitable values for managedLedgerCursorBackloggedThreshold in ServiceConfiguration

# Configure the threshold (in number of entries) from where a cursor should be considered 'backlogged'
# and thus should be set as inactive.
managedLedgerCursorBackloggedThreshold=1000

Modifications

Add active status into cursor stats.
It will get the following sample data when getting the internalStats of a partitioned topic.

{
    "entriesAddedCounter": 6,
    "numberOfEntries": 10302152,
    "totalSize": 63786554161,
    "currentLedgerEntries": 6,
    "currentLedgerSize": 31182,
    "lastLedgerCreatedTimestamp": "2022-09-29T14:48:52.825+08:00",
    "waitingCursorsCount": 0,
    "pendingAddEntriesCount": 0,
    "lastConfirmedEntry": "1359228:5",
    "lastIndex": 1165055506,
    "state": "LedgerOpened",
    "ledgers": [
        {
            "ledgerId": 1356881,
            "entries": 50000,
            "size": 309705120,
            "offloaded": false,
            "underReplicated": false
        }
    ],
    "cursors": {
        "cg_test_map_cache1": {
            "markDeletePosition": "1356881:3399",
            "readPosition": "1356881:3400",
            "waitingReadOp": false,
            "pendingReadOps": 0,
            "messagesConsumedCounter": -10298746,
            "cursorLedger": -1,
            "cursorLedgerLastEntry": -1,
            "individuallyDeletedMessages": "[]",
            "lastLedgerSwitchTimestamp": "2022-09-29T14:48:52.859+08:00",
            "state": "NoLedger",
            "numberOfEntriesSinceFirstNotAckedMessage": 1,
            "totalNonContiguousDeletedMessagesRange": 0,
            "subscriptionHavePendingRead": false,
            "subscriptionHavePendingReplayRead": false,
            "active": false,
            "properties": {
                "index": 846056487
            }
        },
        "cg_test_map_cache2": {
            "markDeletePosition": "1359208:79",
            "readPosition": "1359228:6",
            "waitingReadOp": false,
            "pendingReadOps": 0,
            "messagesConsumedCounter": -8,
            "cursorLedger": 1359234,
            "cursorLedgerLastEntry": 1,
            "individuallyDeletedMessages": "[]",
            "lastLedgerSwitchTimestamp": "2022-09-29T14:48:52.859+08:00",
            "state": "Open",
            "numberOfEntriesSinceFirstNotAckedMessage": 15,
            "totalNonContiguousDeletedMessagesRange": 0,
            "subscriptionHavePendingRead": false,
            "subscriptionHavePendingReplayRead": false,
            "active": true,
            "properties": {
                "index": 1165055008
            }
        }
    },
    "schemaLedgers": [],
    "compactedLedger": {
        "ledgerId": -1,
        "entries": -1,
        "size": -1,
        "offloaded": false,
        "underReplicated": false
    }
}

Verifying this change

Make sure that the change passes the CI checks.

This change is a trivial rework / code cleanup without any test coverage.

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

Dependencies (does it add or upgrade a dependency): (no)
The public API: (no)
The schema: (no)
The default values of configurations: (no)
The wire protocol: (no)
The rest endpoints: (no)
The admin cli options: (no)
Anything that affects deployment: (no)

Documentation

Check the box below or label this PR directly.

Need to update docs?

doc-required
doc-not-needed
doc
doc-complete

Matching PR in forked repository

PR in forked repository: #1

…for `getTopic`. (apache#17416)

…ache#17458)

This reverts commit 1e9a1f0.

Fixes apache#16782 ### Motivation As we all know, Bundle split has 3 algorithms: - range_equally_divide - topic_count_equally_divide - specified_positions_divide However, none of these algorithms can divide bundles according to flow or qps, which may cause bundles to be split multiple times.

…y suite (apache#17476)

…icy (apache#17320)

…che#17459) --- *Motivation* We update the jna version in this [PR](apache#17262). We should update the version in presto license file as well.

…pache#17462)

Fixes apache#17392 ### Motivation All timers in `ProducerImpl` are `std::shared_ptr` objects that can be reset with `nullptr` in `ProducerImpl::cancelTimers`. It could lead to null pointer access in some cases. See apache#17392 (comment) for the analysis. Generally it's not necessary to hold a nullable pointer to the timer. However, to resolve the cyclic reference issue, apache#5246 reset the shared pointer to reduce the reference count manually. It's not a good solution because we have to perform null check for timers everywhere. The null check still has some race condition issue like: Thread 1: ```c++ if (timer) { // [1] timer is not nullptr timer->async_wait(/* ... */); // [3] timer is null now, see [2] below } ``` Thread 2: ```c++ timer.reset(); // [2] ``` The best solution is to capture `weak_ptr` in timer's callback and call `lock()` to check if the referenced object is still valid. ### Modifications - Change the type of `sendTimer_` and `batchTimer_` to `deadline_timer`, not a `shared_ptr`. - Use `PeriodicTask` instead of the `deadline_timer` for token refresh. - Migrate `weak_from_this()` method from C++17 and capture `weak_from_this()` instead of `shared_from_this()` in callbacks. ### Verifying this change Run the `testResendViaSendCallback` for many times and we can see it won't fail after this patch. ```bash ./tests/main --gtest_filter='BasicEndToEndTest.testResendViaSendCallback' --gtest_repeat=30 ```

…che#17383)

…Ack (apache#17436)

…ration is invalid (apache#17035)

* Sync recent changes from apache#17030, apache#17039, apache#16315, and apache#17057 * fix apache#17119 * minor updates * add link of release notes to navigation * fix * update release process as per PIP-190 * minor fix * minor fix * Update release-process.md

Signed-off-by: Zixuan Liu <nodeces@gmail.com> Signed-off-by: Zixuan Liu <nodeces@gmail.com>

… and added test cases including default time (apache#16130)

### Motivation When a Pulsar topic is unloaded from a broker, certain metrics related to that topic will appear to remain active for the broker for 5 minutes. This is confusing for troubleshooting because it makes the topic appear to be owned by multiple brokers for a short period of time. See below for a way to reproduce this behavior. In order to solve this "zombie" metric problem, I propose we remove the timestamps that get exported with each Prometheus metric served by the broker. ### Analysis Since we introduced Prometheus metrics in apache#294, we have exported a timestamp along with most metrics. This is an optional, valid part of the spec defined [here](https://prometheus.io/docs/instrumenting/exposition_formats/#comments-help-text-and-type-information). However, after our adoption of Prometheus metrics, the Prometheus project released version 2.0 with a significant improvement to its concept of staleness. In short, before 2.0, a metric that was in the last scrape but not the next one (this often happens for topics that are unloaded) will essentially inherit the most recent value for the last 5 minute window. If there isn't one in the past 5 minutes, the metric becomes "stale" and isn't reported. Starting in 2.0, there was new logic to consider a value stale the very first time that it is not reported in a scrape. Importantly, this new behavior is only available if you do not export timestamps with metrics, as documented here: https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness. We want to use the new behavior because it gives better insight into all topic metrics, which are subject to move between brokers at any time. This presentation https://www.youtube.com/watch?v=GcTzd2CLH7I and slide deck https://promcon.io/2017-munich/slides/staleness-in-prometheus-2-0.pdf document the feature in detail. This blog post was also helpful: https://www.robustperception.io/staleness-and-promql/. Additional motivation comes from mailing list threads like this one https://groups.google.com/g/prometheus-users/c/8OFAwp1OEcY. It says: > Note, however, that adding timestamps is an extremely niche use case. Most of the users who think the need it should actually not do it. > > The main usecases within that tiny niche are federation and mirroring the data from another monitoring system. The Prometheus Go client also indicates a similar motivation: https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#NewMetricWithTimestamp. The OpenMetrics project also recommends against exporting timestamps: https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#exposing-timestamps. As such, I think we are not a niche use case, and we should not add timestamps to our metrics. ### Reproducing the problem 1. Run any 2.x version of Prometheus (I used 2.31.0) along with the following scrape config: ```yaml - job_name: broker honor_timestamps: true scrape_interval: 30s scrape_timeout: 10s metrics_path: /metrics scheme: http follow_redirects: true static_configs: - targets: ["localhost:8080"] ``` 2. Start pulsar standalone on the same machine. I used a recently compiled version of master. 3. Publish messages to a topic. 4. Observe `pulsar_in_messages_total` metric for the topic in the prometheus UI (localhost:9090) 5. Stop the producer. 6. Unload the topic from the broker. 7. Optionally, `curl` the metrics endpoint to verify that the topic’s `pulsar_in_messages_total` metric is no longer reported. 8. Watch the metrics get reported in prometheus for 5 additional minutes. When you set `honor_timestamps: false`, the metric stops getting reported right after the topic is unloaded, which is the desired behavior. ### Modifications * Remove all timestamps from metrics * Fix affected tests and test files (some of those tests were in the proxy and the function worker, but no code was changed for those modules) ### Verifying this change This change is accompanied by updated tests. ### Does this pull request potentially affect one of the following parts: This is technically a breaking change to the metrics, though I would consider it a bug fix at this point. I will discuss it on the mailing list to ensure it gets proper visibility. Given how frequently Pulsar changes which metrics are exposed between each scrape, I think this is an important fix that should be cherry picked to older release branches. Technically, we can avoid cherry picking this change if we advise users to set `honor_timestamps: false`. However, I think it is better to just remove them. ### Documentation - [x] `doc-not-needed`

…17378) Co-authored-by: huangzegui <huangzegui@didiglobal.com>

…pulsar-perf-producer (apache#17381) * [improve][cli] Add option to disable batching in pulsar-testclient * [improve][cli] Add option to disable batching in pulsar-testclient * [improve][cli] Add option to disable batching in pulsar-testclient * [improve][cli] Add option to disable batching in pulsar-testclient * [improve][cli] Add option to disable batching in pulsar-testclient * [improve][cli] Add option to disable batching in pulsar-testclient Co-authored-by: Vineeth <vineeth.polamreddy@verizonmedia.com> Co-authored-by: Rajan Dhabalia <rdhabalia@apache.org>

…he#17479)

…sOpenCacheSetEnabled=true (apache#17465)

) * Fix maxNumberOfRejectedRequestPerConnection doc * fix doc in 2.8.x docs

…ed (apache#17704) * [fix][metrics]wrong metrics text generated when label_cluster specified * improve logic branch * mark test group

…he#17844)

…ges (apache#17750)

Signed-off-by: tison <wander4096@gmail.com> Signed-off-by: tison <wander4096@gmail.com>

… properties (apache#17829)

… been published after the topic gets activated on a broker (apache#16618) * Skip creating a replication snapshot if no messages have been published * Adapt test to new behavior where replication snapshots happen only when there are new messages

* Call cleanup method in finally block to ensure it's not skipped * Clear invocations for the mocks that are left around without cleanup * Cleanup PulsarService and PulsarAdmin mocks/spies in MockedPulsarServiceBaseTest * Don't record invocations at all for PulsarService and PulsarAdmin in MockedPulsarServiceBaseTest * Don't record invocations for spies by default * Simplify reseting mocks * Fix PersistentTopicTest * Fix TokenExpirationProducerConsumerTest * Fix SimpleLoadManagerImplTest * Fix FilterEntryTest

…he#17834) - use Bookkeeper defaults by setting BK_METADATA_OPTIONS=none

…7209) Fixes apache#17186 ### Motivation There are some cases in which it is useful to be able to include current position of the message when reset of cursor was made. ### Modifications * Support inclusive seek in c++ consumers. * Add a unit test to verify.

* fix: delete sqlite files after jdbc connection closed This closes apache#17713. Signed-off-by: tison <wander4096@gmail.com> * uses isolated db file Signed-off-by: tison <wander4096@gmail.com> * Revert "uses isolated db file" This reverts commit 295db3c. * close in order Signed-off-by: tison <wander4096@gmail.com> * strong order guarantee Signed-off-by: tison <wander4096@gmail.com> * factor out defer logic to avoid further bugs Signed-off-by: tison <wander4096@gmail.com> * Revert "factor out defer logic to avoid further bugs" This reverts commit f7f4634. * Revert "strong order guarantee" This reverts commit 747086f. * use awaitTermination Signed-off-by: tison <wander4096@gmail.com> Signed-off-by: tison <wander4096@gmail.com>

…AndCommitForTransaction (apache#17845) * scenario is already covered by PendingAckPersistentTest

…#17691) Co-authored-by: Marvin Cai <cai19930303@gmail.com>

…ManagedLedgerImpl (apache#17293) - a NPE with no description is confusing

…on time (apache#17790) Fixes - apache#17623 - apache#17637 ### Motivation Manually release resources, including `consumer`, `producer`, `pulsar client`, `transaction`, and `topic`. This saves `setup` and `cleanup` time before and after each method. ### Modifications - Manually release resources instead of calling `cleanup` & `setup` each method - remove useless method `markDeletePositionCheck` - `Integer.valueOf(int)` instead of `new Integer(int)`, because `new Integer(int)` is deprecated ### Matching PR in forked repository PR in forked repository: - poorbarcode#10

Fixes apache#17785 ### Motivation The `failureMap` need to be clear after run per unit test. ### Modifications Clear `failureMap` after run per unit test, and only run once `setup()`/`cleanup()` to reduce execution time. ### Matching PR in forked repository PR in forked repository: coderzc#6

…ache#17252) - fixes issue with stats where timestamps might be inconsistent because of visibility issues - fields should be volatile to ensure visibility of updated values in a consistent manner - in replication, the lastDataMessagePublishedTimestamp field in PersistentTopic might be inconsistent unless volatile is used

…he#16891) * add reader config doc * update to the versioned doc * Update site2/docs/io-debezium-source.md Co-authored-by: momo-jun <60642177+momo-jun@users.noreply.github.com> * Update site2/docs/io-debezium-source.md Co-authored-by: momo-jun <60642177+momo-jun@users.noreply.github.com> * revert changes to 2.10.1 and 2.9.3 Co-authored-by: momo-jun <60642177+momo-jun@users.noreply.github.com>

…che#15709)

AnonHxy · 2022-09-30T06:54:39Z

/pulsarbot run-failure-checks

HQebupt · 2022-10-01T01:52:26Z

/pulsarbot run-failure-checks

github-actions · 2022-11-19T02:58:21Z

The pr had no activity for 30 days, mark with Stale label.

tisonkun and others added 30 commits September 5, 2022 12:09

[fix][sec] bump snakeyaml to 1.31 fix CVE-2022-25857 (apache#17457)

3ea478a

[improve][broker] Using TopicName instead of String as the parameter …

a49fbf6

…for `getTopic`. (apache#17416)

[improve][sec] suppress CVE-2021-3563 of openstack-keystone-2.5.0 (ap…

0e4e88b

…ache#17458)

Revert "Enable Log4j2 async loggers (apache#15188)" (apache#17474)

35d3d5c

This reverts commit 1e9a1f0.

[ci] move back PersistentStreamingDispatcherBlockConsumerTest to flak…

5137905

…y suite (apache#17476)

[fix][doc] Add more information for producer_request_hold backlog pol…

afbf72e

…icy (apache#17320)

[fix][license] Update the jna version in the presto license file (apa…

f77dec4

…che#17459) --- *Motivation* We update the jna version in this [PR](apache#17262). We should update the version in presto license file as well.

[doc][monitoring][txn] add doc for transaction metrics (apache#15218)

2701e51

[fix][license] Update the log4j version in the presto license file (a…

3de99ee

…pache#17462)

remove useless else block (apache#17122)

f4a8260

[fix][flaky-test] Fix RawReaderTest.testFlowControlBatch (apache#17369)

cfe95dd

[improve][broker] Improve backlogQuota endpoint to pure async. (apa…

ff4dc08

…che#17383)

[fix][flaky-test]BatchMessageWithBatchIndexLevelTest.testBatchMessage…

427e129

…Ack (apache#17436)

Generate doc for 2.7.5 (apache#17448)

bf264c8

Update swagger file for v2.7.5 (apache#17444)

b81d0a5

[fix][broker]After the broker is restarted, the cache dynamic configu…

a7f1a56

…ration is invalid (apache#17035)

[feat][python] Add basic authentication (apache#17482)

eae90f6

[fix][doc] Add verifying the client cert (apache#17486)

edfc5cb

Signed-off-by: Zixuan Liu <nodeces@gmail.com> Signed-off-by: Zixuan Liu <nodeces@gmail.com>

[improve][doc] Improve golang client doc (apache#17455)

757035b

[improve][docs] Get started locally (apache#17475)

ae7c941

[improve][pulsar-client-tools] Updated set retention time description…

6c02186

… and added test cases including default time (apache#16130)

remove unnecessary parameters(reusefuture) and related logic (apache#…

f453e0a

…17378) Co-authored-by: huangzegui <huangzegui@didiglobal.com>

[improve][cli] Pulsar shell: fix custom commands autocompletion (apac…

1d907b7

…he#17479)

[improve][broker] Improve cursor.getNumberOfEntries if isUnackedRange…

09edcce

…sOpenCacheSetEnabled=true (apache#17465)

Add ownerBroker field to topic stats (apache#17136)

eab2bb5

coderzc and others added 24 commits September 27, 2022 06:50

[fix][doc] Fix maxNumberOfRejectedRequestPerConnection doc (apache#17821

26281d5

) * Fix maxNumberOfRejectedRequestPerConnection doc * fix doc in 2.8.x docs

[fix][metrics]wrong metrics text generated when label_cluster specifi…

518cdcd

…ed (apache#17704) * [fix][metrics]wrong metrics text generated when label_cluster specified * improve logic branch * mark test group

[Pulsar-init] Support cluster init using proxy url and protocol (apac…

6528a91

…he#17844)

docs: Updating Python installation section (apache#17796)

5e42e4d

[fix][cli] Quit PerformanceConsumer after receiving numMessages messa…

59ce90c

…ges (apache#17750)

docs: add developers-landing page to sidebars (apache#17780)

91f747f

Signed-off-by: tison <wander4096@gmail.com> Signed-off-by: tison <wander4096@gmail.com>

[improve][pulsar-io-kafka] Add option to copy Kafka headers to Pulsar…

b89c145

… properties (apache#17829)

[improve][common] Make Bookkeeper metadata options configurable (apac…

6dd38a4

…he#17834) - use Bookkeeper defaults by setting BK_METADATA_OPTIONS=none

[fix][flaky-test]Delete PersistentSubscriptionTest.testCanAcknowledge…

5f59f8b

…AndCommitForTransaction (apache#17845) * scenario is already covered by PendingAckPersistentTest

[cleanup][broker][Modernizer] fix violations in pulsar-broker (apache…

8aef1bf

…#17691) Co-authored-by: Marvin Cai <cai19930303@gmail.com>

[improve][broker] Add a message to a NullPointerException created in …

dfd4882

…ManagedLedgerImpl (apache#17293) - a NPE with no description is confusing

[improve][test] Add integration test for websocket (apache#17843)

0678b82

[improve][ML] Print log when delete empty ledger. (apache#17859)

716f5e2

Fix NPE when ResourceGroupService execute scheduled task. (apache#17840)

62d900f

[improve][doc] Add a limitation for key_shared subscription type (apa…

048ccae

…che#15709)

[improve][broker]add active status into cursor stats

cf72151

HQebupt mentioned this pull request Sep 29, 2022

[improve][broker]add active status into cursor stats apache/pulsar#17884

Merged

5 tasks

HQebupt force-pushed the master branch from 98e9089 to 67361e8 Compare October 19, 2022 09:16

github-actions bot added the Stale label Nov 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[improve][broker]add active status into cursor stats#1

[improve][broker]add active status into cursor stats#1
HQebupt wants to merge 1003 commits intomasterfrom
addActive2CurosrStat

HQebupt commented Sep 29, 2022 •

edited

Loading

Uh oh!

AnonHxy commented Sep 30, 2022

Uh oh!

HQebupt commented Oct 1, 2022

Uh oh!

github-actions bot commented Nov 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

HQebupt commented Sep 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Matching PR in forked repository

Uh oh!

AnonHxy commented Sep 30, 2022

Uh oh!

HQebupt commented Oct 1, 2022

Uh oh!

github-actions bot commented Nov 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

HQebupt commented Sep 29, 2022 •

edited

Loading