Skip to content

[improve][broker]add active status into cursor stats#1

Open
HQebupt wants to merge 1003 commits intomasterfrom
addActive2CurosrStat
Open

[improve][broker]add active status into cursor stats#1
HQebupt wants to merge 1003 commits intomasterfrom
addActive2CurosrStat

Conversation

@HQebupt
Copy link
Owner

@HQebupt HQebupt commented Sep 29, 2022

Motivation

The active state indicates whether a cursor is active. And it affects the cache hit rate, because inactive cursors will be evicted from the entry cache. This active state helps to troubleshoot issues with low cache hit rate.
Also it is meaningful for configuring a suitable values for managedLedgerCursorBackloggedThreshold in ServiceConfiguration

# Configure the threshold (in number of entries) from where a cursor should be considered 'backlogged'
# and thus should be set as inactive.
managedLedgerCursorBackloggedThreshold=1000

Modifications

Add active status into cursor stats.
It will get the following sample data when getting the internalStats of a partitioned topic.

{
    "entriesAddedCounter": 6,
    "numberOfEntries": 10302152,
    "totalSize": 63786554161,
    "currentLedgerEntries": 6,
    "currentLedgerSize": 31182,
    "lastLedgerCreatedTimestamp": "2022-09-29T14:48:52.825+08:00",
    "waitingCursorsCount": 0,
    "pendingAddEntriesCount": 0,
    "lastConfirmedEntry": "1359228:5",
    "lastIndex": 1165055506,
    "state": "LedgerOpened",
    "ledgers": [
        {
            "ledgerId": 1356881,
            "entries": 50000,
            "size": 309705120,
            "offloaded": false,
            "underReplicated": false
        }
    ],
    "cursors": {
        "cg_test_map_cache1": {
            "markDeletePosition": "1356881:3399",
            "readPosition": "1356881:3400",
            "waitingReadOp": false,
            "pendingReadOps": 0,
            "messagesConsumedCounter": -10298746,
            "cursorLedger": -1,
            "cursorLedgerLastEntry": -1,
            "individuallyDeletedMessages": "[]",
            "lastLedgerSwitchTimestamp": "2022-09-29T14:48:52.859+08:00",
            "state": "NoLedger",
            "numberOfEntriesSinceFirstNotAckedMessage": 1,
            "totalNonContiguousDeletedMessagesRange": 0,
            "subscriptionHavePendingRead": false,
            "subscriptionHavePendingReplayRead": false,
            "active": false,
            "properties": {
                "index": 846056487
            }
        },
        "cg_test_map_cache2": {
            "markDeletePosition": "1359208:79",
            "readPosition": "1359228:6",
            "waitingReadOp": false,
            "pendingReadOps": 0,
            "messagesConsumedCounter": -8,
            "cursorLedger": 1359234,
            "cursorLedgerLastEntry": 1,
            "individuallyDeletedMessages": "[]",
            "lastLedgerSwitchTimestamp": "2022-09-29T14:48:52.859+08:00",
            "state": "Open",
            "numberOfEntriesSinceFirstNotAckedMessage": 15,
            "totalNonContiguousDeletedMessagesRange": 0,
            "subscriptionHavePendingRead": false,
            "subscriptionHavePendingReplayRead": false,
            "active": true,
            "properties": {
                "index": 1165055008
            }
        }
    },
    "schemaLedgers": [],
    "compactedLedger": {
        "ledgerId": -1,
        "entries": -1,
        "size": -1,
        "offloaded": false,
        "underReplicated": false
    }
}

Verifying this change

  • Make sure that the change passes the CI checks.

This change is a trivial rework / code cleanup without any test coverage.

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API: (no)
  • The schema: (no)
  • The default values of configurations: (no)
  • The wire protocol: (no)
  • The rest endpoints: (no)
  • The admin cli options: (no)
  • Anything that affects deployment: (no)

Documentation

Check the box below or label this PR directly.

Need to update docs?

  • doc-required
  • doc-not-needed
  • doc
  • doc-complete

Matching PR in forked repository

PR in forked repository: #1

tisonkun and others added 30 commits September 5, 2022 12:09
Fixes apache#16782

### Motivation
As we all know, Bundle split has 3 algorithms:

- range_equally_divide
- topic_count_equally_divide
- specified_positions_divide

However, none of these algorithms can divide bundles according to flow or qps, which may cause bundles to be split multiple times.
…che#17459)

---

*Motivation*
We update the jna version in this [PR](apache#17262).
We should update the version in presto license file as well.
Fixes apache#17392

### Motivation

All timers in `ProducerImpl` are `std::shared_ptr` objects that can be
reset with `nullptr` in `ProducerImpl::cancelTimers`. It could lead to
null pointer access in some cases.

See
apache#17392 (comment)
for the analysis.

Generally it's not necessary to hold a nullable pointer to the timer.
However, to resolve the cyclic reference issue, apache#5246 reset the shared
pointer to reduce the reference count manually. It's not a good solution
because we have to perform null check for timers everywhere. The null
check still has some race condition issue like:

Thread 1:

```c++
if (timer) {  // [1] timer is not nullptr
    timer->async_wait(/* ... */);  // [3] timer is null now, see [2] below
}
```

Thread 2:

```c++
timer.reset();  // [2]
```

The best solution is to capture `weak_ptr` in timer's callback and call
`lock()` to check if the referenced object is still valid.

### Modifications
- Change the type of `sendTimer_` and `batchTimer_` to `deadline_timer`,
  not a `shared_ptr`.
- Use `PeriodicTask` instead of the `deadline_timer` for token refresh.
- Migrate `weak_from_this()` method from C++17 and capture
  `weak_from_this()` instead of `shared_from_this()` in callbacks.

### Verifying this change

Run the `testResendViaSendCallback` for many times and we can see it
won't fail after this patch.

```bash
./tests/main --gtest_filter='BasicEndToEndTest.testResendViaSendCallback' --gtest_repeat=30
```
* Sync recent changes from apache#17030, apache#17039, apache#16315, and apache#17057

* fix apache#17119

* minor updates

* add link of release notes to navigation

* fix

* update release process as per PIP-190

* minor fix

* minor fix

* Update release-process.md
Signed-off-by: Zixuan Liu <nodeces@gmail.com>

Signed-off-by: Zixuan Liu <nodeces@gmail.com>
### Motivation

When a Pulsar topic is unloaded from a broker, certain metrics related to that topic will appear to remain active for the broker for 5 minutes. This is confusing for troubleshooting because it makes the topic appear to be owned by multiple brokers for a short period of time. See below for a way to reproduce this behavior.

In order to solve this "zombie" metric problem, I propose we remove the timestamps that get exported with each Prometheus metric served by the broker.

### Analysis

Since we introduced Prometheus metrics in apache#294, we have exported a timestamp along with most metrics. This is an optional, valid part of the spec defined [here](https://prometheus.io/docs/instrumenting/exposition_formats/#comments-help-text-and-type-information). However, after our adoption of Prometheus metrics, the Prometheus project released version 2.0 with a significant improvement to its concept of staleness. In short, before 2.0, a metric that was in the last scrape but not the next one (this often happens for topics that are unloaded) will essentially inherit the most recent value for the last 5 minute window. If there isn't one in the past 5 minutes, the metric becomes "stale" and isn't reported. Starting in 2.0, there was new logic to consider a value stale the very first time that it is not reported in a scrape. Importantly, this new behavior is only available if you do not export timestamps with metrics, as documented here: https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness. We want to use the new behavior because it gives better insight into all topic metrics, which are subject to move between brokers at any time.

This presentation https://www.youtube.com/watch?v=GcTzd2CLH7I and slide deck https://promcon.io/2017-munich/slides/staleness-in-prometheus-2-0.pdf document the feature in detail. This blog post was also helpful: https://www.robustperception.io/staleness-and-promql/.

Additional motivation comes from mailing list threads like this one https://groups.google.com/g/prometheus-users/c/8OFAwp1OEcY. It says:

> Note, however, that adding timestamps is an extremely niche use
case. Most of the users who think the need it should actually not do
it.
>
> The main usecases within that tiny niche are federation and mirroring
the data from another monitoring system.

The Prometheus Go client also indicates a similar motivation: https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#NewMetricWithTimestamp.

The OpenMetrics project also recommends against exporting timestamps: https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#exposing-timestamps.

As such, I think we are not a niche use case, and we should not add timestamps to our metrics.

### Reproducing the problem

1. Run any 2.x version of Prometheus (I used 2.31.0) along with the following scrape config:
```yaml
  - job_name: broker
    honor_timestamps: true
    scrape_interval: 30s
    scrape_timeout: 10s
    metrics_path: /metrics
    scheme: http
    follow_redirects: true
    static_configs:
      - targets: ["localhost:8080"]
```
2. Start pulsar standalone on the same machine. I used a recently compiled version of master.
3. Publish messages to a topic.
4. Observe `pulsar_in_messages_total` metric for the topic in the prometheus UI (localhost:9090)
5. Stop the producer.
6. Unload the topic from the broker.
7. Optionally, `curl` the metrics endpoint to verify that the topic’s `pulsar_in_messages_total` metric is no longer reported.
8. Watch the metrics get reported in prometheus for 5 additional minutes.

When you set `honor_timestamps: false`, the metric stops getting reported right after the topic is unloaded, which is the desired behavior.

### Modifications

* Remove all timestamps from metrics
* Fix affected tests and test files (some of those tests were in the proxy and the function worker, but no code was changed for those modules)

### Verifying this change

This change is accompanied by updated tests.

### Does this pull request potentially affect one of the following parts:

This is technically a breaking change to the metrics, though I would consider it a bug fix at this point. I will discuss it on the mailing list to ensure it gets proper visibility.

Given how frequently Pulsar changes which metrics are exposed between each scrape, I think this is an important fix that should be cherry picked to older release branches. Technically, we can avoid cherry picking this change if we advise users to set `honor_timestamps: false`. However, I think it is better to just remove them.

### Documentation
- [x] `doc-not-needed`
…17378)

Co-authored-by: huangzegui <huangzegui@didiglobal.com>
…pulsar-perf-producer (apache#17381)

* [improve][cli] Add option to disable batching in pulsar-testclient

* [improve][cli] Add option to disable batching in pulsar-testclient

* [improve][cli] Add option to disable batching in pulsar-testclient

* [improve][cli] Add option to disable batching in pulsar-testclient

* [improve][cli] Add option to disable batching in pulsar-testclient

* [improve][cli] Add option to disable batching in pulsar-testclient

Co-authored-by: Vineeth <vineeth.polamreddy@verizonmedia.com>
Co-authored-by: Rajan Dhabalia <rdhabalia@apache.org>
coderzc and others added 24 commits September 27, 2022 06:50
)

* Fix maxNumberOfRejectedRequestPerConnection doc

* fix doc in 2.8.x docs
…ed (apache#17704)

* [fix][metrics]wrong metrics text generated when label_cluster specified

* improve logic branch

* mark test group
Signed-off-by: tison <wander4096@gmail.com>

Signed-off-by: tison <wander4096@gmail.com>
… been published after the topic gets activated on a broker (apache#16618)

* Skip creating a replication snapshot if no messages have been published

* Adapt test to new behavior where replication snapshots happen only when there are new messages
* Call cleanup method in finally block to ensure it's not skipped

* Clear invocations for the mocks that are left around without cleanup

* Cleanup PulsarService and PulsarAdmin mocks/spies in MockedPulsarServiceBaseTest

* Don't record invocations at all for PulsarService and PulsarAdmin in MockedPulsarServiceBaseTest

* Don't record invocations for spies by default

* Simplify reseting mocks

* Fix PersistentTopicTest

* Fix TokenExpirationProducerConsumerTest

* Fix SimpleLoadManagerImplTest

* Fix FilterEntryTest
…he#17834)

- use Bookkeeper defaults by setting BK_METADATA_OPTIONS=none
…7209)

Fixes apache#17186

### Motivation

There are some cases in which it is useful to be able to include current
position of the message when reset of cursor was made.

### Modifications

* Support inclusive seek in c++ consumers.
* Add a unit test to verify.
* fix: delete sqlite files after jdbc connection closed

This closes apache#17713.

Signed-off-by: tison <wander4096@gmail.com>

* uses isolated db file

Signed-off-by: tison <wander4096@gmail.com>

* Revert "uses isolated db file"

This reverts commit 295db3c.

* close in order

Signed-off-by: tison <wander4096@gmail.com>

* strong order guarantee

Signed-off-by: tison <wander4096@gmail.com>

* factor out defer logic to avoid further bugs

Signed-off-by: tison <wander4096@gmail.com>

* Revert "factor out defer logic to avoid further bugs"

This reverts commit f7f4634.

* Revert "strong order guarantee"

This reverts commit 747086f.

* use awaitTermination

Signed-off-by: tison <wander4096@gmail.com>

Signed-off-by: tison <wander4096@gmail.com>
…AndCommitForTransaction (apache#17845)

* scenario is already covered by PendingAckPersistentTest
…ManagedLedgerImpl (apache#17293)

- a NPE with no description is confusing
…on time (apache#17790)

Fixes
- apache#17623
- apache#17637

### Motivation

Manually release resources, including `consumer`, `producer`, `pulsar client`, `transaction`, and `topic`. This saves `setup` and `cleanup` time before and after each method. 

### Modifications

- Manually release resources instead of calling `cleanup` & `setup` each method
- remove useless method `markDeletePositionCheck`
- `Integer.valueOf(int)` instead of `new Integer(int)`, because `new Integer(int)` is deprecated

### Matching PR in forked repository

PR in forked repository: 

- poorbarcode#10
Fixes apache#17785

### Motivation

The `failureMap` need to be clear after run per unit test.

### Modifications

Clear `failureMap` after run per unit test, and only run once `setup()`/`cleanup()` to reduce execution time.

### Matching PR in forked repository

PR in forked repository: coderzc#6
…ache#17252)

- fixes issue with stats where timestamps might be inconsistent because of visibility issues
  - fields should be volatile to ensure visibility of updated values in a consistent manner

- in replication, the lastDataMessagePublishedTimestamp field in PersistentTopic might be inconsistent
  unless volatile is used
…he#16891)

* add reader config doc

* update to the versioned doc

* Update site2/docs/io-debezium-source.md

Co-authored-by: momo-jun <60642177+momo-jun@users.noreply.github.com>

* Update site2/docs/io-debezium-source.md

Co-authored-by: momo-jun <60642177+momo-jun@users.noreply.github.com>

* revert changes to 2.10.1 and 2.9.3

Co-authored-by: momo-jun <60642177+momo-jun@users.noreply.github.com>
@AnonHxy
Copy link

AnonHxy commented Sep 30, 2022

/pulsarbot run-failure-checks

1 similar comment
@HQebupt
Copy link
Owner Author

HQebupt commented Oct 1, 2022

/pulsarbot run-failure-checks

@github-actions
Copy link

The pr had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Nov 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.