Fix the completionObjects leak problem.#4285
Merged
hezhangjian merged 2 commits intoapache:masterfrom Jul 9, 2024
Merged
Conversation
wenbingshen
reviewed
Apr 12, 2024
| } | ||
| } else { | ||
| nettyOpLogger.registerFailedEvent(MathUtils.elapsedNanos(startTime), TimeUnit.NANOSECONDS); | ||
| errorOut(key); |
Member
There was a problem hiding this comment.
In #4278, the PerChannelBookieClient#addEntryTimeoutNanos is enabled, So you will see that 5 seconds(default timeout) passed after bk1 died, the step 9 trigger PendingAddOp timeout, then cause the issue in #4278.
I have a question, since we have a default thread to detect timeout requests every 5 seconds and remove the timedout CompletionKey from completionObjects, even if the write fails, will the timeout detection task not remove the key?
Member
Author
There was a problem hiding this comment.
Sure, the timeout detection will remove the key.
hangc0276
approved these changes
May 29, 2024
# Conflicts: # bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/PerChannelBookieClient.java
zymap
approved these changes
Jul 8, 2024
hezhangjian
approved these changes
Jul 9, 2024
Member
|
Nice Catch! |
Ghatage
pushed a commit
to sijie/bookkeeper
that referenced
this pull request
Jul 12, 2024
hangc0276
pushed a commit
that referenced
this pull request
Aug 5, 2024
(cherry picked from commit 772b162)
hangc0276
pushed a commit
that referenced
this pull request
Aug 5, 2024
(cherry picked from commit 772b162)
lhotari
pushed a commit
that referenced
this pull request
Apr 17, 2025
(cherry picked from commit 772b162)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #4278, you need to go through #4278 for the context.
When a connection is broken, it will trigger PerChannelBookieClient#channelInactive.
bookkeeper/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/PerChannelBookieClient.java
Lines 1292 to 1312 in 8516d0a
Point 1. line_1303, it will drain all completionObjects, and complete with BookieHandleNotAvailableException.
Point 2. line_1310, set PerChannelBookieClient#channel = null.
When PerChannelBookieClient#addEntry.
bookkeeper/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/PerChannelBookieClient.java
Lines 776 to 854 in 8516d0a
Point 3. line_840, it will put completionKeyValue to completionObjects.
Point 4. line_852, if the channel is not null, invoke writeAndFlush.
There is a race condition between PerChannelBookieClient#channelInactive and PerChannelBookieClient#addEntry.
There is the timeline.
Point 1 -> Point 3 -> Point 4 -> Point 2.
It will write and flush the AddRequest to the netty channel. In the bookkeeper, there is a weakness in PerChannelBookieClient#writeAndFlush.
bookkeeper/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/PerChannelBookieClient.java
Lines 1176 to 1279 in 8516d0a
At line_1230, we define a promise for the netty write and flush, if the write and flush failed, we only record the metrics at line_1211, not remove the completionKey from completionObjects. The completionKey will leak in the completionObjects.
If the PerChannelBookieClient#addEntryTimeoutNanos is disabled, the timeoutCheck won't work, so the completionKey exists in the completionObjects forever.
In #4278, the PerChannelBookieClient#addEntryTimeoutNanos is enabled, So you will see that 5 seconds(default timeout) passed after bk1 died, the step 9 trigger PendingAddOp timeout, then cause the issue in #4278.
So, as long as we remove the completionKey from completionObjects when write and flush AddRequest failed, we can solve the problem.