Skip to content

Conversation

@xyuanlu
Copy link
Contributor

@xyuanlu xyuanlu commented Aug 7, 2025

Issues

Description

  • Here are some details about my PR, including screenshots of any UI changes:
    Fix a race condition in metaclient leader election.
    When the client gets an expired event, it need to re-register all user registered listener and its own re-election listeners before rectreate the leader node.

(Write a concise description including what, why, how)

Tests

  • The following tests are written for this issue:
  • new test added
  • testClientLeadershipChangeListenersAfterExpire

(List the names of added unit/integration tests)

  • The following is the result of the "mvn test" command on the appropriate module:

(If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)

Documentation (Optional)

  • In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

@xyuanlu xyuanlu force-pushed the fixLeaderElection branch from 4a93ea7 to f756e5e Compare August 7, 2025 19:08
@xyuanlu xyuanlu marked this pull request as ready for review August 7, 2025 21:58
Copy link
Contributor

@junkaixue junkaixue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please give the a comprehensive description on PR for what to fix.

@xyuanlu xyuanlu force-pushed the fixLeaderElection branch from f756e5e to 72802ec Compare August 8, 2025 02:55
@xyuanlu xyuanlu force-pushed the fixLeaderElection branch from 72802ec to bd39805 Compare August 8, 2025 02:59
@xyuanlu xyuanlu force-pushed the fixLeaderElection branch from 47d4f57 to eb4a28f Compare August 8, 2025 04:13
private void subscribeAndTryCreateLeaderEntry(String leaderPath) {
_metaClient.subscribeDataChange(leaderPath + LEADER_ENTRY_KEY, _reElectListener, false);
_leaderGroups.add(leaderPath + LEADER_ENTRY_KEY);
registerAllListeners();
Copy link

@LZD-PratyushBhatt LZD-PratyushBhatt Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @xyuanlu thanks for this!
I have a question here.
what happens if listener re-subscription succeeds but participant node creation fails? What i mean is, currently, the flow would re-subscribe all listeners successfully, then if it fail to create the participant node (due to network issues, parent path missing, whatever), but still proceed to attempt leader node creation and touch the leader node. could this result in a participant becoming leader without being a valid member of the participant pool, leading to an invalid leader state that other participants don't recognize as legitimate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In handleConnectStateChanged, where we recreate participant node, re-subscribe listeners and touch leader node. I think if recreate participant node failed in line 425 and if it is not nodeAlreadyExist error, it would through exception out and exit before register listeners and touch leader node.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, Thanks!

@xyuanlu
Copy link
Contributor Author

xyuanlu commented Aug 11, 2025

This change is ready to be merged. Approved by @junkaixue
Commit message "Fix a race condition in metaclient leader election."

@xyuanlu xyuanlu merged commit 62f0e9c into apache:master Aug 11, 2025
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants