-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-54312][CORE] Avoid repeatedly scheduling tasks for SendHeartbeat/WorkDirClean in standalone worker #53054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…up after registratiion
|
cc @Ngone51 @LuciferYang @dongjoon-hyun can you please take a look? Thanks |
Ngone51
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! LGTM.
|
|
||
| private var registerMasterFutures: Array[JFuture[_]] = null | ||
| private var registrationRetryTimer: Option[JScheduledFuture[_]] = None | ||
| private[worker] var heartbeatTask: Option[JScheduledFuture[_]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The identifier marked as [work] seems to serve the purpose of merely being callable within test cases, right? Given that the current WorkerSuite already has with PrivateMethodTester, can we adopt the approach of using invokePrivate for testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated.
| cleanupThreadExecutor.shutdownNow() | ||
| metricsSystem.report() | ||
| cancelLastRegistrationRetry() | ||
| heartbeatTask.foreach(_.cancel(true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The handleRegisterResponse is a synchronized code block. Don't the operations on heartbeatTask and workDirCleanupTask within onStop also require synchronized protection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The synchronized block was introduced by #9138 to avoid some race conditions in very early implementation with some async call back...
Looks like not be necessary now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worker is a ThreadSafeRpcEndpoint already. The synchronized protection seems to be unnecessary today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so can we remove that unnecessary synchronized in a separate pr ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will create a separate task to revisit the synchronized usage here.
What changes were proposed in this pull request?
Currently, worker will schedule tasks forwarding
SendHeartbeatandWorkDirCleanupwhilehandleRegisterResponse.While worker registration could happen multiple times in case of heartbeat timeout/disconnected from master, in these cases the tasks would be scheduled multiple times.
To fix the issue:
heartbeatTaskandworkDirCleanupTaskin worker to tell whether these tasks have been scheduledheartbeatTaskandworkDirCleanupTaskwill be initialized after the 1st registration, and then skipped scheduling these tasks in later registration.heartbeatTaskandworkDirCleanupTaskwhen worker stops.Why are the changes needed?
Fix the issue repeatedly scheduling SendHeartbeat/WorkDirClean tasks after worker registration.
Does this PR introduce any user-facing change?
No
How was this patch tested?
UT added
Was this patch authored or co-authored using generative AI tooling?
No