-
Notifications
You must be signed in to change notification settings - Fork 260
[YUNIKORN-3119] Add Metrics for Monitoring Applications and Nodes Attempted in Each Scheduling Cycle #1041
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1041 +/- ##
==========================================
- Coverage 81.84% 81.76% -0.08%
==========================================
Files 103 103
Lines 13608 13688 +80
==========================================
+ Hits 11137 11192 +55
- Misses 2209 2235 +26
+ Partials 262 261 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
6bdbaee to
e002116
Compare
|
@mitdesai please rebase this PR. The unit test failure is unrelated. |
e002116 to
0a89255
Compare
|
Thanks @pbacsko I have rebased with master |
manirajv06
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to include other types of allocations in scheduling cycles? PH, Reservation etc
pbacsko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely think this solution needs some re-work. Current approach is a bit hard to understand.
-
Code should be not communicating through metrics. Eg. in
Queue.tryAllocate(), we callGetTryNode()twice to get a difference. Why not just return this fromapp.tryAllocate()? That would be much simpler. Then instead of callingInc()every time from the app, you can just callAdd()with the number of nodes which was tried. -
We're storing transient information specifically in the root queue, which involves constant queue walking. It's not the speed that bothers me, but it's just weird. This information is not specific to a queue. Similarly to #1, this data (number of apps tried) should propagate back to a higher-level caller which does the necessary processing. You can easily add this to
AllocationResultand record the metrics inPartitionContext.tryAllocate()afterpc.root.TryAllocate(...)returns.
pkg/scheduler/partition.go
Outdated
|
|
||
| // Reset the tryAllocate call counter at the beginning of each scheduling cycle | ||
| pc.root.ResetApplicationsTried() | ||
| pc.root.ResetNodesTried() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code should not be here. Reset in a separate method and call that from ClusterContext.schedule(). The intent is much clearer that way.
pkg/scheduler/objects/queue.go
Outdated
| zap.Stringer("resultType", result.ResultType), | ||
| zap.Stringer("allocation", result.Request)) | ||
| zap.Stringer("allocation", result.Request), | ||
| zap.Int64("applicationsTried:", sq.GetApplicationsTried()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: ":" not needed
pkg/scheduler/objects/queue.go
Outdated
| applicationsTried int64 // number of applications tried per scheduling cycle | ||
| nodesTried int64 // number of nodes tried per scheduling cycle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two fields are not related to a queue, but more like a per-scheduling cycle specific info. We must return this to the caller in PartitionContext and record the metrics there.
- NodesTried and ApplicationsTried are tracked in the result structure - Local applicationsTried counter increments for each application tried; returns a total count when returning the result - add application tried counter field to SchedulerMetrics - partition context records both NodesTried and ApplicationsTried - added reset calls in ClusterContext.schedule() - fixed linting issues
0a89255 to
d46128a
Compare
What is this PR for?
Added additional metrics for monitoring nodes and applications attempted during scheduling cycle.
What type of PR is it?
Todos
What is the Jira issue?
Jira https://issues.apache.org/jira/browse/YUNIKORN-3119
How should this be tested?
Screenshots (if appropriate)
Questions: