[YUNIKORN-3119] Add Metrics for Monitoring Applications and Nodes Attempted in Each Scheduling Cycle #1041

mitdesai · 2025-10-30T15:39:52Z

What is this PR for?

Added additional metrics for monitoring nodes and applications attempted during scheduling cycle.

What type of PR is it?

Todos

- Task

What is the Jira issue?

Jira https://issues.apache.org/jira/browse/YUNIKORN-3119

How should this be tested?

Screenshots (if appropriate)

Questions:

- The licenses files need update.
- There is breaking changes for older versions.
- It needs documentation.

codecov · 2025-10-31T01:38:46Z

Codecov Report

❌ Patch coverage is 66.96429% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.76%. Comparing base (cefba88) to head (d46128a).

Files with missing lines	Patch %	Lines
pkg/metrics/scheduler.go	60.97%	16 Missing ⚠️
pkg/scheduler/objects/application.go	80.35%	11 Missing ⚠️
pkg/scheduler/objects/queue.go	11.11%	8 Missing ⚠️
pkg/scheduler/context.go	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1041      +/-   ##
==========================================
- Coverage   81.84%   81.76%   -0.08%     
==========================================
  Files         103      103              
  Lines       13608    13688      +80     
==========================================
+ Hits        11137    11192      +55     
- Misses       2209     2235      +26     
+ Partials      262      261       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pbacsko · 2025-11-07T16:52:36Z

@mitdesai please rebase this PR. The unit test failure is unrelated.

mitdesai · 2025-11-07T19:34:06Z

Thanks @pbacsko I have rebased with master

manirajv06

Do we need to include other types of allocations in scheduling cycles? PH, Reservation etc

pbacsko

I definitely think this solution needs some re-work. Current approach is a bit hard to understand.

Code should be not communicating through metrics. Eg. in Queue.tryAllocate(), we call GetTryNode() twice to get a difference. Why not just return this from app.tryAllocate()? That would be much simpler. Then instead of calling Inc() every time from the app, you can just call Add() with the number of nodes which was tried.
We're storing transient information specifically in the root queue, which involves constant queue walking. It's not the speed that bothers me, but it's just weird. This information is not specific to a queue. Similarly to #1, this data (number of apps tried) should propagate back to a higher-level caller which does the necessary processing. You can easily add this to AllocationResult and record the metrics in PartitionContext.tryAllocate() after pc.root.TryAllocate(...) returns.

pbacsko · 2025-11-11T12:55:49Z

pkg/scheduler/partition.go

+
+	// Reset the tryAllocate call counter at the beginning of each scheduling cycle
+	pc.root.ResetApplicationsTried()
+	pc.root.ResetNodesTried()
+


This code should not be here. Reset in a separate method and call that from ClusterContext.schedule(). The intent is much clearer that way.

pbacsko · 2025-11-11T13:02:40Z

pkg/scheduler/objects/queue.go

 					zap.Stringer("resultType", result.ResultType),
-					zap.Stringer("allocation", result.Request))
+					zap.Stringer("allocation", result.Request),
+					zap.Int64("applicationsTried:", sq.GetApplicationsTried()),


Nit: ":" not needed

pbacsko · 2025-11-11T13:03:58Z

pkg/scheduler/objects/queue.go

+	applicationsTried                  int64 // number of applications tried per scheduling cycle
+	nodesTried                         int64 // number of nodes tried per scheduling cycle


These two fields are not related to a queue, but more like a per-scheduling cycle specific info. We must return this to the caller in PartitionContext and record the metrics there.

- NodesTried and ApplicationsTried are tracked in the result structure - Local applicationsTried counter increments for each application tried; returns a total count when returning the result - add application tried counter field to SchedulerMetrics - partition context records both NodesTried and ApplicationsTried - added reset calls in ClusterContext.schedule() - fixed linting issues

wilfred-s assigned mitdesai Oct 31, 2025

mitdesai force-pushed the YUNIKORN-3119 branch from 6bdbaee to e002116 Compare October 31, 2025 02:56

pbacsko self-requested a review November 7, 2025 16:51

mitdesai force-pushed the YUNIKORN-3119 branch from e002116 to 0a89255 Compare November 7, 2025 19:33

manirajv06 reviewed Nov 11, 2025

View reviewed changes

pbacsko requested changes Nov 11, 2025

View reviewed changes

mitdesai added 3 commits November 29, 2025 09:42

Rebased with master

83de6f9

Add code-coverage

bce7a1d

mitdesai force-pushed the YUNIKORN-3119 branch from 0a89255 to d46128a Compare November 29, 2025 17:42

pbacsko requested a review from wilfred-s December 11, 2025 18:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[YUNIKORN-3119] Add Metrics for Monitoring Applications and Nodes Attempted in Each Scheduling Cycle #1041

[YUNIKORN-3119] Add Metrics for Monitoring Applications and Nodes Attempted in Each Scheduling Cycle #1041

Uh oh!

mitdesai commented Oct 30, 2025

Uh oh!

codecov bot commented Oct 31, 2025 •

edited

Loading

Uh oh!

pbacsko commented Nov 7, 2025

Uh oh!

mitdesai commented Nov 7, 2025

Uh oh!

manirajv06 left a comment

Uh oh!

pbacsko left a comment

Uh oh!

pbacsko Nov 11, 2025

Uh oh!

pbacsko Nov 11, 2025

Uh oh!

pbacsko Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		applicationsTried int64 // number of applications tried per scheduling cycle
		nodesTried int64 // number of nodes tried per scheduling cycle

[YUNIKORN-3119] Add Metrics for Monitoring Applications and Nodes Attempted in Each Scheduling Cycle #1041

Are you sure you want to change the base?

[YUNIKORN-3119] Add Metrics for Monitoring Applications and Nodes Attempted in Each Scheduling Cycle #1041

Uh oh!

Conversation

mitdesai commented Oct 30, 2025

What is this PR for?

What type of PR is it?

Todos

What is the Jira issue?

How should this be tested?

Screenshots (if appropriate)

Questions:

Uh oh!

codecov bot commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pbacsko commented Nov 7, 2025

Uh oh!

mitdesai commented Nov 7, 2025

Uh oh!

manirajv06 left a comment

Choose a reason for hiding this comment

Uh oh!

pbacsko left a comment

Choose a reason for hiding this comment

Uh oh!

pbacsko Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

pbacsko Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

pbacsko Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Oct 31, 2025 •

edited

Loading