Skip to content

Conversation

@mitdesai
Copy link
Contributor

What is this PR for?

Added additional metrics for monitoring nodes and applications attempted during scheduling cycle.

What type of PR is it?

  • - Bug Fix
  • - Improvement
  • - Feature
  • - Documentation
  • - Hot Fix
  • - Refactoring

Todos

  • - Task

What is the Jira issue?

Jira https://issues.apache.org/jira/browse/YUNIKORN-3119

How should this be tested?

Screenshots (if appropriate)

Questions:

  • - The licenses files need update.
  • - There is breaking changes for older versions.
  • - It needs documentation.

@codecov
Copy link

codecov bot commented Oct 31, 2025

Codecov Report

❌ Patch coverage is 66.96429% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.76%. Comparing base (cefba88) to head (d46128a).

Files with missing lines Patch % Lines
pkg/metrics/scheduler.go 60.97% 16 Missing ⚠️
pkg/scheduler/objects/application.go 80.35% 11 Missing ⚠️
pkg/scheduler/objects/queue.go 11.11% 8 Missing ⚠️
pkg/scheduler/context.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1041      +/-   ##
==========================================
- Coverage   81.84%   81.76%   -0.08%     
==========================================
  Files         103      103              
  Lines       13608    13688      +80     
==========================================
+ Hits        11137    11192      +55     
- Misses       2209     2235      +26     
+ Partials      262      261       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@pbacsko pbacsko self-requested a review November 7, 2025 16:51
@pbacsko
Copy link
Contributor

pbacsko commented Nov 7, 2025

@mitdesai please rebase this PR. The unit test failure is unrelated.

@mitdesai
Copy link
Contributor Author

mitdesai commented Nov 7, 2025

Thanks @pbacsko I have rebased with master

Copy link
Contributor

@manirajv06 manirajv06 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to include other types of allocations in scheduling cycles? PH, Reservation etc

Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely think this solution needs some re-work. Current approach is a bit hard to understand.

  1. Code should be not communicating through metrics. Eg. in Queue.tryAllocate(), we call GetTryNode() twice to get a difference. Why not just return this from app.tryAllocate()? That would be much simpler. Then instead of calling Inc() every time from the app, you can just call Add() with the number of nodes which was tried.

  2. We're storing transient information specifically in the root queue, which involves constant queue walking. It's not the speed that bothers me, but it's just weird. This information is not specific to a queue. Similarly to #1, this data (number of apps tried) should propagate back to a higher-level caller which does the necessary processing. You can easily add this to AllocationResult and record the metrics in PartitionContext.tryAllocate() after pc.root.TryAllocate(...) returns.

Comment on lines 856 to 860

// Reset the tryAllocate call counter at the beginning of each scheduling cycle
pc.root.ResetApplicationsTried()
pc.root.ResetNodesTried()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code should not be here. Reset in a separate method and call that from ClusterContext.schedule(). The intent is much clearer that way.

zap.Stringer("resultType", result.ResultType),
zap.Stringer("allocation", result.Request))
zap.Stringer("allocation", result.Request),
zap.Int64("applicationsTried:", sq.GetApplicationsTried()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: ":" not needed

Comment on lines 95 to 96
applicationsTried int64 // number of applications tried per scheduling cycle
nodesTried int64 // number of nodes tried per scheduling cycle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two fields are not related to a queue, but more like a per-scheduling cycle specific info. We must return this to the caller in PartitionContext and record the metrics there.

  - NodesTried and ApplicationsTried are tracked in the result structure
  - Local applicationsTried counter increments for each application tried; returns a total count when returning the result
  - add application tried counter field to SchedulerMetrics
  - partition context records both NodesTried and ApplicationsTried
  - added reset calls in ClusterContext.schedule()
  - fixed linting issues
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants