Skip to content

Conversation

@jeffgao01
Copy link

@jeffgao01 jeffgao01 commented Dec 12, 2025

What is this PR for?

This PR is regarding https://issues.apache.org/jira/browse/YUNIKORN-3128

  • Define BindFailed taskEventType
  • Rollback allocation on bind-volumes-to-pod or bind-pod-to-node failure
  • Modify the task state to new, for follow-up retry

What type of PR is it?

  • - Bug Fix
  • - Improvement
  • - Feature
  • - Documentation
  • - Hot Fix
  • - Refactoring

Todos

  • - Task

What is the Jira issue?

How should this be tested?

  • unit test

Screenshots (if appropriate)

Questions:

  • - The licenses files need update.
  • - There is breaking changes for older versions.
  • - It needs documentation.

@wilfred-s
Copy link
Contributor

@jeffgao01 please run make lint and make test over your PR and fix any issues you find

@wilfred-s
Copy link
Contributor

Looking at the change I think we're missing the part that needs to communicate the failure back to the core. The core already has assigned a node to the allocation. That also needs to be undone. Without that the core still thinks that the allocation is assigned to a node and will never retry it.

}

func (task *Task) beforeTaskBindFail() {
task.releaseAllocation()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the change I think we're missing the part that needs to communicate the failure back to the core. The core already has assigned a node to the allocation. That also needs to be undone. Without that the core still thinks that the allocation is assigned to a node and will never retry it.

Actually, the current change is rolling back the allocation as a whole.
But I synced with Wilfred, a more fine-grained approach is to partially rollback the allocation, to the extent that it can be retried (but the existing metadata/history is retained).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants