Skip to content

Regarding the setting of the job's unfit status. #7

@Moxixis

Description

@Moxixis

I noticed that in Paella, for non-memory jobs where GPU resources are insufficient, the job is marked as resource-unfit when num_outstanding_kernels_ < max_num_outstanding_kernels_.

However, both the block start and block finish methods include logic to clear the job's unfit flag. In the block start method, once all blocks are launched, it directly checks the job's unfit flag — if it's marked as resource-unfit, it clears the flag and decrements num_outstanding_kernels_. In the block finish method, once all blocks are completed, it checks the job's unfit flag again, clears it, and also decrements num_outstanding_kernels_.

I'm trying to understand the rationale behind this behavior — in particular, why the unfit flag is set in the first place, since in the schedule_job method, even after marking a job as unfit, one job still gets to run.

Also, the operations to clear the unfit flag and decrement num_outstanding_kernels_ are repeated in both the block start and block finish methods. If this has already been done in the block start method, wouldn’t the block finish method skip it?

Another question is about the setting of max_num_outstanding_kernels_ — what’s the reference or guideline used to determine its value?

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions