I noticed that in Paella, for non-memory jobs where GPU resources are insufficient, the job is marked as resource-unfit when num_outstanding_kernels_ < max_num_outstanding_kernels_.
However, both the block start and block finish methods include logic to clear the job's unfit flag. In the block start method, once all blocks are launched, it directly checks the job's unfit flag — if it's marked as resource-unfit, it clears the flag and decrements num_outstanding_kernels_. In the block finish method, once all blocks are completed, it checks the job's unfit flag again, clears it, and also decrements num_outstanding_kernels_.
I'm trying to understand the rationale behind this behavior — in particular, why the unfit flag is set in the first place, since in the schedule_job method, even after marking a job as unfit, one job still gets to run.
Also, the operations to clear the unfit flag and decrement num_outstanding_kernels_ are repeated in both the block start and block finish methods. If this has already been done in the block start method, wouldn’t the block finish method skip it?
Another question is about the setting of max_num_outstanding_kernels_ — what’s the reference or guideline used to determine its value?
I noticed that in Paella, for non-memory jobs where GPU resources are insufficient, the job is marked as resource-unfit when
num_outstanding_kernels_ < max_num_outstanding_kernels_.However, both the block start and block finish methods include logic to clear the job's
unfitflag. In the block start method, once all blocks are launched, it directly checks the job'sunfitflag — if it's marked as resource-unfit, it clears the flag and decrementsnum_outstanding_kernels_. In the block finish method, once all blocks are completed, it checks the job'sunfitflag again, clears it, and also decrementsnum_outstanding_kernels_.I'm trying to understand the rationale behind this behavior — in particular, why the
unfitflag is set in the first place, since in theschedule_jobmethod, even after marking a job as unfit, one job still gets to run.Also, the operations to clear the
unfitflag and decrementnum_outstanding_kernels_are repeated in both the block start and block finish methods. If this has already been done in the block start method, wouldn’t the block finish method skip it?Another question is about the setting of
max_num_outstanding_kernels_— what’s the reference or guideline used to determine its value?