I wonder if I might be doing something wrong, but it appears that all of my model evaluations are occurring in serial when I believe they should be occurring in parallel.
If I define my allocation_queue.sh as:
./hq alloc add slurm --time-limit 20m \
--idle-timeout 5m \
--backlog 1 \
--workers-per-alloc 4 \
--max-worker-count 4 \
--cpus=4 \
-- \
...
# other account/partition definitions
and job.sh as:
#HQ --cpus=1
#HQ --time-request=3m
#HQ --time-limit=5m
...
I appear to get 4 workers when running the load-balancer. I then submit some model evaluations using a client which uses the QMCPy UM-Bridge wrapper:
...
integrand = qp.integrand.UMBridgeWrapper(
true_measure=true_measure, model=model, config=config, parallel=4
)
...
Some jobs are seen to start on the load-balancer, but after a while workers 2, 3 and 4 are all closed (at exactly the idle timeout time). The evaluations are slow and regular. I ssh-ed into each node when all workers were running and found with top that only the first node (i.e. worker 1) had any activity on it. I seem to get the same behaviour with different-sized allocations.
Does anyone have any ideas why this might not be evaluating in parallel, and what I might have done wrong? Thanks!
I wonder if I might be doing something wrong, but it appears that all of my model evaluations are occurring in serial when I believe they should be occurring in parallel.
If I define my
allocation_queue.shas:and
job.shas:I appear to get 4 workers when running the
load-balancer. I then submit some model evaluations using a client which uses the QMCPy UM-Bridge wrapper:Some jobs are seen to start on the load-balancer, but after a while workers 2, 3 and 4 are all closed (at exactly the idle timeout time). The evaluations are slow and regular. I ssh-ed into each node when all workers were running and found with
topthat only the first node (i.e. worker 1) had any activity on it. I seem to get the same behaviour with different-sized allocations.Does anyone have any ideas why this might not be evaluating in parallel, and what I might have done wrong? Thanks!