Parallelism not working with HPC load-balancer

I wonder if I might be doing something wrong, but it appears that all of my model evaluations are occurring in serial when I believe they should be occurring in parallel.

If I define my `allocation_queue.sh` as:
```
./hq alloc add slurm --time-limit 20m \
                   --idle-timeout 5m \
                   --backlog 1 \
                   --workers-per-alloc 4 \
                   --max-worker-count 4 \
                   --cpus=4 \
                   -- \
...
# other account/partition definitions
```

and `job.sh` as:
```
#HQ --cpus=1
#HQ --time-request=3m
#HQ --time-limit=5m
...
```

I appear to get 4 workers when running the `load-balancer`. I then submit some model evaluations using a client which uses the QMCPy UM-Bridge wrapper:
```
...
integrand = qp.integrand.UMBridgeWrapper(
    true_measure=true_measure, model=model, config=config, parallel=4
)
...
```

Some jobs are seen to start on the load-balancer, but after a while workers 2, 3 and 4 are all closed (at exactly the idle timeout time). The evaluations are slow and regular. I ssh-ed into each node when all workers were running and found with `top` that only the first node (i.e. worker 1) had any activity on it. I seem to get the same behaviour with different-sized allocations.

Does anyone have any ideas why this might not be evaluating in parallel, and what I might have done wrong? Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelism not working with HPC load-balancer #89

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parallelism not working with HPC load-balancer #89

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions