This issue might not be really related to the pestpp itself and might be more of my bash scripting or our cluster setup. I'm using a cluster with 16 nodes and each node carry 28 cores. I can run the the pestpp-gsa in parallel using worker/slave on one node with the slurm script as below:
#!/bin/bash
#SBATCH -n 1 # total number of tasks requested
#SBATCH --cpus-per-task=1 # cpus to allocate per task
#SBATCH -p shortq # queue (partition) -- defq, eduq, gpuq.
#SBATCH -t 12:00:00 # run time (hh:mm:ss) - 12.0 hours in this.
cd /home/hdashti/scratch/ED_BSU/old_ed2/26jan17/ED/working_morris_ws/master
pestpp-gsa gsa_karun /h :4004 &
MASTER_PID=$!
cd /home/hdashti/scratch/ED_BSU/old_ed2/26jan17/ED/working_morris_ws
parallel -i bash -c "cd {} ; pestpp-gsa gsa_karun /h 127.0.0.1:4004" -- wrk1 wrk2 wrk3 wrk4 wrk5 wrk6 wrk7 wrk8 wrk9 wrk10 wrk11 wrk12 wrk13 wrk14 wrk15 wrk16 wrk17 wrk18 wrk19 wrk20
kill ${MASTER_PID}
The above code which uses 20 cores of one node works fine. I tried to use multiple nodes using more workers using the following script:
#!/bin/bash
#SBATCH -N 4
#SBATCH --tasks-per-node=28
#SBATCH -p defq
#SBATCH -t 120:00:00
ulimit -u 9999
ulimit -s unlimited
ulimit -v unlimited
cd /home/hdashti/scratch/ED_BSU/old_ed2/26jan17/ED/working_morris_ws/master
pestpp-gsa gsa_karun /h :4004 &
MASTER_PID=$!
LEADER=$SLURMD_NODENAME
NODELIST=($(scontrol show hostname $SLURM_JOB_NODELIST))
FOLDERS=(seq 1 112)
for i in seq 0 111; do
ssh -f ${NODELIST[$(echo "$i % 4" | bc)]} "cd /home/hdashti/scratch/ED_BSU/old_ed2/26jan17/ED/working_morris_ws/wrk${FOLDERS[$i]} ; nohup pestpp-gsa gsa_karun /h ${LEADER}:4004 > worker.log &"
done
wait ${MASTER_PID}
Although I'm using 112 cores now but it takes a lot more to finish the pestpp. I was wondering did anyone else run into the same problem? Or am I missing something here? I'm posting this here beacuse I'm not sure if its pestpp problem or our cluster setup. Thanks
This issue might not be really related to the pestpp itself and might be more of my bash scripting or our cluster setup. I'm using a cluster with 16 nodes and each node carry 28 cores. I can run the the pestpp-gsa in parallel using worker/slave on one node with the slurm script as below:
#!/bin/bash#SBATCH -n 1 # total number of tasks requested#SBATCH --cpus-per-task=1 # cpus to allocate per task#SBATCH -p shortq # queue (partition) -- defq, eduq, gpuq.#SBATCH -t 12:00:00 # run time (hh:mm:ss) - 12.0 hours in this.cd /home/hdashti/scratch/ED_BSU/old_ed2/26jan17/ED/working_morris_ws/masterpestpp-gsa gsa_karun /h :4004 &MASTER_PID=$!cd /home/hdashti/scratch/ED_BSU/old_ed2/26jan17/ED/working_morris_wsparallel -i bash -c "cd {} ; pestpp-gsa gsa_karun /h 127.0.0.1:4004" -- wrk1 wrk2 wrk3 wrk4 wrk5 wrk6wrk7 wrk8 wrk9 wrk10 wrk11 wrk12 wrk13 wrk14 wrk15 wrk16 wrk17 wrk18 wrk19 wrk20kill ${MASTER_PID}The above code which uses 20 cores of one node works fine. I tried to use multiple nodes using more workers using the following script:
#!/bin/bash#SBATCH -N 4#SBATCH --tasks-per-node=28#SBATCH -p defq#SBATCH -t 120:00:00ulimit -u 9999ulimit -s unlimitedulimit -v unlimitedcd /home/hdashti/scratch/ED_BSU/old_ed2/26jan17/ED/working_morris_ws/masterpestpp-gsa gsa_karun /h :4004 &MASTER_PID=$!LEADER=$SLURMD_NODENAMENODELIST=($(scontrol show hostname $SLURM_JOB_NODELIST))FOLDERS=(seq 1 112)for i inseq 0 111; dossh -f ${NODELIST[$(echo "$i % 4" | bc)]} "cd/home/hdashti/scratch/ED_BSU/old_ed2/26jan17/ED/working_morris_ws/wrk${FOLDERS[$i]} ; nohuppestpp-gsa gsa_karun /h ${LEADER}:4004 > worker.log &"donewait ${MASTER_PID}Although I'm using 112 cores now but it takes a lot more to finish the pestpp. I was wondering did anyone else run into the same problem? Or am I missing something here? I'm posting this here beacuse I'm not sure if its pestpp problem or our cluster setup. Thanks