-
Notifications
You must be signed in to change notification settings - Fork 19
MONC on ARCHER2 #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MONC on ARCHER2 #32
Conversation
|
Debug commands I'm using on ARCHER2 (for my own reference): Run MONC inside $> gdb4hpc
gdb all> launch --args="--config=tests/straka_short.mcf --checkpoint_file=checkpoint_files/straka_dump.nc" --launcher-args="--partition=standard --qos=standard --tasks-per-node=2 --exclusive --export=all" $monc{2} ./build/bin/monc_driver.exe |
|
Currently I'm stuck with an issue with a call to earlcd@uan01:/work/ta009/ta009/earlcd/git-repos/monc> fcm make -f fcm-make/monc-cray-cray.cfg
[init] make # 2020-12-15T15:15:52Z
[info] FCM 2019.05.0 (/home2/home/ta009/ta009/earlcd/fcm-2019.09.0)
[init] make config-parse # 2020-12-15T15:15:52Z
[info] config-file=/lus/cls01095/work/ta009/ta009/earlcd/git-repos/monc/fcm-make/monc-cray-cray.cfg
[info] config-file= - /lus/cls01095/work/ta009/ta009/earlcd/git-repos/monc/fcm-make/comp-cray-2107.cfg
[info] config-file= - /lus/cls01095/work/ta009/ta009/earlcd/git-repos/monc/fcm-make/env-cray.cfg
[info] config-file= - /lus/cls01095/work/ta009/ta009/earlcd/git-repos/monc/fcm-make/monc-build.cfg
[done] make config-parse # 0.0s
[init] make dest-init # 2020-12-15T15:15:52Z
[info] dest=earlcd@uan01:/lus/cls01095/work/ta009/ta009/earlcd/git-repos/monc
[info] mode=incremental
[done] make dest-init # 0.0s
[init] make extract # 2020-12-15T15:15:52Z
[info] location monc: 0: /lus/cls01095/work/ta009/ta009/earlcd/git-repos/monc
[info] dest: 381 [U unchanged]
[info] source: 381 [U from base]
[done] make extract # 0.4s
[init] make preprocess # 2020-12-15T15:15:53Z
[info] sources: total=381, analysed=0, elapsed-time=0.2s, total-time=0.0s
[info] target-tree-analysis: elapsed-time=0.0s
[info] install targets: modified=0, unchanged=8, failed=0, total-time=0.0s
[info] process targets: modified=0, unchanged=172, failed=0, total-time=0.0s
[info] TOTAL targets: modified=0, unchanged=180, failed=0, elapsed-time=0.2s
[done] make preprocess # 0.8s
[init] make build # 2020-12-15T15:15:54Z
[info] sources: total=381, analysed=0, elapsed-time=0.1s, total-time=0.0s
[info] target-tree-analysis: elapsed-time=0.1s
[info] compile targets: modified=120, unchanged=3, failed=0, total-time=176.7s
[info] compile+ targets: modified=112, unchanged=7, failed=0, total-time=0.5s
[info] link targets: modified=1, unchanged=0, failed=0, total-time=0.5s
[info] TOTAL targets: modified=233, unchanged=10, failed=0, elapsed-time=178.1s
[done] make build # 178.3s
[done] make # 179.6s
earlcd@uan01:/work/ta009/ta009/earlcd/git-repos/monc>
(reverse-i-search)`': ^C
earlcd@uan01:/work/ta009/ta009/earlcd/git-repos/monc> sbatch utils/archer2/submonc.slurm
Submitted batch job 59769
earlcd@uan01:/work/ta009/ta009/earlcd/git-repos/monc> cat slurm-59769.out
Unloading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env-profile
Loading cpe-cray
Loading cce/10.0.4
Loading craype/2.7.2
Loading craype-x86-rome
Loading libfabric/1.11.0.0.233
Loading craype-network-ofi
Loading cray-dsmml/0.1.2
Loading perftools-base/20.10.0
Loading xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta
Loading cray-mpich/8.0.16
Loading cray-libsci/20.10.1.2
Loading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env
Currently Loaded Modulefiles:
1) cpe-cray
2) cce/10.0.4(default)
3) craype/2.7.2(default)
4) craype-x86-rome
5) libfabric/1.11.0.0.233(default)
6) craype-network-ofi
7) cray-dsmml/0.1.2(default)
8) perftools-base/20.10.0(default)
9) xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta(default)
10) cray-mpich/8.0.16(default)
11) cray-libsci/20.10.1.2(default)
12) /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env
13) epcc-job-env
14) cray-netcdf/4.7.4.2(default)
15) cray-fftw/3.3.8.8(default)
16) cray-hdf5/1.12.0.2(default)
MPICH ERROR [Rank 1] [job id 59769.0] [Tue Dec 15 15:22:06 2020] [unknown] [nid001139] - Abort(403275522) (rank 1 in comm 0): Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x9183600, scnts=0x45becc0, sdispls=0x47f7a00, MPI_DOUBLE_PRECISION, rbuf=0x92888c0, rcnts=0x47f6540, rdispls=0x47f4040, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -1207723264
aborting job:
Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x9183600, scnts=0x45becc0, sdispls=0x47f7a00, MPI_DOUBLE_PRECISION, rbuf=0x92888c0, rcnts=0x47f6540, rdispls=0x47f4040, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -1207723264
[INFO] MONC running with 1 processes, 1 IO server(s)
[WARN] No enabled configuration for component ideal_squall therefore disabling this
[WARN] No enabled configuration for component kid_testcase therefore disabling this
[WARN] Run order callback for component tank_experiments at stage initialisation not specified
[WARN] Run order callback for component tank_experiments at stage finalisation not specified
[WARN] Defaulting to one dimension decomposition due to solution size too small
[INFO] Decomposed 1 processes via 'OneDim' into z=1 y=1 x=1
[INFO] 3D system; z=65, y=512, x=2
srun: error: nid001139: task 1: Exited with exit code 255
srun: Terminating job step 59769.0
slurmstepd: error: *** STEP 59769.0 ON nid001139 CANCELLED AT 2020-12-15T15:22:06 ***
srun: error: nid001139: task 0: Terminated
srun: Force Terminated job step 59769.0 |
|
I've tried compiling with |
|
Hi @leifdenby: since these are MPI issues, I thought the changes that Chris applied for ARC4 may be worth exploring, in case you have not done so yet. |
Great idea @sjboeing! I'll give this a try |
|
Unfortunately the fixes introduced for ARC4 don't appear to have fixed the issue @sjboeing. But I have an idea what the issue might be. I'll put my testing in separate comments below |
|
(optimised) compiling with cray fortran compiler and running compilingearlcd@uan01:~/work/monc> module restore PrgEnv-cray
Unloading cray-hdf5/1.12.0.2
Unloading cray-fftw/3.3.8.8
Unloading cray-netcdf/4.7.4.2
Unloading /usr/local/share/epcc-module/epcc-module-loader
Warning: Unloading the epcc-setup-env module will stop many
modules being available on the system. If you do this by
accident, you can recover the situation with the command:
module load /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env
Unloading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env
Unloading bolt/0.7
Unloading cray-libsci/20.10.1.2
Unloading cray-mpich/8.0.16
Unloading xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta
Unloading perftools-base/20.10.0
WARNING: Did not unuse /opt/cray/pe/perftools/20.10.0/modulefiles
Unloading cray-dsmml/0.1.2
Unloading craype-network-ofi
Unloading libfabric/1.11.0.0.233
Unloading craype-x86-rome
Unloading craype/2.7.2
Unloading gcc/10.1.0
Unloading cpe-gnu
Loading cpe-cray
Loading cce/10.0.4
Loading craype/2.7.2
Loading craype-x86-rome
Loading libfabric/1.11.0.0.233
Loading craype-network-ofi
Loading cray-dsmml/0.1.2
Loading perftools-base/20.10.0
Loading xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta
Loading cray-mpich/8.0.16
Loading cray-libsci/20.10.1.2
Loading bolt/0.7
Loading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env
Loading /usr/local/share/epcc-module/epcc-module-loader
earlcd@uan01:~/work/monc> module load cray-hdf5 cray-netcdf cray-fftw
earlcd@uan01:~/work/monc> fcm make -f fcm-make/monc-cray-cray.cfg
[init] make # 2021-01-26T12:16:55Z
[info] FCM 2019.05.0 (/home1/home/n02/n02/earlcd/fcm-2019.09.0)
[init] make config-parse # 2021-01-26T12:16:55Z
[info] config-file=/lus/cls01095/work/n02/n02/earlcd/monc/fcm-make/monc-cray-cray.cfg
[info] config-file= - /lus/cls01095/work/n02/n02/earlcd/monc/fcm-make/comp-cray-2107.cfg
[info] config-file= - /lus/cls01095/work/n02/n02/earlcd/monc/fcm-make/env-cray.cfg
[info] config-file= - /lus/cls01095/work/n02/n02/earlcd/monc/fcm-make/monc-build.cfg
[done] make config-parse # 0.1s
[init] make dest-init # 2021-01-26T12:16:55Z
[info] dest=earlcd@uan01:/lus/cls01095/work/n02/n02/earlcd/monc
[info] mode=incremental
[done] make dest-init # 0.1s
[init] make extract # 2021-01-26T12:16:55Z
[info] location monc: 0: /lus/cls01095/work/n02/n02/earlcd/monc
[info] dest: 381 [U unchanged]
[info] source: 381 [U from base]
[done] make extract # 7.7s
[init] make preprocess # 2021-01-26T12:17:03Z
[info] sources: total=381, analysed=180, elapsed-time=0.2s, total-time=0.1s
[info] target-tree-analysis: elapsed-time=0.0s
[info] install targets: modified=8, unchanged=0, failed=0, total-time=0.1s
[info] process targets: modified=172, unchanged=0, failed=0, total-time=14.1s
[info] TOTAL targets: modified=180, unchanged=0, failed=0, elapsed-time=14.3s
[done] make preprocess # 14.7s
[init] make build # 2021-01-26T12:17:17Z
[info] sources: total=381, analysed=381, elapsed-time=1.5s, total-time=1.4s
[info] target-tree-analysis: elapsed-time=0.3s
[info] compile targets: modified=123, unchanged=0, failed=0, total-time=209.8s
[info] compile+ targets: modified=119, unchanged=0, failed=0, total-time=1.3s
[info] link targets: modified=1, unchanged=0, failed=0, total-time=2.0s
[info] TOTAL targets: modified=243, unchanged=0, failed=0, elapsed-time=213.7s
[done] make build # 215.4s
[done] make # 238.0srunning MONCearlcd@uan01:~/work/monc> sbatch utils/archer2/submonc.slurm
Submitted batch job 77481output of SLURM logearlcd@uan01:~/work/monc> cat slurm-77481.out
Unloading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env-profile
Loading cpe-cray
Loading cce/10.0.4
Loading craype/2.7.2
Loading craype-x86-rome
Loading libfabric/1.11.0.0.233
Loading craype-network-ofi
Loading cray-dsmml/0.1.2
Loading perftools-base/20.10.0
Loading xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta
Loading cray-mpich/8.0.16
Loading cray-libsci/20.10.1.2
Loading bolt/0.7
Loading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env
Loading epcc-job-env
Loading requirement: bolt/0.7
Currently Loaded Modulefiles:
1) cpe-cray
2) cce/10.0.4(default)
3) craype/2.7.2(default)
4) craype-x86-rome
5) libfabric/1.11.0.0.233(default)
6) craype-network-ofi
7) cray-dsmml/0.1.2(default)
8) perftools-base/20.10.0(default)
9) xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta(default)
10) cray-mpich/8.0.16(default)
11) cray-libsci/20.10.1.2(default)
12) bolt/0.7
13) /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env
14) epcc-job-env
[INFO] MONC running with 4 processes, 1 IO server(s)
[WARN] No enabled configuration for component ideal_squall therefore disabling this
[WARN] No enabled configuration for component kid_testcase therefore disabling this
[WARN] Run order callback for component tank_experiments at stage initialisation not specified
[WARN] Run order callback for component tank_experiments at stage finalisation not specified
[WARN] Defaulting to one dimension decomposition due to solution size too small
[INFO] Decomposed 4 processes via 'OneDim' into z=1 y=4 x=1
[INFO] 3D system; z=65, y=512, x=2
MPICH ERROR [Rank 1] [job id 77481.0] [Tue Jan 26 12:22:49 2021] [unknown] [nid001037] - Abort(403275522) (rank 1 in comm 0): Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x5cd2600, scnts=0x47fa200, sdispls=0x47f8d40, MPI_DOUBLE_PRECISION, rbuf=0x5d16bc0, rcnts=0x47f3400, rdispls=0x47f0200, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -404335872
aborting job:
Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x5cd2600, scnts=0x47fa200, sdispls=0x47f8d40, MPI_DOUBLE_PRECISION, rbuf=0x5d16bc0, rcnts=0x47f3400, rdispls=0x47f0200, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -404335872
MPICH ERROR [Rank 2] [job id 77481.0] [Tue Jan 26 12:22:49 2021] [unknown] [nid001037] - Abort(622338) (rank 2 in comm 0): Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x5c6c940, scnts=0x45fda80, sdispls=0x45fd180, MPI_DOUBLE_PRECISION, rbuf=0x5cacf40, rcnts=0x45fc400, rdispls=0x46299c0, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -1539026176
aborting job:
Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x5c6c940, scnts=0x45fda80, sdispls=0x45fd180, MPI_DOUBLE_PRECISION, rbuf=0x5cacf40, rcnts=0x45fc400, rdispls=0x46299c0, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -1539026176
MPICH ERROR [Rank 4] [job id 77481.0] [Tue Jan 26 12:22:49 2021] [unknown] [nid001037] - Abort(336166658) (rank 4 in comm 0): Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x5c3f440, scnts=0x45fda80, sdispls=0x45fd180, MPI_DOUBLE_PRECISION, rbuf=0x5c7fc80, rcnts=0x45fc400, rdispls=0x46299c0, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -1178053888
aborting job:
Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x5c3f440, scnts=0x45fda80, sdispls=0x45fd180, MPI_DOUBLE_PRECISION, rbuf=0x5c7fc80, rcnts=0x45fc400, rdispls=0x46299c0, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -1178053888
srun: error: nid001037: tasks 1-2,4: Exited with exit code 255
srun: Terminating job step 77481.0
slurmstepd: error: *** STEP 77481.0 ON nid001037 CANCELLED AT 2021-01-26T12:22:49 ***
srun: error: nid001037: tasks 0,3: Terminated
srun: Force Terminated job step 77481.0This run-time error suggests to me that the routine calculating the size of the buffer used to make the MPI communication is doing the calculation incorrectly. I should also note that compiling with debug flags (using |
|
compiling with GNU fortran compiler and running compilingearlcd@uan01:~/work/monc> module restore PrgEnv-gnu
Unloading /usr/local/share/epcc-module/epcc-module-loader
Warning: Unloading the epcc-setup-env module will stop many
modules being available on the system. If you do this by
accident, you can recover the situation with the command:
module load /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env
Unloading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env
Unloading bolt/0.7
Unloading cray-libsci/20.10.1.2
Unloading cray-mpich/8.0.16
Unloading xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta
Unloading perftools-base/20.10.0
WARNING: Did not unuse /opt/cray/pe/perftools/20.10.0/modulefiles
Unloading cray-dsmml/0.1.2
Unloading craype-network-ofi
Unloading libfabric/1.11.0.0.233
Unloading craype-x86-rome
Unloading craype/2.7.2
Unloading cce/10.0.4
Unloading cpe-cray
Loading cpe-gnu
Loading gcc/10.1.0
Loading craype/2.7.2
Loading craype-x86-rome
Loading libfabric/1.11.0.0.233
Loading craype-network-ofi
Loading cray-dsmml/0.1.2
Loading perftools-base/20.10.0
Loading xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta
Loading cray-mpich/8.0.16
Loading cray-libsci/20.10.1.2
Loading bolt/0.7
Loading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env
Loading /usr/local/share/epcc-module/epcc-module-loader
earlcd@uan01:~/work/monc> module load cray-hdf5 cray-netcdf cray-fftw
(reverse-i-search)`': ^C
earlcd@uan01:~/work/monc> ftn --version
GNU Fortran (GCC) 10.1.0 20200507 (Cray Inc.)
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
earlcd@uan01:~/work/monc> fcm make -f fcm-make/monc-cray-gnu.cfg
[init] make # 2021-01-26T12:30:49Z
[info] FCM 2019.05.0 (/home1/home/n02/n02/earlcd/fcm-2019.09.0)
[init] make config-parse # 2021-01-26T12:30:49Z
[info] config-file=/lus/cls01095/work/n02/n02/earlcd/monc/fcm-make/monc-cray-gnu.cfg
[info] config-file= - /lus/cls01095/work/n02/n02/earlcd/monc/fcm-make/comp-gnu-4.4.7.cfg
[info] config-file= - /lus/cls01095/work/n02/n02/earlcd/monc/fcm-make/env-cray.cfg
[info] config-file= - /lus/cls01095/work/n02/n02/earlcd/monc/fcm-make/monc-build.cfg
[done] make config-parse # 0.1s
[init] make dest-init # 2021-01-26T12:30:49Z
[info] dest=earlcd@uan01:/lus/cls01095/work/n02/n02/earlcd/monc
[info] mode=incremental
[done] make dest-init # 0.1s
[init] make extract # 2021-01-26T12:30:49Z
[info] location monc: 0: /lus/cls01095/work/n02/n02/earlcd/monc
[info] dest: 381 [U unchanged]
[info] source: 381 [U from base]
[done] make extract # 0.8s
[init] make preprocess # 2021-01-26T12:30:50Z
[info] sources: total=381, analysed=0, elapsed-time=1.2s, total-time=0.0s
[info] target-tree-analysis: elapsed-time=0.0s
[info] install targets: modified=0, unchanged=8, failed=0, total-time=0.0s
[info] process targets: modified=0, unchanged=172, failed=0, total-time=0.0s
[info] TOTAL targets: modified=0, unchanged=180, failed=0, elapsed-time=2.2s
[done] make preprocess # 5.6s
[init] make build # 2021-01-26T12:30:56Z
[info] sources: total=381, analysed=0, elapsed-time=0.1s, total-time=0.0s
[info] target-tree-analysis: elapsed-time=0.2s
[FAIL] ftn -oo/conditional_diagnostics_whole_mod.o -c -I./include -I/opt/cray/pe/netcdf/4.7.4.2/GNU/9.1/include -I/opt/cray/pe/fftw/3.3.8.8/x86_rome/lib/../include -O3 -J . /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/components/conditional_diagnostics_whole/src/conditional_diagnostics_whole.F90 # rc=1
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/components/conditional_diagnostics_whole/src/conditional_diagnostics_whole.F90:80:22:
[FAIL]
[FAIL] 77 | call mpi_reduce(MPI_IN_PLACE , CondDiags_tot, ncond*2*ndiag*current_state%local_grid%size(Z_INDEX), &
[FAIL] | 2
[FAIL] ......
[FAIL] 80 | call mpi_reduce(CondDiags_tot, CondDiags_tot, ncond*2*ndiag*current_state%local_grid%size(Z_INDEX), &
[FAIL] | 1
[FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (REAL(8)/INTEGER(4)).
[FAIL] compile 0.1 ! conditional_diagnostics_whole_mod.o <- monc/components/conditional_diagnostics_whole/src/conditional_diagnostics_whole.F90
[FAIL] ftn -oo/iterativesolver_mod.o -c -I./include -I/opt/cray/pe/netcdf/4.7.4.2/GNU/9.1/include -I/opt/cray/pe/fftw/3.3.8.8/x86_rome/lib/../include -O3 -J . /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/components/iterativesolver/src/iterativesolver.F90 # rc=1
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/components/iterativesolver/src/iterativesolver.F90:540:23:
[FAIL]
[FAIL] 540 | call mpi_allreduce(local_sum, global_sum, 3, PRECISION_TYPE, MPI_SUM, current_state%parallel%monc_communicator, ierr)
[FAIL] | 1
[FAIL] ......
[FAIL] 600 | call mpi_allreduce(current_state%local_divmax, current_state%global_divmax, 1, PRECISION_TYPE, MPI_MAX, &
[FAIL] | 2
[FAIL] Error: Rank mismatch between actual argument at (1) and actual argument at (2) (scalar and rank-1)
[FAIL] compile 0.1 ! iterativesolver_mod.o <- monc/components/iterativesolver/src/iterativesolver.F90
[FAIL] ftn -oo/monc_mod.o -c -I./include -I/opt/cray/pe/netcdf/4.7.4.2/GNU/9.1/include -I/opt/cray/pe/fftw/3.3.8.8/x86_rome/lib/../include -O3 -J . -frecursive /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/model_core/src/monc.F90 # rc=1
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/model_core/src/monc.F90:207:60:
[FAIL]
[FAIL] 207 | call mpi_barrier(state%parallel%monc_communicator, ierr)
[FAIL] | 1
[FAIL] Error: More actual than formal arguments in procedure call at (1)
[FAIL] compile 0.1 ! monc_mod.o <- monc/model_core/src/monc.F90
[FAIL] ftn -oo/io_server_client_mod.o -c -I./include -I/opt/cray/pe/netcdf/4.7.4.2/GNU/9.1/include -I/opt/cray/pe/fftw/3.3.8.8/x86_rome/lib/../include -O3 -J . -frecursive /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90 # rc=1
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:168:25:
[FAIL]
[FAIL] 163 | call mpi_get_address(basic_type, base_addr, ierr)
[FAIL] | 2
[FAIL] ......
[FAIL] 168 | call mpi_get_address(basic_type%dimensions, num_addr, ierr)
[FAIL] | 1
[FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (INTEGER(4)/TYPE(data_sizing_description_type)).
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:173:25:
[FAIL]
[FAIL] 163 | call mpi_get_address(basic_type, base_addr, ierr)
[FAIL] | 2
[FAIL] ......
[FAIL] 173 | call mpi_get_address(basic_type%dim_sizes, num_addr, ierr)
[FAIL] | 1
[FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (INTEGER(4)/TYPE(data_sizing_description_type)).
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:124:25:
[FAIL]
[FAIL] 124 | call mpi_get_address(basic_type, base_addr, ierr)
[FAIL] | 1
[FAIL] ......
[FAIL] 163 | call mpi_get_address(basic_type, base_addr, ierr)
[FAIL] | 2
[FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (TYPE(field_description_type)/TYPE(data_sizing_description_type)).
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:129:25:
[FAIL]
[FAIL] 129 | call mpi_get_address(basic_type%field_name, num_addr, ierr)
[FAIL] | 1
[FAIL] ......
[FAIL] 163 | call mpi_get_address(basic_type, base_addr, ierr)
[FAIL] | 2
[FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (CHARACTER(150)/TYPE(data_sizing_description_type)).
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:134:25:
[FAIL]
[FAIL] 134 | call mpi_get_address(basic_type%field_type, num_addr, ierr)
[FAIL] | 1
[FAIL] ......
[FAIL] 163 | call mpi_get_address(basic_type, base_addr, ierr)
[FAIL] | 2
[FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (INTEGER(4)/TYPE(data_sizing_description_type)).
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:139:25:
[FAIL]
[FAIL] 139 | call mpi_get_address(basic_type%data_type, num_addr, ierr)
[FAIL] | 1
[FAIL] ......
[FAIL] 163 | call mpi_get_address(basic_type, base_addr, ierr)
[FAIL] | 2
[FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (INTEGER(4)/TYPE(data_sizing_description_type)).
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:144:25:
[FAIL]
[FAIL] 144 | call mpi_get_address(basic_type%optional, num_addr, ierr)
[FAIL] | 1
[FAIL] ......
[FAIL] 163 | call mpi_get_address(basic_type, base_addr, ierr)
[FAIL] | 2
[FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (LOGICAL(4)/TYPE(data_sizing_description_type)).
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:92:25:
[FAIL]
[FAIL] 92 | call mpi_get_address(basic_type, base_addr, ierr)
[FAIL] | 1
[FAIL] ......
[FAIL] 163 | call mpi_get_address(basic_type, base_addr, ierr)
[FAIL] | 2
[FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (TYPE(definition_description_type)/TYPE(data_sizing_description_type)).
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:97:25:
[FAIL]
[FAIL] 97 | call mpi_get_address(basic_type%send_on_terminate, num_addr, ierr)
[FAIL] | 1
[FAIL] ......
[FAIL] 163 | call mpi_get_address(basic_type, base_addr, ierr)
[FAIL] | 2
[FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (LOGICAL(4)/TYPE(data_sizing_description_type)).
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:102:25:
[FAIL]
[FAIL] 102 | call mpi_get_address(basic_type%number_fields, num_addr, ierr)
[FAIL] | 1
[FAIL] ......
[FAIL] 163 | call mpi_get_address(basic_type, base_addr, ierr)
[FAIL] | 2
[FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (INTEGER(4)/TYPE(data_sizing_description_type)).
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:107:25:
[FAIL]
[FAIL] 107 | call mpi_get_address(basic_type%frequency, num_addr, ierr)
[FAIL] | 1
[FAIL] ......
[FAIL] 163 | call mpi_get_address(basic_type, base_addr, ierr)
[FAIL] | 2
[FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (INTEGER(4)/TYPE(data_sizing_description_type)).
[FAIL] compile 0.1 ! io_server_client_mod.o <- monc/io/src/ioclient.F90
[info] compile targets: modified=0, unchanged=91, failed=4, total-time=0.3s
[info] compile+ targets: modified=0, unchanged=88, failed=0, total-time=0.0s
[info] TOTAL targets: modified=0, unchanged=179, failed=8, elapsed-time=2.0s
[FAIL] ! conditional_diagnostics_whole_mod.mod: depends on failed target: conditional_diagnostics_whole_mod.o
[FAIL] ! conditional_diagnostics_whole_mod.o: update task failed
[FAIL] ! io_server_client_mod.mod: depends on failed target: io_server_client_mod.o
[FAIL] ! io_server_client_mod.o: update task failed
[FAIL] ! iterativesolver_mod.mod: depends on failed target: iterativesolver_mod.o
[FAIL] ! iterativesolver_mod.o: update task failed
[FAIL] ! monc_mod.mod : depends on failed target: monc_mod.o
[FAIL] ! monc_mod.o : update task failed
[FAIL] make build # 2.3s
[FAIL] make # 8.9sdoes not compile With the GNU compiler MONC fails to compile. All the errors are related to incorrect datatypes being passed to MPI-related subroutines (as far as I can see). I think these are bugs and fixing these may resolve the issue we are having at runtime. It is possible that the GNU compiler is being more strict here and catching these bugs at compile-time. Thoughts @sjboeing ? |
|
Digging a little further it I've added some print statements. It appears that the counts are calculated incorrectly (I'm compiling with Cray Fortran again here) earlcd@uan01:~/work/monc> cat slurm-77539.out
Unloading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env-profile
Loading cpe-cray
Loading cce/10.0.4
Loading craype/2.7.2
Loading craype-x86-rome
Loading libfabric/1.11.0.0.233
Loading craype-network-ofi
Loading cray-dsmml/0.1.2
Loading perftools-base/20.10.0
Loading xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta
Loading cray-mpich/8.0.16
Loading cray-libsci/20.10.1.2
Loading bolt/0.7
Loading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env
Loading epcc-job-env
Loading requirement: bolt/0.7
Currently Loaded Modulefiles:
1) cpe-cray
2) cce/10.0.4(default)
3) craype/2.7.2(default)
4) craype-x86-rome
5) libfabric/1.11.0.0.233(default)
6) craype-network-ofi
7) cray-dsmml/0.1.2(default)
8) perftools-base/20.10.0(default)
9) xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta(default)
10) cray-mpich/8.0.16(default)
11) cray-libsci/20.10.1.2(default)
12) bolt/0.7
13) /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env
14) epcc-job-env
MPICH ERROR [Rank 2] [job id 77539.0] [Tue Jan 26 13:07:48 2021] [unknown] [nid001199] - Abort(1007255298) (rank 2 in comm 0): Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x5ce63c0, scnts=0x45fba80, sdispls=0x45fb180, MPI_DOUBLE_PRECISION, rbuf=0x5d26a00, rcnts=0x45fa400, rdispls=0x46279c0, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -919842048
aborting job:
Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x5ce63c0, scnts=0x45fba80, sdispls=0x45fb180, MPI_DOUBLE_PRECISION, rbuf=0x5d26a00, rcnts=0x45fa400, rdispls=0x46279c0, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -919842048
debug send_sizes 4352, 3*4096
debug recv_sizes 4*4096
debug send_sizes 16448
debug recv_sizes 16448
debug send_sizes 32896
debug recv_sizes -919842048
MPICH ERROR [Rank 1] [job id 77539.0] [Tue Jan 26 13:07:48 2021] [unknown] [nid001199] - Abort(1007255298) (rank 1 in comm 0): Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x5ca16c0, scnts=0x47f8200, sdispls=0x47f6d40, MPI_DOUBLE_PRECISION, rbuf=0x5ce5cc0, rcnts=0x47f1400, rdispls=0x47ee200, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -1300475136I've gotten as far as identifying that determine_offsets_from_size (also in compiled MONC docs) is in charge of computing these offsets. I think the next step will be to work out why this subroutine isn't calculating positive values (as it should). |
|
Hi @leif, this looks like it is not really trivial. two things you may try are:
|
|
Thanks for the suggestions @sjboeing, I've tried both and the issue persists. with `enable_io_server=.false.`aborting job:
Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x570ab40, scnts=0x4611440, sdispls=0x45fa740, MPI_DOUBLE_PRECISION, rbuf=0x573f040, rcnts=0x45f9e40, rdispls=0x45f90c0, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -1112427776
debug send_sizes 5*2652
debug recv_sizes 2*2678, 3*2652
debug send_sizes 13364
debug recv_sizes 13364
debug send_sizes 26728
debug recv_sizes -1090014464
MPICH ERROR [Rank 4] [job id 78435.0] [Tue Jan 26 21:33:32 2021] [unknown] [nid001010] - Abort(940146434) (rank 4 in comm 0): Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x570a080, scnts=0x4611440, sdispls=0x45fa740, MPI_DOUBLE_PRECISION, rbuf=0x573e580, rcnts=0x45f9e40, rdispls=0x45f90c0, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -1090014464and using the `testcases/shallow_convection/bomex.mcf` configurationaborting job:
Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x9961800, scnts=0x4606440, sdispls=0x4605b40, MPI_DOUBLE_PRECISION, rbuf=0x9a03480, rcnts=0x47c1c80, rdispls=0x47bf780, datatype=MPI_DOUBLE_PRECISION, comm=comm=0xc4000000) failed
PMPI_Alltoallv(351): Negative count, value is -746590336
debug send_sizes 2*38912
debug recv_sizes 2*38912
debug send_sizes 2*40128
debug recv_sizes 2*40128
debug send_sizes 2*41382
debug recv_sizes -2011828352, 0
MPICH ERROR [Rank 4] [job id 78459.0] [Tue Jan 26 22:11:40 2021] [unknown] [nid001010] - Abort(134840066) (rank 4 in comm 0): Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x99602c0, scnts=0x4606440, sdispls=0x4605b40, MPI_DOUBLE_PRECISION, rbuf=0x9a01f40, rcnts=0x47c1c80, rdispls=0x47bf780, datatype=MPI_DOUBLE_PRECISION, comm=comm=0x84000007) failed
PMPI_Alltoallv(351): Negative count, value is -2011828352 |
|
Hi @leifdenby Leif, just wondering what choice of moncs _per_io is set. Is this a version that still uses FFTW rather than the MOSRS new HEAD that uses FFTE? The message seemed to indicate fft_pencil in earlier output. |
|
I just re-read some of your compile woes. NOTE GCC 10. does not like the fact that MPI calls can use any data type so we have to apply " -fallow-argument-mismatch "; I found that back in November and archer team added it to the documentation on building: https://docs.archer2.ac.uk/user-guide/dev-environment/ |
|
Did you solve this yet? I see the only ref to mpi_alltoallv appears to be in components/fftsolver/src/pencilfft.F90 |
I haven't no 😢 The farthest I've gotten is producing a branch (see #38) which contains all the commits that Adrian has made on MOSRS where he is working on ARCHER2 fixes. As you know this branch includes a lot of changes and as it stands also reverses changes Chris recently did for ARC4. I am going to try and cherry-pick just the first four commits and see if that helps with running on ARCHER2.
Thank you for suggesting this. I'll give it a try with adding that compilation flag.
Thank you for suggesting this. I'm not quiet sure how to work this out. If you check my run log above you'll see:
My Does that sound reasonable or am I doing something obviously stupid? |
|
The job looks poorly specified. If you want to ru a total of 4 MPI tasks (i.e. 1 io and 3 monc) then tasks-per-node should be also be 4. but then yo might choose to spread them out unless this is just a really basic job and you are happy for all tasks to sit in same numa region. When doing a proper job consider the cpus-per-task if you have fewer than 128 tasks on one node. |
|
Per Ralph's notes, and in agreement with @MarkUoLeeds , it seems that the Leeds branch works with gnu 9.3.0 but not the default gnu 10. Having my module load order as seems the best option for compiling with gnu using Still trying out the cray compiler so can't comment on that, and waiting on the test job to run, but thought I'd mention this |
|
thanks for the help here. I think we can close this now that @cemac-ccs is working on a pull-request for ARCHER2: #45 |
This is work-in-progress to get MONC compiling and running on ARCHER2