Skip to content

Conversation

@leifdenby
Copy link
Collaborator

This is work-in-progress to get MONC compiling and running on ARCHER2

@leifdenby leifdenby self-assigned this Dec 15, 2020
@leifdenby
Copy link
Collaborator Author

Debug commands I'm using on ARCHER2 (for my own reference):

Run MONC inside gdb4hpc:

$> gdb4hpc
gdb all> launch --args="--config=tests/straka_short.mcf --checkpoint_file=checkpoint_files/straka_dump.nc" --launcher-args="--partition=standard --qos=standard --tasks-per-node=2 --exclusive --export=all" $monc{2} ./build/bin/monc_driver.exe

@leifdenby
Copy link
Collaborator Author

Currently I'm stuck with an issue with a call to MPI_Alltoallv

earlcd@uan01:/work/ta009/ta009/earlcd/git-repos/monc> fcm make -f fcm-make/monc-cray-cray.cfg
[init] make                # 2020-12-15T15:15:52Z
[info] FCM 2019.05.0 (/home2/home/ta009/ta009/earlcd/fcm-2019.09.0)
[init] make config-parse   # 2020-12-15T15:15:52Z
[info] config-file=/lus/cls01095/work/ta009/ta009/earlcd/git-repos/monc/fcm-make/monc-cray-cray.cfg
[info] config-file= - /lus/cls01095/work/ta009/ta009/earlcd/git-repos/monc/fcm-make/comp-cray-2107.cfg
[info] config-file= - /lus/cls01095/work/ta009/ta009/earlcd/git-repos/monc/fcm-make/env-cray.cfg
[info] config-file= - /lus/cls01095/work/ta009/ta009/earlcd/git-repos/monc/fcm-make/monc-build.cfg
[done] make config-parse   # 0.0s
[init] make dest-init      # 2020-12-15T15:15:52Z
[info] dest=earlcd@uan01:/lus/cls01095/work/ta009/ta009/earlcd/git-repos/monc
[info] mode=incremental
[done] make dest-init      # 0.0s
[init] make extract        # 2020-12-15T15:15:52Z
[info] location  monc: 0: /lus/cls01095/work/ta009/ta009/earlcd/git-repos/monc
[info]   dest:  381 [U unchanged]
[info] source:  381 [U from base]
[done] make extract        # 0.4s
[init] make preprocess     # 2020-12-15T15:15:53Z
[info] sources: total=381, analysed=0, elapsed-time=0.2s, total-time=0.0s
[info] target-tree-analysis: elapsed-time=0.0s
[info] install   targets: modified=0, unchanged=8, failed=0, total-time=0.0s
[info] process   targets: modified=0, unchanged=172, failed=0, total-time=0.0s
[info] TOTAL     targets: modified=0, unchanged=180, failed=0, elapsed-time=0.2s
[done] make preprocess     # 0.8s
[init] make build          # 2020-12-15T15:15:54Z
[info] sources: total=381, analysed=0, elapsed-time=0.1s, total-time=0.0s
[info] target-tree-analysis: elapsed-time=0.1s
[info] compile   targets: modified=120, unchanged=3, failed=0, total-time=176.7s
[info] compile+  targets: modified=112, unchanged=7, failed=0, total-time=0.5s
[info] link      targets: modified=1, unchanged=0, failed=0, total-time=0.5s
[info] TOTAL     targets: modified=233, unchanged=10, failed=0, elapsed-time=178.1s
[done] make build          # 178.3s
[done] make                # 179.6s
earlcd@uan01:/work/ta009/ta009/earlcd/git-repos/monc>
(reverse-i-search)`': ^C
earlcd@uan01:/work/ta009/ta009/earlcd/git-repos/monc> sbatch utils/archer2/submonc.slurm
Submitted batch job 59769
earlcd@uan01:/work/ta009/ta009/earlcd/git-repos/monc> cat slurm-59769.out
Unloading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env-profile
Loading cpe-cray
Loading cce/10.0.4
Loading craype/2.7.2
Loading craype-x86-rome
Loading libfabric/1.11.0.0.233
Loading craype-network-ofi
Loading cray-dsmml/0.1.2
Loading perftools-base/20.10.0
Loading xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta
Loading cray-mpich/8.0.16
Loading cray-libsci/20.10.1.2
Loading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env
Currently Loaded Modulefiles:
 1) cpe-cray
 2) cce/10.0.4(default)
 3) craype/2.7.2(default)
 4) craype-x86-rome
 5) libfabric/1.11.0.0.233(default)
 6) craype-network-ofi
 7) cray-dsmml/0.1.2(default)
 8) perftools-base/20.10.0(default)
 9) xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta(default)
10) cray-mpich/8.0.16(default)
11) cray-libsci/20.10.1.2(default)
12) /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env
13) epcc-job-env
14) cray-netcdf/4.7.4.2(default)
15) cray-fftw/3.3.8.8(default)
16) cray-hdf5/1.12.0.2(default)
MPICH ERROR [Rank 1] [job id 59769.0] [Tue Dec 15 15:22:06 2020] [unknown] [nid001139] - Abort(403275522) (rank 1 in comm 0): Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x9183600, scnts=0x45becc0, sdispls=0x47f7a00, MPI_DOUBLE_PRECISION, rbuf=0x92888c0, rcnts=0x47f6540, rdispls=0x47f4040, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -1207723264

aborting job:
Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x9183600, scnts=0x45becc0, sdispls=0x47f7a00, MPI_DOUBLE_PRECISION, rbuf=0x92888c0, rcnts=0x47f6540, rdispls=0x47f4040, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -1207723264
[INFO] MONC running with 1 processes, 1 IO server(s)
[WARN] No enabled configuration for component ideal_squall therefore disabling this
[WARN] No enabled configuration for component kid_testcase therefore disabling this
[WARN] Run order callback for component tank_experiments at stage initialisation not specified
[WARN] Run order callback for component tank_experiments at stage finalisation not specified
[WARN] Defaulting to one dimension decomposition due to solution size too small
[INFO] Decomposed 1 processes via 'OneDim' into z=1 y=1 x=1
[INFO] 3D system; z=65, y=512, x=2
srun: error: nid001139: task 1: Exited with exit code 255
srun: Terminating job step 59769.0
slurmstepd: error: *** STEP 59769.0 ON nid001139 CANCELLED AT 2020-12-15T15:22:06 ***
srun: error: nid001139: task 0: Terminated
srun: Force Terminated job step 59769.0

@leifdenby
Copy link
Collaborator Author

I've tried compiling with fcm-make/comp-cray-2107-debug.cfg and using gdb4hpc to identify the issue. Within gdb4hpc I'm stuck since I don't get any output when trying to print local variables:

dbg all> launch --args="--config=tests/straka_short.mcf --checkpoint_file=checkpoint_files/straka_dump.nc" --launcher-args="--partition=standard --qos=standard --tasks-per-node=2 --exclusive --export=all" $monc{2} ./build/bin/monc_driver.exe
Starting application, please wait...
Creating MRNet communication network...
Waiting for debug servers to attach to MRNet communications network...
Timeout in 400 seconds. Please wait for the attach to complete.
Number of dbgsrvs connected: [1];  Timeout Counter: [0]
Number of dbgsrvs connected: [1];  Timeout Counter: [1]
Number of dbgsrvs connected: [2];  Timeout Counter: [0]
Finalizing setup...
Launch complete.
monc{0..1}: Initial breakpoint, monc_driver at /lus/cls01095/work/ta009/ta009/earlcd/git-repos/monc/preprocess/src/monc/monc_driver.F90:16
dbg all> break pencilfft.F90:360
...
bg all> print source_data
monc{0}: *** The application is running
dbg all> print size(source_data)

@sjboeing
Copy link
Contributor

Hi @leifdenby: since these are MPI issues, I thought the changes that Chris applied for ARC4 may be worth exploring, in case you have not done so yet.

@leifdenby
Copy link
Collaborator Author

Hi @leifdenby: since these are MPI issues, I thought the changes that Chris applied for ARC4 may be worth exploring, in case you have not done so yet.

Great idea @sjboeing! I'll give this a try

@leifdenby
Copy link
Collaborator Author

Unfortunately the fixes introduced for ARC4 don't appear to have fixed the issue @sjboeing. But I have an idea what the issue might be. I'll put my testing in separate comments below

@leifdenby
Copy link
Collaborator Author

leifdenby commented Jan 26, 2021

(optimised) compiling with cray fortran compiler and running

compiling
earlcd@uan01:~/work/monc> module restore PrgEnv-cray
Unloading cray-hdf5/1.12.0.2
Unloading cray-fftw/3.3.8.8
Unloading cray-netcdf/4.7.4.2
Unloading /usr/local/share/epcc-module/epcc-module-loader

Warning: Unloading the epcc-setup-env module will stop many
modules being available on the system. If you do this by
accident, you can recover the situation with the command:

        module load /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env

Unloading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env
Unloading bolt/0.7
Unloading cray-libsci/20.10.1.2
Unloading cray-mpich/8.0.16
Unloading xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta

Unloading perftools-base/20.10.0
  WARNING: Did not unuse /opt/cray/pe/perftools/20.10.0/modulefiles

Unloading cray-dsmml/0.1.2
Unloading craype-network-ofi
Unloading libfabric/1.11.0.0.233
Unloading craype-x86-rome
Unloading craype/2.7.2
Unloading gcc/10.1.0
Unloading cpe-gnu
Loading cpe-cray
Loading cce/10.0.4
Loading craype/2.7.2
Loading craype-x86-rome
Loading libfabric/1.11.0.0.233
Loading craype-network-ofi
Loading cray-dsmml/0.1.2
Loading perftools-base/20.10.0
Loading xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta
Loading cray-mpich/8.0.16
Loading cray-libsci/20.10.1.2
Loading bolt/0.7
Loading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env
Loading /usr/local/share/epcc-module/epcc-module-loader
earlcd@uan01:~/work/monc> module load cray-hdf5 cray-netcdf cray-fftw
earlcd@uan01:~/work/monc> fcm make -f fcm-make/monc-cray-cray.cfg 
[init] make                # 2021-01-26T12:16:55Z
[info] FCM 2019.05.0 (/home1/home/n02/n02/earlcd/fcm-2019.09.0)
[init] make config-parse   # 2021-01-26T12:16:55Z
[info] config-file=/lus/cls01095/work/n02/n02/earlcd/monc/fcm-make/monc-cray-cray.cfg
[info] config-file= - /lus/cls01095/work/n02/n02/earlcd/monc/fcm-make/comp-cray-2107.cfg
[info] config-file= - /lus/cls01095/work/n02/n02/earlcd/monc/fcm-make/env-cray.cfg
[info] config-file= - /lus/cls01095/work/n02/n02/earlcd/monc/fcm-make/monc-build.cfg
[done] make config-parse   # 0.1s
[init] make dest-init      # 2021-01-26T12:16:55Z
[info] dest=earlcd@uan01:/lus/cls01095/work/n02/n02/earlcd/monc
[info] mode=incremental
[done] make dest-init      # 0.1s
[init] make extract        # 2021-01-26T12:16:55Z
[info] location  monc: 0: /lus/cls01095/work/n02/n02/earlcd/monc
[info]   dest:  381 [U unchanged]
[info] source:  381 [U from base]
[done] make extract        # 7.7s
[init] make preprocess     # 2021-01-26T12:17:03Z
[info] sources: total=381, analysed=180, elapsed-time=0.2s, total-time=0.1s
[info] target-tree-analysis: elapsed-time=0.0s
[info] install   targets: modified=8, unchanged=0, failed=0, total-time=0.1s
[info] process   targets: modified=172, unchanged=0, failed=0, total-time=14.1s
[info] TOTAL     targets: modified=180, unchanged=0, failed=0, elapsed-time=14.3s
[done] make preprocess     # 14.7s
[init] make build          # 2021-01-26T12:17:17Z
[info] sources: total=381, analysed=381, elapsed-time=1.5s, total-time=1.4s
[info] target-tree-analysis: elapsed-time=0.3s
[info] compile   targets: modified=123, unchanged=0, failed=0, total-time=209.8s
[info] compile+  targets: modified=119, unchanged=0, failed=0, total-time=1.3s
[info] link      targets: modified=1, unchanged=0, failed=0, total-time=2.0s
[info] TOTAL     targets: modified=243, unchanged=0, failed=0, elapsed-time=213.7s
[done] make build          # 215.4s
[done] make                # 238.0s
running MONC
earlcd@uan01:~/work/monc> sbatch utils/archer2/submonc.slurm 
Submitted batch job 77481
output of SLURM log
earlcd@uan01:~/work/monc> cat slurm-77481.out 
Unloading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env-profile
Loading cpe-cray
Loading cce/10.0.4
Loading craype/2.7.2
Loading craype-x86-rome
Loading libfabric/1.11.0.0.233
Loading craype-network-ofi
Loading cray-dsmml/0.1.2
Loading perftools-base/20.10.0
Loading xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta
Loading cray-mpich/8.0.16
Loading cray-libsci/20.10.1.2
Loading bolt/0.7
Loading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env

Loading epcc-job-env
  Loading requirement: bolt/0.7
Currently Loaded Modulefiles:
 1) cpe-cray                                                         
 2) cce/10.0.4(default)                                              
 3) craype/2.7.2(default)                                            
 4) craype-x86-rome                                                  
 5) libfabric/1.11.0.0.233(default)                                  
 6) craype-network-ofi                                               
 7) cray-dsmml/0.1.2(default)                                        
 8) perftools-base/20.10.0(default)                                  
 9) xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta(default)               
10) cray-mpich/8.0.16(default)                                       
11) cray-libsci/20.10.1.2(default)                                   
12) bolt/0.7                                                         
13) /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env  
14) epcc-job-env                                                     
[INFO] MONC running with 4 processes, 1 IO server(s)
[WARN] No enabled configuration for component ideal_squall therefore disabling this
[WARN] No enabled configuration for component kid_testcase therefore disabling this
[WARN] Run order callback for component tank_experiments at stage initialisation not specified
[WARN] Run order callback for component tank_experiments at stage finalisation not specified
[WARN] Defaulting to one dimension decomposition due to solution size too small
[INFO] Decomposed 4 processes via 'OneDim' into z=1 y=4 x=1
[INFO] 3D system; z=65, y=512, x=2
MPICH ERROR [Rank 1] [job id 77481.0] [Tue Jan 26 12:22:49 2021] [unknown] [nid001037] - Abort(403275522) (rank 1 in comm 0): Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x5cd2600, scnts=0x47fa200, sdispls=0x47f8d40, MPI_DOUBLE_PRECISION, rbuf=0x5d16bc0, rcnts=0x47f3400, rdispls=0x47f0200, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -404335872

aborting job:
Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x5cd2600, scnts=0x47fa200, sdispls=0x47f8d40, MPI_DOUBLE_PRECISION, rbuf=0x5d16bc0, rcnts=0x47f3400, rdispls=0x47f0200, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -404335872
MPICH ERROR [Rank 2] [job id 77481.0] [Tue Jan 26 12:22:49 2021] [unknown] [nid001037] - Abort(622338) (rank 2 in comm 0): Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x5c6c940, scnts=0x45fda80, sdispls=0x45fd180, MPI_DOUBLE_PRECISION, rbuf=0x5cacf40, rcnts=0x45fc400, rdispls=0x46299c0, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -1539026176

aborting job:
Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x5c6c940, scnts=0x45fda80, sdispls=0x45fd180, MPI_DOUBLE_PRECISION, rbuf=0x5cacf40, rcnts=0x45fc400, rdispls=0x46299c0, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -1539026176
MPICH ERROR [Rank 4] [job id 77481.0] [Tue Jan 26 12:22:49 2021] [unknown] [nid001037] - Abort(336166658) (rank 4 in comm 0): Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x5c3f440, scnts=0x45fda80, sdispls=0x45fd180, MPI_DOUBLE_PRECISION, rbuf=0x5c7fc80, rcnts=0x45fc400, rdispls=0x46299c0, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -1178053888

aborting job:
Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x5c3f440, scnts=0x45fda80, sdispls=0x45fd180, MPI_DOUBLE_PRECISION, rbuf=0x5c7fc80, rcnts=0x45fc400, rdispls=0x46299c0, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -1178053888
srun: error: nid001037: tasks 1-2,4: Exited with exit code 255
srun: Terminating job step 77481.0
slurmstepd: error: *** STEP 77481.0 ON nid001037 CANCELLED AT 2021-01-26T12:22:49 ***
srun: error: nid001037: tasks 0,3: Terminated
srun: Force Terminated job step 77481.0

This run-time error suggests to me that the routine calculating the size of the buffer used to make the MPI communication is doing the calculation incorrectly.

I should also note that compiling with debug flags (using fcm-make/monc-cray-cray-debug.cfg, this compiles) the SLURM job simply aborts (not clear from the log why).

@leifdenby
Copy link
Collaborator Author

compiling with GNU fortran compiler and running

compiling
earlcd@uan01:~/work/monc> module restore PrgEnv-gnu
Unloading /usr/local/share/epcc-module/epcc-module-loader

Warning: Unloading the epcc-setup-env module will stop many
modules being available on the system. If you do this by
accident, you can recover the situation with the command:

        module load /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env

Unloading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env
Unloading bolt/0.7
Unloading cray-libsci/20.10.1.2
Unloading cray-mpich/8.0.16
Unloading xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta

Unloading perftools-base/20.10.0
  WARNING: Did not unuse /opt/cray/pe/perftools/20.10.0/modulefiles

Unloading cray-dsmml/0.1.2
Unloading craype-network-ofi
Unloading libfabric/1.11.0.0.233
Unloading craype-x86-rome
Unloading craype/2.7.2
Unloading cce/10.0.4
Unloading cpe-cray
Loading cpe-gnu
Loading gcc/10.1.0
Loading craype/2.7.2
Loading craype-x86-rome
Loading libfabric/1.11.0.0.233
Loading craype-network-ofi
Loading cray-dsmml/0.1.2
Loading perftools-base/20.10.0
Loading xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta
Loading cray-mpich/8.0.16
Loading cray-libsci/20.10.1.2
Loading bolt/0.7
Loading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env
Loading /usr/local/share/epcc-module/epcc-module-loader
earlcd@uan01:~/work/monc> module load cray-hdf5 cray-netcdf cray-fftw
(reverse-i-search)`': ^C
earlcd@uan01:~/work/monc> ftn --version
GNU Fortran (GCC) 10.1.0 20200507 (Cray Inc.)
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

earlcd@uan01:~/work/monc> fcm make -f fcm-make/monc-cray-gnu.cfg 
[init] make                # 2021-01-26T12:30:49Z
[info] FCM 2019.05.0 (/home1/home/n02/n02/earlcd/fcm-2019.09.0)
[init] make config-parse   # 2021-01-26T12:30:49Z
[info] config-file=/lus/cls01095/work/n02/n02/earlcd/monc/fcm-make/monc-cray-gnu.cfg
[info] config-file= - /lus/cls01095/work/n02/n02/earlcd/monc/fcm-make/comp-gnu-4.4.7.cfg
[info] config-file= - /lus/cls01095/work/n02/n02/earlcd/monc/fcm-make/env-cray.cfg
[info] config-file= - /lus/cls01095/work/n02/n02/earlcd/monc/fcm-make/monc-build.cfg
[done] make config-parse   # 0.1s
[init] make dest-init      # 2021-01-26T12:30:49Z
[info] dest=earlcd@uan01:/lus/cls01095/work/n02/n02/earlcd/monc
[info] mode=incremental
[done] make dest-init      # 0.1s
[init] make extract        # 2021-01-26T12:30:49Z
[info] location  monc: 0: /lus/cls01095/work/n02/n02/earlcd/monc
[info]   dest:  381 [U unchanged]
[info] source:  381 [U from base]
[done] make extract        # 0.8s
[init] make preprocess     # 2021-01-26T12:30:50Z
[info] sources: total=381, analysed=0, elapsed-time=1.2s, total-time=0.0s
[info] target-tree-analysis: elapsed-time=0.0s
[info] install   targets: modified=0, unchanged=8, failed=0, total-time=0.0s
[info] process   targets: modified=0, unchanged=172, failed=0, total-time=0.0s
[info] TOTAL     targets: modified=0, unchanged=180, failed=0, elapsed-time=2.2s
[done] make preprocess     # 5.6s
[init] make build          # 2021-01-26T12:30:56Z
[info] sources: total=381, analysed=0, elapsed-time=0.1s, total-time=0.0s
[info] target-tree-analysis: elapsed-time=0.2s
[FAIL] ftn -oo/conditional_diagnostics_whole_mod.o -c -I./include -I/opt/cray/pe/netcdf/4.7.4.2/GNU/9.1/include -I/opt/cray/pe/fftw/3.3.8.8/x86_rome/lib/../include -O3 -J . /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/components/conditional_diagnostics_whole/src/conditional_diagnostics_whole.F90 # rc=1
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/components/conditional_diagnostics_whole/src/conditional_diagnostics_whole.F90:80:22:
[FAIL] 
[FAIL]    77 |       call mpi_reduce(MPI_IN_PLACE , CondDiags_tot, ncond*2*ndiag*current_state%local_grid%size(Z_INDEX), &
[FAIL]       |                      2
[FAIL] ......
[FAIL]    80 |       call mpi_reduce(CondDiags_tot, CondDiags_tot, ncond*2*ndiag*current_state%local_grid%size(Z_INDEX), &
[FAIL]       |                      1
[FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (REAL(8)/INTEGER(4)).
[FAIL] compile    0.1 ! conditional_diagnostics_whole_mod.o <- monc/components/conditional_diagnostics_whole/src/conditional_diagnostics_whole.F90
[FAIL] ftn -oo/iterativesolver_mod.o -c -I./include -I/opt/cray/pe/netcdf/4.7.4.2/GNU/9.1/include -I/opt/cray/pe/fftw/3.3.8.8/x86_rome/lib/../include -O3 -J . /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/components/iterativesolver/src/iterativesolver.F90 # rc=1
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/components/iterativesolver/src/iterativesolver.F90:540:23:
[FAIL] 
[FAIL]   540 |     call mpi_allreduce(local_sum, global_sum, 3, PRECISION_TYPE, MPI_SUM, current_state%parallel%monc_communicator, ierr)
[FAIL]       |                       1
[FAIL] ......
[FAIL]   600 |     call mpi_allreduce(current_state%local_divmax, current_state%global_divmax, 1, PRECISION_TYPE, MPI_MAX, &
[FAIL]       |                       2
[FAIL] Error: Rank mismatch between actual argument at (1) and actual argument at (2) (scalar and rank-1)
[FAIL] compile    0.1 ! iterativesolver_mod.o <- monc/components/iterativesolver/src/iterativesolver.F90
[FAIL] ftn -oo/monc_mod.o -c -I./include -I/opt/cray/pe/netcdf/4.7.4.2/GNU/9.1/include -I/opt/cray/pe/fftw/3.3.8.8/x86_rome/lib/../include -O3 -J . -frecursive /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/model_core/src/monc.F90 # rc=1
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/model_core/src/monc.F90:207:60:
[FAIL] 
[FAIL]   207 |     call mpi_barrier(state%parallel%monc_communicator, ierr)
[FAIL]       |                                                            1
[FAIL] Error: More actual than formal arguments in procedure call at (1)
[FAIL] compile    0.1 ! monc_mod.o           <- monc/model_core/src/monc.F90
[FAIL] ftn -oo/io_server_client_mod.o -c -I./include -I/opt/cray/pe/netcdf/4.7.4.2/GNU/9.1/include -I/opt/cray/pe/fftw/3.3.8.8/x86_rome/lib/../include -O3 -J . -frecursive /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90 # rc=1
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:168:25:
[FAIL] 
[FAIL]   163 |     call mpi_get_address(basic_type, base_addr, ierr)
[FAIL]       |                         2
[FAIL] ......
[FAIL]   168 |     call mpi_get_address(basic_type%dimensions, num_addr, ierr)
[FAIL]       |                         1
[FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (INTEGER(4)/TYPE(data_sizing_description_type)).
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:173:25:
[FAIL] 
[FAIL]   163 |     call mpi_get_address(basic_type, base_addr, ierr)
[FAIL]       |                         2
[FAIL] ......
[FAIL]   173 |     call mpi_get_address(basic_type%dim_sizes, num_addr, ierr)
[FAIL]       |                         1
[FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (INTEGER(4)/TYPE(data_sizing_description_type)).
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:124:25:
[FAIL] 
[FAIL]   124 |     call mpi_get_address(basic_type, base_addr, ierr)
[FAIL]       |                         1
[FAIL] ......
[FAIL]   163 |     call mpi_get_address(basic_type, base_addr, ierr)
[FAIL]       |                         2
[FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (TYPE(field_description_type)/TYPE(data_sizing_description_type)).
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:129:25:
[FAIL] 
[FAIL]   129 |     call mpi_get_address(basic_type%field_name, num_addr, ierr)
[FAIL]       |                         1
[FAIL] ......
[FAIL]   163 |     call mpi_get_address(basic_type, base_addr, ierr)
[FAIL]       |                         2
[FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (CHARACTER(150)/TYPE(data_sizing_description_type)).
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:134:25:
[FAIL] 
[FAIL]   134 |     call mpi_get_address(basic_type%field_type, num_addr, ierr)
[FAIL]       |                         1
[FAIL] ......
[FAIL]   163 |     call mpi_get_address(basic_type, base_addr, ierr)
[FAIL]       |                         2
[FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (INTEGER(4)/TYPE(data_sizing_description_type)).
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:139:25:
[FAIL] 
[FAIL]   139 |     call mpi_get_address(basic_type%data_type, num_addr, ierr)
[FAIL]       |                         1
[FAIL] ......
[FAIL]   163 |     call mpi_get_address(basic_type, base_addr, ierr)
[FAIL]       |                         2
[FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (INTEGER(4)/TYPE(data_sizing_description_type)).
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:144:25:
[FAIL] 
[FAIL]   144 |     call mpi_get_address(basic_type%optional, num_addr, ierr)
[FAIL]       |                         1
[FAIL] ......
[FAIL]   163 |     call mpi_get_address(basic_type, base_addr, ierr)
[FAIL]       |                         2
[FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (LOGICAL(4)/TYPE(data_sizing_description_type)).
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:92:25:
[FAIL] 
[FAIL]    92 |     call mpi_get_address(basic_type, base_addr, ierr)
[FAIL]       |                         1
[FAIL] ......
[FAIL]   163 |     call mpi_get_address(basic_type, base_addr, ierr)
[FAIL]       |                         2
[FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (TYPE(definition_description_type)/TYPE(data_sizing_description_type)).
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:97:25:
[FAIL] 
[FAIL]    97 |     call mpi_get_address(basic_type%send_on_terminate, num_addr, ierr)
[FAIL]       |                         1
[FAIL] ......
[FAIL]   163 |     call mpi_get_address(basic_type, base_addr, ierr)
[FAIL]       |                         2
[FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (LOGICAL(4)/TYPE(data_sizing_description_type)).
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:102:25:
[FAIL] 
[FAIL]   102 |     call mpi_get_address(basic_type%number_fields, num_addr, ierr)
[FAIL]       |                         1
[FAIL] ......
[FAIL]   163 |     call mpi_get_address(basic_type, base_addr, ierr)
[FAIL]       |                         2
[FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (INTEGER(4)/TYPE(data_sizing_description_type)).
[FAIL] /lus/cls01095/work/n02/n02/earlcd/monc/preprocess/src/monc/io/src/ioclient.F90:107:25:
[FAIL] 
[FAIL]   107 |     call mpi_get_address(basic_type%frequency, num_addr, ierr)
[FAIL]       |                         1
[FAIL] ......
[FAIL]   163 |     call mpi_get_address(basic_type, base_addr, ierr)
[FAIL]       |                         2
[FAIL] Error: Type mismatch between actual argument at (1) and actual argument at (2) (INTEGER(4)/TYPE(data_sizing_description_type)).
[FAIL] compile    0.1 ! io_server_client_mod.o <- monc/io/src/ioclient.F90
[info] compile   targets: modified=0, unchanged=91, failed=4, total-time=0.3s
[info] compile+  targets: modified=0, unchanged=88, failed=0, total-time=0.0s
[info] TOTAL     targets: modified=0, unchanged=179, failed=8, elapsed-time=2.0s
[FAIL] ! conditional_diagnostics_whole_mod.mod: depends on failed target: conditional_diagnostics_whole_mod.o
[FAIL] ! conditional_diagnostics_whole_mod.o: update task failed
[FAIL] ! io_server_client_mod.mod: depends on failed target: io_server_client_mod.o
[FAIL] ! io_server_client_mod.o: update task failed
[FAIL] ! iterativesolver_mod.mod: depends on failed target: iterativesolver_mod.o
[FAIL] ! iterativesolver_mod.o: update task failed
[FAIL] ! monc_mod.mod        : depends on failed target: monc_mod.o
[FAIL] ! monc_mod.o          : update task failed
[FAIL] make build          # 2.3s
[FAIL] make                # 8.9s

does not compile

With the GNU compiler MONC fails to compile. All the errors are related to incorrect datatypes being passed to MPI-related subroutines (as far as I can see). I think these are bugs and fixing these may resolve the issue we are having at runtime. It is possible that the GNU compiler is being more strict here and catching these bugs at compile-time.

Thoughts @sjboeing ?

@leifdenby
Copy link
Collaborator Author

Digging a little further it I've added some print statements. It appears that the counts are calculated incorrectly (I'm compiling with Cray Fortran again here)

earlcd@uan01:~/work/monc> cat slurm-77539.out 
Unloading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env-profile
Loading cpe-cray
Loading cce/10.0.4
Loading craype/2.7.2
Loading craype-x86-rome
Loading libfabric/1.11.0.0.233
Loading craype-network-ofi
Loading cray-dsmml/0.1.2
Loading perftools-base/20.10.0
Loading xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta
Loading cray-mpich/8.0.16
Loading cray-libsci/20.10.1.2
Loading bolt/0.7
Loading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env

Loading epcc-job-env
  Loading requirement: bolt/0.7
Currently Loaded Modulefiles:
 1) cpe-cray                                                         
 2) cce/10.0.4(default)                                              
 3) craype/2.7.2(default)                                            
 4) craype-x86-rome                                                  
 5) libfabric/1.11.0.0.233(default)                                  
 6) craype-network-ofi                                               
 7) cray-dsmml/0.1.2(default)                                        
 8) perftools-base/20.10.0(default)                                  
 9) xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta(default)               
10) cray-mpich/8.0.16(default)                                       
11) cray-libsci/20.10.1.2(default)                                   
12) bolt/0.7                                                         
13) /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env  
14) epcc-job-env                                                     
MPICH ERROR [Rank 2] [job id 77539.0] [Tue Jan 26 13:07:48 2021] [unknown] [nid001199] - Abort(1007255298) (rank 2 in comm 0): Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x5ce63c0, scnts=0x45fba80, sdispls=0x45fb180, MPI_DOUBLE_PRECISION, rbuf=0x5d26a00, rcnts=0x45fa400, rdispls=0x46279c0, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -919842048

aborting job:
Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x5ce63c0, scnts=0x45fba80, sdispls=0x45fb180, MPI_DOUBLE_PRECISION, rbuf=0x5d26a00, rcnts=0x45fa400, rdispls=0x46279c0, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -919842048
 debug send_sizes 4352,  3*4096
 debug recv_sizes 4*4096
 debug send_sizes 16448
 debug recv_sizes 16448
 debug send_sizes 32896
 debug recv_sizes -919842048
MPICH ERROR [Rank 1] [job id 77539.0] [Tue Jan 26 13:07:48 2021] [unknown] [nid001199] - Abort(1007255298) (rank 1 in comm 0): Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x5ca16c0, scnts=0x47f8200, sdispls=0x47f6d40, MPI_DOUBLE_PRECISION, rbuf=0x5ce5cc0, rcnts=0x47f1400, rdispls=0x47ee200, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -1300475136

I've gotten as far as identifying that determine_offsets_from_size (also in compiled MONC docs) is in charge of computing these offsets. I think the next step will be to work out why this subroutine isn't calculating positive values (as it should).

@sjboeing
Copy link
Contributor

Hi @leif, this looks like it is not really trivial. two things you may try are:

  1. disable the ioserver component, just to see if the issue has to do with it. (just use enable_io_server=.false.?)
  2. Run a BOMEX case. I think the straka case is effectively 2D, and may not be as well accommodated as the runs on a 3D domain.

@leifdenby
Copy link
Collaborator Author

leifdenby commented Jan 26, 2021

Thanks for the suggestions @sjboeing, I've tried both and the issue persists.

with `enable_io_server=.false.`
aborting job:
Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x570ab40, scnts=0x4611440, sdispls=0x45fa740, MPI_DOUBLE_PRECISION, rbuf=0x573f040, rcnts=0x45f9e40, rdispls=0x45f90c0, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -1112427776
 debug send_sizes 5*2652
 debug recv_sizes 2*2678,  3*2652
 debug send_sizes 13364
 debug recv_sizes 13364
 debug send_sizes 26728
 debug recv_sizes -1090014464
MPICH ERROR [Rank 4] [job id 78435.0] [Tue Jan 26 21:33:32 2021] [unknown] [nid001010] - Abort(940146434) (rank 4 in comm 0): Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x570a080, scnts=0x4611440, sdispls=0x45fa740, MPI_DOUBLE_PRECISION, rbuf=0x573e580, rcnts=0x45f9e40, rdispls=0x45f90c0, datatype=MPI_DOUBLE_PRECISION, comm=MPI_COMM_SELF) failed
PMPI_Alltoallv(351): Negative count, value is -1090014464

and

using the `testcases/shallow_convection/bomex.mcf` configuration
aborting job:
Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x9961800, scnts=0x4606440, sdispls=0x4605b40, MPI_DOUBLE_PRECISION, rbuf=0x9a03480, rcnts=0x47c1c80, rdispls=0x47bf780, datatype=MPI_DOUBLE_PRECISION, comm=comm=0xc4000000) failed
PMPI_Alltoallv(351): Negative count, value is -746590336
 debug send_sizes 2*38912
 debug recv_sizes 2*38912
 debug send_sizes 2*40128
 debug recv_sizes 2*40128
 debug send_sizes 2*41382
 debug recv_sizes -2011828352,  0
MPICH ERROR [Rank 4] [job id 78459.0] [Tue Jan 26 22:11:40 2021] [unknown] [nid001010] - Abort(134840066) (rank 4 in comm 0): Fatal error in PMPI_Alltoallv: Invalid count, error stack:
PMPI_Alltoallv(410): MPI_Alltoallv(sbuf=0x99602c0, scnts=0x4606440, sdispls=0x4605b40, MPI_DOUBLE_PRECISION, rbuf=0x9a01f40, rcnts=0x47c1c80, rdispls=0x47bf780, datatype=MPI_DOUBLE_PRECISION, comm=comm=0x84000007) failed
PMPI_Alltoallv(351): Negative count, value is -2011828352

@MarkUoLeeds
Copy link
Collaborator

Hi @leifdenby Leif, just wondering what choice of moncs _per_io is set. Is this a version that still uses FFTW rather than the MOSRS new HEAD that uses FFTE? The message seemed to indicate fft_pencil in earlier output.
Note Archer2 has 128 cores per node and 8 NUMA regions per node so it is best to have 1 io per numa region: that is 15 moncs per io server. [sorry if you already knew that I did not look at your case]. alternatively do 63 monc per io so that one IO server per socket.

@MarkUoLeeds
Copy link
Collaborator

I just re-read some of your compile woes. NOTE GCC 10. does not like the fact that MPI calls can use any data type so we have to apply " -fallow-argument-mismatch "; I found that back in November and archer team added it to the documentation on building: https://docs.archer2.ac.uk/user-guide/dev-environment/

@MarkUoLeeds
Copy link
Collaborator

Did you solve this yet? I see the only ref to mpi_alltoallv appears to be in components/fftsolver/src/pencilfft.F90
I was (am) working with the MOSRS code at r8166 where Adrian seems to make most of his branches start.
Perhaps I should turn my attention to this repo.

@leifdenby
Copy link
Collaborator Author

Did you solve this yet? I see the only ref to mpi_alltoallv appears to be in components/fftsolver/src/pencilfft.F90
I was (am) working with the MOSRS code at r8166 where Adrian seems to make most of his branches start.
Perhaps I should turn my attention to this repo.

I haven't no 😢 The farthest I've gotten is producing a branch (see #38) which contains all the commits that Adrian has made on MOSRS where he is working on ARCHER2 fixes. As you know this branch includes a lot of changes and as it stands also reverses changes Chris recently did for ARC4.

I am going to try and cherry-pick just the first four commits and see if that helps with running on ARCHER2.

I just re-read some of your compile woes. NOTE GCC 10. does not like the fact that MPI calls can use any data type so we have to apply " -fallow-argument-mismatch "; I found that back in November and archer team added it to the documentation on building: https://docs.archer2.ac.uk/user-guide/dev-environment/

Thank you for suggesting this. I'll give it a try with adding that compilation flag.

just wondering what choice of moncs _per_io is set. Is this a version that still uses FFTW rather than the MOSRS new HEAD that uses FFTE? The message seemed to indicate fft_pencil in earlier output.
Note Archer2 has 128 cores per node and 8 NUMA regions per node so it is best to have 1 io per numa region: that is 15 moncs per io server. [sorry if you already knew that I did not look at your case]. alternatively do 63 monc per io so that one IO server per socket.

Thank you for suggesting this. I'm not quiet sure how to work this out. If you check my run log above you'll see:

[INFO] MONC running with 4 processes, 1 IO server(s)

My moncs_per_io=3 (https://github.com/leifdenby/monc/blob/archer2-compilation/tests/straka_short.mcf#L38) and I think I'm requesting 5 cores in my job (https://github.com/leifdenby/monc/blob/archer2-compilation/utils/archer2/submonc.slurm#L7)

Does that sound reasonable or am I doing something obviously stupid?

@MarkUoLeeds
Copy link
Collaborator

The job looks poorly specified. If you want to ru a total of 4 MPI tasks (i.e. 1 io and 3 monc) then tasks-per-node should be also be 4. but then yo might choose to spread them out unless this is just a really basic job and you are happy for all tasks to sit in same numa region. When doing a proper job consider the cpus-per-task if you have fewer than 128 tasks on one node.

@cemac-ccs
Copy link
Collaborator

Per Ralph's notes, and in agreement with @MarkUoLeeds , it seems that the Leeds branch works with gnu 9.3.0 but not the default gnu 10. Having my module load order as

export PATH=$PATH:/work/y07/shared/umshared/bin
export PATH=$PATH:/work/y07/shared/umshared/software/bin
. mosrs-setup-gpg-agent

module restore PrgEnv-cray
module load cpe-gnu
module load gcc/9.3.0
module load cray-netcdf-hdf5parallel
module load cray-hdf5-parallel
module load cray-fftw/3.3.8.7
module load petsc/3.13.3

seems the best option for compiling with gnu using fcm make -j4 -f fcm-make/monc-cray-gnu.cfg. The PATH inclusions add in the installed version of fcm and allow you to cache your MOSRS password with the . mosrs-setup-gpg-agent command (as needed when getting casim and socrates from MOSRS)

Still trying out the cray compiler so can't comment on that, and waiting on the test job to run, but thought I'd mention this

@leifdenby
Copy link
Collaborator Author

thanks for the help here. I think we can close this now that @cemac-ccs is working on a pull-request for ARCHER2: #45

@leifdenby leifdenby closed this Apr 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants