Skip to content

LAMMPS parallel MD simulations crashes soon after begin [multi-node + CUDA-aware MPI] #161

@turbosonics

Description

@turbosonics

I'm trying to run series of high temperature (1500K~3000K) 1atm NPT and high temperature NVT simulations with 1fs timestep using pre-trained model (July 2024) of SevenNet. But the parallel simulations are too unstable.

The exact same simulation runs with serial version without any issues. No crashes, no errors, no nothing. But if I submit the same job with e3gnn/parallel with 2 GPUs, MD become very unstable and crashes soon after MD begins.

I just compiled SevenNet & LAMMPS SevenNet, so they are 0.10.3 version. OpenMPI version of our local cluster is CUDA-aware one, and I used CUDA 11.8 with prebuilt pytorch 2.4.0.

Geometry is not that huge, it contains just 8400 atoms (it is just silica), and the crash is not OOM.

Very first crash I faced with high temperature 1atm NPT was:

ERROR on proc 0: Too many neighbor bins (src/nbin_standard.cpp:213)
Last command: run             100000

The log file just printed the information of step 0 then immediately crashed.

I have following lines in LAMMPS input script:

neighbor        1 bin
neigh_modify    every 10 delay 0 check no

So, when I remove that line and resubmit the same job, then:

ERROR: Non-numeric pressure - simulation unstable (src/fix_nh.cpp:1049)
Last command: run             100000

Then, I changed to NVT with the same temperature:

ERROR: Pair e3gnn requires consecutive atom IDs (src/pair_e3gnn_parallel.cpp:204)
Last command: run             100000

I tried 0.5fs timestep but non-numeric error appeared. But serial run with 1fs runs smoothly well. All crashes happened within 30seconds after I submit the job.

I just want to know where the problem comes from. Pre-trained model and my LAMMPS input script should be fine, because serial MD with 1 GPU is still running well for more than 12 hours. Then, this would be problem of SevenNet parallel or my installation or local server cluster...

Does anyone tried high temperature NPT and NVT using LAMMPS SevenNet using 2+ GPUs for the system of 8k~10k or more number of atoms? Anyone faced similar problem?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions