Since every shared memory allocation in MPI-PR opens a memory mapped file for each rank and each rank needs to know the file descriptors of all the ranks on the same node ... you end up with a (proc. per node)^2 file descriptors opened for every shared memory allocation.
Since 128-core hardware is becoming common place and 128*128=16K, we already have seen reports of Global Arrays runs that required to increase the kernel limit /proc/sys/fs/file-max to values O(10^6)-O(10^7).
https://groups.google.com/g/nwchem-forum/c/Q-qvcHP9vP4
nwchemgit/nwchem#338
Can we try to address this from the GA side?
Possible solutions that come to my mind (I have no idea about their feasibility)
- Disable shared memory
- Split physical node in smaller "virtual nodes" (may be consistent with numa or socket domains)