Skip to content

Explicit communication inside models running with domain decomposition #82

@Luthaf

Description

@Luthaf

When running metatomic models inside simulation engines with domain decomposition (currently mainly LAMMPS, more to come), it would be useful to give the model access to explicit communication API. This would then enable at least two things:

  • using the communication for message passing in message passing models (MPNN), in turn reducing both the amount of duplicated work from one domain to another, and allowing the model to run with a smaller interaction cutoff.
  • using the communication to enable domain decomposition for long-range models.

For the MPNN use-case, MACE is able to use something similar in the ML-IAP interface of LAMMPS, using LAMMPS primitives to receive/transmit node features from other domains: https://github.com/ACEsuit/mace/blob/42cd495b9c8e155441c405fef842a8ba5107c2f8/mace/tools/utils.py#L152-L167.

We should do something similar, but not limited to LAMMPS, providing models the same set of tools for communication regardless of the simulation engine.

How it could work

metatomic would provide a set of custom TorchScript operations (name to be decided, communicate here) that the model can use and call. Then the engine would give metatomic a set of corresponding function pointers that implement these operations using then engine's communication primitives.

model                                                               engine

communicate(...)    ==>    metatomic shim     ==>     communicate_impl(...)

Taking inspiration from ML-IAP, the API could look something like

# features is `n_atoms x n_features`

pre_allocated = torch.zeros_like(features)
communicate(features, pre_allocated)

# we get the information on our ghosts (i.e. atoms from other domains) inside the `pre_allocated` array
# other domain get information about their ghosts from `features`

If the engine does not support domain decomposition, the communicate function would do nothing.

We should be able to make everything works during backward propagation for the calculation of forces/gradients in general. In LAMMPS, communicate would map to forward_comm and reverse_comm for backward propagation (https://docs.lammps.org/Developer_comm_ops.html)

Unresolved questions

  • Can this API be used for long-range model as well or do we need more?
  • can such an API perform direct GPU to GPU communication?
    • when running in LAMMPS-Kokkos
    • when running in LAMMPS without Kokkos, but still running the model on GPU
    • when running in GROMACS/OpenMM
  • Can we implement this kind of interface in other simulation engines?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions