I’m using HvdAllToAllEmbedding in a multi-CPU distributed training setup with the following environment:
- TFRA 0.7.2
- TensorFlow 2.15.1
- Horovod 0.28.1
In the description of HvdAllToAllEmbedding, it says:
“In the backward training, the gradient of sparse parameters will be distributed to other cards in the default modulo operator way through AllToAll communication between cards. After each card obtains its own sparse gradient, the sparse optimizer will be executed to complete the optimization of large-scale sparse features.”
However, in the source code, I only see:
- In the forward embedding_lookup, HvdAllToAllEmbedding uses self.hvd.alltoall for aggregation.
- In dynamic_embedding_optimizer.py, only the model variable gradients are aggregated.
I do not see any aggregation for the gradients corresponding to HvdAllToAllEmbedding during backward propagation.
So I’m confused:
In multi-CPU distributed training, during backward training, are the embedding gradients actually aggregated?
- If they are aggregated, in what way are they aggregated?
- Is it synchronous or asynchronous?
- Is it a sum or mean operation?
Thanks for clarifying!