Question about Gradient Aggregation in HvdAllToAllEmbedding during Multi-CPU Distributed Training

I’m using HvdAllToAllEmbedding in a multi-CPU distributed training setup with the following environment:
- TFRA 0.7.2
- TensorFlow 2.15.1
- Horovod 0.28.1

**In the description of HvdAllToAllEmbedding, it says:**
_“In the backward training, the gradient of sparse parameters will be distributed to other cards in the default modulo operator way through AllToAll communication between cards. After each card obtains its own sparse gradient, the sparse optimizer will be executed to complete the optimization of large-scale sparse features.”_

However, in the source code, I only see:
- In the forward embedding_lookup, HvdAllToAllEmbedding uses self.hvd.alltoall for aggregation.
- In dynamic_embedding_optimizer.py, only the model variable gradients are aggregated.

**I do not see any aggregation for the gradients corresponding to HvdAllToAllEmbedding during backward propagation.**
So I’m confused:
**In multi-CPU distributed training, during backward training, are the embedding gradients actually aggregated?**

- If they are aggregated, in what way are they aggregated?
- Is it synchronous or asynchronous?
- Is it a sum or mean operation?

Thanks for clarifying!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Gradient Aggregation in HvdAllToAllEmbedding during Multi-CPU Distributed Training #500

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about Gradient Aggregation in HvdAllToAllEmbedding during Multi-CPU Distributed Training #500

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions