Skip to content

Question about Gradient Aggregation in HvdAllToAllEmbedding during Multi-CPU Distributed Training #500

@Rebecca-rr

Description

@Rebecca-rr

I’m using HvdAllToAllEmbedding in a multi-CPU distributed training setup with the following environment:

  • TFRA 0.7.2
  • TensorFlow 2.15.1
  • Horovod 0.28.1

In the description of HvdAllToAllEmbedding, it says:
“In the backward training, the gradient of sparse parameters will be distributed to other cards in the default modulo operator way through AllToAll communication between cards. After each card obtains its own sparse gradient, the sparse optimizer will be executed to complete the optimization of large-scale sparse features.”

However, in the source code, I only see:

  • In the forward embedding_lookup, HvdAllToAllEmbedding uses self.hvd.alltoall for aggregation.
  • In dynamic_embedding_optimizer.py, only the model variable gradients are aggregated.

I do not see any aggregation for the gradients corresponding to HvdAllToAllEmbedding during backward propagation.
So I’m confused:
In multi-CPU distributed training, during backward training, are the embedding gradients actually aggregated?

  • If they are aggregated, in what way are they aggregated?
  • Is it synchronous or asynchronous?
  • Is it a sum or mean operation?

Thanks for clarifying!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions