generated from amazon-archives/__template_Apache-2.0
-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Description
We need a systematic approach to verify whether a scheduled model with tensor parallelism still produces the same results (i.e., outputs and gradients). It should check the following:
- Whether
.shardand.syncare correctly specified to maintain the shape correctness. For example, shard the weight of a linear layer by its output feature dimension results in partitioned outputs. In this case, we either need anall-gatherright after the linear, or the next linear must shard its weight by input feature dimension. - Whether
.shardand.syncare correctly specified to maintain the functional correctness. For example, shard the weight of a linear layer by its input feature dimension results in the same shape but partial sum outputs. In this case,all-reduceis required. - Whether the random seeds of each dropout is property configured. In the case that the input tensor of dropout is not partitioned (i.e., replica and partial sum), the random seed on each device should be the same, because all devices are supposed to do redundant computations. On the other hand, if the input tensor is partitioned (i.e., the shape on each device is divided by TP group size), the random seed on each device should be different to avoid repeated dropout patterns that hurt the convergence.
We use this issue to discuss possible solutions and track the progress. At the first glance, we may consider a compiler approach that perform type inference on static graphs. We could use TorchDynamo or LazyTensor to capture static graph and apply this analysis. However, this approach won't work if the module cannot be captured as a static graph due to coding style limitations.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels