[Feature] A systematic tensor parallelism verifier

We need a systematic approach to verify whether a scheduled model with tensor parallelism still produces the same results (i.e., outputs and gradients). It should check the following:
- Whether `.shard` and `.sync` are correctly specified to maintain the **shape correctness**. For example, shard the weight of a linear layer by its output feature dimension results in partitioned outputs. In this case, we either need an `all-gather` right after the linear, or the next linear must shard its weight by input feature dimension.
- Whether `.shard` and `.sync` are correctly specified to maintain the **functional correctness**. For example, shard the weight of a linear layer by its input feature dimension results in the same shape but partial sum outputs. In this case, `all-reduce` is required.
- Whether the random seeds of each dropout is property configured. In the case that the input tensor of dropout is not partitioned (i.e., replica and partial sum), the random seed on each device should be the same, because all devices are supposed to do redundant computations. On the other hand, if the input tensor is partitioned (i.e., the shape on each device is divided by TP group size), the random seed on each device should be different to avoid repeated dropout patterns that hurt the convergence.

We use this issue to discuss possible solutions and track the progress. At the first glance, we may consider a compiler approach that perform type inference on static graphs. We could use TorchDynamo or LazyTensor to capture static graph and apply this analysis. However, this approach won't work if the module cannot be captured as a static graph due to coding style limitations.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] A systematic tensor parallelism verifier #37

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] A systematic tensor parallelism verifier #37

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions