Skip to content

Enquiries about Parameter Sharding #98

@keneoneth

Description

@keneoneth

Hello there, I have been reading your research paper "Decoupled Model Schedule for Deep Learning Training". In particular, in part (3) tensor parallelism, it is mentioned that "Since the output tensor only holds partial results after sharding, we need to conduct all_reduce to aggregate outputs
from different device". May I know which part of the source code at this repository is performing the all_reduce operation? Right now I am taking a look at build.py, and I believe that the captured code below is handling the sharded parameters and making a split when a shard is added by the user. But I am not exactly sure where the all_reduce operation will be done? Any help will be appreciated. Thank you.

# Only keep the partition for this device for sharded params.
        tp_rank = sch.rank
        cnt_shard = 0
        for param_name, param in sch.mod.named_parameters(recurse=False):
            is_found = False
            for idx, new_size in enumerate(new_param_shapes[param_name]):
                if new_size != param.shape[idx]:
                    assert not is_found, "Cannot have two sharded dimensions!"
                    sharded_size = new_size
                    axis = idx
                    is_found = True
            if is_found:
                cnt_shard += 1
                sharded_param = param.detach().split(sharded_size, dim=axis)[tp_rank]
                sharded_param = sharded_param.contiguous()
                new_param = nn.Parameter(sharded_param)
                sch.mod.register_parameter(param_name, new_param)
                transfor_param_tags(sch, param, new_param)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions