generated from amazon-archives/__template_Apache-2.0
-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Description
Hello there, I have been reading your research paper "Decoupled Model Schedule for Deep Learning Training". In particular, in part (3) tensor parallelism, it is mentioned that "Since the output tensor only holds partial results after sharding, we need to conduct all_reduce to aggregate outputs
from different device". May I know which part of the source code at this repository is performing the all_reduce operation? Right now I am taking a look at build.py, and I believe that the captured code below is handling the sharded parameters and making a split when a shard is added by the user. But I am not exactly sure where the all_reduce operation will be done? Any help will be appreciated. Thank you.
# Only keep the partition for this device for sharded params.
tp_rank = sch.rank
cnt_shard = 0
for param_name, param in sch.mod.named_parameters(recurse=False):
is_found = False
for idx, new_size in enumerate(new_param_shapes[param_name]):
if new_size != param.shape[idx]:
assert not is_found, "Cannot have two sharded dimensions!"
sharded_size = new_size
axis = idx
is_found = True
if is_found:
cnt_shard += 1
sharded_param = param.detach().split(sharded_size, dim=axis)[tp_rank]
sharded_param = sharded_param.contiguous()
new_param = nn.Parameter(sharded_param)
sch.mod.register_parameter(param_name, new_param)
transfor_param_tags(sch, param, new_param)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels