Conversation
Signed-off-by: ashors1 <ashors@nvidia.com>
for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <nemo-aligner-ci@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <nemo-aligner-ci@nvidia.com>
|
Note that CP support for SFT was recently added to NeMo. We need a NeMo commit at least as recent as 8c921dc19a905d8b5a0f90f6e2a34607c2e0660d |
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
This reverts commit ea3b5ba.
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
terrykong
left a comment
There was a problem hiding this comment.
TODO:
- potentially rebase after https://github.com/NVIDIA/NeMo-Aligner/actions/runs/12149989736/job/33882060795?pr=405#step:5:743
- presubmit didn't pass for SFT on that PR, so if that doesn't get resolved by thursday, let's just merge this as is assuming CI passes since #405 should be able to rebase on this easily
| sync_batch_comm: False | ||
| megatron_amp_O2: False | ||
| encoder_seq_length: 4096 # the sequence length of the encoder model, it will be overwriten by loaded GPT model | ||
| transformer_engine: True |
There was a problem hiding this comment.
For my education, why do we need to specify this now?
There was a problem hiding this comment.
I don't actually think it's necessary since TE is enabled by default. I just wanted to make explicit the fact that we were using TE. But I will remove this
| pad_seq_length_to_mult = 16 | ||
| if model_cfg is not None: | ||
| pad_seq_length_to_mult = ( | ||
| 8 * model_cfg.get("tensor_model_parallel_size", 1) if model_cfg.get("sequence_parallel", False) else 16 | ||
| ) | ||
| pad_seq_length_to_mult *= model_cfg.get("context_parallel_size", 1) |
There was a problem hiding this comment.
According to that fp8 comment above, should this be:
| pad_seq_length_to_mult = 16 | |
| if model_cfg is not None: | |
| pad_seq_length_to_mult = ( | |
| 8 * model_cfg.get("tensor_model_parallel_size", 1) if model_cfg.get("sequence_parallel", False) else 16 | |
| ) | |
| pad_seq_length_to_mult *= model_cfg.get("context_parallel_size", 1) | |
| pad_seq_length_to_mult = 16 | |
| if model_cfg is not None: | |
| if model_cfg.get("sequence_parallel", False): | |
| pad_seq_length_to_mult = math.lcm(pad_seq_length_to_mult, model_cfg.get("tensor_model_parallel_size", 1)) | |
| pad_seq_length_to_mult *= model_cfg.get("context_parallel_size", 1) |
? From the comment it sounds like if someone is doing fp8 SFT with TP=1 and set sequence_parallel, then the padding would be too small
There was a problem hiding this comment.
That chunk was taken directly from here: https://github.com/NVIDIA/NeMo/blob/b847bf75c371931e4f17ea426741c1d023afa0c0/nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py#L262-L268, but the code does seem to contradict the comment. I'll follow up with the TE team
There was a problem hiding this comment.
From the TE team: "when SP=True, the dimensions are flipped so the sequence dimension is first. So we only need to make sure it's divisible by 8 after the TP split to comply with TE's expectations."
|
|
||
| fwd_bwd_function = get_forward_backward_func() | ||
| fwd_loss_fn = self.get_forward_output_and_loss_func(forward_only) | ||
| fwd_loss_fn = self.get_forward_output_and_loss_func(forward_only, tuning=True) |
There was a problem hiding this comment.
It controls the keys that are returned in the batch: https://github.com/NVIDIA/NeMo/blob/b847bf75c371931e4f17ea426741c1d023afa0c0/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py#L1211-L1222. If tuning=False, we don't return the keys that are necessary for sequence packing (and thus CP) in TE.
Note also that tuning is set to True in NeMo's SFT: https://github.com/NVIDIA/NeMo/blob/b847bf75c371931e4f17ea426741c1d023afa0c0/nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py#L407
|
closing in favor of #430. @terrykong I'll address your comments there |
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Changelog
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Checklist when contributing a new algorithm
max_steps=-1andvalidation?Additional Information