Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
在为 PaddleFormers 增加 LoRA GEMM 功能的过程中,发现并修复了两个 fleet 侧的问题:
1.开启 EP 时 grouped_gemm_experts 缺少 LoRA 梯度
原因:当 moe_use_fusion_node=true 时,模型走 fusion_moe_forward 分支,但 fleet 侧未对 LoRA 梯度做相应处理。
修复:在ExpertsGroupGemmContiguousNode中补充了_lora_weight_grad函数和相应处理。
2.lora训练时,模型Layer 0 始终未参与训练
原因:开启 recompute 后,layer 0 错误继承了 embed_tokens 的 stop_gradient 状态。
修复:参考 formers 中 pp_model.py 的同类修复,将相应逻辑同步至 fleet 侧即可解决。