Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/BestPractices/Qwen3_5-Best-Practice.md
Original file line number Diff line number Diff line change
Expand Up @@ -311,6 +311,7 @@ Megatron-SWIFT训练Qwen3.5的提示:
- 全参数训练:参考[这个例子](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/mcore_full.sh)
- 关于MTP训练:"mcore-bridge>=1.1.0"支持了多模态MTP的训练(暂时需安装[main分支](https://github.com/modelscope/mcore-bridge/pull/14)),请安装对应版本。
- TP 限制解除:使用 "megatron-core>=0.16" 可解除 TP 受到的 `num_query_groups` 限制。
- CP支持:"mcore-bridge>=1.1.0"支持了GDN的CP训练(暂时需安装[main分支](https://github.com/modelscope/mcore-bridge/pull/16)),此外需安装megatron-core dev分支。
- 默认 `GatedDeltaNet` 使用 Megatron 实现,需使用 "megatron-core>=0.16"(ms-swift>=4.1.0,之前版本默认使用transformers实现)。设置环境变量 `USE_MCORE_GDN=0`可切换至 transformers 实现,transformers实现不支持packing和GDN的TP。
- padding_free/packing的支持:packing可以提升训练速度。参考[这个例子](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/packing.sh)
- apply_wd_to_qk_layernorm:对 qk layernorm 应用权重衰减。默认为False。
Expand Down
1 change: 1 addition & 0 deletions docs/source_en/BestPractices/Qwen3_5-Best-Practice.md
Original file line number Diff line number Diff line change
Expand Up @@ -309,6 +309,7 @@ Tips for training Qwen3.5 with Megatron-SWIFT:
- Full parameter training: Refer to [this example](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/mcore_full.sh).
- Regarding MTP training: `mcore-bridge>=1.1.0` supports multimodal MTP training (currently requires installing the [main branch](https://github.com/modelscope/mcore-bridge/pull/14)). Please install the corresponding version.
- TP Limitation Removed: Using `megatron-core>=0.16` removes the `num_query_groups` limitation on TP.
- CP support: "mcore-bridge>=1.1.0" supports CP training for GDN (currently requires installing the [main branch](https://github.com/modelscope/mcore-bridge/pull/16)). Additionally, the megatron-core dev branch needs to be installed.
- By default, `GatedDeltaNet` uses the Megatron implementation, which requires "megatron-core>=0.16" (ms-swift>=4.1.0; previous versions defaulted to the transformers implementation). Set the environment variable `USE_MCORE_GDN=0` to switch to the transformers implementation. Note that the transformers implementation does not support packing and GDN's TP.
- Support for padding_free/packing: Packing can improve training speed. Refer to [this example](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/packing.sh).
- apply_wd_to_qk_layernorm: Apply weight decay to qk layernorm. Default is False.
Expand Down
Loading