diff --git a/docs/source/BestPractices/Qwen3_5-Best-Practice.md b/docs/source/BestPractices/Qwen3_5-Best-Practice.md index 2329bf8bbb..d6824c1176 100644 --- a/docs/source/BestPractices/Qwen3_5-Best-Practice.md +++ b/docs/source/BestPractices/Qwen3_5-Best-Practice.md @@ -311,6 +311,7 @@ Megatron-SWIFT训练Qwen3.5的提示: - 全参数训练:参考[这个例子](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/mcore_full.sh)。 - 关于MTP训练:"mcore-bridge>=1.1.0"支持了多模态MTP的训练(暂时需安装[main分支](https://github.com/modelscope/mcore-bridge/pull/14)),请安装对应版本。 - TP 限制解除:使用 "megatron-core>=0.16" 可解除 TP 受到的 `num_query_groups` 限制。 +- CP支持:"mcore-bridge>=1.1.0"支持了GDN的CP训练(暂时需安装[main分支](https://github.com/modelscope/mcore-bridge/pull/16)),此外需安装megatron-core dev分支。 - 默认 `GatedDeltaNet` 使用 Megatron 实现,需使用 "megatron-core>=0.16"(ms-swift>=4.1.0,之前版本默认使用transformers实现)。设置环境变量 `USE_MCORE_GDN=0`可切换至 transformers 实现,transformers实现不支持packing和GDN的TP。 - padding_free/packing的支持:packing可以提升训练速度。参考[这个例子](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/packing.sh)。 - apply_wd_to_qk_layernorm:对 qk layernorm 应用权重衰减。默认为False。 diff --git a/docs/source_en/BestPractices/Qwen3_5-Best-Practice.md b/docs/source_en/BestPractices/Qwen3_5-Best-Practice.md index 3d00899dfd..f309f36f41 100644 --- a/docs/source_en/BestPractices/Qwen3_5-Best-Practice.md +++ b/docs/source_en/BestPractices/Qwen3_5-Best-Practice.md @@ -309,6 +309,7 @@ Tips for training Qwen3.5 with Megatron-SWIFT: - Full parameter training: Refer to [this example](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/mcore_full.sh). - Regarding MTP training: `mcore-bridge>=1.1.0` supports multimodal MTP training (currently requires installing the [main branch](https://github.com/modelscope/mcore-bridge/pull/14)). Please install the corresponding version. - TP Limitation Removed: Using `megatron-core>=0.16` removes the `num_query_groups` limitation on TP. +- CP support: "mcore-bridge>=1.1.0" supports CP training for GDN (currently requires installing the [main branch](https://github.com/modelscope/mcore-bridge/pull/16)). Additionally, the megatron-core dev branch needs to be installed. - By default, `GatedDeltaNet` uses the Megatron implementation, which requires "megatron-core>=0.16" (ms-swift>=4.1.0; previous versions defaulted to the transformers implementation). Set the environment variable `USE_MCORE_GDN=0` to switch to the transformers implementation. Note that the transformers implementation does not support packing and GDN's TP. - Support for padding_free/packing: Packing can improve training speed. Refer to [this example](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/packing.sh). - apply_wd_to_qk_layernorm: Apply weight decay to qk layernorm. Default is False.