[model] Support GLM-4.7-Flash and remove redundant param --sglang-enable-draft-weights-cpu-backup#566
[model] Support GLM-4.7-Flash and remove redundant param --sglang-enable-draft-weights-cpu-backup#566guapisolo wants to merge 30 commits intoradixark:mainfrom
--sglang-enable-draft-weights-cpu-backup#566Conversation
Summary of ChangesHello @guapisolo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request primarily focuses on extending the system's capabilities to support the GLM-4.7-Flash model. It involves integrating the new model into the existing conversion and bridging infrastructure, along with refining the speculative decoding configurations for better efficiency. The changes also include the removal of an obsolete configuration flag, contributing to a cleaner and more up-to-date codebase. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for the GLM-4.7-Flash model. This includes adding a new model bridge that reuses the DeepseekV3 logic, a new model configuration script, and a run script for training. The changes also involve cleaning up an obsolete speculative decoding flag (--sglang-enable-draft-weights-cpu-backup) from various scripts. My review focuses on the new files. I've found a potential bug in the model name check and several areas for improvement in the new run script to reduce hardcoded values and redundant commands, enhancing its maintainability and portability. Overall, the changes look good and logically implement support for the new model.
| # TODO optimize code details | ||
| def _convert_to_hf_core(args, model_name, name, param): | ||
| if "glm4moe" in model_name: | ||
| if "glm4moelite" in model_name or "deepseekv3" in model_name: |
There was a problem hiding this comment.
There appears to be a typo in the model name check. The model is registered as "glm4_moe_lite" in miles_plugins/mbridge/glm4moe_lite.py, but the check here uses "glm4moelite". This mismatch will likely cause the condition to fail and prevent the correct model conversion from being used. Please correct the name to match the registered one.
| if "glm4moelite" in model_name or "deepseekv3" in model_name: | |
| if "glm4_moe_lite" in model_name or "deepseekv3" in model_name: |
| sleep 3 | ||
| pkill -9 ray | ||
| pkill -9 python |
scripts/run-glm4.7-30B-A3B.sh
Outdated
| export no_proxy="127.0.0.1,${MASTER_ADDR}" | ||
| ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 8 --disable-usage-stats | ||
|
|
||
| for WORKER_IP in $(awk '{print $1}' /root/mpi_rack_hostfile); do |
There was a problem hiding this comment.
The path to the host file /root/mpi_rack_hostfile is hardcoded. This makes the script less portable. It's better to use an environment variable with a default value to allow for easier configuration in different environments.
| for WORKER_IP in $(awk '{print $1}' /root/mpi_rack_hostfile); do | |
| for WORKER_IP in $(awk '{print $1}' "${MPI_RACK_HOSTFILE:-/root/mpi_rack_hostfile}"); do |
scripts/run-glm4.7-30B-A3B.sh
Outdated
| continue | ||
| fi | ||
| echo "Starting Ray worker on ${WORKER_IP}" | ||
| ssh root@"${WORKER_IP}" \ |
scripts/run-glm4.7-30B-A3B.sh
Outdated
| "GLOO_SOCKET_IFNAME": "${MLP_SOCKET_IFNAME}", | ||
| "TP_SOCKET_IFNAME": "${MLP_SOCKET_IFNAME}", | ||
| "MASTER_ADDR": "${MLP_WORKER_0_HOST}", | ||
| "PYTHONPATH": "/root/Megatron-LM/", |
There was a problem hiding this comment.
The PYTHONPATH is hardcoded to /root/Megatron-LM/. This limits the script's portability. Please consider making this configurable via an environment variable to allow the script to run in different setups.
| "PYTHONPATH": "/root/Megatron-LM/", | |
| "PYTHONPATH": "${MEGATRON_LM_PATH:-/root/Megatron-LM/}", |
--sglang-enable-draft-weights-cpu-backup
--sglang-enable-draft-weights-cpu-backup--sglang-enable-draft-weights-cpu-backup
Replace the old env-var-driven RoutingReplay with a cleaner BaseReplayManager / RoutingReplayManager pattern. This removes os.environ usage for replay stage control in favor of direct manager state, generalizes fill_routing_replay into _fill_replay_data, and extracts layer registration logic into replay_utils. Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: Yueming Yuan <yy28@illinois.edu>
ac05af4 to
2b40a7a
Compare
--sglang-enable-draft-weights-cpu-backupparam.