Skip to content

[model] Support GLM-4.7-Flash and remove redundant param --sglang-enable-draft-weights-cpu-backup#566

Open
guapisolo wants to merge 30 commits intoradixark:mainfrom
guapisolo:auto/20260205235101
Open

[model] Support GLM-4.7-Flash and remove redundant param --sglang-enable-draft-weights-cpu-backup#566
guapisolo wants to merge 30 commits intoradixark:mainfrom
guapisolo:auto/20260205235101

Conversation

@guapisolo
Copy link
Collaborator

@guapisolo guapisolo commented Feb 5, 2026

  1. Support GLM 4.7 flash with fix in mbridge.
  2. Remove outdated --sglang-enable-draft-weights-cpu-backup param.

@guapisolo guapisolo requested a review from maocheng23 as a code owner February 5, 2026 23:56
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @guapisolo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on extending the system's capabilities to support the GLM-4.7-Flash model. It involves integrating the new model into the existing conversion and bridging infrastructure, along with refining the speculative decoding configurations for better efficiency. The changes also include the removal of an obsolete configuration flag, contributing to a cleaner and more up-to-date codebase.

Highlights

  • GLM-4.7-Flash Model Support: Introduced comprehensive support for the GLM-4.7-Flash model, including its specific model configuration and a dedicated run script.
  • Model Bridging Integration: Added a new GLM4MoELiteBridge to handle GLM-4.7-Flash and DeepSeek-V3 models, integrating them into the existing model conversion framework.
  • Speculative Decoding Parameter Updates: Adjusted speculative decoding parameters in various run scripts, increasing num-steps and num-draft-tokens for improved performance.
  • Removed CPU Backup Flag: The --sglang-enable-draft-weights-cpu-backup flag has been removed from documentation and run scripts, indicating a streamlining or deprecation of this feature.
Changelog
  • docs/en/advanced/speculative-decoding.md
    • Removed the --sglang-enable-draft-weights-cpu-backup flag from the speculative decoding documentation.
  • miles/backends/megatron_utils/megatron_to_hf/init.py
    • Modified the _convert_to_hf_core function to include glm4moelite and deepseekv3 model names, mapping them to the convert_deepseekv3_to_hf function.
    • Reordered the conditional checks for model names in the conversion logic.
  • miles_plugins/mbridge/init.py
    • Imported GLM4MoELiteBridge.
    • Added GLM4MoELiteBridge to the __all__ export list.
  • miles_plugins/mbridge/glm4moe_lite.py
    • Added a new file defining GLM4MoELiteBridge which inherits from DeepseekV3Bridge and is registered as glm4_moe_lite.
  • scripts/models/glm4.7-30B-A3B.sh
    • Added a new script defining model arguments for GLM-4.7-Flash, including MoE parameters, hidden sizes, attention heads, and other configurations.
  • scripts/run-glm4.5-355B-A32B.sh
    • Updated --sglang-speculative-num-steps from 1 to 2.
    • Updated --sglang-speculative-num-draft-tokens from 2 to 3.
    • Removed the --sglang-enable-draft-weights-cpu-backup flag.
    • Changed the save-debug-rollout-data path from an absolute path to a relative path.
  • scripts/run-glm4.7-30B-A3B.sh
    • Added a new comprehensive run script for GLM-4.7-Flash, including Ray setup, checkpoint, rollout, evaluation, performance, optimizer, SGLang, and miscellaneous arguments.
  • scripts/run-qwen3-next-80B-A3B.sh
    • Removed the --sglang-enable-draft-weights-cpu-backup flag.
Activity
  • The pull request was opened by guapisolo with the title "[feat] Support GLM-4.7-Flash".
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the GLM-4.7-Flash model. This includes adding a new model bridge that reuses the DeepseekV3 logic, a new model configuration script, and a run script for training. The changes also involve cleaning up an obsolete speculative decoding flag (--sglang-enable-draft-weights-cpu-backup) from various scripts. My review focuses on the new files. I've found a potential bug in the model name check and several areas for improvement in the new run script to reduce hardcoded values and redundant commands, enhancing its maintainability and portability. Overall, the changes look good and logically implement support for the new model.

# TODO optimize code details
def _convert_to_hf_core(args, model_name, name, param):
if "glm4moe" in model_name:
if "glm4moelite" in model_name or "deepseekv3" in model_name:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There appears to be a typo in the model name check. The model is registered as "glm4_moe_lite" in miles_plugins/mbridge/glm4moe_lite.py, but the check here uses "glm4moelite". This mismatch will likely cause the condition to fail and prevent the correct model conversion from being used. Please correct the name to match the registered one.

Suggested change
if "glm4moelite" in model_name or "deepseekv3" in model_name:
if "glm4_moe_lite" in model_name or "deepseekv3" in model_name:

Comment on lines +9 to +11
sleep 3
pkill -9 ray
pkill -9 python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of commands is redundant. The pkill commands on lines 10-11 are repeats of lines 7-8. A single set of pkill commands after ray stop --force should be sufficient to clean up processes. The extra sleep and pkills add clutter and can be removed.

export no_proxy="127.0.0.1,${MASTER_ADDR}"
ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 8 --disable-usage-stats

for WORKER_IP in $(awk '{print $1}' /root/mpi_rack_hostfile); do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The path to the host file /root/mpi_rack_hostfile is hardcoded. This makes the script less portable. It's better to use an environment variable with a default value to allow for easier configuration in different environments.

Suggested change
for WORKER_IP in $(awk '{print $1}' /root/mpi_rack_hostfile); do
for WORKER_IP in $(awk '{print $1}' "${MPI_RACK_HOSTFILE:-/root/mpi_rack_hostfile}"); do

continue
fi
echo "Starting Ray worker on ${WORKER_IP}"
ssh root@"${WORKER_IP}" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The root user is hardcoded for the SSH connection. This is not ideal for security and flexibility. Consider parameterizing the username with an environment variable.

Suggested change
ssh root@"${WORKER_IP}" \
ssh "${REMOTE_USER:-root}"@"${WORKER_IP}" \

"GLOO_SOCKET_IFNAME": "${MLP_SOCKET_IFNAME}",
"TP_SOCKET_IFNAME": "${MLP_SOCKET_IFNAME}",
"MASTER_ADDR": "${MLP_WORKER_0_HOST}",
"PYTHONPATH": "/root/Megatron-LM/",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The PYTHONPATH is hardcoded to /root/Megatron-LM/. This limits the script's portability. Please consider making this configurable via an environment variable to allow the script to run in different setups.

Suggested change
"PYTHONPATH": "/root/Megatron-LM/",
"PYTHONPATH": "${MEGATRON_LM_PATH:-/root/Megatron-LM/}",

@guapisolo guapisolo changed the title [feat] Support GLM-4.7-Flash [feat] Support GLM-4.7-Flash and remove redundant param --sglang-enable-draft-weights-cpu-backup Feb 6, 2026
@guapisolo guapisolo changed the title [feat] Support GLM-4.7-Flash and remove redundant param --sglang-enable-draft-weights-cpu-backup [model] Support GLM-4.7-Flash and remove redundant param --sglang-enable-draft-weights-cpu-backup Feb 12, 2026
Replace the old env-var-driven RoutingReplay with a cleaner
BaseReplayManager / RoutingReplayManager pattern. This removes
os.environ usage for replay stage control in favor of direct
manager state, generalizes fill_routing_replay into _fill_replay_data,
and extracts layer registration logic into replay_utils.

Co-authored-by: fzyzcjy <ch271828n@outlook.com>
Co-authored-by: Yueming Yuan <yy28@illinois.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants