Skip to content

Conversation

GeYuhong
Copy link

@GeYuhong GeYuhong commented Sep 2, 2025

Description

This pr is used to adapt for offload activation (a new feature in Megatron-LM, NVIDIA/Megatron-LM#1752).

Offload activation select inputs of specific modules (such as core_attn, qkv_linear, router_fc1), offloading them to CPU in forward pass and reloading them to GPU in backward pass.

When offloading the modules that include weights (nn.Parameter), attributes of these weights (such as main_grad, grad_added_to_main_grad) are ripped off by torch. Therefore, this feature needs to modify the basic modules in TE (such as grouped_linear.py, 'layernorm_linear.py') to preserve these necessary attributes.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Based on whether the input tensor contains the offloading_activation attribute, add support for retrieving the offload_activation flag in grouped_linear.py, linear.py, and layernorm_linear.py.
  • Save the grad_added_to_main_grad attribute in forward pass and get it in backward pass in grouped_linear.py, linear.py, and layernorm_linear.py.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Hongbin Liu and others added 5 commits September 18, 2025 07:00
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
…tion

Hongbinl/adapt for offload activation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants