Skip to content

Megatron-LM specific release v2.8 rocm cherrypicks#498

Open
sudhu2k wants to merge 2 commits intorelease_v2.8_rocmfrom
sudhu/release_v2.8_rocm_cherrypicks
Open

Megatron-LM specific release v2.8 rocm cherrypicks#498
sudhu2k wants to merge 2 commits intorelease_v2.8_rocmfrom
sudhu/release_v2.8_rocm_cherrypicks

Conversation

@sudhu2k
Copy link
Contributor

@sudhu2k sudhu2k commented Mar 19, 2026

Description

These cherry picks fixes some issues that's seen in Megatron-LM unit test runs.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

sudhu2k and others added 2 commits March 17, 2026 18:30
* Update max_fp8 value based on is_fp8_fnuz check in utils.py

* Fixed and added test_cast_master_weights_to_fp8 to ci

Addressed Reviews

* Update copyright information.

---------

Co-authored-by: Veera Gopu <veerarajasekharreddy.gopu@amd.com>
* Update permutation.py



* Update permutation.py



* Update transformer_engine/pytorch/triton/permutation.py



* Update transformer_engine/pytorch/triton/permutation.py



---------

Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
configure_omp_threads 8
run_default_fa 1 test_fused_optimizer.py
run_default_fa 3 test_sanity_import.py
run_default_fa 3 distributed/test_cast_master_weights_to_fp8.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It requires test hotfix doesn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one didn't require the hotfix.

TE2.10 required hotfix:

# ROCm: Use executable as-is; do not resolve() or a venv symlink may point to system
# Python which does not have torch/site-packages.
python_exe = pathlib.Path(sys.executable)

TE2.8 implements the test differently:

def _run_test(quantization):
test_path = TEST_ROOT / "run_cast_master_weights_to_fp8.py"
test_cmd = LAUNCH_CMD + [str(test_path)] + ["--quantization", quantization]
result = subprocess.run(test_cmd, env=os.environ, check=False)
assert result.returncode == 0
@pytest.mark.parametrize("quantization", ["fp8", "fp8_cs", "fp8_block"])
def test_cast_master_weights_to_fp8(quantization):
if quantization in ("fp8", "fp8_cs") and not fp8_available:
pytest.skip(reason_for_no_fp8)
if quantization == "fp8_block" and not fp8_block_scaling_available:
pytest.skip(reason_for_no_fp8_block_scaling)
_run_test(quantization)

Copy link
Collaborator

@ipanfilo ipanfilo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run CI level 3

@sudhu2k
Copy link
Contributor Author

sudhu2k commented Mar 20, 2026

Run CI level 3

MGPU/SGPU tests passed on level 3
failed on example test on MI325 with a flaky huggingface error:

File "/opt/venv/lib/python3.11/site-packages/huggingface_hub/hf_api.py", line 3010, in dataset_info
   hf_raise_for_status(r)
 File "/opt/venv/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 880, in hf_raise_for_status
   raise _format(HfHubHTTPError, message, response) from e
huggingface_hub.errors.HfHubHTTPError: (Request ID: Root=1-69bcf50c-102ad0297e9cea061658671b;35076563-ba85-4f92-a0a7-3f465e541e0a)

429 Too Many Requests: you have reached your 'api' rate limit.

https://github.com/ROCm/TransformerEngine/actions/runs/23328907485/job/67856033151

@sudhu2k sudhu2k requested a review from ipanfilo March 20, 2026 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants