Skip to content

Fix decoding with ngpu-lm when training (#13994) #13995

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

hoangtran9122
Copy link
Contributor

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Fix illegal CUDA memory error when evaluating using the decoding methods with NGPU-LM while training ASR models.

Collection: asr

Changelog

  • Add WithOptionalCudaGraphs to the new decoding classes with ngpu-lm and add the required methods to enable/disable CUDA graph while training.

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Copy link
Contributor

github-actions bot commented Jul 9, 2025

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Jul 9, 2025
@ahmadki
Copy link
Member

ahmadki commented Jul 9, 2025

@nithinraok The PR looks good, can you take a look if this is something we want to merge ?

@hoangtran9122: Do you mind signing the commit and fixing the linting errors ?

@nithinraok nithinraok requested a review from artbataev July 9, 2025 16:41
@github-actions github-actions bot removed the stale label Jul 10, 2025
@artbataev
Copy link
Collaborator

@hoangtran9122 please update manually to the latest main branch (to fix "isort and black formatting") and sign off the commit

artbataev
artbataev previously approved these changes Jul 10, 2025
Copy link
Collaborator

@artbataev artbataev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!
Please fix the issues with CI.

@hoangtran9122
Copy link
Contributor Author

Yes sure, I'll sign fix it

Signed-off-by: Hoang Tran <hoang.tch@namitech.io>
@hoangtran9122 hoangtran9122 force-pushed the fix-decoding-with-ngpu-lm-on-training branch from 4ecde2d to c7be903 Compare July 11, 2025 06:45
@ko3n1g ko3n1g added Run CICD and removed Run CICD labels Jul 11, 2025
update the latest main branch

Signed-off-by: Hoang Tran <hoang.tch@namitech.io>
@ko3n1g ko3n1g added Run CICD and removed Run CICD labels Jul 11, 2025
@hoangtran9122 hoangtran9122 force-pushed the fix-decoding-with-ngpu-lm-on-training branch from 7087c91 to 5c143f5 Compare July 11, 2025 08:04
@ko3n1g ko3n1g added Run CICD and removed Run CICD labels Jul 11, 2025
@hoangtran9122 hoangtran9122 force-pushed the fix-decoding-with-ngpu-lm-on-training branch from 5c143f5 to 3879d54 Compare July 11, 2025 08:10
@github-actions github-actions bot added the CI label Jul 11, 2025
@ko3n1g ko3n1g added Run CICD and removed Run CICD labels Jul 11, 2025
@hoangtran9122
Copy link
Contributor Author

I think have some difficulties to set the NEMO_REFORMAT_TOKEN for the commit @artbataev

@artbataev
Copy link
Collaborator

artbataev commented Jul 11, 2025

@hoangtran9122 The problem is with the PR permissions. After reformatting the code, the changes should be pushed back to your (PR) branch, but you disallowed edits.

There are 2 options how to fix it:

  1. allow edits by maintainers (in PR settings - usually enabled by default when you create the PR) https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork#enabling-repository-maintainer-permissions-on-existing-pull-requests

  2. Install isort and black from requirements_test.txt, reformat each file manually (black . will reformat nothing, need to explicitly pass each file), then push to your branch ("isort & black" task will reformat nothing - no need to push changes)

@ko3n1g ko3n1g added Run CICD and removed Run CICD labels Jul 14, 2025
Signed-off-by: Hoang Tran <hoang.tch@namitech.io>
@hoangtran9122 hoangtran9122 force-pushed the fix-decoding-with-ngpu-lm-on-training branch from ff9f69c to cb86955 Compare July 14, 2025 04:25
@github-actions github-actions bot removed the CI label Jul 14, 2025
@ko3n1g ko3n1g added Run CICD and removed Run CICD labels Jul 14, 2025
@artbataev artbataev merged commit 826b9f7 into NVIDIA:main Jul 15, 2025
130 checks passed
AmirHussein96 pushed a commit to AmirHussein96/NeMo that referenced this pull request Jul 23, 2025
* Fix decoding with ngpu-lm when training (NVIDIA#13994)

Signed-off-by: Hoang Tran <hoang.tch@namitech.io>

* code_format

Signed-off-by: Hoang Tran <hoang.tch@namitech.io>

---------

Signed-off-by: Hoang Tran <hoang.tch@namitech.io>
Signed-off-by: Amir Hussein <amhussein@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants