-
Notifications
You must be signed in to change notification settings - Fork 370
TRT-LLM installation tool #3829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
tensorrt_fused_nccl_all_gather_op, | ||
tensorrt_fused_nccl_reduce_scatter_op, | ||
) | ||
if load_tensorrt_llm_for_nccl(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to use the enabled features system for this rather than a stand alone function
Unsupported: | ||
- Windows platforms | ||
- Jetson/Orin/Xavier (aarch64 architecture + 'tegra' in platform release) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thor also not supported by TRT-LLM right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah Thor and sbsa should support NCCL, but TRT-LLM I am not aware. Will include Thor in the list of unsupported platform.
A followup question, what about sbsa? I see on TRT-LLM page that they are supported on Blackwell, but that does not imply sbsa support right (can be supported on B200 - non sbsa vs GB200 - sbsa).
py/torch_tensorrt/dynamo/utils.py
Outdated
|
||
if machine == "aarch64" and "tegra" in release: | ||
logger.info( | ||
"TensorRT-LLM plugins for NCCL backend are not supported on Jetson/Orin/Xavier (Tegra) devices." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit the error message here to include thor
py/torch_tensorrt/dynamo/utils.py
Outdated
try: | ||
cuda_version = torch.version.cuda # e.g., "12.4" or "13.0" | ||
if cuda_version is None: | ||
logger.warning("No CUDA runtime detected — TRT-LLM plugins unavailable.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewhat misleading because the actual error is that the pytorch install does not support cuda.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if that is the case would this be an error? What invokes this function? Should the user continue to be able to run? Would they be under the assumption that TRT-LLM plugins would be available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes will change the error message.
In that case cuda runtime is not available, but I assume we would hit an error before only before reaching this point. Wrt to this function we won't be able to verify if the CUDA is 12.X or 13.X. Should I remove this check altogether?
py/torch_tensorrt/dynamo/utils.py
Outdated
|
||
major, minor = map(int, cuda_version.split(".")) | ||
if major != 12: | ||
logger.warning("CUDA 13 is not supported for TRT-LLM plugins.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not currently supported. Same comment as above though, Seems to me this is at least log error, but the question is if we should kill the process. If the program will not run as intended we should otherwise its still an error but we can continue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change the error message to add currently
Its more like then this function will return a false
load_tensorrt_llm_for_nccl()
calls is_platform_supported_for_trtllm()
which will return false and the converter will be unsupported.
return False | ||
|
||
|
||
def load_tensorrt_llm_for_nccl() -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function should be in the enabled features system. And should register the feature for other parts of the library to query against
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will make this change
Yes you should be able to run it on GB200, I think there is just not a thor distribution of TRT-LLM for now.
|
…ing. Pending- check support on Thor and sbsa
8dd657c
to
cee5c7a
Compare
Across runs wheel is removed, while .so file is retained