LLM optimizations of Torch TensorRT #3738
cehongwang
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Torch-TensorRT Feature Ideas and Improvements
1. Investigation and Optimization of Graph Break Overhead
Currently, the multi-backend feature in Torch-TensorRT is experimentally enabled, but preliminary benchmarking results are not satisfying. Specifically, significant overhead occurs during graph breaks—when switching between different backends—which is not a notable issue in comparable frameworks such as Autodeploy.
Objective:
Investigate the root causes of the observed overhead during backend switching and implement optimizations to significantly reduce it. Achieving this could allow Torch-TensorRT to match or exceed the performance of competing frameworks like Autodeploy when using similar backend combinations.
Minimizing this overhead has broader implications beyond Large Language Models (LLMs). Currently, unsupported operations that fall back to PyTorch eager mode can introduce latencies worse than using PyTorch eager alone, discouraging user from using Torch-TensorRT. Reducing graph break overhead ensures Torch-TensorRT consistently provides better latency performance than PyTorch eager mode, thus solidifying its value proposition.
Follow-up features:
a. Kernel Optimization: Automatically select the best-performing kernels across multiple backends to maximize execution efficiency.
b. Further multi-backend enhancements to improve flexibility and performance.
2. LLM Quantization for Consumer and Edge Devices
Quantization is a critical advantage offered by TensorRT, particularly beneficial for deploying larger models to consumer or edge devices with limited computational resources.
Objective:
Enable efficient quantization techniques (FP8, FP4) within Torch-TensorRT, initially targeting popular models such as Qwen and Deepseek. This will substantially reduce model size and boost inference speed.
Expected Outcomes:
a. Enable deployment of larger and more sophisticated models on resource-constrained edge and consumer hardware.
b. Increase overall accessibility and practicality of advanced LLM capabilities in edge scenarios.
3. (Optional) Integration of LLM SDK with Torch-TensorRT
Considering the rising importance of embedded AI applications, integrating LLM SDK as a dedicated backend could significantly extend Torch-TensorRT’s capabilities in autonomous driving and robotics scenarios.
Objective:
Evaluate the potential benefits of incorporating LLM SDK into Torch-TensorRT. Depending on its demonstrated value and practical contribution, consider integrating LLM SDK as an officially supported backend.
Potential Benefits:
a. Specialized optimization for driving and automotive-specific models.
b. Enhanced performance and efficiency tailored to automotive inference tasks.
Beta Was this translation helpful? Give feedback.
All reactions