LLM optimizations of Torch TensorRT #3738

cehongwang · 2025-07-31T21:40:14Z

cehongwang
Jul 31, 2025
Collaborator

Torch-TensorRT Feature Ideas and Improvements

1. Investigation and Optimization of Graph Break Overhead

Currently, the multi-backend feature in Torch-TensorRT is experimentally enabled, but preliminary benchmarking results are not satisfying. Specifically, significant overhead occurs during graph breaks—when switching between different backends—which is not a notable issue in comparable frameworks such as Autodeploy.

Objective:

Investigate the root causes of the observed overhead during backend switching and implement optimizations to significantly reduce it. Achieving this could allow Torch-TensorRT to match or exceed the performance of competing frameworks like Autodeploy when using similar backend combinations.

Minimizing this overhead has broader implications beyond Large Language Models (LLMs). Currently, unsupported operations that fall back to PyTorch eager mode can introduce latencies worse than using PyTorch eager alone, discouraging user from using Torch-TensorRT. Reducing graph break overhead ensures Torch-TensorRT consistently provides better latency performance than PyTorch eager mode, thus solidifying its value proposition.

Follow-up features:

a. Kernel Optimization: Automatically select the best-performing kernels across multiple backends to maximize execution efficiency.

b. Further multi-backend enhancements to improve flexibility and performance.

2. LLM Quantization for Consumer and Edge Devices

Quantization is a critical advantage offered by TensorRT, particularly beneficial for deploying larger models to consumer or edge devices with limited computational resources.

Objective:

Enable efficient quantization techniques (FP8, FP4) within Torch-TensorRT, initially targeting popular models such as Qwen and Deepseek. This will substantially reduce model size and boost inference speed.

Expected Outcomes:

a. Enable deployment of larger and more sophisticated models on resource-constrained edge and consumer hardware.

b. Increase overall accessibility and practicality of advanced LLM capabilities in edge scenarios.

3. (Optional) Integration of LLM SDK with Torch-TensorRT

Considering the rising importance of embedded AI applications, integrating LLM SDK as a dedicated backend could significantly extend Torch-TensorRT’s capabilities in autonomous driving and robotics scenarios.

Objective:

Evaluate the potential benefits of incorporating LLM SDK into Torch-TensorRT. Depending on its demonstrated value and practical contribution, consider integrating LLM SDK as an officially supported backend.

Potential Benefits:

a. Specialized optimization for driving and automotive-specific models.

b. Enhanced performance and efficiency tailored to automotive inference tasks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LLM optimizations of Torch TensorRT #3738

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

LLM optimizations of Torch TensorRT #3738

Uh oh!

Uh oh!

cehongwang Jul 31, 2025 Collaborator

Torch-TensorRT Feature Ideas and Improvements

1. Investigation and Optimization of Graph Break Overhead

Objective:

Follow-up features:

2. LLM Quantization for Consumer and Edge Devices

Objective:

Expected Outcomes:

3. (Optional) Integration of LLM SDK with Torch-TensorRT

Objective:

Potential Benefits:

Replies: 0 comments

cehongwang
Jul 31, 2025
Collaborator