-
Notifications
You must be signed in to change notification settings - Fork 6.3k
[Quantization] Add TRT-ModelOpt as a Backend #11173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@sayakpaul, would you mind giving a quick look and suggestions |
Thanks for getting started on this. I guess there is a problem here: NVIDIA/TensorRT-Model-Optimizer#165? Additionally, the API should have a TRTConfig in place of just a dict being the quantization config. |
I think the problem has been fixed the newest release, I just need to bump it up in diffusers requirements, also we can do the following for passing Config class
by TRTConfig did you mean including the config classes from ModelOptimizer here ? |
We use namings like So, in this case, we should be using |
Alright, let's try with the latest fixes then. |
The newer version wasn't backward compatible hence the issues, I have fixed it. Related to naming, package name is |
Doesn't it have any reliance on tensorrt? |
No it doesn't, we can use TRT to compile the quantized model |
Could you elaborate what you mean by this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks nice. Could you demonstrate some memory savings and any speedups when using modelopt, please? We can then add tests, docs, etc.
Yeah, so for quantizing the model we dont use tensorRT, but once the model is quantized we can compile the model using tensorrt. |
💾 Model & Inference Memory (in MB)
Following is the codeimport torch
from tqdm import tqdm
from diffusers import SanaTransformer2DModel, SD3Transformer2DModel, FluxTransformer2DModel
from diffusers.quantizers.quantization_config import NVIDIAModelOptConfig
checkpoint = "Efficient-Large-Model/Sana_600M_1024px_diffusers"
model_cls = SanaTransformer2DModel
# checkpoint = "stabilityai/stable-diffusion-3-medium-diffusers"
# model_cls = SD3Transformer2DModel
# checkpoint = "black-forest-labs/FLUX.1-dev"
# model_cls = FluxTransformer2DModel
input = lambda: (torch.randn((2, 32, 32, 32), dtype=torch.bfloat16).to('cuda'), torch.randn((2,10,300,2304), dtype=torch.bfloat16).to('cuda'), torch.Tensor([0,0]).to('cuda'))
# input = lambda: (torch.randn((1,16,96,96), dtype=torch.bfloat16).to('cuda'), torch.randn((1,300,4096), dtype=torch.bfloat16).to('cuda'), torch.randn((1, 2048), dtype=torch.bfloat16).to('cuda'), torch.Tensor([0]).to('cuda'))
# input = lambda: (torch.randn((1,1024, 64), dtype=torch.bfloat16).to('cuda'), torch.randn((1,300,4096), dtype=torch.bfloat16).to('cuda'), torch.randn((1, 768), dtype=torch.bfloat16).to('cuda'), torch.Tensor([0]).to('cuda'), torch.randn((300, 3)).to('cuda'), torch.randn((1024, 3)).to('cuda'), torch.Tensor([0]).to('cuda'))
quant_config_fp8 = {"quant_type": "FP8", "quant_method": "modelopt"}
quant_config_int4 = {"quant_type": "INT4", "quant_method": "modelopt", "block_quantize": 128, "channel_quantize": -1}
quant_config_nvfp4 = {"quant_type": "NVFP4", "quant_method": "modelopt", "block_quantize": 128, "channel_quantize": -1, 'modules_to_not_convert' : ['conv']}
def test_quantization(config, checkpoint, model_cls):
quant_config = NVIDIAModelOptConfig(**config)
print(quant_config.get_config_from_quant_type())
quant_model = model_cls.from_pretrained(checkpoint, subfolder="transformer", quantization_config=quant_config, torch_dtype=torch.bfloat16, device_map="balanced").to('cuda')
print(f"Quant {config['quant_type']} Model Memory Footprint: ", quant_model.get_memory_footprint() / 1e6)
return quant_model
def test_quant_inference(model, input, iter=10):
torch.cuda.empty_cache()
torch.cuda.reset_max_memory_allocated()
inference_memory = 0
for _ in tqdm(range(iter)):
with torch.no_grad():
output = model(*input())
inference_memory += torch.cuda.max_memory_allocated()
inference_memory /= iter
print("Inference Memory: ", inference_memory / 1e6)
test_quant_inference(test_quantization(quant_config_fp8, checkpoint, model_cls), input)
# test_quant_inference(test_quantization(quant_config_int4, checkpoint, model_cls), input)
# test_quant_inference(test_quantization(quant_config_nvfp4, checkpoint, model_cls), input)
# test_quant_inference(model_cls.from_pretrained(checkpoint, subfolder="transformer", torch_dtype=torch.bfloat16).to('cuda'), input) Speed UpsThere is no significant speedup between the different quantizations because internally modelopt still uses high precision arithmetic (float32). Sorry for being a bit late on this, @sayakpaul let me know next steps ! |
@ishan-modi let us know if this is ready to be reviewed. |
@sayakpaul, I think it is ready for preliminary review, on-the-fly quantization works fine. But loading pre-quantized models errors out and will be fixed in next release here (early may) by NVIDIA team. @jingyu-ml, just so that you are in the loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far!
Could you also demonstrate some memory and timing numbers with the modelopt
toolkit and some visual results?
No need, just saw #11173 (comment). But it doesn't measure the inference memory which is usually done via torch.cuda.max_memory_allocated()
. Could we also see those numbers? Would it be possible to make it clear in the PR description that
on-the-fly quantization works fine. But loading pre-quantized models errors out and will be fixed in next release NVIDIA/TensorRT-Model-Optimizer#185 (early may) by NVIDIA team.
@jingyu-ml is it expected to not see any speedups in latency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this Just some nits, it could be nice to add this quantization scheme to transformers after this gets merged !
@ishan-modi just a quick question. Do we know if the |
@sayakpaul, yes modelopt does support SVDQuant, but in this integration we support only |
That's fine. I wanted to because I think if we can support |
Will merge after @DN6 has had a chance to review. @ishan-modi can we also include a note in the docs that just performing the conversion step with @realAsma @jingyu-ml after this PR is merged, we could plan writing a post/guide on how to take a |
…diffusers into add-trtquant-backend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work @ishan-modi 👍🏽 Thank you 🙏🏽
@ishan-modi can we fix the remaining CI problems and then we should be good to go. |
…diffusers into add-trtquant-backend
@sayakpaul, should be fixed now. |
Congratulations on shipping this thing, @ishan-modi! Thank you! Let's maybe now focus on the following things to maximize the potential impact:
Happy to help. |
What does this PR do?
WIP, aimed at adding new backend for quantization #11032. For now, this PR just works for on-the-fly quantization. Loading pre-quantized models errors out and it is to be fixed by NVIDIA team in next release early may
Depends on
this to support latest diffusersthis to enable INT8 quantizationthis to enable NF4 quantizationCode
Following is a discussion on speedups while using real_quant with NVIDIA team here