Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
d7ca877
initial commit
ishan-modi Mar 30, 2025
a016c56
update
ishan-modi Mar 31, 2025
eb73ab0
updates
ishan-modi Apr 1, 2025
7fdb79e
update
ishan-modi Apr 8, 2025
a83bb98
update
ishan-modi Apr 8, 2025
9d9f0b9
update
ishan-modi Apr 10, 2025
7b09750
update
ishan-modi Apr 21, 2025
4fe06ee
Merge branch 'main' into add-trtquant-backend
ishan-modi Apr 23, 2025
71d8a7e
update
ishan-modi Apr 23, 2025
6c74c69
update
ishan-modi Apr 24, 2025
10fb9fe
Merge branch 'main' into add-trtquant-backend
sayakpaul Apr 29, 2025
6c65138
addressed PR comments
ishan-modi Apr 29, 2025
4b32567
Merge remote-tracking branch 'origin/add-trtquant-backend' into add-t…
ishan-modi Apr 29, 2025
915dbf0
update
ishan-modi Apr 30, 2025
3336a08
Merge branch 'main' into add-trtquant-backend
sayakpaul May 1, 2025
1c470f2
Merge branch 'main' into add-trtquant-backend
sayakpaul May 2, 2025
f823a2c
addressed PR comments
ishan-modi May 6, 2025
e78841e
Merge branch 'add-trtquant-backend' of https://github.com/ishan-modi/…
ishan-modi May 6, 2025
8f88f29
update
ishan-modi May 6, 2025
212603f
update
ishan-modi May 6, 2025
24f1bcb
update
ishan-modi May 6, 2025
65097f1
update
ishan-modi May 6, 2025
97f94ae
update
ishan-modi May 6, 2025
752544f
update
ishan-modi May 9, 2025
415901f
Merge branch 'main' into add-trtquant-backend
ishan-modi May 29, 2025
482fe78
updates
ishan-modi Jul 21, 2025
488282f
Merge branch 'main' into add-trtquant-backend
ishan-modi Jul 21, 2025
88259c9
Merge branch 'huggingface:main' into add-trtquant-backend
ishan-modi Aug 3, 2025
e51be6a
Merge branch 'main' into add-trtquant-backend
ishan-modi Aug 15, 2025
d48835d
update
ishan-modi Aug 16, 2025
5c4a4ea
Merge branch 'main' into add-trtquant-backend
ishan-modi Aug 16, 2025
670202d
update
ishan-modi Aug 16, 2025
6dd903f
Merge branch 'add-trtquant-backend' of https://github.com/ishan-modi/…
ishan-modi Aug 16, 2025
3f672d3
Merge branch 'main' into add-trtquant-backend
sayakpaul Aug 20, 2025
64d018c
Merge branch 'main' into add-trtquant-backend
sayakpaul Aug 20, 2025
395e75b
addressed PR comments
ishan-modi Aug 22, 2025
9034661
Merge branch 'main' into add-trtquant-backend
sayakpaul Aug 22, 2025
bbbc840
updates
ishan-modi Aug 22, 2025
2076783
Merge branch 'add-trtquant-backend' of https://github.com/ishan-modi/…
ishan-modi Aug 22, 2025
c53d251
code formatting
ishan-modi Aug 22, 2025
1ddcc9c
update
ishan-modi Aug 22, 2025
5df6926
addressed PR comments
ishan-modi Aug 22, 2025
8439f01
Merge branch 'main' into add-trtquant-backend
ishan-modi Aug 22, 2025
b96da23
Merge branch 'main' into add-trtquant-backend
ishan-modi Aug 26, 2025
0bf90b0
addressed PR comments
ishan-modi Aug 26, 2025
b097f0f
Merge branch 'add-trtquant-backend' of https://github.com/ishan-modi/…
ishan-modi Aug 26, 2025
cf054d2
addressed PR comments
ishan-modi Aug 26, 2025
0828f50
Merge branch 'main' into add-trtquant-backend
sayakpaul Aug 27, 2025
031298d
Merge branch 'main' into add-trtquant-backend
sayakpaul Aug 27, 2025
f345325
Merge branch 'main' into add-trtquant-backend
sayakpaul Aug 30, 2025
dd39595
addressed PR comments
ishan-modi Sep 1, 2025
d66709b
Merge branch 'add-trtquant-backend' of https://github.com/ishan-modi/…
ishan-modi Sep 1, 2025
81f4785
Merge branch 'main' into add-trtquant-backend
sayakpaul Sep 1, 2025
8f60186
fix docs and dependencies
ishan-modi Sep 1, 2025
8daf21d
Merge branch 'add-trtquant-backend' of https://github.com/ishan-modi/…
ishan-modi Sep 1, 2025
1a8806f
fixed dependency test
ishan-modi Sep 1, 2025
cb4e44b
Merge branch 'main' into add-trtquant-backend
sayakpaul Sep 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/workflows/nightly_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -340,6 +340,9 @@ jobs:
- backend: "optimum_quanto"
test_location: "quanto"
additional_deps: []
- backend: "nvidia_modelopt"
test_location: "modelopt"
additional_deps: []
runs-on:
group: aws-g6e-xlarge-plus
container:
Expand Down
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,8 @@
title: torchao
- local: quantization/quanto
title: quanto
- local: quantization/modelopt
title: NVIDIA ModelOpt

- title: Model accelerators and hardware
isExpanded: false
Expand Down
141 changes: 141 additions & 0 deletions docs/source/en/quantization/modelopt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
<!-- Copyright 2025 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->

# NVIDIA ModelOpt

[NVIDIA-ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) is a unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed.

Before you begin, make sure you have nvidia_modelopt installed.

```bash
pip install -U "nvidia_modelopt[hf]"
```

Quantize a model by passing [`NVIDIAModelOptConfig`] to [`~ModelMixin.from_pretrained`] (you can also load pre-quantized models). This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers.

The example below only quantizes the weights to FP8.

```python
import torch
from diffusers import AutoModel, SanaPipeline, NVIDIAModelOptConfig

model_id = "Efficient-Large-Model/Sana_600M_1024px_diffusers"
dtype = torch.bfloat16

quantization_config = NVIDIAModelOptConfig(quant_type="FP8", quant_method="modelopt")
transformer = AutoModel.from_pretrained(
model_id,
subfolder="transformer",
quantization_config=quantization_config,
torch_dtype=dtype,
)
pipe = SanaPipeline.from_pretrained(
model_id,
transformer=transformer,
torch_dtype=dtype,
)
Comment on lines +33 to +44
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets prefer using PipelineQuantizationConfig.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have kept it similar to all the other quantization docs (quanto, torchao etc), can we keep it similar to them for now, in those doc they use specific quant config

pipe.to("cuda")

print(f"Pipeline memory usage: {torch.cuda.max_memory_reserved() / 1024**3:.3f} GB")

prompt = "A cat holding a sign that says hello world"
image = pipe(
prompt, num_inference_steps=50, guidance_scale=4.5, max_sequence_length=512
).images[0]
image.save("output.png")
```

> **Note:**
>
> The quantization methods in NVIDIA-ModelOpt are designed to reduce the memory footprint of model weights using various QAT (Quantization-Aware Training) and PTQ (Post-Training Quantization) techniques while maintaining model performance. However, the actual performance gain during inference depends on the deployment framework (e.g., TRT-LLM, TensorRT) and the specific hardware configuration.
>
> More details can be found [here](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples).

## NVIDIAModelOptConfig

The `NVIDIAModelOptConfig` class accepts three parameters:
- `quant_type`: A string value mentioning one of the quantization types below.
- `modules_to_not_convert`: A list of module full/partial module names for which quantization should not be performed. For example, to not perform any quantization of the [`SD3Transformer2DModel`]'s pos_embed projection blocks, one would specify: `modules_to_not_convert=["pos_embed.proj.weight"]`.
- `disable_conv_quantization`: A boolean value which when set to `True` disables quantization for all convolutional layers in the model. This is useful as channel and block quantization generally don't work well with convolutional layers (used with INT4, NF4, NVFP4). If you want to disable quantization for specific convolutional layers, use `modules_to_not_convert` instead.
- `algorithm`: The algorithm to use for determining scale, defaults to `"max"`. You can check modelopt documentation for more algorithms and details.
- `forward_loop`: The forward loop function to use for calibrating activation during quantization. If not provided, it relies on static scale values computed using the weights only.
- `kwargs`: A dict of keyword arguments to pass to the underlying quantization method which will be invoked based on `quant_type`.

## Supported quantization types

ModelOpt supports weight-only, channel and block quantization int8, fp8, int4, nf4, and nvfp4. The quantization methods are designed to reduce the memory footprint of the model weights while maintaining the performance of the model during inference.

Weight-only quantization stores the model weights in a specific low-bit data type but performs computation with a higher-precision data type, like `bfloat16`. This lowers the memory requirements from model weights but retains the memory peaks for activation computation.

The quantization methods supported are as follows:

| **Quantization Type** | **Supported Schemes** | **Required Kwargs** | **Additional Notes** |
|-----------------------|-----------------------|---------------------|----------------------|
| **INT8** | `int8 weight only`, `int8 channel quantization`, `int8 block quantization` | `quant_type`, `quant_type + channel_quantize`, `quant_type + channel_quantize + block_quantize` |
| **FP8** | `fp8 weight only`, `fp8 channel quantization`, `fp8 block quantization` | `quant_type`, `quant_type + channel_quantize`, `quant_type + channel_quantize + block_quantize` |
| **INT4** | `int4 weight only`, `int4 block quantization` | `quant_type`, `quant_type + channel_quantize + block_quantize` | `channel_quantize = -1 is only supported for now`|
| **NF4** | `nf4 weight only`, `nf4 double block quantization` | `quant_type`, `quant_type + channel_quantize + block_quantize + scale_channel_quantize` + `scale_block_quantize` | `channel_quantize = -1 and scale_channel_quantize = -1 are only supported for now` |
| **NVFP4** | `nvfp4 weight only`, `nvfp4 block quantization` | `quant_type`, `quant_type + channel_quantize + block_quantize` | `channel_quantize = -1 is only supported for now`|


Refer to the [official modelopt documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/) for a better understanding of the available quantization methods and the exhaustive list of configuration options available.

## Serializing and Deserializing quantized models

To serialize a quantized model in a given dtype, first load the model with the desired quantization dtype and then save it using the [`~ModelMixin.save_pretrained`] method.

```python
import torch
from diffusers import AutoModel, NVIDIAModelOptConfig
from modelopt.torch.opt import enable_huggingface_checkpointing

enable_huggingface_checkpointing()

model_id = "Efficient-Large-Model/Sana_600M_1024px_diffusers"
quant_config_fp8 = {"quant_type": "FP8", "quant_method": "modelopt"}
quant_config_fp8 = NVIDIAModelOptConfig(**quant_config_fp8)
model = AutoModel.from_pretrained(
model_id,
subfolder="transformer",
quantization_config=quant_config_fp8,
torch_dtype=torch.bfloat16,
)
model.save_pretrained('path/to/sana_fp8', safe_serialization=False)
```

To load a serialized quantized model, use the [`~ModelMixin.from_pretrained`] method.

```python
import torch
from diffusers import AutoModel, NVIDIAModelOptConfig, SanaPipeline
from modelopt.torch.opt import enable_huggingface_checkpointing

enable_huggingface_checkpointing()

quantization_config = NVIDIAModelOptConfig(quant_type="FP8", quant_method="modelopt")
transformer = AutoModel.from_pretrained(
"path/to/sana_fp8",
subfolder="transformer",
quantization_config=quantization_config,
torch_dtype=torch.bfloat16,
)
pipe = SanaPipeline.from_pretrained(
"Efficient-Large-Model/Sana_600M_1024px_diffusers",
transformer=transformer,
torch_dtype=torch.bfloat16,
)
pipe.to("cuda")
prompt = "A cat holding a sign that says hello world"
image = pipe(
prompt, num_inference_steps=50, guidance_scale=4.5, max_sequence_length=512
).images[0]
image.save("output.png")
```
2 changes: 2 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,7 @@
"gguf>=0.10.0",
"torchao>=0.7.0",
"bitsandbytes>=0.43.3",
"nvidia_modelopt[hf]>=0.33.1",
"regex!=2019.12.17",
"requests",
"tensorboard",
Expand Down Expand Up @@ -244,6 +245,7 @@ def run(self):
extras["gguf"] = deps_list("gguf", "accelerate")
extras["optimum_quanto"] = deps_list("optimum_quanto", "accelerate")
extras["torchao"] = deps_list("torchao", "accelerate")
extras["nvidia_modelopt"] = deps_list("nvidia_modelopt[hf]")

if os.name == "nt": # windows
extras["flax"] = [] # jax is not supported on windows
Expand Down
21 changes: 21 additions & 0 deletions src/diffusers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
is_k_diffusion_available,
is_librosa_available,
is_note_seq_available,
is_nvidia_modelopt_available,
is_onnx_available,
is_opencv_available,
is_optimum_quanto_available,
Expand Down Expand Up @@ -111,6 +112,18 @@
else:
_import_structure["quantizers.quantization_config"].append("QuantoConfig")

try:
if not is_torch_available() and not is_accelerate_available() and not is_nvidia_modelopt_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
from .utils import dummy_nvidia_modelopt_objects

_import_structure["utils.dummy_nvidia_modelopt_objects"] = [
name for name in dir(dummy_nvidia_modelopt_objects) if not name.startswith("_")
]
else:
_import_structure["quantizers.quantization_config"].append("NVIDIAModelOptConfig")

try:
if not is_onnx_available():
raise OptionalDependencyNotAvailable()
Expand Down Expand Up @@ -795,6 +808,14 @@
else:
from .quantizers.quantization_config import QuantoConfig

try:
if not is_nvidia_modelopt_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
from .utils.dummy_nvidia_modelopt_objects import *
else:
from .quantizers.quantization_config import NVIDIAModelOptConfig

try:
if not is_onnx_available():
raise OptionalDependencyNotAvailable()
Expand Down
1 change: 1 addition & 0 deletions src/diffusers/dependency_versions_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
"gguf": "gguf>=0.10.0",
"torchao": "torchao>=0.7.0",
"bitsandbytes": "bitsandbytes>=0.43.3",
"nvidia_modelopt[hf]": "nvidia_modelopt[hf]>=0.33.1",
"regex": "regex!=2019.12.17",
"requests": "requests",
"tensorboard": "tensorboard",
Expand Down
7 changes: 7 additions & 0 deletions src/diffusers/quantizers/auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,11 @@

from .bitsandbytes import BnB4BitDiffusersQuantizer, BnB8BitDiffusersQuantizer
from .gguf import GGUFQuantizer
from .modelopt import NVIDIAModelOptQuantizer
from .quantization_config import (
BitsAndBytesConfig,
GGUFQuantizationConfig,
NVIDIAModelOptConfig,
QuantizationConfigMixin,
QuantizationMethod,
QuantoConfig,
Expand All @@ -39,6 +41,7 @@
"gguf": GGUFQuantizer,
"quanto": QuantoQuantizer,
"torchao": TorchAoHfQuantizer,
"modelopt": NVIDIAModelOptQuantizer,
}

AUTO_QUANTIZATION_CONFIG_MAPPING = {
Expand All @@ -47,6 +50,7 @@
"gguf": GGUFQuantizationConfig,
"quanto": QuantoConfig,
"torchao": TorchAoConfig,
"modelopt": NVIDIAModelOptConfig,
}


Expand Down Expand Up @@ -137,6 +141,9 @@ def merge_quantization_configs(
if isinstance(quantization_config, dict):
quantization_config = cls.from_dict(quantization_config)

if isinstance(quantization_config, NVIDIAModelOptConfig):
quantization_config.check_model_patching()

if warning_msg != "":
warnings.warn(warning_msg)

Expand Down
1 change: 1 addition & 0 deletions src/diffusers/quantizers/modelopt/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .modelopt_quantizer import NVIDIAModelOptQuantizer
Loading
Loading