Skip to content

Conversation

@willg-nv
Copy link

@willg-nv willg-nv commented Dec 17, 2025

What does this PR do?

Type of change: new feature

Overview: This PR integrates automated QDQ placement tool to ModelOpt. This PR is 3/4 of the change. This PR contains the following changes:

  1. Implements QDQAutotuner and Autotuner CLI interface
  2. Implements Benchmark to measure E2E time of QDQ models.
  3. unit tests for QDQ Autotuner and config.

Part 1: #701
Part 2: #702
Part 3: #703
Part 4: #704

Usage

python -m modelopt.onnx.quantization.autotune --model model.onnx

Testing

Implemented unit tests for QDQAutotuner and Config classes.

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes
  • Did you write any new necessary tests?: Yes
  • Did you add or update any necessary documentation?: No, document will be in part 4.
  • Did you update Changelog?: No. change log will be in part 4.

Additional Information

@willg-nv willg-nv requested a review from a team as a code owner December 17, 2025 06:56
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 17, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@willg-nv
Copy link
Author

@vishalpandya1990 could you help me review this PR? thanks!

@vishalpandya1990
Copy link
Contributor

@vishalpandya1990 could you help me review this PR? thanks!

Sorry for the delay. Added Ajinkya for review.

@willg-nv willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part3 branch from 3454bba to 4b9d789 Compare December 31, 2025 02:09
"--output",
"-o",
type=str,
default=DEFAULT_OUTPUT_DIR,
Copy link
Contributor

@gcunhase gcunhase Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update this behavior to match the ONNX quantization and Autocast workflow?

In there, if output_path is not given, the resulting model is saved in the same path as the input model with a name extension. For ex: the quantized model.onnx is saved as model.quant.onnx and the converted model is saved as model.fp16.onnx.

See more details in:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: rename this as output_path to match other ONNX workflows.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: I plan to create another PR to integrate autotuner as a sub-command of modelopt.onnx.quantization. User could 1) directly run autotuner, 2) or autotune based on PTQ model. After that, I think cli.py could be removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the meanwhile, could we move cli.py this to __main__.py and remove cli.py? Then the workflow could match quantization and offer autotune as a standalone feature separate from quantization for debugging purposes.

from modelopt.onnx.quantization.autotune.common import PatternCache, RegionType


def create_simple_conv_model():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to tests/_test_utils/onnx/lib_test_models.py? Alternatively, NonSimplifiedModel or build_resnet_block could be used here instead?

@ajrasane WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, lets keep this in a central place.

@willg-nv willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part3 branch 2 times, most recently from 20ae533 to 99d3c0d Compare January 12, 2026 03:11
@willg-nv willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part3 branch 4 times, most recently from 202b3e2 to 8674964 Compare January 21, 2026 02:26
Signed-off-by: Will Guo <willg@nvidia.com>
@willg-nv willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part3 branch from 8674964 to b348350 Compare January 23, 2026 07:50
ResolvedInsertionPoint,
)
from .region_pattern import RegionPattern
from .region_search import CombinedRegionSearch
Copy link
Contributor

@gcunhase gcunhase Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing:

from .autotuner import QDQAutotuner, QDQAutotunerBase

"PatternCache",
"PatternSchemes",
"Region",
"RegionError",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing implementation of RegionError


for region_idx, region in enumerate(regions):
logger.info(
f"Region {region_idx + 1}/{len(regions)} (ID={region.id}, level={region.get_level()})"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error:

modelopt/onnx/quantization/autotune/autotuner.py:193: error: "Region" has no attribute "get_level"  [attr-defined]
modelopt/onnx/quantization/autotune/autotuner.py:194: error: "Region" has no attribute "get_size"  [attr-defined]
modelopt/onnx/quantization/autotune/workflows.py:335: error: "Region" has no attribute "get_level"  [attr-defined]
modelopt/onnx/quantization/autotune/workflows.py:384: error: "Region" has no attribute "get_level"  [attr-defined]

else:
print("✓ QDQAutotuner (no schemes to test tracking)")
else:
self.skipTest("No regions discovered")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a function to test the full autotuner workflow? ONNX -> InsertionPoints -> Quantized ONNX. Thanks.


if needs_fp8_conversion:
logger.debug("Converting INT8 to FP8")
model = int8_to_fp8(model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this conversion function needed or can we insert Q/DQ nodes already at the correct precision?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried the following test code:

    def test_export_quantized_model(self):
        """Test exporting quantized model with Q/DQ."""
        model = create_simple_conv_model()
        autotuner = QDQAutotuner(model)
        config = self._create_test_config()
        autotuner.initialize(config)

        with open("/tmp/autotuner_model.quant.onnx", "w") as f:  # tempfile.NamedTemporaryFile(suffix=".onnx", delete=False) as f:
            output_path = f.name

        try:
            # Export baseline without Q/DQ insertion
            autotuner.export_onnx(output_path, insert_qdq=True)

            # Verify file was created
            assert os.path.exists(output_path)

            # Verify it's a valid ONNX model
            exported_model = onnx.load(output_path)
            assert exported_model is not None

            # Verify that it contains Q/DQ nodes
            qdq_nodes = [n for n in exported_model.graph.node if n.op_type in ["QuantizeLinear", "DequantizeLinear"]]
            assert qdq_nodes, "Q/DQ nodes not found in quantized model"

            print("✓ QDQAutotuner export quantized model")
        finally:
            print()
            # if os.path.exists(output_path):
            #     os.unlink(output_path)

But the simple Conv->Relu model didn't get quantized. Is this expected?

[modelopt][onnx] - DEBUG - Region 0 (level 0)
[modelopt][onnx] - DEBUG -   → Pattern signature: Conv->Relu
[modelopt][onnx] - DEBUG -   → No scheme available, skipping
[modelopt][onnx] - DEBUG - Matched 0/1 regions, total 0 unique insertion points
[modelopt][onnx] - DEBUG - Inserting 0 Q/DQ pairs into graph
[modelopt][onnx] - DEBUG - Serializing to ONNX format
[modelopt][onnx] - INFO - Exported INT8 model with 0 Q/DQ pairs  → /tmp/autotuner_model.quant.onnx
✓ QDQAutotuner export quantized model

scheme_idx = autotuner.generate()

# Should return a valid index (>= 0) or -1 if no more unique schemes
assert isinstance(scheme_idx, int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the expected scheme_idx for create_simple_conv_model()? Please update this assert accordingly. Thanks.

@gcunhase
Copy link
Contributor

Can we add a test file for workflows.py and potentially benchmark.py?

cli.py can be moved to __main__.py if the plan is for it to be available as a standalone feature and/or for debugging purposes. Unittests might also need to be added for that.

# TensorRT Benchmark
trt_group = parser.add_argument_group("TensorRT Benchmark")
trt_group.add_argument(
"--use_trtexec",
Copy link
Contributor

@gcunhase gcunhase Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following CLI fails to perform benchmark / quantize the model (this uses TensorRTPyBenchmark):

$ python -m modelopt.onnx.quantization.autotune --onnx_path=conv_relu.onnx

Error:

[modelopt][onnx] - ERROR - Benchmark instance not initialized
[modelopt][onnx] - INFO - Results: 3.73 ms → failed (invalid measurement)

This failure happens because pycuda was not installed. After installing that dependency, no error is thrown but the model is not quantized.

  • @ajrasane should we create another optional_dep in setup.py with autotune's dependencies?

Copy link
Contributor

@gcunhase gcunhase Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If --use_trtexec is used, autotune does not fail but also doesn't generate a quantized model.

This is due to Latency being used as a measurement instead of GPU Compute Time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is just pycuda, we can probably just include this in the modelopt onnx dependencies. But if we have more dependencies, then it would be better to create a new section in setup.py with autotune dependencies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willg-nv how should we approach the tensorrt / trtexec requirements for autotune? Are we just adding a disclaimer for the user in the README or adding that in setup.py?

Comment on lines +298 to +302
# "[I] Latency: min = X ms, max = Y ms, mean = Z ms, median = W ms, ..."
output = result.stdout

# Look for median latency in the main "[I] Latency:" line
pattern = r"\[I\]\s+Latency:.*?median\s*=\s*([\d.]+)\s*ms"
Copy link
Contributor

@gcunhase gcunhase Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For measurements equivalent to TensorRTPyBenchmark in TrtExecBenchmark:

Suggested change
# "[I] Latency: min = X ms, max = Y ms, mean = Z ms, median = W ms, ..."
output = result.stdout
# Look for median latency in the main "[I] Latency:" line
pattern = r"\[I\]\s+Latency:.*?median\s*=\s*([\d.]+)\s*ms"
# "[I] GPU Compute Time: min = X ms, max = Y ms, mean = Z ms, median = W ms, ..."
output = result.stdout
# Look for median GPU Compute Time in the main "[I] GPU Compute Time:" line
pattern = r"\[I\]\s+GPU Compute Time:.*?median\s*=\s*([\d.]+)\s*ms"

@gcunhase
Copy link
Contributor

Can we add a test file for workflows.py and potentially benchmark.py?

cli.py can be moved to __main__.py if the plan is for it to be available as a standalone feature and/or for debugging purposes. Unittests might also need to be added for that.

Suggestion for test_workflows.py: test_workflows.py

output_shape: tuple | None,
output_dtype: np.dtype,
quant_dtype: np.dtype,
quant_type: str,
Copy link
Contributor

@gcunhase gcunhase Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this being used somewhere?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants