-
Notifications
You must be signed in to change notification settings - Fork 247
Integrate Automated QDQ placement tool - Part 3 #703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Integrate Automated QDQ placement tool - Part 3 #703
Conversation
|
@vishalpandya1990 could you help me review this PR? thanks! |
Sorry for the delay. Added Ajinkya for review. |
3454bba to
4b9d789
Compare
| "--output", | ||
| "-o", | ||
| type=str, | ||
| default=DEFAULT_OUTPUT_DIR, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we update this behavior to match the ONNX quantization and Autocast workflow?
In there, if output_path is not given, the resulting model is saved in the same path as the input model with a name extension. For ex: the quantized model.onnx is saved as model.quant.onnx and the converted model is saved as model.fp16.onnx.
See more details in:
| if not output_path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: rename this as output_path to match other ONNX workflows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: I plan to create another PR to integrate autotuner as a sub-command of modelopt.onnx.quantization. User could 1) directly run autotuner, 2) or autotune based on PTQ model. After that, I think cli.py could be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the meanwhile, could we move cli.py this to __main__.py and remove cli.py? Then the workflow could match quantization and offer autotune as a standalone feature separate from quantization for debugging purposes.
| from modelopt.onnx.quantization.autotune.common import PatternCache, RegionType | ||
|
|
||
|
|
||
| def create_simple_conv_model(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this to tests/_test_utils/onnx/lib_test_models.py? Alternatively, NonSimplifiedModel or build_resnet_block could be used here instead?
@ajrasane WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, lets keep this in a central place.
20ae533 to
99d3c0d
Compare
202b3e2 to
8674964
Compare
Signed-off-by: Will Guo <willg@nvidia.com>
8674964 to
b348350
Compare
| ResolvedInsertionPoint, | ||
| ) | ||
| from .region_pattern import RegionPattern | ||
| from .region_search import CombinedRegionSearch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing:
from .autotuner import QDQAutotuner, QDQAutotunerBase| "PatternCache", | ||
| "PatternSchemes", | ||
| "Region", | ||
| "RegionError", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing implementation of RegionError
|
|
||
| for region_idx, region in enumerate(regions): | ||
| logger.info( | ||
| f"Region {region_idx + 1}/{len(regions)} (ID={region.id}, level={region.get_level()})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error:
modelopt/onnx/quantization/autotune/autotuner.py:193: error: "Region" has no attribute "get_level" [attr-defined]
modelopt/onnx/quantization/autotune/autotuner.py:194: error: "Region" has no attribute "get_size" [attr-defined]
modelopt/onnx/quantization/autotune/workflows.py:335: error: "Region" has no attribute "get_level" [attr-defined]
modelopt/onnx/quantization/autotune/workflows.py:384: error: "Region" has no attribute "get_level" [attr-defined]| else: | ||
| print("✓ QDQAutotuner (no schemes to test tracking)") | ||
| else: | ||
| self.skipTest("No regions discovered") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a function to test the full autotuner workflow? ONNX -> InsertionPoints -> Quantized ONNX. Thanks.
|
|
||
| if needs_fp8_conversion: | ||
| logger.debug("Converting INT8 to FP8") | ||
| model = int8_to_fp8(model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this conversion function needed or can we insert Q/DQ nodes already at the correct precision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried the following test code:
def test_export_quantized_model(self):
"""Test exporting quantized model with Q/DQ."""
model = create_simple_conv_model()
autotuner = QDQAutotuner(model)
config = self._create_test_config()
autotuner.initialize(config)
with open("/tmp/autotuner_model.quant.onnx", "w") as f: # tempfile.NamedTemporaryFile(suffix=".onnx", delete=False) as f:
output_path = f.name
try:
# Export baseline without Q/DQ insertion
autotuner.export_onnx(output_path, insert_qdq=True)
# Verify file was created
assert os.path.exists(output_path)
# Verify it's a valid ONNX model
exported_model = onnx.load(output_path)
assert exported_model is not None
# Verify that it contains Q/DQ nodes
qdq_nodes = [n for n in exported_model.graph.node if n.op_type in ["QuantizeLinear", "DequantizeLinear"]]
assert qdq_nodes, "Q/DQ nodes not found in quantized model"
print("✓ QDQAutotuner export quantized model")
finally:
print()
# if os.path.exists(output_path):
# os.unlink(output_path)But the simple Conv->Relu model didn't get quantized. Is this expected?
[modelopt][onnx] - DEBUG - Region 0 (level 0)
[modelopt][onnx] - DEBUG - → Pattern signature: Conv->Relu
[modelopt][onnx] - DEBUG - → No scheme available, skipping
[modelopt][onnx] - DEBUG - Matched 0/1 regions, total 0 unique insertion points
[modelopt][onnx] - DEBUG - Inserting 0 Q/DQ pairs into graph
[modelopt][onnx] - DEBUG - Serializing to ONNX format
[modelopt][onnx] - INFO - Exported INT8 model with 0 Q/DQ pairs → /tmp/autotuner_model.quant.onnx
✓ QDQAutotuner export quantized model| scheme_idx = autotuner.generate() | ||
|
|
||
| # Should return a valid index (>= 0) or -1 if no more unique schemes | ||
| assert isinstance(scheme_idx, int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the expected scheme_idx for create_simple_conv_model()? Please update this assert accordingly. Thanks.
|
Can we add a test file for
|
| # TensorRT Benchmark | ||
| trt_group = parser.add_argument_group("TensorRT Benchmark") | ||
| trt_group.add_argument( | ||
| "--use_trtexec", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following CLI fails to perform benchmark / quantize the model (this uses TensorRTPyBenchmark):
$ python -m modelopt.onnx.quantization.autotune --onnx_path=conv_relu.onnxError:
[modelopt][onnx] - ERROR - Benchmark instance not initialized
[modelopt][onnx] - INFO - Results: 3.73 ms → failed (invalid measurement)
This failure happens because pycuda was not installed. After installing that dependency, no error is thrown but the model is not quantized.
- @ajrasane should we create another optional_dep in setup.py with autotune's dependencies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If --use_trtexec is used, autotune does not fail but also doesn't generate a quantized model.
This is due to Latency being used as a measurement instead of GPU Compute Time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is just pycuda, we can probably just include this in the modelopt onnx dependencies. But if we have more dependencies, then it would be better to create a new section in setup.py with autotune dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@willg-nv how should we approach the tensorrt / trtexec requirements for autotune? Are we just adding a disclaimer for the user in the README or adding that in setup.py?
| # "[I] Latency: min = X ms, max = Y ms, mean = Z ms, median = W ms, ..." | ||
| output = result.stdout | ||
|
|
||
| # Look for median latency in the main "[I] Latency:" line | ||
| pattern = r"\[I\]\s+Latency:.*?median\s*=\s*([\d.]+)\s*ms" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For measurements equivalent to TensorRTPyBenchmark in TrtExecBenchmark:
| # "[I] Latency: min = X ms, max = Y ms, mean = Z ms, median = W ms, ..." | |
| output = result.stdout | |
| # Look for median latency in the main "[I] Latency:" line | |
| pattern = r"\[I\]\s+Latency:.*?median\s*=\s*([\d.]+)\s*ms" | |
| # "[I] GPU Compute Time: min = X ms, max = Y ms, mean = Z ms, median = W ms, ..." | |
| output = result.stdout | |
| # Look for median GPU Compute Time in the main "[I] GPU Compute Time:" line | |
| pattern = r"\[I\]\s+GPU Compute Time:.*?median\s*=\s*([\d.]+)\s*ms" |
Suggestion for |
| output_shape: tuple | None, | ||
| output_dtype: np.dtype, | ||
| quant_dtype: np.dtype, | ||
| quant_type: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this being used somewhere?
What does this PR do?
Type of change: new feature
Overview: This PR integrates automated QDQ placement tool to ModelOpt. This PR is 3/4 of the change. This PR contains the following changes:
Part 1: #701
Part 2: #702
Part 3: #703
Part 4: #704
Usage
Testing
Implemented unit tests for QDQAutotuner and Config classes.
Before your PR is "Ready for review"
Additional Information