Skip to content

Conversation

@willg-nv
Copy link

@willg-nv willg-nv commented Dec 17, 2025

What does this PR do?

Type of change: new feature

Overview: This PR integrate automated Q/DQ placement tool to ModelOpt. This PR is 2/4 parts of the cahnges.

Part 1: #701
Part 2: #702
Part 3: #703
Part 4: #704

This PR contains the following changes:

  1. Implement RegionPattern to represent the topology structure of Regions. InsertionPoints are also defined on RegionPattern. Regions with same pattern are optimized at the same time
  2. Implement RegionSearch class to divide ONNX graph into small regions
  3. RegionSearch python file also provides an entry point to print out the region structures.
  4. Unit tests for new classse.

Usage

python -m modelopt.onnx.quantization.autotune.region_search --model model.onnx --verbose

Example output:

    ├─ Region 212 (Level 0, Type: COMPOSITE)
    │  ├─ Direct nodes: 0
    │  ├─ Total nodes (recursive): 9
    │  ├─ Children: 1
    │  ├─ Inputs: 3 tensors
    │  │    - xxx
    │  │    - xxx
    │  │    - xxx
    │  └─ Outputs: 1 tensors
    │       - xxx
    │
    │  Child regions:
    │
      ├─ Region 209 (Level 2, Type: LEAF) 
      │  ├─ Direct nodes: 9
      │  ├─ Total nodes (recursive): 9
      │  ├─ Children: 0
      │  ├─ Inputs: 11 tensors
      │  │    - xxx

Testing

Implemented unit tests for new classes. All unit tests could get pass locally.

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes
  • Did you write any new necessary tests?: Yes
  • Did you add or update any necessary documentation?: No, document change will be in part 4.
  • Did you update Changelog?: No. Change log will be included in part 4.

Additional Information

Summary by CodeRabbit

  • New Features

    • Added comprehensive ONNX quantization autotuning framework with pattern-based Quantize/Dequantize insertion optimization for improved TensorRT latency.
    • Added region inspection tool to analyze graph structure and quantizable patterns.
    • Added pattern caching and matching to identify and reuse structurally similar regions.
  • Chores

    • Added utility functions for graph analysis and symmetric operation detection.
    • Added extensive unit test coverage for autotuning components.

✏️ Tip: You can customize this high-level summary in your review settings.

@willg-nv willg-nv requested a review from a team as a code owner December 17, 2025 06:29
@willg-nv willg-nv requested a review from ajrasane December 17, 2025 06:29
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 17, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@willg-nv willg-nv changed the title Dev willg integrate auto qdq placement part2 Integrate Automated QDQ placement tool - Part 2 Dec 17, 2025
@willg-nv
Copy link
Author

Hi @ajrasane , could you help me review this PR, thanks!

@willg-nv willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part2 branch from 3f7ff31 to d3a6765 Compare December 31, 2025 02:16
@willg-nv willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part2 branch 2 times, most recently from 616285d to c95939a Compare January 8, 2026 08:35
@willg-nv willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part2 branch 3 times, most recently from 4468ca2 to bc87ca7 Compare January 9, 2026 05:02
@willg-nv willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part2 branch 2 times, most recently from 6f809d7 to a933107 Compare January 15, 2026 09:27
@willg-nv willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part2 branch 4 times, most recently from 8c8e685 to dc3ef86 Compare January 16, 2026 07:32
@willg-nv willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part2 branch 5 times, most recently from 3de8e6f to 4d11c3d Compare January 23, 2026 03:35
Signed-off-by: Will Guo <willg@nvidia.com>
@willg-nv willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part2 branch from 4d11c3d to 29f08db Compare January 26, 2026 05:57
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 26, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review
📝 Walkthrough

Walkthrough

This PR introduces a comprehensive Q/DQ autotuning framework for ONNX quantization. It adds hierarchical region discovery, pattern-based region grouping, insertion point scheme generation, caching mechanisms, and supporting utilities for analyzing ONNX graphs and identifying quantized tensors. The framework enables systematic exploration of different Q/DQ placement strategies optimized for TensorRT latency.

Changes

Cohort / File(s) Summary
ONNX Autotuning Core Framework
modelopt/onnx/quantization/autotune/common.py
modelopt/onnx/quantization/autotune/region_pattern.py
modelopt/onnx/quantization/autotune/region_search.py
Foundational data structures (Region, InsertionScheme, PatternSchemes, PatternCache, Config, exception hierarchy); region pattern signature generation with matching and insertion scheme utilities; hierarchical region discovery via two-phase search (bottom-up partitioning + top-down refinement). Dense logic with graph traversal, signature computation, and pattern matching.
Autotuning Package Interface
modelopt/onnx/quantization/autotune/__init__.py
Package-level re-exports of 16+ public symbols (error types, region classes, insertion points, pattern utilities, config). Establishes public API surface for the autotuning framework.
Region Analysis & Inspection
modelopt/onnx/quantization/autotune/region_inspect.py
CLI and programmatic API for inspecting region search results; supports model loading, region discovery, filtering, and detailed logging with coverage statistics.
Quantization Utilities
modelopt/onnx/quantization/graph_utils.py
modelopt/onnx/quantization/qdq_utils.py
New utilities: get_tensor_consumer_node_indices() for building consumer mappings; get_quantized_tensors() for identifying Q/DQ-affected tensors in models.
ONNX Operations
modelopt/onnx/op_types.py
New get_symmetric_ops() function returning set of commutative/symmetric ONNX operations.
Unit Tests
tests/unit/onnx/quantization/autotune/test_pattern_cache.py
tests/unit/onnx/quantization/autotune/test_region_pattern.py
tests/unit/onnx/quantization/autotune/test_region_search.py
Comprehensive test suites for PatternCache (serialization, merging, scheme selection), RegionPattern (matching, tree formatting, symmetry handling), and region search classes (partitioning, composite building, print tree utilities).

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client Code
    participant CRS as CombinedRegionSearch
    participant Partitioner as RegionPartitioner
    participant Builder as TopDownRegionBuilder
    participant Pattern as RegionPattern
    participant Cache as PatternCache
    participant Config as Config

    Client->>Config: create Config(max_seq_size, thresholds)
    Config-->>Client: config instance

    Client->>CRS: create CombinedRegionSearch(graph, config)
    activate CRS
    
    Client->>CRS: search_regions()
    activate Partitioner
    Partitioner->>Partitioner: partition_graph() - Phase 1
    Note over Partitioner: Bottom-up: identify leaf regions<br/>from divergence/convergence
    Partitioner-->>CRS: initial regions
    deactivate Partitioner

    activate Builder
    Builder->>Builder: build_composite_region() - Phase 2
    Note over Builder: Top-down: create hierarchy,<br/>split sequences, merge converged
    Builder-->>CRS: refined region tree
    deactivate Builder
    
    CRS-->>Client: list[Region]
    deactivate CRS

    Client->>Pattern: RegionPattern.from_region(region, graph)
    activate Pattern
    Pattern->>Pattern: compute_structural_signature(graph)
    Pattern-->>Client: RegionPattern
    deactivate Pattern

    Client->>Cache: PatternCache()
    activate Cache
    Cache-->>Client: cache instance
    deactivate Cache

    Client->>Cache: add_pattern_from_region(region, graph)
    Cache->>Pattern: (implicitly uses from_region)
    Note over Cache: Store pattern with<br/>insertion schemes
    Cache-->>Client: void

    Client->>Cache: get_pattern_schemes(pattern_signature)
    Cache-->>Client: PatternSchemes | None

    Note over Client,Cache: Workflow: Search → Pattern → Cache → Config
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly identifies the primary change: a new automated QDQ placement tool being integrated. It is part 2 of a multi-part feature, directly aligned with the substantial additions to the codebase (Region pattern matching, region search, insertion points, and comprehensive configuration framework).
Docstring Coverage ✅ Passed Docstring coverage is 97.40% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Fix all issues with AI agents
In `@modelopt/onnx/quantization/autotune/__init__.py`:
- Around line 78-95: The __all__ list in modelopt.onnx.quantization.autotune
includes "RegionError" which does not exist; remove "RegionError" from the
__all__ declaration so the public export list only references actual symbols
(e.g., keep entries like "Region", "RegionType", "RegionPattern" but delete
"RegionError"). Verify the __all__ variable (named __all__) is updated
accordingly and no other code expects a RegionError symbol.

In `@modelopt/onnx/quantization/autotune/common.py`:
- Around line 556-575: The current similarity-pruning loop around
self.minimum_distance can leave multiple near-duplicates because it breaks out
after removing only the first similar existing_scheme; update the logic in the
block handling filtered_schemes so that when a new scheme is compared you
continue scanning all existing entries (removing any worse existing schemes
whose distance < self.minimum_distance) before deciding to append or skip the
new scheme. Concretely, in the loop that iterates sorted_schemes, replace the
single-break behavior with logic that either (a) iterates over a copy of
filtered_schemes and removes all existing_scheme where
scheme.distance(existing_scheme) < self.minimum_distance and scheme.latency_ms <
existing_scheme.latency_ms, or (b) after removing an existing_scheme, continue
checking the remaining filtered_schemes (do not break) and set a flag to skip
appending only if any remaining existing_scheme is better; use the existing
symbols filtered_schemes, scheme.distance(...), and latency_ms to implement
this.
- Around line 314-320: The repository is missing the
modelopt.onnx.quantization.autotune.insertion_points module which defines
NodeInputInsertionPoint, ChildRegionInputInsertionPoint, and
RegionOutputInsertionPoint, causing imports in common.py, region_pattern.py,
region_inspect.py and the package __init__.py to fail; implement a new
insertion_points module that declares these three classes (with the expected
attributes and equality/hash behavior used by InsertionScheme.distance and the
rest of the autotune code), export them from the autotune package __init__, and
ensure the classes provide the iterable/identifier properties used in set()
operations so existing code (e.g., InsertionScheme.distance) can compute
symmetric differences without further changes.

In `@modelopt/onnx/quantization/autotune/region_inspect.py`:
- Around line 140-142: The function currently computes a filtered list named
all_regions but returns the original regions variable, leading callers to
receive unfiltered data; modify the function to return all_regions instead of
regions (replace the final "return regions" with "return all_regions") so
callers get the filtered/child-trimmed set, and update any surrounding logging
(e.g., the logger.debug line) or docstring if you intentionally want to keep the
original behavior.
- Around line 91-94: The loop in region_inspect.py iterates
region.get_children() and calls region.remove_child(child), which mutates the
children collection; change the iteration to traverse a stable copy (e.g.,
iterate over list(region.get_children()) or similar) so removals don't affect
the ongoing iteration, leaving the logic using has_quantizable_operations(child,
graph) and region.remove_child(child) unchanged.

In `@modelopt/onnx/quantization/autotune/region_search.py`:
- Around line 451-458: The code removes start_node_idx from visited_nodes but
never adds it back, leaving the divergent start node unassigned; change the
logic to explicitly include start_node_idx into the region (call
_append_node_to_region(start_node_idx) and
self.visited_nodes.add(start_node_idx)) before processing the rest, and replace
visited_nodes.remove(start_node_idx) with visited_nodes.discard(start_node_idx)
to avoid KeyError; keep the existing checks for
_is_node_divergent(converge_node_idx) and the subsequent
_append_node_to_region(converge_node_idx) and
_build_sequence_from_node(converge_node_idx,
max_nodes=MAX_PROBE_STEPS_AFTER_CONVERGE).

In `@modelopt/onnx/quantization/graph_utils.py`:
- Around line 305-319: The function get_tensor_consumer_node_indices mishandles
inputs when graph is an onnx.GraphProto because node.input yields string names,
not tensor objects; update the implementation to handle both cases by extracting
the tensor name with something like name = tensor if isinstance(tensor, str)
else tensor.name (or use getattr(tensor, "name", tensor)), and use that name as
the key when populating tensor_consumer_map; also update the docstring to
mention that the function accepts both onnx.GraphProto and onnx-graphsurgeon
gs.Graph.

In `@modelopt/onnx/quantization/qdq_utils.py`:
- Around line 1041-1078: The docstring for get_quantized_tensors incorrectly
says it identifies QuantizeLinear nodes while the implementation scans
DequantizeLinear; update the docstring text (description, Args/Returns and any
examples) to state that it identifies DequantizeLinear nodes and that the
returned set contains tensor names that are inputs to DequantizeLinear nodes
(i.e., tensors being dequantized), and adjust any example wording that mentions
QuantizeLinear accordingly so docstring matches the actual behavior of
get_quantized_tensors.
🧹 Nitpick comments (8)
tests/unit/onnx/quantization/autotune/test_pattern_cache.py (1)

27-28: Consider removing sys.path manipulation.

The sys.path.insert is typically unnecessary when the package is properly installed (e.g., via pip install -e .). This pattern can cause import issues in different environments and makes tests less portable. If tests run via pytest with proper package installation, this line can be removed.

tests/unit/onnx/quantization/autotune/test_region_pattern.py (2)

29-30: Consider removing sys.path manipulation.

Same as in other test files - this pattern is unnecessary with proper package installation.


310-314: Consider removing or guarding debug print statements.

These print statements produce output during test runs which can clutter test output. Consider either removing them or guarding with a verbosity flag if the visual output is needed for debugging.

Suggested change
-        print("\n" + "=" * 60)
-        print("Region Tree Structure:")
-        print("=" * 60)
-        print(tree_output)
-        print("=" * 60)
+        # Uncomment for debugging:
+        # print("\n" + "=" * 60)
+        # print("Region Tree Structure:")
+        # print("=" * 60)
+        # print(tree_output)
+        # print("=" * 60)
tests/unit/onnx/quantization/autotune/test_region_search.py (2)

28-29: Consider removing sys.path manipulation.

Same pattern as other test files - unnecessary with proper package installation.


413-417: Consider removing debug print statements.

These print statements produce output during test runs. Consider removing or commenting them out to keep test output clean.

Suggested change
         result = output.getvalue()
-        print("\n" + "=" * 60)
-        print("Region Tree Structure:")
-        print("=" * 60)
-        print(result)
-        print("=" * 60)
 
         assert "Region" in result
modelopt/onnx/quantization/autotune/region_pattern.py (3)

204-214: Repeated list(graph.nodes) conversion in recursive function.

list(graph.nodes) is called on every recursive invocation of _compute_signature_recursive. This creates a new list each time, which is inefficient for deep hierarchies. Consider passing the nodes list as a parameter or computing it once at the entry point.

Proposed refactor
+    `@classmethod`
+    def from_region(cls, region: Region, graph: gs.Graph) -> "RegionPattern":
+        """Compute a structural pattern for a region."""
+        nodes_list = list(graph.nodes)
+        signature_str = cls._compute_signature_recursive(region, graph, nodes_list)
+        total_size = len(region.get_region_nodes_and_descendants())
+        return cls(signature_str, total_size)

     `@staticmethod`
-    def _compute_signature_recursive(region: Region, graph: gs.Graph) -> str:
+    def _compute_signature_recursive(
+        region: Region, graph: gs.Graph, nodes_list: list | None = None
+    ) -> str:
         """Recursively compute structural signature for a region."""
-        nodes_list = list(graph.nodes)
+        if nodes_list is None:
+            nodes_list = list(graph.nodes)
         node_indices_set = set(region.get_nodes())

Then update the recursive call on line 225:

         child_sigs = "+".join(
-            [RegionPattern._compute_signature_recursive(child, graph) for child in sorted_children]
+            [RegionPattern._compute_signature_recursive(child, graph, nodes_list) for child in sorted_children]
         )

166-169: Using assert for runtime validation may be stripped in optimized mode.

The assert statement on line 169 validates pattern matching, but asserts are stripped when Python runs with -O flag. For production validation that should always run, consider raising an explicit exception.

Proposed fix
     def get_full_insertion_scheme(self, region: Region, graph: gs.Graph) -> InsertionScheme:
         """Get all possible insertion points for a region in a single InsertionScheme."""
         region_pattern = RegionPattern.from_region(region, graph)
-        assert self == region_pattern, "Region pattern mismatch"
+        if self != region_pattern:
+            raise ValueError(
+                f"Region pattern mismatch: expected {self.signature}, got {region_pattern.signature}"
+            )

224-231: Minor redundancy in signature building.

When there are no node_ops, line 231 rebuilds the joined string with '+'.join(child_sigs) even though child_sigs was already joined on line 224. This is a minor inefficiency.

Proposed fix
-        child_sigs = "+".join(
-            [RegionPattern._compute_signature_recursive(child, graph) for child in sorted_children]
-        )
+        child_sig_list = [
+            RegionPattern._compute_signature_recursive(child, graph) for child in sorted_children
+        ]
+        child_sigs = "+".join(child_sig_list)

         if node_ops:
             node_sig = "->".join(node_ops)
             return f"COMPOSITE({node_sig}|{child_sigs})"
-        return f"COMPOSITE({'+'.join(child_sigs)})"
+        return f"COMPOSITE({child_sigs})"

@willg-nv willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part2 branch 2 times, most recently from 4e0f4e1 to af26a85 Compare January 26, 2026 06:49
Signed-off-by: Will Guo <willg@nvidia.com>
@willg-nv willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part2 branch from af26a85 to 559d12c Compare January 26, 2026 07:00
gcunhase added a commit to gcunhase/TensorRT-Model-Optimizer that referenced this pull request Jan 26, 2026
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: Will Guo <willg@nvidia.com>
Signed-off-by: Will Guo <willg@nvidia.com>
@willg-nv willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part2 branch from e7ad869 to f630313 Compare January 30, 2026 05:42
Signed-off-by: Will Guo <willg@nvidia.com>
@willg-nv willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part2 branch from f630313 to 4afccc4 Compare January 30, 2026 05:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants