mileston(qnn): Qnn AOT by chenghuaWang · Pull Request #624 · UbiquitousLearning/mllm

chenghuaWang · 2026-01-31T14:03:28Z

SM8650, chunk=32

Qwen3 1.7B, prefill 640 tokens, decode 128 tokens. w4a16kv8.

Prefill: 758.28 tps
Decode: 25.84 tps

SM8650, chunk=32

Qwen2.5 3B, prefill 256 tokens, decode 128 tokens. w4a16kv8.

Prefill: 512.75 tps
Decode: 18.29 tps

Qwen2.5 3B, prefill 512 tokens, decode 256 tokens. w4a16kv8.

Prefill: 545.62 tps
Decode: 18.11 tps

Summary by CodeRabbit

Bug Fixes
- Improved resource cleanup and shutdown sequencing to prevent memory leaks and use-after-free issues
- Updated quantization handling to expect unsigned 8-bit symmetric zero-point (128), improving quantized model correctness
- Removed stale zero-point metadata from KV cache and improved KV de/quant paths
- Fixed attention mask initialization and updates during generation
New Features
- Added explicit backend shutdown capability for better lifecycle control

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…onsistent quantization across models, enhancing accuracy in AOT runs.

coderabbitai · 2026-01-31T14:03:48Z

📝 Walkthrough

Walkthrough

Adds explicit QNN backend and allocator shutdown/lifecycle management, updates uint8 per-tensor symmetric quantization zero_point expectations from 0→128 across examples and parsers, enforces ordered destructor/cleanup for AOT runtime components, and adjusts token-generator KV/attention-mask initialization.

Changes

Cohort / File(s)	Summary
Quantization Zero-Point Updates `examples/llama_qnn_aot/modeling_llama_qnn_aot.hpp`, `examples/llama_qnn_aot/modeling_llama_qnn_aot_sha.hpp`, `examples/qwen2_qnn_aot/modeling_qwen2_qnn_aot.hpp`, `examples/qwen2_qnn_aot/modeling_qwen2_qnn_aot_sha.hpp`, `examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp`, `examples/qwen3_qnn_aot/modeling_qwen_qnn_aot_sha.hpp`	Changed QDQ_KV assertion and handling for kUInt8PerTensorSym zero_point from 0 to 128 (per-tensor symmetric uint8 now expects zp==128).
KV Cache & Example Adjustments `examples/qwen3_qnn_aot/compile_sha.cpp`, `examples/qwen3_qnn_aot/...`	Removed zero_point attachments for past_key/past_value in compile setup; some KV per-head cast paths adjusted to direct uint8 per-tensor handling.
QNN AOT Quant Param Parsing `mllm/backends/qnn/aot/QnnWrappersAPI.cpp`	For SymPerTensor quantization, changed scale offset handling (offset -128) and added assertion requiring quant_to_type == kUInt8.
Python Quantization Observer `pymllm/backends/qualcomm/transformers/core/qdq.py`	ActivationQDQ symmetric branch now enforces 8-bit, uses `torch.uint8` and expects observer-provided [0,255] range with zero_point 128 (quant_min/quant_max set to None).
QNN Utils: library loading `mllm/backends/qnn/QNNUtils.hpp`, `mllm/backends/qnn/QNNUtils.cpp`	Changed loadQNNSymbol/loadQNNSystemSymbol return types from `bool` → `std::pair<bool, void*>` to return success + library handle.
QNN Allocator: lifecycle & unloading `mllm/backends/qnn/QNNAllocator.hpp`, `mllm/backends/qnn/QNNAllocator.cpp`	Introduced `libCdspHandle_`, `isShutdown_`, rpcmem fn ptr initialization, added `shutdown()`/`releaseAllResources()`/`markShutdown()`, reworked free() sequencing, and defined destructor to unload library.
QNN Backend & Perf: lifecycle `mllm/backends/qnn/QNNBackend.hpp`, `mllm/backends/qnn/QNNBackend.cpp`	Added `~QNNBackend()` and strict shutdown order; added `QNNPerf::shutdown()`, renamed internal members, and ensure libraries/handles are unloaded in controlled order.
AOT runtime destructors & cleanup ordering `mllm/backends/qnn/aot_rt/KVCacheManager.hpp`, `mllm/backends/qnn/aot_rt/PromptProcessor.hpp`, `mllm/backends/qnn/aot_rt/QnnAOTModule.hpp`, `mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp`, `mllm/backends/qnn/aot_rt/TokenGenerator.hpp`	Replaced defaulted destructors with explicit destructors that clear module output tensors and reset members in a safe order to avoid use-after-free/double-free.
TokenGenerator: attention-mask & IO sizing `mllm/backends/qnn/aot_rt/TokenGenerator.cpp`, `mllm/backends/qnn/aot_rt/TokenGenerator.hpp`	Adjusted IO reserve size, reduced K cache width by 1, added attention-mask init (`initAttentionMask`) and update calls, and added numeric include.
KVCacheManager: comments `mllm/backends/qnn/aot_rt/KVCacheManager.cpp`	Added inline comments annotating rearrangement dimensional orders.
Context/backend shutdown integration `mllm/engine/Context.hpp`, `mllm/engine/Context.cpp`, `mllm/mllm.cpp`	Added `Context::shutdownBackend(const DeviceTypes&)` and call in global shutdown to explicitly remove/shutdown QNN backend when available.
IR cache key type change `mllm/compile/ir/Node.hpp`, `mllm/compile/ir/Node.cpp`	Switched IRContext cached input/output UUID keys and APIs from `uint32_t` → `size_t` and adjusted push-region insertion to set current insert region.
CastType visitor `mllm/backends/qnn/aot/visitor/CastType.cpp`	Added integer→integer cast mapping to qnn op type `Convert`; kept float32→float16 `Cast` path (TODO noted).
LLMQuantRecipePass: Slice handling `mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp`	Ensure SliceOp first input has a quant_recipe; if missing, generate and attach a simple quant spec before propagation.
Other minor cleanup `mllm/backends/qnn/aot_rt/`, `mllm/backends/qnn/`	Various destructor/cleanup ordering and minor log/message/name updates to reflect lifecycle management and safer teardown.

Sequence Diagram(s)

sequenceDiagram
    actor App
    participant Context
    participant QNNBackend
    participant QNNAllocator
    participant DynLib as "QNN HTP/System libs\n(dl handles)"
    Note over App,Context: Shutdown sequence (high-level)
    App->>Context: shutdownContext()
    Context->>Context: isQnnAvailable? true
    Context->>Context: shutdownBackend(kQNN)
    Context->>QNNBackend: request shutdown/destruction
    QNNBackend->>QNNAllocator: shutdown()
    QNNAllocator->>QNNAllocator: deregister memory entries
    QNNAllocator->>QNNAllocator: free rpcmem allocations
    QNNAllocator->>DynLib: dlclose(libCdspHandle_) (nullify)
    QNNBackend->>DynLib: dlclose(qnnHtpLibHandle_) / dlclose(qnnSystemLibHandle_)
    QNNBackend->>QNNBackend: clear models, perf shutdown
    Context-->>App: shutdown complete

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

fix: Qualcomm QNN AOT Pass #579: Modifies QDQ_KV handling for per-tensor symmetric uint8 zero_point (0 → 128).
fix(qualcomm): Qnn AOT refactor #578: Changes QnnWrappersAPI / quantization parameter parsing and zero-point/scale expectations.
feat(Qnn AOT): AOT and AOT Runtime. Qwen3 AOT Mode. #567: Adjusts example model QDQ_KV quantization paths for uint8 symmetric handling.

Suggested reviewers

liang1232018
oreomaker

Poem

🐇 I nibble at handles, tidy each lib,
Zero-point hops to one-two-eight, no fib,
Destructors align in a careful row,
KV masks wake, attention starts to grow,
Hooray — resources freed, tidy and spry!

🚥 Pre-merge checks | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description contains only benchmark results without explaining what changes were made, why they were made, or how they address the objectives.	Add a clear summary of changes made, technical improvements, and how the benchmarks validate the implementation.
Docstring Coverage	⚠️ Warning	Docstring coverage is 8.93% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'mileston(qnn): Qnn AOT' is vague and contains a typo ('mileston' instead of 'milestone'), making it unclear what the primary change is.	Clarify the title to describe the main change (e.g., 'Add QNN AOT runtime support' or 'Implement QNN AOT backend') and fix the typo.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

🤖 Fix all issues with AI agents

In `@examples/qwen3_qnn_aot/modeling_qwen_qnn_aot_sha.hpp`:
- Around line 358-366: The K/V quantization paths for Qwen3 differ from
Qwen2/Llama—K uses key_states_per_head[h].to(kUInt8PerTensorSym) then
ptq::QDQ_KV, and V uses ptq::QDQ(...) then .to(kUInt8PerTensorSym) then
ptq::QDQ_KV—while other SHA implementations convert via kFloat32 first; either
make this explicit or align the flows: update the K and V flows (the code around
key_states_per_head[h], ptq::QDQ, and ptq::QDQ_KV) to include the intermediate
.to(kFloat32) conversion like the other models, or add a concise comment above
these snippets explaining why Qwen3 intentionally omits kFloat32 (e.g., expected
input dtype/performance reason) so reviewers understand the divergence.

In `@mllm/backends/qnn/aot/visitor/CastType.cpp`:
- Line 67: Update the inline comment in CastType.cpp to use the required TODO
prefix format: replace the existing comment "// TODO, Cast usage is error." with
the guideline-compliant "// TODO: Cast usage is error." so the TODO uses the
exact "TODO:" prefix (look for the comment near the CastType visitor
implementation in CastType.cpp).
- Around line 64-65: Update the TODO comment in CastType.cpp from "// TODO, Cast
usage is error." to "// TODO: Cast usage is error." to match required TODO
format; then ensure the int→int branch that sets qnn_op_type = "Convert" (where
isInt(input_dtype) && isInt(target_dtype) is true) correctly handles quantized
integer tensors by detecting per-tensor/per-channel quantization (via the same
checks used by isInt or any quantization helpers) and either perform a
requantize/recalibration step instead of a blind Convert or insert
dequantize→convert→requantize logic so differing scale/zero-point parameters are
reconciled before/after Convert.

In `@mllm/backends/qnn/QNNAllocator.cpp`:
- Around line 16-35: The constructors of QNNAllocator set rpcmem_alloc,
rpcmem_free, and rpcmem_to_fd via dlsym without checking for nullptr; update
both QNNAllocator() and QNNAllocator(QNN_INTERFACE_VER_TYPE, void*) to validate
each dlsym result (rpcmem_alloc, rpcmem_free, rpcmem_to_fd) after assignment,
and if any is nullptr call MLLM_ERROR_EXIT (or MLLM_RT_ASSERT if preferred) with
a clear message that includes dlerror() context; reference the libCdspHandle_
loading logic and use the same error-handling pattern used for dlopen to fail
fast when symbols are missing.

In `@mllm/backends/qnn/QNNBackend.cpp`:
- Around line 80-121: The destructor QNNBackend::~QNNBackend calls
runtime_->qnnInterface.contextFree(context_, nullptr) unconditionally but
context_ may be null if createContext()/loadContext() were never called; guard
the call by checking context_ (and that runtime_ is non-null) before invoking
contextFree so you don't pass a null handle into the QNN API, and ensure you
still set context_ = nullptr and reset runtime_ afterwards.

In `@mllm/backends/qnn/QNNUtils.hpp`:
- Around line 46-49: Header QNNUtils.hpp declares std::pair<bool, void*> for
functions loadQNNSymbol and loadQNNSystemSymbol but doesn't include <utility>;
add an explicit `#include` <utility> to this header (near other includes) so the
std::pair type is guaranteed to be available and avoid relying on transitive
includes.

In `@pymllm/backends/qualcomm/transformers/core/qdq.py`:
- Around line 30-34: In the symmetric-quantization branch of the QDQ code
(around the attributes self.quant_min, self.quant_max, bits handling), replace
the incorrect dtype torch.uint8 with torch.qint8 to match PyTorch
FakeQuantize/MinMaxObserver signed symmetric semantics; update the comment to
correctly describe symmetric range as centered at zero (e.g., [-128, 127])
instead of [0,255]; and replace the assert bits == 8 with an explicit check that
raises a ValueError if bits is not 8 (so validation remains at runtime even
under -O). Ensure references to self.dtype, bits, and the surrounding symmetric
quantization logic (the block setting self.quant_min/self.quant_max) are updated
accordingly.

🧹 Nitpick comments (5)

mllm/backends/qnn/aot_rt/TokenGenerator.cpp (1)
112-115: Improve readability with named constants for sentinel values and decode phase size.

The buffer initialization is correct—initAttentionMask fully clears the attention mask via std::fill_n(attention_mask, ar_len * config_.context_len, ...) before populating it. However, the purpose of the −1 sentinel in attention_map and the hard-coded 1 for decode-phase batch size are non-obvious at the call site. Per coding guidelines, use named constants to clarify intent:
🔧 Suggested readability refactor
-  std::vector<int32_t> attention_map(1);
-  std::iota(attention_map.begin(), attention_map.end(), -1);
-  kv_manager_->initAttentionMask(input_tensors_[2].ptr<uint16_t>(), attention_map, 1, current_pos);
+  constexpr int32_t kDecodeTokenCount = 1;
+  constexpr int32_t kAttentionMapSentinel = -1;  // Signals "attend to past tokens only" in initAttentionMask
+  std::vector<int32_t> attention_map(kDecodeTokenCount);
+  std::iota(attention_map.begin(), attention_map.end(), kAttentionMapSentinel);
+  kv_manager_->initAttentionMask(input_tensors_[2].ptr<uint16_t>(), attention_map, kDecodeTokenCount, current_pos);
Also applies to: 132-133
mllm/engine/Context.hpp (1)
33-33: Add a brief doc comment for the new public API.

A one-line description (including that it’s a no-op if absent) will clarify expectations for callers.
📝 Suggested change
-  void shutdownBackend(const DeviceTypes& device);
+  // Idempotently remove backend for device; no-op if absent.
+  void shutdownBackend(const DeviceTypes& device);
As per coding guidelines: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
mllm/backends/qnn/aot_rt/QnnAOTModule.hpp (1)
15-18: Destructor clear is redundant—use a defaulted destructor.

output_tensors_ is a std::vector<Tensor> that will be automatically destroyed when the object is destroyed. Calling clear() explicitly does not change C++ member destruction order (members are destroyed in reverse declaration order, and output_tensors_ is the last member, so it destroys first). The explicit clear() adds unnecessary work without any benefit.
-  ~QnnAOTModule() {
-    // Clear output tensors to ensure proper cleanup order
-    output_tensors_.clear();
-  }
+  ~QnnAOTModule() = default;
mllm/backends/qnn/aot/QnnWrappersAPI.cpp (1)

141-145: Use a named constant for the symmetric uint8 zero-point.

-128 is a magic number; naming it keeps intent clear and prevents drift if the expected zero-point changes.

Consider extracting it as a named constant, e.g., constexpr int32_t kSymmetricUint8ZeroPoint = 128;, and then using -kSymmetricUint8ZeroPoint in the offset assignment. This improves code maintainability and aligns with the coding guideline to use named constants instead of magic numbers.

mllm/backends/qnn/aot_rt/KVCacheManager.hpp (1)

27-40: Destructor can be defaulted—explicit cleanup is redundant.

No base classes exist, and std::vector destructors automatically destroy all contained KVCache<T> elements and their shared_ptr members in the correct order. Manual reset() calls during the destructor body do not change the member-destruction sequence and add no benefit. If ordering with external systems truly requires early cleanup, implement a dedicated shutdown() method instead.

coderabbitai · 2026-01-31T14:12:47Z

examples/qwen3_qnn_aot/modeling_qwen_qnn_aot_sha.hpp

      // K: De-quantize and re-quantize to int8
-      auto k_h = key_states_per_head[h].to(kFloat32);
-      k_h = k_h.to(kUInt8PerTensorSym);
+      auto k_h = key_states_per_head[h].to(kUInt8PerTensorSym);
      k_h = ptq::QDQ_KV(this, k_h, "k_cast_to_int8_qdq_h" + h_str);
      k_h = k_h.transpose(2, 3);  // [B, 1, D, S]

      // V: Quantize to int16 then int8
      auto v_h = ptq::QDQ(this, value_states_per_head[h], "v_cast_to_int16_qdq_h" + h_str);
-      v_h = v_h.to(kFloat32);
      v_h = v_h.to(kUInt8PerTensorSym);
      v_h = ptq::QDQ_KV(this, v_h, "v_cast_to_int8_qdq_h" + h_str);


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify the K/V quantization patterns across SHA files echo "=== Qwen3 SHA (this file) ===" rg -n "key_states_per_head\[h\]\.to\(" examples/qwen3_qnn_aot/modeling_qwen_qnn_aot_sha.hpp echo "" echo "=== Qwen2 SHA ===" rg -n "key_states_per_head\[h\]\.to\(" examples/qwen2_qnn_aot/modeling_qwen2_qnn_aot_sha.hpp echo "" echo "=== Llama SHA ===" rg -n "key_states_per_head\[h\]\.to\(" examples/llama_qnn_aot/modeling_llama_qnn_aot_sha.hpp

Repository: UbiquitousLearning/mllm

Length of output: 318

🏁 Script executed:

#!/bin/bash # Get context around the K quantization in each file echo "=== Qwen3 SHA - K quantization context (lines 355-370) ===" sed -n '355,370p' examples/qwen3_qnn_aot/modeling_qwen_qnn_aot_sha.hpp | cat -n echo "" echo "=== Qwen2 SHA - K quantization context (lines 343-358) ===" sed -n '343,358p' examples/qwen2_qnn_aot/modeling_qwen2_qnn_aot_sha.hpp | cat -n echo "" echo "=== Llama SHA - K quantization context (lines 343-358) ===" sed -n '343,358p' examples/llama_qnn_aot/modeling_llama_qnn_aot_sha.hpp | cat -n

Repository: UbiquitousLearning/mllm

Length of output: 2573

K/V quantization paths differ from other SHA variants—consider clarifying the design choice.

Both K and V quantization paths in this file skip the intermediate kFloat32 conversion step that is present in Qwen2 and Llama SHA implementations:

Qwen3 (this file):

K: key_states_per_head[h].to(kUInt8PerTensorSym) → ptq::QDQ_KV(...)

V: ptq::QDQ(...) → .to(kUInt8PerTensorSym) → ptq::QDQ_KV(...)

Qwen2/Llama:

K: key_states_per_head[h].to(kFloat32) → .to(kUInt8PerTensorSym) → ptq::QDQ_KV(...)

V: ptq::QDQ(...) → .to(kFloat32) → .to(kUInt8PerTensorSym) → ptq::QDQ_KV(...)

If this divergence is intentional (e.g., due to different input data types), add a comment explaining why. Otherwise, align the quantization path with the other models to ensure consistency.

🤖 Prompt for AI Agents

In `@examples/qwen3_qnn_aot/modeling_qwen_qnn_aot_sha.hpp` around lines 358 - 366, The K/V quantization paths for Qwen3 differ from Qwen2/Llama—K uses key_states_per_head[h].to(kUInt8PerTensorSym) then ptq::QDQ_KV, and V uses ptq::QDQ(...) then .to(kUInt8PerTensorSym) then ptq::QDQ_KV—while other SHA implementations convert via kFloat32 first; either make this explicit or align the flows: update the K and V flows (the code around key_states_per_head[h], ptq::QDQ, and ptq::QDQ_KV) to include the intermediate .to(kFloat32) conversion like the other models, or add a concise comment above these snippets explaining why Qwen3 intentionally omits kFloat32 (e.g., expected input dtype/performance reason) so reviewers understand the divergence.

coderabbitai · 2026-01-31T14:12:47Z

mllm/backends/qnn/aot/visitor/CastType.cpp

+  } else if (isInt(input_dtype) && isInt(target_dtype)) {
+    qnn_op_type = "Convert";


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find . -name "CastType.cpp" -type f

Repository: UbiquitousLearning/mllm

Length of output: 111

🏁 Script executed:

cat -n mllm/backends/qnn/aot/visitor/CastType.cpp | sed -n '55,75p'

Repository: UbiquitousLearning/mllm

Length of output: 993

🏁 Script executed:

wc -l mllm/backends/qnn/aot/visitor/CastType.cpp

Repository: UbiquitousLearning/mllm

Length of output: 112

🏁 Script executed:

cat -n mllm/backends/qnn/aot/visitor/CastType.cpp | head -n 86

Repository: UbiquitousLearning/mllm

Length of output: 4164

🏁 Script executed:

grep -r "isInt\|isFloat" mllm/backends/qnn/ --include="*.cpp" --include="*.h" -B 2 -A 2 | head -n 50

Repository: UbiquitousLearning/mllm

Length of output: 1491

🏁 Script executed:

grep -r "quantiz\|quant_param\|scale\|zero_point" mllm/backends/qnn/aot/visitor/ --include="*.cpp" --include="*.h"

Repository: UbiquitousLearning/mllm

Length of output: 749

🏁 Script executed:

grep -r "Convert" mllm/backends/qnn/ --include="*.cpp" --include="*.h" -B 2 -A 2 | head -n 40

Repository: UbiquitousLearning/mllm

Length of output: 4099

🏁 Script executed:

grep -A 10 "captureQnnAOTNodeTensor" mllm/backends/qnn/aot/visitor/CastType.cpp

Repository: UbiquitousLearning/mllm

Length of output: 427

Fix TODO comment format violation at line 67.

Change // TODO, Cast usage is error. to // TODO: Cast usage is error. to comply with the required TODO format (must include colon). Additionally, verify that the int→int "Convert" path at line 65 handles quantized integer-to-integer conversions correctly; the isInt() function includes per-tensor and per-channel quantized types, so "Convert" may be called on tensors with differing scale/zero-point parameters.

🤖 Prompt for AI Agents

In `@mllm/backends/qnn/aot/visitor/CastType.cpp` around lines 64 - 65, Update the TODO comment in CastType.cpp from "// TODO, Cast usage is error." to "// TODO: Cast usage is error." to match required TODO format; then ensure the int→int branch that sets qnn_op_type = "Convert" (where isInt(input_dtype) && isInt(target_dtype) is true) correctly handles quantized integer tensors by detecting per-tensor/per-channel quantization (via the same checks used by isInt or any quantization helpers) and either perform a requantize/recalibration step instead of a blind Convert or insert dequantize→convert→requantize logic so differing scale/zero-point parameters are reconciled before/after Convert.

coderabbitai · 2026-01-31T14:12:47Z

mllm/backends/qnn/aot/visitor/CastType.cpp

+  } else if (isInt(input_dtype) && isInt(target_dtype)) {
+    qnn_op_type = "Convert";
  } else if (input_dtype == kFloat32 && target_dtype == kFloat16) {
+    // TODO, Cast usage is error.


⚠️ Potential issue | 🟠 Major

Fix TODO comment format to comply with guidelines.

The TODO must use the exact TODO: prefix.

✅ Suggested fix

- // TODO, Cast usage is error. + // TODO: Cast usage is incorrect; verify the proper QNN op for f32->f16.

As per coding guidelines, TODO and FIXME comments must be written as 'TODO:' or 'FIXME:' followed by UTF-8 text that adheres to character set rules.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// TODO, Cast usage is error.

// TODO: Cast usage is incorrect; verify the proper QNN op for f32->f16.

🤖 Prompt for AI Agents

In `@mllm/backends/qnn/aot/visitor/CastType.cpp` at line 67, Update the inline comment in CastType.cpp to use the required TODO prefix format: replace the existing comment "// TODO, Cast usage is error." with the guideline-compliant "// TODO: Cast usage is error." so the TODO uses the exact "TODO:" prefix (look for the comment near the CastType visitor implementation in CastType.cpp).

coderabbitai · 2026-01-31T14:12:47Z

mllm/backends/qnn/QNNAllocator.cpp

 QNNAllocator::QNNAllocator() {
-  void* libCdspHandle = dlopen("libcdsprpc.so", RTLD_NOW | RTLD_LOCAL);
-  if (nullptr == libCdspHandle) { MLLM_ERROR_EXIT(1, "dlopen libcdsprpc.so failed"); }
+  libCdspHandle_ = dlopen("libcdsprpc.so", RTLD_NOW | RTLD_LOCAL);
+  if (nullptr == libCdspHandle_) { MLLM_ERROR_EXIT(1, "dlopen libcdsprpc.so failed"); }

-  rpcmem_alloc = (RpcMemAllocFn_t)dlsym(libCdspHandle, "rpcmem_alloc");
-  rpcmem_free = (RpcMemFreeFn_t)dlsym(libCdspHandle, "rpcmem_free");
-  rpcmem_to_fd = (RpcMemToFdFn_t)dlsym(libCdspHandle, "rpcmem_to_fd");
+  rpcmem_alloc = (RpcMemAllocFn_t)dlsym(libCdspHandle_, "rpcmem_alloc");
+  rpcmem_free = (RpcMemFreeFn_t)dlsym(libCdspHandle_, "rpcmem_free");
+  rpcmem_to_fd = (RpcMemToFdFn_t)dlsym(libCdspHandle_, "rpcmem_to_fd");
 }

 QNNAllocator::QNNAllocator(QNN_INTERFACE_VER_TYPE qnnInterface, void* context)
    : qnnInterface_(qnnInterface), context_(context) {
  MLLM_RT_ASSERT(context_ != nullptr);

-  void* libCdspHandle = dlopen("libcdsprpc.so", RTLD_NOW | RTLD_LOCAL);
-  if (nullptr == libCdspHandle) { MLLM_ERROR_EXIT(1, "dlopen libcdsprpc.so failed"); }
+  libCdspHandle_ = dlopen("libcdsprpc.so", RTLD_NOW | RTLD_LOCAL);
+  if (nullptr == libCdspHandle_) { MLLM_ERROR_EXIT(1, "dlopen libcdsprpc.so failed"); }

-  rpcmem_alloc = (RpcMemAllocFn_t)dlsym(libCdspHandle, "rpcmem_alloc");
-  rpcmem_free = (RpcMemFreeFn_t)dlsym(libCdspHandle, "rpcmem_free");
-  rpcmem_to_fd = (RpcMemToFdFn_t)dlsym(libCdspHandle, "rpcmem_to_fd");
+  rpcmem_alloc = (RpcMemAllocFn_t)dlsym(libCdspHandle_, "rpcmem_alloc");
+  rpcmem_free = (RpcMemFreeFn_t)dlsym(libCdspHandle_, "rpcmem_free");
+  rpcmem_to_fd = (RpcMemToFdFn_t)dlsym(libCdspHandle_, "rpcmem_to_fd");
+}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n mllm/backends/qnn/QNNAllocator.cpp | head -80

Repository: UbiquitousLearning/mllm

Length of output: 3359

🏁 Script executed:

rg "MLLM_ERROR_EXIT" mllm/utils/ -A 2 | head -20

Repository: UbiquitousLearning/mllm

Length of output: 1230

🏁 Script executed:

rg "define MLLM_ERROR_EXIT" mllm/utils/Log.hpp -A 5

Repository: UbiquitousLearning/mllm

Length of output: 540

🏁 Script executed:

rg "dlerror" mllm/ -A 2

Repository: UbiquitousLearning/mllm

Length of output: 2315

🏁 Script executed:

rg "MLLM_ERROR_EXIT.*ExitCode" mllm/ -B 1 | head -30

Repository: UbiquitousLearning/mllm

Length of output: 1691

🏁 Script executed:

rg "MLLM_ERROR_EXIT\(1" mllm/ -B 1 -A 1

Repository: UbiquitousLearning/mllm

Length of output: 1988

🏁 Script executed:

cat -n mllm/backends/qnn/QNNAllocator.cpp | sed -n '49,77p'

Repository: UbiquitousLearning/mllm

Length of output: 1224

🏁 Script executed:

rg "rpcmem_alloc|rpcmem_free|rpcmem_to_fd" mllm/backends/qnn/QNNAllocator.hpp -B 1 -A 1

Repository: UbiquitousLearning/mllm

Length of output: 436

🏁 Script executed:

rg "RpcMemAllocFn_t|RpcMemFreeFn_t|RpcMemToFdFn_t" mllm/backends/qnn/QNNAllocator.hpp -B 2 -A 2

Repository: UbiquitousLearning/mllm

Length of output: 443

Add dlsym failure checks before using rpcmem_ pointers.*

The rpcmem_alloc, rpcmem_free, and rpcmem_to_fd function pointers are loaded without validation in both constructors (lines 20–22 and 32–34). If any symbol fails to resolve, dlsym returns nullptr, and subsequent calls crash. Apply the dlopen pattern to dlsym results in both constructors:

🔧 Suggested fix

rpcmem_alloc = (RpcMemAllocFn_t)dlsym(libCdspHandle_, "rpcmem_alloc"); + if (!rpcmem_alloc) { MLLM_ERROR_EXIT(1, "dlsym rpcmem_alloc failed: {}", dlerror()); } rpcmem_free = (RpcMemFreeFn_t)dlsym(libCdspHandle_, "rpcmem_free"); + if (!rpcmem_free) { MLLM_ERROR_EXIT(1, "dlsym rpcmem_free failed: {}", dlerror()); } rpcmem_to_fd = (RpcMemToFdFn_t)dlsym(libCdspHandle_, "rpcmem_to_fd"); + if (!rpcmem_to_fd) { MLLM_ERROR_EXIT(1, "dlsym rpcmem_to_fd failed: {}", dlerror()); }

🤖 Prompt for AI Agents

In `@mllm/backends/qnn/QNNAllocator.cpp` around lines 16 - 35, The constructors of QNNAllocator set rpcmem_alloc, rpcmem_free, and rpcmem_to_fd via dlsym without checking for nullptr; update both QNNAllocator() and QNNAllocator(QNN_INTERFACE_VER_TYPE, void*) to validate each dlsym result (rpcmem_alloc, rpcmem_free, rpcmem_to_fd) after assignment, and if any is nullptr call MLLM_ERROR_EXIT (or MLLM_RT_ASSERT if preferred) with a clear message that includes dlerror() context; reference the libCdspHandle_ loading logic and use the same error-handling pattern used for dlopen to fail fast when symbols are missing.

coderabbitai · 2026-01-31T14:12:47Z

mllm/backends/qnn/QNNBackend.cpp

+QNNBackend::~QNNBackend() {
+  // Cleanup order is critical - we hold all QNN library handles to control unload order:
+  // 1. Allocator shutdown (memDeRegister + rpcmem_free) - needs QNN alive
+  // 2. Clear models - tensor destructors try to free but allocator is shut down
+  // 3. Perf cleanup - needs QNN HTP infrastructure alive
+  // 4. Runtime cleanup - frees QNN backend/device handles
+  // 5. Allocator reset - dlcloses libcdsprpc.so (held by allocator)
+  // 6. Close QNN libraries - libQnnSystem.so first, then libQnnHtp.so
+
+  // 1. Properly shutdown allocator while QNN is still alive
+  //    This calls memDeRegister and rpcmem_free safely
+  if (allocator_) {
+    auto* qnnAllocator = dynamic_cast<QNNAllocator*>(allocator_.get());
+    if (qnnAllocator) { qnnAllocator->shutdown(); }
+  }
+
+  // 2. Clear models - tensor destructors will call free() but they're now no-ops
+  qnnModels_.clear();
+  qnnModelIndexMap_.clear();
+
+  // 3. Cleanup perf while QNN HTP infrastructure is still alive
+  if (perf_) { perf_->shutdown(); }
+  perf_.reset();
+
+  // 4. Cleanup runtime - frees QNN backend/device handles
+  runtime_->qnnInterface.contextFree(context_, nullptr);
+  context_ = nullptr;
+  runtime_.reset();
+
+  // 5. Reset allocator - will dlclose libcdsprpc.so since shutdown() was already called
+  allocator_.reset();
+
+  // 6. Close QNN libraries in reverse order of dependency
+  if (qnnSystemLibHandle_) {
+    dlclose(qnnSystemLibHandle_);
+    qnnSystemLibHandle_ = nullptr;
+  }
+  if (qnnHtpLibHandle_) {
+    dlclose(qnnHtpLibHandle_);
+    qnnHtpLibHandle_ = nullptr;
+  }
+}


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, find the QNNBackend header file fd -t f "QNNBackend" --extension h --extension hpp

Repository: UbiquitousLearning/mllm

Length of output: 99

🏁 Script executed:

# Read the QNNBackend.h header to understand member initialization cat -n mllm/backends/qnn/QNNBackend.h

Repository: UbiquitousLearning/mllm

Length of output: 129

🏁 Script executed:

# Now find the constructor implementation to see when context_ and runtime_ are initialized rg "QNNBackend::" mllm/backends/qnn/QNNBackend.cpp -A 5 | head -100

Repository: UbiquitousLearning/mllm

Length of output: 4456

🏁 Script executed:

# Check the full constructor to understand initialization order sed -n '1,80p' mllm/backends/qnn/QNNBackend.cpp | tail -50

Repository: UbiquitousLearning/mllm

Length of output: 2497

🏁 Script executed:

# Read the QNNBackend header file (it's .hpp not .h) cat -n mllm/backends/qnn/QNNBackend.hpp | head -150

Repository: UbiquitousLearning/mllm

Length of output: 6926

🏁 Script executed:

# Look for context_ initialization in the header rg "context_" mllm/backends/qnn/QNNBackend.hpp -B 2 -A 2

Repository: UbiquitousLearning/mllm

Length of output: 511

🏁 Script executed:

# Check if context_ has a default initialization (might be in member initialization list) rg "Qnn_.*_t context" mllm/backends/qnn/ -A 2 -B 2

Repository: UbiquitousLearning/mllm

Length of output: 2071

🏁 Script executed:

# Search for all usages of contextFree to understand expected behavior rg "contextFree" mllm/backends/qnn/ -B 3 -A 3

Repository: UbiquitousLearning/mllm

Length of output: 474

🏁 Script executed:

# Check if MLLM_ERROR_EXIT actually terminates the program rg "MLLM_ERROR_EXIT" mllm/utils/Log.hpp -B 2 -A 5

Repository: UbiquitousLearning/mllm

Length of output: 545

🏁 Script executed:

# Check for exception handling or early returns in QNNBackend constructor sed -n '1,75p' mllm/backends/qnn/QNNBackend.cpp

Repository: UbiquitousLearning/mllm

Length of output: 3338

🏁 Script executed:

# Verify if createContext or loadContext must be called before using the backend rg "createContext|loadContext" mllm/backends/qnn/ -B 2 -A 2 | head -50

Repository: UbiquitousLearning/mllm

Length of output: 3462

🏁 Script executed:

# Check if there are any ways runtime_ could be null or partially initialized in destructor rg "reset\(\)" mllm/backends/qnn/QNNBackend.cpp

Repository: UbiquitousLearning/mllm

Length of output: 125

🏁 Script executed:

# Verify that context is indeed nullptr when not explicitly created rg "context_.*=" mllm/backends/qnn/QNNBackend.cpp | head -20

Repository: UbiquitousLearning/mllm

Length of output: 201

🏁 Script executed:

# Check how createContext and loadContext set context_ rg "context_.*=" mllm/backends/qnn/QNNRuntime.cpp -B 3 -A 3 | head -40

Repository: UbiquitousLearning/mllm

Length of output: 139

🏁 Script executed:

# Find all QNN files to understand the structure better fd -t f "QNN" mllm/backends/qnn/ --extension cpp --extension hpp | head -20

Repository: UbiquitousLearning/mllm

Length of output: 775

🏁 Script executed:

# Check if there are any other uses of contextFree or similar patterns rg "contextFree" mllm/backends/qnn/ -B 5 -A 5

Repository: UbiquitousLearning/mllm

Length of output: 818

🏁 Script executed:

# Look at how runtime_ is constructed to confirm it's always valid at destructor rg "QNNRuntime::create" mllm/backends/qnn/ -B 2 -A 10

Repository: UbiquitousLearning/mllm

Length of output: 2018

🏁 Script executed:

# Check if there's a null check pattern used elsewhere for similar operations rg "if.*runtime_\|if.*context_" mllm/backends/qnn/QNNBackend.cpp -B 1 -A 1

Repository: UbiquitousLearning/mllm

Length of output: 49

Guard contextFree when context_ is null.

The destructor can run without a context being created if neither createContext() nor loadContext() is called. Passing a null context handle to QNN may be invalid. Add a guard to safely handle this case.

Suggested fix

- runtime_->qnnInterface.contextFree(context_, nullptr); - context_ = nullptr; + if (context_) { + runtime_->qnnInterface.contextFree(context_, nullptr); + context_ = nullptr; + }

As per coding guidelines: Check for proper error handling and resource cleanup (e.g., using try-finally, context managers, or RAII).

🤖 Prompt for AI Agents

In `@mllm/backends/qnn/QNNBackend.cpp` around lines 80 - 121, The destructor QNNBackend::~QNNBackend calls runtime_->qnnInterface.contextFree(context_, nullptr) unconditionally but context_ may be null if createContext()/loadContext() were never called; guard the call by checking context_ (and that runtime_ is non-null) before invoking contextFree so you don't pass a null handle into the QNN API, and ensure you still set context_ = nullptr and reset runtime_ afterwards.

coderabbitai · 2026-01-31T14:12:47Z

mllm/backends/qnn/QNNUtils.hpp

+// Load QNN symbols and return library handle for lifecycle management
+// Returns {success, libHandle} - caller owns the handle and must dlclose it
+std::pair<bool, void*> loadQNNSymbol();
+std::pair<bool, void*> loadQNNSystemSymbol();


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Check if <utility> is included and verify std::pair usage rg -n '#include <utility>' mllm/backends/qnn/QNNUtils.hpp || echo "No <utility> found" rg -n 'std::pair' mllm/backends/qnn/QNNUtils.hpp || echo "No std::pair found"

Repository: UbiquitousLearning/mllm

Length of output: 177

🏁 Script executed:

# Read the beginning of the file to see all includes head -60 mllm/backends/qnn/QNNUtils.hpp

Repository: UbiquitousLearning/mllm

Length of output: 2526

Add <utility> include for std::pair declarations.

This header uses std::pair<bool, void*> at lines 48–49 but does not explicitly include <utility>. While compilation may succeed due to transitive includes, this is fragile and can break on stricter toolchains.

🛠️ Proposed fix

`#include` <memory> `#include` <string> `#include` <vector> +#include <utility>

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Load QNN symbols and return library handle for lifecycle management

// Returns {success, libHandle} - caller owns the handle and must dlclose it

std::pair<bool, void*> loadQNNSymbol();

std::pair<bool, void*> loadQNNSystemSymbol();

`#include` <memory>

`#include` <string>

`#include` <vector>

`#include` <utility>

🤖 Prompt for AI Agents

In `@mllm/backends/qnn/QNNUtils.hpp` around lines 46 - 49, Header QNNUtils.hpp declares std::pair<bool, void*> for functions loadQNNSymbol and loadQNNSystemSymbol but doesn't include <utility>; add an explicit `#include` <utility> to this header (near other includes) so the std::pair type is guaranteed to be available and avoid relying on transitive includes.

coderabbitai · 2026-01-31T14:12:47Z

pymllm/backends/qualcomm/transformers/core/qdq.py

+            # NOTE: If left empty: with uint8 and symmetric quantization, the observer will use [0, 255] as the range. And 128 as the zero_point.
+            self.quant_min = None
+            self.quant_max = None
+            assert bits == 8, "Symmetric quantization is only supported for 8-bit"
+            self.dtype = torch.uint8


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Read the file to understand context head -60 pymllm/backends/qualcomm/transformers/core/qdq.py

Repository: UbiquitousLearning/mllm

Length of output: 2263

🏁 Script executed:

# Search for torch.ao.quantization usage in the codebase rg "torch\.ao\.quantization|FakeQuantize|MinMaxObserver" -A 3 -B 3 --type py

Repository: UbiquitousLearning/mllm

Length of output: 20206

🏁 Script executed:

# Check for torch.quint8 or torch.qint8 usage rg "torch\.qu?int8|torch\.uint8" --type py -B 2 -A 2

Repository: UbiquitousLearning/mllm

Length of output: 5322

🌐 Web query:

PyTorch torch.ao.quantization FakeQuantize MinMaxObserver dtype torch.uint8 symmetric quantization

💡 Result:

In PyTorch eager/QAT quantization, FakeQuantize / MinMaxObserver do not accept torch.uint8. Their dtype is limited to torch.qint8 (signed) or torch.quint8 (unsigned quantized). [1], [2]
Also, symmetric quantization is a qscheme (e.g. torch.per_tensor_symmetric), and PyTorch defines symmetric vs affine schemes separately from dtype. [3]

What to do

1) If you want true symmetric 8-bit quantization (centered at 0)

Use signed int8:

import torch from torch.ao.quantization import FakeQuantize from torch.ao.quantization.observer import MinMaxObserver fq = FakeQuantize.with_args( observer=MinMaxObserver, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric, quant_min=-128, quant_max=127, )()

This matches the usual meaning of “symmetric” (zero_point = 0).

2) If you specifically want an unsigned 8-bit quantized dtype

Use torch.quint8 (not torch.uint8). Note: with unsigned 8-bit, you typically end up with affine (asymmetric) quantization (nonzero zero_point), not true symmetry around 0:

fq = FakeQuantize.with_args( observer=MinMaxObserver, dtype=torch.quint8, qscheme=torch.per_tensor_affine, quant_min=0, quant_max=255, )()

Key point: if your goal is “symmetric quantization,” pick torch.qint8 + torch.per_tensor_symmetric; if your goal is “store as 8-bit unsigned,” pick torch.quint8 but expect affine behavior. [1], [2], [3]

Sources:
[1] PyTorch FakeQuantize docs (dtype allowed values)
[2] PyTorch MinMaxObserver docs (dtype allowed values)
[3] PyTorch quantization API reference (supported qscheme incl. symmetric)

Replace torch.uint8 with torch.qint8 and replace assert with explicit validation.

The symmetric quantization path has two critical issues:

Invalid dtype: torch.uint8 is not supported by PyTorch's FakeQuantize/MinMaxObserver. Valid dtypes are torch.qint8 (signed) or torch.quint8 (unsigned quantized). For true symmetric quantization (zero_point=0), use torch.qint8.

Misleading comment and semantics: The comment describes asymmetric behavior (range [0, 255], zero_point=128), which contradicts the symmetric quantization claim. True symmetric quantization centers at 0 with range [-128, 127].

Unreliable input validation: assert bits == 8 can be stripped with Python's -O optimization flag. Use explicit error handling instead.

✅ Corrected symmetric quantization setup

- self.quant_min = None - self.quant_max = None - assert bits == 8, "Symmetric quantization is only supported for 8-bit" - self.dtype = torch.uint8 + if bits != 8: + raise ValueError("Symmetric quantization is only supported for 8-bit") + self.quant_min = -128 + self.quant_max = 127 + self.dtype = torch.qint8

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# NOTE: If left empty: with uint8 and symmetric quantization, the observer will use [0, 255] as the range. And 128 as the zero_point.

self.quant_min = None

self.quant_max = None

assert bits == 8, "Symmetric quantization is only supported for 8-bit"

self.dtype = torch.uint8

# NOTE: If left empty: with uint8 and symmetric quantization, the observer will use [0, 255] as the range. And 128 as the zero_point.

if bits != 8:

raise ValueError("Symmetric quantization is only supported for 8-bit")

self.quant_min = -128

self.quant_max = 127

self.dtype = torch.qint8

🤖 Prompt for AI Agents

In `@pymllm/backends/qualcomm/transformers/core/qdq.py` around lines 30 - 34, In the symmetric-quantization branch of the QDQ code (around the attributes self.quant_min, self.quant_max, bits handling), replace the incorrect dtype torch.uint8 with torch.qint8 to match PyTorch FakeQuantize/MinMaxObserver signed symmetric semantics; update the comment to correctly describe symmetric range as centered at zero (e.g., [-128, 127]) instead of [0,255]; and replace the assert bits == 8 with an explicit check that raises a ValueError if bits is not 8 (so validation remains at runtime even under -O). Ensure references to self.dtype, bits, and the surrounding symmetric quantization logic (the block setting self.quant_min/self.quant_max) are updated accordingly.

…s for better compatibility and consistency in tensor caching.

UbiquitousLearning

LGTM

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

mllm/compile/ir/Node.cpp (1)

272-280: ⚠️ Potential issue | 🟡 Minor

Remove unused push/pop region methods or clarify their purpose.

These methods are defined but never called anywhere in the codebase. The actual region management uses resetRegion() and the IRWriterGuard pattern instead. Additionally, the push/pop methods have inconsistent semantics: push() immediately sets cur_insert_region_ to the new region, but pop() sets it to the top of the stack before popping. If these methods are intended for future use, add a comment explaining the intent; otherwise, remove them to avoid maintaining dead code.

🧹 Nitpick comments (1)

mllm/compile/ir/Node.cpp (1)
268-268: Direct map access may insert null entries for missing keys.

Using cached_inputs_outputs_[uuid] when the key doesn't exist will insert a default-constructed val_ptr_t (null) into the map, potentially corrupting the cache state. Consider using find() or at() with proper handling, or document that callers must always call isCacheInputOutputTensor() first.
🔧 Proposed safer implementation
-val_ptr_t IRContext::getCacheInputOutputTensor(size_t uuid) { return cached_inputs_outputs_[uuid]; }
+val_ptr_t IRContext::getCacheInputOutputTensor(size_t uuid) {
+  auto it = cached_inputs_outputs_.find(uuid);
+  return it != cached_inputs_outputs_.end() ? it->second : nullptr;
+}

fix(qnn): Update zero point assertion in QDQ_KV function to 128 for c…

7a9e0c2

…onsistent quantization across models, enhancing accuracy in AOT runs.

chenghuaWang requested review from liang1232018 and oreomaker as code owners January 31, 2026 14:03

coderabbitai bot reviewed Jan 31, 2026

View reviewed changes

fix(ir): Change uuid type from uint32_t to size_t in IRContext method…

8110a91

…s for better compatibility and consistency in tensor caching.

UbiquitousLearning approved these changes Feb 1, 2026

View reviewed changes

coderabbitai bot reviewed Feb 1, 2026

View reviewed changes

chenghuaWang merged commit 081f408 into UbiquitousLearning:main Feb 1, 2026
4 checks passed

		} else if (isInt(input_dtype) && isInt(target_dtype)) {
		qnn_op_type = "Convert";

	// TODO, Cast usage is error.
	// TODO: Cast usage is incorrect; verify the proper QNN op for f32->f16.

Conversation

chenghuaWang commented Jan 31, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

What to do

1) If you want true symmetric 8-bit quantization (centered at 0)

2) If you specifically want an unsigned 8-bit quantized dtype

Uh oh!

UbiquitousLearning left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chenghuaWang commented Jan 31, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 31, 2026 •

edited

Loading