Skip to content

mileston(qnn): Qnn AOT#624

Merged
chenghuaWang merged 2 commits intoUbiquitousLearning:mainfrom
chenghuaWang:wch-main
Feb 1, 2026
Merged

mileston(qnn): Qnn AOT#624
chenghuaWang merged 2 commits intoUbiquitousLearning:mainfrom
chenghuaWang:wch-main

Conversation

@chenghuaWang
Copy link
Collaborator

@chenghuaWang chenghuaWang commented Jan 31, 2026

SM8650, chunk=32

Qwen3 1.7B, prefill 640 tokens, decode 128 tokens. w4a16kv8.

Prefill: 758.28 tps
Decode: 25.84 tps


SM8650, chunk=32

Qwen2.5 3B, prefill 256 tokens, decode 128 tokens. w4a16kv8.

Prefill: 512.75 tps
Decode: 18.29 tps

Qwen2.5 3B, prefill 512 tokens, decode 256 tokens. w4a16kv8.

Prefill: 545.62 tps
Decode: 18.11 tps

Summary by CodeRabbit

  • Bug Fixes

    • Improved resource cleanup and shutdown sequencing to prevent memory leaks and use-after-free issues
    • Updated quantization handling to expect unsigned 8-bit symmetric zero-point (128), improving quantized model correctness
    • Removed stale zero-point metadata from KV cache and improved KV de/quant paths
    • Fixed attention mask initialization and updates during generation
  • New Features

    • Added explicit backend shutdown capability for better lifecycle control

✏️ Tip: You can customize this high-level summary in your review settings.

…onsistent quantization across models, enhancing accuracy in AOT runs.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 31, 2026

📝 Walkthrough

Walkthrough

Adds explicit QNN backend and allocator shutdown/lifecycle management, updates uint8 per-tensor symmetric quantization zero_point expectations from 0→128 across examples and parsers, enforces ordered destructor/cleanup for AOT runtime components, and adjusts token-generator KV/attention-mask initialization.

Changes

Cohort / File(s) Summary
Quantization Zero-Point Updates
examples/llama_qnn_aot/modeling_llama_qnn_aot.hpp, examples/llama_qnn_aot/modeling_llama_qnn_aot_sha.hpp, examples/qwen2_qnn_aot/modeling_qwen2_qnn_aot.hpp, examples/qwen2_qnn_aot/modeling_qwen2_qnn_aot_sha.hpp, examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp, examples/qwen3_qnn_aot/modeling_qwen_qnn_aot_sha.hpp
Changed QDQ_KV assertion and handling for kUInt8PerTensorSym zero_point from 0 to 128 (per-tensor symmetric uint8 now expects zp==128).
KV Cache & Example Adjustments
examples/qwen3_qnn_aot/compile_sha.cpp, examples/qwen3_qnn_aot/...
Removed zero_point attachments for past_key/past_value in compile setup; some KV per-head cast paths adjusted to direct uint8 per-tensor handling.
QNN AOT Quant Param Parsing
mllm/backends/qnn/aot/QnnWrappersAPI.cpp
For SymPerTensor quantization, changed scale offset handling (offset -128) and added assertion requiring quant_to_type == kUInt8.
Python Quantization Observer
pymllm/backends/qualcomm/transformers/core/qdq.py
ActivationQDQ symmetric branch now enforces 8-bit, uses torch.uint8 and expects observer-provided [0,255] range with zero_point 128 (quant_min/quant_max set to None).
QNN Utils: library loading
mllm/backends/qnn/QNNUtils.hpp, mllm/backends/qnn/QNNUtils.cpp
Changed loadQNNSymbol/loadQNNSystemSymbol return types from boolstd::pair<bool, void*> to return success + library handle.
QNN Allocator: lifecycle & unloading
mllm/backends/qnn/QNNAllocator.hpp, mllm/backends/qnn/QNNAllocator.cpp
Introduced libCdspHandle_, isShutdown_, rpcmem fn ptr initialization, added shutdown()/releaseAllResources()/markShutdown(), reworked free() sequencing, and defined destructor to unload library.
QNN Backend & Perf: lifecycle
mllm/backends/qnn/QNNBackend.hpp, mllm/backends/qnn/QNNBackend.cpp
Added ~QNNBackend() and strict shutdown order; added QNNPerf::shutdown(), renamed internal members, and ensure libraries/handles are unloaded in controlled order.
AOT runtime destructors & cleanup ordering
mllm/backends/qnn/aot_rt/KVCacheManager.hpp, mllm/backends/qnn/aot_rt/PromptProcessor.hpp, mllm/backends/qnn/aot_rt/QnnAOTModule.hpp, mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp, mllm/backends/qnn/aot_rt/TokenGenerator.hpp
Replaced defaulted destructors with explicit destructors that clear module output tensors and reset members in a safe order to avoid use-after-free/double-free.
TokenGenerator: attention-mask & IO sizing
mllm/backends/qnn/aot_rt/TokenGenerator.cpp, mllm/backends/qnn/aot_rt/TokenGenerator.hpp
Adjusted IO reserve size, reduced K cache width by 1, added attention-mask init (initAttentionMask) and update calls, and added numeric include.
KVCacheManager: comments
mllm/backends/qnn/aot_rt/KVCacheManager.cpp
Added inline comments annotating rearrangement dimensional orders.
Context/backend shutdown integration
mllm/engine/Context.hpp, mllm/engine/Context.cpp, mllm/mllm.cpp
Added Context::shutdownBackend(const DeviceTypes&) and call in global shutdown to explicitly remove/shutdown QNN backend when available.
IR cache key type change
mllm/compile/ir/Node.hpp, mllm/compile/ir/Node.cpp
Switched IRContext cached input/output UUID keys and APIs from uint32_tsize_t and adjusted push-region insertion to set current insert region.
CastType visitor
mllm/backends/qnn/aot/visitor/CastType.cpp
Added integer→integer cast mapping to qnn op type Convert; kept float32→float16 Cast path (TODO noted).
LLMQuantRecipePass: Slice handling
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp
Ensure SliceOp first input has a quant_recipe; if missing, generate and attach a simple quant spec before propagation.
Other minor cleanup
mllm/backends/qnn/aot_rt/*, mllm/backends/qnn/*
Various destructor/cleanup ordering and minor log/message/name updates to reflect lifecycle management and safer teardown.

Sequence Diagram(s)

sequenceDiagram
    actor App
    participant Context
    participant QNNBackend
    participant QNNAllocator
    participant DynLib as "QNN HTP/System libs\n(dl handles)"
    Note over App,Context: Shutdown sequence (high-level)
    App->>Context: shutdownContext()
    Context->>Context: isQnnAvailable? true
    Context->>Context: shutdownBackend(kQNN)
    Context->>QNNBackend: request shutdown/destruction
    QNNBackend->>QNNAllocator: shutdown()
    QNNAllocator->>QNNAllocator: deregister memory entries
    QNNAllocator->>QNNAllocator: free rpcmem allocations
    QNNAllocator->>DynLib: dlclose(libCdspHandle_) (nullify)
    QNNBackend->>DynLib: dlclose(qnnHtpLibHandle_) / dlclose(qnnSystemLibHandle_)
    QNNBackend->>QNNBackend: clear models, perf shutdown
    Context-->>App: shutdown complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • liang1232018
  • oreomaker

Poem

🐇 I nibble at handles, tidy each lib,
Zero-point hops to one-two-eight, no fib,
Destructors align in a careful row,
KV masks wake, attention starts to grow,
Hooray — resources freed, tidy and spry!

🚥 Pre-merge checks | ❌ 3
❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Description check ⚠️ Warning The description contains only benchmark results without explaining what changes were made, why they were made, or how they address the objectives. Add a clear summary of changes made, technical improvements, and how the benchmarks validate the implementation.
Docstring Coverage ⚠️ Warning Docstring coverage is 8.93% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'mileston(qnn): Qnn AOT' is vague and contains a typo ('mileston' instead of 'milestone'), making it unclear what the primary change is. Clarify the title to describe the main change (e.g., 'Add QNN AOT runtime support' or 'Implement QNN AOT backend') and fix the typo.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Fix all issues with AI agents
In `@examples/qwen3_qnn_aot/modeling_qwen_qnn_aot_sha.hpp`:
- Around line 358-366: The K/V quantization paths for Qwen3 differ from
Qwen2/Llama—K uses key_states_per_head[h].to(kUInt8PerTensorSym) then
ptq::QDQ_KV, and V uses ptq::QDQ(...) then .to(kUInt8PerTensorSym) then
ptq::QDQ_KV—while other SHA implementations convert via kFloat32 first; either
make this explicit or align the flows: update the K and V flows (the code around
key_states_per_head[h], ptq::QDQ, and ptq::QDQ_KV) to include the intermediate
.to(kFloat32) conversion like the other models, or add a concise comment above
these snippets explaining why Qwen3 intentionally omits kFloat32 (e.g., expected
input dtype/performance reason) so reviewers understand the divergence.

In `@mllm/backends/qnn/aot/visitor/CastType.cpp`:
- Line 67: Update the inline comment in CastType.cpp to use the required TODO
prefix format: replace the existing comment "// TODO, Cast usage is error." with
the guideline-compliant "// TODO: Cast usage is error." so the TODO uses the
exact "TODO:" prefix (look for the comment near the CastType visitor
implementation in CastType.cpp).
- Around line 64-65: Update the TODO comment in CastType.cpp from "// TODO, Cast
usage is error." to "// TODO: Cast usage is error." to match required TODO
format; then ensure the int→int branch that sets qnn_op_type = "Convert" (where
isInt(input_dtype) && isInt(target_dtype) is true) correctly handles quantized
integer tensors by detecting per-tensor/per-channel quantization (via the same
checks used by isInt or any quantization helpers) and either perform a
requantize/recalibration step instead of a blind Convert or insert
dequantize→convert→requantize logic so differing scale/zero-point parameters are
reconciled before/after Convert.

In `@mllm/backends/qnn/QNNAllocator.cpp`:
- Around line 16-35: The constructors of QNNAllocator set rpcmem_alloc,
rpcmem_free, and rpcmem_to_fd via dlsym without checking for nullptr; update
both QNNAllocator() and QNNAllocator(QNN_INTERFACE_VER_TYPE, void*) to validate
each dlsym result (rpcmem_alloc, rpcmem_free, rpcmem_to_fd) after assignment,
and if any is nullptr call MLLM_ERROR_EXIT (or MLLM_RT_ASSERT if preferred) with
a clear message that includes dlerror() context; reference the libCdspHandle_
loading logic and use the same error-handling pattern used for dlopen to fail
fast when symbols are missing.

In `@mllm/backends/qnn/QNNBackend.cpp`:
- Around line 80-121: The destructor QNNBackend::~QNNBackend calls
runtime_->qnnInterface.contextFree(context_, nullptr) unconditionally but
context_ may be null if createContext()/loadContext() were never called; guard
the call by checking context_ (and that runtime_ is non-null) before invoking
contextFree so you don't pass a null handle into the QNN API, and ensure you
still set context_ = nullptr and reset runtime_ afterwards.

In `@mllm/backends/qnn/QNNUtils.hpp`:
- Around line 46-49: Header QNNUtils.hpp declares std::pair<bool, void*> for
functions loadQNNSymbol and loadQNNSystemSymbol but doesn't include <utility>;
add an explicit `#include` <utility> to this header (near other includes) so the
std::pair type is guaranteed to be available and avoid relying on transitive
includes.

In `@pymllm/backends/qualcomm/transformers/core/qdq.py`:
- Around line 30-34: In the symmetric-quantization branch of the QDQ code
(around the attributes self.quant_min, self.quant_max, bits handling), replace
the incorrect dtype torch.uint8 with torch.qint8 to match PyTorch
FakeQuantize/MinMaxObserver signed symmetric semantics; update the comment to
correctly describe symmetric range as centered at zero (e.g., [-128, 127])
instead of [0,255]; and replace the assert bits == 8 with an explicit check that
raises a ValueError if bits is not 8 (so validation remains at runtime even
under -O). Ensure references to self.dtype, bits, and the surrounding symmetric
quantization logic (the block setting self.quant_min/self.quant_max) are updated
accordingly.
🧹 Nitpick comments (5)
mllm/backends/qnn/aot_rt/TokenGenerator.cpp (1)

112-115: Improve readability with named constants for sentinel values and decode phase size.

The buffer initialization is correct—initAttentionMask fully clears the attention mask via std::fill_n(attention_mask, ar_len * config_.context_len, ...) before populating it. However, the purpose of the −1 sentinel in attention_map and the hard-coded 1 for decode-phase batch size are non-obvious at the call site. Per coding guidelines, use named constants to clarify intent:

🔧 Suggested readability refactor
-  std::vector<int32_t> attention_map(1);
-  std::iota(attention_map.begin(), attention_map.end(), -1);
-  kv_manager_->initAttentionMask(input_tensors_[2].ptr<uint16_t>(), attention_map, 1, current_pos);
+  constexpr int32_t kDecodeTokenCount = 1;
+  constexpr int32_t kAttentionMapSentinel = -1;  // Signals "attend to past tokens only" in initAttentionMask
+  std::vector<int32_t> attention_map(kDecodeTokenCount);
+  std::iota(attention_map.begin(), attention_map.end(), kAttentionMapSentinel);
+  kv_manager_->initAttentionMask(input_tensors_[2].ptr<uint16_t>(), attention_map, kDecodeTokenCount, current_pos);

Also applies to: 132-133

mllm/engine/Context.hpp (1)

33-33: Add a brief doc comment for the new public API.

A one-line description (including that it’s a no-op if absent) will clarify expectations for callers.

📝 Suggested change
-  void shutdownBackend(const DeviceTypes& device);
+  // Idempotently remove backend for device; no-op if absent.
+  void shutdownBackend(const DeviceTypes& device);

As per coding guidelines: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.

mllm/backends/qnn/aot_rt/QnnAOTModule.hpp (1)

15-18: Destructor clear is redundant—use a defaulted destructor.

output_tensors_ is a std::vector<Tensor> that will be automatically destroyed when the object is destroyed. Calling clear() explicitly does not change C++ member destruction order (members are destroyed in reverse declaration order, and output_tensors_ is the last member, so it destroys first). The explicit clear() adds unnecessary work without any benefit.

-  ~QnnAOTModule() {
-    // Clear output tensors to ensure proper cleanup order
-    output_tensors_.clear();
-  }
+  ~QnnAOTModule() = default;
mllm/backends/qnn/aot/QnnWrappersAPI.cpp (1)

141-145: Use a named constant for the symmetric uint8 zero-point.

-128 is a magic number; naming it keeps intent clear and prevents drift if the expected zero-point changes.

Consider extracting it as a named constant, e.g., constexpr int32_t kSymmetricUint8ZeroPoint = 128;, and then using -kSymmetricUint8ZeroPoint in the offset assignment. This improves code maintainability and aligns with the coding guideline to use named constants instead of magic numbers.

mllm/backends/qnn/aot_rt/KVCacheManager.hpp (1)

27-40: Destructor can be defaulted—explicit cleanup is redundant.

No base classes exist, and std::vector destructors automatically destroy all contained KVCache<T> elements and their shared_ptr members in the correct order. Manual reset() calls during the destructor body do not change the member-destruction sequence and add no benefit. If ordering with external systems truly requires early cleanup, implement a dedicated shutdown() method instead.

Comment on lines 358 to 366
// K: De-quantize and re-quantize to int8
auto k_h = key_states_per_head[h].to(kFloat32);
k_h = k_h.to(kUInt8PerTensorSym);
auto k_h = key_states_per_head[h].to(kUInt8PerTensorSym);
k_h = ptq::QDQ_KV(this, k_h, "k_cast_to_int8_qdq_h" + h_str);
k_h = k_h.transpose(2, 3); // [B, 1, D, S]

// V: Quantize to int16 then int8
auto v_h = ptq::QDQ(this, value_states_per_head[h], "v_cast_to_int16_qdq_h" + h_str);
v_h = v_h.to(kFloat32);
v_h = v_h.to(kUInt8PerTensorSym);
v_h = ptq::QDQ_KV(this, v_h, "v_cast_to_int8_qdq_h" + h_str);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the K/V quantization patterns across SHA files
echo "=== Qwen3 SHA (this file) ==="
rg -n "key_states_per_head\[h\]\.to\(" examples/qwen3_qnn_aot/modeling_qwen_qnn_aot_sha.hpp

echo ""
echo "=== Qwen2 SHA ==="
rg -n "key_states_per_head\[h\]\.to\(" examples/qwen2_qnn_aot/modeling_qwen2_qnn_aot_sha.hpp

echo ""
echo "=== Llama SHA ==="
rg -n "key_states_per_head\[h\]\.to\(" examples/llama_qnn_aot/modeling_llama_qnn_aot_sha.hpp

Repository: UbiquitousLearning/mllm

Length of output: 318


🏁 Script executed:

#!/bin/bash
# Get context around the K quantization in each file
echo "=== Qwen3 SHA - K quantization context (lines 355-370) ==="
sed -n '355,370p' examples/qwen3_qnn_aot/modeling_qwen_qnn_aot_sha.hpp | cat -n

echo ""
echo "=== Qwen2 SHA - K quantization context (lines 343-358) ==="
sed -n '343,358p' examples/qwen2_qnn_aot/modeling_qwen2_qnn_aot_sha.hpp | cat -n

echo ""
echo "=== Llama SHA - K quantization context (lines 343-358) ==="
sed -n '343,358p' examples/llama_qnn_aot/modeling_llama_qnn_aot_sha.hpp | cat -n

Repository: UbiquitousLearning/mllm

Length of output: 2573


K/V quantization paths differ from other SHA variants—consider clarifying the design choice.

Both K and V quantization paths in this file skip the intermediate kFloat32 conversion step that is present in Qwen2 and Llama SHA implementations:

Qwen3 (this file):

  • K: key_states_per_head[h].to(kUInt8PerTensorSym)ptq::QDQ_KV(...)
  • V: ptq::QDQ(...).to(kUInt8PerTensorSym)ptq::QDQ_KV(...)

Qwen2/Llama:

  • K: key_states_per_head[h].to(kFloat32).to(kUInt8PerTensorSym)ptq::QDQ_KV(...)
  • V: ptq::QDQ(...).to(kFloat32).to(kUInt8PerTensorSym)ptq::QDQ_KV(...)

If this divergence is intentional (e.g., due to different input data types), add a comment explaining why. Otherwise, align the quantization path with the other models to ensure consistency.

🤖 Prompt for AI Agents
In `@examples/qwen3_qnn_aot/modeling_qwen_qnn_aot_sha.hpp` around lines 358 - 366,
The K/V quantization paths for Qwen3 differ from Qwen2/Llama—K uses
key_states_per_head[h].to(kUInt8PerTensorSym) then ptq::QDQ_KV, and V uses
ptq::QDQ(...) then .to(kUInt8PerTensorSym) then ptq::QDQ_KV—while other SHA
implementations convert via kFloat32 first; either make this explicit or align
the flows: update the K and V flows (the code around key_states_per_head[h],
ptq::QDQ, and ptq::QDQ_KV) to include the intermediate .to(kFloat32) conversion
like the other models, or add a concise comment above these snippets explaining
why Qwen3 intentionally omits kFloat32 (e.g., expected input dtype/performance
reason) so reviewers understand the divergence.

Comment on lines +64 to +65
} else if (isInt(input_dtype) && isInt(target_dtype)) {
qnn_op_type = "Convert";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find . -name "CastType.cpp" -type f

Repository: UbiquitousLearning/mllm

Length of output: 111


🏁 Script executed:

cat -n mllm/backends/qnn/aot/visitor/CastType.cpp | sed -n '55,75p'

Repository: UbiquitousLearning/mllm

Length of output: 993


🏁 Script executed:

wc -l mllm/backends/qnn/aot/visitor/CastType.cpp

Repository: UbiquitousLearning/mllm

Length of output: 112


🏁 Script executed:

cat -n mllm/backends/qnn/aot/visitor/CastType.cpp | head -n 86

Repository: UbiquitousLearning/mllm

Length of output: 4164


🏁 Script executed:

grep -r "isInt\|isFloat" mllm/backends/qnn/ --include="*.cpp" --include="*.h" -B 2 -A 2 | head -n 50

Repository: UbiquitousLearning/mllm

Length of output: 1491


🏁 Script executed:

grep -r "quantiz\|quant_param\|scale\|zero_point" mllm/backends/qnn/aot/visitor/ --include="*.cpp" --include="*.h"

Repository: UbiquitousLearning/mllm

Length of output: 749


🏁 Script executed:

grep -r "Convert" mllm/backends/qnn/ --include="*.cpp" --include="*.h" -B 2 -A 2 | head -n 40

Repository: UbiquitousLearning/mllm

Length of output: 4099


🏁 Script executed:

grep -A 10 "captureQnnAOTNodeTensor" mllm/backends/qnn/aot/visitor/CastType.cpp

Repository: UbiquitousLearning/mllm

Length of output: 427


Fix TODO comment format violation at line 67.

Change // TODO, Cast usage is error. to // TODO: Cast usage is error. to comply with the required TODO format (must include colon). Additionally, verify that the int→int "Convert" path at line 65 handles quantized integer-to-integer conversions correctly; the isInt() function includes per-tensor and per-channel quantized types, so "Convert" may be called on tensors with differing scale/zero-point parameters.

🤖 Prompt for AI Agents
In `@mllm/backends/qnn/aot/visitor/CastType.cpp` around lines 64 - 65, Update the
TODO comment in CastType.cpp from "// TODO, Cast usage is error." to "// TODO:
Cast usage is error." to match required TODO format; then ensure the int→int
branch that sets qnn_op_type = "Convert" (where isInt(input_dtype) &&
isInt(target_dtype) is true) correctly handles quantized integer tensors by
detecting per-tensor/per-channel quantization (via the same checks used by isInt
or any quantization helpers) and either perform a requantize/recalibration step
instead of a blind Convert or insert dequantize→convert→requantize logic so
differing scale/zero-point parameters are reconciled before/after Convert.

} else if (isInt(input_dtype) && isInt(target_dtype)) {
qnn_op_type = "Convert";
} else if (input_dtype == kFloat32 && target_dtype == kFloat16) {
// TODO, Cast usage is error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix TODO comment format to comply with guidelines.

The TODO must use the exact TODO: prefix.

✅ Suggested fix
-    // TODO, Cast usage is error.
+    // TODO: Cast usage is incorrect; verify the proper QNN op for f32->f16.

As per coding guidelines, TODO and FIXME comments must be written as 'TODO:' or 'FIXME:' followed by UTF-8 text that adheres to character set rules.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// TODO, Cast usage is error.
// TODO: Cast usage is incorrect; verify the proper QNN op for f32->f16.
🤖 Prompt for AI Agents
In `@mllm/backends/qnn/aot/visitor/CastType.cpp` at line 67, Update the inline
comment in CastType.cpp to use the required TODO prefix format: replace the
existing comment "// TODO, Cast usage is error." with the guideline-compliant
"// TODO: Cast usage is error." so the TODO uses the exact "TODO:" prefix (look
for the comment near the CastType visitor implementation in CastType.cpp).

Comment on lines 16 to +35
QNNAllocator::QNNAllocator() {
void* libCdspHandle = dlopen("libcdsprpc.so", RTLD_NOW | RTLD_LOCAL);
if (nullptr == libCdspHandle) { MLLM_ERROR_EXIT(1, "dlopen libcdsprpc.so failed"); }
libCdspHandle_ = dlopen("libcdsprpc.so", RTLD_NOW | RTLD_LOCAL);
if (nullptr == libCdspHandle_) { MLLM_ERROR_EXIT(1, "dlopen libcdsprpc.so failed"); }

rpcmem_alloc = (RpcMemAllocFn_t)dlsym(libCdspHandle, "rpcmem_alloc");
rpcmem_free = (RpcMemFreeFn_t)dlsym(libCdspHandle, "rpcmem_free");
rpcmem_to_fd = (RpcMemToFdFn_t)dlsym(libCdspHandle, "rpcmem_to_fd");
rpcmem_alloc = (RpcMemAllocFn_t)dlsym(libCdspHandle_, "rpcmem_alloc");
rpcmem_free = (RpcMemFreeFn_t)dlsym(libCdspHandle_, "rpcmem_free");
rpcmem_to_fd = (RpcMemToFdFn_t)dlsym(libCdspHandle_, "rpcmem_to_fd");
}

QNNAllocator::QNNAllocator(QNN_INTERFACE_VER_TYPE qnnInterface, void* context)
: qnnInterface_(qnnInterface), context_(context) {
MLLM_RT_ASSERT(context_ != nullptr);

void* libCdspHandle = dlopen("libcdsprpc.so", RTLD_NOW | RTLD_LOCAL);
if (nullptr == libCdspHandle) { MLLM_ERROR_EXIT(1, "dlopen libcdsprpc.so failed"); }
libCdspHandle_ = dlopen("libcdsprpc.so", RTLD_NOW | RTLD_LOCAL);
if (nullptr == libCdspHandle_) { MLLM_ERROR_EXIT(1, "dlopen libcdsprpc.so failed"); }

rpcmem_alloc = (RpcMemAllocFn_t)dlsym(libCdspHandle, "rpcmem_alloc");
rpcmem_free = (RpcMemFreeFn_t)dlsym(libCdspHandle, "rpcmem_free");
rpcmem_to_fd = (RpcMemToFdFn_t)dlsym(libCdspHandle, "rpcmem_to_fd");
rpcmem_alloc = (RpcMemAllocFn_t)dlsym(libCdspHandle_, "rpcmem_alloc");
rpcmem_free = (RpcMemFreeFn_t)dlsym(libCdspHandle_, "rpcmem_free");
rpcmem_to_fd = (RpcMemToFdFn_t)dlsym(libCdspHandle_, "rpcmem_to_fd");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n mllm/backends/qnn/QNNAllocator.cpp | head -80

Repository: UbiquitousLearning/mllm

Length of output: 3359


🏁 Script executed:

rg "MLLM_ERROR_EXIT" mllm/utils/ -A 2 | head -20

Repository: UbiquitousLearning/mllm

Length of output: 1230


🏁 Script executed:

rg "define MLLM_ERROR_EXIT" mllm/utils/Log.hpp -A 5

Repository: UbiquitousLearning/mllm

Length of output: 540


🏁 Script executed:

rg "dlerror" mllm/ -A 2

Repository: UbiquitousLearning/mllm

Length of output: 2315


🏁 Script executed:

rg "MLLM_ERROR_EXIT.*ExitCode" mllm/ -B 1 | head -30

Repository: UbiquitousLearning/mllm

Length of output: 1691


🏁 Script executed:

rg "MLLM_ERROR_EXIT\(1" mllm/ -B 1 -A 1

Repository: UbiquitousLearning/mllm

Length of output: 1988


🏁 Script executed:

cat -n mllm/backends/qnn/QNNAllocator.cpp | sed -n '49,77p'

Repository: UbiquitousLearning/mllm

Length of output: 1224


🏁 Script executed:

rg "rpcmem_alloc|rpcmem_free|rpcmem_to_fd" mllm/backends/qnn/QNNAllocator.hpp -B 1 -A 1

Repository: UbiquitousLearning/mllm

Length of output: 436


🏁 Script executed:

rg "RpcMemAllocFn_t|RpcMemFreeFn_t|RpcMemToFdFn_t" mllm/backends/qnn/QNNAllocator.hpp -B 2 -A 2

Repository: UbiquitousLearning/mllm

Length of output: 443


Add dlsym failure checks before using rpcmem_ pointers.*

The rpcmem_alloc, rpcmem_free, and rpcmem_to_fd function pointers are loaded without validation in both constructors (lines 20–22 and 32–34). If any symbol fails to resolve, dlsym returns nullptr, and subsequent calls crash. Apply the dlopen pattern to dlsym results in both constructors:

🔧 Suggested fix
   rpcmem_alloc = (RpcMemAllocFn_t)dlsym(libCdspHandle_, "rpcmem_alloc");
+  if (!rpcmem_alloc) { MLLM_ERROR_EXIT(1, "dlsym rpcmem_alloc failed: {}", dlerror()); }
   rpcmem_free = (RpcMemFreeFn_t)dlsym(libCdspHandle_, "rpcmem_free");
+  if (!rpcmem_free) { MLLM_ERROR_EXIT(1, "dlsym rpcmem_free failed: {}", dlerror()); }
   rpcmem_to_fd = (RpcMemToFdFn_t)dlsym(libCdspHandle_, "rpcmem_to_fd");
+  if (!rpcmem_to_fd) { MLLM_ERROR_EXIT(1, "dlsym rpcmem_to_fd failed: {}", dlerror()); }
🤖 Prompt for AI Agents
In `@mllm/backends/qnn/QNNAllocator.cpp` around lines 16 - 35, The constructors of
QNNAllocator set rpcmem_alloc, rpcmem_free, and rpcmem_to_fd via dlsym without
checking for nullptr; update both QNNAllocator() and
QNNAllocator(QNN_INTERFACE_VER_TYPE, void*) to validate each dlsym result
(rpcmem_alloc, rpcmem_free, rpcmem_to_fd) after assignment, and if any is
nullptr call MLLM_ERROR_EXIT (or MLLM_RT_ASSERT if preferred) with a clear
message that includes dlerror() context; reference the libCdspHandle_ loading
logic and use the same error-handling pattern used for dlopen to fail fast when
symbols are missing.

Comment on lines +80 to +121
QNNBackend::~QNNBackend() {
// Cleanup order is critical - we hold all QNN library handles to control unload order:
// 1. Allocator shutdown (memDeRegister + rpcmem_free) - needs QNN alive
// 2. Clear models - tensor destructors try to free but allocator is shut down
// 3. Perf cleanup - needs QNN HTP infrastructure alive
// 4. Runtime cleanup - frees QNN backend/device handles
// 5. Allocator reset - dlcloses libcdsprpc.so (held by allocator)
// 6. Close QNN libraries - libQnnSystem.so first, then libQnnHtp.so

// 1. Properly shutdown allocator while QNN is still alive
// This calls memDeRegister and rpcmem_free safely
if (allocator_) {
auto* qnnAllocator = dynamic_cast<QNNAllocator*>(allocator_.get());
if (qnnAllocator) { qnnAllocator->shutdown(); }
}

// 2. Clear models - tensor destructors will call free() but they're now no-ops
qnnModels_.clear();
qnnModelIndexMap_.clear();

// 3. Cleanup perf while QNN HTP infrastructure is still alive
if (perf_) { perf_->shutdown(); }
perf_.reset();

// 4. Cleanup runtime - frees QNN backend/device handles
runtime_->qnnInterface.contextFree(context_, nullptr);
context_ = nullptr;
runtime_.reset();

// 5. Reset allocator - will dlclose libcdsprpc.so since shutdown() was already called
allocator_.reset();

// 6. Close QNN libraries in reverse order of dependency
if (qnnSystemLibHandle_) {
dlclose(qnnSystemLibHandle_);
qnnSystemLibHandle_ = nullptr;
}
if (qnnHtpLibHandle_) {
dlclose(qnnHtpLibHandle_);
qnnHtpLibHandle_ = nullptr;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, find the QNNBackend header file
fd -t f "QNNBackend" --extension h --extension hpp

Repository: UbiquitousLearning/mllm

Length of output: 99


🏁 Script executed:

# Read the QNNBackend.h header to understand member initialization
cat -n mllm/backends/qnn/QNNBackend.h

Repository: UbiquitousLearning/mllm

Length of output: 129


🏁 Script executed:

# Now find the constructor implementation to see when context_ and runtime_ are initialized
rg "QNNBackend::" mllm/backends/qnn/QNNBackend.cpp -A 5 | head -100

Repository: UbiquitousLearning/mllm

Length of output: 4456


🏁 Script executed:

# Check the full constructor to understand initialization order
sed -n '1,80p' mllm/backends/qnn/QNNBackend.cpp | tail -50

Repository: UbiquitousLearning/mllm

Length of output: 2497


🏁 Script executed:

# Read the QNNBackend header file (it's .hpp not .h)
cat -n mllm/backends/qnn/QNNBackend.hpp | head -150

Repository: UbiquitousLearning/mllm

Length of output: 6926


🏁 Script executed:

# Look for context_ initialization in the header
rg "context_" mllm/backends/qnn/QNNBackend.hpp -B 2 -A 2

Repository: UbiquitousLearning/mllm

Length of output: 511


🏁 Script executed:

# Check if context_ has a default initialization (might be in member initialization list)
rg "Qnn_.*_t context" mllm/backends/qnn/ -A 2 -B 2

Repository: UbiquitousLearning/mllm

Length of output: 2071


🏁 Script executed:

# Search for all usages of contextFree to understand expected behavior
rg "contextFree" mllm/backends/qnn/ -B 3 -A 3

Repository: UbiquitousLearning/mllm

Length of output: 474


🏁 Script executed:

# Check if MLLM_ERROR_EXIT actually terminates the program
rg "MLLM_ERROR_EXIT" mllm/utils/Log.hpp -B 2 -A 5

Repository: UbiquitousLearning/mllm

Length of output: 545


🏁 Script executed:

# Check for exception handling or early returns in QNNBackend constructor
sed -n '1,75p' mllm/backends/qnn/QNNBackend.cpp

Repository: UbiquitousLearning/mllm

Length of output: 3338


🏁 Script executed:

# Verify if createContext or loadContext must be called before using the backend
rg "createContext|loadContext" mllm/backends/qnn/ -B 2 -A 2 | head -50

Repository: UbiquitousLearning/mllm

Length of output: 3462


🏁 Script executed:

# Check if there are any ways runtime_ could be null or partially initialized in destructor
rg "reset\(\)" mllm/backends/qnn/QNNBackend.cpp

Repository: UbiquitousLearning/mllm

Length of output: 125


🏁 Script executed:

# Verify that context is indeed nullptr when not explicitly created
rg "context_.*=" mllm/backends/qnn/QNNBackend.cpp | head -20

Repository: UbiquitousLearning/mllm

Length of output: 201


🏁 Script executed:

# Check how createContext and loadContext set context_
rg "context_.*=" mllm/backends/qnn/QNNRuntime.cpp -B 3 -A 3 | head -40

Repository: UbiquitousLearning/mllm

Length of output: 139


🏁 Script executed:

# Find all QNN files to understand the structure better
fd -t f "QNN" mllm/backends/qnn/ --extension cpp --extension hpp | head -20

Repository: UbiquitousLearning/mllm

Length of output: 775


🏁 Script executed:

# Check if there are any other uses of contextFree or similar patterns
rg "contextFree" mllm/backends/qnn/ -B 5 -A 5

Repository: UbiquitousLearning/mllm

Length of output: 818


🏁 Script executed:

# Look at how runtime_ is constructed to confirm it's always valid at destructor
rg "QNNRuntime::create" mllm/backends/qnn/ -B 2 -A 10

Repository: UbiquitousLearning/mllm

Length of output: 2018


🏁 Script executed:

# Check if there's a null check pattern used elsewhere for similar operations
rg "if.*runtime_\|if.*context_" mllm/backends/qnn/QNNBackend.cpp -B 1 -A 1

Repository: UbiquitousLearning/mllm

Length of output: 49


Guard contextFree when context_ is null.

The destructor can run without a context being created if neither createContext() nor loadContext() is called. Passing a null context handle to QNN may be invalid. Add a guard to safely handle this case.

Suggested fix
-  runtime_->qnnInterface.contextFree(context_, nullptr);
-  context_ = nullptr;
+  if (context_) {
+    runtime_->qnnInterface.contextFree(context_, nullptr);
+    context_ = nullptr;
+  }

As per coding guidelines: Check for proper error handling and resource cleanup (e.g., using try-finally, context managers, or RAII).

🤖 Prompt for AI Agents
In `@mllm/backends/qnn/QNNBackend.cpp` around lines 80 - 121, The destructor
QNNBackend::~QNNBackend calls runtime_->qnnInterface.contextFree(context_,
nullptr) unconditionally but context_ may be null if
createContext()/loadContext() were never called; guard the call by checking
context_ (and that runtime_ is non-null) before invoking contextFree so you
don't pass a null handle into the QNN API, and ensure you still set context_ =
nullptr and reset runtime_ afterwards.

Comment on lines +46 to +49
// Load QNN symbols and return library handle for lifecycle management
// Returns {success, libHandle} - caller owns the handle and must dlclose it
std::pair<bool, void*> loadQNNSymbol();
std::pair<bool, void*> loadQNNSystemSymbol();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Check if <utility> is included and verify std::pair usage
rg -n '#include <utility>' mllm/backends/qnn/QNNUtils.hpp || echo "No <utility> found"
rg -n 'std::pair' mllm/backends/qnn/QNNUtils.hpp || echo "No std::pair found"

Repository: UbiquitousLearning/mllm

Length of output: 177


🏁 Script executed:

# Read the beginning of the file to see all includes
head -60 mllm/backends/qnn/QNNUtils.hpp

Repository: UbiquitousLearning/mllm

Length of output: 2526


Add <utility> include for std::pair declarations.

This header uses std::pair<bool, void*> at lines 48–49 but does not explicitly include <utility>. While compilation may succeed due to transitive includes, this is fragile and can break on stricter toolchains.

🛠️ Proposed fix
 `#include` <memory>
 `#include` <string>
 `#include` <vector>
+#include <utility>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Load QNN symbols and return library handle for lifecycle management
// Returns {success, libHandle} - caller owns the handle and must dlclose it
std::pair<bool, void*> loadQNNSymbol();
std::pair<bool, void*> loadQNNSystemSymbol();
`#include` <memory>
`#include` <string>
`#include` <vector>
`#include` <utility>
🤖 Prompt for AI Agents
In `@mllm/backends/qnn/QNNUtils.hpp` around lines 46 - 49, Header QNNUtils.hpp
declares std::pair<bool, void*> for functions loadQNNSymbol and
loadQNNSystemSymbol but doesn't include <utility>; add an explicit `#include`
<utility> to this header (near other includes) so the std::pair type is
guaranteed to be available and avoid relying on transitive includes.

Comment on lines +30 to +34
# NOTE: If left empty: with uint8 and symmetric quantization, the observer will use [0, 255] as the range. And 128 as the zero_point.
self.quant_min = None
self.quant_max = None
assert bits == 8, "Symmetric quantization is only supported for 8-bit"
self.dtype = torch.uint8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Read the file to understand context
head -60 pymllm/backends/qualcomm/transformers/core/qdq.py

Repository: UbiquitousLearning/mllm

Length of output: 2263


🏁 Script executed:

# Search for torch.ao.quantization usage in the codebase
rg "torch\.ao\.quantization|FakeQuantize|MinMaxObserver" -A 3 -B 3 --type py

Repository: UbiquitousLearning/mllm

Length of output: 20206


🏁 Script executed:

# Check for torch.quint8 or torch.qint8 usage
rg "torch\.qu?int8|torch\.uint8" --type py -B 2 -A 2

Repository: UbiquitousLearning/mllm

Length of output: 5322


🌐 Web query:

PyTorch torch.ao.quantization FakeQuantize MinMaxObserver dtype torch.uint8 symmetric quantization

💡 Result:

In PyTorch eager/QAT quantization, FakeQuantize / MinMaxObserver do not accept torch.uint8. Their dtype is limited to torch.qint8 (signed) or torch.quint8 (unsigned quantized). [1], [2]
Also, symmetric quantization is a qscheme (e.g. torch.per_tensor_symmetric), and PyTorch defines symmetric vs affine schemes separately from dtype. [3]

What to do

1) If you want true symmetric 8-bit quantization (centered at 0)

Use signed int8:

import torch
from torch.ao.quantization import FakeQuantize
from torch.ao.quantization.observer import MinMaxObserver

fq = FakeQuantize.with_args(
    observer=MinMaxObserver,
    dtype=torch.qint8,
    qscheme=torch.per_tensor_symmetric,
    quant_min=-128,
    quant_max=127,
)()

This matches the usual meaning of “symmetric” (zero_point = 0).

2) If you specifically want an unsigned 8-bit quantized dtype

Use torch.quint8 (not torch.uint8). Note: with unsigned 8-bit, you typically end up with affine (asymmetric) quantization (nonzero zero_point), not true symmetry around 0:

fq = FakeQuantize.with_args(
    observer=MinMaxObserver,
    dtype=torch.quint8,
    qscheme=torch.per_tensor_affine,
    quant_min=0,
    quant_max=255,
)()

Key point: if your goal is “symmetric quantization,” pick torch.qint8 + torch.per_tensor_symmetric; if your goal is “store as 8-bit unsigned,” pick torch.quint8 but expect affine behavior. [1], [2], [3]

Sources:
[1] PyTorch FakeQuantize docs (dtype allowed values)
[2] PyTorch MinMaxObserver docs (dtype allowed values)
[3] PyTorch quantization API reference (supported qscheme incl. symmetric)


Replace torch.uint8 with torch.qint8 and replace assert with explicit validation.

The symmetric quantization path has two critical issues:

  1. Invalid dtype: torch.uint8 is not supported by PyTorch's FakeQuantize/MinMaxObserver. Valid dtypes are torch.qint8 (signed) or torch.quint8 (unsigned quantized). For true symmetric quantization (zero_point=0), use torch.qint8.

  2. Misleading comment and semantics: The comment describes asymmetric behavior (range [0, 255], zero_point=128), which contradicts the symmetric quantization claim. True symmetric quantization centers at 0 with range [-128, 127].

  3. Unreliable input validation: assert bits == 8 can be stripped with Python's -O optimization flag. Use explicit error handling instead.

✅ Corrected symmetric quantization setup
-            self.quant_min = None
-            self.quant_max = None
-            assert bits == 8, "Symmetric quantization is only supported for 8-bit"
-            self.dtype = torch.uint8
+            if bits != 8:
+                raise ValueError("Symmetric quantization is only supported for 8-bit")
+            self.quant_min = -128
+            self.quant_max = 127
+            self.dtype = torch.qint8
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# NOTE: If left empty: with uint8 and symmetric quantization, the observer will use [0, 255] as the range. And 128 as the zero_point.
self.quant_min = None
self.quant_max = None
assert bits == 8, "Symmetric quantization is only supported for 8-bit"
self.dtype = torch.uint8
# NOTE: If left empty: with uint8 and symmetric quantization, the observer will use [0, 255] as the range. And 128 as the zero_point.
if bits != 8:
raise ValueError("Symmetric quantization is only supported for 8-bit")
self.quant_min = -128
self.quant_max = 127
self.dtype = torch.qint8
🤖 Prompt for AI Agents
In `@pymllm/backends/qualcomm/transformers/core/qdq.py` around lines 30 - 34, In
the symmetric-quantization branch of the QDQ code (around the attributes
self.quant_min, self.quant_max, bits handling), replace the incorrect dtype
torch.uint8 with torch.qint8 to match PyTorch FakeQuantize/MinMaxObserver signed
symmetric semantics; update the comment to correctly describe symmetric range as
centered at zero (e.g., [-128, 127]) instead of [0,255]; and replace the assert
bits == 8 with an explicit check that raises a ValueError if bits is not 8 (so
validation remains at runtime even under -O). Ensure references to self.dtype,
bits, and the surrounding symmetric quantization logic (the block setting
self.quant_min/self.quant_max) are updated accordingly.

…s for better compatibility and consistency in tensor caching.
Copy link
Owner

@UbiquitousLearning UbiquitousLearning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
mllm/compile/ir/Node.cpp (1)

272-280: ⚠️ Potential issue | 🟡 Minor

Remove unused push/pop region methods or clarify their purpose.

These methods are defined but never called anywhere in the codebase. The actual region management uses resetRegion() and the IRWriterGuard pattern instead. Additionally, the push/pop methods have inconsistent semantics: push() immediately sets cur_insert_region_ to the new region, but pop() sets it to the top of the stack before popping. If these methods are intended for future use, add a comment explaining the intent; otherwise, remove them to avoid maintaining dead code.

🧹 Nitpick comments (1)
mllm/compile/ir/Node.cpp (1)

268-268: Direct map access may insert null entries for missing keys.

Using cached_inputs_outputs_[uuid] when the key doesn't exist will insert a default-constructed val_ptr_t (null) into the map, potentially corrupting the cache state. Consider using find() or at() with proper handling, or document that callers must always call isCacheInputOutputTensor() first.

🔧 Proposed safer implementation
-val_ptr_t IRContext::getCacheInputOutputTensor(size_t uuid) { return cached_inputs_outputs_[uuid]; }
+val_ptr_t IRContext::getCacheInputOutputTensor(size_t uuid) {
+  auto it = cached_inputs_outputs_.find(uuid);
+  return it != cached_inputs_outputs_.end() ? it->second : nullptr;
+}

@chenghuaWang chenghuaWang merged commit 081f408 into UbiquitousLearning:main Feb 1, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants