mileston(qnn): Qnn AOT#624
Conversation
…onsistent quantization across models, enhancing accuracy in AOT runs.
📝 WalkthroughWalkthroughAdds explicit QNN backend and allocator shutdown/lifecycle management, updates uint8 per-tensor symmetric quantization zero_point expectations from 0→128 across examples and parsers, enforces ordered destructor/cleanup for AOT runtime components, and adjusts token-generator KV/attention-mask initialization. Changes
Sequence Diagram(s)sequenceDiagram
actor App
participant Context
participant QNNBackend
participant QNNAllocator
participant DynLib as "QNN HTP/System libs\n(dl handles)"
Note over App,Context: Shutdown sequence (high-level)
App->>Context: shutdownContext()
Context->>Context: isQnnAvailable? true
Context->>Context: shutdownBackend(kQNN)
Context->>QNNBackend: request shutdown/destruction
QNNBackend->>QNNAllocator: shutdown()
QNNAllocator->>QNNAllocator: deregister memory entries
QNNAllocator->>QNNAllocator: free rpcmem allocations
QNNAllocator->>DynLib: dlclose(libCdspHandle_) (nullify)
QNNBackend->>DynLib: dlclose(qnnHtpLibHandle_) / dlclose(qnnSystemLibHandle_)
QNNBackend->>QNNBackend: clear models, perf shutdown
Context-->>App: shutdown complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Fix all issues with AI agents
In `@examples/qwen3_qnn_aot/modeling_qwen_qnn_aot_sha.hpp`:
- Around line 358-366: The K/V quantization paths for Qwen3 differ from
Qwen2/Llama—K uses key_states_per_head[h].to(kUInt8PerTensorSym) then
ptq::QDQ_KV, and V uses ptq::QDQ(...) then .to(kUInt8PerTensorSym) then
ptq::QDQ_KV—while other SHA implementations convert via kFloat32 first; either
make this explicit or align the flows: update the K and V flows (the code around
key_states_per_head[h], ptq::QDQ, and ptq::QDQ_KV) to include the intermediate
.to(kFloat32) conversion like the other models, or add a concise comment above
these snippets explaining why Qwen3 intentionally omits kFloat32 (e.g., expected
input dtype/performance reason) so reviewers understand the divergence.
In `@mllm/backends/qnn/aot/visitor/CastType.cpp`:
- Line 67: Update the inline comment in CastType.cpp to use the required TODO
prefix format: replace the existing comment "// TODO, Cast usage is error." with
the guideline-compliant "// TODO: Cast usage is error." so the TODO uses the
exact "TODO:" prefix (look for the comment near the CastType visitor
implementation in CastType.cpp).
- Around line 64-65: Update the TODO comment in CastType.cpp from "// TODO, Cast
usage is error." to "// TODO: Cast usage is error." to match required TODO
format; then ensure the int→int branch that sets qnn_op_type = "Convert" (where
isInt(input_dtype) && isInt(target_dtype) is true) correctly handles quantized
integer tensors by detecting per-tensor/per-channel quantization (via the same
checks used by isInt or any quantization helpers) and either perform a
requantize/recalibration step instead of a blind Convert or insert
dequantize→convert→requantize logic so differing scale/zero-point parameters are
reconciled before/after Convert.
In `@mllm/backends/qnn/QNNAllocator.cpp`:
- Around line 16-35: The constructors of QNNAllocator set rpcmem_alloc,
rpcmem_free, and rpcmem_to_fd via dlsym without checking for nullptr; update
both QNNAllocator() and QNNAllocator(QNN_INTERFACE_VER_TYPE, void*) to validate
each dlsym result (rpcmem_alloc, rpcmem_free, rpcmem_to_fd) after assignment,
and if any is nullptr call MLLM_ERROR_EXIT (or MLLM_RT_ASSERT if preferred) with
a clear message that includes dlerror() context; reference the libCdspHandle_
loading logic and use the same error-handling pattern used for dlopen to fail
fast when symbols are missing.
In `@mllm/backends/qnn/QNNBackend.cpp`:
- Around line 80-121: The destructor QNNBackend::~QNNBackend calls
runtime_->qnnInterface.contextFree(context_, nullptr) unconditionally but
context_ may be null if createContext()/loadContext() were never called; guard
the call by checking context_ (and that runtime_ is non-null) before invoking
contextFree so you don't pass a null handle into the QNN API, and ensure you
still set context_ = nullptr and reset runtime_ afterwards.
In `@mllm/backends/qnn/QNNUtils.hpp`:
- Around line 46-49: Header QNNUtils.hpp declares std::pair<bool, void*> for
functions loadQNNSymbol and loadQNNSystemSymbol but doesn't include <utility>;
add an explicit `#include` <utility> to this header (near other includes) so the
std::pair type is guaranteed to be available and avoid relying on transitive
includes.
In `@pymllm/backends/qualcomm/transformers/core/qdq.py`:
- Around line 30-34: In the symmetric-quantization branch of the QDQ code
(around the attributes self.quant_min, self.quant_max, bits handling), replace
the incorrect dtype torch.uint8 with torch.qint8 to match PyTorch
FakeQuantize/MinMaxObserver signed symmetric semantics; update the comment to
correctly describe symmetric range as centered at zero (e.g., [-128, 127])
instead of [0,255]; and replace the assert bits == 8 with an explicit check that
raises a ValueError if bits is not 8 (so validation remains at runtime even
under -O). Ensure references to self.dtype, bits, and the surrounding symmetric
quantization logic (the block setting self.quant_min/self.quant_max) are updated
accordingly.
🧹 Nitpick comments (5)
mllm/backends/qnn/aot_rt/TokenGenerator.cpp (1)
112-115: Improve readability with named constants for sentinel values and decode phase size.The buffer initialization is correct—
initAttentionMaskfully clears the attention mask viastd::fill_n(attention_mask, ar_len * config_.context_len, ...)before populating it. However, the purpose of the−1sentinel inattention_mapand the hard-coded1for decode-phase batch size are non-obvious at the call site. Per coding guidelines, use named constants to clarify intent:🔧 Suggested readability refactor
- std::vector<int32_t> attention_map(1); - std::iota(attention_map.begin(), attention_map.end(), -1); - kv_manager_->initAttentionMask(input_tensors_[2].ptr<uint16_t>(), attention_map, 1, current_pos); + constexpr int32_t kDecodeTokenCount = 1; + constexpr int32_t kAttentionMapSentinel = -1; // Signals "attend to past tokens only" in initAttentionMask + std::vector<int32_t> attention_map(kDecodeTokenCount); + std::iota(attention_map.begin(), attention_map.end(), kAttentionMapSentinel); + kv_manager_->initAttentionMask(input_tensors_[2].ptr<uint16_t>(), attention_map, kDecodeTokenCount, current_pos);Also applies to: 132-133
mllm/engine/Context.hpp (1)
33-33: Add a brief doc comment for the new public API.A one-line description (including that it’s a no-op if absent) will clarify expectations for callers.
📝 Suggested change
- void shutdownBackend(const DeviceTypes& device); + // Idempotently remove backend for device; no-op if absent. + void shutdownBackend(const DeviceTypes& device);As per coding guidelines: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
mllm/backends/qnn/aot_rt/QnnAOTModule.hpp (1)
15-18: Destructor clear is redundant—use a defaulted destructor.
output_tensors_is astd::vector<Tensor>that will be automatically destroyed when the object is destroyed. Callingclear()explicitly does not change C++ member destruction order (members are destroyed in reverse declaration order, andoutput_tensors_is the last member, so it destroys first). The explicitclear()adds unnecessary work without any benefit.- ~QnnAOTModule() { - // Clear output tensors to ensure proper cleanup order - output_tensors_.clear(); - } + ~QnnAOTModule() = default;mllm/backends/qnn/aot/QnnWrappersAPI.cpp (1)
141-145: Use a named constant for the symmetric uint8 zero-point.
-128is a magic number; naming it keeps intent clear and prevents drift if the expected zero-point changes.Consider extracting it as a named constant, e.g.,
constexpr int32_t kSymmetricUint8ZeroPoint = 128;, and then using-kSymmetricUint8ZeroPointin the offset assignment. This improves code maintainability and aligns with the coding guideline to use named constants instead of magic numbers.mllm/backends/qnn/aot_rt/KVCacheManager.hpp (1)
27-40: Destructor can be defaulted—explicit cleanup is redundant.No base classes exist, and
std::vectordestructors automatically destroy all containedKVCache<T>elements and theirshared_ptrmembers in the correct order. Manualreset()calls during the destructor body do not change the member-destruction sequence and add no benefit. If ordering with external systems truly requires early cleanup, implement a dedicatedshutdown()method instead.
| // K: De-quantize and re-quantize to int8 | ||
| auto k_h = key_states_per_head[h].to(kFloat32); | ||
| k_h = k_h.to(kUInt8PerTensorSym); | ||
| auto k_h = key_states_per_head[h].to(kUInt8PerTensorSym); | ||
| k_h = ptq::QDQ_KV(this, k_h, "k_cast_to_int8_qdq_h" + h_str); | ||
| k_h = k_h.transpose(2, 3); // [B, 1, D, S] | ||
|
|
||
| // V: Quantize to int16 then int8 | ||
| auto v_h = ptq::QDQ(this, value_states_per_head[h], "v_cast_to_int16_qdq_h" + h_str); | ||
| v_h = v_h.to(kFloat32); | ||
| v_h = v_h.to(kUInt8PerTensorSym); | ||
| v_h = ptq::QDQ_KV(this, v_h, "v_cast_to_int8_qdq_h" + h_str); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the K/V quantization patterns across SHA files
echo "=== Qwen3 SHA (this file) ==="
rg -n "key_states_per_head\[h\]\.to\(" examples/qwen3_qnn_aot/modeling_qwen_qnn_aot_sha.hpp
echo ""
echo "=== Qwen2 SHA ==="
rg -n "key_states_per_head\[h\]\.to\(" examples/qwen2_qnn_aot/modeling_qwen2_qnn_aot_sha.hpp
echo ""
echo "=== Llama SHA ==="
rg -n "key_states_per_head\[h\]\.to\(" examples/llama_qnn_aot/modeling_llama_qnn_aot_sha.hppRepository: UbiquitousLearning/mllm
Length of output: 318
🏁 Script executed:
#!/bin/bash
# Get context around the K quantization in each file
echo "=== Qwen3 SHA - K quantization context (lines 355-370) ==="
sed -n '355,370p' examples/qwen3_qnn_aot/modeling_qwen_qnn_aot_sha.hpp | cat -n
echo ""
echo "=== Qwen2 SHA - K quantization context (lines 343-358) ==="
sed -n '343,358p' examples/qwen2_qnn_aot/modeling_qwen2_qnn_aot_sha.hpp | cat -n
echo ""
echo "=== Llama SHA - K quantization context (lines 343-358) ==="
sed -n '343,358p' examples/llama_qnn_aot/modeling_llama_qnn_aot_sha.hpp | cat -nRepository: UbiquitousLearning/mllm
Length of output: 2573
K/V quantization paths differ from other SHA variants—consider clarifying the design choice.
Both K and V quantization paths in this file skip the intermediate kFloat32 conversion step that is present in Qwen2 and Llama SHA implementations:
Qwen3 (this file):
- K:
key_states_per_head[h].to(kUInt8PerTensorSym)→ptq::QDQ_KV(...) - V:
ptq::QDQ(...)→.to(kUInt8PerTensorSym)→ptq::QDQ_KV(...)
Qwen2/Llama:
- K:
key_states_per_head[h].to(kFloat32)→.to(kUInt8PerTensorSym)→ptq::QDQ_KV(...) - V:
ptq::QDQ(...)→.to(kFloat32)→.to(kUInt8PerTensorSym)→ptq::QDQ_KV(...)
If this divergence is intentional (e.g., due to different input data types), add a comment explaining why. Otherwise, align the quantization path with the other models to ensure consistency.
🤖 Prompt for AI Agents
In `@examples/qwen3_qnn_aot/modeling_qwen_qnn_aot_sha.hpp` around lines 358 - 366,
The K/V quantization paths for Qwen3 differ from Qwen2/Llama—K uses
key_states_per_head[h].to(kUInt8PerTensorSym) then ptq::QDQ_KV, and V uses
ptq::QDQ(...) then .to(kUInt8PerTensorSym) then ptq::QDQ_KV—while other SHA
implementations convert via kFloat32 first; either make this explicit or align
the flows: update the K and V flows (the code around key_states_per_head[h],
ptq::QDQ, and ptq::QDQ_KV) to include the intermediate .to(kFloat32) conversion
like the other models, or add a concise comment above these snippets explaining
why Qwen3 intentionally omits kFloat32 (e.g., expected input dtype/performance
reason) so reviewers understand the divergence.
| } else if (isInt(input_dtype) && isInt(target_dtype)) { | ||
| qnn_op_type = "Convert"; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "CastType.cpp" -type fRepository: UbiquitousLearning/mllm
Length of output: 111
🏁 Script executed:
cat -n mllm/backends/qnn/aot/visitor/CastType.cpp | sed -n '55,75p'Repository: UbiquitousLearning/mllm
Length of output: 993
🏁 Script executed:
wc -l mllm/backends/qnn/aot/visitor/CastType.cppRepository: UbiquitousLearning/mllm
Length of output: 112
🏁 Script executed:
cat -n mllm/backends/qnn/aot/visitor/CastType.cpp | head -n 86Repository: UbiquitousLearning/mllm
Length of output: 4164
🏁 Script executed:
grep -r "isInt\|isFloat" mllm/backends/qnn/ --include="*.cpp" --include="*.h" -B 2 -A 2 | head -n 50Repository: UbiquitousLearning/mllm
Length of output: 1491
🏁 Script executed:
grep -r "quantiz\|quant_param\|scale\|zero_point" mllm/backends/qnn/aot/visitor/ --include="*.cpp" --include="*.h"Repository: UbiquitousLearning/mllm
Length of output: 749
🏁 Script executed:
grep -r "Convert" mllm/backends/qnn/ --include="*.cpp" --include="*.h" -B 2 -A 2 | head -n 40Repository: UbiquitousLearning/mllm
Length of output: 4099
🏁 Script executed:
grep -A 10 "captureQnnAOTNodeTensor" mllm/backends/qnn/aot/visitor/CastType.cppRepository: UbiquitousLearning/mllm
Length of output: 427
Fix TODO comment format violation at line 67.
Change // TODO, Cast usage is error. to // TODO: Cast usage is error. to comply with the required TODO format (must include colon). Additionally, verify that the int→int "Convert" path at line 65 handles quantized integer-to-integer conversions correctly; the isInt() function includes per-tensor and per-channel quantized types, so "Convert" may be called on tensors with differing scale/zero-point parameters.
🤖 Prompt for AI Agents
In `@mllm/backends/qnn/aot/visitor/CastType.cpp` around lines 64 - 65, Update the
TODO comment in CastType.cpp from "// TODO, Cast usage is error." to "// TODO:
Cast usage is error." to match required TODO format; then ensure the int→int
branch that sets qnn_op_type = "Convert" (where isInt(input_dtype) &&
isInt(target_dtype) is true) correctly handles quantized integer tensors by
detecting per-tensor/per-channel quantization (via the same checks used by isInt
or any quantization helpers) and either perform a requantize/recalibration step
instead of a blind Convert or insert dequantize→convert→requantize logic so
differing scale/zero-point parameters are reconciled before/after Convert.
| } else if (isInt(input_dtype) && isInt(target_dtype)) { | ||
| qnn_op_type = "Convert"; | ||
| } else if (input_dtype == kFloat32 && target_dtype == kFloat16) { | ||
| // TODO, Cast usage is error. |
There was a problem hiding this comment.
Fix TODO comment format to comply with guidelines.
The TODO must use the exact TODO: prefix.
✅ Suggested fix
- // TODO, Cast usage is error.
+ // TODO: Cast usage is incorrect; verify the proper QNN op for f32->f16.As per coding guidelines, TODO and FIXME comments must be written as 'TODO:' or 'FIXME:' followed by UTF-8 text that adheres to character set rules.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // TODO, Cast usage is error. | |
| // TODO: Cast usage is incorrect; verify the proper QNN op for f32->f16. |
🤖 Prompt for AI Agents
In `@mllm/backends/qnn/aot/visitor/CastType.cpp` at line 67, Update the inline
comment in CastType.cpp to use the required TODO prefix format: replace the
existing comment "// TODO, Cast usage is error." with the guideline-compliant
"// TODO: Cast usage is error." so the TODO uses the exact "TODO:" prefix (look
for the comment near the CastType visitor implementation in CastType.cpp).
| QNNAllocator::QNNAllocator() { | ||
| void* libCdspHandle = dlopen("libcdsprpc.so", RTLD_NOW | RTLD_LOCAL); | ||
| if (nullptr == libCdspHandle) { MLLM_ERROR_EXIT(1, "dlopen libcdsprpc.so failed"); } | ||
| libCdspHandle_ = dlopen("libcdsprpc.so", RTLD_NOW | RTLD_LOCAL); | ||
| if (nullptr == libCdspHandle_) { MLLM_ERROR_EXIT(1, "dlopen libcdsprpc.so failed"); } | ||
|
|
||
| rpcmem_alloc = (RpcMemAllocFn_t)dlsym(libCdspHandle, "rpcmem_alloc"); | ||
| rpcmem_free = (RpcMemFreeFn_t)dlsym(libCdspHandle, "rpcmem_free"); | ||
| rpcmem_to_fd = (RpcMemToFdFn_t)dlsym(libCdspHandle, "rpcmem_to_fd"); | ||
| rpcmem_alloc = (RpcMemAllocFn_t)dlsym(libCdspHandle_, "rpcmem_alloc"); | ||
| rpcmem_free = (RpcMemFreeFn_t)dlsym(libCdspHandle_, "rpcmem_free"); | ||
| rpcmem_to_fd = (RpcMemToFdFn_t)dlsym(libCdspHandle_, "rpcmem_to_fd"); | ||
| } | ||
|
|
||
| QNNAllocator::QNNAllocator(QNN_INTERFACE_VER_TYPE qnnInterface, void* context) | ||
| : qnnInterface_(qnnInterface), context_(context) { | ||
| MLLM_RT_ASSERT(context_ != nullptr); | ||
|
|
||
| void* libCdspHandle = dlopen("libcdsprpc.so", RTLD_NOW | RTLD_LOCAL); | ||
| if (nullptr == libCdspHandle) { MLLM_ERROR_EXIT(1, "dlopen libcdsprpc.so failed"); } | ||
| libCdspHandle_ = dlopen("libcdsprpc.so", RTLD_NOW | RTLD_LOCAL); | ||
| if (nullptr == libCdspHandle_) { MLLM_ERROR_EXIT(1, "dlopen libcdsprpc.so failed"); } | ||
|
|
||
| rpcmem_alloc = (RpcMemAllocFn_t)dlsym(libCdspHandle, "rpcmem_alloc"); | ||
| rpcmem_free = (RpcMemFreeFn_t)dlsym(libCdspHandle, "rpcmem_free"); | ||
| rpcmem_to_fd = (RpcMemToFdFn_t)dlsym(libCdspHandle, "rpcmem_to_fd"); | ||
| rpcmem_alloc = (RpcMemAllocFn_t)dlsym(libCdspHandle_, "rpcmem_alloc"); | ||
| rpcmem_free = (RpcMemFreeFn_t)dlsym(libCdspHandle_, "rpcmem_free"); | ||
| rpcmem_to_fd = (RpcMemToFdFn_t)dlsym(libCdspHandle_, "rpcmem_to_fd"); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n mllm/backends/qnn/QNNAllocator.cpp | head -80Repository: UbiquitousLearning/mllm
Length of output: 3359
🏁 Script executed:
rg "MLLM_ERROR_EXIT" mllm/utils/ -A 2 | head -20Repository: UbiquitousLearning/mllm
Length of output: 1230
🏁 Script executed:
rg "define MLLM_ERROR_EXIT" mllm/utils/Log.hpp -A 5Repository: UbiquitousLearning/mllm
Length of output: 540
🏁 Script executed:
rg "dlerror" mllm/ -A 2Repository: UbiquitousLearning/mllm
Length of output: 2315
🏁 Script executed:
rg "MLLM_ERROR_EXIT.*ExitCode" mllm/ -B 1 | head -30Repository: UbiquitousLearning/mllm
Length of output: 1691
🏁 Script executed:
rg "MLLM_ERROR_EXIT\(1" mllm/ -B 1 -A 1Repository: UbiquitousLearning/mllm
Length of output: 1988
🏁 Script executed:
cat -n mllm/backends/qnn/QNNAllocator.cpp | sed -n '49,77p'Repository: UbiquitousLearning/mllm
Length of output: 1224
🏁 Script executed:
rg "rpcmem_alloc|rpcmem_free|rpcmem_to_fd" mllm/backends/qnn/QNNAllocator.hpp -B 1 -A 1Repository: UbiquitousLearning/mllm
Length of output: 436
🏁 Script executed:
rg "RpcMemAllocFn_t|RpcMemFreeFn_t|RpcMemToFdFn_t" mllm/backends/qnn/QNNAllocator.hpp -B 2 -A 2Repository: UbiquitousLearning/mllm
Length of output: 443
Add dlsym failure checks before using rpcmem_ pointers.*
The rpcmem_alloc, rpcmem_free, and rpcmem_to_fd function pointers are loaded without validation in both constructors (lines 20–22 and 32–34). If any symbol fails to resolve, dlsym returns nullptr, and subsequent calls crash. Apply the dlopen pattern to dlsym results in both constructors:
🔧 Suggested fix
rpcmem_alloc = (RpcMemAllocFn_t)dlsym(libCdspHandle_, "rpcmem_alloc");
+ if (!rpcmem_alloc) { MLLM_ERROR_EXIT(1, "dlsym rpcmem_alloc failed: {}", dlerror()); }
rpcmem_free = (RpcMemFreeFn_t)dlsym(libCdspHandle_, "rpcmem_free");
+ if (!rpcmem_free) { MLLM_ERROR_EXIT(1, "dlsym rpcmem_free failed: {}", dlerror()); }
rpcmem_to_fd = (RpcMemToFdFn_t)dlsym(libCdspHandle_, "rpcmem_to_fd");
+ if (!rpcmem_to_fd) { MLLM_ERROR_EXIT(1, "dlsym rpcmem_to_fd failed: {}", dlerror()); }🤖 Prompt for AI Agents
In `@mllm/backends/qnn/QNNAllocator.cpp` around lines 16 - 35, The constructors of
QNNAllocator set rpcmem_alloc, rpcmem_free, and rpcmem_to_fd via dlsym without
checking for nullptr; update both QNNAllocator() and
QNNAllocator(QNN_INTERFACE_VER_TYPE, void*) to validate each dlsym result
(rpcmem_alloc, rpcmem_free, rpcmem_to_fd) after assignment, and if any is
nullptr call MLLM_ERROR_EXIT (or MLLM_RT_ASSERT if preferred) with a clear
message that includes dlerror() context; reference the libCdspHandle_ loading
logic and use the same error-handling pattern used for dlopen to fail fast when
symbols are missing.
| QNNBackend::~QNNBackend() { | ||
| // Cleanup order is critical - we hold all QNN library handles to control unload order: | ||
| // 1. Allocator shutdown (memDeRegister + rpcmem_free) - needs QNN alive | ||
| // 2. Clear models - tensor destructors try to free but allocator is shut down | ||
| // 3. Perf cleanup - needs QNN HTP infrastructure alive | ||
| // 4. Runtime cleanup - frees QNN backend/device handles | ||
| // 5. Allocator reset - dlcloses libcdsprpc.so (held by allocator) | ||
| // 6. Close QNN libraries - libQnnSystem.so first, then libQnnHtp.so | ||
|
|
||
| // 1. Properly shutdown allocator while QNN is still alive | ||
| // This calls memDeRegister and rpcmem_free safely | ||
| if (allocator_) { | ||
| auto* qnnAllocator = dynamic_cast<QNNAllocator*>(allocator_.get()); | ||
| if (qnnAllocator) { qnnAllocator->shutdown(); } | ||
| } | ||
|
|
||
| // 2. Clear models - tensor destructors will call free() but they're now no-ops | ||
| qnnModels_.clear(); | ||
| qnnModelIndexMap_.clear(); | ||
|
|
||
| // 3. Cleanup perf while QNN HTP infrastructure is still alive | ||
| if (perf_) { perf_->shutdown(); } | ||
| perf_.reset(); | ||
|
|
||
| // 4. Cleanup runtime - frees QNN backend/device handles | ||
| runtime_->qnnInterface.contextFree(context_, nullptr); | ||
| context_ = nullptr; | ||
| runtime_.reset(); | ||
|
|
||
| // 5. Reset allocator - will dlclose libcdsprpc.so since shutdown() was already called | ||
| allocator_.reset(); | ||
|
|
||
| // 6. Close QNN libraries in reverse order of dependency | ||
| if (qnnSystemLibHandle_) { | ||
| dlclose(qnnSystemLibHandle_); | ||
| qnnSystemLibHandle_ = nullptr; | ||
| } | ||
| if (qnnHtpLibHandle_) { | ||
| dlclose(qnnHtpLibHandle_); | ||
| qnnHtpLibHandle_ = nullptr; | ||
| } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find the QNNBackend header file
fd -t f "QNNBackend" --extension h --extension hppRepository: UbiquitousLearning/mllm
Length of output: 99
🏁 Script executed:
# Read the QNNBackend.h header to understand member initialization
cat -n mllm/backends/qnn/QNNBackend.hRepository: UbiquitousLearning/mllm
Length of output: 129
🏁 Script executed:
# Now find the constructor implementation to see when context_ and runtime_ are initialized
rg "QNNBackend::" mllm/backends/qnn/QNNBackend.cpp -A 5 | head -100Repository: UbiquitousLearning/mllm
Length of output: 4456
🏁 Script executed:
# Check the full constructor to understand initialization order
sed -n '1,80p' mllm/backends/qnn/QNNBackend.cpp | tail -50Repository: UbiquitousLearning/mllm
Length of output: 2497
🏁 Script executed:
# Read the QNNBackend header file (it's .hpp not .h)
cat -n mllm/backends/qnn/QNNBackend.hpp | head -150Repository: UbiquitousLearning/mllm
Length of output: 6926
🏁 Script executed:
# Look for context_ initialization in the header
rg "context_" mllm/backends/qnn/QNNBackend.hpp -B 2 -A 2Repository: UbiquitousLearning/mllm
Length of output: 511
🏁 Script executed:
# Check if context_ has a default initialization (might be in member initialization list)
rg "Qnn_.*_t context" mllm/backends/qnn/ -A 2 -B 2Repository: UbiquitousLearning/mllm
Length of output: 2071
🏁 Script executed:
# Search for all usages of contextFree to understand expected behavior
rg "contextFree" mllm/backends/qnn/ -B 3 -A 3Repository: UbiquitousLearning/mllm
Length of output: 474
🏁 Script executed:
# Check if MLLM_ERROR_EXIT actually terminates the program
rg "MLLM_ERROR_EXIT" mllm/utils/Log.hpp -B 2 -A 5Repository: UbiquitousLearning/mllm
Length of output: 545
🏁 Script executed:
# Check for exception handling or early returns in QNNBackend constructor
sed -n '1,75p' mllm/backends/qnn/QNNBackend.cppRepository: UbiquitousLearning/mllm
Length of output: 3338
🏁 Script executed:
# Verify if createContext or loadContext must be called before using the backend
rg "createContext|loadContext" mllm/backends/qnn/ -B 2 -A 2 | head -50Repository: UbiquitousLearning/mllm
Length of output: 3462
🏁 Script executed:
# Check if there are any ways runtime_ could be null or partially initialized in destructor
rg "reset\(\)" mllm/backends/qnn/QNNBackend.cppRepository: UbiquitousLearning/mllm
Length of output: 125
🏁 Script executed:
# Verify that context is indeed nullptr when not explicitly created
rg "context_.*=" mllm/backends/qnn/QNNBackend.cpp | head -20Repository: UbiquitousLearning/mllm
Length of output: 201
🏁 Script executed:
# Check how createContext and loadContext set context_
rg "context_.*=" mllm/backends/qnn/QNNRuntime.cpp -B 3 -A 3 | head -40Repository: UbiquitousLearning/mllm
Length of output: 139
🏁 Script executed:
# Find all QNN files to understand the structure better
fd -t f "QNN" mllm/backends/qnn/ --extension cpp --extension hpp | head -20Repository: UbiquitousLearning/mllm
Length of output: 775
🏁 Script executed:
# Check if there are any other uses of contextFree or similar patterns
rg "contextFree" mllm/backends/qnn/ -B 5 -A 5Repository: UbiquitousLearning/mllm
Length of output: 818
🏁 Script executed:
# Look at how runtime_ is constructed to confirm it's always valid at destructor
rg "QNNRuntime::create" mllm/backends/qnn/ -B 2 -A 10Repository: UbiquitousLearning/mllm
Length of output: 2018
🏁 Script executed:
# Check if there's a null check pattern used elsewhere for similar operations
rg "if.*runtime_\|if.*context_" mllm/backends/qnn/QNNBackend.cpp -B 1 -A 1Repository: UbiquitousLearning/mllm
Length of output: 49
Guard contextFree when context_ is null.
The destructor can run without a context being created if neither createContext() nor loadContext() is called. Passing a null context handle to QNN may be invalid. Add a guard to safely handle this case.
Suggested fix
- runtime_->qnnInterface.contextFree(context_, nullptr);
- context_ = nullptr;
+ if (context_) {
+ runtime_->qnnInterface.contextFree(context_, nullptr);
+ context_ = nullptr;
+ }As per coding guidelines: Check for proper error handling and resource cleanup (e.g., using try-finally, context managers, or RAII).
🤖 Prompt for AI Agents
In `@mllm/backends/qnn/QNNBackend.cpp` around lines 80 - 121, The destructor
QNNBackend::~QNNBackend calls runtime_->qnnInterface.contextFree(context_,
nullptr) unconditionally but context_ may be null if
createContext()/loadContext() were never called; guard the call by checking
context_ (and that runtime_ is non-null) before invoking contextFree so you
don't pass a null handle into the QNN API, and ensure you still set context_ =
nullptr and reset runtime_ afterwards.
| // Load QNN symbols and return library handle for lifecycle management | ||
| // Returns {success, libHandle} - caller owns the handle and must dlclose it | ||
| std::pair<bool, void*> loadQNNSymbol(); | ||
| std::pair<bool, void*> loadQNNSystemSymbol(); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check if <utility> is included and verify std::pair usage
rg -n '#include <utility>' mllm/backends/qnn/QNNUtils.hpp || echo "No <utility> found"
rg -n 'std::pair' mllm/backends/qnn/QNNUtils.hpp || echo "No std::pair found"Repository: UbiquitousLearning/mllm
Length of output: 177
🏁 Script executed:
# Read the beginning of the file to see all includes
head -60 mllm/backends/qnn/QNNUtils.hppRepository: UbiquitousLearning/mllm
Length of output: 2526
Add <utility> include for std::pair declarations.
This header uses std::pair<bool, void*> at lines 48–49 but does not explicitly include <utility>. While compilation may succeed due to transitive includes, this is fragile and can break on stricter toolchains.
🛠️ Proposed fix
`#include` <memory>
`#include` <string>
`#include` <vector>
+#include <utility>📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Load QNN symbols and return library handle for lifecycle management | |
| // Returns {success, libHandle} - caller owns the handle and must dlclose it | |
| std::pair<bool, void*> loadQNNSymbol(); | |
| std::pair<bool, void*> loadQNNSystemSymbol(); | |
| `#include` <memory> | |
| `#include` <string> | |
| `#include` <vector> | |
| `#include` <utility> |
🤖 Prompt for AI Agents
In `@mllm/backends/qnn/QNNUtils.hpp` around lines 46 - 49, Header QNNUtils.hpp
declares std::pair<bool, void*> for functions loadQNNSymbol and
loadQNNSystemSymbol but doesn't include <utility>; add an explicit `#include`
<utility> to this header (near other includes) so the std::pair type is
guaranteed to be available and avoid relying on transitive includes.
| # NOTE: If left empty: with uint8 and symmetric quantization, the observer will use [0, 255] as the range. And 128 as the zero_point. | ||
| self.quant_min = None | ||
| self.quant_max = None | ||
| assert bits == 8, "Symmetric quantization is only supported for 8-bit" | ||
| self.dtype = torch.uint8 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Read the file to understand context
head -60 pymllm/backends/qualcomm/transformers/core/qdq.pyRepository: UbiquitousLearning/mllm
Length of output: 2263
🏁 Script executed:
# Search for torch.ao.quantization usage in the codebase
rg "torch\.ao\.quantization|FakeQuantize|MinMaxObserver" -A 3 -B 3 --type pyRepository: UbiquitousLearning/mllm
Length of output: 20206
🏁 Script executed:
# Check for torch.quint8 or torch.qint8 usage
rg "torch\.qu?int8|torch\.uint8" --type py -B 2 -A 2Repository: UbiquitousLearning/mllm
Length of output: 5322
🌐 Web query:
PyTorch torch.ao.quantization FakeQuantize MinMaxObserver dtype torch.uint8 symmetric quantization
💡 Result:
In PyTorch eager/QAT quantization, FakeQuantize / MinMaxObserver do not accept torch.uint8. Their dtype is limited to torch.qint8 (signed) or torch.quint8 (unsigned quantized). [1], [2]
Also, symmetric quantization is a qscheme (e.g. torch.per_tensor_symmetric), and PyTorch defines symmetric vs affine schemes separately from dtype. [3]
What to do
1) If you want true symmetric 8-bit quantization (centered at 0)
Use signed int8:
import torch
from torch.ao.quantization import FakeQuantize
from torch.ao.quantization.observer import MinMaxObserver
fq = FakeQuantize.with_args(
observer=MinMaxObserver,
dtype=torch.qint8,
qscheme=torch.per_tensor_symmetric,
quant_min=-128,
quant_max=127,
)()This matches the usual meaning of “symmetric” (zero_point = 0).
2) If you specifically want an unsigned 8-bit quantized dtype
Use torch.quint8 (not torch.uint8). Note: with unsigned 8-bit, you typically end up with affine (asymmetric) quantization (nonzero zero_point), not true symmetry around 0:
fq = FakeQuantize.with_args(
observer=MinMaxObserver,
dtype=torch.quint8,
qscheme=torch.per_tensor_affine,
quant_min=0,
quant_max=255,
)()Key point: if your goal is “symmetric quantization,” pick torch.qint8 + torch.per_tensor_symmetric; if your goal is “store as 8-bit unsigned,” pick torch.quint8 but expect affine behavior. [1], [2], [3]
Sources:
[1] PyTorch FakeQuantize docs (dtype allowed values)
[2] PyTorch MinMaxObserver docs (dtype allowed values)
[3] PyTorch quantization API reference (supported qscheme incl. symmetric)
Replace torch.uint8 with torch.qint8 and replace assert with explicit validation.
The symmetric quantization path has two critical issues:
-
Invalid dtype:
torch.uint8is not supported by PyTorch'sFakeQuantize/MinMaxObserver. Valid dtypes aretorch.qint8(signed) ortorch.quint8(unsigned quantized). For true symmetric quantization (zero_point=0), usetorch.qint8. -
Misleading comment and semantics: The comment describes asymmetric behavior (range [0, 255], zero_point=128), which contradicts the symmetric quantization claim. True symmetric quantization centers at 0 with range [-128, 127].
-
Unreliable input validation:
assert bits == 8can be stripped with Python's-Ooptimization flag. Use explicit error handling instead.
✅ Corrected symmetric quantization setup
- self.quant_min = None
- self.quant_max = None
- assert bits == 8, "Symmetric quantization is only supported for 8-bit"
- self.dtype = torch.uint8
+ if bits != 8:
+ raise ValueError("Symmetric quantization is only supported for 8-bit")
+ self.quant_min = -128
+ self.quant_max = 127
+ self.dtype = torch.qint8📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # NOTE: If left empty: with uint8 and symmetric quantization, the observer will use [0, 255] as the range. And 128 as the zero_point. | |
| self.quant_min = None | |
| self.quant_max = None | |
| assert bits == 8, "Symmetric quantization is only supported for 8-bit" | |
| self.dtype = torch.uint8 | |
| # NOTE: If left empty: with uint8 and symmetric quantization, the observer will use [0, 255] as the range. And 128 as the zero_point. | |
| if bits != 8: | |
| raise ValueError("Symmetric quantization is only supported for 8-bit") | |
| self.quant_min = -128 | |
| self.quant_max = 127 | |
| self.dtype = torch.qint8 |
🤖 Prompt for AI Agents
In `@pymllm/backends/qualcomm/transformers/core/qdq.py` around lines 30 - 34, In
the symmetric-quantization branch of the QDQ code (around the attributes
self.quant_min, self.quant_max, bits handling), replace the incorrect dtype
torch.uint8 with torch.qint8 to match PyTorch FakeQuantize/MinMaxObserver signed
symmetric semantics; update the comment to correctly describe symmetric range as
centered at zero (e.g., [-128, 127]) instead of [0,255]; and replace the assert
bits == 8 with an explicit check that raises a ValueError if bits is not 8 (so
validation remains at runtime even under -O). Ensure references to self.dtype,
bits, and the surrounding symmetric quantization logic (the block setting
self.quant_min/self.quant_max) are updated accordingly.
…s for better compatibility and consistency in tensor caching.
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
mllm/compile/ir/Node.cpp (1)
272-280:⚠️ Potential issue | 🟡 MinorRemove unused push/pop region methods or clarify their purpose.
These methods are defined but never called anywhere in the codebase. The actual region management uses
resetRegion()and theIRWriterGuardpattern instead. Additionally, the push/pop methods have inconsistent semantics:push()immediately setscur_insert_region_to the new region, butpop()sets it to the top of the stack before popping. If these methods are intended for future use, add a comment explaining the intent; otherwise, remove them to avoid maintaining dead code.
🧹 Nitpick comments (1)
mllm/compile/ir/Node.cpp (1)
268-268: Direct map access may insert null entries for missing keys.Using
cached_inputs_outputs_[uuid]when the key doesn't exist will insert a default-constructedval_ptr_t(null) into the map, potentially corrupting the cache state. Consider usingfind()orat()with proper handling, or document that callers must always callisCacheInputOutputTensor()first.🔧 Proposed safer implementation
-val_ptr_t IRContext::getCacheInputOutputTensor(size_t uuid) { return cached_inputs_outputs_[uuid]; } +val_ptr_t IRContext::getCacheInputOutputTensor(size_t uuid) { + auto it = cached_inputs_outputs_.find(uuid); + return it != cached_inputs_outputs_.end() ? it->second : nullptr; +}
SM8650, chunk=32
Qwen3 1.7B, prefill 640 tokens, decode 128 tokens. w4a16kv8.
Prefill: 758.28 tps
Decode: 25.84 tps
SM8650, chunk=32
Qwen2.5 3B, prefill 256 tokens, decode 128 tokens. w4a16kv8.
Prefill: 512.75 tps
Decode: 18.29 tps
Qwen2.5 3B, prefill 512 tokens, decode 256 tokens. w4a16kv8.
Prefill: 545.62 tps
Decode: 18.11 tps
Summary by CodeRabbit
Bug Fixes
New Features
✏️ Tip: You can customize this high-level summary in your review settings.