doc(qualcomm): Qnn aot#625
Conversation
…ities - Remove qnnOutputNameToIndex_ map and related functionality that was tracking QNN output order since it's no longer needed - Remove expectedOutputOrder_ and associated getter/setter methods from QNNModel class - Remove the logic in QNNGraphBuildPass that was recording MLLM expected output order from ReturnOp - Add new debug utility functions for printing dequantized UInt16 tensor data with support for multi-dimensional arrays - Fix integer casting issue in PromptProcessor prefill method
📝 WalkthroughWalkthroughAdds QNN AOT execution documentation and propagates a configurable QNN AOT environment path to multiple example compilers; removes QNN output-order APIs and ReturnOp-based ordering, adds dequantized-tensor debug helpers, conditions a device assertion for AOT-on-X86, removes make_slice sugar and updates slicing callsites, and a small debug-print removal in an AOT runner. Changes
Sequence Diagram(s)sequenceDiagram
participant Dev as Developer (host)
participant Tools as QNN Toolchain (offline compiler)
participant Device as Target Device (Hexagon NPU)
Dev->>Dev: Quantize model -> export (.mir/.mllm)
Dev->>Tools: Run compilation (compile.cpp / mllm-convertor)
Tools-->>Dev: Produce QNN context binary (.bin / qnn ctx)
Dev->>Device: Deploy binary + QNN libs/packages
Device->>Device: Load context, initialize runtime
Device->>Device: Run inference (prefill / decode hybrid flow)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
examples/qwen3_qnn_aot/compile_sha.cpp (1)
172-173:⚠️ Potential issue | 🟠 MajorMissing
zero_pointattachments for KV cache tensors in seq=1 block.In the seq=1 block, only
scaleis attached topast_keyandpast_valuetensors. Thezero_pointattachments are missing, unlike the seq=32 block (lines 118-122) which attaches bothscaleandzero_point. This inconsistency could cause issues if the QNN backend expects quantization parameters.🐛 Proposed fix to add missing zero_point attachments
trace_inputs[past_key_name].attach("scale", params->pull("model.layers." + std::to_string(i) + ".self_attn.k_cast_to_int8_qdq.fake_quant.scale").impl(), true); + trace_inputs[past_key_name].attach("zero_point", params->pull("model.layers." + std::to_string(i) + ".self_attn.k_cast_to_int8_qdq.fake_quant.zero_point").impl(), true); + trace_inputs[past_value_name].attach("scale", params->pull("model.layers." + std::to_string(i) + ".self_attn.v_cast_to_int8_qdq.fake_quant.scale").impl(), true); + trace_inputs[past_value_name].attach("zero_point", params->pull("model.layers." + std::to_string(i) + ".self_attn.v_cast_to_int8_qdq.fake_quant.zero_point").impl(), true);examples/qwen2_qnn_aot/compile_sha.cpp (1)
48-48:⚠️ Potential issue | 🔴 CriticalWrong config class:
Qwen3Configused in qwen2 compile script.This file is for compiling Qwen2 models (as evidenced by
qwen2::sha::prepareParametersForSHAon line 63 andqwen2::sha::Qwen2ForCausalLM_SHAon line 67), but it usesmllm::models::qwen3::Qwen3Configinstead ofmllm::models::qwen2::Qwen2Config. This mismatch could cause incorrect model configuration.🐛 Proposed fix
- auto model_cfg = mllm::models::qwen3::Qwen3Config(model_cfg_path.get()); + auto model_cfg = mllm::models::qwen2::Qwen2Config(model_cfg_path.get());examples/qwen2_qnn_aot/compile.cpp (1)
40-41:⚠️ Potential issue | 🔴 CriticalWrong config class:
Qwen3Configused withQwen2ForCausalLM.The config on line 40 uses
mllm::models::qwen3::Qwen3Config, but the model on line 41 ismllm::models::qwen2::Qwen2ForCausalLM. This config/model namespace mismatch could lead to incorrect model initialization.🐛 Proposed fix
- auto model_cfg = mllm::models::qwen3::Qwen3Config(model_cfg_path.get()); + auto model_cfg = mllm::models::qwen2::Qwen2Config(model_cfg_path.get()); auto model = mllm::models::qwen2::Qwen2ForCausalLM(model_cfg);
🤖 Fix all issues with AI agents
In `@docs/qnn_backend/aot_execute.rst`:
- Line 81: The command invocation is inconsistent: one place calls `python
task.py ...` while this line uses `task.py tasks/build_android_qnn.yaml`; make
them consistent by either prefixing this line with the Python interpreter
(change to `python task.py tasks/build_android_qnn.yaml`) or update both places
to rely on a shebang and mark `task.py` executable—adjust the docs so all
references to `task.py` use the same invocation style (referencing `task.py` and
the earlier `python task.py` usage to locate and update the entries).
In `@mllm/backends/qnn/aot_rt/PromptProcessor.cpp`:
- Line 151: The index computation uses a 32-bit cast before modulo which can
overflow: change the expression so you compute ((int64_t)num_tokens %
(int64_t)config_.ar_len) (or equivalent 64-bit modulo) and only cast the final
bounded result to int when indexing into output_tensors_ (affecting the line
that produces logits); additionally add input validation at the start of the
prefill function to check prompt_tokens.size() <= config_.context_len and handle
the error (return or throw) if the check fails to ensure callers cannot pass
oversized prompts.
🧹 Nitpick comments (4)
docs/qnn_backend/aot_execute.rst (2)
34-35: Consider clarifying the script's purpose.The script name
train.pymight be confusing since it's being used for model quantization and export rather than training. Consider adding a brief comment explaining that this script performs quantization and export, or noting that the name is historical.📝 Suggested clarification
First, we need to run a Python script on the host to quantize the model and export it as a ``.safetensors`` file. .. code-block:: shell cd ./pymllm/backends/qualcomm/transformers/qwen3 + # Note: train.py performs model quantization and export, not training python train.py --model_path "/your/qwen3/model/path/" --max_length 2048 --num_samples 128 --output_dir "/path/to/output"
90-92: Consider clarifying the multi-line shell command.The adb shell command spans multiple lines with line breaks in the documentation, which might make it harder for users to copy and execute correctly. Consider formatting it as a single line or using explicit line continuation.
📝 Alternative formatting options
Option 1: Single-line command
# Execute on the device - adb shell "cd /data/local/tmp && export LD_LIBRARY_PATH=. && - ./mllm-qwen3-aot-runner -m qwen3-1.7B-lpbq-sha.bin - -t qwen3-tokenizer.json -c config_1.7B.json --ar_len 32" + adb shell "cd /data/local/tmp && export LD_LIBRARY_PATH=. && ./mllm-qwen3-aot-runner -m qwen3-1.7B-lpbq-sha.bin -t qwen3-tokenizer.json -c config_1.7B.json --ar_len 32"Option 2: Proper line continuation with backslashes
# Execute on the device adb shell "cd /data/local/tmp && export LD_LIBRARY_PATH=. && \ - ./mllm-qwen3-aot-runner -m qwen3-1.7B-lpbq-sha.bin - -t qwen3-tokenizer.json -c config_1.7B.json --ar_len 32" + ./mllm-qwen3-aot-runner -m qwen3-1.7B-lpbq-sha.bin \ + -t qwen3-tokenizer.json -c config_1.7B.json --ar_len 32"mllm/backends/qnn/QNNUtils.hpp (2)
305-313: Avoid relying on transitive includes forprintf.This header now uses
printfbut doesn’t include<cstdio>, which can break on stricter toolchains. Please include<cstdio>and preferably usestd::printfto avoid global-namespace reliance.🛠️ Suggested update
+#include <cstdio> `#include` <memory> `#include` <string> `#include` <vector> @@ - printf("%.4f", fval); + std::printf("%.4f", fval);- printf("["); + std::printf("[");- printf(", "); + std::printf(", ");- if (!is_last_dim) printf("\n"); + if (!is_last_dim) std::printf("\n");- printf(",\n...\n"); + std::printf(",\n...\n");- printf(", ..., "); + std::printf(", ..., ");- printf("]"); + std::printf("]");- printf("Dequantized Tensor (scale=%f, offset=%d):\n", scale, offset); + std::printf("Dequantized Tensor (scale=%f, offset=%d):\n", scale, offset);- printf("Not supported type"); + std::printf("Not supported type");- printf("\n"); + std::printf("\n");Based on learnings: Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific.
319-349: Replace magic numbers with named constants.The
20//2values control output truncation and should be named for clarity and easy tuning.♻️ Suggested update
- int max_elements_per_dim = 20; + constexpr int kMaxElementsPerDim = 20; bool is_last_dim = (dim == (int)shape.size() - 1); - if (dim_size <= max_elements_per_dim) { + if (dim_size <= kMaxElementsPerDim) { @@ - const int SHOW_ELEMENTS = max_elements_per_dim / 2; - for (int32_t i = 0; i < SHOW_ELEMENTS; ++i) { + constexpr int kShowElements = kMaxElementsPerDim / 2; + for (int32_t i = 0; i < kShowElements; ++i) { @@ - for (int32_t i = dim_size - SHOW_ELEMENTS; i < dim_size; ++i) { + for (int32_t i = dim_size - kShowElements; i < dim_size; ++i) {Based on learnings: Use named constants instead of magic numbers.
| } | ||
|
|
||
| auto logits = output_tensors_[0].to(kCPU).squeeze(0)[{kAll, (num_tokens + config_.ar_len - 1) % config_.ar_len, kAll}]; | ||
| auto logits = output_tensors_[0].to(kCPU).squeeze(0)[{kAll, ((int)num_tokens + config_.ar_len - 1) % config_.ar_len, kAll}]; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, locate and examine the PromptProcessor.cpp file
find . -name "PromptProcessor.cpp" -type fRepository: UbiquitousLearning/mllm
Length of output: 113
🏁 Script executed:
#!/bin/bash
# Read the relevant section of PromptProcessor.cpp around line 151
if [ -f "mllm/backends/qnn/aot_rt/PromptProcessor.cpp" ]; then
sed -n '140,165p' mllm/backends/qnn/aot_rt/PromptProcessor.cpp
fiRepository: UbiquitousLearning/mllm
Length of output: 843
🏁 Script executed:
#!/bin/bash
# Check the type definitions and structure of the class to understand num_tokens and config_
if [ -f "mllm/backends/qnn/aot_rt/PromptProcessor.cpp" ]; then
head -100 mllm/backends/qnn/aot_rt/PromptProcessor.cpp | grep -E "include|class|struct|using"
fiRepository: UbiquitousLearning/mllm
Length of output: 301
🏁 Script executed:
#!/bin/bash
# Search for the PromptProcessor class definition and relevant config structure
find . -name "PromptProcessor.h" -o -name "PromptProcessor.hpp" 2>/dev/null | head -5Repository: UbiquitousLearning/mllm
Length of output: 113
🏁 Script executed:
#!/bin/bash
# Search for context_len, ar_len, and num_tokens type definitions and constraints
rg -n -A 3 'context_len|ar_len' --type=cpp --type=cc --type=hpp --type=h mllm/backends/qnn/aot_rt/ | head -50Repository: UbiquitousLearning/mllm
Length of output: 93
🏁 Script executed:
#!/bin/bash
# Search for prefill method/call sites to see how num_tokens is passed and bounded
rg -n 'prefill' --type=cpp --type=cc --type=hpp --type=h mllm/backends/qnn/aot_rt/ -A 5Repository: UbiquitousLearning/mllm
Length of output: 93
🏁 Script executed:
#!/bin/bash
# Read the PromptProcessor.hpp header file
cat -n mllm/backends/qnn/aot_rt/PromptProcessor.hppRepository: UbiquitousLearning/mllm
Length of output: 1799
🏁 Script executed:
#!/bin/bash
# Search for ar_len and context_len definitions without problematic file type filters
grep -rn 'ar_len\|context_len' mllm/backends/qnn/aot_rt/ --include="*.h" --include="*.hpp" --include="*.cpp"Repository: UbiquitousLearning/mllm
Length of output: 12499
🏁 Script executed:
#!/bin/bash
# Search for prefill method signature
grep -rn 'prefill' mllm/backends/qnn/aot_rt/ --include="*.h" --include="*.hpp" --include="*.cpp" -A 5Repository: UbiquitousLearning/mllm
Length of output: 5898
🏁 Script executed:
#!/bin/bash
# Look for the function signature that contains line 151 to understand num_tokens type
grep -n -B 20 'auto logits = output_tensors_\[0\]' mllm/backends/qnn/aot_rt/PromptProcessor.cpp | head -40Repository: UbiquitousLearning/mllm
Length of output: 1046
Compute modulo in 64-bit space before casting to int.
The cast (int)num_tokens occurs before the modulo operation, which could truncate num_tokens if it exceeds INT_MAX, yielding an incorrect index. Instead, compute the modulo in 64-bit space and cast only the bounded result (guaranteed to fit in int):
Suggested fix
- auto logits = output_tensors_[0].to(kCPU).squeeze(0)[{kAll, ((int)num_tokens + config_.ar_len - 1) % config_.ar_len, kAll}];
+ const int idx = static_cast<int>(
+ (num_tokens + static_cast<int64_t>(config_.ar_len) - 1) % config_.ar_len);
+ auto logits = output_tensors_[0].to(kCPU).squeeze(0)[{kAll, idx, kAll}];Additionally, consider validating that prompt_tokens.size() does not exceed config_.context_len at the start of the prefill function, per coding guidelines for public API input validation.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| auto logits = output_tensors_[0].to(kCPU).squeeze(0)[{kAll, ((int)num_tokens + config_.ar_len - 1) % config_.ar_len, kAll}]; | |
| const int idx = static_cast<int>( | |
| (num_tokens + static_cast<int64_t>(config_.ar_len) - 1) % config_.ar_len); | |
| auto logits = output_tensors_[0].to(kCPU).squeeze(0)[{kAll, idx, kAll}]; |
🤖 Prompt for AI Agents
In `@mllm/backends/qnn/aot_rt/PromptProcessor.cpp` at line 151, The index
computation uses a 32-bit cast before modulo which can overflow: change the
expression so you compute ((int64_t)num_tokens % (int64_t)config_.ar_len) (or
equivalent 64-bit modulo) and only cast the final bounded result to int when
indexing into output_tensors_ (affecting the line that produces logits);
additionally add input validation at the start of the prefill function to check
prompt_tokens.size() <= config_.context_len and handle the error (return or
throw) if the check fails to ensure callers cannot pass oversized prompts.
docs/qnn_backend/aot_execute.rst
Outdated
| .. code-block:: shell | ||
|
|
||
| # Cross-compile the aot_run program for the target device (e.g., Android) | ||
| task.py tasks/build_android_qnn.yaml |
There was a problem hiding this comment.
python task.py tasks/build_android_qnn.yaml
Please check Guidelines for Contributing.
Summary by CodeRabbit
Documentation
New Features
Bug Fixes
Chores