Skip to content

doc(qualcomm): Qnn aot#625

Merged
chenghuaWang merged 13 commits intoUbiquitousLearning:mainfrom
oreomaker:qnn-aot
Feb 2, 2026
Merged

doc(qualcomm): Qnn aot#625
chenghuaWang merged 13 commits intoUbiquitousLearning:mainfrom
oreomaker:qnn-aot

Conversation

@oreomaker
Copy link
Collaborator

@oreomaker oreomaker commented Feb 1, 2026

Please check Guidelines for Contributing.

Summary by CodeRabbit

  • Documentation

    • Added a comprehensive QNN AOT guide covering quantization, offline compilation, deployment, hybrid execution (prefill/decode), and environment setup notes.
  • New Features

    • Made the QNN AOT environment path configurable via a command-line option in compilation examples.
  • Bug Fixes

    • Removed noisy debug output from the AOT runner for cleaner runtime logs.
  • Chores

    • Clarified Hexagon tool version guidance in build/setup docs.

…ities

- Remove qnnOutputNameToIndex_ map and related functionality that was
  tracking QNN output order since it's no longer needed
- Remove expectedOutputOrder_ and associated getter/setter methods
  from QNNModel class
- Remove the logic in QNNGraphBuildPass that was recording MLLM
  expected output order from ReturnOp
- Add new debug utility functions for printing dequantized UInt16
  tensor data with support for multi-dimensional arrays
- Fix integer casting issue in PromptProcessor prefill method
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 1, 2026

📝 Walkthrough

Walkthrough

Adds QNN AOT execution documentation and propagates a configurable QNN AOT environment path to multiple example compilers; removes QNN output-order APIs and ReturnOp-based ordering, adds dequantized-tensor debug helpers, conditions a device assertion for AOT-on-X86, removes make_slice sugar and updates slicing callsites, and a small debug-print removal in an AOT runner.

Changes

Cohort / File(s) Summary
QNN Documentation
docs/qnn_backend/aot_execute.rst, docs/qnn_backend/index.rst
Adds AOT execution guide for fully integer-quantized LLMs on Hexagon (quantize/export → offline compile → on-device run). Updates toctree to reference new doc.
AOT Example CLI
examples/llama_qnn_aot/compile.cpp, examples/llama_qnn_aot/compile_sha.cpp, examples/qwen2_qnn_aot/compile.cpp, examples/qwen2_qnn_aot/compile_sha.cpp, examples/qwen3_qnn_aot/compile.cpp, examples/qwen3_qnn_aot/compile_sha.cpp
Introduce `-qnn_env
AOT Runtime Example
examples/qwen3_qnn_aot/aot_run.cpp
Removed debug printing of input tensor sequence shape/data prior to Runner creation.
Build / CMake
mllm/CMakeLists.txt
Removed -Wno-comma-subscript suppression and added conditional compile-definition MLLM_QUALCOMM_QNN_AOT_ON_X86_ENABLE before including QNN backend.
QNN Model API & Graph Pass
mllm/backends/qnn/QNNModel.hpp, mllm/backends/qnn/QNNModel.cpp, mllm/backends/qnn/passes/QNNGraphBuildPass.cpp
Removed expected-output-order API and internal name→index mapping; removed ReturnOp-driven output-order extraction from graph build pass.
QNN Utilities & Debugging
mllm/backends/qnn/QNNUtils.hpp, mllm/backends/qnn/QNNUtils.cpp
Added inline helpers to print dequantized uint16 tensors; made a device assertion conditional on AOT-on-X86 and mark wrappers as already-allocated when input tensor storage is provided.
AOT Runtime Fix
mllm/backends/qnn/aot_rt/PromptProcessor.cpp
Added explicit cast for index calculation used in logits selection (fix type for modulo indexing).
Slice API Removal & Callsite Updates
mllm/core/SlicePrimitives.hpp, mllm/models/minicpm_o2_6/modeling_resampler.hpp, mllm/models/minicpm_o2_6/streaming_generation.cpp
Removed SliceIndicesBuilder and make_slice helpers; updated callsites to use direct slice/index syntax (e.g., x[{...}] and .slice({...})).
Docs: Build Guidance
docs/qnn_backend/setup_env.rst
Note added advising users to update Hexagon tools version in Makefile if compilation fails.

Sequence Diagram(s)

sequenceDiagram
    participant Dev as Developer (host)
    participant Tools as QNN Toolchain (offline compiler)
    participant Device as Target Device (Hexagon NPU)

    Dev->>Dev: Quantize model -> export (.mir/.mllm)
    Dev->>Tools: Run compilation (compile.cpp / mllm-convertor)
    Tools-->>Dev: Produce QNN context binary (.bin / qnn ctx)
    Dev->>Device: Deploy binary + QNN libs/packages
    Device->>Device: Load context, initialize runtime
    Device->>Device: Run inference (prefill / decode hybrid flow)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • xumengwei
  • liang1232018
  • yirongjie

Poem

🐰 I hopped through code and docs in spring,
Paths made flexible so compiles sing,
Slices trimmed, prints tucked away,
Dequantized bits now bright as day,
A tiny hop for cleaner build-ing! 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description only contains the repository template boilerplate without any substantive implementation details, objectives, or rationale for the changes made. Add a detailed description explaining the documentation additions (aot_execute.rst guide), why QNN AOT environment configuration is important, and how code refactoring (removing SliceIndicesBuilder, updating tensor indexing) relates to the overall objectives.
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'doc(qualcomm): Qnn aot' clearly indicates documentation changes for the Qualcomm QNN AOT backend, which aligns with the primary changes (new aot_execute.rst documentation and setup_env.rst updates).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
examples/qwen3_qnn_aot/compile_sha.cpp (1)

172-173: ⚠️ Potential issue | 🟠 Major

Missing zero_point attachments for KV cache tensors in seq=1 block.

In the seq=1 block, only scale is attached to past_key and past_value tensors. The zero_point attachments are missing, unlike the seq=32 block (lines 118-122) which attaches both scale and zero_point. This inconsistency could cause issues if the QNN backend expects quantization parameters.

🐛 Proposed fix to add missing zero_point attachments
     trace_inputs[past_key_name].attach("scale", params->pull("model.layers." + std::to_string(i) + ".self_attn.k_cast_to_int8_qdq.fake_quant.scale").impl(), true);
+    trace_inputs[past_key_name].attach("zero_point", params->pull("model.layers." + std::to_string(i) + ".self_attn.k_cast_to_int8_qdq.fake_quant.zero_point").impl(), true);
+
     trace_inputs[past_value_name].attach("scale", params->pull("model.layers." + std::to_string(i) + ".self_attn.v_cast_to_int8_qdq.fake_quant.scale").impl(), true);
+    trace_inputs[past_value_name].attach("zero_point", params->pull("model.layers." + std::to_string(i) + ".self_attn.v_cast_to_int8_qdq.fake_quant.zero_point").impl(), true);
examples/qwen2_qnn_aot/compile_sha.cpp (1)

48-48: ⚠️ Potential issue | 🔴 Critical

Wrong config class: Qwen3Config used in qwen2 compile script.

This file is for compiling Qwen2 models (as evidenced by qwen2::sha::prepareParametersForSHA on line 63 and qwen2::sha::Qwen2ForCausalLM_SHA on line 67), but it uses mllm::models::qwen3::Qwen3Config instead of mllm::models::qwen2::Qwen2Config. This mismatch could cause incorrect model configuration.

🐛 Proposed fix
-  auto model_cfg = mllm::models::qwen3::Qwen3Config(model_cfg_path.get());
+  auto model_cfg = mllm::models::qwen2::Qwen2Config(model_cfg_path.get());
examples/qwen2_qnn_aot/compile.cpp (1)

40-41: ⚠️ Potential issue | 🔴 Critical

Wrong config class: Qwen3Config used with Qwen2ForCausalLM.

The config on line 40 uses mllm::models::qwen3::Qwen3Config, but the model on line 41 is mllm::models::qwen2::Qwen2ForCausalLM. This config/model namespace mismatch could lead to incorrect model initialization.

🐛 Proposed fix
-  auto model_cfg = mllm::models::qwen3::Qwen3Config(model_cfg_path.get());
+  auto model_cfg = mllm::models::qwen2::Qwen2Config(model_cfg_path.get());
   auto model = mllm::models::qwen2::Qwen2ForCausalLM(model_cfg);
🤖 Fix all issues with AI agents
In `@docs/qnn_backend/aot_execute.rst`:
- Line 81: The command invocation is inconsistent: one place calls `python
task.py ...` while this line uses `task.py tasks/build_android_qnn.yaml`; make
them consistent by either prefixing this line with the Python interpreter
(change to `python task.py tasks/build_android_qnn.yaml`) or update both places
to rely on a shebang and mark `task.py` executable—adjust the docs so all
references to `task.py` use the same invocation style (referencing `task.py` and
the earlier `python task.py` usage to locate and update the entries).

In `@mllm/backends/qnn/aot_rt/PromptProcessor.cpp`:
- Line 151: The index computation uses a 32-bit cast before modulo which can
overflow: change the expression so you compute ((int64_t)num_tokens %
(int64_t)config_.ar_len) (or equivalent 64-bit modulo) and only cast the final
bounded result to int when indexing into output_tensors_ (affecting the line
that produces logits); additionally add input validation at the start of the
prefill function to check prompt_tokens.size() <= config_.context_len and handle
the error (return or throw) if the check fails to ensure callers cannot pass
oversized prompts.
🧹 Nitpick comments (4)
docs/qnn_backend/aot_execute.rst (2)

34-35: Consider clarifying the script's purpose.

The script name train.py might be confusing since it's being used for model quantization and export rather than training. Consider adding a brief comment explaining that this script performs quantization and export, or noting that the name is historical.

📝 Suggested clarification
    First, we need to run a Python script on the host to quantize the model and export it as a ``.safetensors`` file.
 
    .. code-block:: shell
 
       cd ./pymllm/backends/qualcomm/transformers/qwen3
+      # Note: train.py performs model quantization and export, not training
       python train.py --model_path "/your/qwen3/model/path/" --max_length 2048 --num_samples 128 --output_dir "/path/to/output"

90-92: Consider clarifying the multi-line shell command.

The adb shell command spans multiple lines with line breaks in the documentation, which might make it harder for users to copy and execute correctly. Consider formatting it as a single line or using explicit line continuation.

📝 Alternative formatting options

Option 1: Single-line command

       # Execute on the device
-      adb shell "cd /data/local/tmp && export LD_LIBRARY_PATH=. &&
-      ./mllm-qwen3-aot-runner -m qwen3-1.7B-lpbq-sha.bin
-      -t qwen3-tokenizer.json -c config_1.7B.json --ar_len 32"
+      adb shell "cd /data/local/tmp && export LD_LIBRARY_PATH=. && ./mllm-qwen3-aot-runner -m qwen3-1.7B-lpbq-sha.bin -t qwen3-tokenizer.json -c config_1.7B.json --ar_len 32"

Option 2: Proper line continuation with backslashes

       # Execute on the device
       adb shell "cd /data/local/tmp && export LD_LIBRARY_PATH=. && \
-      ./mllm-qwen3-aot-runner -m qwen3-1.7B-lpbq-sha.bin
-      -t qwen3-tokenizer.json -c config_1.7B.json --ar_len 32"
+          ./mllm-qwen3-aot-runner -m qwen3-1.7B-lpbq-sha.bin \
+          -t qwen3-tokenizer.json -c config_1.7B.json --ar_len 32"
mllm/backends/qnn/QNNUtils.hpp (2)

305-313: Avoid relying on transitive includes for printf.

This header now uses printf but doesn’t include <cstdio>, which can break on stricter toolchains. Please include <cstdio> and preferably use std::printf to avoid global-namespace reliance.

🛠️ Suggested update
+#include <cstdio>
 `#include` <memory>
 `#include` <string>
 `#include` <vector>
@@
-    printf("%.4f", fval);
+    std::printf("%.4f", fval);
-  printf("[");
+  std::printf("[");
-        printf(", ");
+        std::printf(", ");
-        if (!is_last_dim) printf("\n");
+        if (!is_last_dim) std::printf("\n");
-      printf(",\n...\n");
+      std::printf(",\n...\n");
-      printf(", ..., ");
+      std::printf(", ..., ");
-  printf("]");
+  std::printf("]");
-  printf("Dequantized Tensor (scale=%f, offset=%d):\n", scale, offset);
+  std::printf("Dequantized Tensor (scale=%f, offset=%d):\n", scale, offset);
-    printf("Not supported type");
+    std::printf("Not supported type");
-  printf("\n");
+  std::printf("\n");

Based on learnings: Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific.


319-349: Replace magic numbers with named constants.

The 20//2 values control output truncation and should be named for clarity and easy tuning.

♻️ Suggested update
-  int max_elements_per_dim = 20;
+  constexpr int kMaxElementsPerDim = 20;
   bool is_last_dim = (dim == (int)shape.size() - 1);

-  if (dim_size <= max_elements_per_dim) {
+  if (dim_size <= kMaxElementsPerDim) {
@@
-    const int SHOW_ELEMENTS = max_elements_per_dim / 2;
-    for (int32_t i = 0; i < SHOW_ELEMENTS; ++i) {
+    constexpr int kShowElements = kMaxElementsPerDim / 2;
+    for (int32_t i = 0; i < kShowElements; ++i) {
@@
-    for (int32_t i = dim_size - SHOW_ELEMENTS; i < dim_size; ++i) {
+    for (int32_t i = dim_size - kShowElements; i < dim_size; ++i) {

Based on learnings: Use named constants instead of magic numbers.

}

auto logits = output_tensors_[0].to(kCPU).squeeze(0)[{kAll, (num_tokens + config_.ar_len - 1) % config_.ar_len, kAll}];
auto logits = output_tensors_[0].to(kCPU).squeeze(0)[{kAll, ((int)num_tokens + config_.ar_len - 1) % config_.ar_len, kAll}];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, locate and examine the PromptProcessor.cpp file
find . -name "PromptProcessor.cpp" -type f

Repository: UbiquitousLearning/mllm

Length of output: 113


🏁 Script executed:

#!/bin/bash
# Read the relevant section of PromptProcessor.cpp around line 151
if [ -f "mllm/backends/qnn/aot_rt/PromptProcessor.cpp" ]; then
  sed -n '140,165p' mllm/backends/qnn/aot_rt/PromptProcessor.cpp
fi

Repository: UbiquitousLearning/mllm

Length of output: 843


🏁 Script executed:

#!/bin/bash
# Check the type definitions and structure of the class to understand num_tokens and config_
if [ -f "mllm/backends/qnn/aot_rt/PromptProcessor.cpp" ]; then
  head -100 mllm/backends/qnn/aot_rt/PromptProcessor.cpp | grep -E "include|class|struct|using"
fi

Repository: UbiquitousLearning/mllm

Length of output: 301


🏁 Script executed:

#!/bin/bash
# Search for the PromptProcessor class definition and relevant config structure
find . -name "PromptProcessor.h" -o -name "PromptProcessor.hpp" 2>/dev/null | head -5

Repository: UbiquitousLearning/mllm

Length of output: 113


🏁 Script executed:

#!/bin/bash
# Search for context_len, ar_len, and num_tokens type definitions and constraints
rg -n -A 3 'context_len|ar_len' --type=cpp --type=cc --type=hpp --type=h mllm/backends/qnn/aot_rt/ | head -50

Repository: UbiquitousLearning/mllm

Length of output: 93


🏁 Script executed:

#!/bin/bash
# Search for prefill method/call sites to see how num_tokens is passed and bounded
rg -n 'prefill' --type=cpp --type=cc --type=hpp --type=h mllm/backends/qnn/aot_rt/ -A 5

Repository: UbiquitousLearning/mllm

Length of output: 93


🏁 Script executed:

#!/bin/bash
# Read the PromptProcessor.hpp header file
cat -n mllm/backends/qnn/aot_rt/PromptProcessor.hpp

Repository: UbiquitousLearning/mllm

Length of output: 1799


🏁 Script executed:

#!/bin/bash
# Search for ar_len and context_len definitions without problematic file type filters
grep -rn 'ar_len\|context_len' mllm/backends/qnn/aot_rt/ --include="*.h" --include="*.hpp" --include="*.cpp"

Repository: UbiquitousLearning/mllm

Length of output: 12499


🏁 Script executed:

#!/bin/bash
# Search for prefill method signature
grep -rn 'prefill' mllm/backends/qnn/aot_rt/ --include="*.h" --include="*.hpp" --include="*.cpp" -A 5

Repository: UbiquitousLearning/mllm

Length of output: 5898


🏁 Script executed:

#!/bin/bash
# Look for the function signature that contains line 151 to understand num_tokens type
grep -n -B 20 'auto logits = output_tensors_\[0\]' mllm/backends/qnn/aot_rt/PromptProcessor.cpp | head -40

Repository: UbiquitousLearning/mllm

Length of output: 1046


Compute modulo in 64-bit space before casting to int.

The cast (int)num_tokens occurs before the modulo operation, which could truncate num_tokens if it exceeds INT_MAX, yielding an incorrect index. Instead, compute the modulo in 64-bit space and cast only the bounded result (guaranteed to fit in int):

Suggested fix
-  auto logits = output_tensors_[0].to(kCPU).squeeze(0)[{kAll, ((int)num_tokens + config_.ar_len - 1) % config_.ar_len, kAll}];
+  const int idx = static_cast<int>(
+      (num_tokens + static_cast<int64_t>(config_.ar_len) - 1) % config_.ar_len);
+  auto logits = output_tensors_[0].to(kCPU).squeeze(0)[{kAll, idx, kAll}];

Additionally, consider validating that prompt_tokens.size() does not exceed config_.context_len at the start of the prefill function, per coding guidelines for public API input validation.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto logits = output_tensors_[0].to(kCPU).squeeze(0)[{kAll, ((int)num_tokens + config_.ar_len - 1) % config_.ar_len, kAll}];
const int idx = static_cast<int>(
(num_tokens + static_cast<int64_t>(config_.ar_len) - 1) % config_.ar_len);
auto logits = output_tensors_[0].to(kCPU).squeeze(0)[{kAll, idx, kAll}];
🤖 Prompt for AI Agents
In `@mllm/backends/qnn/aot_rt/PromptProcessor.cpp` at line 151, The index
computation uses a 32-bit cast before modulo which can overflow: change the
expression so you compute ((int64_t)num_tokens % (int64_t)config_.ar_len) (or
equivalent 64-bit modulo) and only cast the final bounded result to int when
indexing into output_tensors_ (affecting the line that produces logits);
additionally add input validation at the start of the prefill function to check
prompt_tokens.size() <= config_.context_len and handle the error (return or
throw) if the check fails to ensure callers cannot pass oversized prompts.

@chenghuaWang chenghuaWang changed the title Qnn aot doc(qualcomm): Qnn aot Feb 2, 2026
.. code-block:: shell

# Cross-compile the aot_run program for the target device (e.g., Android)
task.py tasks/build_android_qnn.yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python task.py tasks/build_android_qnn.yaml

Copy link
Collaborator

@chenghuaWang chenghuaWang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chenghuaWang chenghuaWang merged commit 6240180 into UbiquitousLearning:main Feb 2, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants