Skip to content

feat(ascend): Implement Ascend backend ops (Linear, RMSNorm, SiLU)#615

Closed
lywbarca wants to merge 4 commits intoUbiquitousLearning:mainfrom
lywbarca:lyw/ascend-backend
Closed

feat(ascend): Implement Ascend backend ops (Linear, RMSNorm, SiLU)#615
lywbarca wants to merge 4 commits intoUbiquitousLearning:mainfrom
lywbarca:lyw/ascend-backend

Conversation

@lywbarca
Copy link

@lywbarca lywbarca commented Jan 28, 2026

Please check Guidelines for Contributing.

Summary by CodeRabbit

  • New Features

    • Ascend backend: added SiLU, Linear, MatMul, RMSNorm, Softmax, View, Sub, and Mul operators
    • New public functional APIs: linear and rms_norm
    • Example: ascend_add_demo with build/run helper
  • Improvements

    • More robust Ascend device initialization and per-device metadata handling
    • Unified tensor conversion, workspace management, and FP16 execution paths
  • Tests

    • Expanded FP16 Ascend kernel tests: elementwise (Sub/Mul), SiLU, Linear (±bias), RMSNorm, Softmax, and attention suites
  • Chores

    • Removed Qualcomm QNN AOT integration and related FFI/Python bindings and build options

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 28, 2026

📝 Walkthrough

Walkthrough

Adds multiple Ascend backend operators (SiLU, Linear, RMSNorm, View, MatMul, Sub, Mul), ATB tensor helpers, device init/enumeration adjustments, nn::functional wrappers (linear, rms_norm), extensive FP16 Ascend kernel tests, and removes Qualcomm QNN AOT sources/FFI; registers new operator factories in the Ascend backend.

Changes

Cohort / File(s) Summary
Backend registration
mllm/backends/ascend/AscendBackend.cpp
Includes/ registers many new Ascend operator factories (SiLU, Linear, RMSNorm, View, MatMul, Sub, Mul).
Common utilities & device init
mllm/backends/ascend/AscendCommon.hpp, mllm/backends/ascend/AscendCommon.cpp
Adds fillAtbTensor/FP16 validation; moves ACL init earlier, removes final finalize after enumeration, implements per-device enumeration with robust failure handling.
Elementwise ops (add/sub/mul)
mllm/backends/ascend/ops/AscendElewiseOps.hpp, .../AscendElewiseOps.cpp
Adds AscendSubOp/AscendMulOp declarations and implements ATB-based Sub/Mul (broadcasting support); updates Add to ATB flow and unified workspace handling.
Linear / MatMul / RMSNorm / SiLU / Softmax / View operators
mllm/backends/ascend/ops/AscendLinearOp.*, AscendMatMulOp.*, AscendRMSNormOp.*, AscendSiLUOp.*, AscendSoftmaxOp.*, AscendViewOp.* and corresponding *.hpp
New Ascend-specific operator implementations and factories using ATB: input/shape/dtype validation, VariantPack setup, workspace queries, AscendMemoryManager allocation, execute/sync/cleanup.
Functional API
mllm/nn/Functional.hpp, mllm/nn/Functional.cpp
Adds nn::functional::linear() and nn::functional::rmsNorm() wrappers that invoke backend ops (optional bias handling, epsilon/unit-offset).
Tests (Ascend FP16 kernels)
tests/ascend/* (many files: AscendSiLUKernelTest.hpp, AscendLinearKernelTest.hpp, AscendRMSNormKernelTest.hpp, AscendSoftmaxKernelTest.hpp, AscendAttentionKernelTest.hpp, AscendKernelTest.hpp, KernelTest.cpp, ...)
Adds extensive FP16 kernel tests for new ops (SiLU, Linear w/ bias, RMSNorm, Softmax, Sub, Mul, attention variants), integrates Ascend init in test runner and CPU reference comparisons.
Build / Packaging / Examples
CMakeLists.txt, mllm/CMakeLists.txt, pyproject.toml, tasks/*, examples/ascend_add_demo/*, mllm/ffi/*, pymllm/backends/*
Enables Ascend backend build flags and adds example script/readme; removes Qualcomm QNN AOT build option, deletes QNN AOT sources/FFI headers/registrations and related Python/FFI wrappers; adjusts build deps.
Large removals (QNN AOT)
mllm/backends/qnn/aot/QnnWrappersAPI.cpp, .../QnnWrappersAPI.hpp, mllm/ffi/qualcomm/QnnAOT.*, mllm/backends/qnn/aot_rt/README.md
Deletes Qualcomm QNN AOT API/FFI implementation and headers (many public API declarations removed), and related FFI/ Python bindings and tests.

Sequence Diagram(s)

sequenceDiagram
    participant Caller as User / Frontend
    participant Op as AscendOp
    participant ATB as ATB Runtime
    participant MemMgr as AscendMemoryManager
    participant Stream as Device Stream

    Caller->>Op: forward(inputs, outputs)
    Op->>Op: validate dtypes/shapes, build params
    Op->>ATB: CreateOperation(params)
    ATB-->>Op: operation handle / status
    Op->>Op: fillAtbTensor(inputs/outputs)
    Op->>ATB: Setup(VariantPack)
    ATB-->>Op: workspaceSize
    Op->>MemMgr: Allocate(workspaceSize)
    MemMgr-->>Op: workspacePtr
    Op->>ATB: Execute(opHandle, VariantPack, workspacePtr, Stream)
    ATB->>Stream: schedule kernels (rgba(70,130,180,0.5))
    Stream-->>Op: synchronize / complete
    Op->>MemMgr: Free(workspacePtr)
    Op->>ATB: DestroyOperation
    Op-->>Caller: outputs ready
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • liang1232018
  • chenghuaWang
  • oreomaker

Poem

🐰 I hopped into code with a tiny cheer,
New ops and tests sprang lively here.
SiLU, Linear, MatMul take the stage,
FP16 streams now run the page.
Heap of tests—time to munch and peer! 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is missing substantive content beyond the contribution template header and guideline reference, lacking explanation of changes, rationale, testing, and impact. Add a detailed description covering what was implemented, why these operations were added, how they integrate with the Ascend backend, what testing was performed, and any breaking changes or migration steps required.
Docstring Coverage ⚠️ Warning Docstring coverage is 22.62% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The PR title is clear and specific, accurately describing the main additions of Ascend backend operations (Linear, RMSNorm, SiLU) to the framework.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

🧪 Unit Test Generation v2 is now available!

We have significantly improved our unit test generation capabilities.

To enable: Add this to your .coderabbit.yaml configuration:

reviews:
  finishing_touches:
    unit_tests:
      enabled: true

Try it out by using the @coderabbitai generate unit tests command on your code files or under ✨ Finishing Touches on the walkthrough!

Have feedback? Share your thoughts on our Discord thread!


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/ascend/KernelTest.cpp (1)

96-104: Remove trailing whitespace on blank lines.

The blank lines around context/backend initialization include trailing spaces; trim them to satisfy the formatting rules.

✂️ Trim blank lines
  testing::InitGoogleTest(&argc, argv);
-  
+
  // Initialize context
  mllm::initializeContext();

  // Initialize Ascend backend
  mllm::initAscendBackend();
-  
+
  auto ret = RUN_ALL_TESTS();
As per coding guidelines: No line may end with trailing whitespace.
🤖 Fix all issues with AI agents
In `@mllm/backends/ascend/AscendCommon.cpp`:
- Around line 220-225: fillAtbTensor currently assumes FP16 because
fillAtbTensorDesc hardcodes FP16, so add an explicit dtype check at the top of
fillAtbTensor: call t.dtype() and compare it against the FP16 enum value used by
fillAtbTensorDesc (the same FP16 symbol), and if it does not match, fail fast
(e.g., log an error via your logger and return/throw) to avoid descriptor/data
mismatches; only proceed to call fillAtbTensorDesc, set atb_tensor.deviceData
and atb_tensor.dataSize when the dtype check passes.

In `@mllm/backends/ascend/ops/AscendLinearOp.cpp`:
- Around line 54-92: Before calling fillAtbTensor for inputs/weight/bias in the
AscendLinearOp dispatch, validate that the input tensors are FP16 and that bias
(if present) is 2D with shape [1, out_channels]; specifically check the input
dtype via the input Tensor's dtype field and error out (use MLLM_ERROR_EXIT or
return an error) if dtype != ACL_FLOAT16, and verify bias_ptr->shape.size()==2
and bias_ptr->shape[0]==1 and bias_ptr->shape[1]==weight_out_channels (derive
out_channels from weight_ptr shape) before calling fillAtbTensor for
atb_x/atb_weight/atb_bias so ATB never receives non-FP16 data or a 1D bias; keep
these checks adjacent to the existing fillAtbTensor calls and reference the
symbols fillAtbTensor, weight_ptr, bias_ptr, and MLLM_ERROR_EXIT.

In `@mllm/backends/ascend/ops/AscendMatMulOp.cpp`:
- Around line 30-38: Several declarations and assignments in AscendMatMulOp.cpp
(the const auto& A, const auto& B, auto& C lines and the LinearParam block where
linearParam.transposeA, transposeB, and linearParam.hasBias are set) have
trailing whitespace at line ends; remove all trailing spaces from these lines so
each line ends immediately after the last token (e.g., "const auto& A =
inputs[0];", "linearParam.hasBias = false;") to comply with the
no-trailing-whitespace guideline.

In `@mllm/backends/ascend/ops/AscendSiLUOp.cpp`:
- Around line 30-56: AscendSiLUOp currently only checks x/y dtype equality but
not that they are FP16; before calling fillAtbTensorDesc on x and y in
AscendSiLUOp (where fillAtbTensorDesc hardcodes ACL_FLOAT16), add an explicit
dtype validation that x.dtype() (and therefore y.dtype()) equals the FP16 enum
used in this codebase (e.g., kFloat16/ACL_FLOAT16) and return/NYI or
MLLM_ERROR_EXIT with a clear message if not; reference checks should be placed
just prior to the fillAtbTensorDesc(x, atb_x.desc) / fillAtbTensorDesc(y,
atb_y.desc) calls.

In `@tests/ascend/AscendRMSNormKernelTest.hpp`:
- Around line 20-26: Add a validation in RMSNormFloat16Test to ensure norm_size
matches the last dimension of input_shape before proceeding: check that
input_shape is non-empty and that norm_size == input_shape.back(), and if not,
fail the test early (e.g., return false or assert) with a clear message so the
subsequent reference computation using input_shape and norm_size cannot access
out-of-bounds memory; place this check at the start of the loop over test_cases
(reference symbols: RMSNormFloat16Test, input_shape, norm_size).
🧹 Nitpick comments (12)
mllm/backends/ascend/AscendCommon.hpp (1)

44-46: Clarify fillAtbTensor expectations in the header comment.

This helper is public and reused across ops; please document expected device (Ascend), dtype (currently FP16 via the descriptor helper), and any contiguity requirements so callers don't pass incompatible tensors.
As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."

mllm/nn/Functional.hpp (1)

20-24: Add brief doc comments for the new functional APIs.

These are public entry points; please document expected shapes, optional bias semantics, epsilon meaning, and add_unit_offset behavior.
As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."

Also applies to: 136-137

mllm/backends/ascend/ops/AscendViewOp.hpp (1)

11-23: Add brief class/method docs for AscendViewOp.

A short comment describing that this op is metadata‑only (view/reshape) and the expected input/output semantics would help callers.
As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."

mllm/backends/ascend/ops/AscendRMSNormOp.cpp (2)

28-28: Remove or uncomment the commented-out assertion.

Commented-out code should either be removed or properly enabled. If the assertion is needed for input validation, uncomment it; otherwise, remove the line to keep the codebase clean.

♻️ Proposed fix
-  //MLLM_RT_ASSERT(inputs.size() == 1 || inputs.size() == 2, "AscendRMSNormOp expects 1 or 2 inputs");
+  MLLM_RT_ASSERT(inputs.size() == 1 || inputs.size() == 2, "AscendRMSNormOp expects 1 or 2 inputs");

35-35: Redundant variable weight_for_atb.

This variable is just a const reference alias to weight and adds no value. Consider using weight directly in fillAtbTensor.

♻️ Proposed fix
-  const Tensor& weight_for_atb = weight;
...
-  fillAtbTensor(weight_for_atb, atb_weight);
+  fillAtbTensor(weight, atb_weight);
mllm/backends/ascend/ops/AscendMatMulOp.hpp (1)

20-25: Consider adding final to AscendMatMulOpFactory for consistency.

AscendLinearOpFactory is marked final (line 21 in AscendLinearOp.hpp), but AscendMatMulOpFactory is not. For consistency across the Ascend operator factories, consider adding final.

♻️ Proposed fix
-class AscendMatMulOpFactory : public TypedOpFactory<OpTypes::kMatMul, aops::MatMulOpOptions> {
+class AscendMatMulOpFactory final : public TypedOpFactory<OpTypes::kMatMul, aops::MatMulOpOptions> {
tests/ascend/AscendRMSNormKernelTest.hpp (1)

74-77: Consider logging which test case failed for easier debugging.

When the comparison fails, the test returns false without indicating which test case or what the mismatch was. Adding diagnostic output would help identify failures during test runs.

♻️ Proposed fix
       auto result = mllm::test::allClose(y_cpu, ref_cpu, 1e-2f, 1e-2f);
       if (!result.is_close) {
+        std::cerr << "RMSNormFloat16Test failed for shape with norm_size=" << norm_size
+                  << ", epsilon=" << epsilon << std::endl;
         return false;
       }
tests/ascend/AscendLinearKernelTest.hpp (1)

73-83: LinearFloat16Test currently validates MatMul, not Linear.

This path uses nn::functional::matmul with weight [in, out], so the no‑bias AscendLinearOp path isn’t exercised. Consider adding a dedicated no‑bias nn::functional::linear test (weight [out, in]) or renaming this test to MatMul to reflect coverage.

As per coding guidelines: Suggest adding unit tests for untested complex logic or edge cases.

tests/ascend/AscendSiLUKernelTest.hpp (3)

19-20: Add a brief docstring for the public method.

Per coding guidelines, public methods should have comments explaining purpose, parameters, and return value.

📝 Suggested documentation
+  /**
+   * `@brief` Tests SiLU (Sigmoid Linear Unit) operation on Ascend backend with FP16 precision.
+   * `@param` shapes Vector of tensor shapes to test.
+   * `@return` true if all shapes pass the allClose comparison, false otherwise.
+   */
   bool SiLUFloat16Test(const std::vector<mllm::Tensor::shape_t>& shapes) {

22-22: Use const auto& for loop variable.

Since shapes is a const reference and shape is not modified, use const auto& for correctness and clarity.

♻️ Proposed fix
-    for (auto& shape : shapes) {
+    for (const auto& shape : shapes) {

60-63: Consider logging diagnostic information on test failure.

When the test fails, it returns false without any diagnostic output. Adding a log message with the failing shape and mismatch details would aid debugging.

📝 Suggested improvement
       auto result = mllm::test::allClose(y_cpu, ref_cpu, 1e-2f, 1e-2f);
       if (!result.is_close) {
+        // Consider logging: shape, max_diff, index of max_diff, etc.
         return false;
       }
mllm/backends/ascend/ops/AscendSiLUOp.hpp (1)

12-18: Add class documentation.

Per coding guidelines, public classes should have comments explaining their purpose.

📝 Suggested documentation
+/**
+ * `@brief` Ascend backend implementation of the SiLU (Sigmoid Linear Unit) operator.
+ *
+ * Computes SiLU(x) = x * sigmoid(x) using ATB operations on Ascend NPU.
+ */
 class AscendSiLUOp final : public aops::SiLUOp {

Comment on lines +20 to +26
bool RMSNormFloat16Test(const std::vector<std::tuple<mllm::Tensor::shape_t, int, float>>& test_cases) {
using namespace mllm; // NOLINT
for (auto& test_case : test_cases) {
auto input_shape = std::get<0>(test_case);
int norm_size = std::get<1>(test_case);
float epsilon = std::get<2>(test_case);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add validation that norm_size matches the last dimension of input_shape.

The test assumes norm_size equals input_shape.back(), but there's no explicit validation. If a test case is misconfigured with mismatched values, this could cause out-of-bounds access in the reference computation loop.

🛡️ Proposed fix
       int norm_size = std::get<1>(test_case);
       float epsilon = std::get<2>(test_case);
+
+      // Validate that norm_size matches the last dimension
+      MLLM_RT_ASSERT(input_shape.back() == static_cast<size_t>(norm_size),
+                     "norm_size must match the last dimension of input_shape");
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
bool RMSNormFloat16Test(const std::vector<std::tuple<mllm::Tensor::shape_t, int, float>>& test_cases) {
using namespace mllm; // NOLINT
for (auto& test_case : test_cases) {
auto input_shape = std::get<0>(test_case);
int norm_size = std::get<1>(test_case);
float epsilon = std::get<2>(test_case);
bool RMSNormFloat16Test(const std::vector<std::tuple<mllm::Tensor::shape_t, int, float>>& test_cases) {
using namespace mllm; // NOLINT
for (auto& test_case : test_cases) {
auto input_shape = std::get<0>(test_case);
int norm_size = std::get<1>(test_case);
float epsilon = std::get<2>(test_case);
// Validate that norm_size matches the last dimension
MLLM_RT_ASSERT(input_shape.back() == static_cast<size_t>(norm_size),
"norm_size must match the last dimension of input_shape");
🤖 Prompt for AI Agents
In `@tests/ascend/AscendRMSNormKernelTest.hpp` around lines 20 - 26, Add a
validation in RMSNormFloat16Test to ensure norm_size matches the last dimension
of input_shape before proceeding: check that input_shape is non-empty and that
norm_size == input_shape.back(), and if not, fail the test early (e.g., return
false or assert) with a clear message so the subsequent reference computation
using input_shape and norm_size cannot access out-of-bounds memory; place this
check at the start of the loop over test_cases (reference symbols:
RMSNormFloat16Test, input_shape, norm_size).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@mllm/backends/ascend/ops/AscendLinearOp.cpp`:
- Around line 101-105: The code leaks resources when erroring after creating the
ATB operation or allocating workspace: ensure the created op and any allocated
workspace are freed before calling MLLM_ERROR_EXIT. Add a small cleanup helper
(e.g., cleanupAndExit) or a scope guard that, given the local op and workspace
variables, frees the workspace and destroys the operation (call
atb::DestroyOperation(op) or the appropriate atb destroy API) and then performs
the error exit; invoke this helper on failures in Setup, Execute, and any
post-allocation error paths (referencing op, atb::CreateOperation, Setup,
Execute, workspace, and MLLM_ERROR_EXIT).
- Around line 134-146: The code casts workspaceSize (uint64_t) to uint32_t when
calling mem_mgr.allocateBlock, which can truncate sizes >= 4GB; update
AscendLinearOp.cpp to validate workspaceSize before casting: after
op->Setup(...) and before mem_mgr.allocateBlock, check if workspaceSize == 0
(keep current path) or if workspaceSize > static_cast<uint64_t>(UINT32_MAX) then
fail fast (log/Mllm error exit) or use a mem_mgr API that accepts 64-bit sizes;
ensure allocateBlock/getBlockPtr are called with a safe uint32_t size only after
this check and reference the existing symbols workspaceSize,
mem_mgr.allocateBlock, mem_mgr.getBlockPtr, and the error path used after
op->Setup.
🧹 Nitpick comments (3)
tests/ascend/AscendRMSNormKernelTest.hpp (1)

79-81: Consider adding diagnostic logging for test progress.

Unlike LinearFloat16Test which logs the test shape and PASSED/FAILED status for each case, this test returns silently. Adding similar logging would improve debuggability and consistency across the test suite.

♻️ Suggested improvement
+      std::cout << "[RMSNormTest] Testing shape=[";
+      for (size_t i = 0; i < input_shape.size(); ++i) {
+        std::cout << input_shape[i] << (i < input_shape.size() - 1 ? ", " : "");
+      }
+      std::cout << "], norm_size=" << norm_size << ", epsilon=" << epsilon << std::endl;
+
       // 3. Move inputs to Ascend and run RMSNorm
       auto x_ascend = x_cpu.to(kAscend);
       auto weight_ascend = weight_cpu.to(kAscend);
       auto result = mllm::test::allClose(y_cpu, ref_cpu, 1e-2f, 1e-2f);
       if (!result.is_close) {
+        std::cout << "[RMSNormTest] FAILED!" << std::endl;
         return false;
       }
+      std::cout << "[RMSNormTest] PASSED" << std::endl;
     }
mllm/backends/ascend/ops/AscendSiLUOp.cpp (1)

26-113: Prefer fillAtbTensor to avoid duplicated wiring.

You already added a helper that fills desc + payload; using it here reduces drift and keeps dtype checks centralized.

♻️ Proposed refactor
   atb::Tensor atb_x;
   atb::Tensor atb_y;

-  fillAtbTensorDesc(x, atb_x.desc);
-  fillAtbTensorDesc(y, atb_y.desc);
-
-  atb_x.deviceData = reinterpret_cast<uint8_t*>(x.ptr<void>());
-  atb_x.dataSize = x.bytes();
-  atb_y.deviceData = reinterpret_cast<uint8_t*>(y.ptr<void>());
-  atb_y.dataSize = y.bytes();
+  fillAtbTensor(x, atb_x);
+  fillAtbTensor(y, atb_y);
mllm/backends/ascend/AscendCommon.cpp (1)

234-290: Preserve device state during enumeration to avoid side effects.

The constructor leaves the last device active after the enumeration loop. If AscendDeviceMetaInfo::instance() is called after user code has selected a device, this will unexpectedly change the active device. Save and restore the current device before and after the loop.

The concern about aclInit is already addressed by the singleton pattern—it is only called once on first access.

+  // Preserve current device to avoid side effects
+  int32_t prev_device = -1;
+  if (aclrtGetDevice(&prev_device) != ACL_SUCCESS) {
+    prev_device = -1;
+  }
   // Collect info for each device
   for (uint32_t i = 0; i < device_count; ++i) {
     AscendDeviceInfo info;
     info.id = i;
     info.name = "Ascend Device " + std::to_string(i);

     // Set device to query its properties
     ret = aclrtSetDevice(i);
     if (ret == ACL_SUCCESS) {
       // Get memory information
       size_t free_mem = 0, total_mem = 0;
       ret = aclrtGetMemInfo(ACL_HBM_MEM, &free_mem, &total_mem);
       if (ret == ACL_SUCCESS) {
         info.total_memory = total_mem;
         info.free_memory = free_mem;
       } else {
         info.total_memory = 0;
         info.free_memory = 0;
       }

       // SOC version - platform specific, set to unknown for now
       info.soc_version = "Unknown";
     } else {
       info.total_memory = 0;
       info.free_memory = 0;
       info.soc_version = "Unknown";
     }

     devices.push_back(info);
   }
+  if (prev_device >= 0) {
+    (void)aclrtSetDevice(prev_device);
+  }

Comment on lines +101 to +105
atb::Operation* op = nullptr;
auto st = atb::CreateOperation(linearParam, &op);
if (st != atb::NO_ERROR || op == nullptr) {
MLLM_ERROR_EXIT(ExitCode::kAscendError, "ATB CreateOperation(Linear) failed, status={}", static_cast<int>(st));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Resource leak on error paths.

If Setup or Execute fails, op is not destroyed before MLLM_ERROR_EXIT. Similarly, if Execute fails after workspace allocation, the workspace is not freed. While MLLM_ERROR_EXIT may terminate the process, proper cleanup ensures robustness if error handling changes in the future.

Suggested approach using scope guard or manual cleanup
   atb::Operation* op = nullptr;
   auto st = atb::CreateOperation(linearParam, &op);
   if (st != atb::NO_ERROR || op == nullptr) {
     MLLM_ERROR_EXIT(ExitCode::kAscendError, "ATB CreateOperation(Linear) failed, status={}", static_cast<int>(st));
   }
+
+  // Helper lambda for cleanup before error exit
+  auto cleanupAndExit = [&](const char* msg, int status) {
+    if (workspace_block_id != -1) {
+      getAscendMemoryManager().freeBlock(workspace_block_id);
+    }
+    atb::DestroyOperation(op);
+    MLLM_ERROR_EXIT(ExitCode::kAscendError, msg, status);
+  };

Then replace the error exits after operation creation with calls to cleanupAndExit.

Also applies to: 136-138, 153-155

🤖 Prompt for AI Agents
In `@mllm/backends/ascend/ops/AscendLinearOp.cpp` around lines 101 - 105, The code
leaks resources when erroring after creating the ATB operation or allocating
workspace: ensure the created op and any allocated workspace are freed before
calling MLLM_ERROR_EXIT. Add a small cleanup helper (e.g., cleanupAndExit) or a
scope guard that, given the local op and workspace variables, frees the
workspace and destroys the operation (call atb::DestroyOperation(op) or the
appropriate atb destroy API) and then performs the error exit; invoke this
helper on failures in Setup, Execute, and any post-allocation error paths
(referencing op, atb::CreateOperation, Setup, Execute, workspace, and
MLLM_ERROR_EXIT).

Comment on lines +134 to +146
uint64_t workspaceSize = 0;
st = op->Setup(vp, workspaceSize, atb_ctx);
if (st != atb::NO_ERROR) {
MLLM_ERROR_EXIT(ExitCode::kAscendError, "ATB LinearOp Setup failed, status={}", static_cast<int>(st));
}

void* workspace = nullptr;
int workspace_block_id = -1;
if (workspaceSize > 0) {
auto& mem_mgr = getAscendMemoryManager();
mem_mgr.allocateBlock(static_cast<uint32_t>(workspaceSize), workspace_block_id);
mem_mgr.getBlockPtr(workspace_block_id, workspace);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential integer overflow when casting workspace size.

workspaceSize is uint64_t but is cast to uint32_t when calling allocateBlock. If the workspace exceeds 4GB, this will silently truncate and cause memory corruption or crashes.

Suggested fix
   if (workspaceSize > 0) {
+    if (workspaceSize > static_cast<uint64_t>(UINT32_MAX)) {
+      atb::DestroyOperation(op);
+      MLLM_ERROR_EXIT(ExitCode::kAscendError,
+                      "ATB LinearOp workspace size {} exceeds uint32_t max", workspaceSize);
+    }
     auto& mem_mgr = getAscendMemoryManager();
     mem_mgr.allocateBlock(static_cast<uint32_t>(workspaceSize), workspace_block_id);
     mem_mgr.getBlockPtr(workspace_block_id, workspace);
   }
🤖 Prompt for AI Agents
In `@mllm/backends/ascend/ops/AscendLinearOp.cpp` around lines 134 - 146, The code
casts workspaceSize (uint64_t) to uint32_t when calling mem_mgr.allocateBlock,
which can truncate sizes >= 4GB; update AscendLinearOp.cpp to validate
workspaceSize before casting: after op->Setup(...) and before
mem_mgr.allocateBlock, check if workspaceSize == 0 (keep current path) or if
workspaceSize > static_cast<uint64_t>(UINT32_MAX) then fail fast (log/Mllm error
exit) or use a mem_mgr API that accepts 64-bit sizes; ensure
allocateBlock/getBlockPtr are called with a safe uint32_t size only after this
check and reference the existing symbols workspaceSize, mem_mgr.allocateBlock,
mem_mgr.getBlockPtr, and the error path used after op->Setup.

Tensor silu_(const Tensor& x);

Tensor rms_norm(const Tensor& x, const Tensor& weight, float epsilon = 1e-5f, bool add_unit_offset = false);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rmsNorm

regOpFactory<AscendSiLUOpFactory>();
regOpFactory<AscendLinearOpFactory>();
regOpFactory<AscendRMSNormOpFactory>();
regOpFactory<AscendViewOpFactory>();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regOpFactory<AscendSiLUOpFactory, AscendLinearOpFactory, AscendRMSNormOpFactory>

return Context::instance().buildOpAndSubmitTask(OpTypes::kSiLU, opt, {x})[0];
}

Tensor rms_norm(const Tensor& x, const Tensor& weight, float epsilon, bool add_unit_offset) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rmsNorm

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tasks/build_android.yaml (1)

2-15: ⚠️ Potential issue | 🟠 Major

Add cmake_toolchain_file to match other Android build configurations.

The cmake_toolchain_file is missing from this configuration, while all other Android build configurations (build_android_debug.yaml, build_android_opencl.yaml, build_android_qnn.yaml, build_sdk_android.yaml, etc.) include cmake_toolchain_file: "$ANDROID_NDK_PATH/build/cmake/android.toolchain.cmake". Without the NDK toolchain file, CMake may not correctly configure the Android cross-compilation environment.

Add the missing line:

cmake_toolchain_file: "$ANDROID_NDK_PATH/build/cmake/android.toolchain.cmake"
pymllm/nn/_layers.py (1)

8-46: ⚠️ Potential issue | 🔴 Critical

Fix base-class rename to avoid NameError and class shadowing.

Line 8 renames _Layer to Linear, but Line 46 still defines class Linear(_Layer). This makes _Layer undefined at import time and overwrites the first Linear definition. Restore the base-class name (or update all subclasses consistently).

🛠️ Proposed fix
-class Linear:
+class _Layer:
🤖 Fix all issues with AI agents
In `@examples/ascend_add_demo/build_and_run.sh`:
- Around line 52-69: The script currently relies on set -e and then checks $?
after running "make ascend_add_demo -j$(nproc)", which makes the else branch
unreachable; change the build invocation to use the conditional form "if make
ascend_add_demo -j\"$(nproc)\"; then ... else ..." so the success and failure
branches run correctly, and move the demo execution and its own conditional
similarly into the then/else structure (i.e., replace the standalone make line
plus the subsequent "if [ $? -eq 0 ]" block with an if wrapper around the make
command and quote "$(nproc)"); do the same for running
"./examples/ascend_add_demo/ascend_add_demo" so failure prints the custom error
and exits.

In `@examples/ascend_add_demo/README.md`:
- Around line 16-18: Replace the hard-coded absolute path in the README code
example with a portable placeholder or relative path so demos are reproducible;
specifically update the code block invoking the example to use either
"<repo_root>/examples/ascend_add_demo" or a relative path like
"./examples/ascend_add_demo" and keep the same ./build_and_run.sh invocation
(reference: README.md code block showing cd
/home/HwHiAiUser/mLLM/examples/ascend_add_demo and ./build_and_run.sh).
- Around line 54-74: The markdown fenced output block in the README lacks a
language identifier causing markdownlint MD040; update the triple-backtick fence
that wraps the demo output (the block beginning with "=== Ascend Add Op Demo
===") to include a language specifier (e.g., change the opening fence from ```
to ```text) so the code fence is ```text ... ``` and linting passes; no other
content changes needed.

In `@mllm/backends/ascend/ops/AscendElewiseOps.cpp`:
- Around line 88-94: The file contains blank lines with trailing whitespace near
the ATB AddOp error handling and subsequent code (around the MLLM_ERROR_EXIT
call, syncGlobalAtbStream(), and the workspace_block_id check); remove the
trailing spaces on those blank lines so no line ends with whitespace (trim the
whitespace after the blank lines around MLLM_ERROR_EXIT, syncGlobalAtbStream,
and the block handling around workspace_block_id).

In `@mllm/backends/ascend/ops/AscendElewiseOps.hpp`:
- Around line 27-55: Add Doxygen-style API docs for the public classes and
methods: document AscendSubOp and AscendMulOp (purpose, when to use), their
constructors (parameters: const aops::SubOpOptions& / aops::MulOpOptions), and
the overridden methods setup(const std::vector<Tensor>& inputs,
std::vector<Tensor>& outputs) and forward(const std::vector<Tensor>& inputs,
std::vector<Tensor>& outputs) (describe expected inputs, outputs, side effects,
error conditions). Also document the factory classes AscendSubOpFactory and
AscendMulOpFactory (what they produce and when to use createOpImpl). Keep
comments brief, follow Doxygen tags (`@brief`, `@param`, `@return`, `@throws` if
applicable) and place them above each class and the setup/forward method
declarations.

In `@mllm/backends/ascend/ops/AscendMatMulOp.cpp`:
- Around line 60-74: When handling the 4D-A reshape path in AscendMatMulOp.cpp
(where is_4d is true and atb_a_shape/atb_b_shape/atb_c_shape are built),
validate that b_shape and c_shape are also 4D before indexing [0..3]; add
explicit checks (e.g., assert or throw) that b_shape.size() == 4 and
c_shape.size() == 4 immediately inside the is_4d branch before computing
batch_heads_b/batch_heads_c and accessing b_shape[2..3]/c_shape[2..3], and
return or raise a clear error if they are not 4D to prevent out-of-bounds
access.

In `@mllm/backends/ascend/ops/AscendSoftmaxOp.cpp`:
- Around line 55-63: The code adjusts options_.axis to a positive index but
never validates it; before pushing to softmaxParam.axes (and before creating the
ATB op), compute int rank = static_cast<int>(x.rank()); verify that
options_.axis is within [-rank, rank-1], convert negative axis via axis += rank,
then assert or return an error if the resulting axis is not in [0, rank-1]; only
after this valid check call
softmaxParam.axes.push_back(static_cast<int64_t>(axis)) so the ATB op cannot be
created with an out-of-range axis.

In `@pyproject.toml`:
- Around line 2-4: Update the build-system requires list to pin apache-tvm-ffi
to the runtime version by changing the unpinned entry in the requires = [...]
array to "apache-tvm-ffi==0.1.0b4" so the build-time dependency matches the
runtime pin; locate the requires list in pyproject.toml and replace the plain
"apache-tvm-ffi" token with the exact version string.

In `@tasks/build_android.yaml`:
- Line 8: Add the missing cmake_toolchain_file key to the Android build task so
the NDK toolchain is used: insert cmake_toolchain_file:
"$ANDROID_NDK_PATH/build/cmake/android.toolchain.cmake" into the Android task
(the same place where cmake args like "-DMLLM_BUILD_ASCEND_BACKEND=ON" are
defined) and do the same change in the ARM Ascend Android task; ensure the key
is at the same indentation level as the other top-level task properties so the
YAML remains valid.

In `@tests/ascend/AscendSoftmaxKernelTest.hpp`:
- Around line 60-109: The index computation for the middle-axis case (0 <
pos_axis < ndim-1) in the three loops erroneously multiplies by shape[pos_axis]
and can step past the buffer; update all index calculations used to read/write
x_ptr and r_ptr so that the middle-axis branch uses outer * strides[pos_axis -
1] + i * strides[pos_axis] + inner (instead of the current outer *
strides[pos_axis - 1] * shape[pos_axis] + ...), and factor this into a small
helper (e.g., an indexer function used by the max_val, exp_vals, and result
loops) to avoid duplication and ensure consistency across the loops that use
pos_axis, strides, shape, outer_size, inner_size, axis_size, MLLM_FP16_TO_FP32
and MLLM_FP32_TO_FP16.
🧹 Nitpick comments (1)
mllm/nn/Functional.hpp (1)

23-24: Add brief API docs for the new functional ops.

These are public API declarations; a short Doxygen-style comment clarifying parameters/returns (bias optional, epsilon/add_unit_offset semantics) will make the header clearer.

✍️ Example doc additions
+// Linear: y = x * weight^T + bias (optional).
+// bias can be an empty Tensor().
 Tensor linear(const Tensor& x, const Tensor& weight, const Tensor& bias = Tensor());

+// RMSNorm: y = x * weight / sqrt(mean(x^2) + epsilon).
+// add_unit_offset adds 1.0 to weight before scaling when true.
 Tensor rmsNorm(const Tensor& x, const Tensor& weight, float epsilon = 1e-5f, bool add_unit_offset = false);
As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."

Also applies to: 136-137

Comment on lines +52 to +69
# 编译
echo -e "\n${YELLOW}开始编译...${NC}"
make ascend_add_demo -j$(nproc)

# 检查编译结果
if [ $? -eq 0 ]; then
echo -e "\n${GREEN}✓ 编译成功!${NC}\n"

# 运行
echo -e "${YELLOW}运行 demo...${NC}\n"
./examples/ascend_add_demo/ascend_add_demo

if [ $? -eq 0 ]; then
echo -e "\n${GREEN}✓ Demo 运行成功!${NC}"
else
echo -e "\n${RED}✗ Demo 运行失败${NC}"
exit 1
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, locate and read the entire script to check for set -e
cat examples/ascend_add_demo/build_and_run.sh

Repository: UbiquitousLearning/mllm

Length of output: 1667


Fix error handling: custom failure messages won't execute with set -e.

With set -e, standalone commands like make ascend_add_demo -j$(nproc) cause the script to exit immediately on failure, before the if [ $? -eq 0 ] check runs. The custom error messages in the else blocks are therefore unreachable. Use if cmd; then ... else ... fi to preserve error handling (also quote $(nproc) to avoid word-splitting issues).

🔧 Suggested update
 # 编译
 echo -e "\n${YELLOW}开始编译...${NC}"
-make ascend_add_demo -j$(nproc)
-
-# 检查编译结果
-if [ $? -eq 0 ]; then
+if make ascend_add_demo -j"$(nproc)"; then
     echo -e "\n${GREEN}✓ 编译成功!${NC}\n"
     
     # 运行
     echo -e "${YELLOW}运行 demo...${NC}\n"
-    ./examples/ascend_add_demo/ascend_add_demo
-    
-    if [ $? -eq 0 ]; then
+    if ./examples/ascend_add_demo/ascend_add_demo; then
         echo -e "\n${GREEN}✓ Demo 运行成功!${NC}"
     else
         echo -e "\n${RED}✗ Demo 运行失败${NC}"
         exit 1
     fi
 else
     echo -e "\n${RED}✗ 编译失败${NC}"
     exit 1
 fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# 编译
echo -e "\n${YELLOW}开始编译...${NC}"
make ascend_add_demo -j$(nproc)
# 检查编译结果
if [ $? -eq 0 ]; then
echo -e "\n${GREEN}✓ 编译成功!${NC}\n"
# 运行
echo -e "${YELLOW}运行 demo...${NC}\n"
./examples/ascend_add_demo/ascend_add_demo
if [ $? -eq 0 ]; then
echo -e "\n${GREEN}✓ Demo 运行成功!${NC}"
else
echo -e "\n${RED}✗ Demo 运行失败${NC}"
exit 1
fi
# 编译
echo -e "\n${YELLOW}开始编译...${NC}"
if make ascend_add_demo -j"$(nproc)"; then
echo -e "\n${GREEN}✓ 编译成功!${NC}\n"
# 运行
echo -e "${YELLOW}运行 demo...${NC}\n"
if ./examples/ascend_add_demo/ascend_add_demo; then
echo -e "\n${GREEN}✓ Demo 运行成功!${NC}"
else
echo -e "\n${RED}✗ Demo 运行失败${NC}"
exit 1
fi
else
echo -e "\n${RED}✗ 编译失败${NC}"
exit 1
fi
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 54-54: Quote this to prevent word splitting.

(SC2046)

🤖 Prompt for AI Agents
In `@examples/ascend_add_demo/build_and_run.sh` around lines 52 - 69, The script
currently relies on set -e and then checks $? after running "make
ascend_add_demo -j$(nproc)", which makes the else branch unreachable; change the
build invocation to use the conditional form "if make ascend_add_demo
-j\"$(nproc)\"; then ... else ..." so the success and failure branches run
correctly, and move the demo execution and its own conditional similarly into
the then/else structure (i.e., replace the standalone make line plus the
subsequent "if [ $? -eq 0 ]" block with an if wrapper around the make command
and quote "$(nproc)"); do the same for running
"./examples/ascend_add_demo/ascend_add_demo" so failure prints the custom error
and exits.

Comment on lines +16 to +18
```bash
cd /home/HwHiAiUser/mLLM/examples/ascend_add_demo
./build_and_run.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid hard-coded absolute paths in the demo instructions.

Using an environment-specific path makes the steps non-portable. Consider a placeholder such as <repo_root>/examples/ascend_add_demo.

✏️ Suggested edit
-cd /home/HwHiAiUser/mLLM/examples/ascend_add_demo
+cd <repo_root>/examples/ascend_add_demo
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```bash
cd /home/HwHiAiUser/mLLM/examples/ascend_add_demo
./build_and_run.sh
🤖 Prompt for AI Agents
In `@examples/ascend_add_demo/README.md` around lines 16 - 18, Replace the
hard-coded absolute path in the README code example with a portable placeholder
or relative path so demos are reproducible; specifically update the code block
invoking the example to use either "<repo_root>/examples/ascend_add_demo" or a
relative path like "./examples/ascend_add_demo" and keep the same
./build_and_run.sh invocation (reference: README.md code block showing cd
/home/HwHiAiUser/mLLM/examples/ascend_add_demo and ./build_and_run.sh).

Comment on lines +54 to +74
```
=== Ascend Add Op Demo ===
1. Initializing Ascend backend...
✓ Ascend backend initialized

2. Creating input tensors...
Input x shape: [2, 3]
Input y shape: [2, 3]

3. Transferring tensors to Ascend device...
✓ Tensors transferred to Ascend

4. Executing Add operation on Ascend...
✓ Add operation completed

5. Transferring result back to CPU and verifying...
Expected result: [11, 22, 33, 44, 55, 66]
Actual result: [11, 22, 33, 44, 55, 66]

✓ Test PASSED! All values match expected results.
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Specify a language for the expected-output code fence.

markdownlint MD040 requires a language; text is fine for output.

✅ Suggested fix
-```
+```text
 === Ascend Add Op Demo ===
 ...
-```
+```
🧰 Tools
🪛 markdownlint-cli2 (0.20.0)

[warning] 54-54: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In `@examples/ascend_add_demo/README.md` around lines 54 - 74, The markdown fenced
output block in the README lacks a language identifier causing markdownlint
MD040; update the triple-backtick fence that wraps the demo output (the block
beginning with "=== Ascend Add Op Demo ===") to include a language specifier
(e.g., change the opening fence from ``` to ```text) so the code fence is
```text ... ``` and linting passes; no other content changes needed.

Comment on lines +88 to +94
MLLM_ERROR_EXIT(ExitCode::kAscendError, "ATB AddOp Execute failed, status={}", static_cast<int>(st));
}


syncGlobalAtbStream();

if (workspace_block_id != -1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Remove trailing whitespace (policy violation).

There are blank lines with trailing spaces that violate the repository’s no‑trailing‑whitespace rule.

✂️ Trim trailing spaces
-  
-  syncGlobalAtbStream();
+  syncGlobalAtbStream();
@@
-  
-  syncGlobalAtbStream();
+  syncGlobalAtbStream();
As per coding guidelines: No line may end with trailing whitespace.

Also applies to: 176-181

🤖 Prompt for AI Agents
In `@mllm/backends/ascend/ops/AscendElewiseOps.cpp` around lines 88 - 94, The file
contains blank lines with trailing whitespace near the ATB AddOp error handling
and subsequent code (around the MLLM_ERROR_EXIT call, syncGlobalAtbStream(), and
the workspace_block_id check); remove the trailing spaces on those blank lines
so no line ends with whitespace (trim the whitespace after the blank lines
around MLLM_ERROR_EXIT, syncGlobalAtbStream, and the block handling around
workspace_block_id).

Comment on lines +27 to +55
class AscendSubOp final : public aops::SubOp {
public:
explicit AscendSubOp(const aops::SubOpOptions& options);

void setup(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
void forward(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
};

class AscendSubOpFactory final : public TypedOpFactory<OpTypes::kSub, aops::SubOpOptions> {
public:
std::shared_ptr<BaseOp> createOpImpl(const aops::SubOpOptions& options) override {
return std::make_shared<AscendSubOp>(options);
}
};

class AscendMulOp final : public aops::MulOp {
public:
explicit AscendMulOp(const aops::MulOpOptions& options);

void setup(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
void forward(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
};

class AscendMulOpFactory final : public TypedOpFactory<OpTypes::kMul, aops::MulOpOptions> {
public:
std::shared_ptr<BaseOp> createOpImpl(const aops::MulOpOptions& options) override {
return std::make_shared<AscendMulOp>(options);
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add API docs for the new AscendSub/AscendMul ops and factories.

These are public declarations in a header but lack purpose/parameter/behavior docs. Please add brief comments (Doxygen-style is fine) for the new classes and their setup/forward methods.
As per coding guidelines, {mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.

💡 Suggested doc-comment additions
+/// Ascend backend elementwise subtraction op.
+/// \param options Op configuration.
 class AscendSubOp final : public aops::SubOp {
  public:
   explicit AscendSubOp(const aops::SubOpOptions& options);

+  /// Prepare output tensors for the given inputs.
   void setup(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
+  /// Execute elementwise subtraction.
   void forward(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
 };

+/// Factory for AscendSubOp.
 class AscendSubOpFactory final : public TypedOpFactory<OpTypes::kSub, aops::SubOpOptions> {
  public:
   std::shared_ptr<BaseOp> createOpImpl(const aops::SubOpOptions& options) override {
     return std::make_shared<AscendSubOp>(options);
   }
 };

+/// Ascend backend elementwise multiplication op.
+/// \param options Op configuration.
 class AscendMulOp final : public aops::MulOp {
  public:
   explicit AscendMulOp(const aops::MulOpOptions& options);

+  /// Prepare output tensors for the given inputs.
   void setup(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
+  /// Execute elementwise multiplication.
   void forward(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
 };

+/// Factory for AscendMulOp.
 class AscendMulOpFactory final : public TypedOpFactory<OpTypes::kMul, aops::MulOpOptions> {
  public:
   std::shared_ptr<BaseOp> createOpImpl(const aops::MulOpOptions& options) override {
     return std::make_shared<AscendMulOp>(options);
   }
 };
🤖 Prompt for AI Agents
In `@mllm/backends/ascend/ops/AscendElewiseOps.hpp` around lines 27 - 55, Add
Doxygen-style API docs for the public classes and methods: document AscendSubOp
and AscendMulOp (purpose, when to use), their constructors (parameters: const
aops::SubOpOptions& / aops::MulOpOptions), and the overridden methods
setup(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) and
forward(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs)
(describe expected inputs, outputs, side effects, error conditions). Also
document the factory classes AscendSubOpFactory and AscendMulOpFactory (what
they produce and when to use createOpImpl). Keep comments brief, follow Doxygen
tags (`@brief`, `@param`, `@return`, `@throws` if applicable) and place them above each
class and the setup/forward method declarations.

Comment on lines +60 to +74
bool is_4d = (a_shape.size() == 4);

// Prepare shapes for ATB
std::vector<int64_t> atb_a_shape, atb_b_shape, atb_c_shape;

if (is_4d) {
// Reshape [B, H, S, D] -> [B*H, S, D]
int64_t batch_heads_a = static_cast<int64_t>(a_shape[0]) * static_cast<int64_t>(a_shape[1]);
int64_t batch_heads_b = static_cast<int64_t>(b_shape[0]) * static_cast<int64_t>(b_shape[1]);
int64_t batch_heads_c = static_cast<int64_t>(c_shape[0]) * static_cast<int64_t>(c_shape[1]);

atb_a_shape = {batch_heads_a, static_cast<int64_t>(a_shape[2]), static_cast<int64_t>(a_shape[3])};
atb_b_shape = {batch_heads_b, static_cast<int64_t>(b_shape[2]), static_cast<int64_t>(b_shape[3])};
atb_c_shape = {batch_heads_c, static_cast<int64_t>(c_shape[2]), static_cast<int64_t>(c_shape[3])};
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "AscendMatMulOp.cpp" -type f

Repository: UbiquitousLearning/mllm

Length of output: 112


🏁 Script executed:

cat -n mllm/backends/ascend/ops/AscendMatMulOp.cpp | head -100

Repository: UbiquitousLearning/mllm

Length of output: 4489


🏁 Script executed:

# Check what assertion macros are available in the codebase
rg "MLLM_RT_ASSERT" --type cpp -m 1 -A 2

Repository: UbiquitousLearning/mllm

Length of output: 33484


🏁 Script executed:

# Check if there's any validation of B/C ranks elsewhere or documentation about expected shapes
cat -n mllm/backends/ascend/ops/AscendMatMulOp.cpp | sed -n '40,120p'

Repository: UbiquitousLearning/mllm

Length of output: 3790


🏁 Script executed:

# Look for the header file to understand the function interface better
find . -name "AscendMatMulOp.hpp" -type f

Repository: UbiquitousLearning/mllm

Length of output: 112


🏁 Script executed:

# Check how this operation is typically called - look for test files or usage
find . -name "*MatMul*Test*" -o -name "*matmul*test*" | grep -i ascend

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Verify MLLM_RT_ASSERT_EQ macro definition
cat mllm/utils/Common.hpp | grep -A 5 "MLLM_RT_ASSERT_EQ"

Repository: UbiquitousLearning/mllm

Length of output: 569


🏁 Script executed:

# Check how AscendMatMulOp::forward is called
rg "AscendMatMulOp" --type cpp -A 2 | head -50

Repository: UbiquitousLearning/mllm

Length of output: 2853


🏁 Script executed:

# Look for setup() to see if rank is validated during initialization
cat -n mllm/backends/ascend/ops/AscendMatMulOp.cpp | sed -n '22,25p'

Repository: UbiquitousLearning/mllm

Length of output: 225


🏁 Script executed:

# Check the MatMulOp base class to see if there's rank validation
find . -name "MatMulOp.hpp" -o -name "MatMulOp.cpp" | grep -v ascend | head -5

Repository: UbiquitousLearning/mllm

Length of output: 250


🏁 Script executed:

# Check base MatMulOp for any rank validation
cat -n mllm/core/aops/MatMulOp.cpp | head -150

Repository: UbiquitousLearning/mllm

Length of output: 2790


🏁 Script executed:

# Check for any shape inference or validation in the base class
cat mllm/core/aops/MatMulOp.hpp

Repository: UbiquitousLearning/mllm

Length of output: 2355


🏁 Script executed:

# Search for how forward() is called generically in the execution pipeline
rg "->forward\(" --type cpp | head -10

Repository: UbiquitousLearning/mllm

Length of output: 471


🏁 Script executed:

# Look for where forward() is called in the execution pipeline
rg "forward" --type cpp mllm/core/BaseOp.hpp | head -20

Repository: UbiquitousLearning/mllm

Length of output: 325


🏁 Script executed:

# Check if reshape() is always called before forward()
rg "reshape\(" --type cpp mllm/core/BaseOp.cpp | head -20

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

cat -n mllm/core/BaseOp.hpp | grep -A 10 "forward\|reshape" | head -40

Repository: UbiquitousLearning/mllm

Length of output: 1460


🏁 Script executed:

# Let me check if there's any matmul shape validation during reshape or setup that would prevent mismatched ranks
cat -n mllm/core/aops/MatMulOp.cpp | sed -n '27,64p'

Repository: UbiquitousLearning/mllm

Length of output: 1612


🏁 Script executed:

# Check if BaseOp enforces any constraints during setup
cat -n mllm/core/BaseOp.cpp | sed -n '1,50p'

Repository: UbiquitousLearning/mllm

Length of output: 996


Validate B and C ranks before 4D reshape indexing to prevent out-of-bounds access.

When is_4d is true (A is 4D), the code unconditionally indexes b_shape[0..3] and c_shape[0..3] at lines 68-69 and 72-73 without checking that B and C are also 4D. If callers pass 4D A with 3D B or C, this causes out-of-bounds access. Add explicit rank assertions before the reshape path:

  if (is_4d) {
+    MLLM_RT_ASSERT_EQ(b_shape.size(), 4);
+    MLLM_RT_ASSERT_EQ(c_shape.size(), 4);
     // Reshape [B, H, S, D] -> [B*H, S, D]
🤖 Prompt for AI Agents
In `@mllm/backends/ascend/ops/AscendMatMulOp.cpp` around lines 60 - 74, When
handling the 4D-A reshape path in AscendMatMulOp.cpp (where is_4d is true and
atb_a_shape/atb_b_shape/atb_c_shape are built), validate that b_shape and
c_shape are also 4D before indexing [0..3]; add explicit checks (e.g., assert or
throw) that b_shape.size() == 4 and c_shape.size() == 4 immediately inside the
is_4d branch before computing batch_heads_b/batch_heads_c and accessing
b_shape[2..3]/c_shape[2..3], and return or raise a clear error if they are not
4D to prevent out-of-bounds access.

Comment on lines +55 to +63
// Convert axis to positive index if negative
int axis = options_.axis;
if (axis < 0) {
axis = static_cast<int>(x.rank()) + axis;
}

// ATB expects axes as SVector<int64_t>
softmaxParam.axes.push_back(static_cast<int64_t>(axis));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate axis bounds before creating the ATB op.

If options_.axis is outside [-rank, rank-1], the adjusted value can still be invalid and the ATB op may fail or read out of range. Add an explicit range check.

✅ Suggested bounds check
   int axis = options_.axis;
   if (axis < 0) {
     axis = static_cast<int>(x.rank()) + axis;
   }
+  if (axis < 0 || axis >= static_cast<int>(x.rank())) {
+    MLLM_ERROR_EXIT(ExitCode::kAscendError,
+                    "AscendSoftmaxOp: axis {} out of range for rank {}",
+                    axis, x.rank());
+  }
As per coding guidelines, "Validate inputs for public APIs and critical internal functions."
🤖 Prompt for AI Agents
In `@mllm/backends/ascend/ops/AscendSoftmaxOp.cpp` around lines 55 - 63, The code
adjusts options_.axis to a positive index but never validates it; before pushing
to softmaxParam.axes (and before creating the ATB op), compute int rank =
static_cast<int>(x.rank()); verify that options_.axis is within [-rank, rank-1],
convert negative axis via axis += rank, then assert or return an error if the
resulting axis is not in [0, rank-1]; only after this valid check call
softmaxParam.axes.push_back(static_cast<int64_t>(axis)) so the ATB op cannot be
created with an out-of-range axis.

Comment on lines 2 to 4
requires = [
"scikit-build-core>=0.11.0", "apache-tvm-ffi"
"scikit-build-core==0.10.0", "apache-tvm-ffi"
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find . -name "pyproject.toml" -type f

Repository: UbiquitousLearning/mllm

Length of output: 83


🏁 Script executed:

cat -n pyproject.toml

Repository: UbiquitousLearning/mllm

Length of output: 2137


🌐 Web query:

scikit-build-core 0.10.0 release notes wheel.py-api minimum-version support

💡 Result:

For scikit-build-core 0.10.0 (released as part of the 0.10 line), the relevant “release notes” items are:

  • Opt-in “auto minimum-version”: 0.10.0 added an automatic way to set tool.scikit-build.minimum-version from your build-system.requires, so you can single-source the lower bound (e.g. requires = ["scikit-build-core>=0.10"] plus minimum-version = "build-system.requires"). (iscinumpy.dev)
  • Wheel python-API tagging via wheel.py-api (feature existed by 0.10.x and is documented): you can set wheel tags for Limited API / Stable ABI builds, e.g. wheel.py-api = "cp38". scikit-build-core will only target ABI3 when the running Python is >= the version you set; otherwise it builds a normal CPython-extension wheel. You can also use wheel.py-api = "py3" for “pythonless” (generic) tags when you’re not tied to CPython’s API. (scikit-build-core.readthedocs.io)

Citations:


Pin apache-tvm-ffi in build-system to match runtime version.

apache-tvm-ffi is unpinned in the build-system requires while the runtime dependency pins it to 0.1.0b4. Unpinned build dependencies can cause build/runtime mismatches and non-reproducible builds. Pin it consistently.

🔧 Suggested adjustment
 requires = [
-  "scikit-build-core==0.10.0", "apache-tvm-ffi"
+  "scikit-build-core==0.10.0",
+  "apache-tvm-ffi==0.1.0b4",
 ]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
requires = [
"scikit-build-core>=0.11.0", "apache-tvm-ffi"
"scikit-build-core==0.10.0", "apache-tvm-ffi"
]
requires = [
"scikit-build-core==0.10.0",
"apache-tvm-ffi==0.1.0b4",
]
🤖 Prompt for AI Agents
In `@pyproject.toml` around lines 2 - 4, Update the build-system requires list to
pin apache-tvm-ffi to the runtime version by changing the unpinned entry in the
requires = [...] array to "apache-tvm-ffi==0.1.0b4" so the build-time dependency
matches the runtime pin; locate the requires list in pyproject.toml and replace
the plain "apache-tvm-ffi" token with the exact version string.

cmake_extra_args:
- "-DMLLM_CROSS_COMPILE=ON"
- "-DMLLM_BUILD_ARM_BACKEND=ON"
- "-DMLLM_BUILD_ASCEND_BACKEND=ON"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for MLLM_BUILD_ASCEND usage in CMake files
echo "=== MLLM_BUILD_ASCEND in CMake files ==="
rg "MLLM_BUILD_ASCEND" --type cmake -C 3

echo -e "\n=== Ascend backend directory structure ==="
fd -t d "ascend" src/

echo -e "\n=== Android references in Ascend backend ==="
if [ -d "src/backends/ascend" ]; then
  rg -i "android" src/backends/ascend/ -C 2 || echo "No Android references found in Ascend backend"
else
  echo "Ascend backend directory not found at expected location"
fi

echo -e "\n=== Other build task files ==="
fd -e yaml . tasks/ --type f | head -20

Repository: UbiquitousLearning/mllm

Length of output: 2134


🏁 Script executed:

#!/bin/bash
# Check build_android.yaml for context
echo "=== Current build_android.yaml content ==="
cat -n tasks/build_android.yaml

echo -e "\n=== Look for cmake_toolchain_file in other build configs ==="
rg "cmake_toolchain_file" tasks/ -C 2

Repository: UbiquitousLearning/mllm

Length of output: 3656


🏁 Script executed:

#!/bin/bash
# Search for Ascend-related build configurations across the repo
echo "=== Ascend mentions in build files ==="
rg -i "ascend" tasks/ -C 2

echo -e "\n=== CMakeLists.txt main configuration ==="
if [ -f "CMakeLists.txt" ]; then
  echo "File exists, checking size..."
  wc -l CMakeLists.txt
fi

Repository: UbiquitousLearning/mllm

Length of output: 2914


🏁 Script executed:

#!/bin/bash
echo "=== build_arm_ascend.yaml full content ==="
cat -n tasks/build_arm_ascend.yaml

Repository: UbiquitousLearning/mllm

Length of output: 973


Add missing cmake_toolchain_file for Android NDK cross-compilation.

The cmake_toolchain_file is missing from this configuration, which is required for Android cross-compilation. All other Android build configurations include:

cmake_toolchain_file: "$ANDROID_NDK_PATH/build/cmake/android.toolchain.cmake"

Without it, the Android NDK toolchain won't be properly configured, breaking the build. Additionally, the same issue exists in tasks/build_arm_ascend.yaml.

🤖 Prompt for AI Agents
In `@tasks/build_android.yaml` at line 8, Add the missing cmake_toolchain_file key
to the Android build task so the NDK toolchain is used: insert
cmake_toolchain_file: "$ANDROID_NDK_PATH/build/cmake/android.toolchain.cmake"
into the Android task (the same place where cmake args like
"-DMLLM_BUILD_ASCEND_BACKEND=ON" are defined) and do the same change in the ARM
Ascend Android task; ensure the key is at the same indentation level as the
other top-level task properties so the YAML remains valid.

Comment on lines +60 to +109
// Compute softmax for each slice along the axis
for (size_t outer = 0; outer < outer_size; ++outer) {
for (size_t inner = 0; inner < inner_size; ++inner) {
// Find max value for numerical stability
float max_val = -std::numeric_limits<float>::infinity();
for (size_t i = 0; i < axis_size; ++i) {
size_t idx = outer * strides[pos_axis > 0 ? pos_axis - 1 : 0] * shape[pos_axis] +
i * (pos_axis < ndim - 1 ? strides[pos_axis] : 1) + inner;
if (pos_axis == 0) {
idx = i * strides[0] + inner;
} else if (pos_axis == ndim - 1) {
idx = outer * axis_size + i;
} else {
idx = outer * strides[pos_axis - 1] * shape[pos_axis] + i * strides[pos_axis] + inner;
}
float val = MLLM_FP16_TO_FP32(x_ptr[idx]);
max_val = std::max(max_val, val);
}

// Compute exp(x - max) and sum
float sum_exp = 0.0f;
std::vector<float> exp_vals(axis_size);
for (size_t i = 0; i < axis_size; ++i) {
size_t idx = outer * strides[pos_axis > 0 ? pos_axis - 1 : 0] * shape[pos_axis] +
i * (pos_axis < ndim - 1 ? strides[pos_axis] : 1) + inner;
if (pos_axis == 0) {
idx = i * strides[0] + inner;
} else if (pos_axis == ndim - 1) {
idx = outer * axis_size + i;
} else {
idx = outer * strides[pos_axis - 1] * shape[pos_axis] + i * strides[pos_axis] + inner;
}
float val = MLLM_FP16_TO_FP32(x_ptr[idx]);
exp_vals[i] = std::exp(val - max_val);
sum_exp += exp_vals[i];
}

// Compute softmax and store result
for (size_t i = 0; i < axis_size; ++i) {
size_t idx = outer * strides[pos_axis > 0 ? pos_axis - 1 : 0] * shape[pos_axis] +
i * (pos_axis < ndim - 1 ? strides[pos_axis] : 1) + inner;
if (pos_axis == 0) {
idx = i * strides[0] + inner;
} else if (pos_axis == ndim - 1) {
idx = outer * axis_size + i;
} else {
idx = outer * strides[pos_axis - 1] * shape[pos_axis] + i * strides[pos_axis] + inner;
}
float result = exp_vals[i] / sum_exp;
r_ptr[idx] = MLLM_FP32_TO_FP16(result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix index calculation for middle axes in the CPU reference.

For 0 < axis < ndim-1, the index formula multiplies by shape[pos_axis] twice, which can step past the buffer and corrupt the reference. Use outer * strides[pos_axis - 1] + i * strides[pos_axis] + inner (ideally via a small helper) in all three loops.

🛠️ Suggested fix with a helper indexer
           size_t inner_size = 1;
           for (int i = pos_axis + 1; i < ndim; ++i) {
             inner_size *= shape[i];
           }

+          auto idx_at = [&](size_t outer_idx, size_t axis_idx, size_t inner_idx) -> size_t {
+            if (pos_axis == 0) {
+              return axis_idx * strides[0] + inner_idx;
+            }
+            if (pos_axis == ndim - 1) {
+              return outer_idx * axis_size + axis_idx;
+            }
+            return outer_idx * strides[pos_axis - 1] + axis_idx * strides[pos_axis] + inner_idx;
+          };
+
           // Compute softmax for each slice along the axis
           for (size_t outer = 0; outer < outer_size; ++outer) {
             for (size_t inner = 0; inner < inner_size; ++inner) {
               // Find max value for numerical stability
               float max_val = -std::numeric_limits<float>::infinity();
               for (size_t i = 0; i < axis_size; ++i) {
-                size_t idx = outer * strides[pos_axis > 0 ? pos_axis - 1 : 0] * shape[pos_axis] +
-                             i * (pos_axis < ndim - 1 ? strides[pos_axis] : 1) + inner;
-                if (pos_axis == 0) {
-                  idx = i * strides[0] + inner;
-                } else if (pos_axis == ndim - 1) {
-                  idx = outer * axis_size + i;
-                } else {
-                  idx = outer * strides[pos_axis - 1] * shape[pos_axis] + i * strides[pos_axis] + inner;
-                }
+                size_t idx = idx_at(outer, i, inner);
                 float val = MLLM_FP16_TO_FP32(x_ptr[idx]);
                 max_val = std::max(max_val, val);
               }

               // Compute exp(x - max) and sum
               float sum_exp = 0.0f;
               std::vector<float> exp_vals(axis_size);
               for (size_t i = 0; i < axis_size; ++i) {
-                size_t idx = outer * strides[pos_axis > 0 ? pos_axis - 1 : 0] * shape[pos_axis] +
-                             i * (pos_axis < ndim - 1 ? strides[pos_axis] : 1) + inner;
-                if (pos_axis == 0) {
-                  idx = i * strides[0] + inner;
-                } else if (pos_axis == ndim - 1) {
-                  idx = outer * axis_size + i;
-                } else {
-                  idx = outer * strides[pos_axis - 1] * shape[pos_axis] + i * strides[pos_axis] + inner;
-                }
+                size_t idx = idx_at(outer, i, inner);
                 float val = MLLM_FP16_TO_FP32(x_ptr[idx]);
                 exp_vals[i] = std::exp(val - max_val);
                 sum_exp += exp_vals[i];
               }

               // Compute softmax and store result
               for (size_t i = 0; i < axis_size; ++i) {
-                size_t idx = outer * strides[pos_axis > 0 ? pos_axis - 1 : 0] * shape[pos_axis] +
-                             i * (pos_axis < ndim - 1 ? strides[pos_axis] : 1) + inner;
-                if (pos_axis == 0) {
-                  idx = i * strides[0] + inner;
-                } else if (pos_axis == ndim - 1) {
-                  idx = outer * axis_size + i;
-                } else {
-                  idx = outer * strides[pos_axis - 1] * shape[pos_axis] + i * strides[pos_axis] + inner;
-                }
+                size_t idx = idx_at(outer, i, inner);
                 float result = exp_vals[i] / sum_exp;
                 r_ptr[idx] = MLLM_FP32_TO_FP16(result);
               }
             }
           }
🤖 Prompt for AI Agents
In `@tests/ascend/AscendSoftmaxKernelTest.hpp` around lines 60 - 109, The index
computation for the middle-axis case (0 < pos_axis < ndim-1) in the three loops
erroneously multiplies by shape[pos_axis] and can step past the buffer; update
all index calculations used to read/write x_ptr and r_ptr so that the
middle-axis branch uses outer * strides[pos_axis - 1] + i * strides[pos_axis] +
inner (instead of the current outer * strides[pos_axis - 1] * shape[pos_axis] +
...), and factor this into a small helper (e.g., an indexer function used by the
max_val, exp_vals, and result loops) to avoid duplication and ensure consistency
across the loops that use pos_axis, strides, shape, outer_size, inner_size,
axis_size, MLLM_FP16_TO_FP32 and MLLM_FP32_TO_FP16.

@lywbarca lywbarca closed this Jan 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants