feat(ascend): Implement Ascend backend ops (Linear, RMSNorm, SiLU)#615
feat(ascend): Implement Ascend backend ops (Linear, RMSNorm, SiLU)#615lywbarca wants to merge 4 commits intoUbiquitousLearning:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds multiple Ascend backend operators (SiLU, Linear, RMSNorm, View, MatMul, Sub, Mul), ATB tensor helpers, device init/enumeration adjustments, nn::functional wrappers (linear, rms_norm), extensive FP16 Ascend kernel tests, and removes Qualcomm QNN AOT sources/FFI; registers new operator factories in the Ascend backend. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller as User / Frontend
participant Op as AscendOp
participant ATB as ATB Runtime
participant MemMgr as AscendMemoryManager
participant Stream as Device Stream
Caller->>Op: forward(inputs, outputs)
Op->>Op: validate dtypes/shapes, build params
Op->>ATB: CreateOperation(params)
ATB-->>Op: operation handle / status
Op->>Op: fillAtbTensor(inputs/outputs)
Op->>ATB: Setup(VariantPack)
ATB-->>Op: workspaceSize
Op->>MemMgr: Allocate(workspaceSize)
MemMgr-->>Op: workspacePtr
Op->>ATB: Execute(opHandle, VariantPack, workspacePtr, Stream)
ATB->>Stream: schedule kernels (rgba(70,130,180,0.5))
Stream-->>Op: synchronize / complete
Op->>MemMgr: Free(workspacePtr)
Op->>ATB: DestroyOperation
Op-->>Caller: outputs ready
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 🧪 Unit Test Generation v2 is now available!We have significantly improved our unit test generation capabilities. To enable: Add this to your reviews:
finishing_touches:
unit_tests:
enabled: trueTry it out by using the Have feedback? Share your thoughts on our Discord thread! Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/ascend/KernelTest.cpp (1)
96-104: Remove trailing whitespace on blank lines.The blank lines around context/backend initialization include trailing spaces; trim them to satisfy the formatting rules.
As per coding guidelines: No line may end with trailing whitespace.✂️ Trim blank lines
testing::InitGoogleTest(&argc, argv); - + // Initialize context mllm::initializeContext(); // Initialize Ascend backend mllm::initAscendBackend(); - + auto ret = RUN_ALL_TESTS();
🤖 Fix all issues with AI agents
In `@mllm/backends/ascend/AscendCommon.cpp`:
- Around line 220-225: fillAtbTensor currently assumes FP16 because
fillAtbTensorDesc hardcodes FP16, so add an explicit dtype check at the top of
fillAtbTensor: call t.dtype() and compare it against the FP16 enum value used by
fillAtbTensorDesc (the same FP16 symbol), and if it does not match, fail fast
(e.g., log an error via your logger and return/throw) to avoid descriptor/data
mismatches; only proceed to call fillAtbTensorDesc, set atb_tensor.deviceData
and atb_tensor.dataSize when the dtype check passes.
In `@mllm/backends/ascend/ops/AscendLinearOp.cpp`:
- Around line 54-92: Before calling fillAtbTensor for inputs/weight/bias in the
AscendLinearOp dispatch, validate that the input tensors are FP16 and that bias
(if present) is 2D with shape [1, out_channels]; specifically check the input
dtype via the input Tensor's dtype field and error out (use MLLM_ERROR_EXIT or
return an error) if dtype != ACL_FLOAT16, and verify bias_ptr->shape.size()==2
and bias_ptr->shape[0]==1 and bias_ptr->shape[1]==weight_out_channels (derive
out_channels from weight_ptr shape) before calling fillAtbTensor for
atb_x/atb_weight/atb_bias so ATB never receives non-FP16 data or a 1D bias; keep
these checks adjacent to the existing fillAtbTensor calls and reference the
symbols fillAtbTensor, weight_ptr, bias_ptr, and MLLM_ERROR_EXIT.
In `@mllm/backends/ascend/ops/AscendMatMulOp.cpp`:
- Around line 30-38: Several declarations and assignments in AscendMatMulOp.cpp
(the const auto& A, const auto& B, auto& C lines and the LinearParam block where
linearParam.transposeA, transposeB, and linearParam.hasBias are set) have
trailing whitespace at line ends; remove all trailing spaces from these lines so
each line ends immediately after the last token (e.g., "const auto& A =
inputs[0];", "linearParam.hasBias = false;") to comply with the
no-trailing-whitespace guideline.
In `@mllm/backends/ascend/ops/AscendSiLUOp.cpp`:
- Around line 30-56: AscendSiLUOp currently only checks x/y dtype equality but
not that they are FP16; before calling fillAtbTensorDesc on x and y in
AscendSiLUOp (where fillAtbTensorDesc hardcodes ACL_FLOAT16), add an explicit
dtype validation that x.dtype() (and therefore y.dtype()) equals the FP16 enum
used in this codebase (e.g., kFloat16/ACL_FLOAT16) and return/NYI or
MLLM_ERROR_EXIT with a clear message if not; reference checks should be placed
just prior to the fillAtbTensorDesc(x, atb_x.desc) / fillAtbTensorDesc(y,
atb_y.desc) calls.
In `@tests/ascend/AscendRMSNormKernelTest.hpp`:
- Around line 20-26: Add a validation in RMSNormFloat16Test to ensure norm_size
matches the last dimension of input_shape before proceeding: check that
input_shape is non-empty and that norm_size == input_shape.back(), and if not,
fail the test early (e.g., return false or assert) with a clear message so the
subsequent reference computation using input_shape and norm_size cannot access
out-of-bounds memory; place this check at the start of the loop over test_cases
(reference symbols: RMSNormFloat16Test, input_shape, norm_size).
🧹 Nitpick comments (12)
mllm/backends/ascend/AscendCommon.hpp (1)
44-46: ClarifyfillAtbTensorexpectations in the header comment.This helper is public and reused across ops; please document expected device (Ascend), dtype (currently FP16 via the descriptor helper), and any contiguity requirements so callers don't pass incompatible tensors.
As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."mllm/nn/Functional.hpp (1)
20-24: Add brief doc comments for the new functional APIs.These are public entry points; please document expected shapes, optional bias semantics, epsilon meaning, and
add_unit_offsetbehavior.
As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."Also applies to: 136-137
mllm/backends/ascend/ops/AscendViewOp.hpp (1)
11-23: Add brief class/method docs for AscendViewOp.A short comment describing that this op is metadata‑only (view/reshape) and the expected input/output semantics would help callers.
As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."mllm/backends/ascend/ops/AscendRMSNormOp.cpp (2)
28-28: Remove or uncomment the commented-out assertion.Commented-out code should either be removed or properly enabled. If the assertion is needed for input validation, uncomment it; otherwise, remove the line to keep the codebase clean.
♻️ Proposed fix
- //MLLM_RT_ASSERT(inputs.size() == 1 || inputs.size() == 2, "AscendRMSNormOp expects 1 or 2 inputs"); + MLLM_RT_ASSERT(inputs.size() == 1 || inputs.size() == 2, "AscendRMSNormOp expects 1 or 2 inputs");
35-35: Redundant variableweight_for_atb.This variable is just a const reference alias to
weightand adds no value. Consider usingweightdirectly infillAtbTensor.♻️ Proposed fix
- const Tensor& weight_for_atb = weight; ... - fillAtbTensor(weight_for_atb, atb_weight); + fillAtbTensor(weight, atb_weight);mllm/backends/ascend/ops/AscendMatMulOp.hpp (1)
20-25: Consider addingfinaltoAscendMatMulOpFactoryfor consistency.
AscendLinearOpFactoryis markedfinal(line 21 in AscendLinearOp.hpp), butAscendMatMulOpFactoryis not. For consistency across the Ascend operator factories, consider addingfinal.♻️ Proposed fix
-class AscendMatMulOpFactory : public TypedOpFactory<OpTypes::kMatMul, aops::MatMulOpOptions> { +class AscendMatMulOpFactory final : public TypedOpFactory<OpTypes::kMatMul, aops::MatMulOpOptions> {tests/ascend/AscendRMSNormKernelTest.hpp (1)
74-77: Consider logging which test case failed for easier debugging.When the comparison fails, the test returns
falsewithout indicating which test case or what the mismatch was. Adding diagnostic output would help identify failures during test runs.♻️ Proposed fix
auto result = mllm::test::allClose(y_cpu, ref_cpu, 1e-2f, 1e-2f); if (!result.is_close) { + std::cerr << "RMSNormFloat16Test failed for shape with norm_size=" << norm_size + << ", epsilon=" << epsilon << std::endl; return false; }tests/ascend/AscendLinearKernelTest.hpp (1)
73-83: LinearFloat16Test currently validates MatMul, not Linear.This path uses nn::functional::matmul with weight [in, out], so the no‑bias AscendLinearOp path isn’t exercised. Consider adding a dedicated no‑bias nn::functional::linear test (weight [out, in]) or renaming this test to MatMul to reflect coverage.
As per coding guidelines: Suggest adding unit tests for untested complex logic or edge cases.
tests/ascend/AscendSiLUKernelTest.hpp (3)
19-20: Add a brief docstring for the public method.Per coding guidelines, public methods should have comments explaining purpose, parameters, and return value.
📝 Suggested documentation
+ /** + * `@brief` Tests SiLU (Sigmoid Linear Unit) operation on Ascend backend with FP16 precision. + * `@param` shapes Vector of tensor shapes to test. + * `@return` true if all shapes pass the allClose comparison, false otherwise. + */ bool SiLUFloat16Test(const std::vector<mllm::Tensor::shape_t>& shapes) {
22-22: Useconst auto&for loop variable.Since
shapesis a const reference andshapeis not modified, useconst auto&for correctness and clarity.♻️ Proposed fix
- for (auto& shape : shapes) { + for (const auto& shape : shapes) {
60-63: Consider logging diagnostic information on test failure.When the test fails, it returns
falsewithout any diagnostic output. Adding a log message with the failing shape and mismatch details would aid debugging.📝 Suggested improvement
auto result = mllm::test::allClose(y_cpu, ref_cpu, 1e-2f, 1e-2f); if (!result.is_close) { + // Consider logging: shape, max_diff, index of max_diff, etc. return false; }mllm/backends/ascend/ops/AscendSiLUOp.hpp (1)
12-18: Add class documentation.Per coding guidelines, public classes should have comments explaining their purpose.
📝 Suggested documentation
+/** + * `@brief` Ascend backend implementation of the SiLU (Sigmoid Linear Unit) operator. + * + * Computes SiLU(x) = x * sigmoid(x) using ATB operations on Ascend NPU. + */ class AscendSiLUOp final : public aops::SiLUOp {
| bool RMSNormFloat16Test(const std::vector<std::tuple<mllm::Tensor::shape_t, int, float>>& test_cases) { | ||
| using namespace mllm; // NOLINT | ||
| for (auto& test_case : test_cases) { | ||
| auto input_shape = std::get<0>(test_case); | ||
| int norm_size = std::get<1>(test_case); | ||
| float epsilon = std::get<2>(test_case); | ||
|
|
There was a problem hiding this comment.
Add validation that norm_size matches the last dimension of input_shape.
The test assumes norm_size equals input_shape.back(), but there's no explicit validation. If a test case is misconfigured with mismatched values, this could cause out-of-bounds access in the reference computation loop.
🛡️ Proposed fix
int norm_size = std::get<1>(test_case);
float epsilon = std::get<2>(test_case);
+
+ // Validate that norm_size matches the last dimension
+ MLLM_RT_ASSERT(input_shape.back() == static_cast<size_t>(norm_size),
+ "norm_size must match the last dimension of input_shape");📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| bool RMSNormFloat16Test(const std::vector<std::tuple<mllm::Tensor::shape_t, int, float>>& test_cases) { | |
| using namespace mllm; // NOLINT | |
| for (auto& test_case : test_cases) { | |
| auto input_shape = std::get<0>(test_case); | |
| int norm_size = std::get<1>(test_case); | |
| float epsilon = std::get<2>(test_case); | |
| bool RMSNormFloat16Test(const std::vector<std::tuple<mllm::Tensor::shape_t, int, float>>& test_cases) { | |
| using namespace mllm; // NOLINT | |
| for (auto& test_case : test_cases) { | |
| auto input_shape = std::get<0>(test_case); | |
| int norm_size = std::get<1>(test_case); | |
| float epsilon = std::get<2>(test_case); | |
| // Validate that norm_size matches the last dimension | |
| MLLM_RT_ASSERT(input_shape.back() == static_cast<size_t>(norm_size), | |
| "norm_size must match the last dimension of input_shape"); |
🤖 Prompt for AI Agents
In `@tests/ascend/AscendRMSNormKernelTest.hpp` around lines 20 - 26, Add a
validation in RMSNormFloat16Test to ensure norm_size matches the last dimension
of input_shape before proceeding: check that input_shape is non-empty and that
norm_size == input_shape.back(), and if not, fail the test early (e.g., return
false or assert) with a clear message so the subsequent reference computation
using input_shape and norm_size cannot access out-of-bounds memory; place this
check at the start of the loop over test_cases (reference symbols:
RMSNormFloat16Test, input_shape, norm_size).
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@mllm/backends/ascend/ops/AscendLinearOp.cpp`:
- Around line 101-105: The code leaks resources when erroring after creating the
ATB operation or allocating workspace: ensure the created op and any allocated
workspace are freed before calling MLLM_ERROR_EXIT. Add a small cleanup helper
(e.g., cleanupAndExit) or a scope guard that, given the local op and workspace
variables, frees the workspace and destroys the operation (call
atb::DestroyOperation(op) or the appropriate atb destroy API) and then performs
the error exit; invoke this helper on failures in Setup, Execute, and any
post-allocation error paths (referencing op, atb::CreateOperation, Setup,
Execute, workspace, and MLLM_ERROR_EXIT).
- Around line 134-146: The code casts workspaceSize (uint64_t) to uint32_t when
calling mem_mgr.allocateBlock, which can truncate sizes >= 4GB; update
AscendLinearOp.cpp to validate workspaceSize before casting: after
op->Setup(...) and before mem_mgr.allocateBlock, check if workspaceSize == 0
(keep current path) or if workspaceSize > static_cast<uint64_t>(UINT32_MAX) then
fail fast (log/Mllm error exit) or use a mem_mgr API that accepts 64-bit sizes;
ensure allocateBlock/getBlockPtr are called with a safe uint32_t size only after
this check and reference the existing symbols workspaceSize,
mem_mgr.allocateBlock, mem_mgr.getBlockPtr, and the error path used after
op->Setup.
🧹 Nitpick comments (3)
tests/ascend/AscendRMSNormKernelTest.hpp (1)
79-81: Consider adding diagnostic logging for test progress.Unlike
LinearFloat16Testwhich logs the test shape and PASSED/FAILED status for each case, this test returns silently. Adding similar logging would improve debuggability and consistency across the test suite.♻️ Suggested improvement
+ std::cout << "[RMSNormTest] Testing shape=["; + for (size_t i = 0; i < input_shape.size(); ++i) { + std::cout << input_shape[i] << (i < input_shape.size() - 1 ? ", " : ""); + } + std::cout << "], norm_size=" << norm_size << ", epsilon=" << epsilon << std::endl; + // 3. Move inputs to Ascend and run RMSNorm auto x_ascend = x_cpu.to(kAscend); auto weight_ascend = weight_cpu.to(kAscend);auto result = mllm::test::allClose(y_cpu, ref_cpu, 1e-2f, 1e-2f); if (!result.is_close) { + std::cout << "[RMSNormTest] FAILED!" << std::endl; return false; } + std::cout << "[RMSNormTest] PASSED" << std::endl; }mllm/backends/ascend/ops/AscendSiLUOp.cpp (1)
26-113: PreferfillAtbTensorto avoid duplicated wiring.You already added a helper that fills desc + payload; using it here reduces drift and keeps dtype checks centralized.
♻️ Proposed refactor
atb::Tensor atb_x; atb::Tensor atb_y; - fillAtbTensorDesc(x, atb_x.desc); - fillAtbTensorDesc(y, atb_y.desc); - - atb_x.deviceData = reinterpret_cast<uint8_t*>(x.ptr<void>()); - atb_x.dataSize = x.bytes(); - atb_y.deviceData = reinterpret_cast<uint8_t*>(y.ptr<void>()); - atb_y.dataSize = y.bytes(); + fillAtbTensor(x, atb_x); + fillAtbTensor(y, atb_y);mllm/backends/ascend/AscendCommon.cpp (1)
234-290: Preserve device state during enumeration to avoid side effects.The constructor leaves the last device active after the enumeration loop. If
AscendDeviceMetaInfo::instance()is called after user code has selected a device, this will unexpectedly change the active device. Save and restore the current device before and after the loop.The concern about
aclInitis already addressed by the singleton pattern—it is only called once on first access.+ // Preserve current device to avoid side effects + int32_t prev_device = -1; + if (aclrtGetDevice(&prev_device) != ACL_SUCCESS) { + prev_device = -1; + } // Collect info for each device for (uint32_t i = 0; i < device_count; ++i) { AscendDeviceInfo info; info.id = i; info.name = "Ascend Device " + std::to_string(i); // Set device to query its properties ret = aclrtSetDevice(i); if (ret == ACL_SUCCESS) { // Get memory information size_t free_mem = 0, total_mem = 0; ret = aclrtGetMemInfo(ACL_HBM_MEM, &free_mem, &total_mem); if (ret == ACL_SUCCESS) { info.total_memory = total_mem; info.free_memory = free_mem; } else { info.total_memory = 0; info.free_memory = 0; } // SOC version - platform specific, set to unknown for now info.soc_version = "Unknown"; } else { info.total_memory = 0; info.free_memory = 0; info.soc_version = "Unknown"; } devices.push_back(info); } + if (prev_device >= 0) { + (void)aclrtSetDevice(prev_device); + }
| atb::Operation* op = nullptr; | ||
| auto st = atb::CreateOperation(linearParam, &op); | ||
| if (st != atb::NO_ERROR || op == nullptr) { | ||
| MLLM_ERROR_EXIT(ExitCode::kAscendError, "ATB CreateOperation(Linear) failed, status={}", static_cast<int>(st)); | ||
| } |
There was a problem hiding this comment.
Resource leak on error paths.
If Setup or Execute fails, op is not destroyed before MLLM_ERROR_EXIT. Similarly, if Execute fails after workspace allocation, the workspace is not freed. While MLLM_ERROR_EXIT may terminate the process, proper cleanup ensures robustness if error handling changes in the future.
Suggested approach using scope guard or manual cleanup
atb::Operation* op = nullptr;
auto st = atb::CreateOperation(linearParam, &op);
if (st != atb::NO_ERROR || op == nullptr) {
MLLM_ERROR_EXIT(ExitCode::kAscendError, "ATB CreateOperation(Linear) failed, status={}", static_cast<int>(st));
}
+
+ // Helper lambda for cleanup before error exit
+ auto cleanupAndExit = [&](const char* msg, int status) {
+ if (workspace_block_id != -1) {
+ getAscendMemoryManager().freeBlock(workspace_block_id);
+ }
+ atb::DestroyOperation(op);
+ MLLM_ERROR_EXIT(ExitCode::kAscendError, msg, status);
+ };Then replace the error exits after operation creation with calls to cleanupAndExit.
Also applies to: 136-138, 153-155
🤖 Prompt for AI Agents
In `@mllm/backends/ascend/ops/AscendLinearOp.cpp` around lines 101 - 105, The code
leaks resources when erroring after creating the ATB operation or allocating
workspace: ensure the created op and any allocated workspace are freed before
calling MLLM_ERROR_EXIT. Add a small cleanup helper (e.g., cleanupAndExit) or a
scope guard that, given the local op and workspace variables, frees the
workspace and destroys the operation (call atb::DestroyOperation(op) or the
appropriate atb destroy API) and then performs the error exit; invoke this
helper on failures in Setup, Execute, and any post-allocation error paths
(referencing op, atb::CreateOperation, Setup, Execute, workspace, and
MLLM_ERROR_EXIT).
| uint64_t workspaceSize = 0; | ||
| st = op->Setup(vp, workspaceSize, atb_ctx); | ||
| if (st != atb::NO_ERROR) { | ||
| MLLM_ERROR_EXIT(ExitCode::kAscendError, "ATB LinearOp Setup failed, status={}", static_cast<int>(st)); | ||
| } | ||
|
|
||
| void* workspace = nullptr; | ||
| int workspace_block_id = -1; | ||
| if (workspaceSize > 0) { | ||
| auto& mem_mgr = getAscendMemoryManager(); | ||
| mem_mgr.allocateBlock(static_cast<uint32_t>(workspaceSize), workspace_block_id); | ||
| mem_mgr.getBlockPtr(workspace_block_id, workspace); | ||
| } |
There was a problem hiding this comment.
Potential integer overflow when casting workspace size.
workspaceSize is uint64_t but is cast to uint32_t when calling allocateBlock. If the workspace exceeds 4GB, this will silently truncate and cause memory corruption or crashes.
Suggested fix
if (workspaceSize > 0) {
+ if (workspaceSize > static_cast<uint64_t>(UINT32_MAX)) {
+ atb::DestroyOperation(op);
+ MLLM_ERROR_EXIT(ExitCode::kAscendError,
+ "ATB LinearOp workspace size {} exceeds uint32_t max", workspaceSize);
+ }
auto& mem_mgr = getAscendMemoryManager();
mem_mgr.allocateBlock(static_cast<uint32_t>(workspaceSize), workspace_block_id);
mem_mgr.getBlockPtr(workspace_block_id, workspace);
}🤖 Prompt for AI Agents
In `@mllm/backends/ascend/ops/AscendLinearOp.cpp` around lines 134 - 146, The code
casts workspaceSize (uint64_t) to uint32_t when calling mem_mgr.allocateBlock,
which can truncate sizes >= 4GB; update AscendLinearOp.cpp to validate
workspaceSize before casting: after op->Setup(...) and before
mem_mgr.allocateBlock, check if workspaceSize == 0 (keep current path) or if
workspaceSize > static_cast<uint64_t>(UINT32_MAX) then fail fast (log/Mllm error
exit) or use a mem_mgr API that accepts 64-bit sizes; ensure
allocateBlock/getBlockPtr are called with a safe uint32_t size only after this
check and reference the existing symbols workspaceSize, mem_mgr.allocateBlock,
mem_mgr.getBlockPtr, and the error path used after op->Setup.
| Tensor silu_(const Tensor& x); | ||
|
|
||
| Tensor rms_norm(const Tensor& x, const Tensor& weight, float epsilon = 1e-5f, bool add_unit_offset = false); | ||
|
|
| regOpFactory<AscendSiLUOpFactory>(); | ||
| regOpFactory<AscendLinearOpFactory>(); | ||
| regOpFactory<AscendRMSNormOpFactory>(); | ||
| regOpFactory<AscendViewOpFactory>(); |
There was a problem hiding this comment.
regOpFactory<AscendSiLUOpFactory, AscendLinearOpFactory, AscendRMSNormOpFactory>
mllm/nn/Functional.cpp
Outdated
| return Context::instance().buildOpAndSubmitTask(OpTypes::kSiLU, opt, {x})[0]; | ||
| } | ||
|
|
||
| Tensor rms_norm(const Tensor& x, const Tensor& weight, float epsilon, bool add_unit_offset) { |
There was a problem hiding this comment.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tasks/build_android.yaml (1)
2-15:⚠️ Potential issue | 🟠 MajorAdd
cmake_toolchain_fileto match other Android build configurations.The
cmake_toolchain_fileis missing from this configuration, while all other Android build configurations (build_android_debug.yaml, build_android_opencl.yaml, build_android_qnn.yaml, build_sdk_android.yaml, etc.) includecmake_toolchain_file: "$ANDROID_NDK_PATH/build/cmake/android.toolchain.cmake". Without the NDK toolchain file, CMake may not correctly configure the Android cross-compilation environment.Add the missing line:
cmake_toolchain_file: "$ANDROID_NDK_PATH/build/cmake/android.toolchain.cmake"pymllm/nn/_layers.py (1)
8-46:⚠️ Potential issue | 🔴 CriticalFix base-class rename to avoid NameError and class shadowing.
Line 8 renames
_LayertoLinear, but Line 46 still definesclass Linear(_Layer). This makes_Layerundefined at import time and overwrites the firstLineardefinition. Restore the base-class name (or update all subclasses consistently).🛠️ Proposed fix
-class Linear: +class _Layer:
🤖 Fix all issues with AI agents
In `@examples/ascend_add_demo/build_and_run.sh`:
- Around line 52-69: The script currently relies on set -e and then checks $?
after running "make ascend_add_demo -j$(nproc)", which makes the else branch
unreachable; change the build invocation to use the conditional form "if make
ascend_add_demo -j\"$(nproc)\"; then ... else ..." so the success and failure
branches run correctly, and move the demo execution and its own conditional
similarly into the then/else structure (i.e., replace the standalone make line
plus the subsequent "if [ $? -eq 0 ]" block with an if wrapper around the make
command and quote "$(nproc)"); do the same for running
"./examples/ascend_add_demo/ascend_add_demo" so failure prints the custom error
and exits.
In `@examples/ascend_add_demo/README.md`:
- Around line 16-18: Replace the hard-coded absolute path in the README code
example with a portable placeholder or relative path so demos are reproducible;
specifically update the code block invoking the example to use either
"<repo_root>/examples/ascend_add_demo" or a relative path like
"./examples/ascend_add_demo" and keep the same ./build_and_run.sh invocation
(reference: README.md code block showing cd
/home/HwHiAiUser/mLLM/examples/ascend_add_demo and ./build_and_run.sh).
- Around line 54-74: The markdown fenced output block in the README lacks a
language identifier causing markdownlint MD040; update the triple-backtick fence
that wraps the demo output (the block beginning with "=== Ascend Add Op Demo
===") to include a language specifier (e.g., change the opening fence from ```
to ```text) so the code fence is ```text ... ``` and linting passes; no other
content changes needed.
In `@mllm/backends/ascend/ops/AscendElewiseOps.cpp`:
- Around line 88-94: The file contains blank lines with trailing whitespace near
the ATB AddOp error handling and subsequent code (around the MLLM_ERROR_EXIT
call, syncGlobalAtbStream(), and the workspace_block_id check); remove the
trailing spaces on those blank lines so no line ends with whitespace (trim the
whitespace after the blank lines around MLLM_ERROR_EXIT, syncGlobalAtbStream,
and the block handling around workspace_block_id).
In `@mllm/backends/ascend/ops/AscendElewiseOps.hpp`:
- Around line 27-55: Add Doxygen-style API docs for the public classes and
methods: document AscendSubOp and AscendMulOp (purpose, when to use), their
constructors (parameters: const aops::SubOpOptions& / aops::MulOpOptions), and
the overridden methods setup(const std::vector<Tensor>& inputs,
std::vector<Tensor>& outputs) and forward(const std::vector<Tensor>& inputs,
std::vector<Tensor>& outputs) (describe expected inputs, outputs, side effects,
error conditions). Also document the factory classes AscendSubOpFactory and
AscendMulOpFactory (what they produce and when to use createOpImpl). Keep
comments brief, follow Doxygen tags (`@brief`, `@param`, `@return`, `@throws` if
applicable) and place them above each class and the setup/forward method
declarations.
In `@mllm/backends/ascend/ops/AscendMatMulOp.cpp`:
- Around line 60-74: When handling the 4D-A reshape path in AscendMatMulOp.cpp
(where is_4d is true and atb_a_shape/atb_b_shape/atb_c_shape are built),
validate that b_shape and c_shape are also 4D before indexing [0..3]; add
explicit checks (e.g., assert or throw) that b_shape.size() == 4 and
c_shape.size() == 4 immediately inside the is_4d branch before computing
batch_heads_b/batch_heads_c and accessing b_shape[2..3]/c_shape[2..3], and
return or raise a clear error if they are not 4D to prevent out-of-bounds
access.
In `@mllm/backends/ascend/ops/AscendSoftmaxOp.cpp`:
- Around line 55-63: The code adjusts options_.axis to a positive index but
never validates it; before pushing to softmaxParam.axes (and before creating the
ATB op), compute int rank = static_cast<int>(x.rank()); verify that
options_.axis is within [-rank, rank-1], convert negative axis via axis += rank,
then assert or return an error if the resulting axis is not in [0, rank-1]; only
after this valid check call
softmaxParam.axes.push_back(static_cast<int64_t>(axis)) so the ATB op cannot be
created with an out-of-range axis.
In `@pyproject.toml`:
- Around line 2-4: Update the build-system requires list to pin apache-tvm-ffi
to the runtime version by changing the unpinned entry in the requires = [...]
array to "apache-tvm-ffi==0.1.0b4" so the build-time dependency matches the
runtime pin; locate the requires list in pyproject.toml and replace the plain
"apache-tvm-ffi" token with the exact version string.
In `@tasks/build_android.yaml`:
- Line 8: Add the missing cmake_toolchain_file key to the Android build task so
the NDK toolchain is used: insert cmake_toolchain_file:
"$ANDROID_NDK_PATH/build/cmake/android.toolchain.cmake" into the Android task
(the same place where cmake args like "-DMLLM_BUILD_ASCEND_BACKEND=ON" are
defined) and do the same change in the ARM Ascend Android task; ensure the key
is at the same indentation level as the other top-level task properties so the
YAML remains valid.
In `@tests/ascend/AscendSoftmaxKernelTest.hpp`:
- Around line 60-109: The index computation for the middle-axis case (0 <
pos_axis < ndim-1) in the three loops erroneously multiplies by shape[pos_axis]
and can step past the buffer; update all index calculations used to read/write
x_ptr and r_ptr so that the middle-axis branch uses outer * strides[pos_axis -
1] + i * strides[pos_axis] + inner (instead of the current outer *
strides[pos_axis - 1] * shape[pos_axis] + ...), and factor this into a small
helper (e.g., an indexer function used by the max_val, exp_vals, and result
loops) to avoid duplication and ensure consistency across the loops that use
pos_axis, strides, shape, outer_size, inner_size, axis_size, MLLM_FP16_TO_FP32
and MLLM_FP32_TO_FP16.
🧹 Nitpick comments (1)
mllm/nn/Functional.hpp (1)
23-24: Add brief API docs for the new functional ops.These are public API declarations; a short Doxygen-style comment clarifying parameters/returns (bias optional, epsilon/add_unit_offset semantics) will make the header clearer.
As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."✍️ Example doc additions
+// Linear: y = x * weight^T + bias (optional). +// bias can be an empty Tensor(). Tensor linear(const Tensor& x, const Tensor& weight, const Tensor& bias = Tensor()); +// RMSNorm: y = x * weight / sqrt(mean(x^2) + epsilon). +// add_unit_offset adds 1.0 to weight before scaling when true. Tensor rmsNorm(const Tensor& x, const Tensor& weight, float epsilon = 1e-5f, bool add_unit_offset = false);Also applies to: 136-137
| # 编译 | ||
| echo -e "\n${YELLOW}开始编译...${NC}" | ||
| make ascend_add_demo -j$(nproc) | ||
|
|
||
| # 检查编译结果 | ||
| if [ $? -eq 0 ]; then | ||
| echo -e "\n${GREEN}✓ 编译成功!${NC}\n" | ||
|
|
||
| # 运行 | ||
| echo -e "${YELLOW}运行 demo...${NC}\n" | ||
| ./examples/ascend_add_demo/ascend_add_demo | ||
|
|
||
| if [ $? -eq 0 ]; then | ||
| echo -e "\n${GREEN}✓ Demo 运行成功!${NC}" | ||
| else | ||
| echo -e "\n${RED}✗ Demo 运行失败${NC}" | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate and read the entire script to check for set -e
cat examples/ascend_add_demo/build_and_run.shRepository: UbiquitousLearning/mllm
Length of output: 1667
Fix error handling: custom failure messages won't execute with set -e.
With set -e, standalone commands like make ascend_add_demo -j$(nproc) cause the script to exit immediately on failure, before the if [ $? -eq 0 ] check runs. The custom error messages in the else blocks are therefore unreachable. Use if cmd; then ... else ... fi to preserve error handling (also quote $(nproc) to avoid word-splitting issues).
🔧 Suggested update
# 编译
echo -e "\n${YELLOW}开始编译...${NC}"
-make ascend_add_demo -j$(nproc)
-
-# 检查编译结果
-if [ $? -eq 0 ]; then
+if make ascend_add_demo -j"$(nproc)"; then
echo -e "\n${GREEN}✓ 编译成功!${NC}\n"
# 运行
echo -e "${YELLOW}运行 demo...${NC}\n"
- ./examples/ascend_add_demo/ascend_add_demo
-
- if [ $? -eq 0 ]; then
+ if ./examples/ascend_add_demo/ascend_add_demo; then
echo -e "\n${GREEN}✓ Demo 运行成功!${NC}"
else
echo -e "\n${RED}✗ Demo 运行失败${NC}"
exit 1
fi
else
echo -e "\n${RED}✗ 编译失败${NC}"
exit 1
fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # 编译 | |
| echo -e "\n${YELLOW}开始编译...${NC}" | |
| make ascend_add_demo -j$(nproc) | |
| # 检查编译结果 | |
| if [ $? -eq 0 ]; then | |
| echo -e "\n${GREEN}✓ 编译成功!${NC}\n" | |
| # 运行 | |
| echo -e "${YELLOW}运行 demo...${NC}\n" | |
| ./examples/ascend_add_demo/ascend_add_demo | |
| if [ $? -eq 0 ]; then | |
| echo -e "\n${GREEN}✓ Demo 运行成功!${NC}" | |
| else | |
| echo -e "\n${RED}✗ Demo 运行失败${NC}" | |
| exit 1 | |
| fi | |
| # 编译 | |
| echo -e "\n${YELLOW}开始编译...${NC}" | |
| if make ascend_add_demo -j"$(nproc)"; then | |
| echo -e "\n${GREEN}✓ 编译成功!${NC}\n" | |
| # 运行 | |
| echo -e "${YELLOW}运行 demo...${NC}\n" | |
| if ./examples/ascend_add_demo/ascend_add_demo; then | |
| echo -e "\n${GREEN}✓ Demo 运行成功!${NC}" | |
| else | |
| echo -e "\n${RED}✗ Demo 运行失败${NC}" | |
| exit 1 | |
| fi | |
| else | |
| echo -e "\n${RED}✗ 编译失败${NC}" | |
| exit 1 | |
| fi |
🧰 Tools
🪛 Shellcheck (0.11.0)
[warning] 54-54: Quote this to prevent word splitting.
(SC2046)
🤖 Prompt for AI Agents
In `@examples/ascend_add_demo/build_and_run.sh` around lines 52 - 69, The script
currently relies on set -e and then checks $? after running "make
ascend_add_demo -j$(nproc)", which makes the else branch unreachable; change the
build invocation to use the conditional form "if make ascend_add_demo
-j\"$(nproc)\"; then ... else ..." so the success and failure branches run
correctly, and move the demo execution and its own conditional similarly into
the then/else structure (i.e., replace the standalone make line plus the
subsequent "if [ $? -eq 0 ]" block with an if wrapper around the make command
and quote "$(nproc)"); do the same for running
"./examples/ascend_add_demo/ascend_add_demo" so failure prints the custom error
and exits.
| ```bash | ||
| cd /home/HwHiAiUser/mLLM/examples/ascend_add_demo | ||
| ./build_and_run.sh |
There was a problem hiding this comment.
Avoid hard-coded absolute paths in the demo instructions.
Using an environment-specific path makes the steps non-portable. Consider a placeholder such as <repo_root>/examples/ascend_add_demo.
✏️ Suggested edit
-cd /home/HwHiAiUser/mLLM/examples/ascend_add_demo
+cd <repo_root>/examples/ascend_add_demo📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ```bash | |
| cd /home/HwHiAiUser/mLLM/examples/ascend_add_demo | |
| ./build_and_run.sh |
🤖 Prompt for AI Agents
In `@examples/ascend_add_demo/README.md` around lines 16 - 18, Replace the
hard-coded absolute path in the README code example with a portable placeholder
or relative path so demos are reproducible; specifically update the code block
invoking the example to use either "<repo_root>/examples/ascend_add_demo" or a
relative path like "./examples/ascend_add_demo" and keep the same
./build_and_run.sh invocation (reference: README.md code block showing cd
/home/HwHiAiUser/mLLM/examples/ascend_add_demo and ./build_and_run.sh).
| ``` | ||
| === Ascend Add Op Demo === | ||
| 1. Initializing Ascend backend... | ||
| ✓ Ascend backend initialized | ||
|
|
||
| 2. Creating input tensors... | ||
| Input x shape: [2, 3] | ||
| Input y shape: [2, 3] | ||
|
|
||
| 3. Transferring tensors to Ascend device... | ||
| ✓ Tensors transferred to Ascend | ||
|
|
||
| 4. Executing Add operation on Ascend... | ||
| ✓ Add operation completed | ||
|
|
||
| 5. Transferring result back to CPU and verifying... | ||
| Expected result: [11, 22, 33, 44, 55, 66] | ||
| Actual result: [11, 22, 33, 44, 55, 66] | ||
|
|
||
| ✓ Test PASSED! All values match expected results. | ||
| ``` |
There was a problem hiding this comment.
Specify a language for the expected-output code fence.
markdownlint MD040 requires a language; text is fine for output.
✅ Suggested fix
-```
+```text
=== Ascend Add Op Demo ===
...
-```
+```🧰 Tools
🪛 markdownlint-cli2 (0.20.0)
[warning] 54-54: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In `@examples/ascend_add_demo/README.md` around lines 54 - 74, The markdown fenced
output block in the README lacks a language identifier causing markdownlint
MD040; update the triple-backtick fence that wraps the demo output (the block
beginning with "=== Ascend Add Op Demo ===") to include a language specifier
(e.g., change the opening fence from ``` to ```text) so the code fence is
```text ... ``` and linting passes; no other content changes needed.
| MLLM_ERROR_EXIT(ExitCode::kAscendError, "ATB AddOp Execute failed, status={}", static_cast<int>(st)); | ||
| } | ||
|
|
||
|
|
||
| syncGlobalAtbStream(); | ||
|
|
||
| if (workspace_block_id != -1) { |
There was a problem hiding this comment.
Remove trailing whitespace (policy violation).
There are blank lines with trailing spaces that violate the repository’s no‑trailing‑whitespace rule.
✂️ Trim trailing spaces
-
- syncGlobalAtbStream();
+ syncGlobalAtbStream();
@@
-
- syncGlobalAtbStream();
+ syncGlobalAtbStream();Also applies to: 176-181
🤖 Prompt for AI Agents
In `@mllm/backends/ascend/ops/AscendElewiseOps.cpp` around lines 88 - 94, The file
contains blank lines with trailing whitespace near the ATB AddOp error handling
and subsequent code (around the MLLM_ERROR_EXIT call, syncGlobalAtbStream(), and
the workspace_block_id check); remove the trailing spaces on those blank lines
so no line ends with whitespace (trim the whitespace after the blank lines
around MLLM_ERROR_EXIT, syncGlobalAtbStream, and the block handling around
workspace_block_id).
| class AscendSubOp final : public aops::SubOp { | ||
| public: | ||
| explicit AscendSubOp(const aops::SubOpOptions& options); | ||
|
|
||
| void setup(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override; | ||
| void forward(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override; | ||
| }; | ||
|
|
||
| class AscendSubOpFactory final : public TypedOpFactory<OpTypes::kSub, aops::SubOpOptions> { | ||
| public: | ||
| std::shared_ptr<BaseOp> createOpImpl(const aops::SubOpOptions& options) override { | ||
| return std::make_shared<AscendSubOp>(options); | ||
| } | ||
| }; | ||
|
|
||
| class AscendMulOp final : public aops::MulOp { | ||
| public: | ||
| explicit AscendMulOp(const aops::MulOpOptions& options); | ||
|
|
||
| void setup(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override; | ||
| void forward(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override; | ||
| }; | ||
|
|
||
| class AscendMulOpFactory final : public TypedOpFactory<OpTypes::kMul, aops::MulOpOptions> { | ||
| public: | ||
| std::shared_ptr<BaseOp> createOpImpl(const aops::MulOpOptions& options) override { | ||
| return std::make_shared<AscendMulOp>(options); | ||
| } | ||
| }; |
There was a problem hiding this comment.
Add API docs for the new AscendSub/AscendMul ops and factories.
These are public declarations in a header but lack purpose/parameter/behavior docs. Please add brief comments (Doxygen-style is fine) for the new classes and their setup/forward methods.
As per coding guidelines, {mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
💡 Suggested doc-comment additions
+/// Ascend backend elementwise subtraction op.
+/// \param options Op configuration.
class AscendSubOp final : public aops::SubOp {
public:
explicit AscendSubOp(const aops::SubOpOptions& options);
+ /// Prepare output tensors for the given inputs.
void setup(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
+ /// Execute elementwise subtraction.
void forward(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
};
+/// Factory for AscendSubOp.
class AscendSubOpFactory final : public TypedOpFactory<OpTypes::kSub, aops::SubOpOptions> {
public:
std::shared_ptr<BaseOp> createOpImpl(const aops::SubOpOptions& options) override {
return std::make_shared<AscendSubOp>(options);
}
};
+/// Ascend backend elementwise multiplication op.
+/// \param options Op configuration.
class AscendMulOp final : public aops::MulOp {
public:
explicit AscendMulOp(const aops::MulOpOptions& options);
+ /// Prepare output tensors for the given inputs.
void setup(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
+ /// Execute elementwise multiplication.
void forward(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
};
+/// Factory for AscendMulOp.
class AscendMulOpFactory final : public TypedOpFactory<OpTypes::kMul, aops::MulOpOptions> {
public:
std::shared_ptr<BaseOp> createOpImpl(const aops::MulOpOptions& options) override {
return std::make_shared<AscendMulOp>(options);
}
};🤖 Prompt for AI Agents
In `@mllm/backends/ascend/ops/AscendElewiseOps.hpp` around lines 27 - 55, Add
Doxygen-style API docs for the public classes and methods: document AscendSubOp
and AscendMulOp (purpose, when to use), their constructors (parameters: const
aops::SubOpOptions& / aops::MulOpOptions), and the overridden methods
setup(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) and
forward(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs)
(describe expected inputs, outputs, side effects, error conditions). Also
document the factory classes AscendSubOpFactory and AscendMulOpFactory (what
they produce and when to use createOpImpl). Keep comments brief, follow Doxygen
tags (`@brief`, `@param`, `@return`, `@throws` if applicable) and place them above each
class and the setup/forward method declarations.
| bool is_4d = (a_shape.size() == 4); | ||
|
|
||
| // Prepare shapes for ATB | ||
| std::vector<int64_t> atb_a_shape, atb_b_shape, atb_c_shape; | ||
|
|
||
| if (is_4d) { | ||
| // Reshape [B, H, S, D] -> [B*H, S, D] | ||
| int64_t batch_heads_a = static_cast<int64_t>(a_shape[0]) * static_cast<int64_t>(a_shape[1]); | ||
| int64_t batch_heads_b = static_cast<int64_t>(b_shape[0]) * static_cast<int64_t>(b_shape[1]); | ||
| int64_t batch_heads_c = static_cast<int64_t>(c_shape[0]) * static_cast<int64_t>(c_shape[1]); | ||
|
|
||
| atb_a_shape = {batch_heads_a, static_cast<int64_t>(a_shape[2]), static_cast<int64_t>(a_shape[3])}; | ||
| atb_b_shape = {batch_heads_b, static_cast<int64_t>(b_shape[2]), static_cast<int64_t>(b_shape[3])}; | ||
| atb_c_shape = {batch_heads_c, static_cast<int64_t>(c_shape[2]), static_cast<int64_t>(c_shape[3])}; | ||
| } else { |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "AscendMatMulOp.cpp" -type fRepository: UbiquitousLearning/mllm
Length of output: 112
🏁 Script executed:
cat -n mllm/backends/ascend/ops/AscendMatMulOp.cpp | head -100Repository: UbiquitousLearning/mllm
Length of output: 4489
🏁 Script executed:
# Check what assertion macros are available in the codebase
rg "MLLM_RT_ASSERT" --type cpp -m 1 -A 2Repository: UbiquitousLearning/mllm
Length of output: 33484
🏁 Script executed:
# Check if there's any validation of B/C ranks elsewhere or documentation about expected shapes
cat -n mllm/backends/ascend/ops/AscendMatMulOp.cpp | sed -n '40,120p'Repository: UbiquitousLearning/mllm
Length of output: 3790
🏁 Script executed:
# Look for the header file to understand the function interface better
find . -name "AscendMatMulOp.hpp" -type fRepository: UbiquitousLearning/mllm
Length of output: 112
🏁 Script executed:
# Check how this operation is typically called - look for test files or usage
find . -name "*MatMul*Test*" -o -name "*matmul*test*" | grep -i ascendRepository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Verify MLLM_RT_ASSERT_EQ macro definition
cat mllm/utils/Common.hpp | grep -A 5 "MLLM_RT_ASSERT_EQ"Repository: UbiquitousLearning/mllm
Length of output: 569
🏁 Script executed:
# Check how AscendMatMulOp::forward is called
rg "AscendMatMulOp" --type cpp -A 2 | head -50Repository: UbiquitousLearning/mllm
Length of output: 2853
🏁 Script executed:
# Look for setup() to see if rank is validated during initialization
cat -n mllm/backends/ascend/ops/AscendMatMulOp.cpp | sed -n '22,25p'Repository: UbiquitousLearning/mllm
Length of output: 225
🏁 Script executed:
# Check the MatMulOp base class to see if there's rank validation
find . -name "MatMulOp.hpp" -o -name "MatMulOp.cpp" | grep -v ascend | head -5Repository: UbiquitousLearning/mllm
Length of output: 250
🏁 Script executed:
# Check base MatMulOp for any rank validation
cat -n mllm/core/aops/MatMulOp.cpp | head -150Repository: UbiquitousLearning/mllm
Length of output: 2790
🏁 Script executed:
# Check for any shape inference or validation in the base class
cat mllm/core/aops/MatMulOp.hppRepository: UbiquitousLearning/mllm
Length of output: 2355
🏁 Script executed:
# Search for how forward() is called generically in the execution pipeline
rg "->forward\(" --type cpp | head -10Repository: UbiquitousLearning/mllm
Length of output: 471
🏁 Script executed:
# Look for where forward() is called in the execution pipeline
rg "forward" --type cpp mllm/core/BaseOp.hpp | head -20Repository: UbiquitousLearning/mllm
Length of output: 325
🏁 Script executed:
# Check if reshape() is always called before forward()
rg "reshape\(" --type cpp mllm/core/BaseOp.cpp | head -20Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
cat -n mllm/core/BaseOp.hpp | grep -A 10 "forward\|reshape" | head -40Repository: UbiquitousLearning/mllm
Length of output: 1460
🏁 Script executed:
# Let me check if there's any matmul shape validation during reshape or setup that would prevent mismatched ranks
cat -n mllm/core/aops/MatMulOp.cpp | sed -n '27,64p'Repository: UbiquitousLearning/mllm
Length of output: 1612
🏁 Script executed:
# Check if BaseOp enforces any constraints during setup
cat -n mllm/core/BaseOp.cpp | sed -n '1,50p'Repository: UbiquitousLearning/mllm
Length of output: 996
Validate B and C ranks before 4D reshape indexing to prevent out-of-bounds access.
When is_4d is true (A is 4D), the code unconditionally indexes b_shape[0..3] and c_shape[0..3] at lines 68-69 and 72-73 without checking that B and C are also 4D. If callers pass 4D A with 3D B or C, this causes out-of-bounds access. Add explicit rank assertions before the reshape path:
if (is_4d) {
+ MLLM_RT_ASSERT_EQ(b_shape.size(), 4);
+ MLLM_RT_ASSERT_EQ(c_shape.size(), 4);
// Reshape [B, H, S, D] -> [B*H, S, D]🤖 Prompt for AI Agents
In `@mllm/backends/ascend/ops/AscendMatMulOp.cpp` around lines 60 - 74, When
handling the 4D-A reshape path in AscendMatMulOp.cpp (where is_4d is true and
atb_a_shape/atb_b_shape/atb_c_shape are built), validate that b_shape and
c_shape are also 4D before indexing [0..3]; add explicit checks (e.g., assert or
throw) that b_shape.size() == 4 and c_shape.size() == 4 immediately inside the
is_4d branch before computing batch_heads_b/batch_heads_c and accessing
b_shape[2..3]/c_shape[2..3], and return or raise a clear error if they are not
4D to prevent out-of-bounds access.
| // Convert axis to positive index if negative | ||
| int axis = options_.axis; | ||
| if (axis < 0) { | ||
| axis = static_cast<int>(x.rank()) + axis; | ||
| } | ||
|
|
||
| // ATB expects axes as SVector<int64_t> | ||
| softmaxParam.axes.push_back(static_cast<int64_t>(axis)); | ||
|
|
There was a problem hiding this comment.
Validate axis bounds before creating the ATB op.
If options_.axis is outside [-rank, rank-1], the adjusted value can still be invalid and the ATB op may fail or read out of range. Add an explicit range check.
✅ Suggested bounds check
int axis = options_.axis;
if (axis < 0) {
axis = static_cast<int>(x.rank()) + axis;
}
+ if (axis < 0 || axis >= static_cast<int>(x.rank())) {
+ MLLM_ERROR_EXIT(ExitCode::kAscendError,
+ "AscendSoftmaxOp: axis {} out of range for rank {}",
+ axis, x.rank());
+ }🤖 Prompt for AI Agents
In `@mllm/backends/ascend/ops/AscendSoftmaxOp.cpp` around lines 55 - 63, The code
adjusts options_.axis to a positive index but never validates it; before pushing
to softmaxParam.axes (and before creating the ATB op), compute int rank =
static_cast<int>(x.rank()); verify that options_.axis is within [-rank, rank-1],
convert negative axis via axis += rank, then assert or return an error if the
resulting axis is not in [0, rank-1]; only after this valid check call
softmaxParam.axes.push_back(static_cast<int64_t>(axis)) so the ATB op cannot be
created with an out-of-range axis.
| requires = [ | ||
| "scikit-build-core>=0.11.0", "apache-tvm-ffi" | ||
| "scikit-build-core==0.10.0", "apache-tvm-ffi" | ||
| ] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "pyproject.toml" -type fRepository: UbiquitousLearning/mllm
Length of output: 83
🏁 Script executed:
cat -n pyproject.tomlRepository: UbiquitousLearning/mllm
Length of output: 2137
🌐 Web query:
scikit-build-core 0.10.0 release notes wheel.py-api minimum-version support
💡 Result:
For scikit-build-core 0.10.0 (released as part of the 0.10 line), the relevant “release notes” items are:
- Opt-in “auto minimum-version”: 0.10.0 added an automatic way to set
tool.scikit-build.minimum-versionfrom yourbuild-system.requires, so you can single-source the lower bound (e.g.requires = ["scikit-build-core>=0.10"]plusminimum-version = "build-system.requires"). (iscinumpy.dev) - Wheel python-API tagging via
wheel.py-api(feature existed by 0.10.x and is documented): you can set wheel tags for Limited API / Stable ABI builds, e.g.wheel.py-api = "cp38". scikit-build-core will only target ABI3 when the running Python is >= the version you set; otherwise it builds a normal CPython-extension wheel. You can also usewheel.py-api = "py3"for “pythonless” (generic) tags when you’re not tied to CPython’s API. (scikit-build-core.readthedocs.io)
Citations:
- 1: https://iscinumpy.dev/post/scikit-build-core-0-10/
- 2: https://scikit-build-core.readthedocs.io/en/latest/configuration/index.html?utm_source=openai
Pin apache-tvm-ffi in build-system to match runtime version.
apache-tvm-ffi is unpinned in the build-system requires while the runtime dependency pins it to 0.1.0b4. Unpinned build dependencies can cause build/runtime mismatches and non-reproducible builds. Pin it consistently.
🔧 Suggested adjustment
requires = [
- "scikit-build-core==0.10.0", "apache-tvm-ffi"
+ "scikit-build-core==0.10.0",
+ "apache-tvm-ffi==0.1.0b4",
]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| requires = [ | |
| "scikit-build-core>=0.11.0", "apache-tvm-ffi" | |
| "scikit-build-core==0.10.0", "apache-tvm-ffi" | |
| ] | |
| requires = [ | |
| "scikit-build-core==0.10.0", | |
| "apache-tvm-ffi==0.1.0b4", | |
| ] |
🤖 Prompt for AI Agents
In `@pyproject.toml` around lines 2 - 4, Update the build-system requires list to
pin apache-tvm-ffi to the runtime version by changing the unpinned entry in the
requires = [...] array to "apache-tvm-ffi==0.1.0b4" so the build-time dependency
matches the runtime pin; locate the requires list in pyproject.toml and replace
the plain "apache-tvm-ffi" token with the exact version string.
| cmake_extra_args: | ||
| - "-DMLLM_CROSS_COMPILE=ON" | ||
| - "-DMLLM_BUILD_ARM_BACKEND=ON" | ||
| - "-DMLLM_BUILD_ASCEND_BACKEND=ON" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for MLLM_BUILD_ASCEND usage in CMake files
echo "=== MLLM_BUILD_ASCEND in CMake files ==="
rg "MLLM_BUILD_ASCEND" --type cmake -C 3
echo -e "\n=== Ascend backend directory structure ==="
fd -t d "ascend" src/
echo -e "\n=== Android references in Ascend backend ==="
if [ -d "src/backends/ascend" ]; then
rg -i "android" src/backends/ascend/ -C 2 || echo "No Android references found in Ascend backend"
else
echo "Ascend backend directory not found at expected location"
fi
echo -e "\n=== Other build task files ==="
fd -e yaml . tasks/ --type f | head -20Repository: UbiquitousLearning/mllm
Length of output: 2134
🏁 Script executed:
#!/bin/bash
# Check build_android.yaml for context
echo "=== Current build_android.yaml content ==="
cat -n tasks/build_android.yaml
echo -e "\n=== Look for cmake_toolchain_file in other build configs ==="
rg "cmake_toolchain_file" tasks/ -C 2Repository: UbiquitousLearning/mllm
Length of output: 3656
🏁 Script executed:
#!/bin/bash
# Search for Ascend-related build configurations across the repo
echo "=== Ascend mentions in build files ==="
rg -i "ascend" tasks/ -C 2
echo -e "\n=== CMakeLists.txt main configuration ==="
if [ -f "CMakeLists.txt" ]; then
echo "File exists, checking size..."
wc -l CMakeLists.txt
fiRepository: UbiquitousLearning/mllm
Length of output: 2914
🏁 Script executed:
#!/bin/bash
echo "=== build_arm_ascend.yaml full content ==="
cat -n tasks/build_arm_ascend.yamlRepository: UbiquitousLearning/mllm
Length of output: 973
Add missing cmake_toolchain_file for Android NDK cross-compilation.
The cmake_toolchain_file is missing from this configuration, which is required for Android cross-compilation. All other Android build configurations include:
cmake_toolchain_file: "$ANDROID_NDK_PATH/build/cmake/android.toolchain.cmake"
Without it, the Android NDK toolchain won't be properly configured, breaking the build. Additionally, the same issue exists in tasks/build_arm_ascend.yaml.
🤖 Prompt for AI Agents
In `@tasks/build_android.yaml` at line 8, Add the missing cmake_toolchain_file key
to the Android build task so the NDK toolchain is used: insert
cmake_toolchain_file: "$ANDROID_NDK_PATH/build/cmake/android.toolchain.cmake"
into the Android task (the same place where cmake args like
"-DMLLM_BUILD_ASCEND_BACKEND=ON" are defined) and do the same change in the ARM
Ascend Android task; ensure the key is at the same indentation level as the
other top-level task properties so the YAML remains valid.
| // Compute softmax for each slice along the axis | ||
| for (size_t outer = 0; outer < outer_size; ++outer) { | ||
| for (size_t inner = 0; inner < inner_size; ++inner) { | ||
| // Find max value for numerical stability | ||
| float max_val = -std::numeric_limits<float>::infinity(); | ||
| for (size_t i = 0; i < axis_size; ++i) { | ||
| size_t idx = outer * strides[pos_axis > 0 ? pos_axis - 1 : 0] * shape[pos_axis] + | ||
| i * (pos_axis < ndim - 1 ? strides[pos_axis] : 1) + inner; | ||
| if (pos_axis == 0) { | ||
| idx = i * strides[0] + inner; | ||
| } else if (pos_axis == ndim - 1) { | ||
| idx = outer * axis_size + i; | ||
| } else { | ||
| idx = outer * strides[pos_axis - 1] * shape[pos_axis] + i * strides[pos_axis] + inner; | ||
| } | ||
| float val = MLLM_FP16_TO_FP32(x_ptr[idx]); | ||
| max_val = std::max(max_val, val); | ||
| } | ||
|
|
||
| // Compute exp(x - max) and sum | ||
| float sum_exp = 0.0f; | ||
| std::vector<float> exp_vals(axis_size); | ||
| for (size_t i = 0; i < axis_size; ++i) { | ||
| size_t idx = outer * strides[pos_axis > 0 ? pos_axis - 1 : 0] * shape[pos_axis] + | ||
| i * (pos_axis < ndim - 1 ? strides[pos_axis] : 1) + inner; | ||
| if (pos_axis == 0) { | ||
| idx = i * strides[0] + inner; | ||
| } else if (pos_axis == ndim - 1) { | ||
| idx = outer * axis_size + i; | ||
| } else { | ||
| idx = outer * strides[pos_axis - 1] * shape[pos_axis] + i * strides[pos_axis] + inner; | ||
| } | ||
| float val = MLLM_FP16_TO_FP32(x_ptr[idx]); | ||
| exp_vals[i] = std::exp(val - max_val); | ||
| sum_exp += exp_vals[i]; | ||
| } | ||
|
|
||
| // Compute softmax and store result | ||
| for (size_t i = 0; i < axis_size; ++i) { | ||
| size_t idx = outer * strides[pos_axis > 0 ? pos_axis - 1 : 0] * shape[pos_axis] + | ||
| i * (pos_axis < ndim - 1 ? strides[pos_axis] : 1) + inner; | ||
| if (pos_axis == 0) { | ||
| idx = i * strides[0] + inner; | ||
| } else if (pos_axis == ndim - 1) { | ||
| idx = outer * axis_size + i; | ||
| } else { | ||
| idx = outer * strides[pos_axis - 1] * shape[pos_axis] + i * strides[pos_axis] + inner; | ||
| } | ||
| float result = exp_vals[i] / sum_exp; | ||
| r_ptr[idx] = MLLM_FP32_TO_FP16(result); |
There was a problem hiding this comment.
Fix index calculation for middle axes in the CPU reference.
For 0 < axis < ndim-1, the index formula multiplies by shape[pos_axis] twice, which can step past the buffer and corrupt the reference. Use outer * strides[pos_axis - 1] + i * strides[pos_axis] + inner (ideally via a small helper) in all three loops.
🛠️ Suggested fix with a helper indexer
size_t inner_size = 1;
for (int i = pos_axis + 1; i < ndim; ++i) {
inner_size *= shape[i];
}
+ auto idx_at = [&](size_t outer_idx, size_t axis_idx, size_t inner_idx) -> size_t {
+ if (pos_axis == 0) {
+ return axis_idx * strides[0] + inner_idx;
+ }
+ if (pos_axis == ndim - 1) {
+ return outer_idx * axis_size + axis_idx;
+ }
+ return outer_idx * strides[pos_axis - 1] + axis_idx * strides[pos_axis] + inner_idx;
+ };
+
// Compute softmax for each slice along the axis
for (size_t outer = 0; outer < outer_size; ++outer) {
for (size_t inner = 0; inner < inner_size; ++inner) {
// Find max value for numerical stability
float max_val = -std::numeric_limits<float>::infinity();
for (size_t i = 0; i < axis_size; ++i) {
- size_t idx = outer * strides[pos_axis > 0 ? pos_axis - 1 : 0] * shape[pos_axis] +
- i * (pos_axis < ndim - 1 ? strides[pos_axis] : 1) + inner;
- if (pos_axis == 0) {
- idx = i * strides[0] + inner;
- } else if (pos_axis == ndim - 1) {
- idx = outer * axis_size + i;
- } else {
- idx = outer * strides[pos_axis - 1] * shape[pos_axis] + i * strides[pos_axis] + inner;
- }
+ size_t idx = idx_at(outer, i, inner);
float val = MLLM_FP16_TO_FP32(x_ptr[idx]);
max_val = std::max(max_val, val);
}
// Compute exp(x - max) and sum
float sum_exp = 0.0f;
std::vector<float> exp_vals(axis_size);
for (size_t i = 0; i < axis_size; ++i) {
- size_t idx = outer * strides[pos_axis > 0 ? pos_axis - 1 : 0] * shape[pos_axis] +
- i * (pos_axis < ndim - 1 ? strides[pos_axis] : 1) + inner;
- if (pos_axis == 0) {
- idx = i * strides[0] + inner;
- } else if (pos_axis == ndim - 1) {
- idx = outer * axis_size + i;
- } else {
- idx = outer * strides[pos_axis - 1] * shape[pos_axis] + i * strides[pos_axis] + inner;
- }
+ size_t idx = idx_at(outer, i, inner);
float val = MLLM_FP16_TO_FP32(x_ptr[idx]);
exp_vals[i] = std::exp(val - max_val);
sum_exp += exp_vals[i];
}
// Compute softmax and store result
for (size_t i = 0; i < axis_size; ++i) {
- size_t idx = outer * strides[pos_axis > 0 ? pos_axis - 1 : 0] * shape[pos_axis] +
- i * (pos_axis < ndim - 1 ? strides[pos_axis] : 1) + inner;
- if (pos_axis == 0) {
- idx = i * strides[0] + inner;
- } else if (pos_axis == ndim - 1) {
- idx = outer * axis_size + i;
- } else {
- idx = outer * strides[pos_axis - 1] * shape[pos_axis] + i * strides[pos_axis] + inner;
- }
+ size_t idx = idx_at(outer, i, inner);
float result = exp_vals[i] / sum_exp;
r_ptr[idx] = MLLM_FP32_TO_FP16(result);
}
}
}🤖 Prompt for AI Agents
In `@tests/ascend/AscendSoftmaxKernelTest.hpp` around lines 60 - 109, The index
computation for the middle-axis case (0 < pos_axis < ndim-1) in the three loops
erroneously multiplies by shape[pos_axis] and can step past the buffer; update
all index calculations used to read/write x_ptr and r_ptr so that the
middle-axis branch uses outer * strides[pos_axis - 1] + i * strides[pos_axis] +
inner (instead of the current outer * strides[pos_axis - 1] * shape[pos_axis] +
...), and factor this into a small helper (e.g., an indexer function used by the
max_val, exp_vals, and result loops) to avoid duplication and ensure consistency
across the loops that use pos_axis, strides, shape, outer_size, inner_size,
axis_size, MLLM_FP16_TO_FP32 and MLLM_FP32_TO_FP16.
Please check Guidelines for Contributing.
Summary by CodeRabbit
New Features
Improvements
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.