Skip to content

feat(Ascend): Add some new Ascend Ops#621

Merged
chenghuaWang merged 4 commits intoUbiquitousLearning:mainfrom
lywbarca:ascend
Jan 31, 2026
Merged

feat(Ascend): Add some new Ascend Ops#621
chenghuaWang merged 4 commits intoUbiquitousLearning:mainfrom
lywbarca:ascend

Conversation

@lywbarca
Copy link

@lywbarca lywbarca commented Jan 30, 2026

Please check Guidelines for Contributing.

Summary by CodeRabbit

  • New Features

    • Ascend backend: added support for Sub, Mul, SiLU, Linear, RMSNorm, View, MatMul and Softmax operations.
    • Public functional APIs: added linear(...) and rmsNorm(...) convenience functions.
  • Tests

    • Added extensive FP16 test suites for all new Ascend operations, including attention-related tests (scaled dot-product, multi-head, GQA).

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 30, 2026

📝 Walkthrough

Walkthrough

Extends the Ascend backend with multiple new ops (SiLU, Linear, RMSNorm, View, MatMul, Softmax, Sub, Mul), adds a tensor-to-ATB filler utility, adjusts ACL init flow during device enumeration, registers factories for the new ops, exposes functional wrappers (linear, rmsNorm), and adds extensive FP16 tests.

Changes

Cohort / File(s) Summary
Backend core & util
mllm/backends/ascend/AscendBackend.cpp, mllm/backends/ascend/AscendCommon.hpp, mllm/backends/ascend/AscendCommon.cpp
Added fillAtbTensor() declaration/implementation; moved/centralized ACL init during device enumeration; updated op factory registration to include 10 factories.
Elementwise ops
mllm/backends/ascend/ops/AscendElewiseOps.hpp, mllm/backends/ascend/ops/AscendElewiseOps.cpp
Removed strict no-broadcast assertion from Add; added AscendSubOp and AscendMulOp implementations and factories.
SiLU, Softmax, View
mllm/backends/ascend/ops/AscendSiLUOp.hpp, mllm/backends/ascend/ops/AscendSiLUOp.cpp, mllm/backends/ascend/ops/AscendSoftmaxOp.hpp, mllm/backends/ascend/ops/AscendSoftmaxOp.cpp, mllm/backends/ascend/ops/AscendViewOp.hpp, mllm/backends/ascend/ops/AscendViewOp.cpp
New Ascend-specific implementations and factories for SiLU, Softmax (axis normalization, FP16), and View (metadata-only delegation).
Linear & MatMul
mllm/backends/ascend/ops/AscendLinearOp.hpp, mllm/backends/ascend/ops/AscendLinearOp.cpp, mllm/backends/ascend/ops/AscendMatMulOp.hpp, mllm/backends/ascend/ops/AscendMatMulOp.cpp
Added AscendLinearOp with redirect reshape, bias validation, ATB linear integration and workspace management; added AscendMatMulOp supporting 2D/3D/4D (4D→3D reshape) batched matmul with ATB.
RMSNorm
mllm/backends/ascend/ops/AscendRMSNormOp.hpp, mllm/backends/ascend/ops/AscendRMSNormOp.cpp
New AscendRMSNormOp with epsilon handling, optional weight, ATB RMSNorm integration, workspace allocation and stream sync.
Functional API
mllm/nn/Functional.hpp, mllm/nn/Functional.cpp
Added linear() wrapper (optional bias) and rmsNorm() wrapper routing through Context builder.
Build config
tasks/build_arm_ascend.yaml
Updated CMake build directory path from build-arm-ascend to build-ascend.
Tests
tests/ascend/AscendKernelTest.hpp, tests/ascend/AscendLinearKernelTest.hpp, tests/ascend/AscendRMSNormKernelTest.hpp, tests/ascend/AscendSiLUKernelTest.hpp, tests/ascend/AscendSoftmaxKernelTest.hpp, tests/ascend/AscendAttentionKernelTest.hpp, tests/ascend/KernelTest.cpp
Comprehensive FP16 test suites added for Sub/Mul, SiLU, Linear (with/without bias), RMSNorm, Softmax, and attention variants (scaled dot-product, MHA, GQA); include CPU reference implementations and allClose checks.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant Op as Ascend Op
    participant ATB as ATB Library
    participant Mem as Ascend Memory Manager
    participant Stream as ATB Stream

    App->>Op: forward(inputs, outputs)
    activate Op
    Op->>Op: validate dtype/shape
    Op->>ATB: create operation (e.g., Linear/MatMul/Softmax)
    ATB-->>Op: op handle
    Op->>Op: fill ATB tensors (fillAtbTensor)
    opt workspace needed
        Op->>Mem: allocate(workspace_size)
        Mem-->>Op: workspace_ptr
    end
    Op->>ATB: Setup(VariantPack)
    ATB-->>Op: setup OK
    Op->>ATB: Execute(VariantPack)
    ATB->>Stream: schedule compute
    Stream-->>ATB: done
    ATB-->>Op: execute OK
    Op->>Stream: synchronize
    Stream-->>Op: sync complete
    opt workspace allocated
        Op->>Mem: free(workspace_ptr)
    end
    Op->>ATB: destroy operation
    Op-->>App: forward complete
    deactivate Op
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested reviewers

  • xumengwei
  • liang1232018
  • oreomaker

Poem

🐇 A rabbit's note on Ascend's new hop:

Ten ops now bound to hop and play,
SiLU, Linear, Softmax lead the way.
MatMul, RMSNorm join the run —
Sub and Mul leap, tests say "well done!"
ATB streams hum; the bunny says hooray! 🎉

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is entirely missing substantive content; it contains only the contribution template header and a link to guidelines without any explanation of the changes, rationale, or implementation details. Complete the PR description with details about the new operations being added, implementation approach, testing strategy, and any relevant context or breaking changes.
Docstring Coverage ⚠️ Warning Docstring coverage is 23.46% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'feat(Ascend): Add some new Ascend Ops' clearly and specifically describes the main change—adding new Ascend operators to the backend. It uses a conventional format with scope and subject, making it easily scannable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

🤖 Fix all issues with AI agents
In `@mllm/backends/ascend/AscendCommon.cpp`:
- Around line 220-225: Add defensive runtime validation at the start of
fillAtbTensor: verify the input Tensor t has FP16 dtype and is on the Ascend
device (e.g., check t.dtype() == DType::FP16 and t.device/type indicates
Ascend), and ensure t.ptr<void>() is non-null; if any check fails, throw or
return a clear error (e.g., throw std::invalid_argument with a message
identifying fillAtbTensor), then proceed to call fillAtbTensorDesc(t,
atb_tensor.desc) and set atb_tensor.deviceData =
reinterpret_cast<uint8_t*>(t.ptr<void>()) and atb_tensor.dataSize = t.bytes();
this adds defense-in-depth for callers like RMSNormOp that may pass invalid
tensors.

In `@mllm/backends/ascend/ops/AscendElewiseOps.cpp`:
- Around line 168-172: AscendSubOp and AscendMulOp are truncating uint64_t
workspaceSize to uint32_t when calling getAscendMemoryManager().allocateBlock;
add the same guard used in AscendMatMulOp to check if workspaceSize > UINT32_MAX
(or otherwise unsafe to cast) before allocating, and handle the error path
(log/throw or set an error status) instead of silently truncating; update the
allocation calls to cast only after the guard and reference workspace_block_id,
getAscendMemoryManager(), and allocateBlock to locate the change.

In `@mllm/backends/ascend/ops/AscendLinearOp.cpp`:
- Around line 22-29: AscendLinearOp::reshape uses inputs[1] (weight) when
options().isRedirect() without validating inputs size; add a guard at the top of
AscendLinearOp::reshape to verify inputs.size() >= 2 before accessing inputs[1],
and handle the error path (e.g. log/throw or set outputs appropriately) so the
function does not read out-of-bounds; ensure the same validation covers
accessing inputs[0] as well to avoid further OOB access.
- Around line 38-56: The function AscendLinearOp::forward dereferences
outputs[0] without confirming outputs has at least one element; add a check
using MLLM_RT_ASSERT (or similar) to assert outputs.size() == 1 (or
outputs.size() >= 1 as appropriate for this op) before using outputs[0], and
fail early with a clear message; update any existing precondition comments to
reflect the requirement so callers and tests are consistent with the new
assertion.

In `@mllm/backends/ascend/ops/AscendLinearOp.hpp`:
- Around line 12-25: Add brief Doxygen-style API comments for the public classes
and methods: document AscendLinearOp (purpose as Ascend backend implementation
of aops::LinearOp), its constructor AscendLinearOp(const aops::LinearOpOptions&
options) (describe options parameter), and the public overrides setup, forward,
and reshape (describe inputs, outputs and expected behavior/side-effects). Also
document AscendLinearOpFactory (purpose to create AscendLinearOp instances) and
its createOpImpl method (describe returned shared_ptr and parameter). Keep
comments concise, mention parameter names and return type and any thrown
conditions.

In `@mllm/backends/ascend/ops/AscendMatMulOp.cpp`:
- Around line 120-126: The code currently casts workspaceSize (uint64_t) to
uint32_t when calling getAscendMemoryManager().allocateBlock, risking truncation
for sizes > UINT32_MAX; update the block-allocation logic in AscendMatMulOp.cpp
to first check workspaceSize against UINT32_MAX and handle overflow explicitly
(e.g., log error/throw or return a failure status) instead of silently
static_casting, then only call
mem_mgr.allocateBlock(static_cast<uint32_t>(workspaceSize), workspace_block_id)
when safe; keep references to workspaceSize, getAscendMemoryManager(),
allocateBlock(...), workspace_block_id, and getBlockPtr(...) so the check and
error path are added around the existing allocation calls.
- Around line 29-42: The helper fillAtbTensorWithShape currently hardcodes
desc.dtype = ACL_FLOAT16 which breaks non-FP16 tensors; update
fillAtbTensorWithShape to derive the ACL dtype from the input Tensor t (or call
the existing fillAtbTensorDesc helper used by AscendAddOp/AscendSoftmaxOp) and
set desc.dtype accordingly, then recompute atb_tensor.dataSize and assign
deviceData as before; also add validation to throw/log an error if t's dtype
cannot be mapped to a supported ACL dtype so unsupported types fail fast instead
of producing silent errors.

In `@mllm/backends/ascend/ops/AscendRMSNormOp.cpp`:
- Around line 27-33: Reinstate input-count validation at the top of
AscendRMSNormOp::forward to ensure inputs.size() is 1 or 2 before any indexing;
replace the commented-out check with an active assertion (e.g., MLLM_RT_ASSERT
or similar used in this file) that yields a clear error message like
"AscendRMSNormOp expects 1 or 2 inputs" so inputs[0] and the conditional access
of inputs[1] are safe.

In `@mllm/backends/ascend/ops/AscendRMSNormOp.hpp`:
- Around line 12-24: Add concise API documentation comments for the public
classes and methods: document the purpose of the AscendRMSNormOp class, its
constructor AscendRMSNormOp(const aops::RMSNormOpOptions& options), and its
public methods setup(const std::vector<Tensor>& inputs, std::vector<Tensor>&
outputs) and forward(const std::vector<Tensor>& inputs, std::vector<Tensor>&
outputs) (describe expected inputs, outputs, and behaviors); also document the
AscendRMSNormOpFactory class and its createOpImpl(const aops::RMSNormOpOptions&
options) method (describe what it returns and when to use it). Keep comments
brief, use Doxygen-style or single-line doc comments above each public
declaration, and include parameter/return descriptions for the constructor and
createOpImpl where applicable.

In `@mllm/backends/ascend/ops/AscendSiLUOp.hpp`:
- Around line 12-24: Add concise API doc comments for the public classes and
their constructor/methods: document AscendSiLUOp (purpose: Ascend backend SiLU
activation operator), its constructor AscendSiLUOp(const aops::SiLUOpOptions&
options) (describe options and ownership), and its public methods setup(const
std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) and forward(const
std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) (describe expected
inputs, outputs, and possible error conditions). Also add a brief comment for
AscendSiLUOpFactory explaining it creates AscendSiLUOp instances for
OpTypes::kSiLU via createOpImpl. Keep comments short and in Doxygen-style or
single-line form above each declaration.

In `@mllm/backends/ascend/ops/AscendSoftmaxOp.cpp`:
- Around line 55-63: After converting options_.axis to a positive index in
AscendSoftmaxOp.cpp, validate that the resulting axis is within [0, x.rank()-1];
if x.rank() is zero or axis is out of range, return or throw a clear error
(e.g., std::invalid_argument or the project’s status/error reporting mechanism)
before pushing to softmaxParam.axes. Update the block that sets axis (uses
options_.axis and x.rank()) to perform this bounds check and produce a
descriptive error mentioning the invalid axis and the input rank instead of
allowing an out-of-range axis to be used.

In `@mllm/nn/Functional.hpp`:
- Line 23: Add short API doc comments to the new public functional wrappers
(e.g., the function linear(const Tensor& x, const Tensor& weight, const Tensor&
bias = Tensor()) and the other functional wrappers around line 136) that state
the function purpose, describe each parameter (clearly mark optional parameters
like bias, epsilon, add_unit_offset), specify the return value (what Tensor
shape/semantics to expect), and outline error behavior (e.g., mismatched shapes,
null/empty tensors). Keep each doc comment brief (one to three sentences) and
attach it immediately above the function declaration so tools and readers can
discover the API easily.

In `@tests/ascend/AscendAttentionKernelTest.hpp`:
- Around line 224-239: The causal mask generation in the block that creates
mask_cpu when use_mask is true uses offset = S_kv - S_q and will produce
incorrect masking when S_kv < S_q (negative offset) causing entire rows to
become -inf; update the logic in the mask_cpu creation (the loops that write
mask_ptr) to skip creating the upper-triangular mask when S_kv < S_q (e.g.,
guard early and leave mask as zeros or compute only when S_kv >= S_q), and apply
the same guard to the GQA mask block referenced at the GQA mask generation
(mirror the check around its mask write loops) so no rows are fully masked when
S_kv < S_q.

In `@tests/ascend/AscendLinearKernelTest.hpp`:
- Around line 24-90: Add a guard at the start of LinearFloat16Test’s per-case
loop to validate input_shape is non-empty and that input_shape.back() ==
in_channels (and that in_channels>0 and out_channels>0) to prevent out-of-bounds
reads when computing the CPU reference; if the check fails, log a clear message
(using the existing "[LinearTest]" prefix) and return false (or skip the case as
desired) so malformed test cases do not cause invalid memory access in the loops
that use x_ptr and w_ptr.

In `@tests/ascend/AscendSoftmaxKernelTest.hpp`:
- Around line 41-110: The index math for middle-axis softmax is incorrect (it
multiplies by shape[pos_axis] twice); replace the repeated complex branches
computing idx inside the max/exp/result loops with a single flattened-index
expression using outer, axis_size and inner_size: compute idx = (outer *
axis_size + i) * inner_size + inner (use this in the three places where idx is
computed), instead of the current conditional expressions that use strides and
shape[pos_axis]; this fixes OOB and ensures correct indexing for 0 < pos_axis <
ndim - 1 when using outer_size, axis_size and inner_size.
🧹 Nitpick comments (5)
mllm/backends/ascend/ops/AscendElewiseOps.hpp (1)

27-56: Consider adding brief documentation for the public API classes.

The coding guidelines recommend that public APIs have clear docstrings or comments. While the existing AscendAddOp also lacks documentation (so this is consistent), adding a brief comment describing each op's purpose would improve maintainability.

📝 Optional: Add brief class documentation
+/// Ascend backend implementation of element-wise subtraction.
 class AscendSubOp final : public aops::SubOp {
+/// Ascend backend implementation of element-wise multiplication.
 class AscendMulOp final : public aops::MulOp {
mllm/backends/ascend/AscendCommon.hpp (1)

44-45: Document parameter/ownership expectations for fillAtbTensor.

The current comment is brief but doesn’t describe parameter expectations or ownership/lifetime, which is important for a public API.

📝 Suggested doc comment expansion
-// Setup ATB Tensor with correct dataSize calculated by ATB Utils
+// Setup ATB Tensor with correct dataSize calculated by ATB Utils.
+// `@param` t         Source tensor whose storage is referenced.
+// `@param` atb_tensor ATB tensor to populate (metadata + deviceData).
+// `@note` Caller must ensure `t` remains alive while `atb_tensor` is in use.
 void fillAtbTensor(const Tensor& t, atb::Tensor& atb_tensor);

As per coding guidelines, ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.

mllm/backends/ascend/ops/AscendSoftmaxOp.hpp (1)

12-24: Add brief API docs for AscendSoftmaxOp and its factory.
These new public types are introduced without any class-level documentation; a short purpose/usage comment would satisfy the API doc requirement.

✍️ Suggested doc comments
+// Ascend backend implementation of Softmax.
 class AscendSoftmaxOp final : public aops::SoftmaxOp {
@@
-class AscendSoftmaxOpFactory final : public TypedOpFactory<OpTypes::kSoftmax, aops::SoftmaxOpOptions> {
+// Factory for creating AscendSoftmaxOp instances.
+class AscendSoftmaxOpFactory final : public TypedOpFactory<OpTypes::kSoftmax, aops::SoftmaxOpOptions> {
As per coding guidelines, Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
mllm/backends/ascend/ops/AscendMatMulOp.hpp (1)

12-23: Add brief API docs for AscendMatMulOp and its factory.
These new public types are introduced without any class-level documentation; a short purpose/usage comment would satisfy the API doc requirement.

✍️ Suggested doc comments
+// Ascend backend implementation of MatMul.
 class AscendMatMulOp final : public aops::MatMulOp {
@@
-class AscendMatMulOpFactory : public TypedOpFactory<OpTypes::kMatMul, aops::MatMulOpOptions> {
+// Factory for creating AscendMatMulOp instances.
+class AscendMatMulOpFactory : public TypedOpFactory<OpTypes::kMatMul, aops::MatMulOpOptions> {
As per coding guidelines, Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
mllm/backends/ascend/ops/AscendElewiseOps.cpp (1)

106-189: LGTM, but significant code duplication with AscendAddOp.

The AscendSubOp implementation is correct and follows the established ATB workflow pattern. However, this forward function is nearly identical to AscendAddOp::forward (lines 26-104), differing only in:

  • elewiseType enum value
  • Error message strings

Consider extracting the common logic into a shared helper function parameterized by the elewise type:

Refactor suggestion
namespace {
void executeElewiseOp(
    atb::infer::ElewiseParam::ElewiseType type,
    const char* opName,
    const std::vector<Tensor>& inputs,
    std::vector<Tensor>& outputs) {
  // Common implementation here...
}
}  // namespace

void AscendAddOp::forward(...) {
  executeElewiseOp(atb::infer::ElewiseParam::ELEWISE_ADD, "AddOp", inputs, outputs);
}

void AscendSubOp::forward(...) {
  executeElewiseOp(atb::infer::ElewiseParam::ELEWISE_SUB, "SubOp", inputs, outputs);
}

Comment on lines +220 to +225
void fillAtbTensor(const Tensor& t, atb::Tensor& atb_tensor) {
fillAtbTensorDesc(t, atb_tensor.desc);
atb_tensor.deviceData = reinterpret_cast<uint8_t*>(t.ptr<void>());
// Use MLLM tensor's actual bytes as dataSize to match allocated memory
atb_tensor.dataSize = t.bytes();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n mllm/backends/ascend/AscendCommon.cpp | sed -n '215,230p'

Repository: UbiquitousLearning/mllm

Length of output: 721


🏁 Script executed:

# Search for fillAtbTensorDesc definition to understand what it does
rg -n "fillAtbTensorDesc" mllm/backends/ascend/ -A 5

Repository: UbiquitousLearning/mllm

Length of output: 4216


🏁 Script executed:

# Find all call sites of fillAtbTensor
rg -n "fillAtbTensor\(" mllm/backends/ascend/ -B 2 -A 2

Repository: UbiquitousLearning/mllm

Length of output: 2581


🏁 Script executed:

# Check if there's a Tensor class definition to understand dtype() and device() methods
rg -n "class Tensor" -t h -t hpp

Repository: UbiquitousLearning/mllm

Length of output: 94


🏁 Script executed:

# Search for Tensor class definition
rg -n "class Tensor" -t c -t h --max-count=5

Repository: UbiquitousLearning/mllm

Length of output: 814


🏁 Script executed:

# Search for MLLM_RT_ASSERT macro definition
rg -n "MLLM_RT_ASSERT" -t c -t h --max-count=5

Repository: UbiquitousLearning/mllm

Length of output: 9863


🏁 Script executed:

# Check if Tensor has device() and dtype() methods
rg -n "\.device\(\)|\.dtype\(\)" mllm/backends/ascend/ -B 1 -A 1 | head -30

Repository: UbiquitousLearning/mllm

Length of output: 2524


🏁 Script executed:

# Look for Tensor definition more broadly
fd "Tensor" mllm --type f -e h | head -5

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Check Tensor class definition for device() and dtype() methods
cat -n mllm/core/Tensor.hpp | sed -n '53,150p'

Repository: UbiquitousLearning/mllm

Length of output: 3824


🏁 Script executed:

# Look at one of the call sites more closely to see if there's validation before fillAtbTensor
cat -n mllm/backends/ascend/ops/AscendSoftmaxOp.cpp | sed -n '30,85p'

Repository: UbiquitousLearning/mllm

Length of output: 2143


🏁 Script executed:

# Check AscendRMSNormOp for dtype/device validation before fillAtbTensor
cat -n mllm/backends/ascend/ops/AscendRMSNormOp.cpp | sed -n '50,70p'

Repository: UbiquitousLearning/mllm

Length of output: 830


🏁 Script executed:

# Check AscendLinearOp for dtype validation before fillAtbTensor
cat -n mllm/backends/ascend/ops/AscendLinearOp.cpp | sed -n '100,125p'

Repository: UbiquitousLearning/mllm

Length of output: 989


🏁 Script executed:

# Check AscendRMSNormOp for any validation
cat -n mllm/backends/ascend/ops/AscendRMSNormOp.cpp | sed -n '1,70p'

Repository: UbiquitousLearning/mllm

Length of output: 2800


🏁 Script executed:

# Search for dtype() and device() method definitions in Tensor class
cat -n mllm/core/Tensor.hpp | grep -A 1 "dtype()\|device()" | head -20

Repository: UbiquitousLearning/mllm

Length of output: 281


🏁 Script executed:

# Check AscendLinearOp more thoroughly for dtype validation
cat -n mllm/backends/ascend/ops/AscendLinearOp.cpp | sed -n '1,100p'

Repository: UbiquitousLearning/mllm

Length of output: 4524


🏁 Script executed:

# Check what MLLM_TYPE_F16 is and its relationship with kFloat16
rg -n "MLLM_TYPE_F16|kFloat16" mllm/core/DataTypes.hpp -A 1 | head -20

Repository: UbiquitousLearning/mllm

Length of output: 354


Add input validation to fillAtbTensor.

fillAtbTensor hardcodes FP16 in the descriptor and assumes a device pointer without validating the input tensor. While some call sites (SoftmaxOp, LinearOp) validate dtype before calling this function, others (RMSNormOp) do not. Add runtime checks to ensure tensors are FP16 and on the Ascend device, providing defense-in-depth against invalid tensor usage.

Suggested validation
 void fillAtbTensor(const Tensor& t, atb::Tensor& atb_tensor) {
+  MLLM_RT_ASSERT(t.device() == kAscend);
+  MLLM_RT_ASSERT(t.dtype() == MLLM_TYPE_F16);
+  MLLM_RT_ASSERT(!t.isNil());
   fillAtbTensorDesc(t, atb_tensor.desc);
-  atb_tensor.deviceData = reinterpret_cast<uint8_t*>(t.ptr<void>());
+  auto* ptr = t.ptr<void>();
+  MLLM_RT_ASSERT(ptr != nullptr);
+  atb_tensor.deviceData = reinterpret_cast<uint8_t*>(ptr);
   // Use MLLM tensor's actual bytes as dataSize to match allocated memory
   atb_tensor.dataSize = t.bytes();
 }
🤖 Prompt for AI Agents
In `@mllm/backends/ascend/AscendCommon.cpp` around lines 220 - 225, Add defensive
runtime validation at the start of fillAtbTensor: verify the input Tensor t has
FP16 dtype and is on the Ascend device (e.g., check t.dtype() == DType::FP16 and
t.device/type indicates Ascend), and ensure t.ptr<void>() is non-null; if any
check fails, throw or return a clear error (e.g., throw std::invalid_argument
with a message identifying fillAtbTensor), then proceed to call
fillAtbTensorDesc(t, atb_tensor.desc) and set atb_tensor.deviceData =
reinterpret_cast<uint8_t*>(t.ptr<void>()) and atb_tensor.dataSize = t.bytes();
this adds defense-in-depth for callers like RMSNormOp that may pass invalid
tensors.

Comment on lines +168 to +172
if (workspaceSize > 0) {
auto& mem_mgr = getAscendMemoryManager();
mem_mgr.allocateBlock(static_cast<uint32_t>(workspaceSize), workspace_block_id);
mem_mgr.getBlockPtr(workspace_block_id, workspace);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Same workspaceSize truncation concern as in AscendMatMulOp.

Both AscendSubOp and AscendMulOp cast uint64_t workspaceSize to uint32_t when calling allocateBlock. Consider adding the same guard suggested for AscendMatMulOp for consistency across all Ascend ops.

Also applies to: 253-257

🤖 Prompt for AI Agents
In `@mllm/backends/ascend/ops/AscendElewiseOps.cpp` around lines 168 - 172,
AscendSubOp and AscendMulOp are truncating uint64_t workspaceSize to uint32_t
when calling getAscendMemoryManager().allocateBlock; add the same guard used in
AscendMatMulOp to check if workspaceSize > UINT32_MAX (or otherwise unsafe to
cast) before allocating, and handle the error path (log/throw or set an error
status) instead of silently truncating; update the allocation calls to cast only
after the guard and reference workspace_block_id, getAscendMemoryManager(), and
allocateBlock to locate the change.

Comment on lines +12 to +25
class AscendLinearOp final : public aops::LinearOp {
public:
explicit AscendLinearOp(const aops::LinearOpOptions& options);

void setup(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
void forward(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
void reshape(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
};

class AscendLinearOpFactory final : public TypedOpFactory<OpTypes::kLinear, aops::LinearOpOptions> {
public:
std::shared_ptr<BaseOp> createOpImpl(const aops::LinearOpOptions& options) override {
return std::make_shared<AscendLinearOp>(options);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add brief API docs for the public Ascend Linear types.

These public declarations should document purpose and parameters to meet the API comment requirement.

📝 Suggested doc comments
 class AscendLinearOp final : public aops::LinearOp {
  public:
+  /// Create an Ascend Linear op.
+  /// `@param` options Linear op configuration options.
   explicit AscendLinearOp(const aops::LinearOpOptions& options);
 
+  /// Prepare output metadata for the operation.
   void setup(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
+  /// Execute Linear on inputs and write to outputs.
   void forward(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
+  /// Reshape output tensors as needed for the operation.
   void reshape(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
 };
 
+/// Factory for AscendLinearOp.
 class AscendLinearOpFactory final : public TypedOpFactory<OpTypes::kLinear, aops::LinearOpOptions> {

As per coding guidelines, ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.

🤖 Prompt for AI Agents
In `@mllm/backends/ascend/ops/AscendLinearOp.hpp` around lines 12 - 25, Add brief
Doxygen-style API comments for the public classes and methods: document
AscendLinearOp (purpose as Ascend backend implementation of aops::LinearOp), its
constructor AscendLinearOp(const aops::LinearOpOptions& options) (describe
options parameter), and the public overrides setup, forward, and reshape
(describe inputs, outputs and expected behavior/side-effects). Also document
AscendLinearOpFactory (purpose to create AscendLinearOp instances) and its
createOpImpl method (describe returned shared_ptr and parameter). Keep comments
concise, mention parameter names and return type and any thrown conditions.

Tensor matmul(const Tensor& A, const Tensor& B, bool transpose_A = false, bool transpose_B = false,
aops::MatMulOpType type = aops::MatMulOpType::kDefault);

Tensor linear(const Tensor& x, const Tensor& weight, const Tensor& bias = Tensor());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add brief API docs for the new functional wrappers.

These are new public APIs; please add short doc comments describing purpose, parameters (including optional bias/epsilon/add_unit_offset), return value, and error behavior.

✏️ Suggested doc comment scaffold
+/// Linear: y = x * weight (+ bias).
+/// `@param` x Input tensor.
+/// `@param` weight Weight matrix.
+/// `@param` bias Optional bias (nil => no bias).
+/// `@return` Output tensor.
 Tensor linear(const Tensor& x, const Tensor& weight, const Tensor& bias = Tensor());
+/// RMSNorm: y = x * weight / sqrt(mean(x^2) + epsilon).
+/// `@param` x Input tensor.
+/// `@param` weight Scale tensor.
+/// `@param` epsilon Numerical stability term.
+/// `@param` add_unit_offset Whether to add unit offset.
+/// `@return` Output tensor.
 Tensor rmsNorm(const Tensor& x, const Tensor& weight, float epsilon = 1e-5f, bool add_unit_offset = false);
As per coding guidelines: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.

Also applies to: 136-136

🤖 Prompt for AI Agents
In `@mllm/nn/Functional.hpp` at line 23, Add short API doc comments to the new
public functional wrappers (e.g., the function linear(const Tensor& x, const
Tensor& weight, const Tensor& bias = Tensor()) and the other functional wrappers
around line 136) that state the function purpose, describe each parameter
(clearly mark optional parameters like bias, epsilon, add_unit_offset), specify
the return value (what Tensor shape/semantics to expect), and outline error
behavior (e.g., mismatched shapes, null/empty tensors). Keep each doc comment
brief (one to three sentences) and attach it immediately above the function
declaration so tools and readers can discover the API easily.

Comment on lines +224 to +239
// 2. Create causal mask if needed
// Causal mask: mask[i, j] = 0 if j <= i, else -inf (large negative value)
Tensor mask_cpu;
if (use_mask) {
mask_cpu = Tensor::zeros({1, 1, S_q, S_kv}, kFloat16, kCPU);
auto* mask_ptr = mask_cpu.ptr<mllm_fp16_t>();

// Fill causal mask: upper triangular part is masked (-inf)
for (int32_t i = 0; i < S_q; ++i) {
for (int32_t j = 0; j < S_kv; ++j) {
int32_t offset = S_kv - S_q;
if (j > i + offset) {
mask_ptr[i * S_kv + j] = MLLM_FP32_TO_FP16(-10000.0f);
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard causal mask generation when S_kv < S_q.

With offset = S_kv - S_q, a negative offset can mask entire rows, leading to all--inf scores and NaNs. Add a guard when use_mask is true (and mirror the same guard in the GQA mask block at Line 339–352).

✅ Suggested guard
       if (use_mask) {
+        MLLM_RT_ASSERT_EQ(S_kv >= S_q, true);
         mask_cpu = Tensor::zeros({1, 1, S_q, S_kv}, kFloat16, kCPU);
         auto* mask_ptr = mask_cpu.ptr<mllm_fp16_t>();
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// 2. Create causal mask if needed
// Causal mask: mask[i, j] = 0 if j <= i, else -inf (large negative value)
Tensor mask_cpu;
if (use_mask) {
mask_cpu = Tensor::zeros({1, 1, S_q, S_kv}, kFloat16, kCPU);
auto* mask_ptr = mask_cpu.ptr<mllm_fp16_t>();
// Fill causal mask: upper triangular part is masked (-inf)
for (int32_t i = 0; i < S_q; ++i) {
for (int32_t j = 0; j < S_kv; ++j) {
int32_t offset = S_kv - S_q;
if (j > i + offset) {
mask_ptr[i * S_kv + j] = MLLM_FP32_TO_FP16(-10000.0f);
}
}
}
// 2. Create causal mask if needed
// Causal mask: mask[i, j] = 0 if j <= i, else -inf (large negative value)
Tensor mask_cpu;
if (use_mask) {
MLLM_RT_ASSERT_EQ(S_kv >= S_q, true);
mask_cpu = Tensor::zeros({1, 1, S_q, S_kv}, kFloat16, kCPU);
auto* mask_ptr = mask_cpu.ptr<mllm_fp16_t>();
// Fill causal mask: upper triangular part is masked (-inf)
for (int32_t i = 0; i < S_q; ++i) {
for (int32_t j = 0; j < S_kv; ++j) {
int32_t offset = S_kv - S_q;
if (j > i + offset) {
mask_ptr[i * S_kv + j] = MLLM_FP32_TO_FP16(-10000.0f);
}
}
}
🤖 Prompt for AI Agents
In `@tests/ascend/AscendAttentionKernelTest.hpp` around lines 224 - 239, The
causal mask generation in the block that creates mask_cpu when use_mask is true
uses offset = S_kv - S_q and will produce incorrect masking when S_kv < S_q
(negative offset) causing entire rows to become -inf; update the logic in the
mask_cpu creation (the loops that write mask_ptr) to skip creating the
upper-triangular mask when S_kv < S_q (e.g., guard early and leave mask as zeros
or compute only when S_kv >= S_q), and apply the same guard to the GQA mask
block referenced at the GQA mask generation (mirror the check around its mask
write loops) so no rows are fully masked when S_kv < S_q.

Comment on lines +24 to +90
bool LinearFloat16Test(const std::vector<std::tuple<mllm::Tensor::shape_t, int, int>>& test_cases) {
using namespace mllm; // NOLINT
for (auto& test_case : test_cases) {
auto input_shape = std::get<0>(test_case);
int in_channels = std::get<1>(test_case);
int out_channels = std::get<2>(test_case);

std::cout << "[LinearTest] Testing shape=[";
for (size_t i = 0; i < input_shape.size(); ++i) {
std::cout << input_shape[i] << (i < input_shape.size() - 1 ? ", " : "");
}
std::cout << "], in=" << in_channels << ", out=" << out_channels << std::endl;

// 1. Construct random FP16 inputs on CPU
// x: [M, K] where K = in_channels
Tensor x_cpu = Tensor::random(input_shape, -1, 1, kFloat16, kCPU);

// Weight shape for ATB: [K, N] where K=in_channels, N=out_channels
Tensor weight_cpu = Tensor::random({in_channels, out_channels}, -0.5, 0.5, kFloat16, kCPU);

// 2. Compute reference result on CPU
// y = x @ weight, where x is [M, K], weight is [K, N], output is [M, N]
auto output_shape = input_shape;
output_shape[output_shape.size() - 1] = out_channels;
Tensor ref_cpu = Tensor::zeros(output_shape, kFloat16, kCPU);

{
auto* x_ptr = x_cpu.ptr<mllm_fp16_t>();
auto* w_ptr = weight_cpu.ptr<mllm_fp16_t>();
auto* r_ptr = ref_cpu.ptr<mllm_fp16_t>();

size_t batch_size = 1;
for (size_t i = 0; i < input_shape.size() - 1; ++i) {
batch_size *= input_shape[i];
}

for (size_t b = 0; b < batch_size; ++b) {
for (int o = 0; o < out_channels; ++o) {
float sum = 0.0f;
for (int i = 0; i < in_channels; ++i) {
float x_val = MLLM_FP16_TO_FP32(x_ptr[b * in_channels + i]);
float w_val = MLLM_FP16_TO_FP32(w_ptr[i * out_channels + o]); // weight is [K, N]
sum += x_val * w_val;
}
r_ptr[b * out_channels + o] = MLLM_FP32_TO_FP16(sum);
}
}
}

// 3. Move inputs to Ascend and run Linear via matmul
auto x_ascend = x_cpu.to(kAscend);
auto weight_ascend = weight_cpu.to(kAscend);

// Use matmul: y = x @ weight
auto y_ascend = nn::functional::matmul(x_ascend, weight_ascend, false, false);

// 4. Move result back to CPU and compare with reference
auto y_cpu = y_ascend.to(kCPU);
auto result = mllm::test::allClose(y_cpu, ref_cpu, 1e-2f, 1e-2f);
if (!result.is_close) {
std::cout << "[LinearTest] FAILED!" << std::endl;
return false;
}
std::cout << "[LinearTest] PASSED" << std::endl;
}
return true;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard test cases against shape/channel mismatches.
The reference math indexes using in_channels and assumes input_shape.back() == in_channels; add an early check to avoid OOB access on malformed cases.

🧪 Suggested guard
       int in_channels = std::get<1>(test_case);
       int out_channels = std::get<2>(test_case);
+
+      if (input_shape.empty() || input_shape.back() != static_cast<size_t>(in_channels)) {
+        std::cout << "[LinearTest] Invalid test case: last dim != in_channels" << std::endl;
+        return false;
+      }
@@
       int in_channels = std::get<1>(test_case);
       int out_channels = std::get<2>(test_case);
+
+      if (input_shape.empty() || input_shape.back() != static_cast<size_t>(in_channels)) {
+        std::cout << "[LinearWithBiasTest] Invalid test case: last dim != in_channels" << std::endl;
+        return false;
+      }

Also applies to: 93-163

🤖 Prompt for AI Agents
In `@tests/ascend/AscendLinearKernelTest.hpp` around lines 24 - 90, Add a guard at
the start of LinearFloat16Test’s per-case loop to validate input_shape is
non-empty and that input_shape.back() == in_channels (and that in_channels>0 and
out_channels>0) to prevent out-of-bounds reads when computing the CPU reference;
if the check fails, log a clear message (using the existing "[LinearTest]"
prefix) and return false (or skip the case as desired) so malformed test cases
do not cause invalid memory access in the loops that use x_ptr and w_ptr.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In `@mllm/backends/ascend/ops/AscendLinearOp.cpp`:
- Around line 103-167: The Setup and Execute failure paths leak resources:
ensure the ATB operation and any allocated workspace block are cleaned up before
calling MLLM_ERROR_EXIT; specifically, when st != atb::NO_ERROR after op->Setup
or op->Execute, call atb::DestroyOperation(op) and if workspace_block_id != -1
call getAscendMemoryManager().freeBlock(workspace_block_id) (and null out
workspace if needed) before invoking MLLM_ERROR_EXIT — or wrap op and workspace
allocation in an RAII/scope-guard so atb::DestroyOperation(op) and
mem_mgr.freeBlock(...) run on all exit paths.

In `@mllm/backends/ascend/ops/AscendRMSNormOp.cpp`:
- Around line 31-43: The fallback to weight_ needs explicit validation: before
calling fillAtbTensor(...) validate that weight (the variable resolved in the
snippet as weight and weight_for_atb) is initialized/non-empty when
inputs.size()==1, that its dtype matches x.dtype(), and that its length equals
the expected size (typically x.size(x.dim()-1) or the last dimension length) —
if any check fails, return the same NYI/error path used for x/y mismatches;
perform these checks immediately after resolving weight and before using
weight_for_atb.
- Around line 1-18: In forward(), add validation for the weight tensor whenever
it is used (both when inputs.size() == 1 using member weight_ and when
inputs.size() == 2 using the second input): ensure the weight is 1D, its length
equals x.shape().back() (the last dimension of input x), and its dtype equals
x.dtype(); return or error if these checks fail. Replace the raw atb::Operation*
op with an RAII wrapper (e.g., std::unique_ptr<atb::Operation, custom_deleter>
or a small scope-guard class) so the operation is always destroyed via its
destroy function on all exit paths (including where MLLM_ERROR_EXIT is called)
to guarantee proper cleanup in forward().

In `@mllm/backends/ascend/ops/AscendSoftmaxOp.cpp`:
- Around line 94-107: The code reads workspaceSize (uint64_t) from op->Setup
then passes it to mem_mgr.allocateBlock as a uint32_t, risking truncation for
sizes > UINT32_MAX; add an explicit guard after op->Setup to check if
workspaceSize > UINT32_MAX and handle it (return/log an error via
MLLM_ERROR_EXIT or switch to a 64-bit allocation path), avoiding calling
mem_mgr.allocateBlock/getBlockPtr with a truncated size and leaving
workspace_block_id invalid; reference symbols: workspaceSize, op->Setup,
allocateBlock, getBlockPtr, workspace_block_id.

In `@tests/ascend/AscendSoftmaxKernelTest.hpp`:
- Around line 34-47: The code computes pos_axis and then indexes shape[pos_axis]
without validating axis or ndim; add a guard at the top of the block (before
computing pos_axis or accessing shape) to ensure ndim > 0 and axis is within
[-ndim, ndim-1], and fail fast if not (e.g., assert/EXPECT/throw
std::invalid_argument) so that variables pos_axis, outer_size, and axis_size are
only computed when shape[pos_axis] is safe to access.
🧹 Nitpick comments (2)
tests/ascend/AscendSoftmaxKernelTest.hpp (1)

67-75: Reuse exp_vals to avoid per-slice allocations.
Allocating exp_vals inside the inner loop adds avoidable overhead for large shapes. Reuse a single buffer per (shape, axis).

♻️ Suggested refactor
-          for (size_t outer = 0; outer < outer_size; ++outer) {
+          std::vector<float> exp_vals(axis_size);
+          for (size_t outer = 0; outer < outer_size; ++outer) {
             for (size_t inner = 0; inner < inner_size; ++inner) {
               auto idx_at = [&](size_t i) -> size_t {
                 return (outer * axis_size + i) * inner_size + inner;
               };
@@
-              std::vector<float> exp_vals(axis_size);
               for (size_t i = 0; i < axis_size; ++i) {
                 size_t idx = idx_at(i);
                 float val = MLLM_FP16_TO_FP32(x_ptr[idx]);
                 exp_vals[i] = std::exp(val - max_val);
                 sum_exp += exp_vals[i];
               }

As per coding guidelines, Avoid unnecessary object creation in loops or hot paths.

mllm/backends/ascend/ops/AscendRMSNormOp.cpp (1)

49-104: Consider RAII guards for ATB op/workspace cleanup on early exit.

If MLLM_ERROR_EXIT behavior ever changes from process-exit to exception/return, op and workspace could leak. A small RAII guard keeps cleanup correct and future-proof.

♻️ Suggested RAII pattern
-  atb::Operation* op = nullptr;
-  auto st = atb::CreateOperation(rmsNormParam, &op);
+  atb::Operation* op = nullptr;
+  auto st = atb::CreateOperation(rmsNormParam, &op);
   if (st != atb::NO_ERROR || op == nullptr) {
     MLLM_ERROR_EXIT(ExitCode::kAscendError, "ATB CreateOperation(RMS_NORM) failed, status={}", static_cast<int>(st));
   }
+  auto op_guard = std::unique_ptr<atb::Operation, decltype(&atb::DestroyOperation)>(op, &atb::DestroyOperation);

   atb::Context* atb_ctx = getGlobalAtbContext();

@@
-  void* workspace = nullptr;
-  int workspace_block_id = -1;
+  void* workspace = nullptr;
+  int workspace_block_id = -1;
+  auto& mem_mgr = getAscendMemoryManager();
+  struct WorkspaceGuard {
+    AscendMemoryManager* mgr;
+    int id;
+    ~WorkspaceGuard() { if (mgr && id != -1) mgr->freeBlock(id); }
+  } ws_guard{&mem_mgr, -1};

   if (workspaceSize > 0) {
-    auto& mem_mgr = getAscendMemoryManager();
     mem_mgr.allocateBlock(static_cast<uint32_t>(workspaceSize), workspace_block_id);
     mem_mgr.getBlockPtr(workspace_block_id, workspace);
+    ws_guard.id = workspace_block_id;
   }

@@
-  if (workspace_block_id != -1) {
-    auto& mem_mgr = getAscendMemoryManager();
-    mem_mgr.freeBlock(workspace_block_id);
-  }
-
-  atb::DestroyOperation(op);
+  // op and workspace are released by guards

As per coding guidelines, Check for proper error handling and resource cleanup (e.g., using try-finally, context managers, or RAII).

Comment on lines +103 to +167
atb::Operation* op = nullptr;
auto st = atb::CreateOperation(linearParam, &op);
if (st != atb::NO_ERROR || op == nullptr) {
MLLM_ERROR_EXIT(ExitCode::kAscendError, "ATB CreateOperation(Linear) failed, status={}", static_cast<int>(st));
}

atb::Context* atb_ctx = getGlobalAtbContext();

atb::Tensor atb_x;
atb::Tensor atb_weight;
atb::Tensor atb_y;
atb::Tensor atb_bias;

fillAtbTensor(x, atb_x);
fillAtbTensor(*weight_ptr, atb_weight);
fillAtbTensor(y, atb_y);

atb::SVector<atb::Tensor> inTensors;
atb::SVector<atb::Tensor> outTensors;
inTensors.push_back(atb_x);
inTensors.push_back(atb_weight);

if (bias_ptr != nullptr) {
fillAtbTensor(*bias_ptr, atb_bias);
inTensors.push_back(atb_bias);
}

outTensors.push_back(atb_y);

atb::VariantPack vp;
vp.inTensors = inTensors;
vp.outTensors = outTensors;

uint64_t workspaceSize = 0;
st = op->Setup(vp, workspaceSize, atb_ctx);
if (st != atb::NO_ERROR) {
MLLM_ERROR_EXIT(ExitCode::kAscendError, "ATB LinearOp Setup failed, status={}", static_cast<int>(st));
}

void* workspace = nullptr;
int workspace_block_id = -1;
if (workspaceSize > 0) {
auto& mem_mgr = getAscendMemoryManager();
mem_mgr.allocateBlock(static_cast<uint32_t>(workspaceSize), workspace_block_id);
mem_mgr.getBlockPtr(workspace_block_id, workspace);
}

{
ASCEND_TIME_SCOPE("AscendLinearOp::forward");
st = op->Execute(vp, reinterpret_cast<uint8_t*>(workspace), workspaceSize, atb_ctx);
}

if (st != atb::NO_ERROR) {
MLLM_ERROR_EXIT(ExitCode::kAscendError, "ATB LinearOp Execute failed, status={}", static_cast<int>(st));
}

syncGlobalAtbStream();

if (workspace_block_id != -1) {
auto& mem_mgr = getAscendMemoryManager();
mem_mgr.freeBlock(workspace_block_id);
}

atb::DestroyOperation(op);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Resource leak on error paths: ATB operation and workspace not cleaned up.

If op->Setup fails (line 138-140) or op->Execute fails (line 155-157), MLLM_ERROR_EXIT is called without first destroying the ATB operation or freeing the workspace memory. This leaks resources on error paths.

🛡️ Suggested fix using cleanup before exit
   st = op->Setup(vp, workspaceSize, atb_ctx);
   if (st != atb::NO_ERROR) {
+    atb::DestroyOperation(op);
     MLLM_ERROR_EXIT(ExitCode::kAscendError, "ATB LinearOp Setup failed, status={}", static_cast<int>(st));
   }

   void* workspace = nullptr;
   int workspace_block_id = -1;
   if (workspaceSize > 0) {
     auto& mem_mgr = getAscendMemoryManager();
     mem_mgr.allocateBlock(static_cast<uint32_t>(workspaceSize), workspace_block_id);
     mem_mgr.getBlockPtr(workspace_block_id, workspace);
   }

   {
     ASCEND_TIME_SCOPE("AscendLinearOp::forward");
     st = op->Execute(vp, reinterpret_cast<uint8_t*>(workspace), workspaceSize, atb_ctx);
   }

   if (st != atb::NO_ERROR) {
+    if (workspace_block_id != -1) {
+      getAscendMemoryManager().freeBlock(workspace_block_id);
+    }
+    atb::DestroyOperation(op);
     MLLM_ERROR_EXIT(ExitCode::kAscendError, "ATB LinearOp Execute failed, status={}", static_cast<int>(st));
   }

Alternatively, consider using an RAII wrapper or a scope guard to ensure cleanup on all exit paths.

As per coding guidelines: Check for proper error handling and resource cleanup (e.g., using try-finally, context managers, or RAII).

🤖 Prompt for AI Agents
In `@mllm/backends/ascend/ops/AscendLinearOp.cpp` around lines 103 - 167, The
Setup and Execute failure paths leak resources: ensure the ATB operation and any
allocated workspace block are cleaned up before calling MLLM_ERROR_EXIT;
specifically, when st != atb::NO_ERROR after op->Setup or op->Execute, call
atb::DestroyOperation(op) and if workspace_block_id != -1 call
getAscendMemoryManager().freeBlock(workspace_block_id) (and null out workspace
if needed) before invoking MLLM_ERROR_EXIT — or wrap op and workspace allocation
in an RAII/scope-guard so atb::DestroyOperation(op) and mem_mgr.freeBlock(...)
run on all exit paths.

Comment on lines +1 to +18
// Copyright (c) MLLM Team.
// Licensed under the MIT License.

#include "mllm/backends/ascend/ops/AscendRMSNormOp.hpp"

#include <acl/acl.h>
#include <iostream>
#include <atb/atb_infer.h>
#include <atb/types.h>
#include <atb/utils.h>
#include <atb/infer_op_params.h>

#include "mllm/utils/Common.hpp"
#include "mllm/core/DataTypes.hpp"
#include "mllm/core/Tensor.hpp"
#include "mllm/backends/ascend/memory/AscendMemoryManager.hpp"
#include "mllm/backends/ascend/AscendCommon.hpp"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
from pathlib import Path

path = Path("mllm/backends/ascend/ops/AscendRMSNormOp.cpp")
data = path.read_bytes()
errors = []

if data.startswith(b"\xef\xbb\xbf"):
    errors.append("UTF-8 BOM detected")

bad = [b for b in data if (b < 0x20 and b not in (0x09, 0x0A)) or (0x7F <= b <= 0x9F)]
if bad:
    errors.append(f"Disallowed control codes: {sorted(set(bad))}")

if b"\r" in data:
    errors.append("CR characters detected (not LF-only)")

lines = data.split(b"\n")
tw = [i + 1 for i, l in enumerate(lines) if l.rstrip(b" \t") != l]
if tw:
    errors.append(f"Trailing whitespace on lines: {tw[:10]}")

print("OK" if not errors else "FAIL: " + "; ".join(errors))
PY

Repository: UbiquitousLearning/mllm

Length of output: 69


🏁 Script executed:

cat -n mllm/backends/ascend/ops/AscendRMSNormOp.cpp

Repository: UbiquitousLearning/mllm

Length of output: 4094


Encoding compliant; add weight tensor validation and use RAII for operation cleanup.

Encoding and whitespace constraints verified as compliant.

However, the forward() method has two issues:

  1. Missing weight tensor validation: When inputs.size() == 1, the code uses weight_ member variable without validating its shape or dtype compatibility with input x. Similarly, when inputs.size() == 2, the weight tensor is not validated. Per guidelines, inputs to critical functions must be validated. Add checks for weight shape (must be 1D and match x's last dimension) and dtype (must match x's dtype).

  2. Resource cleanup not following RAII: The atb::Operation* op is created at line 50 but only destroyed at line 103 in success paths. If MLLM_ERROR_EXIT is called at lines 52, 78, or 93, the operation is not properly destroyed. Wrap the operation in a smart pointer (e.g., custom deleter or RAII wrapper) to ensure cleanup even during early exit or exceptions.

🧰 Tools
🪛 Clang (14.0.6)

[error] 4-4: 'mllm/backends/ascend/ops/AscendRMSNormOp.hpp' file not found

(clang-diagnostic-error)

🤖 Prompt for AI Agents
In `@mllm/backends/ascend/ops/AscendRMSNormOp.cpp` around lines 1 - 18, In
forward(), add validation for the weight tensor whenever it is used (both when
inputs.size() == 1 using member weight_ and when inputs.size() == 2 using the
second input): ensure the weight is 1D, its length equals x.shape().back() (the
last dimension of input x), and its dtype equals x.dtype(); return or error if
these checks fail. Replace the raw atb::Operation* op with an RAII wrapper
(e.g., std::unique_ptr<atb::Operation, custom_deleter> or a small scope-guard
class) so the operation is always destroyed via its destroy function on all exit
paths (including where MLLM_ERROR_EXIT is called) to guarantee proper cleanup in
forward().

Comment on lines +31 to +43
const auto& x = inputs[0];
const auto& weight = (inputs.size() == 2) ? inputs[1] : weight_;
auto& y = outputs[0];

const Tensor& weight_for_atb = weight;

if (x.dtype() != y.dtype()) {
NYI("AscendRMSNormOp currently requires x/y have same dtype");
}
if (x.numel() != y.numel()) {
NYI("AscendRMSNormOp requires x/y have same numel");
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add explicit validation for weight when using the fallback path.

Line 32 falls back to weight_ when only one input is provided, but there’s no guard that weight_ is initialized and compatible with x (dtype and expected length). This can lead to ATB failures or incorrect results. Please add checks before fillAtbTensor(...) to ensure the weight is present and matches the expected size (typically the last dimension of x) and dtype.

As per coding guidelines, Validate inputs for public APIs and critical internal functions.

🤖 Prompt for AI Agents
In `@mllm/backends/ascend/ops/AscendRMSNormOp.cpp` around lines 31 - 43, The
fallback to weight_ needs explicit validation: before calling fillAtbTensor(...)
validate that weight (the variable resolved in the snippet as weight and
weight_for_atb) is initialized/non-empty when inputs.size()==1, that its dtype
matches x.dtype(), and that its length equals the expected size (typically
x.size(x.dim()-1) or the last dimension length) — if any check fails, return the
same NYI/error path used for x/y mismatches; perform these checks immediately
after resolving weight and before using weight_for_atb.

Comment on lines +94 to +107
uint64_t workspaceSize = 0;
st = op->Setup(vp, workspaceSize, atb_ctx);
if (st != atb::NO_ERROR) {
MLLM_ERROR_EXIT(ExitCode::kAscendError,
"ATB SoftmaxOp Setup failed, status={}",
static_cast<int>(st));
}

void* workspace = nullptr;
int workspace_block_id = -1;
if (workspaceSize > 0) {
auto& mem_mgr = getAscendMemoryManager();
mem_mgr.allocateBlock(static_cast<uint32_t>(workspaceSize), workspace_block_id);
mem_mgr.getBlockPtr(workspace_block_id, workspace);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard against workspace size truncation.
workspaceSize is uint64_t, but allocation truncates to uint32_t. If ATB returns >4 GB, you’ll under-allocate and risk overwrite when Execute uses the full size. Add an explicit guard (or upgrade the allocator to 64-bit).

🛠️ Proposed fix
 `#include` <atb/types.h>
 `#include` <atb/utils.h>
 `#include` <atb/infer_op_params.h>
+#include <limits>
@@
   if (workspaceSize > 0) {
+    if (workspaceSize > std::numeric_limits<uint32_t>::max()) {
+      MLLM_ERROR_EXIT(ExitCode::kAscendError,
+                      "AscendSoftmaxOp: workspaceSize {} exceeds 32-bit limit",
+                      workspaceSize);
+    }
     auto& mem_mgr = getAscendMemoryManager();
     mem_mgr.allocateBlock(static_cast<uint32_t>(workspaceSize), workspace_block_id);
     mem_mgr.getBlockPtr(workspace_block_id, workspace);
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
uint64_t workspaceSize = 0;
st = op->Setup(vp, workspaceSize, atb_ctx);
if (st != atb::NO_ERROR) {
MLLM_ERROR_EXIT(ExitCode::kAscendError,
"ATB SoftmaxOp Setup failed, status={}",
static_cast<int>(st));
}
void* workspace = nullptr;
int workspace_block_id = -1;
if (workspaceSize > 0) {
auto& mem_mgr = getAscendMemoryManager();
mem_mgr.allocateBlock(static_cast<uint32_t>(workspaceSize), workspace_block_id);
mem_mgr.getBlockPtr(workspace_block_id, workspace);
uint64_t workspaceSize = 0;
st = op->Setup(vp, workspaceSize, atb_ctx);
if (st != atb::NO_ERROR) {
MLLM_ERROR_EXIT(ExitCode::kAscendError,
"ATB SoftmaxOp Setup failed, status={}",
static_cast<int>(st));
}
void* workspace = nullptr;
int workspace_block_id = -1;
if (workspaceSize > 0) {
if (workspaceSize > std::numeric_limits<uint32_t>::max()) {
MLLM_ERROR_EXIT(ExitCode::kAscendError,
"AscendSoftmaxOp: workspaceSize {} exceeds 32-bit limit",
workspaceSize);
}
auto& mem_mgr = getAscendMemoryManager();
mem_mgr.allocateBlock(static_cast<uint32_t>(workspaceSize), workspace_block_id);
mem_mgr.getBlockPtr(workspace_block_id, workspace);
🤖 Prompt for AI Agents
In `@mllm/backends/ascend/ops/AscendSoftmaxOp.cpp` around lines 94 - 107, The code
reads workspaceSize (uint64_t) from op->Setup then passes it to
mem_mgr.allocateBlock as a uint32_t, risking truncation for sizes > UINT32_MAX;
add an explicit guard after op->Setup to check if workspaceSize > UINT32_MAX and
handle it (return/log an error via MLLM_ERROR_EXIT or switch to a 64-bit
allocation path), avoiding calling mem_mgr.allocateBlock/getBlockPtr with a
truncated size and leaving workspace_block_id invalid; reference symbols:
workspaceSize, op->Setup, allocateBlock, getBlockPtr, workspace_block_id.

Comment on lines +34 to +47
// Convert axis to positive index
int ndim = static_cast<int>(shape.size());
int pos_axis = axis;
if (pos_axis < 0) {
pos_axis = ndim + pos_axis;
}

size_t outer_size = 1;
for (int i = 0; i < pos_axis; ++i) {
outer_size *= shape[i];
}

size_t axis_size = shape[pos_axis];

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Validate axis range before indexing shape[pos_axis].
If axis is out of range (e.g., axis >= ndim or axis < -ndim) or ndim == 0, shape[pos_axis] is OOB and the test will crash. Add a guard and fail fast.

🛠️ Proposed fix
           int ndim = static_cast<int>(shape.size());
           int pos_axis = axis;
           if (pos_axis < 0) {
             pos_axis = ndim + pos_axis;
           }
+          if (ndim == 0 || pos_axis < 0 || pos_axis >= ndim) {
+            return false;
+          }

As per coding guidelines, Validate inputs for public APIs and critical internal functions.

🤖 Prompt for AI Agents
In `@tests/ascend/AscendSoftmaxKernelTest.hpp` around lines 34 - 47, The code
computes pos_axis and then indexes shape[pos_axis] without validating axis or
ndim; add a guard at the top of the block (before computing pos_axis or
accessing shape) to ensure ndim > 0 and axis is within [-ndim, ndim-1], and fail
fast if not (e.g., assert/EXPECT/throw std::invalid_argument) so that variables
pos_axis, outer_size, and axis_size are only computed when shape[pos_axis] is
safe to access.

@chenghuaWang chenghuaWang changed the title Add some new Ascend Ops feat(Ascend): Add some new Ascend Ops Jan 31, 2026
Copy link
Collaborator

@chenghuaWang chenghuaWang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chenghuaWang chenghuaWang merged commit b77f0af into UbiquitousLearning:main Jan 31, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants