Skip to content

feat(qwen3, cpu): add support for Qwen3 model on x86 architecture#561

Merged
chenghuaWang merged 8 commits intoUbiquitousLearning:mainfrom
HayzelHan:feat/x86-qwen3
Dec 13, 2025
Merged

feat(qwen3, cpu): add support for Qwen3 model on x86 architecture#561
chenghuaWang merged 8 commits intoUbiquitousLearning:mainfrom
HayzelHan:feat/x86-qwen3

Conversation

@HayzelHan
Copy link
Contributor

@HayzelHan HayzelHan commented Dec 5, 2025

  • Update Qwen3 model implementation with FlashAttention2 support and enable compilation options
  • Fix rmsnorm and softmax operators and refine quantization parameters
  • Add support for elementwise operator
  • Add NYI messages for unsupported operations

Summary by CodeRabbit

  • New Features

    • Enabled additional example models (Qwen3, Qwen2 variants, LLaMA, DeepSeek OCR).
    • Added runtime-dispatched FP32 elementwise operations (vector and scalar) with x86 support.
  • Improvements

    • Added x86 build/test workflow.
    • Optimized elementwise and softmax/rmsnorm memory access for unaligned loads/stores and x86 paths.
    • Introduced x86-specific execution paths and NYI fallbacks where applicable.
  • Documentation

    • Added comprehensive x86 CPU backend guide.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 5, 2025

Walkthrough

Introduces HWY-based dynamic dispatch and x86-focused FP32 element-wise kernels (vector and scalar), adds kernel dispatch headers/impl, switches some x86 memory accesses to unaligned variants, updates quantize FP16 type, gates several ops by architecture, enables example builds, and adds an x86 CI workflow.

Changes

Cohort / File(s) Summary
Build System / Examples
examples/CMakeLists.txt, examples/qwen3/main.cpp
Uncommented several example subdirs and switched Qwen3 include from modeling_qwen3.hpp to modeling_qwen3_fa2.hpp.
Kernel Dispatch API
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp, mllm/backends/cpu/kernels/common/kernel_dispatch.cpp
New HWY dynamic-dispatch header and implementation exposing/exporting FP32 elementwise vector and scalar dispatch wrappers (call_elewise_*_fp32) in mllm::cpu::common.
Element-wise Implementations
mllm/backends/cpu/kernels/common/elewise-inl.hpp
Renamed internal routines, added FP32 out-parameter elementwise functions and scalar-right-hand-side implementations and operator functors; added FP32 scalar wrappers.
Kernel Includes
mllm/backends/cpu/kernels/Kernels.hpp
Added includes for kernel_dispatch.hpp in platform-free and other include regions.
Elewise Dispatch/Usage
mllm/backends/cpu/ops/ElewiseOps.cpp
Added x86/x86_64 branches preferring HWY-accelerated FP32 elementwise and scalar paths, with ARM as fallback; NYI stubs for several ops on x86.
Kernel Integration
mllm/backends/cpu/kernels/common/*
Integrated elewise-inl.hpp via the new dispatch mechanism; influenced kernel compilation per-target.
x86 Memory Access
mllm/backends/cpu/kernels/x86/rmsnorm.cpp, mllm/backends/cpu/kernels/x86/softmax.cpp
Replaced aligned Load/Store intrinsics with unaligned variants (LoadU/StoreU).
Quantize Type Update
mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp
Changed lookup_fp16_to_fp32 parameter from uint16_t to mllm_fp16_t.
Linear & Transpose Ops
mllm/backends/cpu/ops/LinearOp.cpp, mllm/backends/cpu/ops/TransposeOp.cpp
Added architecture guards: x86 emits NYI for some impl types/variants; ARM paths preserved.
CI/CD
.github/workflows/build-x86.yml
New GitHub Actions workflow to test Linux x86_64 compilation using tasks/build_x86.yaml.
Docs
docs/cpu_backend/x86/index.rst
Added comprehensive documentation for the X86 CPU backend (overview, build/run, performance).

Sequence Diagram(s)

sequenceDiagram
    participant Op as ElewiseOps (caller)
    participant Disp as kernel_dispatch (call_elewise_*_fp32)
    participant Impl as HWY target impl
    participant Mem as Memory/input buffers

    Op->>Disp: call_elewise_add_fp32(out, x, y, n)
    Disp->>Impl: dispatch to selected HWY target (per-target include)
    Impl->>Mem: LoadU x/y segments
    Impl->>Impl: vectorized compute (add / scalar op)
    Impl->>Mem: StoreU results to out
    Disp-->>Op: return
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Areas needing extra attention:
    • elewise-inl.hpp: scalar operator correctness, tail handling, and API changes (out-parameter semantics).
    • kernel_dispatch.*: HWY per-target re-inclusion pattern and exported symbol correctness.
    • Call sites across the codebase: ensure all callers use new call_elewise_*_fp32 signatures.
    • Architecture guards in ElewiseOps.cpp, LinearOp.cpp, TransposeOp.cpp and NYI placement.
    • Softmax/rmsnorm LoadU/StoreU changes for correctness on unaligned inputs.

Possibly related PRs

Suggested reviewers

  • xumengwei
  • liang1232018
  • yirongjie
  • oreomaker

Poem

🐇 Hop, hop — a highway made of code,

Vectors dance where scalars strode,
Dispatch calls, the kernels hum,
x86 joins, the rabbits drum,
Build and run — the SIMD road.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Linked Issues check ❓ Inconclusive PR 561 references changes aligned with adding x86 architecture support, but no explicit issue links or tracking references are documented. Link relevant issues or epic tracking x86 architecture support to improve traceability and context.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main feature addition: Qwen3 model support on x86 architecture, which is the primary focus of this changeset.
Description check ✅ Passed The description provides a clear bullet-point summary of key changes (Qwen3 with FlashAttention2, operator fixes, elementwise support, NYI messages), but lacks the structured template format recommended in the repository guidelines.
Out of Scope Changes check ✅ Passed Changes are well-scoped: Qwen3 model updates, operator fixes, elementwise support, and x86 compilation enablement align with the stated PR objectives.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
mllm/backends/cpu/ops/TransposeOp.cpp (3)

26-53: Align x86 behaviour and int64 handling across transpose special-cases

The new x86/x86_64 branches for kFloat32/kFloat16 in CASE 1–3 correctly surface unsupported patterns via NYI, while ARM keeps using the optimized kernels. However, for kInt64 in these cases there is still only an ARM #if and no explicit x86 handling. On an x86 build, those kInt64 cases effectively become:

case kInt64: {
  // no call
  break;
}

which silently turns the transpose into a no-op for int64 tensors instead of failing fast.

To avoid silent data corruption when an int64 transpose is accidentally used on x86, consider adding explicit x86 branches (or wrapping the whole case kInt64 in the same #if/#elif structure) so that x86 either:

  • Calls a generic non-ARM fallback, or
  • Emits a clear NYI("Transpose op(...) int64 not supported in x86").

This same pattern applies to CASE 1 (HW→WH), CASE 2 (BSHD→BHSD) and CASE 3 (last-two-dims) kInt64 branches.

Also applies to: 55-85, 87-121


87-91: Fix batch accumulation bug in CASE 3 (last-dims transpose)

In CASE 3 you compute batch as:

int batch = 0;
for (int i = 0; i < input_shape.size() - 2; i++) { batch *= input_shape[i]; }

Because batch starts at 0 and is only multiplied, batch is always 0, so the transpose_last_dims_* kernels will never process any batch element. This makes the entire CASE 3 path a no-op for all dtypes/architectures.

Initialize batch to 1 instead so it correctly accumulates the product of the leading dimensions, e.g.:

-    int batch = 0;
-    for (int i = 0; i < input_shape.size() - 2; i++) { batch *= input_shape[i]; }
+    int batch = 1;
+    for (int i = 0; i < input_shape.size() - 2; i++) { batch *= input_shape[i]; }

This is independent of the existing FIXME about (dim0 + dim1 == 1) and should be fixed regardless.


123-153: General permute CASE 4 is a silent no-op on x86; add explicit x86 handling / NYI

In CASE 4 (general permute), the dtype branches are guarded only by ARM macros, for example:

case kFloat32: {
#if defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
  arm::permute_fp32(...);
#endif
  break;
}

On x86/x86_64, the #if body is compiled out, so the code effectively just hits break; with no kernel call and no NYI, meaning the operation becomes a silent no-op and leaves output unpermuted. That’s more dangerous than the explicit NYI behaviour you added in CASE 1–3.

To align behaviour and avoid silent incorrect results on x86, consider restructuring these branches similar to CASE 1–3, e.g.:

       case kFloat32: {
-#if defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
-        arm::permute_fp32(input.ptr<mllm_fp32_t>(), output.ptr<mllm_fp32_t>(), input_shape.data(), permute_axis.data(),
-                          permute_axis.size());
-#endif
+ #if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
+        NYI("Transpose op(general permute) fp32 not supported in x86");
+ #elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
+        arm::permute_fp32(input.ptr<mllm_fp32_t>(), output.ptr<mllm_fp32_t>(), input_shape.data(), permute_axis.data(),
+                          permute_axis.size());
+ #endif
         break;
       }
       case kFloat16: {
-#if defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
-        arm::permute_fp16(input.ptr<mllm_fp16_t>(), output.ptr<mllm_fp16_t>(), input_shape.data(), permute_axis.data(),
-                          permute_axis.size());
-#endif
+ #if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
+        NYI("Transpose op(general permute) fp16 not supported in x86");
+ #elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
+        arm::permute_fp16(input.ptr<mllm_fp16_t>(), output.ptr<mllm_fp16_t>(), input_shape.data(), permute_axis.data(),
+                          permute_axis.size());
+ #endif
         break;
       }
       case kInt64: {
-#if defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
-        arm::permute_generic<mllm_int64_t>(input.ptr<mllm_int64_t>(), output.ptr<mllm_int64_t>(), input_shape.data(),
-                                           permute_axis.data(), permute_axis.size());
-#endif
+ #if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
+        NYI("Transpose op(general permute) int64 not supported in x86");
+ #elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
+        arm::permute_generic<mllm_int64_t>(input.ptr<mllm_int64_t>(), output.ptr<mllm_int64_t>(), input_shape.data(),
+                                           permute_axis.data(), permute_axis.size());
+ #endif
         break;
       }

(or route x86 to a generic non-ARM permute implementation if/when one exists). This keeps x86 semantics explicit and consistent with the earlier transpose special-cases.

🧹 Nitpick comments (2)
mllm/backends/cpu/ops/ElewiseOps.cpp (1)

143-157: Consider adding a fallback for unsupported platforms.

The code has branches for x86 and ARM but no fallback else clause. On platforms that are neither x86 nor ARM (e.g., RISC-V, WebAssembly), these operations will silently do nothing, leading to incorrect results without any warning.

Consider adding a fallback path or NYI message:

 #elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
         cpu::arm::ew_add_fp32(output.ptr<mllm_fp32_t>(), input0.ptr<mllm_fp32_t>(), input1.ptr<mllm_fp32_t>(), output.numel(),
                               options_.getThreads());
+#else
+        NYI("AddOp not supported on this platform.");
 #endif
mllm/backends/cpu/kernels/common/elewise-inl.hpp (1)

113-139: Consider reusing existing operator structs to eliminate duplication.

The scalar operator structs (AddScalarOp, SubScalarOp, MulScalarOp, DivScalarOp) are identical to the non-scalar operators (AddOp, SubOp, MulOp, DivOp) defined at lines 39-65. Since the operators work on vector types and don't distinguish between scalar-broadcast and element-wise operations, you could directly reuse the existing operators.

For example, update the scalar functions to reuse existing operators:

-template<typename T>
-HWY_NOINLINE HWY_MAYBE_UNUSED void element_wise_add_scalar(T* out, const T* x, const T* y, size_t n) {
-  __elementwise_scalar(out, x, y, n, AddScalarOp{});
-}
+template<typename T>
+HWY_NOINLINE HWY_MAYBE_UNUSED void element_wise_add_scalar(T* out, const T* x, const T* y, size_t n) {
+  __elementwise_scalar(out, x, y, n, AddOp{});
+}

Then remove the duplicate scalar operator structs (lines 113-139).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e917a70 and 2eb1586.

📒 Files selected for processing (11)
  • examples/CMakeLists.txt (1 hunks)
  • examples/qwen3/main.cpp (1 hunks)
  • mllm/backends/cpu/kernels/Kernels.hpp (1 hunks)
  • mllm/backends/cpu/kernels/common/elewise-inl.hpp (1 hunks)
  • mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp (1 hunks)
  • mllm/backends/cpu/kernels/x86/rmsnorm.cpp (1 hunks)
  • mllm/backends/cpu/kernels/x86/softmax.cpp (3 hunks)
  • mllm/backends/cpu/ops/ElewiseOps.cpp (11 hunks)
  • mllm/backends/cpu/ops/LinearOp.cpp (1 hunks)
  • mllm/backends/cpu/ops/TransposeOp.cpp (3 hunks)
  • mllm/models/qwen3/modeling_qwen3_fa2.hpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
{mllm,mllm-cli,pymllm}/**/*

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*: Files must not contain C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, or DEL 0x7F. Horizontal tab (0x09) and line feed (0x0A) are explicitly allowed.
All files must be encoded in UTF-8 without BOM.
Any violation of character set (Rule 1) or encoding (Rule 2) requirements must cause the review to fail.
No line may end with trailing whitespace.
Use Unix line endings (LF).
File and directory names must consist only of printable Unicode characters, excluding C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, and DEL 0x7F.
Only use acceptable file extensions: .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .py, .pyi, .sh, .txt, .md, .yml, .yaml, .json, .toml.
Optional license headers, if present, must comply with character set rules (no C0/C1 control codes except tab and line feed).

Files:

  • mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp
  • mllm/backends/cpu/ops/ElewiseOps.cpp
  • mllm/backends/cpu/kernels/Kernels.hpp
  • mllm/models/qwen3/modeling_qwen3_fa2.hpp
  • mllm/backends/cpu/kernels/x86/rmsnorm.cpp
  • mllm/backends/cpu/kernels/common/elewise-inl.hpp
  • mllm/backends/cpu/ops/LinearOp.cpp
  • mllm/backends/cpu/ops/TransposeOp.cpp
  • mllm/backends/cpu/kernels/x86/softmax.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}: TODO and FIXME comments must be written as 'TODO:' or 'FIXME:' followed by UTF-8 text that adheres to character set rules.
Encourage consistent coding style and patterns with the existing codebase.
Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific.

Files:

  • mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp
  • mllm/backends/cpu/ops/ElewiseOps.cpp
  • mllm/backends/cpu/kernels/Kernels.hpp
  • mllm/models/qwen3/modeling_qwen3_fa2.hpp
  • mllm/backends/cpu/kernels/x86/rmsnorm.cpp
  • mllm/backends/cpu/kernels/common/elewise-inl.hpp
  • mllm/backends/cpu/ops/LinearOp.cpp
  • mllm/backends/cpu/ops/TransposeOp.cpp
  • mllm/backends/cpu/kernels/x86/softmax.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
Adhere to language-specific best practices and idioms (e.g., PEP 8 for Python, Google C++ Style Guide for C++).

Files:

  • mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp
  • mllm/backends/cpu/ops/ElewiseOps.cpp
  • mllm/backends/cpu/kernels/Kernels.hpp
  • mllm/models/qwen3/modeling_qwen3_fa2.hpp
  • mllm/backends/cpu/kernels/x86/rmsnorm.cpp
  • mllm/backends/cpu/kernels/common/elewise-inl.hpp
  • mllm/backends/cpu/ops/LinearOp.cpp
  • mllm/backends/cpu/ops/TransposeOp.cpp
  • mllm/backends/cpu/kernels/x86/softmax.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}: Prioritize production-ready code quality by evaluating time and space complexity of algorithms and data structures, and suggest more efficient alternatives for operations with high complexity (e.g., O(n^2) or worse) when feasible.
Avoid unnecessary object creation in loops or hot paths.
Check for proper error handling and resource cleanup (e.g., using try-finally, context managers, or RAII).
Ensure functions that can fail return appropriate error codes or raise exceptions.
Validate inputs for public APIs and critical internal functions.
Add comments for complex algorithms or non-obvious logic.
Identify potential security issues (e.g., buffer overflows, injection risks, insecure temporary files) and recommend using secure alternatives (e.g., parameterized queries, secure random generators).
Suggest adding unit tests for untested complex logic or edge cases.
Ensure code is testable by avoiding global state and using dependency injection.
Flag overly complex functions (e.g., high cyclomatic complexity) and suggest breaking them down.
Use named constants instead of magic numbers.
Add appropriate logging (e.g., debug, info, warning, error) for significant events and errors, avoiding sensitive data exposure.

Files:

  • mllm/backends/cpu/ops/ElewiseOps.cpp
  • mllm/backends/cpu/kernels/x86/rmsnorm.cpp
  • mllm/backends/cpu/ops/LinearOp.cpp
  • mllm/backends/cpu/ops/TransposeOp.cpp
  • mllm/backends/cpu/kernels/x86/softmax.cpp
🧬 Code graph analysis (3)
mllm/backends/cpu/ops/ElewiseOps.cpp (2)
mllm/backends/cpu/kernels/common/elewise-inl.hpp (8)
  • element_wise_add (68-68)
  • element_wise_add_scalar (142-142)
  • element_wise_sub (73-73)
  • element_wise_sub_scalar (147-147)
  • element_wise_mul (78-78)
  • element_wise_mul_scalar (152-152)
  • element_wise_div (83-83)
  • element_wise_div_scalar (157-157)
mllm/backends/cpu/kernels/arm/elementwise.hpp (8)
  • ew_add_fp32 (262-263)
  • ew_add_fp32_scalar (324-325)
  • ew_sub_fp32 (265-266)
  • ew_sub_fp32_scalar (327-328)
  • ew_mul_fp32 (268-269)
  • ew_mul_fp32_scalar (330-331)
  • ew_div_fp32 (271-272)
  • ew_div_fp32_scalar (333-334)
mllm/backends/cpu/ops/TransposeOp.cpp (1)
mllm/backends/cpu/kernels/arm/transpose.cpp (6)
  • transpose_hw_wh_fp32 (13-46)
  • transpose_hw_wh_fp32 (13-13)
  • transpose_bshd_bhsd_fp32 (48-73)
  • transpose_bshd_bhsd_fp32 (48-49)
  • transpose_last_dims_fp32 (160-200)
  • transpose_last_dims_fp32 (160-161)
mllm/backends/cpu/kernels/x86/softmax.cpp (1)
mllm/backends/cpu/kernels/x86/math.hpp (1)
  • vexpq_fast_f32 (19-19)
🪛 Clang (14.0.6)
mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp

[error] 114-114: parameter name 'f' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)

mllm/backends/cpu/kernels/common/elewise-inl.hpp

[error] 68-68: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 68-68: invalid case style for variable 'HWY_MAYBE_UNUSED'

(readability-identifier-naming,-warnings-as-errors)


[error] 68-68: 2 adjacent parameters of 'element_wise_add' of similar type ('const int *') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 68-68: parameter name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 68-68: parameter name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 73-73: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 73-73: invalid case style for variable 'HWY_MAYBE_UNUSED'

(readability-identifier-naming,-warnings-as-errors)


[error] 73-73: 2 adjacent parameters of 'element_wise_sub' of similar type ('const int *') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 73-73: parameter name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 73-73: parameter name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 78-78: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 78-78: invalid case style for variable 'HWY_MAYBE_UNUSED'

(readability-identifier-naming,-warnings-as-errors)


[error] 78-78: 2 adjacent parameters of 'element_wise_mul' of similar type ('const int *') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 78-78: parameter name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 78-78: parameter name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 83-83: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 83-83: invalid case style for variable 'HWY_MAYBE_UNUSED'

(readability-identifier-naming,-warnings-as-errors)


[error] 83-83: 2 adjacent parameters of 'element_wise_div' of similar type ('const int *') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 83-83: parameter name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 83-83: parameter name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 92-92: declaration uses identifier '__elementwise_scalar', which is a reserved identifier

(bugprone-reserved-identifier,-warnings-as-errors)


[error] 93-93: variable 'd' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 93-93: variable name 'd' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 94-94: variable 'N' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 94-94: variable name 'N' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 94-94: invalid case style for variable 'N'

(readability-identifier-naming,-warnings-as-errors)


[error] 95-95: variable 'idx' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 98-98: variable 'sVec' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 98-98: invalid case style for variable 'sVec'

(readability-identifier-naming,-warnings-as-errors)


[error] 107-107: variable 'vx' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 107-107: variable name 'vx' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 108-108: variable 'result' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 142-142: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 142-142: invalid case style for variable 'HWY_MAYBE_UNUSED'

(readability-identifier-naming,-warnings-as-errors)


[error] 142-142: 2 adjacent parameters of 'element_wise_add_scalar' of similar type ('const int *') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 142-142: parameter name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 142-142: parameter name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 147-147: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 147-147: invalid case style for variable 'HWY_MAYBE_UNUSED'

(readability-identifier-naming,-warnings-as-errors)


[error] 147-147: 2 adjacent parameters of 'element_wise_sub_scalar' of similar type ('const int *') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 147-147: parameter name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 147-147: parameter name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 152-152: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 152-152: invalid case style for variable 'HWY_MAYBE_UNUSED'

(readability-identifier-naming,-warnings-as-errors)


[error] 152-152: 2 adjacent parameters of 'element_wise_mul_scalar' of similar type ('const int *') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 152-152: parameter name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 152-152: parameter name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 157-157: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 157-157: invalid case style for variable 'HWY_MAYBE_UNUSED'

(readability-identifier-naming,-warnings-as-errors)


[error] 157-157: 2 adjacent parameters of 'element_wise_div_scalar' of similar type ('const int *') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 157-157: parameter name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 157-157: parameter name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)

🔇 Additional comments (13)
mllm/backends/cpu/ops/LinearOp.cpp (1)

293-309: Architecture-guarded NYI for x86 is acceptable, but verify default fallback behavior.

The explicit NYI for x86 is a reasonable approach to mark unsupported implementations. However, note that when MLLM_USE_BLAS is not defined and K < 4 (lines 119-124), kDefault auto-selects kMllmBlas, which will hit this NYI on x86 at runtime.

Consider whether x86 without BLAS should default to kGGUF regardless of K size to avoid the unexpected NYI:

 #else
     if (K >= 4) {
       impl_type = aops::LinearImplTypes::kGGUF;
-    } else
-    // All fallback to mllm blas
-    {
+    } else {
+#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
+      impl_type = aops::LinearImplTypes::kGGUF;  // x86 lacks kMllmBlas support
+#else
       impl_type = aops::LinearImplTypes::kMllmBlas;
+#endif
     }
 #endif

Alternatively, if hitting NYI for small-K x86 scenarios is intentional as a known limitation, this is fine as-is.

examples/CMakeLists.txt (1)

8-9: LGTM!

Enabling qwen3 and qwen3_service build targets aligns with the PR objective to add Qwen3 model support on x86 architecture.

mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp (1)

114-127: Signature update looks correct.

The parameter type change from uint16_t to mllm_fp16_t improves type consistency with the FP16 abstraction used elsewhere in the codebase.

Note: The lazy initialization of table_f32_f16 (lines 115-122) has a pre-existing race condition if called from multiple threads simultaneously. Consider using std::call_once or static initialization if thread safety becomes a concern.

examples/qwen3/main.cpp (1)

4-4: LGTM!

Header updated to use the FlashAttention2 variant (modeling_qwen3_fa2.hpp), aligning with the PR objective to add FA2 support for Qwen3.

mllm/backends/cpu/kernels/Kernels.hpp (1)

40-40: LGTM!

Adding the elewise-inl.hpp include for non-ARM platforms enables the HWY-accelerated element-wise operations used by the x86 path in ElewiseOps.cpp.

mllm/backends/cpu/kernels/x86/softmax.cpp (1)

38-65: LGTM!

Switching to LoadU/StoreU for unaligned memory access is the correct approach when alignment of input/output buffers cannot be guaranteed. This improves robustness and prevents potential segfaults on x86 when data is not aligned to SIMD register boundaries.

mllm/backends/cpu/kernels/x86/rmsnorm.cpp (1)

30-45: LGTM!

Consistent with the softmax changes, switching to unaligned LoadU/StoreU operations ensures robustness for input buffers that may not be aligned to SIMD boundaries. Both the add_unit_offset branches are updated consistently.

mllm/models/qwen3/modeling_qwen3_fa2.hpp (1)

262-266: Verify model weight compatibility with the new registration name.

The registration name changed from "lm_head_out" to "lm_head". This is a more conventional name, but it's a breaking change for any existing model weight files that use the old key name.

Ensure that model weight files for Qwen3 use "lm_head" as the weight key, or that the model loader handles this mapping appropriately. If existing weights use "lm_head_out", they will fail to load correctly with this change.

mllm/backends/cpu/ops/ElewiseOps.cpp (2)

143-149: LGTM!

The x86 HWY-accelerated element-wise add implementation correctly uses the element_wise_add function from elewise-inl.hpp with the proper out-of-place parameter ordering (out, x, y, n).


847-851: Good use of NYI for unsupported operations.

Explicitly marking AbsOp (and similar ops like LogOp, ExpOp, ClipOp, SinOp, CosOp) as not-yet-implemented on x86 is better than silent failure. These can be tracked and implemented in future work.

mllm/backends/cpu/kernels/common/elewise-inl.hpp (3)

91-111: LGTM! Scalar element-wise implementation follows Highway patterns.

The implementation correctly broadcasts the scalar value to a vector and applies the operation efficiently using Highway SIMD primitives. The tail handling with LoadN/StoreN properly handles non-multiple-of-lane counts.


142-159: LGTM! Scalar wrapper functions correctly delegate to the scalar implementation.

The four scalar element-wise wrapper functions follow the established pattern and correctly delegate to __elementwise_scalar with the appropriate operator.


68-85: Verify all call sites are updated for the new out-of-place API signature.

This change converts the element-wise operation functions from in-place to explicit out-of-place semantics, with out now as the first parameter. Confirm that all callers throughout the codebase have been updated to pass arguments in the new order: element_wise_add(out, x, y, n) instead of the previous signature. This is a breaking change that affects all usage sites.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls using lm_head_out for BC. Arm CPU use lm_head_out.

int out_offset = batch * broadcast_naive_loops * vector_size + l * vector_size;

#if defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
cpu::common::HWY_NAMESPACE::element_wise_div(out + out_offset, a + a_offset, b + b_offset, vector_size);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is cpu::common::HWY_NAMESPACE the right way to use highway?

IIRC, We should include head file of highway operator for multiple times and use highway's dispatcher to dispatch one op.

#add_subdirectory(llama)
add_subdirectory(minicpm_o)
add_subdirectory(minicpm4)
#add_subdirectory(qwen3)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls uncomment all target. I think this is involved by @KKkai0315 .

@chenghuaWang
Copy link
Collaborator

Can you add a CI/CD test for the X86 backend? Just referencing the macOS test scripts is fine.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
mllm/backends/cpu/kernels/common/elewise-inl.hpp (1)

15-34: Reserved identifier usage: __elementwise.

Identifiers starting with double underscore (__) are reserved in C++. Rename to avoid potential conflicts with compiler/library internals.

 template<typename T, typename Op>
-HWY_INLINE void __elementwise(const T* HWY_RESTRICT x, const T* HWY_RESTRICT y, T* HWY_RESTRICT out, size_t count, Op&& op) {
+HWY_INLINE void elementwise_impl(const T* HWY_RESTRICT x, const T* HWY_RESTRICT y, T* HWY_RESTRICT out, size_t count, Op&& op) {

Also update the call sites (lines 65, 69, 73, 77).

🧹 Nitpick comments (2)
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp (1)

67-75: Remove or document commented-out GELU code.

Commented-out code without explanation adds noise. Either remove it or add a TODO comment explaining why it's preserved and when it will be implemented.

-//===----------------------------------------------------------------------===//
-// GELU
-//===----------------------------------------------------------------------===//
-// HWY_EXPORT(gelu_fp32);
-// 
-// HWY_DLLEXPORT void call_gelu_fp32(mllm_fp32_t* out, const mllm_fp32_t* in, size_t n) {
-//   HWY_DYNAMIC_DISPATCH(gelu_fp32)(out, in, n);
-// }
+// TODO: Add GELU dispatch when elewise-inl.hpp implements gelu_fp32
mllm/backends/cpu/kernels/common/elewise-inl.hpp (1)

106-132: Duplicate functors: *ScalarOp are identical to *Op.

AddScalarOp, SubScalarOp, MulScalarOp, and DivScalarOp have identical implementations to their non-scalar counterparts. Reuse the existing functors to reduce duplication.

-struct AddScalarOp {
-  template<class D, class V>
-  HWY_INLINE V operator()(D d, V a, V b) const {
-    return hn::Add(a, b);
-  }
-};
-
-struct SubScalarOp {
-  template<class D, class V>
-  HWY_INLINE V operator()(D d, V a, V b) const {
-    return hn::Sub(a, b);
-  }
-};
-
-struct MulScalarOp {
-  template<class D, class V>
-  HWY_INLINE V operator()(D d, V a, V b) const {
-    return hn::Mul(a, b);
-  }
-};
-
-struct DivScalarOp {
-  template<class D, class V>
-  HWY_INLINE V operator()(D d, V a, V b) const {
-    return hn::Div(a, b);
-  }
-};
-
 HWY_NOINLINE HWY_MAYBE_UNUSED void elewise_add_scalar_fp32(mllm_fp32_t* out, const mllm_fp32_t* x, const mllm_fp32_t y, size_t n) {
-  __elementwise_scalar(out, x, y, n, AddScalarOp{});
+  elementwise_scalar_impl(out, x, y, n, AddOp{});
 }
 
 HWY_NOINLINE HWY_MAYBE_UNUSED void elewise_sub_scalar_fp32(mllm_fp32_t* out, const mllm_fp32_t* x, const mllm_fp32_t y, size_t n) {
-  __elementwise_scalar(out, x, y, n, SubScalarOp{});
+  elementwise_scalar_impl(out, x, y, n, SubOp{});
 }
 
 HWY_NOINLINE HWY_MAYBE_UNUSED void elewise_mul_scalar_fp32(mllm_fp32_t* out, const mllm_fp32_t* x, const mllm_fp32_t y, size_t n) {
-  __elementwise_scalar(out, x, y, n, MulScalarOp{});
+  elementwise_scalar_impl(out, x, y, n, MulOp{});
 }
 
 HWY_NOINLINE HWY_MAYBE_UNUSED void elewise_div_scalar_fp32(mllm_fp32_t* out, const mllm_fp32_t* x, const mllm_fp32_t y, size_t n) {
-  __elementwise_scalar(out, x, y, n, DivScalarOp{});
+  elementwise_scalar_impl(out, x, y, n, DivOp{});
 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2eb1586 and 8514471.

📒 Files selected for processing (7)
  • .github/workflows/build-x86.yml (1 hunks)
  • examples/CMakeLists.txt (1 hunks)
  • mllm/backends/cpu/kernels/Kernels.hpp (1 hunks)
  • mllm/backends/cpu/kernels/common/elewise-inl.hpp (2 hunks)
  • mllm/backends/cpu/kernels/common/kernel_dispatch.cpp (1 hunks)
  • mllm/backends/cpu/kernels/common/kernel_dispatch.hpp (1 hunks)
  • mllm/backends/cpu/ops/ElewiseOps.cpp (11 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • examples/CMakeLists.txt
  • mllm/backends/cpu/kernels/Kernels.hpp
🧰 Additional context used
📓 Path-based instructions (4)
{mllm,mllm-cli,pymllm}/**/*

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*: Files must not contain C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, or DEL 0x7F. Horizontal tab (0x09) and line feed (0x0A) are explicitly allowed.
All files must be encoded in UTF-8 without BOM.
Any violation of character set (Rule 1) or encoding (Rule 2) requirements must cause the review to fail.
No line may end with trailing whitespace.
Use Unix line endings (LF).
File and directory names must consist only of printable Unicode characters, excluding C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, and DEL 0x7F.
Only use acceptable file extensions: .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .py, .pyi, .sh, .txt, .md, .yml, .yaml, .json, .toml.
Optional license headers, if present, must comply with character set rules (no C0/C1 control codes except tab and line feed).

Files:

  • mllm/backends/cpu/kernels/common/kernel_dispatch.hpp
  • mllm/backends/cpu/ops/ElewiseOps.cpp
  • mllm/backends/cpu/kernels/common/elewise-inl.hpp
  • mllm/backends/cpu/kernels/common/kernel_dispatch.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}: TODO and FIXME comments must be written as 'TODO:' or 'FIXME:' followed by UTF-8 text that adheres to character set rules.
Encourage consistent coding style and patterns with the existing codebase.
Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific.

Files:

  • mllm/backends/cpu/kernels/common/kernel_dispatch.hpp
  • mllm/backends/cpu/ops/ElewiseOps.cpp
  • mllm/backends/cpu/kernels/common/elewise-inl.hpp
  • mllm/backends/cpu/kernels/common/kernel_dispatch.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
Adhere to language-specific best practices and idioms (e.g., PEP 8 for Python, Google C++ Style Guide for C++).

Files:

  • mllm/backends/cpu/kernels/common/kernel_dispatch.hpp
  • mllm/backends/cpu/ops/ElewiseOps.cpp
  • mllm/backends/cpu/kernels/common/elewise-inl.hpp
  • mllm/backends/cpu/kernels/common/kernel_dispatch.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}: Prioritize production-ready code quality by evaluating time and space complexity of algorithms and data structures, and suggest more efficient alternatives for operations with high complexity (e.g., O(n^2) or worse) when feasible.
Avoid unnecessary object creation in loops or hot paths.
Check for proper error handling and resource cleanup (e.g., using try-finally, context managers, or RAII).
Ensure functions that can fail return appropriate error codes or raise exceptions.
Validate inputs for public APIs and critical internal functions.
Add comments for complex algorithms or non-obvious logic.
Identify potential security issues (e.g., buffer overflows, injection risks, insecure temporary files) and recommend using secure alternatives (e.g., parameterized queries, secure random generators).
Suggest adding unit tests for untested complex logic or edge cases.
Ensure code is testable by avoiding global state and using dependency injection.
Flag overly complex functions (e.g., high cyclomatic complexity) and suggest breaking them down.
Use named constants instead of magic numbers.
Add appropriate logging (e.g., debug, info, warning, error) for significant events and errors, avoiding sensitive data exposure.

Files:

  • mllm/backends/cpu/ops/ElewiseOps.cpp
  • mllm/backends/cpu/kernels/common/kernel_dispatch.cpp
🧬 Code graph analysis (3)
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp (2)
mllm/backends/cpu/kernels/common/elewise-inl.hpp (1)
  • void (16-34)
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp (8)
  • void (35-37)
  • void (39-41)
  • void (43-45)
  • void (47-49)
  • void (51-53)
  • void (55-57)
  • void (59-61)
  • void (63-65)
mllm/backends/cpu/kernels/common/elewise-inl.hpp (1)
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp (8)
  • void (35-37)
  • void (39-41)
  • void (43-45)
  • void (47-49)
  • void (51-53)
  • void (55-57)
  • void (59-61)
  • void (63-65)
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp (1)
mllm/backends/cpu/kernels/common/elewise-inl.hpp (9)
  • elewise_add_fp32 (64-64)
  • elewise_sub_fp32 (68-68)
  • elewise_mul_fp32 (72-72)
  • elewise_div_fp32 (76-76)
  • elewise_add_scalar_fp32 (134-134)
  • elewise_sub_scalar_fp32 (138-138)
  • elewise_mul_scalar_fp32 (142-142)
  • elewise_div_scalar_fp32 (146-146)
  • void (16-34)
🪛 actionlint (1.7.9)
.github/workflows/build-x86.yml

39-39: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🪛 Clang (14.0.6)
mllm/backends/cpu/kernels/common/elewise-inl.hpp

[error] 64-64: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 64-64: invalid case style for variable 'HWY_MAYBE_UNUSED'

(readability-identifier-naming,-warnings-as-errors)


[error] 64-64: 2 adjacent parameters of 'elewise_add_fp32' of similar type ('const int *') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 64-64: parameter name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 64-64: parameter name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 68-68: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 68-68: invalid case style for variable 'HWY_MAYBE_UNUSED'

(readability-identifier-naming,-warnings-as-errors)


[error] 68-68: 2 adjacent parameters of 'elewise_sub_fp32' of similar type ('const int *') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 68-68: parameter name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 68-68: parameter name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 72-72: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 72-72: invalid case style for variable 'HWY_MAYBE_UNUSED'

(readability-identifier-naming,-warnings-as-errors)


[error] 72-72: 2 adjacent parameters of 'elewise_mul_fp32' of similar type ('const int *') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 72-72: parameter name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 72-72: parameter name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 76-76: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 76-76: invalid case style for variable 'HWY_MAYBE_UNUSED'

(readability-identifier-naming,-warnings-as-errors)


[error] 76-76: 2 adjacent parameters of 'elewise_div_fp32' of similar type ('const int *') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 76-76: parameter name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 76-76: parameter name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 85-85: declaration uses identifier '__elementwise_scalar', which is a reserved identifier

(bugprone-reserved-identifier,-warnings-as-errors)


[error] 86-86: variable 'd' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 86-86: variable name 'd' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 87-87: variable 'N' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 87-87: variable name 'N' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 87-87: invalid case style for variable 'N'

(readability-identifier-naming,-warnings-as-errors)


[error] 88-88: variable 'idx' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 91-91: variable 'sVec' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 91-91: invalid case style for variable 'sVec'

(readability-identifier-naming,-warnings-as-errors)


[error] 100-100: variable 'vx' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 100-100: variable name 'vx' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 101-101: variable 'result' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 134-134: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 134-134: invalid case style for variable 'HWY_MAYBE_UNUSED'

(readability-identifier-naming,-warnings-as-errors)


[error] 134-134: parameter name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 134-134: parameter name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 138-138: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 138-138: invalid case style for variable 'HWY_MAYBE_UNUSED'

(readability-identifier-naming,-warnings-as-errors)


[error] 138-138: parameter name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 138-138: parameter name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 142-142: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 142-142: invalid case style for variable 'HWY_MAYBE_UNUSED'

(readability-identifier-naming,-warnings-as-errors)


[error] 142-142: parameter name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 142-142: parameter name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 146-146: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 146-146: invalid case style for variable 'HWY_MAYBE_UNUSED'

(readability-identifier-naming,-warnings-as-errors)


[error] 146-146: parameter name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 146-146: parameter name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)

mllm/backends/cpu/kernels/common/kernel_dispatch.cpp

[error] 12-12: macro 'HWY_TARGET_INCLUDE' used to declare a constant; consider using a 'constexpr' constant

(cppcoreguidelines-macro-usage,-warnings-as-errors)

🔇 Additional comments (7)
.github/workflows/build-x86.yml (1)

1-51: Workflow structure and design are solid.

The CI/CD workflow properly addresses the reviewer's request for X86 backend compilation testing. The path-based triggering, repository guard, and build steps are well-configured. Python 3.12 with pip caching is appropriate for build speed.

mllm/backends/cpu/ops/ElewiseOps.cpp (3)

143-149: x86 HWY dispatch integration looks correct.

The dynamic dispatch pattern using cpu::common::call_elewise_*_fp32 is appropriate for Highway's runtime dispatch mechanism. This addresses the previous review comment about proper Highway usage.


144-148: x86 implementation lacks thread count parameter.

The ARM implementations accept options_.getThreads() for multi-threading, but the x86 HWY-based implementations don't expose threading control. Verify this is intentional or consider adding threading support for parity.


847-851: NYI placeholders for x86 are appropriate.

Clear indication that these operations need x86 implementations. Consider adding TODO comments with tracking issue references if these are planned for near-term implementation.

mllm/backends/cpu/kernels/common/kernel_dispatch.hpp (1)

1-28: Clean header structure for HWY dispatch declarations.

The header is well-organized with appropriate includes, namespace usage, and function declarations. The separation of vectorized and scalar variants is clear.

mllm/backends/cpu/kernels/common/kernel_dispatch.cpp (1)

7-15: HWY dynamic dispatch pattern is correctly implemented.

The HWY_TARGET_INCLUDE and foreach_target.h pattern is the standard Highway approach for runtime CPU feature detection and dispatch. The static analysis warning about preferring constexpr is a false positive—this macro pattern is required by Highway's design.

mllm/backends/cpu/kernels/common/elewise-inl.hpp (1)

64-78: FP32 elementwise implementations look correct.

The implementations correctly use the HWY vectorized operations with proper function signatures matching the dispatch declarations.

Comment on lines +162 to +173
#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
// Process each batch separately
for (int batch = 0; batch < batch_dims; ++batch) {
// Each batch processes broadcast_naive_loops iterations of vector_size elements
for (int l = 0; l < broadcast_naive_loops; ++l) {
int a_offset = batch * broadcast_naive_loops * vector_size + l * vector_size;
int b_offset = batch * vector_size; // b doesn't broadcast over loops dimension
int out_offset = batch * broadcast_naive_loops * vector_size + l * vector_size;

#if defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
cpu::common::call_elewise_add_fp32(out + out_offset, a + a_offset, b + b_offset, vector_size);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential integer overflow in offset calculations.

When batch, broadcast_naive_loops, and vector_size are large, the expression batch * broadcast_naive_loops * vector_size can overflow int. Consider using size_t or int64_t for offset calculations.

-        for (int batch = 0; batch < batch_dims; ++batch) {
+        for (int64_t batch = 0; batch < batch_dims; ++batch) {
           // Each batch processes broadcast_naive_loops iterations of vector_size elements
-          for (int l = 0; l < broadcast_naive_loops; ++l) {
-            int a_offset = batch * broadcast_naive_loops * vector_size + l * vector_size;
-            int b_offset = batch * vector_size;  // b doesn't broadcast over loops dimension
-            int out_offset = batch * broadcast_naive_loops * vector_size + l * vector_size;
+          for (int64_t l = 0; l < broadcast_naive_loops; ++l) {
+            size_t a_offset = batch * broadcast_naive_loops * vector_size + l * vector_size;
+            size_t b_offset = batch * vector_size;  // b doesn't broadcast over loops dimension
+            size_t out_offset = batch * broadcast_naive_loops * vector_size + l * vector_size;

Note: This same issue exists in the ARM paths and in all other broadcast_naive loops (Sub, Mul, Div ops).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
// Process each batch separately
for (int batch = 0; batch < batch_dims; ++batch) {
// Each batch processes broadcast_naive_loops iterations of vector_size elements
for (int l = 0; l < broadcast_naive_loops; ++l) {
int a_offset = batch * broadcast_naive_loops * vector_size + l * vector_size;
int b_offset = batch * vector_size; // b doesn't broadcast over loops dimension
int out_offset = batch * broadcast_naive_loops * vector_size + l * vector_size;
#if defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
cpu::common::call_elewise_add_fp32(out + out_offset, a + a_offset, b + b_offset, vector_size);
}
}
#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
// Process each batch separately
for (int64_t batch = 0; batch < batch_dims; ++batch) {
// Each batch processes broadcast_naive_loops iterations of vector_size elements
for (int64_t l = 0; l < broadcast_naive_loops; ++l) {
size_t a_offset = batch * broadcast_naive_loops * vector_size + l * vector_size;
size_t b_offset = batch * vector_size; // b doesn't broadcast over loops dimension
size_t out_offset = batch * broadcast_naive_loops * vector_size + l * vector_size;
cpu::common::call_elewise_add_fp32(out + out_offset, a + a_offset, b + b_offset, vector_size);
}
}

- Updated qwen3 model runtime using `modeling_qwen3_fa2.hpp`
- Enabled qwen3 compilation options
- Fixed rmsnorm and softmax operators
- Fixed parameters for quantize
- Added `NYI` messages for some unsupported ops
Added support for common CPU elementwise operators (fp32)
- Replace cpu::common::HWY_NAMESPACE with proper Highway dynamic dispatcher
- Create general kernel_dispatch.hpp/cpp with HWY_DYNAMIC_DISPATCH pattern
- Use lm_head_out in `modeling_qwen3_fa2.hpp`
- Uncomment all compilation targets in `examples/CMakeLists.txt`
- Add GitHub Actions workflow file for Linux (Ubuntu) x86_64 build
- Fix python version in `.github/workflows/build-x86.yml`
- Rename function name to avoid using reserved identifiers
- Add #else fallback for unsupported architectures
- Fix potential integer overflow in offset calculations
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
mllm/backends/cpu/ops/TransposeOp.cpp (1)

87-100: batch is computed incorrectly (always 0).
int batch = 0; then multiplying keeps it at 0, so arm::transpose_last_dims_* will run with batch == 0. This looks like a functional bug.

-  int batch = 0;
+  int batch = 1;
   for (int i = 0; i < input_shape.size() - 2; i++) { batch *= input_shape[i]; }
mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp (1)

114-127: Fix data race in FP16 lookup table initialization—avoid std::call_once in hot path.

The table_f32_f16_init flag has a data race: multiple threads calling lookup_fp16_to_fp32() during parallel inference can race on the flag without synchronization. Since MLLM_FP16_TO_FP32 is invoked in hot loops (vec_dot, quantization, GEMM kernels), this is a critical issue.

Do not use std::call_once for the fix—it adds atomic fences and potential mutex overhead on every call, which is unacceptable in hot paths. Instead, use one of:

  • Eager initialization: initialize the table at program or model load time (best for inference servers).
  • Atomic fast-path + mutex slow-path: load std::atomic<bool> with acquire semantics; only acquire mutex on miss.
  • Thread-local initialization: per-thread tables (less efficient, avoid if possible).

Also add static_assert(sizeof(mllm_fp16_t) == sizeof(uint16_t), "...") to guard the memcpy assumption. Rename parameter f to fp16 for clarity.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8514471 and bd046b9.

📒 Files selected for processing (14)
  • .github/workflows/build-x86.yml (1 hunks)
  • docs/cpu_backend/x86/index.rst (1 hunks)
  • examples/CMakeLists.txt (1 hunks)
  • examples/qwen3/main.cpp (1 hunks)
  • mllm/backends/cpu/kernels/Kernels.hpp (1 hunks)
  • mllm/backends/cpu/kernels/common/elewise-inl.hpp (3 hunks)
  • mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp (1 hunks)
  • mllm/backends/cpu/kernels/common/kernel_dispatch.cpp (1 hunks)
  • mllm/backends/cpu/kernels/common/kernel_dispatch.hpp (1 hunks)
  • mllm/backends/cpu/kernels/x86/rmsnorm.cpp (1 hunks)
  • mllm/backends/cpu/kernels/x86/softmax.cpp (3 hunks)
  • mllm/backends/cpu/ops/ElewiseOps.cpp (11 hunks)
  • mllm/backends/cpu/ops/LinearOp.cpp (1 hunks)
  • mllm/backends/cpu/ops/TransposeOp.cpp (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • mllm/backends/cpu/ops/LinearOp.cpp
  • .github/workflows/build-x86.yml
  • mllm/backends/cpu/kernels/common/kernel_dispatch.hpp
  • mllm/backends/cpu/kernels/Kernels.hpp
  • examples/CMakeLists.txt
🧰 Additional context used
📓 Path-based instructions (4)
{mllm,mllm-cli,pymllm}/**/*

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*: Files must not contain C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, or DEL 0x7F. Horizontal tab (0x09) and line feed (0x0A) are explicitly allowed.
All files must be encoded in UTF-8 without BOM.
Any violation of character set (Rule 1) or encoding (Rule 2) requirements must cause the review to fail.
No line may end with trailing whitespace.
Use Unix line endings (LF).
File and directory names must consist only of printable Unicode characters, excluding C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, and DEL 0x7F.
Only use acceptable file extensions: .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .py, .pyi, .sh, .txt, .md, .yml, .yaml, .json, .toml.
Optional license headers, if present, must comply with character set rules (no C0/C1 control codes except tab and line feed).

Files:

  • mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp
  • mllm/backends/cpu/ops/TransposeOp.cpp
  • mllm/backends/cpu/ops/ElewiseOps.cpp
  • mllm/backends/cpu/kernels/x86/rmsnorm.cpp
  • mllm/backends/cpu/kernels/x86/softmax.cpp
  • mllm/backends/cpu/kernels/common/elewise-inl.hpp
  • mllm/backends/cpu/kernels/common/kernel_dispatch.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}: TODO and FIXME comments must be written as 'TODO:' or 'FIXME:' followed by UTF-8 text that adheres to character set rules.
Encourage consistent coding style and patterns with the existing codebase.
Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific.

Files:

  • mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp
  • mllm/backends/cpu/ops/TransposeOp.cpp
  • mllm/backends/cpu/ops/ElewiseOps.cpp
  • mllm/backends/cpu/kernels/x86/rmsnorm.cpp
  • mllm/backends/cpu/kernels/x86/softmax.cpp
  • mllm/backends/cpu/kernels/common/elewise-inl.hpp
  • mllm/backends/cpu/kernels/common/kernel_dispatch.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
Adhere to language-specific best practices and idioms (e.g., PEP 8 for Python, Google C++ Style Guide for C++).

Files:

  • mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp
  • mllm/backends/cpu/ops/TransposeOp.cpp
  • mllm/backends/cpu/ops/ElewiseOps.cpp
  • mllm/backends/cpu/kernels/x86/rmsnorm.cpp
  • mllm/backends/cpu/kernels/x86/softmax.cpp
  • mllm/backends/cpu/kernels/common/elewise-inl.hpp
  • mllm/backends/cpu/kernels/common/kernel_dispatch.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}: Prioritize production-ready code quality by evaluating time and space complexity of algorithms and data structures, and suggest more efficient alternatives for operations with high complexity (e.g., O(n^2) or worse) when feasible.
Avoid unnecessary object creation in loops or hot paths.
Check for proper error handling and resource cleanup (e.g., using try-finally, context managers, or RAII).
Ensure functions that can fail return appropriate error codes or raise exceptions.
Validate inputs for public APIs and critical internal functions.
Add comments for complex algorithms or non-obvious logic.
Identify potential security issues (e.g., buffer overflows, injection risks, insecure temporary files) and recommend using secure alternatives (e.g., parameterized queries, secure random generators).
Suggest adding unit tests for untested complex logic or edge cases.
Ensure code is testable by avoiding global state and using dependency injection.
Flag overly complex functions (e.g., high cyclomatic complexity) and suggest breaking them down.
Use named constants instead of magic numbers.
Add appropriate logging (e.g., debug, info, warning, error) for significant events and errors, avoiding sensitive data exposure.

Files:

  • mllm/backends/cpu/ops/TransposeOp.cpp
  • mllm/backends/cpu/ops/ElewiseOps.cpp
  • mllm/backends/cpu/kernels/x86/rmsnorm.cpp
  • mllm/backends/cpu/kernels/x86/softmax.cpp
  • mllm/backends/cpu/kernels/common/kernel_dispatch.cpp
🧠 Learnings (2)
📚 Learning: 2025-11-25T07:26:06.575Z
Learnt from: CR
Repo: UbiquitousLearning/mllm PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-25T07:26:06.575Z
Learning: Applies to {mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi} : Suggest adding unit tests for untested complex logic or edge cases.

Applied to files:

  • mllm/backends/cpu/ops/ElewiseOps.cpp
📚 Learning: 2025-11-25T07:26:06.575Z
Learnt from: CR
Repo: UbiquitousLearning/mllm PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-25T07:26:06.575Z
Learning: Applies to {mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh} : Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific.

Applied to files:

  • mllm/backends/cpu/ops/ElewiseOps.cpp
🧬 Code graph analysis (5)
mllm/backends/cpu/ops/TransposeOp.cpp (2)
mllm/backends/cpu/kernels/arm/transpose.hpp (3)
  • transpose_hw_wh_fp32 (16-16)
  • transpose_bshd_bhsd_fp32 (18-19)
  • transpose_last_dims_fp32 (21-22)
mllm/backends/cpu/kernels/arm/transpose.cpp (6)
  • transpose_hw_wh_fp32 (13-46)
  • transpose_hw_wh_fp32 (13-13)
  • transpose_bshd_bhsd_fp32 (48-73)
  • transpose_bshd_bhsd_fp32 (48-49)
  • transpose_last_dims_fp32 (160-200)
  • transpose_last_dims_fp32 (160-161)
mllm/backends/cpu/ops/ElewiseOps.cpp (2)
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp (8)
  • call_elewise_add_fp32 (20-20)
  • call_elewise_add_scalar_fp32 (28-28)
  • call_elewise_sub_fp32 (21-21)
  • call_elewise_sub_scalar_fp32 (29-29)
  • call_elewise_mul_fp32 (22-22)
  • call_elewise_mul_scalar_fp32 (30-30)
  • call_elewise_div_fp32 (23-23)
  • call_elewise_div_scalar_fp32 (31-31)
mllm/backends/cpu/kernels/arm/elementwise.hpp (8)
  • ew_add_fp32 (262-263)
  • ew_add_fp32_scalar (324-325)
  • ew_sub_fp32 (265-266)
  • ew_sub_fp32_scalar (327-328)
  • ew_mul_fp32 (268-269)
  • ew_mul_fp32_scalar (330-331)
  • ew_div_fp32 (271-272)
  • ew_div_fp32_scalar (333-334)
mllm/backends/cpu/kernels/x86/softmax.cpp (1)
mllm/backends/cpu/kernels/x86/math.hpp (1)
  • vexpq_fast_f32 (19-19)
mllm/backends/cpu/kernels/common/elewise-inl.hpp (2)
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp (8)
  • void (36-38)
  • void (40-42)
  • void (44-46)
  • void (48-50)
  • void (52-54)
  • void (56-58)
  • void (60-62)
  • void (64-66)
mllm/backends/cpu/kernels/common/gelu-inl.hpp (2)
  • HWY_MAYBE_UNUSED (49-69)
  • HWY_MAYBE_UNUSED (71-75)
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp (1)
mllm/backends/cpu/kernels/common/elewise-inl.hpp (9)
  • elewise_add_fp32 (64-64)
  • elewise_sub_fp32 (68-68)
  • elewise_mul_fp32 (72-72)
  • elewise_div_fp32 (76-76)
  • elewise_add_scalar_fp32 (134-134)
  • elewise_sub_scalar_fp32 (138-138)
  • elewise_mul_scalar_fp32 (142-142)
  • elewise_div_scalar_fp32 (146-146)
  • void (16-34)
🪛 Clang (14.0.6)
mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp

[error] 114-114: parameter name 'f' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)

mllm/backends/cpu/kernels/common/elewise-inl.hpp

[error] 64-64: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 64-64: invalid case style for variable 'HWY_MAYBE_UNUSED'

(readability-identifier-naming,-warnings-as-errors)


[error] 64-64: 2 adjacent parameters of 'elewise_add_fp32' of similar type ('const int *') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 64-64: parameter name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 64-64: parameter name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 68-68: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 68-68: invalid case style for variable 'HWY_MAYBE_UNUSED'

(readability-identifier-naming,-warnings-as-errors)


[error] 68-68: 2 adjacent parameters of 'elewise_sub_fp32' of similar type ('const int *') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 68-68: parameter name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 68-68: parameter name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 72-72: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 72-72: invalid case style for variable 'HWY_MAYBE_UNUSED'

(readability-identifier-naming,-warnings-as-errors)


[error] 72-72: 2 adjacent parameters of 'elewise_mul_fp32' of similar type ('const int *') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 72-72: parameter name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 72-72: parameter name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 76-76: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 76-76: invalid case style for variable 'HWY_MAYBE_UNUSED'

(readability-identifier-naming,-warnings-as-errors)


[error] 76-76: 2 adjacent parameters of 'elewise_div_fp32' of similar type ('const int *') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 76-76: parameter name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 76-76: parameter name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 86-86: variable 'd' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 86-86: variable name 'd' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 87-87: variable 'N' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 87-87: variable name 'N' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 87-87: invalid case style for variable 'N'

(readability-identifier-naming,-warnings-as-errors)


[error] 88-88: variable 'idx' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 91-91: variable 'sVec' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 91-91: invalid case style for variable 'sVec'

(readability-identifier-naming,-warnings-as-errors)


[error] 100-100: variable 'vx' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 100-100: variable name 'vx' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 101-101: variable 'result' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 134-134: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 134-134: invalid case style for variable 'HWY_MAYBE_UNUSED'

(readability-identifier-naming,-warnings-as-errors)


[error] 134-134: parameter name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 134-134: parameter name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 138-138: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 138-138: invalid case style for variable 'HWY_MAYBE_UNUSED'

(readability-identifier-naming,-warnings-as-errors)


[error] 138-138: parameter name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 138-138: parameter name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 142-142: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 142-142: invalid case style for variable 'HWY_MAYBE_UNUSED'

(readability-identifier-naming,-warnings-as-errors)


[error] 142-142: parameter name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 142-142: parameter name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 146-146: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 146-146: invalid case style for variable 'HWY_MAYBE_UNUSED'

(readability-identifier-naming,-warnings-as-errors)


[error] 146-146: parameter name 'x' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 146-146: parameter name 'y' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)

mllm/backends/cpu/kernels/common/kernel_dispatch.cpp

[error] 13-13: macro 'HWY_TARGET_INCLUDE' used to declare a constant; consider using a 'constexpr' constant

(cppcoreguidelines-macro-usage,-warnings-as-errors)

🔇 Additional comments (7)
examples/qwen3/main.cpp (1)

4-4: The header switch to modeling_qwen3_fa2.hpp is correct and safe. The FA2 implementation maintains full API compatibility with the original—both have identical public interfaces (constructor, forward() method, kvCache() method). The only difference is internal: the FA2 variant passes true to the StaticCache constructor's use_fa2 parameter (vs. false in the original), which doesn't affect the client code in this example file.

mllm/backends/cpu/kernels/x86/softmax.cpp (1)

37-66: Good switch to unaligned HWY memory ops (safer, same math).
Using hn::LoadU/StoreU here removes any hidden alignment requirement on X/Y without changing the reduction / exp / scale logic.

mllm/backends/cpu/kernels/x86/rmsnorm.cpp (1)

26-46: LGTM: LoadU/StoreU is the right default unless you can prove alignment.
This should prevent misaligned access issues without affecting results.

mllm/backends/cpu/ops/ElewiseOps.cpp (1)

142-162: Nice: explicit #else NYI(...) prevents silent output corruption on unsupported arch.
This addresses the “do nothing and still return” failure mode for fp32 elementwise ops.

Also applies to: 325-345, 507-527, 689-709

mllm/backends/cpu/kernels/common/elewise-inl.hpp (2)

15-148: Elementwise + scalar FP32 kernels look correct; nice fix vs reserved identifier.
Vector + tail handling is coherent (LoadU/StoreU + LoadN/StoreN), and the scalar RHS path is implemented consistently.


1-10: No ODR risk in elewise-inl.hpp — proper Highway dynamic dispatch pattern. This file is correctly included only in kernel_dispatch.cpp as part of Highway's multi-target code generation mechanism. The functions are defined within HWY_NAMESPACE blocks that are re-instantiated with different namespace values for each enabled target (via foreach_target.h). HWY_EXPORT and HWY_DYNAMIC_DISPATCH macros properly handle symbol resolution across target variants. No ODR violation occurs.

mllm/backends/cpu/kernels/common/kernel_dispatch.cpp (1)

4-80: Pattern is correct and production-ready.

The dynamic dispatch scaffold properly implements Google Highway 1.3.0's multi-target dispatch mechanism: foreach_target.h re-includes the source for each enabled target, HWY_NAMESPACE wraps target-specific implementations, and HWY_DYNAMIC_DISPATCH() selects the right variant at runtime. Architecture guards are correctly applied—x86 code is isolated from ARM with no competing definitions. All implementation functions are present in elewise-inl.hpp, and call sites in ElewiseOps.cpp properly branch between x86 (cpu::common) and ARM (cpu::arm) codepaths.

Comment on lines +107 to +116
3. Download pre-converted models from **our HuggingFace organization** (recommended on x86):

Due to current compatibility issues with the mllm-converter on x86 architecture, we recommend downloading pre-converted quantized models from our HuggingFace organization `mllmTeam <https://huggingface.co/UbiquitousLearning>`_:

Example command:

.. code-block:: bash

wget https://huggingface.co/mllmTeam/qwen-3-0.6b-mllm/blob/main/qwen-3-0.6b-q4_k.mllm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix the Hugging Face download command (blob URL won’t fetch the model file).
Use a raw/resolve URL so wget downloads the artifact, not an HTML page.

-   wget https://huggingface.co/mllmTeam/qwen-3-0.6b-mllm/blob/main/qwen-3-0.6b-q4_k.mllm
+   wget -O qwen-3-0.6b-q4_k.mllm https://huggingface.co/mllmTeam/qwen-3-0.6b-mllm/resolve/main/qwen-3-0.6b-q4_k.mllm
🤖 Prompt for AI Agents
In docs/cpu_backend/x86/index.rst around lines 107 to 116, the wget example uses
a Hugging Face "blob" HTML URL which will download an HTML page instead of the
model artifact; update the example to point to the repository file raw/resolve
endpoint (e.g. use the Hugging Face raw or /resolve/main file URL or append
?raw=true) so wget retrieves the actual .mllm file instead of an HTML page.

Comment on lines +149 to 151
#else
NYI("AddOp not supported on this architecture.");
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

files=(
  "mllm/backends/cpu/ops/ElewiseOps.cpp"
  "mllm/backends/cpu/ops/TransposeOp.cpp"
  "mllm/backends/cpu/kernels/x86/softmax.cpp"
  "mllm/backends/cpu/kernels/x86/rmsnorm.cpp"
  "mllm/backends/cpu/kernels/common/kernel_dispatch.cpp"
  "mllm/backends/cpu/kernels/common/elewise-inl.hpp"
  "mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp"
)

python - <<'PY'
import sys, pathlib, re

BAD_BYTES = set(list(range(0x00,0x09)) + list(range(0x0B,0x0D)) + list(range(0x0E,0x20)) + [0x7F] + list(range(0x80,0xA0)))
files = sys.argv[1:]
fail = False

for p in files:
    b = pathlib.Path(p).read_bytes()
    if b.startswith(b"\xef\xbb\xbf"):
        print(f"[FAIL] BOM found: {p}")
        fail = True
    bad = [i for i, bb in enumerate(b) if bb in BAD_BYTES]
    if bad:
        print(f"[FAIL] Control code bytes found: {p} (first at byte offset {bad[0]})")
        fail = True
    if b"\r\n" in b or b"\r" in b:
        print(f"[FAIL] CRLF/CR found (expected LF only): {p}")
        fail = True

    text = b.decode("utf-8", errors="strict")
    for i, line in enumerate(text.split("\n"), 1):
        if re.search(r"[ \t]+$", line):
            print(f"[FAIL] Trailing whitespace: {p}:{i}")
            fail = True

if fail:
    sys.exit(1)
print("[OK] No BOM/control-codes/CRLF/trailing-whitespace in checked files.")
PY "${files[@]}"

Repository: UbiquitousLearning/mllm

Length of output: 255


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# First, check if the file exists
if [ ! -f "mllm/backends/cpu/ops/ElewiseOps.cpp" ]; then
    echo "[FAIL] File not found: mllm/backends/cpu/ops/ElewiseOps.cpp"
    exit 1
fi

# Check for trailing whitespace at the specified lines
echo "Checking lines 149-151, 159-161, 191-193 for trailing whitespace:"

# Get line count first
total_lines=$(wc -l < "mllm/backends/cpu/ops/ElewiseOps.cpp")
echo "Total lines in file: $total_lines"

# Check specific line ranges
for line_num in 149 150 151 159 160 161 191 192 193; do
    if [ "$line_num" -le "$total_lines" ]; then
        line_content=$(sed -n "${line_num}p" "mllm/backends/cpu/ops/ElewiseOps.cpp")
        # Check if line ends with space or tab
        if [[ "$line_content" =~ [[:space:]]$ ]]; then
            echo "[FAIL] Line $line_num has trailing whitespace: $(sed -n "${line_num}p" "mllm/backends/cpu/ops/ElewiseOps.cpp" | cat -A)"
        else
            echo "[OK] Line $line_num: no trailing whitespace"
        fi
    fi
done

# Also do a comprehensive check for any trailing whitespace in the entire file
echo ""
echo "Comprehensive check for trailing whitespace in entire file:"
trailing_lines=$(grep -n '[[:space:]]$' "mllm/backends/cpu/ops/ElewiseOps.cpp" || true)
if [ -n "$trailing_lines" ]; then
    echo "[FAIL] Lines with trailing whitespace found:"
    echo "$trailing_lines"
else
    echo "[OK] No trailing whitespace found in file"
fi

Repository: UbiquitousLearning/mllm

Length of output: 2141


Remove trailing whitespace (repo guideline violation).

Lines 150, 160, and 192 contain trailing spaces after the NYI() calls. Files under mllm/** must not have trailing whitespace at line ends.

-        NYI("AddOp not supported on this architecture.");                              
+        NYI("AddOp not supported on this architecture.");
🤖 Prompt for AI Agents
In mllm/backends/cpu/ops/ElewiseOps.cpp around lines 149-151, 158-162 and
190-194, there are trailing spaces after the NYI("...") calls (specifically
noted at lines 150, 160 and 192); remove the trailing whitespace at the end of
those lines so the NYI calls end immediately before the line break, and ensure
no other trailing spaces remain in the file (run your editor's "trim trailing
whitespace" or a linter/formatter to verify).

Comment on lines +166 to 193
#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
// Process each batch separately
for (int batch = 0; batch < batch_dims; ++batch) {
// Each batch processes broadcast_naive_loops iterations of vector_size elements
for (int l = 0; l < broadcast_naive_loops; ++l) {
size_t a_offset = batch * broadcast_naive_loops * vector_size + l * vector_size;
size_t b_offset = batch * vector_size; // b doesn't broadcast over loops dimension
size_t out_offset = batch * broadcast_naive_loops * vector_size + l * vector_size;

#if defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
cpu::common::call_elewise_add_fp32(out + out_offset, a + a_offset, b + b_offset, vector_size);
}
}

#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
// Process each batch separately
for (int batch = 0; batch < batch_dims; ++batch) {
// Each batch processes broadcast_naive_loops iterations of vector_size elements
for (int l = 0; l < broadcast_naive_loops; ++l) {
int a_offset = batch * broadcast_naive_loops * vector_size + l * vector_size;
int b_offset = batch * vector_size; // b doesn't broadcast over loops dimension
int out_offset = batch * broadcast_naive_loops * vector_size + l * vector_size;
size_t a_offset = batch * broadcast_naive_loops * vector_size + l * vector_size;
size_t b_offset = batch * vector_size; // b doesn't broadcast over loops dimension
size_t out_offset = batch * broadcast_naive_loops * vector_size + l * vector_size;

cpu::arm::ew_add_fp32(out + out_offset, a + a_offset, b + b_offset, vector_size, options_.getThreads());
}
}
#else
NYI("AddOp not supported on this architecture.");
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

size_t offsets still risk overflow due to int intermediate math.
batch * broadcast_naive_loops * vector_size is computed as int then converted to size_t. Cast inputs before multiplying (apply similarly in Add/Sub/Mul/Div).

-            size_t a_offset = batch * broadcast_naive_loops * vector_size + l * vector_size;
-            size_t b_offset = batch * vector_size;  // b doesn't broadcast over loops dimension
-            size_t out_offset = batch * broadcast_naive_loops * vector_size + l * vector_size;
+            const size_t batch_u = static_cast<size_t>(batch);
+            const size_t loops_u = static_cast<size_t>(broadcast_naive_loops);
+            const size_t vec_u = static_cast<size_t>(vector_size);
+            const size_t l_u = static_cast<size_t>(l);
+            const size_t a_offset = batch_u * loops_u * vec_u + l_u * vec_u;
+            const size_t b_offset = batch_u * vec_u;  // b doesn't broadcast over loops dimension
+            const size_t out_offset = batch_u * loops_u * vec_u + l_u * vec_u;

Also applies to: 349-376, 531-558, 713-740

🤖 Prompt for AI Agents
In mllm/backends/cpu/ops/ElewiseOps.cpp around lines 166 to 193 (and similarly
for ranges 349-376, 531-558, 713-740), the offsets are computed using int
arithmetic then assigned to size_t which can overflow; cast the loop indices or
operands to size_t before performing multiplications (e.g., use (size_t)batch,
(size_t)broadcast_naive_loops, (size_t)vector_size) so all intermediate math is
done in size_t, and apply the same fix to the corresponding Add/Sub/Mul/Div
blocks to ensure safe offset calculation across architectures.

Comment on lines +30 to 43
#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
NYI("Transpose op(HW -> WH) fp32 not supported in x86");
#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
arm::transpose_hw_wh_fp32(input.ptr<mllm_fp32_t>(), output.ptr<mllm_fp32_t>(), input_shape[0], input_shape[1]);
#endif
break;
}
case kFloat16: {
#if defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
NYI("Transpose op(HW -> WH) fp16 not supported in x86");
#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
arm::transpose_hw_wh_fp16(input.ptr<mllm_fp16_t>(), output.ptr<mllm_fp16_t>(), input_shape[0], input_shape[1]);
#endif
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add #else NYI(...) to avoid silent no-op on non-{x86,arm}.
Right now, if someone builds on another arch (e.g., WASM/RISC-V), these branches can fall through without writing output. Recommend making the unsupported case explicit.

 #if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
   NYI("Transpose op(HW -> WH) fp32 not supported in x86");
 #elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
   arm::transpose_hw_wh_fp32(...);
+#else
+  NYI("Transpose op(HW -> WH) fp32 not supported on this architecture");
 #endif

Also applies to: 59-74, 95-110

🤖 Prompt for AI Agents
In mllm/backends/cpu/ops/TransposeOp.cpp around lines 30-43 (also apply same
change to 59-74 and 95-110), the architecture-specific #if/#elif blocks
currently omit an #else path so building on non-x86/non-ARM targets results in a
silent no-op; add an #else branch that calls NYI("Transpose op(...) not
supported on this arch") (with the appropriate dtype/shape message) before the
#endif to make unsupported architectures explicit and ensure the code
reports/notifies instead of falling through.

Copy link
Collaborator

@chenghuaWang chenghuaWang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chenghuaWang chenghuaWang merged commit 0c945f3 into UbiquitousLearning:main Dec 13, 2025
5 checks passed
@HayzelHan HayzelHan deleted the feat/x86-qwen3 branch December 14, 2025 01:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants