feat(qwen3, cpu): add support for Qwen3 model on x86 architecture#561
feat(qwen3, cpu): add support for Qwen3 model on x86 architecture#561chenghuaWang merged 8 commits intoUbiquitousLearning:mainfrom
Conversation
WalkthroughIntroduces HWY-based dynamic dispatch and x86-focused FP32 element-wise kernels (vector and scalar), adds kernel dispatch headers/impl, switches some x86 memory accesses to unaligned variants, updates quantize FP16 type, gates several ops by architecture, enables example builds, and adds an x86 CI workflow. Changes
Sequence Diagram(s)sequenceDiagram
participant Op as ElewiseOps (caller)
participant Disp as kernel_dispatch (call_elewise_*_fp32)
participant Impl as HWY target impl
participant Mem as Memory/input buffers
Op->>Disp: call_elewise_add_fp32(out, x, y, n)
Disp->>Impl: dispatch to selected HWY target (per-target include)
Impl->>Mem: LoadU x/y segments
Impl->>Impl: vectorized compute (add / scalar op)
Impl->>Mem: StoreU results to out
Disp-->>Op: return
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
mllm/backends/cpu/ops/TransposeOp.cpp (3)
26-53: Align x86 behaviour and int64 handling across transpose special-casesThe new x86/x86_64 branches for
kFloat32/kFloat16in CASE 1–3 correctly surface unsupported patterns viaNYI, while ARM keeps using the optimized kernels. However, forkInt64in these cases there is still only an ARM#ifand no explicit x86 handling. On an x86 build, thosekInt64cases effectively become:case kInt64: { // no call break; }which silently turns the transpose into a no-op for int64 tensors instead of failing fast.
To avoid silent data corruption when an int64 transpose is accidentally used on x86, consider adding explicit x86 branches (or wrapping the whole
case kInt64in the same#if/#elifstructure) so that x86 either:
- Calls a generic non-ARM fallback, or
- Emits a clear
NYI("Transpose op(...) int64 not supported in x86").This same pattern applies to CASE 1 (HW→WH), CASE 2 (BSHD→BHSD) and CASE 3 (last-two-dims)
kInt64branches.Also applies to: 55-85, 87-121
87-91: Fixbatchaccumulation bug in CASE 3 (last-dims transpose)In CASE 3 you compute
batchas:int batch = 0; for (int i = 0; i < input_shape.size() - 2; i++) { batch *= input_shape[i]; }Because
batchstarts at 0 and is only multiplied,batchis always 0, so thetranspose_last_dims_*kernels will never process any batch element. This makes the entire CASE 3 path a no-op for all dtypes/architectures.Initialize
batchto 1 instead so it correctly accumulates the product of the leading dimensions, e.g.:- int batch = 0; - for (int i = 0; i < input_shape.size() - 2; i++) { batch *= input_shape[i]; } + int batch = 1; + for (int i = 0; i < input_shape.size() - 2; i++) { batch *= input_shape[i]; }This is independent of the existing
FIXMEabout(dim0 + dim1 == 1)and should be fixed regardless.
123-153: General permute CASE 4 is a silent no-op on x86; add explicit x86 handling / NYIIn CASE 4 (general permute), the dtype branches are guarded only by ARM macros, for example:
case kFloat32: { #if defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) arm::permute_fp32(...); #endif break; }On x86/x86_64, the
#ifbody is compiled out, so the code effectively just hitsbreak;with no kernel call and noNYI, meaning the operation becomes a silent no-op and leavesoutputunpermuted. That’s more dangerous than the explicitNYIbehaviour you added in CASE 1–3.To align behaviour and avoid silent incorrect results on x86, consider restructuring these branches similar to CASE 1–3, e.g.:
case kFloat32: { -#if defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) - arm::permute_fp32(input.ptr<mllm_fp32_t>(), output.ptr<mllm_fp32_t>(), input_shape.data(), permute_axis.data(), - permute_axis.size()); -#endif + #if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) + NYI("Transpose op(general permute) fp32 not supported in x86"); + #elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) + arm::permute_fp32(input.ptr<mllm_fp32_t>(), output.ptr<mllm_fp32_t>(), input_shape.data(), permute_axis.data(), + permute_axis.size()); + #endif break; } case kFloat16: { -#if defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) - arm::permute_fp16(input.ptr<mllm_fp16_t>(), output.ptr<mllm_fp16_t>(), input_shape.data(), permute_axis.data(), - permute_axis.size()); -#endif + #if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) + NYI("Transpose op(general permute) fp16 not supported in x86"); + #elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) + arm::permute_fp16(input.ptr<mllm_fp16_t>(), output.ptr<mllm_fp16_t>(), input_shape.data(), permute_axis.data(), + permute_axis.size()); + #endif break; } case kInt64: { -#if defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) - arm::permute_generic<mllm_int64_t>(input.ptr<mllm_int64_t>(), output.ptr<mllm_int64_t>(), input_shape.data(), - permute_axis.data(), permute_axis.size()); -#endif + #if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) + NYI("Transpose op(general permute) int64 not supported in x86"); + #elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) + arm::permute_generic<mllm_int64_t>(input.ptr<mllm_int64_t>(), output.ptr<mllm_int64_t>(), input_shape.data(), + permute_axis.data(), permute_axis.size()); + #endif break; }(or route x86 to a generic non-ARM permute implementation if/when one exists). This keeps x86 semantics explicit and consistent with the earlier transpose special-cases.
🧹 Nitpick comments (2)
mllm/backends/cpu/ops/ElewiseOps.cpp (1)
143-157: Consider adding a fallback for unsupported platforms.The code has branches for x86 and ARM but no fallback
elseclause. On platforms that are neither x86 nor ARM (e.g., RISC-V, WebAssembly), these operations will silently do nothing, leading to incorrect results without any warning.Consider adding a fallback path or NYI message:
#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) cpu::arm::ew_add_fp32(output.ptr<mllm_fp32_t>(), input0.ptr<mllm_fp32_t>(), input1.ptr<mllm_fp32_t>(), output.numel(), options_.getThreads()); +#else + NYI("AddOp not supported on this platform."); #endifmllm/backends/cpu/kernels/common/elewise-inl.hpp (1)
113-139: Consider reusing existing operator structs to eliminate duplication.The scalar operator structs (
AddScalarOp,SubScalarOp,MulScalarOp,DivScalarOp) are identical to the non-scalar operators (AddOp,SubOp,MulOp,DivOp) defined at lines 39-65. Since the operators work on vector types and don't distinguish between scalar-broadcast and element-wise operations, you could directly reuse the existing operators.For example, update the scalar functions to reuse existing operators:
-template<typename T> -HWY_NOINLINE HWY_MAYBE_UNUSED void element_wise_add_scalar(T* out, const T* x, const T* y, size_t n) { - __elementwise_scalar(out, x, y, n, AddScalarOp{}); -} +template<typename T> +HWY_NOINLINE HWY_MAYBE_UNUSED void element_wise_add_scalar(T* out, const T* x, const T* y, size_t n) { + __elementwise_scalar(out, x, y, n, AddOp{}); +}Then remove the duplicate scalar operator structs (lines 113-139).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
examples/CMakeLists.txt(1 hunks)examples/qwen3/main.cpp(1 hunks)mllm/backends/cpu/kernels/Kernels.hpp(1 hunks)mllm/backends/cpu/kernels/common/elewise-inl.hpp(1 hunks)mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp(1 hunks)mllm/backends/cpu/kernels/x86/rmsnorm.cpp(1 hunks)mllm/backends/cpu/kernels/x86/softmax.cpp(3 hunks)mllm/backends/cpu/ops/ElewiseOps.cpp(11 hunks)mllm/backends/cpu/ops/LinearOp.cpp(1 hunks)mllm/backends/cpu/ops/TransposeOp.cpp(3 hunks)mllm/models/qwen3/modeling_qwen3_fa2.hpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
{mllm,mllm-cli,pymllm}/**/*
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*: Files must not contain C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, or DEL 0x7F. Horizontal tab (0x09) and line feed (0x0A) are explicitly allowed.
All files must be encoded in UTF-8 without BOM.
Any violation of character set (Rule 1) or encoding (Rule 2) requirements must cause the review to fail.
No line may end with trailing whitespace.
Use Unix line endings (LF).
File and directory names must consist only of printable Unicode characters, excluding C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, and DEL 0x7F.
Only use acceptable file extensions: .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .py, .pyi, .sh, .txt, .md, .yml, .yaml, .json, .toml.
Optional license headers, if present, must comply with character set rules (no C0/C1 control codes except tab and line feed).
Files:
mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hppmllm/backends/cpu/ops/ElewiseOps.cppmllm/backends/cpu/kernels/Kernels.hppmllm/models/qwen3/modeling_qwen3_fa2.hppmllm/backends/cpu/kernels/x86/rmsnorm.cppmllm/backends/cpu/kernels/common/elewise-inl.hppmllm/backends/cpu/ops/LinearOp.cppmllm/backends/cpu/ops/TransposeOp.cppmllm/backends/cpu/kernels/x86/softmax.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}: TODO and FIXME comments must be written as 'TODO:' or 'FIXME:' followed by UTF-8 text that adheres to character set rules.
Encourage consistent coding style and patterns with the existing codebase.
Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific.
Files:
mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hppmllm/backends/cpu/ops/ElewiseOps.cppmllm/backends/cpu/kernels/Kernels.hppmllm/models/qwen3/modeling_qwen3_fa2.hppmllm/backends/cpu/kernels/x86/rmsnorm.cppmllm/backends/cpu/kernels/common/elewise-inl.hppmllm/backends/cpu/ops/LinearOp.cppmllm/backends/cpu/ops/TransposeOp.cppmllm/backends/cpu/kernels/x86/softmax.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
Adhere to language-specific best practices and idioms (e.g., PEP 8 for Python, Google C++ Style Guide for C++).
Files:
mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hppmllm/backends/cpu/ops/ElewiseOps.cppmllm/backends/cpu/kernels/Kernels.hppmllm/models/qwen3/modeling_qwen3_fa2.hppmllm/backends/cpu/kernels/x86/rmsnorm.cppmllm/backends/cpu/kernels/common/elewise-inl.hppmllm/backends/cpu/ops/LinearOp.cppmllm/backends/cpu/ops/TransposeOp.cppmllm/backends/cpu/kernels/x86/softmax.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}: Prioritize production-ready code quality by evaluating time and space complexity of algorithms and data structures, and suggest more efficient alternatives for operations with high complexity (e.g., O(n^2) or worse) when feasible.
Avoid unnecessary object creation in loops or hot paths.
Check for proper error handling and resource cleanup (e.g., using try-finally, context managers, or RAII).
Ensure functions that can fail return appropriate error codes or raise exceptions.
Validate inputs for public APIs and critical internal functions.
Add comments for complex algorithms or non-obvious logic.
Identify potential security issues (e.g., buffer overflows, injection risks, insecure temporary files) and recommend using secure alternatives (e.g., parameterized queries, secure random generators).
Suggest adding unit tests for untested complex logic or edge cases.
Ensure code is testable by avoiding global state and using dependency injection.
Flag overly complex functions (e.g., high cyclomatic complexity) and suggest breaking them down.
Use named constants instead of magic numbers.
Add appropriate logging (e.g., debug, info, warning, error) for significant events and errors, avoiding sensitive data exposure.
Files:
mllm/backends/cpu/ops/ElewiseOps.cppmllm/backends/cpu/kernels/x86/rmsnorm.cppmllm/backends/cpu/ops/LinearOp.cppmllm/backends/cpu/ops/TransposeOp.cppmllm/backends/cpu/kernels/x86/softmax.cpp
🧬 Code graph analysis (3)
mllm/backends/cpu/ops/ElewiseOps.cpp (2)
mllm/backends/cpu/kernels/common/elewise-inl.hpp (8)
element_wise_add(68-68)element_wise_add_scalar(142-142)element_wise_sub(73-73)element_wise_sub_scalar(147-147)element_wise_mul(78-78)element_wise_mul_scalar(152-152)element_wise_div(83-83)element_wise_div_scalar(157-157)mllm/backends/cpu/kernels/arm/elementwise.hpp (8)
ew_add_fp32(262-263)ew_add_fp32_scalar(324-325)ew_sub_fp32(265-266)ew_sub_fp32_scalar(327-328)ew_mul_fp32(268-269)ew_mul_fp32_scalar(330-331)ew_div_fp32(271-272)ew_div_fp32_scalar(333-334)
mllm/backends/cpu/ops/TransposeOp.cpp (1)
mllm/backends/cpu/kernels/arm/transpose.cpp (6)
transpose_hw_wh_fp32(13-46)transpose_hw_wh_fp32(13-13)transpose_bshd_bhsd_fp32(48-73)transpose_bshd_bhsd_fp32(48-49)transpose_last_dims_fp32(160-200)transpose_last_dims_fp32(160-161)
mllm/backends/cpu/kernels/x86/softmax.cpp (1)
mllm/backends/cpu/kernels/x86/math.hpp (1)
vexpq_fast_f32(19-19)
🪛 Clang (14.0.6)
mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp
[error] 114-114: parameter name 'f' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
mllm/backends/cpu/kernels/common/elewise-inl.hpp
[error] 68-68: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 68-68: invalid case style for variable 'HWY_MAYBE_UNUSED'
(readability-identifier-naming,-warnings-as-errors)
[error] 68-68: 2 adjacent parameters of 'element_wise_add' of similar type ('const int *') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 68-68: parameter name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 68-68: parameter name 'y' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 73-73: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 73-73: invalid case style for variable 'HWY_MAYBE_UNUSED'
(readability-identifier-naming,-warnings-as-errors)
[error] 73-73: 2 adjacent parameters of 'element_wise_sub' of similar type ('const int *') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 73-73: parameter name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 73-73: parameter name 'y' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 78-78: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 78-78: invalid case style for variable 'HWY_MAYBE_UNUSED'
(readability-identifier-naming,-warnings-as-errors)
[error] 78-78: 2 adjacent parameters of 'element_wise_mul' of similar type ('const int *') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 78-78: parameter name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 78-78: parameter name 'y' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 83-83: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 83-83: invalid case style for variable 'HWY_MAYBE_UNUSED'
(readability-identifier-naming,-warnings-as-errors)
[error] 83-83: 2 adjacent parameters of 'element_wise_div' of similar type ('const int *') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 83-83: parameter name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 83-83: parameter name 'y' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 92-92: declaration uses identifier '__elementwise_scalar', which is a reserved identifier
(bugprone-reserved-identifier,-warnings-as-errors)
[error] 93-93: variable 'd' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 93-93: variable name 'd' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 94-94: variable 'N' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 94-94: variable name 'N' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 94-94: invalid case style for variable 'N'
(readability-identifier-naming,-warnings-as-errors)
[error] 95-95: variable 'idx' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 98-98: variable 'sVec' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 98-98: invalid case style for variable 'sVec'
(readability-identifier-naming,-warnings-as-errors)
[error] 107-107: variable 'vx' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 107-107: variable name 'vx' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 108-108: variable 'result' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 142-142: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 142-142: invalid case style for variable 'HWY_MAYBE_UNUSED'
(readability-identifier-naming,-warnings-as-errors)
[error] 142-142: 2 adjacent parameters of 'element_wise_add_scalar' of similar type ('const int *') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 142-142: parameter name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 142-142: parameter name 'y' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 147-147: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 147-147: invalid case style for variable 'HWY_MAYBE_UNUSED'
(readability-identifier-naming,-warnings-as-errors)
[error] 147-147: 2 adjacent parameters of 'element_wise_sub_scalar' of similar type ('const int *') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 147-147: parameter name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 147-147: parameter name 'y' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 152-152: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 152-152: invalid case style for variable 'HWY_MAYBE_UNUSED'
(readability-identifier-naming,-warnings-as-errors)
[error] 152-152: 2 adjacent parameters of 'element_wise_mul_scalar' of similar type ('const int *') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 152-152: parameter name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 152-152: parameter name 'y' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 157-157: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 157-157: invalid case style for variable 'HWY_MAYBE_UNUSED'
(readability-identifier-naming,-warnings-as-errors)
[error] 157-157: 2 adjacent parameters of 'element_wise_div_scalar' of similar type ('const int *') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 157-157: parameter name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 157-157: parameter name 'y' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
🔇 Additional comments (13)
mllm/backends/cpu/ops/LinearOp.cpp (1)
293-309: Architecture-guarded NYI for x86 is acceptable, but verify default fallback behavior.The explicit NYI for x86 is a reasonable approach to mark unsupported implementations. However, note that when
MLLM_USE_BLASis not defined andK < 4(lines 119-124),kDefaultauto-selectskMllmBlas, which will hit this NYI on x86 at runtime.Consider whether x86 without BLAS should default to
kGGUFregardless ofKsize to avoid the unexpected NYI:#else if (K >= 4) { impl_type = aops::LinearImplTypes::kGGUF; - } else - // All fallback to mllm blas - { + } else { +#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) + impl_type = aops::LinearImplTypes::kGGUF; // x86 lacks kMllmBlas support +#else impl_type = aops::LinearImplTypes::kMllmBlas; +#endif } #endifAlternatively, if hitting NYI for small-K x86 scenarios is intentional as a known limitation, this is fine as-is.
examples/CMakeLists.txt (1)
8-9: LGTM!Enabling qwen3 and qwen3_service build targets aligns with the PR objective to add Qwen3 model support on x86 architecture.
mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp (1)
114-127: Signature update looks correct.The parameter type change from
uint16_ttomllm_fp16_timproves type consistency with the FP16 abstraction used elsewhere in the codebase.Note: The lazy initialization of
table_f32_f16(lines 115-122) has a pre-existing race condition if called from multiple threads simultaneously. Consider usingstd::call_onceor static initialization if thread safety becomes a concern.examples/qwen3/main.cpp (1)
4-4: LGTM!Header updated to use the FlashAttention2 variant (
modeling_qwen3_fa2.hpp), aligning with the PR objective to add FA2 support for Qwen3.mllm/backends/cpu/kernels/Kernels.hpp (1)
40-40: LGTM!Adding the
elewise-inl.hppinclude for non-ARM platforms enables the HWY-accelerated element-wise operations used by the x86 path inElewiseOps.cpp.mllm/backends/cpu/kernels/x86/softmax.cpp (1)
38-65: LGTM!Switching to
LoadU/StoreUfor unaligned memory access is the correct approach when alignment of input/output buffers cannot be guaranteed. This improves robustness and prevents potential segfaults on x86 when data is not aligned to SIMD register boundaries.mllm/backends/cpu/kernels/x86/rmsnorm.cpp (1)
30-45: LGTM!Consistent with the softmax changes, switching to unaligned
LoadU/StoreUoperations ensures robustness for input buffers that may not be aligned to SIMD boundaries. Both theadd_unit_offsetbranches are updated consistently.mllm/models/qwen3/modeling_qwen3_fa2.hpp (1)
262-266: Verify model weight compatibility with the new registration name.The registration name changed from
"lm_head_out"to"lm_head". This is a more conventional name, but it's a breaking change for any existing model weight files that use the old key name.Ensure that model weight files for Qwen3 use
"lm_head"as the weight key, or that the model loader handles this mapping appropriately. If existing weights use"lm_head_out", they will fail to load correctly with this change.mllm/backends/cpu/ops/ElewiseOps.cpp (2)
143-149: LGTM!The x86 HWY-accelerated element-wise add implementation correctly uses the
element_wise_addfunction fromelewise-inl.hppwith the proper out-of-place parameter ordering(out, x, y, n).
847-851: Good use of NYI for unsupported operations.Explicitly marking AbsOp (and similar ops like LogOp, ExpOp, ClipOp, SinOp, CosOp) as not-yet-implemented on x86 is better than silent failure. These can be tracked and implemented in future work.
mllm/backends/cpu/kernels/common/elewise-inl.hpp (3)
91-111: LGTM! Scalar element-wise implementation follows Highway patterns.The implementation correctly broadcasts the scalar value to a vector and applies the operation efficiently using Highway SIMD primitives. The tail handling with
LoadN/StoreNproperly handles non-multiple-of-lane counts.
142-159: LGTM! Scalar wrapper functions correctly delegate to the scalar implementation.The four scalar element-wise wrapper functions follow the established pattern and correctly delegate to
__elementwise_scalarwith the appropriate operator.
68-85: Verify all call sites are updated for the new out-of-place API signature.This change converts the element-wise operation functions from in-place to explicit out-of-place semantics, with
outnow as the first parameter. Confirm that all callers throughout the codebase have been updated to pass arguments in the new order:element_wise_add(out, x, y, n)instead of the previous signature. This is a breaking change that affects all usage sites.
There was a problem hiding this comment.
pls using lm_head_out for BC. Arm CPU use lm_head_out.
mllm/backends/cpu/ops/ElewiseOps.cpp
Outdated
| int out_offset = batch * broadcast_naive_loops * vector_size + l * vector_size; | ||
|
|
||
| #if defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) | ||
| cpu::common::HWY_NAMESPACE::element_wise_div(out + out_offset, a + a_offset, b + b_offset, vector_size); |
There was a problem hiding this comment.
Is cpu::common::HWY_NAMESPACE the right way to use highway?
IIRC, We should include head file of highway operator for multiple times and use highway's dispatcher to dispatch one op.
| #add_subdirectory(llama) | ||
| add_subdirectory(minicpm_o) | ||
| add_subdirectory(minicpm4) | ||
| #add_subdirectory(qwen3) |
There was a problem hiding this comment.
Pls uncomment all target. I think this is involved by @KKkai0315 .
|
Can you add a CI/CD test for the X86 backend? Just referencing the macOS test scripts is fine. |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
mllm/backends/cpu/kernels/common/elewise-inl.hpp (1)
15-34: Reserved identifier usage:__elementwise.Identifiers starting with double underscore (
__) are reserved in C++. Rename to avoid potential conflicts with compiler/library internals.template<typename T, typename Op> -HWY_INLINE void __elementwise(const T* HWY_RESTRICT x, const T* HWY_RESTRICT y, T* HWY_RESTRICT out, size_t count, Op&& op) { +HWY_INLINE void elementwise_impl(const T* HWY_RESTRICT x, const T* HWY_RESTRICT y, T* HWY_RESTRICT out, size_t count, Op&& op) {Also update the call sites (lines 65, 69, 73, 77).
🧹 Nitpick comments (2)
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp (1)
67-75: Remove or document commented-out GELU code.Commented-out code without explanation adds noise. Either remove it or add a TODO comment explaining why it's preserved and when it will be implemented.
-//===----------------------------------------------------------------------===// -// GELU -//===----------------------------------------------------------------------===// -// HWY_EXPORT(gelu_fp32); -// -// HWY_DLLEXPORT void call_gelu_fp32(mllm_fp32_t* out, const mllm_fp32_t* in, size_t n) { -// HWY_DYNAMIC_DISPATCH(gelu_fp32)(out, in, n); -// } +// TODO: Add GELU dispatch when elewise-inl.hpp implements gelu_fp32mllm/backends/cpu/kernels/common/elewise-inl.hpp (1)
106-132: Duplicate functors:*ScalarOpare identical to*Op.
AddScalarOp,SubScalarOp,MulScalarOp, andDivScalarOphave identical implementations to their non-scalar counterparts. Reuse the existing functors to reduce duplication.-struct AddScalarOp { - template<class D, class V> - HWY_INLINE V operator()(D d, V a, V b) const { - return hn::Add(a, b); - } -}; - -struct SubScalarOp { - template<class D, class V> - HWY_INLINE V operator()(D d, V a, V b) const { - return hn::Sub(a, b); - } -}; - -struct MulScalarOp { - template<class D, class V> - HWY_INLINE V operator()(D d, V a, V b) const { - return hn::Mul(a, b); - } -}; - -struct DivScalarOp { - template<class D, class V> - HWY_INLINE V operator()(D d, V a, V b) const { - return hn::Div(a, b); - } -}; - HWY_NOINLINE HWY_MAYBE_UNUSED void elewise_add_scalar_fp32(mllm_fp32_t* out, const mllm_fp32_t* x, const mllm_fp32_t y, size_t n) { - __elementwise_scalar(out, x, y, n, AddScalarOp{}); + elementwise_scalar_impl(out, x, y, n, AddOp{}); } HWY_NOINLINE HWY_MAYBE_UNUSED void elewise_sub_scalar_fp32(mllm_fp32_t* out, const mllm_fp32_t* x, const mllm_fp32_t y, size_t n) { - __elementwise_scalar(out, x, y, n, SubScalarOp{}); + elementwise_scalar_impl(out, x, y, n, SubOp{}); } HWY_NOINLINE HWY_MAYBE_UNUSED void elewise_mul_scalar_fp32(mllm_fp32_t* out, const mllm_fp32_t* x, const mllm_fp32_t y, size_t n) { - __elementwise_scalar(out, x, y, n, MulScalarOp{}); + elementwise_scalar_impl(out, x, y, n, MulOp{}); } HWY_NOINLINE HWY_MAYBE_UNUSED void elewise_div_scalar_fp32(mllm_fp32_t* out, const mllm_fp32_t* x, const mllm_fp32_t y, size_t n) { - __elementwise_scalar(out, x, y, n, DivScalarOp{}); + elementwise_scalar_impl(out, x, y, n, DivOp{}); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
.github/workflows/build-x86.yml(1 hunks)examples/CMakeLists.txt(1 hunks)mllm/backends/cpu/kernels/Kernels.hpp(1 hunks)mllm/backends/cpu/kernels/common/elewise-inl.hpp(2 hunks)mllm/backends/cpu/kernels/common/kernel_dispatch.cpp(1 hunks)mllm/backends/cpu/kernels/common/kernel_dispatch.hpp(1 hunks)mllm/backends/cpu/ops/ElewiseOps.cpp(11 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- examples/CMakeLists.txt
- mllm/backends/cpu/kernels/Kernels.hpp
🧰 Additional context used
📓 Path-based instructions (4)
{mllm,mllm-cli,pymllm}/**/*
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*: Files must not contain C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, or DEL 0x7F. Horizontal tab (0x09) and line feed (0x0A) are explicitly allowed.
All files must be encoded in UTF-8 without BOM.
Any violation of character set (Rule 1) or encoding (Rule 2) requirements must cause the review to fail.
No line may end with trailing whitespace.
Use Unix line endings (LF).
File and directory names must consist only of printable Unicode characters, excluding C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, and DEL 0x7F.
Only use acceptable file extensions: .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .py, .pyi, .sh, .txt, .md, .yml, .yaml, .json, .toml.
Optional license headers, if present, must comply with character set rules (no C0/C1 control codes except tab and line feed).
Files:
mllm/backends/cpu/kernels/common/kernel_dispatch.hppmllm/backends/cpu/ops/ElewiseOps.cppmllm/backends/cpu/kernels/common/elewise-inl.hppmllm/backends/cpu/kernels/common/kernel_dispatch.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}: TODO and FIXME comments must be written as 'TODO:' or 'FIXME:' followed by UTF-8 text that adheres to character set rules.
Encourage consistent coding style and patterns with the existing codebase.
Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific.
Files:
mllm/backends/cpu/kernels/common/kernel_dispatch.hppmllm/backends/cpu/ops/ElewiseOps.cppmllm/backends/cpu/kernels/common/elewise-inl.hppmllm/backends/cpu/kernels/common/kernel_dispatch.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
Adhere to language-specific best practices and idioms (e.g., PEP 8 for Python, Google C++ Style Guide for C++).
Files:
mllm/backends/cpu/kernels/common/kernel_dispatch.hppmllm/backends/cpu/ops/ElewiseOps.cppmllm/backends/cpu/kernels/common/elewise-inl.hppmllm/backends/cpu/kernels/common/kernel_dispatch.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}: Prioritize production-ready code quality by evaluating time and space complexity of algorithms and data structures, and suggest more efficient alternatives for operations with high complexity (e.g., O(n^2) or worse) when feasible.
Avoid unnecessary object creation in loops or hot paths.
Check for proper error handling and resource cleanup (e.g., using try-finally, context managers, or RAII).
Ensure functions that can fail return appropriate error codes or raise exceptions.
Validate inputs for public APIs and critical internal functions.
Add comments for complex algorithms or non-obvious logic.
Identify potential security issues (e.g., buffer overflows, injection risks, insecure temporary files) and recommend using secure alternatives (e.g., parameterized queries, secure random generators).
Suggest adding unit tests for untested complex logic or edge cases.
Ensure code is testable by avoiding global state and using dependency injection.
Flag overly complex functions (e.g., high cyclomatic complexity) and suggest breaking them down.
Use named constants instead of magic numbers.
Add appropriate logging (e.g., debug, info, warning, error) for significant events and errors, avoiding sensitive data exposure.
Files:
mllm/backends/cpu/ops/ElewiseOps.cppmllm/backends/cpu/kernels/common/kernel_dispatch.cpp
🧬 Code graph analysis (3)
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp (2)
mllm/backends/cpu/kernels/common/elewise-inl.hpp (1)
void(16-34)mllm/backends/cpu/kernels/common/kernel_dispatch.cpp (8)
void(35-37)void(39-41)void(43-45)void(47-49)void(51-53)void(55-57)void(59-61)void(63-65)
mllm/backends/cpu/kernels/common/elewise-inl.hpp (1)
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp (8)
void(35-37)void(39-41)void(43-45)void(47-49)void(51-53)void(55-57)void(59-61)void(63-65)
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp (1)
mllm/backends/cpu/kernels/common/elewise-inl.hpp (9)
elewise_add_fp32(64-64)elewise_sub_fp32(68-68)elewise_mul_fp32(72-72)elewise_div_fp32(76-76)elewise_add_scalar_fp32(134-134)elewise_sub_scalar_fp32(138-138)elewise_mul_scalar_fp32(142-142)elewise_div_scalar_fp32(146-146)void(16-34)
🪛 actionlint (1.7.9)
.github/workflows/build-x86.yml
39-39: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
🪛 Clang (14.0.6)
mllm/backends/cpu/kernels/common/elewise-inl.hpp
[error] 64-64: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 64-64: invalid case style for variable 'HWY_MAYBE_UNUSED'
(readability-identifier-naming,-warnings-as-errors)
[error] 64-64: 2 adjacent parameters of 'elewise_add_fp32' of similar type ('const int *') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 64-64: parameter name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 64-64: parameter name 'y' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 68-68: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 68-68: invalid case style for variable 'HWY_MAYBE_UNUSED'
(readability-identifier-naming,-warnings-as-errors)
[error] 68-68: 2 adjacent parameters of 'elewise_sub_fp32' of similar type ('const int *') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 68-68: parameter name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 68-68: parameter name 'y' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 72-72: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 72-72: invalid case style for variable 'HWY_MAYBE_UNUSED'
(readability-identifier-naming,-warnings-as-errors)
[error] 72-72: 2 adjacent parameters of 'elewise_mul_fp32' of similar type ('const int *') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 72-72: parameter name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 72-72: parameter name 'y' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 76-76: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 76-76: invalid case style for variable 'HWY_MAYBE_UNUSED'
(readability-identifier-naming,-warnings-as-errors)
[error] 76-76: 2 adjacent parameters of 'elewise_div_fp32' of similar type ('const int *') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 76-76: parameter name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 76-76: parameter name 'y' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 85-85: declaration uses identifier '__elementwise_scalar', which is a reserved identifier
(bugprone-reserved-identifier,-warnings-as-errors)
[error] 86-86: variable 'd' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 86-86: variable name 'd' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 87-87: variable 'N' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 87-87: variable name 'N' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 87-87: invalid case style for variable 'N'
(readability-identifier-naming,-warnings-as-errors)
[error] 88-88: variable 'idx' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 91-91: variable 'sVec' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 91-91: invalid case style for variable 'sVec'
(readability-identifier-naming,-warnings-as-errors)
[error] 100-100: variable 'vx' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 100-100: variable name 'vx' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 101-101: variable 'result' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 134-134: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 134-134: invalid case style for variable 'HWY_MAYBE_UNUSED'
(readability-identifier-naming,-warnings-as-errors)
[error] 134-134: parameter name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 134-134: parameter name 'y' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 138-138: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 138-138: invalid case style for variable 'HWY_MAYBE_UNUSED'
(readability-identifier-naming,-warnings-as-errors)
[error] 138-138: parameter name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 138-138: parameter name 'y' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 142-142: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 142-142: invalid case style for variable 'HWY_MAYBE_UNUSED'
(readability-identifier-naming,-warnings-as-errors)
[error] 142-142: parameter name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 142-142: parameter name 'y' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 146-146: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 146-146: invalid case style for variable 'HWY_MAYBE_UNUSED'
(readability-identifier-naming,-warnings-as-errors)
[error] 146-146: parameter name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 146-146: parameter name 'y' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp
[error] 12-12: macro 'HWY_TARGET_INCLUDE' used to declare a constant; consider using a 'constexpr' constant
(cppcoreguidelines-macro-usage,-warnings-as-errors)
🔇 Additional comments (7)
.github/workflows/build-x86.yml (1)
1-51: Workflow structure and design are solid.The CI/CD workflow properly addresses the reviewer's request for X86 backend compilation testing. The path-based triggering, repository guard, and build steps are well-configured. Python 3.12 with pip caching is appropriate for build speed.
mllm/backends/cpu/ops/ElewiseOps.cpp (3)
143-149: x86 HWY dispatch integration looks correct.The dynamic dispatch pattern using
cpu::common::call_elewise_*_fp32is appropriate for Highway's runtime dispatch mechanism. This addresses the previous review comment about proper Highway usage.
144-148: x86 implementation lacks thread count parameter.The ARM implementations accept
options_.getThreads()for multi-threading, but the x86 HWY-based implementations don't expose threading control. Verify this is intentional or consider adding threading support for parity.
847-851: NYI placeholders for x86 are appropriate.Clear indication that these operations need x86 implementations. Consider adding TODO comments with tracking issue references if these are planned for near-term implementation.
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp (1)
1-28: Clean header structure for HWY dispatch declarations.The header is well-organized with appropriate includes, namespace usage, and function declarations. The separation of vectorized and scalar variants is clear.
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp (1)
7-15: HWY dynamic dispatch pattern is correctly implemented.The
HWY_TARGET_INCLUDEandforeach_target.hpattern is the standard Highway approach for runtime CPU feature detection and dispatch. The static analysis warning about preferringconstexpris a false positive—this macro pattern is required by Highway's design.mllm/backends/cpu/kernels/common/elewise-inl.hpp (1)
64-78: FP32 elementwise implementations look correct.The implementations correctly use the HWY vectorized operations with proper function signatures matching the dispatch declarations.
| #if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) | ||
| // Process each batch separately | ||
| for (int batch = 0; batch < batch_dims; ++batch) { | ||
| // Each batch processes broadcast_naive_loops iterations of vector_size elements | ||
| for (int l = 0; l < broadcast_naive_loops; ++l) { | ||
| int a_offset = batch * broadcast_naive_loops * vector_size + l * vector_size; | ||
| int b_offset = batch * vector_size; // b doesn't broadcast over loops dimension | ||
| int out_offset = batch * broadcast_naive_loops * vector_size + l * vector_size; | ||
|
|
||
| #if defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) | ||
| cpu::common::call_elewise_add_fp32(out + out_offset, a + a_offset, b + b_offset, vector_size); | ||
| } | ||
| } |
There was a problem hiding this comment.
Potential integer overflow in offset calculations.
When batch, broadcast_naive_loops, and vector_size are large, the expression batch * broadcast_naive_loops * vector_size can overflow int. Consider using size_t or int64_t for offset calculations.
- for (int batch = 0; batch < batch_dims; ++batch) {
+ for (int64_t batch = 0; batch < batch_dims; ++batch) {
// Each batch processes broadcast_naive_loops iterations of vector_size elements
- for (int l = 0; l < broadcast_naive_loops; ++l) {
- int a_offset = batch * broadcast_naive_loops * vector_size + l * vector_size;
- int b_offset = batch * vector_size; // b doesn't broadcast over loops dimension
- int out_offset = batch * broadcast_naive_loops * vector_size + l * vector_size;
+ for (int64_t l = 0; l < broadcast_naive_loops; ++l) {
+ size_t a_offset = batch * broadcast_naive_loops * vector_size + l * vector_size;
+ size_t b_offset = batch * vector_size; // b doesn't broadcast over loops dimension
+ size_t out_offset = batch * broadcast_naive_loops * vector_size + l * vector_size;Note: This same issue exists in the ARM paths and in all other broadcast_naive loops (Sub, Mul, Div ops).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) | |
| // Process each batch separately | |
| for (int batch = 0; batch < batch_dims; ++batch) { | |
| // Each batch processes broadcast_naive_loops iterations of vector_size elements | |
| for (int l = 0; l < broadcast_naive_loops; ++l) { | |
| int a_offset = batch * broadcast_naive_loops * vector_size + l * vector_size; | |
| int b_offset = batch * vector_size; // b doesn't broadcast over loops dimension | |
| int out_offset = batch * broadcast_naive_loops * vector_size + l * vector_size; | |
| #if defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) | |
| cpu::common::call_elewise_add_fp32(out + out_offset, a + a_offset, b + b_offset, vector_size); | |
| } | |
| } | |
| #if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) | |
| // Process each batch separately | |
| for (int64_t batch = 0; batch < batch_dims; ++batch) { | |
| // Each batch processes broadcast_naive_loops iterations of vector_size elements | |
| for (int64_t l = 0; l < broadcast_naive_loops; ++l) { | |
| size_t a_offset = batch * broadcast_naive_loops * vector_size + l * vector_size; | |
| size_t b_offset = batch * vector_size; // b doesn't broadcast over loops dimension | |
| size_t out_offset = batch * broadcast_naive_loops * vector_size + l * vector_size; | |
| cpu::common::call_elewise_add_fp32(out + out_offset, a + a_offset, b + b_offset, vector_size); | |
| } | |
| } |
- Updated qwen3 model runtime using `modeling_qwen3_fa2.hpp` - Enabled qwen3 compilation options - Fixed rmsnorm and softmax operators - Fixed parameters for quantize - Added `NYI` messages for some unsupported ops
Added support for common CPU elementwise operators (fp32)
- Replace cpu::common::HWY_NAMESPACE with proper Highway dynamic dispatcher - Create general kernel_dispatch.hpp/cpp with HWY_DYNAMIC_DISPATCH pattern
- Use lm_head_out in `modeling_qwen3_fa2.hpp` - Uncomment all compilation targets in `examples/CMakeLists.txt`
- Add GitHub Actions workflow file for Linux (Ubuntu) x86_64 build
- Fix python version in `.github/workflows/build-x86.yml` - Rename function name to avoid using reserved identifiers - Add #else fallback for unsupported architectures - Fix potential integer overflow in offset calculations
0a6466f to
bd046b9
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
mllm/backends/cpu/ops/TransposeOp.cpp (1)
87-100:batchis computed incorrectly (always 0).
int batch = 0;then multiplying keeps it at 0, soarm::transpose_last_dims_*will run withbatch == 0. This looks like a functional bug.- int batch = 0; + int batch = 1; for (int i = 0; i < input_shape.size() - 2; i++) { batch *= input_shape[i]; }mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp (1)
114-127: Fix data race in FP16 lookup table initialization—avoid std::call_once in hot path.The
table_f32_f16_initflag has a data race: multiple threads callinglookup_fp16_to_fp32()during parallel inference can race on the flag without synchronization. SinceMLLM_FP16_TO_FP32is invoked in hot loops (vec_dot, quantization, GEMM kernels), this is a critical issue.Do not use
std::call_oncefor the fix—it adds atomic fences and potential mutex overhead on every call, which is unacceptable in hot paths. Instead, use one of:
- Eager initialization: initialize the table at program or model load time (best for inference servers).
- Atomic fast-path + mutex slow-path: load
std::atomic<bool>with acquire semantics; only acquire mutex on miss.- Thread-local initialization: per-thread tables (less efficient, avoid if possible).
Also add
static_assert(sizeof(mllm_fp16_t) == sizeof(uint16_t), "...")to guard thememcpyassumption. Rename parameterftofp16for clarity.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
.github/workflows/build-x86.yml(1 hunks)docs/cpu_backend/x86/index.rst(1 hunks)examples/CMakeLists.txt(1 hunks)examples/qwen3/main.cpp(1 hunks)mllm/backends/cpu/kernels/Kernels.hpp(1 hunks)mllm/backends/cpu/kernels/common/elewise-inl.hpp(3 hunks)mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp(1 hunks)mllm/backends/cpu/kernels/common/kernel_dispatch.cpp(1 hunks)mllm/backends/cpu/kernels/common/kernel_dispatch.hpp(1 hunks)mllm/backends/cpu/kernels/x86/rmsnorm.cpp(1 hunks)mllm/backends/cpu/kernels/x86/softmax.cpp(3 hunks)mllm/backends/cpu/ops/ElewiseOps.cpp(11 hunks)mllm/backends/cpu/ops/LinearOp.cpp(1 hunks)mllm/backends/cpu/ops/TransposeOp.cpp(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- mllm/backends/cpu/ops/LinearOp.cpp
- .github/workflows/build-x86.yml
- mllm/backends/cpu/kernels/common/kernel_dispatch.hpp
- mllm/backends/cpu/kernels/Kernels.hpp
- examples/CMakeLists.txt
🧰 Additional context used
📓 Path-based instructions (4)
{mllm,mllm-cli,pymllm}/**/*
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*: Files must not contain C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, or DEL 0x7F. Horizontal tab (0x09) and line feed (0x0A) are explicitly allowed.
All files must be encoded in UTF-8 without BOM.
Any violation of character set (Rule 1) or encoding (Rule 2) requirements must cause the review to fail.
No line may end with trailing whitespace.
Use Unix line endings (LF).
File and directory names must consist only of printable Unicode characters, excluding C0 control codes 0x00–0x08, 0x0B–0x0C, 0x0E–0x1F, C1 control codes 0x7F–0x9F, and DEL 0x7F.
Only use acceptable file extensions: .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .py, .pyi, .sh, .txt, .md, .yml, .yaml, .json, .toml.
Optional license headers, if present, must comply with character set rules (no C0/C1 control codes except tab and line feed).
Files:
mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hppmllm/backends/cpu/ops/TransposeOp.cppmllm/backends/cpu/ops/ElewiseOps.cppmllm/backends/cpu/kernels/x86/rmsnorm.cppmllm/backends/cpu/kernels/x86/softmax.cppmllm/backends/cpu/kernels/common/elewise-inl.hppmllm/backends/cpu/kernels/common/kernel_dispatch.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh}: TODO and FIXME comments must be written as 'TODO:' or 'FIXME:' followed by UTF-8 text that adheres to character set rules.
Encourage consistent coding style and patterns with the existing codebase.
Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific.
Files:
mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hppmllm/backends/cpu/ops/TransposeOp.cppmllm/backends/cpu/ops/ElewiseOps.cppmllm/backends/cpu/kernels/x86/rmsnorm.cppmllm/backends/cpu/kernels/x86/softmax.cppmllm/backends/cpu/kernels/common/elewise-inl.hppmllm/backends/cpu/kernels/common/kernel_dispatch.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi}: Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors.
Adhere to language-specific best practices and idioms (e.g., PEP 8 for Python, Google C++ Style Guide for C++).
Files:
mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hppmllm/backends/cpu/ops/TransposeOp.cppmllm/backends/cpu/ops/ElewiseOps.cppmllm/backends/cpu/kernels/x86/rmsnorm.cppmllm/backends/cpu/kernels/x86/softmax.cppmllm/backends/cpu/kernels/common/elewise-inl.hppmllm/backends/cpu/kernels/common/kernel_dispatch.cpp
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi}: Prioritize production-ready code quality by evaluating time and space complexity of algorithms and data structures, and suggest more efficient alternatives for operations with high complexity (e.g., O(n^2) or worse) when feasible.
Avoid unnecessary object creation in loops or hot paths.
Check for proper error handling and resource cleanup (e.g., using try-finally, context managers, or RAII).
Ensure functions that can fail return appropriate error codes or raise exceptions.
Validate inputs for public APIs and critical internal functions.
Add comments for complex algorithms or non-obvious logic.
Identify potential security issues (e.g., buffer overflows, injection risks, insecure temporary files) and recommend using secure alternatives (e.g., parameterized queries, secure random generators).
Suggest adding unit tests for untested complex logic or edge cases.
Ensure code is testable by avoiding global state and using dependency injection.
Flag overly complex functions (e.g., high cyclomatic complexity) and suggest breaking them down.
Use named constants instead of magic numbers.
Add appropriate logging (e.g., debug, info, warning, error) for significant events and errors, avoiding sensitive data exposure.
Files:
mllm/backends/cpu/ops/TransposeOp.cppmllm/backends/cpu/ops/ElewiseOps.cppmllm/backends/cpu/kernels/x86/rmsnorm.cppmllm/backends/cpu/kernels/x86/softmax.cppmllm/backends/cpu/kernels/common/kernel_dispatch.cpp
🧠 Learnings (2)
📚 Learning: 2025-11-25T07:26:06.575Z
Learnt from: CR
Repo: UbiquitousLearning/mllm PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-25T07:26:06.575Z
Learning: Applies to {mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,py,pyi} : Suggest adding unit tests for untested complex logic or edge cases.
Applied to files:
mllm/backends/cpu/ops/ElewiseOps.cpp
📚 Learning: 2025-11-25T07:26:06.575Z
Learnt from: CR
Repo: UbiquitousLearning/mllm PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-25T07:26:06.575Z
Learning: Applies to {mllm,mllm-cli,pymllm}/**/*.{c,cc,cpp,cxx,h,hh,hpp,py,pyi,sh} : Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific.
Applied to files:
mllm/backends/cpu/ops/ElewiseOps.cpp
🧬 Code graph analysis (5)
mllm/backends/cpu/ops/TransposeOp.cpp (2)
mllm/backends/cpu/kernels/arm/transpose.hpp (3)
transpose_hw_wh_fp32(16-16)transpose_bshd_bhsd_fp32(18-19)transpose_last_dims_fp32(21-22)mllm/backends/cpu/kernels/arm/transpose.cpp (6)
transpose_hw_wh_fp32(13-46)transpose_hw_wh_fp32(13-13)transpose_bshd_bhsd_fp32(48-73)transpose_bshd_bhsd_fp32(48-49)transpose_last_dims_fp32(160-200)transpose_last_dims_fp32(160-161)
mllm/backends/cpu/ops/ElewiseOps.cpp (2)
mllm/backends/cpu/kernels/common/kernel_dispatch.hpp (8)
call_elewise_add_fp32(20-20)call_elewise_add_scalar_fp32(28-28)call_elewise_sub_fp32(21-21)call_elewise_sub_scalar_fp32(29-29)call_elewise_mul_fp32(22-22)call_elewise_mul_scalar_fp32(30-30)call_elewise_div_fp32(23-23)call_elewise_div_scalar_fp32(31-31)mllm/backends/cpu/kernels/arm/elementwise.hpp (8)
ew_add_fp32(262-263)ew_add_fp32_scalar(324-325)ew_sub_fp32(265-266)ew_sub_fp32_scalar(327-328)ew_mul_fp32(268-269)ew_mul_fp32_scalar(330-331)ew_div_fp32(271-272)ew_div_fp32_scalar(333-334)
mllm/backends/cpu/kernels/x86/softmax.cpp (1)
mllm/backends/cpu/kernels/x86/math.hpp (1)
vexpq_fast_f32(19-19)
mllm/backends/cpu/kernels/common/elewise-inl.hpp (2)
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp (8)
void(36-38)void(40-42)void(44-46)void(48-50)void(52-54)void(56-58)void(60-62)void(64-66)mllm/backends/cpu/kernels/common/gelu-inl.hpp (2)
HWY_MAYBE_UNUSED(49-69)HWY_MAYBE_UNUSED(71-75)
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp (1)
mllm/backends/cpu/kernels/common/elewise-inl.hpp (9)
elewise_add_fp32(64-64)elewise_sub_fp32(68-68)elewise_mul_fp32(72-72)elewise_div_fp32(76-76)elewise_add_scalar_fp32(134-134)elewise_sub_scalar_fp32(138-138)elewise_mul_scalar_fp32(142-142)elewise_div_scalar_fp32(146-146)void(16-34)
🪛 Clang (14.0.6)
mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp
[error] 114-114: parameter name 'f' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
mllm/backends/cpu/kernels/common/elewise-inl.hpp
[error] 64-64: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 64-64: invalid case style for variable 'HWY_MAYBE_UNUSED'
(readability-identifier-naming,-warnings-as-errors)
[error] 64-64: 2 adjacent parameters of 'elewise_add_fp32' of similar type ('const int *') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 64-64: parameter name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 64-64: parameter name 'y' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 68-68: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 68-68: invalid case style for variable 'HWY_MAYBE_UNUSED'
(readability-identifier-naming,-warnings-as-errors)
[error] 68-68: 2 adjacent parameters of 'elewise_sub_fp32' of similar type ('const int *') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 68-68: parameter name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 68-68: parameter name 'y' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 72-72: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 72-72: invalid case style for variable 'HWY_MAYBE_UNUSED'
(readability-identifier-naming,-warnings-as-errors)
[error] 72-72: 2 adjacent parameters of 'elewise_mul_fp32' of similar type ('const int *') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 72-72: parameter name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 72-72: parameter name 'y' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 76-76: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 76-76: invalid case style for variable 'HWY_MAYBE_UNUSED'
(readability-identifier-naming,-warnings-as-errors)
[error] 76-76: 2 adjacent parameters of 'elewise_div_fp32' of similar type ('const int *') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 76-76: parameter name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 76-76: parameter name 'y' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 86-86: variable 'd' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 86-86: variable name 'd' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 87-87: variable 'N' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 87-87: variable name 'N' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 87-87: invalid case style for variable 'N'
(readability-identifier-naming,-warnings-as-errors)
[error] 88-88: variable 'idx' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 91-91: variable 'sVec' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 91-91: invalid case style for variable 'sVec'
(readability-identifier-naming,-warnings-as-errors)
[error] 100-100: variable 'vx' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 100-100: variable name 'vx' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 101-101: variable 'result' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 134-134: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 134-134: invalid case style for variable 'HWY_MAYBE_UNUSED'
(readability-identifier-naming,-warnings-as-errors)
[error] 134-134: parameter name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 134-134: parameter name 'y' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 138-138: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 138-138: invalid case style for variable 'HWY_MAYBE_UNUSED'
(readability-identifier-naming,-warnings-as-errors)
[error] 138-138: parameter name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 138-138: parameter name 'y' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 142-142: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 142-142: invalid case style for variable 'HWY_MAYBE_UNUSED'
(readability-identifier-naming,-warnings-as-errors)
[error] 142-142: parameter name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 142-142: parameter name 'y' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 146-146: variable 'HWY_MAYBE_UNUSED' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 146-146: invalid case style for variable 'HWY_MAYBE_UNUSED'
(readability-identifier-naming,-warnings-as-errors)
[error] 146-146: parameter name 'x' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 146-146: parameter name 'y' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
mllm/backends/cpu/kernels/common/kernel_dispatch.cpp
[error] 13-13: macro 'HWY_TARGET_INCLUDE' used to declare a constant; consider using a 'constexpr' constant
(cppcoreguidelines-macro-usage,-warnings-as-errors)
🔇 Additional comments (7)
examples/qwen3/main.cpp (1)
4-4: The header switch tomodeling_qwen3_fa2.hppis correct and safe. The FA2 implementation maintains full API compatibility with the original—both have identical public interfaces (constructor,forward()method,kvCache()method). The only difference is internal: the FA2 variant passestrueto theStaticCacheconstructor'suse_fa2parameter (vs.falsein the original), which doesn't affect the client code in this example file.mllm/backends/cpu/kernels/x86/softmax.cpp (1)
37-66: Good switch to unaligned HWY memory ops (safer, same math).
Usinghn::LoadU/StoreUhere removes any hidden alignment requirement onX/Ywithout changing the reduction / exp / scale logic.mllm/backends/cpu/kernels/x86/rmsnorm.cpp (1)
26-46: LGTM:LoadU/StoreUis the right default unless you can prove alignment.
This should prevent misaligned access issues without affecting results.mllm/backends/cpu/ops/ElewiseOps.cpp (1)
142-162: Nice: explicit#else NYI(...)prevents silent output corruption on unsupported arch.
This addresses the “do nothing and still return” failure mode for fp32 elementwise ops.Also applies to: 325-345, 507-527, 689-709
mllm/backends/cpu/kernels/common/elewise-inl.hpp (2)
15-148: Elementwise + scalar FP32 kernels look correct; nice fix vs reserved identifier.
Vector + tail handling is coherent (LoadU/StoreU+LoadN/StoreN), and the scalar RHS path is implemented consistently.
1-10: No ODR risk inelewise-inl.hpp— proper Highway dynamic dispatch pattern. This file is correctly included only inkernel_dispatch.cppas part of Highway's multi-target code generation mechanism. The functions are defined withinHWY_NAMESPACEblocks that are re-instantiated with different namespace values for each enabled target (viaforeach_target.h).HWY_EXPORTandHWY_DYNAMIC_DISPATCHmacros properly handle symbol resolution across target variants. No ODR violation occurs.mllm/backends/cpu/kernels/common/kernel_dispatch.cpp (1)
4-80: Pattern is correct and production-ready.The dynamic dispatch scaffold properly implements Google Highway 1.3.0's multi-target dispatch mechanism:
foreach_target.hre-includes the source for each enabled target,HWY_NAMESPACEwraps target-specific implementations, andHWY_DYNAMIC_DISPATCH()selects the right variant at runtime. Architecture guards are correctly applied—x86 code is isolated from ARM with no competing definitions. All implementation functions are present inelewise-inl.hpp, and call sites inElewiseOps.cppproperly branch between x86 (cpu::common) and ARM (cpu::arm) codepaths.
| 3. Download pre-converted models from **our HuggingFace organization** (recommended on x86): | ||
|
|
||
| Due to current compatibility issues with the mllm-converter on x86 architecture, we recommend downloading pre-converted quantized models from our HuggingFace organization `mllmTeam <https://huggingface.co/UbiquitousLearning>`_: | ||
|
|
||
| Example command: | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| wget https://huggingface.co/mllmTeam/qwen-3-0.6b-mllm/blob/main/qwen-3-0.6b-q4_k.mllm | ||
|
|
There was a problem hiding this comment.
Fix the Hugging Face download command (blob URL won’t fetch the model file).
Use a raw/resolve URL so wget downloads the artifact, not an HTML page.
- wget https://huggingface.co/mllmTeam/qwen-3-0.6b-mllm/blob/main/qwen-3-0.6b-q4_k.mllm
+ wget -O qwen-3-0.6b-q4_k.mllm https://huggingface.co/mllmTeam/qwen-3-0.6b-mllm/resolve/main/qwen-3-0.6b-q4_k.mllm🤖 Prompt for AI Agents
In docs/cpu_backend/x86/index.rst around lines 107 to 116, the wget example uses
a Hugging Face "blob" HTML URL which will download an HTML page instead of the
model artifact; update the example to point to the repository file raw/resolve
endpoint (e.g. use the Hugging Face raw or /resolve/main file URL or append
?raw=true) so wget retrieves the actual .mllm file instead of an HTML page.
| #else | ||
| NYI("AddOp not supported on this architecture."); | ||
| #endif |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
files=(
"mllm/backends/cpu/ops/ElewiseOps.cpp"
"mllm/backends/cpu/ops/TransposeOp.cpp"
"mllm/backends/cpu/kernels/x86/softmax.cpp"
"mllm/backends/cpu/kernels/x86/rmsnorm.cpp"
"mllm/backends/cpu/kernels/common/kernel_dispatch.cpp"
"mllm/backends/cpu/kernels/common/elewise-inl.hpp"
"mllm/backends/cpu/kernels/common/ggml/quantize/quantize.hpp"
)
python - <<'PY'
import sys, pathlib, re
BAD_BYTES = set(list(range(0x00,0x09)) + list(range(0x0B,0x0D)) + list(range(0x0E,0x20)) + [0x7F] + list(range(0x80,0xA0)))
files = sys.argv[1:]
fail = False
for p in files:
b = pathlib.Path(p).read_bytes()
if b.startswith(b"\xef\xbb\xbf"):
print(f"[FAIL] BOM found: {p}")
fail = True
bad = [i for i, bb in enumerate(b) if bb in BAD_BYTES]
if bad:
print(f"[FAIL] Control code bytes found: {p} (first at byte offset {bad[0]})")
fail = True
if b"\r\n" in b or b"\r" in b:
print(f"[FAIL] CRLF/CR found (expected LF only): {p}")
fail = True
text = b.decode("utf-8", errors="strict")
for i, line in enumerate(text.split("\n"), 1):
if re.search(r"[ \t]+$", line):
print(f"[FAIL] Trailing whitespace: {p}:{i}")
fail = True
if fail:
sys.exit(1)
print("[OK] No BOM/control-codes/CRLF/trailing-whitespace in checked files.")
PY "${files[@]}"Repository: UbiquitousLearning/mllm
Length of output: 255
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# First, check if the file exists
if [ ! -f "mllm/backends/cpu/ops/ElewiseOps.cpp" ]; then
echo "[FAIL] File not found: mllm/backends/cpu/ops/ElewiseOps.cpp"
exit 1
fi
# Check for trailing whitespace at the specified lines
echo "Checking lines 149-151, 159-161, 191-193 for trailing whitespace:"
# Get line count first
total_lines=$(wc -l < "mllm/backends/cpu/ops/ElewiseOps.cpp")
echo "Total lines in file: $total_lines"
# Check specific line ranges
for line_num in 149 150 151 159 160 161 191 192 193; do
if [ "$line_num" -le "$total_lines" ]; then
line_content=$(sed -n "${line_num}p" "mllm/backends/cpu/ops/ElewiseOps.cpp")
# Check if line ends with space or tab
if [[ "$line_content" =~ [[:space:]]$ ]]; then
echo "[FAIL] Line $line_num has trailing whitespace: $(sed -n "${line_num}p" "mllm/backends/cpu/ops/ElewiseOps.cpp" | cat -A)"
else
echo "[OK] Line $line_num: no trailing whitespace"
fi
fi
done
# Also do a comprehensive check for any trailing whitespace in the entire file
echo ""
echo "Comprehensive check for trailing whitespace in entire file:"
trailing_lines=$(grep -n '[[:space:]]$' "mllm/backends/cpu/ops/ElewiseOps.cpp" || true)
if [ -n "$trailing_lines" ]; then
echo "[FAIL] Lines with trailing whitespace found:"
echo "$trailing_lines"
else
echo "[OK] No trailing whitespace found in file"
fiRepository: UbiquitousLearning/mllm
Length of output: 2141
Remove trailing whitespace (repo guideline violation).
Lines 150, 160, and 192 contain trailing spaces after the NYI() calls. Files under mllm/** must not have trailing whitespace at line ends.
- NYI("AddOp not supported on this architecture.");
+ NYI("AddOp not supported on this architecture.");🤖 Prompt for AI Agents
In mllm/backends/cpu/ops/ElewiseOps.cpp around lines 149-151, 158-162 and
190-194, there are trailing spaces after the NYI("...") calls (specifically
noted at lines 150, 160 and 192); remove the trailing whitespace at the end of
those lines so the NYI calls end immediately before the line break, and ensure
no other trailing spaces remain in the file (run your editor's "trim trailing
whitespace" or a linter/formatter to verify).
| #if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) | ||
| // Process each batch separately | ||
| for (int batch = 0; batch < batch_dims; ++batch) { | ||
| // Each batch processes broadcast_naive_loops iterations of vector_size elements | ||
| for (int l = 0; l < broadcast_naive_loops; ++l) { | ||
| size_t a_offset = batch * broadcast_naive_loops * vector_size + l * vector_size; | ||
| size_t b_offset = batch * vector_size; // b doesn't broadcast over loops dimension | ||
| size_t out_offset = batch * broadcast_naive_loops * vector_size + l * vector_size; | ||
|
|
||
| #if defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) | ||
| cpu::common::call_elewise_add_fp32(out + out_offset, a + a_offset, b + b_offset, vector_size); | ||
| } | ||
| } | ||
|
|
||
| #elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) | ||
| // Process each batch separately | ||
| for (int batch = 0; batch < batch_dims; ++batch) { | ||
| // Each batch processes broadcast_naive_loops iterations of vector_size elements | ||
| for (int l = 0; l < broadcast_naive_loops; ++l) { | ||
| int a_offset = batch * broadcast_naive_loops * vector_size + l * vector_size; | ||
| int b_offset = batch * vector_size; // b doesn't broadcast over loops dimension | ||
| int out_offset = batch * broadcast_naive_loops * vector_size + l * vector_size; | ||
| size_t a_offset = batch * broadcast_naive_loops * vector_size + l * vector_size; | ||
| size_t b_offset = batch * vector_size; // b doesn't broadcast over loops dimension | ||
| size_t out_offset = batch * broadcast_naive_loops * vector_size + l * vector_size; | ||
|
|
||
| cpu::arm::ew_add_fp32(out + out_offset, a + a_offset, b + b_offset, vector_size, options_.getThreads()); | ||
| } | ||
| } | ||
| #else | ||
| NYI("AddOp not supported on this architecture."); | ||
| #endif |
There was a problem hiding this comment.
size_t offsets still risk overflow due to int intermediate math.
batch * broadcast_naive_loops * vector_size is computed as int then converted to size_t. Cast inputs before multiplying (apply similarly in Add/Sub/Mul/Div).
- size_t a_offset = batch * broadcast_naive_loops * vector_size + l * vector_size;
- size_t b_offset = batch * vector_size; // b doesn't broadcast over loops dimension
- size_t out_offset = batch * broadcast_naive_loops * vector_size + l * vector_size;
+ const size_t batch_u = static_cast<size_t>(batch);
+ const size_t loops_u = static_cast<size_t>(broadcast_naive_loops);
+ const size_t vec_u = static_cast<size_t>(vector_size);
+ const size_t l_u = static_cast<size_t>(l);
+ const size_t a_offset = batch_u * loops_u * vec_u + l_u * vec_u;
+ const size_t b_offset = batch_u * vec_u; // b doesn't broadcast over loops dimension
+ const size_t out_offset = batch_u * loops_u * vec_u + l_u * vec_u;Also applies to: 349-376, 531-558, 713-740
🤖 Prompt for AI Agents
In mllm/backends/cpu/ops/ElewiseOps.cpp around lines 166 to 193 (and similarly
for ranges 349-376, 531-558, 713-740), the offsets are computed using int
arithmetic then assigned to size_t which can overflow; cast the loop indices or
operands to size_t before performing multiplications (e.g., use (size_t)batch,
(size_t)broadcast_naive_loops, (size_t)vector_size) so all intermediate math is
done in size_t, and apply the same fix to the corresponding Add/Sub/Mul/Div
blocks to ensure safe offset calculation across architectures.
| #if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) | ||
| NYI("Transpose op(HW -> WH) fp32 not supported in x86"); | ||
| #elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) | ||
| arm::transpose_hw_wh_fp32(input.ptr<mllm_fp32_t>(), output.ptr<mllm_fp32_t>(), input_shape[0], input_shape[1]); | ||
| #endif | ||
| break; | ||
| } | ||
| case kFloat16: { | ||
| #if defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) | ||
| #if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86) | ||
| NYI("Transpose op(HW -> WH) fp16 not supported in x86"); | ||
| #elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM) | ||
| arm::transpose_hw_wh_fp16(input.ptr<mllm_fp16_t>(), output.ptr<mllm_fp16_t>(), input_shape[0], input_shape[1]); | ||
| #endif | ||
| break; |
There was a problem hiding this comment.
Add #else NYI(...) to avoid silent no-op on non-{x86,arm}.
Right now, if someone builds on another arch (e.g., WASM/RISC-V), these branches can fall through without writing output. Recommend making the unsupported case explicit.
#if defined(MLLM_HOST_ARCH_X86_64) || defined(MLLM_HOST_ARCH_X86)
NYI("Transpose op(HW -> WH) fp32 not supported in x86");
#elif defined(MLLM_HOST_ARCH_ARM64) || defined(MLLM_HOST_ARCH_ARM)
arm::transpose_hw_wh_fp32(...);
+#else
+ NYI("Transpose op(HW -> WH) fp32 not supported on this architecture");
#endifAlso applies to: 59-74, 95-110
🤖 Prompt for AI Agents
In mllm/backends/cpu/ops/TransposeOp.cpp around lines 30-43 (also apply same
change to 59-74 and 95-110), the architecture-specific #if/#elif blocks
currently omit an #else path so building on non-x86/non-ARM targets results in a
silent no-op; add an #else branch that calls NYI("Transpose op(...) not
supported on this arch") (with the appropriate dtype/shape message) before the
#endif to make unsupported architectures explicit and ensure the code
reports/notifies instead of falling through.
rmsnormandsoftmaxoperators and refine quantization parametersNYImessages for unsupported operationsSummary by CodeRabbit
New Features
Improvements
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.