Skip to content

Comments

[WIP] feat(codegen): orchestration tensor APIs, scalar min/max, and if/view codegen#204

Open
zhusy54 wants to merge 4 commits intohw-native-sys:mainfrom
zhusy54:pa-orchest
Open

[WIP] feat(codegen): orchestration tensor APIs, scalar min/max, and if/view codegen#204
zhusy54 wants to merge 4 commits intohw-native-sys:mainfrom
zhusy54:pa-orchest

Conversation

@zhusy54
Copy link
Contributor

@zhusy54 zhusy54 commented Feb 14, 2026

Summary

Refactor orchestration codegen to use shape/dtype-based tensor APIs, add scalar min/max support to the DSL, and implement codegen for IfOp and ViewOp.

Changes

Orchestration Codegen Refactor

  • Replace size-based make_tensor(bytes) / make_tensor_external(ptr, size) with shape/dtype-based APIs: make_tensor(shapes, ndim, dtype) / make_tensor_external(ptr, shapes, ndim, dtype), providing the runtime with richer tensor metadata.
  • Remove outer/inner scope classification logic (TaskRecord, scope analysis, re-indentation). Scope management is now handled structurally by ForStmt and IfStmt visitors which emit PTO2_SCOPE(rt) blocks directly.
  • Emit intermediate tensor declarations inline at the point of task submission rather than hoisting to a separate section.
  • Consolidate non_task_code_ stream into the main code_ stream — all generated code now flows through a single output path.

Scalar Min/Max Operations

  • Add MakeMin / MakeMax factory functions in scalar_expr.h with automatic type promotion.
  • Expose as ir.min_() / ir.max_() in Python bindings and type stubs.
  • Add parser dispatch: pl.min(scalar, scalar) and pl.max(scalar, scalar) now route to scalar IR ops via _parse_scalar_op(), while pl.min(tile, axis=...) continues to work as tile reduction.
  • Add @overload type signatures for pl.min / pl.max in block_ops.py to satisfy pyright.
  • Add codegen support for Min/Max expressions in GenerateExprString (emits std::min()/std::max()).

New Codegen: IfOp and ViewOp

  • IfStmt codegen: emits if/else blocks with PTO2_SCOPE(rt) wrapping, pre-declares return variables, and handles both branches.
  • tensor.view codegen: emits input.view({shapes}, {offsets}) for tensor slicing.

Shared Utilities

  • Extract DataTypeToString() into dtype.h as a shared utility, replacing the duplicated DataTypeToPythonString() in python_printer.cpp.
  • Add GetRuntimeDataTypeString() and GetExternalTensorName() as virtual methods on CodegenBase for subclass customization.
  • Extend GenerateExprString with support for FloorMod, comparison ops (Eq/Ne/Lt/Le/Gt/Ge), Neg, ConstFloat, ConstBool.

Type Stubs

  • Add missing scalar expression stubs: eq, ne, lt, le, gt, ge, neg, truediv, floordiv, mod, min_, max_.
  • Make span parameter optional (default = ...) across all scalar op stubs.

Tests

  • Update orchestration codegen tests to match new shape/dtype-based output format.
  • Add test_scalar_ops.py for IR-level scalar min/max node creation and type promotion.
  • Add test_scalar_dispatch.py for parser-level pl.min/pl.max scalar dispatch.

… codegen

- Add GetRuntimeDataTypeString for runtime-compatible dtype name mapping
- Refactor tensor operations to use shape/dtype-based APIs instead of raw strings
- Add codegen support for IfOp and ViewOp in orchestration layer
- Extract common codegen helpers (EmitTensorDecl, EmitFunctionCall, etc.)
- Add scalar min/max operations (pl.min, pl.max) for orchestration
- Update tests to cover new codegen patterns and scalar ops
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @zhusy54, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the code generation capabilities by modernizing tensor API usage, introducing support for fundamental scalar operations, and enabling control flow and tensor view operations within the generated code. The changes streamline the orchestration codegen, making it more robust and expressive by moving to shape/dtype-based tensor handling and integrating new IR constructs like IfOp and ViewOp. This improves the flexibility and maintainability of the generated C++ code, allowing for more complex computations to be expressed and optimized.

Highlights

  • Orchestration Codegen Refactor: Refactored orchestration codegen to utilize shape/dtype-based tensor APIs (make_tensor, make_tensor_external), replacing size-based allocation. Eliminated outer/inner scope classification logic, with scope management now handled structurally by ForStmt and IfStmt visitors emitting PTO2_SCOPE(rt) blocks. Intermediate tensor declarations are now emitted inline at task submission, and the non_task_code_ stream has been consolidated into the main code_ stream for a unified output path.
  • Scalar Min/Max Operations: Added MakeMin and MakeMax factory functions in scalar_expr.h with automatic type promotion. These are exposed as ir.min_() and ir.max_() in Python bindings. The parser now dispatches pl.min(scalar, scalar) and pl.max(scalar, scalar) to scalar IR operations, while retaining tile reduction functionality. @overload type signatures were added for pl.min/pl.max, and codegen support for Min/Max expressions now emits std::min()/std::max().
  • New Codegen for IfOp and ViewOp: Implemented codegen for IfStmt, which emits if/else blocks wrapped with PTO2_SCOPE(rt), pre-declares return variables, and handles both branches. Codegen for tensor.view was added to emit input.view({shapes}, {offsets}) for tensor slicing operations.
  • Shared Utilities and Expression Support: Extracted DataTypeToString() into dtype.h as a shared utility. Added virtual methods GetRuntimeDataTypeString() and GetExternalTensorName() to CodegenBase for subclass customization. GenerateExprString was extended to support FloorMod, comparison operators (Eq, Ne, Lt, Le, Gt, Ge), Neg, ConstFloat, and ConstBool.
  • Type Stubs and Testing: Added missing scalar expression type stubs for various operators and made the span parameter optional across all scalar op stubs. Updated orchestration codegen tests to align with the new shape/dtype-based output format and added new test files (test_scalar_ops.py, test_scalar_dispatch.py) for scalar min/max IR node creation, type promotion, and parser dispatch.
Changelog
  • include/pypto/codegen/codegen_base.h
    • Added virtual methods GetExternalTensorName and GetRuntimeDataTypeString to CodegenBase.
  • include/pypto/core/dtype.h
    • Added DataTypeToString utility function for converting DataType to its string representation.
  • include/pypto/ir/scalar_expr.h
    • Added MakeMin and MakeMax factory functions for scalar minimum and maximum operations, including type promotion.
  • python/bindings/modules/ir.cpp
    • Exposed min_ and max_ functions to the Python ir module bindings.
  • python/pypto/language/op/block_ops.py
    • Imported overload from typing.
    • Added @overload decorators for max and min functions to support both Tile and Scalar arguments.
    • Modified max and min functions to accept Tile | Scalar for the first argument and int | Scalar for the second, enabling scalar dispatch.
  • python/pypto/language/parser/ast_parser.py
    • Added _SCALAR_BINARY_OPS dictionary to map unified op names to IR scalar expression functions for min and max.
    • Removed abs from _UNIFIED_OPS list.
    • Removed max and min from _BLOCK_OPS list.
    • Added abs to _UNIFIED_OPS list.
    • Modified _parse_unified_op to dispatch to _parse_scalar_op if the first argument is a ScalarType.
    • Added _parse_scalar_op method to handle parsing of scalar binary and unary operations.
  • python/pypto/pypto_core/ir.pyi
    • Made span parameter optional (default = ...) for add, sub, mul, pow, cast, bit_and, bit_or, bit_xor, bit_shift_left, bit_shift_right, and bit_not.
    • Added type stubs for truediv, floordiv, mod, eq, ne, lt, le, gt, ge, neg, min_, and max_ with optional span parameters.
  • src/codegen/codegen_base.cpp
    • Extended GenerateExprString to handle FloorMod, Min, Max, Eq, Ne, Lt, Le, Gt, Ge, ConstFloat, ConstBool, and Neg expressions.
    • Implemented GetRuntimeDataTypeString to return the fully qualified DataType enum name for C++ codegen.
  • src/codegen/orchestration/orchestration_codegen.cpp
    • Removed DataTypeToPTO2Enum comment, indicating usage of DataTypeToString.
    • Added GenerateMakeTensorExternal helper function to create make_tensor_external calls with shape array, ndim, and dtype.
    • Removed TaskRecord struct and related GetTaskRecords and GetNonTaskCode methods.
    • Added PTO2_SCOPE(rt) block wrapping for ForStmt bodies.
    • Implemented VisitStmt_ for IfStmt to generate if/else blocks with PTO2_SCOPE(rt) and pre-declare return variables.
    • Modified GetExternalTensorName to be const override.
    • Removed conditional tensor.create handling and non_task_code_ stream, consolidating all generated code into code_.
    • Removed task_input_tensors and task_output_tensors tracking from VisitStmt_ for Call operations.
    • Modified task submission logic to emit make_tensor declarations for intermediate output tensors inline.
    • Removed non_task_code_ member from OrchestrationStmtCodegen.
    • Removed task_records_ member from OrchestrationStmtCodegen.
    • Removed logic for identifying intermediate tensors and separating outer/inner tasks, simplifying the overall codegen flow.
    • Updated external tensor generation to use shape/dtype-based make_tensor_external calls.
  • src/codegen/tensor_op_codegen.cpp
    • Updated REGISTER_ORCHESTRATION_OP(tensor_create) to generate shape/dtype-based make_tensor calls instead of size-based.
    • Added REGISTER_ORCHESTRATION_OP(tensor_view) to generate input.view({shapes}, {offsets}) for tensor slicing.
  • src/ir/transforms/python_printer.cpp
    • Removed DataTypeToPythonString helper function.
    • Updated IRPythonPrinter::Print for ScalarType, TensorType, and TileType to use DataTypeToString from dtype.h.
    • Updated IRPythonPrinter::VisitExpr_ for Call and Cast operations to use DataTypeToString.
  • tests/ut/codegen/test_orchestration_codegen.py
    • Updated orch_basic test to assert shape/dtype-based make_tensor_external and make_tensor calls, and removed PTO2_SCOPE assertion for intermediate tensors.
    • Updated orch_vector test to reflect shape/dtype-based tensor creation and removal of explicit PTO2_SCOPE for intermediate tensors.
    • Updated test_tuple_mid to assert DataType::FLOAT32 instead of byte size and removed PTO2_SCOPE assertion.
    • Updated orch_four_tuple test to assert shape-based tensor creation and removed PTO2_SCOPE assertion.
    • Updated orch_create test to assert shape/dtype-based make_tensor call.
    • Updated orch_inplace test to assert shape/dtype-based make_tensor_external calls.
    • Added test_for_loop_with_view to verify codegen for for loops with dynamic bounds, tensor.view with dynamic offsets, and kernel calls inside loops.
    • Added test_if_statement to verify codegen for if/else statements with conditional scalar values and PTO2_SCOPE wrapping.
  • tests/ut/ir/expressions/test_scalar_ops.py
    • Added new test file to verify ir.min_ and ir.max_ creation, including type promotion.
  • tests/ut/language/parser/test_scalar_dispatch.py
    • Added new test file to verify that pl.min and pl.max correctly dispatch to scalar IR operations when called with scalar arguments, including literal values, while ensuring tile reduction still functions.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant and valuable refactoring to the orchestration codegen, switching to shape/dtype-based tensor APIs and simplifying scope management. The addition of scalar min/max operations and codegen for IfOp/ViewOp are also great enhancements. My review focuses on improving maintainability by reducing code duplication, fixing a potential runtime error in the parser, and ensuring the Python-level API for min/max is robust.

ir_func = getattr(ir, ir_func_name)
return ir_func(lhs, rhs, call_span)

if op_name in self._SCALAR_UNARY_OPS:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The code checks if op_name in self._SCALAR_UNARY_OPS:, but the class attribute _SCALAR_UNARY_OPS does not appear to be defined in this file or diff. This will lead to a NameError at runtime if this code path is ever reached. You should define _SCALAR_UNARY_OPS alongside _SCALAR_BINARY_OPS.

Comment on lines +349 to +370
inline std::string DataTypeToString(const DataType& dtype) {
if (dtype == DataType::BOOL) return "BOOL";
if (dtype == DataType::INT4) return "INT4";
if (dtype == DataType::INT8) return "INT8";
if (dtype == DataType::INT16) return "INT16";
if (dtype == DataType::INT32) return "INT32";
if (dtype == DataType::INT64) return "INT64";
if (dtype == DataType::UINT4) return "UINT4";
if (dtype == DataType::UINT8) return "UINT8";
if (dtype == DataType::UINT16) return "UINT16";
if (dtype == DataType::UINT32) return "UINT32";
if (dtype == DataType::UINT64) return "UINT64";
if (dtype == DataType::FP4) return "FP4";
if (dtype == DataType::FP8E4M3FN) return "FP8E4M3FN";
if (dtype == DataType::FP8E5M2) return "FP8E5M2";
if (dtype == DataType::FP16) return "FP16";
if (dtype == DataType::FP32) return "FP32";
if (dtype == DataType::BF16) return "BFLOAT16";
if (dtype == DataType::HF4) return "HF4";
if (dtype == DataType::HF8) return "HF8";
return "UnknownType";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The DataTypeToString function is implemented as a long chain of if statements. While correct, this can be slightly less efficient and harder to maintain than a switch statement on the underlying type code. Using dtype.Code() with a switch would be a more idiomatic and performant C++ pattern for this kind of dispatch.

inline std::string DataTypeToString(const DataType& dtype) {
  switch (dtype.Code()) {
    case DataType::kBoolCode: return "BOOL";
    case DataType::kInt4Code: return "INT4";
    case DataType::kInt8Code: return "INT8";
    case DataType::kInt16Code: return "INT16";
    case DataType::kInt32Code: return "INT32";
    case DataType::kInt64Code: return "INT64";
    case DataType::kUInt4Code: return "UINT4";
    case DataType::kUInt8Code: return "UINT8";
    case DataType::kUInt16Code: return "UINT16";
    case DataType::kUInt32Code: return "UINT32";
    case DataType::kUInt64Code: return "UINT64";
    case DataType::kFp4Code: return "FP4";
    case DataType::kFp8e4m3fnCode: return "FP8E4M3FN";
    case DataType::kFp8e5m2Code: return "FP8E5M2";
    case DataType::kFp16Code: return "FP16";
    case DataType::kFp32Code: return "FP32";
    case DataType::kBf16Code: return "BFLOAT16";
    case DataType::kHf4Code: return "HF4";
    case DataType::kHf8Code: return "HF8";
    default: return "UnknownType";
  }
}
References
  1. When mapping enums to strings, prefer using an explicit map (like a switch statement) over a long chain of if-else if statements for robustness and future extensibility.

Comment on lines +763 to 798
def max(tile: Tile | Scalar, axis: int | Scalar = 0, keepdim: bool = False) -> Tile | Scalar:
"""Max reduction along specified axis, or scalar max of two values.

Args:
tile: Input tile
axis: Reduction axis (0 for rows, 1 for columns, -1 for last)
keepdim: Whether to keep the reduced dimension as 1
tile: Input tile or first scalar operand
axis: Reduction axis (for tiles) or second scalar operand
keepdim: Whether to keep the reduced dimension as 1 (tiles only)

Returns:
Tile wrapping the max operation
Tile or Scalar wrapping the max operation
"""
call_expr = _ir_ops.max(tile.unwrap(), axis, keepdim)
call_expr = _ir_ops.max(tile.unwrap(), axis, keepdim) # type: ignore[reportArgumentType]
return Tile(expr=call_expr)


def min(tile: Tile, axis: int, keepdim: bool = False) -> Tile:
"""Min reduction along specified axis.
@overload
def min(tile: Tile, axis: int, keepdim: bool = False) -> Tile: ...


@overload
def min(tile: Scalar, axis: Scalar | int, keepdim: bool = False) -> Scalar: ...


def min(tile: Tile | Scalar, axis: int | Scalar = 0, keepdim: bool = False) -> Tile | Scalar:
"""Min reduction along specified axis, or scalar min of two values.

Args:
tile: Input tile
axis: Reduction axis (0 for rows, 1 for columns, -1 for last)
keepdim: Whether to keep the reduced dimension as 1
tile: Input tile or first scalar operand
axis: Reduction axis (for tiles) or second scalar operand
keepdim: Whether to keep the reduced dimension as 1 (tiles only)

Returns:
Tile wrapping the min operation
Tile or Scalar wrapping the min operation
"""
call_expr = _ir_ops.min(tile.unwrap(), axis, keepdim)
call_expr = _ir_ops.min(tile.unwrap(), axis, keepdim) # type: ignore[reportArgumentType]
return Tile(expr=call_expr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The implementations for max and min are overloaded to support both Tile and Scalar types, which is great for the DSL's usability. However, the current implementation only handles the Tile reduction case by calling _ir_ops.max (the block operation). If these functions are called directly with Scalar arguments (i.e., not inside a @pl.function that gets parsed), they will fail at runtime.

While the parser correctly dispatches to the scalar IR ops, making the Python implementation robust for direct calls would improve consistency and prevent unexpected errors for users who might use these functions outside a parsed context. A runtime type check on the tile argument could be used to dispatch to the correct IR operation (ir.max_ for scalars).

raise InvalidOperationError(
f"Operation '{op_name}' is not supported for scalar arguments",
span=call_span,
hint="Supported scalar ops: min, max",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error hint for unsupported scalar operations is hardcoded to min, max. Since the logic also supports unary operations via _SCALAR_UNARY_OPS, the hint should be updated to dynamically include all supported scalar operations. This will provide more accurate feedback to the user.

Suggested change
hint="Supported scalar ops: min, max",
hint=f"Supported scalar ops: {', '.join(list(self._SCALAR_BINARY_OPS.keys()) + list(getattr(self, '_SCALAR_UNARY_OPS', {}).keys()))}",

Comment on lines +101 to +112
std::string CodegenBase::GetRuntimeDataTypeString(const DataType& dtype) const {
if (dtype == DataType::FP16) return "DataType::FLOAT16";
if (dtype == DataType::FP32) return "DataType::FLOAT32";
if (dtype == DataType::INT32) return "DataType::INT32";
if (dtype == DataType::INT16) return "DataType::INT16";
if (dtype == DataType::INT8) return "DataType::INT8";
if (dtype == DataType::UINT8) return "DataType::UINT8";
if (dtype == DataType::BF16) return "DataType::BFLOAT16";
if (dtype == DataType::INT64) return "DataType::INT64";
if (dtype == DataType::UINT64) return "DataType::UINT64";
return "DataType::UNKNOWN";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The implementation of GetRuntimeDataTypeString duplicates logic from the new DataTypeToString function and is incomplete, as it doesn't cover all data types. This could lead to maintenance issues and bugs when new types are added.

To improve maintainability and correctness, this function should be refactored to leverage DataTypeToString for the general case, while only handling the special mappings where the runtime enum name differs from the canonical name (e.g., FP16 -> FLOAT16).

std::string CodegenBase::GetRuntimeDataTypeString(const DataType& dtype) const {
  // Handle special cases where runtime name differs from canonical enum name
  if (dtype == DataType::FP16) return "DataType::FLOAT16";
  if (dtype == DataType::FP32) return "DataType::FLOAT32";

  std::string s = DataTypeToString(dtype);
  if (s == "UnknownType") {
    return "DataType::UNKNOWN";
  }
  return "DataType::" + s;
}
References
  1. To improve maintainability and correctness in enum-to-string mapping, leverage existing explicit mapping functions and handle special cases, rather than duplicating logic, for robustness and future extensibility.

Comment on lines 529 to 553
auto saved = current_return_var_names_;
current_return_var_names_.clear();
for (const auto& rv : if_stmt->return_vars_) {
current_return_var_names_.push_back(rv->name_);
}
VisitStmt(if_stmt->then_body_);
current_return_var_names_ = saved;

indent_ -= 4;
code_ << Indent() << "}\n";
indent_ -= 4;

if (if_stmt->else_body_.has_value()) {
code_ << Indent() << "} else {\n";
indent_ += 4;
code_ << Indent() << "PTO2_SCOPE(rt) {\n";
indent_ += 4;

auto saved2 = current_return_var_names_;
current_return_var_names_.clear();
for (const auto& rv : if_stmt->return_vars_) {
current_return_var_names_.push_back(rv->name_);
}
VisitStmt(*if_stmt->else_body_);
current_return_var_names_ = saved2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for setting up current_return_var_names_ before visiting the then and else bodies is duplicated. This code block could be extracted into a helper function or, for a more robust solution, a RAII class could be used to manage the saving and restoring of current_return_var_names_. This would make the code cleaner and less prone to errors if the logic needs to be updated in the future.

@zhusy54 zhusy54 changed the title feat(codegen): orchestration tensor APIs, scalar min/max, and if/view codegen [WIP] feat(codegen): orchestration tensor APIs, scalar min/max, and if/view codegen Feb 14, 2026
@coderabbitai
Copy link

coderabbitai bot commented Feb 24, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds min/max scalar operators to the IR layer, enhances parser to dispatch scalar operations, improves code generation infrastructure with virtual methods for naming and type handling, refactors orchestration code generation to track SSA base names, and introduces a paged attention example demonstrating orchestration workflow via PyPTO DSL.

Changes

Cohort / File(s) Summary
New Scalar Operations
include/pypto/ir/scalar_expr.h, python/bindings/modules/ir.cpp
Introduces MakeMin and MakeMax binary expression constructors with automatic operand type promotion.
Type Utilities
include/pypto/core/dtype.h, src/ir/transforms/python_printer.cpp
Adds DataTypeToString utility function and migrates python printer from custom DataTypeToPythonString helper to use the new utility directly.
Parser Scalar Dispatch
python/pypto/language/parser/ast_parser.py, python/pypto/language/op/block_ops.py
Extends parser with scalar-dispatch path for min/max operations; adds overloads in block_ops.py supporting both Tile and Scalar variants.
Type Stubs & Python Bindings
python/pypto/pypto_core/ir.pyi
Expands operator surface with min_, max_, comparison operators (eq, ne, lt, le, gt, ge), negation, and introduces span parameter defaults across arithmetic/bitwise operators.
Codegen Base Infrastructure
include/pypto/codegen/codegen_base.h, src/codegen/codegen_base.cpp
Adds virtual methods GetExternalTensorName and GetRuntimeDataTypeString; converts TryGetVarName and GenerateExprString from static to virtual instance methods; extends expression string generation for new operators (Min, Max, comparisons, etc.).
Orchestration Codegen Refactoring
src/codegen/orchestration/orchestration_codegen.cpp
Major refactor introducing SSA base-name resolution for variables, inout tensor tracking across tuple-returning calls, per-call tuple element mapping, and inline external tensor generation; replaces TaskRecord and scope-based handling with streamlined SSA-aware code emission.
Tensor Op Codegen
src/codegen/tensor_op_codegen.cpp
Updates tensor_create to use shape arrays and runtime DataType; refactors tensor_read index computation with linear indexing and optional intermediate variables; adds tensor_view operator for shape/offset-based tensor views.
CCE Backend & Codegen
src/backend/910B_CCE/backend_910b_cce_ops.cpp, src/codegen/cce/cce_codegen.cpp
Guards col_offset calculation against missing second offset in block load/store ops; changes ForStmt loop variable from int64_t to uint64_t.
Paged Attention Example
examples/ir_parser/paged_attention_example.py
New example demonstrating paged attention orchestration via PyPTO DSL with QK/PV matmul kernels, softmax preparation, online updates, and IR code generation pipeline.
Test Coverage
tests/ut/ir/expressions/test_scalar_ops.py, tests/ut/language/parser/test_scalar_dispatch.py, tests/ut/codegen/test_cce_codegen.py, tests/ut/codegen/test_orchestration_codegen.py
Adds unit tests for min/max scalar operations, scalar dispatch behavior in parser, updates CCE loop variable type expectations, and extensively refactors orchestration codegen test expectations for SSA-aware generation and shape-based tensor construction.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • Hzfengsy

Poem

🐰 New min and max are hopping in,
SSA names make code so clean,
Paged attention now takes flight,
With shapes and views shining bright,
The orchestration dance looks right!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.63% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title '[WIP] feat(codegen): orchestration tensor APIs, scalar min/max, and if/view codegen' clearly summarizes the main changes in the PR.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, covering orchestration refactoring, scalar operations, new codegen support, and tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

🧹 Nitpick comments (1)
examples/ir_parser/paged_attention_example.py (1)

204-214: Inconsistent # type: ignore suppression across kernel call sites.

Line 210 explicitly suppresses type errors with # type: ignore[reportArgumentType], but lines 204 and 214 (which call kernel_qk_matmul and kernel_pv_matmul with mismatched shapes/types just the same) have no such annotation. Add matching suppression comments for consistency.

♻️ Proposed fix
-                    sij = self.kernel_qk_matmul(qi, kj, sij)
+                    sij = self.kernel_qk_matmul(qi, kj, sij)  # type: ignore[reportArgumentType]
-                    oi_tmp = self.kernel_pv_matmul(pij, vj, oi_tmp)
+                    oi_tmp = self.kernel_pv_matmul(pij, vj, oi_tmp)  # type: ignore[reportArgumentType]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/ir_parser/paged_attention_example.py` around lines 204 - 214, The
calls to the low-level kernels are inconsistent in type-ignore suppression:
kernel_softmax_prepare already has "# type: ignore[reportArgumentType]" but
kernel_qk_matmul and kernel_pv_matmul do not, causing linter/type-check noise;
add the same "# type: ignore[reportArgumentType]" suppression to the
kernel_qk_matmul(...) and kernel_pv_matmul(...) call sites so all three kernel
invocations (kernel_qk_matmul, kernel_softmax_prepare, kernel_pv_matmul)
consistently suppress the reported argument-type errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/ir_parser/paged_attention_example.py`:
- Around line 105-107: The kernel function signature includes unused parameters
mi, li, and oi_tmp which trigger ARG002; update the kernel (the function that
declares mi, li, oi_tmp in its signature) to either reference them with a short
explanatory inline comment or add a linter suppression (e.g., append “# noqa:
ARG002” to the parameter list or to the function definition line) to indicate
these are intentionally unused in this simplified example and silence the static
analyzer.
- Around line 112-113: The kernel signature for the paging kernel declares
parameters is_first and is_last as pl.Scalar[pl.INT64] but the orchestration
sets/passes them as pl.Scalar[pl.UINT64]; change the kernel parameter types to
pl.Scalar[pl.UINT64] to match the orchestration (or vice versa if you prefer
signed everywhere) so the types are consistent—update the kernel function
signature where is_first and is_last are declared to use pl.UINT64 and ensure
any related annotations/reference to those symbols (is_first, is_last, kernel
signature) reflect the same UINT64 type.
- Line 167: The variable block_num is assigned via pl.tensor.read(config, [5])
in paged_attention_example.py but never used (Ruff F841); remove the unused
assignment or replace its usage: either delete the line "block_num:
pl.Scalar[pl.UINT64] = pl.tensor.read(config, [5])" if it's vestigial, or use
the value where intended (e.g., pass block_num to the function that needs the
block index or incorporate it into the paging logic). Locate the assignment by
the symbol block_num and the call pl.tensor.read and either remove it or wire it
into the downstream code that expects the block number.
- Around line 132-144: The function kernel_init_inplace has a shape mismatch:
parameter oi is annotated as pl.Tensor[[16, 16], pl.FP32] but the function
returns oi as the first element with return type pl.Tensor[[16, 128], pl.FP32]
and the call site passes an oi of shape [q_tile, head_dim] (16,128). Fix this by
updating the oi parameter annotation in kernel_init_inplace to pl.Tensor[[16,
128], pl.FP32] so it matches the declared return type and the call site; verify
the other parameters (li, mi) and the return tuple remain unchanged.
- Around line 87-99: The local variable annotation for out in kernel_pv_matmul
is incorrect: change out's declared shape from pl.Tensor[[16, 128], pl.FP32] to
pl.Tensor[[16, 16], pl.FP32] so it matches the output parameter and the function
return type; ensure the pl.l0c_store call remains the same but that out's
annotation and any uses reflect the [16, 16] shape.

In `@include/pypto/core/dtype.h`:
- Around line 336-369: DataTypeToString currently maps DataType::BF16 to
"BFLOAT16", producing invalid symbols; change the BF16 branch in
DataTypeToString to return "BF16" instead of "BFLOAT16" and adjust the comment
examples that show how callers compose names (the Python printer and C++ codegen
examples) so they reflect the canonical enum-style "BF16" suffix; keep all other
mappings unchanged and ensure DataTypeToString still returns "UnknownType" for
unknown values.

In `@python/pypto/language/op/block_ops.py`:
- Around line 755-775: The implementations of max and min always call
_ir_ops.max/_ir_ops.min and return a Tile even when overloads expect Scalar
results; update both functions (max and min) to dispatch based on input types:
if tile is a Scalar, call ir.max_ / ir.min_ from pypto.pypto_core.ir (import
them) with the two scalar operands and return a Scalar; if tile is a Tile,
validate that axis is an int (raise TypeError if not), call
_ir_ops.max/_ir_ops.min with tile.unwrap(), axis, keepdim and return a Tile;
ensure imports for ir.max_/ir.min_ are added and no Tile is created for scalar
paths.

In `@python/pypto/language/parser/ast_parser.py`:
- Around line 1518-1551: In _parse_scalar_op, enforce exact arity and reject
keywords: for binary ops (self._SCALAR_BINARY_OPS) require exactly 2 positional
args and no keywords on the ast.Call (check len(call.args) == 2 and not
call.keywords), and for unary ops (self._SCALAR_UNARY_OPS) require exactly 1
positional arg and no keywords (len(call.args) == 1 and not call.keywords); when
the checks fail raise InvalidOperationError with a clear message and
span=call_span (same pattern used currently), then continue to parse via
self.parse_expression and call the IR constructor retrieved via getattr(ir,
ir_func_name).

In `@src/backend/910B_CCE/backend_910b_cce_ops.cpp`:
- Around line 131-133: The access to offsets_tuple->elements_[0] is unguarded
and can crash for zero-length tuples; add the same size guard used for
col_offset before reading elements_[0] in all three functions
(MakeBlockStoreCodegenCCE, MakeBlockL0CStoreCodegenCCE and the third identical
block), e.g., CHECK(offsets_tuple->elements_.size() > 0) (or equivalent) and
only call codegen.GetExprAsCode(offsets_tuple->elements_[0]) after that check so
both row_offset and col_offset are protected for 0D/1D tuples.

In `@src/codegen/cce/cce_codegen.cpp`:
- Around line 452-453: The generated loop uses an unsigned uint64_t for the loop
index (see emitter_.EmitLine call that constructs the for with loop_var_name,
start, stop, step), which will wrap for negative/descending ranges; change the
emitted loop index type to a signed type (e.g., int64_t) and ensure comparisons
and increments use the same signed type and expressions (loop_var_name, start,
stop, step) so negative values behave correctly; alternatively, if you choose to
keep unsigned, add a precondition/assertion or runtime check before code
generation that start/stop/step are non-negative and refuse generation
otherwise—update any places that assume unsigned loop semantics to use the
signed index or the new precondition.

In `@src/codegen/orchestration/orchestration_codegen.cpp`:
- Around line 44-73: GetSSABaseName is over-eagerly treating any trailing
"_<digits>" as SSA and will collapse real user names like "foo_1"; change the
heuristic to require an explicit SSA marker (e.g., "_ssa_" / "_iter_ssa_")
instead of plain underscores before digits: update GetSSABaseName to check for
"_iter_ssa_" and "_ssa_" (and only strip those markers plus trailing digits),
and update the other analogous spots called out in the review (the code blocks
at ~357-360, 491-495, 714-720, 879-883) so SSA generation/consumption uses the
new reserved marker consistently rather than stripping any "_<digits>" suffix.
Ensure callers that emit SSA names are changed to emit the reserved "_ssa_" form
so original user identifiers ending in "_<digits>" are preserved.
- Around line 420-435: The emitted shape array uses a zero-length declaration
when ndim==0, which is invalid C++; update GenerateMakeTensorExternal and the
inline intermediate tensor creation that emits "uint64_t <name>_shapes[ndim]" to
allocate at least one element: compute a shape storage length like shape_len =
std::max<size_t>(1, ndim) and declare "uint64_t <var_name>_shapes[shape_len]"
(or equivalent), fill only the first ndim entries (no-op when ndim==0), but
continue to pass the actual ndim to make_tensor_external (and any runtime call)
so the runtime still sees a 0-rank tensor; adjust uses of ndim in the
initializer/loop accordingly and reference GenerateMakeTensorExternal, the
generated "<var_name>_shapes" array, and the make_tensor_external call to locate
changes.
- Around line 671-675: The current dedup branch that returns when (op_name ==
"tensor.create" && declared_vars_.count(result_var)) silently drops subsequent
tensor.create allocations after SSA collapse; update the logic in
orchestration_codegen.cpp to handle repeated tensor.create for the same
result_var by either (A) emitting an assignment statement (e.g., result_var =
make_tensor(...)) instead of returning when declared_vars_ contains result_var,
or (B) enforce/transform to a unique allocation name before codegen; locate the
branch checking op_name == "tensor.create", declared_vars_, and result_var and
replace the early return with code that generates an assignment to result_var
(or performs a unique rename) so each create results in a proper allocation or
re-assignment.
- Around line 732-739: The scalar-arg path in the loop over call->args_
(TryGetVarName, As<ScalarType>, make_scalar_param) emits
make_scalar_param(var_name) unconditionally, which skips dtype-aware packing for
float scalars; update the branch to inspect the ScalarType's dtype and, for
float (and any other types that require special packing), emit
make_scalar_param(float_to_u64(var_name)) (or the appropriate packing function
for that dtype) while preserving the existing behavior for integer/bool scalars;
adjust the logic around As<ScalarType>(arg->GetType()) and the params push to
choose the packed expression based on scalar->dtype() so scalar variables are
packed identically to scalar constants.

In `@src/codegen/tensor_op_codegen.cpp`:
- Around line 99-125: The generated index code must handle rank-0 (empty
indices) and validate indices vs shape length: before building idx_expr check if
indices.empty() and set idx_expr to "0" (or generate a direct scalar load) to
avoid emitting an empty assignment; also check indices.size() == shape.size()
(or emit a runtime/compile-time assertion or throw std::runtime_error) and fail
fast if they differ to avoid silently wrong linear index calculation. Update the
block that builds idx_oss / idx_expr (references: indices, shape,
codegen.GenerateExprString, idx_expr, result_var, ptr_expr, cpp_type) to perform
these validations and then proceed with the existing simple/complex-index
emission logic.
- Around line 55-70: The tensor.create code emits a zero-length C array and
tensor.read emits an empty index for rank-0 tensors; change the shapes array
allocation to use at least one element (e.g., size = std::max<size_t>(1, ndim))
and when ndim==0 emit a single dummy dimension value of 1 in the initializer for
result_var_shapes, and in tensor.read ensure the computed index (idx_var) is "0"
for ndim==0 instead of an empty expression; update the logic around
result_var_shapes, the make_tensor call, and the index computation (referencing
result_var, result_var_shapes, CalculateTensorSizeExpr and the idx_var/index
expression generation) to handle ndim==0 accordingly.

In `@tests/ut/codegen/test_orchestration_codegen.py`:
- Around line 930-935: The parameter `flag` in function kernel_process is unused
and triggers ARG002; rename it to `_flag` (or `_flag: pl.Scalar[pl.INT64]`) or
annotate/ignore it with a noqa to mark it intentional. Locate the kernel_process
definition and update the parameter name or add the noqa inline so the linter no
longer reports ARG002 while preserving the function signature semantics.

---

Nitpick comments:
In `@examples/ir_parser/paged_attention_example.py`:
- Around line 204-214: The calls to the low-level kernels are inconsistent in
type-ignore suppression: kernel_softmax_prepare already has "# type:
ignore[reportArgumentType]" but kernel_qk_matmul and kernel_pv_matmul do not,
causing linter/type-check noise; add the same "# type:
ignore[reportArgumentType]" suppression to the kernel_qk_matmul(...) and
kernel_pv_matmul(...) call sites so all three kernel invocations
(kernel_qk_matmul, kernel_softmax_prepare, kernel_pv_matmul) consistently
suppress the reported argument-type errors.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d1dc1fe and 2b9f923.

📒 Files selected for processing (18)
  • examples/ir_parser/paged_attention_example.py
  • include/pypto/codegen/codegen_base.h
  • include/pypto/core/dtype.h
  • include/pypto/ir/scalar_expr.h
  • python/bindings/modules/ir.cpp
  • python/pypto/language/op/block_ops.py
  • python/pypto/language/parser/ast_parser.py
  • python/pypto/pypto_core/ir.pyi
  • src/backend/910B_CCE/backend_910b_cce_ops.cpp
  • src/codegen/cce/cce_codegen.cpp
  • src/codegen/codegen_base.cpp
  • src/codegen/orchestration/orchestration_codegen.cpp
  • src/codegen/tensor_op_codegen.cpp
  • src/ir/transforms/python_printer.cpp
  • tests/ut/codegen/test_cce_codegen.py
  • tests/ut/codegen/test_orchestration_codegen.py
  • tests/ut/ir/expressions/test_scalar_ops.py
  • tests/ut/language/parser/test_scalar_dispatch.py

Comment on lines +87 to +99
@pl.function(type=pl.FunctionType.InCore)
def kernel_pv_matmul(
self,
pij: pl.Tensor[[16, 128], pl.BF16],
vj: pl.Tensor[[128, 128], pl.BF16],
output: pl.Tensor[[16, 16], pl.FP32],
) -> pl.Tensor[[16, 16], pl.FP32]:
"""PV matmul: oi_tmp = pij @ vj (CUBE)."""
p_tile: pl.Tile[[16, 16], pl.BF16] = pl.load(pij, [0, 0], [16, 16])
v_tile: pl.Tile[[16, 16], pl.BF16] = pl.load(vj, [0, 0], [16, 16])
result: pl.Tile[[16, 16], pl.FP32] = pl.matmul(p_tile, v_tile)
out: pl.Tensor[[16, 128], pl.FP32] = pl.l0c_store(result, [0, 0], [16, 16], output)
return out
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

out local annotation [16, 128] contradicts the output parameter and return type [16, 16].

Line 98 annotates out as pl.Tensor[[16, 128], pl.FP32], but both the output parameter (line 92) and the return type (line 93) are [16, 16]. The annotation should be [16, 16] to be consistent.

🐛 Proposed fix
-        out: pl.Tensor[[16, 128], pl.FP32] = pl.l0c_store(result, [0, 0], [16, 16], output)
+        out: pl.Tensor[[16, 16], pl.FP32] = pl.l0c_store(result, [0, 0], [16, 16], output)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@pl.function(type=pl.FunctionType.InCore)
def kernel_pv_matmul(
self,
pij: pl.Tensor[[16, 128], pl.BF16],
vj: pl.Tensor[[128, 128], pl.BF16],
output: pl.Tensor[[16, 16], pl.FP32],
) -> pl.Tensor[[16, 16], pl.FP32]:
"""PV matmul: oi_tmp = pij @ vj (CUBE)."""
p_tile: pl.Tile[[16, 16], pl.BF16] = pl.load(pij, [0, 0], [16, 16])
v_tile: pl.Tile[[16, 16], pl.BF16] = pl.load(vj, [0, 0], [16, 16])
result: pl.Tile[[16, 16], pl.FP32] = pl.matmul(p_tile, v_tile)
out: pl.Tensor[[16, 128], pl.FP32] = pl.l0c_store(result, [0, 0], [16, 16], output)
return out
`@pl.function`(type=pl.FunctionType.InCore)
def kernel_pv_matmul(
self,
pij: pl.Tensor[[16, 128], pl.BF16],
vj: pl.Tensor[[128, 128], pl.BF16],
output: pl.Tensor[[16, 16], pl.FP32],
) -> pl.Tensor[[16, 16], pl.FP32]:
"""PV matmul: oi_tmp = pij @ vj (CUBE)."""
p_tile: pl.Tile[[16, 16], pl.BF16] = pl.load(pij, [0, 0], [16, 16])
v_tile: pl.Tile[[16, 16], pl.BF16] = pl.load(vj, [0, 0], [16, 16])
result: pl.Tile[[16, 16], pl.FP32] = pl.matmul(p_tile, v_tile)
out: pl.Tensor[[16, 16], pl.FP32] = pl.l0c_store(result, [0, 0], [16, 16], output)
return out
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/ir_parser/paged_attention_example.py` around lines 87 - 99, The
local variable annotation for out in kernel_pv_matmul is incorrect: change out's
declared shape from pl.Tensor[[16, 128], pl.FP32] to pl.Tensor[[16, 16],
pl.FP32] so it matches the output parameter and the function return type; ensure
the pl.l0c_store call remains the same but that out's annotation and any uses
reflect the [16, 16] shape.

Comment on lines +105 to +107
mi: pl.Tensor[[16], pl.FP32],
li: pl.Tensor[[16], pl.FP32],
oi_tmp: pl.Tensor[[16, 128], pl.FP32],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unused kernel parameters mi, li, oi_tmp flagged by static analysis.

These parameters are accepted by the kernel but never referenced in the body (Ruff ARG002). Since the docstring notes that this is a simplified example body, consider adding a brief inline comment or suppression (e.g., # noqa: ARG002) to signal intent and silence the linter.

🧰 Tools
🪛 Ruff (0.15.2)

[warning] 105-105: Unused method argument: mi

(ARG002)


[warning] 106-106: Unused method argument: li

(ARG002)


[warning] 107-107: Unused method argument: oi_tmp

(ARG002)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/ir_parser/paged_attention_example.py` around lines 105 - 107, The
kernel function signature includes unused parameters mi, li, and oi_tmp which
trigger ARG002; update the kernel (the function that declares mi, li, oi_tmp in
its signature) to either reference them with a short explanatory inline comment
or add a linter suppression (e.g., append “# noqa: ARG002” to the parameter list
or to the function definition line) to indicate these are intentionally unused
in this simplified example and silence the static analyzer.

Comment on lines +112 to +113
is_first: pl.Scalar[pl.INT64],
is_last: pl.Scalar[pl.INT64],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

is_first/is_last scalar type mismatches between kernel signature (INT64) and orchestration usage (UINT64).

The kernel declares is_first and is_last as pl.Scalar[pl.INT64] (lines 112–113), but the orchestration assigns and passes them as pl.Scalar[pl.UINT64] (lines 218 and 222). Pick one type consistently; since pl.yield_(1) naturally produces an unsigned value here, prefer pl.UINT64 in the kernel signature.

🐛 Proposed fix
-        is_first: pl.Scalar[pl.INT64],
-        is_last: pl.Scalar[pl.INT64],
+        is_first: pl.Scalar[pl.UINT64],
+        is_last: pl.Scalar[pl.UINT64],
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
is_first: pl.Scalar[pl.INT64],
is_last: pl.Scalar[pl.INT64],
is_first: pl.Scalar[pl.UINT64],
is_last: pl.Scalar[pl.UINT64],
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 112-112: Unused method argument: is_first

(ARG002)


[warning] 113-113: Unused method argument: is_last

(ARG002)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/ir_parser/paged_attention_example.py` around lines 112 - 113, The
kernel signature for the paging kernel declares parameters is_first and is_last
as pl.Scalar[pl.INT64] but the orchestration sets/passes them as
pl.Scalar[pl.UINT64]; change the kernel parameter types to pl.Scalar[pl.UINT64]
to match the orchestration (or vice versa if you prefer signed everywhere) so
the types are consistent—update the kernel function signature where is_first and
is_last are declared to use pl.UINT64 and ensure any related
annotations/reference to those symbols (is_first, is_last, kernel signature)
reflect the same UINT64 type.

Comment on lines +132 to +144
@pl.function(type=pl.FunctionType.InCore)
def kernel_init_inplace(
self,
oi: pl.Tensor[[16, 16], pl.FP32],
li: pl.Tensor[[16], pl.FP32],
mi: pl.Tensor[[16], pl.FP32],
) -> tuple[
pl.Tensor[[16, 128], pl.FP32],
pl.Tensor[[16], pl.FP32],
pl.Tensor[[16], pl.FP32],
]:
"""Initialize inplace accumulators to zero (VECTOR)."""
return oi, li, mi
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

oi parameter shape conflicts with the return type annotation and call site.

oi is declared [16, 16] (line 135) but the first return type is [16, 128] (line 139). Since the body directly returns oi, these must match. The call site (line 185) also passes an oi created as [q_tile, head_dim] (i.e., [16, 128]), confirming the parameter annotation should be [16, 128].

🐛 Proposed fix
-        oi: pl.Tensor[[16, 16], pl.FP32],
+        oi: pl.Tensor[[16, 128], pl.FP32],
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@pl.function(type=pl.FunctionType.InCore)
def kernel_init_inplace(
self,
oi: pl.Tensor[[16, 16], pl.FP32],
li: pl.Tensor[[16], pl.FP32],
mi: pl.Tensor[[16], pl.FP32],
) -> tuple[
pl.Tensor[[16, 128], pl.FP32],
pl.Tensor[[16], pl.FP32],
pl.Tensor[[16], pl.FP32],
]:
"""Initialize inplace accumulators to zero (VECTOR)."""
return oi, li, mi
`@pl.function`(type=pl.FunctionType.InCore)
def kernel_init_inplace(
self,
oi: pl.Tensor[[16, 128], pl.FP32],
li: pl.Tensor[[16], pl.FP32],
mi: pl.Tensor[[16], pl.FP32],
) -> tuple[
pl.Tensor[[16, 128], pl.FP32],
pl.Tensor[[16], pl.FP32],
pl.Tensor[[16], pl.FP32],
]:
"""Initialize inplace accumulators to zero (VECTOR)."""
return oi, li, mi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/ir_parser/paged_attention_example.py` around lines 132 - 144, The
function kernel_init_inplace has a shape mismatch: parameter oi is annotated as
pl.Tensor[[16, 16], pl.FP32] but the function returns oi as the first element
with return type pl.Tensor[[16, 128], pl.FP32] and the call site passes an oi of
shape [q_tile, head_dim] (16,128). Fix this by updating the oi parameter
annotation in kernel_init_inplace to pl.Tensor[[16, 128], pl.FP32] so it matches
the declared return type and the call site; verify the other parameters (li, mi)
and the return tuple remain unchanged.

num_heads: pl.Scalar[pl.UINT64] = pl.tensor.read(config, [1])
head_dim: pl.Scalar[pl.UINT64] = pl.tensor.read(config, [3])
block_size: pl.Scalar[pl.UINT64] = pl.tensor.read(config, [4])
block_num: pl.Scalar[pl.UINT64] = pl.tensor.read(config, [5])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

block_num is assigned but never used — Ruff F841.

🐛 Proposed fix
-        block_num: pl.Scalar[pl.UINT64] = pl.tensor.read(config, [5])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
block_num: pl.Scalar[pl.UINT64] = pl.tensor.read(config, [5])
🧰 Tools
🪛 Ruff (0.15.2)

[error] 167-167: Local variable block_num is assigned to but never used

Remove assignment to unused variable block_num

(F841)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/ir_parser/paged_attention_example.py` at line 167, The variable
block_num is assigned via pl.tensor.read(config, [5]) in
paged_attention_example.py but never used (Ruff F841); remove the unused
assignment or replace its usage: either delete the line "block_num:
pl.Scalar[pl.UINT64] = pl.tensor.read(config, [5])" if it's vestigial, or use
the value where intended (e.g., pass block_num to the function that needs the
block index or incorporate it into the paging logic). Locate the assignment by
the symbol block_num and the call pl.tensor.read and either remove it or wire it
into the downstream code that expects the block number.

Comment on lines +671 to 675
// Dedup: skip if this tensor was already declared (SSA name collapse)
if (op_name == "tensor.create" && declared_vars_.count(result_var)) {
return;
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the file and the specific lines mentioned
wc -l src/codegen/orchestration/orchestration_codegen.cpp

Repository: hw-native-sys/pypto

Length of output: 118


🏁 Script executed:

# Read the context around lines 671-675
sed -n '650,700p' src/codegen/orchestration/orchestration_codegen.cpp | cat -n

Repository: hw-native-sys/pypto

Length of output: 2104


🏁 Script executed:

# Search for declared_vars_ to understand its purpose
rg "declared_vars_" src/codegen/orchestration/orchestration_codegen.cpp -B 2 -A 2

Repository: hw-native-sys/pypto

Length of output: 981


🏁 Script executed:

# Search for tensor.create to understand how it's used
rg "tensor\.create" src/codegen/orchestration/orchestration_codegen.cpp -B 3 -A 3

Repository: hw-native-sys/pypto

Length of output: 1480


🏁 Script executed:

# Search for SSA name collapse context
rg "SSA" src/codegen/orchestration/orchestration_codegen.cpp -B 3 -A 3

Repository: hw-native-sys/pypto

Length of output: 8456


🏁 Script executed:

# Look for how result_var is computed and passed to GenerateTensorOpCode
rg "GenerateTensorOpCode" src/codegen/orchestration/orchestration_codegen.cpp -B 5 -A 2

Repository: hw-native-sys/pypto

Length of output: 936


🏁 Script executed:

# Search for GetSSABaseName to understand name collapse
rg "GetSSABaseName" src/codegen/orchestration/orchestration_codegen.cpp -B 2 -A 2

Repository: hw-native-sys/pypto

Length of output: 5239


🏁 Script executed:

# Look at the VisitStmt for more context on tensor.create handling
sed -n '750,850p' src/codegen/orchestration/orchestration_codegen.cpp | cat -n

Repository: hw-native-sys/pypto

Length of output: 5250


🏁 Script executed:

# Look for tests related to tensor.create dedup
find . -type f -name "*.py" -o -name "*.txt" -o -name "*.md" | grep -E "(test|spec)" | head -20

Repository: hw-native-sys/pypto

Length of output: 785


🏁 Script executed:

# Search for any comments or documentation about the dedup logic
rg "tensor.create" --type=cpp -B 5 -A 5 | grep -A 10 -B 10 "dedup\|SSA"

Repository: hw-native-sys/pypto

Length of output: 2715


🏁 Script executed:

# Check if there are scenarios where the same base name gets assigned in multiple branches
rg "tensor\.create" --type=cpp --context=10 | head -100

Repository: hw-native-sys/pypto

Length of output: 7266


🏁 Script executed:

# Look at the VisitStmt for assignments to understand the flow better
sed -n '870,950p' src/codegen/orchestration/orchestration_codegen.cpp | cat -n

Repository: hw-native-sys/pypto

Length of output: 3784


🏁 Script executed:

# Let's look at the test file for orchestration codegen
cat tests/ut/codegen/test_orchestration_codegen.py | head -100

Repository: hw-native-sys/pypto

Length of output: 3997


🏁 Script executed:

# Check if there are scenarios with multiple tensor.create calls in tests
rg "tensor\.create" tests/ --type=py -B 3 -A 3

Repository: hw-native-sys/pypto

Length of output: 5790


🏁 Script executed:

# Let's look at the actual codegen for tensor.create to see what it generates
sed -n '1,100p' src/codegen/tensor_op_codegen.cpp | cat -n

Repository: hw-native-sys/pypto

Length of output: 4533


🏁 Script executed:

# Find the REGISTER_ORCHESTRATION_OP for tensor.create to see what it generates
rg "REGISTER_ORCHESTRATION_OP.*tensor_create" -A 30 src/codegen/tensor_op_codegen.cpp

Repository: hw-native-sys/pypto

Length of output: 1394


🏁 Script executed:

# Search for test cases with multiple tensor.create in a single function
rg "tensor.create" tests/ut/codegen/test_orchestration_codegen.py -A 20 | head -80

Repository: hw-native-sys/pypto

Length of output: 1879


🏁 Script executed:

# Check if there's any documentation about the dedup intent
rg "dedup\|SSA.*collapse" src/codegen/orchestration/ -B 3 -A 3

Repository: hw-native-sys/pypto

Length of output: 45


🏁 Script executed:

# Look for scenarios in loops or conditionals where the same base var might be reassigned
sed -n '730,900p' src/codegen/orchestration/orchestration_codegen.cpp | cat -n

Repository: hw-native-sys/pypto

Length of output: 8531


🏁 Script executed:

# Let's look for loop or conditional scenarios where the same base name might be reused
sed -n '500,700p' src/codegen/orchestration/orchestration_codegen.cpp | cat -n

Repository: hw-native-sys/pypto

Length of output: 8376


🏁 Script executed:

# Check the full VisitStmt method
sed -n '860,950p' src/codegen/orchestration/orchestration_codegen.cpp | cat -n

Repository: hw-native-sys/pypto

Length of output: 4276


🏁 Script executed:

# Let's check if there are any examples or comments about loop-based tensor reallocation
rg "tensor.create" tests/ --type=py -B 5 -A 5 | grep -A 10 -B 10 "loop\|for\|while" | head -60

Repository: hw-native-sys/pypto

Length of output: 4667


🏁 Script executed:

# Check if the comment mentions anything about the intent
sed -n '600,650p' src/codegen/orchestration/orchestration_codegen.cpp | cat -n

Repository: hw-native-sys/pypto

Length of output: 2051


🏁 Script executed:

# Let's check if there are tests with tensor.create in loops to understand the expected behavior
rg "tensor\.create" tests/ut/codegen/ --type=py -A 30 | grep -B 5 -A 20 "for\|ForStmt"

Repository: hw-native-sys/pypto

Length of output: 4425


🏁 Script executed:

# Check if declared_vars_ is scoped to control flow regions or global
rg "declared_vars_" src/codegen/orchestration/orchestration_codegen.cpp -B 2 -A 2

Repository: hw-native-sys/pypto

Length of output: 981


Don't silently skip tensor.create when the same base name appears multiple times after SSA collapse.

In loops and conditionals, multiple tensor.create operations can map to the same base name after SSA name collapse (e.g., mi_0, mi_1mi). The current dedup check skips codegen for all but the first occurrence, causing subsequent allocations to be dropped and the old tensor to be silently reused. Either emit an assignment (mi = make_tensor(...)) for each create, or enforce that result variable names are unique per allocation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/codegen/orchestration/orchestration_codegen.cpp` around lines 671 - 675,
The current dedup branch that returns when (op_name == "tensor.create" &&
declared_vars_.count(result_var)) silently drops subsequent tensor.create
allocations after SSA collapse; update the logic in orchestration_codegen.cpp to
handle repeated tensor.create for the same result_var by either (A) emitting an
assignment statement (e.g., result_var = make_tensor(...)) instead of returning
when declared_vars_ contains result_var, or (B) enforce/transform to a unique
allocation name before codegen; locate the branch checking op_name ==
"tensor.create", declared_vars_, and result_var and replace the early return
with code that generates an assignment to result_var (or performs a unique
rename) so each create results in a proper allocation or re-assignment.

Comment on lines 732 to +739
// Input args
for (const auto& arg : call->args_) {
std::string var_name = TryGetVarName(arg);
if (!var_name.empty()) {
// Check if this is a scalar variable (not a tensor) -> make_scalar_param
if (As<ScalarType>(arg->GetType())) {
params.push_back({"make_scalar_param", var_name});
continue;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the file structure and find the target file
find . -type f -name "orchestration_codegen.cpp" | head -20

Repository: hw-native-sys/pypto

Length of output: 116


🏁 Script executed:

# Look at the specific lines mentioned in the review
if [ -f src/codegen/orchestration/orchestration_codegen.cpp ]; then
  wc -l src/codegen/orchestration/orchestration_codegen.cpp
fi

Repository: hw-native-sys/pypto

Length of output: 118


🏁 Script executed:

# Get context around lines 732-739
if [ -f src/codegen/orchestration/orchestration_codegen.cpp ]; then
  sed -n '720,750p' src/codegen/orchestration/orchestration_codegen.cpp
fi

Repository: hw-native-sys/pypto

Length of output: 1211


🏁 Script executed:

# Search for make_scalar_param definition
rg "make_scalar_param" --type cpp -B 2 -A 5 | head -100

Repository: hw-native-sys/pypto

Length of output: 3438


🏁 Script executed:

# Search for float_to_u64 to see if it exists
rg "float_to_u64" --type cpp -B 2 -A 5

Repository: hw-native-sys/pypto

Length of output: 1671


🏁 Script executed:

# Get broader context around the scalar handling code
sed -n '730,810p' src/codegen/orchestration/orchestration_codegen.cpp

Repository: hw-native-sys/pypto

Length of output: 3690


🏁 Script executed:

# Search for how ConstFloat and other constants are distinguished from variables
rg "ConstFloat|ConstInt|ConstBool" --type cpp -B 5 -A 2 | grep -A 5 "As<ConstFloat>"

Repository: hw-native-sys/pypto

Length of output: 1464


🏁 Script executed:

# Look for TryGetVarName to understand what it returns
rg "TryGetVarName" --type cpp -B 2 -A 10 | head -60

Repository: hw-native-sys/pypto

Length of output: 4773


🏁 Script executed:

# Find the definition/implementation of TryGetVarName in CodegenBase
rg "CodegenBase::TryGetVarName" --type cpp -B 2 -A 15

Repository: hw-native-sys/pypto

Length of output: 2869


🏁 Script executed:

# Look for where scalar variables are declared/assigned
rg "ScalarType" --type cpp -B 3 -A 3 | grep -A 5 "var.*=" | head -80

Repository: hw-native-sys/pypto

Length of output: 1022


🏁 Script executed:

# Search for how scalar parameters are used/expected in make_scalar_param calls
rg "make_scalar_param" --type cpp -B 2 -A 2 | head -100

Repository: hw-native-sys/pypto

Length of output: 2746


🏁 Script executed:

# Look for comments or documentation about scalar parameter format
rg "scalar.*param|param.*scalar" --type cpp -B 3 -A 3 | head -100

Repository: hw-native-sys/pypto

Length of output: 4946


🏁 Script executed:

# Search for test files related to orchestration_codegen
find . -type f -name "*test*.cpp" -o -name "*test*.h" | xargs grep -l "orchestration\|make_scalar_param" 2>/dev/null | head -10

Repository: hw-native-sys/pypto

Length of output: 45


🏁 Script executed:

# Look for how scalar variables are declared/used in the context of function calls
rg "ScalarType.*var_name|var_name.*ScalarType" --type cpp -B 5 -A 5

Repository: hw-native-sys/pypto

Length of output: 45


🏁 Script executed:

# Check what type of values are passed to the scalar variable path - look for function signatures
rg "make_scalar_param\|PTOParam" --type cpp -B 5 -A 1 | grep -A 10 "make_scalar_param.*var_name"

Repository: hw-native-sys/pypto

Length of output: 45


🏁 Script executed:

# Search for any assertions or CHECKs about scalar param types
rg "scalar.*param|make_scalar_param" --type cpp -B 3 -A 3 | grep -i "uint64\|float\|check"

Repository: hw-native-sys/pypto

Length of output: 1490


🏁 Script executed:

# Search for variable declarations with ScalarType to understand their storage format
rg "Var|IterArg" --type cpp -B 3 -A 3 | grep -A 5 "ScalarType" | head -80

Repository: hw-native-sys/pypto

Length of output: 2043


🏁 Script executed:

# Look for how the params are actually used after being collected
sed -n '800,900p' src/codegen/orchestration/orchestration_codegen.cpp

Repository: hw-native-sys/pypto

Length of output: 4215


🏁 Script executed:

# Check if there are any tests or examples showing scalar parameter usage
find . -type f \( -name "*.cpp" -o -name "*.h" \) | xargs grep -l "float_to_u64\|make_scalar_param" | head -5

Repository: hw-native-sys/pypto

Length of output: 116


🏁 Script executed:

# Look at the broader context to see how params are emitted/used after collection
rg "params\[" --type cpp src/codegen/orchestration/orchestration_codegen.cpp -B 2 -A 2 | head -100

Repository: hw-native-sys/pypto

Length of output: 45


🏁 Script executed:

# Look for how scalar variables are generated/declared in the context before make_scalar_param
sed -n '600,750p' src/codegen/orchestration/orchestration_codegen.cpp | head -200

Repository: hw-native-sys/pypto

Length of output: 5437


🏁 Script executed:

# Search for how scalar types are handled in code generation elsewhere
rg "ScalarType.*ToCTypeString\|ToCTypeString.*float" --type cpp -B 3 -A 3 | head -80

Repository: hw-native-sys/pypto

Length of output: 45


🏁 Script executed:

# Check if there's any documentation or tests showing what make_scalar_param expects
find . -type f \( -name "*.md" -o -name "*.txt" -o -name "*.h" \) | xargs grep -l "make_scalar_param\|PTOParam" 2>/dev/null

Repository: hw-native-sys/pypto

Length of output: 124


🏁 Script executed:

# Look for the PTOParam definition to understand what it expects
rg "struct PTOParam|class PTOParam|typedef.*PTOParam" --type cpp -B 2 -A 10

Repository: hw-native-sys/pypto

Length of output: 45


🏁 Script executed:

# Check the header file for PTOParam definition
cat include/pypto/codegen/orchestration/orchestration_codegen.h

Repository: hw-native-sys/pypto

Length of output: 2913


🏁 Script executed:

# Search for PTOParam in header files
find . -type f -name "*.h" | xargs grep -l "PTOParam"

Repository: hw-native-sys/pypto

Length of output: 124


🏁 Script executed:

# Look for the actual generated code to see how the parameters are used
rg "pto2_rt_submit_task\|PTOParam" --type cpp -B 5 -A 5 | head -150

Repository: hw-native-sys/pypto

Length of output: 45


🏁 Script executed:

# Search more broadly for PTOParam definition
rg "PTOParam" --type cpp --type h | head -50

Repository: hw-native-sys/pypto

Length of output: 470


🏁 Script executed:

# Look for runtime headers that might define PTOParam
find . -type f -name "*.h" | xargs grep -l "PTOParam\|pto2_rt_submit_task" 2>/dev/null

Repository: hw-native-sys/pypto

Length of output: 124


🏁 Script executed:

# Check if PTOParam is defined in external headers or if it's a macro
rg "make_scalar_param\|make_input_param" --type cpp -B 10 | head -100

Repository: hw-native-sys/pypto

Length of output: 45


🏁 Script executed:

# Look at the full context of how params are emitted to understand what make_scalar_param expects
sed -n '820,850p' src/codegen/orchestration/orchestration_codegen.cpp

Repository: hw-native-sys/pypto

Length of output: 1288


🏁 Script executed:

# Let me verify if there are any other scalar types besides float that might need special handling
rg "ToCTypeString\|dtype_" --type cpp src/codegen/orchestration/orchestration_codegen.cpp | grep -i "float\|double\|int" | head -20

Repository: hw-native-sys/pypto

Length of output: 45


🏁 Script executed:

# Check if there are any tests or examples that show the expected behavior of scalar parameters
find . -type f \( -name "*test*" -o -name "*.md" \) | xargs grep -l "make_scalar_param\|scalar.*param" 2>/dev/null

Repository: hw-native-sys/pypto

Length of output: 294


🏁 Script executed:

# Look at how scalar variables would be declared/initialized to understand their runtime type
sed -n '600,700p' src/codegen/orchestration/orchestration_codegen.cpp | grep -B 5 -A 5 "GetCppType"

Repository: hw-native-sys/pypto

Length of output: 728


🏁 Script executed:

# Check the test file for orchestration_codegen to understand expected behavior
cat tests/ut/codegen/test_orchestration_codegen.py | head -200

Repository: hw-native-sys/pypto

Length of output: 8126


🏁 Script executed:

# Look for any test cases involving scalar parameters
rg "scalar.*param\|make_scalar_param" tests/ --type python -B 3 -A 3

Repository: hw-native-sys/pypto

Length of output: 93


🏁 Script executed:

# Check if there are examples of float scalar variables being passed to functions
rg "ScalarType.*float\|float.*scalar" --type python tests/ -B 2 -A 2 | head -60

Repository: hw-native-sys/pypto

Length of output: 93


🏁 Script executed:

# Search for test cases involving scalar parameter passing
rg "Scalar\[.*FP32\]|scalar.*function" tests/ut/codegen/test_orchestration_codegen.py -B 3 -A 5

Repository: hw-native-sys/pypto

Length of output: 989


🏁 Script executed:

# Look for where scalar variables are actually used as function arguments in tests
grep -n "kernel\|scalar\|Scalar" tests/ut/codegen/test_orchestration_codegen.py | head -50

Repository: hw-native-sys/pypto

Length of output: 3329


🏁 Script executed:

# Search the full test file for any scalar parameter tests
wc -l tests/ut/codegen/test_orchestration_codegen.py

Repository: hw-native-sys/pypto

Length of output: 114


🏁 Script executed:

# Get a larger section of the test file
sed -n '1,400p' tests/ut/codegen/test_orchestration_codegen.py | tail -200

Repository: hw-native-sys/pypto

Length of output: 8708


🏁 Script executed:

# Look for test cases where scalar VARIABLES are passed (not constants like 1.0, 2.0)
grep -n "Scalar\[" tests/ut/codegen/test_orchestration_codegen.py | head -20

Repository: hw-native-sys/pypto

Length of output: 621


🏁 Script executed:

# Check if there's a test with scalar variable argument vs scalar constant argument
sed -n '330,450p' tests/ut/codegen/test_orchestration_codegen.py

Repository: hw-native-sys/pypto

Length of output: 5109


🏁 Script executed:

# Look for any test case where a scalar variable is passed to a kernel function
rg "def.*scalar.*\(.*Scalar" tests/ut/codegen/test_orchestration_codegen.py -B 5 -A 10

Repository: hw-native-sys/pypto

Length of output: 45


🏁 Script executed:

# Search for test cases that pass scalar variables (not constants) to functions
sed -n '850,950p' tests/ut/codegen/test_orchestration_codegen.py

Repository: hw-native-sys/pypto

Length of output: 4441


🏁 Script executed:

# Look for test cases involving tensor.dim or tensor.read that produce scalar variables
sed -n '175,210p' tests/ut/codegen/test_orchestration_codegen.py

Repository: hw-native-sys/pypto

Length of output: 1650


🏁 Script executed:

# Check if there are any tests where scalar variables from operations are passed to kernel functions
rg "pl.tensor.read\|pl.tensor.dim" tests/ut/codegen/test_orchestration_codegen.py -B 2 -A 8 | head -80

Repository: hw-native-sys/pypto

Length of output: 45


🏁 Script executed:

# Look for the if_statement test's expected output to see how scalar variables are handled
sed -n '900,1000p' tests/ut/codegen/test_orchestration_codegen.py

Repository: hw-native-sys/pypto

Length of output: 4193


🏁 Script executed:

# Search for where the test assertions are for the if_statement test
grep -n "assert.*is_first\|assert.*make_scalar_param.*is_first" tests/ut/codegen/test_orchestration_codegen.py

Repository: hw-native-sys/pypto

Length of output: 146


🏁 Script executed:

# Get the full test_if_statement test including assertions
sed -n '915,1020p' tests/ut/codegen/test_orchestration_codegen.py

Repository: hw-native-sys/pypto

Length of output: 4557


🏁 Script executed:

# Search for where scalar variables from operations are passed to kernel functions
sed -n '920,970p' tests/ut/codegen/test_orchestration_codegen.py

Repository: hw-native-sys/pypto

Length of output: 2116


🏁 Script executed:

# Look at the generated code for the if_statement test to see how is_first is used
rg "test_if_statement" tests/ut/codegen/test_orchestration_codegen.py -A 200 | grep -A 50 "make_scalar_param"

Repository: hw-native-sys/pypto

Length of output: 45


🏁 Script executed:

# Final check: search the entire codebase for actual usage of scalar variables passed to tasks
rg "make_scalar_param.*var_name\|scalar.*variable.*make_scalar_param" --type cpp

Repository: hw-native-sys/pypto

Length of output: 45


Scalar variables need dtype-aware packing for make_scalar_param.

The scalar variable path at lines 732-739 emits make_scalar_param(var_name) directly, creating an inconsistency with how scalar constants are handled. For float constants, the code correctly uses float_to_u64() conversion, but scalar float variables bypass this. If var_name refers to a float type variable, it should be packed identically to float constants.

Suggested fix
-        if (As<ScalarType>(arg->GetType())) {
-          params.push_back({"make_scalar_param", var_name});
+        if (auto scalar_type = As<ScalarType>(arg->GetType())) {
+          std::string cpp_type = scalar_type->dtype_.ToCTypeString();
+          if (cpp_type == "float") {
+            params.push_back({"make_scalar_param", "float_to_u64(" + var_name + ")"});
+          } else {
+            params.push_back({"make_scalar_param", var_name});
+          }
           continue;
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/codegen/orchestration/orchestration_codegen.cpp` around lines 732 - 739,
The scalar-arg path in the loop over call->args_ (TryGetVarName, As<ScalarType>,
make_scalar_param) emits make_scalar_param(var_name) unconditionally, which
skips dtype-aware packing for float scalars; update the branch to inspect the
ScalarType's dtype and, for float (and any other types that require special
packing), emit make_scalar_param(float_to_u64(var_name)) (or the appropriate
packing function for that dtype) while preserving the existing behavior for
integer/bool scalars; adjust the logic around As<ScalarType>(arg->GetType()) and
the params push to choose the packed expression based on scalar->dtype() so
scalar variables are packed identically to scalar constants.

Comment on lines +55 to +70
// tensor.create -> uint64_t var_shapes[N] = {...}; Tensor var = make_tensor(var_shapes, N, DataType::XX);
auto result_type = As<TensorType>(op->GetType());
CHECK(result_type) << "tensor.create must return TensorType";

std::string result_var = codegen.GetCurrentResultTarget();
std::string size_expr = CalculateTensorSizeExpr(result_type, codegen);
size_t ndim = result_type->shape_.size();

std::ostringstream oss;
oss << "Tensor " << result_var << " = make_tensor(" << size_expr << ");";
oss << "uint64_t " << result_var << "_shapes[" << ndim << "] = {";
for (size_t i = 0; i < ndim; ++i) {
if (i > 0) oss << ", ";
oss << codegen.GenerateExprString(result_type->shape_[i]);
}
oss << "};\n";
oss << "Tensor " << result_var << " = make_tensor(" << result_var << "_shapes, " << ndim << ", "
<< codegen.GetRuntimeDataTypeString(result_type->dtype_) << ");";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd /tmp && find . -name "tensor_op_codegen.cpp" 2>/dev/null | head -5

Repository: hw-native-sys/pypto

Length of output: 45


🏁 Script executed:

git ls-files | grep -i tensor_op_codegen

Repository: hw-native-sys/pypto

Length of output: 96


🏁 Script executed:

git ls-files | head -20

Repository: hw-native-sys/pypto

Length of output: 705


🏁 Script executed:

fd tensor_op_codegen

Repository: hw-native-sys/pypto

Length of output: 96


🏁 Script executed:

find . -type f -name "*.cpp" | head -20

Repository: hw-native-sys/pypto

Length of output: 824


🏁 Script executed:

git ls-files | grep -E "(codegen|tensor)" | head -30

Repository: hw-native-sys/pypto

Length of output: 1191


🏁 Script executed:

cat -n src/codegen/tensor_op_codegen.cpp | head -80

Repository: hw-native-sys/pypto

Length of output: 3626


🏁 Script executed:

cat -n src/codegen/tensor_op_codegen.cpp | sed -n '50,160p'

Repository: hw-native-sys/pypto

Length of output: 5104


🏁 Script executed:

fd -t f "\.(h|cpp)$" | xargs grep -l "make_tensor" | head -10

Repository: hw-native-sys/pypto

Length of output: 280


🏁 Script executed:

rg "void.*make_tensor|auto.*make_tensor|Tensor.*make_tensor" -A 3

Repository: hw-native-sys/pypto

Length of output: 8614


🏁 Script executed:

rg "shapes\[0\]|ndim.*0|scalar" tests/ut/codegen/test_orchestration_codegen.py -B 2 -A 2

Repository: hw-native-sys/pypto

Length of output: 3202


🏁 Script executed:

fd -t h "\.(h|hpp)$" | xargs grep -l "make_tensor" | head -5

Repository: hw-native-sys/pypto

Length of output: 231


🏁 Script executed:

rg "Tensor.*make_tensor|make_tensor\s*\(" include/ -A 2 -B 2

Repository: hw-native-sys/pypto

Length of output: 45


🏁 Script executed:

rg "ndim\s*==\s*0|shape_.*empty\(\)|rank.*0" src/codegen/ -B 3 -A 3

Repository: hw-native-sys/pypto

Length of output: 45


🏁 Script executed:

find include -name "*.h" -o -name "*.hpp" | xargs grep -l "make_tensor" 2>/dev/null

Repository: hw-native-sys/pypto

Length of output: 162


🏁 Script executed:

rg "make_tensor" include/ --type cpp --type h

Repository: hw-native-sys/pypto

Length of output: 409


🏁 Script executed:

rg "Tensor\s+make_tensor|Tensor.*make_tensor" . --type-list

Repository: hw-native-sys/pypto

Length of output: 452


🏁 Script executed:

rg "def make_tensor|struct.*make_tensor|typedef.*make_tensor" -A 3

Repository: hw-native-sys/pypto

Length of output: 45


🏁 Script executed:

cat tests/ut/codegen/test_orchestration_codegen.py | grep -A 5 "shapes\[" | head -30

Repository: hw-native-sys/pypto

Length of output: 1553


🏁 Script executed:

rg "VLA|Variable.Length.Array|zero.*length.*array" . --type txt --type md

Repository: hw-native-sys/pypto

Length of output: 45


🏁 Script executed:

grep -r "make_tensor.*DataType" tests/ | head -10

Repository: hw-native-sys/pypto

Length of output: 1396


🏁 Script executed:

cat -n include/pypto/codegen/orchestration/orchestration_codegen.h | grep -A 10 "make_tensor"

Repository: hw-native-sys/pypto

Length of output: 748


🏁 Script executed:

rg "Scalar\[|TensorType.*shape_.*empty|rank.*0" python/pypto/ | head -20

Repository: hw-native-sys/pypto

Length of output: 1650


🏁 Script executed:

rg "Scalar\|shape.*\[\]|\[\].*shape" tests/ut/ -B 2 -A 2

Repository: hw-native-sys/pypto

Length of output: 45


🏁 Script executed:

grep -r "tensor.create" tests/ | head -10

Repository: hw-native-sys/pypto

Length of output: 994


🏁 Script executed:

rg "class TensorType|struct TensorType" include/ -A 10

Repository: hw-native-sys/pypto

Length of output: 674


🏁 Script executed:

cat > /tmp/check_vla.cpp << 'EOF'
`#include` <iostream>

int main() {
    int ndim = 0;
    // This is invalid C++ - VLA (Variable Length Array) is not standard C++
    // uint64_t shapes[ndim] = {};  // Compilation Error
    
    // Standard C++ solution: allocate minimum 1
    size_t shape_len = (ndim == 0) ? 1 : ndim;
    uint64_t shapes[4] = {};  // Use constant size
    
    return 0;
}
EOF
cat /tmp/check_vla.cpp

Repository: hw-native-sys/pypto

Length of output: 421


🏁 Script executed:

rg "VLA\|Variable.Length\|VarArrayType" . -B 2 -A 2

Repository: hw-native-sys/pypto

Length of output: 45


🏁 Script executed:

rg "pl.Scalar|ScalarType\|rank.*0" python/pypto/ir/ | head -15

Repository: hw-native-sys/pypto

Length of output: 45


🏁 Script executed:

grep -n "ScalarType\|Scalar\[" python/pypto/language/ | head -10

Repository: hw-native-sys/pypto

Length of output: 107


🏁 Script executed:

cat -n src/codegen/tensor_op_codegen.cpp | sed -n '26,52p'

Repository: hw-native-sys/pypto

Length of output: 1049


🏁 Script executed:

cat -n src/codegen/tensor_op_codegen.cpp | sed -n '99,126p'

Repository: hw-native-sys/pypto

Length of output: 1318


🏁 Script executed:

rg "is_simple.*idx_expr|idx_expr.*empty" src/codegen/tensor_op_codegen.cpp -B 2 -A 2

Repository: hw-native-sys/pypto

Length of output: 257


🏁 Script executed:

rg "shape.*\[\]|shape_.*empty\(\)" tests/ -B 3 -A 3

Repository: hw-native-sys/pypto

Length of output: 653


🏁 Script executed:

grep -n "tensor.create.*\[\]" tests/ut/

Repository: hw-native-sys/pypto

Length of output: 94


🏁 Script executed:

cat CMakeLists.txt | grep -i "std\|c++\|cxx" | head -20

Repository: hw-native-sys/pypto

Length of output: 383


🏁 Script executed:

rg "CMAKE_CXX_STANDARD|std=" . --type cmake

Repository: hw-native-sys/pypto

Length of output: 159


Fix zero‑rank tensor handling in tensor.create and tensor.read code generation.

The current implementation emits invalid C++ for rank‑0 tensors. When shape_ is empty, line 63 generates uint64_t <name>_shapes[0] = {}, which violates C++17 (zero‑length arrays are non‑standard). Additionally, tensor.read at lines 99‑125 produces an empty index expression size_t idx_var = ; for rank‑0 tensors, causing a syntax error.

Allocate a minimum of 1 element for the shapes array and provide a dummy value when the shape is empty, following the pattern used in CalculateTensorSizeExpr (lines 41‑44). The codebase has test coverage for rank‑0 tensors (test_scalar_shape_dimensions), so this must be supported.

🔧 Suggested fix for tensor.create
-  size_t ndim = result_type->shape_.size();
+  size_t ndim = result_type->shape_.size();
+  size_t shape_len = (ndim == 0) ? 1 : ndim;

   std::ostringstream oss;
-  oss << "uint64_t " << result_var << "_shapes[" << ndim << "] = {";
-  for (size_t i = 0; i < ndim; ++i) {
-    if (i > 0) oss << ", ";
-    oss << codegen.GenerateExprString(result_type->shape_[i]);
-  }
+  oss << "uint64_t " << result_var << "_shapes[" << shape_len << "] = {";
+  if (ndim == 0) {
+    oss << "1";
+  } else {
+    for (size_t i = 0; i < ndim; ++i) {
+      if (i > 0) oss << ", ";
+      oss << codegen.GenerateExprString(result_type->shape_[i]);
+    }
+  }
   oss << "};\n";
   oss << "Tensor " << result_var << " = make_tensor(" << result_var << "_shapes, " << ndim << ", "
       >> codegen.GetRuntimeDataTypeString(result_type->dtype_) << ");";
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/codegen/tensor_op_codegen.cpp` around lines 55 - 70, The tensor.create
code emits a zero-length C array and tensor.read emits an empty index for rank-0
tensors; change the shapes array allocation to use at least one element (e.g.,
size = std::max<size_t>(1, ndim)) and when ndim==0 emit a single dummy dimension
value of 1 in the initializer for result_var_shapes, and in tensor.read ensure
the computed index (idx_var) is "0" for ndim==0 instead of an empty expression;
update the logic around result_var_shapes, the make_tensor call, and the index
computation (referencing result_var, result_var_shapes, CalculateTensorSizeExpr
and the idx_var/index expression generation) to handle ndim==0 accordingly.

Comment on lines +99 to +125
// Build linear index expression
std::ostringstream idx_oss;
for (size_t i = 0; i < indices.size(); ++i) {
if (i > 0) oss << " + ";
oss << codegen.GenerateExprString(indices[i]);
if (i > 0) idx_oss << " + ";
idx_oss << codegen.GenerateExprString(indices[i]);
for (size_t j = i + 1; j < shape.size(); ++j) {
oss << " * " << codegen.GenerateExprString(shape[j]);
idx_oss << " * " << codegen.GenerateExprString(shape[j]);
}
}
oss << ";\n";
oss << cpp_type << " " << result_var << " = static_cast<" << cpp_type << "*>(" << ptr_expr << ")[idx_"
<< result_var << "];";
std::string idx_expr = idx_oss.str();

// Check if the index expression is a simple constant (all digits)
bool is_simple = !idx_expr.empty() && std::all_of(idx_expr.begin(), idx_expr.end(), ::isdigit);

std::ostringstream oss;
if (is_simple) {
// Inline constant index directly
oss << cpp_type << " " << result_var << " = static_cast<" << cpp_type << "*>(" << ptr_expr << ")["
<< idx_expr << "];";
} else {
// Use intermediate variable for complex index expressions
oss << "size_t idx_" << result_var << " = " << idx_expr << ";\n";
oss << cpp_type << " " << result_var << " = static_cast<" << cpp_type << "*>(" << ptr_expr << ")[idx_"
<< result_var << "];";
}

return oss.str();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle empty index tuples and validate rank.
If indices is empty (rank‑0 tensor), idx_expr becomes empty and the generated size_t idx_... = ; is invalid. Also, mismatched indices.size() vs shape.size() will silently compute a wrong linear index. Please add validation and a scalar fallback.

🔧 Suggested fix
   const auto& indices = indices_tuple->elements_;
   const auto& shape = input_type->shape_;
+  CHECK(indices.size() == shape.size())
+      << "tensor.read indices size must match tensor rank";

-  // Build linear index expression
-  std::ostringstream idx_oss;
-  for (size_t i = 0; i < indices.size(); ++i) {
-    if (i > 0) idx_oss << " + ";
-    idx_oss << codegen.GenerateExprString(indices[i]);
-    for (size_t j = i + 1; j < shape.size(); ++j) {
-      idx_oss << " * " << codegen.GenerateExprString(shape[j]);
-    }
-  }
-  std::string idx_expr = idx_oss.str();
+  std::string idx_expr;
+  if (indices.empty()) {
+    idx_expr = "0";
+  } else {
+    std::ostringstream idx_oss;
+    for (size_t i = 0; i < indices.size(); ++i) {
+      if (i > 0) idx_oss << " + ";
+      idx_oss << codegen.GenerateExprString(indices[i]);
+      for (size_t j = i + 1; j < shape.size(); ++j) {
+        idx_oss << " * " << codegen.GenerateExprString(shape[j]);
+      }
+    }
+    idx_expr = idx_oss.str();
+  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/codegen/tensor_op_codegen.cpp` around lines 99 - 125, The generated index
code must handle rank-0 (empty indices) and validate indices vs shape length:
before building idx_expr check if indices.empty() and set idx_expr to "0" (or
generate a direct scalar load) to avoid emitting an empty assignment; also check
indices.size() == shape.size() (or emit a runtime/compile-time assertion or
throw std::runtime_error) and fail fast if they differ to avoid silently wrong
linear index calculation. Update the block that builds idx_oss / idx_expr
(references: indices, shape, codegen.GenerateExprString, idx_expr, result_var,
ptr_expr, cpp_type) to perform these validations and then proceed with the
existing simple/complex-index emission logic.

Comment on lines +930 to +935
def kernel_process(
self,
a: pl.Tensor[[16, 16], pl.FP32],
flag: pl.Scalar[pl.INT64],
output: pl.Tensor[[16, 16], pl.FP32],
) -> pl.Tensor[[16, 16], pl.FP32]:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid unused parameter in kernel_process.
flag is unused and will trigger ARG002. Consider prefixing with _ or adding a noqa.

🔧 Suggested fix
-            def kernel_process(
-                self,
-                a: pl.Tensor[[16, 16], pl.FP32],
-                flag: pl.Scalar[pl.INT64],
-                output: pl.Tensor[[16, 16], pl.FP32],
-            ) -> pl.Tensor[[16, 16], pl.FP32]:
+            def kernel_process(
+                self,
+                a: pl.Tensor[[16, 16], pl.FP32],
+                _flag: pl.Scalar[pl.INT64],
+                output: pl.Tensor[[16, 16], pl.FP32],
+            ) -> pl.Tensor[[16, 16], pl.FP32]:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def kernel_process(
self,
a: pl.Tensor[[16, 16], pl.FP32],
flag: pl.Scalar[pl.INT64],
output: pl.Tensor[[16, 16], pl.FP32],
) -> pl.Tensor[[16, 16], pl.FP32]:
def kernel_process(
self,
a: pl.Tensor[[16, 16], pl.FP32],
_flag: pl.Scalar[pl.INT64],
output: pl.Tensor[[16, 16], pl.FP32],
) -> pl.Tensor[[16, 16], pl.FP32]:
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 933-933: Unused method argument: flag

(ARG002)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/ut/codegen/test_orchestration_codegen.py` around lines 930 - 935, The
parameter `flag` in function kernel_process is unused and triggers ARG002;
rename it to `_flag` (or `_flag: pl.Scalar[pl.INT64]`) or annotate/ignore it
with a noqa to mark it intentional. Locate the kernel_process definition and
update the parameter name or add the noqa inline so the linter no longer reports
ARG002 while preserving the function signature semantics.

…n codegen

Multiple tuple-returning kernel calls (e.g. kernel_init_inplace,
kernel_softmax_prepare, kernel_online_update) shared each other's
output params due to GetSSABaseName() collapsing all _tuple_tmp_N
names to _tuple_tmp. Use per-call unique keys with Call* pointer
coordination to isolate each call's tuple elements.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant