[WIP] feat(codegen): orchestration tensor APIs, scalar min/max, and if/view codegen by zhusy54 · Pull Request #204 · hw-native-sys/pypto

zhusy54 · 2026-02-14T00:46:31Z

Summary

Refactor orchestration codegen to use shape/dtype-based tensor APIs, add scalar min/max support to the DSL, and implement codegen for IfOp and ViewOp.

Changes

Orchestration Codegen Refactor

Replace size-based make_tensor(bytes) / make_tensor_external(ptr, size) with shape/dtype-based APIs: make_tensor(shapes, ndim, dtype) / make_tensor_external(ptr, shapes, ndim, dtype), providing the runtime with richer tensor metadata.
Remove outer/inner scope classification logic (TaskRecord, scope analysis, re-indentation). Scope management is now handled structurally by ForStmt and IfStmt visitors which emit PTO2_SCOPE(rt) blocks directly.
Emit intermediate tensor declarations inline at the point of task submission rather than hoisting to a separate section.
Consolidate non_task_code_ stream into the main code_ stream — all generated code now flows through a single output path.

Scalar Min/Max Operations

Add MakeMin / MakeMax factory functions in scalar_expr.h with automatic type promotion.
Expose as ir.min_() / ir.max_() in Python bindings and type stubs.
Add parser dispatch: pl.min(scalar, scalar) and pl.max(scalar, scalar) now route to scalar IR ops via _parse_scalar_op(), while pl.min(tile, axis=...) continues to work as tile reduction.
Add @overload type signatures for pl.min / pl.max in block_ops.py to satisfy pyright.
Add codegen support for Min/Max expressions in GenerateExprString (emits std::min()/std::max()).

New Codegen: IfOp and ViewOp

IfStmt codegen: emits if/else blocks with PTO2_SCOPE(rt) wrapping, pre-declares return variables, and handles both branches.
tensor.view codegen: emits input.view({shapes}, {offsets}) for tensor slicing.

Shared Utilities

Extract DataTypeToString() into dtype.h as a shared utility, replacing the duplicated DataTypeToPythonString() in python_printer.cpp.
Add GetRuntimeDataTypeString() and GetExternalTensorName() as virtual methods on CodegenBase for subclass customization.
Extend GenerateExprString with support for FloorMod, comparison ops (Eq/Ne/Lt/Le/Gt/Ge), Neg, ConstFloat, ConstBool.

Type Stubs

Add missing scalar expression stubs: eq, ne, lt, le, gt, ge, neg, truediv, floordiv, mod, min_, max_.
Make span parameter optional (default = ...) across all scalar op stubs.

Tests

Update orchestration codegen tests to match new shape/dtype-based output format.
Add test_scalar_ops.py for IR-level scalar min/max node creation and type promotion.
Add test_scalar_dispatch.py for parser-level pl.min/pl.max scalar dispatch.

… codegen - Add GetRuntimeDataTypeString for runtime-compatible dtype name mapping - Refactor tensor operations to use shape/dtype-based APIs instead of raw strings - Add codegen support for IfOp and ViewOp in orchestration layer - Extract common codegen helpers (EmitTensorDecl, EmitFunctionCall, etc.) - Add scalar min/max operations (pl.min, pl.max) for orchestration - Update tests to cover new codegen patterns and scalar ops

gemini-code-assist · 2026-02-14T00:47:01Z

Summary of Changes

Hello @zhusy54, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the code generation capabilities by modernizing tensor API usage, introducing support for fundamental scalar operations, and enabling control flow and tensor view operations within the generated code. The changes streamline the orchestration codegen, making it more robust and expressive by moving to shape/dtype-based tensor handling and integrating new IR constructs like IfOp and ViewOp. This improves the flexibility and maintainability of the generated C++ code, allowing for more complex computations to be expressed and optimized.

Highlights

Orchestration Codegen Refactor: Refactored orchestration codegen to utilize shape/dtype-based tensor APIs (make_tensor, make_tensor_external), replacing size-based allocation. Eliminated outer/inner scope classification logic, with scope management now handled structurally by ForStmt and IfStmt visitors emitting PTO2_SCOPE(rt) blocks. Intermediate tensor declarations are now emitted inline at task submission, and the non_task_code_ stream has been consolidated into the main code_ stream for a unified output path.
Scalar Min/Max Operations: Added MakeMin and MakeMax factory functions in scalar_expr.h with automatic type promotion. These are exposed as ir.min_() and ir.max_() in Python bindings. The parser now dispatches pl.min(scalar, scalar) and pl.max(scalar, scalar) to scalar IR operations, while retaining tile reduction functionality. @overload type signatures were added for pl.min/pl.max, and codegen support for Min/Max expressions now emits std::min()/std::max().
New Codegen for IfOp and ViewOp: Implemented codegen for IfStmt, which emits if/else blocks wrapped with PTO2_SCOPE(rt), pre-declares return variables, and handles both branches. Codegen for tensor.view was added to emit input.view({shapes}, {offsets}) for tensor slicing operations.
Shared Utilities and Expression Support: Extracted DataTypeToString() into dtype.h as a shared utility. Added virtual methods GetRuntimeDataTypeString() and GetExternalTensorName() to CodegenBase for subclass customization. GenerateExprString was extended to support FloorMod, comparison operators (Eq, Ne, Lt, Le, Gt, Ge), Neg, ConstFloat, and ConstBool.
Type Stubs and Testing: Added missing scalar expression type stubs for various operators and made the span parameter optional across all scalar op stubs. Updated orchestration codegen tests to align with the new shape/dtype-based output format and added new test files (test_scalar_ops.py, test_scalar_dispatch.py) for scalar min/max IR node creation, type promotion, and parser dispatch.

Changelog

include/pypto/codegen/codegen_base.h
- Added virtual methods GetExternalTensorName and GetRuntimeDataTypeString to CodegenBase.
include/pypto/core/dtype.h
- Added DataTypeToString utility function for converting DataType to its string representation.
include/pypto/ir/scalar_expr.h
- Added MakeMin and MakeMax factory functions for scalar minimum and maximum operations, including type promotion.
python/bindings/modules/ir.cpp
- Exposed min_ and max_ functions to the Python ir module bindings.
python/pypto/language/op/block_ops.py
- Imported overload from typing.
- Added @overload decorators for max and min functions to support both Tile and Scalar arguments.
- Modified max and min functions to accept Tile | Scalar for the first argument and int | Scalar for the second, enabling scalar dispatch.
python/pypto/language/parser/ast_parser.py
- Added _SCALAR_BINARY_OPS dictionary to map unified op names to IR scalar expression functions for min and max.
- Removed abs from _UNIFIED_OPS list.
- Removed max and min from _BLOCK_OPS list.
- Added abs to _UNIFIED_OPS list.
- Modified _parse_unified_op to dispatch to _parse_scalar_op if the first argument is a ScalarType.
- Added _parse_scalar_op method to handle parsing of scalar binary and unary operations.
python/pypto/pypto_core/ir.pyi
- Made span parameter optional (default = ...) for add, sub, mul, pow, cast, bit_and, bit_or, bit_xor, bit_shift_left, bit_shift_right, and bit_not.
- Added type stubs for truediv, floordiv, mod, eq, ne, lt, le, gt, ge, neg, min_, and max_ with optional span parameters.
src/codegen/codegen_base.cpp
- Extended GenerateExprString to handle FloorMod, Min, Max, Eq, Ne, Lt, Le, Gt, Ge, ConstFloat, ConstBool, and Neg expressions.
- Implemented GetRuntimeDataTypeString to return the fully qualified DataType enum name for C++ codegen.
src/codegen/orchestration/orchestration_codegen.cpp
- Removed DataTypeToPTO2Enum comment, indicating usage of DataTypeToString.
- Added GenerateMakeTensorExternal helper function to create make_tensor_external calls with shape array, ndim, and dtype.
- Removed TaskRecord struct and related GetTaskRecords and GetNonTaskCode methods.
- Added PTO2_SCOPE(rt) block wrapping for ForStmt bodies.
- Implemented VisitStmt_ for IfStmt to generate if/else blocks with PTO2_SCOPE(rt) and pre-declare return variables.
- Modified GetExternalTensorName to be const override.
- Removed conditional tensor.create handling and non_task_code_ stream, consolidating all generated code into code_.
- Removed task_input_tensors and task_output_tensors tracking from VisitStmt_ for Call operations.
- Modified task submission logic to emit make_tensor declarations for intermediate output tensors inline.
- Removed non_task_code_ member from OrchestrationStmtCodegen.
- Removed task_records_ member from OrchestrationStmtCodegen.
- Removed logic for identifying intermediate tensors and separating outer/inner tasks, simplifying the overall codegen flow.
- Updated external tensor generation to use shape/dtype-based make_tensor_external calls.
src/codegen/tensor_op_codegen.cpp
- Updated REGISTER_ORCHESTRATION_OP(tensor_create) to generate shape/dtype-based make_tensor calls instead of size-based.
- Added REGISTER_ORCHESTRATION_OP(tensor_view) to generate input.view({shapes}, {offsets}) for tensor slicing.
src/ir/transforms/python_printer.cpp
- Removed DataTypeToPythonString helper function.
- Updated IRPythonPrinter::Print for ScalarType, TensorType, and TileType to use DataTypeToString from dtype.h.
- Updated IRPythonPrinter::VisitExpr_ for Call and Cast operations to use DataTypeToString.
tests/ut/codegen/test_orchestration_codegen.py
- Updated orch_basic test to assert shape/dtype-based make_tensor_external and make_tensor calls, and removed PTO2_SCOPE assertion for intermediate tensors.
- Updated orch_vector test to reflect shape/dtype-based tensor creation and removal of explicit PTO2_SCOPE for intermediate tensors.
- Updated test_tuple_mid to assert DataType::FLOAT32 instead of byte size and removed PTO2_SCOPE assertion.
- Updated orch_four_tuple test to assert shape-based tensor creation and removed PTO2_SCOPE assertion.
- Updated orch_create test to assert shape/dtype-based make_tensor call.
- Updated orch_inplace test to assert shape/dtype-based make_tensor_external calls.
- Added test_for_loop_with_view to verify codegen for for loops with dynamic bounds, tensor.view with dynamic offsets, and kernel calls inside loops.
- Added test_if_statement to verify codegen for if/else statements with conditional scalar values and PTO2_SCOPE wrapping.
tests/ut/ir/expressions/test_scalar_ops.py
- Added new test file to verify ir.min_ and ir.max_ creation, including type promotion.
tests/ut/language/parser/test_scalar_dispatch.py
- Added new test file to verify that pl.min and pl.max correctly dispatch to scalar IR operations when called with scalar arguments, including literal values, while ensuring tile reduction still functions.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant and valuable refactoring to the orchestration codegen, switching to shape/dtype-based tensor APIs and simplifying scope management. The addition of scalar min/max operations and codegen for IfOp/ViewOp are also great enhancements. My review focuses on improving maintainability by reducing code duplication, fixing a potential runtime error in the parser, and ensuring the Python-level API for min/max is robust.

gemini-code-assist · 2026-02-14T00:49:48Z

python/pypto/language/parser/ast_parser.py

+            ir_func = getattr(ir, ir_func_name)
+            return ir_func(lhs, rhs, call_span)
+
+        if op_name in self._SCALAR_UNARY_OPS:


The code checks if op_name in self._SCALAR_UNARY_OPS:, but the class attribute _SCALAR_UNARY_OPS does not appear to be defined in this file or diff. This will lead to a NameError at runtime if this code path is ever reached. You should define _SCALAR_UNARY_OPS alongside _SCALAR_BINARY_OPS.

gemini-code-assist · 2026-02-14T00:49:48Z

include/pypto/core/dtype.h

+inline std::string DataTypeToString(const DataType& dtype) {
+  if (dtype == DataType::BOOL) return "BOOL";
+  if (dtype == DataType::INT4) return "INT4";
+  if (dtype == DataType::INT8) return "INT8";
+  if (dtype == DataType::INT16) return "INT16";
+  if (dtype == DataType::INT32) return "INT32";
+  if (dtype == DataType::INT64) return "INT64";
+  if (dtype == DataType::UINT4) return "UINT4";
+  if (dtype == DataType::UINT8) return "UINT8";
+  if (dtype == DataType::UINT16) return "UINT16";
+  if (dtype == DataType::UINT32) return "UINT32";
+  if (dtype == DataType::UINT64) return "UINT64";
+  if (dtype == DataType::FP4) return "FP4";
+  if (dtype == DataType::FP8E4M3FN) return "FP8E4M3FN";
+  if (dtype == DataType::FP8E5M2) return "FP8E5M2";
+  if (dtype == DataType::FP16) return "FP16";
+  if (dtype == DataType::FP32) return "FP32";
+  if (dtype == DataType::BF16) return "BFLOAT16";
+  if (dtype == DataType::HF4) return "HF4";
+  if (dtype == DataType::HF8) return "HF8";
+  return "UnknownType";
+}


The DataTypeToString function is implemented as a long chain of if statements. While correct, this can be slightly less efficient and harder to maintain than a switch statement on the underlying type code. Using dtype.Code() with a switch would be a more idiomatic and performant C++ pattern for this kind of dispatch.

inline std::string DataTypeToString(const DataType& dtype) { switch (dtype.Code()) { case DataType::kBoolCode: return "BOOL"; case DataType::kInt4Code: return "INT4"; case DataType::kInt8Code: return "INT8"; case DataType::kInt16Code: return "INT16"; case DataType::kInt32Code: return "INT32"; case DataType::kInt64Code: return "INT64"; case DataType::kUInt4Code: return "UINT4"; case DataType::kUInt8Code: return "UINT8"; case DataType::kUInt16Code: return "UINT16"; case DataType::kUInt32Code: return "UINT32"; case DataType::kUInt64Code: return "UINT64"; case DataType::kFp4Code: return "FP4"; case DataType::kFp8e4m3fnCode: return "FP8E4M3FN"; case DataType::kFp8e5m2Code: return "FP8E5M2"; case DataType::kFp16Code: return "FP16"; case DataType::kFp32Code: return "FP32"; case DataType::kBf16Code: return "BFLOAT16"; case DataType::kHf4Code: return "HF4"; case DataType::kHf8Code: return "HF8"; default: return "UnknownType"; } }

References

When mapping enums to strings, prefer using an explicit map (like a switch statement) over a long chain of if-else if statements for robustness and future extensibility.

gemini-code-assist · 2026-02-14T00:49:48Z

python/pypto/language/op/block_ops.py

+def max(tile: Tile | Scalar, axis: int | Scalar = 0, keepdim: bool = False) -> Tile | Scalar:
+    """Max reduction along specified axis, or scalar max of two values.

    Args:
-        tile: Input tile
-        axis: Reduction axis (0 for rows, 1 for columns, -1 for last)
-        keepdim: Whether to keep the reduced dimension as 1
+        tile: Input tile or first scalar operand
+        axis: Reduction axis (for tiles) or second scalar operand
+        keepdim: Whether to keep the reduced dimension as 1 (tiles only)

    Returns:
-        Tile wrapping the max operation
+        Tile or Scalar wrapping the max operation
    """
-    call_expr = _ir_ops.max(tile.unwrap(), axis, keepdim)
+    call_expr = _ir_ops.max(tile.unwrap(), axis, keepdim)  # type: ignore[reportArgumentType]
    return Tile(expr=call_expr)


-def min(tile: Tile, axis: int, keepdim: bool = False) -> Tile:
-    """Min reduction along specified axis.
+@overload
+def min(tile: Tile, axis: int, keepdim: bool = False) -> Tile: ...
+
+
+@overload
+def min(tile: Scalar, axis: Scalar | int, keepdim: bool = False) -> Scalar: ...
+
+
+def min(tile: Tile | Scalar, axis: int | Scalar = 0, keepdim: bool = False) -> Tile | Scalar:
+    """Min reduction along specified axis, or scalar min of two values.

    Args:
-        tile: Input tile
-        axis: Reduction axis (0 for rows, 1 for columns, -1 for last)
-        keepdim: Whether to keep the reduced dimension as 1
+        tile: Input tile or first scalar operand
+        axis: Reduction axis (for tiles) or second scalar operand
+        keepdim: Whether to keep the reduced dimension as 1 (tiles only)

    Returns:
-        Tile wrapping the min operation
+        Tile or Scalar wrapping the min operation
    """
-    call_expr = _ir_ops.min(tile.unwrap(), axis, keepdim)
+    call_expr = _ir_ops.min(tile.unwrap(), axis, keepdim)  # type: ignore[reportArgumentType]
    return Tile(expr=call_expr)


The implementations for max and min are overloaded to support both Tile and Scalar types, which is great for the DSL's usability. However, the current implementation only handles the Tile reduction case by calling _ir_ops.max (the block operation). If these functions are called directly with Scalar arguments (i.e., not inside a @pl.function that gets parsed), they will fail at runtime.

While the parser correctly dispatches to the scalar IR ops, making the Python implementation robust for direct calls would improve consistency and prevent unexpected errors for users who might use these functions outside a parsed context. A runtime type check on the tile argument could be used to dispatch to the correct IR operation (ir.max_ for scalars).

gemini-code-assist · 2026-02-14T00:49:48Z

python/pypto/language/parser/ast_parser.py

+        raise InvalidOperationError(
+            f"Operation '{op_name}' is not supported for scalar arguments",
+            span=call_span,
+            hint="Supported scalar ops: min, max",


The error hint for unsupported scalar operations is hardcoded to min, max. Since the logic also supports unary operations via _SCALAR_UNARY_OPS, the hint should be updated to dynamically include all supported scalar operations. This will provide more accurate feedback to the user.

Suggested change

hint="Supported scalar ops: min, max",

hint=f"Supported scalar ops: {', '.join(list(self._SCALAR_BINARY_OPS.keys()) + list(getattr(self, '_SCALAR_UNARY_OPS', {}).keys()))}",

gemini-code-assist · 2026-02-14T00:49:48Z

src/codegen/codegen_base.cpp

+std::string CodegenBase::GetRuntimeDataTypeString(const DataType& dtype) const {
+  if (dtype == DataType::FP16) return "DataType::FLOAT16";
+  if (dtype == DataType::FP32) return "DataType::FLOAT32";
+  if (dtype == DataType::INT32) return "DataType::INT32";
+  if (dtype == DataType::INT16) return "DataType::INT16";
+  if (dtype == DataType::INT8) return "DataType::INT8";
+  if (dtype == DataType::UINT8) return "DataType::UINT8";
+  if (dtype == DataType::BF16) return "DataType::BFLOAT16";
+  if (dtype == DataType::INT64) return "DataType::INT64";
+  if (dtype == DataType::UINT64) return "DataType::UINT64";
+  return "DataType::UNKNOWN";
+}


The implementation of GetRuntimeDataTypeString duplicates logic from the new DataTypeToString function and is incomplete, as it doesn't cover all data types. This could lead to maintenance issues and bugs when new types are added.

To improve maintainability and correctness, this function should be refactored to leverage DataTypeToString for the general case, while only handling the special mappings where the runtime enum name differs from the canonical name (e.g., FP16 -> FLOAT16).

std::string CodegenBase::GetRuntimeDataTypeString(const DataType& dtype) const { // Handle special cases where runtime name differs from canonical enum name if (dtype == DataType::FP16) return "DataType::FLOAT16"; if (dtype == DataType::FP32) return "DataType::FLOAT32"; std::string s = DataTypeToString(dtype); if (s == "UnknownType") { return "DataType::UNKNOWN"; } return "DataType::" + s; }

References

To improve maintainability and correctness in enum-to-string mapping, leverage existing explicit mapping functions and handle special cases, rather than duplicating logic, for robustness and future extensibility.

gemini-code-assist · 2026-02-14T00:49:48Z

src/codegen/orchestration/orchestration_codegen.cpp

+    auto saved = current_return_var_names_;
+    current_return_var_names_.clear();
+    for (const auto& rv : if_stmt->return_vars_) {
+      current_return_var_names_.push_back(rv->name_);
+    }
+    VisitStmt(if_stmt->then_body_);
+    current_return_var_names_ = saved;
+
+    indent_ -= 4;
+    code_ << Indent() << "}\n";
+    indent_ -= 4;
+
+    if (if_stmt->else_body_.has_value()) {
+      code_ << Indent() << "} else {\n";
+      indent_ += 4;
+      code_ << Indent() << "PTO2_SCOPE(rt) {\n";
+      indent_ += 4;
+
+      auto saved2 = current_return_var_names_;
+      current_return_var_names_.clear();
+      for (const auto& rv : if_stmt->return_vars_) {
+        current_return_var_names_.push_back(rv->name_);
+      }
+      VisitStmt(*if_stmt->else_body_);
+      current_return_var_names_ = saved2;


The logic for setting up current_return_var_names_ before visiting the then and else bodies is duplicated. This code block could be extracted into a helper function or, for a more robust solution, a RAII class could be used to manage the saving and restoring of current_return_var_names_. This would make the code cleaner and less prone to errors if the logic needs to be updated in the future.

coderabbitai · 2026-02-24T07:07:16Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds min/max scalar operators to the IR layer, enhances parser to dispatch scalar operations, improves code generation infrastructure with virtual methods for naming and type handling, refactors orchestration code generation to track SSA base names, and introduces a paged attention example demonstrating orchestration workflow via PyPTO DSL.

Changes

Cohort / File(s)	Summary
New Scalar Operations `include/pypto/ir/scalar_expr.h`, `python/bindings/modules/ir.cpp`	Introduces MakeMin and MakeMax binary expression constructors with automatic operand type promotion.
Type Utilities `include/pypto/core/dtype.h`, `src/ir/transforms/python_printer.cpp`	Adds DataTypeToString utility function and migrates python printer from custom DataTypeToPythonString helper to use the new utility directly.
Parser Scalar Dispatch `python/pypto/language/parser/ast_parser.py`, `python/pypto/language/op/block_ops.py`	Extends parser with scalar-dispatch path for min/max operations; adds overloads in block_ops.py supporting both Tile and Scalar variants.
Type Stubs & Python Bindings `python/pypto/pypto_core/ir.pyi`	Expands operator surface with min_, max_, comparison operators (eq, ne, lt, le, gt, ge), negation, and introduces span parameter defaults across arithmetic/bitwise operators.
Codegen Base Infrastructure `include/pypto/codegen/codegen_base.h`, `src/codegen/codegen_base.cpp`	Adds virtual methods GetExternalTensorName and GetRuntimeDataTypeString; converts TryGetVarName and GenerateExprString from static to virtual instance methods; extends expression string generation for new operators (Min, Max, comparisons, etc.).
Orchestration Codegen Refactoring `src/codegen/orchestration/orchestration_codegen.cpp`	Major refactor introducing SSA base-name resolution for variables, inout tensor tracking across tuple-returning calls, per-call tuple element mapping, and inline external tensor generation; replaces TaskRecord and scope-based handling with streamlined SSA-aware code emission.
Tensor Op Codegen `src/codegen/tensor_op_codegen.cpp`	Updates tensor_create to use shape arrays and runtime DataType; refactors tensor_read index computation with linear indexing and optional intermediate variables; adds tensor_view operator for shape/offset-based tensor views.
CCE Backend & Codegen `src/backend/910B_CCE/backend_910b_cce_ops.cpp`, `src/codegen/cce/cce_codegen.cpp`	Guards col_offset calculation against missing second offset in block load/store ops; changes ForStmt loop variable from int64_t to uint64_t.
Paged Attention Example `examples/ir_parser/paged_attention_example.py`	New example demonstrating paged attention orchestration via PyPTO DSL with QK/PV matmul kernels, softmax preparation, online updates, and IR code generation pipeline.
Test Coverage `tests/ut/ir/expressions/test_scalar_ops.py`, `tests/ut/language/parser/test_scalar_dispatch.py`, `tests/ut/codegen/test_cce_codegen.py`, `tests/ut/codegen/test_orchestration_codegen.py`	Adds unit tests for min/max scalar operations, scalar dispatch behavior in parser, updates CCE loop variable type expectations, and extensively refactors orchestration codegen test expectations for SSA-aware generation and shape-based tensor construction.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feat(codegen): Add 910B PTO backend op support for paged attention #195: Adds paged-attention support with PTO/910B backend operations, tests, and codegen hooks; both PRs touch orchestration and code generation infrastructure for attention workflows.
fix(printer): preserve constant dtype through print-parse round-trip #232: Modifies IR printing/type utilities (dtype.h, python_printer.cpp) and introduces complementary changes to constant/type printing and DataType utilities.

Suggested reviewers

Hzfengsy

Poem

🐰 New min and max are hopping in,
SSA names make code so clean,
Paged attention now takes flight,
With shapes and views shining bright,
The orchestration dance looks right!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.63% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[WIP] feat(codegen): orchestration tensor APIs, scalar min/max, and if/view codegen' clearly summarizes the main changes in the PR.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, covering orchestration refactoring, scalar operations, new codegen support, and tests.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 17

🧹 Nitpick comments (1)

examples/ir_parser/paged_attention_example.py (1)

204-214: Inconsistent # type: ignore suppression across kernel call sites.

Line 210 explicitly suppresses type errors with # type: ignore[reportArgumentType], but lines 204 and 214 (which call kernel_qk_matmul and kernel_pv_matmul with mismatched shapes/types just the same) have no such annotation. Add matching suppression comments for consistency.

♻️ Proposed fix

-                    sij = self.kernel_qk_matmul(qi, kj, sij)
+                    sij = self.kernel_qk_matmul(qi, kj, sij)  # type: ignore[reportArgumentType]

-                    oi_tmp = self.kernel_pv_matmul(pij, vj, oi_tmp)
+                    oi_tmp = self.kernel_pv_matmul(pij, vj, oi_tmp)  # type: ignore[reportArgumentType]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/ir_parser/paged_attention_example.py` around lines 204 - 214, The
calls to the low-level kernels are inconsistent in type-ignore suppression:
kernel_softmax_prepare already has "# type: ignore[reportArgumentType]" but
kernel_qk_matmul and kernel_pv_matmul do not, causing linter/type-check noise;
add the same "# type: ignore[reportArgumentType]" suppression to the
kernel_qk_matmul(...) and kernel_pv_matmul(...) call sites so all three kernel
invocations (kernel_qk_matmul, kernel_softmax_prepare, kernel_pv_matmul)
consistently suppress the reported argument-type errors.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/ir_parser/paged_attention_example.py`:
- Around line 105-107: The kernel function signature includes unused parameters
mi, li, and oi_tmp which trigger ARG002; update the kernel (the function that
declares mi, li, oi_tmp in its signature) to either reference them with a short
explanatory inline comment or add a linter suppression (e.g., append “# noqa:
ARG002” to the parameter list or to the function definition line) to indicate
these are intentionally unused in this simplified example and silence the static
analyzer.
- Around line 112-113: The kernel signature for the paging kernel declares
parameters is_first and is_last as pl.Scalar[pl.INT64] but the orchestration
sets/passes them as pl.Scalar[pl.UINT64]; change the kernel parameter types to
pl.Scalar[pl.UINT64] to match the orchestration (or vice versa if you prefer
signed everywhere) so the types are consistent—update the kernel function
signature where is_first and is_last are declared to use pl.UINT64 and ensure
any related annotations/reference to those symbols (is_first, is_last, kernel
signature) reflect the same UINT64 type.
- Line 167: The variable block_num is assigned via pl.tensor.read(config, [5])
in paged_attention_example.py but never used (Ruff F841); remove the unused
assignment or replace its usage: either delete the line "block_num:
pl.Scalar[pl.UINT64] = pl.tensor.read(config, [5])" if it's vestigial, or use
the value where intended (e.g., pass block_num to the function that needs the
block index or incorporate it into the paging logic). Locate the assignment by
the symbol block_num and the call pl.tensor.read and either remove it or wire it
into the downstream code that expects the block number.
- Around line 132-144: The function kernel_init_inplace has a shape mismatch:
parameter oi is annotated as pl.Tensor[[16, 16], pl.FP32] but the function
returns oi as the first element with return type pl.Tensor[[16, 128], pl.FP32]
and the call site passes an oi of shape [q_tile, head_dim] (16,128). Fix this by
updating the oi parameter annotation in kernel_init_inplace to pl.Tensor[[16,
128], pl.FP32] so it matches the declared return type and the call site; verify
the other parameters (li, mi) and the return tuple remain unchanged.
- Around line 87-99: The local variable annotation for out in kernel_pv_matmul
is incorrect: change out's declared shape from pl.Tensor[[16, 128], pl.FP32] to
pl.Tensor[[16, 16], pl.FP32] so it matches the output parameter and the function
return type; ensure the pl.l0c_store call remains the same but that out's
annotation and any uses reflect the [16, 16] shape.

In `@include/pypto/core/dtype.h`:
- Around line 336-369: DataTypeToString currently maps DataType::BF16 to
"BFLOAT16", producing invalid symbols; change the BF16 branch in
DataTypeToString to return "BF16" instead of "BFLOAT16" and adjust the comment
examples that show how callers compose names (the Python printer and C++ codegen
examples) so they reflect the canonical enum-style "BF16" suffix; keep all other
mappings unchanged and ensure DataTypeToString still returns "UnknownType" for
unknown values.

In `@python/pypto/language/op/block_ops.py`:
- Around line 755-775: The implementations of max and min always call
_ir_ops.max/_ir_ops.min and return a Tile even when overloads expect Scalar
results; update both functions (max and min) to dispatch based on input types:
if tile is a Scalar, call ir.max_ / ir.min_ from pypto.pypto_core.ir (import
them) with the two scalar operands and return a Scalar; if tile is a Tile,
validate that axis is an int (raise TypeError if not), call
_ir_ops.max/_ir_ops.min with tile.unwrap(), axis, keepdim and return a Tile;
ensure imports for ir.max_/ir.min_ are added and no Tile is created for scalar
paths.

In `@python/pypto/language/parser/ast_parser.py`:
- Around line 1518-1551: In _parse_scalar_op, enforce exact arity and reject
keywords: for binary ops (self._SCALAR_BINARY_OPS) require exactly 2 positional
args and no keywords on the ast.Call (check len(call.args) == 2 and not
call.keywords), and for unary ops (self._SCALAR_UNARY_OPS) require exactly 1
positional arg and no keywords (len(call.args) == 1 and not call.keywords); when
the checks fail raise InvalidOperationError with a clear message and
span=call_span (same pattern used currently), then continue to parse via
self.parse_expression and call the IR constructor retrieved via getattr(ir,
ir_func_name).

In `@src/backend/910B_CCE/backend_910b_cce_ops.cpp`:
- Around line 131-133: The access to offsets_tuple->elements_[0] is unguarded
and can crash for zero-length tuples; add the same size guard used for
col_offset before reading elements_[0] in all three functions
(MakeBlockStoreCodegenCCE, MakeBlockL0CStoreCodegenCCE and the third identical
block), e.g., CHECK(offsets_tuple->elements_.size() > 0) (or equivalent) and
only call codegen.GetExprAsCode(offsets_tuple->elements_[0]) after that check so
both row_offset and col_offset are protected for 0D/1D tuples.

In `@src/codegen/cce/cce_codegen.cpp`:
- Around line 452-453: The generated loop uses an unsigned uint64_t for the loop
index (see emitter_.EmitLine call that constructs the for with loop_var_name,
start, stop, step), which will wrap for negative/descending ranges; change the
emitted loop index type to a signed type (e.g., int64_t) and ensure comparisons
and increments use the same signed type and expressions (loop_var_name, start,
stop, step) so negative values behave correctly; alternatively, if you choose to
keep unsigned, add a precondition/assertion or runtime check before code
generation that start/stop/step are non-negative and refuse generation
otherwise—update any places that assume unsigned loop semantics to use the
signed index or the new precondition.

In `@src/codegen/orchestration/orchestration_codegen.cpp`:
- Around line 44-73: GetSSABaseName is over-eagerly treating any trailing
"_<digits>" as SSA and will collapse real user names like "foo_1"; change the
heuristic to require an explicit SSA marker (e.g., "_ssa_" / "_iter_ssa_")
instead of plain underscores before digits: update GetSSABaseName to check for
"_iter_ssa_" and "_ssa_" (and only strip those markers plus trailing digits),
and update the other analogous spots called out in the review (the code blocks
at ~357-360, 491-495, 714-720, 879-883) so SSA generation/consumption uses the
new reserved marker consistently rather than stripping any "_<digits>" suffix.
Ensure callers that emit SSA names are changed to emit the reserved "_ssa_" form
so original user identifiers ending in "_<digits>" are preserved.
- Around line 420-435: The emitted shape array uses a zero-length declaration
when ndim==0, which is invalid C++; update GenerateMakeTensorExternal and the
inline intermediate tensor creation that emits "uint64_t <name>_shapes[ndim]" to
allocate at least one element: compute a shape storage length like shape_len =
std::max<size_t>(1, ndim) and declare "uint64_t <var_name>_shapes[shape_len]"
(or equivalent), fill only the first ndim entries (no-op when ndim==0), but
continue to pass the actual ndim to make_tensor_external (and any runtime call)
so the runtime still sees a 0-rank tensor; adjust uses of ndim in the
initializer/loop accordingly and reference GenerateMakeTensorExternal, the
generated "<var_name>_shapes" array, and the make_tensor_external call to locate
changes.
- Around line 671-675: The current dedup branch that returns when (op_name ==
"tensor.create" && declared_vars_.count(result_var)) silently drops subsequent
tensor.create allocations after SSA collapse; update the logic in
orchestration_codegen.cpp to handle repeated tensor.create for the same
result_var by either (A) emitting an assignment statement (e.g., result_var =
make_tensor(...)) instead of returning when declared_vars_ contains result_var,
or (B) enforce/transform to a unique allocation name before codegen; locate the
branch checking op_name == "tensor.create", declared_vars_, and result_var and
replace the early return with code that generates an assignment to result_var
(or performs a unique rename) so each create results in a proper allocation or
re-assignment.
- Around line 732-739: The scalar-arg path in the loop over call->args_
(TryGetVarName, As<ScalarType>, make_scalar_param) emits
make_scalar_param(var_name) unconditionally, which skips dtype-aware packing for
float scalars; update the branch to inspect the ScalarType's dtype and, for
float (and any other types that require special packing), emit
make_scalar_param(float_to_u64(var_name)) (or the appropriate packing function
for that dtype) while preserving the existing behavior for integer/bool scalars;
adjust the logic around As<ScalarType>(arg->GetType()) and the params push to
choose the packed expression based on scalar->dtype() so scalar variables are
packed identically to scalar constants.

In `@src/codegen/tensor_op_codegen.cpp`:
- Around line 99-125: The generated index code must handle rank-0 (empty
indices) and validate indices vs shape length: before building idx_expr check if
indices.empty() and set idx_expr to "0" (or generate a direct scalar load) to
avoid emitting an empty assignment; also check indices.size() == shape.size()
(or emit a runtime/compile-time assertion or throw std::runtime_error) and fail
fast if they differ to avoid silently wrong linear index calculation. Update the
block that builds idx_oss / idx_expr (references: indices, shape,
codegen.GenerateExprString, idx_expr, result_var, ptr_expr, cpp_type) to perform
these validations and then proceed with the existing simple/complex-index
emission logic.
- Around line 55-70: The tensor.create code emits a zero-length C array and
tensor.read emits an empty index for rank-0 tensors; change the shapes array
allocation to use at least one element (e.g., size = std::max<size_t>(1, ndim))
and when ndim==0 emit a single dummy dimension value of 1 in the initializer for
result_var_shapes, and in tensor.read ensure the computed index (idx_var) is "0"
for ndim==0 instead of an empty expression; update the logic around
result_var_shapes, the make_tensor call, and the index computation (referencing
result_var, result_var_shapes, CalculateTensorSizeExpr and the idx_var/index
expression generation) to handle ndim==0 accordingly.

In `@tests/ut/codegen/test_orchestration_codegen.py`:
- Around line 930-935: The parameter `flag` in function kernel_process is unused
and triggers ARG002; rename it to `_flag` (or `_flag: pl.Scalar[pl.INT64]`) or
annotate/ignore it with a noqa to mark it intentional. Locate the kernel_process
definition and update the parameter name or add the noqa inline so the linter no
longer reports ARG002 while preserving the function signature semantics.

---

Nitpick comments:
In `@examples/ir_parser/paged_attention_example.py`:
- Around line 204-214: The calls to the low-level kernels are inconsistent in
type-ignore suppression: kernel_softmax_prepare already has "# type:
ignore[reportArgumentType]" but kernel_qk_matmul and kernel_pv_matmul do not,
causing linter/type-check noise; add the same "# type:
ignore[reportArgumentType]" suppression to the kernel_qk_matmul(...) and
kernel_pv_matmul(...) call sites so all three kernel invocations
(kernel_qk_matmul, kernel_softmax_prepare, kernel_pv_matmul) consistently
suppress the reported argument-type errors.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d1dc1fe and 2b9f923.

📒 Files selected for processing (18)

examples/ir_parser/paged_attention_example.py
include/pypto/codegen/codegen_base.h
include/pypto/core/dtype.h
include/pypto/ir/scalar_expr.h
python/bindings/modules/ir.cpp
python/pypto/language/op/block_ops.py
python/pypto/language/parser/ast_parser.py
python/pypto/pypto_core/ir.pyi
src/backend/910B_CCE/backend_910b_cce_ops.cpp
src/codegen/cce/cce_codegen.cpp
src/codegen/codegen_base.cpp
src/codegen/orchestration/orchestration_codegen.cpp
src/codegen/tensor_op_codegen.cpp
src/ir/transforms/python_printer.cpp
tests/ut/codegen/test_cce_codegen.py
tests/ut/codegen/test_orchestration_codegen.py
tests/ut/ir/expressions/test_scalar_ops.py
tests/ut/language/parser/test_scalar_dispatch.py

coderabbitai · 2026-02-24T07:20:52Z