Skip to content

Comments

test(st): Add Paged Attention system tests for four PTO kernels#257

Merged
lyfne123 merged 1 commit intohw-native-sys:mainfrom
Crystal-wzy:main
Feb 25, 2026
Merged

test(st): Add Paged Attention system tests for four PTO kernels#257
lyfne123 merged 1 commit intohw-native-sys:mainfrom
Crystal-wzy:main

Conversation

@Crystal-wzy
Copy link
Contributor

Add system-level tests for the Paged Attention algorithm using the PTOTestCase framework, covering four kernels:

  • QKMatmulTestCase: sij = qi @ kj_t via L1 -> L0A/L0B -> L0C memory flow
  • SoftmaxPrepareTestCase: scale -> row_max -> row_expand_sub -> exp ->
    row_sum pipeline; scale passed via FP32 config tensor
  • PVMatmulTestCase: oi_new = pij @ vj with the same L0C path
  • OnlineUpdateTestCase: online normalizer update with is_first/is_last flags covering all four runtime branches; PyTorch golden validation

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Crystal-wzy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a suite of system-level tests for the Paged Attention algorithm's core kernels. These tests leverage the PTOTestCase framework to ensure the correctness and expected memory behavior of the QK Matmul, Softmax Prepare, PV Matmul, and Online Update operations, crucial components for efficient attention mechanisms.

Highlights

  • QK Matmul Kernel Test: Added a system test for the QK Matmul kernel, which computes sij = qi @ kj_t, verifying its memory flow from GM to L1, L0A/L0B, L0C, and back to GM.
  • Softmax Prepare Kernel Test: Implemented a system test for the Softmax Prepare kernel, covering the pipeline of scaling attention scores, computing row maximums, applying exponential, and summing rows, with the scale factor passed via a FP32 config tensor.
  • PV Matmul Kernel Test: Introduced a system test for the PV Matmul kernel, which calculates oi_new = pij @ vj, ensuring correct memory handling through L1, L0A/L0B, L0C, and GM.
  • Online Update Kernel Test: Provided a comprehensive system test for the Online Update kernel, validating its behavior across all four combinations of is_first and is_last flags, including initial copying, full online updates, and normalization, with PyTorch golden validation.
Changelog
  • tests/st/codegen/test_paged_attention.py
    • Added new system tests for Paged Attention kernels, including QK Matmul, Softmax Prepare, PV Matmul, and Online Update, using the PTOTestCase framework.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link

coderabbitai bot commented Feb 25, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

A new test module for Paged Attention kernels is introduced with four PTOTestCase-based test classes validating QK matmul, softmax preparation, PV matmul, and online update kernels. The TestPagedAttentionKernels class orchestrates parameterized tests with configuration-driven path selection and result verification.

Changes

Cohort / File(s) Summary
Paged Attention Test Suite
tests/st/codegen/test_paged_attention.py
Five test case classes added: QKMatmulTestCase, SoftmaxPrepareTestCase, PVMatmulTestCase, OnlineUpdateTestCase, and TestPagedAttentionKernels orchestrator. Each implements tensor specifications, kernel program generation, and expected result computation with configuration-driven parameters (num_heads, head_dim). Covers single-block and multi-block edge cases across all four kernel paths.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Suggested reviewers

  • Hzfengsy

Poem

🐰 Tests bloom like carrots in the attention garden,
QK multiplies, softmax prepares the way,
PV matmuls dance, online updates stay true,
Each kernel validated, no bugs get pardon,
Paged attention kernels in test-driven array! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 3.13% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding system tests for Paged Attention kernels using the PTO framework.
Description check ✅ Passed The description is directly related to the changeset, detailing the four test cases and their implementation specifics.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a comprehensive set of system tests for four Paged Attention kernels using the PTOTestCase framework. The tests cover QK matmul, softmax preparation, PV matmul, and the online update logic. My review found a critical bug in the SoftmaxPrepareTestCase where a tensor type mismatch would lead to incorrect test results. I've also suggested improvements for maintainability by parameterizing the test programs instead of using hardcoded dimensions, and for improving type safety in the OnlineUpdateTestCase by resolving a type mismatch for boolean flags.

def orchestrator(
self,
sij: pl.Tensor[[16, 16], pl.FP32],
config: pl.Tensor[[1], pl.INT32],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The config tensor is defined with DataType.FP32 but the orchestrator function signature types it as pl.Tensor[[1], pl.INT32]. This type mismatch will cause pl.tensor.read to misinterpret the floating-point scale value as an integer, leading to incorrect calculations in the kernel and a test failure. The type in the orchestrator signature should be pl.FP32 to match the tensor definition.

Suggested change
config: pl.Tensor[[1], pl.INT32],
config: pl.Tensor[[1], pl.FP32],

TensorSpec("sij", [self.num_heads, self.num_heads], DataType.FP32, is_output=True), # attention score output: [num_heads, num_heads]
]

def get_program(self) -> Any:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The get_program method uses hardcoded dimensions (e.g., 16) for tensor shapes and operations, while __init__ and define_tensors are parameterized. This makes the tests brittle and hard to extend via pytest.mark.parametrize. The program definition should be made dynamic by using instance attributes like self.num_heads and self.head_dim. This would involve capturing these values and using them in the nested pl.program definition. This issue is present in all PTOTestCase subclasses in this file (QKMatmulTestCase, SoftmaxPrepareTestCase, PVMatmulTestCase, OnlineUpdateTestCase).

Comment on lines +303 to +304
is_first: pl.Scalar[pl.BOOL],
is_last: pl.Scalar[pl.BOOL],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a type mismatch for the is_first and is_last flags. They are read as pl.Scalar[pl.INT64] in the orchestrator, but the online_update InCore function expects pl.Scalar[pl.BOOL]. While this may work due to implicit casting, it's better for type safety and clarity to have consistent types. To resolve this, please change the InCore function to accept pl.Scalar[pl.INT64] to match the type read from the config tensor.

Suggested change
is_first: pl.Scalar[pl.BOOL],
is_last: pl.Scalar[pl.BOOL],
is_first: pl.Scalar[pl.INT64],
is_last: pl.Scalar[pl.INT64],

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
tests/st/codegen/test_paged_attention.py (2)

49-52: Hardcoded shapes in get_program() contradict parameterized constructor.

The constructor accepts num_heads and head_dim as parameters, but get_program() hardcodes all tensor dimensions as literal 16. The same pattern is repeated in all four test case classes. If a caller passes different dimensions, define_tensors would create mismatched shapes vs. the program.

Consider either:

  • Removing the constructor parameters and documenting the fixed 16×16 constraint, or
  • Adding a guard (e.g., assert self.num_heads == 16 and self.head_dim == 16) to fail fast on unsupported sizes.

Also applies to: 64-89

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/st/codegen/test_paged_attention.py` around lines 49 - 52, The tests'
constructors (__init__) allow configurable num_heads and head_dim, but
get_program() and define_tensors() hardcode all tensor dimensions as 16, causing
silent shape mismatches; either enforce the fixed 16×16 constraint or make
shapes derived from the instance values. Update the test classes (the __init__
methods and get_program functions) to either (A) add a guard/assert in __init__
(e.g., assert self.num_heads == 16 and self.head_dim == 16) to fail fast when
non-16 sizes are passed, or (B) change the literal 16s inside
get_program()/define_tensors to use self.num_heads and self.head_dim so tensors
are constructed from the instance parameters consistently (apply the same change
across all four test case classes).

449-484: Missing if __name__ == "__main__" entry point.

Per repo convention, test files should include a pytest.main entry point for standalone execution:

if __name__ == "__main__":
    pytest.main([__file__, "-v"])

Based on learnings: "In Python tests, when adding or maintaining pytest.main entry points, keep the existing pattern pytest.main([__file__, "-v"]) and do not wrap the call in raise SystemExit()."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/st/codegen/test_paged_attention.py` around lines 449 - 484, Add the
missing standalone test entry point at the end of the file so the
TestPagedAttentionKernels tests can be run directly; specifically, append an if
__name__ == "__main__" guard that calls pytest.main([__file__, "-v"]) (do not
wrap in raise SystemExit). Place this after the TestPagedAttentionKernels class
(which contains test_qk_matmul, test_softmax_prepare, test_pv_matmul,
test_online_update) so running the file directly executes the pytest runner with
verbose output.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/st/codegen/test_paged_attention.py`:
- Line 170: The test declares the orchestrator signature config as pl.INT32 but
define_tensors creates config with DataType.FP32 and initializes it with a
float, causing a type mismatch; update the signature to match the created tensor
(change the config entry from pl.INT32 to the floating-point type used, e.g.,
pl.FLOAT32 or pl.FP32 consistent with your pl enum) or alternatively change
define_tensors to create an integer config and initialize with an int—make the
types for config consistent between define_tensors and the orchestrator
signature (refer to define_tensors, the config tensor, and DataType.FP32).
- Around line 320-328: When is_first is true but is_last is false the variable
dst_out is never assigned, causing an unbound reference on return; modify the
first-block branch in the InCore logic (the block that currently sets mi_out,
li_out, oi_out using pl.store on mij_tile, lij_tile, oi_new_tile) to also assign
dst_out by storing zeros (or an appropriate no-op/sentinel) into dst (e.g., use
pl.store with a zero-tile or equivalent) so that dst_out is always defined
before the function returns; update the branch that handles is_first to include
this pl.store call for dst_out, referencing dst_out, dst, and an all-zero tile
consistent with how dst is zeroed in compute_expected.

---

Nitpick comments:
In `@tests/st/codegen/test_paged_attention.py`:
- Around line 49-52: The tests' constructors (__init__) allow configurable
num_heads and head_dim, but get_program() and define_tensors() hardcode all
tensor dimensions as 16, causing silent shape mismatches; either enforce the
fixed 16×16 constraint or make shapes derived from the instance values. Update
the test classes (the __init__ methods and get_program functions) to either (A)
add a guard/assert in __init__ (e.g., assert self.num_heads == 16 and
self.head_dim == 16) to fail fast when non-16 sizes are passed, or (B) change
the literal 16s inside get_program()/define_tensors to use self.num_heads and
self.head_dim so tensors are constructed from the instance parameters
consistently (apply the same change across all four test case classes).
- Around line 449-484: Add the missing standalone test entry point at the end of
the file so the TestPagedAttentionKernels tests can be run directly;
specifically, append an if __name__ == "__main__" guard that calls
pytest.main([__file__, "-v"]) (do not wrap in raise SystemExit). Place this
after the TestPagedAttentionKernels class (which contains test_qk_matmul,
test_softmax_prepare, test_pv_matmul, test_online_update) so running the file
directly executes the pytest runner with verbose output.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a64ab1 and 539f4c0.

📒 Files selected for processing (1)
  • tests/st/codegen/test_paged_attention.py

def orchestrator(
self,
sij: pl.Tensor[[16, 16], pl.FP32],
config: pl.Tensor[[1], pl.INT32],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Type mismatch: config declared as pl.INT32 but defined as DataType.FP32.

define_tensors (line 118) creates the config tensor with DataType.FP32 and initializes it with a float scale value, but the orchestrator signature here declares it as pl.INT32. This will cause the scale factor to be misinterpreted at runtime (float bits read as integer).

🐛 Proposed fix
             `@pl.function`(type=pl.FunctionType.Orchestration)
             def orchestrator(
                 self,
                 sij: pl.Tensor[[16, 16], pl.FP32],
-                config: pl.Tensor[[1], pl.INT32],
+                config: pl.Tensor[[1], pl.FP32],
             ) -> tuple[
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
config: pl.Tensor[[1], pl.INT32],
`@pl.function`(type=pl.FunctionType.Orchestration)
def orchestrator(
self,
sij: pl.Tensor[[16, 16], pl.FP32],
config: pl.Tensor[[1], pl.FP32],
) -> tuple[
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/st/codegen/test_paged_attention.py` at line 170, The test declares the
orchestrator signature config as pl.INT32 but define_tensors creates config with
DataType.FP32 and initializes it with a float, causing a type mismatch; update
the signature to match the created tensor (change the config entry from pl.INT32
to the floating-point type used, e.g., pl.FLOAT32 or pl.FP32 consistent with
your pl enum) or alternatively change define_tensors to create an integer config
and initialize with an int—make the types for config consistent between
define_tensors and the orchestrator signature (refer to define_tensors, the
config tensor, and DataType.FP32).

Comment on lines +320 to +328
if is_first:
# First block: copy mij->mi, lij->li, oi_new->oi
mi_out = pl.store(mij_tile, [0, 0], [16, 1], mi)
li_out = pl.store(lij_tile, [0, 0], [16, 1], li)
oi_out = pl.store(oi_new_tile, [0, 0], [16, 16], oi)
if is_last:
# Single block: normalize dst = oi_new / lij
dst_tile = pl.row_expand_div(oi_new_tile, lij_tile)
dst_out = pl.store(dst_tile, [0, 0], [16, 16], dst)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

dst_out is never assigned when is_first=True, is_last=False.

When the kernel takes the is_first=True branch but is_last=False, lines 322-324 assign mi_out, li_out, oi_out — but dst_out is never set. The return statement at line 375 (return mi_out, li_out, oi_out, dst_out) will then reference an unbound name.

The compute_expected at line 427-428 zeros dst for this case, so the intent is clear. The InCore function should also store zeros (or a no-op sentinel) to dst in this branch:

🐛 Proposed fix — write zeros to dst in the first-but-not-last path
                 if is_first:
                     # First block: copy mij->mi, lij->li, oi_new->oi
                     mi_out = pl.store(mij_tile, [0, 0], [16, 1], mi)
                     li_out = pl.store(lij_tile, [0, 0], [16, 1], li)
                     oi_out = pl.store(oi_new_tile, [0, 0], [16, 16], oi)
                     if is_last:
                         # Single block: normalize dst = oi_new / lij
                         dst_tile = pl.row_expand_div(oi_new_tile, lij_tile)
                         dst_out = pl.store(dst_tile, [0, 0], [16, 16], dst)
+                    else:
+                        # First but not last: kernel does not need dst; zero it
+                        zero_tile = pl.block.full([16, 16], dtype=pl.FP32, value=0.0)
+                        dst_out = pl.store(zero_tile, [0, 0], [16, 16], dst)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if is_first:
# First block: copy mij->mi, lij->li, oi_new->oi
mi_out = pl.store(mij_tile, [0, 0], [16, 1], mi)
li_out = pl.store(lij_tile, [0, 0], [16, 1], li)
oi_out = pl.store(oi_new_tile, [0, 0], [16, 16], oi)
if is_last:
# Single block: normalize dst = oi_new / lij
dst_tile = pl.row_expand_div(oi_new_tile, lij_tile)
dst_out = pl.store(dst_tile, [0, 0], [16, 16], dst)
if is_first:
# First block: copy mij->mi, lij->li, oi_new->oi
mi_out = pl.store(mij_tile, [0, 0], [16, 1], mi)
li_out = pl.store(lij_tile, [0, 0], [16, 1], li)
oi_out = pl.store(oi_new_tile, [0, 0], [16, 16], oi)
if is_last:
# Single block: normalize dst = oi_new / lij
dst_tile = pl.row_expand_div(oi_new_tile, lij_tile)
dst_out = pl.store(dst_tile, [0, 0], [16, 16], dst)
else:
# First but not last: kernel does not need dst; zero it
zero_tile = pl.block.full([16, 16], dtype=pl.FP32, value=0.0)
dst_out = pl.store(zero_tile, [0, 0], [16, 16], dst)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/st/codegen/test_paged_attention.py` around lines 320 - 328, When
is_first is true but is_last is false the variable dst_out is never assigned,
causing an unbound reference on return; modify the first-block branch in the
InCore logic (the block that currently sets mi_out, li_out, oi_out using
pl.store on mij_tile, lij_tile, oi_new_tile) to also assign dst_out by storing
zeros (or an appropriate no-op/sentinel) into dst (e.g., use pl.store with a
zero-tile or equivalent) so that dst_out is always defined before the function
returns; update the branch that handles is_first to include this pl.store call
for dst_out, referencing dst_out, dst, and an all-zero tile consistent with how
dst is zeroed in compute_expected.

Add system-level tests for the Paged Attention algorithm using the
PTOTestCase framework, covering four kernels:

- QKMatmulTestCase: sij = qi @ kj_t via L1 -> L0A/L0B -> L0C memory flow
- SoftmaxPrepareTestCase: scale -> row_max -> row_expand_sub -> exp ->
  row_sum pipeline; scale passed via FP32 config tensor
- PVMatmulTestCase: oi_new = pij @ vj with the same L0C path
- OnlineUpdateTestCase: online normalizer update with is_first/is_last
  flags covering all four runtime branches; PyTorch golden validation
@lyfne123 lyfne123 merged commit 1895b7d into hw-native-sys:main Feb 25, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants