test(st): Add Paged Attention system tests for four PTO kernels#257
test(st): Add Paged Attention system tests for four PTO kernels#257lyfne123 merged 1 commit intohw-native-sys:mainfrom
Conversation
Summary of ChangesHello @Crystal-wzy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a suite of system-level tests for the Paged Attention algorithm's core kernels. These tests leverage the PTOTestCase framework to ensure the correctness and expected memory behavior of the QK Matmul, Softmax Prepare, PV Matmul, and Online Update operations, crucial components for efficient attention mechanisms. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughA new test module for Paged Attention kernels is introduced with four PTOTestCase-based test classes validating QK matmul, softmax preparation, PV matmul, and online update kernels. The TestPagedAttentionKernels class orchestrates parameterized tests with configuration-driven path selection and result verification. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Code Review
This pull request adds a comprehensive set of system tests for four Paged Attention kernels using the PTOTestCase framework. The tests cover QK matmul, softmax preparation, PV matmul, and the online update logic. My review found a critical bug in the SoftmaxPrepareTestCase where a tensor type mismatch would lead to incorrect test results. I've also suggested improvements for maintainability by parameterizing the test programs instead of using hardcoded dimensions, and for improving type safety in the OnlineUpdateTestCase by resolving a type mismatch for boolean flags.
| def orchestrator( | ||
| self, | ||
| sij: pl.Tensor[[16, 16], pl.FP32], | ||
| config: pl.Tensor[[1], pl.INT32], |
There was a problem hiding this comment.
The config tensor is defined with DataType.FP32 but the orchestrator function signature types it as pl.Tensor[[1], pl.INT32]. This type mismatch will cause pl.tensor.read to misinterpret the floating-point scale value as an integer, leading to incorrect calculations in the kernel and a test failure. The type in the orchestrator signature should be pl.FP32 to match the tensor definition.
| config: pl.Tensor[[1], pl.INT32], | |
| config: pl.Tensor[[1], pl.FP32], |
| TensorSpec("sij", [self.num_heads, self.num_heads], DataType.FP32, is_output=True), # attention score output: [num_heads, num_heads] | ||
| ] | ||
|
|
||
| def get_program(self) -> Any: |
There was a problem hiding this comment.
The get_program method uses hardcoded dimensions (e.g., 16) for tensor shapes and operations, while __init__ and define_tensors are parameterized. This makes the tests brittle and hard to extend via pytest.mark.parametrize. The program definition should be made dynamic by using instance attributes like self.num_heads and self.head_dim. This would involve capturing these values and using them in the nested pl.program definition. This issue is present in all PTOTestCase subclasses in this file (QKMatmulTestCase, SoftmaxPrepareTestCase, PVMatmulTestCase, OnlineUpdateTestCase).
| is_first: pl.Scalar[pl.BOOL], | ||
| is_last: pl.Scalar[pl.BOOL], |
There was a problem hiding this comment.
There is a type mismatch for the is_first and is_last flags. They are read as pl.Scalar[pl.INT64] in the orchestrator, but the online_update InCore function expects pl.Scalar[pl.BOOL]. While this may work due to implicit casting, it's better for type safety and clarity to have consistent types. To resolve this, please change the InCore function to accept pl.Scalar[pl.INT64] to match the type read from the config tensor.
| is_first: pl.Scalar[pl.BOOL], | |
| is_last: pl.Scalar[pl.BOOL], | |
| is_first: pl.Scalar[pl.INT64], | |
| is_last: pl.Scalar[pl.INT64], |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
tests/st/codegen/test_paged_attention.py (2)
49-52: Hardcoded shapes inget_program()contradict parameterized constructor.The constructor accepts
num_headsandhead_dimas parameters, butget_program()hardcodes all tensor dimensions as literal16. The same pattern is repeated in all four test case classes. If a caller passes different dimensions,define_tensorswould create mismatched shapes vs. the program.Consider either:
- Removing the constructor parameters and documenting the fixed 16×16 constraint, or
- Adding a guard (e.g.,
assert self.num_heads == 16 and self.head_dim == 16) to fail fast on unsupported sizes.Also applies to: 64-89
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/st/codegen/test_paged_attention.py` around lines 49 - 52, The tests' constructors (__init__) allow configurable num_heads and head_dim, but get_program() and define_tensors() hardcode all tensor dimensions as 16, causing silent shape mismatches; either enforce the fixed 16×16 constraint or make shapes derived from the instance values. Update the test classes (the __init__ methods and get_program functions) to either (A) add a guard/assert in __init__ (e.g., assert self.num_heads == 16 and self.head_dim == 16) to fail fast when non-16 sizes are passed, or (B) change the literal 16s inside get_program()/define_tensors to use self.num_heads and self.head_dim so tensors are constructed from the instance parameters consistently (apply the same change across all four test case classes).
449-484: Missingif __name__ == "__main__"entry point.Per repo convention, test files should include a
pytest.mainentry point for standalone execution:if __name__ == "__main__": pytest.main([__file__, "-v"])Based on learnings: "In Python tests, when adding or maintaining pytest.main entry points, keep the existing pattern
pytest.main([__file__, "-v"])and do not wrap the call inraise SystemExit()."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/st/codegen/test_paged_attention.py` around lines 449 - 484, Add the missing standalone test entry point at the end of the file so the TestPagedAttentionKernels tests can be run directly; specifically, append an if __name__ == "__main__" guard that calls pytest.main([__file__, "-v"]) (do not wrap in raise SystemExit). Place this after the TestPagedAttentionKernels class (which contains test_qk_matmul, test_softmax_prepare, test_pv_matmul, test_online_update) so running the file directly executes the pytest runner with verbose output.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/st/codegen/test_paged_attention.py`:
- Line 170: The test declares the orchestrator signature config as pl.INT32 but
define_tensors creates config with DataType.FP32 and initializes it with a
float, causing a type mismatch; update the signature to match the created tensor
(change the config entry from pl.INT32 to the floating-point type used, e.g.,
pl.FLOAT32 or pl.FP32 consistent with your pl enum) or alternatively change
define_tensors to create an integer config and initialize with an int—make the
types for config consistent between define_tensors and the orchestrator
signature (refer to define_tensors, the config tensor, and DataType.FP32).
- Around line 320-328: When is_first is true but is_last is false the variable
dst_out is never assigned, causing an unbound reference on return; modify the
first-block branch in the InCore logic (the block that currently sets mi_out,
li_out, oi_out using pl.store on mij_tile, lij_tile, oi_new_tile) to also assign
dst_out by storing zeros (or an appropriate no-op/sentinel) into dst (e.g., use
pl.store with a zero-tile or equivalent) so that dst_out is always defined
before the function returns; update the branch that handles is_first to include
this pl.store call for dst_out, referencing dst_out, dst, and an all-zero tile
consistent with how dst is zeroed in compute_expected.
---
Nitpick comments:
In `@tests/st/codegen/test_paged_attention.py`:
- Around line 49-52: The tests' constructors (__init__) allow configurable
num_heads and head_dim, but get_program() and define_tensors() hardcode all
tensor dimensions as 16, causing silent shape mismatches; either enforce the
fixed 16×16 constraint or make shapes derived from the instance values. Update
the test classes (the __init__ methods and get_program functions) to either (A)
add a guard/assert in __init__ (e.g., assert self.num_heads == 16 and
self.head_dim == 16) to fail fast when non-16 sizes are passed, or (B) change
the literal 16s inside get_program()/define_tensors to use self.num_heads and
self.head_dim so tensors are constructed from the instance parameters
consistently (apply the same change across all four test case classes).
- Around line 449-484: Add the missing standalone test entry point at the end of
the file so the TestPagedAttentionKernels tests can be run directly;
specifically, append an if __name__ == "__main__" guard that calls
pytest.main([__file__, "-v"]) (do not wrap in raise SystemExit). Place this
after the TestPagedAttentionKernels class (which contains test_qk_matmul,
test_softmax_prepare, test_pv_matmul, test_online_update) so running the file
directly executes the pytest runner with verbose output.
| def orchestrator( | ||
| self, | ||
| sij: pl.Tensor[[16, 16], pl.FP32], | ||
| config: pl.Tensor[[1], pl.INT32], |
There was a problem hiding this comment.
Type mismatch: config declared as pl.INT32 but defined as DataType.FP32.
define_tensors (line 118) creates the config tensor with DataType.FP32 and initializes it with a float scale value, but the orchestrator signature here declares it as pl.INT32. This will cause the scale factor to be misinterpreted at runtime (float bits read as integer).
🐛 Proposed fix
`@pl.function`(type=pl.FunctionType.Orchestration)
def orchestrator(
self,
sij: pl.Tensor[[16, 16], pl.FP32],
- config: pl.Tensor[[1], pl.INT32],
+ config: pl.Tensor[[1], pl.FP32],
) -> tuple[📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| config: pl.Tensor[[1], pl.INT32], | |
| `@pl.function`(type=pl.FunctionType.Orchestration) | |
| def orchestrator( | |
| self, | |
| sij: pl.Tensor[[16, 16], pl.FP32], | |
| config: pl.Tensor[[1], pl.FP32], | |
| ) -> tuple[ |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/st/codegen/test_paged_attention.py` at line 170, The test declares the
orchestrator signature config as pl.INT32 but define_tensors creates config with
DataType.FP32 and initializes it with a float, causing a type mismatch; update
the signature to match the created tensor (change the config entry from pl.INT32
to the floating-point type used, e.g., pl.FLOAT32 or pl.FP32 consistent with
your pl enum) or alternatively change define_tensors to create an integer config
and initialize with an int—make the types for config consistent between
define_tensors and the orchestrator signature (refer to define_tensors, the
config tensor, and DataType.FP32).
| if is_first: | ||
| # First block: copy mij->mi, lij->li, oi_new->oi | ||
| mi_out = pl.store(mij_tile, [0, 0], [16, 1], mi) | ||
| li_out = pl.store(lij_tile, [0, 0], [16, 1], li) | ||
| oi_out = pl.store(oi_new_tile, [0, 0], [16, 16], oi) | ||
| if is_last: | ||
| # Single block: normalize dst = oi_new / lij | ||
| dst_tile = pl.row_expand_div(oi_new_tile, lij_tile) | ||
| dst_out = pl.store(dst_tile, [0, 0], [16, 16], dst) |
There was a problem hiding this comment.
dst_out is never assigned when is_first=True, is_last=False.
When the kernel takes the is_first=True branch but is_last=False, lines 322-324 assign mi_out, li_out, oi_out — but dst_out is never set. The return statement at line 375 (return mi_out, li_out, oi_out, dst_out) will then reference an unbound name.
The compute_expected at line 427-428 zeros dst for this case, so the intent is clear. The InCore function should also store zeros (or a no-op sentinel) to dst in this branch:
🐛 Proposed fix — write zeros to dst in the first-but-not-last path
if is_first:
# First block: copy mij->mi, lij->li, oi_new->oi
mi_out = pl.store(mij_tile, [0, 0], [16, 1], mi)
li_out = pl.store(lij_tile, [0, 0], [16, 1], li)
oi_out = pl.store(oi_new_tile, [0, 0], [16, 16], oi)
if is_last:
# Single block: normalize dst = oi_new / lij
dst_tile = pl.row_expand_div(oi_new_tile, lij_tile)
dst_out = pl.store(dst_tile, [0, 0], [16, 16], dst)
+ else:
+ # First but not last: kernel does not need dst; zero it
+ zero_tile = pl.block.full([16, 16], dtype=pl.FP32, value=0.0)
+ dst_out = pl.store(zero_tile, [0, 0], [16, 16], dst)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if is_first: | |
| # First block: copy mij->mi, lij->li, oi_new->oi | |
| mi_out = pl.store(mij_tile, [0, 0], [16, 1], mi) | |
| li_out = pl.store(lij_tile, [0, 0], [16, 1], li) | |
| oi_out = pl.store(oi_new_tile, [0, 0], [16, 16], oi) | |
| if is_last: | |
| # Single block: normalize dst = oi_new / lij | |
| dst_tile = pl.row_expand_div(oi_new_tile, lij_tile) | |
| dst_out = pl.store(dst_tile, [0, 0], [16, 16], dst) | |
| if is_first: | |
| # First block: copy mij->mi, lij->li, oi_new->oi | |
| mi_out = pl.store(mij_tile, [0, 0], [16, 1], mi) | |
| li_out = pl.store(lij_tile, [0, 0], [16, 1], li) | |
| oi_out = pl.store(oi_new_tile, [0, 0], [16, 16], oi) | |
| if is_last: | |
| # Single block: normalize dst = oi_new / lij | |
| dst_tile = pl.row_expand_div(oi_new_tile, lij_tile) | |
| dst_out = pl.store(dst_tile, [0, 0], [16, 16], dst) | |
| else: | |
| # First but not last: kernel does not need dst; zero it | |
| zero_tile = pl.block.full([16, 16], dtype=pl.FP32, value=0.0) | |
| dst_out = pl.store(zero_tile, [0, 0], [16, 16], dst) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/st/codegen/test_paged_attention.py` around lines 320 - 328, When
is_first is true but is_last is false the variable dst_out is never assigned,
causing an unbound reference on return; modify the first-block branch in the
InCore logic (the block that currently sets mi_out, li_out, oi_out using
pl.store on mij_tile, lij_tile, oi_new_tile) to also assign dst_out by storing
zeros (or an appropriate no-op/sentinel) into dst (e.g., use pl.store with a
zero-tile or equivalent) so that dst_out is always defined before the function
returns; update the branch that handles is_first to include this pl.store call
for dst_out, referencing dst_out, dst, and an all-zero tile consistent with how
dst is zeroed in compute_expected.
Add system-level tests for the Paged Attention algorithm using the PTOTestCase framework, covering four kernels: - QKMatmulTestCase: sij = qi @ kj_t via L1 -> L0A/L0B -> L0C memory flow - SoftmaxPrepareTestCase: scale -> row_max -> row_expand_sub -> exp -> row_sum pipeline; scale passed via FP32 config tensor - PVMatmulTestCase: oi_new = pij @ vj with the same L0C path - OnlineUpdateTestCase: online normalizer update with is_first/is_last flags covering all four runtime branches; PyTorch golden validation
Add system-level tests for the Paged Attention algorithm using the PTOTestCase framework, covering four kernels:
row_sum pipeline; scale passed via FP32 config tensor