aws-neuron · shaojiex-aws · Apr 27, 2026
diff --git a/kernelgen/.claude/skills/build_nkipykernelgen/SKILL.md b/kernelgen/.claude/skills/build_nkipykernelgen/SKILL.md
@@ -0,0 +1,23 @@
+---
+name: build_nkipykernelgen
+description: Rebuild NKIPyKernelGen (C++ passes and Python package)
+user-invocable: true
+---
+
+## Usage
+
+`/build_nkipykernelgen`
+
+## Instructions
+
+Run the build script. Use `bash` (not `sh`) since it uses `source`. Use a timeout of 300000ms.
+
+```bash
+bash .claude/skills/build_nkipykernelgen/scripts/build.sh
+```
+
+Note: Run this from the NKIPyKernelGen repo root.
+
+## Important
+
+`pip install -e .` builds BOTH the C++ passes (nkipy-opt binary) AND the Python package in one step. There is NO need to run cmake separately — the pyproject.toml build system handles the full C++ compilation via cmake internally.
diff --git a/kernelgen/.claude/skills/build_nkipykernelgen/scripts/build.sh b/kernelgen/.claude/skills/build_nkipykernelgen/scripts/build.sh
@@ -0,0 +1,12 @@
+#!/bin/bash
+# Rebuild NKIPyKernelGen (C++ passes and Python package).
+set -e
+
+# Derive repo root from script location: scripts/ -> build_nkipykernelgen/ -> skills/ -> .claude/ -> repo root
+REPO_ROOT="$(cd "$(dirname "$0")/../../../.." && pwd)"
+
+cd "$REPO_ROOT"
+
+echo "=== Rebuilding NKIPyKernelGen ==="
+pip install -e . 2>&1 | tail -5
+echo "=== Build complete ==="
diff --git a/kernelgen/.claude/skills/debug_nisa_ir/SKILL.md b/kernelgen/.claude/skills/debug_nisa_ir/SKILL.md
@@ -0,0 +1,121 @@
+---
+name: debug_nisa_ir
+description: Debug NISA MLIR that fails BIRSim. Creates a debug case under tests/debug/ with buggy.mlir, kernel.py, iterative fixes, and a README proposing compiler pass changes.
+user-invocable: true
+---
+
+## Usage
+
+`/debug_nisa_ir <bug_name> [kernel.py path] [buggy NISA MLIR path or inline]`
+
+- `bug_name`: Short snake_case name for the debug case (e.g., `rope_partition_oob`)
+- `kernel.py path`: Path to the Python source that was fed into `nkipy_opt`. If omitted, ask the user.
+- `buggy NISA MLIR`: Path to the `.mlir` file that `nkipy_opt` produced, or the user may paste it inline. If omitted, ask the user.
+
+## Instructions
+
+You are debugging a NISA-level MLIR kernel that `nkipy_opt` generated but that fails BIRSim verification or produces incorrect numerical results. Follow this systematic workflow.
+
+### Step 1: Set up the debug case directory
+
+Create `tests/debug/<bug_name>/` with:
+
+```
+tests/debug/<bug_name>/
+  kernel.py       # Copy of the input Python kernel
+  buggy.mlir      # The failing NISA MLIR from nkipy_opt
+  README.md       # Will be populated in Step 6
+```
+
+Copy the user-provided `kernel.py` and `buggy.mlir` into this directory. Ensure `kernel.py` contains a function whose name matches the `sym_name` in the MLIR (this is required by `run_sim.py`).
+
+### Step 2: Reproduce the failure
+
+Run the buggy MLIR through BIRSim:
+
+```bash
+cd tests/debug && source ./run.sh <bug_name>/buggy.mlir
+```
+
+Record the exact error output. Common failure modes:
+- **BIR verification error**: `Invalid access of N partitions starting at partition M` or `Access pattern out of bounds`
+- **BIRSim runtime error**: `NCC_ISIM*` errors (e.g., uninitialized PSUM read)
+- **Numerical mismatch**: `SIMULATION FAILED (max_diff=...)` -- BIRSim runs but output doesn't match kernel.py
+
+### Step 3: Analyze the bug
+
+Read the MLIR carefully and identify the root cause. Common patterns:
+
+1. **Multi-partition SBUF with vector engine**: `tensor_tensor_arith` (engine=vector) reading from a loop-indexed partition of a multi-partition SBUF tensor. The vector engine processes all 128 partitions simultaneously and cannot address partition N selectively.
+
+2. **Wrong reshape/transpose lowering**: Column-by-column transposes that conflate head and head_dim dimensions. Often manifests as `<128|2>` tile on a dim of size 2 (OOB), or silent numerical corruption.
+
+3. **Missing accumulate flags**: Matmul K-loops without `psum_accumulate_flags`, causing PSUM overwrite instead of accumulate.
+
+4. **SBUF OOM**: Too many live SBUF tensors. Check if intermediates can be fused or freed earlier.
+
+Focus on understanding:
+- Which MLIR lines are problematic (cite line numbers)
+- What the pass *intended* to generate vs what it actually generated
+- Why the hardware rejects it (BIR rules violated)
+
+### Step 4: Create iterative fixes
+
+For each fix attempt, create a new MLIR file:
+
+```
+fix_<number>_<what_was_fixed>.mlir
+```
+
+For example:
+- `fix_01_fuse_rope_elementwise.mlir`
+- `fix_02_reshape_head_granularity.mlir`
+
+Edit the MLIR by hand to correct the identified issue. Then run:
+
+```bash
+cd tests/debug && source ./run.sh <bug_name>/fix_01_<description>.mlir
+```
+
+If it still fails, analyze the new error, create another fix file, and iterate. Keep each attempt as a separate file so the progression is visible.
+
+### Step 5: Verify the final fix
+
+The last `fix_*.mlir` should produce:
+
+```
+BIRSim PASSED
+SIMULATION PASSED
+```
+
+Confirm that the numerical output matches `kernel.py` within tolerance (atol=1e-2, rtol=1e-2).
+
+### Step 6: Write the README
+
+Create `tests/debug/<bug_name>/README.md` documenting:
+
+1. **Overview**: One paragraph summarizing what `buggy.mlir` is (which kernel, what it does) and what goes wrong.
+
+2. **How to reproduce**: The exact `source ../run.sh` commands for buggy and fixed versions.
+
+3. **Bug analysis**: For each bug found:
+   - **Symptom**: The exact error message
+   - **Location in MLIR**: Line numbers and what the code does
+   - **What happens**: Why the hardware rejects it or produces wrong results
+   - **Fix**: What was changed in the MLIR (with code snippets)
+
+4. **Root cause summary**: Table mapping each bug to the compiler pass responsible and whether it causes a compilation error or silent corruption.
+
+5. **Proposed compiler pass fixes**: For each bug, describe:
+   - Which pass to fix (e.g., `simplify-linalg`, `linalg-to-nisa`, tiling)
+   - The root cause *in the pass* (not just the MLIR symptom)
+   - A concrete proposed change (pseudocode or description of the algorithm change)
+
+Use the format from existing debug cases (see `tests/debug/qwen3_layer/README.md` for reference).
+
+### Tips
+
+- The debug harness (`run.sh` / `run_sim.py`) automatically sets up the NKI environment, generates random inputs (seed=42), compiles to NEFF with BIRSim, and compares against `kernel.py`.
+- Artifacts (NEFF, BIR) are written to `artifacts_<stem>/` next to each MLIR file (git-ignored).
+- When editing MLIR, keep changes minimal and targeted. Change only the ops/loops related to the bug.
+- If you're unsure which pass generated a problematic pattern, check the pass pipeline in `nkipy_opt` or ask the user.
diff --git a/kernelgen/.claude/skills/run_nkipykernelgen_tests/SKILL.md b/kernelgen/.claude/skills/run_nkipykernelgen_tests/SKILL.md
@@ -0,0 +1,28 @@
+---
+name: run_nkipykernelgen_tests
+description: Run NKIPyKernelGen tests (without rebuilding)
+user-invocable: true
+---
+
+## Usage
+
+`/run_nkipykernelgen_tests [scope]`
+
+Where `scope` is: `all` (default), `passes`, `e2e`, or a specific path like `passes/infer_layout` or `e2e/nkipy_tests`.
+
+## Instructions
+
+1. Run the script at `~/.claude/skills/run_nkipykernelgen_tests/scripts/run_tests.sh` with the requested scope as the argument. Use `bash` to invoke it (not `sh`) since it uses `source`. Use a timeout of 600000ms.
+
+```bash
+bash .claude/skills/run_nkipykernelgen_tests/scripts/run_tests.sh <scope>
+```
+
+Note: Run this from the NKIPyKernelGen repo root.
+
+2. The script saves full test output to `/tmp/nkipykernelgen_test_results.txt`. After the script finishes, use the Read tool to read that file for the complete results. This avoids context window issues with long test output.
+
+3. When reporting results, summarize:
+   - Total passed/failed/xfailed/xpassed/skipped counts
+   - List any unexpected failures (FAILED, not XFAIL)
+   - Note any XPASS (unexpected passes) that indicate xfail markers should be removed
diff --git a/kernelgen/.claude/skills/run_nkipykernelgen_tests/scripts/run_tests.sh b/kernelgen/.claude/skills/run_nkipykernelgen_tests/scripts/run_tests.sh
@@ -0,0 +1,36 @@
+#!/bin/bash
+# Run NKIPyKernelGen tests with proper environment setup.
+# Usage: run_tests.sh [scope]
+#   scope: all (default), passes, e2e, or a specific path like passes/infer_layout
+
+SCOPE="${1:-all}"
+RESULTS_FILE="/tmp/nkipykernelgen_test_results.txt"
+
+# Derive repo root from script location: scripts/ -> run_nkipykernelgen_tests/ -> skills/ -> .claude/ -> repo root
+REPO_ROOT="$(cd "$(dirname "$0")/../../../.." && pwd)"
+
+cd "$REPO_ROOT"
+
+# Run tests, capturing full output to file
+echo "=== Running tests (scope: $SCOPE) ==="
+echo "Results will be saved to: $RESULTS_FILE"
+
+case "$SCOPE" in
+  all)
+    python -m pytest tests/ -v --tb=short 2>&1 | tee "$RESULTS_FILE"
+    ;;
+  passes)
+    python -m pytest tests/passes/ -v --tb=short 2>&1 | tee "$RESULTS_FILE"
+    ;;
+  e2e)
+    python -m pytest tests/e2e/ -v --tb=short 2>&1 | tee "$RESULTS_FILE"
+    ;;
+  *)
+    python -m pytest "tests/$SCOPE" -v --tb=short 2>&1 | tee "$RESULTS_FILE"
+    ;;
+esac
+EXIT_CODE=${PIPESTATUS[0]}
+
+echo ""
+echo "=== Full results saved to: $RESULTS_FILE ==="
+exit $EXIT_CODE
diff --git a/kernelgen/.gitignore b/kernelgen/.gitignore
@@ -0,0 +1,47 @@
+# Override parent nkipy/.gitignore's `lib/` rule so MLIR C++ sources in
+# mlir/lib/ are tracked (the parent rule is aimed at Python venv lib/ dirs).
+!mlir/lib/
+!mlir/lib/**
+
+# Python
+__pycache__/
+*.py[cod]
+*.so
+
+# Distribution / packaging
+build/
+dist/
+*.egg-info/
+.eggs/
+*.whl
+
+# Built MLIR bindings (generated during build)
+nkipy_kernelgen/_mlir/
+
+# Virtual environments
+venv/
+.env
+
+# Testing
+.pytest_cache/
+.coverage
+tests/**/outputs/
+tests/**/artifacts/
+
+# IDE
+.vscode/
+.idea/
+
+# OS
+.DS_Store
+Thumbs.db
+
+# Logs
+*.log
+
+# LLVM lit test outputs
+.lit_test_times.txt
+Output/
+
+# Compiler Explorer (cloned repo)
+compiler_explorer/compiler-explorer/