AlphaGPU · claude · Mar 24, 2026 · Mar 27, 2026 · kunal-mansukhani · Mar 27, 2026
@@ -0,0 +1,201 @@
+<p>
+  Implement decode-phase attention over a <strong>paged KV cache</strong>. In LLM serving systems (e.g., vLLM),
+  the key and value tensors for each sequence are stored in fixed-size memory <em>blocks</em> (pages) that
+  may be scattered non-contiguously across a shared GPU memory pool. A <code>block_table</code> maps each
+  sequence's logical block indices to physical block indices in the cache pool. Given a single query vector
+  per sequence (one new token being generated), compute the attention output by gathering the relevant
+  K/V blocks via the block table and computing scaled dot-product attention over the full context.
+</p>
+
+<svg width="620" height="300" viewBox="0 0 620 300" xmlns="http://www.w3.org/2000/svg"
+     style="display:block; margin:20px auto; font-family:monospace;">
+  <rect width="620" height="300" fill="#222" rx="8"/>
+  <defs>
+    <marker id="ab" markerWidth="7" markerHeight="7" refX="5" refY="3.5" orient="auto">
+      <path d="M0,0 L0,7 L7,3.5 z" fill="#4a8abf"/>
+    </marker>
+    <marker id="ag" markerWidth="7" markerHeight="7" refX="5" refY="3.5" orient="auto">
+      <path d="M0,0 L0,7 L7,3.5 z" fill="#52b788"/>
+    </marker>
+  </defs>
+
+  <!-- ============================================================ -->
+  <!-- TOP: block_table as a proper table                           -->
+  <!-- ============================================================ -->
+  <text x="16" y="18" fill="#666" font-size="10">block_table</text>
+
+  <!-- Column headers -->
+  <text x="168" y="18" fill="#888" font-size="9" text-anchor="middle">blk 0</text>
+  <text x="218" y="18" fill="#888" font-size="9" text-anchor="middle">blk 1</text>
+  <text x="268" y="18" fill="#888" font-size="9" text-anchor="middle">blk 2</text>
+
+  <!-- Seq 0 row -->
+  <text x="100" y="40" fill="#4a8abf" font-size="10" text-anchor="end">seq 0</text>
+  <rect x="114" y="26" width="48" height="22" rx="3" fill="#2a3a4a" stroke="#4a8abf" stroke-width="1.5"/>
+  <text x="138" y="42" text-anchor="middle" fill="#4a8abf" font-size="11" font-weight="bold">3</text>
+  <rect x="164" y="26" width="48" height="22" rx="3" fill="#2a3a4a" stroke="#4a8abf" stroke-width="1.5"/>
+  <text x="188" y="42" text-anchor="middle" fill="#4a8abf" font-size="11" font-weight="bold">7</text>
+  <rect x="214" y="26" width="48" height="22" rx="3" fill="#2a2a2a" stroke="#555" stroke-width="1"/>
+  <text x="238" y="42" text-anchor="middle" fill="#555" font-size="10">&#x2014;</text>
+
+  <!-- Seq 1 row -->
+  <text x="100" y="68" fill="#52b788" font-size="10" text-anchor="end">seq 1</text>
+  <rect x="114" y="54" width="48" height="22" rx="3" fill="#2a4a38" stroke="#52b788" stroke-width="1.5"/>
+  <text x="138" y="70" text-anchor="middle" fill="#52b788" font-size="11" font-weight="bold">1</text>
+  <rect x="164" y="54" width="48" height="22" rx="3" fill="#2a4a38" stroke="#52b788" stroke-width="1.5"/>
+  <text x="188" y="70" text-anchor="middle" fill="#52b788" font-size="11" font-weight="bold">5</text>
+  <rect x="214" y="54" width="48" height="22" rx="3" fill="#2a4a38" stroke="#52b788" stroke-width="1.5"/>
+  <text x="238" y="70" text-anchor="middle" fill="#52b788" font-size="11" font-weight="bold">9</text>
+
+  <text x="190" y="92" fill="#888" font-size="9" text-anchor="middle">values = physical block indices in pool &#x2193;</text>
+
+  <!-- ============================================================ -->
+  <!-- MIDDLE: KV cache pool as a horizontal memory strip           -->
+  <!-- ============================================================ -->
+  <text x="16" y="118" fill="#666" font-size="10">K_cache / V_cache pool (GPU memory)</text>
+
+  <!-- 10 blocks in a horizontal strip -->
+  <!-- blk 0 (unused) -->
+  <rect x="16" y="128" width="54" height="36" rx="3" fill="#2a2a2a" stroke="#555" stroke-width="1"/>
+  <text x="43" y="151" text-anchor="middle" fill="#666" font-size="9">blk 0</text>
+
+  <!-- blk 1 (seq 1) -->
+  <rect x="74" y="128" width="54" height="36" rx="3" fill="#2a4a38" stroke="#52b788" stroke-width="1.5"/>
+  <text x="101" y="148" text-anchor="middle" fill="#52b788" font-size="9">blk 1</text>
+  <text x="101" y="159" text-anchor="middle" fill="#52b788" font-size="7">seq1.0</text>
+
+  <!-- blk 2 (unused) -->
+  <rect x="132" y="128" width="54" height="36" rx="3" fill="#2a2a2a" stroke="#555" stroke-width="1"/>
+  <text x="159" y="151" text-anchor="middle" fill="#666" font-size="9">blk 2</text>
+
+  <!-- blk 3 (seq 0) -->
+  <rect x="190" y="128" width="54" height="36" rx="3" fill="#2a3a4a" stroke="#4a8abf" stroke-width="1.5"/>
+  <text x="217" y="148" text-anchor="middle" fill="#4a8abf" font-size="9">blk 3</text>
+  <text x="217" y="159" text-anchor="middle" fill="#4a8abf" font-size="7">seq0.0</text>
+
+  <!-- blk 4 (unused) -->
+  <rect x="248" y="128" width="54" height="36" rx="3" fill="#2a2a2a" stroke="#555" stroke-width="1"/>
+  <text x="275" y="151" text-anchor="middle" fill="#666" font-size="9">blk 4</text>
+
+  <!-- blk 5 (seq 1) -->
+  <rect x="306" y="128" width="54" height="36" rx="3" fill="#2a4a38" stroke="#52b788" stroke-width="1.5"/>
+  <text x="333" y="148" text-anchor="middle" fill="#52b788" font-size="9">blk 5</text>
+  <text x="333" y="159" text-anchor="middle" fill="#52b788" font-size="7">seq1.1</text>
+
+  <!-- blk 6 (unused) -->
+  <rect x="364" y="128" width="54" height="36" rx="3" fill="#2a2a2a" stroke="#555" stroke-width="1"/>
+  <text x="391" y="151" text-anchor="middle" fill="#666" font-size="9">blk 6</text>
+
+  <!-- blk 7 (seq 0) -->
+  <rect x="422" y="128" width="54" height="36" rx="3" fill="#2a3a4a" stroke="#4a8abf" stroke-width="1.5"/>
+  <text x="449" y="148" text-anchor="middle" fill="#4a8abf" font-size="9">blk 7</text>
+  <text x="449" y="159" text-anchor="middle" fill="#4a8abf" font-size="7">seq0.1</text>
+
+  <!-- blk 8 (unused) -->
+  <rect x="480" y="128" width="54" height="36" rx="3" fill="#2a2a2a" stroke="#555" stroke-width="1"/>
+  <text x="507" y="151" text-anchor="middle" fill="#666" font-size="9">blk 8</text>
+
+  <!-- blk 9 (seq 1) -->
+  <rect x="538" y="128" width="54" height="36" rx="3" fill="#2a4a38" stroke="#52b788" stroke-width="1.5"/>
+  <text x="565" y="148" text-anchor="middle" fill="#52b788" font-size="9">blk 9</text>
+  <text x="565" y="159" text-anchor="middle" fill="#52b788" font-size="7">seq1.2</text>
+
+  <!-- ============================================================ -->
+  <!-- BOTTOM: Attention computation                                -->
+  <!-- ============================================================ -->
+  <rect x="16" y="186" width="588" height="100" rx="5" fill="#1a1a1a" stroke="#666" stroke-width="1"/>
+  <text x="310" y="208" text-anchor="middle" fill="#ccc" font-size="11" font-weight="bold">Decode Attention (per sequence s, per head h)</text>
+
+  <text x="36" y="232" fill="#ffcc66" font-size="10">1.</text>
+  <text x="56" y="232" fill="#aaa" font-size="10">Gather K, V: token t is at pool[ block_table[s, t/B] ], offset t%B</text>
+
+  <text x="36" y="254" fill="#ffcc66" font-size="10">2.</text>
+  <text x="56" y="254" fill="#aaa" font-size="10">scores[t] = Q[s,h] &#xb7; K[s,h,t] / &#x221a;head_dim     for t = 0 .. context_lens[s]-1</text>
+
+  <text x="36" y="276" fill="#ffcc66" font-size="10">3.</text>
+  <text x="56" y="276" fill="#aaa" font-size="10">output[s,h] = &#x2211;_t softmax(scores)[t] &#xb7; V[s,h,t]</text>
+</svg>
+
+<h2>Implementation Requirements</h2>
+<p>
+  Implement the function <code>solve(Q, K_cache, V_cache, block_table, context_lens, output, batch_size, num_heads, head_dim, block_size, max_blocks_per_seq)</code>
+  that computes paged decode-phase attention:
+</p>
+<ul>
+  <li><code>Q</code>: query tensor of shape <code>(batch_size, num_heads, head_dim)</code>, dtype <code>float32</code> — one query per sequence</li>
+  <li><code>K_cache</code>: paged key cache of shape <code>(num_blocks, block_size, num_heads, head_dim)</code>, dtype <code>float32</code></li>
+  <li><code>V_cache</code>: paged value cache of shape <code>(num_blocks, block_size, num_heads, head_dim)</code>, dtype <code>float32</code></li>
+  <li><code>block_table</code>: physical block indices of shape <code>(batch_size, max_blocks_per_seq)</code>, dtype <code>int32</code></li>
+  <li><code>context_lens</code>: number of valid KV tokens per sequence, shape <code>(batch_size,)</code>, dtype <code>int32</code></li>
+  <li><code>output</code>: result of shape <code>(batch_size, num_heads, head_dim)</code>, dtype <code>float32</code></li>
+</ul>
+<p>
+  For each sequence \(s\) and each attention head \(h\), compute:
+</p>
+<ol>
+  <li>
+    Gather the \(\text{context_lens}[s]\) key and value vectors from the paged cache using
+    \(\text{block_table}[s]\). Token at logical position \(t\) lives in physical block
+    \(\text{block_table}[s,\;\lfloor t / B \rfloor]\) at offset \(t \bmod B\) within that block,
+    where \(B = \text{block_size}\).
+  </li>
+  <li>
+    Compute scaled dot-product attention:
+    \[\text{scores}[t] = \frac{Q[s, h] \cdot K[s, h, t]}{\sqrt{\text{head_dim}}}\]
+  </li>
+  <li>
+    Apply softmax over all \(\text{context_lens}[s]\) positions to get attention weights.
+  </li>
+  <li>
+    Compute: \(\displaystyle \text{output}[s, h] = \sum_{t} \text{softmax}(\text{scores})[t] \cdot V[s, h, t]\)
+  </li>
+</ol>
+<p>
+  Do not use external libraries beyond the framework you select. Keep the function signature unchanged.
+  Write results directly into <code>output</code>.
+</p>
+
+<h2>Example</h2>
+<p>
+  Input: <code>batch_size</code> = 1, <code>num_heads</code> = 1, <code>head_dim</code> = 4,
+  <code>block_size</code> = 2, <code>context_lens</code> = [2], <code>block_table</code> = [[0]]
+</p>
+<p>
+  \(Q[0, 0] = \begin{bmatrix} 1.0 & 1.0 & 0.0 & 0.0 \end{bmatrix}\)
+</p>
+<p>
+  Keys gathered from block 0 (2 tokens):
+  \[
+  K_0 = \begin{bmatrix} 1.0 & 0.0 & 0.0 & 0.0 \end{bmatrix}, \quad
+  K_1 = \begin{bmatrix} 0.0 & 1.0 & 0.0 & 0.0 \end{bmatrix}
+  \]
+  Values gathered from block 0:
+  \[
+  V_0 = \begin{bmatrix} 2.0 & 0.0 & 0.0 & 0.0 \end{bmatrix}, \quad
+  V_1 = \begin{bmatrix} 0.0 & 4.0 & 0.0 & 0.0 \end{bmatrix}
+  \]
+</p>
+<p>
+  Scores (before softmax):
+  \[
+  s_0 = \frac{Q \cdot K_0}{\sqrt{4}} = \frac{1}{2} = 0.5, \quad
+  s_1 = \frac{Q \cdot K_1}{\sqrt{4}} = \frac{1}{2} = 0.5
+  \]
+  Attention weights: \(\text{softmax}([0.5, 0.5]) = [0.5, 0.5]\)
+  \[
+  \text{output}[0, 0] = 0.5 \cdot V_0 + 0.5 \cdot V_1 =
+  \begin{bmatrix} 1.0 & 2.0 & 0.0 & 0.0 \end{bmatrix}
+  \]
+</p>
+
+<h2>Constraints</h2>
+<ul>
+  <li>1 &le; <code>batch_size</code> &le; 32</li>
+  <li>1 &le; <code>num_heads</code> &le; 64</li>
+  <li>1 &le; <code>head_dim</code> &le; 256; <code>head_dim</code> is a multiple of 8</li>
+  <li>1 &le; <code>block_size</code> &le; 64; <code>block_size</code> is a power of 2</li>
+  <li>1 &le; <code>context_lens[s]</code> &le; 8,192 for all sequences <code>s</code></li>
+  <li>All input tensors are on the GPU and in <code>float32</code> (except <code>block_table</code> and <code>context_lens</code> which are <code>int32</code>)</li>
+  <li><code>block_table[s, i]</code> is a valid index into the first dimension of <code>K_cache</code> for all <code>i &lt; ceil(context_lens[s] / block_size)</code></li>
+  <li>Performance is measured with <code>batch_size</code> = 8, <code>num_heads</code> = 32, <code>head_dim</code> = 128, <code>block_size</code> = 16, <code>context_lens[s]</code> = 2,048 for all sequences</li>
+</ul>