Skip to content

Add challenge 84: SwiGLU MLP Block (Medium)#221

Open
claude[bot] wants to merge 2 commits intomainfrom
add-challenge-84-swiglu-mlp-block
Open

Add challenge 84: SwiGLU MLP Block (Medium)#221
claude[bot] wants to merge 2 commits intomainfrom
add-challenge-84-swiglu-mlp-block

Conversation

@claude
Copy link
Copy Markdown
Contributor

@claude claude bot commented Mar 19, 2026

Summary

  • Adds challenge 84: SwiGLU MLP Block (Medium difficulty)
  • Implements the feedforward network from LLaMA, Mistral, Gemma, and most modern LLMs: output = (SiLU(x × W_gate) ⊙ (x × W_up)) × W_down
  • Distinct from the existing SwiGLU challenge (moved monte carlo challenge from easy to medium #54, which is just the element-wise gate silu(x1)*x2) — this challenge includes all three matrix multiplications and is a self-contained inference building block
  • Validated on NVIDIA Tesla T4 via run_challenge.py

Why this challenge is interesting

Solvers must implement three chained matrix multiplications with a gated nonlinearity in between. The GPU programming challenge is:

  • Parallelizing across the two independent gate/up projections
  • Managing intermediate tensors [M, d_ffn] efficiently (memory bandwidth)
  • Optionally fusing the elementwise SiLU+multiply with one of the matmuls to reduce memory round-trips
  • Performance test uses LLaMA-3 8B dimensions: M=512, d_model=4096, d_ffn=14336

Checklist

challenge.html

  • Starts with <p> (problem description)
  • <h2> sections: Implementation Requirements, Example, Constraints
  • First example matches generate_example_test() (identity-like matrices, output ≈ [[0.7311, 0], [0, 0.7311]])
  • Examples use <pre> consistently (1D/sequential data)
  • Constraints includes performance test size bullet: M = 512, d_model = 4,096, d_ffn = 14,336
  • SVG visualization included (dark theme, shows gate/up/SiLU/multiply/down dataflow)

challenge.py

  • class Challenge inherits ChallengeBase
  • __init__ with name, atol, rtol, num_gpus, access_tier
  • reference_impl has assertions on shape, dtype, device
  • All 6 methods present
  • generate_functional_test returns 10 cases: edge (1, 2 rows), zero, power-of-2 (16×32, 64×64), non-power-of-2 (30, 100, 255), realistic (128×256, 256×512)
  • generate_performance_test fits 5× in 16GB VRAM (weights ≈ 670MB × 5 = 3.4GB)

Starter files

  • All 6 present: .cu, .pytorch.py, .triton.py, .jax.py, .cute.py, .mojo
  • Exactly 1 parameter description comment per file
  • CUDA/Mojo: "are device pointers" (medium, no parenthetical)
  • Python frameworks: "are tensors on the GPU"; JAX also has # return output tensor directly
  • Starters compile but produce no output

General

  • Directory: 84_swiglu_mlp_block
  • Linting passes: pre-commit run --all-files

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fix overlapping text in SVG: separate gate/up branch labels, add tensor
shape annotations at each stage, color-code converging arrows. Convert
example from <pre> to LaTeX bmatrix with proper math notation for each
intermediate computation step.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@shxjames
Copy link
Copy Markdown
Contributor

Screenshot 2026-03-26 at 21 49 51 Screenshot 2026-03-26 at 21 49 43

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants