Skip to content

Nax Refactor#3271

Merged
jagrit06 merged 3 commits intomainfrom
nax-refactor
Mar 18, 2026
Merged

Nax Refactor#3271
jagrit06 merged 3 commits intomainfrom
nax-refactor

Conversation

@jagrit06
Copy link
Copy Markdown
Member

Proposed changes

  • Refactoring NAX to not rely on the intermediate sub-tile

Checklist

Put an x in the boxes that apply.

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the necessary documentation (if needed)

@jagrit06
Copy link
Copy Markdown
Member Author

No major performance implications with this PR

Attention numbers for example
Before:

==================================
  B,   qsl,   ksl, hdim, n_qh, n_kvh, tp,   dtype, causal, t_unfs,  t_fus, tflops, speed_up
  1,  1024,  1024,   64,   32,    32,  0, float16,      0,  0.159,  0.038, 14.299,    4.147
  1,  2048,  2048,   64,   32,    32,  0, float16,      0,  0.611,  0.142, 15.436,    4.290
  1,  4096,  4096,   64,   32,     8,  0, float16,      0,  2.453,  0.552, 15.921,    4.439
  1,   512,  4096,   64,   32,    32,  0, float16,      0,  0.313,  0.074, 14.774,    4.204
  1,  1024,  1024,  128,   32,    32,  0, float16,      0,  0.186,  0.087, 12.571,    2.123
  1,  2048,  2048,  128,   32,     8,  0, float16,      0,  0.719,  0.325, 13.519,    2.209
  1,  4096,  4096,  128,   16,     2,  0, float16,      0,  1.426,  0.645, 13.646,    2.212
  1,  4096,  4096,  128,   32,     8,  0, float16,      0,  2.865,  1.295, 13.586,    2.213
  1,   512,  4096,  128,   32,     8,  0, float16,      0,  0.359,  0.171, 12.871,    2.098

==================================
  B,   qsl,   ksl, hdim, n_qh, n_kvh, tp,   dtype, causal, t_unfs,  t_fus, tflops, speed_up
  1,  1024,  1024,   64,   32,    32,  0, float16,      1,  0.232,  0.022, 12.455,    10.499
  1,  2048,  2048,   64,   32,    32,  0, float16,      1,  0.955,  0.079, 13.918,    12.091
  1,  4096,  4096,   64,   32,     8,  0, float16,      1,  3.552,  0.280, 15.719,    12.696
  1,   512,  4096,   64,   32,    32,  0, float16,      1,  0.456,  0.070,  7.838,    6.504
  1,  1024,  1024,  128,   32,    32,  0, float16,      1,  0.258,  0.053, 10.402,    4.889
  1,  2048,  2048,  128,   32,     8,  0, float16,      1,  0.995,  0.179, 12.309,    5.571
  1,  4096,  4096,  128,   16,     2,  0, float16,      1,  1.979,  0.345, 12.760,    5.741
  1,  4096,  4096,  128,   32,     8,  0, float16,      1,  3.962,  0.684, 12.855,    5.790
  1,   512,  4096,  128,   32,     8,  0, float16,      1,  0.497,  0.168,  6.535,    2.955

After:

==================================
  B,   qsl,   ksl, hdim, n_qh, n_kvh, tp,   dtype, causal, t_unfs,  t_fus, tflops, speed_up
  1,  1024,  1024,   64,   32,    32,  0, float16,      0,  0.155,  0.039, 14.263,    4.019
  1,  2048,  2048,   64,   32,    32,  0, float16,      0,  0.606,  0.142, 15.473,    4.261
  1,  4096,  4096,   64,   32,     8,  0, float16,      0,  2.393,  0.552, 15.938,    4.336
  1,   512,  4096,   64,   32,    32,  0, float16,      0,  0.300,  0.074, 14.864,    4.061
  1,  1024,  1024,  128,   32,    32,  0, float16,      0,  0.183,  0.089, 12.412,    2.067
  1,  2048,  2048,  128,   32,     8,  0, float16,      0,  0.714,  0.331, 13.305,    2.160
  1,  4096,  4096,  128,   16,     2,  0, float16,      0,  1.392,  0.653, 13.462,    2.131
  1,  4096,  4096,  128,   32,     8,  0, float16,      0,  2.808,  1.311, 13.423,    2.143
  1,   512,  4096,  128,   32,     8,  0, float16,      0,  0.350,  0.172, 12.776,    2.036

==================================
  B,   qsl,   ksl, hdim, n_qh, n_kvh, tp,   dtype, causal, t_unfs,  t_fus, tflops, speed_up
  1,  1024,  1024,   64,   32,    32,  0, float16,      1,  0.227,  0.022, 12.252,    10.132
  1,  2048,  2048,   64,   32,    32,  0, float16,      1,  0.872,  0.076, 14.432,    11.446
  1,  4096,  4096,   64,   32,     8,  0, float16,      1,  3.452,  0.281, 15.670,    12.298
  1,   512,  4096,   64,   32,    32,  0, float16,      1,  0.441,  0.071,  7.716,    6.195
  1,  1024,  1024,  128,   32,    32,  0, float16,      1,  0.253,  0.053, 10.379,    4.774
  1,  2048,  2048,  128,   32,     8,  0, float16,      1,  0.983,  0.179, 12.275,    5.490
  1,  4096,  4096,  128,   16,     2,  0, float16,      1,  1.933,  0.346, 12.707,    5.585
  1,  4096,  4096,  128,   32,     8,  0, float16,      1,  3.882,  0.685, 12.834,    5.664
  1,   512,  4096,  128,   32,     8,  0, float16,      1,  0.487,  0.168,  6.548,    2.899

Copy link
Copy Markdown
Member

@angeloskath angeloskath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

I may be reading too much the tea leaves but it looks like there is a small but consistent improvement for unfused and a tiny but fairly consistent regression for fused 🤷‍♂️

@jagrit06 jagrit06 merged commit b41b349 into main Mar 18, 2026
16 checks passed
@jagrit06 jagrit06 deleted the nax-refactor branch March 18, 2026 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants