Releases: flashinfer-ai/flashinfer
Releases · flashinfer-ai/flashinfer
v0.2.9rc2
What's Changed
- Reduce the JIT compilation time of gen_gemm_sm100_module by @jinyangyuan-nvidia in #1251
- fix: correctly pass k_scale and v_scale to run() in forward_return_lse (#1023) by @vlev02 in #1254
- Made AR output optional + esthetic changes by @nvmbreughe in #1265
- init add gemm fp8 using cudnn backend by @ttyio in #1264
- Feature/sm100 low latency nvfp4 kernels by @azhurkevich in #1214
- CI: install
nvidia-nvshmem-cu12
by @EmilienM in #1262 - feat: enable trtllm-gen mla MTP by @yyihuang in #1258
- Add trtllm-gen attention mha kernel with FP8 Q/K/V and FP8 output by @weireweire in #1242
- add trtllm-gen context attention by @IwakuraRein in #1239
- feat: add masked deepgemm support and benchmarking by @cyx-6 in #1266
- Add missing import in comm/init,py by @joker-eph in #1275
- hotfix: fix deepgemm artifactory hash by @cyx-6 in #1278
- Unify groupwise fp8 GEMM test by @cyx-6 in #1281
- fix: update trtllm-gen fmha benchmark by @yyihuang in #1280
- fix multiCtasKvScratchPtr misalignment issue (new one) by @nvpohanh in #1286
- Fix install folder regression, and JIT-vs-AOT differences by @directhex in #1279
- Add shuffle matrix flag by @aleozlx in #1272
- Convert scale_factor from scalar to Tensor in trt_allreduce_fusion by @ilmarkov in #1284
- patch error handling by @aleozlx in #1293
- Bug fix: guard fp8 e8m0 and e2m1 compile by @Edenzzzz in #1287
- refactor: Improved metainfo for trtllm-gen fmha by @cyx-6 in #1292
- add mm_fp4 use cudnn backend by @ttyio in #1288
- fix: minor errors in cubin loader by @yyihuang in #1295
- perfix: use lightweight API to query device property by @azhurkevich in #1298
- refactor: refactor trtllm-gen attention kernel integration code by @yzh119 in #1289
- Remove FAST_BUILD FLAG for MOE by @wenscarl in #1291
- bugfix: ensure graph is captured and executed on the same stream to avoid rep… by @elfiegg in #1303
- minor: some fix and cleanup for trtllm-gen mha by @yyihuang in #1302
- [Feature] SM level profiler by @Edenzzzz in #1305
- Heuristics + testing unification + CUDA Graphs by @azhurkevich in #1306
- Update cutlass fp4 moe kernels by @wenscarl in #1294
- Fix the bug of the kernel-selection heuristic in trtllm-gen by @PerkzZheng in #1307
- test qkvo quantization not equal to 1. by @weireweire in #1314
- [fix] fix integer overflow in FA2 customized_mask & add buffer overflow warning. by @happierpig in #1290
- Addition of flashinfer_benchmark.py for benchmarking routines by @bkryu in #1323
- minor: update devcontainer by @yyihuang in #1329
- Fix redundant argument in TrtllmGenDecodeModule by @IwakuraRein in #1326
- Optimizations for TRTLLM MNNVL Allreduce by @timlee0212 in #1321
- add torch float4_e2m1fn_x2 check for cudnn fp4 backend by @ttyio in #1333
- only add cudnn dependency for x86 platform by @ttyio in #1332
- Make Fp8 MoE routing_bias optional by @aleozlx in #1319
- feat: Add weight layout option for trtllm-gen fused moe by @aleozlx in #1297
- [Fix] remove torch 2.8 requirement for FP4 GEMM by @elfiegg in #1334
- Bug fix: fix duplicate launch in POD by @Edenzzzz in #1267
New Contributors
- @vlev02 made their first contribution in #1254
- @ttyio made their first contribution in #1264
- @azhurkevich made their first contribution in #1214
- @weireweire made their first contribution in #1242
- @IwakuraRein made their first contribution in #1239
- @nvpohanh made their first contribution in #1286
- @directhex made their first contribution in #1279
- @ilmarkov made their first contribution in #1284
- @elfiegg made their first contribution in #1303
- @PerkzZheng made their first contribution in #1307
- @bkryu made their first contribution in #1323
- @timlee0212 made their first contribution in #1321
Full Changelog: v0.2.8...v0.2.9rc2
v0.2.9rc1
What's Changed
- Reduce the JIT compilation time of gen_gemm_sm100_module by @jinyangyuan-nvidia in #1251
- fix: correctly pass k_scale and v_scale to run() in forward_return_lse (#1023) by @vlev02 in #1254
- Made AR output optional + esthetic changes by @nvmbreughe in #1265
- init add gemm fp8 using cudnn backend by @ttyio in #1264
- Feature/sm100 low latency nvfp4 kernels by @azhurkevich in #1214
- CI: install
nvidia-nvshmem-cu12
by @EmilienM in #1262 - feat: enable trtllm-gen mla MTP by @yyihuang in #1258
- Add trtllm-gen attention mha kernel with FP8 Q/K/V and FP8 output by @weireweire in #1242
- add trtllm-gen context attention by @IwakuraRein in #1239
- feat: add masked deepgemm support and benchmarking by @cyx-6 in #1266
- Add missing import in comm/init,py by @joker-eph in #1275
- hotfix: fix deepgemm artifactory hash by @cyx-6 in #1278
- Unify groupwise fp8 GEMM test by @cyx-6 in #1281
- fix: update trtllm-gen fmha benchmark by @yyihuang in #1280
- fix multiCtasKvScratchPtr misalignment issue (new one) by @nvpohanh in #1286
- Fix install folder regression, and JIT-vs-AOT differences by @directhex in #1279
- Add shuffle matrix flag by @aleozlx in #1272
- Convert scale_factor from scalar to Tensor in trt_allreduce_fusion by @ilmarkov in #1284
- patch error handling by @aleozlx in #1293
- Bug fix: guard fp8 e8m0 and e2m1 compile by @Edenzzzz in #1287
- refactor: Improved metainfo for trtllm-gen fmha by @cyx-6 in #1292
- add mm_fp4 use cudnn backend by @ttyio in #1288
- fix: minor errors in cubin loader by @yyihuang in #1295
- perfix: use lightweight API to query device property by @azhurkevich in #1298
- refactor: refactor trtllm-gen attention kernel integration code by @yzh119 in #1289
- Remove FAST_BUILD FLAG for MOE by @wenscarl in #1291
- bugfix: ensure graph is captured and executed on the same stream to avoid rep… by @elfiegg in #1303
- minor: some fix and cleanup for trtllm-gen mha by @yyihuang in #1302
- [Feature] SM level profiler by @Edenzzzz in #1305
- Heuristics + testing unification + CUDA Graphs by @azhurkevich in #1306
- Update cutlass fp4 moe kernels by @wenscarl in #1294
New Contributors
- @vlev02 made their first contribution in #1254
- @ttyio made their first contribution in #1264
- @azhurkevich made their first contribution in #1214
- @weireweire made their first contribution in #1242
- @IwakuraRein made their first contribution in #1239
- @nvpohanh made their first contribution in #1286
- @directhex made their first contribution in #1279
- @ilmarkov made their first contribution in #1284
- @elfiegg made their first contribution in #1303
Full Changelog: v0.2.8...v0.2.9rc1
v0.2.8
What's Changed
- [fix] fix BatchAttention CTA_TILE_KV mask issue by @happierpig in #1206
- feat: enable and update all-reduce fused quantization by @yyihuang in #1164
- Fix the issue with auxillary kernel launch and grid dim calculation by @Anerudhan in #1208
- Fix test_groupwise_scaled_gemm_fp8.py by @jinyangyuan-nvidia in #1211
- [TVM] Remove
enable_pdl
from TVM binding interface by @MasterJH5574 in #1217 - misc: minor adds in readme by @yyihuang in #1218
- bugfix: fix blackwell fmha hanging issue for empty kv_len by @yzh119 in #1198
- update trtllm-gen decode attention kernel launcher by @wenscarl in #1189
- Handle allocation cutlass fused MoE output to caller by @wenscarl in #1225
- Fix missing hash in the cudnn cubin path by @Anerudhan in #1227
- bugfix: add logits processor to pyproject.toml by @yzh119 in #1224
- fix: add trtllm-allreduce-fusion api notes and fix memory error by @yyihuang in #1229
- feat: Add non-causal cudnn prefill kernels by @Anerudhan in #1230
- minor: update oneshot handling, add params notes by @yyihuang in #1232
- Enable cudnn decode and add tests for the cudnn decode kernel by @Anerudhan in #1221
- docker: add cuda-python to CI docker image by @yzh119 in #1233
- bugfix: Fix building without
get_requires*()
invocation by @mgorny in #1226 - bugfix: support uint8_t for vec_t class template by @chenyang78 in #1234
- feat: trtllm-gen fp8 moe kernels by @aleozlx in #1212
- Patch fp8 cubin availability by @aleozlx in #1240
- [comm] TRT-LLM's Multi-Node NVLink All-Reduce Kernel by @nvmbreughe in #1213
- feat: Support MXFP8 x MXFP4 CUTLASS grouped GEMM by @jinyangyuan-nvidia in #1241
- feat: add trtllm-gen mla cubin by @yyihuang in #1222
- Add DeepGEMM kernels by @cyx-6 in #1209
- Remove sm100+ requirment for trtllm allreduce kernels by @yzh119 in #1249
- Defer mpi import for comm module by @yzh119 in #1250
- feat: support environment variable overrides for NVSHMEM paths and linker flags by @EmilienM in #1253
- release: bump version to v0.2.8 by @yzh119 in #1257
- TRT-LLM's Multi-Node NVLink AR + fused RMSNorm kernel by @nvmbreughe in #1255
New Contributors
- @jinyangyuan-nvidia made their first contribution in #1211
- @mgorny made their first contribution in #1226
- @chenyang78 made their first contribution in #1234
- @aleozlx made their first contribution in #1212
- @nvmbreughe made their first contribution in #1213
- @EmilienM made their first contribution in #1253
Full Changelog: v0.2.7.post1...v0.2.8
v0.2.8rc1
What's Changed
- [fix] fix BatchAttention CTA_TILE_KV mask issue by @happierpig in #1206
- feat: enable and update all-reduce fused quantization by @yyihuang in #1164
- Fix the issue with auxillary kernel launch and grid dim calculation by @Anerudhan in #1208
- Fix test_groupwise_scaled_gemm_fp8.py by @jinyangyuan-nvidia in #1211
- [TVM] Remove
enable_pdl
from TVM binding interface by @MasterJH5574 in #1217 - misc: minor adds in readme by @yyihuang in #1218
- bugfix: fix blackwell fmha hanging issue for empty kv_len by @yzh119 in #1198
- update trtllm-gen decode attention kernel launcher by @wenscarl in #1189
- Handle allocation cutlass fused MoE output to caller by @wenscarl in #1225
- Fix missing hash in the cudnn cubin path by @Anerudhan in #1227
- bugfix: add logits processor to pyproject.toml by @yzh119 in #1224
- fix: add trtllm-allreduce-fusion api notes and fix memory error by @yyihuang in #1229
- feat: Add non-causal cudnn prefill kernels by @Anerudhan in #1230
- minor: update oneshot handling, add params notes by @yyihuang in #1232
- Enable cudnn decode and add tests for the cudnn decode kernel by @Anerudhan in #1221
- docker: add cuda-python to CI docker image by @yzh119 in #1233
- bugfix: Fix building without
get_requires*()
invocation by @mgorny in #1226 - bugfix: support uint8_t for vec_t class template by @chenyang78 in #1234
New Contributors
- @jinyangyuan-nvidia made their first contribution in #1211
- @mgorny made their first contribution in #1226
- @chenyang78 made their first contribution in #1234
Full Changelog: v0.2.7.post1...v0.2.8rc1
v0.2.7.post1
What's Changed
- [feat] optimize persistent batch attention perf. by @happierpig in #1200
- Feature/cudnn dynamic cubin by @Anerudhan in #1187
- Fix flashinfer.comm module missing by @BBuf in #1203
- chore: bump flashinfer v0.2.7.post1 by @zhyncs in #1205
New Contributors
- @Anerudhan made their first contribution in #1187
- @BBuf made their first contribution in #1203
Full Changelog: v0.2.7...v0.2.7.post1
v0.2.7
What's Changed
- ci: Update images for self-hosted ARM64 runner by @yongwww in #1128
- Fix pointer dtype bug in rope by @Edenzzzz in #1129
- feat: update and test create_ipc_buffer by @yyihuang in #1130
- misc: update runllm widget by @yzh119 in #1132
- misc: correct runllm widget (again) by @MasterJH5574 in #1133
- [Feature] Support PDL for batch Prefill and Decode by @Edenzzzz in #1117
- fix: negative zero by type trait --> binary value by @yyihuang in #1136
- fix: sync after create_workspace by @yyihuang in #1138
- refactor: use functools.cache instead of global dict for caching modules by @yzh119 in #1135
- [feat] add unified batch attention w/ correctness tests. by @happierpig in #1137
- Fix FA2 and FA3 multi-item scoring and cuda illegal memory access error by @arde171 in #1140
- feat: Add support for FLASHINFER_EXTRA_LDFLAGS environment variable by @jennifgcrl in #1144
- misc: remove sync between persistent runners and use packed_causal_kv_end for SM90Plan by @Edenzzzz in #1146
- [fix] fix precision errors when applying causal mask on Qwen-2.5 series models by @happierpig in #1148
- ci: Install mpi4py by @yongwww in #1149
- feat: add trtllm moe_allreduce_fusion by @yyihuang in #1108
- feat: add trtllm all-reduce fusion by @yyihuang in #1131
- Add more logging to TRTLLM-GEN debug trace (NFC) by @joker-eph in #1158
- feat: update non-fused moe by @yyihuang in #1161
- Add fp4 quantization swizzling tests by @wenscarl in #1157
- refactor: communication module by @yyihuang in #1162
- feat: add finalize_moe_allreduce from trtllm by @yyihuang in #1159
- feat: experimental support of green ctx by @yzh119 in #1163
- feat: Fused temperature online softmax kernel by @xslingcn in #1153
- MNNVL MoE All-to-All Support by @cyx-6 in #1134
- feat: nvshmem python bindings by @yzh119 in #1160
- Fix missing symbols in trtllm_utils.so by @tiran in #1168
- feat: logits processor fustion rule for temperature softmax by @xslingcn in #1170
- Expose fp4 blockscale swizzling kernel by @wenscarl in #1176
- add nvshmem sum_reduce for mnnvl allreduce by @Amir-19 in #1152
- bugfix: softmax NaN results caused by large -inf masks by @xslingcn in #1178
- [CI] Update is_last_build by @yongwww in #1183
- [feat] support block sparse attention w/ variable block sizes and head-wise sparse patterns by @happierpig in #1177
- bugfix: fix invalid blackwell fmha unittests by @yzh119 in #1181
- feat: support green ctx creation by a list of SM counts by @Conless in #1190
- fix: trtllm_comm module aot arch issues by @yyihuang in #1196
- bugfix: fix broken docs build by adding missing dependencies by @Conless in #1197
- chore: bump v0.2.7 by @zhyncs in #1199
New Contributors
- @jennifgcrl made their first contribution in #1144
- @tiran made their first contribution in #1168
- @Amir-19 made their first contribution in #1152
- @Conless made their first contribution in #1190
Full Changelog: v0.2.6.post1...v0.2.7
v0.2.6.post1
What's Changed
- [CI] Add x86_64 tag for x86 self-hosted runner by @yongwww in #1126
- hotfix: fix installation script behavior by @yzh119 in #1125
Full Changelog: v0.2.6...v0.2.6.post1
v0.2.6
What's Changed
- ci: select 2_28 manylinux builder for new torch+cuda versions by @yzh119 in #1000
- misc: update REAMDME.md by @yzh119 in #1003
- bugfix: Fix illegal memory access due to custom mask ptr by @yongchaoding in #1008
- misc: fix kv-layout doc references by @Edenzzzz in #1009
- misc: more benchmark scripts in Python by @yzh119 in #1010
- misc: fix instrument code for mla profiler by @yzh119 in #1014
- bugfix: import wrapper of mla decode by @dhy2000 in #1013
- feat: update decode attention APIs by @yzh119 in #1007
- doc: use latest protobuf for profiler by @xslingcn in #1021
- feat: SM-constraint Communication Kernels by @yyihuang in #994
- feat: ragged tensor padding kernel for blackwell kernel alignment by @yzh119 in #1025
- bugfix: fix custom mask not be reseted after convert custom mask into causal or non-causal by @yongchaoding in #1028
- fix: add zero init for KV tiled copy by @happierpig in #1029
- [NVIDIA] Add Cutlass MLA backend by @kaixih in #1031
- Add workflow to build aarch64 wheel by @yongwww in #1036
- Non-blocking host-to-device copy in the ragged prefill wrapper by @nandor in #1040
- fix: remove default ubuntu user in Lunar/Noble by @rickyfeng0119 in #1042
- feat: Softmax free sampling by @kf-zhang in #1035
- feat: add functional per-head FP8 quantization for FA3 by @happierpig in #1033
- add multi-item scoring by @arde171 in #1015
- [nvidia] cutlass fp8 blockwise/groupwise gemm support by @cyx-6 in #1045
- [nvidia] cutlass fp8 groupwise grouped gemm support by @cyx-6 in #1047
- fix: top_k_mask_logits hangs on -inf inputs by @xslingcn in #1050
- Benchmark: POD vs batched prefill by @Edenzzzz in #1052
- [nvidia] initial support for blackwell kernels by @yzh119 in #1039
- Fix KV chunking for POD. by @AKKamath in #1054
- bugfix: temporally disable split-kv in blackwell mla by @yzh119 in #1055
- bugfix: remove device allocation by @yzh119 in #1056
- Parameterize prefix mask call (needed by POD-Attention) by @AKKamath in #1059
- bugfix: move
cum_m
calculation inside kernels by @yzh119 in #1060 - misc: add pull request template by @yzh119 in #1062
- bugfix: Cast build paths to str before setuputils Extension by @farnasirim in #1058
- Add PyTorch 2.7.0 build by @huydhn in #1063
- bugfix: adding lse output to blackwell fmha kernels by @yzh119 in #1071
- bugfix: follow user-specified sm_scale for blackwell cutlass fmha by @yzh119 in #1072
- misc: jit: Introduce JitSpec and Generate ninja file by @abcdabcd987 in #1065
- fix: fix a typo in docs by @acelyc111 in #1077
- misc: jit: Deprecate
load_cuda_ops()
by @abcdabcd987 in #1066 - misc: jit: fix missing _get_glibcxx_abi_build_flags by @abcdabcd987 in #1080
- misc: jit: Refactor gen JitSpec out of get_xxx_module by @abcdabcd987 in #1069
- misc: jit: Replace parallel_load_modules() with build_jit_specs() by @abcdabcd987 in #1070
- misc: jit: Import jit_env as a module by @abcdabcd987 in #1073
- misc: aot: Add script to build all AOT ops by @abcdabcd987 in #1067
- misc: aot: Refactor AOT packaging by @abcdabcd987 in #1075
- misc: aot: Remove has_prebuilt_ops by @abcdabcd987 in #1076
- ci: upgrade docker ci image by @yzh119 in #1082
- bugfix: fix custom allreduce compilation in AOT mode by @yzh119 in #1083
- perf: accelerate blackwell grouped gemm by @yzh119 in #1086
- misc: update pull request template by @yzh119 in #1088
- Fix Cutlass grouped GEMM stride by @cyx-6 in #1081
- bugfix: fix fp8 attention kernels aot compilation issue by @yzh119 in #1087
- comm: refactor and initialize
flashinfer.comm
module by @yzh119 in #1089 - misc: cleanup by @b8zhong in #1092
- misc: followup by @b8zhong in #1093
- [nvidia] Add Blackwell FMHA decode kernel from TRT-LLM by @joker-eph in #1051
- bugfix: fix ninja generation rule for non-cuda input by @yzh119 in #1097
- jit: Update TVM JIT binding with the latest FFI refactor by @MasterJH5574 in #1100
- SM100 Groupwise GeMM K-Major Scale Supports by @cyx-6 in #1102
- misc: aot: Add platform tag to wheel by @abcdabcd987 in #1105
- feat: composable logits processor by @xslingcn in #1099
- feat: add trtllm all-reduce (non-MoE) by @yyihuang in #1096
- bugfix: host-precomuted plan function for blackwell fmha by @yzh119 in #1106
- doc: fix LogitsPipe example by @xslingcn in #1110
- bugfix: bugfix for blackwell mla split-k by @yzh119 in #1109
- Add CUTLASS fused moe kernels from TensorRT-LLM. by @wenscarl in #1113
- fix: initialize lamport buffer only once after creating new workspace by @yyihuang in #1111
- hotfix: fix the blackwell fmha stream by @yzh119 in #1116
- fix head_dim not defined if sm_scale is not None by @majian4work in #1119
- doc: add Ask-AI widget by @xslingcn in #1121
- bugfix: Fix test and output shape of fp4 quantize by @wenscarl in #1114
- misc: update slack link by @yzh119 in #1120
- release: bump version to v0.2.6 by @yzh119 in #1122
New Contributors
- @yongchaoding made their first contribution in #1008
- @Edenzzzz made their first contribution in #1009
- @dhy2000 made their first contribution in #1013
- @kaixih made their first contribution in #1031
- @yongwww made their first contribution in #1036
- @rickyfeng0119 made their first contribution in #1042
- @kf-zhang made their first contribution in #1035
- @arde171 made their first contribution in #1015
- @farnasirim made their first contribution in #1058
- @huydhn made their first contribution in #1063
- @acelyc111 made their first contribution in #1077
- @b8zhong made their first contribution in #1092
- @joker-eph made their first contribution in #1051
- @wenscarl made their first contribution in #1113
- @majian4work made their first contribution in #1119
Full Changelog: v0.2.5...v0.2.6
v0.2.5
What's Changed
- Fix compilation with FP16_QK_REDUCTION enabled. by @diptorupd in #962
- misc: Use environment variable to control JIT verbose flag by @yzh119 in #981
- Triton
rms_norm
kernels by @nandor in #983 - Allow passing workspace base directory via environment variable by @jsuchome in #973
- [CHORE] Rename
output_emitted_token_num
->output_emitted_draft_token_num
by @jon-chuang in #977 - ci: switch to on-demand instances if spot instance is interrupted by @yzh119 in #987
- misc: update devcontainer by @yzh119 in #986
- ci: add torch 2.6+cu126 wheel by @yzh119 in #985
- misc: fix devcontainer conda path by @yzh119 in #989
- perf: prefetch page indices for mla kernel by @yzh119 in #991
- SM-constraint-GEMM by triton persistent kernel by @yyihuang in #982
- 3rdparty: upgrade cutlass to 3.9 by @yzh119 in #997
- perf: add
-DNDEBUG
compilation flag by @yzh119 in #998 - release: bump version to v0.2.5 by @yzh119 in #999
New Contributors
- @jsuchome made their first contribution in #973
- @jon-chuang made their first contribution in #977
- @yyihuang made their first contribution in #982
Full Changelog: v0.2.4...v0.2.5
v0.2.4
What's Changed
- typo: fix pdl terminology by @yzh119 in #933
- Fix "specutate" typo by @markmc in #934
- typo: fix target_probs docs after uniform_samples removal by @markmc in #935
- typo: remove another uniform samples leftover by @markmc in #937
- Fix/precommit issues by @diptorupd in #931
- ci: setup Jenkins by @yzh119 in #874
- bugfix: fix include header name conflict by @yzh119 in #939
- fix: Fix MLA TVM binding for the latest changes by @MasterJH5574 in #940
- feat - support mla kvcache store by @baowendin in #888
- Add POD-Attention to FlashInfer by @AKKamath in #858
- bugfix: fix potential issues of FA3 template loading nans for PageAttention by @yzh119 in #945
- fix - fix bug when not relevant seq has nan data by @baowendin in #942
- misc: add ci-badge, update blog list by @yzh119 in #948
- bugfix: Fix missing PyModuleDef field initializers by @sampan26 in #946
- fix: fix pod-attention compilation time by @yzh119 in #954
- bugfix: bugfix to #949 by @yzh119 in #951
- misc: Temporarily disable POD from AOT wheels by @abcdabcd987 in #956
- ci: improve jenkins by @yzh119 in #943
- Fix compilation on cuda 12.2 by @goliaro in #961
- doc: remove misleading docstring about
non_blocking
by @yzh119 in #966 - perf: reduce torch.library dispatch overhead by @yzh119 in #968
- [TVM] Added tvm binding for sampling kernel by @annanyapr in #958
- perf: Fix python API overhead when CUDAGraph is not enabled by @yzh119 in #969
- Fix POD JIT bugs by @AKKamath in #971
- benchmark: add sampling.renorm benchmarks by @xslingcn in #970
- perf: dual pivot top-p/top-k renorm by @xslingcn in #974
- perf: Use 2WG pipeline design for MLA implementation on Hopper by @yzh119 in #952
- release: bump version to v0.2.4 by @yzh119 in #980
New Contributors
- @markmc made their first contribution in #934
- @diptorupd made their first contribution in #931
- @AKKamath made their first contribution in #858
- @sampan26 made their first contribution in #946
- @goliaro made their first contribution in #961
- @annanyapr made their first contribution in #958
Full Changelog: v0.2.3...v0.2.4