Skip to content

[FEAT] Add bf16 gemm #1974

@vadiklyutiy

Description

@vadiklyutiy

FI has fp8 and fp4 gemm implementations. But there is no bf16 one.

Original issue was found in vllm and described in vllm-project/vllm#27173.

In short, torch.nn.functional.linear is not optimal for small batch sizes. Torch team said that they just call cuBLAS.

It makes sense to support bp16 gemm and do tuning through cuBLAS, cutlass, cuDNN, internal FI implementation as it done for fp8 and fp4 cases.

Performance result and script for measurement are in vllm-project/vllm#27173.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions