Skip to content

CUDA Sigmoid: Implement Optimized Sigmoid Kernels for Float32 and Float16 #7

@debashishc

Description

@debashishc

Implement the following optimized sigmoid kernels for float32 and float16 with vectorized versions and PyTorch bindings for improved performance.

  • sigmoid_f32_kernel: Standard sigmoid function for float32 data type.
  • sigmoid_f32x4_kernel: Vectorized sigmoid for float32, processing 4 elements at a time (float4).
  • sigmoid_f16_kernel: Standard sigmoid function for float16 (half-precision).
  • sigmoid_f16x2_kernel: Vectorized sigmoid for float16, processing 2 elements at a time (half2).
  • sigmoid_f16x8_kernel: Unpacked version of float16, processing 8 elements in parallel.
  • sigmoid_f16x8_pack_kernel: Packed version of sigmoid_f16x8_kernel for efficient memory access.
  • PyTorch bindings: Expose the above kernels through PyTorch.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions