-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
I would like to ask whether there have been any efforts to introduce a cblas_sgemm_strided_batched
API in OpenBLAS.
A strided batched API would make it possible to parallelize along the batch dimension in scenarios where the core computation is fundamentally two-dimensional, and the batch dimension effectively introduces a third axis.
At the moment, I am working around this by placing an OpenMP parallel for
loop around the batch dimension. However, this requires setting OpenBLAS threads to 1 and OpenMP threads to 4 (instead of simply setting OpenBLAS threads to 4), which feels like an unnecessary and somewhat hacky configuration.
An official API for strided batched GEMM would simplify this workflow significantly and provide a cleaner, more robust solution.