[Feature Request] Fallback for custom kernels

We have a specific use-case whereby we need to export MLX callables from Linux. Currently, this prevents us from using any operations that require custom kernels (e.g. grid sample) because otherwise, MLX complains that no Metal GPU was found.

Ideally, we would want to keep the fast kernel so that we can export without a Metal GPU, but execute on a Metal GPU.