The current grCUDA prototype adds an synchronization barriers after every kernel execution (cudaDeviceSynchronize()). In CUDA, kernels are executed asynchronously with respect to host code and kernels or memory operations in other streams.
- Implement asynchronous but non-deferred execution.
- Track read and write dependencies in
DeviceArray and automatically insert synchronization points.
The current grCUDA prototype adds an synchronization barriers after every kernel execution (
cudaDeviceSynchronize()). In CUDA, kernels are executed asynchronously with respect to host code and kernels or memory operations in other streams.DeviceArrayand automatically insert synchronization points.