Update blog

cmdr2 · cmdr2 · commit 5db8dc799c9c · 2025-10-07T15:36:00.000+05:30
diff --git a/content/blog/2025-08-25-1756113601.md b/content/blog/2025-08-25-1756113601.md
@@ -36,7 +36,7 @@ class TinyCNN(nn.Module):
 
 I ran this on a NVIDIA 4060 8 GB (Laptop) for 10K iterations, on Windows and WSL-with-Ubuntu, with float32 data.
 
-I ported this model to plain torch, torch.compile, TensorRT, TensorRT RTX, plain CUDA (fused operation), plain Vulkan (fused operation), ggml + CUDA, and ggml + Vulkan.
+I ported this model to plain torch, torch.compile, TensorRT, TensorRT RTX, plain CUDA (fused operation), plain Vulkan (fused operation), ggml + CUDA, ggml + Vulkan, and ONNX Runtime + CUDA.
 
 I've included the performance numbers below, but they shouldn't be taken very seriously since the model is too small to paint a true picture (in terms of computation complexity and data size). The intent is to verify that the different test setups are working somewhat sanely.
 
@@ -46,6 +46,7 @@ For 10k iterations:
 | 1.6s | plain torch | Ubuntu Linux (WSL) |
 | 1.6s | TensorRT | Windows |
 | 1.6s | fused CUDA kernel | Windows |
+| 1.6s | ONNX Runtime with CUDA | Windows |
 | 1.7s | TensorRT RTX | Windows |
 | 1.9s | plain torch | Windows |
 | 2.3s | ggml + CUDA | Windows |