Update blog

cmdr2 · cmdr2 · commit 4ec503fce4f7 · 2025-08-25T14:50:15.000+05:30
diff --git a/content/blog/2025-08-25-1756113601.md b/content/blog/2025-08-25-1756113601.md
@@ -0,0 +1,57 @@
+---
+title: "Post from Aug 25, 2025"
+date: 2025-08-25T09:20:01
+slug: "1756113601"
+tags:
+  - tensorRT
+  - torch
+  - easydiffusion
+  - ggml
+  - cuda
+  - vulkan
+---
+
+Experimented with TensorRT-RTX (a new library offered by NVIDIA).
+
+The first step was a tiny toy model, just to get the build and test setup working.
+
+The reference model in PyTorch:
+```py
+import torch
+import torch.nn as nn
+
+class TinyCNN(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.conv = nn.Conv2d(3, 8, 3, stride=1, padding=1)
+        self.relu = nn.ReLU()
+        self.pool = nn.AdaptiveAvgPool2d((1, 1))
+        self.fc = nn.Linear(8, 4)  # 4-class toy output
+
+    def forward(self, x):
+        x = self.relu(self.conv(x))
+        x = self.pool(x).flatten(1)
+        return self.fc(x)
+```
+
+I ran this on a NVIDIA 4060 8 GB (Laptop) for 10K iterations, on Windows and WSL-with-Ubuntu, with float32 data.
+
+I ported this model to plain torch, torch.compile, TensorRT, TensorRT RTX, plain CUDA (fused operation), plain Vulkan (fused operation), ggml + CUDA, and ggml + Vulkan.
+
+I've included the performance numbers below, but they shouldn't be taken very seriously since the model is too small to paint a true picture (in terms of computation complexity and data size). The intent is to verify that the different test setups are working somewhat sanely.
+
+| Time for 10K iterations | Test time | Environment |
+| --- | --- | --- |
+| 1.6s | plain torch | Ubuntu Linux (WSL) |
+| 1.9s | plain torch | Windows |
+| 2.6s | torch.compile() with Triton | Ubuntu Linux (WSL) |
+| 1.7s | TensorRT RTX | Windows |
+| 1.6s | TensorRT | Windows |
+| 1.6s | fused CUDA kernel | Windows |
+| 5.1s | fused Vulkan shader | Windows |
+| 2.3s | ggml + CUDA | Windows |
+| 5.3s | ggml + Vulkan | Windows |
+
+It's interesting that `torch.compile()` was slower than plain torch on both Windows and Ubuntu Linux (WSL). And plain torch was pretty close to TensorRT and TensorRT RTX on Windows.
+
+Maybe the model (and data) is too small? I'll pick a more representative model next - the Unet of a Stable Diffusion 1.5 model.