|
| 1 | +--- |
| 2 | +title: "Post from Aug 25, 2025" |
| 3 | +date: 2025-08-25T09:20:01 |
| 4 | +slug: "1756113601" |
| 5 | +tags: |
| 6 | + - tensorRT |
| 7 | + - torch |
| 8 | + - easydiffusion |
| 9 | + - ggml |
| 10 | + - cuda |
| 11 | + - vulkan |
| 12 | +--- |
| 13 | + |
| 14 | +Experimented with TensorRT-RTX (a new library offered by NVIDIA). |
| 15 | + |
| 16 | +The first step was a tiny toy model, just to get the build and test setup working. |
| 17 | + |
| 18 | +The reference model in PyTorch: |
| 19 | +```py |
| 20 | +import torch |
| 21 | +import torch.nn as nn |
| 22 | + |
| 23 | +class TinyCNN(nn.Module): |
| 24 | + def __init__(self): |
| 25 | + super().__init__() |
| 26 | + self.conv = nn.Conv2d(3, 8, 3, stride=1, padding=1) |
| 27 | + self.relu = nn.ReLU() |
| 28 | + self.pool = nn.AdaptiveAvgPool2d((1, 1)) |
| 29 | + self.fc = nn.Linear(8, 4) # 4-class toy output |
| 30 | + |
| 31 | + def forward(self, x): |
| 32 | + x = self.relu(self.conv(x)) |
| 33 | + x = self.pool(x).flatten(1) |
| 34 | + return self.fc(x) |
| 35 | +``` |
| 36 | + |
| 37 | +I ran this on a NVIDIA 4060 8 GB (Laptop) for 10K iterations, on Windows and WSL-with-Ubuntu, with float32 data. |
| 38 | + |
| 39 | +I ported this model to plain torch, torch.compile, TensorRT, TensorRT RTX, plain CUDA (fused operation), plain Vulkan (fused operation), ggml + CUDA, and ggml + Vulkan. |
| 40 | + |
| 41 | +I've included the performance numbers below, but they shouldn't be taken very seriously since the model is too small to paint a true picture (in terms of computation complexity and data size). The intent is to verify that the different test setups are working somewhat sanely. |
| 42 | + |
| 43 | +| Time for 10K iterations | Test time | Environment | |
| 44 | +| --- | --- | --- | |
| 45 | +| 1.6s | plain torch | Ubuntu Linux (WSL) | |
| 46 | +| 1.9s | plain torch | Windows | |
| 47 | +| 2.6s | torch.compile() with Triton | Ubuntu Linux (WSL) | |
| 48 | +| 1.7s | TensorRT RTX | Windows | |
| 49 | +| 1.6s | TensorRT | Windows | |
| 50 | +| 1.6s | fused CUDA kernel | Windows | |
| 51 | +| 5.1s | fused Vulkan shader | Windows | |
| 52 | +| 2.3s | ggml + CUDA | Windows | |
| 53 | +| 5.3s | ggml + Vulkan | Windows | |
| 54 | + |
| 55 | +It's interesting that `torch.compile()` was slower than plain torch on both Windows and Ubuntu Linux (WSL). And plain torch was pretty close to TensorRT and TensorRT RTX on Windows. |
| 56 | + |
| 57 | +Maybe the model (and data) is too small? I'll pick a more representative model next - the Unet of a Stable Diffusion 1.5 model. |
0 commit comments