Skip to content

Commit 4ec503f

Browse files
committed
Update blog
1 parent 2543b64 commit 4ec503f

File tree

1 file changed

+57
-0
lines changed

1 file changed

+57
-0
lines changed
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
---
2+
title: "Post from Aug 25, 2025"
3+
date: 2025-08-25T09:20:01
4+
slug: "1756113601"
5+
tags:
6+
- tensorRT
7+
- torch
8+
- easydiffusion
9+
- ggml
10+
- cuda
11+
- vulkan
12+
---
13+
14+
Experimented with TensorRT-RTX (a new library offered by NVIDIA).
15+
16+
The first step was a tiny toy model, just to get the build and test setup working.
17+
18+
The reference model in PyTorch:
19+
```py
20+
import torch
21+
import torch.nn as nn
22+
23+
class TinyCNN(nn.Module):
24+
def __init__(self):
25+
super().__init__()
26+
self.conv = nn.Conv2d(3, 8, 3, stride=1, padding=1)
27+
self.relu = nn.ReLU()
28+
self.pool = nn.AdaptiveAvgPool2d((1, 1))
29+
self.fc = nn.Linear(8, 4) # 4-class toy output
30+
31+
def forward(self, x):
32+
x = self.relu(self.conv(x))
33+
x = self.pool(x).flatten(1)
34+
return self.fc(x)
35+
```
36+
37+
I ran this on a NVIDIA 4060 8 GB (Laptop) for 10K iterations, on Windows and WSL-with-Ubuntu, with float32 data.
38+
39+
I ported this model to plain torch, torch.compile, TensorRT, TensorRT RTX, plain CUDA (fused operation), plain Vulkan (fused operation), ggml + CUDA, and ggml + Vulkan.
40+
41+
I've included the performance numbers below, but they shouldn't be taken very seriously since the model is too small to paint a true picture (in terms of computation complexity and data size). The intent is to verify that the different test setups are working somewhat sanely.
42+
43+
| Time for 10K iterations | Test time | Environment |
44+
| --- | --- | --- |
45+
| 1.6s | plain torch | Ubuntu Linux (WSL) |
46+
| 1.9s | plain torch | Windows |
47+
| 2.6s | torch.compile() with Triton | Ubuntu Linux (WSL) |
48+
| 1.7s | TensorRT RTX | Windows |
49+
| 1.6s | TensorRT | Windows |
50+
| 1.6s | fused CUDA kernel | Windows |
51+
| 5.1s | fused Vulkan shader | Windows |
52+
| 2.3s | ggml + CUDA | Windows |
53+
| 5.3s | ggml + Vulkan | Windows |
54+
55+
It's interesting that `torch.compile()` was slower than plain torch on both Windows and Ubuntu Linux (WSL). And plain torch was pretty close to TensorRT and TensorRT RTX on Windows.
56+
57+
Maybe the model (and data) is too small? I'll pick a more representative model next - the Unet of a Stable Diffusion 1.5 model.

0 commit comments

Comments
 (0)