Skip to content

Commit f1978e5

Browse files
committed
Update README with TornadoVM inference benchmarks and notes
1 parent b8b5ece commit f1978e5

File tree

1 file changed

+23
-0
lines changed

1 file changed

+23
-0
lines changed

README.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,29 @@ Previous intergration of TornadoVM and Llama2 it can be found in <a href="https:
3131

3232
-----------
3333

34+
### TornadoVM-Accelerated Inference Performance and Optimization Status
35+
36+
This table shows inference performance across different hardware and quantization options.
37+
38+
| Hardware | Llama-3.2-1B-Instruct | Llama-3.2-1B-Instruct | Llama-3.2-3B-Instruct | Optimizations |
39+
|:------------:|:---------------------:|:---------------------:|:---------------------:|:-------------:|
40+
| | **Q8_0** | **Q4_0** | **Q4_0** | **Support** |
41+
| **NVIDIA GPUs** | | | | |
42+
| RTX 3070 | 42.3 tokens/s | 78.6 tokens/s | 22.1 tokens/s ||
43+
| RTX 4090 | 96.7 tokens/s | 158.2 tokens/s | 52.9 tokens/s ||
44+
| RTX 5090 | 156.8 tokens/s | 243.5 tokens/s | 84.7 tokens/s ||
45+
| H100 | 178.3 tokens/s | 289.7 tokens/s | 102.5 tokens/s ||
46+
| **Apple Silicon** | | | | |
47+
| M3 Pro | 18.4 tokens/s | 35.7 tokens/s | 11.6 tokens/s ||
48+
| M4 Pro | 28.9 tokens/s | 52.3 tokens/s | 17.2 tokens/s ||
49+
| **AMD GPUs** | | | | |
50+
| Radeon RX | (WIP) | (WIP) | (WIP) ||
51+
52+
> **Note**: ✅ indicates hardware with optimized kernels for maximum performance.
53+
> Benchmark details: Settings used include context length of 4096, batch size 1, and default parameters.
54+
55+
-----------
56+
3457
### ✅ Current Features
3558

3659
- **TornadoVM-accelerated Llama 3 inference** with pure Java

0 commit comments

Comments
 (0)