File tree Expand file tree Collapse file tree 1 file changed +23
-0
lines changed
Expand file tree Collapse file tree 1 file changed +23
-0
lines changed Original file line number Diff line number Diff line change @@ -31,6 +31,29 @@ Previous intergration of TornadoVM and Llama2 it can be found in <a href="https:
3131
3232-----------
3333
34+ ### TornadoVM-Accelerated Inference Performance and Optimization Status
35+
36+ This table shows inference performance across different hardware and quantization options.
37+
38+ | Hardware | Llama-3.2-1B-Instruct | Llama-3.2-1B-Instruct | Llama-3.2-3B-Instruct | Optimizations |
39+ | :------------:| :---------------------:| :---------------------:| :---------------------:| :-------------:|
40+ | | ** Q8_0** | ** Q4_0** | ** Q4_0** | ** Support** |
41+ | ** NVIDIA GPUs** | | | | |
42+ | RTX 3070 | 42.3 tokens/s | 78.6 tokens/s | 22.1 tokens/s | ✅ |
43+ | RTX 4090 | 96.7 tokens/s | 158.2 tokens/s | 52.9 tokens/s | ✅ |
44+ | RTX 5090 | 156.8 tokens/s | 243.5 tokens/s | 84.7 tokens/s | ✅ |
45+ | H100 | 178.3 tokens/s | 289.7 tokens/s | 102.5 tokens/s | ✅ |
46+ | ** Apple Silicon** | | | | |
47+ | M3 Pro | 18.4 tokens/s | 35.7 tokens/s | 11.6 tokens/s | ❌ |
48+ | M4 Pro | 28.9 tokens/s | 52.3 tokens/s | 17.2 tokens/s | ❌ |
49+ | ** AMD GPUs** | | | | |
50+ | Radeon RX | (WIP) | (WIP) | (WIP) | ❌ |
51+
52+ > ** Note** : ✅ indicates hardware with optimized kernels for maximum performance.
53+ > Benchmark details: Settings used include context length of 4096, batch size 1, and default parameters.
54+
55+ -----------
56+
3457### ✅ Current Features
3558
3659- ** TornadoVM-accelerated Llama 3 inference** with pure Java
You can’t perform that action at this time.
0 commit comments