-
Notifications
You must be signed in to change notification settings - Fork 45
Open
Description
NDArrayMatrixMultiplyA16 does not contain simd async copy instructions, although the kernel for A14 does. Starting with AGX3 (A15), there are some new instructions used for GEMM and Conv. I haven't checked whether they're accessible from __asm (SIMD futures are not).
; Function Attrs: nounwind memory(write)
declare void @llvm.agx3.store.with.emask.global.i16.v2i16(ptr addrspace(1), <2 x i16>, i16, i16, i16) #7
; Function Attrs: nounwind memory(write)
declare void @llvm.agx3.store.with.emask.global.i32.v2i32(ptr addrspace(1), <2 x i32>, i16, i16, i16) #7
; Function Attrs: nounwind speculatable memory(none)
declare i16 @llvm.agx3.edgecheck(i32, i32, i32) #8
; Function Attrs: nounwind memory(read)
declare <2 x i16> @llvm.agx3.load.with.emask.global.v2i16.i16(ptr addrspace(1), i16, i16, i16) #10
; Function Attrs: nounwind memory(read)
declare <2 x i32> @llvm.agx3.load.with.emask.global.v2i32.i32(ptr addrspace(1), i16, i16, i16) #10
; Function Attrs: nounwind memory(read)
declare <1 x i16> @llvm.agx3.load.with.emask.global.v1i16.i16(ptr addrspace(1), i16, i16, i16) #10
; Function Attrs: nounwind memory(read)
declare <1 x i32> @llvm.agx3.load.with.emask.global.v1i32.i32(ptr addrspace(1), i16, i16, i16) #10
; Function Attrs: nounwind memory(read)
declare <4 x i16> @llvm.agx3.load.with.emask.global.v4i16.i16(ptr addrspace(1), i16, i16, i16) #10
; Function Attrs: nounwind memory(read)
declare <4 x i32> @llvm.agx3.load.with.emask.global.v4i32.i32(ptr addrspace(1), i16, i16, i16) #10Furthermore, unlike A14/M1, at least A16 can access 65536 bytes of registers from a single SIMD-group. That is more than physically possible.
%31 = alloca [16 x [16 x %"struct.metal::simdgroup_matrix"]], align 256
call void @llvm.lifetime.end.p0(i64 65536, ptr nonnull %292) #14Luckily, SIMD futures run correctly and performantly on A15/A16. I do worry that this MPS kernel is referencing their unreleased A16 ray tracing GPU (or the in-development M3), which might remove support for SIMD futures.
Source: https://gist.github.com/philipturner/939d4ffda26e66f10a142c82d8d498e9
Results (A15)
GEMM dimensions: 256x256x256
2023-06-09 12:16:16.984966-0400 SIMDFuturesA15[32193:1233147] Metal GPU Frame Capture Enabled
2023-06-09 12:16:16.985621-0400 SIMDFuturesA15[32193:1233147] Metal API Validation Enabled
Metal FlashAttention: 'f16'
GFLOPS: 269
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 48
MFA vs MPS Euclidean Distance: 30.969585
GEMM dimensions: 512x512x512
Metal FlashAttention: 'f16'
GFLOPS: 465
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 249
MFA vs MPS Euclidean Distance: 209.93332
GEMM dimensions: 768x768x768
Metal FlashAttention: 'f16'
GFLOPS: 948
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 749
MFA vs MPS Euclidean Distance: 894.5818
GEMM dimensions: 1024x1024x1024
Metal FlashAttention: 'f32'
GFLOPS: 1215
Metal Performance Shaders: 'f32'
GFLOPS: 1353
MFA vs MPS Euclidean Distance: 0.0
Metal FlashAttention: 'f16'
GFLOPS: 1184
Metal Performance Shaders: 'f16'
GFLOPS: 1265
MFA vs MPS Euclidean Distance: 2008.9558
Metal FlashAttention: 'f16'
GFLOPS: 1262
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 1194
MFA vs MPS Euclidean Distance: 2009.3207
GEMM dimensions: 1280x1280x1280
Metal FlashAttention: 'f16'
GFLOPS: 1618
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 1972
MFA vs MPS Euclidean Distance: 5536.6294
GEMM dimensions: 1536x1536x1536
Metal FlashAttention: 'f16'
GFLOPS: 1611
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 2177
MFA vs MPS Euclidean Distance: 10459.943
GEMM dimensions: 1792x1792x1792
Metal FlashAttention: 'f16'
GFLOPS: 1622
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 2355
MFA vs MPS Euclidean Distance: 16553.246
GEMM dimensions: 2048x2048x2048
Metal FlashAttention: 'f32'
GFLOPS: 1397
Metal Performance Shaders: 'f32'
GFLOPS: 1326
MFA vs MPS Euclidean Distance: 0.0
Metal FlashAttention: 'f16'
GFLOPS: 1624
Metal Performance Shaders: 'f16'
GFLOPS: 1303
MFA vs MPS Euclidean Distance: 24126.244
Metal FlashAttention: 'f16'
GFLOPS: 1624
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 2405
MFA vs MPS Euclidean Distance: 24127.777
GEMM dimensions: 4096x4096x4096
Metal FlashAttention: 'f16'
GFLOPS: 1594
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 2941
MFA vs MPS Euclidean Distance: 305298.94
Results (A16)
GEMM dimensions: 256x256x256
2023-06-09 08:53:08.328381-0700 TestUI[48119:10608193] Metal GPU Frame Capture Enabled
2023-06-09 08:53:08.328473-0700 TestUI[48119:10608193] Metal API Validation Enabled
2023-06-09 08:53:09.303047-0700 TestUI[48119:10608193] fopen failed for data file: errno = 2 (No such file or directory)
2023-06-09 08:53:09.303155-0700 TestUI[48119:10608193] Errors found! Invalidating cache...
2023-06-09 08:53:09.346694-0700 TestUI[48119:10608193] fopen failed for data file: errno = 2 (No such file or directory)
2023-06-09 08:53:09.346763-0700 TestUI[48119:10608193] Errors found! Invalidating cache...
Metal FlashAttention: 'f16'
GFLOPS: 632
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 68
MFA vs MPS Euclidean Distance: 30.969585
GEMM dimensions: 512x512x512
Metal FlashAttention: 'f16'
GFLOPS: 378
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 387
MFA vs MPS Euclidean Distance: 209.93332
GEMM dimensions: 768x768x768
Metal FlashAttention: 'f16'
GFLOPS: 957
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 1087
MFA vs MPS Euclidean Distance: 894.5818
GEMM dimensions: 1024x1024x1024
Metal FlashAttention: 'f32'
GFLOPS: 1163
Metal Performance Shaders: 'f32'
GFLOPS: 1070
MFA vs MPS Euclidean Distance: 0.0
Metal FlashAttention: 'f16'
GFLOPS: 1130
Metal Performance Shaders: 'f16'
GFLOPS: 1273
MFA vs MPS Euclidean Distance: 2008.9558
Metal FlashAttention: 'f16'
GFLOPS: 1401
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 1949
MFA vs MPS Euclidean Distance: 2009.3207
GEMM dimensions: 1280x1280x1280
Metal FlashAttention: 'f16'
GFLOPS: 1606
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 2799
MFA vs MPS Euclidean Distance: 5536.6294
GEMM dimensions: 1536x1536x1536
Metal FlashAttention: 'f16'
GFLOPS: 1610
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 3306
MFA vs MPS Euclidean Distance: 10459.943
GEMM dimensions: 1792x1792x1792
Metal FlashAttention: 'f16'
GFLOPS: 1622
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 3159
MFA vs MPS Euclidean Distance: 16553.246
GEMM dimensions: 2048x2048x2048
Metal FlashAttention: 'f32'
GFLOPS: 1474
Metal Performance Shaders: 'f32'
GFLOPS: 1379
MFA vs MPS Euclidean Distance: 0.0
Metal FlashAttention: 'f16'
GFLOPS: 1623
Metal Performance Shaders: 'f16'
GFLOPS: 1305
MFA vs MPS Euclidean Distance: 24126.244
Metal FlashAttention: 'f16'
GFLOPS: 1622
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 3484
MFA vs MPS Euclidean Distance: 24127.777
GEMM dimensions: 4096x4096x4096
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels