This repository:
- Establishes a terse syntax for describing a broad range of CRC32 implementations.
- Contains a code generator for turning terse descriptions into executable C code.
- Contains benchmarking apparatus for CRC32 implementations, along with a range of benchmark results.
- Presents some novel CRC32C implementations which are faster than the previous state of the art.
- Apple M1: 9.6% faster (77.55 -> 85.05 GB/s)
- Cascade Lake: 25% faster (25.32 -> 31.55 GB/s)
- Ice Lake: 56% faster (41.00 -> 63.98 GB/s)
- Sapphire Rapids: 19% faster (81.68 -> 97.30 GB/s)
Running make will generate and compile and benchmark a handful of different implementations.
Alternatively, run make generate and then use ./generate to generate a particular implementation.
| Terse syntax | gcc equivalent | Example CPUs |
|---|---|---|
-i neon |
aarch64 -march=armv8-a+crypto+crc |
Apple M1, GCP Tau T2A (Ampere Altra Arm) |
-i neon_eor3 |
aarch64 -march=armv8.2-a+crypto+sha3 |
Apple M1 |
-i sse |
x86_64 -msse4.2 -mpclmul |
Any Intel or AMD CPU from the last 10 years |
-i avx512 |
x86_64 -mavx512f -mavx512vl |
Intel Skylake X and newer, AMD Zen 4 |
-i avx512_vpclmulqdq |
x86_64 -mavx512f -mavx512vl -mvpclmulqdq |
Intel Ice Lake and newer, AMD Zen 4 |
For CRC32, the key feature required of any instruction set is a vector carryless multiplication instruction, e.g.
pclmulqdq on x86_64 or pmull on aarch64. Three-way exclusive-or is also useful, e.g. vpternlogq on x86_64 or
eor3 on aarch64.
| Terse syntax | Polynomial | Hardware acceleration | Example applications |
|---|---|---|---|
-p crc32 |
0x04C11DB7 | aarch64 | Ethernet, SATA, zlib, png |
-p crc32c |
0x1EDC6F41 | aarch64, x86_64 | iSCSI, Btrfs, ext4 |
-p crc32k |
0x741B8CD7 | no | |
-p crc32q |
0x814141AB | no | AIXM |
-p 0x87654321 |
0x87654321 | no | none |
Most new applications should probably use -p crc32c. A large number of existing applications use -p crc32. If you
really care about this, then you can also specify an arbitrary polynomial.
Some instruction sets contain scalar instructions for accelerating particular CRC32 polynomials. Notably, aarch64 has
scalar acceleration for -p crc32 and -p crc32c, whereas x86_64 only has scalar acceleration for -p crc32c.
Six different parameters control the generated algorithm:
- Number of vector accumulators to use (16 or 64 bytes each).
- Number of vector loads to perform per iteration (defaults to number of vector accumulators, and must be a multiple of the number of vector accumulators).
- Number of scalar accumulators to use (4 bytes each).
- Number of scalar 8-byte loads to perform per iteration (defaults to number of scalar accumulators, and must be a multiple of the number of scalar accumulators).
- The outer loop size, if any.
- Whether inner loop termination is length-based or pointer-based.
The number of vector accumulators is specified by the letter v, followed by a decimal number. If the number of vector loads per iteration is non-default, this is followed by the letter x, followed by the ratio of loads to accumulators as a decimal number. For example, v4 denotes four vector accumulators and four vector loads per iteration, and v3x2 denotes three vector accumulators and six vector loads per iteration.
The number of scalar accumulators is specified in a similar way, but with s instead of v. For example, s3 denotes three scalar accumulators and three scalar loads per iteration, and s5x3 denotes five scalar accumulators and 15 scalar loads per iteration.
An outer loop size (in bytes) can be specified by the letter k, followed by a decimal number. Using an outer loop slightly decreases the complexity of the inner loop (because the inner loop trip count becomes constant), but makes the implementation less well suited to input sizes that aren't a multiple of the outer loop size.
The inner loop condition can be based on the remaining length (the default), or based on pointer comparisons (specify the letter e). Length-based is easier for humans to comprehend. Pointer-based can make the inner loop slightly simpler, at the expense of slightly more logic outside the loop.
The various parameters are concatenated to give the algorithm string, for example:
-a v4to use four vector registers.-a v4x2s3x5k4096eto use four vector registers (8 vector loads per iteration), three scalar registers (15 scalar loads per iteration), outer block size 4096 bytes, and pointer-based loop termination.
Multiple sets of parameters can be separated by an _ character, for example -a v4_v1 uses a v4 algorithm for as long as possible, then switches to v1 for any remaining bytes. If there are still any remaining bytes, then an implicit _s1 is appended to deal with them.
| Implementation | Speed | Our equivalent | Speed |
|---|---|---|---|
| Chromium scalar | 25.03 GB/s | -i neon -p crc32 -a s3k95760_s3 |
25.50 GB/s |
| Chromium vector | 26.57 GB/s | -i neon -p crc32 -a v4_v1 |
34.03 GB/s |
| dougallj Faster CRC32 | 77.55 GB/s | -i neon -p crc32 -a v12e_v1 |
77.69 GB/s |
| fusion-inspired | N/A | -i neon_eor3 -p crc32 -a v9s3x2e_s3 |
85.05 GB/s |
To reproduce these measurements:
$ make bench autobench
$ make -C third_party chromium_neon_crc32_s3k95760_s3.dylib chromium_neon_crc32_v4_v1.dylib dougallj_neon_crc32_v12e_v1.dylib
$ ./bench -r 40 third_party/chromium_neon_crc32_s3k95760_s3.dylib third_party/chromium_neon_crc32_v4_v1.dylib third_party/dougallj_neon_crc32_v12e_v1.dylib
$ ./autobench -r 40 -i neon -p crc32 -a s3k95760_s3,v4_v1,v12e_v1 -i neon_eor3 -a v9s3x2e_s3
Rich microarchitecture details are available, though the key details for CRC32 distill down to the following table:
| Primitive | Uops | Load | Scalar | Vector | Bytes | Score | Latency |
|---|---|---|---|---|---|---|---|
ldr, pmull+eor, pmull2+eor |
5/8 | 1/3 or 0.5/3 | 2/4 | 16 | 25.6 | 6 | |
ldr, pmull, pmull2, eor3 |
4/8 | 1/3 or 0.5/3 | 3/4 | 16 | 21.3 | 5 | |
ldr, pmull, pmull2, eor, eor |
5/8 | 1/3 or 0.5/3 | 4/4 | 16 | 16.0 | 7 | |
ldr, crc32x |
2/8 | 1/3 or 0.5/3 | 1/1 | 8 | 8.0 | 3 |
Score is computed as Bytes / max(Uops, Load, Scalar, Vector), and gives the optimal bytes per cycle that we can
hope to achieve with the corresponding primitive. The compiler might fuse two consecutive ldr instructions into a
single ldp instruction, which reduces Load from 1/3 to 0.5/3, but does not decrease Uops. The CPU can
fuse eor into an immediately preceding pmull or pmull2, which decreases the demand on the vector execution
units, but such fused instructions still count as two for Uops. Chromium's scalar implementation is a decent
application of the ldr, crc32x primitive, with three scalar accumulators sufficient to match the latency of 3.
Chromium's vector implementation is a poor application of the ldr, pmull, pmull2, eor, eor primitive, as the four
vector accumulators are insufficient to match the latency of 7. Our v4_v1 does better by using the
ldr, pmull+eor, pmull2+eor primitive, but the four vector accumulators are still insufficient to match the latency
of 6. Dougallj does a good application of the ldr, pmull+eor, pmull2+eor primitive, using 12 vector accumulators to
match the latency. The fun twist is the ldr, pmull, pmull2, eor3 primitive; it scores worse than
ldr, pmull+eor, pmull2+eor, so if just one primitive is used, ldr, pmull, pmull2, eor3 will lose out to
ldr, pmull+eor, pmull2+eor, but ldr, pmull, pmull2, eor3 is less demanding on Uops, so a blend of
ldr, pmull, pmull2, eor3 with ldr, crc32x can come out ahead. The v9s3x2e_s3 implementation blends nine copies
of ldr, pmull, pmull2, eor3 with six copies of ldr, crc32x, which gives Uops = 48/8, Scalar = 6/1, Vector = 27/4,
Bytes = 192, and thus a score of 28.4, which puts it just ahead of ldr, pmull+eor, pmull2+eor.
| Implementation | Speed | Our equivalent | Speed |
|---|---|---|---|
| Chromium scalar | 22.68 GB/s | -i neon -p crc32 -a s3k95760_s3 |
23.57 GB/s |
| Chromium vector | 16.91 GB/s | -i neon -p crc32 -a v4_v1 |
20.94 GB/s |
| dougallj Faster CRC32 | 17.41 GB/s | -i neon -p crc32 -a v12e_v1 |
21.81 GB/s |
| fusion-inspired | N/A | -i neon -p crc32 -a v3s4x2e_v2 |
35.87 GB/s |
To reproduce these measurements:
$ make bench autobench
$ make -C third_party chromium_neon_crc32_s3k95760_s3.so chromium_neon_crc32_v4_v1.so dougallj_neon_crc32_v12e_v1.so
$ ./bench -r 40 third_party/chromium_neon_crc32_s3k95760_s3.so third_party/chromium_neon_crc32_v4_v1.so third_party/dougallj_neon_crc32_v12e_v1.so
$ ./autobench -r 40 -i neon -p crc32 -a s3k95760_s3,v4_v1,v12e_v1,v3s4x2e_v2
The CPU datasheet says:
Four-wide superscalar aggressive out-of-order execution CPU
Dual full-width (128-bit wide) SIMD execution pipes
Meanwhile, microarchitecture measurements suggest that:
crc32x/crc32cxhave latency 2, throughput 1pmullhas latency 2, throughput 1- vector
eorhas latency 2, throughput 0.5
These latency and throughput numbers are consistent with Chromium scalar being faster than Chromium vector. The optimal fusion also involves more bytes being processed by the scalar pipes (64 bytes per iteration) than the vector pipes (48 bytes per iteration).
Relevant /proc/cpuinfo of test machine:
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
stepping : 7
microcode : 0x5003006
cpu MHz : 2199.998
cache size : 16384 KB
| Implementation | Speed | Our equivalent | Speed |
|---|---|---|---|
| Chromium vector | 15.87 GB/s | -i sse -p crc32 -a v4_v1 |
16.48 GB/s |
| crc32_4k_three_way | 15.04 GB/s | -i sse -p crc32c -a s3k4096e |
15.11 GB/s |
| crc32_4k_pclmulqdq | 15.78 GB/s | -i sse -p crc32c -a v4k4096e |
15.63 GB/s |
| retuned crc32_4k_pclmulqdq | N/A | -i sse -p crc32c -a v4e |
16.03 GB/s |
| AVX512 crc32_4k_pclmulqdq | N/A | -i avx512 -p crc32c -a v4e |
17.36 GB/s |
| crc32_4k_fusion | 25.32 GB/s | -i sse -p crc32c -a v4s3x3k4096e |
25.42 GB/s |
| retuned crc32_4k_fusion | N/A | -i sse -p crc32c -a v8s3x3 |
28.55 GB/s |
| AVX512 crc32_4k_fusion | N/A | -i avx512 -p crc32c -a v9s3x4e |
31.55 GB/s |
To reproduce these measurements:
$ make bench autobench
$ make -C third_party chromium_sse_crc32_v4_v1.so corsix4k_sse_crc32c_s3k4096e.so corsix4k_sse_crc32c_v4k4096e.so corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./bench -r 40 third_party/chromium_sse_crc32_v4_v1.so third_party/corsix4k_sse_crc32c_s3k4096e.so third_party/corsix4k_sse_crc32c_v4k4096e.so third_party/corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./autobench -r 40 -i sse -p crc32 -a v4_v1 -i sse -p crc32c -a s3k4096e,v4k4096e,v4e -i avx512 -a v4e -i sse -a v4s3x3k4096e,v8s3x3 -i avx512 -a v9s3x4e
Note that Cascade Lake lacks VPCLMULQDQ, so the only useful part of AVX512 is VPTERNLOGQ, hence the only very limited performance increase from AVX512 here.
Relevant /proc/cpuinfo of test machine:
vendor_id : GenuineIntel
cpu family : 6
model : 106
model name : Intel(R) Xeon(R) CPU @ 2.60GHz
stepping : 6
microcode : 0xffffffff
cpu MHz : 2600.028
cache size : 55296 KB
| Implementation | Speed | Our equivalent | Speed |
|---|---|---|---|
| Chromium vector | 24.08 GB/s | -i sse -p crc32 -a v4_v1 |
24.02 GB/s |
| Chromium vector AVX512 | 41.00 GB/s | -i avx512_vpclmulqdq -p crc32 -a v4_v1 |
47.31 GB/s |
| crc32_4k_three_way | 19.68 GB/s | -i sse -p crc32c -a s3k4096e |
19.80 GB/s |
| retuned crc32_4k_three_way | N/A | -i sse -p crc32c -a s3 |
22.50 GB/s |
| crc32_4k_pclmulqdq | 22.80 GB/s | -i sse -p crc32c -a v4k4096e |
22.80 GB/s |
| retuned crc32_4k_pclmulqdq | N/A | -i sse -p crc32c -a v4e |
24.05 GB/s |
| AVX512 crc32_4k_pclmulqdq | N/A | -i avx512_vpclmulqdq -p crc32c -a v4x2e |
47.50 GB/s |
| crc32_4k_fusion | 35.12 GB/s | -i sse -p crc32c -a v4s3x3k4096e |
34.72 GB/s |
| retuned crc32_4k_fusion | N/A | -i sse -p crc32c -a v7s3x3 |
42.58 GB/s |
| AVX512 crc32_4k_fusion | N/A | -i avx512_vpclmulqdq -p crc32c -a v4s5x3 |
63.98 GB/s |
To reproduce these measurements:
$ make bench autobench
$ make -C third_party chromium_sse_crc32_v4_v1.so chromium_avx512_vpclmulqdq_crc32_v4_v1.so corsix4k_sse_crc32c_s3k4096e.so corsix4k_sse_crc32c_v4k4096e.so corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./bench -r 40 third_party/chromium_sse_crc32_v4_v1.so third_party/chromium_avx512_vpclmulqdq_crc32_v4_v1.so third_party/corsix4k_sse_crc32c_s3k4096e.so third_party/corsix4k_sse_crc32c_v4k4096e.so third_party/corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./autobench -r 40 -i sse -p crc32 -a v4_v1 -i avx512_vpclmulqdq -p crc32 -a v4_v1 -i sse -p crc32c -a s3k4096e,s3,v4k4096e,v4e -i avx512_vpclmulqdq -a v4e -i sse -a v4s3x3k4096e,v7s3x3 -i avx512_vpclmulqdq -a v4s5x3
Ice Lake introduces VPCLMULQDQ, but VPCLMULQDQ on 64-byte registers is 1*p05+2*p5 (versus regular PCLMULQDQ on 16-byte registers being 1*p5), so going from 16-byte to 64-byte registers only gives a ~2x speedup. Chromium vector AVX512 leaves some performance on the table due to not using VPTERNLOGQ.
Relevant /proc/cpuinfo of test machine:
vendor_id : GenuineIntel
cpu family : 6
model : 143
model name : Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz
stepping : 8
microcode : 0xffffffff
cpu MHz : 2699.998
cache size : 107520 KB
| Implementation | Speed | Our equivalent | Speed |
|---|---|---|---|
| Chromium vector | 23.42 GB/s | -i sse -p crc32 -a v4_v1 |
23.31 GB/s |
| Chromium vector AVX512 | 65.20 GB/s | -i avx512_vpclmulqdq -p crc32 -a v4_v1 |
94.00 GB/s |
| (with aligned inputs) | 81.68 GB/s | -i avx512_vpclmulqdq -p crc32 -a v4_v1 |
94.00 GB/s |
| crc32_4k_three_way | 17.45 GB/s | -i sse -p crc32c -a s3k4096e |
17.59 GB/s |
| retuned crc32_4k_three_way | N/A | -i sse -p crc32c -a s3 |
23.84 GB/s |
| crc32_4k_pclmulqdq | 22.79 GB/s | -i sse -p crc32c -a v4k4096e |
22.79 GB/s |
| retuned crc32_4k_pclmulqdq | N/A | -i sse -p crc32c -a v4e |
23.29 GB/s |
| AVX512 crc32_4k_pclmulqdq | N/A | -i avx512_vpclmulqdq -p crc32c -a v4e |
94.84 GB/s |
| crc32_4k_fusion | 33.87 GB/s | -i sse -p crc32c -a v4s3x3k4096e |
34.68 GB/s |
| retuned crc32_4k_fusion | N/A | -i sse -p crc32c -a v8s3x3 |
36.98 GB/s |
| AVX512 crc32_4k_fusion | N/A | -i avx512_vpclmulqdq -p crc32c -a v3s1_s3 |
97.30 GB/s |
To reproduce these measurements:
$ make bench autobench
$ make -C third_party chromium_sse_crc32_v4_v1.so chromium_avx512_vpclmulqdq_crc32_v4_v1.so corsix4k_sse_crc32c_s3k4096e.so corsix4k_sse_crc32c_v4k4096e.so corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./bench -r 40 third_party/chromium_sse_crc32_v4_v1.so third_party/chromium_avx512_vpclmulqdq_crc32_v4_v1.so third_party/corsix4k_sse_crc32c_s3k4096e.so third_party/corsix4k_sse_crc32c_v4k4096e.so third_party/corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./autobench -r 40 -i sse -p crc32 -a v4_v1 -i avx512_vpclmulqdq -p crc32 -a v4_v1 -i sse -p crc32c -a s3k4096e,s3,v4k4096e,v4e -i avx512_vpclmulqdq -a v4e -i sse -a v4s3x3k4096e,v8s3x3 -i avx512_vpclmulqdq -a v3s1_s3
Sapphire Rapids improves VPCLMULQDQ on 64-byte registers to 1*p5, meaning that AVX512 implementations can approach 4x the performance of implementations using 16-byte vectors (and slightly exceed 4x when also gaining VPTERNLOGQ). One catch is that we now observe a 20% performance penalty for unaligned 64-byte loads, which the Chromium code doesn't take care to handle.
Relevant /proc/cpuinfo of test machine:
vendor_id : AuthenticAMD
cpu family : 23
model : 49
model name : AMD EPYC 7B12
stepping : 0
microcode : 0xffffffff
cpu MHz : 2249.998
cache size : 512 KB
| Implementation | Speed | Our equivalent | Speed |
|---|---|---|---|
| Chromium vector | 12.89 GB/s | -i sse -p crc32 -a v4_v1 |
12.84 GB/s |
| crc32_4k_three_way | 23.52 GB/s | -i sse -p crc32c -a s3k4096e |
23.08 GB/s |
| crc32_4k_pclmulqdq | 12.86 GB/s | -i sse -p crc32c -a v4k4096e |
12.89 GB/s |
| crc32_4k_fusion | 23.84 GB/s | -i sse -p crc32c -a v4s3x3k4096e |
22.62 GB/s |
| retuned crc32_4k_fusion | N/A | -i sse -p crc32c -a v1s3x3 |
31.16 GB/s |
To reproduce these measurements:
$ make bench autobench
$ make -C third_party chromium_sse_crc32_v4_v1.so corsix4k_sse_crc32c_s3k4096e.so corsix4k_sse_crc32c_v4k4096e.so corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./bench -r 40 third_party/chromium_sse_crc32_v4_v1.so third_party/corsix4k_sse_crc32c_s3k4096e.so third_party/corsix4k_sse_crc32c_v4k4096e.so third_party/corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./autobench -r 40 -i sse -p crc32 -a v4_v1 -p crc32c -a s3k4096e,v4k4096e,v4s3x3k4096e,v1s3x3
Relevant /proc/cpuinfo of test machine:
vendor_id : AuthenticAMD
cpu family : 25
model : 1
model name : AMD EPYC 7B13
stepping : 0
microcode : 0xffffffff
cpu MHz : 2450.000
cache size : 512 KB
| Implementation | Speed | Our equivalent | Speed |
|---|---|---|---|
| Chromium vector | 12.82 GB/s | -i sse -p crc32 -a v4_v1 |
12.85 GB/s |
| crc32_4k_three_way | 22.75 GB/s | -i sse -p crc32c -a s3k4096e |
23.55 GB/s |
| crc32_4k_pclmulqdq | 12.76 GB/s | -i sse -p crc32c -a v4k4096e |
12.86 GB/s |
| crc32_4k_fusion | 23.41 GB/s | -i sse -p crc32c -a v4s3x3k4096e |
22.64 GB/s |
| retuned crc32_4k_fusion | N/A | -i sse -p crc32c -a v1s4x2 |
31.76 GB/s |
To reproduce these measurements:
$ make bench autobench
$ make -C third_party chromium_sse_crc32_v4_v1.so corsix4k_sse_crc32c_s3k4096e.so corsix4k_sse_crc32c_v4k4096e.so corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./bench -r 40 third_party/chromium_sse_crc32_v4_v1.so third_party/corsix4k_sse_crc32c_s3k4096e.so third_party/corsix4k_sse_crc32c_v4k4096e.so third_party/corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./autobench -r 40 -i sse -p crc32 -a v4_v1 -p crc32c -a s3k4096e,v4k4096e,v4s3x3k4096e,v1s4x2
Relevant /proc/cpuinfo of test machine:
vendor_id : AuthenticAMD
cpu family : 25
model : 17
model name : AMD EPYC 9B14
stepping : 1
microcode : 0xffffffff
cpu MHz : 2599.998
cache size : 1024 KB
| Implementation | Speed | Our equivalent | Speed |
|---|---|---|---|
| Chromium vector | 13.73 GB/s | -i sse -p crc32 -a v4_v1 |
13.72 GB/s |
| Chromium vector AVX512 | 51.70 GB/s | -i avx512_vpclmulqdq -p crc32 -a v4_v1 |
54.00 GB/s |
| crc32_4k_three_way | 26.27 GB/s | -i sse -p crc32c -a s3k4096e |
26.56 GB/s |
| retuned crc32_4k_three_way | N/A | -i sse -p crc32c -a s3 |
27.32 GB/s |
| crc32_4k_pclmulqdq | 13.73 GB/s | -i sse -p crc32c -a v4k4096e |
13.74 GB/s |
| AVX512 crc32_4k_pclmulqdq | N/A | -i avx512_vpclmulqdq -p crc32c -a v3x2 |
54.77 GB/s |
| crc32_4k_fusion | 28.47 GB/s | -i sse -p crc32c -a v4s3x3k4096e |
28.56 GB/s |
| retuned crc32_4k_fusion | N/A | -i sse -p crc32c -a v1s3x2 |
36.20 GB/s |
| AVX512 crc32_4k_fusion | N/A | -i avx512_vpclmulqdq -p crc32c -a v3s2x4 |
71.95 GB/s |
To reproduce these measurements:
$ make bench autobench
$ make -C third_party chromium_sse_crc32_v4_v1.so chromium_avx512_vpclmulqdq_crc32_v4_v1.so corsix4k_sse_crc32c_s3k4096e.so corsix4k_sse_crc32c_v4k4096e.so corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./bench -r 40 third_party/chromium_sse_crc32_v4_v1.so third_party/chromium_avx512_vpclmulqdq_crc32_v4_v1.so third_party/corsix4k_sse_crc32c_s3k4096e.so third_party/corsix4k_sse_crc32c_v4k4096e.so third_party/corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./autobench -r 40 -i sse -p crc32 -a v4_v1 -i avx512_vpclmulqdq -p crc32 -a v4_v1 -i sse -p crc32c -a s3k4096e,s3,v4k4096e -i avx512_vpclmulqdq -a v3x2 -i sse -a v4s3x3k4096e,v1s3x2 -i avx512_vpclmulqdq -a v3s2x4
Background mathematics:
CRC implementations: