[PWCI] "net: optimize raw checksum computation"#615
Conversation
__rte_raw_cksum uses a loop with memcpy on each iteration.
GCC 15+ is able to vectorize the loop but Clang 18.1 is not.
Replacing the memcpy with unaligned_uint16_t pointer access enables
both GCC and Clang to vectorize with SSE/AVX/AVX-512.
This patch adds comprehensive fuzz testing and updates the performance
test to measure the optimization impact.
Performance results from cksum_perf_autotest on Intel Xeon
(Cascade Lake, AVX-512) built with Clang 18.1 (TSC cycles/byte):
Block size Before After Improvement
100 0.40 0.24 ~40%
1500 0.50 0.06 ~8x
9000 0.49 0.06 ~8x
Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>
Signed-off-by: 0-day Robot <robot@bytheb.org>
The optimized __rte_raw_cksum() uses unaligned_uint16_t pointer access which triggers UBSAN alignment warnings even though the access is safe due to the unaligned type definition. Add __rte_no_ubsan_alignment attribute to suppress these false positive warnings while preserving other UBSAN checks. Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> Signed-off-by: 0-day Robot <robot@bytheb.org>
GCC has a bug where it incorrectly elides struct initialization in inline functions when strict aliasing is enabled (-O2/-O3/-Os), causing reads from uninitialized memory. This affects both designated initializers and manual field assignment. Add RTE_FORCE_INIT_BARRIER macro that uses an asm volatile memory barrier to prevent the compiler from incorrectly optimizing away struct initialization. Apply the workaround to pseudo-header checksum functions in rte_ip4.h, rte_ip6.h, hinic driver, and mlx5 driver. Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> Signed-off-by: 0-day Robot <robot@bytheb.org>
Reviewer's GuideOptimizes the raw checksum implementation for better performance while addressing sanitizer/optimizer issues, adds fuzz and extended performance tests for checksum correctness and coverage, and introduces a macro-based barrier to work around a GCC struct initialization bug in several network paths. Class diagram for checksum helpers and GCC barrier macro integrationclassDiagram
class EAL_Common {
<<header>>
+__rte_no_ubsan_alignment
+RTE_FORCE_INIT_BARRIER(var)
}
class Net_Cksum {
<<header>>
+uint32_t __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
}
class IPv4_Phdr {
<<header>>
+uint16_t rte_ipv4_phdr_cksum(const struct rte_ipv4_hdr *ipv4_hdr, uint64_t ol_flags)
+psd_hdr local_struct
}
class IPv6_Phdr {
<<header>>
+uint16_t rte_ipv6_phdr_cksum(const struct rte_ipv6_hdr *ipv6_hdr, uint64_t ol_flags)
+psd_hdr local_struct
}
class Hinic_Tx {
<<driver>>
+uint16_t hinic_ipv4_phdr_cksum(const struct rte_ipv4_hdr *ipv4_hdr, uint64_t ol_flags)
+uint16_t hinic_ipv6_phdr_cksum(const struct rte_ipv6_hdr *ipv6_hdr, uint64_t ol_flags)
+psd_hdr local_struct
}
class Mlx5_Flow {
<<driver>>
+__flow_encap_decap_resource_register(struct rte_eth_dev *dev, struct mlx5_flow_dv_encap_decap_resource *resource, struct rte_flow_error *error)
+encap_decap_key local_struct
}
EAL_Common <.. Net_Cksum : uses
EAL_Common <.. IPv4_Phdr : uses RTE_FORCE_INIT_BARRIER
EAL_Common <.. IPv6_Phdr : uses RTE_FORCE_INIT_BARRIER
EAL_Common <.. Hinic_Tx : uses RTE_FORCE_INIT_BARRIER
EAL_Common <.. Mlx5_Flow : uses RTE_FORCE_INIT_BARRIER
Net_Cksum <.. IPv4_Phdr : calls __rte_raw_cksum
Net_Cksum <.. IPv6_Phdr : calls __rte_raw_cksum
Net_Cksum <.. Hinic_Tx : calls __rte_raw_cksum
IPv4_Phdr <.. Hinic_Tx : calls rte_ipv4_phdr_cksum
IPv6_Phdr <.. Hinic_Tx : calls rte_ipv6_phdr_cksum
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
📝 WalkthroughWalkthroughThis PR introduces fuzz testing for the Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- The fuzz helper
init_random_buffer()callsrte_rand()once per byte, which can be quite expensive at 64K lengths and 1000+ iterations; consider filling the buffer in wider (e.g., 64-bit) chunks and only handling a tail to keep the test fast enough for regular runs. - The fuzz tests only exercise unaligned buffers at a fixed +1 offset; if you want broader coverage of the new unaligned 16-bit load path, you could vary the misalignment (e.g., offsets 1–7) to catch alignment-sensitive issues more reliably.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The fuzz helper `init_random_buffer()` calls `rte_rand()` once per byte, which can be quite expensive at 64K lengths and 1000+ iterations; consider filling the buffer in wider (e.g., 64-bit) chunks and only handling a tail to keep the test fast enough for regular runs.
- The fuzz tests only exercise unaligned buffers at a fixed +1 offset; if you want broader coverage of the new unaligned 16-bit load path, you could vary the misalignment (e.g., offsets 1–7) to catch alignment-sensitive issues more reliably.
## Individual Comments
### Comment 1
<location> `app/test/test_cksum_fuzz.c:96` </location>
<code_context>
+ return TEST_FAILED;
+ }
+
+ buf = aligned ? data : (data + 1);
+
+ init_random_buffer(buf, len);
</code_context>
<issue_to_address>
**suggestion (testing):** Consider testing multiple misalignment offsets instead of only +1.
Since the unaligned case always uses `data + 1`, it only covers 1‑byte misalignment. Because `__rte_raw_cksum` now reads unaligned 16‑bit words, it would be useful to iterate over several offsets (e.g., `data + 1`, `data + 2`, `data + 3`, or similar) when `aligned == false` to increase coverage of different alignment patterns without significantly increasing runtime.
Suggested implementation:
```c
const size_t max_misalignment = 3;
alloc_size = aligned ? len : len + max_misalignment;
```
```c
if (aligned) {
buf = data;
init_random_buffer(buf, len);
sum_ref = __rte_raw_cksum_reference(buf, len, initial_sum);
sum_opt = __rte_raw_cksum(buf, len, initial_sum);
if (sum_ref != sum_opt) {
printf("MISMATCH at len=%zu aligned='%s' initial_sum=0x%08x ref=0x%08x opt=0x%08x\n",
len, "aligned",
initial_sum, sum_ref, sum_opt);
rte_hexdump(stdout, "failing buffer", buf, len);
rte_free(data);
return TEST_FAILED;
}
} else {
for (size_t misalignment = 1; misalignment <= max_misalignment; misalignment++) {
buf = data + misalignment;
init_random_buffer(buf, len);
sum_ref = __rte_raw_cksum_reference(buf, len, initial_sum);
sum_opt = __rte_raw_cksum(buf, len, initial_sum);
if (sum_ref != sum_opt) {
printf("MISMATCH at len=%zu aligned='%s' misalignment=%zu initial_sum=0x%08x ref=0x%08x opt=0x%08x\n",
len, "unaligned", misalignment,
initial_sum, sum_ref, sum_opt);
rte_hexdump(stdout, "failing buffer", buf, len);
rte_free(data);
return TEST_FAILED;
}
}
}
```
</issue_to_address>
### Comment 2
<location> `app/test/test_cksum_fuzz.c:162-174` </location>
<code_context>
+
+ printf("Testing edge case lengths...\n");
+
+ for (i = 0; i < RTE_DIM(edge_lengths); i++) {
+ /* Test with zero initial sum */
+ rc = test_cksum_fuzz_length(edge_lengths[i], 0);
+ if (rc != TEST_SUCCESS)
+ return rc;
+
+ /* Test with random initial sum */
+ rc = test_cksum_fuzz_length(edge_lengths[i], get_initial_sum(true));
+ if (rc != TEST_SUCCESS)
</code_context>
<issue_to_address>
**suggestion (testing):** Add explicit tests for extreme initial_sum values to stress carry/overflow behavior.
In `test_cksum_fuzz_edge_cases`, each length is only tested with `initial_sum = 0` and one random value. Please add a few fixed extreme initial sums (e.g., `0xFFFF0000`, `0xFFFFFFFF`, or other high-bit patterns) to make the tests deterministic and to better cover worst‑case carry/overflow behavior in the optimized implementation.
```suggestion
printf("Testing edge case lengths...\n");
for (i = 0; i < RTE_DIM(edge_lengths); i++) {
/* Test with zero initial sum */
rc = test_cksum_fuzz_length(edge_lengths[i], 0);
if (rc != TEST_SUCCESS)
return rc;
/* Test with extreme deterministic initial sums to stress carry/overflow */
rc = test_cksum_fuzz_length(edge_lengths[i], 0xFFFF0000u);
if (rc != TEST_SUCCESS)
return rc;
rc = test_cksum_fuzz_length(edge_lengths[i], 0xFFFFFFFFu);
if (rc != TEST_SUCCESS)
return rc;
rc = test_cksum_fuzz_length(edge_lengths[i], 0x80000000u);
if (rc != TEST_SUCCESS)
return rc;
/* Test with random initial sum */
rc = test_cksum_fuzz_length(edge_lengths[i], get_initial_sum(true));
if (rc != TEST_SUCCESS)
return rc;
}
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
lib/net/rte_ip6.h (1)
575-575: Consider documenting the GCC strict aliasing workaround.The
RTE_FORCE_INIT_BARRIERmacro prevents GCC from incorrectly optimizing away writes to the localpsd_hdrstruct when strict aliasing optimizations are enabled. Adding a code comment explaining:
- Why the barrier is necessary (GCC may assume the struct fields are not accessed due to strict aliasing rules in inlined functions)
- When this workaround can be removed (if/when minimum GCC version requirements are updated to exclude affected versions)
would help maintainers understand the purpose of this pattern, which is used in multiple locations (rte_ip4.h, rte_ip6.h, and driver code).
app/test/test_cksum_fuzz.c (2)
63-67: Optional: Remove redundant mask.The
& 0xFFFFFFFFmask is redundant since the return typeuint32_twill implicitly truncate theuint64_tfromrte_rand(). The current code works correctly but could be simplified.♻️ Suggested simplification
static inline uint32_t get_initial_sum(bool random_initial_sum) { - return random_initial_sum ? (rte_rand() & 0xFFFFFFFF) : 0; + return random_initial_sum ? rte_rand() : 0; }
204-240: Well-structured test orchestration and registration.The main test function provides clear progression through test phases with informative output, and the REGISTER_FAST_TEST integration properly exposes the fuzz test to the DPDK test framework.
Optional enhancement: Consider making
DEFAULT_ITERATIONSconfigurable via an environment variable (e.g.,RTE_TEST_CKSUM_FUZZ_ITERATIONS) to allow users to trade off test coverage versus execution time during development and CI.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
app/test/meson.buildapp/test/test_cksum_fuzz.capp/test/test_cksum_perf.cdrivers/net/hinic/hinic_pmd_tx.cdrivers/net/mlx5/mlx5_flow_dv.clib/eal/include/rte_common.hlib/net/rte_cksum.hlib/net/rte_ip4.hlib/net/rte_ip6.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Sourcery review
🔇 Additional comments (12)
app/test/test_cksum_perf.c (1)
18-18: LGTM! Expanded test coverage for larger packet sizes.The addition of larger block sizes (9000, 9001 for jumbo frames and 65536, 65537 for maximum IP packet sizes) appropriately extends the performance test coverage. The continued pattern of testing both even and odd sizes ensures alignment edge cases are covered, which is particularly valuable for validating the vectorization improvements mentioned in the PR objectives.
app/test/meson.build (1)
41-41: Test file exists and build entry is correctly integrated.The new test file
test_cksum_fuzz.cis present in the repository, and the build configuration entry at line 41 follows the established pattern with the correct dependency specification and proper alphabetical ordering.drivers/net/mlx5/mlx5_flow_dv.c (1)
4448-4449: The current usage ofRTE_FORCE_INIT_BARRIER(encap_decap_key)is correct and requires no changes.The macro accepts any lvalue via its "+m" constraint, and all established usage patterns in the codebase (rte_ip4.h, rte_ip6.h, hinic_pmd_tx.c) pass the entire initialized struct or union, not individual fields. The barrier on the union is appropriate because it signals to the compiler that the entire object—which is being type-punned—may be accessed, effectively preventing unintended optimizations across the initialization and subsequent field access.
drivers/net/hinic/hinic_pmd_tx.c (1)
728-728: LGTM: Initialization barriers correctly applied.The barriers are consistently placed after pseudo-header initialization and before checksum computation, matching the pattern in the core library functions (
rte_ipv4_phdr_cksumandrte_ipv6_phdr_cksum).Also applies to: 747-747
lib/net/rte_ip4.h (1)
241-241: LGTM: Barrier placement is correct.The initialization barrier is properly placed after the
psd_hdrstructure is fully populated and before it's passed torte_raw_cksum(), ensuring all fields are visible to the checksum function.lib/eal/include/rte_common.h (2)
549-556: LGTM: UBSAN suppression macro is correctly implemented.The
__rte_no_ubsan_alignmentmacro appropriately usesno_sanitize("alignment")for GCC/Clang to suppress false positive alignment warnings in checksum code paths that intentionally use unaligned access withunaligned_uint16_t.
558-570: Clarify GCC-only gating and consider extending to Clang.The
asm volatilebarrier implementation is correct and prevents the strict aliasing optimization issue. However, the macro is only enabled for GCC (line 564), yet both GCC and Clang supportasm volatile("" ::: "memory")as a compiler barrier, and both can exhibit similar strict aliasing optimizations that elide uninitialized struct writes. The codebase already uses conditional compilation for both compilers in other places (e.g., line 552). Either:
- Extend the macro to Clang:
#if defined(RTE_CC_GCC) || defined(RTE_CC_CLANG), or- Document why Clang is intentionally excluded (e.g., if testing confirmed Clang does not exhibit this issue in practice).
Additionally, consider adding a reference to the specific GCC bug or affected version(s) in the comment for future maintainers.
lib/net/rte_cksum.h (1)
42-50: All concerns are properly addressed in the implementation.The optimization safely handles unaligned access:
unaligned_uint16_tis defined with__rte_aligned(1)on strict-alignment architectures, explicitly marking intentional unaligned access__rte_no_ubsan_alignmentsuppresses UB sanitizer alignment checks as intended for this deliberate optimization- Comprehensive fuzz tests in
app/test/test_cksum_fuzz.cvalidate correctness across edge cases (0-1025 bytes), random lengths up to 64K, both aligned and unaligned buffers, and varying initial sums, comparing against the original reference implementationapp/test/test_cksum_fuzz.c (4)
1-16: LGTM! Clean includes and proper licensing.The headers are well-organized and include all necessary DPDK primitives for fuzz testing (random data generation, memory allocation, checksum computation, and debug output).
23-52: Excellent reference implementation and test parameters.The constants provide good coverage (1000 iterations, up to 64K buffers), and the reference implementation correctly preserves the original DPDK v23.11 memcpy-based approach with proper handling of odd-length buffers. This establishes a reliable baseline for validating the optimization.
72-114: Robust test implementation with good defensive coding.The function properly handles edge cases (zero-length buffers, allocation failures), tests both aligned and unaligned buffers, and provides excellent debugging output via hexdump on mismatches. Memory management is correct on all paths.
119-202: Comprehensive test coverage with excellent edge case selection.The test suite systematically covers:
- Critical boundaries (power-of-2, SIMD widths, MTU 1500, GRO 64K)
- Both aligned and unaligned buffers
- Zero and random initial sum values
- Random fuzzing to catch unexpected corner cases
This thorough approach should effectively validate the optimization against the reference implementation.
Auto-submission for "http://patchwork.dpdk.org/project/dpdk/list/?series=37006"
Summary by Sourcery
Optimize raw checksum computation and harden it against compiler sanitizer and optimization issues while extending checksum test coverage, including fuzzing and larger payload sizes.
Enhancements:
Tests:
Summary by CodeRabbit
Tests
Bug Fixes
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.