Ship 8 Tranche 7b: high-bit 4:4:4 RGBA u8 SIMD#30
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
This PR completes the u8 RGBA SIMD wiring for high-bit-depth 4:4:4 YUV/Pn formats by replacing the prior stub RGBA dispatchers with real per-arch SIMD routing and adding per-backend SIMD↔scalar equivalence tests.
Changes:
- Wire the 8 high-bit 4:4:4 u8 RGBA dispatchers in
src/row/mod.rsto per-arch SIMD backends (NEON/SSE4.1/AVX2/AVX-512/simd128), with scalar fallback whenuse_simd == falseor unavailable. - Refactor each SIMD backend’s existing u8 RGB kernels into a shared const-
ALPHAimplementation plus thin RGB/RGBA wrappers for the 4:4:4 families. - Add per-backend u8 RGBA equivalence tests to byte-pin SIMD output against the scalar reference across matrices/ranges and tail widths.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/row/mod.rs | Replaces the u8 RGBA stub dispatchers with real cfg_select! per-arch SIMD routing (scalar fallback preserved). |
| src/row/arch/neon.rs | Adds RGBA wrappers and shared const-ALPHA 4:4:4 u8 kernel path for NEON. |
| src/row/arch/neon/tests.rs | Adds NEON u8 RGBA SIMD↔scalar equivalence tests for all 4:4:4 kernel families. |
| src/row/arch/x86_sse41.rs | Adds RGBA wrappers and shared const-ALPHA 4:4:4 u8 kernel path for SSE4.1. |
| src/row/arch/x86_sse41/tests.rs | Adds SSE4.1 u8 RGBA SIMD↔scalar equivalence tests with runtime feature guards. |
| src/row/arch/x86_avx2.rs | Adds RGBA wrappers and shared const-ALPHA 4:4:4 u8 kernel path for AVX2. |
| src/row/arch/x86_avx2/tests.rs | Adds AVX2 u8 RGBA SIMD↔scalar equivalence tests with runtime feature guards. |
| src/row/arch/x86_avx512.rs | Adds RGBA wrappers and shared const-ALPHA 4:4:4 u8 kernel path for AVX-512BW. |
| src/row/arch/x86_avx512/tests.rs | Adds AVX-512 u8 RGBA SIMD↔scalar equivalence tests with runtime feature guards. |
| src/row/arch/wasm_simd128.rs | Adds RGBA wrappers and shared const-ALPHA 4:4:4 u8 kernel path for wasm simd128. |
| src/row/arch/wasm_simd128/tests.rs | Adds wasm simd128 u8 RGBA SIMD↔scalar equivalence tests (module-level simd128 gating). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The 4:4:4 high-bit YUV planar SIMD docs claimed `BITS ∈ {10, 12, 14}`
across all 5 backends, but the const-assert in every implementation
accepts `BITS == 9 || 10 || 12 || 14` and the `yuv444p9_to_rgba_row`
public dispatcher (added in PR #29) instantiates the kernel with
`<9>`. The doc string was stale from before BITS=9 was added in Ship 6b.
Updates both the const-generic bound (`{10, 12, 14}` → `{9, 10, 12, 14}`)
and the prose bit-list (`10/12/14-bit` → `9/10/12/14-bit`) on every
4:4:4 planar SIMD doc — covers the u8 RGB, u8 RGBA (added in this PR),
and u16 RGB siblings across NEON, SSE4.1, AVX2, AVX-512, and wasm
simd128. 23 lines updated total.
Addresses Copilot review comments on PR #30. Also retroactively fixes
the matching drift on the u16 RGB and pre-existing u8 RGB docs that
Copilot didn't explicitly flag but had identical wording.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wires u8 RGBA SIMD across all 5 backends for high-bit-depth 4:4:4 YUV (
Yuv444p9/10/12/14/16,P410/P412/P416) and replaces the 8 stub dispatchers landed in PR #29 with realcfg_select!per-arch routes. Mirrors PR #25 (Tranche 5a) exactly, which did the same for 4:2:0.The companion u16 RGBA SIMD work + sinker integration lands in Tranche 7c.
Changes
SIMD (5 backends × 4 kernel families = 20 kernel refactors)
Each backend's existing u8 RGB kernel becomes a thin wrapper over a const-ALPHA template, alongside a new RGBA wrapper:
yuv_444p_n_to_rgb_or_rgba_row<BITS, ALPHA>yuv_444p_n_to_rgb_row<BITS>yuv_444p_n_to_rgba_row<BITS>yuv_444p16_to_rgb_or_rgba_row<ALPHA>yuv_444p16_to_rgb_rowyuv_444p16_to_rgba_rowp_n_444_to_rgb_or_rgba_row<BITS, ALPHA>p_n_444_to_rgb_row<BITS>p_n_444_to_rgba_row<BITS>p_n_444_16_to_rgb_or_rgba_row<ALPHA>p_n_444_16_to_rgb_rowp_n_444_16_to_rgba_rowOnly the per-iteration store and the scalar tail dispatch branch on
ALPHA; per-pixel math is unchanged. Alpha =0xFFfor all u8 RGBA paths. Per-arch alpha splat:vdupq_n_u8(0xFF)(NEON) /_mm_set1_epi8(-1)(SSE4.1) /_mm256_set1_epi8(-1)(AVX2) /_mm512_set1_epi8(-1)(AVX-512) /u8x16_splat(0xFF)(wasm).RGBA store helpers (
vst4q_u8,write_rgba_16/32/64) are reused verbatim from PR #25's 4:2:0 work — no new helpers needed.The 4:4:4 kernel structure is simpler than 4:2:0: chroma is 1:1 with Y so there's no horizontal duplication step, no chroma-pair
while-loop split, no_lo/_hihalf pairs at the store. The const-ALPHA refactor was therefore mechanical — only the store branch and the tail dispatch neededif ALPHA { ... } else { ... }.Dispatcher wiring (8 u8 RGBA dispatchers in
src/row/mod.rs)Replace the 8
let _ = use_simd; // SIMD per-arch routes land in Ship 8 Tranche 7b.stubs (landed in PR #29) with the standardcfg_select!per-arch route block, mirroring the existing high-bit RGB dispatchers:yuv444p9_to_rgba_row,yuv444p10_to_rgba_row,yuv444p12_to_rgba_row,yuv444p14_to_rgba_row(BITS-generic planar)yuv444p16_to_rgba_row(16-bit dedicated planar)p410_to_rgba_row,p412_to_rgba_row(BITS-generic Pn)p416_to_rgba_row(16-bit dedicated Pn)use_simd = falsestill forces scalar. The 8 u16 RGBA dispatchers still route to scalar — those land in 7c.Per-backend RGBA equivalence tests (~30 new tests)
6 tests per backend × 5 backends, mirroring PR #25's structure. Each backend covers all 4 kernel families across narrow + tail + 1920 widths and the full ColorMatrix × range cross-product:
<backend>_yuv444p_n_rgba_matches_scalar_all_bits(BITS=9/10/12/14)<backend>_yuv444p_n_rgba_matches_scalar_tail_and_widths<backend>_pn_444_rgba_matches_scalar_all_bits(BITS=10/12)<backend>_pn_444_rgba_matches_scalar_tail_and_widths<backend>_yuv444p16_rgba_matches_scalar_all_matrices<backend>_p416_rgba_matches_scalar_all_matricesAll 24 new x86
#[test]functions (8 SSE4.1 + 8 AVX2 + 8 AVX-512) includeis_x86_feature_detected!early-return guards — per the Tranche 5a CI fallout (without them, ASAN sanitizer hitsSIGILLand Miri reports UB on runners lacking the feature). NEON tests use#[cfg_attr(miri, ignore = "...")]. Wasm tests are module-level cfg-gated bytarget_feature = "simd128".Test plan
cargo test --lib: 519 pass on aarch64-darwin (host); was 513 → +6 NEON-side RGBA tests. The other 24 (x86 + wasm gate-guarded) fire on their respective CI runners.cargo check --tests --libclean across host, x86_64-unknown-freebsd, wasm32-unknown-unknownRUSTFLAGS=\"-Dwarnings\" cargo clippy --lib --testscleandead_codewarnings — every new*_to_rgba_rowwrapper is consumed by its dispatcherCodex adversarial review
Verdict: approve. No material findings.
Out of scope (deferred to follow-up)
MixedSinker<Yuv444p9..16>,<P410/P412/P416>,<Yuv440p10/12>) — Tranche 7c🤖 Generated with Claude Code