Ship 8b‑1b: Yuva444p10 u8 RGBA SIMD across all 5 backends#33
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the YUVA 4:4:4 (currently Yuva444p10) → RGBA u8 conversion path to use per-architecture SIMD implementations (NEON / SSE4.1 / AVX2 / AVX-512BW / wasm simd128) when use_simd is enabled, and removes/updates now-stale “scalar-only” staging notes.
Changes:
- Wire
row::yuva444p10_to_rgba_rowto runtime-dispatch into new arch-specificyuv_444p_n_to_rgba_with_alpha_src_row::<10>SIMD wrappers (with scalar fallback). - Extend existing 4:4:4 SIMD kernels (NEON/x86/wasm) to optionally source per-pixel alpha from an input alpha plane (via a new const-generic
ALPHA_SRCpath). - Add SIMD-vs-scalar equivalence tests for the new YUVA 4:4:4 u8 RGBA alpha-source path across multiple backends.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| src/sinker/mixed/yuva_4_4_4.rs | Removes outdated “scalar-only” performance note from YUVA sinker RGBA wiring docs. |
| src/row/mod.rs | Updates YUVA 4:4:4 dispatcher docs and adds per-arch SIMD dispatch for yuva444p10_to_rgba_row. |
| src/row/arch/x86_sse41.rs | Adds SSE4.1 alpha-source RGBA wrapper and integrates ALPHA_SRC into the shared 4:4:4 kernel. |
| src/row/arch/x86_sse41/tests.rs | Adds SSE4.1 equivalence tests for YUVA 4:4:4 → RGBA u8 (alpha sourced). |
| src/row/arch/x86_avx2.rs | Adds AVX2 alpha-source RGBA wrapper and integrates ALPHA_SRC into the shared 4:4:4 kernel. |
| src/row/arch/x86_avx2/tests.rs | Adds AVX2 equivalence tests for YUVA 4:4:4 → RGBA u8 (alpha sourced). |
| src/row/arch/x86_avx512.rs | Adds AVX-512BW alpha-source RGBA wrapper and integrates ALPHA_SRC into the shared 4:4:4 kernel. |
| src/row/arch/x86_avx512/tests.rs | Adds AVX-512BW equivalence tests for YUVA 4:4:4 → RGBA u8 (alpha sourced). |
| src/row/arch/wasm_simd128.rs | Adds wasm simd128 alpha-source RGBA wrapper and integrates ALPHA_SRC into the shared 4:4:4 kernel. |
| src/row/arch/wasm_simd128/tests.rs | Adds wasm simd128 equivalence tests for YUVA 4:4:4 → RGBA u8 (alpha sourced). |
| src/row/arch/neon.rs | Adds NEON alpha-source RGBA wrapper and integrates ALPHA_SRC into the shared 4:4:4 kernel. |
| src/row/arch/neon/tests.rs | Adds NEON equivalence tests for YUVA 4:4:4 → RGBA u8 (alpha sourced). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wires u8 RGBA SIMD across all 5 backends (NEON, SSE4.1, AVX2, AVX-512, wasm simd128) for the alpha-preserving
Yuva444p10path. Replaces theyuva444p10_to_rgba_rowdispatcher stub from PR #32 with a realcfg_select!per-arch route block. Mirrors PR #25 (Tranche 5a) and PR #30 (Tranche 7b) structurally.The companion u16 RGBA SIMD work lands in Ship 8b‑1c.
Changes
Strategy B refactor — extend u8 4:4:4 const-ALPHA template with
ALPHA_SRCEach backend's existing
yuv_444p_n_to_rgb_or_rgba_row<BITS, ALPHA>u8 template grows a third const-generic,ALPHA_SRC: bool, plus ana_src: Option<&[u16]>parameter. The per-pixel store now branches on three combinations:ALPHAALPHA_SRC0xFF(existing path, unchanged)(a_src[x] & bits_mask) >> (BITS - 8)from the source planeConst-asserted
!ALPHA_SRC || ALPHAmirrors the scalar template's guard (PR #32 review fix #2 hardening pattern). The existingyuv_444p_n_to_rgb_row<BITS>andyuv_444p_n_to_rgba_row<BITS>public wrappers keep their existing signatures unchanged and passALPHA_SRC = false, Noneinternally.Each backend gets a new
pub(crate) unsafe fn yuv_444p_n_to_rgba_with_alpha_src_row<const BITS: u32>(y, u, v, a_src, rgba_out, width, matrix, full_range)wrapper that the dispatcher'scfg_select!calls.Per-backend SIMD alpha load + depth-convert
The kernel pulls source-alpha at the same width-per-pixel cadence as Y/U/V (4:4:4 = 1:1, no chroma duplication), AND-masks with
bits_mask::<BITS>(), right-shifts byBITS - 8(= 2 forBITS = 10), and packs to u8 — feeding the result into the alpha lane of the existingwrite_rgba_*store helper:vld1q_u16× 2 →vandq_u16→vshlq_u16(variable shift) →vqmovn_u16× 2 +vcombine_u8→vst4q_u8_mm_loadu_si128× 2 →_mm_and_si128→_mm_srl_epi16→_mm_packus_epi16→write_rgba_16_mm256_loadu_si256× 2 →_mm256_and_si256→_mm256_srl_epi16→narrow_u8x32(reuses pack-fixup permute) →write_rgba_32_mm512_loadu_si512× 2 →_mm512_and_si512→_mm512_srl_epi16→narrow_u8x64(reusespack_fixup) →write_rgba_64v128_load× 2 →v128_and→u16x8_shr→u8x16_narrow_i16x8→write_rgba_16Per-backend gotcha — variable shift count.
BITS - 8isn't const-evaluable as a literal-immediate shift count forvshrq_n_u16/_mm{,256,512}_srli_epi16::<IMM8>, so each backend uses the variable-count shift sibling (vshlq_u16with negative count for NEON,_mm{,256,512}_srl_epi16with_mm_cvtsi32_si128for x86; wasm'su16x8_shralready takes a runtime u32). Same hardening pattern that exists elsewhere in the crate.Dispatcher wiring (
src/row/mod.rs)yuva444p10_to_rgba_row'slet _ = use_simd;stub from PR #32 replaced with the standardcfg_select!per-arch route block. The "⚠ Scalar-only as of Ship 8b‑1a" doc warning is dropped from the u8 dispatcher; section header drops the "prep" qualifier on the u8 path. The u16 dispatcher (yuva444p10_to_rgba_u16_row) stays scalar-only — Ship 8b‑1c.Sinker doc cleanup (
src/sinker/mixed/yuva_4_4_4.rs)MixedSinker<Yuva444p10>::with_rgba's "Performance note (Ship 8b‑1a)" warning dropped now that u8 has SIMD coverage. Thewith_rgba_u16builder keeps its warning until 8b‑1c lands.Per-backend SIMD equivalence tests (~25)
5 tests per backend × 5 backends mirroring PR #30's structure. Each backend covers:
<backend>_yuva444p10_rgba_matches_scalar_all_matrices_<width>— all 6ColorMatrix× full + limited range × natural block width.<backend>_yuva444p10_rgba_matches_scalar_widths— natural width + tail widths {17, 31, 47, 63, 1920, 1922} to exercise the scalar-tail fallthrough.<backend>_yuva444p10_rgba_matches_scalar_random_alpha— pseudo-random alpha pattern (not solid) to catch SIMD lane-order corruption.Each test calls the SIMD wrapper directly via
unsafe { arch::<bk>::yuv_444p_n_to_rgba_with_alpha_src_row::<10>(...) }so all 5 backends are exercised regardless of which CI runner is running. All 15 new x86 tests includeis_x86_feature_detected!early-return guards (per PR #25 CI fallout — without them, ASAN getsSIGILLand Miri reports UB on runners lacking the feature). NEON tests carry#[cfg_attr(miri, ignore = \"...\")]. Wasm is module-level cfg-gated.Test plan
cargo test --lib: 583 pass on aarch64-darwin (host); was 578 → +5 NEON-side tests run. The 20 x86/wasm tests are gate-guarded for their CI runners.cargo check --tests --libclean across host, x86_64-unknown-freebsd, wasm32-unknown-unknownRUSTFLAGS=\"-Dwarnings\" cargo clippy --lib --testscleandead_codewarnings — every newyuv_444p_n_to_rgba_with_alpha_src_rowwrapper is consumed by the dispatcherCodex adversarial review
Verdict: approve. No material findings. Per Codex: "The SIMD alpha-source paths appear bounds-guarded by the public dispatcher and match the existing scalar alpha masking/shift contract."
Out of scope (deferred to follow-up)
Yuva420p*,Yuva422p*, otherYuva444p*variants) → Ship 8b‑2 onward, mass-applying the established Strategy B template🤖 Generated with Claude Code