Ship 8b‑1c: Yuva444p10 u16 RGBA SIMD across all 5 backends#34
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
Wires SIMD-dispatched native-depth u16 RGBA conversion for Yuva444p10 (alpha sourced from the A plane) across all supported backends, completing the Yuva444p10 SIMD “vertical slice”.
Changes:
- Replaces the
yuva444p10_to_rgba_u16_rowdispatcher stub with a realcfg_select!per-arch dispatch to SIMD wrappers, with scalar fallback whenuse_simd = falseor no backend is available. - Extends each backend’s high-bit 4:4:4
u16kernel template withALPHA_SRC+a_src, adding a SIMD alpha-load path (masked toBITS) and a scalar-tail fallback. - Adds backend-specific SIMD-vs-scalar equivalence tests for YUVA444p10 → RGBA
u16(including random alpha patterns), and removes now-stale “scalar-only” performance warnings.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| src/sinker/mixed/yuva_4_4_4.rs | Removes outdated scalar-only perf note for the u16 alpha-source path. |
| src/row/mod.rs | Implements per-arch SIMD dispatch for yuva444p10_to_rgba_u16_row and updates related docs/comments. |
| src/row/arch/neon.rs | Adds ALPHA_SRC support + alpha-plane SIMD load/store for native-depth u16 RGBA. |
| src/row/arch/neon/tests.rs | Adds NEON equivalence tests for YUVA444p10 → RGBA u16 with source alpha. |
| src/row/arch/x86_sse41.rs | Adds ALPHA_SRC support + alpha-plane SIMD load/store for SSE4.1 native-depth u16 RGBA. |
| src/row/arch/x86_sse41/tests.rs | Adds SSE4.1 equivalence tests for YUVA444p10 → RGBA u16 with source alpha (feature-detected). |
| src/row/arch/x86_avx2.rs | Adds ALPHA_SRC support + alpha-plane SIMD load/store for AVX2 native-depth u16 RGBA. |
| src/row/arch/x86_avx2/tests.rs | Adds AVX2 equivalence tests for YUVA444p10 → RGBA u16 with source alpha (feature-detected). |
| src/row/arch/x86_avx512.rs | Adds ALPHA_SRC support + alpha-plane SIMD load/store for AVX-512 native-depth u16 RGBA. |
| src/row/arch/x86_avx512/tests.rs | Adds AVX-512 equivalence tests for YUVA444p10 → RGBA u16 with source alpha (feature-detected). |
| src/row/arch/wasm_simd128.rs | Adds ALPHA_SRC support + alpha-plane SIMD load/store for wasm simd128 native-depth u16 RGBA. |
| src/row/arch/wasm_simd128/tests.rs | Adds wasm simd128 equivalence tests for YUVA444p10 → RGBA u16 with source alpha (cfg-gated). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wires u16 RGBA SIMD across all 5 backends (NEON, SSE4.1, AVX2, AVX-512, wasm simd128) for the alpha-preserving
Yuva444p10u16 path. Replaces theyuva444p10_to_rgba_u16_rowdispatcher stub from PR #32 with a realcfg_select!per-arch route block. Mirrors PR #33 (Ship 8b‑1b) structurally — sameALPHA_SRCconst-generic refactor, applied to the u16 path instead of u8.Closes the entire Yuva444p10 vertical slice. Subsequent PRs (8b‑2 onward) mass-apply the same Strategy B template to other Yuva format families.
Changes
Strategy B refactor — extend u16 4:4:4 const-ALPHA template with
ALPHA_SRCEach backend's existing
yuv_444p_n_to_rgb_or_rgba_u16_row<BITS, ALPHA>u16 template grows a third const-generic,ALPHA_SRC: bool, plus ana_src: Option<&[u16]>parameter. The per-pixel store now branches on three combinations:ALPHAALPHA_SRC(1 << BITS) - 1u16 (existing path; opaque max at native depth)a_src[x] & bits_mask::<BITS>()from the source plane (no shift — both source and output are at native bit depth)Const-asserted
!ALPHA_SRC || ALPHAmirrors the scalar template's guard. Existingyuv_444p_n_to_rgb_u16_row<BITS>andyuv_444p_n_to_rgba_u16_row<BITS>public wrappers keep their signatures unchanged and passALPHA_SRC = false, Noneinternally.Each backend gets a new
pub(crate) unsafe fn yuv_444p_n_to_rgba_u16_with_alpha_src_row<const BITS: u32>(y, u, v, a_src, rgba_out, width, matrix, full_range)wrapper that the dispatcher'scfg_select!calls.Per-backend SIMD alpha load
The kernel pulls source-alpha at the same width-per-pixel cadence as Y/U/V (4:4:4 = 1:1, no chroma duplication), AND-masks with
bits_mask::<BITS>(), and feeds the result directly into the alpha lane of the existingwrite_rgba_u16_*/vst4q_u16store helper:vandq_u16(vld1q_u16(...), mask_v)_mm_and_si128(_mm_loadu_si128(...), mask_v)_mm256_loadu_si256+_mm256_and_si256; split into 2×__m128ihalves via_mm256_castsi256_si128/_mm256_extracti128_si256::<1>for the twowrite_rgba_u16_8halves_mm512_loadu_si512+_mm512_and_si512; split into 4×__m128iquarters via_mm512_extracti32x4_epi32::<0..3>, fed straight intowrite_quarter_rgbav128_and(v128_load(...), mask_v)No depth conversion — both source alpha and output alpha share the same native bit depth (BITS=10 for Yuva444p10 → u16 alpha in
[0, 1023]). The u8 path's>> (BITS - 8)step (PR #33) is absent here.No permute fixup needed — the existing
vst4q_u16/write_rgba_u16_8/write_quarter_rgbahelpers accept the alpha vector in their existing 4th-channel slot. AVX2 / AVX-512 reuse the samemask_vconstant the Y/U/V loads already use.Dispatcher wiring (
src/row/mod.rs)yuva444p10_to_rgba_u16_row'slet _ = use_simd;stub from PR #32 replaced with the standardcfg_select!per-arch route block. The# ⚠ Scalar-only as of Ship 8b‑1adoc warning dropped from the dispatcher; section header drops the remaining "prep" qualifier — both u8 and u16 paths now have SIMD coverage.Sinker doc cleanup (
src/sinker/mixed/yuva_4_4_4.rs)MixedSinker<Yuva444p10>::with_rgba_u16's "Performance note (Ship 8b‑1a)" warning paragraph dropped now that u16 has SIMD coverage. PR #33 already dropped the same warning from the u8 builder.Per-backend SIMD u16 equivalence tests (~25)
5 tests per backend × 5 backends mirroring PR #33's u8 test patterns. Each backend covers:
<backend>_yuva444p10_rgba_u16_matches_scalar_all_matrices_<width>— all 6ColorMatrix× full + limited range × natural block width.<backend>_yuva444p10_rgba_u16_matches_scalar_widths— natural width + tail widths {17, 31, 47, 63, 1920, 1922} forcing scalar-tail fallthrough.<backend>_yuva444p10_rgba_u16_matches_scalar_random_alpha— pseudo-random alpha pattern (not solid) to catch SIMD lane-order corruption.Each test calls the SIMD wrapper directly via
unsafe { arch::<bk>::yuv_444p_n_to_rgba_u16_with_alpha_src_row::<10>(...) }so all 5 backends are exercised regardless of which CI runner is running. All 15 new x86 tests includeis_x86_feature_detected!early-return guards (per PR #25 CI fallout — without them, ASAN getsSIGILLand Miri reports UB on runners lacking the feature). NEON tests carry#[cfg_attr(miri, ignore = \"...\")]. Wasm is module-level cfg-gated.Test plan
cargo test --lib: 588 pass on aarch64-darwin (host); was 583 → +5 NEON-side tests run. The 20 x86/wasm tests are gate-guarded for their CI runners.cargo check --tests --libclean across host, x86_64-unknown-freebsd, wasm32-unknown-unknownRUSTFLAGS=\"-Dwarnings\" cargo clippy --lib --testscleandead_codewarnings — every newyuv_444p_n_to_rgba_u16_with_alpha_src_rowwrapper is consumed by the dispatcherCodex adversarial review
Not run — Codex hit OpenAI usage rate limit (retry window 2026-04-28 02:10 AM). The structural pattern is identical to PR #33 (Ship 8b‑1b u8 SIMD) which Codex approved with no findings. Re-run available on request once the rate limit clears.
Closes Ship 8b‑1: Yuva444p10 vertical slice
After this PR, every Yuva444p10 SIMD path is wired:
Out of scope (deferred to follow-up)
Yuva420p*,Yuva422p*, otherYuva444p*variants like 8-bit / 9-bit / 16-bit) → Ship 8b‑2 onward, mass-applying the established Strategy B template.🤖 Generated with Claude Code