Skip to content

Ship 8b‑1c: Yuva444p10 u16 RGBA SIMD across all 5 backends#34

Merged
al8n merged 2 commits intomainfrom
feat/ship8b-yuva444p10-u16-simd
Apr 27, 2026
Merged

Ship 8b‑1c: Yuva444p10 u16 RGBA SIMD across all 5 backends#34
al8n merged 2 commits intomainfrom
feat/ship8b-yuva444p10-u16-simd

Conversation

@uqio
Copy link
Copy Markdown
Collaborator

@uqio uqio commented Apr 27, 2026

Summary

Wires u16 RGBA SIMD across all 5 backends (NEON, SSE4.1, AVX2, AVX-512, wasm simd128) for the alpha-preserving Yuva444p10 u16 path. Replaces the yuva444p10_to_rgba_u16_row dispatcher stub from PR #32 with a real cfg_select! per-arch route block. Mirrors PR #33 (Ship 8b‑1b) structurally — same ALPHA_SRC const-generic refactor, applied to the u16 path instead of u8.

Closes the entire Yuva444p10 vertical slice. Subsequent PRs (8b‑2 onward) mass-apply the same Strategy B template to other Yuva format families.

Changes

Strategy B refactor — extend u16 4:4:4 const-ALPHA template with ALPHA_SRC

Each backend's existing yuv_444p_n_to_rgb_or_rgba_u16_row<BITS, ALPHA> u16 template grows a third const-generic, ALPHA_SRC: bool, plus an a_src: Option<&[u16]> parameter. The per-pixel store now branches on three combinations:

ALPHA ALPHA_SRC Per-pixel alpha behavior
false false RGB-only (no alpha lane in store)
true false RGBA, alpha = (1 << BITS) - 1 u16 (existing path; opaque max at native depth)
true true RGBA, alpha = a_src[x] & bits_mask::<BITS>() from the source plane (no shift — both source and output are at native bit depth)

Const-asserted !ALPHA_SRC || ALPHA mirrors the scalar template's guard. Existing yuv_444p_n_to_rgb_u16_row<BITS> and yuv_444p_n_to_rgba_u16_row<BITS> public wrappers keep their signatures unchanged and pass ALPHA_SRC = false, None internally.

Each backend gets a new pub(crate) unsafe fn yuv_444p_n_to_rgba_u16_with_alpha_src_row<const BITS: u32>(y, u, v, a_src, rgba_out, width, matrix, full_range) wrapper that the dispatcher's cfg_select! calls.

Per-backend SIMD alpha load

The kernel pulls source-alpha at the same width-per-pixel cadence as Y/U/V (4:4:4 = 1:1, no chroma duplication), AND-masks with bits_mask::<BITS>(), and feeds the result directly into the alpha lane of the existing write_rgba_u16_* / vst4q_u16 store helper:

Backend Block Alpha load
NEON (8 px/iter) vandq_u16(vld1q_u16(...), mask_v)
SSE4.1 (8 px/iter) _mm_and_si128(_mm_loadu_si128(...), mask_v)
AVX2 (16 px/iter) _mm256_loadu_si256 + _mm256_and_si256; split into 2× __m128i halves via _mm256_castsi256_si128 / _mm256_extracti128_si256::<1> for the two write_rgba_u16_8 halves
AVX-512 (32 px/iter) _mm512_loadu_si512 + _mm512_and_si512; split into 4× __m128i quarters via _mm512_extracti32x4_epi32::<0..3>, fed straight into write_quarter_rgba
wasm simd128 (8 px/iter) v128_and(v128_load(...), mask_v)

No depth conversion — both source alpha and output alpha share the same native bit depth (BITS=10 for Yuva444p10 → u16 alpha in [0, 1023]). The u8 path's >> (BITS - 8) step (PR #33) is absent here.

No permute fixup needed — the existing vst4q_u16 / write_rgba_u16_8 / write_quarter_rgba helpers accept the alpha vector in their existing 4th-channel slot. AVX2 / AVX-512 reuse the same mask_v constant the Y/U/V loads already use.

Dispatcher wiring (src/row/mod.rs)

yuva444p10_to_rgba_u16_row's let _ = use_simd; stub from PR #32 replaced with the standard cfg_select! per-arch route block. The # ⚠ Scalar-only as of Ship 8b‑1a doc warning dropped from the dispatcher; section header drops the remaining "prep" qualifier — both u8 and u16 paths now have SIMD coverage.

Sinker doc cleanup (src/sinker/mixed/yuva_4_4_4.rs)

MixedSinker<Yuva444p10>::with_rgba_u16's "Performance note (Ship 8b‑1a)" warning paragraph dropped now that u16 has SIMD coverage. PR #33 already dropped the same warning from the u8 builder.

Per-backend SIMD u16 equivalence tests (~25)

5 tests per backend × 5 backends mirroring PR #33's u8 test patterns. Each backend covers:

  • <backend>_yuva444p10_rgba_u16_matches_scalar_all_matrices_<width> — all 6 ColorMatrix × full + limited range × natural block width.
  • <backend>_yuva444p10_rgba_u16_matches_scalar_widths — natural width + tail widths {17, 31, 47, 63, 1920, 1922} forcing scalar-tail fallthrough.
  • <backend>_yuva444p10_rgba_u16_matches_scalar_random_alpha — pseudo-random alpha pattern (not solid) to catch SIMD lane-order corruption.

Each test calls the SIMD wrapper directly via unsafe { arch::<bk>::yuv_444p_n_to_rgba_u16_with_alpha_src_row::<10>(...) } so all 5 backends are exercised regardless of which CI runner is running. All 15 new x86 tests include is_x86_feature_detected! early-return guards (per PR #25 CI fallout — without them, ASAN gets SIGILL and Miri reports UB on runners lacking the feature). NEON tests carry #[cfg_attr(miri, ignore = \"...\")]. Wasm is module-level cfg-gated.

Test plan

  • cargo test --lib: 588 pass on aarch64-darwin (host); was 583 → +5 NEON-side tests run. The 20 x86/wasm tests are gate-guarded for their CI runners.
  • cargo check --tests --lib clean across host, x86_64-unknown-freebsd, wasm32-unknown-unknown
  • RUSTFLAGS=\"-Dwarnings\" cargo clippy --lib --tests clean
  • Zero dead_code warnings — every new yuv_444p_n_to_rgba_u16_with_alpha_src_row wrapper is consumed by the dispatcher

Codex adversarial review

Not run — Codex hit OpenAI usage rate limit (retry window 2026-04-28 02:10 AM). The structural pattern is identical to PR #33 (Ship 8b‑1b u8 SIMD) which Codex approved with no findings. Re-run available on request once the rate limit clears.

Closes Ship 8b‑1: Yuva444p10 vertical slice

After this PR, every Yuva444p10 SIMD path is wired:

Out of scope (deferred to follow-up)

  • Other Yuva format families (Yuva420p*, Yuva422p*, other Yuva444p* variants like 8-bit / 9-bit / 16-bit) → Ship 8b‑2 onward, mass-applying the established Strategy B template.

🤖 Generated with Claude Code

@al8n al8n changed the title update Ship 8b‑1c: Yuva444p10 u16 RGBA SIMD across all 5 backends Apr 27, 2026
@al8n al8n requested a review from Copilot April 27, 2026 11:09
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Wires SIMD-dispatched native-depth u16 RGBA conversion for Yuva444p10 (alpha sourced from the A plane) across all supported backends, completing the Yuva444p10 SIMD “vertical slice”.

Changes:

  • Replaces the yuva444p10_to_rgba_u16_row dispatcher stub with a real cfg_select! per-arch dispatch to SIMD wrappers, with scalar fallback when use_simd = false or no backend is available.
  • Extends each backend’s high-bit 4:4:4 u16 kernel template with ALPHA_SRC + a_src, adding a SIMD alpha-load path (masked to BITS) and a scalar-tail fallback.
  • Adds backend-specific SIMD-vs-scalar equivalence tests for YUVA444p10 → RGBA u16 (including random alpha patterns), and removes now-stale “scalar-only” performance warnings.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/sinker/mixed/yuva_4_4_4.rs Removes outdated scalar-only perf note for the u16 alpha-source path.
src/row/mod.rs Implements per-arch SIMD dispatch for yuva444p10_to_rgba_u16_row and updates related docs/comments.
src/row/arch/neon.rs Adds ALPHA_SRC support + alpha-plane SIMD load/store for native-depth u16 RGBA.
src/row/arch/neon/tests.rs Adds NEON equivalence tests for YUVA444p10 → RGBA u16 with source alpha.
src/row/arch/x86_sse41.rs Adds ALPHA_SRC support + alpha-plane SIMD load/store for SSE4.1 native-depth u16 RGBA.
src/row/arch/x86_sse41/tests.rs Adds SSE4.1 equivalence tests for YUVA444p10 → RGBA u16 with source alpha (feature-detected).
src/row/arch/x86_avx2.rs Adds ALPHA_SRC support + alpha-plane SIMD load/store for AVX2 native-depth u16 RGBA.
src/row/arch/x86_avx2/tests.rs Adds AVX2 equivalence tests for YUVA444p10 → RGBA u16 with source alpha (feature-detected).
src/row/arch/x86_avx512.rs Adds ALPHA_SRC support + alpha-plane SIMD load/store for AVX-512 native-depth u16 RGBA.
src/row/arch/x86_avx512/tests.rs Adds AVX-512 equivalence tests for YUVA444p10 → RGBA u16 with source alpha (feature-detected).
src/row/arch/wasm_simd128.rs Adds ALPHA_SRC support + alpha-plane SIMD load/store for wasm simd128 native-depth u16 RGBA.
src/row/arch/wasm_simd128/tests.rs Adds wasm simd128 equivalence tests for YUVA444p10 → RGBA u16 with source alpha (cfg-gated).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@al8n al8n merged commit bdf910e into main Apr 27, 2026
43 checks passed
@al8n al8n deleted the feat/ship8b-yuva444p10-u16-simd branch April 27, 2026 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants