Skip to content

Ship 8b‑1b: Yuva444p10 u8 RGBA SIMD across all 5 backends#33

Merged
uqio merged 1 commit intomainfrom
feat/ship8b-yuva444p10-u8-simd
Apr 27, 2026
Merged

Ship 8b‑1b: Yuva444p10 u8 RGBA SIMD across all 5 backends#33
uqio merged 1 commit intomainfrom
feat/ship8b-yuva444p10-u8-simd

Conversation

@uqio
Copy link
Copy Markdown
Collaborator

@uqio uqio commented Apr 27, 2026

Summary

Wires u8 RGBA SIMD across all 5 backends (NEON, SSE4.1, AVX2, AVX-512, wasm simd128) for the alpha-preserving Yuva444p10 path. Replaces the yuva444p10_to_rgba_row dispatcher stub from PR #32 with a real cfg_select! per-arch route block. Mirrors PR #25 (Tranche 5a) and PR #30 (Tranche 7b) structurally.

The companion u16 RGBA SIMD work lands in Ship 8b‑1c.

Changes

Strategy B refactor — extend u8 4:4:4 const-ALPHA template with ALPHA_SRC

Each backend's existing yuv_444p_n_to_rgb_or_rgba_row<BITS, ALPHA> u8 template grows a third const-generic, ALPHA_SRC: bool, plus an a_src: Option<&[u16]> parameter. The per-pixel store now branches on three combinations:

ALPHA ALPHA_SRC Per-pixel alpha behavior
false false RGB-only (no alpha lane in store)
true false RGBA, alpha = 0xFF (existing path, unchanged)
true true RGBA, alpha = (a_src[x] & bits_mask) >> (BITS - 8) from the source plane

Const-asserted !ALPHA_SRC || ALPHA mirrors the scalar template's guard (PR #32 review fix #2 hardening pattern). The existing yuv_444p_n_to_rgb_row<BITS> and yuv_444p_n_to_rgba_row<BITS> public wrappers keep their existing signatures unchanged and pass ALPHA_SRC = false, None internally.

Each backend gets a new pub(crate) unsafe fn yuv_444p_n_to_rgba_with_alpha_src_row<const BITS: u32>(y, u, v, a_src, rgba_out, width, matrix, full_range) wrapper that the dispatcher's cfg_select! calls.

Per-backend SIMD alpha load + depth-convert

The kernel pulls source-alpha at the same width-per-pixel cadence as Y/U/V (4:4:4 = 1:1, no chroma duplication), AND-masks with bits_mask::<BITS>(), right-shifts by BITS - 8 (= 2 for BITS = 10), and packs to u8 — feeding the result into the alpha lane of the existing write_rgba_* store helper:

Backend Block Path
NEON (16 px) vld1q_u16 × 2 → vandq_u16vshlq_u16 (variable shift) → vqmovn_u16 × 2 + vcombine_u8vst4q_u8
SSE4.1 (16 px) _mm_loadu_si128 × 2 → _mm_and_si128_mm_srl_epi16_mm_packus_epi16write_rgba_16
AVX2 (32 px) _mm256_loadu_si256 × 2 → _mm256_and_si256_mm256_srl_epi16narrow_u8x32 (reuses pack-fixup permute) → write_rgba_32
AVX-512 (64 px) _mm512_loadu_si512 × 2 → _mm512_and_si512_mm512_srl_epi16narrow_u8x64 (reuses pack_fixup) → write_rgba_64
wasm simd128 (16 px) v128_load × 2 → v128_andu16x8_shru8x16_narrow_i16x8write_rgba_16

Per-backend gotcha — variable shift count. BITS - 8 isn't const-evaluable as a literal-immediate shift count for vshrq_n_u16 / _mm{,256,512}_srli_epi16::<IMM8>, so each backend uses the variable-count shift sibling (vshlq_u16 with negative count for NEON, _mm{,256,512}_srl_epi16 with _mm_cvtsi32_si128 for x86; wasm's u16x8_shr already takes a runtime u32). Same hardening pattern that exists elsewhere in the crate.

Dispatcher wiring (src/row/mod.rs)

yuva444p10_to_rgba_row's let _ = use_simd; stub from PR #32 replaced with the standard cfg_select! per-arch route block. The "⚠ Scalar-only as of Ship 8b‑1a" doc warning is dropped from the u8 dispatcher; section header drops the "prep" qualifier on the u8 path. The u16 dispatcher (yuva444p10_to_rgba_u16_row) stays scalar-only — Ship 8b‑1c.

Sinker doc cleanup (src/sinker/mixed/yuva_4_4_4.rs)

MixedSinker<Yuva444p10>::with_rgba's "Performance note (Ship 8b‑1a)" warning dropped now that u8 has SIMD coverage. The with_rgba_u16 builder keeps its warning until 8b‑1c lands.

Per-backend SIMD equivalence tests (~25)

5 tests per backend × 5 backends mirroring PR #30's structure. Each backend covers:

  • <backend>_yuva444p10_rgba_matches_scalar_all_matrices_<width> — all 6 ColorMatrix × full + limited range × natural block width.
  • <backend>_yuva444p10_rgba_matches_scalar_widths — natural width + tail widths {17, 31, 47, 63, 1920, 1922} to exercise the scalar-tail fallthrough.
  • <backend>_yuva444p10_rgba_matches_scalar_random_alpha — pseudo-random alpha pattern (not solid) to catch SIMD lane-order corruption.

Each test calls the SIMD wrapper directly via unsafe { arch::<bk>::yuv_444p_n_to_rgba_with_alpha_src_row::<10>(...) } so all 5 backends are exercised regardless of which CI runner is running. All 15 new x86 tests include is_x86_feature_detected! early-return guards (per PR #25 CI fallout — without them, ASAN gets SIGILL and Miri reports UB on runners lacking the feature). NEON tests carry #[cfg_attr(miri, ignore = \"...\")]. Wasm is module-level cfg-gated.

Test plan

  • cargo test --lib: 583 pass on aarch64-darwin (host); was 578 → +5 NEON-side tests run. The 20 x86/wasm tests are gate-guarded for their CI runners.
  • cargo check --tests --lib clean across host, x86_64-unknown-freebsd, wasm32-unknown-unknown
  • RUSTFLAGS=\"-Dwarnings\" cargo clippy --lib --tests clean
  • Zero dead_code warnings — every new yuv_444p_n_to_rgba_with_alpha_src_row wrapper is consumed by the dispatcher

Codex adversarial review

Verdict: approve. No material findings. Per Codex: "The SIMD alpha-source paths appear bounds-guarded by the public dispatcher and match the existing scalar alpha masking/shift contract."

Out of scope (deferred to follow-up)

  • u16 RGBA SIMD across all 5 backends → Ship 8b‑1c
  • Other Yuva format families (Yuva420p*, Yuva422p*, other Yuva444p* variants) → Ship 8b‑2 onward, mass-applying the established Strategy B template

🤖 Generated with Claude Code

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the YUVA 4:4:4 (currently Yuva444p10) → RGBA u8 conversion path to use per-architecture SIMD implementations (NEON / SSE4.1 / AVX2 / AVX-512BW / wasm simd128) when use_simd is enabled, and removes/updates now-stale “scalar-only” staging notes.

Changes:

  • Wire row::yuva444p10_to_rgba_row to runtime-dispatch into new arch-specific yuv_444p_n_to_rgba_with_alpha_src_row::<10> SIMD wrappers (with scalar fallback).
  • Extend existing 4:4:4 SIMD kernels (NEON/x86/wasm) to optionally source per-pixel alpha from an input alpha plane (via a new const-generic ALPHA_SRC path).
  • Add SIMD-vs-scalar equivalence tests for the new YUVA 4:4:4 u8 RGBA alpha-source path across multiple backends.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/sinker/mixed/yuva_4_4_4.rs Removes outdated “scalar-only” performance note from YUVA sinker RGBA wiring docs.
src/row/mod.rs Updates YUVA 4:4:4 dispatcher docs and adds per-arch SIMD dispatch for yuva444p10_to_rgba_row.
src/row/arch/x86_sse41.rs Adds SSE4.1 alpha-source RGBA wrapper and integrates ALPHA_SRC into the shared 4:4:4 kernel.
src/row/arch/x86_sse41/tests.rs Adds SSE4.1 equivalence tests for YUVA 4:4:4 → RGBA u8 (alpha sourced).
src/row/arch/x86_avx2.rs Adds AVX2 alpha-source RGBA wrapper and integrates ALPHA_SRC into the shared 4:4:4 kernel.
src/row/arch/x86_avx2/tests.rs Adds AVX2 equivalence tests for YUVA 4:4:4 → RGBA u8 (alpha sourced).
src/row/arch/x86_avx512.rs Adds AVX-512BW alpha-source RGBA wrapper and integrates ALPHA_SRC into the shared 4:4:4 kernel.
src/row/arch/x86_avx512/tests.rs Adds AVX-512BW equivalence tests for YUVA 4:4:4 → RGBA u8 (alpha sourced).
src/row/arch/wasm_simd128.rs Adds wasm simd128 alpha-source RGBA wrapper and integrates ALPHA_SRC into the shared 4:4:4 kernel.
src/row/arch/wasm_simd128/tests.rs Adds wasm simd128 equivalence tests for YUVA 4:4:4 → RGBA u8 (alpha sourced).
src/row/arch/neon.rs Adds NEON alpha-source RGBA wrapper and integrates ALPHA_SRC into the shared 4:4:4 kernel.
src/row/arch/neon/tests.rs Adds NEON equivalence tests for YUVA 4:4:4 → RGBA u8 (alpha sourced).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@al8n al8n changed the title update Ship 8b‑1b: Yuva444p10 u8 RGBA SIMD across all 5 backends Apr 27, 2026
@uqio uqio merged commit 5e9c13d into main Apr 27, 2026
44 of 47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants