Ship 8b‑1b: Yuva444p10 u8 RGBA SIMD across all 5 backends by uqio · Pull Request #33 · Findit-AI/colconv

uqio · 2026-04-27T10:29:43Z

Summary

Wires u8 RGBA SIMD across all 5 backends (NEON, SSE4.1, AVX2, AVX-512, wasm simd128) for the alpha-preserving Yuva444p10 path. Replaces the yuva444p10_to_rgba_row dispatcher stub from PR #32 with a real cfg_select! per-arch route block. Mirrors PR #25 (Tranche 5a) and PR #30 (Tranche 7b) structurally.

The companion u16 RGBA SIMD work lands in Ship 8b‑1c.

Changes

Strategy B refactor — extend u8 4:4:4 const-ALPHA template with `ALPHA_SRC`

Each backend's existing yuv_444p_n_to_rgb_or_rgba_row<BITS, ALPHA> u8 template grows a third const-generic, ALPHA_SRC: bool, plus an a_src: Option<&[u16]> parameter. The per-pixel store now branches on three combinations:

`ALPHA`	`ALPHA_SRC`	Per-pixel alpha behavior
false	false	RGB-only (no alpha lane in store)
true	false	RGBA, alpha = `0xFF` (existing path, unchanged)
true	true	RGBA, alpha = `(a_src[x] & bits_mask) >> (BITS - 8)` from the source plane

Const-asserted !ALPHA_SRC || ALPHA mirrors the scalar template's guard (PR #32 review fix #2 hardening pattern). The existing yuv_444p_n_to_rgb_row<BITS> and yuv_444p_n_to_rgba_row<BITS> public wrappers keep their existing signatures unchanged and pass ALPHA_SRC = false, None internally.

Each backend gets a new pub(crate) unsafe fn yuv_444p_n_to_rgba_with_alpha_src_row<const BITS: u32>(y, u, v, a_src, rgba_out, width, matrix, full_range) wrapper that the dispatcher's cfg_select! calls.

Per-backend SIMD alpha load + depth-convert

The kernel pulls source-alpha at the same width-per-pixel cadence as Y/U/V (4:4:4 = 1:1, no chroma duplication), AND-masks with bits_mask::<BITS>(), right-shifts by BITS - 8 (= 2 for BITS = 10), and packs to u8 — feeding the result into the alpha lane of the existing write_rgba_* store helper:

Backend	Block	Path
NEON (16 px)	`vld1q_u16` × 2 → `vandq_u16` → `vshlq_u16` (variable shift) → `vqmovn_u16` × 2 + `vcombine_u8` → `vst4q_u8`
SSE4.1 (16 px)	`_mm_loadu_si128` × 2 → `_mm_and_si128` → `_mm_srl_epi16` → `_mm_packus_epi16` → `write_rgba_16`
AVX2 (32 px)	`_mm256_loadu_si256` × 2 → `_mm256_and_si256` → `_mm256_srl_epi16` → `narrow_u8x32` (reuses pack-fixup permute) → `write_rgba_32`
AVX-512 (64 px)	`_mm512_loadu_si512` × 2 → `_mm512_and_si512` → `_mm512_srl_epi16` → `narrow_u8x64` (reuses `pack_fixup`) → `write_rgba_64`
wasm simd128 (16 px)	`v128_load` × 2 → `v128_and` → `u16x8_shr` → `u8x16_narrow_i16x8` → `write_rgba_16`

Per-backend gotcha — variable shift count. BITS - 8 isn't const-evaluable as a literal-immediate shift count for vshrq_n_u16 / _mm{,256,512}_srli_epi16::<IMM8>, so each backend uses the variable-count shift sibling (vshlq_u16 with negative count for NEON, _mm{,256,512}_srl_epi16 with _mm_cvtsi32_si128 for x86; wasm's u16x8_shr already takes a runtime u32). Same hardening pattern that exists elsewhere in the crate.

Dispatcher wiring (`src/row/mod.rs`)

yuva444p10_to_rgba_row's let _ = use_simd; stub from PR #32 replaced with the standard cfg_select! per-arch route block. The "⚠ Scalar-only as of Ship 8b‑1a" doc warning is dropped from the u8 dispatcher; section header drops the "prep" qualifier on the u8 path. The u16 dispatcher (yuva444p10_to_rgba_u16_row) stays scalar-only — Ship 8b‑1c.

Sinker doc cleanup (`src/sinker/mixed/yuva_4_4_4.rs`)

MixedSinker<Yuva444p10>::with_rgba's "Performance note (Ship 8b‑1a)" warning dropped now that u8 has SIMD coverage. The with_rgba_u16 builder keeps its warning until 8b‑1c lands.

Per-backend SIMD equivalence tests (~25)

5 tests per backend × 5 backends mirroring PR #30's structure. Each backend covers:

<backend>_yuva444p10_rgba_matches_scalar_all_matrices_<width> — all 6 ColorMatrix × full + limited range × natural block width.
<backend>_yuva444p10_rgba_matches_scalar_widths — natural width + tail widths {17, 31, 47, 63, 1920, 1922} to exercise the scalar-tail fallthrough.
<backend>_yuva444p10_rgba_matches_scalar_random_alpha — pseudo-random alpha pattern (not solid) to catch SIMD lane-order corruption.

Each test calls the SIMD wrapper directly via unsafe { arch::<bk>::yuv_444p_n_to_rgba_with_alpha_src_row::<10>(...) } so all 5 backends are exercised regardless of which CI runner is running. All 15 new x86 tests include is_x86_feature_detected! early-return guards (per PR #25 CI fallout — without them, ASAN gets SIGILL and Miri reports UB on runners lacking the feature). NEON tests carry #[cfg_attr(miri, ignore = \"...\")]. Wasm is module-level cfg-gated.

Test plan

cargo test --lib: 583 pass on aarch64-darwin (host); was 578 → +5 NEON-side tests run. The 20 x86/wasm tests are gate-guarded for their CI runners.
cargo check --tests --lib clean across host, x86_64-unknown-freebsd, wasm32-unknown-unknown
RUSTFLAGS=\"-Dwarnings\" cargo clippy --lib --tests clean
Zero dead_code warnings — every new yuv_444p_n_to_rgba_with_alpha_src_row wrapper is consumed by the dispatcher

Codex adversarial review

Verdict: approve. No material findings. Per Codex: "The SIMD alpha-source paths appear bounds-guarded by the public dispatcher and match the existing scalar alpha masking/shift contract."

Out of scope (deferred to follow-up)

u16 RGBA SIMD across all 5 backends → Ship 8b‑1c
Other Yuva format families (Yuva420p*, Yuva422p*, other Yuva444p* variants) → Ship 8b‑2 onward, mass-applying the established Strategy B template

🤖 Generated with Claude Code

Copilot

Pull request overview

This PR updates the YUVA 4:4:4 (currently Yuva444p10) → RGBA u8 conversion path to use per-architecture SIMD implementations (NEON / SSE4.1 / AVX2 / AVX-512BW / wasm simd128) when use_simd is enabled, and removes/updates now-stale “scalar-only” staging notes.

Changes:

Wire row::yuva444p10_to_rgba_row to runtime-dispatch into new arch-specific yuv_444p_n_to_rgba_with_alpha_src_row::<10> SIMD wrappers (with scalar fallback).
Extend existing 4:4:4 SIMD kernels (NEON/x86/wasm) to optionally source per-pixel alpha from an input alpha plane (via a new const-generic ALPHA_SRC path).
Add SIMD-vs-scalar equivalence tests for the new YUVA 4:4:4 u8 RGBA alpha-source path across multiple backends.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
src/sinker/mixed/yuva_4_4_4.rs	Removes outdated “scalar-only” performance note from YUVA sinker RGBA wiring docs.
src/row/mod.rs	Updates YUVA 4:4:4 dispatcher docs and adds per-arch SIMD dispatch for `yuva444p10_to_rgba_row`.
src/row/arch/x86_sse41.rs	Adds SSE4.1 alpha-source RGBA wrapper and integrates `ALPHA_SRC` into the shared 4:4:4 kernel.
src/row/arch/x86_sse41/tests.rs	Adds SSE4.1 equivalence tests for YUVA 4:4:4 → RGBA u8 (alpha sourced).
src/row/arch/x86_avx2.rs	Adds AVX2 alpha-source RGBA wrapper and integrates `ALPHA_SRC` into the shared 4:4:4 kernel.
src/row/arch/x86_avx2/tests.rs	Adds AVX2 equivalence tests for YUVA 4:4:4 → RGBA u8 (alpha sourced).
src/row/arch/x86_avx512.rs	Adds AVX-512BW alpha-source RGBA wrapper and integrates `ALPHA_SRC` into the shared 4:4:4 kernel.
src/row/arch/x86_avx512/tests.rs	Adds AVX-512BW equivalence tests for YUVA 4:4:4 → RGBA u8 (alpha sourced).
src/row/arch/wasm_simd128.rs	Adds wasm simd128 alpha-source RGBA wrapper and integrates `ALPHA_SRC` into the shared 4:4:4 kernel.
src/row/arch/wasm_simd128/tests.rs	Adds wasm simd128 equivalence tests for YUVA 4:4:4 → RGBA u8 (alpha sourced).
src/row/arch/neon.rs	Adds NEON alpha-source RGBA wrapper and integrates `ALPHA_SRC` into the shared 4:4:4 kernel.
src/row/arch/neon/tests.rs	Adds NEON equivalence tests for YUVA 4:4:4 → RGBA u8 (alpha sourced).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

update

be54e4c

al8n requested a review from Copilot April 27, 2026 10:30

Copilot started reviewing on behalf of al8n April 27, 2026 10:31 View session

Copilot AI reviewed Apr 27, 2026

View reviewed changes

al8n changed the title ~~update~~ Ship 8b‑1b: Yuva444p10 u8 RGBA SIMD across all 5 backends Apr 27, 2026

uqio merged commit 5e9c13d into main Apr 27, 2026
44 of 47 checks passed

al8n mentioned this pull request Apr 27, 2026

Ship 8b‑1c: Yuva444p10 u16 RGBA SIMD across all 5 backends #34

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ship 8b‑1b: Yuva444p10 u8 RGBA SIMD across all 5 backends#33

Ship 8b‑1b: Yuva444p10 u8 RGBA SIMD across all 5 backends#33
uqio merged 1 commit intomainfrom
feat/ship8b-yuva444p10-u8-simd

uqio commented Apr 27, 2026 •

edited by al8n

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

uqio commented Apr 27, 2026 • edited by al8n Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Strategy B refactor — extend u8 4:4:4 const-ALPHA template with ALPHA_SRC

Per-backend SIMD alpha load + depth-convert

Dispatcher wiring (src/row/mod.rs)

Sinker doc cleanup (src/sinker/mixed/yuva_4_4_4.rs)

Per-backend SIMD equivalence tests (~25)

Test plan

Codex adversarial review

Out of scope (deferred to follow-up)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

uqio commented Apr 27, 2026 •

edited by al8n

Loading

Strategy B refactor — extend u8 4:4:4 const-ALPHA template with `ALPHA_SRC`

Dispatcher wiring (`src/row/mod.rs`)

Sinker doc cleanup (`src/sinker/mixed/yuva_4_4_4.rs`)