Ship 8b‑1c: Yuva444p10 u16 RGBA SIMD across all 5 backends by uqio · Pull Request #34 · Findit-AI/colconv

uqio · 2026-04-27T11:06:52Z

Summary

Wires u16 RGBA SIMD across all 5 backends (NEON, SSE4.1, AVX2, AVX-512, wasm simd128) for the alpha-preserving Yuva444p10 u16 path. Replaces the yuva444p10_to_rgba_u16_row dispatcher stub from PR #32 with a real cfg_select! per-arch route block. Mirrors PR #33 (Ship 8b‑1b) structurally — same ALPHA_SRC const-generic refactor, applied to the u16 path instead of u8.

Closes the entire Yuva444p10 vertical slice. Subsequent PRs (8b‑2 onward) mass-apply the same Strategy B template to other Yuva format families.

Changes

Strategy B refactor — extend u16 4:4:4 const-ALPHA template with `ALPHA_SRC`

Each backend's existing yuv_444p_n_to_rgb_or_rgba_u16_row<BITS, ALPHA> u16 template grows a third const-generic, ALPHA_SRC: bool, plus an a_src: Option<&[u16]> parameter. The per-pixel store now branches on three combinations:

`ALPHA`	`ALPHA_SRC`	Per-pixel alpha behavior
false	false	RGB-only (no alpha lane in store)
true	false	RGBA, alpha = `(1 << BITS) - 1` u16 (existing path; opaque max at native depth)
true	true	RGBA, alpha = `a_src[x] & bits_mask::<BITS>()` from the source plane (no shift — both source and output are at native bit depth)

Const-asserted !ALPHA_SRC || ALPHA mirrors the scalar template's guard. Existing yuv_444p_n_to_rgb_u16_row<BITS> and yuv_444p_n_to_rgba_u16_row<BITS> public wrappers keep their signatures unchanged and pass ALPHA_SRC = false, None internally.

Each backend gets a new pub(crate) unsafe fn yuv_444p_n_to_rgba_u16_with_alpha_src_row<const BITS: u32>(y, u, v, a_src, rgba_out, width, matrix, full_range) wrapper that the dispatcher's cfg_select! calls.

Per-backend SIMD alpha load

The kernel pulls source-alpha at the same width-per-pixel cadence as Y/U/V (4:4:4 = 1:1, no chroma duplication), AND-masks with bits_mask::<BITS>(), and feeds the result directly into the alpha lane of the existing write_rgba_u16_* / vst4q_u16 store helper:

Backend	Block	Alpha load
NEON (8 px/iter)	`vandq_u16(vld1q_u16(...), mask_v)`
SSE4.1 (8 px/iter)	`_mm_and_si128(_mm_loadu_si128(...), mask_v)`
AVX2 (16 px/iter)	`_mm256_loadu_si256` + `_mm256_and_si256`; split into 2× `__m128i` halves via `_mm256_castsi256_si128` / `_mm256_extracti128_si256::<1>` for the two `write_rgba_u16_8` halves
AVX-512 (32 px/iter)	`_mm512_loadu_si512` + `_mm512_and_si512`; split into 4× `__m128i` quarters via `_mm512_extracti32x4_epi32::<0..3>`, fed straight into `write_quarter_rgba`
wasm simd128 (8 px/iter)	`v128_and(v128_load(...), mask_v)`

No depth conversion — both source alpha and output alpha share the same native bit depth (BITS=10 for Yuva444p10 → u16 alpha in [0, 1023]). The u8 path's >> (BITS - 8) step (PR #33) is absent here.

No permute fixup needed — the existing vst4q_u16 / write_rgba_u16_8 / write_quarter_rgba helpers accept the alpha vector in their existing 4th-channel slot. AVX2 / AVX-512 reuse the same mask_v constant the Y/U/V loads already use.

Dispatcher wiring (`src/row/mod.rs`)

yuva444p10_to_rgba_u16_row's let _ = use_simd; stub from PR #32 replaced with the standard cfg_select! per-arch route block. The # ⚠ Scalar-only as of Ship 8b‑1a doc warning dropped from the dispatcher; section header drops the remaining "prep" qualifier — both u8 and u16 paths now have SIMD coverage.

Sinker doc cleanup (`src/sinker/mixed/yuva_4_4_4.rs`)

MixedSinker<Yuva444p10>::with_rgba_u16's "Performance note (Ship 8b‑1a)" warning paragraph dropped now that u16 has SIMD coverage. PR #33 already dropped the same warning from the u8 builder.

Per-backend SIMD u16 equivalence tests (~25)

5 tests per backend × 5 backends mirroring PR #33's u8 test patterns. Each backend covers:

<backend>_yuva444p10_rgba_u16_matches_scalar_all_matrices_<width> — all 6 ColorMatrix × full + limited range × natural block width.
<backend>_yuva444p10_rgba_u16_matches_scalar_widths — natural width + tail widths {17, 31, 47, 63, 1920, 1922} forcing scalar-tail fallthrough.
<backend>_yuva444p10_rgba_u16_matches_scalar_random_alpha — pseudo-random alpha pattern (not solid) to catch SIMD lane-order corruption.

Each test calls the SIMD wrapper directly via unsafe { arch::<bk>::yuv_444p_n_to_rgba_u16_with_alpha_src_row::<10>(...) } so all 5 backends are exercised regardless of which CI runner is running. All 15 new x86 tests include is_x86_feature_detected! early-return guards (per PR #25 CI fallout — without them, ASAN gets SIGILL and Miri reports UB on runners lacking the feature). NEON tests carry #[cfg_attr(miri, ignore = \"...\")]. Wasm is module-level cfg-gated.

Test plan

cargo test --lib: 588 pass on aarch64-darwin (host); was 583 → +5 NEON-side tests run. The 20 x86/wasm tests are gate-guarded for their CI runners.
cargo check --tests --lib clean across host, x86_64-unknown-freebsd, wasm32-unknown-unknown
RUSTFLAGS=\"-Dwarnings\" cargo clippy --lib --tests clean
Zero dead_code warnings — every new yuv_444p_n_to_rgba_u16_with_alpha_src_row wrapper is consumed by the dispatcher

Codex adversarial review

Not run — Codex hit OpenAI usage rate limit (retry window 2026-04-28 02:10 AM). The structural pattern is identical to PR #33 (Ship 8b‑1b u8 SIMD) which Codex approved with no findings. Re-run available on request once the rate limit clears.

Closes Ship 8b‑1: Yuva444p10 vertical slice

After this PR, every Yuva444p10 SIMD path is wired:

8b‑1 (PR feat: ship8b yuva444p10 scalar #32): scalar prep — Frame, walker, dispatchers, sinker integration, scalar tests.
8b‑1b (PR Ship 8b‑1b: Yuva444p10 u8 RGBA SIMD across all 5 backends #33): u8 RGBA SIMD across all 5 backends.
8b‑1c (this PR): u16 RGBA SIMD across all 5 backends.

Out of scope (deferred to follow-up)

Other Yuva format families (Yuva420p*, Yuva422p*, other Yuva444p* variants like 8-bit / 9-bit / 16-bit) → Ship 8b‑2 onward, mass-applying the established Strategy B template.

🤖 Generated with Claude Code

Copilot

Pull request overview

Wires SIMD-dispatched native-depth u16 RGBA conversion for Yuva444p10 (alpha sourced from the A plane) across all supported backends, completing the Yuva444p10 SIMD “vertical slice”.

Changes:

Replaces the yuva444p10_to_rgba_u16_row dispatcher stub with a real cfg_select! per-arch dispatch to SIMD wrappers, with scalar fallback when use_simd = false or no backend is available.
Extends each backend’s high-bit 4:4:4 u16 kernel template with ALPHA_SRC + a_src, adding a SIMD alpha-load path (masked to BITS) and a scalar-tail fallback.
Adds backend-specific SIMD-vs-scalar equivalence tests for YUVA444p10 → RGBA u16 (including random alpha patterns), and removes now-stale “scalar-only” performance warnings.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
src/sinker/mixed/yuva_4_4_4.rs	Removes outdated scalar-only perf note for the u16 alpha-source path.
src/row/mod.rs	Implements per-arch SIMD dispatch for `yuva444p10_to_rgba_u16_row` and updates related docs/comments.
src/row/arch/neon.rs	Adds `ALPHA_SRC` support + alpha-plane SIMD load/store for native-depth `u16` RGBA.
src/row/arch/neon/tests.rs	Adds NEON equivalence tests for YUVA444p10 → RGBA `u16` with source alpha.
src/row/arch/x86_sse41.rs	Adds `ALPHA_SRC` support + alpha-plane SIMD load/store for SSE4.1 native-depth `u16` RGBA.
src/row/arch/x86_sse41/tests.rs	Adds SSE4.1 equivalence tests for YUVA444p10 → RGBA `u16` with source alpha (feature-detected).
src/row/arch/x86_avx2.rs	Adds `ALPHA_SRC` support + alpha-plane SIMD load/store for AVX2 native-depth `u16` RGBA.
src/row/arch/x86_avx2/tests.rs	Adds AVX2 equivalence tests for YUVA444p10 → RGBA `u16` with source alpha (feature-detected).
src/row/arch/x86_avx512.rs	Adds `ALPHA_SRC` support + alpha-plane SIMD load/store for AVX-512 native-depth `u16` RGBA.
src/row/arch/x86_avx512/tests.rs	Adds AVX-512 equivalence tests for YUVA444p10 → RGBA `u16` with source alpha (feature-detected).
src/row/arch/wasm_simd128.rs	Adds `ALPHA_SRC` support + alpha-plane SIMD load/store for wasm simd128 native-depth `u16` RGBA.
src/row/arch/wasm_simd128/tests.rs	Adds wasm simd128 equivalence tests for YUVA444p10 → RGBA `u16` with source alpha (cfg-gated).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

update

ea749e5

al8n changed the title ~~update~~ Ship 8b‑1c: Yuva444p10 u16 RGBA SIMD across all 5 backends Apr 27, 2026

al8n requested a review from Copilot April 27, 2026 11:09

Copilot started reviewing on behalf of al8n April 27, 2026 11:10 View session

Copilot AI reviewed Apr 27, 2026

View reviewed changes

update

2e6dc8c

al8n merged commit bdf910e into main Apr 27, 2026
43 checks passed

al8n deleted the feat/ship8b-yuva444p10-u16-simd branch April 27, 2026 11:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ship 8b‑1c: Yuva444p10 u16 RGBA SIMD across all 5 backends#34

Ship 8b‑1c: Yuva444p10 u16 RGBA SIMD across all 5 backends#34
al8n merged 2 commits intomainfrom
feat/ship8b-yuva444p10-u16-simd

uqio commented Apr 27, 2026 •

edited by al8n

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

uqio commented Apr 27, 2026 • edited by al8n Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Strategy B refactor — extend u16 4:4:4 const-ALPHA template with ALPHA_SRC

Per-backend SIMD alpha load

Dispatcher wiring (src/row/mod.rs)

Sinker doc cleanup (src/sinker/mixed/yuva_4_4_4.rs)

Per-backend SIMD u16 equivalence tests (~25)

Test plan

Codex adversarial review

Closes Ship 8b‑1: Yuva444p10 vertical slice

Out of scope (deferred to follow-up)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

uqio commented Apr 27, 2026 •

edited by al8n

Loading

Strategy B refactor — extend u16 4:4:4 const-ALPHA template with `ALPHA_SRC`

Dispatcher wiring (`src/row/mod.rs`)

Sinker doc cleanup (`src/sinker/mixed/yuva_4_4_4.rs`)