Ship 8 Tranche 7b: high-bit 4:4:4 RGBA u8 SIMD by uqio · Pull Request #30 · Findit-AI/colconv

uqio · 2026-04-27T03:24:10Z

Summary

Wires u8 RGBA SIMD across all 5 backends for high-bit-depth 4:4:4 YUV (Yuv444p9/10/12/14/16, P410/P412/P416) and replaces the 8 stub dispatchers landed in PR #29 with real cfg_select! per-arch routes. Mirrors PR #25 (Tranche 5a) exactly, which did the same for 4:2:0.

The companion u16 RGBA SIMD work + sinker integration lands in Tranche 7c.

Changes

SIMD (5 backends × 4 kernel families = 20 kernel refactors)

Each backend's existing u8 RGB kernel becomes a thin wrapper over a const-ALPHA template, alongside a new RGBA wrapper:

Family	Const-ALPHA template	RGB wrapper	RGBA wrapper
Yuv444p_n (BITS-generic)	`yuv_444p_n_to_rgb_or_rgba_row<BITS, ALPHA>`	`yuv_444p_n_to_rgb_row<BITS>`	`yuv_444p_n_to_rgba_row<BITS>`
Yuv444p16 (16-bit dedicated)	`yuv_444p16_to_rgb_or_rgba_row<ALPHA>`	`yuv_444p16_to_rgb_row`	`yuv_444p16_to_rgba_row`
P_n_444 (BITS-generic)	`p_n_444_to_rgb_or_rgba_row<BITS, ALPHA>`	`p_n_444_to_rgb_row<BITS>`	`p_n_444_to_rgba_row<BITS>`
P_n_444_16 (P416)	`p_n_444_16_to_rgb_or_rgba_row<ALPHA>`	`p_n_444_16_to_rgb_row`	`p_n_444_16_to_rgba_row`

Only the per-iteration store and the scalar tail dispatch branch on ALPHA; per-pixel math is unchanged. Alpha = 0xFF for all u8 RGBA paths. Per-arch alpha splat: vdupq_n_u8(0xFF) (NEON) / _mm_set1_epi8(-1) (SSE4.1) / _mm256_set1_epi8(-1) (AVX2) / _mm512_set1_epi8(-1) (AVX-512) / u8x16_splat(0xFF) (wasm).

RGBA store helpers (vst4q_u8, write_rgba_16/32/64) are reused verbatim from PR #25's 4:2:0 work — no new helpers needed.

The 4:4:4 kernel structure is simpler than 4:2:0: chroma is 1:1 with Y so there's no horizontal duplication step, no chroma-pair while-loop split, no _lo/_hi half pairs at the store. The const-ALPHA refactor was therefore mechanical — only the store branch and the tail dispatch needed if ALPHA { ... } else { ... }.

Dispatcher wiring (8 u8 RGBA dispatchers in `src/row/mod.rs`)

Replace the 8 let _ = use_simd; // SIMD per-arch routes land in Ship 8 Tranche 7b. stubs (landed in PR #29) with the standard cfg_select! per-arch route block, mirroring the existing high-bit RGB dispatchers:

yuv444p9_to_rgba_row, yuv444p10_to_rgba_row, yuv444p12_to_rgba_row, yuv444p14_to_rgba_row (BITS-generic planar)
yuv444p16_to_rgba_row (16-bit dedicated planar)
p410_to_rgba_row, p412_to_rgba_row (BITS-generic Pn)
p416_to_rgba_row (16-bit dedicated Pn)

use_simd = false still forces scalar. The 8 u16 RGBA dispatchers still route to scalar — those land in 7c.

Per-backend RGBA equivalence tests (~30 new tests)

6 tests per backend × 5 backends, mirroring PR #25's structure. Each backend covers all 4 kernel families across narrow + tail + 1920 widths and the full ColorMatrix × range cross-product:

<backend>_yuv444p_n_rgba_matches_scalar_all_bits (BITS=9/10/12/14)
<backend>_yuv444p_n_rgba_matches_scalar_tail_and_widths
<backend>_pn_444_rgba_matches_scalar_all_bits (BITS=10/12)
<backend>_pn_444_rgba_matches_scalar_tail_and_widths
<backend>_yuv444p16_rgba_matches_scalar_all_matrices
<backend>_p416_rgba_matches_scalar_all_matrices

All 24 new x86 #[test] functions (8 SSE4.1 + 8 AVX2 + 8 AVX-512) include is_x86_feature_detected! early-return guards — per the Tranche 5a CI fallout (without them, ASAN sanitizer hits SIGILL and Miri reports UB on runners lacking the feature). NEON tests use #[cfg_attr(miri, ignore = "...")]. Wasm tests are module-level cfg-gated by target_feature = "simd128".

Test plan

cargo test --lib: 519 pass on aarch64-darwin (host); was 513 → +6 NEON-side RGBA tests. The other 24 (x86 + wasm gate-guarded) fire on their respective CI runners.
cargo check --tests --lib clean across host, x86_64-unknown-freebsd, wasm32-unknown-unknown
RUSTFLAGS=\"-Dwarnings\" cargo clippy --lib --tests clean
Zero dead_code warnings — every new *_to_rgba_row wrapper is consumed by its dispatcher

Codex adversarial review

Verdict: approve. No material findings.

Out of scope (deferred to follow-up)

u16 RGBA SIMD across all 5 backends (Tranche 7c)
Sinker integration (MixedSinker<Yuv444p9..16>, <P410/P412/P416>, <Yuv440p10/12>) — Tranche 7c

🤖 Generated with Claude Code

Copilot

Pull request overview

This PR completes the u8 RGBA SIMD wiring for high-bit-depth 4:4:4 YUV/Pn formats by replacing the prior stub RGBA dispatchers with real per-arch SIMD routing and adding per-backend SIMD↔scalar equivalence tests.

Changes:

Wire the 8 high-bit 4:4:4 u8 RGBA dispatchers in src/row/mod.rs to per-arch SIMD backends (NEON/SSE4.1/AVX2/AVX-512/simd128), with scalar fallback when use_simd == false or unavailable.
Refactor each SIMD backend’s existing u8 RGB kernels into a shared const-ALPHA implementation plus thin RGB/RGBA wrappers for the 4:4:4 families.
Add per-backend u8 RGBA equivalence tests to byte-pin SIMD output against the scalar reference across matrices/ranges and tail widths.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/row/mod.rs	Replaces the u8 RGBA stub dispatchers with real `cfg_select!` per-arch SIMD routing (scalar fallback preserved).
src/row/arch/neon.rs	Adds RGBA wrappers and shared const-`ALPHA` 4:4:4 u8 kernel path for NEON.
src/row/arch/neon/tests.rs	Adds NEON u8 RGBA SIMD↔scalar equivalence tests for all 4:4:4 kernel families.
src/row/arch/x86_sse41.rs	Adds RGBA wrappers and shared const-`ALPHA` 4:4:4 u8 kernel path for SSE4.1.
src/row/arch/x86_sse41/tests.rs	Adds SSE4.1 u8 RGBA SIMD↔scalar equivalence tests with runtime feature guards.
src/row/arch/x86_avx2.rs	Adds RGBA wrappers and shared const-`ALPHA` 4:4:4 u8 kernel path for AVX2.
src/row/arch/x86_avx2/tests.rs	Adds AVX2 u8 RGBA SIMD↔scalar equivalence tests with runtime feature guards.
src/row/arch/x86_avx512.rs	Adds RGBA wrappers and shared const-`ALPHA` 4:4:4 u8 kernel path for AVX-512BW.
src/row/arch/x86_avx512/tests.rs	Adds AVX-512 u8 RGBA SIMD↔scalar equivalence tests with runtime feature guards.
src/row/arch/wasm_simd128.rs	Adds RGBA wrappers and shared const-`ALPHA` 4:4:4 u8 kernel path for wasm simd128.
src/row/arch/wasm_simd128/tests.rs	Adds wasm simd128 u8 RGBA SIMD↔scalar equivalence tests (module-level simd128 gating).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The 4:4:4 high-bit YUV planar SIMD docs claimed `BITS ∈ {10, 12, 14}` across all 5 backends, but the const-assert in every implementation accepts `BITS == 9 || 10 || 12 || 14` and the `yuv444p9_to_rgba_row` public dispatcher (added in PR #29) instantiates the kernel with `<9>`. The doc string was stale from before BITS=9 was added in Ship 6b. Updates both the const-generic bound (`{10, 12, 14}` → `{9, 10, 12, 14}`) and the prose bit-list (`10/12/14-bit` → `9/10/12/14-bit`) on every 4:4:4 planar SIMD doc — covers the u8 RGB, u8 RGBA (added in this PR), and u16 RGB siblings across NEON, SSE4.1, AVX2, AVX-512, and wasm simd128. 23 lines updated total. Addresses Copilot review comments on PR #30. Also retroactively fixes the matching drift on the u16 RGB and pre-existing u8 RGB docs that Copilot didn't explicitly flag but had identical wording. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

update

4a79d1e

al8n changed the title ~~update~~ Ship 8 Tranche 7b: high-bit 4:4:4 RGBA u8 SIMD Apr 27, 2026

al8n requested a review from Copilot April 27, 2026 03:36

Copilot started reviewing on behalf of al8n April 27, 2026 03:37 View session

Copilot AI reviewed Apr 27, 2026

View reviewed changes

Comment thread src/row/arch/x86_sse41.rs

Comment thread src/row/arch/neon.rs

Comment thread src/row/arch/x86_avx2.rs Outdated

Comment thread src/row/arch/x86_avx512.rs Outdated

Comment thread src/row/arch/wasm_simd128.rs Outdated

uqio merged commit eedbe1e into main Apr 27, 2026
43 checks passed

uqio deleted the feat/ship8-rgba-high-bit-444-u8-simd branch April 27, 2026 04:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ship 8 Tranche 7b: high-bit 4:4:4 RGBA u8 SIMD#30

Ship 8 Tranche 7b: high-bit 4:4:4 RGBA u8 SIMD#30
uqio merged 2 commits intomainfrom
feat/ship8-rgba-high-bit-444-u8-simd

uqio commented Apr 27, 2026 •

edited by al8n

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

uqio commented Apr 27, 2026 • edited by al8n Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

SIMD (5 backends × 4 kernel families = 20 kernel refactors)

Dispatcher wiring (8 u8 RGBA dispatchers in src/row/mod.rs)

Per-backend RGBA equivalence tests (~30 new tests)

Test plan

Codex adversarial review

Out of scope (deferred to follow-up)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

uqio commented Apr 27, 2026 •

edited by al8n

Loading

Dispatcher wiring (8 u8 RGBA dispatchers in `src/row/mod.rs`)