Skip to content

Ship 8 Tranche 7b: high-bit 4:4:4 RGBA u8 SIMD#30

Merged
uqio merged 2 commits intomainfrom
feat/ship8-rgba-high-bit-444-u8-simd
Apr 27, 2026
Merged

Ship 8 Tranche 7b: high-bit 4:4:4 RGBA u8 SIMD#30
uqio merged 2 commits intomainfrom
feat/ship8-rgba-high-bit-444-u8-simd

Conversation

@uqio
Copy link
Copy Markdown
Collaborator

@uqio uqio commented Apr 27, 2026

Summary

Wires u8 RGBA SIMD across all 5 backends for high-bit-depth 4:4:4 YUV (Yuv444p9/10/12/14/16, P410/P412/P416) and replaces the 8 stub dispatchers landed in PR #29 with real cfg_select! per-arch routes. Mirrors PR #25 (Tranche 5a) exactly, which did the same for 4:2:0.

The companion u16 RGBA SIMD work + sinker integration lands in Tranche 7c.

Changes

SIMD (5 backends × 4 kernel families = 20 kernel refactors)

Each backend's existing u8 RGB kernel becomes a thin wrapper over a const-ALPHA template, alongside a new RGBA wrapper:

Family Const-ALPHA template RGB wrapper RGBA wrapper
Yuv444p_n (BITS-generic) yuv_444p_n_to_rgb_or_rgba_row<BITS, ALPHA> yuv_444p_n_to_rgb_row<BITS> yuv_444p_n_to_rgba_row<BITS>
Yuv444p16 (16-bit dedicated) yuv_444p16_to_rgb_or_rgba_row<ALPHA> yuv_444p16_to_rgb_row yuv_444p16_to_rgba_row
P_n_444 (BITS-generic) p_n_444_to_rgb_or_rgba_row<BITS, ALPHA> p_n_444_to_rgb_row<BITS> p_n_444_to_rgba_row<BITS>
P_n_444_16 (P416) p_n_444_16_to_rgb_or_rgba_row<ALPHA> p_n_444_16_to_rgb_row p_n_444_16_to_rgba_row

Only the per-iteration store and the scalar tail dispatch branch on ALPHA; per-pixel math is unchanged. Alpha = 0xFF for all u8 RGBA paths. Per-arch alpha splat: vdupq_n_u8(0xFF) (NEON) / _mm_set1_epi8(-1) (SSE4.1) / _mm256_set1_epi8(-1) (AVX2) / _mm512_set1_epi8(-1) (AVX-512) / u8x16_splat(0xFF) (wasm).

RGBA store helpers (vst4q_u8, write_rgba_16/32/64) are reused verbatim from PR #25's 4:2:0 work — no new helpers needed.

The 4:4:4 kernel structure is simpler than 4:2:0: chroma is 1:1 with Y so there's no horizontal duplication step, no chroma-pair while-loop split, no _lo/_hi half pairs at the store. The const-ALPHA refactor was therefore mechanical — only the store branch and the tail dispatch needed if ALPHA { ... } else { ... }.

Dispatcher wiring (8 u8 RGBA dispatchers in src/row/mod.rs)

Replace the 8 let _ = use_simd; // SIMD per-arch routes land in Ship 8 Tranche 7b. stubs (landed in PR #29) with the standard cfg_select! per-arch route block, mirroring the existing high-bit RGB dispatchers:

  • yuv444p9_to_rgba_row, yuv444p10_to_rgba_row, yuv444p12_to_rgba_row, yuv444p14_to_rgba_row (BITS-generic planar)
  • yuv444p16_to_rgba_row (16-bit dedicated planar)
  • p410_to_rgba_row, p412_to_rgba_row (BITS-generic Pn)
  • p416_to_rgba_row (16-bit dedicated Pn)

use_simd = false still forces scalar. The 8 u16 RGBA dispatchers still route to scalar — those land in 7c.

Per-backend RGBA equivalence tests (~30 new tests)

6 tests per backend × 5 backends, mirroring PR #25's structure. Each backend covers all 4 kernel families across narrow + tail + 1920 widths and the full ColorMatrix × range cross-product:

  • <backend>_yuv444p_n_rgba_matches_scalar_all_bits (BITS=9/10/12/14)
  • <backend>_yuv444p_n_rgba_matches_scalar_tail_and_widths
  • <backend>_pn_444_rgba_matches_scalar_all_bits (BITS=10/12)
  • <backend>_pn_444_rgba_matches_scalar_tail_and_widths
  • <backend>_yuv444p16_rgba_matches_scalar_all_matrices
  • <backend>_p416_rgba_matches_scalar_all_matrices

All 24 new x86 #[test] functions (8 SSE4.1 + 8 AVX2 + 8 AVX-512) include is_x86_feature_detected! early-return guards — per the Tranche 5a CI fallout (without them, ASAN sanitizer hits SIGILL and Miri reports UB on runners lacking the feature). NEON tests use #[cfg_attr(miri, ignore = "...")]. Wasm tests are module-level cfg-gated by target_feature = "simd128".

Test plan

  • cargo test --lib: 519 pass on aarch64-darwin (host); was 513 → +6 NEON-side RGBA tests. The other 24 (x86 + wasm gate-guarded) fire on their respective CI runners.
  • cargo check --tests --lib clean across host, x86_64-unknown-freebsd, wasm32-unknown-unknown
  • RUSTFLAGS=\"-Dwarnings\" cargo clippy --lib --tests clean
  • Zero dead_code warnings — every new *_to_rgba_row wrapper is consumed by its dispatcher

Codex adversarial review

Verdict: approve. No material findings.

Out of scope (deferred to follow-up)

  • u16 RGBA SIMD across all 5 backends (Tranche 7c)
  • Sinker integration (MixedSinker<Yuv444p9..16>, <P410/P412/P416>, <Yuv440p10/12>) — Tranche 7c

🤖 Generated with Claude Code

@al8n al8n changed the title update Ship 8 Tranche 7b: high-bit 4:4:4 RGBA u8 SIMD Apr 27, 2026
@al8n al8n requested a review from Copilot April 27, 2026 03:36
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR completes the u8 RGBA SIMD wiring for high-bit-depth 4:4:4 YUV/Pn formats by replacing the prior stub RGBA dispatchers with real per-arch SIMD routing and adding per-backend SIMD↔scalar equivalence tests.

Changes:

  • Wire the 8 high-bit 4:4:4 u8 RGBA dispatchers in src/row/mod.rs to per-arch SIMD backends (NEON/SSE4.1/AVX2/AVX-512/simd128), with scalar fallback when use_simd == false or unavailable.
  • Refactor each SIMD backend’s existing u8 RGB kernels into a shared const-ALPHA implementation plus thin RGB/RGBA wrappers for the 4:4:4 families.
  • Add per-backend u8 RGBA equivalence tests to byte-pin SIMD output against the scalar reference across matrices/ranges and tail widths.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/row/mod.rs Replaces the u8 RGBA stub dispatchers with real cfg_select! per-arch SIMD routing (scalar fallback preserved).
src/row/arch/neon.rs Adds RGBA wrappers and shared const-ALPHA 4:4:4 u8 kernel path for NEON.
src/row/arch/neon/tests.rs Adds NEON u8 RGBA SIMD↔scalar equivalence tests for all 4:4:4 kernel families.
src/row/arch/x86_sse41.rs Adds RGBA wrappers and shared const-ALPHA 4:4:4 u8 kernel path for SSE4.1.
src/row/arch/x86_sse41/tests.rs Adds SSE4.1 u8 RGBA SIMD↔scalar equivalence tests with runtime feature guards.
src/row/arch/x86_avx2.rs Adds RGBA wrappers and shared const-ALPHA 4:4:4 u8 kernel path for AVX2.
src/row/arch/x86_avx2/tests.rs Adds AVX2 u8 RGBA SIMD↔scalar equivalence tests with runtime feature guards.
src/row/arch/x86_avx512.rs Adds RGBA wrappers and shared const-ALPHA 4:4:4 u8 kernel path for AVX-512BW.
src/row/arch/x86_avx512/tests.rs Adds AVX-512 u8 RGBA SIMD↔scalar equivalence tests with runtime feature guards.
src/row/arch/wasm_simd128.rs Adds RGBA wrappers and shared const-ALPHA 4:4:4 u8 kernel path for wasm simd128.
src/row/arch/wasm_simd128/tests.rs Adds wasm simd128 u8 RGBA SIMD↔scalar equivalence tests (module-level simd128 gating).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/row/arch/x86_sse41.rs
Comment thread src/row/arch/neon.rs
Comment thread src/row/arch/x86_avx2.rs Outdated
Comment thread src/row/arch/x86_avx512.rs Outdated
Comment thread src/row/arch/wasm_simd128.rs Outdated
The 4:4:4 high-bit YUV planar SIMD docs claimed `BITS ∈ {10, 12, 14}`
across all 5 backends, but the const-assert in every implementation
accepts `BITS == 9 || 10 || 12 || 14` and the `yuv444p9_to_rgba_row`
public dispatcher (added in PR #29) instantiates the kernel with
`<9>`. The doc string was stale from before BITS=9 was added in Ship 6b.

Updates both the const-generic bound (`{10, 12, 14}` → `{9, 10, 12, 14}`)
and the prose bit-list (`10/12/14-bit` → `9/10/12/14-bit`) on every
4:4:4 planar SIMD doc — covers the u8 RGB, u8 RGBA (added in this PR),
and u16 RGB siblings across NEON, SSE4.1, AVX2, AVX-512, and wasm
simd128. 23 lines updated total.

Addresses Copilot review comments on PR #30. Also retroactively fixes
the matching drift on the u16 RGB and pre-existing u8 RGB docs that
Copilot didn't explicitly flag but had identical wording.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@uqio uqio merged commit eedbe1e into main Apr 27, 2026
43 checks passed
@uqio uqio deleted the feat/ship8-rgba-high-bit-444-u8-simd branch April 27, 2026 04:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants