Ship 8 Tranche 5b: high-bit 4:2:0 RGBA u16 SIMD + sinker integration#26
Ship 8 Tranche 5b: high-bit 4:2:0 RGBA u16 SIMD + sinker integration#26
Conversation
There was a problem hiding this comment.
Pull request overview
Extends the “Ship 8” RGBA pipeline to high-bit-depth 4:2:0 formats by adding native-depth u16 RGBA support end-to-end (MixedSinker wiring + row dispatchers + per-arch SIMD implementations), plus targeted correctness/equivalence tests.
Changes:
- Add
with_rgba/with_rgba_u16(and setters) and Strategy-A RGB→RGBA fanout wiring for high-bit-depth 4:2:0MixedSinkerformats (Yuv420p9/10/12/14/16, P010/P012/P016). - Enable SIMD dispatch for native-depth
u16RGBA row conversions across NEON, SSE4.1, AVX2, AVX-512, and wasm simd128 backends; add a shared x86u16RGBA interleave writer. - Add new tests covering MixedSinker RGBA behavior (subset) and per-arch SIMD equivalence vs scalar for native-depth
u16RGBA kernels.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/sinker/mixed/tests.rs | Adds MixedSinker-level RGBA tests for selected high-bit 4:2:0 formats (incl. alpha correctness + Strategy A equivalence checks). |
| src/sinker/mixed/subsampled_4_2_0_high_bit.rs | Wires RGBA/RGBA-u16 buffers into high-bit 4:2:0 sinkers and implements Strategy-A fanout/fast-path logic. |
| src/sinker/mixed/planar_8bit.rs | Updates compile-fail doc example to reference a still-unwired format for RGBA. |
| src/sinker/mixed/mod.rs | Adds rgba_u16_plane_row_slice helper for slicing u16 RGBA rows safely. |
| src/row/mod.rs | Exposes expand_rgb_u16_to_rgba_u16_row and updates high-bit 4:2:0 u16 RGBA dispatchers to actually SIMD-dispatch. |
| src/row/arch/x86_common.rs | Adds write_rgba_u16_8 helper to interleave/store packed RGBA-u16 for 8 pixels on x86. |
| src/row/arch/x86_sse41.rs | Implements SSE4.1 native-depth u16 RGBA kernels via shared RGB/RGBA core. |
| src/row/arch/x86_sse41/tests.rs | Adds SSE4.1 equivalence tests for native-depth u16 RGBA kernels. |
| src/row/arch/x86_avx2.rs | Implements AVX2 native-depth u16 RGBA kernels via shared RGB/RGBA core. |
| src/row/arch/x86_avx2/tests.rs | Adds AVX2 equivalence tests for native-depth u16 RGBA kernels. |
| src/row/arch/x86_avx512.rs | Implements AVX-512 native-depth u16 RGBA kernels (incl. RGBA stores) via shared RGB/RGBA core. |
| src/row/arch/x86_avx512/tests.rs | Adds AVX-512 equivalence tests for native-depth u16 RGBA kernels. |
| src/row/arch/wasm_simd128.rs | Implements wasm simd128 native-depth u16 RGBA kernels via shared RGB/RGBA core + RGBA store helper. |
| src/row/arch/wasm_simd128/tests.rs | Adds wasm simd128 equivalence tests for native-depth u16 RGBA kernels. |
| src/row/arch/neon.rs | Implements NEON native-depth u16 RGBA kernels via shared RGB/RGBA core. |
| src/row/arch/neon/tests.rs | Adds NEON equivalence tests for native-depth u16 RGBA kernels. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// Routes through the dedicated 16-bit u16-output scalar kernel | ||
| /// (`scalar::yuv_420p16_to_rgba_u16_row`) — uses i64 chroma multiply | ||
| /// for the wider `coeff × u_d` product at 16 → 16-bit scaling. SIMD | ||
| /// per-arch routes land in the follow-up Ship 8 Tranche 5b PR. | ||
| /// for the wider `coeff × u_d` product at 16 → 16-bit scaling. | ||
| /// `use_simd = false` forces the scalar reference path. |
There was a problem hiding this comment.
The doc comment claims this dispatcher "routes through the dedicated ... scalar kernel", but the implementation now conditionally dispatches to per-arch SIMD backends when use_simd is true. Please update the docs to describe SIMD-first dispatch with scalar fallback (and keep the note that use_simd = false forces the scalar reference).
| @@ -3518,8 +3856,8 @@ pub fn p016_to_rgba_row( | |||
| /// `0xFFFF`. | |||
| /// | |||
| /// Routes through the dedicated 16-bit u16-output P016 scalar kernel | |||
There was a problem hiding this comment.
The docs say this dispatcher routes through the scalar kernel, but the implementation now attempts SIMD backends when use_simd is true and falls back to scalar otherwise. Please update the comment to reflect SIMD-first dispatch + scalar fallback (while keeping the note that use_simd = false forces scalar).
| /// Routes through the dedicated 16-bit u16-output P016 scalar kernel | |
| /// Dispatches to the best available backend for the current target and | |
| /// falls back to the dedicated 16-bit u16-output P016 scalar kernel |
Summary
Adds u16 RGBA SIMD across all 5 backends for high-bit 4:2:0 YUV (
yuv420p9/10/12/14/16,p010/p012/p016), wires them into the 8 high-bit u16 RGBA dispatchers insrc/row/mod.rs, and lands sinker-level integration:with_rgba(u8) +with_rgba_u16(u16) builders on all 8 high-bit 4:2:0MixedSinkerimpls. Closes the Ship 8 high-bit 4:2:0 RGBA work begun in PR #24 (scalar prep) and PR #25 (5a — u8 RGBA SIMD).Changes
SIMD u16 RGBA (5 backends × 4 kernel families = 20 kernel refactors)
*_to_rgb_u16_row<BITS>becomes a thin wrapper over*_to_rgb_or_rgba_u16_row<BITS, ALPHA>, alongside a new*_to_rgba_u16_row<BITS>wrapper. Kernel families:yuv_420p_n_to_rgb_or_rgba_u16_row<BITS={9,10,12,14}, ALPHA>p_n_to_rgb_or_rgba_u16_row<BITS={10,12}, ALPHA>(P016 has its own family)yuv_420p16_to_rgb_or_rgba_u16_row<ALPHA>p16_to_rgb_or_rgba_u16_row<ALPHA>vst3q_u16vsvst4q_u16,write_rgb_u16_8vs newwrite_rgba_u16_8, etc.) and scalar tail dispatch branch onALPHA; per-pixel math is unchanged.(1 << BITS) - 1for BITS-generic kernels,0xFFFFfor 16-bit kernels — matches the scalar references.const { assert!(BITS == ...) }retained on every shared template; added top_n_to_rgb_or_rgba_u16_row(the priorp_n_to_rgb_u16_rowwas missing the guard).write_rgba_u16_8(NEON viavst4q_u16, x86 SSE2-superset via two-stage unpack, wasm viai16x8_shuffle),write_rgba_u16_32+write_quarter_rgba(AVX-512).Dispatcher wiring (8 u16 RGBA dispatchers in `src/row/mod.rs`)
yuv420p9/10/12/14/16_to_rgba_u16_rowandp010/p012/p016_to_rgba_u16_rowwith the standardcfg_select!per-arch route block, mirroring the 5a u8 RGBA dispatchers. `use_simd = false` still forces scalar.Sinker integration (`src/sinker/mixed/subsampled_4_2_0_high_bit.rs`)
Tests (~36 new)
Test plan
Follow-ups (out of scope)
🤖 Generated with Claude Code