Skip to content

Ship 8 Tranche 5b: high-bit 4:2:0 RGBA u16 SIMD + sinker integration#26

Merged
uqio merged 3 commits intomainfrom
feat/ship8-rgba-high-bit-420-u16-simd
Apr 26, 2026
Merged

Ship 8 Tranche 5b: high-bit 4:2:0 RGBA u16 SIMD + sinker integration#26
uqio merged 3 commits intomainfrom
feat/ship8-rgba-high-bit-420-u16-simd

Conversation

@uqio
Copy link
Copy Markdown
Collaborator

@uqio uqio commented Apr 26, 2026

Summary

Adds u16 RGBA SIMD across all 5 backends for high-bit 4:2:0 YUV (yuv420p9/10/12/14/16, p010/p012/p016), wires them into the 8 high-bit u16 RGBA dispatchers in src/row/mod.rs, and lands sinker-level integration: with_rgba (u8) + with_rgba_u16 (u16) builders on all 8 high-bit 4:2:0 MixedSinker impls. Closes the Ship 8 high-bit 4:2:0 RGBA work begun in PR #24 (scalar prep) and PR #25 (5a — u8 RGBA SIMD).

Changes

SIMD u16 RGBA (5 backends × 4 kernel families = 20 kernel refactors)

  • Each backend's *_to_rgb_u16_row<BITS> becomes a thin wrapper over *_to_rgb_or_rgba_u16_row<BITS, ALPHA>, alongside a new *_to_rgba_u16_row<BITS> wrapper. Kernel families:
    • planar BITS-generic: yuv_420p_n_to_rgb_or_rgba_u16_row<BITS={9,10,12,14}, ALPHA>
    • semi-planar BITS-generic: p_n_to_rgb_or_rgba_u16_row<BITS={10,12}, ALPHA> (P016 has its own family)
    • 16-bit planar: yuv_420p16_to_rgb_or_rgba_u16_row<ALPHA>
    • 16-bit semi-planar: p16_to_rgb_or_rgba_u16_row<ALPHA>
  • Only the store (vst3q_u16 vs vst4q_u16, write_rgb_u16_8 vs new write_rgba_u16_8, etc.) and scalar tail dispatch branch on ALPHA; per-pixel math is unchanged.
  • Alpha contract: (1 << BITS) - 1 for BITS-generic kernels, 0xFFFF for 16-bit kernels — matches the scalar references.
  • Compile-time const { assert!(BITS == ...) } retained on every shared template; added to p_n_to_rgb_or_rgba_u16_row (the prior p_n_to_rgb_u16_row was missing the guard).
  • New per-backend RGBA u16 store helpers: write_rgba_u16_8 (NEON via vst4q_u16, x86 SSE2-superset via two-stage unpack, wasm via i16x8_shuffle), write_rgba_u16_32 + write_quarter_rgba (AVX-512).

Dispatcher wiring (8 u16 RGBA dispatchers in `src/row/mod.rs`)

  • Replace the prior `let _ = use_simd; // ... Tranche 5b` stubs in yuv420p9/10/12/14/16_to_rgba_u16_row and p010/p012/p016_to_rgba_u16_row with the standard cfg_select! per-arch route block, mirroring the 5a u8 RGBA dispatchers. `use_simd = false` still forces scalar.
  • Section header + per-dispatcher doc comments updated to remove `Tranche 5b` placeholder language.
  • `expand_rgb_u16_to_rgba_u16_row` re-exported from `src/row/scalar.rs` for sinker-side Strategy A consumers.

Sinker integration (`src/sinker/mixed/subsampled_4_2_0_high_bit.rs`)

  • All 8 high-bit 4:2:0 `MixedSinker` impls gain 4 new builder methods each (32 new methods total): `with_rgba` / `set_rgba` (u8) and `with_rgba_u16` / `set_rgba_u16` (u16).
  • Each format's `PixelSink::process` restructured to consume the new buffers via Strategy A combine:
    • u16 path: rgba_u16-only routes directly through `*_to_rgba_u16_row`; rgb_u16+rgba_u16 runs the RGB kernel once and fans out via `expand_rgb_u16_to_rgba_u16_row::` (cheap per-pixel pad with depth-aware alpha).
    • u8 path: same shape — rgba-only goes direct; rgb+rgba (or hsv+rgba) uses the existing scratch + `expand_rgb_to_rgba_row` fan-out from PR feat(sinker): Ship 8 — Nv24/Nv42 RGBA + Strategy A RGB→RGBA fan-out #20.
  • New helper `rgba_u16_plane_row_slice` in `src/sinker/mixed/mod.rs` mirrors the existing `rgba_plane_row_slice` (u8) — used in 16 call sites across the 8 formats.
  • The `compile_fail` doctest in `planar_8bit.rs` that demonstrates "attaching RGBA to a sink that doesn't write it is rejected" was using `Yuv420p10` as its negative example; now updated to `Yuv422p10` (4:2:2 high-bit, still genuinely lacks `with_rgba`).

Tests (~36 new)

  • 28 row-level RGBA u16 equivalence tests (6 per backend × 5 backends, modulo NEON which is 6 too): each covers all 4 kernel families across narrow + tail + 1920 widths and the full matrix × range cross-product.
  • 8 sinker integration tests: `Yuv420p10` covers the BITS-generic planar path (rgba u8/u16 gray-to-gray, Strategy A combine for both depths, buffer-too-short err for both); `P010` covers the BITS-generic Pn path; `Yuv420p16` covers the 16-bit dedicated kernel.
  • All new x86 `#[test]` functions include `is_x86_feature_detected!` early-return guards (per the 5a CI fallout — without them, ASAN sanitizer gets SIGILL and Miri reports UB).

Test plan

  • `cargo test --lib` on host (aarch64-darwin / NEON path): 499 pass, 0 fail
  • `cargo check --tests --lib` clean across host / x86_64-unknown-freebsd / wasm32-unknown-unknown
  • `RUSTFLAGS="-Dwarnings" cargo clippy --lib --tests` clean
  • `cargo test --doc` clean (the updated `compile_fail` example correctly rejects `Yuv422p10`)
  • CI: ASAN sanitizer on x86_64-linux (should pass — guards in place)
  • CI: Miri on x86_64-linux (should pass — guards in place)
  • On-device equivalence run for AVX2 / AVX-512 / SSE4.1 hardware (deferred to CI)

Follow-ups (out of scope)

  • 4:2:2 (`Yuv422p9/10/12/14/16`, `P210/P212/P216`) and 4:4:4 (`Yuv444p9/10/12/14/16`, `P410/P412/P416`) high-bit sinkers still lack `with_rgba` / `with_rgba_u16` — symmetric gaps closable in a future tranche.
  • Cleanup PR to split inline `mod tests` blocks out of large source files (per the `project_colconv_cleanup_split_tests` memory note).

🤖 Generated with Claude Code

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends the “Ship 8” RGBA pipeline to high-bit-depth 4:2:0 formats by adding native-depth u16 RGBA support end-to-end (MixedSinker wiring + row dispatchers + per-arch SIMD implementations), plus targeted correctness/equivalence tests.

Changes:

  • Add with_rgba/with_rgba_u16 (and setters) and Strategy-A RGB→RGBA fanout wiring for high-bit-depth 4:2:0 MixedSinker formats (Yuv420p9/10/12/14/16, P010/P012/P016).
  • Enable SIMD dispatch for native-depth u16 RGBA row conversions across NEON, SSE4.1, AVX2, AVX-512, and wasm simd128 backends; add a shared x86 u16 RGBA interleave writer.
  • Add new tests covering MixedSinker RGBA behavior (subset) and per-arch SIMD equivalence vs scalar for native-depth u16 RGBA kernels.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/sinker/mixed/tests.rs Adds MixedSinker-level RGBA tests for selected high-bit 4:2:0 formats (incl. alpha correctness + Strategy A equivalence checks).
src/sinker/mixed/subsampled_4_2_0_high_bit.rs Wires RGBA/RGBA-u16 buffers into high-bit 4:2:0 sinkers and implements Strategy-A fanout/fast-path logic.
src/sinker/mixed/planar_8bit.rs Updates compile-fail doc example to reference a still-unwired format for RGBA.
src/sinker/mixed/mod.rs Adds rgba_u16_plane_row_slice helper for slicing u16 RGBA rows safely.
src/row/mod.rs Exposes expand_rgb_u16_to_rgba_u16_row and updates high-bit 4:2:0 u16 RGBA dispatchers to actually SIMD-dispatch.
src/row/arch/x86_common.rs Adds write_rgba_u16_8 helper to interleave/store packed RGBA-u16 for 8 pixels on x86.
src/row/arch/x86_sse41.rs Implements SSE4.1 native-depth u16 RGBA kernels via shared RGB/RGBA core.
src/row/arch/x86_sse41/tests.rs Adds SSE4.1 equivalence tests for native-depth u16 RGBA kernels.
src/row/arch/x86_avx2.rs Implements AVX2 native-depth u16 RGBA kernels via shared RGB/RGBA core.
src/row/arch/x86_avx2/tests.rs Adds AVX2 equivalence tests for native-depth u16 RGBA kernels.
src/row/arch/x86_avx512.rs Implements AVX-512 native-depth u16 RGBA kernels (incl. RGBA stores) via shared RGB/RGBA core.
src/row/arch/x86_avx512/tests.rs Adds AVX-512 equivalence tests for native-depth u16 RGBA kernels.
src/row/arch/wasm_simd128.rs Implements wasm simd128 native-depth u16 RGBA kernels via shared RGB/RGBA core + RGBA store helper.
src/row/arch/wasm_simd128/tests.rs Adds wasm simd128 equivalence tests for native-depth u16 RGBA kernels.
src/row/arch/neon.rs Implements NEON native-depth u16 RGBA kernels via shared RGB/RGBA core.
src/row/arch/neon/tests.rs Adds NEON equivalence tests for native-depth u16 RGBA kernels.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/sinker/mixed/mod.rs Outdated
@al8n al8n changed the title Feat/ship8 rgba high bit 420 u16 simd Ship 8 Tranche 5b: high-bit 4:2:0 RGBA u16 SIMD + sinker integration Apr 26, 2026
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/row/mod.rs
Comment on lines 3718 to +3721
/// Routes through the dedicated 16-bit u16-output scalar kernel
/// (`scalar::yuv_420p16_to_rgba_u16_row`) — uses i64 chroma multiply
/// for the wider `coeff × u_d` product at 16 → 16-bit scaling. SIMD
/// per-arch routes land in the follow-up Ship 8 Tranche 5b PR.
/// for the wider `coeff × u_d` product at 16 → 16-bit scaling.
/// `use_simd = false` forces the scalar reference path.
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment claims this dispatcher "routes through the dedicated ... scalar kernel", but the implementation now conditionally dispatches to per-arch SIMD backends when use_simd is true. Please update the docs to describe SIMD-first dispatch with scalar fallback (and keep the note that use_simd = false forces the scalar reference).

Copilot uses AI. Check for mistakes.
Comment thread src/row/mod.rs
@@ -3518,8 +3856,8 @@ pub fn p016_to_rgba_row(
/// `0xFFFF`.
///
/// Routes through the dedicated 16-bit u16-output P016 scalar kernel
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs say this dispatcher routes through the scalar kernel, but the implementation now attempts SIMD backends when use_simd is true and falls back to scalar otherwise. Please update the comment to reflect SIMD-first dispatch + scalar fallback (while keeping the note that use_simd = false forces scalar).

Suggested change
/// Routes through the dedicated 16-bit u16-output P016 scalar kernel
/// Dispatches to the best available backend for the current target and
/// falls back to the dedicated 16-bit u16-output P016 scalar kernel

Copilot uses AI. Check for mistakes.
@uqio uqio merged commit cc09cbb into main Apr 26, 2026
47 checks passed
@uqio uqio deleted the feat/ship8-rgba-high-bit-420-u16-simd branch April 26, 2026 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants