Skip to content

feat(sinker): Ship 8 — Yuv420p RGBA output via const-ALPHA template#16

Merged
uqio merged 5 commits intomainfrom
feat/ship8-rgba-yuv420
Apr 26, 2026
Merged

feat(sinker): Ship 8 — Yuv420p RGBA output via const-ALPHA template#16
uqio merged 5 commits intomainfrom
feat/ship8-rgba-yuv420

Conversation

@uqio
Copy link
Copy Markdown
Collaborator

@uqio uqio commented Apr 26, 2026

First PR of Ship 8 (sink-side with_rgba / with_rgba_u16 for every existing source format). Lands the template-validating slice: end-to-end RGBA support for Yuv420p only — public API, all five SIMD backends, MixedSinker integration, per-backend equivalence tests. The remaining ~30 kernel families (NV12, NV21, Yuv422p, Yuv444p, all bit-depth variants, P010 / P016 / P210 / P410, etc.) inherit the same const-generic-ALPHA template via mechanical mass-apply in follow-up PRs — see What's deferred below.

Scope

Sink-side only (the high-value half per docs/color-conversion-functions.md § Ship 8). YUVA source frames + per-pixel alpha pass-through are deferred to a follow-up PR. Default alpha is 0xFF (opaque) for sources that don't carry an alpha plane — every YUV format shipped today.

use colconv::{
    frame::Yuv420pFrame,
    sinker::MixedSinker,
    yuv::{Yuv420p, yuv420p_to},
    ColorMatrix,
};

let frame = Yuv420pFrame::new(&yp, &up, &vp, w, h, w, w / 2, w / 2);
let mut rgba = vec![0u8; (w * h * 4) as usize];
let mut sinker = MixedSinker::<Yuv420p>::new(w as usize, h as usize)
    .with_rgba(&mut rgba)?;

yuv420p_to(&frame, /*full_range=*/ true, ColorMatrix::Bt709, &mut sinker)?;
// Each pixel: rgba[4*i..4*i+3] = R,G,B; rgba[4*i+3] = 0xFF.

Design: const-generic ALPHA: bool kernel template

The natural alternatives:

  1. Compose-and-expand — kernel writes 3-byte RGB to scratch, then a second pass expands to 4-byte RGBA with constant alpha. Zero new SIMD code, but ~7W bytes/row of memory traffic vs. native's 4W → roughly 1.5–2× slower for the RGBA-only path.
  2. Fully forked native — kernel forked into RGB and RGBA copies per backend. Best speed, ~5–10× LOC duplication across 30+ kernels.
  3. Const-generic ALPHA (this PR) — kernel body monomorphized on <const ALPHA: bool>; the only delta is the final per-block store (vst3q_u8vst4q_u8 on NEON; new write_rgba_* shuffle helpers on x86 / wasm parallel to the existing write_rgb_*). Same speed as native, ~1/5th the LOC. Compiler DCE eliminates the if ALPHA branch and the unused alpha-vector splat at each call site.

All five backends keep R / G / B in separate vectors at store time, so adding alpha is purely a 4th vector + different shuffle/store — no re-interleave.

What's in this PR

Public API (colconv::sinker)

  • MixedSinker<Yuv420p>::with_rgba(&mut [u8]) and set_rgba — attaches a packed 32-bit RGBA buffer (width × height × 4 bytes). Format-specific impl block (not on the generic MixedSinker<F>), so attaching RGBA to a sink whose process doesn't write it is a compile error rather than a silent stale-buffer bug. Future formats add their own impl blocks as RGBA support lands; a compile_fail doctest pins this property.
  • MixedSinker::produces_rgba() / produces_rgba_u16() — generic queries (always false for sinks without a write path; symmetrical with the existing produces_rgb_u16()).
  • MixedSinkerError::RgbaBufferTooShort { expected, actual } and RgbaU16BufferTooShort — new error variants.
  • row::yuv_420_to_rgba_row(...) — public dispatcher mirroring yuv_420_to_rgb_row(...). Same contract; rgba_out.len() >= 4 * width.

Per-backend kernels (src/row/)

File What's added
scalar.rs yuv_420_to_rgba_row + shared yuv_420_to_rgb_or_rgba_row::<const ALPHA: bool> template
arch/x86_common.rs write_rgba_16 (4 shuffle masks × 4 stores per 16 pixels = 64 bytes)
arch/x86_sse41.rs yuv_420_to_rgba_row + shared template; uses write_rgba_16
arch/x86_avx2.rs yuv_420_to_rgba_row + write_rgba_32 (two halves through write_rgba_16)
arch/x86_avx512.rs yuv_420_to_rgba_row + write_rgba_64 (four quarters through write_rgba_16)
arch/neon.rs yuv_420_to_rgba_row + shared template; uses native vst4q_u8
arch/wasm_simd128.rs yuv_420_to_rgba_row + new write_rgba_16 (4 swizzle masks × 4 stores)
mod.rs rgba_row_bytes(width) helper + public yuv_420_to_rgba_row dispatcher

The 4-byte stride aligns cleanly with the 16-byte register width on every backend, so the RGBA shuffle masks are simpler than the 3-byte RGB pattern (no channel "split across blocks" at the 16-byte boundary). NEON's vst4q_u8 is a native 4-channel store — zero shuffle overhead vs. the 3-channel vst3q_u8.

MixedSinker integration

MixedSinker<Yuv420p>::process runs the RGBA path as an independent kernel call (not compose). The const-generic-ALPHA monomorphization means yuv_420_to_rgba_row::<true> and yuv_420_to_rgb_row::<false> are separate functions sharing one source body — both fully inlined at their respective call sites with the unused branch eliminated.

SIMD coverage

Kernel NEON SSE4.1 AVX2 AVX-512 wasm simd128
yuv_420_to_rgba_row

Block sizes per iteration mirror the existing RGB kernels: NEON / SSE4.1 / wasm = 16 px; AVX2 = 32 px; AVX-512 = 64 px.

Tests

Format-level (5 tests in src/sinker/mixed.rs):

  • rgba_only_converts_gray_to_gray_with_opaque_alpha — gray YUV → gray RGB + alpha = 0xFF.
  • rgba_alpha_is_opaque_for_arbitrary_color — alpha is 0xFF for non-gray content.
  • with_rgb_and_with_rgba_produce_byte_identical_rgb_bytes — cross-format invariant: alpha is the only difference between with_rgb and with_rgba outputs.
  • rgba_with_simd_false_matches_with_simd_true — SIMD ≡ scalar parity across 8 widths covering all backend block sizes (16 / 32 / 64) plus tails.
  • rgba_buffer_too_short_returns_err — short buffer rejected at with_rgba call.
  • yuv_420_to_rgba_simd_matches_scalar_with_random_yuvpseudo-random per-pixel YUV (width 1922, all 4 matrices × both ranges) catches lane-order corruption that solid-color tests would miss.

Per-backend equivalence (18 tests in src/row/arch/*.rs):

  • 4 tests per backend (NEON, SSE4.1, AVX2, AVX-512) + 2 tests for wasm = 18 total. Each calls its backend's unsafe yuv_420_to_rgba_row directly under runtime feature detection, comparing against scalar::yuv_420_to_rgba_row byte-for-byte. Bypasses the dispatcher so every backend gets exercised on every CI runner regardless of which tier the dispatcher would pick.
  • Mismatch diagnostics report (byte, pixel, channel R/G/B/A, width, matrix, full_range, scalar_value, simd_value) — first divergence is locally diagnosable.

Compile-fail doctest in MixedSinker<Yuv420p>::with_rgba proves attaching RGBA to a non-Yuv420p sink (e.g. MixedSinker::<Nv12>::new(...).with_rgba(...)) is a compile error.

Local results (aarch64 macOS): 429 lib tests + 1 doctest pass.

CI matrix coverage (already wired in .github/workflows/):

  • test-sde-avx512 (Intel SDE Ice Lake) — AVX-512 path.
  • test jobs on AMD EPYC ubuntu-latest / macOS aarch64 — AVX2 / NEON.
  • coverage.yml per-tier matrix with --cfg colconv_disable_avx512 / --cfg colconv_disable_avx2 / --cfg colconv_force_scalar — exercises SSE4.1 + scalar fallback paths.
  • cross job — wasm32-wasip1 test suite exercises the wasm RGBA tests.

What's deferred

Out of scope for this PR (follow-up work):

  1. Mass-apply the const-ALPHA template to the remaining ~30 kernel families. The template is mechanical; each format ships in its own bite-sized PR. Order roughly: 4:2:0 family (NV12 / NV21) → 4:2:2 (Yuv422p / NV16) → 4:4:4 (Yuv444p / NV24 / NV42) → high-bit-depth (Yuv420p9/10/12/14/16, P010/P012/P016, Yuv422p_n, Yuv444p_n, P210/P212/P216, P410/P412/P416, Bayer / Bayer16).
  2. with_rgba_u16 — native-depth u16 RGBA output for high-bit-depth sources. Error variant + format-specific impl pattern is already in place; just needs wiring per format.
  3. YUVA source frame types (Yuv420pAFrame etc.) — when a source carries an alpha plane, the kernel copies it through to the RGBA buffer instead of writing the opaque default. Doc roadmap calls this the "source-side half" of Ship 8.
  4. Bench yuv_420_to_rgba vs. yuv_420_to_rgb — confirm the const-ALPHA path doesn't regress the RGB baseline and measure the RGBA throughput delta. Mechanically straightforward; deferred to keep this PR focused.

Test plan

  • CI green on test, test-sde-avx512, cross, coverage, clippy, build jobs.
  • Manually verify a Yuv420p → RGBA pipeline end-to-end with a real frame (gray patch, full red patch).
  • Confirm the compile_fail doctest still rejects MixedSinker::<Nv12>::new(...).with_rgba(...).
  • cargo doc --lib --no-deps clean (no new doc warnings vs. main).

🤖 Generated with Claude Code

@al8n al8n changed the title Feat/ship8 rgba yuv420 feat(sinker): Ship 8 — Yuv420p RGBA output via const-ALPHA template Apr 26, 2026
@al8n al8n requested a review from Copilot April 26, 2026 02:46
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end 8-bit RGBA output for the Yuv420p pipeline by extending MixedSinker with an RGBA buffer path and introducing a const-generic <const ALPHA: bool> kernel template that’s implemented across scalar + all SIMD backends.

Changes:

  • Add MixedSinker<Yuv420p>::with_rgba / set_rgba, new MixedSinkerError variants, and a process() RGBA kernel path.
  • Add public row::yuv_420_to_rgba_row dispatcher plus scalar + SIMD backend implementations using the shared const-generic template.
  • Add format-level and per-backend scalar-equivalence tests for RGBA output.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/sinker/mixed.rs Adds RGBA buffer plumbing for MixedSinker<Yuv420p>, new error variants, and RGBA-specific tests.
src/row/mod.rs Exposes yuv_420_to_rgba_row dispatcher and adds rgba_row_bytes() helper.
src/row/scalar.rs Introduces yuv_420_to_rgba_row and shared scalar <const ALPHA: bool> template.
src/row/arch/x86_common.rs Adds write_rgba_16 helper for x86 shuffle-based RGBA stores.
src/row/arch/x86_sse41.rs Implements SSE4.1 RGBA row kernel and equivalence tests using write_rgba_16.
src/row/arch/x86_avx2.rs Implements AVX2 RGBA row kernel + write_rgba_32 and equivalence tests.
src/row/arch/x86_avx512.rs Implements AVX-512 RGBA row kernel + write_rgba_64 and equivalence tests.
src/row/arch/neon.rs Implements NEON RGBA row kernel via vst4q_u8 and equivalence tests.
src/row/arch/wasm_simd128.rs Implements wasm simd128 RGBA row kernel + write_rgba_16 and equivalence tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/sinker/mixed.rs Outdated
Comment on lines +155 to +160
/// `u16` RGBA buffer attached via [`MixedSinker::with_rgba_u16`] /
/// [`MixedSinker::set_rgba_u16`] is shorter than `width × height × 4`
/// `u16` elements. Only high‑bit‑depth source impls write into this
/// buffer; the fourth `u16` per pixel is alpha — opaque
/// (`(1 << BITS) - 1`) by default when the source has no alpha
/// plane.
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intra-doc links in this variant’s docs point to MixedSinker::with_rgba_u16 / set_rgba_u16, but those methods don’t exist anywhere in the crate yet. This will trigger rustdoc::broken_intra_doc_links warnings (and may fail CI if docs are built with warnings-as-errors). Consider removing the links for now (leave them as code-formatted text), or introducing the corresponding APIs in the appropriate format-specific impl blocks before linking to them.

Copilot uses AI. Check for mistakes.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 26, 2026

Codecov Report

❌ Patch coverage is 83.02583% with 46 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/row/mod.rs 63.63% 12 Missing ⚠️
src/row/arch/neon.rs 77.41% 7 Missing ⚠️
src/row/arch/x86_avx512.rs 86.00% 7 Missing ⚠️
src/row/arch/x86_avx2.rs 85.00% 6 Missing ⚠️
src/row/arch/x86_sse41.rs 79.31% 6 Missing ⚠️
src/row/arch/x86_common.rs 86.20% 4 Missing ⚠️
src/sinker/mixed.rs 89.18% 4 Missing ⚠️

📢 Thoughts on this report? Let us know!

@uqio uqio merged commit 9d54c46 into main Apr 26, 2026
66 checks passed
@uqio uqio deleted the feat/ship8-rgba-yuv420 branch April 26, 2026 03:17
@al8n al8n restored the feat/ship8-rgba-yuv420 branch April 26, 2026 03:54
@al8n al8n deleted the feat/ship8-rgba-yuv420 branch April 26, 2026 04:17
uqio added a commit that referenced this pull request Apr 26, 2026
Tranche 2 of Ship 8 sink-side RGBA. Mass-applies the const-generic-ALPHA template proven by PR #16 to the **4:2:0 semi-planar family** — NV12 (UV-ordered, FFmpeg's `nv12`, the most common HW-decoder output across CUDA / NVDEC / VideoToolbox / VAAPI / Android MediaCodec / QSV) and NV21 (VU-ordered, Android camera default).
uqio added a commit that referenced this pull request Apr 26, 2026
#18)

Tranche 3 of Ship 8 sink-side RGBA. **Wiring-only PR** — no new SIMD code. The 4:2:2 formats reuse the 4:2:0 kernels from tranches 1 + 2: `Yuv422p` calls `yuv_420_to_rgba_row` (already shipped in PR #16), and `Nv16` calls `nv12_to_rgba_row` (already shipped in PR #17). 4:2:2's per-row contract is identical to 4:2:0's — half-width chroma, one pair per Y pair — so the same kernel handles both with no changes.
uqio pushed a commit that referenced this pull request Apr 26, 2026
Tranche 4a of Ship 8 sink-side RGBA. Refactors the Yuv444p planar 4:4:4 kernel family across all 6 backends (scalar + NEON + SSE4.1 + AVX2 + AVX-512 + wasm simd128) using the const-generic-ALPHA template established by PR #16 (Yuv420p) and extended in PR #17 (NV12/NV21).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants