feat(sinker): Ship 8 — Yuv420p RGBA output via const-ALPHA template by uqio · Pull Request #16 · Findit-AI/colconv

uqio · 2026-04-26T02:41:49Z

First PR of Ship 8 (sink-side with_rgba / with_rgba_u16 for every existing source format). Lands the template-validating slice: end-to-end RGBA support for Yuv420p only — public API, all five SIMD backends, MixedSinker integration, per-backend equivalence tests. The remaining ~30 kernel families (NV12, NV21, Yuv422p, Yuv444p, all bit-depth variants, P010 / P016 / P210 / P410, etc.) inherit the same const-generic-ALPHA template via mechanical mass-apply in follow-up PRs — see What's deferred below.

Scope

Sink-side only (the high-value half per docs/color-conversion-functions.md § Ship 8). YUVA source frames + per-pixel alpha pass-through are deferred to a follow-up PR. Default alpha is 0xFF (opaque) for sources that don't carry an alpha plane — every YUV format shipped today.

use colconv::{
    frame::Yuv420pFrame,
    sinker::MixedSinker,
    yuv::{Yuv420p, yuv420p_to},
    ColorMatrix,
};

let frame = Yuv420pFrame::new(&yp, &up, &vp, w, h, w, w / 2, w / 2);
let mut rgba = vec![0u8; (w * h * 4) as usize];
let mut sinker = MixedSinker::<Yuv420p>::new(w as usize, h as usize)
    .with_rgba(&mut rgba)?;

yuv420p_to(&frame, /*full_range=*/ true, ColorMatrix::Bt709, &mut sinker)?;
// Each pixel: rgba[4*i..4*i+3] = R,G,B; rgba[4*i+3] = 0xFF.

Design: const-generic `ALPHA: bool` kernel template

The natural alternatives:

Compose-and-expand — kernel writes 3-byte RGB to scratch, then a second pass expands to 4-byte RGBA with constant alpha. Zero new SIMD code, but ~7W bytes/row of memory traffic vs. native's 4W → roughly 1.5–2× slower for the RGBA-only path.
Fully forked native — kernel forked into RGB and RGBA copies per backend. Best speed, ~5–10× LOC duplication across 30+ kernels.
Const-generic ALPHA (this PR) — kernel body monomorphized on <const ALPHA: bool>; the only delta is the final per-block store (vst3q_u8 ↔ vst4q_u8 on NEON; new write_rgba_* shuffle helpers on x86 / wasm parallel to the existing write_rgb_*). Same speed as native, ~1/5th the LOC. Compiler DCE eliminates the if ALPHA branch and the unused alpha-vector splat at each call site.

All five backends keep R / G / B in separate vectors at store time, so adding alpha is purely a 4th vector + different shuffle/store — no re-interleave.

What's in this PR

Public API (`colconv::sinker`)

MixedSinker<Yuv420p>::with_rgba(&mut [u8]) and set_rgba — attaches a packed 32-bit RGBA buffer (width × height × 4 bytes). Format-specific impl block (not on the generic MixedSinker<F>), so attaching RGBA to a sink whose process doesn't write it is a compile error rather than a silent stale-buffer bug. Future formats add their own impl blocks as RGBA support lands; a compile_fail doctest pins this property.
MixedSinker::produces_rgba() / produces_rgba_u16() — generic queries (always false for sinks without a write path; symmetrical with the existing produces_rgb_u16()).
MixedSinkerError::RgbaBufferTooShort { expected, actual } and RgbaU16BufferTooShort — new error variants.
row::yuv_420_to_rgba_row(...) — public dispatcher mirroring yuv_420_to_rgb_row(...). Same contract; rgba_out.len() >= 4 * width.

Per-backend kernels (`src/row/`)

File	What's added
`scalar.rs`	`yuv_420_to_rgba_row` + shared `yuv_420_to_rgb_or_rgba_row::<const ALPHA: bool>` template
`arch/x86_common.rs`	`write_rgba_16` (4 shuffle masks × 4 stores per 16 pixels = 64 bytes)
`arch/x86_sse41.rs`	`yuv_420_to_rgba_row` + shared template; uses `write_rgba_16`
`arch/x86_avx2.rs`	`yuv_420_to_rgba_row` + `write_rgba_32` (two halves through `write_rgba_16`)
`arch/x86_avx512.rs`	`yuv_420_to_rgba_row` + `write_rgba_64` (four quarters through `write_rgba_16`)
`arch/neon.rs`	`yuv_420_to_rgba_row` + shared template; uses native `vst4q_u8`
`arch/wasm_simd128.rs`	`yuv_420_to_rgba_row` + new `write_rgba_16` (4 swizzle masks × 4 stores)
`mod.rs`	`rgba_row_bytes(width)` helper + public `yuv_420_to_rgba_row` dispatcher

The 4-byte stride aligns cleanly with the 16-byte register width on every backend, so the RGBA shuffle masks are simpler than the 3-byte RGB pattern (no channel "split across blocks" at the 16-byte boundary). NEON's vst4q_u8 is a native 4-channel store — zero shuffle overhead vs. the 3-channel vst3q_u8.

MixedSinker integration

MixedSinker<Yuv420p>::process runs the RGBA path as an independent kernel call (not compose). The const-generic-ALPHA monomorphization means yuv_420_to_rgba_row::<true> and yuv_420_to_rgb_row::<false> are separate functions sharing one source body — both fully inlined at their respective call sites with the unused branch eliminated.

SIMD coverage

Kernel	NEON	SSE4.1	AVX2	AVX-512	wasm simd128
`yuv_420_to_rgba_row`	✅	✅	✅	✅	✅

Block sizes per iteration mirror the existing RGB kernels: NEON / SSE4.1 / wasm = 16 px; AVX2 = 32 px; AVX-512 = 64 px.

Tests

Format-level (5 tests in src/sinker/mixed.rs):

rgba_only_converts_gray_to_gray_with_opaque_alpha — gray YUV → gray RGB + alpha = 0xFF.
rgba_alpha_is_opaque_for_arbitrary_color — alpha is 0xFF for non-gray content.
with_rgb_and_with_rgba_produce_byte_identical_rgb_bytes — cross-format invariant: alpha is the only difference between with_rgb and with_rgba outputs.
rgba_with_simd_false_matches_with_simd_true — SIMD ≡ scalar parity across 8 widths covering all backend block sizes (16 / 32 / 64) plus tails.
rgba_buffer_too_short_returns_err — short buffer rejected at with_rgba call.
yuv_420_to_rgba_simd_matches_scalar_with_random_yuv — pseudo-random per-pixel YUV (width 1922, all 4 matrices × both ranges) catches lane-order corruption that solid-color tests would miss.

Per-backend equivalence (18 tests in src/row/arch/*.rs):

4 tests per backend (NEON, SSE4.1, AVX2, AVX-512) + 2 tests for wasm = 18 total. Each calls its backend's unsafe yuv_420_to_rgba_row directly under runtime feature detection, comparing against scalar::yuv_420_to_rgba_row byte-for-byte. Bypasses the dispatcher so every backend gets exercised on every CI runner regardless of which tier the dispatcher would pick.
Mismatch diagnostics report (byte, pixel, channel R/G/B/A, width, matrix, full_range, scalar_value, simd_value) — first divergence is locally diagnosable.

Compile-fail doctest in MixedSinker<Yuv420p>::with_rgba proves attaching RGBA to a non-Yuv420p sink (e.g. MixedSinker::<Nv12>::new(...).with_rgba(...)) is a compile error.

Local results (aarch64 macOS): 429 lib tests + 1 doctest pass.

CI matrix coverage (already wired in .github/workflows/):

test-sde-avx512 (Intel SDE Ice Lake) — AVX-512 path.
test jobs on AMD EPYC ubuntu-latest / macOS aarch64 — AVX2 / NEON.
coverage.yml per-tier matrix with --cfg colconv_disable_avx512 / --cfg colconv_disable_avx2 / --cfg colconv_force_scalar — exercises SSE4.1 + scalar fallback paths.
cross job — wasm32-wasip1 test suite exercises the wasm RGBA tests.

What's deferred

Out of scope for this PR (follow-up work):

Mass-apply the const-ALPHA template to the remaining ~30 kernel families. The template is mechanical; each format ships in its own bite-sized PR. Order roughly: 4:2:0 family (NV12 / NV21) → 4:2:2 (Yuv422p / NV16) → 4:4:4 (Yuv444p / NV24 / NV42) → high-bit-depth (Yuv420p9/10/12/14/16, P010/P012/P016, Yuv422p_n, Yuv444p_n, P210/P212/P216, P410/P412/P416, Bayer / Bayer16).
with_rgba_u16 — native-depth u16 RGBA output for high-bit-depth sources. Error variant + format-specific impl pattern is already in place; just needs wiring per format.
YUVA source frame types (Yuv420pAFrame etc.) — when a source carries an alpha plane, the kernel copies it through to the RGBA buffer instead of writing the opaque default. Doc roadmap calls this the "source-side half" of Ship 8.
Bench yuv_420_to_rgba vs. yuv_420_to_rgb — confirm the const-ALPHA path doesn't regress the RGB baseline and measure the RGBA throughput delta. Mechanically straightforward; deferred to keep this PR focused.

Test plan

CI green on test, test-sde-avx512, cross, coverage, clippy, build jobs.
Manually verify a Yuv420p → RGBA pipeline end-to-end with a real frame (gray patch, full red patch).
Confirm the compile_fail doctest still rejects MixedSinker::<Nv12>::new(...).with_rgba(...).
cargo doc --lib --no-deps clean (no new doc warnings vs. main).

🤖 Generated with Claude Code

Copilot

Pull request overview

Adds end-to-end 8-bit RGBA output for the Yuv420p pipeline by extending MixedSinker with an RGBA buffer path and introducing a const-generic <const ALPHA: bool> kernel template that’s implemented across scalar + all SIMD backends.

Changes:

Add MixedSinker<Yuv420p>::with_rgba / set_rgba, new MixedSinkerError variants, and a process() RGBA kernel path.
Add public row::yuv_420_to_rgba_row dispatcher plus scalar + SIMD backend implementations using the shared const-generic template.
Add format-level and per-backend scalar-equivalence tests for RGBA output.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`src/sinker/mixed.rs`	Adds RGBA buffer plumbing for `MixedSinker<Yuv420p>`, new error variants, and RGBA-specific tests.
`src/row/mod.rs`	Exposes `yuv_420_to_rgba_row` dispatcher and adds `rgba_row_bytes()` helper.
`src/row/scalar.rs`	Introduces `yuv_420_to_rgba_row` and shared scalar `<const ALPHA: bool>` template.
`src/row/arch/x86_common.rs`	Adds `write_rgba_16` helper for x86 shuffle-based RGBA stores.
`src/row/arch/x86_sse41.rs`	Implements SSE4.1 RGBA row kernel and equivalence tests using `write_rgba_16`.
`src/row/arch/x86_avx2.rs`	Implements AVX2 RGBA row kernel + `write_rgba_32` and equivalence tests.
`src/row/arch/x86_avx512.rs`	Implements AVX-512 RGBA row kernel + `write_rgba_64` and equivalence tests.
`src/row/arch/neon.rs`	Implements NEON RGBA row kernel via `vst4q_u8` and equivalence tests.
`src/row/arch/wasm_simd128.rs`	Implements wasm simd128 RGBA row kernel + `write_rgba_16` and equivalence tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-26T02:54:18Z

+  /// `u16` RGBA buffer attached via [`MixedSinker::with_rgba_u16`] /
+  /// [`MixedSinker::set_rgba_u16`] is shorter than `width × height × 4`
+  /// `u16` elements. Only high‑bit‑depth source impls write into this
+  /// buffer; the fourth `u16` per pixel is alpha — opaque
+  /// (`(1 << BITS) - 1`) by default when the source has no alpha
+  /// plane.


The intra-doc links in this variant’s docs point to MixedSinker::with_rgba_u16 / set_rgba_u16, but those methods don’t exist anywhere in the crate yet. This will trigger rustdoc::broken_intra_doc_links warnings (and may fail CI if docs are built with warnings-as-errors). Consider removing the links for now (leave them as code-formatted text), or introducing the corresponding APIs in the appropriate format-specific impl blocks before linking to them.

codecov · 2026-04-26T02:54:45Z

Codecov Report

❌ Patch coverage is 83.02583% with 46 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/row/mod.rs	63.63%	12 Missing ⚠️
src/row/arch/neon.rs	77.41%	7 Missing ⚠️
src/row/arch/x86_avx512.rs	86.00%	7 Missing ⚠️
src/row/arch/x86_avx2.rs	85.00%	6 Missing ⚠️
src/row/arch/x86_sse41.rs	79.31%	6 Missing ⚠️
src/row/arch/x86_common.rs	86.20%	4 Missing ⚠️
src/sinker/mixed.rs	89.18%	4 Missing ⚠️

📢 Thoughts on this report? Let us know!

Tranche 2 of Ship 8 sink-side RGBA. Mass-applies the const-generic-ALPHA template proven by PR #16 to the **4:2:0 semi-planar family** — NV12 (UV-ordered, FFmpeg's `nv12`, the most common HW-decoder output across CUDA / NVDEC / VideoToolbox / VAAPI / Android MediaCodec / QSV) and NV21 (VU-ordered, Android camera default).

#18) Tranche 3 of Ship 8 sink-side RGBA. **Wiring-only PR** — no new SIMD code. The 4:2:2 formats reuse the 4:2:0 kernels from tranches 1 + 2: `Yuv422p` calls `yuv_420_to_rgba_row` (already shipped in PR #16), and `Nv16` calls `nv12_to_rgba_row` (already shipped in PR #17). 4:2:2's per-row contract is identical to 4:2:0's — half-width chroma, one pair per Y pair — so the same kernel handles both with no changes.

Tranche 4a of Ship 8 sink-side RGBA. Refactors the Yuv444p planar 4:4:4 kernel family across all 6 backends (scalar + NEON + SSE4.1 + AVX2 + AVX-512 + wasm simd128) using the const-generic-ALPHA template established by PR #16 (Yuv420p) and extended in PR #17 (NV12/NV21).

uqio added 4 commits April 26, 2026 13:46

update

9bd491b

update

aa62bff

update

36fee26

update

c6b0526

al8n changed the title ~~Feat/ship8 rgba yuv420~~ feat(sinker): Ship 8 — Yuv420p RGBA output via const-ALPHA template Apr 26, 2026

al8n requested a review from Copilot April 26, 2026 02:46

Copilot started reviewing on behalf of al8n April 26, 2026 02:50 View session

Copilot AI reviewed Apr 26, 2026

View reviewed changes

update

be02452

uqio merged commit 9d54c46 into main Apr 26, 2026
66 checks passed

uqio deleted the feat/ship8-rgba-yuv420 branch April 26, 2026 03:17

al8n restored the feat/ship8-rgba-yuv420 branch April 26, 2026 03:54

al8n mentioned this pull request Apr 26, 2026

feat(sinker): Ship 8 — NV12 / NV21 RGBA via const-ALPHA template #17

Merged

4 tasks

al8n deleted the feat/ship8-rgba-yuv420 branch April 26, 2026 04:17

al8n mentioned this pull request Apr 26, 2026

feat(sinker): Ship 8 — Yuv422p / NV16 RGBA (kernel reuse, no new SIMD) #18

Merged

3 tasks

al8n mentioned this pull request Apr 26, 2026

feat(sinker): Ship 8 — Yuv444p RGBA via const-ALPHA template #19

Merged

4 tasks

This was referenced Apr 26, 2026

feat(sinker): Ship 8 — Nv24/Nv42 RGBA + Strategy A RGB→RGBA fan-out #20

Merged

feat(sinker): Ship 8 — Yuv440p RGBA wiring (reuses Yuv444p kernels) #22

Merged

feat(row): Ship 8 — high-bit 4:2:0 RGBA scalar (SIMD lands in 5a/5b) #24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sinker): Ship 8 — Yuv420p RGBA output via const-ALPHA template#16

feat(sinker): Ship 8 — Yuv420p RGBA output via const-ALPHA template#16
uqio merged 5 commits intomainfrom
feat/ship8-rgba-yuv420

uqio commented Apr 26, 2026 •

edited by al8n

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

codecov Bot commented Apr 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

uqio commented Apr 26, 2026 • edited by al8n Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scope

Design: const-generic ALPHA: bool kernel template

What's in this PR

Public API (colconv::sinker)

Per-backend kernels (src/row/)

MixedSinker integration

SIMD coverage

Tests

What's deferred

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

uqio commented Apr 26, 2026 •

edited by al8n

Loading

Design: const-generic `ALPHA: bool` kernel template

Public API (`colconv::sinker`)

Per-backend kernels (`src/row/`)

codecov Bot commented Apr 26, 2026 •

edited

Loading