Findit-AI · al8n · Apr 27, 2026 · Apr 27, 2026 · Apr 27, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -154,6 +154,132 @@ scheduled as a dedicated follow-up PR (`feat/bayer-simd`).
   end-to-end "all three channels at MAX_COEFFICIENT, all pixels
   255" stays inside the `u32` accumulator and clamps to 255.
 
+## Ship 8b — source-side YUVA (alpha-preserving RGBA output)
+
+The follow-up to Ship 8: source-side alpha. Where Ship 8 padded the
+output alpha lane to `0xFF` / `(1 << BITS) - 1` regardless of source,
+Ship 8b adds **YUVA source types** that carry an alpha plane through
+to the RGBA output. The first vertical slice ships `Yuva444p10`
+(ProRes 4444 + α territory — the highest-value VFX format from the
+Format Share table § 2a-1 row 10).
+
+### Strategy B (forked kernels) over Strategy A (separate splice)
+
+Two implementation strategies were considered:
+
+- **Strategy A** (deferred) — run the existing RGBA kernel (alpha =
+  opaque), then a second-pass helper reads source alpha + overwrites
+  the alpha byte. Memory traffic 6W per pixel; ~50 LOC + 1 helper.
+- **Strategy B** (adopted) — extend each kernel's const-`ALPHA`
+  template with a third `ALPHA_SRC: bool` generic. Source-alpha is
+  loaded inside the kernel, masked, and stored straight into the
+  alpha lane in the same pass. Memory traffic 5W per pixel (single
+  pass); ~3,000 LOC across 30+ kernels for an L1-noise ~10% perf
+  win in the alpha-present case.
+
+Strategy B was picked for best alpha-present throughput on the
+high-bandwidth 4:4:4 + α format that motivated the work. Existing
+`*_to_rgb_*` and `*_to_rgba_*` public wrappers are backward-compat
+shims passing `ALPHA_SRC = false` and `None` to the templates — zero
+overhead when alpha-source is off; existing call sites compile
+unchanged.
+
+### Vertical slice 1: `Yuva444p10` (3 PRs)
+
+The first format follows the same staging pattern as Ship 8 high-bit
+tranches (5/6/7): scalar prep first (call-site stable), then u8 SIMD,
+then u16 SIMD.
+
+| # | Tranche | Status |
+|---|---|---|
+| 1 | scalar prep + Frame + walker + dispatchers + sinker integration | ✅ shipped (PR #32) — `Yuva444pFrame16<BITS=10>`, `Yuva444p10Frame` alias, `yuva444p10_to` walker, `MixedSinker<Yuva444p10>`, scalar tests |
+| 1b | u8 RGBA SIMD across all 5 backends | ✅ shipped (PR #33) |
+| 1c | u16 RGBA SIMD across all 5 backends | ✅ shipped (PR #34) |
+
+### Surface added
+
+- **`Yuva444pFrame16<'a, const BITS: u32>`** — mirrors `Yuv444pFrame16`
+  with an extra `a` slice + `a_stride`. Const-asserted `BITS == 10`
+  in this slice; other bit depths land in subsequent vertical slices.
+  `try_new` validates dimensions + plane lengths; `try_new_checked`
+  additionally validates every active sample range.
+- **`Yuva444p10Frame<'a>`** type alias.
+- **`Yuva444p10`** marker + `Yuva444p10Row<'a>` (carries `a` slice)
+  + `Yuva444p10Sink` trait + `yuva444p10_to` walker.
+- **`MixedSinker<Yuva444p10>`** with `with_rgba` / `set_rgba` (u8) +
+  `with_rgba_u16` / `set_rgba_u16` (u16) per-format builders, plus
+  `with_rgb` / `with_rgb_u16` / `with_luma` / `with_hsv` alpha-drop
+  paths that reuse the `Yuv444p10` row dispatchers verbatim.
+- **Public dispatchers** in `colconv::row`: `yuva444p10_to_rgba_row`
+  and `yuva444p10_to_rgba_u16_row` — same SIMD-via-`use_simd` shape
+  as `yuv444p10_to_rgba_*`.
+
+### Strategy B template extension
+
+The four 4:4:4 const-`ALPHA` templates gained the `ALPHA_SRC` third
+generic in this slice (only the BITS-generic planar variant is in
+scope for this vertical slice; other 4:4:4 variants land later):
+
+- `scalar::yuv_444p_n_to_rgb_or_rgba_row<BITS, ALPHA, ALPHA_SRC>` (u8)
+- `scalar::yuv_444p_n_to_rgb_or_rgba_u16_row<BITS, ALPHA, ALPHA_SRC>` (u16)
+- Same SIMD templates × 5 backends (NEON / SSE4.1 / AVX2 / AVX-512 /
+  wasm simd128) — refactor in PRs #33 (u8) and #34 (u16).
+
+Per-pixel store branched on three combinations:
+
+| `ALPHA` | `ALPHA_SRC` | Per-pixel alpha |
+|---|---|---|
+| false | false | RGB-only (no alpha lane) |
+| true | false | RGBA, alpha = `0xFF` u8 / `(1 << BITS) - 1` u16 (existing path) |
+| true | true | RGBA, alpha = `(a_src[x] & bits_mask::<BITS>())` from source plane; depth-converted via `>> (BITS - 8)` for u8 output, native depth for u16 output |
+
+`!ALPHA_SRC || ALPHA` const-asserted at every template top.
+
+### Hardenings (Codex review fixes)
+
+- **Source alpha is masked with `bits_mask::<BITS>()` before depth
+  conversion** — `Yuva444p10Frame::try_new` accepts unchecked u16
+  samples; without masking an overrange `1024` at BITS=10 would shift
+  to `256` and cast to u8 zero, silently turning over-range alpha
+  into transparent output. Same masking pattern that Y/U/V already
+  use. Pinned by 2 regression tests at the sinker layer.
+- **`MixedSinker<Yuva444p10>` wires alpha-drop paths** for `with_rgb`
+  / `with_rgb_u16` / `with_luma` / `with_hsv` (declared on the
+  generic `MixedSinker<F>` impl) — initial implementation only wrote
+  RGBA buffers, leaving the others as silent stale-buffer bugs.
+  Pinned by 4 cross-format byte-equivalence tests against
+  `MixedSinker<Yuv444p10>`.
+
+### Tests
+
+- **Per-backend SIMD equivalence tests**: 30 per backend × 5 backends
+  for `Yuva444p10` (5 u8 added in PR #33 + 5 u16 added in PR #34).
+  Solid-alpha + random-alpha + tail-width coverage. All x86 tests
+  carry `is_x86_feature_detected!` early-return guards.
+- **Sinker integration tests**: 17 (PR #32 added 7 covering alpha
+  pass-through / opacity contracts / buffer-too-short error paths;
+  PR #32 review-fix added 7 covering alpha-drop paths + Strategy A
+  combine; PR #32 review-fix added 2 covering overrange-alpha
+  masking).
+- **Test count growth**: 578 → 588 on aarch64-darwin host (583 after
+  PR #33, 588 after PR #34); +5 NEON tests run at each tranche; the
+  +20 x86/wasm tests fire on their respective CI runners.
+
+### Notes
+
+- **Sink-side YUVA + Ship 8 sinks are now end-to-end for the format**:
+  with `Yuva444p10Frame` source and `MixedSinker<Yuva444p10>` sink,
+  the alpha plane flows through to `with_rgba` / `with_rgba_u16`
+  output. `with_rgb` / `with_rgb_u16` / `with_luma` / `with_hsv`
+  are alpha-drop (reuse `Yuv444p10` row kernels).
+- **Subsequent vertical slices (Ship 8b‑2 onward)** will mass-apply
+  the established Strategy B template to other Yuva format families:
+  `Yuva420p*` (4:2:0 with α — `yuva420p`, `yuva420p9/10/16`),
+  `Yuva422p*` (4:2:2 with α — `yuva422p`, `yuva422p9/10/16`), and
+  the remaining `Yuva444p*` variants (8-bit, 9-bit, 16-bit). The
+  template's third generic + per-backend wrapper pattern is now
+  proven; subsequent slices reuse it mechanically.
+
 ## Ship 8 — alpha + RGBA output (`with_rgba` / `with_rgba_u16`)
 
 Adds packed RGBA output across the YUV format inventory. Every YUV

diff --git a/src/row/arch/neon.rs b/src/row/arch/neon.rs
@@ -983,7 +983,8 @@ pub(crate) unsafe fn yuv_444p_n_to_rgb_or_rgba_row<
 /// NEON YUV 4:4:4 planar high-bit-depth → **native-depth u16** RGB.
 /// Const-generic over `BITS ∈ {9, 10, 12, 14}`.
 ///
-/// Thin wrapper over [`yuv_444p_n_to_rgb_or_rgba_u16_row`] with `ALPHA = false`.
+/// Thin wrapper over [`yuv_444p_n_to_rgb_or_rgba_u16_row`] with
+/// `ALPHA = false, ALPHA_SRC = false`.
 ///
 /// # Safety
 ///
@@ -1001,15 +1002,18 @@ pub(crate) unsafe fn yuv_444p_n_to_rgb_u16_row<const BITS: u32>(
 ) {
   // SAFETY: caller obligations forwarded to the shared impl.
   unsafe {
-    yuv_444p_n_to_rgb_or_rgba_u16_row::<BITS, false>(y, u, v, rgb_out, width, matrix, full_range);
+    yuv_444p_n_to_rgb_or_rgba_u16_row::<BITS, false, false>(
+      y, u, v, rgb_out, width, matrix, full_range, None,
+    );
   }
 }
 
 /// NEON sibling of [`yuv_444p_n_to_rgba_row`] for native-depth `u16`
 /// output. Alpha samples are `(1 << BITS) - 1` (opaque maximum at the
 /// input bit depth) — matches `scalar::yuv_444p_n_to_rgba_u16_row`.
 ///
-/// Thin wrapper over [`yuv_444p_n_to_rgb_or_rgba_u16_row`] with `ALPHA = true`.
+/// Thin wrapper over [`yuv_444p_n_to_rgb_or_rgba_u16_row`] with
+/// `ALPHA = true, ALPHA_SRC = false`.
 ///
 /// # Safety
 ///
@@ -1028,41 +1032,102 @@ pub(crate) unsafe fn yuv_444p_n_to_rgba_u16_row<const BITS: u32>(
 ) {
   // SAFETY: caller obligations forwarded to the shared impl.
   unsafe {
-    yuv_444p_n_to_rgb_or_rgba_u16_row::<BITS, true>(y, u, v, rgba_out, width, matrix, full_range);
+    yuv_444p_n_to_rgb_or_rgba_u16_row::<BITS, true, false>(
+      y, u, v, rgba_out, width, matrix, full_range, None,
+    );
   }
 }
 
-/// Shared NEON high-bit YUV 4:4:4 → native-depth `u16` kernel.
-/// `ALPHA = false` writes RGB triples via `vst3q_u16`; `ALPHA = true`
-/// writes RGBA quads via `vst4q_u16` with constant alpha
-/// `(1 << BITS) - 1`.
+/// NEON YUVA 4:4:4 planar high-bit-depth → **native-depth `u16`**
+/// packed RGBA with the per-pixel alpha element **sourced from
+/// `a_src`** (already at the source's native bit depth — no depth
+/// conversion) instead of being the opaque maximum `(1 << BITS) - 1`.
+/// Same numerical contract as [`yuv_444p_n_to_rgba_u16_row`] for R/G/B.
+///
+/// Thin wrapper over [`yuv_444p_n_to_rgb_or_rgba_u16_row`] with
+/// `ALPHA = true, ALPHA_SRC = true`.
+///
+/// # Safety
+///
+/// Same as [`yuv_444p_n_to_rgba_u16_row`] plus `a_src.len() >= width`.
+#[inline]
+#[target_feature(enable = "neon")]
+#[allow(clippy::too_many_arguments)]
+pub(crate) unsafe fn yuv_444p_n_to_rgba_u16_with_alpha_src_row<const BITS: u32>(
+  y: &[u16],
+  u: &[u16],
+  v: &[u16],
+  a_src: &[u16],
+  rgba_out: &mut [u16],
+  width: usize,
+  matrix: ColorMatrix,
+  full_range: bool,
+) {
+  // SAFETY: caller obligations forwarded to the shared impl.
+  unsafe {
+    yuv_444p_n_to_rgb_or_rgba_u16_row::<BITS, true, true>(
+      y,
+      u,
+      v,
+      rgba_out,
+      width,
+      matrix,
+      full_range,
+      Some(a_src),
+    );
+  }
+}
+
+/// Shared NEON high-bit YUV 4:4:4 → native-depth `u16` kernel for
+/// [`yuv_444p_n_to_rgb_u16_row`] (`ALPHA = false, ALPHA_SRC = false`,
+/// `vst3q_u16`), [`yuv_444p_n_to_rgba_u16_row`] (`ALPHA = true,
+/// ALPHA_SRC = false`, `vst4q_u16` with constant alpha
+/// `(1 << BITS) - 1`) and [`yuv_444p_n_to_rgba_u16_with_alpha_src_row`]
+/// (`ALPHA = true, ALPHA_SRC = true`, `vst4q_u16` with the alpha lane
+/// loaded from `a_src` and masked to native bit depth — no shift since
+/// both the source alpha and the u16 output element are at the same
+/// native bit depth).
 ///
 /// # Safety
 ///
 /// 1. **NEON must be available.**
 /// 2. `y.len() >= width`, `u.len() >= width`, `v.len() >= width`,
 ///    `out.len() >= width * if ALPHA { 4 } else { 3 }`.
-/// 3. `BITS` ∈ `{9, 10, 12, 14}`.
+/// 3. When `ALPHA_SRC = true`: `a_src` must be `Some(_)` and
+///    `a_src.unwrap().len() >= width`.
+/// 4. `BITS` ∈ `{9, 10, 12, 14}`.
 #[inline]
 #[target_feature(enable = "neon")]
-pub(crate) unsafe fn yuv_444p_n_to_rgb_or_rgba_u16_row<const BITS: u32, const ALPHA: bool>(
+#[allow(clippy::too_many_arguments)]
+pub(crate) unsafe fn yuv_444p_n_to_rgb_or_rgba_u16_row<
+  const BITS: u32,
+  const ALPHA: bool,
+  const ALPHA_SRC: bool,
+>(
   y: &[u16],
   u: &[u16],
   v: &[u16],
   out: &mut [u16],
   width: usize,
   matrix: ColorMatrix,
   full_range: bool,
+  a_src: Option<&[u16]>,
 ) {
   // Compile-time guard — `out_max = ((1 << BITS) - 1) as i16` below
   // silently wraps to -1 at BITS=16, corrupting the u16 clamp. The
   // dedicated 16-bit u16-output path is `yuv_444p16_to_rgb_u16_row`.
   const { assert!(BITS == 9 || BITS == 10 || BITS == 12 || BITS == 14) };
+  // Source alpha requires RGBA output — there is no 3 bpp store with
+  // alpha to put it in.
+  const { assert!(!ALPHA_SRC || ALPHA) };
   let bpp: usize = if ALPHA { 4 } else { 3 };
   debug_assert!(y.len() >= width);
   debug_assert!(u.len() >= width);
   debug_assert!(v.len() >= width);
   debug_assert!(out.len() >= width * bpp);
+  if ALPHA_SRC {
+    debug_assert!(a_src.as_ref().is_some_and(|s| s.len() >= width));
+  }
 
   let coeffs = scalar::Coefficients::for_matrix(matrix);
   let (y_off, y_scale, c_scale) = scalar::range_params_n::<BITS, BITS>(full_range);
@@ -1140,8 +1205,21 @@ pub(crate) unsafe fn yuv_444p_n_to_rgb_or_rgba_u16_row<const BITS: u32, const AL
       let b_hi = clamp_u16_max(vqaddq_s16(y_scaled_hi, b_chroma_hi), zero_v, max_v);
 
       if ALPHA {
-        let rgba_lo = uint16x8x4_t(r_lo, g_lo, b_lo, alpha_u16);
-        let rgba_hi = uint16x8x4_t(r_hi, g_hi, b_hi, alpha_u16);
+        let (a_lo_v, a_hi_v) = if ALPHA_SRC {
+          // SAFETY (const-checked): ALPHA_SRC = true implies the
+          // wrapper passed Some(_), validated by debug_assert above.
+          // No depth conversion — both source alpha and u16 output are
+          // at the same native bit depth (BITS), so just mask off any
+          // over-range bits to match the scalar reference.
+          let a_ptr = a_src.as_ref().unwrap_unchecked().as_ptr();
+          let lo = vandq_u16(vld1q_u16(a_ptr.add(x)), mask_v);
+          let hi = vandq_u16(vld1q_u16(a_ptr.add(x + 8)), mask_v);
+          (lo, hi)
+        } else {
+          (alpha_u16, alpha_u16)
+        };
+        let rgba_lo = uint16x8x4_t(r_lo, g_lo, b_lo, a_lo_v);
+        let rgba_hi = uint16x8x4_t(r_hi, g_hi, b_hi, a_hi_v);
         vst4q_u16(out.as_mut_ptr().add(x * 4), rgba_lo);
         vst4q_u16(out.as_mut_ptr().add(x * 4 + 32), rgba_hi);
       } else {
@@ -1160,7 +1238,13 @@ pub(crate) unsafe fn yuv_444p_n_to_rgb_or_rgba_u16_row<const BITS: u32, const AL
       let tail_v = &v[x..width];
       let tail_out = &mut out[x * bpp..width * bpp];
       let tail_w = width - x;
-      if ALPHA {
+      if ALPHA_SRC {
+        // SAFETY (const-checked): ALPHA_SRC = true implies Some(_).
+        let tail_a = &a_src.as_ref().unwrap_unchecked()[x..width];
+        scalar::yuv_444p_n_to_rgba_u16_with_alpha_src_row::<BITS>(
+          tail_y, tail_u, tail_v, tail_a, tail_out, tail_w, matrix, full_range,
+        );
+      } else if ALPHA {
         scalar::yuv_444p_n_to_rgba_u16_row::<BITS>(
           tail_y, tail_u, tail_v, tail_out, tail_w, matrix, full_range,
         );