Blit Rect Pipeline by taj-p · Pull Request #1456 · linebender/vello

taj-p · 2026-02-18T20:21:58Z

Intent

This PR supercharges Vello Hybrid's fill_rect method with a fast path for texture copies. Instead of running through the sparse strip pipeline, we upload the quad to the GPU directly. The cases for which we can do this are listed below:

https://github.com/taj-p/vello/blob/386e4bbb3d8b33d68fa8496837de9d0a343f71bd/sparse_strips/vello_hybrid/src/scene.rs#L490-L577

Performance

The performance is very good for scenes that consist of many images. Once we have glyph caching, I believe this will be the fastest way to render from the glyph atlas. This is likely the "speed of light" for a scene that consists of non-blended / non-clipped images and text.

More perf benches

I was a bit worried because in that prior benchmark we use gl.finish() as the "stop time" between runs. I also compared against a pixel read back to "force" a full GPU flush and saw the following results. These are still very promising.

Design

The design is pretty straightforward.

If we have a valid fill_rect that doesn't overlap any pending [fill|stroke]_path, then batch it into a blit rect pipeline pass.
If the fill_rect does overlap, then flush the strips and start a new blit rect batch.

The "overlap" regions of the strips are tracked via a SIMD-enabled DirtyRects struct. Note that pop_layer dirties the wide tiles that it covered in its layer.

We saw significant performance improvement by calculating path bounding boxes by iterating over the line_buf instead of performing path.bounding_box().

Degenerate cases

The blit rect pipeline currently uses a naive batching mechanism (see ## Design). It is possible to build a degenerate scene with this batching mechanism that causes flip-flop pipeline switching between blit and strips.

There are some scenes that are not optimisable. Consider the below scene:

This scene sees rows of overlapping bordered rects. You must draw the rect before the next stroke (i.e. border) before the next rect and so on. See below ASCII visualisation:

Card 0:  [===== blit_0 =====]
         [===== stroke_0 ===]
Card 1:       [===== blit_1 =====]     ← overlaps stroke_0 (pipeline switch)
              [===== stroke_1 ===]
Card 2:            [===== blit_2 =====] ← overlaps stroke_1 (pipeline switch)

To get a feel for this impact, see the below regressions of this type of scene:

We could consider adding runtime "opt out" of batching when the batch only consists of 1 blit or similar. But, at least for my use case, I would be surprised to find scenes with 100s of overlapping bordered images with strokes in between. (Note that overlapping images are fine, but 100s contiguously overlapping image/path are not).

If we're concerned about this, I think I propose (to be approved by reviewers):

Merge as-is into Vello (and remove once sparse strip pipeline is good enough)
Merge after introducing runtime opt-out mechanism for small blit batches
Review, but merge into fork
Add additional documentation to expect_only_default_blending
?? Maybe you can think of an even better idea! 🙏

Note that I dabbled in other batching mechanisms, but I think, for now, this might be good enough to at least "test the waters" of blit rect.

Testing

We run some handcrafted tests with and without blit batching.
Every snapshot has been modified to run on both Vello Hybrid with and without the blit rect pipeline.

LaurenzV

Left some comments. Overall, this does unfortunately add a lot of complexity, but if it gives us the performance improvements we need, it's probably worth it. One request though: While the microbenchmarks you made show some clear wins, I would still like to see how this fares when actually being used with our workloads. Have you already tried integrating this, and if so do you maybe have a screencast that shows the performance difference before/after with our workloads? Would be interesting to see, I think. 🙂

LaurenzV · 2026-02-20T08:02:53Z

sparse_strips/vello_sparse_tests/tests/mix.rs

 }

-#[vello_test(cpu_u8_tolerance = 1, hybrid_tolerance = 1)]
+#[vello_test(cpu_u8_tolerance = 1, hybrid_tolerance = 1, uses_blends)]


Do you think it might make sense to automatically assume any test with "mix" or "compose" in the name uses blends? Then we dont have to annotate those ourselves.

LaurenzV · 2026-02-20T08:18:34Z

sparse_strips/vello_sparse_tests/tests/batching.rs

+        for pixel in pixmap.data_mut() {
+            *pixel = vello_common::peniko::color::PremulRgba8 {
+                r: 0,
+                g: 128,
+                b: 255,
+                a: 255,
+            };
+        }


Could probably shorten this to pixmap.data_mut().fill(pixel)?

LaurenzV · 2026-02-20T08:20:55Z

sparse_strips/vello_sparse_tests/tests/batching.rs

+    }
+
+    /// Serialise concurrent GPU tests to avoid wgpu segfaults.
+    static GPU_MUTEX: Mutex<()> = Mutex::new(());


Since this is a separate mutex used by the other vello_hybrid test, could it not still happen that one blit test and a normal test run concurrently?

Ah, I guess tests from different modules don't run at the same time, right?

LaurenzV · 2026-02-20T08:36:46Z

sparse_strips/vello_sparse_shaders/shaders/blit_rects.wgsl

+
+@fragment
+fn fs_main(in: VertexOutput) -> @location(0) vec4<f32> {
+    return textureSample(atlas_texture_array, atlas_sampler, in.uv, in.atlas_index);


So if bilinear filtering is enabled, this will use the GPU-native image sampler? Won't this cause issues at the border if a different image is in the same image atlas?

LaurenzV · 2026-02-20T08:46:56Z

sparse_strips/vello_hybrid/src/render/wgpu.rs

+                if blits.is_empty() {
+                    continue;
+                }


Haven't looked into it yet, but is there no way of early-optimizing this? i.e. not creating this batch in the first place (or merging it with the next one) if there are no blits.

LaurenzV · 2026-02-20T11:35:34Z