Skip to content

Blit Rect Pipeline#1456

Open
taj-p wants to merge 33 commits intolinebender:mainfrom
taj-p:tajp/blitFastPath
Open

Blit Rect Pipeline#1456
taj-p wants to merge 33 commits intolinebender:mainfrom
taj-p:tajp/blitFastPath

Conversation

@taj-p
Copy link
Contributor

@taj-p taj-p commented Feb 18, 2026

Intent

This PR supercharges Vello Hybrid's fill_rect method with a fast path for texture copies. Instead of running through the sparse strip pipeline, we upload the quad to the GPU directly. The cases for which we can do this are listed below:

https://github.com/taj-p/vello/blob/386e4bbb3d8b33d68fa8496837de9d0a343f71bd/sparse_strips/vello_hybrid/src/scene.rs#L490-L577

Performance

The performance is very good for scenes that consist of many images. Once we have glyph caching, I believe this will be the fastest way to render from the glyph atlas. This is likely the "speed of light" for a scene that consists of non-blended / non-clipped images and text.

image
More perf benches

I was a bit worried because in that prior benchmark we use gl.finish() as the "stop time" between runs. I also compared against a pixel read back to "force" a full GPU flush and saw the following results. These are still very promising.

image

Design

The design is pretty straightforward.

  • If we have a valid fill_rect that doesn't overlap any pending [fill|stroke]_path, then batch it into a blit rect pipeline pass.
  • If the fill_rect does overlap, then flush the strips and start a new blit rect batch.

The "overlap" regions of the strips are tracked via a SIMD-enabled DirtyRects struct. Note that pop_layer dirties the wide tiles that it covered in its layer.

We saw significant performance improvement by calculating path bounding boxes by iterating over the line_buf instead of performing path.bounding_box().

Degenerate cases

The blit rect pipeline currently uses a naive batching mechanism (see ## Design). It is possible to build a degenerate scene with this batching mechanism that causes flip-flop pipeline switching between blit and strips.

There are some scenes that are not optimisable. Consider the below scene:

image

This scene sees rows of overlapping bordered rects. You must draw the rect before the next stroke (i.e. border) before the next rect and so on. See below ASCII visualisation:

Card 0:  [===== blit_0 =====]
         [===== stroke_0 ===]
Card 1:       [===== blit_1 =====]     ← overlaps stroke_0 (pipeline switch)
              [===== stroke_1 ===]
Card 2:            [===== blit_2 =====] ← overlaps stroke_1 (pipeline switch)

To get a feel for this impact, see the below regressions of this type of scene:

image

We could consider adding runtime "opt out" of batching when the batch only consists of 1 blit or similar. But, at least for my use case, I would be surprised to find scenes with 100s of overlapping bordered images with strokes in between. (Note that overlapping images are fine, but 100s contiguously overlapping image/path are not).

If we're concerned about this, I think I propose (to be approved by reviewers):

  1. Merge as-is into Vello (and remove once sparse strip pipeline is good enough)
  2. Merge after introducing runtime opt-out mechanism for small blit batches
  3. Review, but merge into fork
  4. Add additional documentation to expect_only_default_blending
  5. ?? Maybe you can think of an even better idea! 🙏

Note that I dabbled in other batching mechanisms, but I think, for now, this might be good enough to at least "test the waters" of blit rect.

Testing

  • We run some handcrafted tests with and without blit batching.
  • Every snapshot has been modified to run on both Vello Hybrid with and without the blit rect pipeline.

@taj-p taj-p changed the title [WIP]: Blit Rect Fast Path Pipeline [WIP]: Blit Rect Pipeline Feb 18, 2026
@taj-p taj-p changed the title [WIP]: Blit Rect Pipeline Blit Rect Pipeline Feb 19, 2026
@taj-p taj-p requested review from LaurenzV and grebmeg and removed request for LaurenzV and grebmeg February 19, 2026 20:31
Copy link
Collaborator

@LaurenzV LaurenzV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments. Overall, this does unfortunately add a lot of complexity, but if it gives us the performance improvements we need, it's probably worth it. One request though: While the microbenchmarks you made show some clear wins, I would still like to see how this fares when actually being used with our workloads. Have you already tried integrating this, and if so do you maybe have a screencast that shows the performance difference before/after with our workloads? Would be interesting to see, I think. 🙂

}

#[vello_test(cpu_u8_tolerance = 1, hybrid_tolerance = 1)]
#[vello_test(cpu_u8_tolerance = 1, hybrid_tolerance = 1, uses_blends)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it might make sense to automatically assume any test with "mix" or "compose" in the name uses blends? Then we dont have to annotate those ourselves.

Comment on lines +30 to +37
for pixel in pixmap.data_mut() {
*pixel = vello_common::peniko::color::PremulRgba8 {
r: 0,
g: 128,
b: 255,
a: 255,
};
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could probably shorten this to pixmap.data_mut().fill(pixel)?

}

/// Serialise concurrent GPU tests to avoid wgpu segfaults.
static GPU_MUTEX: Mutex<()> = Mutex::new(());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a separate mutex used by the other vello_hybrid test, could it not still happen that one blit test and a normal test run concurrently?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I guess tests from different modules don't run at the same time, right?


@fragment
fn fs_main(in: VertexOutput) -> @location(0) vec4<f32> {
return textureSample(atlas_texture_array, atlas_sampler, in.uv, in.atlas_index);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if bilinear filtering is enabled, this will use the GPU-native image sampler? Won't this cause issues at the border if a different image is in the same image atlas?

Comment on lines +197 to +199
if blits.is_empty() {
continue;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't looked into it yet, but is there no way of early-optimizing this? i.e. not creating this batch in the first place (or merging it with the next one) if there are no blits.

mask: Option<Mask>,
filter: Option<Filter>,
) {
let blend_mode = blend_mode.unwrap_or(Self::DEFAULT_BLEND_MODE);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COuld use unwrap_or_default here

Comment on lines +704 to +709
if self.render_hints.blit_rect_pipeline_enabled() {
assert!(
blend_mode == Self::DEFAULT_BLEND_MODE,
"blit rect pipeline only supports default blending"
);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have that same code above, maybe add a method like self.assert_blend_mode?

let layer_bbox = self.wide.pop_layer(&mut self.render_graph);
if self.render_hints.blit_rect_pipeline_enabled() && !layer_bbox.is_inverted() {
// Push the dirty rect for the layer to the dirty rects list.
let [x0, y0, x1, y1] = layer_bbox.pixel_bounds();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably are aware of this, but this won't represent the real layer bbox, but instead the one snapped to the wide tile coordinates.

) {
self.enter_strip_mode();
if self.render_hints.blit_rect_pipeline_enabled() {
self.push_dirty_viewport();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because we don't care about recordings for now?

);

// Process 2 rects (of 4 u16 values each) per iteration.
for chunk in data.chunks_exact(8) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does mean that each time we draw a new shape, the amount of time needed to perform that check for each new rectangle increases linearly, right? :( Maybe we should impose some limit and give up if it's too many?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments