Conversation
LaurenzV
left a comment
There was a problem hiding this comment.
Left some comments. Overall, this does unfortunately add a lot of complexity, but if it gives us the performance improvements we need, it's probably worth it. One request though: While the microbenchmarks you made show some clear wins, I would still like to see how this fares when actually being used with our workloads. Have you already tried integrating this, and if so do you maybe have a screencast that shows the performance difference before/after with our workloads? Would be interesting to see, I think. 🙂
| } | ||
|
|
||
| #[vello_test(cpu_u8_tolerance = 1, hybrid_tolerance = 1)] | ||
| #[vello_test(cpu_u8_tolerance = 1, hybrid_tolerance = 1, uses_blends)] |
There was a problem hiding this comment.
Do you think it might make sense to automatically assume any test with "mix" or "compose" in the name uses blends? Then we dont have to annotate those ourselves.
| for pixel in pixmap.data_mut() { | ||
| *pixel = vello_common::peniko::color::PremulRgba8 { | ||
| r: 0, | ||
| g: 128, | ||
| b: 255, | ||
| a: 255, | ||
| }; | ||
| } |
There was a problem hiding this comment.
Could probably shorten this to pixmap.data_mut().fill(pixel)?
| } | ||
|
|
||
| /// Serialise concurrent GPU tests to avoid wgpu segfaults. | ||
| static GPU_MUTEX: Mutex<()> = Mutex::new(()); |
There was a problem hiding this comment.
Since this is a separate mutex used by the other vello_hybrid test, could it not still happen that one blit test and a normal test run concurrently?
There was a problem hiding this comment.
Ah, I guess tests from different modules don't run at the same time, right?
|
|
||
| @fragment | ||
| fn fs_main(in: VertexOutput) -> @location(0) vec4<f32> { | ||
| return textureSample(atlas_texture_array, atlas_sampler, in.uv, in.atlas_index); |
There was a problem hiding this comment.
So if bilinear filtering is enabled, this will use the GPU-native image sampler? Won't this cause issues at the border if a different image is in the same image atlas?
| if blits.is_empty() { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Haven't looked into it yet, but is there no way of early-optimizing this? i.e. not creating this batch in the first place (or merging it with the next one) if there are no blits.
| mask: Option<Mask>, | ||
| filter: Option<Filter>, | ||
| ) { | ||
| let blend_mode = blend_mode.unwrap_or(Self::DEFAULT_BLEND_MODE); |
There was a problem hiding this comment.
COuld use unwrap_or_default here
| if self.render_hints.blit_rect_pipeline_enabled() { | ||
| assert!( | ||
| blend_mode == Self::DEFAULT_BLEND_MODE, | ||
| "blit rect pipeline only supports default blending" | ||
| ); | ||
| } |
There was a problem hiding this comment.
I think we have that same code above, maybe add a method like self.assert_blend_mode?
| let layer_bbox = self.wide.pop_layer(&mut self.render_graph); | ||
| if self.render_hints.blit_rect_pipeline_enabled() && !layer_bbox.is_inverted() { | ||
| // Push the dirty rect for the layer to the dirty rects list. | ||
| let [x0, y0, x1, y1] = layer_bbox.pixel_bounds(); |
There was a problem hiding this comment.
You probably are aware of this, but this won't represent the real layer bbox, but instead the one snapped to the wide tile coordinates.
| ) { | ||
| self.enter_strip_mode(); | ||
| if self.render_hints.blit_rect_pipeline_enabled() { | ||
| self.push_dirty_viewport(); |
There was a problem hiding this comment.
Is this because we don't care about recordings for now?
| ); | ||
|
|
||
| // Process 2 rects (of 4 u16 values each) per iteration. | ||
| for chunk in data.chunks_exact(8) { |
There was a problem hiding this comment.
This does mean that each time we draw a new shape, the amount of time needed to perform that check for each new rectangle increases linearly, right? :( Maybe we should impose some limit and give up if it's too many?
Intent
This PR supercharges Vello Hybrid's
fill_rectmethod with a fast path for texture copies. Instead of running through the sparse strip pipeline, we upload the quad to the GPU directly. The cases for which we can do this are listed below:https://github.com/taj-p/vello/blob/386e4bbb3d8b33d68fa8496837de9d0a343f71bd/sparse_strips/vello_hybrid/src/scene.rs#L490-L577
Performance
The performance is very good for scenes that consist of many images. Once we have glyph caching, I believe this will be the fastest way to render from the glyph atlas. This is likely the "speed of light" for a scene that consists of non-blended / non-clipped images and text.
More perf benches
I was a bit worried because in that prior benchmark we use
gl.finish()as the "stop time" between runs. I also compared against a pixel read back to "force" a full GPU flush and saw the following results. These are still very promising.Design
The design is pretty straightforward.
fill_rectthat doesn't overlap any pending[fill|stroke]_path, then batch it into a blit rect pipeline pass.fill_rectdoes overlap, then flush the strips and start a new blit rect batch.The "overlap" regions of the strips are tracked via a SIMD-enabled
DirtyRectsstruct. Note thatpop_layerdirties the wide tiles that it covered in its layer.We saw significant performance improvement by calculating path bounding boxes by iterating over the
line_bufinstead of performingpath.bounding_box().Degenerate cases
The blit rect pipeline currently uses a naive batching mechanism (see ## Design). It is possible to build a degenerate scene with this batching mechanism that causes flip-flop pipeline switching between blit and strips.
There are some scenes that are not optimisable. Consider the below scene:
This scene sees rows of overlapping bordered rects. You must draw the rect before the next stroke (i.e. border) before the next rect and so on. See below ASCII visualisation:
To get a feel for this impact, see the below regressions of this type of scene:
We could consider adding runtime "opt out" of batching when the batch only consists of 1 blit or similar. But, at least for my use case, I would be surprised to find scenes with 100s of overlapping bordered images with strokes in between. (Note that overlapping images are fine, but 100s contiguously overlapping image/path are not).
If we're concerned about this, I think I propose (to be approved by reviewers):
expect_only_default_blendingNote that I dabbled in other batching mechanisms, but I think, for now, this might be good enough to at least "test the waters" of blit rect.
Testing