-
Notifications
You must be signed in to change notification settings - Fork 64
Description
Some image decoding can be slow, PNG is slow, and Crunch is a russian doll of DXT compressed data itself compressed with LZMA, we don't decode the DXT data, but we do decode the LZMA, and that isn't fast. Other codecs like WebP or JPEG are likely much faster than PNG, but aren't cheap anyway.
With our current architecture, parallelizing the image loading isn't easy. Especially, we cannot parallelize the upload of the image to the GPU I guess.
Our code is very sequential: we parse a shader stage, we load an image from the VFS, we decompress what we have to decompress, we upload the data, and then we jump to the next shader stage.
While working on parallelizing some model code we experimented with OpenMP:
That expriment shown that a simple loop that read input as an array and writes output as another array with the same index is very easy to parallelize, just by adding a pragma
, and that it will work with our without OpenMP support, and that the parallelization will be very efficient once the OpenMP support is there.
So, to fetch low hanging fruits, we should look at code that looks like that pattern, or that can be transformed to look like that pattern.
Something we can parallelize is the decoding of images: put all the image names in an array as input, and provide an array of memory areas suitable to receive pixmap data as output, write the loop, annotate it with some OpenMP pragma and that's done.
For now we can't really do that for all images from a shader, or for all images from a map, because we iterate shader stages one by one and once a shader stage is parsed, we don't decode images anymore, and it wouldn't be easy to change that.
What looks easier to achieve is to parallelize the loading of all images from a single shader stage. The good news is that we implemented pre-collapsed shaders a long time ago now:
For now we load all textures from a single stage one by one, as soon as we face one.
But then, for being able to configure image upload based on blend function, which require to configure image loading after the whole shader stage is parsed, I implemented a prototype of delayed pre-collapsed image loading:
This delayed pre-collapsed image loading already works. Every time a texture like a diffuse map or a normal map and things like that are found in a shader stage, such texture path is added to an array of texture paths, and once the whole shader stage is parsed, all the images are loaded in a loop.
If we split the code decoding the image and the code uploading the image, we can probably split the loop and have a loop decoding the images and have a loop uploading the images, meaning we would be able to parallelize the decoding of the images.
If we can do that, then a pre-collapsed stage with, for example, 5 textures: a diffuse map, a normal map, a height map, a specular map and a glow map, could decode all those 5 textures at once, 1 texture per thread, in parallel, reducing the decoding time from 5 image decoding time to the longest single image decoding time of the 5.
Then we would upload all the decoded images from that stage, sequentially.