Skip to content

Conversation

@raphlinus
Copy link
Contributor

This is a draft, I'm still working on it. I'll likely create another subdirectory for tests and add this to that, as overwriting the main hello example is not very good form. But I'm doing it for expedience.

The version at tip of tree as I write this (87e5b20) works well on AMD 5700 XT. In fact, it works very well - I'm seeing 36.4 billion elements/s, which is excellent. It's within a sliver of a compute shader that just copies input to output, and looking at GPU counters suggests that memory bandwidth is pretty well saturated.

This version also makes some progress on each spin, so does not depend on strong forward progress guarantees from the GPU.

That said, I am employing the atomicOr workaround for the atomic bugs I'm seeing, otherwise I get both incorrect results and hangs (try N_DATA = 1 << 17 for a nice mix of the two). I will probably work on a simplified version of the test to exercise the atomic problems without bringing in all of the complexity of full prefix sum.

Sorta works but deadlocks on larger inputs.
Still doesn't fix deadlocks tho :/
Still WIP
Fastest results on AMD at workgroup = 1024. Note, this has atomicOr
workaround for correctness.

Also note, not all targets will support a workgroup of this size; on
shipping, we'd need to query and select at runtime.
Do a small sequential scan at the leaf of the hierarchy. That amortizes
both the workgroup-scope tree reduction and the (still sequential)
decoupled look-back to a larger number of inputs.

Note: this falls short of a real performance evaluation because there's
no attempt to warm up the GPU clock. But it's valid as a very rough
swag.
Better for performance analaysis
Performance measurement requires keeping the GPU busy. That means not
copying results back to CPU and doing verification there.
Naga will accept ordinary loads and stores to atomic types, but tint
will not.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants