FlatMap: add method for batched GPU construction#1610
Conversation
kennyweiss
left a comment
There was a problem hiding this comment.
Thanks @publixsubfan !
Could you please post some performance results?
Please don't forget to update the RELEASE-NOTES.
| @@ -721,7 +745,7 @@ FlatMap<KeyType, ValueType, Hash>::FlatMap(IndexType bucket_count, Allocator all | |||
| // TODO: we should add a countl_zero overload for 64-bit integers | |||
src/axom/core/FlatMap.hpp
Outdated
| { | ||
| std::int32_t numGroups = std::ceil((bucket_count + 1) / (double)BucketsPerGroup); | ||
| m_numGroups2 = 31 - (axom::utilities::countl_zero(numGroups)); | ||
| m_numGroups2 = 32 - (axom::utilities::countl_zero(numGroups)); |
There was a problem hiding this comment.
This change seems subtle/hard won. Do we have a unit test targeting this line?
There was a problem hiding this comment.
Yeah, I added a new test here: https://github.com/LLNL/axom/blob/d64fa39c4229f7c21b1c246cab78ef9ca0284dc2/src/axom/core/tests/core_flatmap.hpp#L117-L129
|
|
||
| AXOM_HOST_DEVICE bool tryLock() | ||
| { | ||
| int still_locked = 0; |
There was a problem hiding this comment.
Any chance the axom atomics can be used/updated to handle/help with this logic?
(Mostly b/c that could harden the axom atomics. If you think this is a one-off and not useful elsewhere, it's fine as is)
There was a problem hiding this comment.
I think adding this to the axom atomics would be dependent on support from within RAJA for atomics with memory ordering. Otherwise the logic to implement that might get a little nasty.
There was a problem hiding this comment.
IIRC, RAJA default atomics don't support memory ordering. RAJA can be configured to use desul atomics, which do support memory ordering. Unfortunately, we only support using those through the original RAJA atomic interface and so we only provide a default we define: https://github.com/LLNL/RAJA/blob/develop/include/RAJA/policy/desul/atomic.hpp#L22.
We should revisit whether we want to switch to desul atomics by default in RAJA. I think the last time we discussed this, there were still some cases where RAJA atomics were faster than desul. If we did switch to desul by default (which is what Kokkos uses), then we could support the full desul interface.
@publixsubfan let me know if you think we should go this route.
There was a problem hiding this comment.
Maybe we could play around with a partial desul default? Something like "default for ordered atomics, but use the original backend for unordered"
There was a problem hiding this comment.
I did have a PR for the ordered atomics here: llnl/RAJA#1616, if we wanted to try and clean that up.
There was a problem hiding this comment.
Thanks -- since this is somewhat of a one-off and it's not super easy to consolidate it into axom::atomics, I think it's fine as is.
| for(int i = 0; i < NUM_ELEMS; i++) | ||
| { | ||
| auto key = this->getKey(i); | ||
| auto value = this->getValue(i * 10.0 + 5.0); |
There was a problem hiding this comment.
Would it make sense to have a test that has repeated value entries?
I'd expect it to be handled properly, but might be good to add a test for it anyway.
|
@kennyweiss -- here's some performance graphs for construction and querying. These were "scaled" to be node-to-node comparisons, meaning we multiplied the ATS-2/ATS-4 numbers for each run by 4 to account for the 4 sockets. For CTS-2, we ran 2 MPI ranks with
|
fc765c6 to
410c298
Compare
e574bfd to
f83228f
Compare



Summary
Adds a method
FlatMap::create()for constructing a flat hash map over a batch of corresponding key-value pairs.ExecSpaceis used to specify whether the batched construction happens on the CPU, GPU, or over multiple threads via OpenMP. The passed-inallocatorargument must be accessible from the specified execution space.Also adds a benchmark example for
FlatMap, which tests the performance of both batched insertion and lookups.