-
Couldn't load subscription status.
- Fork 279
Proposal load/store masked #1162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
1124f01 to
53da643
Compare
|
interesting. Before going into the details, what are the memory effects of a masked load in terms of read memory and value stored? Are the masked elements set to 0 ? to an undefined value ? AVX etc seems to set the value to zero? I'm not sold on the common implementation which looks quite heavy in scalar operation. I can see that we can't do a plain load followed by an and because it could lead to access to unallocated memory. If the mask were constant, we could optimize statically some common patterns, but with a dynamic mask as you propose... |
|
Some thoughts:
In general, I use these operations when I want to vectorize and inner loop that is not a multiple of the simd width. This is a small inner loop nested in a loop executed a lot of times. Depending on the operations, padding sometimes is slower than masking. |
2ae15bf to
510c335
Compare
|
This is starting to take shape now. I still need to clean up. But this is a first working implementation. I extended the mask operation in batch_bool to mimic std::bit so that i could reuse the code for all masked operations. I feel having those living outside the class in detail:: is not the best. They are also useful to the user I think. |
fd0e777 to
ecf4291
Compare
|
@serge-sans-paille do you have any clue on why mingw32 fails? I need to do a final pass make the code coherent (I am mixing reinterpret_casts and c-style casts). Affter that I think will ready for review. |
ecf4291 to
e88369f
Compare
| } | ||
| else | ||
| { | ||
| // Fallback to runtime path for non prefix/suffix masks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless the compiler optimizes those out, the dynamic load is going to perform serveral checks that are going to be always false (none / all).
I'm still not convinced by the relevancy of dynamic mask support. Maybe we could start small with "just" the static mask support and eventually add a dynamic version later if it appears to be needed? But maybe you have specific case in mind ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is just that in my mind I can offer a public runtime mask API
then the compile time mask optimizes as much as it can and if there is no alternative it calls the runtime mask.
Same as the swizzle. So I was not planning of optimizing the runtime at all we just get a maskl_load/store where possible or a common since sse does not support masking.
PS:
This can be optimized in the future with using avx on sse or avx512vl but I will open a discussion once this is merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am cleaning this to only offer compile time for now and the runtime can be a separate PR
e535375 to
4c457ae
Compare
|
I don't think the error is due to anything related to this PR. But, I like the looks of it now. Ready to review! |
4c457ae to
ec2b824
Compare
ec2b824 to
128a080
Compare
128a080 to
b2a0b28
Compare
| constexpr std::array<bool, size> mask { Values... }; | ||
|
|
||
| for (std::size_t i = 0; i < size; ++i) | ||
| buffer[i] = mask[i] ? static_cast<T_out>(mem[i]) : T_out(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: de we want a defined value in non-masked slots, or an undefined value? I guess it depends on how the various arch behave...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neon does not support masked load natively anyway, I guess zeroing is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK x86 defaults to 0 so I kept it consistent.
| { | ||
| constexpr std::size_t size = batch<T_out, A>::size; | ||
| alignas(A::alignment()) std::array<T_out, size> buffer {}; | ||
| constexpr std::array<bool, size> mask { Values... }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that operator[] is not constexpr for std::array in C++11. let's use a plain C array then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll replace it. It is constexpr from 14/17
include/xsimd/arch/xsimd_avx.hpp
Outdated
| return _mm256_insertf128_pd(_mm256_castpd128_pd256(low), high, 1); | ||
| } | ||
| template <class T> | ||
| XSIMD_INLINE batch<T, sse4_2> lower_half(batch<T, avx> const& self) noexcept |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems recursive. don't you want to use self.data ?
|
|
||
| // Convenience helpers for half splits (require even size and appropriate target arch) | ||
| template <class A2> | ||
| static constexpr auto lower_half() noexcept |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan of this, batch_constant size are tied to an architecture...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could put some of the function below as free-functions somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not against moving splice outside the class. I implemented it here because it avoid passing the current architecture. It can go in xsimd::common somewhere maybe not common memory as swizzle or other APIs might use this in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could stay in the same file, just as free functions?
1. Adds new masked API compile time masks (store_masked and load_masked) 2. General use case optimization 3. New tests 4. x86 kernels 5. Adds new APIs to batch_bool_constant for convenience resembling #include<bit> 6. Tests the new APIs
b2a0b28 to
3f3acf5
Compare
|
@DiamonDinoia we plan to do a release in the forthcoming weeks. Do you want this PR to be part of it? |
Only if it does not delay the release by much. I would like to have a release by mid-November. This gives me two weeks to update my downstream projects. For an end of the year release. PS: I am planning on addressing the swizzle issue after this PR. |
Dear @serge-sans-paille,
I made a rough implementation of masked load/store. Before I hash it out. Can I have some early feedback?
Thanks,
Marco