Conversation
…od performance gain.
|
There is this answer on stack overflow that could be worth try: https://stackoverflow.com/questions/17761154/sse-reduction-of-float-vector If the input array is potentially large, it's worth having a scalar loop at the start, too, that runs 0-3 times until the input is aligned on a 16B boundary for the SSE loop. Then you won't have loads that cross cache/page lines slowing down your loop. And it can use ADDPS with a memory operand, which can potentially micro-fuse, reducing overhead. Also, you could get 2 or 4 dependency chains going, by using multiple accumulators, so your loop could sustain 1 vector FP add per cycle, instead of 1 per (latency of ADDPS = 3). – Peter Cordes Jul 5 '15 at 14:57 |
|
It seems that still there is something to do because there is not much difference between using AVX (__mm256) and AVX512 (__mm512). Here is the 10 first outputs of nn-benchmark unsing "-march=native" on this machine: Makefile: |
|
Here is the nn-benchmark on a low-end cpu: ./nn-benchmark-generic ./nn-benchmark-sse |
|
And on a Nexus5: |
It seems that gcc can't optimise this so using simd here give us a good performance gain.