performance of gather/scatter with different types of indexes

Mon May 15 17:26:44 CEST 2017

Am 15.05.2017 um 09:56 schrieb Matthias Kretz:
>
> my guess:
> ST::IndexType is an alias for SimdArray<int, ?>. So, it's typically passed as
> one or two SIMD registers to the gather/scatter functions. However, I have not
> implemented gathers with the existing AVX2 intrinsics yet, so you get the
> scalar fallback implementation in all cases. Meaning, it has to read the
> scalar elements of the SimdArray. Thus, it has to do the same as for the
> TinyVector, except that the TinyVector is possibly easier to optimize for the
> compiler since the scalars are already passed around in memory or even general
> purpose registers

This must be what happens. I wasn't aware of the missing AVX2 
gathers/scatters implementation issue.

But I reckon the AVX gather/scatters are there? I compiled with -mAVX, 
tried both versions and could not detect a significant difference. Now 
my gather/scatters are from quite widely spread out memory locations. 
Maybe the code is so memory-bound that the speed difference between 
using the intrinsics and the fallback scalar implementation is not 
really visible because it's only a small portion of the execution time.

> If you are interested in optimizing gathers I'd be happy to help you with
> resolving https://github.com/VcDevel/Vc/issues/32.

I should, really, because my code uses gathers and scatters all the time.

Do I have to have a git account to actually see the issue? I can't see 
anything but the issue page with the title and some sort of history 
which I can't access.

Can you point me to the relevant bit of code so that I can get a feel 
what would be required? If it's implemented in AVX, maybe I can just 
take that as a template and adapt it.

Kay