Always operating on just the first two elements of a 4-sized vector

Thu Mar 6 11:22:50 CET 2014

Hi all,

I was wondering whether it would be possible to only use the first two
elements in a vector of size 4, for example when using AVX and doubles..
The obvious answer would be to use a mask, however this results in degraded
performance compared to SSE.

On many CPU architectures it appears that when operating on 4 elements
division takes longer than when operating on just 2. DIVPD has a latency of
10-20 cycles on a Haswell while VDIVPD takes 19-35, for example. (from
http://www.agner.org/optimize/instruction_tables.pdf) Things are similar
for square root.

Therefore in this case, one would want to emit SSE instructions instead of
AVX ones. A question might be "why not utilize all 4 elements" -
unfortunately the application is such that only 2 elements can be filled at
a time.

The difference in performance can be large - this results in the
paradoxical situation in which my code that only uses 2 doubles of each
vector is 50% faster when compiled with SSE than AVX.

On the other hand, some other of the vectorized code in the same binary
might use the larger vectors and benefit from AVX - I don't want to resort
to compiling only with SSE.

Is there any way around this?

Cheers,
Georgios
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://compeng.uni-frankfurt.de/pipermail/vc/attachments/20140306/b573a706/attachment.html>