std::vector vs Vc::Memory

Wed Mar 5 08:49:52 CET 2014

Hi Tijskens,

On Tuesday 04 March 2014 17:57:25 Tijskens Engelbert wrote:
> // scalar loop using std::vector
>     for( int i=0; i<ne; ++i ) {
>         x[i] -= 1.0;
>     }
> // vector loop using std::vector
>     for( int i=0; i<ne; i+=Vc::float_v::Size )
>     {
>         Vc::float_v vx( &x[i] );
>         vx -= 1.0;
>         vx.store( &x[i] );
>     }

Just a note in case you didn't know yet: the aligned load/store here only 
works because you created a std::vector of a Vc SIMD vector type. If you have 
a struct S { float_v x; }; and use vector<S>, the aligned load/store has no 
guarantee of actually accessing aligned pointers. Take a look at Vc::Allocator 
for structs.

> // vector loop using Vc::Memory instead of std::vector
>     Vc::Memory<Vc::float_v,ne> Vx;
> [...]
>     for( int i=0; i<nv; ++i ) {
>         Vx.vector(i) -= one;
>     }
> When i time these loops i get the following results
> scalar loop using std::vector : 2162 cycles/repetition, 9.4e-07
> seconds/repetition, 1   x speedup, 2.3  GHz, 100 repetitions. vector loop
> using std::vector :  357 cycles/repetition, 1.6e-07 seconds/repetition,
> 6.04x speedup, 2.24 GHz, 100 repetitions. vector loop using Vc::Memory  : 
> 288 cycles/repetition, 1.2e-07 seconds/repetition, 7.49x speedup, 2.4  GHz,
> 100 repetitions.

questions:
* 1024 entries is maybe too few iterations to hide the offset you get from the 
timing methods themselves. Even the rdtsc instruction doesn't make a guarantee 
for its overhead, AFAIK. What exactly are you measuring?
* You're writing a frequency. Is that the quotient of cycles and seconds? I 
recommend you rather disable turbo mode and power management (see attached 
script for how I do that).
* what CPU did you use?
* what compiler and compiler flags did you use?

Answers:
* Your speedups are already quite good. From my experience compilers don't 
optimize the code you showed that well (I'm used to having to do manual 
unrolling).
* Do I understand correctly that the scalar loop takes 2162 cycles to subtract 
1.0 from 1024 floats? Optimal would be ~1024 cycles then... (~256 for the AVX 
loops - see next point)
* Note that Intel SandyBridge CPUs can only do one SSE vector store per cycle 
(i.e. two cycles for one AVX store). So if you're looking for an 8-fold 
speedup with SandyBridge you need to look for a problem with fewer stores. 
IIRC the store bandwidth is doubled on Haswell - I don't recall IvyBridge 
right now.

> is there a way to improve the vector loop using std::vector? By the way if i
> write the second loop as // vector loop using std::vector
>     for( int i=0; i<ne; i+=Vc::float_v::Size )
>     {
>         Vc::float_v vx( &x[i] );
>         Vx.vector(i) -= one;
>         vx.store( &x[i] );
>     }
> things get even worse, the speedup being only 4.2x roughly.

Hard to say what the compiler does here. But I'd guess you get the speedup 
back if you unroll.

Cheers,
  Matthias

-- 
─────────────────────────────────────────────────────────────
 Dipl.-Phys. Matthias Kretz

 Web:   http://compeng.uni-frankfurt.de/?mkretz

 SIMD easy and portable: http://compeng.uni-frankfurt.de/?vc
─────────────────────────────────────────────────────────────
-------------- next part --------------
A non-text attachment was scrubbed...
Name: benchmarking.sh
Type: application/x-shellscript
Size: 696 bytes
Desc: not available
URL: <http://compeng.uni-frankfurt.de/pipermail/vc/attachments/20140305/ad4bc2da/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://compeng.uni-frankfurt.de/pipermail/vc/attachments/20140305/ad4bc2da/attachment.pgp>