std::vector vs Vc::Memory

Fri Mar 7 15:13:01 CET 2014

GCC 4.8.1 results:

std::vector<float> vector : 296 cycles/repetition,
1.061e-07 seconds/repetition, 1x speedup,
2.79 GHz, 1000000 repetitions.
float * vector : 312 cycles/repetition,
1.119e-07 seconds/repetition, 0.948x speedup,
2.79 GHz, 1000000 repetitions.
Vc::vector<float> -one from std::vector : 288 cycles/repetition,
1.031e-07 seconds/repetition, 1.03x speedup,
2.79 GHz, 1000000 repetitions.
Vc::vector<float> -one from plain array : 289 cycles/repetition,
1.038e-07 seconds/repetition, 1.02x speedup,
2.79 GHz, 1000000 repetitions.
Vc::vector with -1 : 297 cycles/repetition,
1.063e-07 seconds/repetition, 0.997x speedup,
2.79 GHz, 1000000 repetitions.
Vc::vector with -1 and plain array : 297 cycles/repetition,
1.066e-07 seconds/repetition, 0.995x speedup,
2.79 GHz, 1000000 repetitions.
Vc::memory : 293 cycles/repetition,
1.051e-07 seconds/repetition, 1.01x speedup,
2.79 GHz, 1000000 repetitions.

CXXFLAGS := -O3 -fabi-version=0 -march=native -funroll-loops -std=c++11

With -O2 instead of -O3 I get:

std::vector<float> vector : 1564 cycles/repetition,
5.601e-07 seconds/repetition, 1x speedup,
2.79 GHz, 1000000 repetitions.
float * vector : 1572 cycles/repetition,
5.628e-07 seconds/repetition, 0.995x speedup,
2.79 GHz, 1000000 repetitions.
Vc::vector<float> -one from std::vector : 301 cycles/repetition,
1.078e-07 seconds/repetition, 5.19x speedup,
2.79 GHz, 1000000 repetitions.
Vc::vector<float> -one from plain array : 306 cycles/repetition,
1.096e-07 seconds/repetition, 5.11x speedup,
2.79 GHz, 1000000 repetitions.
Vc::vector with -1 : 304 cycles/repetition,
1.09e-07 seconds/repetition, 5.14x speedup,
2.79 GHz, 1000000 repetitions.
Vc::vector with -1 and plain array : 301 cycles/repetition,
1.079e-07 seconds/repetition, 5.19x speedup,
2.79 GHz, 1000000 repetitions.
Vc::memory : 266 cycles/repetition,
9.527e-08 seconds/repetition, 5.88x speedup,
2.79 GHz, 1000000 repetitions.

And additionally without -funroll-loops:

std::vector<float> vector : 2127 cycles/repetition,
7.616e-07 seconds/repetition, 1x speedup,
2.79 GHz, 1000000 repetitions.
float * vector : 2128 cycles/repetition,
7.619e-07 seconds/repetition, 1x speedup,
2.79 GHz, 1000000 repetitions.
Vc::vector<float> -one from std::vector : 353 cycles/repetition,
1.266e-07 seconds/repetition, 6.01x speedup,
2.79 GHz, 1000000 repetitions.
Vc::vector<float> -one from plain array : 347 cycles/repetition,
1.243e-07 seconds/repetition, 6.13x speedup,
2.79 GHz, 1000000 repetitions.
Vc::vector with -1 : 317 cycles/repetition,
1.138e-07 seconds/repetition, 6.69x speedup,
2.79 GHz, 1000000 repetitions.
Vc::vector with -1 and plain array : 342 cycles/repetition,
1.226e-07 seconds/repetition, 6.21x speedup,
2.79 GHz, 1000000 repetitions.
Vc::memory : 275 cycles/repetition,
9.872e-08 seconds/repetition, 7.72x speedup,
2.79 GHz, 1000000 repetitions.

How did you get this to run anyway? Vc::Allocator only looks at the alignof of 
the type used in the container. Since that's 4 for float std::vector got 
memory aligned to 16 Bytes - not 32. I had to modify Vc::Allocator to get 
correct alignment.

Cheers,
  Matthias

On Friday 07 March 2014 14:11:18 Sandro Wenzel wrote:
> Dear Tijskens,
> 
> I am writing back to confirm that I get the same observations as you now. I
> am attaching a slightly modified code that puts the individual tests in
> some functions ( in order to look at the assembly and to enable binary
> instrumentation analysis ... ).
> 
> I also added the same tests using plain C-like arrays. Those seem to give
> good "Vc" performance immediately:
> 
> @@@ test0 - begin @@@
> std::vector<float> vector : 2087 cycles/repetition, 6.14e-07
> seconds/repetition, 1x speedup, 3.4 GHz, 1000000 repetitions.
> float * vector : 287 cycles/repetition, 8.45e-08 seconds/repetition, 7.27x
> speedup, 3.4 GHz, 1000000 repetitions.
> Vc::vector<float> -one from std::vector : 635 cycles/repetition, 1.869e-07
> seconds/repetition, 3.29x speedup, 3.4 GHz, 1000000 repetitions.
> Vc::vector<float> -one from plain array : 319 cycles/repetition, 9.397e-08
> seconds/repetition, 6.53x speedup, 3.4 GHz, 1000000 repetitions.
> Vc::vector with -1 : 410 cycles/repetition, 1.207e-07 seconds/repetition,
> 5.09x speedup, 3.4 GHz, 1000000 repetitions.
> Vc::vector with -1 and plain array : 317 cycles/repetition, 9.345e-08
> seconds/repetition, 6.57x speedup, 3.4 GHz, 1000000 repetitions.
> Vc::memory : 310 cycles/repetition, 9.144e-08 seconds/repetition, 6.72x
> speedup, 3.4 GHz, 1000000 repetitions.
> @@@ test0 - done @@@
> 
> 
> 
> My conclusion is that std::vector is to be avoided ... ( and anyway I still
> had issues with alignment ). Note also that the compiler autovectorization
> is better than any other solution here ( probably because it also unrolls
> ... ).
> 
> 
> I compiled like this:
> 
> icc -mavx -I ./ -O2 -I ${VCROOT}/include testmodif.cpp -o foo.x -std=c++11
> -L ${VCROOT}/lib -lVc -fabi-version=6
> 
> 
> Best
> 
> Sandro
> 
> 
> 
> 
> 2014-03-06 9:42 GMT+01:00 Tijskens Engelbert <
> 
>[please enable javascript to see the address]>:
> >  Dear sandro
> > 
> > the attachments contains the main file test.cpp and the included timer.h
> > i included some unrolling tests as mathias suggested for the scalar case.
> > that helps indeed. didn't check the simd case so far.
> > kindest regards,
> > bert
> > 
>[please enable javascript to see the address]> wrote:
> >  
> >  Dear Tijskens,
> >  
> >  I was intrigued by your observations  and tried to reproduce them but I
> > 
> > failed. Actually, I feel like Matthias that measuring such short
> > minimalistic code section is really tough.
> > 
> >  Would you be able  to share your benchmark code and the way you compile
> > 
> > it such that I can have a more thorough look?
> > 
> >  Best
> >  
> >  Sandro
> > 
> > 2014-03-04 18:57 GMT+01:00 Tijskens Engelbert <
>[please enable javascript to see the address]>:
> > 
> > Dear all,
> > 
> >  I am trying to figure out how to use std::vector<float> efficiently in
> > 
> > combination with Vc. (to have dynamic arrays and performance)
> > 
> >      std::vector<float> x(1024);
> >     
> >     for( int i=0; i<ne; ++i ) {//initialize
> >     
> >         x[i]=1.0;
> >     
> >     }
> > 
> > // scalar loop using std::vector
> > 
> >     for( int i=0; i<ne; ++i ) {
> >     
> >         x[i] -= 1.0;
> >     
> >     }
> > 
> > // vector loop using std::vector
> > 
> >     for( int i=0; i<ne; i+=Vc::float_v::Size )
> >     {
> >     
> >         Vc::float_v vx( &x[i] );
> >         vx -= 1.0;
> >         vx.store( &x[i] );
> >     
> >     }
> > 
> > // vector loop using Vc::Memory instead of std::vector
> > 
> >     Vc::Memory<Vc::float_v,ne> Vx;
> >     for( int i=0; i<ne; ++i ) {//initialize
> >     
> >         Vx[i] = 1.0;
> >     
> >     }
> >     Vc::float_v one(1.);
> >     ET_TIME_THIS
> >     
> >      ( "Vc::Memory<Vc::float_v,ne>  vector",
> >      
> >         for( int i=0; i<nv; ++i ) {
> >         
> >             Vx.vector(i) -= one;
> >         
> >         }
> > 
> > When i time these loops i get the following results
> > 
> >  scalar loop using std::vector : 2162 cycles/repetition, 9.4e-07
> > 
> > seconds/repetition, 1   x speedup, 2.3  GHz, 100 repetitions.
> > vector loop using std::vector :  357 cycles/repetition, 1.6e-07
> > seconds/repetition, 6.04x speedup, 2.24 GHz, 100 repetitions.
> > vector loop using Vc::Memory  :  288 cycles/repetition, 1.2e-07
> > seconds/repetition, 7.49x speedup, 2.4  GHz, 100 repetitions.
> > 
> >  is there a way to improve the vector loop using std::vector? By the way
> > 
> > if i write the second loop as
> > 
> >  // vector loop using std::vector
> >  
> >     for( int i=0; i<ne; i+=Vc::float_v::Size )
> >     {
> >     
> >         Vc::float_v vx( &x[i] );
> >         Vx.vector(i) -= one;
> >         vx.store( &x[i] );
> >     
> >     }
> >  
> >  things get even worse, the speedup being only 4.2x roughly.
> > 
> > _______________________________________________
> > Vc mailing list
>[please enable javascript to see the address]
> > https://compeng.uni-frankfurt.de/mailman/listinfo/vc
> > 
> >  --
> > 
> > Dr. Sandro Wenzel
> > PH / SFT
> > CERN

-- 
─────────────────────────────────────────────────────────────
 Dipl.-Phys. Matthias Kretz

 Web:   http://compeng.uni-frankfurt.de/?mkretz

 SIMD easy and portable: http://compeng.uni-frankfurt.de/?vc
─────────────────────────────────────────────────────────────
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://compeng.uni-frankfurt.de/pipermail/vc/attachments/20140307/e11cbe00/attachment-0001.pgp>