std::vector vs Vc::Memory

Mon Mar 10 09:43:35 CET 2014

interesting, i am learning a lot! 
looking back I guess i was just lucky with the alignment. it was my understanding that Vc::Allocator automatically adjusts allocation to the vector width one is compiling for - but this is apparently incorrect. 
Finally, i am still a bit stuck with my question on how to allow for dynamic arrays which are compliant with SIMD processing. Ideally, SIMD processing should be done in low level  routines, where as the organisation of the computations is done at a higher level which should be unaware of SIMD requirements (except for aligned allocation which should be hidden for the high level developer)

kindest regards
bert

[please enable javascript to see the address]> wrote:

> GCC 4.8.1 results:
> 
> std::vector<float> vector : 296 cycles/repetition,
> 1.061e-07 seconds/repetition, 1x speedup,
> 2.79 GHz, 1000000 repetitions.
> float * vector : 312 cycles/repetition,
> 1.119e-07 seconds/repetition, 0.948x speedup,
> 2.79 GHz, 1000000 repetitions.
> Vc::vector<float> -one from std::vector : 288 cycles/repetition,
> 1.031e-07 seconds/repetition, 1.03x speedup,
> 2.79 GHz, 1000000 repetitions.
> Vc::vector<float> -one from plain array : 289 cycles/repetition,
> 1.038e-07 seconds/repetition, 1.02x speedup,
> 2.79 GHz, 1000000 repetitions.
> Vc::vector with -1 : 297 cycles/repetition,
> 1.063e-07 seconds/repetition, 0.997x speedup,
> 2.79 GHz, 1000000 repetitions.
> Vc::vector with -1 and plain array : 297 cycles/repetition,
> 1.066e-07 seconds/repetition, 0.995x speedup,
> 2.79 GHz, 1000000 repetitions.
> Vc::memory : 293 cycles/repetition,
> 1.051e-07 seconds/repetition, 1.01x speedup,
> 2.79 GHz, 1000000 repetitions.
> 
> CXXFLAGS := -O3 -fabi-version=0 -march=native -funroll-loops -std=c++11
> 
> With -O2 instead of -O3 I get:
> 
> std::vector<float> vector : 1564 cycles/repetition,
> 5.601e-07 seconds/repetition, 1x speedup,
> 2.79 GHz, 1000000 repetitions.
> float * vector : 1572 cycles/repetition,
> 5.628e-07 seconds/repetition, 0.995x speedup,
> 2.79 GHz, 1000000 repetitions.
> Vc::vector<float> -one from std::vector : 301 cycles/repetition,
> 1.078e-07 seconds/repetition, 5.19x speedup,
> 2.79 GHz, 1000000 repetitions.
> Vc::vector<float> -one from plain array : 306 cycles/repetition,
> 1.096e-07 seconds/repetition, 5.11x speedup,
> 2.79 GHz, 1000000 repetitions.
> Vc::vector with -1 : 304 cycles/repetition,
> 1.09e-07 seconds/repetition, 5.14x speedup,
> 2.79 GHz, 1000000 repetitions.
> Vc::vector with -1 and plain array : 301 cycles/repetition,
> 1.079e-07 seconds/repetition, 5.19x speedup,
> 2.79 GHz, 1000000 repetitions.
> Vc::memory : 266 cycles/repetition,
> 9.527e-08 seconds/repetition, 5.88x speedup,
> 2.79 GHz, 1000000 repetitions.
> 
> And additionally without -funroll-loops:
> 
> std::vector<float> vector : 2127 cycles/repetition,
> 7.616e-07 seconds/repetition, 1x speedup,
> 2.79 GHz, 1000000 repetitions.
> float * vector : 2128 cycles/repetition,
> 7.619e-07 seconds/repetition, 1x speedup,
> 2.79 GHz, 1000000 repetitions.
> Vc::vector<float> -one from std::vector : 353 cycles/repetition,
> 1.266e-07 seconds/repetition, 6.01x speedup,
> 2.79 GHz, 1000000 repetitions.
> Vc::vector<float> -one from plain array : 347 cycles/repetition,
> 1.243e-07 seconds/repetition, 6.13x speedup,
> 2.79 GHz, 1000000 repetitions.
> Vc::vector with -1 : 317 cycles/repetition,
> 1.138e-07 seconds/repetition, 6.69x speedup,
> 2.79 GHz, 1000000 repetitions.
> Vc::vector with -1 and plain array : 342 cycles/repetition,
> 1.226e-07 seconds/repetition, 6.21x speedup,
> 2.79 GHz, 1000000 repetitions.
> Vc::memory : 275 cycles/repetition,
> 9.872e-08 seconds/repetition, 7.72x speedup,
> 2.79 GHz, 1000000 repetitions.
> 
> How did you get this to run anyway? Vc::Allocator only looks at the alignof of 
> the type used in the container. Since that's 4 for float std::vector got 
> memory aligned to 16 Bytes - not 32. I had to modify Vc::Allocator to get 
> correct alignment.
> 
> Cheers,
>  Matthias
> 
> On Friday 07 March 2014 14:11:18 Sandro Wenzel wrote:
>> Dear Tijskens,
>> 
>> I am writing back to confirm that I get the same observations as you now. I
>> am attaching a slightly modified code that puts the individual tests in
>> some functions ( in order to look at the assembly and to enable binary
>> instrumentation analysis ... ).
>> 
>> I also added the same tests using plain C-like arrays. Those seem to give
>> good "Vc" performance immediately:
>> 
>> @@@ test0 - begin @@@
>> std::vector<float> vector : 2087 cycles/repetition, 6.14e-07
>> seconds/repetition, 1x speedup, 3.4 GHz, 1000000 repetitions.
>> float * vector : 287 cycles/repetition, 8.45e-08 seconds/repetition, 7.27x
>> speedup, 3.4 GHz, 1000000 repetitions.
>> Vc::vector<float> -one from std::vector : 635 cycles/repetition, 1.869e-07
>> seconds/repetition, 3.29x speedup, 3.4 GHz, 1000000 repetitions.
>> Vc::vector<float> -one from plain array : 319 cycles/repetition, 9.397e-08
>> seconds/repetition, 6.53x speedup, 3.4 GHz, 1000000 repetitions.
>> Vc::vector with -1 : 410 cycles/repetition, 1.207e-07 seconds/repetition,
>> 5.09x speedup, 3.4 GHz, 1000000 repetitions.
>> Vc::vector with -1 and plain array : 317 cycles/repetition, 9.345e-08
>> seconds/repetition, 6.57x speedup, 3.4 GHz, 1000000 repetitions.
>> Vc::memory : 310 cycles/repetition, 9.144e-08 seconds/repetition, 6.72x
>> speedup, 3.4 GHz, 1000000 repetitions.
>> @@@ test0 - done @@@
>> 
>> 
>> 
>> My conclusion is that std::vector is to be avoided ... ( and anyway I still
>> had issues with alignment ). Note also that the compiler autovectorization
>> is better than any other solution here ( probably because it also unrolls
>> ... ).
>> 
>> 
>> I compiled like this:
>> 
>> icc -mavx -I ./ -O2 -I ${VCROOT}/include testmodif.cpp -o foo.x -std=c++11
>> -L ${VCROOT}/lib -lVc -fabi-version=6
>> 
>> 
>> Best
>> 
>> Sandro
>> 
>> 
>> 
>> 
>> 2014-03-06 9:42 GMT+01:00 Tijskens Engelbert <
>> 
>>[please enable javascript to see the address]>:
>>> Dear sandro
>>> 
>>> the attachments contains the main file test.cpp and the included timer.h
>>> i included some unrolling tests as mathias suggested for the scalar case.
>>> that helps indeed. didn't check the simd case so far.
>>> kindest regards,
>>> bert
>>> 
>>>[please enable javascript to see the address]> wrote:
>>> 
>>> Dear Tijskens,
>>> 
>>> I was intrigued by your observations  and tried to reproduce them but I
>>> 
>>> failed. Actually, I feel like Matthias that measuring such short
>>> minimalistic code section is really tough.
>>> 
>>> Would you be able  to share your benchmark code and the way you compile
>>> 
>>> it such that I can have a more thorough look?
>>> 
>>> Best
>>> 
>>> Sandro
>>> 
>>> 2014-03-04 18:57 GMT+01:00 Tijskens Engelbert <
>>>[please enable javascript to see the address]>:
>>> 
>>> Dear all,
>>> 
>>> I am trying to figure out how to use std::vector<float> efficiently in
>>> 
>>> combination with Vc. (to have dynamic arrays and performance)
>>> 
>>>     std::vector<float> x(1024);
>>> 
>>>    for( int i=0; i<ne; ++i ) {//initialize
>>> 
>>>        x[i]=1.0;
>>> 
>>>    }
>>> 
>>> // scalar loop using std::vector
>>> 
>>>    for( int i=0; i<ne; ++i ) {
>>> 
>>>        x[i] -= 1.0;
>>> 
>>>    }
>>> 
>>> // vector loop using std::vector
>>> 
>>>    for( int i=0; i<ne; i+=Vc::float_v::Size )
>>>    {
>>> 
>>>        Vc::float_v vx( &x[i] );
>>>        vx -= 1.0;
>>>        vx.store( &x[i] );
>>> 
>>>    }
>>> 
>>> // vector loop using Vc::Memory instead of std::vector
>>> 
>>>    Vc::Memory<Vc::float_v,ne> Vx;
>>>    for( int i=0; i<ne; ++i ) {//initialize
>>> 
>>>        Vx[i] = 1.0;
>>> 
>>>    }
>>>    Vc::float_v one(1.);
>>>    ET_TIME_THIS
>>> 
>>>     ( "Vc::Memory<Vc::float_v,ne>  vector",
>>> 
>>>        for( int i=0; i<nv; ++i ) {
>>> 
>>>            Vx.vector(i) -= one;
>>> 
>>>        }
>>> 
>>> When i time these loops i get the following results
>>> 
>>> scalar loop using std::vector : 2162 cycles/repetition, 9.4e-07
>>> 
>>> seconds/repetition, 1   x speedup, 2.3  GHz, 100 repetitions.
>>> vector loop using std::vector :  357 cycles/repetition, 1.6e-07
>>> seconds/repetition, 6.04x speedup, 2.24 GHz, 100 repetitions.
>>> vector loop using Vc::Memory  :  288 cycles/repetition, 1.2e-07
>>> seconds/repetition, 7.49x speedup, 2.4  GHz, 100 repetitions.
>>> 
>>> is there a way to improve the vector loop using std::vector? By the way
>>> 
>>> if i write the second loop as
>>> 
>>> // vector loop using std::vector
>>> 
>>>    for( int i=0; i<ne; i+=Vc::float_v::Size )
>>>    {
>>> 
>>>        Vc::float_v vx( &x[i] );
>>>        Vx.vector(i) -= one;
>>>        vx.store( &x[i] );
>>> 
>>>    }
>>> 
>>> things get even worse, the speedup being only 4.2x roughly.
>>> 
>>> _______________________________________________
>>> Vc mailing list
>>>[please enable javascript to see the address]
>>> https://compeng.uni-frankfurt.de/mailman/listinfo/vc
>>> 
>>> --
>>> 
>>> Dr. Sandro Wenzel
>>> PH / SFT
>>> CERN
> 
> -- 
> ─────────────────────────────────────────────────────────────
> Dipl.-Phys. Matthias Kretz
> 
> Web:   http://compeng.uni-frankfurt.de/?mkretz
> 
> SIMD easy and portable: http://compeng.uni-frankfurt.de/?vc
> ─────────────────────────────────────────────────────────────_______________________________________________
> Vc mailing list
>[please enable javascript to see the address]
> https://compeng.uni-frankfurt.de/mailman/listinfo/vc