FIAS . Impressum . Privacy

help getting started with outer-loop-parallel, inner-loop-vector use of Vc

Andrew Corrigan [please enable javascript to see the address]
Wed Oct 18 23:32:54 CEST 2017


I would really like to use Vc to implement operations within my code in an outer-loop-parallel, inner-loop-vector fashion.  I am a complete newbie at vectorization, and would appreciate tremendously any help getting started.  

Typically, I parallelize (#pragma omp parallel for) operations performed for each element (say std::array<double,64>) of a large std::vector<std::array<double,64>>.  Since automatic vectorization over these outer loops (#pragma omp parallel for simd) seems hopeless, instead, within each iteration of the outer loop, I would like to target operations on each element for vectorization. The size of these compile-time-sized arrays is arbitrary however, so I could not assume a length of 4,8,16, etc:

using T = std::array<double, N>; // N is an arbitrary but compile-time constant that might be in the tens or hundreds
  
I sketch what I am trying to do below, but would like to use Vc instead of std::array<double, 64>, and instead of the entire inner-loop, I am hoping to make use of a vectorization-aware analogue of std::array, with pre-defined operations so that I can write the whole loop body as just: a[i] = b[i] + c[i].  I see there is/was SimdArray, but when I pull the latest version I see that is in attic.  I would really appreciate any guidance on how to go about achieving this.  Or am I going about this the entirely wrong way?

Thank you for any help getting started using Vc.

- Andrew



#include <utility>
#include <memory>
#include <array>

int main(int argc, char** argv)
{
    using S = double;
    using T = std::array<S,64>; // in general, a compile-time known number that is typically O(10), but might be O(100)

    auto n = 1000000;   // in general, a run-time known number than can be arbitrarily large

    auto a_ = std::make_unique<T[]>(n);
    auto b_ = std::make_unique<T[]>(n);
    auto c_ = std::make_unique<T[]>(n);

    auto a = a_.get();
    auto b = b_.get();
    auto c = c_.get();

#pragma omp parallel for
    for(auto i = Size(0); i < n; ++i)
    {
#if 0    // this is what I’d like to do:

        a[i] = b[i] + c[i];   // vectorized sum + assign over 64 elements within each T — (or maybe 120 elements, or maybe 6 elements depending how N is defined at compile-time)

#else // this is my best attempt at an implementation using OpenMP directives, which I’m not sure that it even works

        auto b_i = b[i];
        auto c_i = c[i];

        T a_i;

        // inner-loop vector
#pragma omp simd
        for(auto j = Size(0); j < N; ++j)
        {
            a_i[j] = b_i[j] + c_i[j];
        }

        a[i] = a_i;    // store back to memory: does this copy even vectorize???
#endif
    }

    return 0;
}




More information about the Vc-devel mailing list
FIAS . Impressum . Privacy