help getting started with outer-loop-parallel, inner-loop-vector use of Vc
Andrew Corrigan
[please enable javascript to see the address]
Wed Oct 18 23:32:54 CEST 2017
I would really like to use Vc to implement operations within my code in an outer-loop-parallel, inner-loop-vector fashion. I am a complete newbie at vectorization, and would appreciate tremendously any help getting started.
Typically, I parallelize (#pragma omp parallel for) operations performed for each element (say std::array<double,64>) of a large std::vector<std::array<double,64>>. Since automatic vectorization over these outer loops (#pragma omp parallel for simd) seems hopeless, instead, within each iteration of the outer loop, I would like to target operations on each element for vectorization. The size of these compile-time-sized arrays is arbitrary however, so I could not assume a length of 4,8,16, etc:
using T = std::array<double, N>; // N is an arbitrary but compile-time constant that might be in the tens or hundreds
I sketch what I am trying to do below, but would like to use Vc instead of std::array<double, 64>, and instead of the entire inner-loop, I am hoping to make use of a vectorization-aware analogue of std::array, with pre-defined operations so that I can write the whole loop body as just: a[i] = b[i] + c[i]. I see there is/was SimdArray, but when I pull the latest version I see that is in attic. I would really appreciate any guidance on how to go about achieving this. Or am I going about this the entirely wrong way?
Thank you for any help getting started using Vc.
- Andrew
#include <utility>
#include <memory>
#include <array>
int main(int argc, char** argv)
{
using S = double;
using T = std::array<S,64>; // in general, a compile-time known number that is typically O(10), but might be O(100)
auto n = 1000000; // in general, a run-time known number than can be arbitrarily large
auto a_ = std::make_unique<T[]>(n);
auto b_ = std::make_unique<T[]>(n);
auto c_ = std::make_unique<T[]>(n);
auto a = a_.get();
auto b = b_.get();
auto c = c_.get();
#pragma omp parallel for
for(auto i = Size(0); i < n; ++i)
{
#if 0 // this is what I’d like to do:
a[i] = b[i] + c[i]; // vectorized sum + assign over 64 elements within each T — (or maybe 120 elements, or maybe 6 elements depending how N is defined at compile-time)
#else // this is my best attempt at an implementation using OpenMP directives, which I’m not sure that it even works
auto b_i = b[i];
auto c_i = c[i];
T a_i;
// inner-loop vector
#pragma omp simd
for(auto j = Size(0); j < N; ++j)
{
a_i[j] = b_i[j] + c_i[j];
}
a[i] = a_i; // store back to memory: does this copy even vectorize???
#endif
}
return 0;
}
More information about the Vc-devel
mailing list