From nicholasferguson@wingarch.com Fri Nov 3 13:31:23 2017 From: nicholasferguson@wingarch.com (nicholas ferguson) Date: Fri, 3 Nov 2017 08:31:23 -0400 Subject: help getting started with outer-loop-parallel, inner-loop-vector use of Vc In-Reply-To: References: Message-ID: <001a01d3549f$aa221550$fe663ff0$@com> Besides all the literature, I found Compiler-Explorer to be useful. ( I use a local install) www.godbolt.com explained that a version 1.0 (gcc 7) is g++ -std=c++11 -O3 -c -S -masm=intel -o - | c++filt | grep -vE '^\s+\.' Then you run with different optimizations -O2 and none. -----Original Message----- From: vc-devel-bounces@compeng.uni-frankfurt.de [mailto:vc-devel-bounces@compeng.uni-frankfurt.de] On Behalf Of Andrew Corrigan Sent: Wednesday, October 18, 2017 5:33 PM To: vc-devel@compeng.uni-frankfurt.de Subject: help getting started with outer-loop-parallel, inner-loop-vector use of Vc I would really like to use Vc to implement operations within my code in an outer-loop-parallel, inner-loop-vector fashion. I am a complete newbie at vectorization, and would appreciate tremendously any help getting started. Typically, I parallelize (#pragma omp parallel for) operations performed for each element (say std::array) of a large std::vector>. Since automatic vectorization over these outer loops (#pragma omp parallel for simd) seems hopeless, instead, within each iteration of the outer loop, I would like to target operations on each element for vectorization. The size of these compile-time-sized arrays is arbitrary however, so I could not assume a length of 4,8,16, etc: using T = std::array; // N is an arbitrary but compile-time constant that might be in the tens or hundreds I sketch what I am trying to do below, but would like to use Vc instead of std::array, and instead of the entire inner-loop, I am hoping to make use of a vectorization-aware analogue of std::array, with pre-defined operations so that I can write the whole loop body as just: a[i] = b[i] + c[i]. I see there is/was SimdArray, but when I pull the latest version I see that is in attic. I would really appreciate any guidance on how to go about achieving this. Or am I going about this the entirely wrong way? Thank you for any help getting started using Vc. - Andrew #include #include #include int main(int argc, char** argv) { using S = double; using T = std::array; // in general, a compile-time known number that is typically O(10), but might be O(100) auto n = 1000000; // in general, a run-time known number than can be arbitrarily large auto a_ = std::make_unique(n); auto b_ = std::make_unique(n); auto c_ = std::make_unique(n); auto a = a_.get(); auto b = b_.get(); auto c = c_.get(); #pragma omp parallel for for(auto i = Size(0); i < n; ++i) { #if 0 // this is what I?d like to do: a[i] = b[i] + c[i]; // vectorized sum + assign over 64 elements within each T ? (or maybe 120 elements, or maybe 6 elements depending how N is defined at compile-time) #else // this is my best attempt at an implementation using OpenMP directives, which I?m not sure that it even works auto b_i = b[i]; auto c_i = c[i]; T a_i; // inner-loop vector #pragma omp simd for(auto j = Size(0); j < N; ++j) { a_i[j] = b_i[j] + c_i[j]; } a[i] = a_i; // store back to memory: does this copy even vectorize??? #endif } return 0; } _______________________________________________ Vc-devel mailing list Vc-devel@compeng.uni-frankfurt.de https://compeng.uni-frankfurt.de/mailman/listinfo/vc-devel