2012-05-25 Computers

Automatic vectorization

Just the other day I saw a video that presented a new feature in Visual Studio 2011 called auto vectorization. This made me curious about the feature and whether it was available in the tools I have access to, e.g. g++.

The short answer is yes and if you use the optimization level 3, compilation flag -O3, you get this. It seems like the situation is the same for Visual Studio 2011. You get this automatically. So that is fine and end of story if you like. If you want some more details please read on.

How does this optimization work?

The idea is to use the CPU registers introduced with technologies such as MMX or SSE. These vector registers can hold multiple scalar values and perform operations on them. That is a situation that occurs in a loop going over a vector.

int a[SIZE], b[SIZE], c[SIZE];
...
for (i=0; i<SIZE; i++)
{
   a[i] = b[i] + c[i];
}

In the code fragment above the arithmetic in the loop can be vectorized meaning that the operations for more than one index can be performed in parallel using the vector registers. It is up to the compiler to analyse the code to figure out that the optimization can be applied.

How to use it

With gcc this optimization if on by default from optimization level 3, compilation flag -O3. You can also turn it on by using the flag, -ftree-vectorize. Note however that you also need to turn on the use of the vector registers. On my x86 machine I need to add the compiler flag -msse2.

There is however another useful flag, -ftree-vectorizer-verbose. This causes the compiler to tell you if it has found a loop that it is able to optimize. This is good since compiler optimization is like a black box. You either have to run the program and measure it or analyse the assembly code to understand if the optimization took place. A verbose message is a great help in that situation so you know you are on the right track.

Will my program run faster?

This must be tested of course. So I came up with this test program.

// Vectorization

#include <chrono>
#include <iostream>

using namespace std;

const int SIZE=2048;
int a[SIZE], b[SIZE], c[SIZE];

void foo ()
{
   for (int j=0; j<1000000; j++)
   {
      for (int i=0; i<SIZE; i++)
      {
         a[i] = 0;
         b[i] = 10;
         c[i] = 100;
      }

      for (int i=0; i<SIZE; i++)
      {
         a[i] = b[i] + c[i];
      }
   }
}

int main()
{
   auto start = chrono::steady_clock::now();

   foo();

   auto diff = chrono::steady_clock::now() - start;
   auto ms = chrono::duration_cast<chrono::milliseconds>(diff);
   cout << "It took " << ms.count() << endl;
}

In order to get some measurable execution time I had to loop over the loop one million times. The optimized version ran in 1122 milliseconds while the unoptimized program ran in 4424 milliseconds. That is something like a speed up of 4 times. This is due to that on the current machine the vector registers can store four integers at once. A vector register is 128 bits and an int is 32 bits.

Summary

Vectorization is an optimization technique that can be used automatically by a smart compiler. It is also easy to understand how it works. Programs that will benefit from it uses loops to manipulate data.

The optimization uses mechanisms that is on the CPU within the a single core. It does illustrate though how you can speed up execution by using parallelism and it helps in understanding the features in C++ AMP, Accelerated Massive Parallelism which I hope to be able to show more about in the future.