Optimize the code with vectorization

Describes how to iteratively optimize the vectorization example code with vectorization, and profile it with Linaro MAP.

Prerequisites

Procedure

  1. Begin with the first failed vectorization attempt. According to the compiler remark annotation in Fig. 15, the compiler is unable to identify the array bounds so vectorization was not possible.

    ../../_images/map_vectorization_without_opt1.png

    Fig. 15 Compiler remark for a failed vectorization attempt due to unknown array bounds

    Assist the compiler by clarifying the bounds:

    for (int i = 0; i < num/2; i++)
    {
        arr2[2*i] = 3.0f * arr1[i];
        arr2[2*i+1] = 3.0f * arr1[i];
    }
    
  2. Compile the application by running make, and profile with Linaro MAP:

    make -f vectorization.makefile
    map ./vectorization
    

    Notice in Fig. 16 that the vectorization attempt is now successful, and that the runtime of the program has been reduced.

    ../../_images/map_vectorization_with_opt1.png

    Fig. 16 Compiler remark for a successful vectorization attempt after providing array bounds

  3. Next focus on the failed vectorization attempt at the nested loop. The compiler has deemed vectorization to be non-beneficial due to the non-sequential memory accesses in this loop (see Fig. 17).

    ../../_images/map_vectorization_without_opt2.png

    Fig. 17 Compiler remark for a failed vectorization attempt due to non-sequential memory accesses

    Optimize the performance of this loop by interchanging the loop conditions.

    for (int j = 0; j < cols; j++)
    {
        for (int i = 0; i < rows; i++)
        {
            z[j * cols + i] = x[j * cols + i] - y[j * cols + i];
        }
    }
    

    Compile and profile the application again. Notice that the vectorization attempt is now successful in Fig. 18, and that the runtime of the program has been reduced.

    ../../_images/map_vectorization_with_opt2.png

    Fig. 18 Compiler remark for a successful vectorization attempt after interchanging loop conditions

  4. Now focus on the final failed vectorization attempt (Fig. 19). This loop contains a loop-carried dependency, i.e. the result of one iteration is dependent on another.

    ../../_images/map_vectorization_without_opt3.png

    Fig. 19 Compiler remark for a failed vectorization attempt due to a loop-carried dependency

    It cannot be fully vectorized, but it can be distributed into vectorizable and unvectorizable parts as follows:

    for (int i = 1; i < size; i++)
    {
        x[i] = x[i - 1] * y[i - 1] - z[i];
        y[i] = 2.0f * y[i - 1];
    }
    
    for (int i = 1; i < size; i++)
    {
        z[i] = x[i] + y[i];
    }
    

    Compile and profile the application again. Although the compiler has vectorized the second distributed loop in Fig. 20, depending on your system the runtime of the program may have increased due to the introduction of another loop.

    ../../_images/map_vectorization_with_opt3.png

    Fig. 20 Compiler remarks showing both a failed and a successful vectorization attempt after distributing the loop

    This shows that acting upon all failed compiler optimization remarks may not always be beneficial. Revert the last change to optimize the runtime of the program if that is the case.