Optimize the code with Linaro MAP

Describes how to profile and optimize the mmult example code using Linaro Performance Reports and Linaro MAP. Linaro Performance Reports can identify high memory accesses, and Linaro MAP can identify the time-consuming loops in the example code.

Prerequisites

You must install all the necessary tools as described in Software requirements.

You must complete the instructions in Compile and Run mmult, Fix the bug with Linaro DDT, and Analyze the behavior with Linaro Performance Reports

Ensure the code has been compiled with the -g debugging flag.

Procedure

To profile the code with multiple processes and the 3072x3072 test case, use map --profile mpirun. For example:
```
map --profile mpirun -n 8 ./mmult_c 3072
```
or
```
map --profile mpirun -n 8 ./mmult_f 3072
```
or
```
map --profile mpirun -n 8 python3 ./mmult.py -s 3072
```
If express launch is not supported for your MPI environment, run map --profile:
```
map --profile -n 8 ./mmult_c 3072
```
or
```
map --profile -n 8 python3 ./mmult.py -s 3072
```
The --profile option runs the profiler in non-interactive mode. When the execution terminates, a profile file (.map) is created by Linaro MAP:
```
mmult_8p_1n_YYYY-MM-DD_HH-MM.map
```
YYYY-MM-DD_HH-MM corresponds to a timestamp of the report creation date.
To view the results, run the interactive mode:
```
map mmult_8p_1n_YYYY-MM-DD_HH-MM.map
```
Linaro MAP starts and displays the main profiler window. See MAP user interface.

Depending on your system configuration, the details might vary in your results. The profiler indicates that most of the time is spent in one line of the mmult function (or when using the Python version, the corresponding calls in the C of F90 version):

In C:
```
res += A[i*sz+k]*B[k*sz+j];
```
In F90:
```
res=A(k,i)*B(j,k)+res
```
Select this line of code. The CPU breakdown window appears on the right and shows the following results (Fig. 13):

Fig. 13 Linaro MAP line breakdown without optimized memory accesses

The results indicate inefficient memory accesses. The loop nest performs strided accesses to array B. In addition to this, a dependency on intermediate results prevents the compiler vectorizing properly.

Note

On non-x86 architectures, the CPU breakdown is not available. To visualize the high amount of cycles per instructions, L2 (or L3) cache misses, and stalled back-end cycles when the mmult function is being executed, instead use the CPU instructions metric graphs by selecting Metrics ‣ Preset: CPU instructions from the menu.

In C, replace the following code:

for(int i=0; i<sz/nslices; i++)
{
    for(int j=0; j<sz; j++)
    {
        double res = 0.0;
        for(int k=0; k<sz; k++)
        {
            res += A[i*sz+k]*B[k*sz+j];
        }
        C[i*sz+j] += res;
    }
}

with:

for(int i=0; i<sz/nslices; i++)
{
    for(int k=0; k<sz; k++)
    {
        for(int j=0; j<sz; j++)
        {
            C[i*sz+j] += A[i*sz+k]*B[k*sz+j];
        }
    }
}

and in Fortran replace:

do i=0,sz/nslices-1
do j=0,sz-1
    res=0.0
    do k=0,sz-1
    res=A(k,i)*B(j,k)+res
    end do
    C(j,i)=res+C(j,i)
end do
end do

with:

do i=0,sz/nslices-1
do k=0,sz-1
do j=0,sz-1
    C(j,i)=A(k,i)*B(j,k)+C(j,i)
end do
end do
end do

Remove the previous executable, recompile, and run Linaro MAP again:
```
make -f mmult.makefile clean
make -f mmult.makefile
map --profile -n 8 ./mmult 3072
```
The profiling results show significant performance improvement because of the optimization (Fig. 14).

Fig. 14 Linaro MAP line breakdown with optimized memory accesses

Next Steps

To go further and use an optimized version of the matrix multiplication:

In the C version, call CBLAS instead of mmult:

#include <cblas.h>
...
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, sz/nproc, sz, sz, 1.0, mat_a,
sz, mat_b, sz, 1.0, mat_c, sz);

In the F90 version, call BLAS instead of mmult:

call DGEMM('N','N', sz, sz/nproc, sz, 1.0D0, &
           mat_b, sz, &
           mat_a, sz, 1.0D0, &
           mat_c, sz)

Make sure you edit mmult.makefile to include the BLAS header and link to your BLAS library, for instance with OpenBLAS:

CFLAGS = -Ofast -g -I/opt/openblas/include
LFLAGS = -L/opt/openblas/lib -lopenblas

In the Python version, the call to SciPy’s DGEMM can be run with the following command:

mpirun -n 8 python3 ./mmult.py -k Py -s 3072