Optimize the code with Linaro MAP

Describes how to profile and optimize the mmult example code using Linaro Performance Reports and Linaro MAP. Linaro Performance Reports can identify high memory accesses, and Linaro MAP can identify the time-consuming loops in the example code.

Prerequisites

Procedure

  1. To profile the code with multiple processes and the 3072x3072 test case, use map --profile mpirun. For example:

    map --profile mpirun -n 8 ./mmult_c 3072
    

    or

    map --profile mpirun -n 8 ./mmult_f 3072
    

    or

    map --profile mpirun -n 8 python3 ./mmult.py -s 3072
    

    If express launch is not supported for your MPI environment, run map --profile:

    map --profile -n 8 ./mmult_c 3072
    

    or

    map --profile -n 8 python3 ./mmult.py -s 3072
    

    The --profile option runs the profiler in non-interactive mode. When the execution terminates, a profile file (.map) is created by Linaro MAP:

    mmult_8p_1n_YYYY-MM-DD_HH-MM.map
    

    YYYY-MM-DD_HH-MM corresponds to a timestamp of the report creation date.

  2. To view the results, run the interactive mode:

    map mmult_8p_1n_YYYY-MM-DD_HH-MM.map
    

    Linaro MAP starts and displays the main profiler window. See MAP user interface.

    Depending on your system configuration, the details might vary in your results. The profiler indicates that most of the time is spent in one line of the mmult function (or when using the Python version, the corresponding calls in the C of F90 version):

    In C:

    res += A[i*sz+k]*B[k*sz+j];
    

    In F90:

    res=A(k,i)*B(j,k)+res
    

    Select this line of code. The CPU breakdown window appears on the right and shows the following results (Fig. 13):

    ../../_images/mmult_map_line_breakdown_without_opt.png

    Fig. 13 Linaro MAP line breakdown without optimized memory accesses

    The results indicate inefficient memory accesses. The loop nest performs strided accesses to array B. In addition to this, a dependency on intermediate results prevents the compiler vectorizing properly.

    Note

    On non-x86 architectures, the CPU breakdown is not available. To visualize the high amount of cycles per instructions, L2 (or L3) cache misses, and stalled back-end cycles when the mmult function is being executed, instead use the CPU instructions metric graphs by selecting Metrics ‣ Preset: CPU instructions from the menu.

  3. In C, replace the following code:

    for(int i=0; i<sz/nslices; i++)
    {
        for(int j=0; j<sz; j++)
        {
            double res = 0.0;
            for(int k=0; k<sz; k++)
            {
                res += A[i*sz+k]*B[k*sz+j];
            }
            C[i*sz+j] += res;
        }
    }
    

    with:

    for(int i=0; i<sz/nslices; i++)
    {
        for(int k=0; k<sz; k++)
        {
            for(int j=0; j<sz; j++)
            {
                C[i*sz+j] += A[i*sz+k]*B[k*sz+j];
            }
        }
    }
    

    and in Fortran replace:

    do i=0,sz/nslices-1
    do j=0,sz-1
        res=0.0
        do k=0,sz-1
        res=A(k,i)*B(j,k)+res
        end do
        C(j,i)=res+C(j,i)
    end do
    end do
    

    with:

    do i=0,sz/nslices-1
    do k=0,sz-1
    do j=0,sz-1
        C(j,i)=A(k,i)*B(j,k)+C(j,i)
    end do
    end do
    end do
    
  4. Remove the previous executable, recompile, and run Linaro MAP again:

    make -f mmult.makefile clean
    make -f mmult.makefile
    map --profile -n 8 ./mmult 3072
    

    The profiling results show significant performance improvement because of the optimization (Fig. 14).

    ../../_images/mmult_map_line_breakdown_with_opt.png

    Fig. 14 Linaro MAP line breakdown with optimized memory accesses

Next Steps

To go further and use an optimized version of the matrix multiplication:

  • In the C version, call CBLAS instead of mmult:

    #include <cblas.h>
    ...
    cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, sz/nproc, sz, sz, 1.0, mat_a,
    sz, mat_b, sz, 1.0, mat_c, sz);
    
  • In the F90 version, call BLAS instead of mmult:

    call DGEMM('N','N', sz, sz/nproc, sz, 1.0D0, &
               mat_b, sz, &
               mat_a, sz, 1.0D0, &
               mat_c, sz)
    

Make sure you edit mmult.makefile to include the BLAS header and link to your BLAS library, for instance with OpenBLAS:

CFLAGS = -Ofast -g -I/opt/openblas/include
LFLAGS = -L/opt/openblas/lib -lopenblas

In the Python version, the call to SciPy’s DGEMM can be run with the following command:

mpirun -n 8 python3 ./mmult.py -k Py -s 3072