Optimize the code with Linaro MAP
Describes how to profile and optimize the mmult example code using Linaro Performance Reports and Linaro MAP. Linaro Performance Reports can identify high memory accesses, and Linaro MAP can identify the time-consuming loops in the example code.
Prerequisites
You must install all the necessary tools as described in Software requirements.
You must complete the instructions in Compile and Run mmult, Fix the bug with Linaro DDT, and Analyze the behavior with Linaro Performance Reports
Ensure the code has been compiled with the
-g
debugging flag.
Procedure
To profile the code with multiple processes and the 3072x3072 test case, use
map --profile mpirun
. For example:map --profile mpirun -n 8 ./mmult_c 3072
or
map --profile mpirun -n 8 ./mmult_f 3072
or
map --profile mpirun -n 8 python3 ./mmult.py -s 3072
If express launch is not supported for your MPI environment, run
map --profile
:map --profile -n 8 ./mmult_c 3072
or
map --profile -n 8 python3 ./mmult.py -s 3072
The
--profile
option runs the profiler in non-interactive mode. When the execution terminates, a profile file (.map
) is created by Linaro MAP:mmult_8p_1n_YYYY-MM-DD_HH-MM.map
YYYY-MM-DD_HH-MM
corresponds to a timestamp of the report creation date.To view the results, run the interactive mode:
map mmult_8p_1n_YYYY-MM-DD_HH-MM.map
Linaro MAP starts and displays the main profiler window. See MAP user interface.
Depending on your system configuration, the details might vary in your results. The profiler indicates that most of the time is spent in one line of the
mmult
function (or when using the Python version, the corresponding calls in the C of F90 version):In C:
res += A[i*sz+k]*B[k*sz+j];
In F90:
res=A(k,i)*B(j,k)+res
Select this line of code. The CPU breakdown window appears on the right and shows the following results (Fig. 13):
Fig. 13 Linaro MAP line breakdown without optimized memory accesses
The results indicate inefficient memory accesses. The loop nest performs strided accesses to array B. In addition to this, a dependency on intermediate results prevents the compiler vectorizing properly.
Note
On non-x86 architectures, the CPU breakdown is not available. To visualize the high amount of cycles per instructions, L2 (or L3) cache misses, and stalled back-end cycles when the
mmult
function is being executed, instead use the CPU instructions metric graphs by selecting from the menu.In C, replace the following code:
for(int i=0; i<sz/nslices; i++) { for(int j=0; j<sz; j++) { double res = 0.0; for(int k=0; k<sz; k++) { res += A[i*sz+k]*B[k*sz+j]; } C[i*sz+j] += res; } }
with:
for(int i=0; i<sz/nslices; i++) { for(int k=0; k<sz; k++) { for(int j=0; j<sz; j++) { C[i*sz+j] += A[i*sz+k]*B[k*sz+j]; } } }
and in Fortran replace:
do i=0,sz/nslices-1 do j=0,sz-1 res=0.0 do k=0,sz-1 res=A(k,i)*B(j,k)+res end do C(j,i)=res+C(j,i) end do end do
with:
do i=0,sz/nslices-1 do k=0,sz-1 do j=0,sz-1 C(j,i)=A(k,i)*B(j,k)+C(j,i) end do end do end do
Remove the previous executable, recompile, and run Linaro MAP again:
make -f mmult.makefile clean make -f mmult.makefile map --profile -n 8 ./mmult 3072
The profiling results show significant performance improvement because of the optimization (Fig. 14).
Fig. 14 Linaro MAP line breakdown with optimized memory accesses
Next Steps
To go further and use an optimized version of the matrix multiplication:
In the C version, call CBLAS instead of
mmult
:#include <cblas.h> ... cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, sz/nproc, sz, sz, 1.0, mat_a, sz, mat_b, sz, 1.0, mat_c, sz);In the F90 version, call BLAS instead of
mmult
:call DGEMM('N','N', sz, sz/nproc, sz, 1.0D0, & mat_b, sz, & mat_a, sz, 1.0D0, & mat_c, sz)
Make sure you edit mmult.makefile
to include the BLAS header and link to your
BLAS library, for instance with OpenBLAS:
CFLAGS = -Ofast -g -I/opt/openblas/include
LFLAGS = -L/opt/openblas/lib -lopenblas
In the Python version, the call to SciPy’s DGEMM can be run with the following command:
mpirun -n 8 python3 ./mmult.py -k Py -s 3072