Analyze the behavior with Linaro Performance Reports

Describes how to analyze the behavior of the mmult example code, and how to check if there are any performance issues using Linaro Performance Reports.

Prerequisites

Procedure

  1. Run the application with eight processes on a large test case, for example 3072x3072 matrices:

    perf-report mpirun -n 8 ./mmult_c 3072
    

    or

    perf-report mpirun -n 8 ./mmult_f 3072
    

    or

    perf-report mpirun -n 8 python3 ./mmult.py -s 3072
    

    If your MPI environment does not support express launch, run the following command instead:

    perf-report -n 8 ./mmult_c 3072
    

    or

    perf-report -n 8 python3 ./mmult.py -s 3072
    

    When the execution terminates, Linaro Performance Reports creates two files:

    • mmult_8p_1n_YYYY-MM-DD_HH-MM.txt

    • mmult_8p_1n_YYYY-MM-DD_HH-MM.html

    YYYY-MM-DD_HH-MM corresponds to a timestamp of the report creation date. The two files contain the same data, in two different formats.

  2. To visualize the results, open the HTML file in your web browser (either locally or remotely if you have X forwarding enabled). For example, to use Firefox:

    firefox mmult_8p_1n_YYYY-MM-DD_HH-MM.html
    

    Alternatively open the .txt file in any fixed-width code editor:

    vim mmult_8p_1n_YYYY-MM-DD_HH-MM.txt
    

    The report (Fig. 10) shows different sections:

    ../../_images/mmult_perf_report.png

    Fig. 10 Linaro Performance Reports HTML document

    Application Details (top)

    Describes the system settings (including the number of physical and logical cores), the job configuration (including the number of processes and number of nodes) and the execution time.

    Summary (middle)

    The Summary section shows the amount of time spent in computations (CPU), communications (MPI), and IO.

    Breakdown sections (bottom)

    Shows a breakdown of:

    The details of the report will be different and relevant to your system configuration, but the report should indicate that the application is CPU bound.

    The CPU breakdown section (x86_64 only) gives more information about the type of instruction run (Fig. 11):

    ../../_images/mmult_perf_report_cpu_O0.png

    Fig. 11 Linaro Performance Reports CPU metrics without compiler optimizations

    The compiler does not perform vectorization. As the report suggests, you can change the behavior by changing your compiler options.

    Note

    On non-x86 architectures, the CPU metrics are different. Instead, the tool reports the following metrics:

    • Cycles per instructions

    • Amount of L2 (or L3) cache accesses

    • Amount of processor back-end/front-end stalls

    Keep these numbers low for better performance.

  3. To enable the -Ofast compiler optimization (including vectorization), edit mmult.makefile:

    CFLAGS = -Ofast -g
    
  4. Remove the previous executable, recompile, and run Linaro Performance Reports again:

    make -f mmult.makefile clean
    make -f mmult.makefile
    perf-report mpirun -n 8 ./mmult_c 3072
    

    The new report shows a performance improvement because the code has been vectorized by the compiler (Fig. 12).

    ../../_images/mmult_perf_report_cpu_Ofast.png

    Fig. 12 Linaro Performance Reports CPU metrics with compiler optimizations

Tip

You should always profile binaries compiled using the same optimization flags you use in production i.e. -O2 or -Ofast. For best results, use -g flag (or -g1 when you need to minimize the amount of debug information, such as when profiling the -g compiled binary triggers out-of-memory errors).

See Prepare a program for profiling for more on recommended compilation flags for use with Linaro MAP and Linaro Performance Reports.