CPU instructions

The following sections describe the CPU instruction metrics available on each platform, x86_64 and Arm®v8-A.

Note

Due to differences in processor models, not all metrics are available on all systems.

Tip

When you select one or more lines of code in the Source code viewer, Linaro MAP shows a breakdown of the CPU instructions used on those lines. Selected lines view describes this view in more detail.

CPU instruction metrics available on x86_64 systems

These metrics show the percentage of time that the active cores spent executing different classes of instruction. They are most useful for optimizing single-core and OpenMP performance.

CPU floating-point: The percentage of time each rank spends in floating-point CPU instructions. This includes vectorized instructions and standard x87 floating-point. All CPU floating-point vector instructions are included. High values here suggest CPU-bound areas of the code that are probably functioning as expected.

CPU integer: The percentage of time each rank spends in integer CPU instructions. This includes vectorized instructions and standard integer operations. All CPU integer vector instructions are included. High values here suggest CPU-bound areas of the code that are probably functioning as expected.

CPU memory access: The percentage of time each rank spends in memory access CPU instructions, such as move, load, and store. This also includes vectorized memory access functions, and may overlap with instructions classified elsewhere. High values here may indicate inefficiently-structured code. Extremely high values (98% and above) almost always indicate cache problems. Typical cache problems include cache misses due to incorrect loop orderings but may also include more subtle features such as false sharing or cache line collisions.

CPU floating-point vector

The percentage of time each rank spends in vectorized floating-point instructions. Optimized floating-point-based HPC code should spend most of its time running these operations. This metric provides a good check to see whether your compiler is correctly vectorizing hotspots.

See Linaro MAP does not correctly identify vectorized instructions for a list of the instructions considered vectorized.

CPU integer vector

The percentage of time each rank spends in vectorized and integer instructions. Optimized integer-based HPC code should spend most of its time running these operations. This metric provides a good check to see whether your compiler is correctly vectorizing hotspots.

See Linaro MAP does not correctly identify vectorized instructions for a list of the instructions considered vectorized.

CPU branch: The percentage of time each rank spends in test and branch-related instructions such as test, cmp and je. An optimized HPC code should not spend much time in branch-related instructions. Typically the only branch hotspots are during MPI calls, in which the MPI layer is checking whether a message has been fully-received or not.

CPU instruction metrics available on Arm®v8-A systems

Note

These metrics are not available on virtual machines. Linux perf events performance events counters must be accessible on all systems on which the target program runs.

The CPU instruction metrics available on Arm®v8-A systems are:

Cycles per instruction: The number of CPU cycles to execute an instruction. It is less than 1 when the CPU takes advantage of instruction-level parallelism.

L2 Data cache miss: The ratio of data L2 cache accesses which result in a miss to instructions.
Branch mispredicts: The rate of speculatively-executed instructions that do not retire due to incorrect prediction.

Stalled backend cycles: The percentage of cycles where no operation was issued because of the backend, due to a lack of required resources. Data-cache misses can be responsible for this.

Stalled frontend cycles: The percentage of cycles where no operation was issued because of the frontend, due to fetch starvation. Instruction-cache and i-TLB misses can be responsible for this.