Limitations
Modern superscalar processors use instruction-level parallelism to
decode and execute multiple operations in a single cycle, if internal
CPU resources are free, and will retire multiple instructions at once,
making it appear as if the program counter “jumps” several instructions
per cycle.
Current architectures do not allow profilers such as MAP (or Intel
VTune, Linux perftools, and others) to efficiently measure which
instructions were “invisibly” executed by this instruction-level
parallelism. This time is typically allocated to the last instruction
executed in the cycle.
Most MAP users will not be affected by this for the following reasons:
- Hot lines in an HPC code typically contain rather more than a single
instruction such as
nop
. This makes it unlikely that an entire
source line will be executed invisibly via the CPU's
instruction-level parallelism.
- Any such lines executed “for free” in parallel with another line by a
CPU core will clearly show up as a “gap” in the Source code view (but
this is unusual).
- Loops with stalls and mispredicted branches still show up
highlighting the line containing the problem in all but the most
extreme cases.
Key points:
- Expert users: those wanting to use MAP's per-line instruction
metrics to investigate detailed CPU performance of a loop or kernel
(even down to the assembly level) should be aware that instructions
executed in parallel by the CPU will show up with time only assigned
to the last one in the batch executed.
- Other users: MAP's statistical instruction-based metrics correlate
well with where time is spent in the program and help to find
areas for optimization. Feel free to use them as such. If you see
lines with very few operations on them (such as a single add or
multiply) and no time assigned to them inside your hot loops then
these are probably being executed “for free” by the CPU using
instruction-level parallelism. The time for each batch of such is
assigned to the last instruction completed in the cycle instead.