Optimize the application job with thread affinities

Describes how to iteratively optimize the performance of a SLURM job running the wave_openmp example code using the Thread affinity advisor.

Prerequisites

You must install all the necessary tools as described in Software requirements.

Procedure

Examine the first two commentary items:

[ERROR] node-1 (1 similar), ranks 0-1: No bindings set for threads 238507-238508,238545-238558 from processes 238507-238508.
[ERROR] node-1 (1 similar), ranks 0-1 (processes 238507-238508) overlap with at least one other process e.g. processes 238507 and 238508

Click the 0-1 hyperlink to select ranks 0 and 1 under Processes and threads. Notice that compute threads from both processes are listed and that they are all bound to logical CPUs 0-7 on the 8 core node (i.e. there are no particular bindings set for these threads):

../../_images/map_thread_affinity_example_threads_overlapping.png

This is problematic, because threads spanning NUMA nodes severely impacts performance:

[ERROR] node-1 (1 similar), ranks 0-1 (processes 238507-238508) spans multiple NUMA nodes e.g NUMA nodes 0 and 1
[ERROR] node-1 (1 similar), ranks 0-1 (processes 238507-238508) contain at least one thread spanning multiple NUMA nodes e.g 238508 over NUMA nodes 0 and 1

Click on a single thread to see this in the Node topology viewer:
Click on a single logical CPU to highlight all threads that are bound to it.
Resolve these issues by running the SLURM job and binding each rank to a single socket (in this case, NUMA node):
```
map srun --ntasks-per-node=2 --cpu-bind=sockets ./wave_openmp
```
Notice that the performance of the application has been greatly improved:
```
points/second: 2804.2M (175.3M per process)
```
Open the Thread affinity advisor dialog to see that each rank is bound to a single NUMA node:
Click on rank 0 under Processes and threads to verify that each of its compute threads is bound to NUMA node 0:
Resolve the remaining commentary item by binding a single compute thread to each logical core. With the GNU C/C++/Fortran Compiler, this is accomplished by setting GOMP_CPU_AFFINITY=0-7 in the environment:
```
GOMP_CPU_AFFINITY=0-7 map srun --ntasks-per-node=2 --cpu-bind=sockets ./wave_openmp
```
Notice that the Thread affinity advisor tool button no longer indicates thread affinity issues:
Open the Thread affinity advisor dialog to see that the commentary is empty.
Select rank 0 to show that each compute thread is uniquely bound to a single logical CPU:

Notice that the performance of the application has not improved:
```
points/second: 2803.1M (175.2M per process)
```
In this scenario, the kernel was able to schedule threads within each NUMA node so that the application could be performant. Acting on every thread affinity issue may be unnecessary.