Optimize the application job with thread affinities

Describes how to iteratively optimize the performance of a SLURM job running the wave_openmp example code using the Thread affinity advisor.

Prerequisites

Procedure

  1. Examine the first two commentary items:

    [ERROR] node-1 (1 similar), ranks 0-1: No bindings set for threads 238507-238508,238545-238558 from processes 238507-238508.
    [ERROR] node-1 (1 similar), ranks 0-1 (processes 238507-238508) overlap with at least one other process e.g. processes 238507 and 238508
    

    Click the 0-1 hyperlink to select ranks 0 and 1 under Processes and threads. Notice that compute threads from both processes are listed and that they are all bound to logical CPUs 0-7 on the 8 core node (i.e. there are no particular bindings set for these threads):

    ../../_images/map_thread_affinity_example_threads_overlapping.png

    This is problematic, because threads spanning NUMA nodes severely impacts performance:

    [ERROR] node-1 (1 similar), ranks 0-1 (processes 238507-238508) spans multiple NUMA nodes e.g NUMA nodes 0 and 1
    [ERROR] node-1 (1 similar), ranks 0-1 (processes 238507-238508) contain at least one thread spanning multiple NUMA nodes e.g 238508 over NUMA nodes 0 and 1
    
  2. Click on a single thread to see this in the Node topology viewer:

    ../../_images/map_thread_affinity_example_topology_overlapping_highlighted.png
  3. Click on a single logical CPU to highlight all threads that are bound to it.

  4. Resolve these issues by running the SLURM job and binding each rank to a single socket (in this case, NUMA node):

    map srun --ntasks-per-node=2 --cpu-bind=sockets ./wave_openmp
    

    Notice that the performance of the application has been greatly improved:

    points/second: 2804.2M (175.3M per process)
    
  5. Open the Thread affinity advisor dialog to see that each rank is bound to a single NUMA node:

    ../../_images/map_thread_affinity_example_topology_bound.png
  6. Click on rank 0 under Processes and threads to verify that each of its compute threads is bound to NUMA node 0:

    ../../_images/map_thread_affinity_example_threads_bound.png
  7. Resolve the remaining commentary item by binding a single compute thread to each logical core. With the GNU C/C++/Fortran Compiler, this is accomplished by setting GOMP_CPU_AFFINITY=0-7 in the environment:

    GOMP_CPU_AFFINITY=0-7 map srun --ntasks-per-node=2 --cpu-bind=sockets ./wave_openmp
    
  8. Notice that the Thread affinity advisor tool button no longer indicates thread affinity issues:

    ../../_images/map_thread_affinity_example_success_icon.png
  9. Open the Thread affinity advisor dialog to see that the commentary is empty.

  10. Select rank 0 to show that each compute thread is uniquely bound to a single logical CPU:

    ../../_images/map_thread_affinity_example_threads_unique_bound.png

    Notice that the performance of the application has not improved:

    points/second: 2803.1M (175.2M per process)
    

    In this scenario, the kernel was able to schedule threads within each NUMA node so that the application could be performant. Acting on every thread affinity issue may be unnecessary.