Node memory threshold detection
Running out of memory often causes a job to be killed instantly with no further debugging or diagnostic information available.
Usage
To detect potential out of memory errors early, enable Set node memory threshold at in Memory debugging options and optionally adjust the threshold if the default value of 90 percent is not suitable in your case.
There are two ways to set the node memory threshold:
As percentage of node memory, by choosing percent.
As an absolute value, by choosing gigabytes or megabytes.
If an absolute value is chosen, it must be less than the node memory capacity, otherwise the node memory threshold detection feature will be disabled. Setting an absolute threshold slightly above the typical memory usage of your program may allow you to detect a memory leak earlier than a percentage threshold.
If the node memory threshold detection is enabled, Linaro DDT will report a
over node memory threshold limit
memory error as soon as a dynamic memory
allocation (such as malloc
or ALLOCATE
) would take the total memory
usage of a compute node over the specified threshold.
When a over node memory threshold limit
memory error is reported, you have
the option to continue playing the program, or to pause the execution.
If you choose to continue playing the program, it is likely that future
allocations will continuously trigger the over node memory threshold limit
memory error. To suppress this memory error in the future, use
to increase or to disable
the node memory threshold detection.
If you choose to pause the program, the line of your code that was being executed when the error was reported will be highlighted, however
the reported location might not be the root cause of reaching the threshold, use Current memory usage for more details) to diagnose the issue.
(seethe reported process might not be the root cause of reaching the threshold, if you debug more than one process on the same compute node. Use
to diagnose the issue or use to find the affected nodes.the root cause of reaching the threshold might be an external process, use standard Linux command line utilities (such as
ps
ortop
) to diagnose the issue.
Note
The memory debugging library of Linaro DDT uses a custom allocator which will behave differently to the default allocator:
It is likely that the total memory usage with memory debugging enabled is higher than without due to additional meta data.
Only allocations which request additional memory from the operating system will report the
over node memory threshold limit
memory error.The custom allocator might not give back freed memory to the operating system.
CUDA memory allocations do not check if the node memory threshold has been reached.
Offline usage
You can use the node memory threshold detection with offline debugging. There are two ways to enable the node memory threshold detection:
Specify the
--mem-debug-threshold
command line option followed by the threshold percentage (must be between 1 and 99 percent, with or without %), for example--mem-debug-threshold=90
.Specify the
--mem-debug-threshold
command line option followed by threshold bytes value (either MB or GB), for example--mem-debug-threshold=32GB
.
If an absolute value is chosen, it must be less than the node memory capacity, otherwise the node memory threshold detection feature will be disabled.
If the node memory threshold limit is reached, the program will be terminated and the offline log will contain a memory leak report, which can be used to diagnose the issue. See Offline report HTML output for more details.