Low write performance on Linux servers with large RAM

This document (7010287) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise Server 15
SUSE Linux Enterprise Server 12
SUSE Linux Enterprise Server 11

Situation

Low performance may occur on a Linux NFS Client machine with a large amount of RAM, while writing large files to an NFS mount (or to other remote file systems). Systems may even appear to be hung for extended periods of time after large writes are attempted to network shares.

Resolution

The dirty cache size usually must be lowered. The default on SLES 12 and 15 allows 20% of RAM to be used as dirty cache. SLES 11 allows 40% by default. On many modern systems with very large amounts of RAM, this is too much. Flushing a large cache can take extensive time and force many things to wait. In contrast, flushing a small cache (even multiple times) is far less disruptive. Good performance (for both slow storage and quick storage, and for both low RAM systems vs high RAM systems) does not usually require more than a few hundred megabytes of memory in dirty cache.

SLES support has recommended (through this TID and also through active support cases) the following values to a large number of customers for more than a decade, and seen complete relief of related performance problems, without any negative side effects.

To set the values on the fly, which will take effect immediately, use the following commands. This is best to do when large writes are not already underway. These commands will set a 600 MB dirty cache and set background threads to spawn and begin to clear the cache, when it reaches 300 MB:

echo 629145600 > /proc/sys/vm/dirty_bytes
echo 314572800 > /proc/sys/vm/dirty_background_bytes

To put those values into a system's permanent configuration, edit /etc/sysctl.conf with:

vm.dirty_bytes = 629145600
vm.dirty_background_bytes = 314572800

Deeper explanations, alternative settings, and other helpful knowledge can be found in the "Cause" and "Additional Information" sections below. It is best to become familiar with all sections of this document.

Cause

When a Linux process writes to a filesystem, the data first goes into a local write cache known as "dirty cache". During or after those writes, dirty cache will have to flush, meaning the data must be written to its real destination. Data that is destined to an NFS file system (or to other slow storage) cannot flush quickly. When a large dirty cache has to flush to slow storage, I/O on the entire system may get stuck behind that operation. In worst case scenarios, it can take an hour or more to clear the bottleneck.

Decreasing cache to improve performance may seem counter-intuitive, because most caches are read caches which give better performance as you increase their size. However, write caches have trade-offs. Write caches allow you to write to memory very quickly, but at some point you have to pay that "debt" and actually get the work done. Writing out all the data can take considerable time. This is especially true when an application is writing large amounts of data to a file system which resides over a network, such as an NFS mount. The faster the network, the less likely this will cause a problem. However, even in the best scenarios, network I/O is usually slower than local disk I/O.

Therefore, it is especially important that NFS Client machines (which mount NFS shares from remote NFS Servers) have a small dirty cache. Of course, it is also possible (but less common) that Linux NFS *Servers* (or any Linux machine) might need these values tuned lower, if the amount of dirty cache is too large. For dirty cache, "too large" simply means: Any size that can't be flushed quickly and efficiently. Of course, this will depend on the hardware in use, how it is configured, whether it is functioning perfectly or having intermittent errors, etc. Therefore, it is difficult to give a rule of thumb about when and where tuning is most needed. The best that can be said is, "If you have problems that involve performance during large writes, try tuning the dirty cache."

Additional Information

The default controls for dirty cache on SLES actually use a ratio of total RAM rather than exact bytes. Only one of these methods of controlling the dirty cache (bytes or ratio) can be in effect at a time. Any time one method is set, the system will automatically set the other method to "0" to disable it.

In ratio form, the tunable settings are:

vm.dirty_ratio
The maximum percentage of RAM devoted to dirty cache.
The default on SLES 11 is 40%; on SLES 12 and 15 the default is 20%.

When the dirty cache reaches this percentage of memory, processes will not be allowed to write more data until some of their cached data is written out. This ensures that the ratio is enforced. By itself, this enforcement can slow down writes noticeably, but not tremendously. However, if an application has written a large amount of data which is still in the dirty cache, and then issues a "sync" command to have it all written to disk, this can take a significant amount of time to accomplish. During that time, some applications may appear stuck or hung. Some applications which have timers watching those processes may even believe that too much time has passed and the operation needs to be aborted, also known as a "timeout".

Lowering the ratio from it's default of 20 or 40 to something lower, such as 5 or 10, may be helpful enough. However, some kernels will not accept a value lower than 5%. For systems with extremely large RAM, 5% may still be too much. Therefore, it is commonly necessary to use the "bytes" settings rather than the "ratio" settings.

vm.dirty_background_ratio
When dirty cache reaches this percentage of system memory, background writes will start.
The default is 10%.

"Background writes" get writing done even when the application isn't forcing a sync, and even if the dirty_ratio has not yet been reached. The goal of this setting is to prevent the dirty cache from growing too large. When reducing dirty_ratio, it is common to reduce dirty_background_ratio as well.

A good rule of thumb is:
dirty_background_ratio = 1/4 to 1/2 of the dirty_ratio

If the dirty_backgound_ratio is set equal to or higher than the dirty_ratio, the kernel will instead automatically use dirty_background_ratio = 1/2 dirty_ratio. The same type of rule exists when using "..._bytes" settings instead of "..._ratio".

Just like the "*_bytes" values, these ratios can be observed or modified with the sysctl utility (see man pages for sysctl(8), sysctl.conf(5)). But simply put, these can be set (to come into effect upon boot) in /etc/sysctl.conf, as:

vm.dirty_ratio = 10
vm.dirty_background_ratio = 5

Or for temporary testing, values can be echoed into their respective /proc areas:

echo 10 > /proc/sys/vm/dirty_ratio
echo 5 > /proc/sys/vm/dirty_background_ratio

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.