Effective Linux Resource Management
Use control groups to manage complexity and performance in SUSE Linux Enterprise systems
Written by Matthias G. Eckermann and Bill Tobey
When Linux servers under perform—particularly multi-purpose systems running multiple applications for multiple user groups—the root cause is frequently resource monopolization by one or more processes or users. Wouldn’t it be wonderful if you could set and enforce some ground rules to govern how much CPU, memory, disk I/O or network I/O each process or user could command?
Well you can! Control groups (cgroups) are a feature of the Linux kernel that provide mechanisms for partitioning sets of tasks into one or many hierarchical groups, and associating each group with a set of subsystem resource parameters that affect their execution performance. You might use control groups:
- To keep a Web server from using all the memory on a system that’s also running a data base
- To keep a backup system from using too much network I/O bandwidth and crashing the business apps running on the same system
- To allocate system resources among user groups of different priority (the faculty, staff and students of a university, for instance)
There are two types of control group subsystems. Isolation and special controls subsystems include five different controls: CPUset, Namespace, Freezer, Device and Checkpoint and Restart. Resource subsystems are a group of four controls: CPU, Memory, Disk and Network. Before we investigate the functions of each subsystem, it’s important to note that all are implemented in exactly the same manner, by mounting one or more subsystems as virtual file systems.
Subsystems can be mounted individually—in this case, the CPUset subsystem—as follows:
- mount -t cgroup -o cpuset none /cpuset
Or, all cgroup subsystems can be mounted at ounce:
- mount -t cgroup none / cgroup
When Linux servers under perform, the root cause is frequently resource monopolization by one or more processes or users.
The Isolation and Special Control Subsystems
- The CPUset subsystem ties processes to specific CPU and memory nodes (See Figure 1.). In an SMP system, CPUset may restrict a process to a specific set of CPUs, or, in a system with multi-core processors, to a specific set of CPU cores.
- The Namespace subsystem provides a private view of the system to the processes in a cgroup, and is used primarily for OS-level virtualization. It has no special functions other than to track changes in namespace.
- The Freezer subsystem stops all the processes in a cgroup from executing by removing them from the kernel task scheduler. Once you’ve mounted the Freezer subsystem you can stop any process completely by placing it in the cgroup, using the FROZEN command:
echo FROZEN > /freezer/freezer.state
When you’re ready, the frozen group of processes can be restarted using the THAW command:
echo THAWED > /freezer/freezer.state
The primary application for the Freezer subsystem is backing up write-intensive applications. First you freeze the application, then you freeze the file system. Create your snapshot or backup, then unfreeze the file system. Finally, unfreeze the process and resume normal operation.
- The Device subsystem provides device white lists for groups of processes, allowing or denying read/write access to listed devices or file systems.
- The Checkpoint / Restart subsystem supports process migration between machines by stopping all the processes in control group and saving their state information to a dump file for convenient relocation and restart.
The Resource Control Subsystems
- The CPU control subsystem uses the kernel’s CFS task scheduler to share CPU bandwidth among groups of processes. It’s an effective but somewhat mechanically complicated way to allocate CPU capacity.
- The Memory control subsystem limits memory usage in user-space processes, primarily by discarding least recently used pages (LRU) to reclaim memory when a group of processes exceeds a preset limit. This subsystem imposes no restrictions on memory use by the Linux kernel.
- The Disk I/O control subsystem allows or denies disk access to groups of tasks. Several approaches to this function have been proposed and are under active consideration by the Linux kernel community. A provisional controller subsystem is included in SUSE Linux Enterprise Server 11 Service Pack 1 that allows specific parameters of the CFQ I/O scheduler to be managed on a per cgroup basis.
- The Network I/O control subsystem allows or denies network access to groups of tasks. This control is also under continuing development and discussion by the kernel community. A provisional subsystem is included in SUSE Linux Enterprise Server 11 Service Pack 1.
Cset: An Easy Approach to Control Groups
Managing control groups manually—mounting the virtual file systems, creating the cgroup hierarchy, starting new processes in the appropriate groups, moving existing processes into or between groups, tracking group membership, then closing down unneeded groups—can become confusing and complex. Fortunately for Novell customers who want a simpler point of entry, there is a Novell-developed tool, first introduced in SUSE Linux Enterprise, that greatly simplifies cgroup implementation and management.
The cpuset management utility is a Python application that provides an easy-to-use command line interface for the cpusets functionality in the Linux kernel. Called cset after installation (yes, it’s admittedly a little confusing) the tool addresses only the CPU and memory partitioning functionality of the cpuset subsystem. But since these are the obvious starting points for resource management and performance optimization, the cset tool offers an ideal way to sample the power and potential of control groups.
Preparing to Use Cgroups
To prepare a system for performance optimization with control groups, begin with a patched SUSE Linux Enterprise 11 SP 1 install, then add the following packages:
- Libcgroup1 – The library for controlling and monitoring cgroups
- Libcpuset1 – A library that provides a convenient 'C' API to the CPUset subsystem
- Kernel-source – The source code for the Linux kernel
- Cpuset – The cpuset management utility
- Stress – A simple workload generator for testing the impact of our process grouping and resource allocation measures on application and system performance. Available through the opensuse build service at: http://software.opensuse.org/.
- Lxc - Linux containers (optional). We’ll talk a little more about this important new development at the end of this article.
Simple Cgroups with Cpuset
The cpuset (cset) utility makes it quite easy to execute the basic tasks of control group setup and management.
Step One: Discover the available CPU and memory resources on your system. Use the set command as follows:
- cset set --list
to create a list of the available resources.
Step Two: Create the CPUSET hierarchy. In the simplest configuration there are at least three cpusets. The root cpuset which contains all CPU and memory nodes, the system cpuset which is assigned cpu and memory resources for lower-priority system tasks, and at least one user cpuset which receives sufficient resources to ensure adequate performance of higher-priority user tasks.
Assuming we have a four-way NUMA machine, the command:
- cset set --cpu=2-4--mem=1 --set=Charlie
will create a user cpuset named Charlie, to which are assigned the complete capacity of CPUs 2, 3 and 4, and their respective memory nodes.
Step Three: Start a process in a user CPUSET. The command:
- cset proc --set Charlie --exec -- stress -c 1 &
will start a process in the user CPUSET we just created. In this case, the new process is our workload generator.
Step Four: Move an existing process to a CPUSET. The command:
- cset proc --move --pid PID --toset=Charlie
will move an existing process (PID) into the CPUSET Charlie.
Step Five: List the tasks in a CPUSET, by using the command:
- cset proc --list --set Charlie
Step Six: Removing a CPUSET. Use the command:
- cset set --destroy Charlie
to remove the user CPUSET Charlie.
There, in six simple steps, is the complete lifecycle of a cpuset control group.
Linux Containers: The Future of Kernel Resource Management
Even as work continues on the subsystems for disk and network resource management, the next generation of kernel resource management technology is fast approaching production readiness. Linux containers (lxc) builds on all the control group infrastructure that we’ve talked about in this article—CPU, Memory, Namespace, Freezer, Checkpoint/Restart and Network—to provide fast, lightweight, OS-level virtualization without the need for the instruction interpretation or emulation normally provided by a hypervisor. It’s similar to Linux-VServer or OpenVZ.
Linux containers can be used to run an application, a service or a full (Linux) operating system, partially separated from the rest of the system, but with essentially native performance. In particular, disk I/O is undiminished and cpu and I/O scheduling are much more fair and tunable than with full virtualization. This makes it possible to contain disk I/O intensive applications such as databases, to manage their impact on other applications and processes.
Linux containers is provided as a technology preview in SUSE Linux Enterprise Server 11 Service Pack 1, and it is our intent to provide full production support in Service Pack 2. The lxc technology preview comes with rich online documentation (man lxc), including some implementation examples. Information on building and using Linux containers can be found on SourceForge, and on opensuse.org.