2005-04-16 16:20:36 -06:00
|
|
|
CPUSETS
|
|
|
|
-------
|
|
|
|
|
|
|
|
Copyright (C) 2004 BULL SA.
|
|
|
|
Written by Simon.Derr@bull.net
|
|
|
|
|
|
|
|
Portions Copyright (c) 2004 Silicon Graphics, Inc.
|
|
|
|
Modified by Paul Jackson <pj@sgi.com>
|
|
|
|
|
|
|
|
CONTENTS:
|
|
|
|
=========
|
|
|
|
|
|
|
|
1. Cpusets
|
|
|
|
1.1 What are cpusets ?
|
|
|
|
1.2 Why are cpusets needed ?
|
|
|
|
1.3 How are cpusets implemented ?
|
|
|
|
1.4 How do I use cpusets ?
|
|
|
|
2. Usage Examples and Syntax
|
|
|
|
2.1 Basic Usage
|
|
|
|
2.2 Adding/removing cpus
|
|
|
|
2.3 Setting flags
|
|
|
|
2.4 Attaching processes
|
|
|
|
3. Questions
|
|
|
|
4. Contact
|
|
|
|
|
|
|
|
1. Cpusets
|
|
|
|
==========
|
|
|
|
|
|
|
|
1.1 What are cpusets ?
|
|
|
|
----------------------
|
|
|
|
|
|
|
|
Cpusets provide a mechanism for assigning a set of CPUs and Memory
|
|
|
|
Nodes to a set of tasks.
|
|
|
|
|
|
|
|
Cpusets constrain the CPU and Memory placement of tasks to only
|
|
|
|
the resources within a tasks current cpuset. They form a nested
|
|
|
|
hierarchy visible in a virtual file system. These are the essential
|
|
|
|
hooks, beyond what is already present, required to manage dynamic
|
|
|
|
job placement on large systems.
|
|
|
|
|
|
|
|
Each task has a pointer to a cpuset. Multiple tasks may reference
|
|
|
|
the same cpuset. Requests by a task, using the sched_setaffinity(2)
|
|
|
|
system call to include CPUs in its CPU affinity mask, and using the
|
|
|
|
mbind(2) and set_mempolicy(2) system calls to include Memory Nodes
|
|
|
|
in its memory policy, are both filtered through that tasks cpuset,
|
|
|
|
filtering out any CPUs or Memory Nodes not in that cpuset. The
|
|
|
|
scheduler will not schedule a task on a CPU that is not allowed in
|
|
|
|
its cpus_allowed vector, and the kernel page allocator will not
|
|
|
|
allocate a page on a node that is not allowed in the requesting tasks
|
|
|
|
mems_allowed vector.
|
|
|
|
|
|
|
|
If a cpuset is cpu or mem exclusive, no other cpuset, other than a direct
|
|
|
|
ancestor or descendent, may share any of the same CPUs or Memory Nodes.
|
2005-06-25 15:57:34 -06:00
|
|
|
A cpuset that is cpu exclusive has a sched domain associated with it.
|
|
|
|
The sched domain consists of all cpus in the current cpuset that are not
|
|
|
|
part of any exclusive child cpusets.
|
|
|
|
This ensures that the scheduler load balacing code only balances
|
|
|
|
against the cpus that are in the sched domain as defined above and not
|
|
|
|
all of the cpus in the system. This removes any overhead due to
|
|
|
|
load balancing code trying to pull tasks outside of the cpu exclusive
|
|
|
|
cpuset only to be prevented by the tasks' cpus_allowed mask.
|
2005-04-16 16:20:36 -06:00
|
|
|
|
|
|
|
User level code may create and destroy cpusets by name in the cpuset
|
|
|
|
virtual file system, manage the attributes and permissions of these
|
|
|
|
cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
|
|
|
|
specify and query to which cpuset a task is assigned, and list the
|
|
|
|
task pids assigned to a cpuset.
|
|
|
|
|
|
|
|
|
|
|
|
1.2 Why are cpusets needed ?
|
|
|
|
----------------------------
|
|
|
|
|
|
|
|
The management of large computer systems, with many processors (CPUs),
|
|
|
|
complex memory cache hierarchies and multiple Memory Nodes having
|
|
|
|
non-uniform access times (NUMA) presents additional challenges for
|
|
|
|
the efficient scheduling and memory placement of processes.
|
|
|
|
|
|
|
|
Frequently more modest sized systems can be operated with adequate
|
|
|
|
efficiency just by letting the operating system automatically share
|
|
|
|
the available CPU and Memory resources amongst the requesting tasks.
|
|
|
|
|
|
|
|
But larger systems, which benefit more from careful processor and
|
|
|
|
memory placement to reduce memory access times and contention,
|
|
|
|
and which typically represent a larger investment for the customer,
|
|
|
|
can benefit from explictly placing jobs on properly sized subsets of
|
|
|
|
the system.
|
|
|
|
|
|
|
|
This can be especially valuable on:
|
|
|
|
|
|
|
|
* Web Servers running multiple instances of the same web application,
|
|
|
|
* Servers running different applications (for instance, a web server
|
|
|
|
and a database), or
|
|
|
|
* NUMA systems running large HPC applications with demanding
|
|
|
|
performance characteristics.
|
2005-06-25 15:57:34 -06:00
|
|
|
* Also cpu_exclusive cpusets are useful for servers running orthogonal
|
|
|
|
workloads such as RT applications requiring low latency and HPC
|
|
|
|
applications that are throughput sensitive
|
2005-04-16 16:20:36 -06:00
|
|
|
|
|
|
|
These subsets, or "soft partitions" must be able to be dynamically
|
|
|
|
adjusted, as the job mix changes, without impacting other concurrently
|
|
|
|
executing jobs.
|
|
|
|
|
|
|
|
The kernel cpuset patch provides the minimum essential kernel
|
|
|
|
mechanisms required to efficiently implement such subsets. It
|
|
|
|
leverages existing CPU and Memory Placement facilities in the Linux
|
|
|
|
kernel to avoid any additional impact on the critical scheduler or
|
|
|
|
memory allocator code.
|
|
|
|
|
|
|
|
|
|
|
|
1.3 How are cpusets implemented ?
|
|
|
|
---------------------------------
|
|
|
|
|
|
|
|
Cpusets provide a Linux kernel (2.6.7 and above) mechanism to constrain
|
|
|
|
which CPUs and Memory Nodes are used by a process or set of processes.
|
|
|
|
|
|
|
|
The Linux kernel already has a pair of mechanisms to specify on which
|
|
|
|
CPUs a task may be scheduled (sched_setaffinity) and on which Memory
|
|
|
|
Nodes it may obtain memory (mbind, set_mempolicy).
|
|
|
|
|
|
|
|
Cpusets extends these two mechanisms as follows:
|
|
|
|
|
|
|
|
- Cpusets are sets of allowed CPUs and Memory Nodes, known to the
|
|
|
|
kernel.
|
|
|
|
- Each task in the system is attached to a cpuset, via a pointer
|
|
|
|
in the task structure to a reference counted cpuset structure.
|
|
|
|
- Calls to sched_setaffinity are filtered to just those CPUs
|
|
|
|
allowed in that tasks cpuset.
|
|
|
|
- Calls to mbind and set_mempolicy are filtered to just
|
|
|
|
those Memory Nodes allowed in that tasks cpuset.
|
|
|
|
- The root cpuset contains all the systems CPUs and Memory
|
|
|
|
Nodes.
|
|
|
|
- For any cpuset, one can define child cpusets containing a subset
|
|
|
|
of the parents CPU and Memory Node resources.
|
|
|
|
- The hierarchy of cpusets can be mounted at /dev/cpuset, for
|
|
|
|
browsing and manipulation from user space.
|
|
|
|
- A cpuset may be marked exclusive, which ensures that no other
|
|
|
|
cpuset (except direct ancestors and descendents) may contain
|
|
|
|
any overlapping CPUs or Memory Nodes.
|
2005-06-25 15:57:34 -06:00
|
|
|
Also a cpu_exclusive cpuset would be associated with a sched
|
|
|
|
domain.
|
2005-04-16 16:20:36 -06:00
|
|
|
- You can list all the tasks (by pid) attached to any cpuset.
|
|
|
|
|
|
|
|
The implementation of cpusets requires a few, simple hooks
|
|
|
|
into the rest of the kernel, none in performance critical paths:
|
|
|
|
|
|
|
|
- in main/init.c, to initialize the root cpuset at system boot.
|
|
|
|
- in fork and exit, to attach and detach a task from its cpuset.
|
|
|
|
- in sched_setaffinity, to mask the requested CPUs by what's
|
|
|
|
allowed in that tasks cpuset.
|
|
|
|
- in sched.c migrate_all_tasks(), to keep migrating tasks within
|
|
|
|
the CPUs allowed by their cpuset, if possible.
|
2005-06-25 15:57:34 -06:00
|
|
|
- in sched.c, a new API partition_sched_domains for handling
|
|
|
|
sched domain changes associated with cpu_exclusive cpusets
|
|
|
|
and related changes in both sched.c and arch/ia64/kernel/domain.c
|
2005-04-16 16:20:36 -06:00
|
|
|
- in the mbind and set_mempolicy system calls, to mask the requested
|
|
|
|
Memory Nodes by what's allowed in that tasks cpuset.
|
|
|
|
- in page_alloc, to restrict memory to allowed nodes.
|
|
|
|
- in vmscan.c, to restrict page recovery to the current cpuset.
|
|
|
|
|
|
|
|
In addition a new file system, of type "cpuset" may be mounted,
|
|
|
|
typically at /dev/cpuset, to enable browsing and modifying the cpusets
|
|
|
|
presently known to the kernel. No new system calls are added for
|
|
|
|
cpusets - all support for querying and modifying cpusets is via
|
|
|
|
this cpuset file system.
|
|
|
|
|
|
|
|
Each task under /proc has an added file named 'cpuset', displaying
|
|
|
|
the cpuset name, as the path relative to the root of the cpuset file
|
|
|
|
system.
|
|
|
|
|
|
|
|
The /proc/<pid>/status file for each task has two added lines,
|
|
|
|
displaying the tasks cpus_allowed (on which CPUs it may be scheduled)
|
|
|
|
and mems_allowed (on which Memory Nodes it may obtain memory),
|
|
|
|
in the format seen in the following example:
|
|
|
|
|
|
|
|
Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff
|
|
|
|
Mems_allowed: ffffffff,ffffffff
|
|
|
|
|
|
|
|
Each cpuset is represented by a directory in the cpuset file system
|
|
|
|
containing the following files describing that cpuset:
|
|
|
|
|
|
|
|
- cpus: list of CPUs in that cpuset
|
|
|
|
- mems: list of Memory Nodes in that cpuset
|
|
|
|
- cpu_exclusive flag: is cpu placement exclusive?
|
|
|
|
- mem_exclusive flag: is memory placement exclusive?
|
|
|
|
- tasks: list of tasks (by pid) attached to that cpuset
|
|
|
|
|
|
|
|
New cpusets are created using the mkdir system call or shell
|
|
|
|
command. The properties of a cpuset, such as its flags, allowed
|
|
|
|
CPUs and Memory Nodes, and attached tasks, are modified by writing
|
|
|
|
to the appropriate file in that cpusets directory, as listed above.
|
|
|
|
|
|
|
|
The named hierarchical structure of nested cpusets allows partitioning
|
|
|
|
a large system into nested, dynamically changeable, "soft-partitions".
|
|
|
|
|
|
|
|
The attachment of each task, automatically inherited at fork by any
|
|
|
|
children of that task, to a cpuset allows organizing the work load
|
|
|
|
on a system into related sets of tasks such that each set is constrained
|
|
|
|
to using the CPUs and Memory Nodes of a particular cpuset. A task
|
|
|
|
may be re-attached to any other cpuset, if allowed by the permissions
|
|
|
|
on the necessary cpuset file system directories.
|
|
|
|
|
|
|
|
Such management of a system "in the large" integrates smoothly with
|
|
|
|
the detailed placement done on individual tasks and memory regions
|
|
|
|
using the sched_setaffinity, mbind and set_mempolicy system calls.
|
|
|
|
|
|
|
|
The following rules apply to each cpuset:
|
|
|
|
|
|
|
|
- Its CPUs and Memory Nodes must be a subset of its parents.
|
|
|
|
- It can only be marked exclusive if its parent is.
|
|
|
|
- If its cpu or memory is exclusive, they may not overlap any sibling.
|
|
|
|
|
|
|
|
These rules, and the natural hierarchy of cpusets, enable efficient
|
|
|
|
enforcement of the exclusive guarantee, without having to scan all
|
|
|
|
cpusets every time any of them change to ensure nothing overlaps a
|
|
|
|
exclusive cpuset. Also, the use of a Linux virtual file system (vfs)
|
|
|
|
to represent the cpuset hierarchy provides for a familiar permission
|
|
|
|
and name space for cpusets, with a minimum of additional kernel code.
|
|
|
|
|
|
|
|
1.4 How do I use cpusets ?
|
|
|
|
--------------------------
|
|
|
|
|
|
|
|
In order to minimize the impact of cpusets on critical kernel
|
|
|
|
code, such as the scheduler, and due to the fact that the kernel
|
|
|
|
does not support one task updating the memory placement of another
|
|
|
|
task directly, the impact on a task of changing its cpuset CPU
|
|
|
|
or Memory Node placement, or of changing to which cpuset a task
|
|
|
|
is attached, is subtle.
|
|
|
|
|
|
|
|
If a cpuset has its Memory Nodes modified, then for each task attached
|
|
|
|
to that cpuset, the next time that the kernel attempts to allocate
|
|
|
|
a page of memory for that task, the kernel will notice the change
|
|
|
|
in the tasks cpuset, and update its per-task memory placement to
|
|
|
|
remain within the new cpusets memory placement. If the task was using
|
|
|
|
mempolicy MPOL_BIND, and the nodes to which it was bound overlap with
|
|
|
|
its new cpuset, then the task will continue to use whatever subset
|
|
|
|
of MPOL_BIND nodes are still allowed in the new cpuset. If the task
|
|
|
|
was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed
|
|
|
|
in the new cpuset, then the task will be essentially treated as if it
|
|
|
|
was MPOL_BIND bound to the new cpuset (even though its numa placement,
|
|
|
|
as queried by get_mempolicy(), doesn't change). If a task is moved
|
|
|
|
from one cpuset to another, then the kernel will adjust the tasks
|
|
|
|
memory placement, as above, the next time that the kernel attempts
|
|
|
|
to allocate a page of memory for that task.
|
|
|
|
|
|
|
|
If a cpuset has its CPUs modified, then each task using that
|
|
|
|
cpuset does _not_ change its behavior automatically. In order to
|
|
|
|
minimize the impact on the critical scheduling code in the kernel,
|
|
|
|
tasks will continue to use their prior CPU placement until they
|
|
|
|
are rebound to their cpuset, by rewriting their pid to the 'tasks'
|
|
|
|
file of their cpuset. If a task had been bound to some subset of its
|
|
|
|
cpuset using the sched_setaffinity() call, and if any of that subset
|
|
|
|
is still allowed in its new cpuset settings, then the task will be
|
|
|
|
restricted to the intersection of the CPUs it was allowed on before,
|
|
|
|
and its new cpuset CPU placement. If, on the other hand, there is
|
|
|
|
no overlap between a tasks prior placement and its new cpuset CPU
|
|
|
|
placement, then the task will be allowed to run on any CPU allowed
|
|
|
|
in its new cpuset. If a task is moved from one cpuset to another,
|
|
|
|
its CPU placement is updated in the same way as if the tasks pid is
|
|
|
|
rewritten to the 'tasks' file of its current cpuset.
|
|
|
|
|
|
|
|
In summary, the memory placement of a task whose cpuset is changed is
|
|
|
|
updated by the kernel, on the next allocation of a page for that task,
|
|
|
|
but the processor placement is not updated, until that tasks pid is
|
|
|
|
rewritten to the 'tasks' file of its cpuset. This is done to avoid
|
|
|
|
impacting the scheduler code in the kernel with a check for changes
|
|
|
|
in a tasks processor placement.
|
|
|
|
|
|
|
|
There is an exception to the above. If hotplug funtionality is used
|
|
|
|
to remove all the CPUs that are currently assigned to a cpuset,
|
|
|
|
then the kernel will automatically update the cpus_allowed of all
|
2005-05-20 14:59:15 -06:00
|
|
|
tasks attached to CPUs in that cpuset to allow all CPUs. When memory
|
2005-04-16 16:20:36 -06:00
|
|
|
hotplug functionality for removing Memory Nodes is available, a
|
|
|
|
similar exception is expected to apply there as well. In general,
|
|
|
|
the kernel prefers to violate cpuset placement, over starving a task
|
|
|
|
that has had all its allowed CPUs or Memory Nodes taken offline. User
|
|
|
|
code should reconfigure cpusets to only refer to online CPUs and Memory
|
|
|
|
Nodes when using hotplug to add or remove such resources.
|
|
|
|
|
|
|
|
There is a second exception to the above. GFP_ATOMIC requests are
|
|
|
|
kernel internal allocations that must be satisfied, immediately.
|
|
|
|
The kernel may drop some request, in rare cases even panic, if a
|
|
|
|
GFP_ATOMIC alloc fails. If the request cannot be satisfied within
|
|
|
|
the current tasks cpuset, then we relax the cpuset, and look for
|
|
|
|
memory anywhere we can find it. It's better to violate the cpuset
|
|
|
|
than stress the kernel.
|
|
|
|
|
|
|
|
To start a new job that is to be contained within a cpuset, the steps are:
|
|
|
|
|
|
|
|
1) mkdir /dev/cpuset
|
|
|
|
2) mount -t cpuset none /dev/cpuset
|
|
|
|
3) Create the new cpuset by doing mkdir's and write's (or echo's) in
|
|
|
|
the /dev/cpuset virtual file system.
|
|
|
|
4) Start a task that will be the "founding father" of the new job.
|
|
|
|
5) Attach that task to the new cpuset by writing its pid to the
|
|
|
|
/dev/cpuset tasks file for that cpuset.
|
|
|
|
6) fork, exec or clone the job tasks from this founding father task.
|
|
|
|
|
|
|
|
For example, the following sequence of commands will setup a cpuset
|
|
|
|
named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
|
|
|
|
and then start a subshell 'sh' in that cpuset:
|
|
|
|
|
|
|
|
mount -t cpuset none /dev/cpuset
|
|
|
|
cd /dev/cpuset
|
|
|
|
mkdir Charlie
|
|
|
|
cd Charlie
|
|
|
|
/bin/echo 2-3 > cpus
|
|
|
|
/bin/echo 1 > mems
|
|
|
|
/bin/echo $$ > tasks
|
|
|
|
sh
|
|
|
|
# The subshell 'sh' is now running in cpuset Charlie
|
|
|
|
# The next line should display '/Charlie'
|
|
|
|
cat /proc/self/cpuset
|
|
|
|
|
|
|
|
In the case that a change of cpuset includes wanting to move already
|
|
|
|
allocated memory pages, consider further the work of IWAMOTO
|
|
|
|
Toshihiro <iwamoto@valinux.co.jp> for page remapping and memory
|
|
|
|
hotremoval, which can be found at:
|
|
|
|
|
|
|
|
http://people.valinux.co.jp/~iwamoto/mh.html
|
|
|
|
|
|
|
|
The integration of cpusets with such memory migration is not yet
|
|
|
|
available.
|
|
|
|
|
|
|
|
In the future, a C library interface to cpusets will likely be
|
|
|
|
available. For now, the only way to query or modify cpusets is
|
|
|
|
via the cpuset file system, using the various cd, mkdir, echo, cat,
|
|
|
|
rmdir commands from the shell, or their equivalent from C.
|
|
|
|
|
|
|
|
The sched_setaffinity calls can also be done at the shell prompt using
|
|
|
|
SGI's runon or Robert Love's taskset. The mbind and set_mempolicy
|
|
|
|
calls can be done at the shell prompt using the numactl command
|
|
|
|
(part of Andi Kleen's numa package).
|
|
|
|
|
|
|
|
2. Usage Examples and Syntax
|
|
|
|
============================
|
|
|
|
|
|
|
|
2.1 Basic Usage
|
|
|
|
---------------
|
|
|
|
|
|
|
|
Creating, modifying, using the cpusets can be done through the cpuset
|
|
|
|
virtual filesystem.
|
|
|
|
|
|
|
|
To mount it, type:
|
|
|
|
# mount -t cpuset none /dev/cpuset
|
|
|
|
|
|
|
|
Then under /dev/cpuset you can find a tree that corresponds to the
|
|
|
|
tree of the cpusets in the system. For instance, /dev/cpuset
|
|
|
|
is the cpuset that holds the whole system.
|
|
|
|
|
|
|
|
If you want to create a new cpuset under /dev/cpuset:
|
|
|
|
# cd /dev/cpuset
|
|
|
|
# mkdir my_cpuset
|
|
|
|
|
|
|
|
Now you want to do something with this cpuset.
|
|
|
|
# cd my_cpuset
|
|
|
|
|
|
|
|
In this directory you can find several files:
|
|
|
|
# ls
|
|
|
|
cpus cpu_exclusive mems mem_exclusive tasks
|
|
|
|
|
|
|
|
Reading them will give you information about the state of this cpuset:
|
|
|
|
the CPUs and Memory Nodes it can use, the processes that are using
|
|
|
|
it, its properties. By writing to these files you can manipulate
|
|
|
|
the cpuset.
|
|
|
|
|
|
|
|
Set some flags:
|
|
|
|
# /bin/echo 1 > cpu_exclusive
|
|
|
|
|
|
|
|
Add some cpus:
|
|
|
|
# /bin/echo 0-7 > cpus
|
|
|
|
|
|
|
|
Now attach your shell to this cpuset:
|
|
|
|
# /bin/echo $$ > tasks
|
|
|
|
|
|
|
|
You can also create cpusets inside your cpuset by using mkdir in this
|
|
|
|
directory.
|
|
|
|
# mkdir my_sub_cs
|
|
|
|
|
|
|
|
To remove a cpuset, just use rmdir:
|
|
|
|
# rmdir my_sub_cs
|
|
|
|
This will fail if the cpuset is in use (has cpusets inside, or has
|
|
|
|
processes attached).
|
|
|
|
|
|
|
|
2.2 Adding/removing cpus
|
|
|
|
------------------------
|
|
|
|
|
|
|
|
This is the syntax to use when writing in the cpus or mems files
|
|
|
|
in cpuset directories:
|
|
|
|
|
|
|
|
# /bin/echo 1-4 > cpus -> set cpus list to cpus 1,2,3,4
|
|
|
|
# /bin/echo 1,2,3,4 > cpus -> set cpus list to cpus 1,2,3,4
|
|
|
|
|
|
|
|
2.3 Setting flags
|
|
|
|
-----------------
|
|
|
|
|
|
|
|
The syntax is very simple:
|
|
|
|
|
|
|
|
# /bin/echo 1 > cpu_exclusive -> set flag 'cpu_exclusive'
|
|
|
|
# /bin/echo 0 > cpu_exclusive -> unset flag 'cpu_exclusive'
|
|
|
|
|
|
|
|
2.4 Attaching processes
|
|
|
|
-----------------------
|
|
|
|
|
|
|
|
# /bin/echo PID > tasks
|
|
|
|
|
|
|
|
Note that it is PID, not PIDs. You can only attach ONE task at a time.
|
|
|
|
If you have several tasks to attach, you have to do it one after another:
|
|
|
|
|
|
|
|
# /bin/echo PID1 > tasks
|
|
|
|
# /bin/echo PID2 > tasks
|
|
|
|
...
|
|
|
|
# /bin/echo PIDn > tasks
|
|
|
|
|
|
|
|
|
|
|
|
3. Questions
|
|
|
|
============
|
|
|
|
|
|
|
|
Q: what's up with this '/bin/echo' ?
|
|
|
|
A: bash's builtin 'echo' command does not check calls to write() against
|
|
|
|
errors. If you use it in the cpuset file system, you won't be
|
|
|
|
able to tell whether a command succeeded or failed.
|
|
|
|
|
|
|
|
Q: When I attach processes, only the first of the line gets really attached !
|
|
|
|
A: We can only return one error code per call to write(). So you should also
|
|
|
|
put only ONE pid.
|
|
|
|
|
|
|
|
4. Contact
|
|
|
|
==========
|
|
|
|
|
|
|
|
Web: http://www.bullopensource.org/cpuset
|