ef9fe980c6
Up until now, cgroup_freezer didn't implement hierarchy properly. cgroups could be arranged in hierarchy but it didn't make any difference in how each cgroup_freezer behaved. They all operated separately. This patch implements proper hierarchy support. If a cgroup is frozen, all its descendants are frozen. A cgroup is thawed iff it and all its ancestors are THAWED. freezer.self_freezing shows the current freezing state for the cgroup itself. freezer.parent_freezing shows whether the cgroup is freezing because any of its ancestors is freezing. freezer_post_create() locks the parent and new cgroup and inherits the parent's state and freezer_change_state() applies new state top-down using cgroup_for_each_descendant_pre() which guarantees that no child can escape its parent's state. update_if_frozen() uses cgroup_for_each_descendant_post() to propagate frozen states bottom-up. Synchronization could be coarser and easier by using a single mutex to protect all hierarchy operations. Finer grained approach was used because it wasn't too difficult for cgroup_freezer and I think it's beneficial to have an example implementation and cgroup_freezer is rather simple and can serve a good one. As this makes cgroup_freezer properly hierarchical, freezer_subsys.broken_hierarchy marking is removed. Note that this patch changes userland visible behavior - freezing a cgroup now freezes all its descendants too. This behavior change is intended and has been warned via .broken_hierarchy. v2: Michal spotted a bug in freezer_change_state() - descendants were inheriting from the wrong ancestor. Fixed. v3: Documentation/cgroups/freezer-subsystem.txt updated. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Michal Hocko <mhocko@suse.cz>
123 lines
4.8 KiB
Text
123 lines
4.8 KiB
Text
The cgroup freezer is useful to batch job management system which start
|
|
and stop sets of tasks in order to schedule the resources of a machine
|
|
according to the desires of a system administrator. This sort of program
|
|
is often used on HPC clusters to schedule access to the cluster as a
|
|
whole. The cgroup freezer uses cgroups to describe the set of tasks to
|
|
be started/stopped by the batch job management system. It also provides
|
|
a means to start and stop the tasks composing the job.
|
|
|
|
The cgroup freezer will also be useful for checkpointing running groups
|
|
of tasks. The freezer allows the checkpoint code to obtain a consistent
|
|
image of the tasks by attempting to force the tasks in a cgroup into a
|
|
quiescent state. Once the tasks are quiescent another task can
|
|
walk /proc or invoke a kernel interface to gather information about the
|
|
quiesced tasks. Checkpointed tasks can be restarted later should a
|
|
recoverable error occur. This also allows the checkpointed tasks to be
|
|
migrated between nodes in a cluster by copying the gathered information
|
|
to another node and restarting the tasks there.
|
|
|
|
Sequences of SIGSTOP and SIGCONT are not always sufficient for stopping
|
|
and resuming tasks in userspace. Both of these signals are observable
|
|
from within the tasks we wish to freeze. While SIGSTOP cannot be caught,
|
|
blocked, or ignored it can be seen by waiting or ptracing parent tasks.
|
|
SIGCONT is especially unsuitable since it can be caught by the task. Any
|
|
programs designed to watch for SIGSTOP and SIGCONT could be broken by
|
|
attempting to use SIGSTOP and SIGCONT to stop and resume tasks. We can
|
|
demonstrate this problem using nested bash shells:
|
|
|
|
$ echo $$
|
|
16644
|
|
$ bash
|
|
$ echo $$
|
|
16690
|
|
|
|
From a second, unrelated bash shell:
|
|
$ kill -SIGSTOP 16690
|
|
$ kill -SIGCONT 16690
|
|
|
|
<at this point 16690 exits and causes 16644 to exit too>
|
|
|
|
This happens because bash can observe both signals and choose how it
|
|
responds to them.
|
|
|
|
Another example of a program which catches and responds to these
|
|
signals is gdb. In fact any program designed to use ptrace is likely to
|
|
have a problem with this method of stopping and resuming tasks.
|
|
|
|
In contrast, the cgroup freezer uses the kernel freezer code to
|
|
prevent the freeze/unfreeze cycle from becoming visible to the tasks
|
|
being frozen. This allows the bash example above and gdb to run as
|
|
expected.
|
|
|
|
The cgroup freezer is hierarchical. Freezing a cgroup freezes all
|
|
tasks beloning to the cgroup and all its descendant cgroups. Each
|
|
cgroup has its own state (self-state) and the state inherited from the
|
|
parent (parent-state). Iff both states are THAWED, the cgroup is
|
|
THAWED.
|
|
|
|
The following cgroupfs files are created by cgroup freezer.
|
|
|
|
* freezer.state: Read-write.
|
|
|
|
When read, returns the effective state of the cgroup - "THAWED",
|
|
"FREEZING" or "FROZEN". This is the combined self and parent-states.
|
|
If any is freezing, the cgroup is freezing (FREEZING or FROZEN).
|
|
|
|
FREEZING cgroup transitions into FROZEN state when all tasks
|
|
belonging to the cgroup and its descendants become frozen. Note that
|
|
a cgroup reverts to FREEZING from FROZEN after a new task is added
|
|
to the cgroup or one of its descendant cgroups until the new task is
|
|
frozen.
|
|
|
|
When written, sets the self-state of the cgroup. Two values are
|
|
allowed - "FROZEN" and "THAWED". If FROZEN is written, the cgroup,
|
|
if not already freezing, enters FREEZING state along with all its
|
|
descendant cgroups.
|
|
|
|
If THAWED is written, the self-state of the cgroup is changed to
|
|
THAWED. Note that the effective state may not change to THAWED if
|
|
the parent-state is still freezing. If a cgroup's effective state
|
|
becomes THAWED, all its descendants which are freezing because of
|
|
the cgroup also leave the freezing state.
|
|
|
|
* freezer.self_freezing: Read only.
|
|
|
|
Shows the self-state. 0 if the self-state is THAWED; otherwise, 1.
|
|
This value is 1 iff the last write to freezer.state was "FROZEN".
|
|
|
|
* freezer.parent_freezing: Read only.
|
|
|
|
Shows the parent-state. 0 if none of the cgroup's ancestors is
|
|
frozen; otherwise, 1.
|
|
|
|
The root cgroup is non-freezable and the above interface files don't
|
|
exist.
|
|
|
|
* Examples of usage :
|
|
|
|
# mkdir /sys/fs/cgroup/freezer
|
|
# mount -t cgroup -ofreezer freezer /sys/fs/cgroup/freezer
|
|
# mkdir /sys/fs/cgroup/freezer/0
|
|
# echo $some_pid > /sys/fs/cgroup/freezer/0/tasks
|
|
|
|
to get status of the freezer subsystem :
|
|
|
|
# cat /sys/fs/cgroup/freezer/0/freezer.state
|
|
THAWED
|
|
|
|
to freeze all tasks in the container :
|
|
|
|
# echo FROZEN > /sys/fs/cgroup/freezer/0/freezer.state
|
|
# cat /sys/fs/cgroup/freezer/0/freezer.state
|
|
FREEZING
|
|
# cat /sys/fs/cgroup/freezer/0/freezer.state
|
|
FROZEN
|
|
|
|
to unfreeze all tasks in the container :
|
|
|
|
# echo THAWED > /sys/fs/cgroup/freezer/0/freezer.state
|
|
# cat /sys/fs/cgroup/freezer/0/freezer.state
|
|
THAWED
|
|
|
|
This is the basic mechanism which should do the right thing for user space task
|
|
in a simple scenario.
|