cfq-iosched: enable full blkcg hierarchy support
With the previous two patches, all cfqg scheduling decisions are based on vfraction and ready for hierarchy support. The only thing which keeps the behavior flat is cfqg_flat_parent() which makes vfraction calculation consider all non-root cfqgs children of the root cfqg. Replace it with cfqg_parent() which returns the real parent. This enables full blkcg hierarchy support for cfq-iosched. For example, consider the following hierarchy. root / \ A:500 B:250 / \ AA:500 AB:1000 For simplicity, let's say all the leaf nodes have active tasks and are on service tree. For each leaf node, vfraction would be AA: (500 / 1500) * (500 / 750) =~ 0.2222 AB: (1000 / 1500) * (500 / 750) =~ 0.4444 B: (250 / 750) =~ 0.3333 and vdisktime will be distributed accordingly. For more detail, please refer to Documentation/block/cfq-iosched.txt. v2: cfq-iosched.txt updated to describe group scheduling as suggested by Vivek. v3: blkio-controller.txt updated. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
This commit is contained in:
parent
41cad6ab2c
commit
d02f7aa8dc
3 changed files with 88 additions and 26 deletions
|
@ -102,6 +102,64 @@ processing of request. Therefore, increasing the value can imporve the
|
||||||
performace although this can cause the latency of some I/O to increase due
|
performace although this can cause the latency of some I/O to increase due
|
||||||
to more number of requests.
|
to more number of requests.
|
||||||
|
|
||||||
|
CFQ Group scheduling
|
||||||
|
====================
|
||||||
|
|
||||||
|
CFQ supports blkio cgroup and has "blkio." prefixed files in each
|
||||||
|
blkio cgroup directory. It is weight-based and there are four knobs
|
||||||
|
for configuration - weight[_device] and leaf_weight[_device].
|
||||||
|
Internal cgroup nodes (the ones with children) can also have tasks in
|
||||||
|
them, so the former two configure how much proportion the cgroup as a
|
||||||
|
whole is entitled to at its parent's level while the latter two
|
||||||
|
configure how much proportion the tasks in the cgroup have compared to
|
||||||
|
its direct children.
|
||||||
|
|
||||||
|
Another way to think about it is assuming that each internal node has
|
||||||
|
an implicit leaf child node which hosts all the tasks whose weight is
|
||||||
|
configured by leaf_weight[_device]. Let's assume a blkio hierarchy
|
||||||
|
composed of five cgroups - root, A, B, AA and AB - with the following
|
||||||
|
weights where the names represent the hierarchy.
|
||||||
|
|
||||||
|
weight leaf_weight
|
||||||
|
root : 125 125
|
||||||
|
A : 500 750
|
||||||
|
B : 250 500
|
||||||
|
AA : 500 500
|
||||||
|
AB : 1000 500
|
||||||
|
|
||||||
|
root never has a parent making its weight is meaningless. For backward
|
||||||
|
compatibility, weight is always kept in sync with leaf_weight. B, AA
|
||||||
|
and AB have no child and thus its tasks have no children cgroup to
|
||||||
|
compete with. They always get 100% of what the cgroup won at the
|
||||||
|
parent level. Considering only the weights which matter, the hierarchy
|
||||||
|
looks like the following.
|
||||||
|
|
||||||
|
root
|
||||||
|
/ | \
|
||||||
|
A B leaf
|
||||||
|
500 250 125
|
||||||
|
/ | \
|
||||||
|
AA AB leaf
|
||||||
|
500 1000 750
|
||||||
|
|
||||||
|
If all cgroups have active IOs and competing with each other, disk
|
||||||
|
time will be distributed like the following.
|
||||||
|
|
||||||
|
Distribution below root. The total active weight at this level is
|
||||||
|
A:500 + B:250 + C:125 = 875.
|
||||||
|
|
||||||
|
root-leaf : 125 / 875 =~ 14%
|
||||||
|
A : 500 / 875 =~ 57%
|
||||||
|
B(-leaf) : 250 / 875 =~ 28%
|
||||||
|
|
||||||
|
A has children and further distributes its 57% among the children and
|
||||||
|
the implicit leaf node. The total active weight at this level is
|
||||||
|
AA:500 + AB:1000 + A-leaf:750 = 2250.
|
||||||
|
|
||||||
|
A-leaf : ( 750 / 2250) * A =~ 19%
|
||||||
|
AA(-leaf) : ( 500 / 2250) * A =~ 12%
|
||||||
|
AB(-leaf) : (1000 / 2250) * A =~ 25%
|
||||||
|
|
||||||
CFQ IOPS Mode for group scheduling
|
CFQ IOPS Mode for group scheduling
|
||||||
===================================
|
===================================
|
||||||
Basic CFQ design is to provide priority based time slices. Higher priority
|
Basic CFQ design is to provide priority based time slices. Higher priority
|
||||||
|
|
|
@ -94,13 +94,11 @@ Throttling/Upper Limit policy
|
||||||
|
|
||||||
Hierarchical Cgroups
|
Hierarchical Cgroups
|
||||||
====================
|
====================
|
||||||
- Currently none of the IO control policy supports hierarchical groups. But
|
- Currently only CFQ supports hierarchical groups. For throttling,
|
||||||
cgroup interface does allow creation of hierarchical cgroups and internally
|
cgroup interface does allow creation of hierarchical cgroups and
|
||||||
IO policies treat them as flat hierarchy.
|
internally it treats them as flat hierarchy.
|
||||||
|
|
||||||
So this patch will allow creation of cgroup hierarchcy but at the backend
|
If somebody created a hierarchy like as follows.
|
||||||
everything will be treated as flat. So if somebody created a hierarchy like
|
|
||||||
as follows.
|
|
||||||
|
|
||||||
root
|
root
|
||||||
/ \
|
/ \
|
||||||
|
@ -108,16 +106,20 @@ Hierarchical Cgroups
|
||||||
|
|
|
|
||||||
test3
|
test3
|
||||||
|
|
||||||
CFQ and throttling will practically treat all groups at same level.
|
CFQ will handle the hierarchy correctly but and throttling will
|
||||||
|
practically treat all groups at same level. For details on CFQ
|
||||||
|
hierarchy support, refer to Documentation/block/cfq-iosched.txt.
|
||||||
|
Throttling will treat the hierarchy as if it looks like the
|
||||||
|
following.
|
||||||
|
|
||||||
pivot
|
pivot
|
||||||
/ / \ \
|
/ / \ \
|
||||||
root test1 test2 test3
|
root test1 test2 test3
|
||||||
|
|
||||||
Down the line we can implement hierarchical accounting/control support
|
Nesting cgroups, while allowed, isn't officially supported and blkio
|
||||||
and also introduce a new cgroup file "use_hierarchy" which will control
|
genereates warning when cgroups nest. Once throttling implements
|
||||||
whether cgroup hierarchy is viewed as flat or hierarchical by the policy..
|
hierarchy support, hierarchy will be supported and the warning will
|
||||||
This is how memory controller also has implemented the things.
|
be removed.
|
||||||
|
|
||||||
Various user visible config options
|
Various user visible config options
|
||||||
===================================
|
===================================
|
||||||
|
@ -172,6 +174,12 @@ Proportional weight policy files
|
||||||
dev weight
|
dev weight
|
||||||
8:16 300
|
8:16 300
|
||||||
|
|
||||||
|
- blkio.leaf_weight[_device]
|
||||||
|
- Equivalents of blkio.weight[_device] for the purpose of
|
||||||
|
deciding how much weight tasks in the given cgroup has while
|
||||||
|
competing with the cgroup's child cgroups. For details,
|
||||||
|
please refer to Documentation/block/cfq-iosched.txt.
|
||||||
|
|
||||||
- blkio.time
|
- blkio.time
|
||||||
- disk time allocated to cgroup per device in milliseconds. First
|
- disk time allocated to cgroup per device in milliseconds. First
|
||||||
two fields specify the major and minor number of the device and
|
two fields specify the major and minor number of the device and
|
||||||
|
@ -279,6 +287,11 @@ Proportional weight policy files
|
||||||
and minor number of the device and third field specifies the number
|
and minor number of the device and third field specifies the number
|
||||||
of times a group was dequeued from a particular device.
|
of times a group was dequeued from a particular device.
|
||||||
|
|
||||||
|
- blkio.*_recursive
|
||||||
|
- Recursive version of various stats. These files show the
|
||||||
|
same information as their non-recursive counterparts but
|
||||||
|
include stats from all the descendant cgroups.
|
||||||
|
|
||||||
Throttling/Upper limit policy files
|
Throttling/Upper limit policy files
|
||||||
-----------------------------------
|
-----------------------------------
|
||||||
- blkio.throttle.read_bps_device
|
- blkio.throttle.read_bps_device
|
||||||
|
|
|
@ -606,20 +606,11 @@ static inline struct cfq_group *blkg_to_cfqg(struct blkcg_gq *blkg)
|
||||||
return pd_to_cfqg(blkg_to_pd(blkg, &blkcg_policy_cfq));
|
return pd_to_cfqg(blkg_to_pd(blkg, &blkcg_policy_cfq));
|
||||||
}
|
}
|
||||||
|
|
||||||
/*
|
static inline struct cfq_group *cfqg_parent(struct cfq_group *cfqg)
|
||||||
* Determine the parent cfqg for weight calculation. Currently, cfqg
|
|
||||||
* scheduling is flat and the root is the parent of everyone else.
|
|
||||||
*/
|
|
||||||
static inline struct cfq_group *cfqg_flat_parent(struct cfq_group *cfqg)
|
|
||||||
{
|
{
|
||||||
struct blkcg_gq *blkg = cfqg_to_blkg(cfqg);
|
struct blkcg_gq *pblkg = cfqg_to_blkg(cfqg)->parent;
|
||||||
struct cfq_group *root;
|
|
||||||
|
|
||||||
while (blkg->parent)
|
return pblkg ? blkg_to_cfqg(pblkg) : NULL;
|
||||||
blkg = blkg->parent;
|
|
||||||
root = blkg_to_cfqg(blkg);
|
|
||||||
|
|
||||||
return root != cfqg ? root : NULL;
|
|
||||||
}
|
}
|
||||||
|
|
||||||
static inline void cfqg_get(struct cfq_group *cfqg)
|
static inline void cfqg_get(struct cfq_group *cfqg)
|
||||||
|
@ -722,7 +713,7 @@ static void cfq_pd_reset_stats(struct blkcg_gq *blkg)
|
||||||
|
|
||||||
#else /* CONFIG_CFQ_GROUP_IOSCHED */
|
#else /* CONFIG_CFQ_GROUP_IOSCHED */
|
||||||
|
|
||||||
static inline struct cfq_group *cfqg_flat_parent(struct cfq_group *cfqg) { return NULL; }
|
static inline struct cfq_group *cfqg_parent(struct cfq_group *cfqg) { return NULL; }
|
||||||
static inline void cfqg_get(struct cfq_group *cfqg) { }
|
static inline void cfqg_get(struct cfq_group *cfqg) { }
|
||||||
static inline void cfqg_put(struct cfq_group *cfqg) { }
|
static inline void cfqg_put(struct cfq_group *cfqg) { }
|
||||||
|
|
||||||
|
@ -1290,7 +1281,7 @@ cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
|
||||||
* stops once an already activated node is met. vfraction
|
* stops once an already activated node is met. vfraction
|
||||||
* calculation should always continue to the root.
|
* calculation should always continue to the root.
|
||||||
*/
|
*/
|
||||||
while ((parent = cfqg_flat_parent(pos))) {
|
while ((parent = cfqg_parent(pos))) {
|
||||||
if (propagate) {
|
if (propagate) {
|
||||||
propagate = !parent->nr_active++;
|
propagate = !parent->nr_active++;
|
||||||
parent->children_weight += pos->weight;
|
parent->children_weight += pos->weight;
|
||||||
|
@ -1341,7 +1332,7 @@ cfq_group_service_tree_del(struct cfq_rb_root *st, struct cfq_group *cfqg)
|
||||||
pos->children_weight -= pos->leaf_weight;
|
pos->children_weight -= pos->leaf_weight;
|
||||||
|
|
||||||
while (propagate) {
|
while (propagate) {
|
||||||
struct cfq_group *parent = cfqg_flat_parent(pos);
|
struct cfq_group *parent = cfqg_parent(pos);
|
||||||
|
|
||||||
/* @pos has 0 nr_active at this point */
|
/* @pos has 0 nr_active at this point */
|
||||||
WARN_ON_ONCE(pos->children_weight);
|
WARN_ON_ONCE(pos->children_weight);
|
||||||
|
|
Loading…
Reference in a new issue