mempolicy: use MPOL_PREFERRED for system-wide default policy
Currently, when one specifies MPOL_DEFAULT via a NUMA memory policy API [set_mempolicy(), mbind() and internal versions], the kernel simply installs a NULL struct mempolicy pointer in the appropriate context: task policy, vma policy, or shared policy. This causes any use of that policy to "fall back" to the next most specific policy scope. The only use of MPOL_DEFAULT to mean "local allocation" is in the system default policy. This requires extra checks/cases for MPOL_DEFAULT in many mempolicy.c functions. There is another, "preferred" way to specify local allocation via the APIs. That is using the MPOL_PREFERRED policy mode with an empty nodemask. Internally, the empty nodemask gets converted to a preferred_node id of '-1'. All internal usage of MPOL_PREFERRED will convert the '-1' to the id of the node local to the cpu where the allocation occurs. System default policy, except during boot, is hard-coded to "local allocation". By using the MPOL_PREFERRED mode with a negative value of preferred node for system default policy, MPOL_DEFAULT will never occur in the 'policy' member of a struct mempolicy. Thus, we can remove all checks for MPOL_DEFAULT when converting policy to a node id/zonelist in the allocation paths. In slab_node() return local node id when policy pointer is NULL. No need to set a pol value to take the switch default. Replace switch default with BUG()--i.e., shouldn't happen. With this patch MPOL_DEFAULT is only used in the APIs, including internal calls to do_set_mempolicy() and in the display of policy in /proc/<pid>/numa_maps. It always means "fall back" to the the next most specific policy scope. This simplifies the description of memory policies quite a bit, with no visible change in behavior. get_mempolicy() continues to return MPOL_DEFAULT and an empty nodemask when the requested policy [task or vma/shared] is NULL. These are the values one would supply via set_mempolicy() or mbind() to achieve that condition--default behavior. This patch updates Documentation to reflect this change. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This commit is contained in:
parent
52cd3b0740
commit
bea904d54d
2 changed files with 58 additions and 60 deletions
|
@ -147,35 +147,18 @@ Components of Memory Policies
|
|||
|
||||
Linux memory policy supports the following 4 behavioral modes:
|
||||
|
||||
Default Mode--MPOL_DEFAULT: The behavior specified by this mode is
|
||||
context or scope dependent.
|
||||
Default Mode--MPOL_DEFAULT: This mode is only used in the memory
|
||||
policy APIs. Internally, MPOL_DEFAULT is converted to the NULL
|
||||
memory policy in all policy scopes. Any existing non-default policy
|
||||
will simply be removed when MPOL_DEFAULT is specified. As a result,
|
||||
MPOL_DEFAULT means "fall back to the next most specific policy scope."
|
||||
|
||||
As mentioned in the Policy Scope section above, during normal
|
||||
system operation, the System Default Policy is hard coded to
|
||||
contain the Default mode.
|
||||
For example, a NULL or default task policy will fall back to the
|
||||
system default policy. A NULL or default vma policy will fall
|
||||
back to the task policy.
|
||||
|
||||
In this context, default mode means "local" allocation--that is
|
||||
attempt to allocate the page from the node associated with the cpu
|
||||
where the fault occurs. If the "local" node has no memory, or the
|
||||
node's memory can be exhausted [no free pages available], local
|
||||
allocation will "fallback to"--attempt to allocate pages from--
|
||||
"nearby" nodes, in order of increasing "distance".
|
||||
|
||||
Implementation detail -- subject to change: "Fallback" uses
|
||||
a per node list of sibling nodes--called zonelists--built at
|
||||
boot time, or when nodes or memory are added or removed from
|
||||
the system [memory hotplug]. These per node zonelist are
|
||||
constructed with nodes in order of increasing distance based
|
||||
on information provided by the platform firmware.
|
||||
|
||||
When a task/process policy or a shared policy contains the Default
|
||||
mode, this also means "local allocation", as described above.
|
||||
|
||||
In the context of a VMA, Default mode means "fall back to task
|
||||
policy"--which may or may not specify Default mode. Thus, Default
|
||||
mode can not be counted on to mean local allocation when used
|
||||
on a non-shared region of the address space. However, see
|
||||
MPOL_PREFERRED below.
|
||||
When specified in one of the memory policy APIs, the Default mode
|
||||
does not use the optional set of nodes.
|
||||
|
||||
It is an error for the set of nodes specified for this policy to
|
||||
be non-empty.
|
||||
|
@ -187,19 +170,18 @@ Components of Memory Policies
|
|||
|
||||
MPOL_PREFERRED: This mode specifies that the allocation should be
|
||||
attempted from the single node specified in the policy. If that
|
||||
allocation fails, the kernel will search other nodes, exactly as
|
||||
it would for a local allocation that started at the preferred node
|
||||
in increasing distance from the preferred node. "Local" allocation
|
||||
policy can be viewed as a Preferred policy that starts at the node
|
||||
allocation fails, the kernel will search other nodes, in order of
|
||||
increasing distance from the preferred node based on information
|
||||
provided by the platform firmware.
|
||||
containing the cpu where the allocation takes place.
|
||||
|
||||
Internally, the Preferred policy uses a single node--the
|
||||
preferred_node member of struct mempolicy. A "distinguished
|
||||
value of this preferred_node, currently '-1', is interpreted
|
||||
as "the node containing the cpu where the allocation takes
|
||||
place"--local allocation. This is the way to specify
|
||||
local allocation for a specific range of addresses--i.e. for
|
||||
VMA policies.
|
||||
place"--local allocation. "Local" allocation policy can be
|
||||
viewed as a Preferred policy that starts at the node containing
|
||||
the cpu where the allocation takes place.
|
||||
|
||||
It is possible for the user to specify that local allocation is
|
||||
always preferred by passing an empty nodemask with this mode.
|
||||
|
|
|
@ -104,9 +104,13 @@ static struct kmem_cache *sn_cache;
|
|||
policied. */
|
||||
enum zone_type policy_zone = 0;
|
||||
|
||||
/*
|
||||
* run-time system-wide default policy => local allocation
|
||||
*/
|
||||
struct mempolicy default_policy = {
|
||||
.refcnt = ATOMIC_INIT(1), /* never free it */
|
||||
.mode = MPOL_DEFAULT,
|
||||
.mode = MPOL_PREFERRED,
|
||||
.v = { .preferred_node = -1 },
|
||||
};
|
||||
|
||||
static const struct mempolicy_operations {
|
||||
|
@ -189,7 +193,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
|
|||
if (mode == MPOL_DEFAULT) {
|
||||
if (nodes && !nodes_empty(*nodes))
|
||||
return ERR_PTR(-EINVAL);
|
||||
return NULL;
|
||||
return NULL; /* simply delete any existing policy */
|
||||
}
|
||||
VM_BUG_ON(!nodes);
|
||||
|
||||
|
@ -246,7 +250,6 @@ void __mpol_put(struct mempolicy *p)
|
|||
{
|
||||
if (!atomic_dec_and_test(&p->refcnt))
|
||||
return;
|
||||
p->mode = MPOL_DEFAULT;
|
||||
kmem_cache_free(policy_cache, p);
|
||||
}
|
||||
|
||||
|
@ -626,13 +629,16 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
|
|||
return 0;
|
||||
}
|
||||
|
||||
/* Fill a zone bitmap for a policy */
|
||||
static void get_zonemask(struct mempolicy *p, nodemask_t *nodes)
|
||||
/*
|
||||
* Return nodemask for policy for get_mempolicy() query
|
||||
*/
|
||||
static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
|
||||
{
|
||||
nodes_clear(*nodes);
|
||||
if (p == &default_policy)
|
||||
return;
|
||||
|
||||
switch (p->mode) {
|
||||
case MPOL_DEFAULT:
|
||||
break;
|
||||
case MPOL_BIND:
|
||||
/* Fall through */
|
||||
case MPOL_INTERLEAVE:
|
||||
|
@ -686,6 +692,11 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
|
|||
}
|
||||
|
||||
if (flags & MPOL_F_ADDR) {
|
||||
/*
|
||||
* Do NOT fall back to task policy if the
|
||||
* vma/shared policy at addr is NULL. We
|
||||
* want to return MPOL_DEFAULT in this case.
|
||||
*/
|
||||
down_read(&mm->mmap_sem);
|
||||
vma = find_vma_intersection(mm, addr, addr+1);
|
||||
if (!vma) {
|
||||
|
@ -700,7 +711,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
|
|||
return -EINVAL;
|
||||
|
||||
if (!pol)
|
||||
pol = &default_policy;
|
||||
pol = &default_policy; /* indicates default behavior */
|
||||
|
||||
if (flags & MPOL_F_NODE) {
|
||||
if (flags & MPOL_F_ADDR) {
|
||||
|
@ -715,8 +726,11 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
|
|||
err = -EINVAL;
|
||||
goto out;
|
||||
}
|
||||
} else
|
||||
*policy = pol->mode | pol->flags;
|
||||
} else {
|
||||
*policy = pol == &default_policy ? MPOL_DEFAULT :
|
||||
pol->mode;
|
||||
*policy |= pol->flags;
|
||||
}
|
||||
|
||||
if (vma) {
|
||||
up_read(¤t->mm->mmap_sem);
|
||||
|
@ -725,7 +739,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
|
|||
|
||||
err = 0;
|
||||
if (nmask)
|
||||
get_zonemask(pol, nmask);
|
||||
get_policy_nodemask(pol, nmask);
|
||||
|
||||
out:
|
||||
mpol_cond_put(pol);
|
||||
|
@ -1286,8 +1300,7 @@ static struct mempolicy *get_vma_policy(struct task_struct *task,
|
|||
addr);
|
||||
if (vpol)
|
||||
pol = vpol;
|
||||
} else if (vma->vm_policy &&
|
||||
vma->vm_policy->mode != MPOL_DEFAULT)
|
||||
} else if (vma->vm_policy)
|
||||
pol = vma->vm_policy;
|
||||
}
|
||||
if (!pol)
|
||||
|
@ -1334,7 +1347,6 @@ static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy)
|
|||
nd = first_node(policy->v.nodes);
|
||||
break;
|
||||
case MPOL_INTERLEAVE: /* should not happen */
|
||||
case MPOL_DEFAULT:
|
||||
nd = numa_node_id();
|
||||
break;
|
||||
default:
|
||||
|
@ -1369,9 +1381,15 @@ static unsigned interleave_nodes(struct mempolicy *policy)
|
|||
*/
|
||||
unsigned slab_node(struct mempolicy *policy)
|
||||
{
|
||||
unsigned short pol = policy ? policy->mode : MPOL_DEFAULT;
|
||||
if (!policy)
|
||||
return numa_node_id();
|
||||
|
||||
switch (policy->mode) {
|
||||
case MPOL_PREFERRED:
|
||||
if (unlikely(policy->v.preferred_node >= 0))
|
||||
return policy->v.preferred_node;
|
||||
return numa_node_id();
|
||||
|
||||
switch (pol) {
|
||||
case MPOL_INTERLEAVE:
|
||||
return interleave_nodes(policy);
|
||||
|
||||
|
@ -1390,13 +1408,8 @@ unsigned slab_node(struct mempolicy *policy)
|
|||
return zone->node;
|
||||
}
|
||||
|
||||
case MPOL_PREFERRED:
|
||||
if (policy->v.preferred_node >= 0)
|
||||
return policy->v.preferred_node;
|
||||
/* Fall through */
|
||||
|
||||
default:
|
||||
return numa_node_id();
|
||||
BUG();
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -1650,8 +1663,6 @@ int __mpol_equal(struct mempolicy *a, struct mempolicy *b)
|
|||
if (a->mode != MPOL_DEFAULT && !mpol_match_intent(a, b))
|
||||
return 0;
|
||||
switch (a->mode) {
|
||||
case MPOL_DEFAULT:
|
||||
return 1;
|
||||
case MPOL_BIND:
|
||||
/* Fall through */
|
||||
case MPOL_INTERLEAVE:
|
||||
|
@ -1828,7 +1839,7 @@ void mpol_shared_policy_init(struct shared_policy *info, unsigned short policy,
|
|||
if (policy != MPOL_DEFAULT) {
|
||||
struct mempolicy *newpol;
|
||||
|
||||
/* Falls back to MPOL_DEFAULT on any error */
|
||||
/* Falls back to NULL policy [MPOL_DEFAULT] on any error */
|
||||
newpol = mpol_new(policy, flags, policy_nodes);
|
||||
if (!IS_ERR(newpol)) {
|
||||
/* Create pseudo-vma that contains just the policy */
|
||||
|
@ -1952,9 +1963,14 @@ static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
|
|||
char *p = buffer;
|
||||
int l;
|
||||
nodemask_t nodes;
|
||||
unsigned short mode = pol ? pol->mode : MPOL_DEFAULT;
|
||||
unsigned short mode;
|
||||
unsigned short flags = pol ? pol->flags : 0;
|
||||
|
||||
if (!pol || pol == &default_policy)
|
||||
mode = MPOL_DEFAULT;
|
||||
else
|
||||
mode = pol->mode;
|
||||
|
||||
switch (mode) {
|
||||
case MPOL_DEFAULT:
|
||||
nodes_clear(nodes);
|
||||
|
|
Loading…
Reference in a new issue