rename all 'cnt' fields and variables to the less yucky 'count' name.
yuckage noticed by Andrew Morton.
no change in code, other than the /proc/sched_debug bkl_count string got
a bit larger:
text data bss dec hex filename
38236 3506 24 41766 a326 sched.o.before
38240 3506 24 41770 a32a sched.o.after
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
undo some of the recent changes that are not needed after all,
such as last_min_vruntime.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
add vslice: the load-dependent "virtual slice" a task should
run ideally, so that the observed latency stays within the
sched_latency window.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
add per task and per rq BKL usage statistics.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Enable user-id based fair group scheduling. This is useful for anyone
who wants to test the group scheduler w/o having to enable
CONFIG_CGROUPS.
A separate scheduling group (i.e struct task_grp) is automatically created for
every new user added to the system. Upon uid change for a task, it is made to
move to the corresponding scheduling group.
A /proc tunable (/proc/root_user_share) is also provided to tune root
user's quota of cpu bandwidth.
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
With the view of supporting user-id based fair scheduling (and not just
container-based fair scheduling), this patch renames several functions
and makes them independent of whether they are being used for container
or user-id based fair scheduling.
Also fix a problem reported by KAMEZAWA Hiroyuki (wrt allocating
less-sized array for tg->cfs_rq[] and tf->se[]).
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
the 'p' (task_struct) parameter in the sched_class :: yield_task() is
redundant as the caller is always the 'current'. Get rid of it.
text data bss dec hex filename
24341 2734 20 27095 69d7 sched.o.before
24330 2734 20 27084 69cc sched.o.after
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Get rid of 'sched_entity::fair_key'.
As a side effect, 'current' is not kept withing the tree for
SCHED_NORMAL/BATCH tasks anymore. This simplifies some parts of code
(e.g. entity_tick() and yield_task_fair()) and also somewhat optimizes
them (e.g. a single update_curr() now vs. dequeue/enqueue() before in
entity_tick()).
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
remove wait_runtime based fields and features, now that the CFS
math has been changed over to the vruntime metric.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
remove the wait_runtime-limit fields and the code depending on it, now
that the math has been changed over to rely on the vruntime metric.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
introduce se->vruntime as a sum of weighted delta-exec's, and use that
as the key into the tree.
the idea to use absolute virtual time as the basic metric of scheduling
has been first raised by William Lee Irwin, advanced by Tong Li and first
prototyped by Roman Zippel in the "Really Fair Scheduler" (RFS) patchset.
also see:
http://lkml.org/lkml/2007/9/2/76
for a simpler variant of this patch.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
remove the stat_gran code - it was disabled by default and it causes
unnecessary overhead.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
use constants if !CONFIG_SCHED_DEBUG.
this speeds up the code and reduces code-size:
text data bss dec hex filename
27464 3014 16 30494 771e sched.o.before
26929 3010 20 29959 7507 sched.o.after
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
track the maximum amount of time a task has executed while
the CPU load was at least 2x. (i.e. at least two nice-0
tasks were runnable)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
This patch allows you to create a new network namespace
using sys_clone, or sys_unshare.
As the network namespace is still experimental and under development
clone and unshare support is only made available when CONFIG_NET_NS is
selected at compile time.
As this patch introduces network namespace support into code paths
that exist when the CONFIG_NET is not selected there are a few
additions made to net_namespace.h to allow a few more functions
to be used when the networking stack is not compiled in.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It turns out that there are a few other five-second timers in the
kernel, and if the timers get in sync, the load-average can get
artificially inflated by events that just happen to coincide.
So just offset the load average calculation it by a timer tick.
Noticed by Anders Boström, for whom the coincidence started triggering
on one of his machines with the JBD jiffies rounding code (JBD is one of
the subsystems that also end up using a 5-second timer by default).
Tested-by: Anders Boström <anders@bostrom.dyndns.org>
Cc: Chuck Ebbert <cebbert@redhat.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This simplifies signalfd code, by avoiding it to remain attached to the
sighand during its lifetime.
In this way, the signalfd remain attached to the sighand only during
poll(2) (and select and epoll) and read(2). This also allows to remove
all the custom "tsk == current" checks in kernel/signal.c, since
dequeue_signal() will only be called by "current".
I think this is also what Ben was suggesting time ago.
The external effect of this, is that a thread can extract only its own
private signals and the group ones. I think this is an acceptable
behaviour, in that those are the signals the thread would be able to
fetch w/out signalfd.
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
add /proc/sys/kernel/sched_compat_yield to make sys_sched_yield()
more agressive, by moving the yielding task to the last position
in the rbtree.
with sched_compat_yield=0:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2539 mingo 20 0 1576 252 204 R 50 0.0 0:02.03 loop_yield
2541 mingo 20 0 1576 244 196 R 50 0.0 0:02.05 loop
with sched_compat_yield=1:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2584 mingo 20 0 1576 248 196 R 99 0.0 0:52.45 loop
2582 mingo 20 0 1576 256 204 R 0 0.0 0:00.00 loop_yield
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
It turned out, that the user namespace is released during the do_exit() in
exit_task_namespaces(), but the struct user_struct is released only during the
put_task_struct(), i.e. MUCH later.
On debug kernels with poisoned slabs this will cause the oops in
uid_hash_remove() because the head of the chain, which resides inside the
struct user_namespace, will be already freed and poisoned.
Since the uid hash itself is required only when someone can search it, i.e.
when the namespace is alive, we can safely unhash all the user_struct-s from
it during the namespace exiting. The subsequent free_uid() will complete the
user_struct destruction.
For example simple program
#include <sched.h>
char stack[2 * 1024 * 1024];
int f(void *foo)
{
return 0;
}
int main(void)
{
clone(f, stack + 1 * 1024 * 1024, 0x10000000, 0);
return 0;
}
run on kernel with CONFIG_USER_NS turned on will oops the
kernel immediately.
This was spotted during OpenVZ kernel testing.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Acked-by: "Serge E. Hallyn" <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Surprisingly, but (spotted by Alexey Dobriyan) the uid hash still uses
list_heads, thus occupying twice as much place as it could. Convert it to
hlist_heads.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
de-HZ-ification of the granularity defaults unearthed a pre-existing
property of CFS: while it correctly converges to the granularity goal,
it does not prevent run-time fluctuations in the range of
[-gran ... 0 ... +gran].
With the increase of the granularity due to the removal of HZ
dependencies, this becomes visible in chew-max output (with 5 tasks
running):
out: 28 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 37 . 40
out: 27 . 27. 32 | flu: 0 . 0 | ran: 17 . 13 | per: 44 . 40
out: 27 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 36 . 40
out: 29 . 27. 32 | flu: 2 . 0 | ran: 17 . 13 | per: 46 . 40
out: 28 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 37 . 40
out: 29 . 27. 32 | flu: 0 . 0 | ran: 18 . 13 | per: 47 . 40
out: 28 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 37 . 40
average slice is the ideal 13 msecs and the period is picture-perfect 40
msecs. But the 'ran' field fluctuates around 13.33 msecs and there's no
mechanism in CFS to keep that from happening: it's a perfectly valid
solution that CFS finds.
to fix this we add a granularity/preemption rule that knows about
the "target latency", which makes tasks that run longer than the ideal
latency run a bit less. The simplest approach is to simply decrease the
preemption granularity when a task overruns its ideal latency. For this
we have to track how much the task executed since its last preemption.
( this adds a new field to task_struct, but we can eliminate that
overhead in 2.6.24 by putting all the scheduler timestamps into an
anonymous union. )
with this change in place, chew-max output is fluctuation-less all
around:
out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40
out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40
out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40
out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40
out: 28 . 27. 39 | flu: 0 . 1 | ran: 13 . 13 | per: 41 . 40
out: 28 . 27. 39 | flu: 0 . 1 | ran: 13 . 13 | per: 41 . 40
this patch has no impact on any fastpath or on any globally observable
scheduling property. (unless you have sharp enough eyes to see
millisecond-level ruckles in glxgears smoothness :-)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mike Galbraith <efault@gmx.de>
due to adaptive granularity scheduling the role of sched_granularity
has changed to "minimum granularity", so rename the variable (and the
tunable) accordingly.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Instead of specifying the preemption granularity, specify the wanted
latency. By fixing the granlarity to a constany the wakeup latency
it a function of the number of running tasks on the rq.
Invert this relation.
sysctl_sched_granularity becomes a minimum for the dynamic granularity
computed from the new sysctl_sched_latency.
Then use this latency to do more intelligent granularity decisions: if
there are fewer tasks running then we can schedule coarser. This helps
performance while still always keeping the latency target.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On a four package system with HT - HT load balancing optimizations were
broken. For example, if two tasks end up running on two logical threads
of one of the packages, scheduler is not able to pull one of the tasks
to a completely idle package.
In this scenario, for nice-0 tasks, imbalance calculated by scheduler
will be 512 and find_busiest_queue() will return 0 (as each cpu's load
is 1024 > imbalance and has only one task running).
Similarly MC scheduler optimizations also get fixed with this patch.
[ mingo@elte.hu: restored fair balancing by increasing the fuzz and
adding it back to the power decision, without the /2
factor. ]
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
construct a more or less wall-clock time out of sched_clock(), by
using ACPI-idle's existing knowledge about how much time we spent
idling. This allows the rq clock to work around TSC-stops-in-C2,
TSC-gets-corrupted-in-C3 type of problems.
( Besides the scheduler's statistics this also benefits blktrace and
printk-timestamps as well. )
Furthermore, the precise before-C2/C3-sleep and after-C2/C3-wakeup
callbacks allow the scheduler to get out the most of the period where
the CPU has a reliable TSC. This results in slightly more precise
task statistics.
the ACPI bits were acked by Len.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Len Brown <len.brown@intel.com>
remove the 'u64 now' parameter from ->task_new().
( identity transformation that causes no change in functionality. )
Signed-off-by: Ingo Molnar <mingo@elte.hu>
remove the 'u64 now' parameter from ->put_prev_task().
( identity transformation that causes no change in functionality. )
Signed-off-by: Ingo Molnar <mingo@elte.hu>
remove the 'u64 now' parameter from ->pick_next_task().
( identity transformation that causes no change in functionality. )
Signed-off-by: Ingo Molnar <mingo@elte.hu>
remove the 'u64 now' parameter from ->dequeue_task().
( identity transformation that causes no change in functionality. )
Signed-off-by: Ingo Molnar <mingo@elte.hu>
remove the 'u64 now' parameter from ->enqueue_task().
( identity transformation that causes no change in functionality. )
Signed-off-by: Ingo Molnar <mingo@elte.hu>
remove the 'u64 now' parameter from print_cfs_rq().
( identity transformation that causes no change in functionality. )
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There are two problems with balance_tasks() and how it used:
1. The variables best_prio and best_prio_seen (inherited from the old
move_tasks()) were only required to handle problems caused by the
active/expired arrays, the order in which they were processed and the
possibility that the task with the highest priority could be on either.
These issues are no longer present and the extra overhead associated
with their use is unnecessary (and possibly wrong).
2. In the absence of CONFIG_FAIR_GROUP_SCHED being set, the same
this_best_prio variable needs to be used by all scheduling classes or
there is a risk of moving too much load. E.g. if the highest priority
task on this at the beginning is a fairly low priority task and the rt
class migrates a task (during its turn) then that moved task becomes the
new highest priority task on this_rq but when the sched_fair class
initializes its copy of this_best_prio it will get the priority of the
original highest priority task as, due to the run queue locks being
held, the reschedule triggered by pull_task() will not have taken place.
This could result in inappropriate overriding of skip_for_load and
excessive load being moved.
The attached patch addresses these problems by deleting all reference to
best_prio and best_prio_seen and making this_best_prio a reference
parameter to the various functions involved.
load_balance_fair() has also been modified so that this_best_prio is
only reset (in the loop) if CONFIG_FAIR_GROUP_SCHED is set. This should
preserve the effect of helping spread groups' higher priority tasks
around the available CPUs while improving system performance when
CONFIG_FAIR_GROUP_SCHED isn't set.
Signed-off-by: Peter Williams <pwil3058@bigpond.net.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The move_tasks() function is currently multiplexed with two distinct
capabilities:
1. attempt to move a specified amount of weighted load from one run
queue to another; and
2. attempt to move a specified number of tasks from one run queue to
another.
The first of these capabilities is used in two places, load_balance()
and load_balance_idle(), and in both of these cases the return value of
move_tasks() is used purely to decide if tasks/load were moved and no
notice of the actual number of tasks moved is taken.
The second capability is used in exactly one place,
active_load_balance(), to attempt to move exactly one task and, as
before, the return value is only used as an indicator of success or failure.
This multiplexing of sched_task() was introduced, by me, as part of the
smpnice patches and was motivated by the fact that the alternative, one
function to move specified load and one to move a single task, would
have led to two functions of roughly the same complexity as the old
move_tasks() (or the new balance_tasks()). However, the new modular
design of the new CFS scheduler allows a simpler solution to be adopted
and this patch addresses that solution by:
1. adding a new function, move_one_task(), to be used by
active_load_balance(); and
2. making move_tasks() a single purpose function that tries to move a
specified weighted load and returns 1 for success and 0 for failure.
One of the consequences of these changes is that neither move_one_task()
or the new move_tasks() care how many tasks sched_class.load_balance()
moves and this enables its interface to be simplified by returning the
amount of load moved as its result and removing the load_moved pointer
from the argument list. This helps simplify the new move_tasks() and
slightly reduces the amount of work done in each of
sched_class.load_balance()'s implementations.
Further simplification, e.g. changes to balance_tasks(), are possible
but (slightly) complicated by the special needs of load_balance_fair()
so I've left them to a later patch (if this one gets accepted).
NB Since move_tasks() gets called with two run queue locks held even
small reductions in overhead are worthwhile.
[ mingo@elte.hu ]
this change also reduces code size nicely:
text data bss dec hex filename
39216 3618 24 42858 a76a sched.o.before
39173 3618 24 42815 a73f sched.o.after
Signed-off-by: Peter Williams <pwil3058@bigpond.net.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add an above_background_load() function which can be used by other
subsystems to detect if there is anything besides niced tasks running.
Place it in sched.h to allow it to be compiled out if not used.
Unused for now, but it is a useful hint to the IO scheduler and to
swap-prefetch.
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Cc: Peter Williams <pwil3058@bigpond.net.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This adds a general mechanism whereby a task can request the scheduler to
notify it whenever it is preempted or scheduled back in. This allows the
task to swap any special-purpose registers like the fpu or Intel's VT
registers.
Signed-off-by: Avi Kivity <avi@qumranet.com>
[ mingo@elte.hu: fixes, cleanups ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Implement the cpu_clock(cpu) interface for kernel-internal use:
high-speed (but slightly incorrect) per-cpu clock constructed from
sched_clock().
This API, unused at the moment, will be used in the future by blktrace,
by the softlockup-watchdog, by printk and by lockstat.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds an interface to set/reset flags which determines each memory
segment should be dumped or not when a core file is generated.
/proc/<pid>/coredump_filter file is provided to access the flags. You can
change the flag status for a particular process by writing to or reading from
the file.
The flag status is inherited to the child process when it is created.
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch changes mm_struct.dumpable to a pair of bit flags.
set_dumpable() converts three-value dumpable to two flags and stores it into
lower two bits of mm_struct.flags instead of mm_struct.dumpable.
get_dumpable() behaves in the opposite way.
[akpm@linux-foundation.org: export set_dumpable]
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch enables the unshare of user namespaces.
It adds a new clone flag CLONE_NEWUSER and implements copy_user_ns() which
resets the current user_struct and adds a new root user (uid == 0)
For now, unsharing the user namespace allows a process to reset its
user_struct accounting and uid 0 in the new user namespace should be contained
using appropriate means, for instance selinux
The plan, when the full support is complete (all uid checks covered), is to
keep the original user's rights in the original namespace, and let a process
become uid 0 in the new namespace, with full capabilities to the new
namespace.
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Acked-by: Pavel Emelianov <xemul@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Andrew Morgan <agm@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Basically, it will allow a process to unshare its user_struct table,
resetting at the same time its own user_struct and all the associated
accounting.
A new root user (uid == 0) is added to the user namespace upon creation.
Such root users have full privileges and it seems that theses privileges
should be controlled through some means (process capabilities ?)
The unshare is not included in this patch.
Changes since [try #4]:
- Updated get_user_ns and put_user_ns to accept NULL, and
get_user_ns to return the namespace.
Changes since [try #3]:
- moved struct user_namespace to files user_namespace.{c,h}
Changes since [try #2]:
- removed struct user_namespace* argument from find_user()
Changes since [try #1]:
- removed struct user_namespace* argument from find_user()
- added a root_user per user namespace
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: Pavel Emelianov <xemul@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Andrew Morgan <agm@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add TTY input auditing, used to audit system administrator's actions. This is
required by various security standards such as DCID 6/3 and PCI to provide
non-repudiation of administrator's actions and to allow a review of past
actions if the administrator seems to overstep their duties or if the system
becomes misconfigured for unknown reasons. These requirements do not make it
necessary to audit TTY output as well.
Compared to an user-space keylogger, this approach records TTY input using the
audit subsystem, correlated with other audit events, and it is completely
transparent to the user-space application (e.g. the console ioctls still
work).
TTY input auditing works on a higher level than auditing all system calls
within the session, which would produce an overwhelming amount of mostly
useless audit events.
Add an "audit_tty" attribute, inherited across fork (). Data read from TTYs
by process with the attribute is sent to the audit subsystem by the kernel.
The audit netlink interface is extended to allow modifying the audit_tty
attribute, and to allow sending explanatory audit events from user-space (for
example, a shell might send an event containing the final command, after the
interactive command-line editing and history expansion is performed, which
might be difficult to decipher from the TTY input alone).
Because the "audit_tty" attribute is inherited across fork (), it would be set
e.g. for sshd restarted within an audited session. To prevent this, the
audit_tty attribute is cleared when a process with no open TTY file
descriptors (e.g. after daemon startup) opens a TTY.
See https://www.redhat.com/archives/linux-audit/2007-June/msg00000.html for a
more detailed rationale document for an older version of this patch.
[akpm@linux-foundation.org: build fix]
Signed-off-by: Miloslav Trmac <mitr@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Paul Fulghum <paulkf@microgate.com>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 411187fb05 caused boot time to move and
process start times to become invalid after suspend. Using boot based time
for those restores the old behaviour and fixes the issue.
[akpm@linux-foundation.org: little cleanup]
Signed-off-by: Tomas Janousek <tjanouse@redhat.com>
Cc: Tomas Smetana <tsmetana@redhat.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sched_fork()/sched_exit() does not need to specify fastcall anymore,
as the x86 kernel defaults to regparm3, and no assembly code calls
these functions.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
track TSC-unstable events and propagate it to the scheduler code.
Also allow sched_clock() to be used when the TSC is unstable,
the rq_clock() wrapper creates a reliable clock out of it.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
remove the sleep_type heuristics from the core scheduler - scheduling
policy is implemented in the scheduling-policy modules. (and CFS does
not use this type of sleep-type heuristics)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
increase SMP-nice's resolution. This is needed by CFS to
implement SCHED_IDLE and cleaned up nice level support.
no behavioral changes.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
add the init_idle_bootup_task() callback to the bootup thread,
unused at the moment. (CFS will use it to switch the scheduling
class of the boot thread to the idle class)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
the SMP load-balancer uses the boot-time migration-cost estimation
code to attempt to improve the quality of balancing. The reason for
this code is that the discrete priority queues do not preserve
the order of scheduling accurately, so the load-balancer skips
tasks that were running on a CPU 'recently'.
this code is fundamental fragile: the boot-time migration cost detector
doesnt really work on systems that had large L3 caches, it caused boot
delays on large systems and the whole cache-hot concept made the
balancing code pretty undeterministic as well.
(and hey, i wrote most of it, so i can say it out loud that it sucks ;-)
under CFS the same purpose of cache affinity can be achieved without
any special cache-hot special-case: tasks are sorted in the 'timeline'
tree and the SMP balancer picks tasks from the left side of the
tree, thus the most cache-cold task is balanced automatically.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
enum idle_type (used by the load-balancer) clashes with the
SCHED_IDLE name that we want to introduce. 'CPU_IDLE' instead
of 'SCHED_IDLE' is more descriptive as well.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
1. New entries can be added to tsk->pi_state_list after task completed
exit_pi_state_list(). The result is memory leakage and deadlocks.
2. handle_mm_fault() is called under spinlock. The result is obvious.
3. results in self-inflicted deadlock inside glibc.
Sometimes futex_lock_pi returns -ESRCH, when it is not expected
and glibc enters to for(;;) sleep() to simulate deadlock. This problem
is quite obvious and I think the patch is right. Though it looks like
each "if" in futex_lock_pi() got some stupid special case "else if". :-)
4. sometimes futex_lock_pi() returns -EDEADLK,
when nobody has the lock. The reason is also obvious (see comment
in the patch), but correct fix is far beyond my comprehension.
I guess someone already saw this, the chunk:
if (rt_mutex_trylock(&q.pi_state->pi_mutex))
ret = 0;
is obviously from the same opera. But it does not work, because the
rtmutex is really taken at this point: wake_futex_pi() of previous
owner reassigned it to us. My fix works. But it looks very stupid.
I would think about removal of shift of ownership in wake_futex_pi()
and making all the work in context of process taking lock.
From: Thomas Gleixner <tglx@linutronix.de>
Fix 1) Avoid the tasklist lock variant of the exit race fix by adding
an additional state transition to the exit code.
This fixes also the issue, when a task with recursive segfaults
is not able to release the futexes.
Fix 2) Cleanup the lookup_pi_state() failure path and solve the -ESRCH
problem finally.
Fix 3) Solve the fixup_pi_state_owner() problem which needs to do the fixup
in the lock protected section by using the in_atomic userspace access
functions.
This removes also the ugly lock drop / unqueue inside of fixup_pi_state()
Fix 4) Fix a stale lock in the error path of futex_wake_pi()
Added some error checks for verification.
The -EDEADLK problem is solved by the rtmutex fixups.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Steve Hawkes discovered a problem where recalc_sigpending_tsk was called in
do_sigaction but no signal_wake_up call was made, preventing later signals
from waking up blocked threads with TIF_SIGPENDING already set.
In fact, the few other calls to recalc_sigpending_tsk outside the signals
code are also subject to this problem in other race conditions.
This change makes recalc_sigpending_tsk private to the signals code. It
changes the outside calls, as well as do_sigaction, to use the new
recalc_sigpending_and_wake instead.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: <Steve.Hawkes@motorola.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently try_to_freeze_tasks() has to wait until all of the vforked processes
exit and for this reason every user can make it fail. To fix this problem we
can introduce the additional process flag PF_FREEZER_SKIP to be used by tasks
that do not want to be counted as freezable by the freezer and want to have
TIF_FREEZE set nevertheless. Then, this flag can be set by tasks using
sys_vfork() before they call wait_for_completion(&vfork) and cleared after
they have woken up. After clearing it, the tasks should call try_to_freeze()
as soon as possible.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.dk/data/git/linux-2.6-block:
Fix compile/link of init/do_mounts.c with !CONFIG_BLOCK
When stacked block devices are in-use (e.g. md or dm), the recursive calls
This patch series implements the new signalfd() system call.
I took part of the original Linus code (and you know how badly it can be
broken :), and I added even more breakage ;) Signals are fetched from the same
signal queue used by the process, so signalfd will compete with standard
kernel delivery in dequeue_signal(). If you want to reliably fetch signals on
the signalfd file, you need to block them with sigprocmask(SIG_BLOCK). This
seems to be working fine on my Dual Opteron machine. I made a quick test
program for it:
http://www.xmailserver.org/signafd-test.c
The signalfd() system call implements signal delivery into a file descriptor
receiver. The signalfd file descriptor if created with the following API:
int signalfd(int ufd, const sigset_t *mask, size_t masksize);
The "ufd" parameter allows to change an existing signalfd sigmask, w/out going
to close/create cycle (Linus idea). Use "ufd" == -1 if you want a brand new
signalfd file.
The "mask" allows to specify the signal mask of signals that we are interested
in. The "masksize" parameter is the size of "mask".
The signalfd fd supports the poll(2) and read(2) system calls. The poll(2)
will return POLLIN when signals are available to be dequeued. As a direct
consequence of supporting the Linux poll subsystem, the signalfd fd can use
used together with epoll(2) too.
The read(2) system call will return a "struct signalfd_siginfo" structure in
the userspace supplied buffer. The return value is the number of bytes copied
in the supplied buffer, or -1 in case of error. The read(2) call can also
return 0, in case the sighand structure to which the signalfd was attached,
has been orphaned. The O_NONBLOCK flag is also supported, and read(2) will
return -EAGAIN in case no signal is available.
If the size of the buffer passed to read(2) is lower than sizeof(struct
signalfd_siginfo), -EINVAL is returned. A read from the signalfd can also
return -ERESTARTSYS in case a signal hits the process. The format of the
struct signalfd_siginfo is, and the valid fields depends of the (->code &
__SI_MASK) value, in the same way a struct siginfo would:
struct signalfd_siginfo {
__u32 signo; /* si_signo */
__s32 err; /* si_errno */
__s32 code; /* si_code */
__u32 pid; /* si_pid */
__u32 uid; /* si_uid */
__s32 fd; /* si_fd */
__u32 tid; /* si_fd */
__u32 band; /* si_band */
__u32 overrun; /* si_overrun */
__u32 trapno; /* si_trapno */
__s32 status; /* si_status */
__s32 svint; /* si_int */
__u64 svptr; /* si_ptr */
__u64 utime; /* si_utime */
__u64 stime; /* si_stime */
__u64 addr; /* si_addr */
};
[akpm@linux-foundation.org: fix signalfd_copyinfo() on i386]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If CONFIG_TASK_IO_ACCOUNTING is defined, we update io accounting counters for
each task.
This patch permits reporting of values using the well known getrusage()
syscall, filling ru_inblock and ru_oublock instead of null values.
As TASK_IO_ACCOUNTING currently counts bytes counts, we approximate blocks
count doing : nr_blocks = nr_bytes / 512
Example of use :
----------------------
After patch is applied, /usr/bin/time command can now give a good
approximation of IO that the process had to do.
$ /usr/bin/time grep tototo /usr/include/*
Command exited with non-zero status 1
0.00user 0.02system 0:02.11elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k
24288inputs+0outputs (0major+259minor)pagefaults 0swaps
$ /usr/bin/time dd if=/dev/zero of=/tmp/testfile count=1000
1000+0 enregistrements lus
1000+0 enregistrements écrits
512000 octets (512 kB) copiés, 0,00326601 seconde, 157 MB/s
0.00user 0.00system 0:00.00elapsed 80%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+3000outputs (0major+299minor)pagefaults 0swaps
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
to generic_make_request can use up a lot of space, and we would rather they
didn't.
As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.
As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL. We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.
So, we choose to allow each thread to only be active in one
generic_make_request at a time. If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.
As the list of pending bios is per-thread, there are no locking issues to
worry about.
I say above that it is "safe to delay any call...". There are, however,
some behaviours of a make_request_fn which would make it unsafe. These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.
These could include:
- waiting for that call to finish and call it's bi_end_io function.
md use to sometimes do this (marking the superblock dirty before
completing a write) but doesn't any more
- inspecting the bio for fields that generic_make_request might
change, such as bi_sector or bi_bdev. It is hard to see a good
reason for this, and I don't think anyone actually does it.
- inspecing the queue to see if, e.g. it is 'full' yet. Again, I
think this is very unlikely to be useful, or to be done.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>
Alasdair G Kergon <agk@redhat.com> said:
I can see nothing wrong with this in principle.
For device-mapper at the moment though it's essential that, while the bio
mappings may now get delayed, they still get processed in exactly
the same order as they were passed to generic_make_request().
My main concern is whether the timing changes implicit in this patch
will make the rare data-corrupting races in the existing snapshot code
more likely. (I'm working on a fix for these races, but the unfinished
patch is already several hundred lines long.)
It would be helpful if some people on this mailing list would test
this patch in various scenarios and report back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This finally renames the thread_info field in task structure to stack, so that
the assumptions about this field are gone and archs have more freedom about
placing the thread_info structure.
Nonbroken archs which have a proper thread pointer can do the access to both
current thread and task structure via a single pointer.
It'll allow for a few more cleanups of the fork code, from which e.g. ia64
could benefit.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
[akpm@linux-foundation.org: build fix]
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Andi Kleen <ak@muc.de>
Cc: Chris Zankel <chris@zankel.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently kernel threads use sigprocmask(SIG_BLOCK) to protect against
signals. This doesn't prevent the signal delivery, this only blocks
signal_wake_up(). Every "killall -33 kthreadd" means a "struct siginfo"
leak.
Change kthreadd_setup() to set all handlers to SIG_IGN instead of blocking
them (make a new helper ignore_signals() for that). If the kernel thread
needs some signal, it should use allow_signal() anyway, and in that case it
should not use CLONE_SIGHAND.
Note that we can't change daemonize() (should die!) in the same way,
because it can be used along with CLONE_SIGHAND. This means that
allow_signal() still should unblock the signal to work correctly with
daemonize()ed threads.
However, disallow_signal() doesn't block the signal any longer but ignores
it.
NOTE: with or without this patch the kernel threads are not protected from
handle_stop_signal(), this seems harmless, but not good.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I noticed expensive divides done in try_to_wakeup() and
find_busiest_group() on a bi dual core Opteron machine (total of 4 cores),
moderatly loaded (15.000 context switch per second)
oprofile numbers :
CPU: AMD64 processors, speed 2600.05 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit
mask of 0x00 (No unit mask) count 50000
samples % symbol name
...
613914 1.0498 try_to_wake_up
834 0.0013 :ffffffff80227ae1: div %rcx
77513 0.1191 :ffffffff80227ae4: mov %rax,%r11
608893 1.0413 find_busiest_group
1841 0.0031 :ffffffff802260bf: div %rdi
140109 0.2394 :ffffffff802260c2: test %sil,%sil
Some of these divides can use the reciprocal divides we introduced some
time ago (currently used in slab AFAIK)
We can assume a load will fit in a 32bits number, because with a
SCHED_LOAD_SCALE=128 value, its still a theorical limit of 33554432
When/if we reach this limit one day, probably cpus will have a fast
hardware divide and we can zap the reciprocal divide trick.
Ingo suggested to rename cpu_power to __cpu_power to make clear it should
not be modified without changing its reciprocal value too.
I did not convert the divide in cpu_avg_load_per_task(), because tracking
nr_running changes may be not worth it ? We could use a static table of 32
reciprocal values but it would add a conditional branch and table lookup.
[akpm@linux-foundation.org: !SMP build fix]
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the process idle load balancing in the presence of dynticks. cpus for
which ticks are stopped will sleep till the next event wakes it up.
Potentially these sleeps can be for large durations and during which today,
there is no periodic idle load balancing being done.
This patch nominates an owner among the idle cpus, which does the idle load
balancing on behalf of the other idle cpus. And once all the cpus are
completely idle, then we can stop this idle load balancing too. Checks added
in fast path are minimized. Whenever there are busy cpus in the system, there
will be an owner(idle cpu) doing the system wide idle load balancing.
Open items:
1. Intelligent owner selection (like an idle core in a busy package).
2. Merge with rcu's nohz_cpu_mask?
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add touch_all_softlockup_watchdogs() to allow the softlockup watchdog
timers on all cpus to be updated. This is used to prevent sysrq-t from
generating a spurious watchdog message when generating lots of output.
Softlockup watchdogs use sched_clock() as its timebase, which is inherently
per-cpu (at least, when it is measuring unstolen time). Because of this,
it isn't possible for one CPU to directly update the other CPU's timers,
but it is possible to tell the other CPUs to do update themselves
appropriately.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Acked-by: Chris Lalancette <clalance@redhat.com>
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Rick Lindsley <ricklind@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This past week I was playing around with that pahole tool
(http://oops.ghostprotocols.net:81/acme/dwarves/) and looking at the size
of various struct in the kernel. I was surprised by the size of the
task_struct on x86_64, approaching 4K. I looked through the fields in
task_struct and found that a number of them were declared as "unsigned
long" rather than "unsigned int" despite them appearing okay as 32-bit
sized fields. On x86_64 "unsigned long" ends up being 8 bytes in size and
forces 8 byte alignment. Is there a reason there a reason they are
"unsigned long"?
The patch below drops the size of the struct from 3808 bytes (60 64-byte
cachelines) to 3760 bytes (59 64-byte cachelines). A couple other fields
in the task struct take a signficant amount of space:
struct thread_struct thread; 688
struct held_lock held_locks[30]; 1680
CONFIG_LOCKDEP is turned on in the .config
[akpm@linux-foundation.org: fix printk warnings]
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show_state() (SysRq-T) developed the buggy habbit of not showing
TASK_RUNNING tasks. This was due to the mistaken belief that state_filter
== -1 would be a pass-through filter - while in reality it did not let
TASK_RUNNING == 0 p->state values through.
Fix this by restoring the original '!state_filter means all tasks'
special-case i had in the original version. Test-built and test-booted on
i686, SysRq-T now works as intended.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the SMT-nice feature which idles sibling cpus on SMT cpus to
facilitiate nice working properly where cpu power is shared. The idling of
cpus in the presence of runnable tasks is considered too fragile, easy to
break with outside code, and the complexity of managing this system if an
architecture comes along with many logical cores sharing cpu power will be
unworkable.
Remove the associated per_cpu_gain variable in sched_domains used only by
this code.
Also:
The reason is that with dynticks enabled, this code breaks without yet
further tweaks so dynticks brought on the rapid demise of this code. So
either we tweak this code or kill it off entirely. It was Ingo's preference
to kill it off. Either way this needs to happen for 2.6.21 since dynticks
has gone in.
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* master.kernel.org:/pub/scm/linux/kernel/git/kyle/parisc-2.6: (78 commits)
[PARISC] Use symbolic last syscall in __NR_Linux_syscalls
[PARISC] Add missing statfs64 and fstatfs64 syscalls
Revert "[PARISC] Optimize TLB flush on SMP systems"
[PARISC] Compat signal fixes for 64-bit parisc
[PARISC] Reorder syscalls to match unistd.h
Revert "[PATCH] make kernel/signal.c:kill_proc_info() static"
[PARISC] fix sys_rt_sigqueueinfo
[PARISC] fix section mismatch warnings in harmony sound driver
[PARISC] do not export get_register/set_register
[PARISC] add ENTRY()/ENDPROC() and simplify assembly of HP/UX emulation code
[PARISC] convert to use CONFIG_64BIT instead of __LP64__
[PARISC] use CONFIG_64BIT instead of __LP64__
[PARISC] add ASM_EXCEPTIONTABLE_ENTRY() macro
[PARISC] more ENTRY(), ENDPROC(), END() conversions
[PARISC] fix ENTRY() and ENDPROC() for 64bit-parisc
[PARISC] Fixes /proc/cpuinfo cache output on B160L
[PARISC] implement standard ENTRY(), END() and ENDPROC()
[PARISC] kill ENTRY_SYS_CPUS
[PARISC] clean up debugging printks in smp.c
[PARISC] factor syscall_restart code out of do_signal
...
Fix conflict in include/linux/sched.h due to kill_proc_info() being made
publicly available to PARISC again.
Now that I have changed all of the in-tree users remove the old version of
these functions. This should make it clear to any out of tree users that they
should be using kill_pgrp kill_pgrp_info or __kill_pgrp_info instead.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Of kernel subsystems that work with pids the tty layer is probably the largest
consumer. But it has the nice virtue that the assiation with a session only
lasts until the session leader exits. Which means that no reference counting
is required. So using struct pid winds up being a simple optimization to
avoid hash table lookups.
In the long term the use of pid_nr also ensures that when we have multiple pid
spaces mixed everything will work correctly.
Signed-off-by: Eric W. Biederman <eric@maxwell.lnxi.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
They are fat: 4x8 bytes in task_struct.
They are uncoditionally updated in every fork, read, write and sendfile.
They are used only if you have some "extended acct fields feature".
And please, please, please, read(2) knows about bytes, not characters,
why it is called "rchar"?
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, to tell a task that it should go to the refrigerator, we set the
PF_FREEZE flag for it and send a fake signal to it. Unfortunately there
are two SMP-related problems with this approach. First, a task running on
another CPU may be updating its flags while the freezer attempts to set
PF_FREEZE for it and this may leave the task's flags in an inconsistent
state. Second, there is a potential race between freeze_process() and
refrigerator() in which freeze_process() running on one CPU is reading a
task's PF_FREEZE flag while refrigerator() running on another CPU has just
set PF_FROZEN for the same task and attempts to reset PF_FREEZE for it. If
the refrigerator wins the race, freeze_process() will state that PF_FREEZE
hasn't been set for the task and will set it unnecessarily, so the task
will go to the refrigerator once again after it's been thawed.
To solve first of these problems we need to stop using PF_FREEZE to tell
tasks that they should go to the refrigerator. Instead, we can introduce a
special TIF_*** flag and use it for this purpose, since it is allowed to
change the other tasks' TIF_*** flags and there are special calls for it.
To avoid the freeze_process()-refrigerator() race we can make
freeze_process() to always check the task's PF_FROZEN flag after it's read
its "freeze" flag. We should also make sure that refrigerator() will
always reset the task's "freeze" flag after it's set PF_FROZEN for it.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Andi Kleen <ak@muc.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove scheduler stats lb_stopbalance counter. This counter can be
calculated by: lb_balanced - lb_nobusyg - lb_nobusyq. There is no need to
create gazillion counters while we can derive the value.
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently at a particular domain, each cpu in the sched group will do a
load balance at the frequency of balance_interval. More the cores and
threads, more the cpus will be in each sched group at SMP and NUMA domain.
And we endup spending quite a bit of time doing load balancing in those
domains.
Fix this by making only one cpu(first idle cpu or first cpu in the group if
all the cpus are busy) in the sched group do the load balance at that
particular sched domain and this load will slowly percolate down to the
other cpus with in that group(when they do load balancing at lower
domains).
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Large sched domains can be very expensive to scan. Add an option SD_SERIALIZE
to the sched domain flags. If that flag is set then we make sure that no
other such domain is being balanced.
[akpm@osdl.org: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Peter Williams <pwil3058@bigpond.net.au>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The present per-task IO accounting isn't very useful. It simply counts the
number of bytes passed into read() and write(). So if a process reads 1MB
from an already-cached file, it is accused of having performed 1MB of I/O,
which is wrong.
(David Wright had some comments on the applicability of the present logical IO accounting:
For billing purposes it is useless but for workload analysis it is very
useful
read_bytes/read_calls average read request size
write_bytes/write_calls average write request size
read_bytes/read_blocks ie logical/physical can indicate hit rate or thrashing
write_bytes/write_blocks ie logical/physical guess since pdflush writes can
be missed
I often look for logical larger than physical to see filesystem cache
problems. And the bytes/cpusec can help find applications that are
dominating the cache and causing slow interactive response from page cache
contention.
I want to find the IO intensive applications and make sure they are doing
efficient IO. Thus the acctcms(sysV) or csacms command would give the high
IO commands).
This patchset adds new accounting which tries to be more accurate. We account
for three things:
reads:
attempt to count the number of bytes which this process really did cause
to be fetched from the storage layer. Done at the submit_bio() level, so it
is accurate for block-backed filesystems. I also attempt to wire up NFS and
CIFS.
writes:
attempt to count the number of bytes which this process caused to be sent
to the storage layer. This is done at page-dirtying time.
The big inaccuracy here is truncate. If a process writes 1MB to a file
and then deletes the file, it will in fact perform no writeout. But it will
have been accounted as having caused 1MB of write.
So...
cancelled_writes:
account the number of bytes which this process caused to not happen, by
truncating pagecache.
We _could_ just subtract this from the process's `write' accounting. But
that means that some processes would be reported to have done negative
amounts of write IO, which is silly.
So we just report the raw number and punt this decision up to userspace.
Now, we _could_ account for writes at the physical I/O level. But
- This would require that we track memory-dirtying tasks at the per-page
level (would require a new pointer in struct page).
- It would mean that IO statistics for a process are usually only available
long after that process has exitted. Which means that we probably cannot
communicate this info via taskstats.
This patch:
Wire up the kernel-private data structures and the accessor functions to
manipulate them.
Cc: Jay Lan <jlan@sgi.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Chris Sturtivant <csturtiv@sgi.com>
Cc: Tony Ernst <tee@sgi.com>
Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Cc: David Wright <daw@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch provides process filtering feature.
The process filter allows failing only permitted processes
by /proc/<pid>/make-it-fail
Please see the example that demostrates how to inject slab allocation
failures into module init/cleanup code
in Documentation/fault-injection/fault-injection.txt
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a per pid_namespace child-reaper. This is needed so processes are reaped
within the same pid space and do not spill over to the parent pid space. Its
also needed so containers preserve existing semantic that pid == 1 would reap
orphaned children.
This is based on Eric Biederman's patch: http://lkml.org/lkml/2006/2/6/285
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add an anonymous union and ((deprecated)) to catch direct usage of the
session field.
[akpm@osdl.org: fix various missed conversions]
[jdike@addtoit.com: fix UML bug]
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Replace occurences of task->signal->session by a new process_session() helper
routine.
It will be useful for pid namespaces to abstract the session pid number.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>