kernel-fxtec-pro1x

Author	SHA1	Message	Date
Eduard - Gabriel Munteanu	42af9054c0	kmemtrace: restore original tracing data binary format, improve ABI When kmemtrace was ported to ftrace, the marker strings were taken as an indication of how the traced data was being exposed to the userspace. However, the actual format had been binary, not text. This restores the original binary format, while also adding an origin CPU field (since ftrace doesn't expose the data per-CPU to userspace), and re-adding the timestamp field. It also drops arch-independent field sizing where it didn't make sense, so pointers won't always be 64 bits wide like they used to. Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> LKML-Reference: <161be9ca8a27b432c4a6ab79f47788c4521652ae.1237813499.git.eduard.munteanu@linux360.ro> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-03 12:23:08 +02:00
Eduard - Gabriel Munteanu	da2635a985	kmemtrace: kmemtrace_alloc() must fill type_id Impact: fix trace output kmemtrace_alloc() was not filling type_id, which allowed garbage to make it into tracing data. Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> LKML-Reference: <284dba2732a144849d5aa82258fe0de2ad8dcb0b.1237813499.git.eduard.munteanu@linux360.ro> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-03 12:23:07 +02:00
Eduard - Gabriel Munteanu	ca2b84cb3c	kmemtrace: use tracepoints kmemtrace now uses tracepoints instead of markers. We no longer need to use format specifiers to pass arguments. Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> [ folded: Use the new TP_PROTO and TP_ARGS to fix the build. ] [ folded: fix build when CONFIG_KMEMTRACE is disabled. ] [ folded: define tracepoints when CONFIG_TRACEPOINTS is enabled. ] Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> LKML-Reference: <ae61c0f37156db8ec8dc0d5778018edde60a92e3.1237813499.git.eduard.munteanu@linux360.ro> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-03 12:23:06 +02:00
Ingo Molnar	a979241c53	kmemtrace, rcu: fix rcupreempt.c data structure dependencies Impact: cleanup We want to remove percpu.h from rcupreempt.h, but if we do that the percpu primitives there wont build anymore. Move them to the .c file instead. Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Cc: paulmck@linux.vnet.ibm.com LKML-Reference: <1237898630.25315.83.camel@penberg-laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-03 12:23:04 +02:00
Ingo Molnar	6258c4fb59	kmemtrace, rcu: fix rcu_tree_trace.c data structure dependencies Impact: cleanup We want to remove rcutree internals from the public rcutree.h file for upcoming kmemtrace changes - but kernel/rcutree_trace.c depends on them. Introduce kernel/rcutree.h for internal definitions. (Probably all the other data types from include/linux/rcutree.h could be moved here too - except rcu_data.) Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Cc: paulmck@linux.vnet.ibm.com LKML-Reference: <1237898630.25315.83.camel@penberg-laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-03 12:23:03 +02:00
Ingo Molnar	b1f77b0581	kmemtrace, rcu: fix linux/rcutree.h and linux/rcuclassic.h dependencies Impact: build fix for all non-x86 architectures We want to remove percpu.h from rcuclassic.h/rcutree.h (for upcoming kmemtrace changes) but that would break the DECLARE_PER_CPU based declarations in these files. Move the quiescent counter management functions to their respective RCU implementation .c files - they were slightly above the inlining limit anyway. Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Cc: paulmck@linux.vnet.ibm.com LKML-Reference: <1237898630.25315.83.camel@penberg-laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-03 12:23:02 +02:00
Linus Torvalds	8fe74cf053	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: Remove two unneeded exports and make two symbols static in fs/mpage.c Cleanup after commit `585d3bc06f` Trim includes of fdtable.h Don't crap into descriptor table in binfmt_som Trim includes in binfmt_elf Don't mess with descriptor table in load_elf_binary() Get rid of indirect include of fs_struct.h New helper - current_umask() check_unsafe_exec() doesn't care about signal handlers sharing New locking/refcounting for fs_struct Take fs_struct handling to new file (fs/fs_struct.c) Get rid of bumping fs_struct refcount in pivot_root(2) Kill unsharing fs_struct in __set_personality()	2009-04-02 21:09:10 -07:00
Robin Holt	f5f7eac41d	Allow rwlocks to re-enable interrupts Pass the original flags to rwlock arch-code, so that it can re-enable interrupts if implemented for that architecture. Initially, make __raw_read_lock_flags and __raw_write_lock_flags stubs which just do the same thing as non-flags variants. Signed-off-by: Petr Tesarik <ptesarik@suse.cz> Signed-off-by: Robin Holt <holt@sgi.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <linux-arch@vger.kernel.org> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:11 -07:00
Robin Holt	e8c158bb31	Factor out #ifdefs from kernel/spinlock.c to LOCK_CONTENDED_FLAGS SGI has observed that on large systems, interrupts are not serviced for a long period of time when waiting for a rwlock. The following patch series re-enables irqs while waiting for the lock, resembling the code which is already there for spinlocks. I only made the ia64 version, because the patch adds some overhead to the fast path. I assume there is currently no demand to have this for other architectures, because the systems are not so large. Of course, the possibility to implement raw_{read\|write}_lock_flags for any architecture is still there. This patch: The new macro LOCK_CONTENDED_FLAGS expands to the correct implementation depending on the config options, so that IRQ's are re-enabled when possible, but they remain disabled if CONFIG_LOCKDEP is set. Signed-off-by: Petr Tesarik <ptesarik@suse.cz> Signed-off-by: Robin Holt <holt@sgi.com> Cc: <linux-arch@vger.kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:10 -07:00
Aravind Srinivasan	2c53d9109f	relay: fix for possible loss/corruption of produced subbufs Fix possible loss/corruption of produced subbufs in relay_subbufs_consumed(). When buf->subbufs_produced wraps around after UINT_MAX and buf->subbufs_consumed is still < UINT_MAX, the condition if (buf->subbufs_consumed > buf->subbufs_produced) will be true even for certain valid values of subbufs_consumed. This may lead to loss or corruption of produced subbufs. Signed-off-by: Aravind Srinivasan <raa.aars@gmail.com> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Tom Zanussi <zanussi@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:05 -07:00
Dmitri Vorobiev	edb79a2132	kexec: vmcoreinfo_data[] can become static The vmcoreinfo_data[] array is not used outside of kernel/kexec.c, and can therefore become static. This patch adds the relevant keyword to the definition of the array. Noticed by sparse. Signed-off-by: Dmitri Vorobiev <dmitri.vorobiev@movial.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:04 -07:00
Neil Horman	04d491ab2a	kexec: add dmesg log symbols to /proc/vmcoreinfo lists It would be nice to be able to extract the dmesg log from a vmcore file without needing to keep the debug symbols for the running kernel handy all the time. We have a facility to do this in /proc/vmcore. This patch adds the log_buf and log_end symbols to the vmcoreinfo area so that tools (like makedumpfile) can easily extract the dmesg logs from a vmcore image. [akpm@linux-foundation.org: several fixes and cleanups] [akpm@linux-foundation.org: fix unused log_buf_kexec_setup()] [akpm@linux-foundation.org: build fix] Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Cc: Simon Horman <horms@verge.net.au> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Simon Horman <horms@verge.net.au> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:04 -07:00
Oleg Nesterov	1b0f7ffd0e	pids: kill signal_struct-> __pgrp/__session and friends We are wasting 2 words in signal_struct without any reason to implement task_pgrp_nr() and task_session_nr(). task_session_nr() has no callers since `2e2ba22ea4`, we can remove it. task_pgrp_nr() is still (I believe wrongly) used in fs/autofsX and fs/coda. This patch reimplements task_pgrp_nr() via task_pgrp_nr_ns(), and kills __pgrp/__session and the related helpers. The change in drivers/char/tty_io.c is cosmetic, but hopefully makes sense anyway. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Alan Cox <number6@the-village.bc.nu> [tty parts] Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:02 -07:00
Oleg Nesterov	52ee2dfdd4	pids: refactor vnr/nr_ns helpers to make them safe Inho, the safety rules for vnr/nr_ns helpers are horrible and buggy. task_pid_nr_ns(task) needs rcu/tasklist depending on task == current. As for "special" pids, vnr/nr_ns helpers always need rcu. However, if task != current, they are unsafe even under rcu lock, we can't trust task->group_leader without the special checks. And almost every helper has a callsite which needs a fix. Also, it is a bit annoying that the implementations of, say, task_pgrp_vnr() and task_pgrp_nr_ns() are not "symmetrical". This patch introduces the new helper, __task_pid_nr_ns(), which is always safe to use, and turns all other helpers into the trivial wrappers. After this I'll send another patch which converts task_tgid_xxx() as well, they're are a bit special. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Louis Rilling <Louis.Rilling@kerlabs.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:02 -07:00
Oleg Nesterov	2ae448efc8	pids: improve get_task_pid() to fix the unsafe sys_wait4()->task_pgrp() sys_wait4() does get_pid(task_pgrp(current)), this is not safe. We can add rcu lock/unlock around, but we already have get_task_pid() which can be improved to handle the special pids in more reliable manner. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Louis Rilling <Louis.Rilling@kerlabs.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:02 -07:00
Matthew Wilcox	8e654fba4a	sysctl: fix suid_dumpable and lease-break-time sysctls Arne de Bruijn points out that commit `76fdbb25f9` ("coredump masking: bound suid_dumpable sysctl") mistakenly limits lease-break-time instead of suid_dumpable. Signed-off-by: Matthew Wilcox <matthew@wil.cx> Reported-by: Arne de Bruijn <kernelbt@arbruijn.dds.nl> Cc: Kawai, Hidehiro <hidehiro.kawai.ez@hitachi.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:01 -07:00
Serge E. Hallyn	11dea19009	proc_sysctl: use CONFIG_PROC_SYSCTL around ipc and utsname proc_handlers As pointed out by Cedric Le Goater (in response to Alexey's original comment wrt mqns), ipc_sysctl.c and utsname_sysctl.c are using CONFIG_PROC_FS, not CONFIG_PROC_SYSCTL, to determine whether to define the proc_handlers. Change that. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Acked-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:01 -07:00
Lai Jiangshan	2355b70fd5	workqueue: avoid recursion in run_workqueue() 1) lockdep will complain when run_workqueue() performs recursion. 2) The recursive implementation of run_workqueue() means that flush_workqueue() and its documentation are inconsistent. This may hide deadlocks and other bugs. 3) The recursion in run_workqueue() will poison cwq->current_work, but flush_work() and __cancel_work_timer(), etcetera need a reliable cwq->current_work. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Eric Dumazet <dada1@cosmosbay.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:00 -07:00
Oleg Nesterov	1ee1184485	ptrace_untrace: fix the SIGNAL_STOP_STOPPED check This bug is ancient too. ptrace_untrace() must not resume the task if the group stop in progress, we should set TASK_STOPPED instead. Unfortunately, we still have problems here: - if the process/thread was traced, SIGNAL_STOP_STOPPED does not necessary means this thread group is stopped. - ptrace breaks the bookkeeping of ->group_stop_count. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:00 -07:00
Oleg Nesterov	95a3540da9	ptrace_detach: the wrong wakeup breaks the ERESTARTxxx logic Another ancient bug. Consider this trivial test-case, int main(void) { int pid = fork(); if (pid) { ptrace(PTRACE_ATTACH, pid, NULL, NULL); wait(NULL); ptrace(PTRACE_DETACH, pid, NULL, NULL); } else { pause(); printf("WE HAVE A KERNEL BUG!!!\n"); } return 0; } the child must not "escape" for sys_pause(), but it can and this was seen in practice. This is because ptrace_detach does: if (!child->exit_state) wake_up_process(child); this wakeup can happen after this child has already restarted sys_pause(), because it gets another wakeup from ptrace_untrace(). With or without this patch, perhaps sys_pause() needs a fix. But this wakeup also breaks the SIGNAL_STOP_STOPPED logic in ptrace_untrace(). Remove this wakeup. The caller saw this task in TASK_TRACED state, and unless it was SIGKILL'ed in between __ptrace_unlink()->ptrace_untrace() should handle this case correctly. If it was SIGKILL'ed, we don't need to wakup the dying tracee too. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:00 -07:00
Oleg Nesterov	5dfc80be73	forget_original_parent: do not abuse child->ptrace_entry By discussion with Roland. - Use ->sibling instead of ->ptrace_entry to chain the need to be release_task'd childs. Nobody else can use ->sibling, this task is EXIT_DEAD and nobody can find it on its own list. - rename ptrace_dead to dead_childs. - Now that we don't have the "parallel" untrace code, change back reparent_thread() to return void, pass dead_childs as an argument. Actually, I don't understand why do we notify /sbin/init when we reparent a zombie, probably it is better to reap it unconditionally. [akpm@linux-foundation.org: s/childs/children/] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Metzger, Markus T" <markus.t.metzger@intel.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:00 -07:00
Oleg Nesterov	39c626ae47	forget_original_parent: split out the un-ptrace part By discussion with Roland. - Rename ptrace_exit() to exit_ptrace(), and change it to do all the necessary work with ->ptraced list by its own. - Move this code from exit.c to ptrace.c - Update the comment in ptrace_detach() to explain the rechecking of the child->ptrace. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Metzger, Markus T" <markus.t.metzger@intel.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:05:00 -07:00
Oleg Nesterov	7f5d3652d4	reparent_thread: fix a zombie leak if /sbin/init ignores SIGCHLD If /sbin/init ignores SIGCHLD and we re-parent a zombie, it is leaked. reparent_thread() does do_notify_parent() which sets ->exit_signal = -1 in this case. This means that nobody except us can reap it, the detached task is not visible to do_wait(). Change reparent_thread() to return a boolean (like __pthread_detach) to indicate that the thread is dead and must be released. Also change forget_original_parent() to add the child to ptrace_dead list in this case. The naming becomes insane, the next patch does the cleanup. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:59 -07:00
Oleg Nesterov	b1442b055c	reparent_thread: fix the "is it traced" check reparent_thread() uses ptrace_reparented() to check whether this thread is ptraced, in that case we should not notify the new parent. But ptrace_reparented() is not exactly correct when the reparented thread is traced by /sbin/init, because forget_original_parent() has already changed ->real_parent. Currently, the only problem is the false notification. But with the next patch the kernel crash in this (yes, pathological) case. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:59 -07:00
Oleg Nesterov	0a967a044a	reparent_thread: don't call kill_orphaned_pgrp() if task_detached() If task_detached(p) == T, then either a) p is not the main thread, we will find the group leader on the ->children list. or b) p is the group leader but its ->exit_state = EXIT_DEAD. This can only happen when the last sub-thread has died, but in that case that thread has already called kill_orphaned_pgrp() from exit_notify(). In both cases kill_orphaned_pgrp() looks bogus. Move the task_detached() check up and simplify the code, this is also right from the "common sense" pov: we should do nothing with the detached childs, except move them to the new parent's ->children list. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:59 -07:00
Oleg Nesterov	4576145c1e	ptrace: fix possible zombie leak on PTRACE_DETACH When ptrace_detach() takes tasklist, the tracee can be SIGKILL'ed. If it has already passed exit_notify() we can leak a zombie, because a) ptracing disables the auto-reaping logic, and b) ->real_parent was not notified about the child's death. ptrace_detach() should follow the ptrace_exit's logic, change the code accordingly. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Roland McGrath <roland@redhat.com> Tested-by: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:59 -07:00
Oleg Nesterov	b1b4c6799f	ptrace: reintroduce __ptrace_detach() as a callee of ptrace_exit() No functional changes, preparation for the next patch. Move the "should we release this child" logic into the separate handler, __ptrace_detach(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:59 -07:00
Oleg Nesterov	6d69cb87f0	ptrace: simplify ptrace_exit()->ignoring_children() path ignoring_children() takes parent->sighand->siglock and checks k_sigaction[SIGCHLD] atomically. But this buys nothing, we can't get the "really" wrong result even if we race with sigaction(SIGCHLD). If we read the "stale" sa_handler/sa_flags we can pretend it was changed right after the check. Remove spin_lock(->siglock), and kill "int ign" which caches the result of ignoring_children() which becomes rather trivial. Perhaps it makes sense to export this helper, do_notify_parent() can use it too. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:59 -07:00
Oleg Nesterov	95c3eb76dc	ptrace: kill __ptrace_detach(), fix ->exit_state check Move the code from __ptrace_detach() to its single caller and kill this helper. Also, fix the ->exit_state check, we shouldn't wake up EXIT_DEAD tasks. Actually, I think task_is_stopped_or_traced() makes more sense, but this needs another patch. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:59 -07:00
Sukadev Bhattiprolu	6588c1e3ff	signals: SI_USER: Masquerade si_pid when crossing pid ns boundary When sending a signal to a descendant namespace, set ->si_pid to 0 since the sender does not have a pid in the receiver's namespace. Note: - If rt_sigqueueinfo() sets si_code to SI_USER when sending a signal across a pid namespace boundary, the value in ->si_pid will be cleared to 0. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:58 -07:00
Sukadev Bhattiprolu	b3bfa0cba8	signals: protect cinit from blocked fatal signals Normally SIG_DFL signals to global and container-init are dropped early. But if a signal is blocked when it is posted, we cannot drop the signal since the receiver may install a handler before unblocking the signal. Once this signal is queued however, the receiver container-init has no way of knowing if the signal was sent from an ancestor or descendant namespace. This patch ensures that contianer-init drops all SIG_DFL signals in get_signal_to_deliver() except SIGKILL/SIGSTOP. If SIGSTOP/SIGKILL originate from a descendant of container-init they are never queued (i.e dropped in sig_ignored() in an earler patch). If SIGSTOP/SIGKILL originate from parent namespace, the signal is queued and container-init processes the signal. IOW, if get_signal_to_deliver() sees a sig_kernel_only() signal for global or container-init, the signal must have been generated internally or must have come from an ancestor ns and we process the signal. Further, the signal_group_exit() check was needed to cover the case of a multi-threaded init sending SIGKILL to other threads when doing an exit() or exec(). But since the new sig_kernel_only() check covers the SIGKILL, the signal_group_exit() check is no longer needed and can be removed. Finally, now that we have all pieces in place, set SIGNAL_UNKILLABLE for container-inits. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:58 -07:00
Sukadev Bhattiprolu	e4da026f98	signals: zap_pid_ns_process() should use force_sig() send_signal() assumes that signals with SEND_SIG_PRIV are generated from within the same namespace. So any nested container-init processes become immune to the SIGKILL generated by kill_proc_info() in zap_pid_ns_processes(). Use force_sig() in zap_pid_ns_processes() instead - force_sig() clears the SIGNAL_UNKILLABLE flag ensuring the signal is processed by container-inits. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:58 -07:00
Sukadev Bhattiprolu	921cf9f630	signals: protect cinit from unblocked SIG_DFL signals Drop early any SIG_DFL or SIG_IGN signals to container-init from within the same container. But queue SIGSTOP and SIGKILL to the container-init if they are from an ancestor container. Blocked, fatal signals (i.e when SIG_DFL is to terminate) from within the container can still terminate the container-init. That will be addressed in the next patch. Note: To be bisect-safe, SIGNAL_UNKILLABLE will be set for container-inits in a follow-on patch. Until then, this patch is just a preparatory step. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:58 -07:00
Sukadev Bhattiprolu	7978b567d3	signals: add from_ancestor_ns parameter to send_signal() send_signal() (or its helper) needs to determine the pid namespace of the sender. But a signal sent via kill_pid_info_as_uid() comes from within the kernel and send_signal() does not need to determine the pid namespace of the sender. So define a helper for send_signal() which takes an additional parameter, 'from_ancestor_ns' and have kill_pid_info_as_uid() use that helper directly. The 'from_ancestor_ns' parameter will be used in a follow-on patch. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:58 -07:00
Oleg Nesterov	f008faff0e	signals: protect init from unwanted signals more (This is a modified version of the patch submitted by Oleg Nesterov http://lkml.org/lkml/2008/11/18/249 and tries to address comments that came up in that discussion) init ignores the SIG_DFL signals but we queue them anyway, including SIGKILL. This is mostly OK, the signal will be dropped silently when dequeued, but the pending SIGKILL has 2 bad implications: - it implies fatal_signal_pending(), so we confuse things like wait_for_completion_killable/lock_page_killable. - for the sub-namespace inits, the pending SIGKILL can mask (legacy_queue) the subsequent SIGKILL from the parent namespace which must kill cinit reliably. (preparation, cinits don't have SIGNAL_UNKILLABLE yet) The patch can't help when init is ptraced, but ptracing of init is not "safe" anyway. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:58 -07:00
Oleg Nesterov	43918f2bf4	signals: remove 'handler' parameter to tracehook functions Container-init must behave like global-init to processes within the container and hence it must be immune to unhandled fatal signals from within the container (i.e SIG_DFL signals that terminate the process). But the same container-init must behave like a normal process to processes in ancestor namespaces and so if it receives the same fatal signal from a process in ancestor namespace, the signal must be processed. Implementing these semantics requires that send_signal() determine pid namespace of the sender but since signals can originate from workqueues/ interrupt-handlers, determining pid namespace of sender may not always be possible or safe. This patchset implements the design/simplified semantics suggested by Oleg Nesterov. The simplified semantics for container-init are: - container-init must never be terminated by a signal from a descendant process. - container-init must never be immune to SIGKILL from an ancestor namespace (so a process in parent namespace must always be able to terminate a descendant container). - container-init may be immune to unhandled fatal signals (like SIGUSR1) even if they are from ancestor namespace. SIGKILL/SIGSTOP are the only reliable signals to a container-init from ancestor namespace. This patch: Based on an earlier patch submitted by Oleg Nesterov and comments from Roland McGrath (http://lkml.org/lkml/2008/11/19/258). The handler parameter is currently unused in the tracehook functions. Besides, the tracehook functions are called with siglock held, so the functions can check the handler if they later need to. Removing the parameter simiplifies changes to sig_ignored() in a follow-on patch. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:58 -07:00
Oleg Nesterov	90bc8d8b1a	do_wait: fix waiting for the group stop with the dead leader do_wait(WSTOPPED) assumes that p->state must be == TASK_STOPPED, this is not true if the leader is already dead. Check SIGNAL_STOP_STOPPED instead and use signal->group_exit_code. Trivial test-case: void tfunc(void arg) { pause(); return NULL; } int main(void) { pthread_t thr; pthread_create(&thr, NULL, tfunc, NULL); pthread_exit(NULL); return 0; } It doesn't react to ^Z (and then to ^C or ^\). The task is stopped, but bash can't see this. The bug is very old, and it was reported multiple times. This patch was sent more than a year ago (http://marc.info/?t=119713920000003) but it was ignored. This change also fixes other oddities (but not all) in this area. For example, before this patch: $ sleep 100 ^Z [1]+ Stopped sleep 100 $ strace -p `pidof sleep` Process 11442 attached - interrupt to quit strace hangs in do_wait(), because ->exit_code was already consumed by bash. After this patch, strace happily proceeds: --- SIGTSTP (Stopped) @ 0 (0) --- restart_syscall(<... resuming interrupted call ...> To me, this looks much more "natural" and correct. Another example. Let's suppose we have the main thread M and sub-thread T, the process is stopped, and its parent did wait(WSTOPPED). Now we can ptrace T but not M. This looks at least strange to me. Imho, do_wait() should not confuse the per-thread ptrace stops with the per-process job control stops. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Jan Kratochvil <jan.kratochvil@redhat.com> Cc: Kaz Kylheku <kkylheku@gmail.com> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Roland McGrath <roland@redhat.com> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:57 -07:00
David Rientjes	6d7b2f5f9e	cpusets: prevent PF_THREAD_BOUND tasks from attaching to non-root cpusets Kthreads that have the PF_THREAD_BOUND bit set in their flags are bound to a specific cpu. Thus, their set of allowed cpus shall not change. This patch prevents such threads from attaching to non-root cpusets. They do not have mempolicies that restrict them to a subset of system nodes and, since their cpumask may never change, they cannot use any of the features of cpusets. The tasks will forever be a member of the root cpuset and will be returned when listing the tasks attached to that cpuset. Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:57 -07:00
Paul Menage	db7f47cf48	cpusets: allow cpusets to be configured/built on non-SMP systems Allow cpusets to be configured/built on non-SMP systems Currently it's impossible to build cpusets under UML on x86-64, since cpusets depends on SMP and x86-64 UML doesn't support SMP. There's code in cpusets that doesn't depend on SMP. This patch surrounds the minimum amount of cpusets code with #ifdef CONFIG_SMP in order to allow cpusets to build/run on UP systems (for testing purposes under UML). Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Paul Menage <menage@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:57 -07:00
David Rientjes	a1bc5a4eee	cpusets: replace zone allowed functions with node allowed The cpuset_zone_allowed() variants are actually only a function of the zone's node. Cc: Paul Menage <menage@google.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:57 -07:00
Li Zefan	7f81b1ae18	cpuset: remove struct cpuset_hotplug_scanner Use cgroup_scanner.data, instead of introducing cpuset_hotplug_scanner. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:57 -07:00
Li Zefan	010cfac4ca	cpuset: avoid changing cpuset's mems when errno returned When writing to cpuset.mems, cpuset has to update its mems_allowed before calling update_tasks_nodemask(), but this function might return -ENOMEM. To avoid this rare case, we allocate the memory before changing mems_allowed, and then pass to update_tasks_nodemask(). Similar to what update_cpumask() does. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:57 -07:00
Li Zefan	3b6766fe66	cpuset: rewrite update_tasks_nodemask() This patch uses cgroup_scan_tasks() to rebind tasks' vmas to new cpuset's mems_allowed. Not only simplify the code largely, but also avoid allocating an array to hold mm pointers of all the tasks in the cpuset. This array can be big (size > PAGESIZE) if we have lots of tasks in that cpuset, thus has a chance to fail the allocation when under memory stress. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:57 -07:00
Li Zefan	0b4217b3fd	cpuset: fix possible races in cpu/memory hotplug Change to cpuset->cpus_allowed and cpuset->mems_allowed should be protected by callback_mutex, otherwise the reader may read wrong cpus/mems. This is cpuset's lock rule. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:56 -07:00
KAMEZAWA Hiroyuki	0b7f569e45	memcg: fix OOM killer under memcg This patch tries to fix OOM Killer problems caused by hierarchy. Now, memcg itself has OOM KILL function (in oom_kill.c) and tries to kill a task in memcg. But, when hierarchy is used, it's broken and correct task cannot be killed. For example, in following cgroup /groupA/ hierarchy=1, limit=1G, 01 nolimit 02 nolimit All tasks' memory usage under /groupA, /groupA/01, groupA/02 is limited to groupA's 1Gbytes but OOM Killer just kills tasks in groupA. This patch provides makes the bad process be selected from all tasks under hierarchy. BTW, currently, oom_jiffies is updated against groupA in above case. oom_jiffies of tree should be updated. To see how oom_jiffies is used, please check mem_cgroup_oom_called() callers. [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: const fix] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:55 -07:00
Li Zefan	d969fbe69e	debug cgroup: remove unneeded cgroup_lock Since we are in cgroup write handler, so the cgrp is valid, so we don't have to hold cgroup_mutex when calling cgroup_task_count(). One similar example is in cgroup_tasks_open(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:54 -07:00
Li Zefan	0670e08bdf	cgroups: don't change release_agent when remount failed Remount can fail in either case: - wrong mount options is specified, or option 'noprefix' is changed. - a to-be-added subsys is already mounted/active. When using remount to change 'release_agent', for the above former failure case, remount will return errno with release_agent unchanged, but for the latter case, remount will return EBUSY with relase_agent changed, which is unexpected I think: # mount -t cgroup -o cpu xxx /cgrp1 # mount -t cgroup -o cpuset,release_agent=agent1 yyy /cgrp2 # cat /cgrp2/release_agent agent1 # mount -t cgroup -o remount,cpuset,noprefix,release_agent=agent2 yyy /cgrp2 mount: /cgrp2 not mounted already, or bad option # cat /cgrp2/release_agent agent1 <-- ok # mount -t cgroup -o remount,cpu,cpuset,release_agent=agent2 yyy /cgrp2 mount: /cgrp2 is busy # cat /cgrp2/release_agent agent2 <-- unexpected! Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:54 -07:00
Li Zefan	099fca3225	cgroups: show correct file mode We have some read-only files and write-only files, but currently they are all set to 0644, which is counter-intuitive and cause trouble for some cgroup tools like libcgroup. This patch adds 'mode' to struct cftype to allow cgroup subsys to set it's own files' file mode, and for the most cases cft->mode can be default to 0 and cgroup will figure out proper mode. Acked-by: Paul Menage <menage@google.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:54 -07:00
Jesper Juhl	66bdc9cfc7	kernel/cgroup.c: kfree(NULL) is legal Reduces object file size a bit: Before: $ size kernel/cgroup.o text data bss dec hex filename 21593 7804 4924 34321 8611 kernel/cgroup.o After: $ size kernel/cgroup.o text data bss dec hex filename 21537 7744 4924 34205 859d kernel/cgroup.o Signed-off-by: Jesper Juhl <jj@chaosbits.net> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:54 -07:00
KAMEZAWA Hiroyuki	ec64f51545	cgroup: fix frequent -EBUSY at rmdir In following situation, with memory subsystem, /groupA use_hierarchy==1 /01 some tasks /02 some tasks /03 some tasks /04 empty When tasks under 01/02/03 hit limit on /groupA, hierarchical reclaim is triggered and the kernel walks tree under groupA. In this case, rmdir /groupA/04 fails with -EBUSY frequently because of temporal refcnt from the kernel. In general. cgroup can be rmdir'd if there are no children groups and no tasks. Frequent fails of rmdir() is not useful to users. (And the reason for -EBUSY is unknown to users.....in most cases) This patch tries to modify above behavior, by - retries if css_refcnt is got by someone. - add "return value" to pre_destroy() and allows subsystem to say "we're really busy!" Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:54 -07:00
KAMEZAWA Hiroyuki	38460b48d0	cgroup: CSS ID support Patch for Per-CSS(Cgroup Subsys State) ID and private hierarchy code. This patch attaches unique ID to each css and provides following. - css_lookup(subsys, id) returns pointer to struct cgroup_subysys_state of id. - css_get_next(subsys, id, rootid, depth, foundid) returns the next css under "root" by scanning When cgroup_subsys->use_id is set, an id for css is maintained. The cgroup framework only parepares - css_id of root css for subsys - id is automatically attached at creation of css. - id is not freed automatically. Because the cgroup framework don't know lifetime of cgroup_subsys_state. free_css_id() function is provided. This must be called by subsys. There are several reasons to develop this. - Saving space .... For example, memcg's swap_cgroup is array of pointers to cgroup. But it is not necessary to be very fast. By replacing pointers(8bytes per ent) to ID (2byes per ent), we can reduce much amount of memory usage. - Scanning without lock. CSS_ID provides "scan id under this ROOT" function. By this, scanning css under root can be written without locks. ex) do { rcu_read_lock(); next = cgroup_get_next(subsys, id, root, &found); /* check sanity of next here */ css_tryget(); rcu_read_unlock(); id = found + 1 } while(...) Characteristics: - Each css has unique ID under subsys. - Lifetime of ID is controlled by subsys. - css ID contains "ID" and "Depth in hierarchy" and stack of hierarchy - Allowed ID is 1-65535, ID 0 is UNUSED ID. Design Choices: - scan-by-ID v.s. scan-by-tree-walk. As /proc's pid scan does, scan-by-ID is robust when scanning is done by following kind of routine. scan -> rest a while(release a lock) -> conitunue from interrupted memcg's hierarchical reclaim does this. - When subsys->use_id is set, # of css in the system is limited to 65535. [bharata@linux.vnet.ibm.com: remove rcu_read_lock() from css_get_next()] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:53 -07:00
Grzegorz Nosek	313e924c08	cgroups: relax ns_can_attach checks to allow attaching to grandchild cgroups The ns_proxy cgroup allows moving processes to child cgroups only one level deep at a time. This commit relaxes this restriction and makes it possible to attach tasks directly to grandchild cgroups, e.g.: ($pid is in the root cgroup) echo $pid > /cgroup/CG1/CG2/tasks Previously this operation would fail with -EPERM and would have to be performed as two steps: echo $pid > /cgroup/CG1/tasks echo $pid > /cgroup/CG1/CG2/tasks Also, the target cgroup no longer needs to be empty to move a task there. Signed-off-by: Grzegorz Nosek <root@localdomain.pl> Acked-by: Serge Hallyn <serue@us.ibm.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:53 -07:00
Alexey Dobriyan	6f2c55b843	Simplify copy_thread() First argument unused since 2.3.11. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:51 -07:00
David Howells	33e5d76979	nommu: fix a number of issues with the per-MM VMA patch Fix a number of issues with the per-MM VMA patch: (1) Make mmap_pages_allocated an atomic_long_t, just in case this is used on a NOMMU system with more than 2G pages. Makes no difference on a 32-bit system. (2) Report vma->vm_pgoff * PAGE_SIZE as a 64-bit value, not a 32-bit value, lest it overflow. (3) Move the allocation of the vm_area_struct slab back for fork.c. (4) Use KMEM_CACHE() for both vm_area_struct and vm_region slabs. (5) Use BUG_ON() rather than if () BUG(). (6) Make the default validate_nommu_regions() a static inline rather than a #define. (7) Make free_page_series()'s objection to pages with a refcount != 1 more informative. (8) Adjust the __put_nommu_region() banner comment to indicate that the semaphore must be held for writing. (9) Limit the number of warnings about munmaps of non-mmapped regions. Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Howells <dhowells@redhat.com> Cc: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-02 19:04:48 -07:00
Darren Hart	cd84a42f31	futex: comment requeue key reference semantics We've tripped over the futex_requeue drop_count refering to key2 instead of key1. The code is actually correct, but is non-intuitive. This patch adds an explicit comment explaining the requeue. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-02 23:39:53 +02:00
Randy Dunlap	633fe795b8	timers: add missing kernel-doc Add missing kernel-doc parameter notation and change function name to its new name: Warning(kernel/timer.c:543): No description found for parameter 'name' Warning(kernel/timer.c:543): No description found for parameter 'key' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: akpm <akpm@linux-foundation.org> Cc: Johannes Berg <johannes@sipsolutions.net> LKML-Reference: <20090401174723.f0bea0eb.randy.dunlap@oracle.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-02 05:54:00 +02:00
Ingo Molnar	8302294f43	Merge branch 'tracing/core-v2' into tracing-for-linus Conflicts: include/linux/slub_def.h lib/Kconfig.debug mm/slob.c mm/slub.c	2009-04-02 00:49:02 +02:00
Davide Libenzi	4ede816ac3	epoll keyed wakeups: add __wake_up_locked_key() and __wake_up_sync_key() This patchset introduces wakeup hints for some of the most popular (from epoll POV) devices, so that epoll code can avoid spurious wakeups on its waiters. The problem with epoll is that the callback-based wakeups do not, ATM, carry any information about the events the wakeup is related to. So the only choice epoll has (not being able to call f_op->poll() from inside the callback), is to add the file* to a ready-list and resolve the real events later on, at epoll_wait() (or its own f_op->poll()) time. This can cause spurious wakeups, since the wake_up() itself might be for an event the caller is not interested into. The rate of these spurious wakeup can be pretty high in case of many network sockets being monitored. By allowing devices to report the events the wakeups refer to (at least the two major classes - POLLIN/POLLOUT), we are able to spare useless wakeups by proper handling inside the epoll's poll callback. Epoll will have in any case to call f_op->poll() on the file* later on, since the change to be done in order to have the full event set sent via wakeup, is too invasive for the way our f_op->poll() system works (the full event set is calculated inside the poll function - there are too many of them to even start thinking the change - also poll/select would need change too). Epoll is changed in a way that both devices which send event hints, and the ones that don't, are correctly handled. The former will gain some efficiency though. As a general rule for devices, would be to add an event mask by using key-aware wakeup macros, when making up poll wait queues. I tested it (together with the epoll's poll fix patch Andrew has in -mm) and wakeups for the supported devices are correctly filtered. Test program available here: http://www.xmailserver.org/epoll_test.c This patch: Nothing revolutionary here. Just using the available "key" that our wakeup core already support. The __wake_up_locked_key() was no brainer, since both __wake_up_locked() and __wake_up_locked_key() are thin wrappers around __wake_up_common(). The __wake_up_sync() function had a body, so the choice was between borrowing the body for __wake_up_sync_key() and calling it from __wake_up_sync(), or make an inline and calling it from both. I chose the former since in most archs it all resolves to "mov $0, REG; jmp ADDR". Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: William Lee Irwin III <wli@movementarian.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:20 -07:00
Magnus Damm	a8af78982f	pm: rework includes, remove arch ifdefs Make the following header file changes: - remove arch ifdefs and asm/suspend.h from linux/suspend.h - add asm/suspend.h to disk.c (for arch_prepare_suspend()) - add linux/io.h to swsusp.c (for ioremap()) - x86 32/64 bit compile fixes Signed-off-by: Magnus Damm <damm@igel.co.jp> Cc: Paul Mundt <lethal@linux-sh.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:16 -07:00
Alexey Dobriyan	704503d836	mm: fix proc_dointvec_userhz_jiffies "breakage" Addresses http://bugzilla.kernel.org/show_bug.cgi?id=9838 On i386, HZ=1000, jiffies_to_clock_t() converts time in a somewhat strange way from the user's point of view: # echo 500 >/proc/sys/vm/dirty_writeback_centisecs # cat /proc/sys/vm/dirty_writeback_centisecs 499 So, we have 5000 jiffies converted to only 499 clock ticks and reported back. TICK_NSEC = 999848 ACTHZ = 256039 Keeping in-kernel variable in units passed from userspace will fix issue of course, but this probably won't be right for every sysctl. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:13 -07:00
KOSAKI Motohiro	ee99c71c59	mm: introduce for_each_populated_zone() macro Impact: cleanup In almost cases, for_each_zone() is used with populated_zone(). It's because almost function doesn't need memoryless node information. Therefore, for_each_populated_zone() can help to make code simplify. This patch has no functional change. [akpm@linux-foundation.org: small cleanup] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-01 08:59:11 -07:00
Gautham R Shenoy	46e0bb9c12	sched: Print sched_group::__cpu_power in sched_domain_debug Impact: extend debug info /proc/sched_debug If the user changes the value of the sched_mc/smt_power_savings sysfs tunable, it'll trigger a rebuilding of the whole sched_domain tree, with the SD_POWERSAVINGS_BALANCE flag set at certain levels. As a result, there would be a change in the __cpu_power of sched_groups in the sched_domain hierarchy. Print the __cpu_power values for each sched_group in sched_domain_debug to help verify this change and correlate it with the change in the load-balancing behavior. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20090330045520.2869.24777.stgit@sofia.in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-01 17:58:03 +02:00
Bharata B Rao	ef12fefabf	cpuacct: add per-cgroup utime/stime statistics Add per-cgroup cpuacct controller statistics like the system and user time consumed by the group of tasks. Changelog: v7 - Changed the name of the statistic from utime to user and from stime to system so that in future we could easily add other statistics like irq, softirq, steal times etc easily. v6 - Fixed a bug in the error path of cpuacct_create() (pointed by Li Zefan). v5 - In cpuacct_stats_show(), use cputime64_to_clock_t() since we are operating on a 64bit variable here. v4 - Remove comments in cpuacct_update_stats() which explained why rcu_read_lock() was needed (as per Peter Zijlstra's review comments). - Don't say that percpu_counter_read() is broken in Documentation/cpuacct.txt as per KAMEZAWA Hiroyuki's review comments. v3 - Fix a small race in the cpuacct hierarchy walk. v2 - stime and utime now exported in clock_t units instead of msecs. - Addressed the code review comments from Balbir and Li Zefan. - Moved to -tip tree. v1 - Moved the stime/utime accounting to cpuacct controller. Earlier versions - http://lkml.org/lkml/2009/2/25/129 Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> Signed-off-by: Balaji Rao <balajirrao@gmail.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com> LKML-Reference: <20090331043222.GA4093@in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-01 16:49:38 +02:00
Hidetoshi Seto	c5f8d99585	posixtimers, sched: Fix posix clock monotonicity Impact: Regression fix (against clock_gettime() backwarding bug) This patch re-introduces a couple of functions, task_sched_runtime and thread_group_sched_runtime, which was once removed at the time of 2.6.28-rc1. These functions protect the sampling of thread/process clock with rq lock. This rq lock is required not to update rq->clock during the sampling. i.e. The clock_gettime() may return ((accounted runtime before update) + (delta after update)) that is less than what it should be. v2 -> v3: - Rename static helper function __task_delta_exec() to do_task_delta_exec() since -tip tree already has a __task_delta_exec() of different version. v1 -> v2: - Revises comments of function and patch description. - Add note about accuracy of thread group's runtime. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@kernel.org [2.6.28.x][2.6.29.x] LKML-Reference: <49D1CC93.4080401@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-01 16:44:16 +02:00
Steven Rostedt	2e572895bf	ring-buffer: do not remove reader page from list on ring buffer free Impact: prevent possible memory leak The reader page of the ring buffer is special. Although it points into the ring buffer, it is not part of the actual buffer. It is a page used by the reader to swap with a page in the ring buffer. Once the swap is made, the new reader page is again outside the buffer. Even though the reader page points into the buffer, it is really pointing to residual data. Note, this data is used by the reader. reader page \| v (prev) +---+ (next) +----------\| \|----------+ \| +---+ \| v v +---+ +---+ +---+ -->\| \|------->\| \|------->\| \|---> <--\| \|<-------\| \|<-------\| \|<--- +---+ +---+ +---+ ^ ^ ^ \ \| / ------- Buffer--------- If we perform a list_del_init() on the reader page we will actually remove the last page the reader swapped with and not the reader page itself. This will cause that page to not be freed, and thus is a memory leak. Luckily, the only user of the ring buffer so far is ftrace. And ftrace will not free its ring buffer after it allocates it. There is no current possible memory leak. But once there are other users, or if ftrace dynamically creates and frees its ring buffer, then this would be a memory leak. This patch fixes the leak for future cases. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-01 14:47:53 +02:00
Steven Rostedt	2aad1b76e6	function-graph: allow unregistering twice Impact: fix to permanent disabling of function graph tracer There should be nothing to prevent a tracer from unregistering a function graph callback more than once. This can simplify error paths. But currently, the counter does not account for mulitple unregistering of the function graph callback. If it happens, the function graph tracer will be permanently disabled. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-01 14:47:42 +02:00
Rusty Russell	13b8bd0a57	sched_rt: don't allocate cpumask in fastpath Impact: cleanup As pointed out by Steven Rostedt. Since the arg in question is unused, we simply change cpupri_find() to accept NULL. Reported-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> LKML-Reference: <200903251501.22664.rusty@rustcorp.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-04-01 13:24:51 +02:00
Al Viro	5ad4e53bd5	Get rid of indirect include of fs_struct.h Don't pull it in sched.h; very few files actually need it and those can include directly. sched.h itself only needs forward declaration of struct fs_struct; Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-31 23:00:27 -04:00
Al Viro	498052bba5	New locking/refcounting for fs_struct * all changes of current->fs are done under task_lock and write_lock of old fs->lock * refcount is not atomic anymore (same protection) * its decrements are done when removing reference from current; at the same time we decide whether to free it. * put_fs_struct() is gone * new field - ->in_exec. Set by check_unsafe_exec() if we are trying to do execve() and only subthreads share fs_struct. Cleared when finishing exec (success and failure alike). Makes CLONE_FS fail with -EAGAIN if set. * check_unsafe_exec() may fail with -EAGAIN if another execve() from subthread is in progress. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-31 23:00:26 -04:00
Al Viro	3e93cd6718	Take fs_struct handling to new file (fs/fs_struct.c) Pure code move; two new helper functions for nfsd and daemonize (unshare_fs_struct() and daemonize_fs_struct() resp.; for now - the same code as used to be in callers). unshare_fs_struct() exported (for nfsd, as copy_fs_struct()/exit_fs() used to be), copy_fs_struct() and exit_fs() don't need exports anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-31 23:00:26 -04:00
Al Viro	11d06b2a1e	Kill unsharing fs_struct in __set_personality() That's a rudiment of altroot support. I.e. it should've been buried a long time ago. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-31 23:00:25 -04:00
Bharata B Rao	a18b83b7ef	cpuacct: make cpuacct hierarchy walk in cpuacct_charge() safe when rcupreempt is used -v2 Impact: fix cgroups race under rcu-preempt cpuacct_charge() obtains task's ca and does a hierarchy walk upwards. This can race with the task's movement between cgroups. This race can cause an access to freed ca pointer in cpuacct_charge() or access to invalid cgroups pointer of the task. This will not happen with rcu or tree rcu as cpuacct_charge() is called with preemption disabled. However if rcupreempt is used, the race is seen. Thanks to Li Zefan for explaining this. Fix this race by explicitly protecting ca and the hierarchy walk with rcu_read_lock(). Changes for v2: - Update patch descrition (as per Li Zefan's review comments). - Remove comments in cpuacct_charge() which explained why rcu_read_lock() was needed (as per Peter Zijlstra's review comments). Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-31 18:27:59 +02:00
Ingo Molnar	8b54e45b00	Merge branches 'tracing/docs', 'tracing/filters', 'tracing/ftrace', 'tracing/kprobes', 'tracing/blktrace-v2' and 'tracing/textedit' into tracing/core-v2	2009-03-31 17:46:40 +02:00
Li Zefan	b14b70a6a4	trace: make argument 'mem' of trace_seq_putmem() const Impact: fix build warning I passed a const value to trace_seq_putmem(), and I got compile warning. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-31 17:45:13 +02:00
Eduard - Gabriel Munteanu	f285901bb2	tracing: add missing 'extern' keywords to trace_output.h Impact: cleanup Many declarations within trace_output.h are missing the 'extern' keyword in an inconsistent manner. This adds 'extern' where it should be. Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-31 17:45:07 +02:00
Eduard - Gabriel Munteanu	bdd6df6af9	tracing: provide trace_seq_reserve() trace_seq_reserve() allows a caller to reserve space in a trace_seq and write directly into it. This makes it easier to export binary data to userspace via the tracing interface, by simply filling in a struct. Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-31 17:45:02 +02:00
Li Zefan	18cea4591a	blktrace: print out BLK_TN_MESSAGE properly Impact: improve ftrace plugin output Before this patch: # cat trace make-5383 [001] 741.240059: 8,7 P N [make] __trace_note_message: cfq1074 # echo 1 > options/blk_classic # cat trace 8,7 1 0.692221252 0 C W 130411392 + 1024 [0] Bad pc action 6361 Bad pc action 283d # echo 0 > options/blk_classic # echo bin > trace_options # cat trace_pipe \| blkparse -i - (can't parse messages generated by blk_add_trace_msg()) After this patch: # cat trace <idle>-0 [001] 187.600933: 8,7 C W 145220224 + 8 [0] <idle>-0 [001] 187.600946: 8,7 m N cfq1076 complete # echo 1 > options/blk_classic # cat trace 8,7 1 0.256378996 238 I W 113190728 + 8 [pdflush] 8,7 1 0.256378998 238 m N cfq1076 insert_request # echo 0 > options/blk_classic # echo bin > trace_options # cat trace_pipe \| blkparse -i - 8,7 1 0 22.973250293 0 C W 102770576 + 8 [0] 8,7 1 0 22.973259213 0 m N cfq1076 complete Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-31 17:29:11 +02:00
Li Zefan	b6a4b0c3ad	blktrace: extract duplidate code Impact: cleanup blk_trace_event_print() and blk_tracer_print_line() share most of the code. text data bss dec hex filename 8605 393 12 9010 2332 kernel/trace/blktrace.o.orig text data bss dec hex filename 8555 393 12 8960 2300 kernel/trace/blktrace.o This patch also prepares for the next patch, that prints out BLK_TN_MESSAGE. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-31 17:28:59 +02:00
Li Zefan	ad5dd5493a	blktrace: fix memory leak when freeing struct blk_io_trace Impact: fix mixed ioctl and ftrace-plugin blktrace use memory leak When mixing the use of ioctl-based blktrace and ftrace-based blktrace, we can leak memory in this way: # btrace /dev/sda > /dev/null & # echo 0 > /sys/block/sda/sda1/trace/enable now we leak bt->dropped_file, bt->msg_file, bt->rchan... Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-31 17:27:58 +02:00
Li Zefan	17ba97e347	blktrace: fix blk_probes_ref chaos Impact: fix mixed ioctl and ftrace-plugin blktrace use refcount bugs ioctl-based blktrace allocates bt and registers tracepoints when ioctl(BLKTRACESETUP), and do all cleanups when ioctl(BLKTRACETEARDOWN). while ftrace-based blktrace allocates/frees bt when: # echo 1/0 > /sys/block/sda/sda1/trace/enable and registers/unregisters tracepoints when: # echo blk/nop > /debugfs/tracing/current_tracer or # echo 1/0 > /debugfs/tracing/tracing_enable The separatation of allocation and registeration causes 2 problems: 1. current user-space blktrace still calls ioctl(TEARDOWN) when ioctl(SETUP) failed: # echo 1 > /sys/block/sda/sda1/trace/enable # blktrace /dev/sda BLKTRACESETUP: Device or resource busy ^C and now blk_probes_ref == -1 2. Another way to make blk_probes_ref == -1: # plugin sdb && mount sdb1 # echo 1 > /sys/block/sdb/sdb1/trace/enable # remove sdb This patch does the allocation and registeration when writing sdaX/trace/enable. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-31 17:27:45 +02:00
Li Zefan	35ac51bfe4	blktrace: make classic output more classic Impact: fix ftrace plugin timestamp output In the classic user-space blktrace, the output timestamp is sec.nsec not sec.usec. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-31 17:27:36 +02:00
Li Zefan	eb08f8eb06	blktrace: fix off-by-one bug 'what' is used as the index of array what2act, so it can't >= the array size. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-31 17:27:18 +02:00
Li Zefan	5554720482	blktrace: fix the original blktrace Currently the original blktrace, which is using relay and is used via ioctl, is broken. You can use ftrace to see the output of blktrace, but user-space blktrace is unusable. It's broken by "blktrace: add ftrace plugin" (`c71a896154`) - if (unlikely(bt->trace_state != Blktrace_running)) + if (unlikely(bt->trace_state != Blktrace_running \|\| !blk_tracer_enabled)) return; With this patch, both ioctl and ftrace can be used, but of course you can't use both of them at the same time. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-31 17:27:08 +02:00
Li Zefan	b5230b56ee	blktrace: fix a race when creating blk_tree_root in debugfs t1 t2 ------ ------ do_blk_trace_setup() do_blk_trace_setup() if (!blk_tree_root) { if (!blk_tree_root) blk_tree_root = create_dir() blk_tree_root = create_dir(); (now blk_tree_root == NULL) ... dir = create_dir(name, blk_tree_root); Due to this race, t1 will create 'dir' in /debugfs but not /debugfs/block. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-31 17:26:59 +02:00
Li Zefan	6c051ce030	blktrace: fix timestamp in binary output I found the timestamp is wrong: # echo bin > trace_option # echo blk > current_tracer # cat trace_pipe \| blkparse -i - 8,0 0 0 0.000000000 504 A W ... ... 8,7 1 0 0.008534097 0 C R ... (should be 8.534097xxx) user-space blkparse expects the timestamp to be nanosecond. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-31 17:26:47 +02:00
Peter Zijlstra	eedeeabdee	lockdep: add stack dumps to asserts Have a better idea about exactly which loc causes a lockdep limit overflow. Often it's a bug or inefficiency in that subsystem. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1237376327.5069.253.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-31 14:53:01 +02:00
Peter Zijlstra	7f1e2ca9f0	hrtimer: fix rq->lock inversion (again) It appears I inadvertly introduced rq->lock recursion to the hrtimer_start() path when I delegated running already expired timers to softirq context. This patch fixes it by introducing a __hrtimer_start_range_ns() method that will not use raise_softirq_irqoff() but __raise_softirq_irqoff() which avoids the wakeup. It then also changes schedule() to check for pending softirqs and do the wakeup then, I'm not quite sure I like this last bit, nor am I convinced its really needed. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus@samba.org LKML-Reference: <20090313112301.096138802@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-31 14:52:52 +02:00
Ingo Molnar	7bee946358	Merge branch 'linus' into locking-for-linus Conflicts: lib/Kconfig.debug	2009-03-31 13:53:43 +02:00
Rusty Russell	558f6ab910	Merge branch 'cpumask-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip Conflicts: arch/x86/include/asm/topology.h drivers/oprofile/buffer_sync.c (Both cases: changed in Linus' tree, removed in Ingo's).	2009-03-31 13:33:50 +10:30
Rusty Russell	49502677e1	module: use strstarts() Impact: minor cleanup. I'm not going to neaten anyone else's code, but I'm happy to clean up my own. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-03-31 13:05:37 +10:30
Rusty Russell	e91defa26c	module: don't use stop_machine on module load Kay Sievers <kay.sievers@vrfy.org> discovered that boot times are slowed by about half a second because all the stop_machine_create() calls, and he only probes about 40 modules (I have 125 loaded on this laptop). We only do stop_machine_create() so we can unlink the module if something goes wrong, but it's overkill (and buggy anyway: if stop_machine_create() fails we still call stop_machine_destroy()). Since we are only protecting against kallsyms (esp. oops) walking the list, synchronize_sched() is sufficient (synchronize_rcu() is probably sufficient, but we're not in a hurry). Kay says of this patch: ... no module takes more than 40 millisecs to link now, most of them are between 3 and 8 millisecs. That looks very different to the numbers without this patch and the otherwise same setup, where we get heavy noise in the traces and many delays of up to 200 millisecs until linking, most of them taking 30+ millisecs. Tested-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-03-31 13:05:35 +10:30
Arjan van de Ven	acae051565	module: create a request_module_nowait() There seems to be a common pattern in the kernel where drivers want to call request_module() from inside a module_init() function. Currently this would deadlock. As a result, several drivers go through hoops like scheduling things via kevent, or creating custom work queues (because kevent can deadlock on them). This patch changes this to use a request_module_nowait() function macro instead, which just fires the modprobe off but doesn't wait for it, and thus avoids the original deadlock entirely. On my laptop this already results in one less kernel thread running.. (Includes Jiri's patch to use enum umh_wait) Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (bool-ified) Cc: Jiri Slaby <jirislaby@gmail.com>	2009-03-31 13:05:35 +10:30
Rusty Russell	8c8ef42aee	module: include other structures in module version check With CONFIG_MODVERSIONS, we version 'struct module' using a dummy export, but other things matter too: 1) 'struct modversion_info' determines the layout of the __versions section, 2) 'struct kernel_param' determines the layout of the __params section, 3) 'struct kernel_symbol' determines __ksymtab*. 4) 'struct marker' determines __markers. 5) 'struct tracepoint' determines __tracepoints. So we rename 'struct_module' to 'module_layout' and include these in the signature. Now it's general we can add others later on without confusion. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-03-31 13:05:34 +10:30
Rusty Russell	9cb610d8e3	module: remove the SHF_ALLOC flag on the __versions section. Impact: reduce kernel memory usage This patch just takes off the SHF_ALLOC flag on __versions so we don't keep them around after module load. This saves about 7% of module memory if CONFIG_MODVERSIONS=y. Cc: Shawn Bohrer <shawn.bohrer@gmail.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-03-31 13:05:34 +10:30
Rusty Russell	c6e665c8f0	module: clarify the force-loading taint message. Impact: Message cleanup Two of three callers of try_to_force_load() are not because of a missing version, so change the messages: Old: <modname>: no version for "magic" found: kernel tainted. New: <modname>: bad vermagic: kernel tainted. Old: <modname>: no version for "nocrc" found: kernel tainted. New: <modname>: no versions for exported symbols: kernel tainted. Old: <modname>: no version for "<symname>" found: kernel tainted. New: <modname>: <symname>: kernel tainted. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-03-31 13:05:33 +10:30
Tim Abbott	c6b3780191	module: Export symbols needed for Ksplice Impact: Expose some module.c symbols Ksplice uses several functions from module.c in order to resolve symbols and implement dependency handling. Calling these functions requires holding module_mutex, so it is exported. (This is just the module part of a bigger add-exports patch from Tim). Cc: Anders Kaseorg <andersk@mit.edu> Cc: Jeff Arnold <jbarnold@mit.edu> Signed-off-by: Tim Abbott <tabbott@mit.edu> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-03-31 13:05:33 +10:30
Anders Kaseorg	75a66614db	Ksplice: Add functions for walking kallsyms symbols Impact: New API kallsyms_lookup_name only returns the first match that it finds. Ksplice needs information about all symbols with a given name in order to correctly resolve local symbols. kallsyms_on_each_symbol provides a generic mechanism for iterating over the kallsyms table. Cc: Jeff Arnold <jbarnold@mit.edu> Cc: Tim Abbott <tabbott@mit.edu> Signed-off-by: Anders Kaseorg <andersk@mit.edu> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-03-31 13:05:32 +10:30
Rusty Russell	a6e6abd575	module: remove module_text_address() Impact: Replace and remove risky (non-EXPORTed) API module_text_address() returns a pointer to the module, which given locking improvements in module.c, is useless except to test for NULL: 1) If the module can't go away, use __module_text_address. 2) Otherwise, just use is_module_text_address(). Cc: linux-mtd@lists.infradead.org Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-03-31 13:05:32 +10:30
Rusty Russell	e610499e26	module: __module_address Impact: New API, cleanup ksplice wants to know the bounds of a module, not just the module text. It makes sense to have __module_address. We then implement is_module_address and __module_text_address in terms of this (and change is_module_text_address() to bool while we're at it). Also, add proper kerneldoc for them all. Cc: Anders Kaseorg <andersk@mit.edu> Cc: Jeff Arnold <jbarnold@mit.edu> Cc: Tim Abbott <tabbott@mit.edu> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-03-31 13:05:31 +10:30
Tim Abbott	414fd31b25	module: Make find_symbol return a struct kernel_symbol Impact: Cleanup, internal API change Ksplice needs access to the kernel_symbol structure in order to support modifications to the exported symbol table. Cc: Anders Kaseorg <andersk@mit.edu> Cc: Jeff Arnold <jbarnold@mit.edu> Signed-off-by: Tim Abbott <tabbott@mit.edu> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (bugfix and style)	2009-03-31 13:05:31 +10:30
Américo Wang	b10153fe31	kernel/module.c: fix an unused goto label Impact: cleanup Label 'free_init' is only used when defined(CONFIG_MODULE_UNLOAD) && defined(CONFIG_SMP), so move it inside to shut up gcc. Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-03-31 13:05:30 +10:30
Rusty Russell	e180a6b775	param: fix charp parameters set via sysfs Impact: fix crash on reading from /sys/module/.../ieee80211_default_rc_algo The module_param type "charp" simply sets a char * pointer in the module to the parameter in the commandline string: this is why we keep the (mangled) module command line around. But when set via sysfs (as about 11 charp parameters can be) this memory is freed on the way out of the write(). Future reads hit random mem. So we kstrdup instead: we have to check we're not in early commandline parsing, and we have to note when we've used it so we can reliably kfree the parameter when it's next overwritten, and also on module unload. (Thanks to Randy Dunlap for CONFIG_SYSFS=n fixes) Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com> Diagnosed-by: Frederic Weisbecker <fweisbec@gmail.com> Tested-by: Frederic Weisbecker <fweisbec@gmail.com> Tested-by: Christof Schmitt <christof.schmitt@de.ibm.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-03-31 13:05:30 +10:30
Linus Torvalds	d17abcd541	Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask: oprofile: Thou shalt not call __exit functions from __init functions cpumask: remove the now-obsoleted pcibus_to_cpumask(): generic cpumask: remove cpumask_t from core cpumask: convert rcutorture.c cpumask: use new cpumask_ functions in core code. cpumask: remove references to struct irqaction's mask field. cpumask: use mm_cpumask() wrapper: kernel/fork.c cpumask: use set_cpu_active in init/main.c cpumask: remove node_to_first_cpu cpumask: fix seq_bitmap_*() functions. cpumask: remove dangerous CPU_MASK_ALL_PTR, &CPU_MASK_ALL	2009-03-30 18:00:26 -07:00
Linus Torvalds	c4e1aa67ed	Merge branch 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (33 commits) lockdep: fix deadlock in lockdep_trace_alloc lockdep: annotate reclaim context (__GFP_NOFS), fix SLOB lockdep: annotate reclaim context (__GFP_NOFS), fix lockdep: build fix for !PROVE_LOCKING lockstat: warn about disabled lock debugging lockdep: use stringify.h lockdep: simplify check_prev_add_irq() lockdep: get_user_chars() redo lockdep: simplify get_user_chars() lockdep: add comments to mark_lock_irq() lockdep: remove macro usage from mark_held_locks() lockdep: fully reduce mark_lock_irq() lockdep: merge the !_READ mark_lock_irq() helpers lockdep: merge the _READ mark_lock_irq() helpers lockdep: simplify mark_lock_irq() helpers #3 lockdep: further simplify mark_lock_irq() helpers lockdep: simplify the mark_lock_irq() helpers lockdep: split up mark_lock_irq() lockdep: generate usage strings lockdep: generate the state bit definitions ...	2009-03-30 17:17:35 -07:00
Lai Jiangshan	f69b17d7e7	rcu: rcu_barrier VS cpu_hotplug: Ensure callbacks in dead cpu are migrated to online cpu cpu hotplug may happen asynchronously, some rcu callbacks are maybe still on dead cpu, rcu_barrier() also needs to wait for these rcu callbacks to complete, so we must ensure callbacks in dead cpu are migrated to online cpu. Paul E. McKenney's review: Good stuff, Lai!!! Simpler than any of the approaches that I was considering, and, better yet, independent of the underlying RCU implementation!!! I was initially worried that wake_up() might wake only one of two possible wait_event()s, namely rcu_barrier() and the CPU_POST_DEAD code, but the fact that wait_event() clears WQ_FLAG_EXCLUSIVE avoids that issue. I was also worried about the fact that different RCU implementations have different mappings of call_rcu(), call_rcu_bh(), and call_rcu_sched(), but this is OK as well because we just get an extra (harmless) callback in the case that they map together (for example, Classic RCU has call_rcu_sched() mapping to call_rcu()). Overlap of CPU-hotplug operations is prevented by cpu_add_remove_lock, and any stray callbacks that arrive (for example, from irq handlers running on the dying CPU) either are ahead of the CPU_DYING callbacks on the one hand (and thus accounted for), or happened after the rcu_barrier() started on the other (and thus don't need to be accounted for). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <49C36476.1010400@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-31 00:09:37 +02:00
Ingo Molnar	65fb0d23fc	Merge branch 'linus' into cpumask-for-linus Conflicts: arch/x86/kernel/cpu/common.c	2009-03-30 23:53:32 +02:00
Peter Zijlstra	2f85018152	lockdep: fix deadlock in lockdep_trace_alloc Heiko reported that we grab the graph lock with irqs enabled. Fix this by providng the same wrapper as all other lockdep entry functions have. Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <npiggin@suse.de> LKML-Reference: <1237544000.24626.52.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-30 23:19:24 +02:00
Rafael J. Wysocki	749b0afc3a	kexec: Change kexec jump code ordering Change the ordering of the kexec jump code so that the nonboot CPUs are disabled after calling device drivers' "late suspend" methods. This change reflects the recent modifications of the power management code that is also used by kexec jump. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-03-30 21:46:55 +02:00
Rafael J. Wysocki	4aecd67189	PM: Change hibernation code ordering Change the ordering of the hibernation core code so that the platform "prepare" callbacks are executed and the nonboot CPUs are disabled after calling device drivers' "late suspend" methods. This change (along with the previous analogous change of the suspend core code) will allow us to rework the PCI PM core so that the power state of devices is changed in the "late" phase of suspend (and analogously in the "early" phase of resume), which in turn will allow us to avoid the race condition where a device using shared interrupts is put into a low power state with interrupts enabled and then an interrupt (for another device) comes in and confuses its driver. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-03-30 21:46:54 +02:00
Rafael J. Wysocki	900af0d973	PM: Change suspend code ordering Change the ordering of the suspend core code so that the platform "prepare" callback is executed and the nonboot CPUs are disabled after calling device drivers' "late suspend" methods. This change will allow us to rework the PCI PM core so that the power state of devices is changed in the "late" phase of suspend (and analogously in the "early" phase of resume), which in turn will allow us to avoid the race condition where a device using shared interrupts is put into a low power state with interrupts enabled and then an interrupt (for another device) comes in and confuses its driver. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-03-30 21:46:54 +02:00
Rafael J. Wysocki	2ed8d2b3a8	PM: Rework handling of interrupts during suspend-resume Use the functions introduced in by the previous patch, suspend_device_irqs(), resume_device_irqs() and check_wakeup_irqs(), to rework the handling of interrupts during suspend (hibernation) and resume. Namely, interrupts will only be disabled on the CPU right before suspending sysdevs, while device drivers will be prevented from receiving interrupts, with the help of the new helper function, before their "late" suspend callbacks run (and analogously during resume). In addition, since the device interrups are now disabled before the CPU has turned all interrupts off and the CPU will ACK the interrupts setting the IRQ_PENDING bit for them, check in sysdev_suspend() if any wake-up interrupts are pending and abort suspend if that's the case. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-03-30 21:46:54 +02:00
Rafael J. Wysocki	0a0c5168df	PM: Introduce functions for suspending and resuming device interrupts Introduce helper functions allowing us to prevent device drivers from getting any interrupts (without disabling interrupts on the CPU) during suspend (or hibernation) and to make them start to receive interrupts again during the subsequent resume. These functions make it possible to keep timer interrupts enabled while the "late" suspend and "early" resume callbacks provided by device drivers are being executed. In turn, this allows device drivers' "late" suspend and "early" resume callbacks to sleep, execute ACPI callbacks etc. The functions introduced here will be used to rework the handling of interrupts during suspend (hibernation) and resume. Namely, interrupts will only be disabled on the CPU right before suspending sysdevs, while device drivers will be prevented from receiving interrupts, with the help of the new helper function, before their "late" suspend callbacks run (and analogously during resume). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-03-30 21:46:54 +02:00
Matt LaPlante	692105b8ac	trivial: fix typos/grammar errors in Kconfig texts Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-03-30 15:22:01 +02:00
Nick Andrew	877d03105d	trivial: Fix misspelling of firmware Fix misspelling of firmware. Signed-off-by: Nick Andrew <nick@nick-andrew.net> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-03-30 15:21:59 +02:00
Uwe Kleine-Koenig	2a93a1f214	trivial: fix typo "resgister" -> "register" Signed-off-by: Uwe Kleine-Koenig <ukleinek@strlen.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-03-30 15:21:58 +02:00
Rusty Russell	612a726faf	cpumask: remove cpumask_t from core Impact: cleanup struct cpumask is nicer, and we use it to make where we've made code safe for CONFIG_CPUMASK_OFFSTACK=y. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-03-30 22:05:17 +10:30
Rusty Russell	73d0a4b107	cpumask: convert rcutorture.c We're getting rid of cpumasks on the stack. Simply change tmp_mask to a global, and allocate it in rcu_torture_init(). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@freedesktop.org>	2009-03-30 22:05:16 +10:30
Rusty Russell	aa85ea5b89	cpumask: use new cpumask_ functions in core code. Impact: cleanup Time to clean up remaining laggards using the old cpu_ functions. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Trond.Myklebust@netapp.com	2009-03-30 22:05:16 +10:30
Rusty Russell	9489424454	cpumask: use mm_cpumask() wrapper: kernel/fork.c Impact: futureproof Makes code futureproof against the impending change to mm->cpu_vm_mask. It's also a chance to use the new cpumask_ ops which take a pointer. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-03-30 22:05:13 +10:30
Rusty Russell	2b17fa506c	cpumask: use set_cpu_active in init/main.c cpu_active_map is deprecated in favor of cpu_active_mask, which is const for safety: we use accessors now (set_cpu_active) is we really want to make a change. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-03-30 22:05:12 +10:30
Rusty Russell	1a2142afa5	cpumask: remove dangerous CPU_MASK_ALL_PTR, &CPU_MASK_ALL Impact: cleanup (Thanks to Al Viro for reminding me of this, via Ingo) CPU_MASK_ALL is the (deprecated) "all bits set" cpumask, defined as so: #define CPU_MASK_ALL (cpumask_t) { { ... } } Taking the address of such a temporary is questionable at best, unfortunately `321a8e9d` (cpumask: add CPU_MASK_ALL_PTR macro) added CPU_MASK_ALL_PTR: #define CPU_MASK_ALL_PTR (&CPU_MASK_ALL) Which formalizes this practice. One day gcc could bite us over this usage (though we seem to have gotten away with it so far). So replace everywhere which used &CPU_MASK_ALL or CPU_MASK_ALL_PTR with the modern "cpu_all_mask" (a real const struct cpumask *). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Reported-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Mike Travis <travis@sgi.com>	2009-03-30 22:05:11 +10:30
Benjamin Herrenschmidt	9ff9a26b78	Merge commit 'origin/master' into next Manual merge of: arch/powerpc/include/asm/elf.h drivers/i2c/busses/i2c-mpc.c	2009-03-30 14:04:53 +11:00
Randy Dunlap	d5ac537e5f	sched: fix errors in struct & function comments Fix kernel-doc errors in sched.c: the structs don't have kernel-doc notation and the short function description needs to be one line only. Error(kernel/sched.c:3197): cannot understand prototype: 'struct sd_lb_stats ' Error(kernel/sched.c:3228): cannot understand prototype: 'struct sg_lb_stats ' Error(kernel/sched.c:3375): duplicate section name 'Description' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-29 08:12:39 -07:00
Linus Torvalds	c31f403de6	Merge branch 'futexes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'futexes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: futex: remove the pointer math from double_unlock_hb, fix futex: remove the pointer math from double_unlock_hb futex: clean up fault logic futex: unlock before returning -EFAULT futex: use current->time_slack_ns for rt tasks too futex: add double_unlock_hb() futex: additional (get\|put)_futex_key() fixes futex: update futex commentary	2009-03-28 17:32:14 -07:00
Ingo Molnar	38a6ed3ed8	Merge branch 'linus' into core/printk	2009-03-28 23:34:14 +01:00
Ingo Molnar	d00ab2fdd4	Merge branch 'linus' into core/futexes	2009-03-28 23:24:12 +01:00
Linus Torvalds	eedf2c5296	Merge git://git.kernel.org/pub/scm/linux/kernel/git/arjan/linux-2.6-async-for-30 * git://git.kernel.org/pub/scm/linux/kernel/git/arjan/linux-2.6-async-for-30: fastboot: remove duplicate unpack_to_rootfs() ide/net: flip the order of SATA and network init async: remove the temporary (2.6.29) "async is off by default" code Fix up conflicts in init/initramfs.c manually	2009-03-28 14:00:33 -07:00
Arjan van de Ven	9710794383	async: remove the temporary (2.6.29) "async is off by default" code Now that everyone has been able to test the async code (and it's being used in the Moblin betas by default), we can enable it by default. The various fixes needed have gone into 2.6.29 already. [With an important bugfix from Stefan Richter] Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>	2009-03-28 13:05:30 -07:00
Ingo Molnar	82268da1b1	Merge branch 'linus' into percpu-cpumask-x86-for-linus-2 Conflicts: arch/sparc/kernel/time_64.c drivers/gpu/drm/drm_proc.c Manual merge to resolve build warning due to phys_addr_t type change on x86: drivers/gpu/drm/drm_info.c Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-28 04:26:01 +01:00
Linus Torvalds	3ae5080f4c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (37 commits) fs: avoid I_NEW inodes Merge code for single and multiple-instance mounts Remove get_init_pts_sb() Move common mknod_ptmx() calls into caller Parse mount options just once and copy them to super block Unroll essentials of do_remount_sb() into devpts vfs: simple_set_mnt() should return void fs: move bdev code out of buffer.c constify dentry_operations: rest constify dentry_operations: configfs constify dentry_operations: sysfs constify dentry_operations: JFS constify dentry_operations: OCFS2 constify dentry_operations: GFS2 constify dentry_operations: FAT constify dentry_operations: FUSE constify dentry_operations: procfs constify dentry_operations: ecryptfs constify dentry_operations: CIFS constify dentry_operations: AFS ...	2009-03-27 16:23:12 -07:00
Sukadev Bhattiprolu	a3ec947c85	vfs: simple_set_mnt() should return void simple_set_mnt() is defined as returning 'int' but always returns 0. Callers assume simple_set_mnt() never fails and don't properly cleanup if it were to _ever_ fail. For instance, get_sb_single() and get_sb_nodev() should: up_write(sb->s_unmount); deactivate_super(sb); if simple_set_mnt() fails. Since simple_set_mnt() never fails, would be cleaner if it did not return anything. [akpm@linux-foundation.org: fix build] Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:03 -04:00
Al Viro	3ba13d179e	constify dentry_operations: rest Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-03-27 14:44:03 -04:00
Ingo Molnar	6e15cf0486	Merge branch 'core/percpu' into percpu-cpumask-x86-for-linus-2 Conflicts: arch/parisc/kernel/irq.c arch/x86/include/asm/fixmap_64.h arch/x86/include/asm/setup.h kernel/irq/handle.c Semantic merge: arch/x86/include/asm/fixmap.h Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-27 17:28:43 +01:00
Linus Torvalds	a8416961d3	Merge branch 'irq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (32 commits) x86: disable __do_IRQ support sparseirq, powerpc/cell: fix unused variable warning in interrupt.c genirq: deprecate obsolete typedefs and defines genirq: deprecate __do_IRQ genirq: add doc to struct irqaction genirq: use kzalloc instead of explicit zero initialization genirq: make irqreturn_t an enum genirq: remove redundant if condition genirq: remove unused hw_irq_controller typedef irq: export remove_irq() and setup_irq() symbols irq: match remove_irq() args with setup_irq() irq: add remove_irq() for freeing of setup_irq() irqs genirq: assert that irq handlers are indeed running in hardirq context irq: name 'p' variables a bit better irq: further clean up the free_irq() code flow irq: refactor and clean up the free_irq() code flow irq: clean up manage.c irq: use GFP_KERNEL for action allocation in request_irq() kernel/irq: fix sparse warning: make symbol static irq: optimize init_kstat_irqs/init_copy_kstat_irqs ...	2009-03-26 16:06:50 -07:00
Linus Torvalds	6671de344c	Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (26 commits) posix timers: fix RLIMIT_CPU && fork() time: ntp: fix bug in ntp_update_offset() & do_adjtimex(), fix time: ntp: clean up second_overflow() time: ntp: simplify ntp_tick_adj calculations time: ntp: make 64-bit constants more robust time: ntp: refactor do_adjtimex() some more time: ntp: refactor do_adjtimex() time: ntp: fix bug in ntp_update_offset() & do_adjtimex() time: ntp: micro-optimize ntp_update_offset() time: ntp: simplify ntp_update_offset_fll() time: ntp: refactor and clean up ntp_update_offset() time: ntp: refactor up ntp_update_frequency() time: ntp: clean up ntp_update_frequency() time: ntp: simplify the MAX_TICKADJ_SCALED definition time: ntp: simplify the second_overflow() code flow time: ntp: clean up kernel/time/ntp.c x86: hpet: stop HPET_COUNTER when programming periodic mode x86: hpet: provide separate functions to stop and start the counter x86: hpet: print HPET registers during setup (if hpet=verbose is used) time: apply NTP frequency/tick changes immediately ...	2009-03-26 16:05:42 -07:00
Linus Torvalds	831576fe40	Merge branch 'sched-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (46 commits) sched: Add comments to find_busiest_group() function sched: Refactor the power savings balance code sched: Optimize the !power_savings_balance during fbg() sched: Create a helper function to calculate imbalance sched: Create helper to calculate small_imbalance in fbg() sched: Create a helper function to calculate sched_domain stats for fbg() sched: Define structure to store the sched_domain statistics for fbg() sched: Create a helper function to calculate sched_group stats for fbg() sched: Define structure to store the sched_group statistics for fbg() sched: Fix indentations in find_busiest_group() using gotos sched: Simple helper functions for find_busiest_group() sched: remove unused fields from struct rq sched: jiffies not printed per CPU sched: small optimisation of can_migrate_task() sched: fix typos in documentation sched: add avg_overlap decay x86, sched_clock(): mark variables read-mostly sched: optimize ttwu vs group scheduling sched: TIF_NEED_RESCHED -> need_reshed() cleanup sched: don't rebalance if attached on NULL domain ...	2009-03-26 16:05:01 -07:00
David S. Miller	08abe18af1	Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ Conflicts: drivers/net/wimax/i2400m/usb-notif.c	2009-03-26 15:23:24 -07:00
Linus Torvalds	0c93ea4064	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (61 commits) Dynamic debug: fix pr_fmt() build error Dynamic debug: allow simple quoting of words dynamic debug: update docs dynamic debug: combine dprintk and dynamic printk sysfs: fix some bin_vm_ops errors kobject: don't block for each kobject_uevent sysfs: only allow one scheduled removal callback per kobj Driver core: Fix device_move() vs. dpm list ordering, v2 Driver core: some cleanup on drivers/base/sys.c Driver core: implement uevent suppress in kobject vcs: hook sysfs devices into object lifetime instead of "binding" driver core: fix passing platform_data driver core: move platform_data into platform_device sysfs: don't block indefinitely for unmapped files. driver core: move knode_bus into private structure driver core: move knode_driver into private structure driver core: move klist_children into private structure driver core: create a private portion of struct device driver core: remove polling for driver_probe_done(v5) sysfs: reference sysfs_dirent from sysfs inodes ... Fixed conflicts in drivers/sh/maple/maple.c manually	2009-03-26 11:17:04 -07:00
Ingo Molnar	7c526e1fef	Merge branches 'timers/new-apis', 'timers/ntp' and 'timers/urgent' into timers/core	2009-03-26 15:45:52 +01:00
Ingo Molnar	a5ebc0b1a7	Merge commit 'v2.6.29' into timers/core	2009-03-26 15:45:22 +01:00
Tom Zanussi	9a8118baae	tracing: filter fix for TRACE_EVENT_FORMAT events Impact: fix crash (hang) when using TRACE_EVENT_FORMAT filter files filters are only hooked up to the tracepoint events defined using TRACE_EVENT but not the tracers that use TRACE_EVENT_FORMAT, such as ftrace. Do not display the filter files at all for TRACE_EVENT_FORMAT events for the time being. Cc: Steven Rostedt <rostedt@goodmis.org> Cc: =?ISO-8859-1?Q?Fr=E9d=E9ric?= Weisbecker <fweisbec@gmail.com> LKML-Reference: <1237878882.8339.61.camel@charm-linux> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-26 09:13:14 +01:00
Zhaolei	2a4efa4245	ftrace: Using FTRACE_WARN_ON() to check "freed record" in ftrace_release() "Because when we call ftrace_free_rec we change the rec->ip to point to the next record in the chain. Something is very wrong if rec->ip >= s && rec->ip < e and the record is already free." "Note, use FTRACE_WARN_ON() macro. This way it shuts down ftrace if it is hit and helps to avoid further damage later." -- Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com>	2009-03-25 17:45:36 -04:00
Lai Jiangshan	2f63b840bc	trace_workqueues: fix empty line's output Empty lines separate cpus stat. After previous fix(trace_stat: keep original order) applied, the empty lines are displayed at incorrect position. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Steven Rostedt <srostedt@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <49C9F266.2060706@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-25 18:32:35 +01:00
Lai Jiangshan	220ba351df	trace_stat: keep original order Impact: make trace_stat files show items with the original order trace_stat tracer reverse the items, it makes the output looks a little ugly. Example, when we read trace_stat/workqueues, we get cpu#7's stat. at first, and then cpu#6... cpu#0. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Steven Rostedt <srostedt@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <49C9F23F.5040307@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-25 18:32:34 +01:00
Lai Jiangshan	e6f489013b	trace_stat: don't call seq_printf() in seq_operation->start() Impact: Fix incorrect way using seq_file's API Use SEQ_START_TOKEN instead of calling ->stat_headers() int seq_operation->start(). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Steven Rostedt <srostedt@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> LKML-Reference: <49C9EAE5.5070202@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-25 18:32:34 +01:00
Gautham R Shenoy	b7bb4c9bb0	sched: Add comments to find_busiest_group() function Impact: cleanup Add /** style comments around find_busiest_group(). Also add a few explanatory comments. This concludes the find_busiest_group() cleanup. The function is now down to 72 lines from the original 313 lines. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: "Balbir Singh" <balbir@in.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com> Cc: "Vaidyanathan Srinivasan" <svaidy@linux.vnet.ibm.com> LKML-Reference: <20090325091427.13992.18933.stgit@sofia.in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-25 13:28:30 +01:00
Gautham R Shenoy	c071df1852	sched: Refactor the power savings balance code Impact: cleanup Create seperate helper functions to initialize the power-savings-balance related variables, to update them and to check if we have a scope for performing power-savings balance. Add no-op inline functions for the !(CONFIG_SCHED_MC \|\| CONFIG_SCHED_SMT) case. This will eliminate all the #ifdef jungle in find_busiest_group() and the other helper functions. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: "Balbir Singh" <balbir@in.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com> Cc: "Vaidyanathan Srinivasan" <svaidy@linux.vnet.ibm.com> LKML-Reference: <20090325091422.13992.73616.stgit@sofia.in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-25 10:30:48 +01:00
Gautham R Shenoy	a021dc0337	sched: Optimize the !power_savings_balance during fbg() Impact: cleanup, micro-optimization We don't need to perform power_savings balance if either the cpu is NOT_IDLE or if the sched_domain doesn't contain the SD_POWERSAVINGS_BALANCE flag set. Currently, we check for these conditions multiple number of times, even though these variables don't change over the scope of find_busiest_group(). Check once, and store the value in the already exiting "power_savings_balance" variable. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: "Balbir Singh" <balbir@in.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com> Cc: "Vaidyanathan Srinivasan" <svaidy@linux.vnet.ibm.com> LKML-Reference: <20090325091417.13992.2657.stgit@sofia.in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-25 10:30:48 +01:00
Gautham R Shenoy	dbc523a3b8	sched: Create a helper function to calculate imbalance Move all the imbalance calculation out of find_busiest_group() through this helper function. With this change, the structure of find_busiest_group() will be as follows: - update_sched_domain_statistics. - check if imbalance exits. - update imbalance and return busiest. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: "Balbir Singh" <balbir@in.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com> Cc: "Vaidyanathan Srinivasan" <svaidy@linux.vnet.ibm.com> LKML-Reference: <20090325091411.13992.43293.stgit@sofia.in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-25 10:30:47 +01:00
Gautham R Shenoy	2e6f44aeda	sched: Create helper to calculate small_imbalance in fbg() Impact: cleanup We have two places in find_busiest_group() where we need to calculate the minor imbalance before returning the busiest group. Encapsulate this functionality into a seperate helper function. Credit: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: "Balbir Singh" <balbir@in.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com> LKML-Reference: <20090325091406.13992.54316.stgit@sofia.in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-25 10:30:47 +01:00
Gautham R Shenoy	37abe198b1	sched: Create a helper function to calculate sched_domain stats for fbg() Impact: cleanup Create a helper function named update_sd_lb_stats() to update the various sched_domain related statistics in find_busiest_group(). With this we would have moved all the statistics computation out of find_busiest_group(). Credit: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: "Balbir Singh" <balbir@in.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com> LKML-Reference: <20090325091401.13992.88737.stgit@sofia.in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-25 10:30:46 +01:00
Gautham R Shenoy	222d656dea	sched: Define structure to store the sched_domain statistics for fbg() Impact: cleanup Currently we use a lot of local variables in find_busiest_group() to capture the various statistics related to the sched_domain. Group them together into a single data structure. This will help us to offload the job of updating the sched_domain statistics to a helper function. Credit: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: "Balbir Singh" <balbir@in.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com> LKML-Reference: <20090325091356.13992.25970.stgit@sofia.in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-25 10:30:46 +01:00
Gautham R Shenoy	1f8c553d0f	sched: Create a helper function to calculate sched_group stats for fbg() Impact: cleanup Create a helper function named update_sg_lb_stats() which can be invoked to calculate the individual group's statistics in find_busiest_group(). This reduces the lenght of find_busiest_group() considerably. Credit: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Aked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: "Balbir Singh" <balbir@in.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com> LKML-Reference: <20090325091351.13992.43461.stgit@sofia.in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-25 10:30:45 +01:00
Gautham R Shenoy	381be78fdc	sched: Define structure to store the sched_group statistics for fbg() Impact: cleanup Currently a whole bunch of variables are used to store the various statistics pertaining to the groups we iterate over in find_busiest_group(). Group them together in a single data structure and add appropriate comments. This will be useful later on when we create helper functions to calculate the sched_group statistics. Credit: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: "Balbir Singh" <balbir@in.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com> LKML-Reference: <20090325091345.13992.20099.stgit@sofia.in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-25 10:30:45 +01:00
Gautham R Shenoy	6dfdb06290	sched: Fix indentations in find_busiest_group() using gotos Impact: cleanup Some indentations in find_busiest_group() can minimized by using early exits with the help of gotos. This improves readability in a couple of cases. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: "Balbir Singh" <balbir@in.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com> Cc: "Vaidyanathan Srinivasan" <svaidy@linux.vnet.ibm.com> LKML-Reference: <20090325091340.13992.45062.stgit@sofia.in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-25 10:30:44 +01:00
Gautham R Shenoy	67bb6c036d	sched: Simple helper functions for find_busiest_group() Impact: cleanup Currently the load idx calculation code is in find_busiest_group(). Move that to a static inline helper function. Similary, to find the first cpu of a sched_group we use cpumask_first(sched_group_cpus(group)) Use a helper to that. It improves readability in some cases. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: "Balbir Singh" <balbir@in.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com> Cc: "Vaidyanathan Srinivasan" <svaidy@linux.vnet.ibm.com> LKML-Reference: <20090325091335.13992.55424.stgit@sofia.in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-25 10:30:44 +01:00
Ingo Molnar	b6d9842258	Merge branch 'sched/cleanups'; commit 'v2.6.29' into sched/core	2009-03-25 10:26:51 +01:00
Jason Baron	e9d376f0fa	dynamic debug: combine dprintk and dynamic printk This patch combines Greg Bank's dprintk() work with the existing dynamic printk patchset, we are now calling it 'dynamic debug'. The new feature of this patchset is a richer /debugfs control file interface, (an example output from my system is at the bottom), which allows fined grained control over the the debug output. The output can be controlled by function, file, module, format string, and line number. for example, enabled all debug messages in module 'nf_conntrack': echo -n 'module nf_conntrack +p' > /mnt/debugfs/dynamic_debug/control to disable them: echo -n 'module nf_conntrack -p' > /mnt/debugfs/dynamic_debug/control A further explanation can be found in the documentation patch. Signed-off-by: Greg Banks <gnb@sgi.com> Signed-off-by: Jason Baron <jbaron@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-03-24 16:38:26 -07:00
Luis Henriques	67aa0f767a	sched: remove unused fields from struct rq Impact: cleanup, new schedstat ABI Since they are used on in statistics and are always set to zero, the following fields from struct rq have been removed: yld_exp_empty, yld_act_empty and yld_both_empty. Both Sched Debug and SCHEDSTAT_VERSION versions has also been incremented since ABIs have been changed. The schedtop tool has been updated to properly handle new version of schedstat: http://rt.wiki.kernel.org/index.php/Schedtop_utility Signed-off-by: Luis Henriques <henrix@sapo.pt> Acked-by: Gregory Haskins <ghaskins@novell.com> Acked-by: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20090324221002.GA10061@hades.domain.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-24 23:16:51 +01:00
Lai Jiangshan	ee000b7f9f	tracing: use union for multi-usages field Impact: cleanup struct dyn_ftrace::ip has different usages in his lifecycle, we use union for it. And also for struct dyn_ftrace::flags. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Steven Rostedt <srostedt@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <49C871BE.3080405@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-24 16:43:12 +01:00
Lai Jiangshan	cc59c9e8d0	ftrace: show virtual PID Impact: fix PID output under namespaces When current namespace is not the global namespace, pid read from set_ftrace_pid is no correct. # ~/newpid_namespace_run bash # echo $$ 1 # echo 1 > set_ftrace_pid # cat set_ftrace_pid 3756 Since we write virtual PID to set_ftrace_pid, we need get virtual PID when we read it. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Steven Rostedt <srostedt@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <49C84D65.9050606@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-24 16:42:49 +01:00
Steven Rostedt	be6f164a02	function-graph: add option for include sleep times Impact: give user a choice to show times spent while sleeping The user may want to see the time a function spent sleeping. This patch adds the trace option "sleep-time" to allow that. The "sleep-time" option is default on. echo sleep-time > /debug/tracing/trace_options produces: ------------------------------------------ 2) avahi-d-3428 => <idle>-0 ------------------------------------------ 2) \| finish_task_switch() { 2) 0.621 us \| _spin_unlock_irq(); 2) 2.202 us \| } 2) ! 1002.197 us \| } 2) ! 1003.521 us \| } where as, echo nosleep-time > /debug/tracing/trace_options produces: 0) <idle>-0 => yum-upd-3416 ------------------------------------------ 0) \| finish_task_switch() { 0) 0.643 us \| _spin_unlock_irq(); 0) 2.342 us \| } 0) + 41.302 us \| } 0) + 42.453 us \| } Signed-off-by: Steven Rostedt <srostedt@redhat.com>	2009-03-24 11:06:24 -04:00
Steven Rostedt	8aef2d2856	function-graph: ignore times across schedule Impact: more accurate timings The current method of function graph tracing does not take into account the time spent when a task is not running. This shows functions that call schedule have increased costs: 3) + 18.664 us \| } ------------------------------------------ 3) <idle>-0 => kblockd-123 ------------------------------------------ 3) \| finish_task_switch() { 3) 1.441 us \| _spin_unlock_irq(); 3) 3.966 us \| } 3) ! 2959.433 us \| } 3) ! 2961.465 us \| } This patch uses the tracepoint in the scheduling context switch to account for time that has elapsed while a task is scheduled out. Now we see: ------------------------------------------ 3) <idle>-0 => edac-po-1067 ------------------------------------------ 3) \| finish_task_switch() { 3) 0.685 us \| _spin_unlock_irq(); 3) 2.331 us \| } 3) + 41.439 us \| } 3) + 42.663 us \| } Signed-off-by: Steven Rostedt <srostedt@redhat.com>	2009-03-24 09:33:30 -04:00
Steven Rostedt	05ce5818ad	function-graph: prevent more than one tracer registering Impact: prevent crash due to multiple function graph tracers The function graph tracer can currently only handle a single tracer being registered. If another tracer registers with the function graph tracer it can crash the system. Signed-off-by: Steven Rostedt <srostedt@redhat.com>	2009-03-24 09:32:52 -04:00
Steven Rostedt	5d1a03dc54	function-graph: moved the timestamp from arch to generic code This patch move the timestamp from happening in the arch specific code into the general code. This allows for better control by the tracer to time manipulation. Signed-off-by: Steven Rostedt <srostedt@redhat.com>	2009-03-24 09:31:34 -04:00
Steven Rostedt	098335215a	tracing: fix memory leak in trace_stat If the function profiler does not have any items recorded and one were to cat the function stat file, the kernel would take a BUG with a NULL pointer dereference. Looking further into this, I found that returning NULL from stat_start did not stop the stat logic, and would later call stat_next. This breaks from the way seq_file works, so I looked into fixing the stat code. This is where I noticed that the last next_entry is never freed. It is allocated, and if the stat_next returns NULL, the code breaks out of the loop, unlocks the mutex and exits. We never link the next_entry nor do we free it. Thus it is a real memory leak. This patch rearranges the code a bit to not only fix the memory leak, but also to act more like seq_file where nothing is printed if there is nothing to print. That is, stat_start returns NULL. Signed-off-by: Steven Rostedt <srostedt@redhat.com>	2009-03-24 09:07:35 -04:00
Li Zefan	093419971e	blktrace: print human-readable act_mask Impact: new feature, allow symbolic values in /debug/tracing/act_mask Print stringified act_mask instead of hex value: # cat act_mask read,write,barrier,sync,queue,requeue,issue,complete,fs,pc,ahead,meta, discard,drv_data # echo "meta,write" > act_mask # cat act_mask write,meta Also: - make act_mask accept "ahead", "meta", "discard" and "drv_data" - use strsep() instead of strchr() to parse user input - return -EINVAL if a token is not found in the mask map - fix a bug that 'value' is unsigned, so it can < 0 - propagate error value of blk_trace_mask2str() to userspace, but not always return -ENXIO. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <49C8AB42.1000802@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-24 13:09:00 +01:00
Li Zefan	e0dc81bec0	blktrace: fix t_error() Impact: fix error flag output t_error() should return t->error but not t->sector. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <49C8945F.5020802@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-24 13:09:00 +01:00
Li Zefan	65796348e0	blktrace: fix wrong calculation of RWBS Impact: fix the output of IO type category characters Trace categories are the upper 16 bits, not the lower 16 bits. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <49C89432.8010805@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-24 13:08:59 +01:00
Li Zefan	e4955c9986	blktrace: mark ddir_act[] const Impact: cleanup ddir_act and what2act always stay immutable. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <49C89415.5080503@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-24 13:08:59 +01:00
Thomas Gleixner	f48fe81e5b	genirq: threaded irq handlers review fixups Delta patch to address the review comments. - Implement warning when IRQ_WAKE_THREAD is requested and no thread handler installed - coding style fixes Pointed-out-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-03-24 12:15:23 +01:00
Arjan van de Ven	935bd5b971	genirq: add support for threaded interrupts to devres Some devices use devres_request_irq() for to install their interrupt handler. Add support for threaded interrupts to devres as well. [tglx - simplified and adapted to latest threadirq version] Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-03-24 12:15:23 +01:00
Thomas Gleixner	3aa551c9b4	genirq: add threaded interrupt handler support Add support for threaded interrupt handlers: A device driver can request that its main interrupt handler runs in a thread. To achive this the device driver requests the interrupt with request_threaded_irq() and provides additionally to the handler a thread function. The handler function is called in hard interrupt context and needs to check whether the interrupt originated from the device. If the interrupt originated from the device then the handler can either return IRQ_HANDLED or IRQ_WAKE_THREAD. IRQ_HANDLED is returned when no further action is required. IRQ_WAKE_THREAD causes the genirq code to invoke the threaded (main) handler. When IRQ_WAKE_THREAD is returned handler must have disabled the interrupt on the device level. This is mandatory for shared interrupt handlers, but we need to do it as well for obscure x86 hardware where disabling an interrupt on the IO_APIC level redirects the interrupt to the legacy PIC interrupt lines. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2009-03-24 12:15:23 +01:00
Tom Zanussi	9f58a159d0	tracing/filters: disallow integer values for string filters and vice versa Impact: fix filter use boundary condition / crash Make sure filters for string fields don't use integer values and vice versa. Getting it wrong can crash the system or produce bogus results. Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: =?ISO-8859-1?Q?Fr=E9d=E9ric?= Weisbecker <fweisbec@gmail.com> LKML-Reference: <1237878882.8339.61.camel@charm-linux> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-24 08:26:52 +01:00
Tom Zanussi	4bda2d517b	tracing/filters: use trace_seq_printf() to print filters Impact: cleanup Instead of just using the trace_seq buffer to print the filters, use trace_seq_printf() as it was intended to be used. Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Cc: =?ISO-8859-1?Q?Fr=E9d=E9ric?= Weisbecker <fweisbec@gmail.com> LKML-Reference: <1237878871.8339.59.camel@charm-linux> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-24 08:26:52 +01:00
Tom Zanussi	09f1f245c7	tracing/filters: free pred when clearing filters Impact: fix (small) per trace filter modification memory leak Free the current pred when clearing the filters via the filter files. Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: =?ISO-8859-1?Q?Fr=E9d=E9ric?= Weisbecker <fweisbec@gmail.com> LKML-Reference: <1237878851.8339.58.camel@charm-linux> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-24 08:26:51 +01:00
Tom Zanussi	1fc2d5c119	tracing/filters: use list_for_each_entry Impact: cleanup No need to use the safe version here, so use list_for_each_entry instead of list_for_each_entry_safe in find_event_field(). Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: =?ISO-8859-1?Q?Fr=E9d=E9ric?= Weisbecker <fweisbec@gmail.com> LKML-Reference: <1237878841.8339.57.camel@charm-linux> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-24 08:26:51 +01:00
Benjamin Herrenschmidt	9e41d9597e	Merge commit 'origin/master' into next	2009-03-24 13:38:30 +11:00
James Morris	703a3cd728	Merge branch 'master' into next	2009-03-24 10:52:46 +11:00
Frederic Weisbecker	1618536961	tracing/function-graph-tracer: fix functions call traces imbalance Impact: fix traces output Sometimes one can observe an imbalance in the traces between function calls and function return traces: func1() { } } The curly brace inside func1() is the return of another function nested inside func1. The return trace have been inserted in the buffer but not the entry. We are storing a return address on the function traces stack while we haven't inserted its entry on the buffer, hence the imbalance on the traces. This is because the tracers doesn't check all failures that can happen on buffer insertion. This patch reports the tracing recursion failures and the ring buffer failures. In such cases, we now restore the original return address for the function, giving up its return trace. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <1237843021-11695-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-23 23:25:32 +01:00
Anton Vorontsov	45b9560895	tracing: Fix TRACING_SUPPORT dependency for PPC32 commit `40ada30f96` ("tracing: clean up menu"), despite the "clean up" in its purpose, introduced a behavioural change for Kconfig symbols: we no longer able to select tracing support on PPC32 (because IRQFLAGS_SUPPORT isn't yet implemented). The IRQFLAGS_SUPPORT is not mandatory for most tracers, tracing core has a special case for platforms w/o irqflags (which, by the way, has become useless as of the commit above). Though according to Ingo Molnar, there was periodic build failures on weird, unmaintained architectures that had no irqflags-tracing support and hence didn't know the raw_irqs_save/restore primitives. Thus we'd better not enable irqflags-less tracing for all architectures. This patch restores the old behaviour for PPC32, and thus brings the tracing back. Other architectures can either add themselves to the exception list or (better) implement TRACE_IRQFLAGS_SUPPORT. Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com> Acked-b: Steven Rostedt <rostedt@goodmis.org> Cc: linuxppc-dev@ozlabs.org LKML-Reference: <20090323220724.GA9851@oksana.dev.rtsoft.ru> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-23 23:23:03 +01:00
Thomas Gleixner	80c5520811	Merge branch 'cpus4096' into irq/threaded Conflicts: arch/parisc/kernel/irq.c kernel/irq/handle.c Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-03-23 21:20:20 +01:00
Oleg Nesterov	37bebc70d7	posix timers: fix RLIMIT_CPU && fork() See http://bugzilla.kernel.org/show_bug.cgi?id=12911 copy_signal() copies signal->rlim, but RLIMIT_CPU is "lost". Because posix_cpu_timers_init_group() sets cputime_expires.prof_exp = 0 and thus fastpath_timer_check() returns false unless we have other cpu timers. This is the minimal fix for 2.6.29 (tested) and 2.6.28. The patch is not optimal, we need further cleanups here. With this patch update_rlimit_cpu() is not really needed, but I don't think it should be removed. The proper fix (I think) is: - set_process_cpu_timer() should just start the cputimer->running logic (it does), no need to change cputime_expires.xxx_exp - posix_cpu_timers_init_group() should set ->running when needed - fastpath_timer_check() can check ->running instead of task_cputime_zero(signal->cputime_expires) Reported-by: Peter Lojkin <ia6432@inbox.ru> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roland McGrath <roland@redhat.com> Cc: <stable@kernel.org> [for 2.6.29.x] LKML-Reference: <20090323193411.GA17514@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-23 20:43:35 +01:00
Miklos Szeredi	53da1d9456	fix ptrace slowness This patch fixes bug #12208: Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=12208 Subject : uml is very slow on 2.6.28 host This turned out to be not a scheduler regression, but an already existing problem in ptrace being triggered by subtle scheduler changes. The problem is this: - task A is ptracing task B - task B stops on a trace event - task A is woken up and preempts task B - task A calls ptrace on task B, which does ptrace_check_attach() - this calls wait_task_inactive(), which sees that task B is still on the runq - task A goes to sleep for a jiffy - ... Since UML does lots of the above sequences, those jiffies quickly add up to make it slow as hell. This patch solves this by not rescheduling in read_unlock() after ptrace_stop() has woken up the tracer. Thanks to Oleg Nesterov and Ingo Molnar for the feedback. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-23 09:22:31 -07:00
Ingo Molnar	efd247fa34	Merge branches 'sched/debug' and 'linus' into sched/core	2009-03-23 16:53:20 +01:00
Frederic Weisbecker	3e1f60b80c	tracing/ftrace: check if debugfs is registered before creating files Impact: fix a crash with ftrace={nop,boot} parameter If the nop or initcall tracers are launched as boot tracers, they will attempt to create their option directory and files. But these tracers are registered very early and then assigned as "boot tracers" very early if asked to. Since they do this before debugfs has been registered (core initcall), a crash is triggered. Another early tracers could also come later. So we fix it by checking if debugfs is initialized before creating the root tracing directory. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <1237759847-21025-3-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-23 16:25:47 +01:00
Ingo Molnar	b3e3b302cf	Merge branches 'irq/sparseirq' and 'linus' into irq/core	2009-03-23 10:07:49 +01:00
Tom Zanussi	c4cff064be	tracing/filters: clean up filter_add_subsystem_pred() Impact: cleanup, memory leak fix This patch cleans up filter_add_subsystem_pred(): - searches for the field before creating a copy of the pred - fixes memory leak in the case a predicate isn't applied - if -ENOMEM, makes sure there's no longer a reference to the pred so the caller can free the half-finished filter - changes the confusing i == MAX_FILTER_PRED - 1 comparison previously remarked upon This affects only per-subsystem event filtering. Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: =?ISO-8859-1?Q?Fr=E9d=E9ric?= Weisbecker <fweisbec@gmail.com> LKML-Reference: <1237796808.7527.40.camel@charm-linux> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-23 09:30:37 +01:00
Tom Zanussi	ee6cdabc82	tracing/filters: fix bug in copy_pred() Impact: fix potential crash on subsystem filter expression freeing When making a copy of the predicate, pred->field_name needs to be duplicated in the copy as well, otherwise bad things can happen due to later multiple frees of the same string. This affects only per-subsystem event filtering. Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: =?ISO-8859-1?Q?Fr=E9d=E9ric?= Weisbecker <fweisbec@gmail.com> LKML-Reference: <1237796802.7527.39.camel@charm-linux> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-23 09:30:36 +01:00
Tom Zanussi	75c8b41752	tracing/filters: use list_for_each_entry_safe Impact: cleanup Use list_for_each_entry_safe instead of list_for_each_entry in find_event_field(). Reported-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <1237796788.7527.35.camel@charm-linux> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-23 09:28:07 +01:00
Frederic Weisbecker	b118415bfa	tracing/events: don't discard an event after commit When we want to filter an event, the filter test is done after the event is commited to the ring-buffer to be discarded later if needed. But a reader could be reading this event while we are trying to discard it. Other kind of racy events can even happen because the event is commited and can be read and/or consumed. What we want is to discard the event before committing it. Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Tom Zanussi <tzanussi@gmail.com> LKML-Reference: <1237763919-21505-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-23 09:22:15 +01:00
Frederic Weisbecker	7e6ea92df3	tracing/ftrace: make nop-tracer use polling wait for events on pipe Impact: display events when they arrive Now that the events don't use wake_up() anymore, we need the nop tracer to poll waiting for events on the pipe. Especially because nop is useful to look at orphan traces types (traces types that don't rely on specific tracers) because it doesn't produce traces itself. And unlike other tracers that trigger specific traces periodically, nop triggers no traces by itself that can wake him. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <1237759847-21025-5-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-23 09:22:15 +01:00
Frederic Weisbecker	07edf71213	tracing/events: don't use wake up for events Impact: fix hard-lockup with sched switch events Some ftrace events, such as sched wakeup, can be traced while the runqueue lock is hold. Since they are using trace_current_buffer_unlock_commit(), they call wake_up() which can try to grab the runqueue lock too, resulting in a deadlock. Now for all event, we call a new helper: trace_nowake_buffer_unlock_commit() which do pretty the same than trace_current_buffer_unlock_commit() except than it doesn't call trace_wake_up(). Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <1237759847-21025-4-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-23 09:22:14 +01:00
Frederic Weisbecker	9bd7d099ab	tracing/events: make the filter files writable We need the filter files to be writable, the current filter file permissions are only set readable. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tom Zanussi <tzanussi@gmail.com> LKML-Reference: <1237759847-21025-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-23 09:22:14 +01:00
Ingo Molnar	fe9f57f250	tracing: add run-time field descriptions for event filtering, kfree fix Impact: fix potential kfree of random data in (rare) failure path Zero-initialize the field structure. Reported-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Tom Zanussi <tzanussi@gmail.com> LKML-Reference: <1237710639.7703.46.camel@charm-linux> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-22 18:43:25 +01:00
Tom Zanussi	cfb180f3e7	tracing: add per-subsystem filtering This patch adds per-subsystem filtering to the event tracing subsystem. It adds a 'filter' debugfs file to each subsystem directory. This file can be written to to set filters; reading from it will display the current set of filters set for that subsystem. Basically what it does is propagate the filter down to each event contained in the subsystem. If a particular event doesn't have a field with the name specified in the filter, it simply doesn't get set for that event. You can verify whether or not the filter was set for a particular event by looking at the filter file for that event. As with per-event filters, compound expressions are supported, echoing '0' to the subsystem's filter file clears all filters in the subsystem, etc. Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1237710677.7703.49.camel@charm-linux> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-22 18:38:47 +01:00
Tom Zanussi	7ce7e42499	tracing: add per-event filtering This patch adds per-event filtering to the event tracing subsystem. It adds a 'filter' debugfs file to each event directory. This file can be written to to set filters; reading from it will display the current set of filters set for that event. Basically, any field listed in the 'format' file for an event can be filtered on (including strings, but not yet other array types) using either matching ('==') or non-matching ('!=') 'predicates'. A 'predicate' can be either a single expression: # echo pid != 0 > filter # cat filter pid != 0 or a compound expression of up to 8 sub-expressions combined using '&&' or '\|\|': # echo comm == Xorg > filter # echo "&& sig != 29" > filter # cat filter comm == Xorg && sig != 29 Only events having field values matching an expression will be available in the trace output; non-matching events are discarded. Note that a compound expression is built up by echoing each sub-expression separately - it's not the most efficient way to do things, but it keeps the parser simple and assumes that compound expressions will be relatively uncommon. In any case, a subsequent patch introducing a way to set filters for entire subsystems should mitigate any need to do this for lots of events. Setting a filter without an '&&' or '\|\|' clears the previous filter completely and sets the filter to the new expression: # cat filter comm == Xorg && sig != 29 # echo comm != Xorg # cat filter comm != Xorg To clear a filter, echo 0 to the filter file: # echo 0 > filter # cat filter none The limit of 8 predicates for a compound expression is arbitrary - for efficiency, it's implemented as an array of pointers to predicates, and 8 seemed more than enough for any filter... Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1237710665.7703.48.camel@charm-linux> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-22 18:38:46 +01:00
Tom Zanussi	2d622719f1	tracing: add ring_buffer_event_discard() to ring buffer This patch overloads RINGBUF_TYPE_PADDING to provide a way to discard events from the ring buffer, for the event-filtering mechanism introduced in a subsequent patch. I did the initial version but thanks to Steven Rostedt for adding the parts that actually made it work. ;-) Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-22 18:38:25 +01:00
Dmitri Vorobiev	b8b9426533	tracing: fix four sparse warnings Impact: cleanup. This patch fixes the following sparse warnings: kernel/trace/trace.c:385:9: warning: symbol 'trace_seq_to_buffer' was not declared. Should it be static? kernel/trace/trace_clock.c:29:13: warning: symbol 'trace_clock_local' was not declared. Should it be static? kernel/trace/trace_clock.c:54:13: warning: symbol 'trace_clock' was not declared. Should it be static? kernel/trace/trace_clock.c:74:13: warning: symbol 'trace_clock_global' was not declared. Should it be static? Signed-off-by: Dmitri Vorobiev <dmitri.vorobiev@movial.com> LKML-Reference: <1237741871-5827-4-git-send-email-dmitri.vorobiev@movial.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-22 18:16:54 +01:00
Dmitri Vorobiev	f80d2d7725	tracing, Text Edit Lock: Fix one sparse warning in kernel/extable.c Impact: cleanup. The global mutex text_mutex if declared in linux/memory.h, so this file needs to be included into kernel/extable.c, where the same mutex is defined. This fixes the following sparse warning: kernel/extable.c:32:1: warning: symbol 'text_mutex' was not declared. Should it be static? Signed-off-by: Dmitri Vorobiev <dmitri.vorobiev@movial.com> LKML-Reference: <1237741871-5827-3-git-send-email-dmitri.vorobiev@movial.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-03-22 18:16:20 +01:00

... 2 3 4 5 6 ...

6896 commits