Commit graph

61 commits

Author SHA1 Message Date
Hugh Dickins
365e9c87a9 [PATCH] mm: update_hiwaters just in time
update_mem_hiwater has attracted various criticisms, in particular from those
concerned with mm scalability.  Originally it was called whenever rss or
total_vm got raised.  Then many of those callsites were replaced by a timer
tick call from account_system_time.  Now Frank van Maarseveen reports that to
be found inadequate.  How about this?  Works for Frank.

Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
update_hiwater_rss and update_hiwater_vm.  Don't attempt to keep
mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
by 1): those are hot paths.  Do the opposite, update only when about to lower
rss (usually by many), or just before final accounting in do_exit.  Handle
mm->hiwater_vm in the same way, though it's much less of an issue.  Demand
that whoever collects these hiwater statistics do the work of taking the
maximum with rss or total_vm.

And there has been no collector of these hiwater statistics in the tree.  The
new convention needs an example, so match Frank's usage by adding a VmPeak
line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
(High-Water-Mark or High-Water-Memory).

There was a particular anomaly during mremap move, that hiwater_vm might be
captured too high.  A fleeting such anomaly remains, but it's quickly
corrected now, whereas before it would stick.

What locking?  None: if the app is racy then these statistics will be racy,
it's not worth any overhead to make them exact.  But whenever it suits,
hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
page_table_lock (for now) or with preemption disabled (later on): without
going to any trouble, minimize the time between reading current values and
updating, to minimize those occasions when a racing thread bumps a count up
and back down in between.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:39 -07:00
Andrew Morton
bb32051532 [PATCH] export cpu_online_map
With CONFIG_SMP=n:

*** Warning: "cpu_online_map" [drivers/firmware/dcdbas.ko] undefined!

due to set_cpus_allowed().

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-26 10:39:43 -07:00
Ingo Molnar
da04c03503 [PATCH] Fix spinlock owner debugging
fix up the runqueue lock owner only if we truly did a context-switch
with the runqueue lock held. Impacts ia64, mips, sparc64 and arm.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-13 09:59:04 -07:00
Linus Torvalds
1df5c10a5b Mark ia64-specific MCA/INIT scheduler hooks as dangerous
..and only enable them for ia64. The functions are only valid
when the whole system has been totally stopped and no scheduler
activity is ongoing on any CPU, and interrupts are globally
disabled.

In other words, they aren't useful for anything else. So make
sure that nobody can use them by mistake.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-12 07:59:21 -07:00
Keith Owens
a2a979821b [PATCH] MCA/INIT: scheduler hooks
Scheduler hooks to see/change which process is deemed to be on a cpu.

Signed-off-by: Keith Owens <kaos@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2005-09-11 14:01:30 -07:00
Siddha, Suresh B
0c117f1b4d [PATCH] sched: allow the load to grow upto its cpu_power
Don't pull tasks from a group if that would cause the group's total load to
drop below its total cpu_power (ie.  cause the group to start going idle).

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:24 -07:00
Siddha, Suresh B
fa3b6ddc3f [PATCH] sched: don't kick ALB in the presence of pinned task
Jack Steiner brought this issue at my OLS talk.

Take a scenario where two tasks are pinned to two HT threads in a physical
package.  Idle packages in the system will keep kicking migration_thread on
the busy package with out any success.

We will run into similar scenarios in the presence of CMP/NUMA.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:24 -07:00
Renaud Lienhart
5927ad78ec [PATCH] sched: use cached variable in sys_sched_yield()
In sys_sched_yield(), we cache current->array in the "array" variable, thus
there's no need to dereference "current" again later.

Signed-Off-By: Renaud Lienhart <renaud.lienhart@free.fr>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:23 -07:00
Nick Piggin
5969fe0618 [PATCH] sched: HT optimisation
If an idle sibling of an HT queue encounters a busy sibling, then make
higher level load balancing of the non-idle variety.

Performance of multiprocessor HT systems with low numbers of tasks
(generally < number of virtual CPUs) can be significantly worse than the
exact same workloads when running in non-HT mode.  The reason is largely
due to poor scheduling behaviour.

This patch improves the situation, making the performance gap far less
significant on one problematic test case (tbench).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:23 -07:00
Nick Piggin
e17224bf1d [PATCH] sched: less locking
During periodic load balancing, don't hold this runqueue's lock while
scanning remote runqueues, which can take a non trivial amount of time
especially on very large systems.

Holding the runqueue lock will only help to stabilise ->nr_running, however
this doesn't do much to help because tasks being woken will simply get held
up on the runqueue lock, so ->nr_running would not provide a really
accurate picture of runqueue load in that case anyway.

What's more, ->nr_running (and possibly the cpu_load averages) of remote
runqueues won't be stable anyway, so load balancing is always an inexact
operation.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:23 -07:00
Nick Piggin
d6d5cfaf45 [PATCH] sched: less newidle locking
Similarly to the earlier change in load_balance, only lock the runqueue in
load_balance_newidle if the busiest queue found has a nr_running > 1.  This
will reduce frequency of expensive remote runqueue lock aquisitions in the
schedule() path on some workloads.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:23 -07:00
Ingo Molnar
67f9a619e7 [PATCH] sched: fix SMT scheduler latency bug
William Weston reported unusually high scheduling latencies on his x86 HT
box, on the -RT kernel.  I managed to reproduce it on my HT box and the
latency tracer shows the incident in action:

                 _------=> CPU#
                / _-----=> irqs-off
               | / _----=> need-resched
               || / _---=> hardirq/softirq
               ||| / _--=> preempt-depth
               |||| /
               |||||     delay
   cmd     pid ||||| time  |   caller
      \   /    |||||   \   |   /
      du-2803  3Dnh2    0us : __trace_start_sched_wakeup (try_to_wake_up)
        ..............................................................
        ... we are running on CPU#3, PID 2778 gets woken to CPU#1: ...
        ..............................................................
      du-2803  3Dnh2    0us : __trace_start_sched_wakeup <<...>-2778> (73 1)
      du-2803  3Dnh2    0us : _raw_spin_unlock (try_to_wake_up)
        ................................................
        ... still on CPU#3, we send an IPI to CPU#1: ...
        ................................................
      du-2803  3Dnh1    0us : resched_task (try_to_wake_up)
      du-2803  3Dnh1    1us : smp_send_reschedule (try_to_wake_up)
      du-2803  3Dnh1    1us : send_IPI_mask_bitmask (smp_send_reschedule)
      du-2803  3Dnh1    2us : _raw_spin_unlock_irqrestore (try_to_wake_up)
        ...............................................
        ... 1 usec later, the IPI arrives on CPU#1: ...
        ...............................................
  <idle>-0     1Dnh.    2us : smp_reschedule_interrupt (c0100c5a 0 0)

So far so good, this is the normal wakeup/preemption mechanism.  But here
comes the scheduler anomaly on CPU#1:

  <idle>-0     1Dnh.    2us : preempt_schedule_irq (need_resched)
  <idle>-0     1Dnh.    2us : preempt_schedule_irq (need_resched)
  <idle>-0     1Dnh.    3us : __schedule (preempt_schedule_irq)
  <idle>-0     1Dnh.    3us : profile_hit (__schedule)
  <idle>-0     1Dnh1    3us : sched_clock (__schedule)
  <idle>-0     1Dnh1    4us : _raw_spin_lock_irq (__schedule)
  <idle>-0     1Dnh1    4us : _raw_spin_lock_irqsave (__schedule)
  <idle>-0     1Dnh2    5us : _raw_spin_unlock (__schedule)
  <idle>-0     1Dnh1    5us : preempt_schedule (__schedule)
  <idle>-0     1Dnh1    6us : _raw_spin_lock (__schedule)
  <idle>-0     1Dnh2    6us : find_next_bit (__schedule)
  <idle>-0     1Dnh2    6us : _raw_spin_lock (__schedule)
  <idle>-0     1Dnh3    7us : find_next_bit (__schedule)
  <idle>-0     1Dnh3    7us : find_next_bit (__schedule)
  <idle>-0     1Dnh3    8us : _raw_spin_unlock (__schedule)
  <idle>-0     1Dnh2    8us : preempt_schedule (__schedule)
  <idle>-0     1Dnh2    8us : find_next_bit (__schedule)
  <idle>-0     1Dnh2    9us : trace_stop_sched_switched (__schedule)
  <idle>-0     1Dnh2    9us : _raw_spin_lock (trace_stop_sched_switched)
  <idle>-0     1Dnh3   10us : trace_stop_sched_switched <<...>-2778> (73 8c)
  <idle>-0     1Dnh3   10us : _raw_spin_unlock (trace_stop_sched_switched)
  <idle>-0     1Dnh1   10us : _raw_spin_unlock (__schedule)
  <idle>-0     1Dnh.   11us : local_irq_enable_noresched (preempt_schedule_irq)
  <idle>-0     1Dnh.   11us < (0)

we didnt pick up pid 2778! It only gets scheduled much later:

   <...>-2778  1Dnh2  412us : __switch_to (__schedule)
   <...>-2778  1Dnh2  413us : __schedule <<idle>-0> (8c 73)
   <...>-2778  1Dnh2  413us : _raw_spin_unlock (__schedule)
   <...>-2778  1Dnh1  413us : trace_stop_sched_switched (__schedule)
   <...>-2778  1Dnh1  414us : _raw_spin_lock (trace_stop_sched_switched)
   <...>-2778  1Dnh2  414us : trace_stop_sched_switched <<...>-2778> (73 1)
   <...>-2778  1Dnh2  414us : _raw_spin_unlock (trace_stop_sched_switched)
   <...>-2778  1Dnh1  415us : trace_stop_sched_switched (__schedule)

the reason for this anomaly is the following code in dependent_sleeper():

                /*
                 * If a user task with lower static priority than the
                 * running task on the SMT sibling is trying to schedule,
                 * delay it till there is proportionately less timeslice
                 * left of the sibling task to prevent a lower priority
                 * task from using an unfair proportion of the
                 * physical cpu's resources. -ck
                 */
[...]
                        if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) /
                                100) > task_timeslice(p)))
                                        ret = 1;

Note that in contrast to the comment above, we dont actually do the check
based on static priority, we do the check based on timeslices.  But
timeslices go up and down, and even highprio tasks can randomly have very
low timeslices (just before their next refill) and can thus be judged as
'lowprio' by the above piece of code.  This condition is clearly buggy.
The correct test is to check for static_prio _and_ to check for the
preemption priority.  Even on different static priority levels, a
higher-prio interactive task should not be delayed due to a
higher-static-prio CPU hog.

There is a symmetric bug in the 'kick SMT sibling' code of this function as
well, which can be solved in a similar way.

The patch below (against the current scheduler queue in -mm) fixes both
bugs.  I have build and boot-tested this on x86 SMT, and nice +20 tasks
still get properly throttled - so the dependent-sleeper logic is still in
action.

btw., these bugs pessimised the SMT scheduler because the 'delay wakeup'
property was applied too liberally, so this fix is likely a throughput
improvement as well.

I separated out a smt_slice() function to make the code easier to read.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:23 -07:00
Ingo Molnar
d79fc0fc66 [PATCH] sched: TASK_NONINTERACTIVE
This patch implements a task state bit (TASK_NONINTERACTIVE), which can be
used by blocking points to mark the task's wait as "non-interactive".  This
does not mean the task will be considered a CPU-hog - the wait will simply
not have an effect on the waiting task's priority - positive or negative
alike.  Right now only pipe_wait() will make use of it, because it's a
common source of not-so-interactive waits (kernel compilation jobs, etc.).

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:22 -07:00
Ingo Molnar
95cdf3b799 [PATCH] sched cleanups
whitespace cleanups.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:22 -07:00
M.Baris Demiray
da5a552270 [PATCH] sched: make idlest_group/cpu cpus_allowed-aware
Add relevant checks into find_idlest_group() and find_idlest_cpu() to make
them return only the groups that have allowed CPUs and allowed CPUs
respectively.

Signed-off-by: M.Baris Demiray <baris@labristeknoloji.com>
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:22 -07:00
Con Kolivas
fc38ed7531 [PATCH] sched: run SCHED_NORMAL tasks with real time tasks on SMT siblings
The hyperthread aware nice handling currently puts to sleep any non real
time task when a real time task is running on its sibling cpu.  This can
lead to prolonged starvation by having the non real time task pegged to the
cpu with load balancing not pulling that task away.

Currently we force lower priority hyperthread tasks to run a percentage of
time difference based on timeslice differences which is meaningless when
comparing real time tasks to SCHED_NORMAL tasks.  We can allow non real
time tasks to run with real time tasks on the sibling up to per_cpu_gain%
if we use jiffies as a counter.

Cleanups and micro-optimisations to the relevant code section should make
it more understandable as well.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:22 -07:00
Ingo Molnar
fb1c8f93d8 [PATCH] spinlock consolidation
This patch (written by me and also containing many suggestions of Arjan van
de Ven) does a major cleanup of the spinlock code.  It does the following
things:

 - consolidates and enhances the spinlock/rwlock debugging code

 - simplifies the asm/spinlock.h files

 - encapsulates the raw spinlock type and moves generic spinlock
   features (such as ->break_lock) into the generic code.

 - cleans up the spinlock code hierarchy to get rid of the spaghetti.

Most notably there's now only a single variant of the debugging code,
located in lib/spinlock_debug.c.  (previously we had one SMP debugging
variant per architecture, plus a separate generic one for UP builds)

Also, i've enhanced the rwlock debugging facility, it will now track
write-owners.  There is new spinlock-owner/CPU-tracking on SMP builds too.
All locks have lockup detection now, which will work for both soft and hard
spin/rwlock lockups.

The arch-level include files now only contain the minimally necessary
subset of the spinlock code - all the rest that can be generalized now
lives in the generic headers:

 include/asm-i386/spinlock_types.h       |   16
 include/asm-x86_64/spinlock_types.h     |   16

I have also split up the various spinlock variants into separate files,
making it easier to see which does what. The new layout is:

   SMP                         |  UP
   ----------------------------|-----------------------------------
   asm/spinlock_types_smp.h    |  linux/spinlock_types_up.h
   linux/spinlock_types.h      |  linux/spinlock_types.h
   asm/spinlock_smp.h          |  linux/spinlock_up.h
   linux/spinlock_api_smp.h    |  linux/spinlock_api_up.h
   linux/spinlock.h            |  linux/spinlock.h

/*
 * here's the role of the various spinlock/rwlock related include files:
 *
 * on SMP builds:
 *
 *  asm/spinlock_types.h: contains the raw_spinlock_t/raw_rwlock_t and the
 *                        initializers
 *
 *  linux/spinlock_types.h:
 *                        defines the generic type and initializers
 *
 *  asm/spinlock.h:       contains the __raw_spin_*()/etc. lowlevel
 *                        implementations, mostly inline assembly code
 *
 *   (also included on UP-debug builds:)
 *
 *  linux/spinlock_api_smp.h:
 *                        contains the prototypes for the _spin_*() APIs.
 *
 *  linux/spinlock.h:     builds the final spin_*() APIs.
 *
 * on UP builds:
 *
 *  linux/spinlock_type_up.h:
 *                        contains the generic, simplified UP spinlock type.
 *                        (which is an empty structure on non-debug builds)
 *
 *  linux/spinlock_types.h:
 *                        defines the generic type and initializers
 *
 *  linux/spinlock_up.h:
 *                        contains the __raw_spin_*()/etc. version of UP
 *                        builds. (which are NOPs on non-debug, non-preempt
 *                        builds)
 *
 *   (included on UP-non-debug builds:)
 *
 *  linux/spinlock_api_up.h:
 *                        builds the _spin_*() APIs.
 *
 *  linux/spinlock.h:     builds the final spin_*() APIs.
 */

All SMP and UP architectures are converted by this patch.

arm, i386, ia64, ppc, ppc64, s390/s390x, x64 was build-tested via
crosscompilers.  m32r, mips, sh, sparc, have not been tested yet, but should
be mostly fine.

From: Grant Grundler <grundler@parisc-linux.org>

  Booted and lightly tested on a500-44 (64-bit, SMP kernel, dual CPU).
  Builds 32-bit SMP kernel (not booted or tested).  I did not try to build
  non-SMP kernels.  That should be trivial to fix up later if necessary.

  I converted bit ops atomic_hash lock to raw_spinlock_t.  Doing so avoids
  some ugly nesting of linux/*.h and asm/*.h files.  Those particular locks
  are well tested and contained entirely inside arch specific code.  I do NOT
  expect any new issues to arise with them.

 If someone does ever need to use debug/metrics with them, then they will
  need to unravel this hairball between spinlocks, atomic ops, and bit ops
  that exist only because parisc has exactly one atomic instruction: LDCW
  (load and clear word).

From: "Luck, Tony" <tony.luck@intel.com>

   ia64 fix

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjanv@infradead.org>
Signed-off-by: Grant Grundler <grundler@parisc-linux.org>
Cc: Matthew Wilcox <willy@debian.org>
Signed-off-by: Hirokazu Takata <takata@linux-m32r.org>
Signed-off-by: Mikael Pettersson <mikpe@csd.uu.se>
Signed-off-by: Benoit Boissinot <benoit.boissinot@ens-lyon.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:21 -07:00
Chen, Kenneth W
383f2835eb [PATCH] Prefetch kernel stacks to speed up context switch
For architecture like ia64, the switch stack structure is fairly large
(currently 528 bytes).  For context switch intensive application, we found
that significant amount of cache misses occurs in switch_to() function.
The following patch adds a hook in the schedule() function to prefetch
switch stack structure as soon as 'next' task is determined.  This allows
maximum overlap in prefetch cache lines for that structure.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 13:57:31 -07:00
Linus Torvalds
0dd7f883a9 Merge branch 'upstream' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/misc-2.6 2005-09-07 17:28:25 -07:00
John Hawkes
d1b551386a [PATCH] cpusets: fix the "dynamic sched domains" bug
For a NUMA system with multiple CPUs per node, declaring a cpu-exclusive
cpuset that includes only some, but not all, of the CPUs in a node will mangle
the sched domain structures.

Signed-off-by: John Hawkes <hawkes@sgi.com>
Cc; Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:41 -07:00
John Hawkes
9c1cfda20a [PATCH] cpusets: Move the ia64 domain setup code to the generic code
Signed-off-by: John Hawkes <hawkes@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:40 -07:00
Jeff Garzik
344babaa9d [kernel-doc] fix various DocBook build problems/warnings
Most serious is fixing include/sound/pcm.h, which breaks the DocBook
build.

The other stuff is just filling in things that cause warnings.
2005-09-07 01:15:17 -04:00
Matt Mackall
024f474795 [PATCH] Make RLIMIT_NICE ranges consistent with getpriority(2)
As suggested by Michael Kerrisk <mtk-manpages@gmx.net>, make RLIMIT_NICE
consistent with getpriority before it becomes available in released glibc.

Signed-off-by: Matt Mackall <mpm@selenic.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-18 12:53:58 -07:00
Steven Rostedt
d46523ea32 [PATCH] fix MAX_USER_RT_PRIO and MAX_RT_PRIO
Here's the patch again to fix the code to handle if the values between
MAX_USER_RT_PRIO and MAX_RT_PRIO are different.

Without this patch, an SMP system will crash if the values are
different.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Dean Nelson <dcn@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-26 15:40:00 -07:00
Andreas Steinmetz
18586e7216 [PATCH] Fix RLIMIT_RTPRIO breakage
RLIMIT_RTPRIO is supposed to grant non privileged users the right to use
SCHED_FIFO/SCHED_RR scheduling policies with priorites bounded by the
RLIMIT_RTPRIO value via sched_setscheduler(). This is usually used by
audio users.

Unfortunately this is broken in 2.6.13rc3 as you can see in the excerpt
from sched_setscheduler below:

        /*
         * Allow unprivileged RT tasks to decrease priority:
         */
        if (!capable(CAP_SYS_NICE)) {
                /* can't change policy */
                if (policy != p->policy)
                        return -EPERM;

After the above unconditional test which causes sched_setscheduler to
fail with no regard to the RLIMIT_RTPRIO value the following check is made:

               /* can't increase priority */
                if (policy != SCHED_NORMAL &&
                    param->sched_priority > p->rt_priority &&
                    param->sched_priority >
                                p->signal->rlim[RLIMIT_RTPRIO].rlim_cur)
                        return -EPERM;

Thus I do believe that the RLIMIT_RTPRIO value must be taken into
account for the policy check, especially as the RLIMIT_RTPRIO limit is
of no use without this change.

The attached patch fixes this problem.

Signed-off-by: Andreas Steinmetz <ast@domdv.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-26 15:30:51 -07:00
Ingo Molnar
5bbcfd9000 [PATCH] cond_resched(): fix bogus might_sleep() warning
The BKS might be reacquired before we have dropped PREEMPT_ACTIVE, which
could trigger a second could trigger a second cond_resched() call.  Bug
found by Hirofumi Ogawa.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-07 18:23:47 -07:00
Ingo Molnar
f340c0d1a3 [PATCH] Tweak idle thread setup semantics
This patch tweaks idle thread setup semantics a bit: instead of setting
NEED_RESCHED in init_idle(), we do an explicit schedule() before calling
into cpu_idle().

This patch, while having no negative side-effects, enables wider use of
cond_resched()s.  (which might happen in the stock kernel too, but it's
particulary important for voluntary-preempt)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-28 14:56:51 -07:00
Jens Axboe
22e2c507c3 [PATCH] Update cfq io scheduler to time sliced design
This updates the CFQ io scheduler to the new time sliced design (cfq
v3).  It provides full process fairness, while giving excellent
aggregate system throughput even for many competing processes.  It
supports io priorities, either inherited from the cpu nice value or set
directly with the ioprio_get/set syscalls.  The latter closely mimic
set/getpriority.

This import is based on my latest from -mm.

Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-27 14:33:29 -07:00
Linus Torvalds
2031d0f586 Merge Christoph's freeze cleanup patch 2005-06-25 17:16:53 -07:00
Christoph Lameter
3e1d1d28d9 [PATCH] Cleanup patch for process freezing
1. Establish a simple API for process freezing defined in linux/include/sched.h:

   frozen(process)		Check for frozen process
   freezing(process)		Check if a process is being frozen
   freeze(process)		Tell a process to freeze (go to refrigerator)
   thaw_process(process)	Restart process
   frozen_process(process)	Process is frozen now

2. Remove all references to PF_FREEZE and PF_FROZEN from all
   kernel sources except sched.h

3. Fix numerous locations where try_to_freeze is manually done by a driver

4. Remove the argument that is no longer necessary from two function calls.

5. Some whitespace cleanup

6. Clear potential race in refrigerator (provides an open window of PF_FREEZE
   cleared before setting PF_FROZEN, recalc_sigpending does not check
   PF_FROZEN).

This patch does not address the problem of freeze_processes() violating the rule
that a task may only modify its own flags by setting PF_FREEZE. This is not clean
in an SMP environment. freeze(process) is therefore not SMP safe!

Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 17:10:13 -07:00
Dinakar Guniguntala
1a20ff27ef [PATCH] Dynamic sched domains: sched changes
The following patches add dynamic sched domains functionality that was
extensively discussed on lkml and lse-tech.  I would like to see this added to
-mm

o The main advantage with this feature is that it ensures that the scheduler
  load balacing code only balances against the cpus that are in the sched
  domain as defined by an exclusive cpuset and not all of the cpus in the
  system. This removes any overhead due to load balancing code trying to
  pull tasks outside of the cpu exclusive cpuset only to be prevented by
  the tasks' cpus_allowed mask.
o cpu exclusive cpusets are useful for servers running orthogonal
  workloads such as RT applications requiring low latency and HPC
  applications that are throughput sensitive

o It provides a new API partition_sched_domains in sched.c
  that makes dynamic sched domains possible.
o cpu_exclusive cpusets sets are now associated with a sched domain.
  Which means that the users can dynamically modify the sched domains
  through the cpuset file system interface
o ia64 sched domain code has been updated to support this feature as well
o Currently, this does not support hotplug. (However some of my tests
  indicate hotplug+preempt is currently broken)
o I have tested it extensively on x86.
o This should have very minimal impact on performance as none of
  the fast paths are affected

Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Matthew Dobson <colpatch@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:45 -07:00
Olivier Croquette
37e4ab3f0c [PATCH] Changing RT priority without CAP_SYS_NICE
Presently, a process without the capability CAP_SYS_NICE can not change
its own policy, which is OK.

But it can also not decrease its RT priority (if scheduled with policy
SCHED_RR or SCHED_FIFO), which is what this patch changes.

The rationale is the same as for the nice value: a process should be
able to require less priority for itself. Increasing the priority is
still not allowed.

This is for example useful if you give a multithreaded user process a RT
priority, and the process would like to organize its internal threads
using priorities also. Then you can give the process the highest
priority needed N, and the process starts its threads with lower
priorities: N-1, N-2...

The POSIX norm says that the permissions are implementation specific, so
I think we can do that.

In a sense, it makes the permissions consistent whatever the policy is:
with this patch, process scheduled by SCHED_FIFO, SCHED_RR and
SCHED_OTHER can all decrease their priority.

From: Ingo Molnar <mingo@elte.hu>

cleaned up and merged to -mm.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:44 -07:00
Chen Shang
a3464a102a [PATCH] sched: micro-optimize task requeueing in schedule()
micro-optimize task requeueing in schedule() & clean up recalc_task_prio().

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:44 -07:00
Nick Piggin
77391d7168 [PATCH] sched: relax pinned balancing
The maximum rebalance interval allowed by the multiprocessor balancing
backoff is often not large enough to handle corner cases where there are
lots of tasks pinned on a CPU.  Suresh reported:

	I see system livelock's if for example I have 7000 processes
	pinned onto one cpu (this is on the fastest 8-way system I
	have access to).

After this patch, the machine is reported to go well above this number.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:44 -07:00
Nick Piggin
476d139c21 [PATCH] sched: consolidate sbe sbf
Consolidate balance-on-exec with balance-on-fork.  This is made easy by the
sched-domains RCU patches.

As well as the general goodness of code reduction, this allows the runqueues
to be unlocked during balance-on-fork.

schedstats is a problem.  Maybe just have balance-on-event instead of
distinguishing fork and exec?

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:44 -07:00
Nick Piggin
674311d5b4 [PATCH] sched: RCU domains
One of the problems with the multilevel balance-on-fork/exec is that it needs
to jump through hoops to satisfy sched-domain's locking semantics (that is,
you may traverse your own domain when not preemptable, and you may traverse
others' domains when holding their runqueue lock).

balance-on-exec had to potentially migrate between more than one CPU before
finding a final CPU to migrate to, and balance-on-fork needed to potentially
take multiple runqueue locks.

So bite the bullet and make sched-domains go completely RCU.  This actually
simplifies the code quite a bit.

From: Ingo Molnar <mingo@elte.hu>

schedstats RCU fix, and a nice comment on for_each_domain, from Ingo.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:44 -07:00
Nick Piggin
3dbd534207 [PATCH] sched: multilevel sbe sbf
The fundamental problem that Suresh has with balance on exec and fork is that
it only tries to balance the top level domain with the flag set.

This was worked around by removing degenerate domains, but is still a problem
if people want to start using more complex sched-domains, especially
multilevel NUMA that ia64 is already using.

This patch makes balance on fork and exec try balancing over not just the top
most domain with the flag set, but all the way down the domain tree.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:43 -07:00
Suresh Siddha
245af2c787 [PATCH] sched: remove degenerate domains
Remove degenerate scheduler domains during the sched-domain init.

For example on x86_64, we always have NUMA configured in.  On Intel EM64T
systems, top most sched domain will be of NUMA and with only one sched_group
in it.

With fork/exec balances(recent Nick's fixes in -mm tree), we always endup
taking wrong decisions because of this topmost domain (as it contains only one
group and find_idlest_group always returns NULL).  We will endup loading HT
package completely first, letting active load balance kickin and correct it.

In general, this patch also makes sense with out recent Nick's fixes in -mm.

From: Nick Piggin <nickpiggin@yahoo.com.au>

Modified to account for more than just sched_groups when scanning for
degenerate domains by Nick Piggin.  And allow a runqueue's sd to go NULL
rather than keep a single degenerate domain around (this happens when you run
with maxcpus=1).

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:43 -07:00
Nick Piggin
41c7ce9ad9 [PATCH] sched: null domains
Fix the last 2 places that directly access a runqueue's sched-domain and
assume it cannot be NULL.

That allows the use of NULL for domain, instead of a dummy domain, to signify
no balancing is to happen.  No functional changes.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:43 -07:00
Nick Piggin
4866cde064 [PATCH] sched: cleanup context switch locking
Instead of requiring architecture code to interact with the scheduler's
locking implementation, provide a couple of defines that can be used by the
architecture to request runqueue unlocked context switches, and ask for
interrupts to be enabled over the context switch.

Also replaces the "switch_lock" used by these architectures with an oncpu
flag (note, not a potentially slow bitflag).  This eliminates one bus
locked memory operation when context switching, and simplifies the
task_running function.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:43 -07:00
Ingo Molnar
48c08d3f8f [PATCH] sched: uninline task_timeslice
"Chen, Kenneth W" <kenneth.w.chen@intel.com>

uninline task_timeslice() - reduces code footprint noticeably, and it's
slowpath code.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:43 -07:00
Nick Piggin
68767a0ae4 [PATCH] sched: schedstats update for balance on fork
Add SCHEDSTAT statistics for sched-balance-fork.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:42 -07:00
Nick Piggin
147cbb4bbe [PATCH] sched: balance on fork
Reimplement the balance on exec balancing to be sched-domains aware.  Use this
to also do balance on fork balancing.  Make x86_64 do balance on fork over the
NUMA domain.

The problem that the non sched domains aware blancing became apparent on dual
core, multi socket opterons.  What we want is for the new tasks to be sent to
a different socket, but more often than not, we would first load up our
sibling core, or fill two cores of a single remote socket before selecting a
new one.

This gives large improvements to STREAM on such systems.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:42 -07:00
Nick Piggin
cafb20c1f9 [PATCH] sched: no aggressive idle balancing
Remove the very aggressive idle stuff that has recently gone into 2.6 - it is
going against the direction we are trying to go.  Hopefully we can regain
performance through other methods.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:42 -07:00
Nick Piggin
a3f21bce1f [PATCH] sched: tweak affine wakeups
Do less affine wakeups.  We're trying to reduce dbt2-pgsql idle time
regressions here...  make sure we don't don't move tasks the wrong way in an
imbalance condition.  Also, remove the cache coldness requirement from the
calculation - this seems to induce sharp cutoff points where behaviour will
suddenly change on some workloads if the load creeps slightly over or under
some point.  It is good for periodic balancing because in that case have
otherwise have no other context to determine what task to move.

But also make a minor tweak to "wake balancing" - the imbalance tolerance is
now set at half the domain's imbalance, so we get the opportunity to do wake
balancing before the more random periodic rebalancing gets preformed.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:41 -07:00
Nick Piggin
7897986bad [PATCH] sched: balance timers
Do CPU load averaging over a number of different intervals.  Allow each
interval to be chosen by sending a parameter to source_load and target_load.
0 is instantaneous, idx > 0 returns a decaying average with the most recent
sample weighted at 2^(idx-1).  To a maximum of 3 (could be easily increased).

So generally a higher number will result in more conservative balancing.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:41 -07:00
Nick Piggin
99b61ccf0b [PATCH] sched: less aggressive idle balancing
Remove the special casing for idle CPU balancing.  Things like this are
hurting for example on SMT, where are single sibling being idle doesn't really
warrant a really aggressive pull over the NUMA domain, for example.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:41 -07:00
Nick Piggin
db935dbd43 [PATCH] sched: add debugging
These conditions should now be impossible, and we need to fix them if they
happen.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:41 -07:00
Nick Piggin
3950745131 [PATCH] sched: fix SMT scheduling problems
SMT balancing has a couple of problems.  Firstly, active_load_balance is too
complex - basically it should be a dumb helper for when the periodic balancer
has determined there is an imbalance, but gets stuck because the task is
running.

So rip out all its "smarts", and just make it move one task to the target CPU.

Second, the busy CPU's sched-domain tree was being used for active balancing.
This means that it may not see that nr_balance_failed has reached a critical
level.  So use the target CPU's sched-domain tree for this.  We can do this
because we hold its runqueue lock.

Lastly, reset nr_balance_failed to a point where we allow cache hot migration.
This will help ensure active load balancing is successful.

Thanks to Suresh Siddha for pointing out these issues.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:41 -07:00
Nick Piggin
16cfb1c04c [PATCH] sched: reduce active load balancing
Fix up active load balancing a bit so it doesn't get called when it shouldn't.
Reset the nr_balance_failed counter at more points where we have found
conditions to be balanced.  This reduces too aggressive active balancing seen
on some workloads.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:40 -07:00