kernel-fxtec-pro1x/drivers
Ingo Molnar 0888f06ac9 [PATCH] sched: fix bad missed wakeups in the i386, x86_64, ia64, ACPI and APM idle code
Fernando Lopez-Lezcano reported frequent scheduling latencies and audio
xruns starting at the 2.6.18-rt kernel, and those problems persisted all
until current -rt kernels. The latencies were serious and unjustified by
system load, often in the milliseconds range.

After a patient and heroic multi-month effort of Fernando, where he
tested dozens of kernels, tried various configs, boot options,
test-patches of mine and provided latency traces of those incidents, the
following 'smoking gun' trace was captured by him:

                 _------=> CPU#
                / _-----=> irqs-off
               | / _----=> need-resched
               || / _---=> hardirq/softirq
               ||| / _--=> preempt-depth
               |||| /
               |||||     delay
   cmd     pid ||||| time  |   caller
      \   /    |||||   \   |   /
  IRQ_19-1479  1D..1    0us : __trace_start_sched_wakeup (try_to_wake_up)
  IRQ_19-1479  1D..1    0us : __trace_start_sched_wakeup <<...>-5856> (37 0)
  IRQ_19-1479  1D..1    0us : __trace_start_sched_wakeup (c01262ba 0 0)
  IRQ_19-1479  1D..1    0us : resched_task (try_to_wake_up)
  IRQ_19-1479  1D..1    0us : __spin_unlock_irqrestore (try_to_wake_up)
  ...
  <idle>-0     1...1   11us!: default_idle (cpu_idle)
  ...
  <idle>-0     0Dn.1  602us : smp_apic_timer_interrupt (c0103baf 1 0)
  ...
   <...>-5856  0D..2  618us : __switch_to (__schedule)
   <...>-5856  0D..2  618us : __schedule <<idle>-0> (20 162)
   <...>-5856  0D..2  619us : __spin_unlock_irq (__schedule)
   <...>-5856  0...1  619us : trace_stop_sched_switched (__schedule)
   <...>-5856  0D..1  619us : trace_stop_sched_switched <<...>-5856> (37 0)

what is visible in this trace is that CPU#1 ran try_to_wake_up() for
PID:5856, it placed PID:5856 on CPU#0's runqueue and ran resched_task()
for CPU#0. But it decided to not send an IPI that no CPU - due to
TS_POLLING. But CPU#0 never woke up after its NEED_RESCHED bit was set,
and only rescheduled to PID:5856 upon the next lapic timer IRQ. The
result was a 600+ usecs latency and a missed wakeup!

the bug turned out to be an idle-wakeup bug introduced into the mainline
kernel this summer via an optimization in the x86_64 tree:

    commit 495ab9c045
    Author: Andi Kleen <ak@suse.de>
    Date:   Mon Jun 26 13:59:11 2006 +0200

    [PATCH] i386/x86-64/ia64: Move polling flag into thread_info_status

    During some profiling I noticed that default_idle causes a lot of
    memory traffic. I think that is caused by the atomic operations
    to clear/set the polling flag in thread_info. There is actually
    no reason to make this atomic - only the idle thread does it
    to itself, other CPUs only read it. So I moved it into ti->status.

the problem is this type of change:

        if (!hlt_counter && boot_cpu_data.hlt_works_ok) {
-               clear_thread_flag(TIF_POLLING_NRFLAG);
+               current_thread_info()->status &= ~TS_POLLING;
                smp_mb__after_clear_bit();
                while (!need_resched()) {
                        local_irq_disable();

this changes clear_thread_flag() to an explicit clearing of TS_POLLING.
clear_thread_flag() is defined as:

        clear_bit(flag, &ti->flags);

and clear_bit() is a LOCK-ed atomic instruction on all x86 platforms:

  static inline void clear_bit(int nr, volatile unsigned long * addr)
  {
          __asm__ __volatile__( LOCK_PREFIX
                  "btrl %1,%0"

hence smp_mb__after_clear_bit() is defined as a simple compile barrier:

  #define smp_mb__after_clear_bit()       barrier()

but the explicit TS_POLLING clearing introduced by the patch:

+               current_thread_info()->status &= ~TS_POLLING;

is not an atomic op! So the clearing of the TS_POLLING bit is freely
reorderable with the reading of the NEED_RESCHED bit - and both now
reside in different memory addresses.

CPU idle wakeup very much depends on ordered memory ops, the clearing of
the TS_POLLING flag must always be done before we test need_resched()
and hit the idle instruction(s). [Symmetrically, the wakeup code needs
to set NEED_RESCHED before it tests the TS_POLLING flag, so memory
ordering is paramount.]

Fernando's dual-core Athlon64 system has a sufficiently advanced memory
ordering model so that it triggered this scenario very often.

( And it also turned out that the reason why these latencies never
  triggered on my testsystems is that i routinely use idle=poll, which
  was the only idle variant not affected by this bug. )

The fix is to change the smp_mb__after_clear_bit() to an smp_mb(), to
act as an absolute barrier between the TS_POLLING write and the
NEED_RESCHED read. This affects almost all idling methods (default,
ACPI, APM), on all 3 x86 architectures: i386, x86_64, ia64.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:51 -08:00
..
acorn [PATCH] getting rid of all casts of k[cmz]alloc() calls 2006-12-13 09:05:58 -08:00
acpi [PATCH] sched: fix bad missed wakeups in the i386, x86_64, ia64, ACPI and APM idle code 2006-12-22 08:55:51 -08:00
amba [ARM] Fix __must_check warnings in drivers/bus/amba.c 2006-11-30 14:04:49 +00:00
ata [libata] sata_svw, sata_vsc: kill iomem warnings 2006-12-20 14:37:04 -05:00
atm [PATCH] getting rid of all casts of k[cmz]alloc() calls 2006-12-13 09:05:58 -08:00
base [PATCH] fix kernel-doc warnings in 2.6.20-rc1 2006-12-22 08:55:47 -08:00
block [PATCH] fix aoe without scatter-gather [Bug 7662] 2006-12-22 08:55:49 -08:00
bluetooth [PATCH] bluetooth: add support for another Kensington dongle 2006-12-20 11:29:29 -08:00
cdrom Merge branch 'for-linus' of git://brick.kernel.dk/data/git/linux-2.6-block 2006-12-21 00:03:38 -08:00
char [PATCH] tlclk: delete unnecessary sysfs_remove_group 2006-12-22 08:55:50 -08:00
clocksource [PATCH] clocksource: small cleanup 2006-12-10 09:57:22 -08:00
connector [CONNECTOR]: Replace delayed work with usual work queue. 2006-12-18 01:53:58 -08:00
cpufreq [CPUFREQ] fixes typo in cpufreq.c 2006-12-13 10:11:25 -05:00
crypto [PATCH] geode crypto is PCI device 2006-12-10 09:55:40 -08:00
dio
dma [PATCH] slab: remove SLAB_KERNEL 2006-12-07 08:39:24 -08:00
edac [PATCH] Add include/linux/freezer.h and move definitions from sched.h 2006-12-07 08:39:27 -08:00
eisa
fc4 [PATCH] getting rid of all casts of k[cmz]alloc() calls 2006-12-13 09:05:58 -08:00
firmware [PATCH] dell_rbu: fix error check 2006-11-16 11:43:38 -08:00
hid input/hid: Supporting more keys from the HUT Consumer Page 2006-12-14 13:37:24 +01:00
hwmon hwmon: New AMS hardware monitoring driver 2006-12-12 18:18:30 +01:00
i2c Merge branch 'hwmon-for-linus' of git://jdelvare.pck.nerim.net/jdelvare-2.6 2006-12-13 09:13:19 -08:00
ide PCI: ATI sb600 sata quirk 2006-12-20 10:54:44 -08:00
ieee1394 i2c: Discard the i2c algo del_bus wrappers 2006-12-10 21:21:33 +01:00
infiniband IB/mthca: Use DEFINE_MUTEX() instead of mutex_init() 2006-12-15 20:55:28 -08:00
input [SUNKBD]: Fix sunkbd_enable(sunkbd, 0); obvious. 2006-12-17 14:06:58 -08:00
isdn [PATCH] getting rid of all casts of k[cmz]alloc() calls 2006-12-13 09:05:58 -08:00
kvm [PATCH] KVM: API versioning 2006-12-22 08:55:46 -08:00
leds kconfig: Standardize "depends" -> "depends on" in Kconfig files 2006-12-12 20:04:19 +01:00
macintosh [PATCH] getting rid of all casts of k[cmz]alloc() calls 2006-12-13 09:05:58 -08:00
mca
md [PATCH] md: fix a few problems with the interface (sysfs and ioctl) to md 2006-12-22 08:55:51 -08:00
media [PATCH] getting rid of all casts of k[cmz]alloc() calls 2006-12-13 09:05:58 -08:00
message [PATCH] fix kernel-doc warnings in 2.6.20-rc1 2006-12-22 08:55:47 -08:00
mfd [PATCH] Add include/linux/freezer.h and move definitions from sched.h 2006-12-07 08:39:27 -08:00
misc [PATCH] tifm: fix NULL ptr and style 2006-12-07 08:39:33 -08:00
mmc AT91 MMC update for 2.6.19 2006-12-11 12:43:35 +01:00
mtd [PATCH] getting rid of all casts of k[cmz]alloc() calls 2006-12-13 09:05:58 -08:00
net [PATCH] smc911 workqueue fixes 2006-12-22 08:55:48 -08:00
nubus
oprofile [PATCH] struct path: convert oprofile 2006-12-08 08:28:48 -08:00
parisc [PATCH] getting rid of all casts of k[cmz]alloc() calls 2006-12-13 09:05:58 -08:00
parport [PATCH] Kconfig refactoring for better menu nesting 2006-12-10 09:55:39 -08:00
pci [PATCH] increase CARDBUS_MEM_SIZE 2006-12-22 08:55:51 -08:00
pcmcia [PATCH] Fix numerous kcalloc() calls, convert to kzalloc() 2006-12-13 09:05:52 -08:00
pnp [PATCH] getting rid of all casts of k[cmz]alloc() calls 2006-12-13 09:05:58 -08:00
ps3 [POWERPC] ps3: Add vuart support 2006-12-11 13:49:53 +11:00
rapidio
rtc [PATCH] rtc framewok: rtc_wkalrm.enabled reporting updates 2006-12-13 09:05:52 -08:00
s390 [S390] cio: css_register_subchannel race. 2006-12-15 17:18:30 +01:00
sbus [PATCH] getting rid of all casts of k[cmz]alloc() calls 2006-12-13 09:05:58 -08:00
scsi [PATCH] Remove queue merging hooks 2006-12-19 08:33:11 +01:00
serial [PATCH] Add support for Korenix 16C950-based PCI cards 2006-12-13 09:18:11 -08:00
sh
sn
spi [PATCH] fix s3c24xx gpio driver (include linux/workqueue.h) 2006-12-22 08:55:51 -08:00
tc [PATCH] tty: switch to ktermios 2006-12-08 08:28:57 -08:00
telephony [PATCH] struct path: convert ixj 2006-12-08 08:28:46 -08:00
usb USB Storage: remove duplicate Nokia entry in unusual_devs.h 2006-12-20 11:46:03 -08:00
video [PATCH] gxt4500: Fix colormap and PLL setting, support GXT6000P 2006-12-22 08:55:50 -08:00
w1 [PATCH] w1: Fix for kconfig entry typo 2006-12-13 09:05:48 -08:00
zorro [PATCH] struct path: convert zorro 2006-12-08 08:28:50 -08:00
Kconfig [PATCH] kvm: userspace interface 2006-12-10 09:57:22 -08:00
Makefile [PATCH] kvm: userspace interface 2006-12-10 09:57:22 -08:00