kernel-fxtec-pro1x/arch/ia64/kernel
Ingo Molnar 0888f06ac9 [PATCH] sched: fix bad missed wakeups in the i386, x86_64, ia64, ACPI and APM idle code
Fernando Lopez-Lezcano reported frequent scheduling latencies and audio
xruns starting at the 2.6.18-rt kernel, and those problems persisted all
until current -rt kernels. The latencies were serious and unjustified by
system load, often in the milliseconds range.

After a patient and heroic multi-month effort of Fernando, where he
tested dozens of kernels, tried various configs, boot options,
test-patches of mine and provided latency traces of those incidents, the
following 'smoking gun' trace was captured by him:

                 _------=> CPU#
                / _-----=> irqs-off
               | / _----=> need-resched
               || / _---=> hardirq/softirq
               ||| / _--=> preempt-depth
               |||| /
               |||||     delay
   cmd     pid ||||| time  |   caller
      \   /    |||||   \   |   /
  IRQ_19-1479  1D..1    0us : __trace_start_sched_wakeup (try_to_wake_up)
  IRQ_19-1479  1D..1    0us : __trace_start_sched_wakeup <<...>-5856> (37 0)
  IRQ_19-1479  1D..1    0us : __trace_start_sched_wakeup (c01262ba 0 0)
  IRQ_19-1479  1D..1    0us : resched_task (try_to_wake_up)
  IRQ_19-1479  1D..1    0us : __spin_unlock_irqrestore (try_to_wake_up)
  ...
  <idle>-0     1...1   11us!: default_idle (cpu_idle)
  ...
  <idle>-0     0Dn.1  602us : smp_apic_timer_interrupt (c0103baf 1 0)
  ...
   <...>-5856  0D..2  618us : __switch_to (__schedule)
   <...>-5856  0D..2  618us : __schedule <<idle>-0> (20 162)
   <...>-5856  0D..2  619us : __spin_unlock_irq (__schedule)
   <...>-5856  0...1  619us : trace_stop_sched_switched (__schedule)
   <...>-5856  0D..1  619us : trace_stop_sched_switched <<...>-5856> (37 0)

what is visible in this trace is that CPU#1 ran try_to_wake_up() for
PID:5856, it placed PID:5856 on CPU#0's runqueue and ran resched_task()
for CPU#0. But it decided to not send an IPI that no CPU - due to
TS_POLLING. But CPU#0 never woke up after its NEED_RESCHED bit was set,
and only rescheduled to PID:5856 upon the next lapic timer IRQ. The
result was a 600+ usecs latency and a missed wakeup!

the bug turned out to be an idle-wakeup bug introduced into the mainline
kernel this summer via an optimization in the x86_64 tree:

    commit 495ab9c045
    Author: Andi Kleen <ak@suse.de>
    Date:   Mon Jun 26 13:59:11 2006 +0200

    [PATCH] i386/x86-64/ia64: Move polling flag into thread_info_status

    During some profiling I noticed that default_idle causes a lot of
    memory traffic. I think that is caused by the atomic operations
    to clear/set the polling flag in thread_info. There is actually
    no reason to make this atomic - only the idle thread does it
    to itself, other CPUs only read it. So I moved it into ti->status.

the problem is this type of change:

        if (!hlt_counter && boot_cpu_data.hlt_works_ok) {
-               clear_thread_flag(TIF_POLLING_NRFLAG);
+               current_thread_info()->status &= ~TS_POLLING;
                smp_mb__after_clear_bit();
                while (!need_resched()) {
                        local_irq_disable();

this changes clear_thread_flag() to an explicit clearing of TS_POLLING.
clear_thread_flag() is defined as:

        clear_bit(flag, &ti->flags);

and clear_bit() is a LOCK-ed atomic instruction on all x86 platforms:

  static inline void clear_bit(int nr, volatile unsigned long * addr)
  {
          __asm__ __volatile__( LOCK_PREFIX
                  "btrl %1,%0"

hence smp_mb__after_clear_bit() is defined as a simple compile barrier:

  #define smp_mb__after_clear_bit()       barrier()

but the explicit TS_POLLING clearing introduced by the patch:

+               current_thread_info()->status &= ~TS_POLLING;

is not an atomic op! So the clearing of the TS_POLLING bit is freely
reorderable with the reading of the NEED_RESCHED bit - and both now
reside in different memory addresses.

CPU idle wakeup very much depends on ordered memory ops, the clearing of
the TS_POLLING flag must always be done before we test need_resched()
and hit the idle instruction(s). [Symmetrically, the wakeup code needs
to set NEED_RESCHED before it tests the TS_POLLING flag, so memory
ordering is paramount.]

Fernando's dual-core Athlon64 system has a sufficiently advanced memory
ordering model so that it triggered this scenario very often.

( And it also turned out that the reason why these latencies never
  triggered on my testsystems is that i routinely use idle=poll, which
  was the only idle variant not affected by this bug. )

The fix is to change the smp_mb__after_clear_bit() to an smp_mb(), to
act as an absolute barrier between the TS_POLLING write and the
NEED_RESCHED read. This affects almost all idling methods (default,
ACPI, APM), on all 3 x86 architectures: i386, x86_64, ia64.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:51 -08:00
..
cpufreq [PATCH] Add support for type argument in PAL_GET_PSTATE 2006-12-07 11:21:55 -08:00
acpi-ext.c Pull acpi_os_free into release branch 2006-07-01 17:19:08 -04:00
acpi-processor.c fix file specification in comments 2006-10-03 23:01:26 +02:00
acpi.c [IA64] remove unused acpi_kbd_controller_present, acpi_legacy_devices 2006-10-17 14:57:33 -07:00
asm-offsets.c Remove obsolete #include <linux/config.h> 2006-06-30 19:25:36 +02:00
audit.c [PATCH] audit: AUDIT_PERM support 2006-09-11 13:32:30 -04:00
brl_emu.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
crash.c [IA64] CONFIG_KEXEC/CONFIG_CRASH_DUMP permutations 2006-12-12 10:11:00 -08:00
crash_dump.c [IA64] CONFIG_KEXEC/CONFIG_CRASH_DUMP permutations 2006-12-12 10:11:00 -08:00
cyclone.c [IA64] don't report !sn2 or !summit hardware as an error 2006-03-07 15:26:49 -08:00
efi.c [IA64] resolve name clash by renaming is_available_memory() 2006-12-07 13:46:12 -08:00
efi_stub.S [IA64] make efi_stub.S fit in 80 cols 2006-06-21 14:35:28 -07:00
entry.h Remove obsolete #include <linux/config.h> 2006-06-30 19:25:36 +02:00
entry.S [IA64] IA64 Kexec/kdump 2006-12-07 09:51:35 -08:00
esi.c [IA64] esi-support 2006-06-21 11:19:22 -07:00
esi_stub.S [IA64] esi-support 2006-06-21 11:19:22 -07:00
fsys.S [IA64] cleanup in fsys.S 2006-02-28 08:53:32 -08:00
gate-data.S Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
gate.lds.S [PATCH] vDSO hash-style fix 2006-07-31 13:28:43 -07:00
gate.S Remove obsolete #include <linux/config.h> 2006-06-30 19:25:36 +02:00
head.S [IA64] Save register stack contents on cpu start 2006-09-08 11:05:13 -07:00
ia64_ksyms.c [IA64] Need export for csum_ipv6_magic 2006-12-07 13:18:57 -08:00
init_task.c [PATCH] nsproxy: move init_nsproxy into kernel/nsproxy.c 2006-10-02 07:57:20 -07:00
iosapic.c [IA64] IA64 Kexec/kdump 2006-12-07 09:51:35 -08:00
irq.c [IA64] use generic_handle_irq() 2006-11-16 09:38:35 -08:00
irq_ia64.c [IA64] use generic_handle_irq() 2006-11-16 09:38:35 -08:00
irq_lsapic.c [IA64] typename -> name conversion 2006-11-16 09:38:02 -08:00
ivt.S Remove obsolete #include <linux/config.h> 2006-06-30 19:25:36 +02:00
jprobes.S [IA64] enable trap code on slot 1 2006-12-12 12:00:55 -08:00
kprobes.c [IA64] kprobe clears qp bits for special instructions 2006-12-12 12:04:42 -08:00
machine_kexec.c [IA64] kexec/kdump: tidy up declaration of relocate_new_kernel_t 2006-12-12 10:12:08 -08:00
machvec.c IRQ: Maintain regs pointer globally rather than passing to IRQ handlers 2006-10-05 15:10:12 +01:00
Makefile [IA64] CONFIG_KEXEC/CONFIG_CRASH_DUMP permutations 2006-12-12 10:11:00 -08:00
mca.c [IA64] CONFIG_KEXEC/CONFIG_CRASH_DUMP permutations 2006-12-12 10:11:00 -08:00
mca_asm.S [IA64] ar.fpsr not set on MCA/INIT kernel entry 2006-09-26 15:20:35 -07:00
mca_drv.c [IA64] MCA recovery: Montecito support 2006-10-31 14:30:34 -08:00
mca_drv.h [IA64] printing support for MCA/INIT 2006-09-26 14:44:37 -07:00
mca_drv_asm.S Remove obsolete #include <linux/config.h> 2006-06-30 19:25:36 +02:00
minstate.h Remove obsolete #include <linux/config.h> 2006-06-30 19:25:36 +02:00
module.c Remove obsolete #include <linux/config.h> 2006-06-30 19:25:36 +02:00
msi_ia64.c [PATCH] msi: move the ia64 code into arch/ia64 2006-10-04 07:55:29 -07:00
numa.c [PATCH] cpumask: export node_to_cpu_mask consistently 2006-10-02 07:57:17 -07:00
pal.S [IA64] reformat pal.S to fit in 80 columns, fix typos 2006-10-17 14:54:19 -07:00
palinfo.c Merge branch 'release' of master.kernel.org:/pub/scm/linux/kernel/git/aegl/linux-2.6 2006-12-07 15:39:22 -08:00
patch.c [IA64] add init declaration - gate page functions 2006-03-22 16:54:38 -08:00
perfmon.c [PATCH] struct path: convert ia64 2006-12-08 08:28:45 -08:00
perfmon_default_smpl.c Remove obsolete #include <linux/config.h> 2006-06-30 19:25:36 +02:00
perfmon_generic.h Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
perfmon_itanium.h Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
perfmon_mckinley.h Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
perfmon_montecito.h [IA64] sparse cleanups 2006-12-07 10:48:19 -08:00
process.c [PATCH] sched: fix bad missed wakeups in the i386, x86_64, ia64, ACPI and APM idle code 2006-12-22 08:55:51 -08:00
ptrace.c Remove obsolete #include <linux/config.h> 2006-06-30 19:25:36 +02:00
relocate_kernel.S [IA64] IA64 Kexec/kdump 2006-12-07 09:51:35 -08:00
sal.c [IA64] move SAL_CACHE_FLUSH check later in boot 2006-10-31 14:32:10 -08:00
salinfo.c [PATCH] struct path: convert ia64 2006-12-08 08:28:45 -08:00
semaphore.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
setup.c [IA64] Take defensive stance on ia64_pal_get_brand_info() 2006-12-12 11:56:36 -08:00
sigframe.h Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
signal.c Remove obsolete #include <linux/config.h> 2006-06-30 19:25:36 +02:00
smp.c [IA64] CONFIG_KEXEC/CONFIG_CRASH_DUMP permutations 2006-12-12 10:11:00 -08:00
smpboot.c WorkQueue: Fix up arch-specific work items where possible 2006-12-05 19:36:26 +00:00
sys_ia64.c [PATCH] IA64,sparc: local DoS with corrupted ELFs 2006-09-08 08:40:46 -07:00
time.c [IA64] - Allow IPIs in timer loop 2006-10-17 14:51:49 -07:00
topology.c [PATCH] i386: change the 'no_control' field to 'hotpluggable' in the struct cpu 2006-12-07 02:14:10 +01:00
traps.c [IA64] - Reduce overhead of FP exception logging messages 2006-12-12 11:47:09 -08:00
unaligned.c [IA64] sysctl option to silence unaligned trap warnings 2006-02-28 09:42:23 -08:00
uncached.c [PATCH] Define easier to handle GFP_THISNODE 2006-09-26 08:48:50 -07:00
unwind.c [IA64] MCA/INIT: remove obsolete unwind code 2005-09-11 14:09:34 -07:00
unwind_decoder.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
unwind_i.h Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
vmlinux.lds.S [PATCH] vmlinux.lds: consolidate initcall sections 2006-10-27 15:34:51 -07:00