construct a more or less wall-clock time out of sched_clock(), by
using ACPI-idle's existing knowledge about how much time we spent
idling. This allows the rq clock to work around TSC-stops-in-C2,
TSC-gets-corrupted-in-C3 type of problems.
( Besides the scheduler's statistics this also benefits blktrace and
printk-timestamps as well. )
Furthermore, the precise before-C2/C3-sleep and after-C2/C3-wakeup
callbacks allow the scheduler to get out the most of the period where
the CPU has a reliable TSC. This results in slightly more precise
task statistics.
the ACPI bits were acked by Len.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Len Brown <len.brown@intel.com>
Enable C3 without bm control only for CST based C3.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
This is a merge of Peter Keilty's initial patch (which was
revived by Bob Picco) for this with Hidetoshi Seto's fixes
and scaling improvements.
Acked-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
On systems that do not have pm2_control_block, we cannot really use
ARB_DISABLE before C3. We used to disable C3 totally on such systems.
To be compatible with Windows, we need to enable C3 on such systems now.
We just skip ARB_DISABLE step before entering the C3-state and assume
hardware is handling things correctly. Also, ACPI spec is not clear
about pm2_control is _needed_ for C3 or not.
We have atleast one system that need this to enable C3.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Always disable/enable interrupts in the acpi idle routine,
even in the error path.
This is required as the 2.6.20 change in git commit d331e739f5ad2aaa9...
"Fix interrupt race in idle callback" expects the idle handler
to enable interrupt before returning.
There was a case in acpi idle routine, in which interrupt was not being
enabled before return, which caused the system to hang at bootup, while
enabling C-states on an SMP system.
The signature of the hang was that "processor.nocst"
was required to enable boot.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Change mark_tsc_unstable() so it takes a string argument, which holds the
reason the TSC was marked unstable.
This is then displayed the first time mark_tsc_unstable is called.
This should help us better debug why the TSC was marked unstable on certain
systems and allow us to make sure we're not being overly paranoid when
throwing out this troublesome clocksource.
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Thomas's patch for including <asm/apic.h> for x86 UP builds came into
Linus's tree from two different directions, both of which were merged.
This reverts the latter, yanking out the duplicate #include and comment.
Signed-off-by: Ray Lee <ray-lk@madrabbit.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
It turned out that it is almost impossible to trust ACPI, BIOS & Co.
regarding the C states. This was the reason to switch the local apic
timer off in C2 state already. OTOH there are sane and well behaving
systems, which get punished by that decision.
Allow the user to confirm that the local apic timer is trustworthy in C2
state. This keeps the default behaviour on the safe side.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 25496caec1, which
broke bootup on at least Ingo's ThinkPad T60. Need to figure out
exactly what is wrong before we can re-do the logic.
Requested-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Len Brown <len.brown@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use IPI for blacklisted CPUs, add parameter IPI vs LAPIC
Currently, Linux disables lapic timer for all machines with C2 and higher
C-state support.
According to Intel only specific Intel models (Banias/Dothan) are broken
in respect of not waking up from C2 with lapic.
However, I am not sure about the naming of the parameter and how it
could/should get integrated into the dyntick part
(CONFIG_GENERIC_CLOCKEVENTS). There, a more fine grained check (TSC
still running?, ..) is needed? Does this make sense (always use
CLOCK_EVT_NOTIFY_BROADCAST_ON, but use OFF if forced by use_ipi=0:
clockevents_notify(use_ipi ? CLOCK_EVT_NOTIFY_BROADCAST_ON :
CLOCK_EVT_NOTIFY_BROADCAST_OFF, &pr->id);
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Len Brown <len.brown@intel.com>
Add clockevent drivers for i386: lapic (local) and PIT/HPET (global). Update
the timer IRQ to call into the PIT/HPET driver's event handler and the
lapic-timer IRQ to call into the lapic clockevent driver. The assignement of
timer functionality is delegated to the core framework code and replaces the
compile and runtime evalution in do_timer_interrupt_hook()
Use the clockevents broadcast support and implement the lapic_broadcast
function for ACPI.
No changes to existing functionality.
[ kdump fix from Vivek Goyal <vgoyal@in.ibm.com> ]
[ fixes based on review feedback from Arjan van de Ven <arjan@infradead.org> ]
Cleanups-from: Adrian Bunk <bunk@stusta.de>
Build-fixes-from: Andrew Morton <akpm@osdl.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a preperatory patch for highres/dyntick:
- replace the big #ifdef ARCH_APICTIMER_STOPS_ON_C3 hackery by functions
- remove the double switch in the power verify function (in the worst case
we switched ipi to apic and 20usec later apic to ipi)
- keep track of the the state which stops local APIC timer
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Len Brown <len.brown@intel.com>
Cc: <linux-acpi@vger.kernel.org>
Cc: Andi Kleen <ak@suse.de>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
apic.h does not get included on UP compiles. That way the
APICTIMER_STOPS_ON_C3 is not there and UP boxen have no support for timer
broadcasting. This was never noticed, because the lapic timer is only used
for profiling on UP.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
apic.h does not get included on UP compiles. That way the
APICTIMER_STOPS_ON_C3 is not there and UP boxen have no support for timer
broadcasting. This was never noticed, because the lapic timer is only used
for profiling on UP.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Len Brown <len.brown@intel.com>
cosmetic only
Make "module name" actually match the file name.
Invoke with ';' as leaving it off confuses Lindent and gcc doesn't care.
Fix indentation where Lindent did get confused.
Signed-off-by: Len Brown <len.brown@intel.com>
drivers/acpi/namespace/nsparse.c:126: warning: int format, different type arg (arg 7)
drivers/acpi/tables/tbfadt.c:224: warning: unsigned int format, different type arg (arg 6)
drivers/acpi/utilities/utdebug.c:184: warning: cast from pointer to integer of different size
drivers/acpi/utilities/utdebug.c:184: warning: cast from pointer to integer of different size
drivers/acpi/utilities/utdebug.c:197: warning: cast from pointer to integer of different size
drivers/acpi/processor_idle.c:1093: warning: long long unsigned int format, u64 arg (arg 5)
Signed-off-by: Len Brown <len.brown@intel.com>
Remove flags parameter for acpi_{get,set}_register().
It is no longer necessary now that these functions use a
spinlock for mutual exclusion.
Signed-off-by: Alexey Starikovskiy <alexey.y.starikovskiy@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (68 commits)
ACPI: replace kmalloc+memset with kzalloc
ACPI: Add support for acpi_load_table/acpi_unload_table_id
fbdev: update after backlight argument change
ACPI: video: Add dev argument for backlight_device_register
ACPI: Implement acpi_video_get_next_level()
ACPI: Kconfig - depend on PM rather than selecting it
ACPI: fix NULL check in drivers/acpi/osl.c
ACPI: make drivers/acpi/ec.c:ec_ecdt static
ACPI: prevent processor module from loading on failures
ACPI: fix single linked list manipulation
ACPI: ibm_acpi: allow clean removal
ACPI: fix git automerge failure
ACPI: ibm_acpi: respond to workqueue update
ACPI: dock: add uevent to indicate change in device status
ACPI: ec: Lindent once again
ACPI: ec: Change #define to enums there possible.
ACPI: ec: Style changes.
ACPI: ec: Acquire Global Lock under EC mutex.
ACPI: ec: Drop udelay() from poll mode. Loop by reading status field instead.
ACPI: ec: Rename gpe_bit to gpe
...
Fernando Lopez-Lezcano reported frequent scheduling latencies and audio
xruns starting at the 2.6.18-rt kernel, and those problems persisted all
until current -rt kernels. The latencies were serious and unjustified by
system load, often in the milliseconds range.
After a patient and heroic multi-month effort of Fernando, where he
tested dozens of kernels, tried various configs, boot options,
test-patches of mine and provided latency traces of those incidents, the
following 'smoking gun' trace was captured by him:
_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /
IRQ_19-1479 1D..1 0us : __trace_start_sched_wakeup (try_to_wake_up)
IRQ_19-1479 1D..1 0us : __trace_start_sched_wakeup <<...>-5856> (37 0)
IRQ_19-1479 1D..1 0us : __trace_start_sched_wakeup (c01262ba 0 0)
IRQ_19-1479 1D..1 0us : resched_task (try_to_wake_up)
IRQ_19-1479 1D..1 0us : __spin_unlock_irqrestore (try_to_wake_up)
...
<idle>-0 1...1 11us!: default_idle (cpu_idle)
...
<idle>-0 0Dn.1 602us : smp_apic_timer_interrupt (c0103baf 1 0)
...
<...>-5856 0D..2 618us : __switch_to (__schedule)
<...>-5856 0D..2 618us : __schedule <<idle>-0> (20 162)
<...>-5856 0D..2 619us : __spin_unlock_irq (__schedule)
<...>-5856 0...1 619us : trace_stop_sched_switched (__schedule)
<...>-5856 0D..1 619us : trace_stop_sched_switched <<...>-5856> (37 0)
what is visible in this trace is that CPU#1 ran try_to_wake_up() for
PID:5856, it placed PID:5856 on CPU#0's runqueue and ran resched_task()
for CPU#0. But it decided to not send an IPI that no CPU - due to
TS_POLLING. But CPU#0 never woke up after its NEED_RESCHED bit was set,
and only rescheduled to PID:5856 upon the next lapic timer IRQ. The
result was a 600+ usecs latency and a missed wakeup!
the bug turned out to be an idle-wakeup bug introduced into the mainline
kernel this summer via an optimization in the x86_64 tree:
commit 495ab9c045
Author: Andi Kleen <ak@suse.de>
Date: Mon Jun 26 13:59:11 2006 +0200
[PATCH] i386/x86-64/ia64: Move polling flag into thread_info_status
During some profiling I noticed that default_idle causes a lot of
memory traffic. I think that is caused by the atomic operations
to clear/set the polling flag in thread_info. There is actually
no reason to make this atomic - only the idle thread does it
to itself, other CPUs only read it. So I moved it into ti->status.
the problem is this type of change:
if (!hlt_counter && boot_cpu_data.hlt_works_ok) {
- clear_thread_flag(TIF_POLLING_NRFLAG);
+ current_thread_info()->status &= ~TS_POLLING;
smp_mb__after_clear_bit();
while (!need_resched()) {
local_irq_disable();
this changes clear_thread_flag() to an explicit clearing of TS_POLLING.
clear_thread_flag() is defined as:
clear_bit(flag, &ti->flags);
and clear_bit() is a LOCK-ed atomic instruction on all x86 platforms:
static inline void clear_bit(int nr, volatile unsigned long * addr)
{
__asm__ __volatile__( LOCK_PREFIX
"btrl %1,%0"
hence smp_mb__after_clear_bit() is defined as a simple compile barrier:
#define smp_mb__after_clear_bit() barrier()
but the explicit TS_POLLING clearing introduced by the patch:
+ current_thread_info()->status &= ~TS_POLLING;
is not an atomic op! So the clearing of the TS_POLLING bit is freely
reorderable with the reading of the NEED_RESCHED bit - and both now
reside in different memory addresses.
CPU idle wakeup very much depends on ordered memory ops, the clearing of
the TS_POLLING flag must always be done before we test need_resched()
and hit the idle instruction(s). [Symmetrically, the wakeup code needs
to set NEED_RESCHED before it tests the TS_POLLING flag, so memory
ordering is paramount.]
Fernando's dual-core Athlon64 system has a sufficiently advanced memory
ordering model so that it triggered this scenario very often.
( And it also turned out that the reason why these latencies never
triggered on my testsystems is that i routinely use idle=poll, which
was the only idle variant not affected by this bug. )
The fix is to change the smp_mb__after_clear_bit() to an smp_mb(), to
act as an absolute barrier between the TS_POLLING write and the
NEED_RESCHED read. This affects almost all idling methods (default,
ACPI, APM), on all 3 x86 architectures: i386, x86_64, ia64.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch breaks C-state discovery on my IBM IntelliStation Z30 because
the return value of acpi_processor_get_power_info_fadt is not assigned to
"result" in the case that acpi_processor_get_power_info_cst returns
-ENODEV. Thus, if ACPI provides C-state data via the FADT and not _CST (as
is the case on this machine), we incorrectly exit the function with -ENODEV
after reading the FADT. The attached patch sets the value of result so
that we don't exit early.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Acked-by: "Pallipadi, Venkatesh" <venkatesh.pallipadi@intel.com>
Acked-by: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
drivers/acpi/processor_idle.c:1112: warning: 'smp_callback' defined but not used
Cc: Len Brown <lenb@kernel.org>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The ACPI processor init functions should be marked as __cpuinit as they use
structures marked with __cpuinitdata.
Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Len Brown <len.brown@intel.com>
Intel processors starting with the Core Duo support
support processor native C-state using the MWAIT instruction.
Refer: Intel Architecture Software Developer's Manual
http://www.intel.com/design/Pentium4/manuals/253668.htm
Platform firmware exports the support for Native C-state to OS using
ACPI _PDC and _CST methods.
Refer: Intel Processor Vendor-Specific ACPI: Interface Specification
http://www.intel.com/technology/iapc/acpi/downloads/302223.htm
With Processor Native C-state, we use 'MWAIT' instruction on the processor
to enter different C-states (C1, C2, C3). We won't use the special IO
ports to enter C-state and no SMM mode etc required to enter C-state.
Overall this will mean better C-state support.
One major advantage of using MWAIT for all C-states is, with this and
"treat interrupt as break event" feature of MWAIT, we can now get accurate
timing for the time spent in C1, C2, .. states.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Len Brown <len.brown@intel.com>
Add infrastructure to track "maximum allowable latency" for power saving
policies.
The reason for adding this infrastructure is that power management in the
idle loop needs to make a tradeoff between latency and power savings
(deeper power save modes have a longer latency to running code again). The
code that today makes this tradeoff just does a rather simple algorithm;
however this is not good enough: There are devices and use cases where a
lower latency is required than that the higher power saving states provide.
An example would be audio playback, but another example is the ipw2100
wireless driver that right now has a very direct and ugly acpi hook to
disable some higher power states randomly when it gets certain types of
error.
The proposed solution is to have an interface where drivers can
* announce the maximum latency (in microseconds) that they can deal with
* modify this latency
* give up their constraint
and a function where the code that decides on power saving strategy can
query the current global desired maximum.
This patch has a user of each side: on the consumer side, ACPI is patched
to use this, on the producer side the ipw2100 driver is patched.
A generic maximum latency is also registered of 2 timer ticks (more and you
lose accurate time tracking after all).
While the existing users of the patch are x86 specific, the infrastructure
is not. I'd like to ask the arch maintainers of other architectures if the
infrastructure is generic enough for their use (assuming the architecture
has such a tradeoff as concept at all), and the sound/multimedia driver
owners to look at the driver facing API to see if this is something they
can use.
[akpm@osdl.org: cleanups]
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jesse Barnes <jesse.barnes@intel.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
While trying to look for superfluous I/O accesses that can be optimized
away, I stumbled upon this ACPI sleep I/O access and couldn't figure out
why the hell this dummy op was necessary.
After more than one hour of internet research, I had collected a sufficient
number of documents (among those very old kernel versions) that finally
told me what this dummy read was about: STPCLK# doesn't get asserted in time
on (some) chipsets, which is why we need to have a dummy I/O read to delay
further instruction processing until the CPU is fully stopped.
Signed-off-by: Andreas Mohr <andi@lisas.de>
Signed-off-by: Len Brown <len.brown@intel.com>
Only if bus master activity is going on at the present, we should avoid
entering C3-type sleep, as it might be a faulty transition. As long as the
bm_activity bitmask was based on the number of calls to the ACPI idle
function, looking at previous moments made sense. Now, with it being based on
what happened this jiffy, looking at this jiffy should be sufficient.
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Len Brown <len.brown@intel.com>
Do not assume there was bus mastering activity if the idle handler didn't get
called, as there's only reason to not enter C3-type sleep if there is bus
master activity going on. Only for the "promotion" into C3-type sleep bus
mastering activity is taken into account, and there only current bus mastering
activity, and not pure guessing should lead to the decision on whether to
enter C3-type sleep or not.
Also, as bm_activity is a jiffy-based bitmask (bit 0: bus mastering activity
during this juffy, bit 31: bus mastering activity 31 jiffies ago), fix the
setting of bit 0, as it might be called multiple times within one jiffy.
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Len Brown <len.brown@intel.com>
Track the actual time spent in C-States (C2 upwards, we can't determine this
for C1), not only the number of invocations. This is especially useful for
dynamic ticks / "tickless systems", but is also of interest on normal systems,
as any interrupt activity leads to C-States being exited, not only the timer
interrupt.
The time is being measured in PM timer ticks, so an increase by one equals 279
nanoseconds.
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Len Brown <len.brown@intel.com>
During some profiling I noticed that default_idle causes a lot of
memory traffic. I think that is caused by the atomic operations
to clear/set the polling flag in thread_info. There is actually
no reason to make this atomic - only the idle thread does it
to itself, other CPUs only read it. So I moved it into ti->status.
Converted i386/x86-64/ia64 for now because that was the easiest
way to fix ACPI which also manipulates these flags in its idle
function.
Cc: Nick Piggin <npiggin@novell.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As part of the i386 conversion to the generic timekeeping infrastructure, this
introduces a new tsc.c file. The code in this file replaces the TSC
initialization, management and access code currently in timer_tsc.c (which
will be removed) that we want to preserve.
The code also introduces the following functionality:
o tsc_khz: like cpu_khz but stores the TSC frequency on systems that do not
change TSC frequency w/ CPU frequency
o check/mark_tsc_unstable: accessor/modifier flag for TSC timekeeping
usability
o minor cleanups to calibration math.
This patch also includes a one line __cpuinitdata fix from Zwane Mwaikambo.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
make pm_idle_save, nocst and bm_history __read_mostly
remove initializer from static 'first_run'.
Signed-off-by: Andreas Mohr <andi@lisas.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Len Brown <len.brown@intel.com>
Broken earlier by me by a x86-64 patch.
The code was optimized away, but the compiler still complained about an
undeclared function.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>