Returning to a task with a 16-bit stack requires special care: the iret
instruction does not restore the high word of esp in that case. The
espfix code fixes this, but currently is not invoked on NMIs. This means
that a running task gets the upper word of esp clobbered due intervening
NMIs. To reproduce, compile and run the following program with the nmi
watchdog enabled (nmi_watchdog=2 on the command line). Using gdb you can
see that the high bits of esp contain garbage, while the low bits are
still correct.
This patch puts the espfix code back into the NMI code path.
The patch is slightly complicated due to the irqtrace infrastructure not
being NMI-safe. The NMI return path cannot call TRACE_IRQS_IRET.
Otherwise, the tail of the normal iret-code is correct for the nmi code
path too. To be able to share this code-path, the TRACE_IRQS_IRET was
move up a bit. The espfix code exists after the TRACE_IRQS_IRET, but
this code explicitly disables interrupts. This short interrupts-off
section is now not traced anymore. The return-to-kernel path now always
includes the preliminary test to decide if the espfix code should be
called. This is never the case, but doing it this way keeps the patch as
simple as possible and the few extra instructions should not affect
timing in any significant way.
#define _GNU_SOURCE
#include <stdio.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <asm/ldt.h>
int modify_ldt(int func, void *ptr, unsigned long bytecount)
{
return syscall(SYS_modify_ldt, func, ptr, bytecount);
}
/* this is assumed to be usable */
#define SEGBASEADDR 0x10000
#define SEGLIMIT 0x20000
/* 16-bit segment */
struct user_desc desc = {
.entry_number = 0,
.base_addr = SEGBASEADDR,
.limit = SEGLIMIT,
.seg_32bit = 0,
.contents = 0, /* ??? */
.read_exec_only = 0,
.limit_in_pages = 0,
.seg_not_present = 0,
.useable = 1
};
int main(void)
{
setvbuf(stdout, NULL, _IONBF, 0);
/* map a 64 kb segment */
char *pointer = mmap((void *)SEGBASEADDR, SEGLIMIT+1,
PROT_EXEC|PROT_READ|PROT_WRITE,
MAP_SHARED|MAP_ANONYMOUS, -1, 0);
if (pointer == NULL) {
printf("could not map space\n");
return 0;
}
/* write ldt, new mode */
int err = modify_ldt(0x11, &desc, sizeof(desc));
if (err) {
printf("error modifying ldt: %i\n", err);
return 0;
}
for (int i=0; i<1000; i++) {
asm volatile (
"pusha\n\t"
"mov %ss, %eax\n\t" /* preserve ss:esp */
"mov %esp, %ebp\n\t"
"push $7\n\t" /* index 0, ldt, user mode */
"push $65536-4096\n\t" /* esp */
"lss (%esp), %esp\n\t" /* switch to new stack */
"push %eax\n\t" /* save old ss:esp on new stack */
"push %ebp\n\t"
"add $17*65536, %esp\n\t" /* set high bits */
"mov %esp, %edx\n\t"
"mov $10000000, %ecx\n\t" /* wait... */
"1: loop 1b\n\t" /* ... a bit */
"cmp %esp, %edx\n\t"
"je 1f\n\t"
"ud2\n\t" /* esp changed inexplicably! */
"1:\n\t"
"sub $17*65536, %esp\n\t" /* restore high bits */
"lss (%esp), %esp\n\t" /* restore old ss:esp */
"popa\n\t");
printf("\rx%ix", i);
}
return 0;
}
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Acked-by: Stas Sergeev <stsp@aknet.ru>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Expand Intel NMI perfctr1 workaround to include a Core2 processor stepping
(cpuid family-6, model-f, stepping-4). Resolves a situation where the NMI
would not enable on these processors.
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: prarit@redhat.com
Cc: suresh.b.siddha@intel.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This compiler warning:
arch/x86/kernel/apic/io_apic.c: In function ‘ioapic_write_entry’:
arch/x86/kernel/apic/io_apic.c:466: warning: ‘eu’ is used uninitialized in this function
arch/x86/kernel/apic/io_apic.c:465: note: ‘eu’ was declared here
Is bogus as 'eu' is always initialized. But annotate it away by
initializing the variable, to make it easier for people to notice
real warnings. A compiler that sees through this logic will
optimize away the initialization.
Signed-off-by: Figo.zhang <figo1802@gmail.com>
LKML-Reference: <1245248720.3312.27.camel@myhost>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If APIC was disabled (for some reason) and as result
it's not even mapped we should not try to enable thermal
interrupts at all.
Reported-by: Simon Holm Thøgersen <odie@cs.aau.dk>
Tested-by: Simon Holm Thøgersen <odie@cs.aau.dk>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
LKML-Reference: <20090615182633.GA7606@lenovo>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch causes all the EFI_RESERVED_TYPE memory reservations to be recorded
in the e820 table as type E820_RESERVED.
(This patch replaces one called 'x86: vendor reserved memory type'.
This version has been discussed a bit with Peter and Yinghai but not given
a final opinion.)
Without this patch EFI_RESERVED_TYPE memory reservations may be
marked usable in the e820 table. There may be a collision between
kernel use and some reserver's use of this memory.
(An example use of this functionality is the UV system, which
will access extremely large areas of memory with a memory engine
that allows a user to address beyond the processor's range. Such
areas are reserved in the EFI table by the BIOS.
Some loaders have a restricted number of entries possible in the e820 table,
hence the need to record the reservations in the unrestricted EFI table.)
The call to do_add_efi_memmap() is only made if "add_efi_memmap" is specified
on the kernel command line.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
iomem_resource is by default initialized to -1, which means 64 bits of
physical address space if 64-bit resources are enabled. However, x86
CPUs cannot address 64 bits of physical address space. Thus, we want
to cap the physical address space to what the union of all CPU can
actually address.
Without this patch, we may end up assigning inaccessible values to
uninitialized 64-bit PCI memory resources.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Martin Mares <mj@ucw.cz>
Cc: stable@kernel.org
Now that enable_iommus() will call iommu_disable() for each iommu,
the call to disable_iommus() during resume is redundant. Also, the order
for an invalidation is to invalidate device table entries first, then
domain translations.
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
* 'timers-for-linus-migration' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
timers: Logic to move non pinned timers
timers: /proc/sys sysctl hook to enable timer migration
timers: Identifying the existing pinned timers
timers: Framework for identifying pinned timers
timers: allow deferrable timers for intervals tv2-tv5 to be deferred
Fix up conflicts in kernel/sched.c and kernel/timer.c manually
These registers may contain values from previous kernels. So reset them
to known values before enable the event buffer again.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The IOMMU spec states that IOMMU behavior may be undefined when the
IOMMU registers are rewritten while command or event buffer is enabled.
Disable them in IOMMU disable path.
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
When kexec'ing to a new kernel (for example, when crashing and launching
a kdump session), the AMD IOMMU may have cached translations. The kexec'd
kernel, during initialization, will invalidate the IOMMU device table
entries, but not the domain translations. These stale entries can cause
a device's DMA to fail, makes it rough to write a dump to disk when the
disk controller can't DMA ;-)
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
If the IOMMUs are still enabled when the kexec kernel boots access to
the disk is not possible. This is bad for tools like kdump or anything
else which wants to use PCI devices.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
When the IOMMU stays enabled the BIOS may not be able to finish the
machine shutdown properly. So disable the hardware on shutdown.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
timer interrupts are excluded from being disabled during suspend. The
clock events code manages the disabling of clock events on its own
because the timer interrupt needs to be functional before the resume
code reenables the device interrupts.
The hpet per cpu timers request their interrupt without setting the
IRQF_TIMER flag so suspend_device_irqs() disables them as well which
results in a fatal resume failure on the boot CPU.
Adding IRQF_TIMER to the interupt flags when requesting the hpet per
cpu timer interrupts solves the problem.
Reported-by: Benjamin S. <sbenni@gmx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Benjamin S. <sbenni@gmx.de>
Cc: stable@kernel.org
* 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf_counter: Start documenting HAVE_PERF_COUNTERS requirements
perf_counter: Add forward/backward attribute ABI compatibility
perf record: Explicity program a default counter
perf_counter: Remove PERF_TYPE_RAW special casing
perf_counter: PERF_TYPE_HW_CACHE is a hardware counter too
powerpc, perf_counter: Fix performance counter event types
perf_counter/x86: Add a quirk for Atom processors
perf_counter tools: Remove one L1-data alias
This patch (as1241) renames a bunch of functions in the PM core.
Rather than go through a boring list of name changes, suffice it to
say that in the end we have a bunch of pairs of functions:
device_resume_noirq dpm_resume_noirq
device_resume dpm_resume
device_complete dpm_complete
device_suspend_noirq dpm_suspend_noirq
device_suspend dpm_suspend
device_prepare dpm_prepare
in which device_X does the X operation on a single device and dpm_X
invokes device_X for all devices in the dpm_list.
In addition, the old dpm_power_up and device_resume_noirq have been
combined into a single function (dpm_resume_noirq).
Lastly, dpm_suspend_start and dpm_resume_end are the renamed versions
of the former top-level device_suspend and device_resume routines.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Magnus Damm <damm@igel.co.jp>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Rename the functions performing "_noirq" dev_pm_ops
operations from device_power_down() and device_power_up()
to device_suspend_noirq() and device_resume_noirq().
The new function names are chosen to show that the functions
are responsible for calling the _noirq() versions to finalize
the suspend/resume operation. The current function names do
not perform power down/up anymore so the names may be misleading.
Global function renames:
- device_power_down() -> device_suspend_noirq()
- device_power_up() -> device_resume_noirq()
Static function renames:
- suspend_device_noirq() -> __device_suspend_noirq()
- resume_device_noirq() -> __device_resume_noirq()
Signed-off-by: Magnus Damm <damm@igel.co.jp>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Len Brown <lenb@kernel.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
One of the numbers in arch/x86/kernel/acpi/sleep.c is long, but it is
not annotated appropriately, so sparese warns about it. Fix that.
[rjw: added the changelog.]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
* 'topic/slab/earlyboot-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
slab: setup cpu caches later on when interrupts are enabled
slab,slub: don't enable interrupts during early boot
slab: fix gfp flag in setup_cpu_cache()
x86: make zap_low_mapping could be used early
irq: slab alloc for default irq_affinity
memcg: fix page_cgroup fatal error in FLATMEM
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-lguest: (31 commits)
lguest: add support for indirect ring entries
lguest: suppress notifications in example Launcher
lguest: try to batch interrupts on network receive
lguest: avoid sending interrupts to Guest when no activity occurs.
lguest: implement deferred interrupts in example Launcher
lguest: remove obsolete LHREQ_BREAK call
lguest: have example Launcher service all devices in separate threads
lguest: use eventfds for device notification
eventfd: export eventfd_signal and eventfd_fget for lguest
lguest: allow any process to send interrupts
lguest: PAE fixes
lguest: PAE support
lguest: Add support for kvm_hypercall4()
lguest: replace hypercall name LHCALL_SET_PMD with LHCALL_SET_PGD
lguest: use native_set_* macros, which properly handle 64-bit entries when PAE is activated
lguest: map switcher with executable page table entries
lguest: fix writev returning short on console output
lguest: clean up length-used value in example launcher
lguest: Segment selectors are 16-bit long. Fix lg_cpu.ss1 definition.
lguest: beyond ARRAY_SIZE of cpu->arch.gdt
...
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-module-and-param:
module: cleanup FIXME comments about trimming exception table entries.
module: trim exception table on init free.
module: merge module_alloc() finally
uml module: fix uml build process due to this merge
x86 module: merge the rest functions with macros
x86 module: merge the same functions in module_32.c and module_64.c
uvesafb: improve parameter handling.
module_param: allow 'bool' module_params to be bool, not just int.
module_param: add __same_type convenience wrapper for __builtin_types_compatible_p
module_param: split perm field into flags and perm
module_param: invbool should take a 'bool', not an 'int'
cyber2000fb.c: use proper method for stopping unload if CONFIG_ARCH_SHARK
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Provide _sdata in the vmlinux.lds.S file
x86: handle initrd that extends into unusable memory
The downside of the last patch which made restore_flags and irq_enable
check interrupts is that they are now too big to be patched directly
into the callsites, so the C versions are always used.
But the C versions go via PV_CALLEE_SAVE_REGS_THUNK which saves all
the registers. In fact, we don't need any registers in the fast path,
so we can do better than this if we actually code them in assembler.
The results are in the noise, but since it's about the same amount of
code, it's worth applying.
1GB Guest->Host: input(suppressed),output(suppressed)
Before:
Seconds: 0:16.53
Packets: 377268,753673
Interrupts: 22461,24297
Notifications: 1(5245),21303(732370)
Net IRQs triggered: 377023(245),42578(711095)
After:
Seconds: 0:16.48
Packets: 377289,753673
Interrupts: 22281,24465
Notifications: 1(5245),21296(732377)
Net IRQs triggered: 377060(229),42564(711109)
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Everyone cut and paste this comment from my original one. We now do
it generically, so cut the comments.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Amerigo Wang <amwang@redhat.com>
As Christoph Hellwig suggested, module_alloc() actually can be
unified for i386 and x86_64 (of course, also UML).
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: 'Ingo Molnar' <mingo@elte.hu>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Merge the same functions both in module_32.c and module_64.c into
module.c.
This is the first step to merge both of them finally.
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The fixed-function performance counters do not work on current Atom
processors. Use the general-purpose ones instead.
Signed-off-by: Yong Wang <yong.y.wang@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <20090612080855.GA2286@ywang-moblin2.bj.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
_sdata is a common symbol defined by many architectures and made
available to the kernel via asm-generic/sections.h. Kmemleak uses this
symbol when scanning the data sections.
[ Impact: add new global symbol ]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
LKML-Reference: <20090511122105.26556.96593.stgit@pc1117.cambridge.arm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
So we make sure MAXSMP gets a cleared cpumask
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On a system where system memory (according e820) is not covered by
mtrr, mtrr_trim_memory converts a portion of memory to reserved, but
bootloader has already put the initrd in that range.
Thus, we need to have 64bit to use relocate_initrd too.
[ Impact: fix using initrd when mtrr_trim_memory happen ]
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: stable@kernel.org
* 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (574 commits)
perf_counter: Turn off by default
perf_counter: Add counter->id to the throttle event
perf_counter: Better align code
perf_counter: Rename L2 to LL cache
perf_counter: Standardize event names
perf_counter: Rename enums
perf_counter tools: Clean up u64 usage
perf_counter: Rename perf_counter_limit sysctl
perf_counter: More paranoia settings
perf_counter: powerpc: Implement generalized cache events for POWER processors
perf_counters: powerpc: Add support for POWER7 processors
perf_counter: Accurate period data
perf_counter: Introduce struct for sample data
perf_counter tools: Normalize data using per sample period data
perf_counter: Annotate exit ctx recursion
perf_counter tools: Propagate signals properly
perf_counter tools: Small frequency related fixes
perf_counter: More aggressive frequency adjustment
perf_counter/x86: Fix the model number of Intel Core2 processors
perf_counter, x86: Correct some event and umask values for Intel processors
...
* 'topic/slab/earlyboot' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
vgacon: use slab allocator instead of the bootmem allocator
irq: use kcalloc() instead of the bootmem allocator
sched: use slab in cpupri_init()
sched: use alloc_cpumask_var() instead of alloc_bootmem_cpumask_var()
memcg: don't use bootmem allocator in setup code
irq/cpumask: make memoryless node zero happy
x86: remove some alloc_bootmem_cpumask_var calling
vt: use kzalloc() instead of the bootmem allocator
sched: use kzalloc() instead of the bootmem allocator
init: introduce mm_init()
vmalloc: use kzalloc() instead of alloc_bootmem()
slab: setup allocators earlier in the boot sequence
bootmem: fix slab fallback on numa
bootmem: use slab if bootmem is no longer available
* 'kvm-updates/2.6.31' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (138 commits)
KVM: Prevent overflow in largepages calculation
KVM: Disable large pages on misaligned memory slots
KVM: Add VT-x machine check support
KVM: VMX: Rename rmode.active to rmode.vm86_active
KVM: Move "exit due to NMI" handling into vmx_complete_interrupts()
KVM: Disable CR8 intercept if tpr patching is active
KVM: Do not migrate pending software interrupts.
KVM: inject NMI after IRET from a previous NMI, not before.
KVM: Always request IRQ/NMI window if an interrupt is pending
KVM: Do not re-execute INTn instruction.
KVM: skip_emulated_instruction() decode instruction if size is not known
KVM: Remove irq_pending bitmap
KVM: Do not allow interrupt injection from userspace if there is a pending event.
KVM: Unprotect a page if #PF happens during NMI injection.
KVM: s390: Verify memory in kvm run
KVM: s390: Sanity check on validity intercept
KVM: s390: Unlink vcpu on destroy - v2
KVM: s390: optimize float int lock: spin_lock_bh --> spin_lock
KVM: s390: use hrtimer for clock wakeup from idle - v2
KVM: s390: Fix memory slot versus run - v3
...
Don't hardcode to node zero for early boot IRQ setup memory allocations.
[ penberg@cs.helsinki.fi: minor cleanups ]
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Now that we set up the slab allocator earlier, we can get rid of some
alloc_bootmem_cpumask_var() calls in boot code.
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
The top (fastest) and last level (biggest) caches are the most
interesting ones, performance wise.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
[ Fixed the Nehalem LL table to LLC Reference/Miss events ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Pure renames only, to PERF_COUNT_HW_* and PERF_COUNT_SW_*.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch introduces three boot options (no_cmci, dont_log_ce
and ignore_ce) to control handling for corrected errors.
The "mce=no_cmci" boot option disables the CMCI feature.
Since CMCI is a new feature so having boot controls to disable
it will be a help if the hardware is misbehaving.
The "mce=dont_log_ce" boot option disables logging for corrected
errors. All reported corrected errors will be cleared silently.
This option will be useful if you never care about corrected
errors.
The "mce=ignore_ce" boot option disables features for corrected
errors, i.e. polling timer and cmci. All corrected events are
not cleared and kept in bank MSRs.
Usually this disablement is not recommended, however it will be
a help if there are some conflict with the BIOS or hardware
monitoring applications etc., that clears corrected events in
banks instead of OS.
[ And trivial cleanup (space -> tab) for doc is included. ]
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
LKML-Reference: <4A30ACDF.5030408@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>