re-enable OPTIMIZE_INLINING more widely. Jeff Dike fixed the remaining
outstanding issue in this commit:
| commit 4f81c5350b
| Author: Jeff Dike <jdike@addtoit.com>
| Date: Mon Jul 7 13:36:56 2008 -0400
|
| [UML] fix gcc ICEs and unresolved externs
[...]
| This patch reintroduces unit-at-a-time for gcc >= 4.0, bringing back the
| possibility of Uli's crash. If that happens, we'll debug it.
it's still default-off and thus opt-in.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
i386 has show_trace_log_lvl and show_stack_log_lvl, allowing
traces to be emitted with log-level annotations. This patch
introduces them to x86_64, but log_lvl is only ever set to
an empty string. Output of traces is unchanged.
i386-chunk is whitespace-only.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make the diff between the traps_32.c and traps_64.c a bit smaller.
Change traps_32.c to look more like traps_64.c:
- move lock information to file scope
- split out oops_begin() and oops_end() from die()
- increment nest counter in oops_begin
Only whitespace change in traps_64.c
No functional changes intended.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Acked-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Real-time code needs to know the number of cycles per second
on SGI UV. The information is provided via a run time BIOS
call. This patch provides the linux side of that interface.
This is the first of several run time BIOS calls to be defined
in uv/bios.h and bios_uv.c.
Note that BIOS_CALL() is just a stub for now. The bios
side is being worked on.
Signed-off-by: Russ Anderson <rja@sgi.com>
Cc: Jack Steiner <steiner@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Following recent (and less so) issues with the 8254 timer when routed
through the I/O or local APIC, always report which configurations have
been tried and which one has been set up eventually. This is so that logs
posted by people for some other reason can be used as a cross-reference
when investigating any possible future problems.
The change unifies messages printed on 32-bit and 64-bit platforms and
adds trailing newlines (removes leading ones), so that proper log level
annotation can be used and any possible interspersed output will not cause
a mess.
I have chosen to use apic_printk(APIC_QUIET, ...) rather than printk(...)
so that the distinction of these messages is maintained making possible
future decisions about changes in this area easier. A change posted
separately making apic_verbosity unsigned removes any extra code that
would otherwise be generated as a result of this design decision.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
As a microoptimisation, make apic_verbosity unsigned. This will make
apic_printk(APIC_QUIET, ...) expand into just printk(...) with the
surrounding condition and a reference to apic_verbosity removed.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Include <asm/i8259.h> for i8259A_lock used in print_PIC() -- #if-0-ed out
by default. The 32-bit version gets it right already.
The plan is to enable this code with "apic=debug" eventually. This will
aid with debugging strange problems without the need to ask people to
apply patches.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Introduce calibrate_APIC_clock so it could help in further 32/64bit
apic code merging.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: macro@linux-mips.org
Cc: yhlu.kernel@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
it's separate functionality that deserves its own file.
This also prepares 32-bit memtest support.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make and use macro FIX_EFLAGS, instead of immediate value 0x40DD5 in
ia32_restore_sigcontext().
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Various versions of the hypervisor have differences in what ABIs and
features they support. Print some details into the boot log to help
with remote debugging.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use alternatives to select the workaround for the 11AP Pentium erratum
for the affected steppings on the fly rather than build time. Remove the
X86_GOOD_APIC configuration option and replace all the calls to
apic_write_around() with plain apic_write(), protecting accesses to the
ESR as appropriate due to the 3AP Pentium erratum. Remove
apic_read_around() and all its invocations altogether as not needed.
Remove apic_write_atomic() and all its implementing backends. The use of
ASM_OUTPUT2() is not strictly needed for input constraints, but I have
used it for readability's sake.
I had the feeling no one else was brave enough to do it, so I went ahead
and here it is. Verified by checking the generated assembly and tested
with both a 32-bit and a 64-bit configuration, also with the 11AP
"feature" forced on and verified with gdb on /proc/kcore to work as
expected (as an 11AP machines are quite hard to get hands on these days).
Some script complained about the use of "volatile", but apic_write() needs
it for the same reason and is effectively a replacement for writel(), so I
have disregarded it.
I am not sure what the policy wrt defconfig files is, they are generated
and there is risk of a conflict resulting from an unrelated change, so I
have left changes to them out. The option will get removed from them at
the next run.
Some testing with machines other than mine will be needed to avoid some
stupid mistake, but despite its volume, the change is not really that
intrusive, so I am fairly confident that because it works for me, it will
everywhere.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Yinghai Lu noticed that arch/x86/kernel/smpcommon_32.c got
renamed to arch/x86/kernel/smpcommon.c but the old almost-empty
file stayed around. Zap it.
Reported-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Linus observed:
> The real bug is that we shouldn't have "double negatives", and
> certainly not negative config options. Making that "promiscuous
> /dev/mem" option a negated thing as a config option was bad.
right ... lets rename this option. There should never be a negation
in config options.
[ that reminds me of CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER, but that
is for another commit ;-) ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Merge the GDT_ENTRY() macro between arch/x86/boot/pm.c and
arch/x86/kernel/acpi/sleep.c and put the new one in
<asm-x86/segment.h>.
While we're at it, correct the bitmasks for the limit and flags. The
new version relies on using ULL constants in order to cause type
promotion rather than explicit casts; this avoids having to include
<linux/types.h> in <asm-x86/segments.h>.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: fix asm/e820.h for userspace inclusion
x86: fix numaq_tsc_disable
x86: fix kernel_physical_mapping_init() for large x86 systems
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
ftrace: do not trace library functions
ftrace: do not trace scheduler functions
ftrace: fix lockup with MAXSMP
ftrace: fix merge buglet
fix:
arch/x86/kernel/numaq_32.c: In function ‘numaq_tsc_disable’:
arch/x86/kernel/numaq_32.c:99: warning: ‘return’ with a value, in function returning void
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Xen save/restore needs bits of code enabled by PM_SLEEP, and PM_SLEEP
depends on PM. So make XEN_SAVE_RESTORE depend on PM and PM_SLEEP
depend on XEN_SAVE_RESTORE.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
-tip testing found a bootup hang here:
initcall anon_inode_init+0x0/0x130 returned 0 after 0 msecs
calling acpi_event_init+0x0/0x57
the bootup should have continued with:
initcall acpi_event_init+0x0/0x57 returned 0 after 45 msecs
but it hung hard there instead.
bisection led to this commit:
| commit 5806b81ac1
| Merge: d14c8a6... 6712e29...
| Author: Ingo Molnar <mingo@elte.hu>
| Date: Mon Jul 14 16:11:52 2008 +0200
| Merge branch 'auto-ftrace-next' into tracing/for-linus
turns out that i made this mistake in the merge:
ifdef CONFIG_FTRACE
# Do not profile debug utilities
CFLAGS_REMOVE_tsc_64.o = -pg
CFLAGS_REMOVE_tsc_32.o = -pg
those two files got unified meanwhile - so the dont-profile annotation
got lost. The proper rule is:
CFLAGS_REMOVE_tsc.o = -pg
i guess this could have been caught sooner if the CFLAGS_REMOVE* kbuild
rule aborted the build if it met a target that does not exist anymore?
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (72 commits)
Revert "x86/PCI: ACPI based PCI gap calculation"
PCI: remove unnecessary volatile in PCIe hotplug struct controller
x86/PCI: ACPI based PCI gap calculation
PCI: include linux/pm_wakeup.h for device_set_wakeup_capable
PCI PM: Fix pci_prepare_to_sleep
x86/PCI: Fix PCI config space for domains > 0
Fix acpi_pm_device_sleep_wake() by providing a stub for CONFIG_PM_SLEEP=n
PCI: Simplify PCI device PM code
PCI PM: Introduce pci_prepare_to_sleep and pci_back_from_sleep
PCI ACPI: Rework PCI handling of wake-up
ACPI: Introduce new device wakeup flag 'prepared'
ACPI: Introduce acpi_device_sleep_wake function
PCI: rework pci_set_power_state function to call platform first
PCI: Introduce platform_pci_power_manageable function
ACPI: Introduce acpi_bus_power_manageable function
PCI: make pci_name use dev_name
PCI: handle pci_name() being const
PCI: add stub for pci_set_consistent_dma_mask()
PCI: remove unused arch pcibios_update_resource() functions
PCI: fix pci_setup_device()'s sprinting into a const buffer
...
Fixed up conflicts in various files (arch/x86/kernel/setup_64.c,
arch/x86/pci/irq.c, arch/x86/pci/pci.h, drivers/acpi/sleep/main.c,
drivers/pci/pci.c, drivers/pci/pci.h, include/acpi/acpi_bus.h) from x86
and ACPI updates manually.
"idle=nomwait" disables the use of the MWAIT
instruction from both C1 (C1_FFH) and deeper (C2C3_FFH)
C-states.
When MWAIT is unavailable, the BIOS and OS generally
negotiate to use the HALT instruction for C1,
and use IO accesses for deeper C-states.
This option is useful for power and performance
comparisons, and also to work around BIOS bugs
where broken MWAIT support is advertised.
http://bugzilla.kernel.org/show_bug.cgi?id=10807http://bugzilla.kernel.org/show_bug.cgi?id=10914
Signed-off-by: Zhao Yakui <yakui.zhao@intel.com>
Signed-off-by: Li Shaohua <shaohua.li@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
"idle=halt" limits the idle loop to using
the halt instruction. No MWAIT, no IO accesses,
no C-states deeper than C1.
If something is broken in the idle code,
"idle=halt" is a less severe workaround
than "idle=poll" which disables all power savings.
Signed-off-by: Zhao Yakui <yakui.zhao@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
991528d734
(ACPI: Processor native C-states using MWAIT)
started passing C2C3_FFH to _PDC to tell the BIOS
that Linux supports MWAIT for deep C-states.
However, we should first double check with the hardware
that it actually supports MWAIT before potentially exposing
a BIOS bug of an MWAIT _CST on HW that doesn't support MWAIT.
Signed-off-by: Zhao Yakui <yakui.zhao@intel.com>
Signed-off-by: Li Shaohua <shaohua.li@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Synchronized tables with current specifications.
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
This closes some arcane holes in single-step handling that can arise
only when user programs set TF directly (via popf or sigreturn) and
then use vDSO (syscall/sysenter) system call entry. In those entry
paths, the clear_TF_reenable case hits and we must check TIF_SINGLESTEP
to be sure our bookkeeping stays correct wrt the user's view of TF.
Signed-off-by: Roland McGrath <roland@redhat.com>
This unifies and cleans up the syscall tracing code on i386 and x86_64.
Using a single function for entry and exit tracing on 32-bit made the
do_syscall_trace() into some terrible spaghetti. The logic is clear and
simple using separate syscall_trace_enter() and syscall_trace_leave()
functions as on 64-bit.
The unification adds PTRACE_SYSEMU and PTRACE_SYSEMU_SINGLESTEP support
on x86_64, for 32-bit ptrace() callers and for 64-bit ptrace() callers
tracing either 32-bit or 64-bit tasks. It behaves just like 32-bit.
Changing syscall_trace_enter() to return the syscall number shortens
all the assembly paths, while adding the SYSEMU feature in a simple way.
Signed-off-by: Roland McGrath <roland@redhat.com>
This unifies the treatment of TIF_SINGLESTEP on i386 and x86_64.
The bit is now excluded from _TIF_WORK_MASK on i386 as it has been
on x86_64. This means the do_notify_resume() path using it is never
used, so TIF_SINGLESTEP is not cleared on returning to user mode.
Both now leave TIF_SINGLESTEP set when returning to user, so that
it's already set on an int $0x80 system call entry. This removes
the need for testing TF on the system_call path. Doing it this way
fixes the regression for PTRACE_SINGLESTEP into a sigreturn syscall,
introduced by commit 1e2e99f0e4.
The clear_TF_reenable case that sets TIF_SINGLESTEP can only happen
on a non-exception kernel entry, i.e. sysenter/syscall instruction.
That will always get to the syscall exit tracing path.
Signed-off-by: Roland McGrath <roland@redhat.com>
The enable_single_step() logic bails out early if TF is already set.
That skips some of the bookkeeping that keeps things straight.
This makes PTRACE_SINGLEBLOCK break the behavior of a user task
that was already setting TF itself in user mode.
Fix the bookkeeping to notice the old TF setting as it should.
Test case at: http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/tests/ptrace-tests/tests/step-jump-cont-strict.c?cvsroot=systemtap
Signed-off-by: Roland McGrath <roland@redhat.com>
Fix bug in kernel_physical_mapping_init() that causes kernel
page table to be built incorrectly for systems with greater
than 512GB of memory.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
the paravirt-spinlock patches caused a boot hang with this config:
http://redhat.com/~mingo/misc/config-Wed_Jul__9_14_47_04_CEST_2008.bad
i have bisected it down to:
| commit e17b58c2e85bc2ad2afc07fb8d898017c2b75ed1
| Author: Jeremy Fitzhardinge <jeremy@goop.org>
| Date: Mon Jul 7 12:07:53 2008 -0700
|
| xen: implement Xen-specific spinlocks
i.e. applying that patch alone causes the hang. The hang happens in the
ftrace self-test:
initcall utsname_sysctl_init+0x0/0x19 returned 0 after 0 msecs
calling init_sched_switch_trace+0x0/0x4c
Testing tracer sched_switch: PASSED
initcall init_sched_switch_trace+0x0/0x4c returned 0 after 167 msecs
calling init_function_trace+0x0/0x12
Testing tracer ftrace:
[hard hang]
it should have continued like this:
Testing tracer ftrace: PASSED
initcall init_function_trace+0x0/0x12 returned 0 after 198 msecs
calling init_irqsoff_tracer+0x0/0x14
Testing tracer irqsoff: PASSED
initcall init_irqsoff_tracer+0x0/0x14 returned 0 after 3 msecs
calling init_mmio_trace+0x0/0x12
initcall init_mmio_trace+0x0/0x12 returned 0 after 0 msecs
the problem is that such lowlevel primitives as spinlocks should never
be built with -pg (which ftrace does). Marking paravirt.o as non-pg and
marking all spinlock ops as always-inline solve the hang.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The standard ticket spinlocks are very expensive in a virtual
environment, because their performance depends on Xen's scheduler
giving vcpus time in the order that they're supposed to take the
spinlock.
This implements a Xen-specific spinlock, which should be much more
efficient.
The fast-path is essentially the old Linux-x86 locks, using a single
lock byte. The locker decrements the byte; if the result is 0, then
they have the lock. If the lock is negative, then locker must spin
until the lock is positive again.
When there's contention, the locker spin for 2^16[*] iterations waiting
to get the lock. If it fails to get the lock in that time, it adds
itself to the contention count in the lock and blocks on a per-cpu
event channel.
When unlocking the spinlock, the locker looks to see if there's anyone
blocked waiting for the lock by checking for a non-zero waiter count.
If there's a waiter, it traverses the per-cpu "lock_spinners"
variable, which contains which lock each CPU is waiting on. It picks
one CPU waiting on the lock and sends it an event to wake it up.
This allows efficient fast-path spinlock operation, while allowing
spinning vcpus to give up their processor time while waiting for a
contended lock.
[*] 2^16 iterations is threshold at which 98% locks have been taken
according to Thomas Friebel's Xen Summit talk "Preventing Guests from
Spinning Around". Therefore, we'd expect the lock and unlock slow
paths will only be entered 2% of the time.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <clameter@linux-foundation.org>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: Virtualization <virtualization@lists.linux-foundation.org>
Cc: Xen devel <xen-devel@lists.xensource.com>
Cc: Thomas Friebel <thomas.friebel@amd.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Switch to using the lock-byte spinlock implementation, to avoid the
worst of the performance hit from ticket locks.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <clameter@linux-foundation.org>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: Virtualization <virtualization@lists.linux-foundation.org>
Cc: Xen devel <xen-devel@lists.xensource.com>
Cc: Thomas Friebel <thomas.friebel@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Implement a version of the old spinlock algorithm, in which everyone
spins waiting for a lock byte. In order to be compatible with the
ticket-lock's use of a zero initializer, this uses the convention of
'0' for unlocked and '1' for locked.
This algorithm is much better than ticket locks in a virtual
envionment, because it doesn't interact badly with the vcpu scheduler.
If there are multiple vcpus spinning on a lock and the lock is
released, the next vcpu to be scheduled will take the lock, rather
than cycling around until the next ticketed vcpu gets it.
To use this, you must call paravirt_use_bytelocks() very early, before
any spinlocks have been taken.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <clameter@linux-foundation.org>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: Virtualization <virtualization@lists.linux-foundation.org>
Cc: Xen devel <xen-devel@lists.xensource.com>
Cc: Thomas Friebel <thomas.friebel@amd.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ticket spinlocks have absolutely ghastly worst-case performance
characteristics in a virtual environment. If there is any contention
for physical CPUs (ie, there are more runnable vcpus than cpus), then
ticket locks can cause the system to end up spending 90+% of its time
spinning.
The problem is that (v)cpus waiting on a ticket spinlock will be
granted access to the lock in strict order they got their tickets. If
the hypervisor scheduler doesn't give the vcpus time in that order,
they will burn timeslices waiting for the scheduler to give the right
vcpu some time. In the worst case it could take O(n^2) vcpu scheduler
timeslices for everyone waiting on the lock to get it, not counting
new cpus trying to take the lock while the log-jam is sorted out.
These hooks allow a paravirt backend to replace the spinlock
implementation.
At the very least, this could revert the implementation back to the
old lock algorithm, which allows the next scheduled vcpu to take the
lock, and has basically fairly good performance.
It also allows the spinlocks to take advantages of the hypervisor
features to make locks more efficient (spin and block, for example).
The cost to native execution is an extra direct call when using a
spinlock function. There's no overhead if CONFIG_PARAVIRT is turned
off.
The lock structure is fixed at a single "unsigned int", initialized to
zero, but the spinlock implementation can use it as it wishes.
Thanks to Thomas Friebel's Xen Summit talk "Preventing Guests from
Spinning Around" for pointing out this problem.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <clameter@linux-foundation.org>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: Virtualization <virtualization@lists.linux-foundation.org>
Cc: Xen devel <xen-devel@lists.xensource.com>
Cc: Thomas Friebel <thomas.friebel@amd.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Exceptions using paranoidentry need to have their exception frames
adjusted explicitly.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Now that the vdso32 code can cope with both syscall and sysenter
missing for 32-bit compat processes, just disable the features without
disabling vdso altogether.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
AMD only supports "syscall" from 32-bit compat usermode.
Intel and Centaur(?) only support "sysenter" from 32-bit compat usermode.
Set the X86 feature bits accordingly, and set up the vdso in
accordance with those bits. On the offchance we run on in a 64-bit
environment which supports neither syscall nor sysenter from 32-bit
mode, then fall back to the int $0x80 vdso.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
fix:
arch/x86/xen/built-in.o: In function `xen_enable_syscall':
(.cpuinit.text+0xdb): undefined reference to `sysctl_vsyscall32'
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Old versions of Xen (3.1 and before) don't support sysenter or syscall
from 32-bit compat userspaces. If we can't set the appropriate
syscall callback, then disable the corresponding feature bit, which
will cause the vdso32 setup to fall back appropriately.
Linux assumes that syscall is always available to 32-bit userspace,
and installs it by default if sysenter isn't available. In that case,
we just disable vdso altogether, forcing userspace libc to fall back
to int $0x80.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This reverts commit 033786969d1d1b5af12a32a19d3a760314d05329.
Suresh Siddha reported that this broke booting on his 2GB testbox.
Reported-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix:
arch/x86/xen/enlighten.c: In function 'xen_set_fixmap':
arch/x86/xen/enlighten.c:1127: error: 'FIX_KMAP_BEGIN' undeclared (first use in this function)
arch/x86/xen/enlighten.c:1127: error: (Each undeclared identifier is reported only once
arch/x86/xen/enlighten.c:1127: error: for each function it appears in.)
arch/x86/xen/enlighten.c:1127: error: 'FIX_KMAP_END' undeclared (first use in this function)
make[1]: *** [arch/x86/xen/enlighten.o] Error 1
make: *** [arch/x86/xen/enlighten.o] Error 2
FIX_KMAP_BEGIN is only available on HIGHMEM.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Allow Xen to be enabled on 64-bit.
Also extend domain size limit from 8 GB (on 32-bit) to 32 GB on 64-bit.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
64-bit uses MSRs for important things like the base for fs and
gs-prefixed addresses. It's more efficient to use a hypercall to
update these, rather than go via the trap and emulate path.
Other MSR writes are just passed through; in an unprivileged domain
they do nothing, but it might be useful later.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
64-bit userspace expects the vdso to be mapped at a specific fixed
address, which happens to be in the middle of the kernel address
space. Because we have split user and kernel pagetables, we need to
make special arrangements for the vsyscall mapping to appear in the
kernel part of the user pagetable.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We set up entrypoints for syscall and sysenter. sysenter is only used
for 32-bit compat processes, whereas syscall can be used in by both 32
and 64-bit processes.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Because the x86_64 architecture does not enforce segment limits, Xen
cannot protect itself with them as it does in 32-bit mode. Therefore,
to protect itself, it runs the guest kernel in ring 3. Since it also
runs the guest userspace in ring3, the guest kernel must maintain a
second pagetable for its userspace, which does not map kernel space.
Naturally, the guest kernel pagetables map both kernel and userspace.
The userspace pagetable is attached to the corresponding kernel
pagetable via the pgd's page->private field. It is allocated and
freed at the same time as the kernel pgd via the
paravirt_pgd_alloc/free hooks.
Fortunately, the user pagetable is almost entirely shared with the
kernel pagetable; the only difference is the pgd page itself. set_pgd
will populate all entries in the kernel pagetable, and also set the
corresponding user pgd entry if the address is less than
STACK_TOP_MAX.
The user pagetable must be pinned and unpinned with the kernel one,
but because the pagetables are aliased, pgd_walk() only needs to be
called on the kernel pagetable. The user pgd page is then
pinned/unpinned along with the kernel pgd page.
xen_write_cr3 must write both the kernel and user cr3s.
The init_mm.pgd pagetable never has a user pagetable allocated for it,
because it can never be used while running usermode.
One awkward area is that early in boot the page structures are not
available. No user pagetable can exist at that point, but it
complicates the logic to avoid looking at the page structure.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We need to do this, otherwise we can get a GPF on hypercall return
after TLS descriptor is cleared but %fs is still pointing to it.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Implement the failsafe callback, so that iret and segment register
load exceptions are reported to the kernel.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Point the boot params cmd_line_ptr to the domain-builder-provided
command line.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Rewrite pgd_walk to deal with 64-bit address spaces. There are two
notible features of 64-bit workspaces:
1. The physical address is only 48 bits wide, with the upper 16 bits
being sign extension; kernel addresses are negative, and userspace is
positive.
2. The Xen hypervisor mapping is at the negative-most address, just above
the sign-extension hole.
1. means that we can't easily use addresses when traversing the space,
since we must deal with sign extension. This rewrite expresses
everything in terms of pgd/pud/pmd indices, which means we don't need
to worry about the exact configuration of the virtual memory space.
This approach works equally well in 32-bit.
To deal with 2, assume the hole is between the uppermost userspace
address and PAGE_OFFSET. For 64-bit this skips the Xen mapping hole.
For 32-bit, the hole is zero-sized.
In all cases, the uppermost kernel address is FIXADDR_TOP.
A side-effect of this patch is that the upper boundary is actually
handled properly, exposing a long-standing bug in 32-bit, which failed
to pin kernel pmd page. The kernel pmd is not shared, and so must be
explicitly pinned, even though the kernel ptes are shared and don't
need pinning.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The x86_64 interrupt subsystem is oriented towards vectors, as opposed
to a flat irq space as it is in x86-32. This patch adds a simple
identity irq->vector mapping so that we can continue to feed irqs into
do_IRQ() and get a good result.
Ideally x86_32 will unify with the 64-bit code and use vectors too.
At that point we can move to mapping event channels to vectors, which
will allow us to economise on irqs (so per-cpu event channels can
share irqs, rather than having to allocte one per cpu, for example).
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use callback_op hypercall to register callbacks in a 32/64-bit
independent way (64-bit doesn't need a code segment, but that detail
is hidden in XEN_CALLBACK).
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
swapgs is a no-op under Xen, because the hypervisor makes sure the
right version of %gs is current when switching between user and kernel
modes. This means that the swapgs "implementation" can be inlined and
used when the stack is unsafe (usermode). Unfortunately, it means
that disabling patching will result in a non-booting kernel...
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Xen pushes two extra words containing the values of rcx and r11. This
pvop hook copies the words back into their appropriate registers, and
cleans them off the stack. This leaves the stack in native form, so
the normal handler can run unchanged.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make Xen's set_pte_mfn() use set_pte_vaddr rather than copying it.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We need to wait until the page structure is available to use the
proper pagetable page alloc/release operations, since they use struct
page to determine if a pagetable is pinned.
This happened to work in 32bit because nobody allocated new pagetable
pages in the interim between xen_pagetable_setup_done and
xen_post_allocator_init, but the 64-bit kenrel needs to allocate more
pagetable levels.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Someone's got to do it.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When building initial pagetables in 64-bit kernel the pud/pmd pointer may
be in ioremap/fixmap space, so we need to walk the pagetable to look up the
physical address.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
arbitrary_virt_to_machine can truncate a machine address if its above
4G. Cast the problem away.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Rearrange the pagetable initialization to share code with the 64-bit
kernel. Rather than deferring anything to pagetable_setup_start, just
set up an initial pagetable in swapper_pg_dir early at startup, and
create an additional 8MB of physical memory mappings. This matches
the native head_32.S mappings to a large degree, and allows the rest
of the pagetable setup to continue without much Xen vs. native
difference.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Early in boot, map a chunk of extra physical memory for use later on.
We need a pool of mapped pages to allocate further pages to construct
pagetables mapping all physical memory.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It also doesn't need the 32-bit hack version of set_pte for initial
pagetable construction, so just make it use the real thing.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Set up the initial pagetables to map the kernel mapping into the
physical mapping space. This makes __va() usable, since it requires
physical mappings.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Split xen-asm into 32- and 64-bit files, and implement the 64-bit
variants.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add the Xen entrypoint and ELF notes to head_64.S. Adapts xen-head.S
to compile either 32-bit or 64-bit.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
A number of random changes to make xen/smp.c compile in 64-bit mode.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>a
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
As a stopgap until Mike Travis's x86-64 gs-based percpu patches are
ready, provide workaround functions for x86_read/write_percpu for
Xen's use.
Specifically, this means that we can't really make use of vcpu
placement, because we can't use a single gs-based memory access to get
to vcpu fields. So disable all that for now.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move all the smp_ops setup into smp.c, allowing a lot of things to
become static.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
x86_64 stores the active_mm in the pda, so fetch it from there.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We need extra pv_mmu_ops for 64-bit, to deal with the extra level of
pagetable.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Copy 64-bit definitions of various interface structures into place.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We need set_pte to work from a relatively early point, so enable it
from the start.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
add xen_timer_resume() hook.
Timer resume should be done after event channel is resumed.
add xen_arch_resume() hook when ipi becomes usable after resume.
After resume, some cpu specific resource must be reinitialized
on ia64 that can't be set by another cpu.
However available hooks is run once on only one cpu so that ipi has
to be used.
During stop_machine_run() ipi can't be used because interrupt is masked.
So add another hook after stop_machine_run().
Another approach might be use resume hook which is run by
device_resume(). However device_resume() may be executed on
suspend error recovery path.
So it is necessary to determine whether it is executed on real resume path
or error recovery path.
Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Print a backtrace if a multicall fails, to help with debugging.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This allows Xen's xen_cpu_up() to allocate a pda for the new CPU.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Update arch/x86's use of page-aligned variables. The change to
arch/x86/xen/mmu.c fixes an actual bug, but the rest are cleanups
and to set a precedent.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
process_64.c:__switch_to has some very old strange formatting, some of
it dating back to pre-git. Fix it up.
No functional changes.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Early fixmap will allocate its own L1 pagetable page for fixmap
mappings, so there's no need to preallocate one.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Call paravirt_pagetable_setup_{start,done}
These paravirt_ops functions were not being called on x86_64.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Using ACPI to find free address space allows us to find a gap for the
unallocated PCI resources or MMIO resources for hotplug devices within
the BIOS allowed PCI regions.
It works by evaluating the _CRS object under PCI0 looking for producer
resources. Then searches the e820 memory space for a gap within these
producer resources.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Dave Hansen reported a build error on 32bit which went unnoticed
as newer gcc versions seem to optimize unused static functions
away before compiling them.
Make vread_tsc() depend on CONFIG_X86_64
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
fix merge fallout:
arch/x86/pci/amd_bus.c: In function ‘enable_pci_io_ecs':
arch/x86/pci/amd_bus.c:581: error: too many arguments to function ‘on_each_cpu'
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'timers/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: add PCI ID for 6300ESB force hpet
x86: add another PCI ID for ICH6 force-hpet
kernel-paramaters: document pmtmr= command line option
acpi_pm clccksource: fix printk format warning
nohz: don't stop idle tick if softirqs are pending.
pmtmr: allow command line override of ioport
nohz: reduce jiffies polling overhead
hrtimer: Remove unused variables in ktime_divns()
hrtimer: remove warning in hres_timers_resume
posix-timers: print RT watchdog message
On Mon, 2008-04-21 at 18:54 -0400, Masami Hiramatsu wrote:
> Thank you for reporting.
>
> Actually, kprobes tries to fixup thread's flags in post_kprobe_handler
> (which is called from kprobe_exceptions_notify) by
> trace_hardirqs_fixup_flags(pt_regs->flags). However, even the irq flag
> is set in pt_regs->flags, true hardirq is still off until returning
> from do_debug. Thus, lockdep assumes that hardirq is off without annotation.
>
> IMHO, one possible solution is that fixing hardirq flags right after
> notify_die in do_debug instead of in post_kprobe_handler.
My reply to BZ 10489:
> [ 2.707509] Kprobe smoke test started
> [ 2.709300] ------------[ cut here ]------------
> [ 2.709420] WARNING: at kernel/lockdep.c:2658 check_flags+0x4d/0x12c()
> [ 2.709541] Modules linked in:
> [ 2.709588] Pid: 1, comm: swapper Not tainted 2.6.25.jml.057 #1
> [ 2.709588] [<c0126acc>] warn_on_slowpath+0x41/0x51
> [ 2.709588] [<c010bafc>] ? save_stack_trace+0x1d/0x3b
> [ 2.709588] [<c0140a83>] ? save_trace+0x37/0x89
> [ 2.709588] [<c011987d>] ? kernel_map_pages+0x103/0x11c
> [ 2.709588] [<c0109803>] ? native_sched_clock+0xca/0xea
> [ 2.709588] [<c0142958>] ? mark_held_locks+0x41/0x5c
> [ 2.709588] [<c0382580>] ? kprobe_exceptions_notify+0x322/0x3af
> [ 2.709588] [<c0142aff>] ? trace_hardirqs_on+0xf1/0x119
> [ 2.709588] [<c03825b3>] ? kprobe_exceptions_notify+0x355/0x3af
> [ 2.709588] [<c0140823>] check_flags+0x4d/0x12c
> [ 2.709588] [<c0143c9d>] lock_release+0x58/0x195
> [ 2.709588] [<c038347c>] ? __atomic_notifier_call_chain+0x0/0x80
> [ 2.709588] [<c03834d6>] __atomic_notifier_call_chain+0x5a/0x80
> [ 2.709588] [<c0383508>] atomic_notifier_call_chain+0xc/0xe
> [ 2.709588] [<c013b6d4>] notify_die+0x2d/0x2f
> [ 2.709588] [<c038168a>] do_debug+0x67/0xfe
> [ 2.709588] [<c0381287>] debug_stack_correct+0x27/0x30
> [ 2.709588] [<c01564c0>] ? kprobe_target+0x1/0x34
> [ 2.709588] [<c0156572>] ? init_test_probes+0x50/0x186
> [ 2.709588] [<c04fae48>] init_kprobes+0x85/0x8c
> [ 2.709588] [<c04e947b>] kernel_init+0x13d/0x298
> [ 2.709588] [<c04e933e>] ? kernel_init+0x0/0x298
> [ 2.709588] [<c04e933e>] ? kernel_init+0x0/0x298
> [ 2.709588] [<c0105ef7>] kernel_thread_helper+0x7/0x10
> [ 2.709588] =======================
> [ 2.709588] ---[ end trace 778e504de7e3b1e3 ]---
> [ 2.709588] possible reason: unannotated irqs-off.
> [ 2.709588] irq event stamp: 370065
> [ 2.709588] hardirqs last enabled at (370065): [<c0382580>] kprobe_exceptions_notify+0x322/0x3af
> [ 2.709588] hardirqs last disabled at (370064): [<c0381bb7>] do_int3+0x1d/0x7d
> [ 2.709588] softirqs last enabled at (370050): [<c012b464>] __do_softirq+0xfa/0x100
> [ 2.709588] softirqs last disabled at (370045): [<c0107438>] do_softirq+0x74/0xd9
> [ 2.714751] Kprobe smoke test passed successfully
how I love this stuff...
Ok, do_debug() is a trap, this can happen at any time regardless of the
machine's IRQ state. So the first thing we do is fix up the IRQ state.
Then we call this die notifier stuff; and return with messed up IRQ
state... YAY.
So, kprobes fudges it..
notify_die(DIE_DEBUG)
kprobe_exceptions_notify()
post_kprobe_handler()
modify regs->flags
trace_hardirqs_fixup_flags(regs->flags); <--- must be it
So what's the use of modifying flags if they're not meant to take effect
at some point.
/me tries to reproduce issue; enable kprobes test thingy && boot
OK, that reproduces..
So the below makes it work - but I'm not getting this code; at the time
I wrote that stuff I CC'ed each and every kprobe maintainer listed in
the usual places but got no reposonse - can some please explain this
stuff to me?
Are the saved flags only for the TF bit or are they made in full effect
later (and if so, where) ?
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for-2.6.27' of git://git.infradead.org/users/dwmw2/firmware-2.6: (64 commits)
firmware: convert sb16_csp driver to use firmware loader exclusively
dsp56k: use request_firmware
edgeport-ti: use request_firmware()
edgeport: use request_firmware()
vicam: use request_firmware()
dabusb: use request_firmware()
cpia2: use request_firmware()
ip2: use request_firmware()
firmware: convert Ambassador ATM driver to request_firmware()
whiteheat: use request_firmware()
ti_usb_3410_5052: use request_firmware()
emi62: use request_firmware()
emi26: use request_firmware()
keyspan_pda: use request_firmware()
keyspan: use request_firmware()
ttusb-budget: use request_firmware()
kaweth: use request_firmware()
smctr: use request_firmware()
firmware: convert ymfpci driver to use firmware loader exclusively
firmware: convert maestro3 driver to use firmware loader exclusively
...
Fix up trivial conflicts with BKL removal in drivers/char/dsp56k.c and
drivers/char/ip2/ip2main.c manually.
* 'core/printk' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, generic: mark early_printk as asmlinkage
printk: export console_drivers
printk: remember the message level for multi-line output
printk: refactor processing of line severity tokens
printk: don't prefer unsuited consoles on registration
printk: clean up recursion check related static variables
namespacecheck: more kernel/printk.c fixes
namespacecheck: fix kernel printk.c
* 'tracing/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (228 commits)
ftrace: build fix for ftraced_suspend
ftrace: separate out the function enabled variable
ftrace: add ftrace_kill_atomic
ftrace: use current CPU for function startup
ftrace: start wakeup tracing after setting function tracer
ftrace: check proper config for preempt type
ftrace: trace schedule
ftrace: define function trace nop
ftrace: move sched_switch enable after markers
ftrace: prevent ftrace modifications while being kprobe'd, v2
fix "ftrace: store mcount address in rec->ip"
mmiotrace broken in linux-next (8-bit writes only)
ftrace: avoid modifying kprobe'd records
ftrace: freeze kprobe'd records
kprobes: enable clean usage of get_kprobe
ftrace: store mcount address in rec->ip
ftrace: build fix with gcc 4.3
namespacecheck: fixes
ftrace: fix "notrace" filtering priority
ftrace: fix printout
...
John Keller reports that PCI config space access is broken on machines
with more than one domain. conf1 accesses only work for domain 0, so make sure
we check the domain number in the raw routines before trying conf1.
Reported-by: John Keller <jpk@sgi.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Explain that we set up the descriptors for Big Real Mode, and why we
do so. In particular, one system that is known to fail without it is
the Lenovo X61.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The explanation for recent video BIOS suspend quirk failures is that
the VESA BIOS expects to be entered in Big Real Mode (*.limit = 0xffffffff)
instead of ordinary Real Mode (*.limit = 0xffff).
This patch changes the segment descriptors to Big Real Mode instead.
The segment descriptor registers (what Intel calls "segment cache") is
always active. The only thing that changes based on CR0.PE is how it is
*loaded* and the interpretation of the CS flags.
The segment descriptor registers contain of the following sub-registers:
selector (the "visible" part), base, limit and flags. In protected mode
or long mode, they are loaded from descriptors (or fs.base or gs.base can
be manipulated directly in long mode.) In real mode, the only thing
changed by a segment register load is the selector and the base, where the
base <- selector << 4. In particular, *the limit and the flags are not
changed*.
As far as the handling of the CS flags: a code segment cannot be writable
in protected mode, whereas it is "just another segment" in real mode, so
there is some kind of quirk that kicks in for this when CR0.PE <- 0. I'm
not sure if this is accomplished by actually changing the cs.flags register
or just changing the interpretation; it might be something that is
CPU-specific. In particular, the Transmeta CPUs had an explicit "CS is
writable if you're in real mode" override, so even if you had loaded CS
with an execute-only segment it'd be writable (but not readable!) on return
to real mode. I'm not at all sure if that is how other CPUs behave.
Signed-off-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
when try to make hpet_enable use io_remap instead fixmap got
ioremap: invalid physical address fed00000
------------[ cut here ]------------
WARNING: at arch/x86/mm/ioremap.c:161 __ioremap_caller+0x8c/0x2f3()
Modules linked in:
Pid: 0, comm: swapper Not tainted 2.6.26-rc9-tip-01873-ga9827e7-dirty #358
Call Trace:
[<ffffffff8026615e>] warn_on_slowpath+0x6c/0xa7
[<ffffffff802e2313>] ? __slab_alloc+0x20a/0x3fb
[<ffffffff802d85c5>] ? mpol_new+0x88/0x17d
[<ffffffff8022a4f4>] ? mcount_call+0x5/0x31
[<ffffffff8022a4f4>] ? mcount_call+0x5/0x31
[<ffffffff8024b0d2>] __ioremap_caller+0x8c/0x2f3
[<ffffffff80e86dbd>] ? hpet_enable+0x39/0x241
[<ffffffff8022a4f4>] ? mcount_call+0x5/0x31
[<ffffffff8024b466>] ioremap_nocache+0x2a/0x40
[<ffffffff80e86dbd>] hpet_enable+0x39/0x241
[<ffffffff80e7a1f6>] hpet_time_init+0x21/0x4e
[<ffffffff80e730e9>] start_kernel+0x302/0x395
[<ffffffff80e722aa>] x86_64_start_reservations+0xb9/0xd4
[<ffffffff80e722fe>] ? x86_64_init_pda+0x39/0x4f
[<ffffffff80e72400>] x86_64_start_kernel+0xec/0x107
---[ end trace a7919e7f17c0a725 ]---
it seems for amd system that is set later...
try to move setting early in early_identify_cpu.
and remove same code for intel and centaur.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
only add direct mapping for aperture
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Strengthen the return type for the _node_to_cpumask_ptr to be
a const pointer. This adds compiler checking to insure that
node_to_cpumask_map[] is not changed inadvertently.
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: "akpm@linux-foundation.org" <akpm@linux-foundation.org>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Acked-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that IRQ2 is never made available to the I/O APIC, there is no need
to special-case it and mask as a workaround for broken systems. Actually,
because of the former, mask_IO_APIC_irq(2) is a no-op already.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
got this on a test-system:
calling numaq_tsc_disable+0x0/0x39
NUMAQ: disabling TSC
initcall numaq_tsc_disable+0x0/0x39 returned 0 after 0 msecs
that's because we should not be using arch_initcall to call numaq_tsc_disable.
need to call it in setup_arch before time_init()/tsc_init()
and call it in init_intel() to make the cpu feature bits right.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
end_user_pfn used to modify the meaning of the e820 maps.
Now that all e820 operations are cleaned up, unified, tightened up,
the e820 map always get updated to reality, we don't need to keep
this secondary mechanism anymore.
If you hit this commit in bisection it means something slipped through.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
optimization: try to merge the range with same page size in
init_memory_mapping, to get the best possible linear mappings set up.
thus when GBpages is not there, we could do 2M pages.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
tighten the boundary checks around max_low_pfn_mapped - dont overmap
nor undermap into holes.
also print out tseg for AMD cpus, for diagnostic purposes.
(this is an SMM area, and we split up any big mappings around that area)
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On three of the several paths in entry_64.S that call
do_notify_resume() on the way back to user mode, we fail to properly
check again for newly-arrived work that requires another call to
do_notify_resume() before going to user mode. These paths set the
mask to check only _TIF_NEED_RESCHED, but this is wrong. The other
paths that lead to do_notify_resume() do this correctly already, and
entry_32.S does it correctly in all cases.
All paths back to user mode have to check all the _TIF_WORK_MASK
flags at the last possible stage, with interrupts disabled.
Otherwise, we miss any flags (TIF_SIGPENDING for example) that were
set any time after we entered do_notify_resume(). More work flags
can be set (or left set) synchronously inside do_notify_resume(), as
TIF_SIGPENDING can be, or asynchronously by interrupts or other CPUs
(which then send an asynchronous interrupt).
There are many different scenarios that could hit this bug, most of
them races. The simplest one to demonstrate does not require any
race: when one signal has done handler setup at the check before
returning from a syscall, and there is another signal pending that
should be handled. The second signal's handler should interrupt the
first signal handler before it actually starts (so the interrupted PC
is still at the handler's entry point). Instead, it runs away until
the next kernel entry (next syscall, tick, etc).
This test behaves correctly on 32-bit kernels, and fails on 64-bit
(either 32-bit or 64-bit test binary). With this fix, it works.
#define _GNU_SOURCE
#include <stdio.h>
#include <signal.h>
#include <string.h>
#include <sys/ucontext.h>
#ifndef REG_RIP
#define REG_RIP REG_EIP
#endif
static sig_atomic_t hit1, hit2;
static void
handler (int sig, siginfo_t *info, void *ctx)
{
ucontext_t *uc = ctx;
if ((void *) uc->uc_mcontext.gregs[REG_RIP] == &handler)
{
if (sig == SIGUSR1)
hit1 = 1;
else
hit2 = 1;
}
printf ("%s at %#lx\n", strsignal (sig),
uc->uc_mcontext.gregs[REG_RIP]);
}
int
main (void)
{
struct sigaction sa;
sigset_t set;
sigemptyset (&sa.sa_mask);
sa.sa_flags = SA_SIGINFO;
sa.sa_sigaction = &handler;
if (sigaction (SIGUSR1, &sa, NULL)
|| sigaction (SIGUSR2, &sa, NULL))
return 2;
sigemptyset (&set);
sigaddset (&set, SIGUSR1);
sigaddset (&set, SIGUSR2);
if (sigprocmask (SIG_BLOCK, &set, NULL))
return 3;
printf ("main at %p, handler at %p\n", &main, &handler);
raise (SIGUSR1);
raise (SIGUSR2);
if (sigprocmask (SIG_UNBLOCK, &set, NULL))
return 4;
if (hit1 + hit2 == 1)
{
puts ("PASS");
return 0;
}
puts ("FAIL");
return 1;
}
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We have two conflicting DMA-based quirks in there for the same set of
boxes (HP nx6325 and nx6125) and one of them actually breaks my box.
So remove the extra code.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: =?iso-8859-1?q?T=F6r=F6k_Edwin?= <edwintorok@gmail.com>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
I don't know, if this new code boots, but at least it
compiles. Someone should really test it.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Robert Richter <robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In the course of the recent unification of the NMI watchdog an assignment
to timer_ack to switch off unnecesary POLL commands to the 8259A in the
case of a watchdog failure has been accidentally removed. The statement
used to be limited to the 32-bit variation as since the rewrite of the
timer code it has been relevant for the 82489DX only. This change brings
it back.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There is no such entity as ISA IRQ2. The ACPI spec does not make it
explicitly clear, but does not preclude it either -- all it says is ISA
legacy interrupts are identity mapped by default (subject to overrides),
but it does not state whether IRQ2 exists or not. As a result if there is
no IRQ0 override, then IRQ2 is normally initialised as an ISA interrupt,
which implies an edge-triggered line, which is unmasked by default as this
is what we do for edge-triggered I/O APIC interrupts so as not to miss an
edge.
To the best of my knowledge it is useless, as IRQ2 has not been in use
since the PC/AT as back then it was taken by the 8259A cascade interrupt
to the slave, with the line position in the slot rerouted to newly-created
IRQ9. No device could thus make use of this line with the pair of 8259A
chips. Now in theory INTIN2 of the I/O APIC may be usable, but the
interrupt of the device wired to it would not be available in the PIC mode
at all, so I seriously doubt if anybody decided to reuse it for a regular
device.
However there are two common uses of INTIN2. One is for IRQ0, with an
ACPI interrupt override (or its equivalent in the MP table). But in this
case IRQ2 is gone entirely with INTIN0 left vacant. The other one is for
an 8959A ExtINTA cascade. In this case IRQ0 goes to INTIN0 and if ACPI is
used INTIN2 is assumed to be IRQ2 (there is no override and ACPI has no
way to report ExtINTA interrupts). This is where a problem happens.
The problem is INTIN2 is configured as a native APIC interrupt, with a
vector assigned and the mask cleared. And the line may indeed get active
and inject interrupts if the master 8959A has its timer interrupt enabled
(it might happen for other interrupts too, but they are normally masked in
the process of rerouting them to the I/O APIC). There are two cases where
it will happen:
* When the I/O APIC NMI watchdog is enabled. This is actually a misnomer
as the watchdog pulses are delivered through the 8259A to the LINT0
inputs of all the local APICs in the system. The implication is the
output of the master 8259A goes high and low repeatedly, signalling
interrupts to INTIN2 which is enabled too!
[The origin of the name is I think for a brief period during the
development we had a capability in our code to configure the watchdog to
use an I/O APIC input; that would be INTIN2 in this scenario.]
* When the native route of IRQ0 via INTIN0 fails for whatever reason -- as
it happens with the system considered here. In this scenario the timer
pulse is delivered through the 8259A to LINT0 input of the local APIC of
the bootstrap processor, quite similarly to how is done for the watchdog
described above. The result is, again, INTIN2 receives these pulses
too. Rafael's system used to escape this scenario, because an incorrect
IRQ0 override would occupy INTIN2 and prevent it from being unmasked.
My conclusion is IRQ2 should be excluded from configuration in all the
cases and the current exception for ACPI systems should be lifted. The
reason being the exception not only being useless, but harmful as well.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>