This arranges for the lppaca structs for most cpus to be dynamically
allocated in the same manner as the paca structs. If we don't include
support for legacy iSeries, only the first lppaca is statically
allocated; the rest are dynamically allocated. If we include legacy
iSeries support, then we statically allocate the first 64 lppaca
structs, since the iSeries hypervisor requires that the lppaca
structs be present in the data section of the kernel image, but
legacy iSeries supports at most 64 cpus.
With CONFIG_NR_CPUS, the kernel image size for a typical pSeries config
went from:
text data bss dec hex filename
9524478 4734564 8469944 22728986 15ad11a ../test-1024/vmlinux
to:
text data bss dec hex filename
9524482 3751508 8469944 21745934 14bd10e ../test-1024/vmlinux
a reduction of 983052 bytes overall.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently we have the lppaca structs as a simple array of NR_CPUS
entries, taking up space in the data section of the kernel image.
In future we would like to allocate them dynamically, so this
abstracts out the accesses to the array, making it easier to
change how we locate the lppaca for a given cpu in future.
Specifically, lppaca[cpu] changes to lppaca_of(cpu).
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The POWER architecture does not require stcx to check that it is operating
on the same address as the larx. This means it is possible for an
an exception handler to execute a larx, get a reservation, decide
not to do the stcx and then return back with an active reservation. If the
interrupted code was in the middle of a larx/stcx sequence the stcx could
incorrectly succeed.
All recent POWER CPUs check the address before letting the stcx succeed
so we can create a CPU feature and nop it out. As Ben suggested, we can
only do this in our syscall path because there is a remote possibility
some kernel code gets interrupted by an exception that ends up operating
on the same cacheline.
Thanks to Paul Mackerras and Derek Williams for the idea.
To test this I used a very simple null syscall (actually getppid) testcase
at http://ozlabs.org/~anton/junkcode/null_syscall.c
I tested against 2.6.35-git10 with the following changes against the
pseries_defconfig:
CONFIG_VIRT_CPU_ACCOUNTING=n
CONFIG_AUDIT=n
CONFIG_PPC_4K_PAGES=n
CONFIG_PPC_64K_PAGES=y
CONFIG_FORCE_MAX_ZONEORDER=9
CONFIG_PPC_SUBPAGE_PROT=n
CONFIG_FUNCTION_TRACER=n
CONFIG_FUNCTION_GRAPH_TRACER=n
CONFIG_IRQSOFF_TRACER=n
CONFIG_STACK_TRACER=n
to remove the overhead of virtual CPU accounting, syscall auditing and
the ftrace mcount tracers. 64kB pages were enabled to minimise TLB misses.
POWER6: +8.2%
POWER7: +7.0%
Another suggestion was to use a larx to something in the L1 instead of a stcx.
This was almost as fast as removing the larx on POWER6, but only 3.5% faster
on POWER7. We can use this to speed up the reservation clear in our
exception exit code.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This adds the equivalent of csum_and_copy_from_user for the receive side so we
can copy and checksum in one pass. It is modelled on the generic checksum
routine.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We use the same core loop as the new csum_partial, adding in the
stores and exception handling code. To keep things simple we do all the
exception fixup in csum_and_copy_from_user. This wrapper function is
modelled on the generic checksum code and is careful to always calculate
a complete checksum even if we only copied part of the data to userspace.
To test this I forced checksumming on over loopback and ran socklib (a
simple TCP benchmark). On a POWER6 575 throughput improved by 19% with
this patch. If I forced both the sender and receiver onto the same cpu
(with the hope of shifting the benchmark from being cache bandwidth limited
to cpu limited), adding this patch improved performance by 55%
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
I'm sick of seeing ppc64_runlatch_off in our profiles, so inline it
into the callers. To avoid a mess of circular includes I didn't add
it as an inline function.
Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The code is wrapped in an #if 0, but it's wrong so we may as well fix it.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This makes the 64-bit kernel use 64-bit signed integers for the counter
(effectively supporting 32-bit of active count in the semaphore), thus
avoiding things like overflow of the mmap_sem if you use a really crazy
number of threads
Note: Ideally the type in the structure should be atomic_long_t rather
than "long". However, there's some nasty issues with that. It needs to
be initialized statically -and- lib/rwsem.c does things like
sem->count = RWSEM_UNLOCKED_VALUE;
Now, if you mix in the fact that atomic_* types are actually structures
with one member and note typedefs of a scalar, it makes its really nasty.
So I stuck to what we did before using a long and casts for now.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Fairly simple conflicts, the most serious ones are the i.MX ones which I
suspect now need another rename.
Conflicts:
arch/arm/mach-mx2/clock_imx27.c
arch/arm/mach-mx2/devices.c
arch/arm/mach-omap2/board-rx51-peripherals.c
arch/arm/mach-omap2/board-zoom2.c
sound/soc/fsl/mpc5200_dma.c
sound/soc/fsl/mpc5200_dma.h
sound/soc/fsl/mpc8610_hpcd.c
sound/soc/pxa/spitz.c
The immap_86xx.h header file only defines one data structure: the "global
utilities" register set found on Freescale PowerPC SOCs. Rename this file
to fsl_guts.h to reflect its true purpose, and extend it to cover the "GUTS"
register set on 85xx chips.
Signed-off-by: Timur Tabi <timur@freescale.com>
Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Acked-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Liam Girdwood <lrg@slimlogic.co.uk>
Architectures implement dma_is_consistent() in different ways (some
misinterpret the definition of API in DMA-API.txt). So it hasn't been so
useful for drivers. We have only one user of the API in tree. Unlikely
out-of-tree drivers use the API.
Even if we fix dma_is_consistent() in some architectures, it doesn't look
useful at all. It was invented long ago for some old systems that can't
allocate coherent memory at all. It's better to export only APIs that are
definitely necessary for drivers.
Let's remove this API.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dma_get_cache_alignment returns the minimum DMA alignment. Architectures
defines it as ARCH_DMA_MINALIGN (formally ARCH_KMALLOC_MINALIGN). So we
can unify dma_get_cache_alignment implementations.
Note that some architectures implement dma_get_cache_alignment wrongly.
dma_get_cache_alignment() should return the minimum DMA alignment. So
fully-coherent architectures should return 1. This patch also fixes this
issue.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now each architecture has the own dma_get_cache_alignment implementation.
dma_get_cache_alignment returns the minimum DMA alignment. Architectures
define it as ARCH_KMALLOC_MINALIGN (it's used to make sure that malloc'ed
buffer is DMA-safe; the buffer doesn't share a cache with the others). So
we can unify dma_get_cache_alignment implementations.
This patch:
dma_get_cache_alignment() needs to know if an architecture defines
ARCH_KMALLOC_MINALIGN or not (needs to know if architecture has DMA
alignment restriction). However, slab.h define ARCH_KMALLOC_MINALIGN if
architectures doesn't define it.
Let's rename ARCH_KMALLOC_MINALIGN to ARCH_DMA_MINALIGN.
ARCH_KMALLOC_MINALIGN is used only in the internals of slab/slob/slub
(except for crypto).
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block: (149 commits)
block: make sure that REQ_* types are seen even with CONFIG_BLOCK=n
xen-blkfront: fix missing out label
blkdev: fix blkdev_issue_zeroout return value
block: update request stacking methods to support discards
block: fix missing export of blk_types.h
writeback: fix bad _bh spinlock nesting
drbd: revert "delay probes", feature is being re-implemented differently
drbd: Initialize all members of sync_conf to their defaults [Bugz 315]
drbd: Disable delay probes for the upcomming release
writeback: cleanup bdi_register
writeback: add new tracepoints
writeback: remove unnecessary init_timer call
writeback: optimize periodic bdi thread wakeups
writeback: prevent unnecessary bdi threads wakeups
writeback: move bdi threads exiting logic to the forker thread
writeback: restructure bdi forker loop a little
writeback: move last_active to bdi
writeback: do not remove bdi from bdi_list
writeback: simplify bdi code a little
writeback: do not lose wake-ups in bdi threads
...
Fixed up pretty trivial conflicts in drivers/block/virtio_blk.c and
drivers/scsi/scsi_error.c as per Jens.
This patch is against the 2.6.34 source.
Paraphrased from the 1989 BSD patch by David Borman @ cray.com:
These are the changes needed for the kernel to support
LINEMODE in the server.
There is a new bit in the termios local flag word, EXTPROC.
When this bit is set, several aspects of the terminal driver
are disabled. Input line editing, character echo, and mapping
of signals are all disabled. This allows the telnetd to turn
off these functions when in linemode, but still keep track of
what state the user wants the terminal to be in.
New ioctl:
TIOCSIG Generate a signal to processes in the
current process group of the pty.
There is a new mode for packet driver, the TIOCPKT_IOCTL bit.
When packet mode is turned on in the pty, and the EXTPROC bit
is set, then whenever the state of the pty is changed, the
next read on the master side of the pty will have the TIOCPKT_IOCTL
bit set. This allows the process on the server side of the pty
to know when the state of the terminal has changed; it can then
issue the appropriate ioctl to retrieve the new state.
Since the original BSD patches accompanied the source code for telnet
I've left that reference here, but obviously the feature is useful for
any remote terminal protocol, including ssh.
The corresponding feature has existed in the BSD tty driver since 1989.
For historical reference, a good copy of the relevant files can be found
here:
http://anonsvn.mit.edu/viewvc/krb5/trunk/src/appl/telnet/?pathrev=17741
Signed-off-by: Howard Chu <hyc@symas.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
kunmap_atomic() is currently at level -4 on Rusty's "Hard To Misuse"
list[1] ("Follow common convention and you'll get it wrong"), except in
some architectures when CONFIG_DEBUG_HIGHMEM is set[2][3].
kunmap() takes a pointer to a struct page; kunmap_atomic(), however, takes
takes a pointer to within the page itself. This seems to once in a while
trip people up (the convention they are following is the one from
kunmap()).
Make it much harder to misuse, by moving it to level 9 on Rusty's list[4]
("The compiler/linker won't let you get it wrong"). This is done by
refusing to build if the type of its first argument is a pointer to a
struct page.
The real kunmap_atomic() is renamed to kunmap_atomic_notypecheck()
(which is what you would call in case for some strange reason calling it
with a pointer to a struct page is not incorrect in your code).
The previous version of this patch was compile tested on x86-64.
[1] http://ozlabs.org/~rusty/index.cgi/tech/2008-04-01.html
[2] In these cases, it is at level 5, "Do it right or it will always
break at runtime."
[3] At least mips and powerpc look very similar, and sparc also seems to
share a common ancestor with both; there seems to be quite some
degree of copy-and-paste coding here. The include/asm/highmem.h file
for these three archs mention x86 CPUs at its top.
[4] http://ozlabs.org/~rusty/index.cgi/tech/2008-03-30.html
[5] As an aside, could someone tell me why mn10300 uses unsigned long as
the first parameter of kunmap_atomic() instead of void *?
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Cc: Russell King <linux@arm.linux.org.uk> (arch/arm)
Cc: Ralf Baechle <ralf@linux-mips.org> (arch/mips)
Cc: David Howells <dhowells@redhat.com> (arch/frv, arch/mn10300)
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com> (arch/mn10300)
Cc: Kyle McMartin <kyle@mcmartin.ca> (arch/parisc)
Cc: Helge Deller <deller@gmx.de> (arch/parisc)
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> (arch/parisc)
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> (arch/powerpc)
Cc: Paul Mackerras <paulus@samba.org> (arch/powerpc)
Cc: "David S. Miller" <davem@davemloft.net> (arch/sparc)
Cc: Thomas Gleixner <tglx@linutronix.de> (arch/x86)
Cc: Ingo Molnar <mingo@redhat.com> (arch/x86)
Cc: "H. Peter Anvin" <hpa@zytor.com> (arch/x86)
Cc: Arnd Bergmann <arnd@arndb.de> (include/asm-generic)
Cc: Rusty Russell <rusty@rustcorp.com.au> ("Hard To Misuse" list)
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Work around a silicon bug in the ac97 reset functionality of the
mpc5200(b). The implementation of the ac97 "cold" reset is flawed.
If the sync and output lines are high when reset is asserted the
attached ac97 device may go into test mode. Avoid this by
reconfiguring the psc to gpio mode and generating the reset manually.
From MPC5200B User's Manual:
"Some AC97 devices goes to a test mode, if the Sync line is high
during the Res line is low (reset phase). To avoid this behavior the
Sync line must be also forced to zero during the reset phase. To do
that, the pin muxing should switch to GPIO mode and the GPIO control
register should be used to control the output lines."
Signed-off-by: Eric Millbrandt <emillbrandt@dekaresearch.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (27 commits)
sched: Use correct macro to display sched_child_runs_first in /proc/sched_debug
sched: No need for bootmem special cases
sched: Revert nohz_ratelimit() for now
sched: Reduce update_group_power() calls
sched: Update rq->clock for nohz balanced cpus
sched: Fix spelling of sibling
sched, cpuset: Drop __cpuexit from cpu hotplug callbacks
sched: Fix the racy usage of thread_group_cputimer() in fastpath_timer_check()
sched: run_posix_cpu_timers: Don't check ->exit_state, use lock_task_sighand()
sched: thread_group_cputime: Simplify, document the "alive" check
sched: Remove the obsolete exit_state/signal hacks
sched: task_tick_rt: Remove the obsolete ->signal != NULL check
sched: __sched_setscheduler: Read the RLIMIT_RTPRIO value lockless
sched: Fix comments to make them DocBook happy
sched: Fix fix_small_capacity
powerpc: Exclude arch_sd_sibiling_asym_packing() on UP
powerpc: Enable asymmetric SMT scheduling on POWER7
sched: Add asymmetric group packing option for sibling domain
sched: Fix capacity calculations for SMT4
sched: Change nohz idle load balancing logic to push model
...
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (162 commits)
tracing/kprobes: unregister_trace_probe needs to be called under mutex
perf: expose event__process function
perf events: Fix mmap offset determination
perf, powerpc: fsl_emb: Restore setting perf_sample_data.period
perf, powerpc: Convert the FSL driver to use local64_t
perf tools: Don't keep unreferenced maps when unmaps are detected
perf session: Invalidate last_match when removing threads from rb_tree
perf session: Free the ref_reloc_sym memory at the right place
x86,mmiotrace: Add support for tracing STOS instruction
perf, sched migration: Librarize task states and event headers helpers
perf, sched migration: Librarize the GUI class
perf, sched migration: Make the GUI class client agnostic
perf, sched migration: Make it vertically scrollable
perf, sched migration: Parameterize cpu height and spacing
perf, sched migration: Fix key bindings
perf, sched migration: Ignore unhandled task states
perf, sched migration: Handle ignored migrate out events
perf: New migration tool overview
tracing: Drop cpparg() macro
perf: Use tracepoint_synchronize_unregister() to flush any pending tracepoint call
...
Fix up trivial conflicts in Makefile and drivers/cpufreq/cpufreq.c
* 'next-devicetree' of git://git.secretlab.ca/git/linux-2.6: (63 commits)
of/platform: Register of_platform_drivers with an "of:" prefix
of/address: Clean up function declarations
of/spi: call of_register_spi_devices() from spi core code
of: Provide default of_node_to_nid() implementation.
of/device: Make of_device_make_bus_id() usable by other code.
of/irq: Fix endian issues in parsing interrupt specifiers
of: Fix phandle endian issues
of/flattree: fix of_flat_dt_is_compatible() to match the full compatible string
of: remove of_default_bus_ids
of: make of_find_device_by_node generic
microblaze: remove references to of_device and to_of_device
sparc: remove references to of_device and to_of_device
powerpc: remove references to of_device and to_of_device
of/device: Replace of_device with platform_device in includes and core code
of/device: Protect against binding of_platform_drivers to non-OF devices
of: remove asm/of_device.h
of: remove asm/of_platform.h
of/platform: remove all of_bus_type and of_platform_bus_type references
of: Merge of_platform_bus_type with platform_bus_type
drivercore/of: Add OF style matching to platform bus
...
Fix up trivial conflicts in arch/microblaze/kernel/Makefile due to just
some obj-y removals by the devicetree branch, while the microblaze
updates added a new file.
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (79 commits)
powerpc/8xx: Add support for the MPC8xx based boards from TQC
powerpc/85xx: Introduce support for the Freescale P1022DS reference board
powerpc/85xx: Adding DTS for the STx GP3-SSA MPC8555 board
powerpc/85xx: Change deprecated binding for 85xx-based boards
powerpc/tqm85xx: add a quirk for ti1520 PCMCIA bridge
powerpc/tqm85xx: update PCI interrupt-map attribute
powerpc/mpc8308rdb: support for MPC8308RDB board from Freescale
powerpc/fsl_pci: add quirk for mpc8308 pcie bridge
powerpc/85xx: Cleanup QE initialization for MPC85xxMDS boards
powerpc/85xx: Fix booting for P1021MDS boards
powerpc/85xx: Fix SWIOTLB initalization for MPC85xxMDS boards
powerpc/85xx: kexec for SMP 85xx BookE systems
powerpc/5200/i2c: improve i2c bus error recovery
of/xilinxfb: update tft compatible versions
powerpc/fsl-diu-fb: Support setting display mode using EDID
powerpc/5121: doc/dts-bindings: update doc of FSL DIU bindings
powerpc/5121: shared DIU framebuffer support
powerpc/5121: move fsl-diu-fb.h to include/linux
powerpc/5121: fsl-diu-fb: fix issue with re-enabling DIU area descriptor
powerpc/512x: add clock structure for Video-IN (VIU) unit
...
The RMA (RMO is a misnomer) is a concept specific to ppc64 (in fact
server ppc64 though I hijack it on embedded ppc64 for similar purposes)
and represents the area of memory that can be accessed in real mode
(aka with MMU off), or on embedded, from the exception vectors (which
is bolted in the TLB) which pretty much boils down to the same thing.
We take that out of the generic MEMBLOCK data structure and move it into
arch/powerpc where it belongs, renaming it to "RMA" while at it.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This introduce memblock.current_limit which is used to limit allocations
from memblock_alloc() or memblock_alloc_base(..., MEMBLOCK_ALLOC_ACCESSIBLE).
The old MEMBLOCK_ALLOC_ANYWHERE changes value from 0 to ~(u64)0 and can still
be used with memblock_alloc_base() to allocate really anywhere.
It is -no-longer- cropped to MEMBLOCK_REAL_LIMIT which disappears.
Note to archs: I'm leaving the default limit to MEMBLOCK_ALLOC_ANYWHERE. I
strongly recommend that you ensure that you set an appropriate limit
during boot in order to guarantee that an memblock_alloc() at any time
results in something that is accessible with a simple __va().
The reason is that a subsequent patch will introduce the ability for
the array to resize itself by reallocating itself. The MEMBLOCK core will
honor the current limit when performing those allocations.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1443 commits)
phy/marvell: add 88ec048 support
igb: Program MDICNFG register prior to PHY init
e1000e: correct MAC-PHY interconnect register offset for 82579
hso: Add new product ID
can: Add driver for esd CAN-USB/2 device
l2tp: fix export of header file for userspace
can-raw: Fix skb_orphan_try handling
Revert "net: remove zap_completion_queue"
net: cleanup inclusion
phy/marvell: add 88e1121 interface mode support
u32: negative offset fix
net: Fix a typo from "dev" to "ndev"
igb: Use irq_synchronize per vector when using MSI-X
ixgbevf: fix null pointer dereference due to filter being set for VLAN 0
e1000e: Fix irq_synchronize in MSI-X case
e1000e: register pm_qos request on hardware activation
ip_fragment: fix subtracting PPPOE_SES_HLEN from mtu twice
net: Add getsockopt support for TCP thin-streams
cxgb4: update driver version
cxgb4: add new PCI IDs
...
Manually fix up conflicts in:
- drivers/net/e1000e/netdev.c: due to pm_qos registration
infrastructure changes
- drivers/net/phy/marvell.c: conflict between adding 88ec048 support
and cleaning up the IDs
- drivers/net/wireless/ipw2x00/ipw2100.c: trivial ipw2100_pm_qos_req
conflict (registration change vs marking it static)
MPC5121 DIU configuration/setup as initialized by the boot
loader currently will get lost while booting Linux. As a
result displaying the boot splash is not possible through
the boot process.
To prevent this we reserve configured DIU frame buffer
address range while booting and preserve AOI descriptor
and gamma table so that DIU continues displaying through
the whole boot process. On first open from user space
DIU frame buffer driver releases the reserved frame
buffer area and continues to operate as usual.
Signed-off-by: John Rigby <jcrigby@gmail.com>
Signed-off-by: Anatolij Gustschin <agust@denx.de>
Acked-by: Timur Tabi <timur@freescale.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
This patch converts unnecessary divide and modulo operations
in the KVM large page related code into logical operations.
This allows to convert gfn_t to u64 while not breaking 32
bit builds.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
We just introduced generic functions to handle shadow pages on PPC.
This patch makes the respective backends make use of them, getting
rid of a lot of duplicate code along the way.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Initially we had to search for pte entries to invalidate them. Since
the logic has improved since then, we can just get rid of the search
function.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch moves the declaration of of_get_address(), of_get_pci_address(),
and of_pci_address_to_resource() out of arch code and into the common
linux/of_address header file.
This patch also fixes some of the asm/prom.h ordering issues. It still
includes some header files that it ideally shouldn't be, but at least the
ordering is consistent now so that of_* overrides work.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Instead of instantiating a whole thread_struct on the stack use only the
required parts of it.
Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Tested-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
With dynamic PACAs, the kexecing CPU's PACA won't lie within the kernel
static data and there is a chance that something may stomp it when preparing
to kexec. This patch switches this final CPU to a static PACA just before
we pull the switch.
Signed-off-by: Matt Evans <matt@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
of_node_to_nid() is only relevant in a few architectures. Don't force
everyone to implement it anyway.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
of_device is currently just an #define alias to platform_device until it
gets removed entirely. This patch removes references to it from the
include directories and the core drivers/of code.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: David S. Miller <davem@davemloft.net>
It is mostly unused now. Sparc has a few defines left in it, but they
can be moved to other headers. Removing this header means that new
architectures adding CONFIG_OF support don't need to also add this
header file.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: David S. Miller <davem@davemloft.net>
Only thing left in it is of_instantiate_rtc() which can be moved to
asm/prom.h on PowerPC and is unused in microblaze.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: David S. Miller <davem@davemloft.net>
This adds some debug output to our MMU hash code to print out some
useful debug data if the hypervisor refuses the insertion (which
should normally never happen).
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
The KEXEC_*_MEMORY_LIMITs are inclusive addresses. We define them as
2Gs as that is what we allow mapping via TLBs. However, this should be
2G - 1 to be inclusive, otherwise if we have >2G of memory in a system
we fail to boot properly via kexec.
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
via following scripts
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/lmb/memblock/g' \
-e 's/LMB/MEMBLOCK/g' \
$FILES
for N in $(find . -name lmb.[ch]); do
M=$(echo $N | sed 's/lmb/memblock/g')
mv $N $M
done
and remove some wrong change like lmbench and dlmb etc.
also move memblock.c from lib/ to mm/
Suggested-by: Ingo Molnar <mingo@elte.hu>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Use the MMU config registers to scan for available direct and
indirect page sizes and print out the result. Will be needed
for future hugetlbfs implementation.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We use a similar technique to ppc32: We set a thread local flag
to indicate that we are about to enter or have entered the stop
state, and have fixup code in the async interrupt entry code that
reacts to this flag to make us return to a different location
(sets NIP to LINK in our case).
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
--
v2. Fix lockdep bug
Re-mask interrupts when coming back from idle
This saves runtime memory and fixes lots of sparse warnings like this:
CHECK arch/powerpc/sysdev/micropatch.c
arch/powerpc/sysdev/micropatch.c:27:6: warning: symbol 'patch_2000'
was not declared. Should it be static?
arch/powerpc/sysdev/micropatch.c:146:6: warning: symbol 'patch_2f00'
was not declared. Should it be static?
...
Signed-off-by: Anton Vorontsov <avorontsov@mvista.com>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
spi_t was removed in commit 644b2a680c
("powerpc/cpm: Remove SPI defines and spi structs"), the commit assumed
that spi_t isn't used anywhere outside of the spi_mpc8xxx driver. But
it appears that the struct is needed for micropatch code. So, let's
reintroduce the struct.
Fixes the following build issue:
CC arch/powerpc/sysdev/micropatch.o
micropatch.c: In function 'cpm_load_patch':
micropatch.c:629: error: expected '=', ',', ';', 'asm' or '__attribute__' before '*' token
micropatch.c:629: error: 'spp' undeclared (first use in this function)
micropatch.c:629: error: (Each undeclared identifier is reported only once
micropatch.c:629: error: for each function it appears in.)
Reported-by: LEROY Christophe <christophe.leroy@c-s.fr>
Reported-by: Tony Breeds <tony@bakeyournoodle.com>
Cc: <stable@kernel.org> [ .33, .34 ]
Signed-off-by: Anton Vorontsov <avorontsov@mvista.com>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
If we are soft disabled and receive a doorbell exception we don't process
it immediately. This means we need to check on the way out of irq restore
if there are any doorbell exceptions to process.
The problem is at that point we don't know what our regs are, and that
in turn makes xmon unhappy. To workaround the problem, instead of checking
for and processing doorbells, we check for any doorbells and if there were
any we send ourselves another.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The doorbells use the content of the PIR register to match messages
from other CPUs. This may or may not be the same as our linux CPU
number, so using that as the "target" is no right.
Instead, we sample the PIR register at boot on every processor
and use that value subsequently when sending IPIs.
We also use a per-cpu message mask rather than a global array which
should limit cache line contention.
Note: We could use the CPU number in the device-tree instead of
the PIR register, as they are supposed to be equivalent. This
might prove useful if doorbells are to be used to kick CPUs out
of FW at boot time, thus before we can sample the PIR. This is
however not the case now and using the PIR just works.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Our handling of debug interrupts on Book3E 64-bit is not quite
the way it should be just yet. This is a workaround to let gdb
work at least for now. We ensure that when context switching,
we set the appropriate DBCR0 value for the new task. We also
make sure that we turn off MSR[DE] within the kernel, and set
it as part of the bits that get set when going back to userspace.
In the long run, we will probably set the userspace DBCR0 on the
exception exit code path and ensure we have some proper kernel
value to set on the way into the kernel, a bit like ppc32 does,
but that will take more work.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Form 1 affinity allows multiple entries in ibm,associativity-reference-points
which represent affinity domains in decreasing order of importance. The
Linux concept of a node is always the first entry, but using the other
values as an input to node_distance() allows the memory allocator to make
better decisions on which node to go first when local memory has been
exhausted.
We keep things simple and create an array indexed by NUMA node, capped at
4 entries. Each time we lookup an associativity property we initialise
the array which is overkill, but since we should only hit this path during
boot it didn't seem worth adding a per node valid bit.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The RAS code has a #define, RAS_VECTOR_OFFSET, that's used in the
check-exception RTAS call for the vector offset of the exception.
We'll be using this same vector offset for the upcoming IO Event interrupts
code (0x500) so let's move it to include/asm/rtas.h and call it
RTAS_VECTOR_EXTERNAL_INTERRUPT.
Signed-off-by: Mark Nelson <markn@au1.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Now we dynamically allocate the paca array, it takes an extra load
whenever we want to access another cpu's paca. One place we do that a lot
is per cpu variables. A simple example:
DEFINE_PER_CPU(unsigned long, vara);
unsigned long test4(int cpu)
{
return per_cpu(vara, cpu);
}
This takes 4 loads, 5 if you include the actual load of the per cpu variable:
ld r11,-32760(r30) # load address of paca pointer
ld r9,-32768(r30) # load link address of percpu variable
sldi r3,r29,9 # get offset into paca (each entry is 512 bytes)
ld r0,0(r11) # load paca pointer
add r3,r0,r3 # paca + offset
ld r11,64(r3) # load paca[cpu].data_offset
ldx r3,r9,r11 # load per cpu variable
If we remove the ppc64 specific per_cpu_offset(), we get the generic one
which indexes into a statically allocated array. This removes one load and
one add:
ld r11,-32760(r30) # load address of __per_cpu_offset
ld r9,-32768(r30) # load link address of percpu variable
sldi r3,r29,3 # get offset into __per_cpu_offset (each entry 8 bytes)
ldx r11,r11,r3 # load __per_cpu_offset[cpu]
ldx r3,r9,r11 # load per cpu variable
Having all the offsets in one array also helps when iterating over a per cpu
variable across a number of cpus, such as in the scheduler. Before we would
need to load one paca cacheline when calculating each per cpu offset. Now we
have 16 (128 / sizeof(long)) per cpu offsets in each cacheline.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Fix smatch warning: constant 0x8000000000000000 is so big it is unsigned long
Signed-off-by: Denis Kirjanov <dkirjanov@kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Enables support for HMC initiated partition hibernation. This is
a firmware assisted hibernation, since the firmware handles writing
the memory out to disk, along with other partition information,
so we just mimic suspend to ram.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Partition hibernation will use some of the same code as is
currently used for Live Partition Migration. This function
further abstracts this code such that code outside of rtas.c
can utilize it. It also changes the error field in the suspend
me data structure to be an atomic type, since it is set and
checked on different cpus without any barriers or locking.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Since the decrementer and timekeeping code was moved over to using
the generic clockevents and timekeeping infrastructure, several
variables and functions have been obsolete and effectively unused.
This deletes them.
In particular, wakeup_decrementer() is no longer needed since the
generic code reprograms the decrementer as part of the process of
resuming the timekeeping code, which happens during sysdev resume.
Thus the wakeup_decrementer calls in the suspend_enter methods for
52xx platforms have been removed. The call in the powermac cpu
frequency change code has been replaced by set_dec(1), which will
cause a timer interrupt as soon as interrupts are enabled, and the
generic code will then reprogram the decrementer with the correct
value.
This also simplifies the generic_suspend_en/disable_irqs functions
and makes them static since they are not referenced outside time.c.
The preempt_enable/disable calls are removed because the generic
code has disabled all but the boot cpu at the point where these
functions are called, so we can't be moved to another cpu.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently it is possible for userspace to see the result of
gettimeofday() going backwards by 1 microsecond, assuming that
userspace is using the gettimeofday() in the VDSO. The VDSO
gettimeofday() algorithm computes the time in "xsecs", which are
units of 2^-20 seconds, or approximately 0.954 microseconds,
using the algorithm
now = (timebase - tb_orig_stamp) * tb_to_xs + stamp_xsec
and then converts the time in xsecs to seconds and microseconds.
The kernel updates the tb_orig_stamp and stamp_xsec values every
tick in update_vsyscall(). If the length of the tick is not an
integer number of xsecs, then some precision is lost in converting
the current time to xsecs. For example, with CONFIG_HZ=1000, the
tick is 1ms long, which is 1048.576 xsecs. That means that
stamp_xsec will advance by either 1048 or 1049 on each tick.
With the right conditions, it is possible for userspace to get
(timebase - tb_orig_stamp) * tb_to_xs being 1049 if the kernel is
slightly late in updating the vdso_datapage, and then for stamp_xsec
to advance by 1048 when the kernel does update it, and for userspace
to then see (timebase - tb_orig_stamp) * tb_to_xs being zero due to
integer truncation. The result is that time appears to go backwards
by 1 microsecond.
To fix this we change the VDSO gettimeofday to use a new field in the
VDSO datapage which stores the nanoseconds part of the time as a
fractional number of seconds in a 0.32 binary fraction format.
(Or put another way, as a 32-bit number in units of 0.23283 ns.)
This is convenient because we can use the mulhwu instruction to
convert it to either microseconds or nanoseconds.
Since it turns out that computing the time of day using this new field
is simpler than either using stamp_xsec (as gettimeofday does) or
stamp_xtime.tv_nsec (as clock_gettime does), this converts both
gettimeofday and clock_gettime to use the new field. The existing
__do_get_tspec function is converted to use the new field and take
a parameter in r7 that indicates the desired resolution, 1,000,000
for microseconds or 1,000,000,000 for nanoseconds. The __do_get_xsec
function is then unused and is deleted.
The new algorithm is
now = ((timebase - tb_orig_stamp) << 12) * tb_to_xs
+ (stamp_xtime_seconds << 32) + stamp_sec_fraction
with 'now' in units of 2^-32 seconds. That is then converted to
seconds and either microseconds or nanoseconds with
seconds = now >> 32
partseconds = ((now & 0xffffffff) * resolution) >> 32
The 32-bit VDSO code also makes a further simplification: it ignores
the bottom 32 bits of the tb_to_xs value, which is a 0.64 format binary
fraction. Doing so gets rid of 4 multiply instructions. Assuming
a timebase frequency of 1GHz or less and an update interval of no
more than 10ms, the upper 32 bits of tb_to_xs will be at least
4503599, so the error from ignoring the low 32 bits will be at most
2.2ns, which is more than an order of magnitude less than the time
taken to do gettimeofday or clock_gettime on our fastest processors,
so there is no possibility of seeing inconsistent values due to this.
This also moves update_gtod() down next to its only caller, and makes
update_vsyscall use the time passed in via the wall_time argument rather
than accessing xtime directly. At present, wall_time always points to
xtime, but that could change in future.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Build of ptrace.h failed for assembly because it
pulls in stdint.h.
Use exportable types (__u32, __u64) to avoid the dependency
on stdint.h.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Andrey Volkov <avolkov@varma-el.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch merges the common routines of_device_alloc() and
of_device_make_bus_id() from powerpc and microblaze.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
CC: Michal Simek <monstr@monstr.eu>
CC: Grant Likely <grant.likely@secretlab.ca>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Stephen Rothwell <sfr@canb.auug.org.au>
CC: microblaze-uclinux@itee.uq.edu.au
CC: linuxppc-dev@ozlabs.org
CC: devicetree-discuss@lists.ozlabs.org
Merge common code between PowerPC and microblaze. This patch merges
the code that scans the tree and registers devices. The functions
merged are of_platform_bus_probe(), of_platform_bus_create(), and
of_platform_device_create().
This patch also move the of_default_bus_ids[] table out of a Microblaze
header file and makes it non-static. The device ids table isn't merged
because powerpc and microblaze use different default data.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
CC: Michal Simek <monstr@monstr.eu>
CC: Grant Likely <grant.likely@secretlab.ca>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Stephen Rothwell <sfr@canb.auug.org.au>
CC: microblaze-uclinux@itee.uq.edu.au
CC: linuxppc-dev@ozlabs.org
Merge common code between powerpc and microblaze
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
CC: Michal Simek <monstr@monstr.eu>
CC: Wolfram Sang <w.sang@pengutronix.de>
CC: Stephen Rothwell <sfr@canb.auug.org.au>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: microblaze-uclinux@itee.uq.edu.au
CC: linuxppc-dev@ozlabs.org
Microblaze and PowerPC share a large chunk of code for translating
OF device tree data into usable addresses. Differences between the two
consist of cosmetic differences, and the addition of dma-ranges support
code to powerpc but not microblaze. This patch moves the powerpc
version into common code and applies many of the cosmetic (non-functional)
changes from the microblaze version.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Michal Simek <monstr@monstr.eu>
CC: Wolfram Sang <w.sang@pengutronix.de>
CC: Stephen Rothwell <sfr@canb.auug.org.au>
Merge common code between PowerPC and Microblaze. This patch also
moves the prototype of pci_address_to_pio() out of pci-bridge.h and
into prom.h because the only user of pci_address_to_pio() is
of_address_to_resource().
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Michal Simek <monstr@monstr.eu>
CC: Stephen Rothwell <sfr@canb.auug.org.au>
Merge common code between Microblaze and PowerPC. This patch creates
new of_address.h and address.c files to containing address translation
and mapping routines. First routine to be moved it of_iomap()
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Michal Simek <monstr@monstr.eu>
CC: Stephen Rothwell <sfr@canb.auug.org.au>
Merge common irq mapping code between PowerPC and Microblaze.
This patch merges of_irq_find_parent(), of_irq_map_raw() and
of_irq_map_one(). The functions are dependent on one another, so all
three are merged in a single patch. Other than cosmetic difference
(ie. DBG() vs. pr_debug()), the implementations are identical.
of_irq_to_resource() is also merged, but in this case the
implementations are different. This patch drops the microblaze version
and uses the powerpc implementation unchanged. The microblaze version
essentially open-coded irq_of_parse_and_map() which it does not need
to do. Therefore the powerpc version is safe to adopt.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
CC: Michal Simek <monstr@monstr.eu>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Stephen Rothwell <sfr@canb.auug.org.au>
The code that figures out what is wrong with the powermac irq device
tree data belongs with the rest of the powermac irq code. This patch
moves it out of prom_parse.c and into powermac/pic.c so that it is only
compiled in when actually needed.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
At present, hw_breakpoint_slots() returns 1 regardless of what
type of breakpoint is specified in the type argument. Since we
don't define CONFIG_HAVE_MIXED_BREAKPOINTS_REGS, there are
separate values for TYPE_INST and TYPE_DATA, and hw_breakpoint_slots()
returns 1 for both, effectively advertising instruction breakpoint
support which doesn't exist.
This fixes it by making hw_breakpoint_slots return 1 for TYPE_DATA
and 0 for TYPE_INST. This moves hw_breakpoint_slots() from the
powerpc hw_breakpoint.h to hw_breakpoint.c because the definitions
of TYPE_INST and TYPE_DATA aren't available in <asm/hw_breakpoint.h>.
They are defined in <linux/hw_breakpoint.h> but we can't include
that header in <asm/hw_breakpoint.h>, and nor can we rely on
<linux/hw_breakpoint.h> being included before <asm/hw_breakpoint.h>.
Since hw_breakpoint_slots() is only called at boot time, there is
no performance impact from making it a real function rather than
a static inline.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Merge common code between PowerPC and Microblaze. SPARC implements
irq_of_parse_and_map(), but the implementation is different, so it
does not use this code.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Jeremy Kerr <jeremy.kerr@canonical.com>
Now that the device tree node pointer has been moved out of struct
of_device and into the common struct device, there isn't anything
unique about of_device anymore. In fact, there isn't much need
for a separate of_bus when all busses have access to OF style
probing.
arch/powerpc and arch/microblaze are moving away from using the of_bus
and using the regular platform bus instead for mmio devices. This
patch makes of_device the same as platform_device as a stepping stone
in migrating of_platform_drivers over to the platform bus.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Many a times, the requested breakpoint length can be less than the
fixed breakpoint length i.e. 8 bytes supported by PowerPC 64-bit
server (Book III S) processors. This could lead to extraneous
interrupts resulting in false breakpoint notifications. This
detects and discards such interrupts for non-ptrace requests.
We don't change ptrace behaviour to avoid breaking compatability.
[Suggestion from Paul Mackerras <paulus@samba.org> to add a new flag in
'struct arch_hw_breakpoint' to identify extraneous interrupts]
Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
A signal delivered between a hw_breakpoint_handler() and the
single_step_dabr_instruction() will not have the breakpoint active
while the signal handler is running -- the signal delivery will
set up a new MSR value which will not have MSR_SE set, so we
won't get the signal step interrupt until and unless the signal
handler returns (which it may never do).
To fix this, we restore the breakpoint when delivering a signal --
we clear the MSR_SE bit and set the DABR again. If the signal
handler returns, the DABR interrupt will occur again when the
instruction that we were originally trying to single-step gets
re-executed.
[Paul Mackerras <paulus@samba.org> pointed out the need to do this.]
Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Implement perf-events based hw-breakpoint interfaces for PowerPC
64-bit server (Book III S) processors. This allows access to a
given location to be used as an event that can be counted or
profiled by the perf_events subsystem.
This is done using the DABR (data breakpoint register), which can
also be used for process debugging via ptrace. When perf_event
hw_breakpoint support is configured in, the perf_event subsystem
manages the DABR and arbitrates access to it, and ptrace then
creates a perf_event when it is requested to set a data breakpoint.
[Adopted suggestions from Paul Mackerras <paulus@samba.org> to
- emulate_step() all system-wide breakpoints and single-step only the
per-task breakpoints
- perform arch-specific cleanup before unregistration through
arch_unregister_hw_breakpoint()
]
Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This extends the emulate_step() function to handle a large proportion
of the Book I instructions implemented on current 64-bit server
processors. The aim is to handle all the load and store instructions
used in the kernel, plus all of the instructions that appear between
l[wd]arx and st[wd]cx., so this handles the Altivec/VMX lvx and stvx
and the VSX lxv2dx and stxv2dx instructions (implemented in POWER7).
The new code can emulate user mode instructions, and checks the
effective address for a load or store if the saved state is for
user mode. It doesn't handle little-endian mode at present.
For floating-point, Altivec/VMX and VSX instructions, it checks
that the saved MSR has the enable bit for the relevant facility
set, and if so, assumes that the FP/VMX/VSX registers contain
valid state, and does loads or stores directly to/from the
FP/VMX/VSX registers, using assembly helpers in ldstfp.S.
Instructions supported now include:
* Loads and stores, including some but not all VMX and VSX instructions,
and lmw/stmw
* Atomic loads and stores (l[dw]arx, st[dw]cx.)
* Arithmetic instructions (add, subtract, multiply, divide, etc.)
* Compare instructions
* Rotate and mask instructions
* Shift instructions
* Logical instructions (and, or, xor, etc.)
* Condition register logical instructions
* mtcrf, cntlz[wd], exts[bhw]
* isync, sync, lwsync, ptesync, eieio
* Cache operations (dcbf, dcbst, dcbt, dcbtst)
The overflow-checking arithmetic instructions are not included, but
they appear not to be ever used in C code.
This uses decimal values for the minor opcodes in the switch statements
because that is what appears in the Power ISA specification, thus it is
easier to check that they are correct if they are in decimal.
If this is used to single-step an instruction where a data breakpoint
interrupt occurred, then there is the possibility that the instruction
is a lwarx or ldarx. In that case we have to be careful not to lose the
reservation until we get to the matching st[wd]cx., or we'll never make
forward progress. One alternative is to try to arrange that we can
return from interrupts and handle data breakpoint interrupts without
losing the reservation, which means not using any spinlocks, mutexes,
or atomic ops (including bitops). That seems rather fragile. The
other alternative is to emulate the larx/stcx and all the instructions
in between. This is why this commit adds support for a wide range
of integer instructions.
Signed-off-by: Paul Mackerras <paulus@samba.org>
In old kernels, NET_SKB_PAD was defined to 16.
Then commit d6301d3dd1 (net: Increase default NET_SKB_PAD to 32), and
commit 18e8c134f4 (net: Increase NET_SKB_PAD to 64 bytes) increased it
to 64.
While first patch was governed by network stack needs, second was more
driven by performance issues on current hardware. Real intent was to
align data on a cache line boundary.
So use max(32, L1_CACHE_BYTES) instead of 64, to be more generic.
Remove microblaze and powerpc own NET_SKB_PAD definitions.
Thanks to Alexander Duyck and David Miller for their comments.
Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Irq stacks provide an essential protection from stack overflows through
external interrupts, at the cost of two additionals stacks per CPU.
Enable them unconditionally to simplify the kernel build and prevent
people from accidentally disabling them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We are seeing boot fails on some System p machines when using the kdump
crashkernel= boot option. The default kdump base address is 32MB, so if we
reserve 256MB for kdump then we reserve all of the RMO except the first 32MB.
We really want kdump to reserve some memory in the RMO and most of it
elsewhere but that will require more significant changes. For now we can shift
the default base address to 64MB when CONFIG_PPC64 and CONFIG_RELOCATABLE are
set. This isn't quite correct since what we really care about is the kdump
kernel is relocatable, but we already make the assumption that base kernel
and kdump kernel have the same CONFIG_RELOCATABLE setting, eg:
#ifndef CONFIG_RELOCATABLE
if (crashk_res.start != KDUMP_KERNELBASE)
printk("Crash kernel location must be 0x%x\n",
KDUMP_KERNELBASE);
...
RTAS is instantiated towards the top of our RMO, so if we were to go any
higher we risk not having enough RMO memory for the kdump kernel on boxes
with a 128MB RMO.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The POWER7 core has dynamic SMT mode switching which is controlled by
the hypervisor. There are 3 SMT modes:
SMT1 uses thread 0
SMT2 uses threads 0 & 1
SMT4 uses threads 0, 1, 2 & 3
When in any particular SMT mode, all threads have the same performance
as each other (ie. at any moment in time, all threads perform the same).
The SMT mode switching works such that when linux has threads 2 & 3 idle
and 0 & 1 active, it will cede (H_CEDE hypercall) threads 2 and 3 in the
idle loop and the hypervisor will automatically switch to SMT2 for that
core (independent of other cores). The opposite is not true, so if
threads 0 & 1 are idle and 2 & 3 are active, we will stay in SMT4 mode.
Similarly if thread 0 is active and threads 1, 2 & 3 are idle, we'll go
into SMT1 mode.
If we can get the core into a lower SMT mode (SMT1 is best), the threads
will perform better (since they share less core resources). Hence when
we have idle threads, we want them to be the higher ones.
This adds a feature bit for asymmetric packing to powerpc and then
enables it on POWER7.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@ozlabs.org
LKML-Reference: <20100608045702.31FB5CC8C7@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On 64bit, local_t is of size long, and thus we make local64_t an alias.
On 32bit, we fall back to atomic64_t. (architecture can provide optimized
32-bit version)
(This new facility is to be used by perf events optimizations.)
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: linux-arch@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Drop this argument now that we always want to rewind only to the
state of the first caller.
It means frame pointers are not necessary anymore to reliably get
the source of an event. But this also means we need this helper
to be a macro now, as an inline function is not an option since
we need to know when to provide a default implentation.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: David Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Grant patches added an of mach table to struct device_driver. However,
while he changed the macio device code to use that, he left the match
table pointer in struct macio_driver and didn't update drivers to use
the "new" one, thus breaking the probing.
This completes the change by moving all drivers to setup the "new"
one, removing all traces of the old one, and while at it (since it
changes the exact same locations), I also remove two other duplicates
from struct driver which are the name and owner fields.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc: Don't export cvt_fd & _df when CONFIG_PPC_FPU is not set
powerpc/44x: icon: select SM502 and frame buffer console support
powerpc/85xx: Add P1021MDS board support
powerpc/85xx: Change MPC8572DS camp dtses for MSI sharing
powerpc/fsl_msi: add removal path and probe failing path
powerpc/fsl_msi: enable msi sharing through AMP OSes
powerpc/fsl_msi: enable msi allocation in all banks
powerpc/fsl_msi: fix the conflict of virt_msir's chip_data
powerpc/fsl_msi: Add multiple MSI bank support
powerpc/kexec: Add support for FSL-BookE
powerpc/fsl-booke: Move the entry setup code into a seperate file
powerpc/fsl-booke: fix the case where we are not in the first page
powerpc/85xx: Enable support for ports 3 and 4 on 8548 CDS
powerpc/fsl-booke: Add hibernation support for FSL BookE processors
powerpc/e500mc: Implement machine check handler.
powerpc/44x: Add basic ICON PPC440SPe board support
powerpc/44x: Fix UART clocks on 440SPe
powerpc/44x: Add reset-type to katmai.dts
powerpc/44x: Adding PCI-E support for PowerPC 460SX based SOC.
* 'for-35' of git://repo.or.cz/linux-kbuild: (81 commits)
kbuild: Revert part of e8d400a to resolve a conflict
kbuild: Fix checking of scm-identifier variable
gconfig: add support to show hidden options that have prompts
menuconfig: add support to show hidden options which have prompts
gconfig: remove show_debug option
gconfig: remove dbg_print_ptype() and dbg_print_stype()
kconfig: fix zconfdump()
kconfig: some small fixes
add random binaries to .gitignore
kbuild: Include gen_initramfs_list.sh and the file list in the .d file
kconfig: recalc symbol value before showing search results
.gitignore: ignore *.lzo files
headerdep: perlcritic warning
scripts/Makefile.lib: Align the output of LZO
kbuild: Generate modules.builtin in make modules_install
Revert "kbuild: specify absolute paths for cscope"
kbuild: Do not unnecessarily regenerate modules.builtin
headers_install: use local file handles
headers_check: fix perl warnings
export_report: fix perl warnings
...
There are more architectures that don't support ARCH_HAS_SG_CHAIN than
those that support it. This removes removes ARCH_HAS_SG_CHAIN in
asm-generic/scatterlist.h and lets arhictectures to define it.
It's clearer than defining ARCH_HAS_SG_CHAIN asm-generic/scatterlist.h and
undefing it in arhictectures that don't support it.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit b3b77c8cae, which was
also totally broken (see commit 0d2daf5cc8 that reverted the crc32
version of it). As reported by Stephen Rothwell, it causes problems on
big-endian machines:
> In file included from fs/jfs/jfs_types.h:33,
> from fs/jfs/jfs_incore.h:26,
> from fs/jfs/file.c:22:
> fs/jfs/endian24.h:36:101: warning: "__LITTLE_ENDIAN" is not defined
The kernel has never had that crazy "__BYTE_ORDER == __LITTLE_ENDIAN"
model. It's not how we do things, and it isn't how we _should_ do
things. So don't go there.
Requested-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'next-spi' of git://git.secretlab.ca/git/linux-2.6:
spi/xilinx: Fix compile error
spi/davinci: Fix clock prescale factor computation
spi: move bitbang txrx utility functions to private header
spi/mpc5121: Add SPI master driver for MPC5121 PSC
powerpc/mpc5121: move PSC FIFO memory init to platform code
spi/ep93xx: implemented driver for Cirrus EP93xx SPI controller
Documentation/spi/* compile warning fix
spi/omap2_mcspi: Check params before dereference or use
spi/omap2_mcspi: add turbo mode support
spi/omap2_mcspi: change default DMA_MIN_BYTES value to 160
spi/pl022: fix stop queue procedure
spi/pl022: add support for the PL023 derivate
spi/pl022: fix up differences between ARM and ST versions
spi/spi_mpc8xxx: Do not use map_tx_dma to unmap rx_dma
spi/spi_mpc8xxx: Fix QE mode Litte Endian
spi/spi_mpc8xxx: fix potential memory corruption.
Linux does not define __BYTE_ORDER in its endian header files which makes
some header files bend backwards to get at the current endian. Lets
#define __BYTE_ORDER in big_endian.h/litte_endian.h to make it easier for
header files that are used in user space too.
In userspace the convention is that
1. _both_ __LITTLE_ENDIAN and __BIG_ENDIAN are defined,
2. you have to test for e.g. __BYTE_ORDER == __BIG_ENDIAN.
Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds support kexec on FSL-BookE where the MMU can not be simply
switched off. The code borrows the initial MMU-setup code to create the
identical mapping mapping. The only difference to the original boot code
is the size of the mapping(s) and the executeable address.
The kexec code maps the first 2 GiB of memory in 256 MiB steps. This
should work also on e500v1 boxes.
SMP support is still not available.
(Kumar: Added minor change to build to ifdef CONFIG_PPC_STD_MMU_64 some
code that was PPC64 specific)
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Merging in current state of Linus' tree to deal with merge conflicts and
build failures in vio.c after merge.
Conflicts:
drivers/i2c/busses/i2c-cpm.c
drivers/i2c/busses/i2c-mpc.c
drivers/net/gianfar.c
Also fixed up one line in arch/powerpc/kernel/vio.c to use the
correct node pointer.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
By moving dma_mask into pdev_archdata, and adding archdata to
struct of_device, it makes it possible to substitute of_device
with struct platform_device, which is a stepping stone to
removing the of_platform bus entirely.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
* git://git.infradead.org/iommu-2.6:
intel-iommu: Set a more specific taint flag for invalid BIOS DMAR tables
intel-iommu: Combine the BIOS DMAR table warning messages
panic: Add taint flag TAINT_FIRMWARE_WORKAROUND ('I')
panic: Allow warnings to set different taint flags
intel-iommu: intel_iommu_map_range failed at very end of address space
intel-iommu: errors with smaller iommu widths
intel-iommu: Fix boot inside 64bit virtualbox with io-apic disabled
intel-iommu: use physfn to search drhd for VF
intel-iommu: Print out iommu seq_id
intel-iommu: Don't complain that ACPI_DMAR_SCOPE_TYPE_IOAPIC is not supported
intel-iommu: Avoid global flushes with caching mode.
intel-iommu: Use correct domain ID when caching mode is enabled
intel-iommu mistakenly uses offset_pfn when caching mode is enabled
intel-iommu: use for_each_set_bit()
intel-iommu: Fix section mismatch dmar_ir_support() uses dmar_tbl.
* 'kvm-updates/2.6.35' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (269 commits)
KVM: x86: Add missing locking to arch specific vcpu ioctls
KVM: PPC: Add missing vcpu_load()/vcpu_put() in vcpu ioctls
KVM: MMU: Segregate shadow pages with different cr0.wp
KVM: x86: Check LMA bit before set_efer
KVM: Don't allow lmsw to clear cr0.pe
KVM: Add cpuid.txt file
KVM: x86: Tell the guest we'll warn it about tsc stability
x86, paravirt: don't compute pvclock adjustments if we trust the tsc
x86: KVM guest: Try using new kvm clock msrs
KVM: x86: export paravirtual cpuid flags in KVM_GET_SUPPORTED_CPUID
KVM: x86: add new KVMCLOCK cpuid feature
KVM: x86: change msr numbers for kvmclock
x86, paravirt: Add a global synchronization point for pvclock
x86, paravirt: Enable pvclock flags in vcpu_time_info structure
KVM: x86: Inject #GP with the right rip on efer writes
KVM: SVM: Don't allow nested guest to VMMCALL into host
KVM: x86: Fix exception reinjection forced to true
KVM: Fix wallclock version writing race
KVM: MMU: Don't read pdptrs with mmu spinlock held in mmu_alloc_roots
KVM: VMX: enable VMXON check with SMX enabled (Intel TXT)
...
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (92 commits)
powerpc: Remove unused 'protect4gb' boot parameter
powerpc: Build-in e1000e for pseries & ppc64_defconfig
powerpc/pseries: Make request_ras_irqs() available to other pseries code
powerpc/numa: Use ibm,architecture-vec-5 to detect form 1 affinity
powerpc/numa: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim
powerpc: Use smt_snooze_delay=-1 to always busy loop
powerpc: Remove check of ibm,smt-snooze-delay OF property
powerpc/kdump: Fix race in kdump shutdown
powerpc/kexec: Fix race in kexec shutdown
powerpc/kexec: Speedup kexec hash PTE tear down
powerpc/pseries: Add hcall to read 4 ptes at a time in real mode
powerpc: Use more accurate limit for first segment memory allocations
powerpc/kdump: Use chip->shutdown to disable IRQs
powerpc/kdump: CPUs assume the context of the oopsing CPU
powerpc/crashdump: Do not fail on NULL pointer dereferencing
powerpc/eeh: Fix oops when probing in early boot
powerpc/pci: Check devices status property when scanning OF tree
powerpc/vio: Switch VIO Bus PM to use generic helpers
powerpc: Avoid bad relocations in iSeries code
powerpc: Use common cpu_die (fixes SMP+SUSPEND build)
...
Most of the MSCR bit assigments are different in e500mc versus
e500, and they are now write-one-to-clear.
Some e500mc machine check conditions are made recoverable (as long as
they aren't stuck on), most notably L1 instruction cache parity errors.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
I noticed /proc/sys/vm/zone_reclaim_mode was 0 on a ppc64 NUMA box. It gets
enabled via this:
/*
* If another node is sufficiently far away then it is better
* to reclaim pages in a zone before going off node.
*/
if (distance > RECLAIM_DISTANCE)
zone_reclaim_mode = 1;
Since we use the default value of 20 for REMOTE_DISTANCE and 20 for
RECLAIM_DISTANCE it never kicks in.
The local to remote bandwidth ratios can be quite large on System p
machines so it makes sense for us to reclaim clean pagecache locally before
going off node.
The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables
zone reclaim.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In kexec_prepare_cpus, the primary CPU IPIs the secondary CPUs to
kexec_smp_down(). kexec_smp_down() calls kexec_smp_wait() which sets
the hw_cpu_id() to -1. The primary does this while leaving IRQs on
which means the primary can take a timer interrupt which can lead to
the IPIing one of the secondary CPUs (say, for a scheduler re-balance)
but since the secondary CPU now has a hw_cpu_id = -1, we IPI CPU
-1... Kaboom!
We are hitting this case regularly on POWER7 machines.
There is also a second race, where the primary will tear down the MMU
mappings before knowing the secondaries have entered real mode.
Also, the secondaries are clearing out any pending IPIs before
guaranteeing that no more will be received.
This changes kexec_prepare_cpus() so that we turn off IRQs in the
primary CPU much earlier. It adds a paca flag to say that the
secondaries have entered the kexec_smp_down() IPI and turned off IRQs,
rather than overloading hw_cpu_id with -1. This new paca flag is
again used to in indicate when the secondaries has entered real mode.
It also ensures that all CPUs have their IRQs off before we clear out
any pending IPI requests (in kexec_cpu_down()) to ensure there are no
trailing IPIs left unacknowledged.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This adds plpar_pte_read_4_raw() which can be used read 4 PTEs from
PHYP at a time, while in real mode.
It also creates a new hcall9 which can be used in real mode. It's the
same as plpar_hcall9 but minus the tracing hcall statistics which may
require variables outside the RMO.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch contains the hooks and instrumentation into kernel which
live outside the kernel/debug directory, which the kdb core
will call to run commands like lsmod, dmesg, bt etc...
CC: linux-arch@vger.kernel.org
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Martin Hicks <mort@sgi.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (44 commits)
vlynq: make whole Kconfig-menu dependant on architecture
add descriptive comment for TIF_MEMDIE task flag declaration.
EEPROM: max6875: Header file cleanup
EEPROM: 93cx6: Header file cleanup
EEPROM: Header file cleanup
agp: use NULL instead of 0 when pointer is needed
rtc-v3020: make bitfield unsigned
PCI: make bitfield unsigned
jbd2: use NULL instead of 0 when pointer is needed
cciss: fix shadows sparse warning
doc: inode uses a mutex instead of a semaphore.
uml: i386: Avoid redefinition of NR_syscalls
fix "seperate" typos in comments
cocbalt_lcdfb: correct sections
doc: Change urls for sparse
Powerpc: wii: Fix typo in comment
i2o: cleanup some exit paths
Documentation/: it's -> its where appropriate
UML: Fix compiler warning due to missing task_struct declaration
UML: add kernel.h include to signal.c
...
WARN() is used in some places to report firmware or hardware bugs that
are then worked-around. These bugs do not affect the stability of the
kernel and should not set the flag for TAINT_WARN. To allow for this,
add WARN_TAINT() and WARN_TAINT_ONCE() macros that take a taint number
as argument.
Architectures that implement warnings using trap instructions instead
of calls to warn_slowpath_*() now implement __WARN_TAINT(taint)
instead of __WARN().
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: Helge Deller <deller@gmx.de>
Tested-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
This patch eliminates the node pointer from struct of_device and the
of_node (or prom_node) pointer from struct dev_archdata since the node
pointer is now part of struct device proper when CONFIG_OF is set, and
all users of the old pointer locations have already been converted over
to use device->of_node.
Also remove dev_archdata_{get,set}_node() as it is no longer used by
anything.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
The following structure elements duplicate the information in
'struct device.of_node' and so are being eliminated. This patch
makes all readers of these elements use device.of_node instead.
(struct of_device *)->node
(struct dev_archdata *)->prom_node (sparc)
(struct dev_archdata *)->of_node (powerpc & microblaze)
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
When we're on a paired single capable host, we can just always enable
paired singles and expose them to the guest directly.
This approach breaks when multiple VMs run and access PS concurrently,
but this should suffice until we get a proper framework for it in Linux.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
When in split mode, instruction relocation and data relocation are not equal.
So far we implemented this mode by reserving a special pseudo-VSID for the
two cases and flushing all PTEs when going into split mode, which is slow.
Unfortunately 32bit Linux and Mac OS X use split mode extensively. So to not
slow down things too much, I came up with a different idea: Mark the split
mode with a bit in the VSID and then treat it like any other segment.
This means we can just flush the shadow segment cache, but keep the PTEs
intact. I verified that this works with ppc32 Linux and Mac OS X 10.4
guests and does speed them up.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
There are some pieces in the code that I overlooked that still use
u64s instead of longs. This slows down 32 bit hosts unnecessarily, so
let's just move them to ulong.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
We need to keep the pointer to the shadow vcpu somewhere accessible from
within really early interrupt code. The best fit I found was the thread
struct, as that resides in an SPRG.
So let's put a pointer to the shadow vcpu in the thread struct and add
an asm-offset so we can find it.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
The host shadow mmu code needs to get initialized. It needs to fetch a
segment it can use to put shadow PTEs into.
That initialization code was in generic code, which is icky. Let's move
it over to the respective MMU file.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
The shadow vcpu now contains some fields we don't use from the vcpu anymore.
Access to them happens using inline functions that happily use the shadow
vcpu fields.
So let's now ifdef them out to booke only and add asm-offsets.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
For assembly code there are several "long" load and store defines already.
The one that's missing is the typical stack store, stdu/stwu.
So let's add that define as well, making my KVM code happy.
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
Upstream recently added a new name for PPC64: Book3S_64.
So instead of using CONFIG_PPC64 we should use CONFIG_PPC_BOOK3S consotently.
That makes understanding the code easier (I hope).
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
So far we had a lot of conditional code on CONFIG_KVM_BOOK3S_64_HANDLER.
As we're moving towards common code between 32 and 64 bits, most of
these ifdefs can be moved to a more generic term define, called
CONFIG_KVM_BOOK3S_HANDLER.
This patch adds the new generic config option and moves ifdefs over.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
We already have some inline fuctions we use to access vcpu or svcpu structs,
depending on whether we're on booke or book3s. Since we just put a few more
registers into the svcpu, we also need to make sure the respective callbacks
are available and get used.
So this patch moves direct use of the now in the svcpu struct fields to
inline function calls. While at it, it also moves the definition of those
inline function calls to respective header files for booke and book3s,
greatly improving readability.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
After a lot of thought on how to make the entry / exit code easier,
I figured it'd be clever to put even more register state into the
shadow vcpu. That way we have more registers available to use, making
the code easier to read.
So this patch adds a few new fields to that shadow vcpu. Later on we
will remove the originals from the vcpu and paca.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
In analogy to the 64 bit specific header file, this is the 32 bit
pendant. With this in place we can just always call to_svcpu and
be assured we get the right pointer anywhere.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
In the process of generalizing as much code as possible, I also moved
the shadow vcpu code together to a generic book3s file. Unfortunately
the location of the shadow vcpu is different on 32 and 64 bit, so we
need a wrapper function to tell us where it is.
That sounded like a perfect fit for a subarch specific header file.
Here we can put anything that needs to be different between those two.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
We need to reserve a context from KVM to make sure we have our own
segment space. While we did that split for Book3S_64 already, 32 bit
is still outstanding.
So let's split it now.
Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
We have quite some code that can be used by Book3S_32 and Book3S_64 alike,
so let's call it "Book3S" instead of "Book3S_64", so we can later on
use it from the 32 bit port too.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
Bool defaults to at least byte width. We usually only want to waste a single
bit on this. So let's move all the bool values to bitfields, potentially
saving memory.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
Some constants were bigger than ints. Let's mark them as such so we don't
accidently truncate them.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
MOL uses its own hypercall interface to call back into userspace when
the guest wants to do something.
So let's implement that as an exit reason, specify it with a CAP and
only really use it when userspace wants us to.
The only user of it so far is MOL.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
Mac OS X has some applications - namely the Finder - that require alignment
interrupts to work properly. So we need to implement them.
But the spec for 970 and 750 also looks different. While 750 requires the
DSISR and DAR fields to reflect some instruction bits (DSISR) and the fault
address (DAR), the 970 declares this as an optional feature. So we need
to reconstruct DSISR and DAR manually.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch makes the VSID of mapped pages always reflecting all special cases
we have, like split mode.
It also changes the tlbie mask to 0x0ffff000 according to the spec. The mask
we used before was incorrect.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
DSISR is only defined as 32 bits wide. So let's reflect that in the
structs too.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
Userspace can tell us that it wants to trigger an interrupt. But
so far it can't tell us that it wants to stop triggering one.
So let's interpret the parameter to the ioctl that we have anyways
to tell us if we want to raise or lower the interrupt line.
Signed-off-by: Alexander Graf <agraf@suse.de>
v2 -> v3:
- Add CAP for unset irq
Signed-off-by: Avi Kivity <avi@redhat.com>
On PowerPC we can go into MMU Split Mode. That means that either
data relocation is on but instruction relocation is off or vice
versa.
That mode didn't work properly, as we weren't always flushing
entries when going into a new split mode, potentially mapping
different code or data that we're supposed to.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
Anton Blanchard found that large POWER systems would occasionally
crash in the exception exit path when profiling with perf_events.
The symptom was that an interrupt would occur late in the exit path
when the MSR[RI] (recoverable interrupt) bit was clear. Interrupts
should be hard-disabled at this point but they were enabled. Because
the interrupt was not recoverable the system panicked.
The reason is that the exception exit path was calling
perf_event_do_pending after hard-disabling interrupts, and
perf_event_do_pending will re-enable interrupts.
The simplest and cleanest fix for this is to use the same mechanism
that 32-bit powerpc does, namely to cause a self-IPI by setting the
decrementer to 1. This means we can remove the tests in the exception
exit path and raw_local_irq_restore.
This also makes sure that the call to perf_event_do_pending from
timer_interrupt() happens within irq_enter/irq_exit. (Note that
calling perf_event_do_pending from timer_interrupt does not mean that
there is a possible 1/HZ latency; setting the decrementer to 1 ensures
that the timer interrupt will happen immediately, i.e. within one
timebase tick, which is a few nanoseconds or 10s of nanoseconds.)
Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: stable@kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Convert to the new cpumask API.
irq_choose_cpu can be simplified by using cpumask_next and cpumask_first.
smp_mpic_message_pass was doing open coded cpumask manipulation and passing an
int for a cpumask into mpic_send_ipi. Since mpic_send_ipi is only used
locally, make it static and convert it to take a cpumask. This allows us
to clean up the mess in smp_mpic_message_pass.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Convert NUMA code to new cpumask API. We shift the node to cpumask
setup code until after we complete bootmem allocation so we can
dynamically allocate the cpumasks.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Dynamically allocate cpu_sibling_map and cpu_core_map cpumasks.
We don't need to set_cpu_online() the boot cpu in smp_prepare_boot_cpu,
init/main.c does it for us.
We also postpone setting of the boot cpu in cpu_sibling_map and cpu_core_map
until when the memory allocator is available (smp_prepare_cpus), similar
to x86.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Use new cpumask_* functions, and dynamically allocate cpumask in fixup_irqs.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We need to keep track of the backing pages that get allocated by
vmemmap_populate() so that when we use kdump, the dump-capture kernel knows
where these pages are.
We use a simple linked list of structures that contain the physical address
of the backing page and corresponding virtual address to track the backing
pages.
To save space, we just use a pointer to the next struct vmemmap_backing. We
can also do this because we never remove nodes. We call the pointer "list"
to be compatible with changes made to the crash utility.
vmemmap_populate() is called either at boot-time or on a memory hotplug
operation. We don't have to worry about the boot-time calls because they
will be inherently single-threaded, and for a memory hotplug operation
vmemmap_populate() is called through:
sparse_add_one_section()
|
V
kmalloc_section_memmap()
|
V
sparse_mem_map_populate()
|
V
vmemmap_populate()
and in sparse_add_one_section() we're protected by pgdat_resize_lock().
So, we don't need a spinlock to protect the vmemmap_list.
We allocate space for the vmemmap_backing structs by allocating whole pages
in vmemmap_list_alloc() and then handing out chunks of this to
vmemmap_list_populate().
This means that we waste at most just under one page, but this keeps the code
is simple.
Signed-off-by: Mark Nelson <markn@au1.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>