Commit graph

4289 commits

Author SHA1 Message Date
Linus Torvalds
24b6124047 KVM fixes for v4.15-rc9
ARM:
 * fix incorrect huge page mappings on systems using the contiguous hint
   for hugetlbfs
 * support alternative GICv4 init sequence
 * correctly implement the ARM SMCC for HVC and SMC handling
 
 PPC:
 * add KVM IOCTL for reporting vulnerability and workaround status
 
 s390:
 * provide userspace interface for branch prediction changes in firmware
 
 x86:
 * use correct macros for bits
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJaY3/eAAoJEED/6hsPKofo64kH/16SCSA9pKJTf39+jLoCPzbp
 tlhzxoaqb9cPNMQBAk8Cj5xNJ6V4Clwnk8iRWaE6dRI5nWQxnxRHiWxnrobHwUbK
 I0zSy+SywynSBnollKzLzQrDUBZ72fv3oLwiYEYhjMvs0zW6Q/vg10WERbav912Q
 bv8nb5e8TbvU500ErndKTXOa8/B6uZYkMVjBNvAHwb+4AQ7bJgDQs5/qOeXllm8A
 MT/SNYop/fkjRP7mQng5XYzoO+70tbe0hWpOQGgBnduzrbkNNvZtYtovusHYytLX
 PAB7DDPbLZm5L2HBo4zvKgTHIoHTxU0X2yfUDzt7O151O2WSyqBRC3y1tpj6xa8=
 =GnNJ
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Radim Krčmář:
 "ARM:
   - fix incorrect huge page mappings on systems using the contiguous
     hint for hugetlbfs
   - support alternative GICv4 init sequence
   - correctly implement the ARM SMCC for HVC and SMC handling

  PPC:
   - add KVM IOCTL for reporting vulnerability and workaround status

  s390:
   - provide userspace interface for branch prediction changes in
     firmware

  x86:
   - use correct macros for bits"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: s390: wire up bpb feature
  KVM: PPC: Book3S: Provide information about hardware/firmware CVE workarounds
  KVM/x86: Fix wrong macro references of X86_CR0_PG_BIT and X86_CR4_PAE_BIT in kvm_valid_sregs()
  arm64: KVM: Fix SMCCC handling of unimplemented SMC/HVC calls
  KVM: arm64: Fix GICv4 init when called from vgic_its_create
  KVM: arm/arm64: Check pagesize when allocating a hugepage at Stage 2
2018-01-20 11:41:09 -08:00
Ram Pai
3350eb2ea1 powerpc: sys_pkey_mprotect() system call
Patch provides the ability for a process to
associate a pkey with a address range.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-21 01:06:16 +11:00
Ram Pai
9499ec1b5e powerpc: sys_pkey_alloc() and sys_pkey_free() system calls
Finally this patch provides the ability for a process to
allocate and free a protection key.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-21 01:06:15 +11:00
Ram Pai
cf43d3b264 powerpc: Enable pkey subsystem
PAPR defines 'ibm,processor-storage-keys' property. It exports two
values. The first value holds the number of data-access keys and the
second holds the number of instruction-access keys. Due to a bug in
the firmware, instruction-access keys is always reported as zero.
However any key can be configured to disable data-access and/or
disable execution-access. The inavailablity of the second value is not
a big handicap, though it could have been used to determine if the
platform supported disable-execution-access.

Non-PAPR platforms do not define this property in the device tree yet.
Fortunately power8 is the only released Non-PAPR platform that is
supported. Here, we hardcode the number of supported pkey to 32, by
consulting the PowerISA3.0

This patch calculates the number of keys supported by the platform.
Also it determines the platform support for read/write/execution
access support for pkeys.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
[mpe: Use a PVR check instead of CPU_FTR for execute. Restrict to
 Power7/8/9 for now until older CPUs are tested.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-21 01:06:10 +11:00
Thiago Jung Bauermann
c5cc1f4df6 powerpc/ptrace: Add memory protection key regset
The AMR/IAMR/UAMOR are part of the program context.
Allow it to be accessed via ptrace and through core files.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20 22:59:06 +11:00
Ram Pai
99cd130232 powerpc: Deliver SEGV signal on pkey violation
The value of the pkey, whose protection got violated,
is made available in si_pkey field of the siginfo structure.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20 22:59:05 +11:00
Ram Pai
087003e9ef powerpc: introduce get_mm_addr_key() helper
get_mm_addr_key() helper returns the pkey associated with
an address corresponding to a given mm_struct.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20 22:59:05 +11:00
Ram Pai
e6c2a4797e powerpc: Handle exceptions caused by pkey violation
Handle Data and  Instruction exceptions caused by memory
protection-key.

The CPU will detect the key fault if the HPTE is already
programmed with the key.

However if the HPTE is not  hashed, a key fault will not
be detected by the hardware. The software will detect
pkey violation in such a case.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20 22:59:04 +11:00
Ram Pai
1137573acf powerpc: implementation for arch_vma_access_permitted()
This patch provides the implementation for
arch_vma_access_permitted(). Returns true if the
requested access is allowed by pkey associated with the
vma.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20 22:59:04 +11:00
Ram Pai
bca7aacfe8 powerpc: check key protection for user page access
Make sure that the kernel does not access user pages without
checking their key-protection.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
[mpe: Integrate with upstream version of pte_access_permitted()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20 22:59:04 +11:00
Ram Pai
f2407ef3ba powerpc: helper to validate key-access permissions of a pte
helper function that checks if the read/write/execute is allowed
on the pte.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20 22:59:03 +11:00
Ram Pai
a6590ca55f powerpc: Program HPTE key protection bits
Map the PTE protection key bits to the HPTE key protection bits,
while creating HPTE  entries.

Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20 22:59:03 +11:00
Ram Pai
eb95d016ce powerpc: map vma key-protection bits to pte key bits.
Map  the  key  protection  bits of the vma to the pkey bits in
the PTE.

The PTE  bits used  for pkey  are  3,4,5,6  and 57. The  first
four bits are the same four bits that were freed up  initially
in this patch series. remember? :-) Without those four bits
this patch wouldn't be possible.

BUT, on 4k kernel, bit 3, and 4 could not be freed up. remember?
Hence we have to be satisfied with 5, 6 and 7.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20 22:59:02 +11:00
Ram Pai
87bbabbed8 powerpc: implementation for arch_override_mprotect_pkey()
arch independent code calls arch_override_mprotect_pkey()
to return a pkey that best matches the requested protection.

This patch provides the implementation.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20 22:59:02 +11:00
Ram Pai
013a91b39c powerpc: ability to associate pkey to a vma
arch-independent code expects the arch to  map
a  pkey  into the vma's protection bit setting.
The patch provides that ability.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20 22:59:01 +11:00
Ram Pai
5586cf61e1 powerpc: introduce execute-only pkey
This patch provides the implementation of execute-only pkey.
The architecture-independent layer expects the arch-dependent
layer, to support the ability to create and enable a special
key which has execute-only permission.

Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20 22:59:01 +11:00
Ram Pai
06bb53b338 powerpc: store and restore the pkey state across context switches
Store and restore the AMR, IAMR and UAMOR register state of the task
before scheduling out and after scheduling in, respectively.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20 22:59:00 +11:00
Ram Pai
dcf872956d powerpc: ability to create execute-disabled pkeys
powerpc has hardware support to disable execute on a pkey.
This patch enables the ability to create execute-disabled
keys.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20 22:59:00 +11:00
Ram Pai
2ddc53f3a7 powerpc: implementation for arch_set_user_pkey_access()
This patch provides the detailed implementation for
a user to allocate a key and enable it in the hardware.

It provides the plumbing, but it cannot be used till
the system call is implemented. The next patch  will
do so.

Reviewed-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20 22:58:59 +11:00
Ram Pai
0685f217af powerpc: cleanup AMR, IAMR when a key is allocated or freed
Cleanup the bits corresponding to a key in the AMR, and IAMR
register, when the key is newly allocated/activated or is freed.
We dont want some residual bits cause the hardware enforce
unintended behavior when the key is activated or freed.

Reviewed-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20 22:58:59 +11:00
Ram Pai
4fb158f65a powerpc: track allocation status of all pkeys
Total 32 keys are available on power7 and above. However
pkey 0,1 are reserved. So effectively we  have  30 pkeys.

On 4K kernels, we do not  have  5  bits  in  the  PTE to
represent  all the keys; we only have 3bits. Two of those
keys are reserved; pkey 0 and pkey 1. So effectively  we
have 6 pkeys.

This patch keeps track of reserved keys, allocated  keys
and keys that are currently free.

Also it  adds  skeletal  functions  and macros, that the
architecture-independent code expects to be available.

Reviewed-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20 22:58:35 +11:00
Ram Pai
92e3da3cf1 powerpc: initial pkey plumbing
Basic  plumbing  to   initialize  the   pkey  system.
Nothing is enabled yet. A later patch will enable it
once all the infrastructure is in place.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
[mpe: Rework copyrights to use SPDX tags]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20 21:45:03 +11:00
Linus Torvalds
4917d5df38 powerpc fixes for 4.15 #8
More than we'd like after rc8, but nothing very alarming either, just tying up
 loose ends before the release:
 
 Since we changed powernv to use cpufreq_get() from show_cpuinfo(), we see
 warnings with PREEMPT enabled. But the preempt_disable() in show_cpuinfo()
 doesn't actually prevent CPU hotplug as it suggests, so remove it.
 
 Two updates to the recently merged RFI flush code. Wire up the generic sysfs
 file to report the status, and add a debugfs file to allow enabling/disabling it
 at runtime.
 
 Two updates to xmon, one to add the RFI flush related fields to the paca dump,
 and another to not use hashed pointers in the paca dump.
 
 And one minor fix to add a missing include of linux/types.h in asm/hvcall.h, not
 seen to break the build in upstream, but correct anyway.
 
 Thanks to:
   Benjamin Herrenschmidt, Michal Suchanek, Nicholas Piggin.
 -----BEGIN PGP SIGNATURE-----
 
 iQIwBAABCAAaBQJaYcINExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYDQ
 /w//S5OeowmRncTPsiCNlZSLlGsynE7/3QiB3qS0+hGK9vzcdl9Zw5nBxPmA16g+
 53z2pgRpsJvagqR3JFHwYsoOKg157heMZKbrFyV/EWXPUoZQD8Iko88BkNQfV9kH
 dZM4gOuPnn7lLJQJrdc13SRwlOlc1oqOiPqFQtwH5CTlH/I4kVe3iR+oaReA71+U
 bKZQWwXjRqjexLgxfPhaMH9CzfgK6LpM/TEc2tG1YBbKxbTCbXeFYhrNFvLKILHg
 sC/8QDYzFbdgqTrcw6rXpwoxA65Eu8I3puOLpTl/oJ/OQZZBLN6sjjJ33+Aq1yNR
 cXOakJgC+WjNYmw0C2eRBttMG1o0t/g52C9e0DWN2cXU9nwZ5mdFtbQos1kOdaLv
 V5sOpX6HH0+K8jefvw9qdmx4+kma38nyBY1hlsSXkMG5FwXVOx5V/hLFdH4zJs4+
 /9q3Bnwklc7X7Y+Qvqz/5rwi2T4++cIQpngRE2B7uUwjqNQR/iQUrXcb1MBie8LB
 RO7JD8D0ut+Utj75/b6PmBh2tvIHjWvC8BY9mexGfYgU4e/kHbpUa0tN8/OZhYv2
 pm1LP7Qiq+Lt8ii9UjvRKRdXWuxP/8dZ5EI7l7DJkwDItHTzQCDnfCRYjLqvXuDQ
 CGSsyMpQzzReWhWHDn4WJp4f8wRqDyN24d2a95ozaY7HJ4M=
 =zgy0
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.15-8' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "More than we'd like after rc8, but nothing very alarming either, just
  tying up loose ends before the release:

  Since we changed powernv to use cpufreq_get() from show_cpuinfo(), we
  see warnings with PREEMPT enabled. But the preempt_disable() in
  show_cpuinfo() doesn't actually prevent CPU hotplug as it suggests, so
  remove it.

  Two updates to the recently merged RFI flush code. Wire up the generic
  sysfs file to report the status, and add a debugfs file to allow
  enabling/disabling it at runtime.

  Two updates to xmon, one to add the RFI flush related fields to the
  paca dump, and another to not use hashed pointers in the paca dump.

  And one minor fix to add a missing include of linux/types.h in
  asm/hvcall.h, not seen to break the build in upstream, but correct
  anyway.

  Thanks to: Benjamin Herrenschmidt, Michal Suchanek, Nicholas Piggin"

* tag 'powerpc-4.15-8' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/pseries: include linux/types.h in asm/hvcall.h
  powerpc/64s: Allow control of RFI flush via debugfs
  powerpc/64s: Wire up cpu_show_meltdown()
  powerpc: Don't preempt_disable() in show_cpuinfo()
  powerpc/xmon: Don't print hashed pointers in paca dump
  powerpc/xmon: Add RFI flush related fields to paca dump
2018-01-19 11:19:11 -08:00
Anju T Sudhakar
684d984038 powerpc/powernv: Add debugfs interface for imc-mode and imc-command
In memory Collection (IMC) counter pmu driver controls the ucode's
execution state. At the system boot, IMC perf driver pause the ucode.
Ucode state is changed to "running" only when any of the nest units
are monitored or profiled using perf tool.

Nest units support only limited set of hardware counters and ucode is
always programmed in the "production mode" ("accumulation") mode. This
mode is configured to provide key performance metric data for most of
the nest units.

But ucode also supports other modes which would be used for "debug" to
drill down specific nest units. That is, ucode when switched to
"powerbus" debug mode (for example), will dynamically reconfigure the
nest counters to target only "powerbus" related events in the hardware
counters. This allows the IMC nest unit to focus on powerbus related
transactions in the system in more detail. At this point, production
mode events may or may not be counted.

IMC nest counters has both in-band (ucode access) and out of band
access to it. Since not all nest counter configurations are supported
by ucode, out of band tools are used to characterize other nest
counter configurations.

Patch provides an interface via "debugfs" to enable the switching of
ucode modes in the system. To switch ucode mode, one has to first
pause the microcode (imc_cmd), and then write the target mode value to
the "imc_mode" file.

Proposed Approach:

In the proposed approach, the function (export_imc_mode_and_cmd) which
creates the debugfs interface for imc mode and command is implemented
in opal-imc.c. Thus we can use imc_get_mem_addr() to get the homer
base address for each chip.

The interface to expose imc mode and command is required only if we
have nest pmu units registered. Employing the existing data structures
to track whether we have any nest units registered will require to
extend data from perf side to opal-imc.c. Instead an integer is
introduced to hold that information by counting successful nest unit
registration. Debugfs interface is removed based on the integer count.

Example for the interface:

  $ ls /sys/kernel/debug/imc
  imc_cmd_0  imc_cmd_8  imc_mode_0  imc_mode_8

Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 23:05:00 +11:00
Anju T Sudhakar
8b4e6deaff powerpc/perf: Pass struct imc_events as a parameter to imc_parse_event()
Remove the allocation of struct imc_events from imc_parse_event().
Instead pass imc_events as a parameter to imc_parse_event(), which is
a pointer to a slot in the array allocated in
update_events_in_group().

Reported-by: Dan Carpenter ("powerpc/perf: Fix a sizeof() typo so we allocate less memory")
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 23:04:44 +11:00
Madhavan Srinivasan
6cd74d2b61 powerpc/64s: Implement local_t using irq soft masking
local_t is used for atomic modifications for per-CPU data, versus
re-entrant modifications via interrupts.

local_t read-modify-write atomic operations are currently implemented
with hardware atomics (larx/stcx), which are quite slow. This patch
implements them by masking all types of interrupts that may do local_t
operations ("standard" and perf interrupts).

Rusty's benchmark (https://lkml.org/lkml/2008/12/16/450) gives the
following timings for the local_t test, in nanoseconds per iteration:

             larx/stcx   irq+pmu disable
_inc                38                10
_add                38                10
_read                4                 4
_add_return         38                10

There are still some interrupt types (system reset, machine check, and
watchdog), which can not safely use local_t operations, because they
are not masked.

An alternative approach was proposed, using a CR bit to mark a critical
section, which is tested in the interrupt return path, and would then
branch to a fixup handler (similar to exception fixups), which re-starts
the operation. The problem with this was the complexity of the fixup
handler and the latency of the slow path.

https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-November/123024.html

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:37:04 +11:00
Madhavan Srinivasan
3b7e302086 powerpc: use generic atomic implementation for local_t
powerpc implements local_t with atomic operations. There is already
an asm-generic implementation which does this using atomic_t.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:37:04 +11:00
Madhavan Srinivasan
c6424382de powerpc/64s: Add new set of irq_soft_mask_ functions for PMI masking
To support soft-masking of the performance monitor interrupt, a set of
new powerpc_local_irq_pmu_save() and powerpc_local_irq_restore()
functions are added. And powerpc_local_irq_save() implemented, by
adding a new irq_soft_mask manipulation function
irq_soft_mask_or_return().

Local_irq_pmu_* macros are provided to access these
powerpc_local_irq_pmu* functions which includes
trace_hardirqs_on|off() to match what we have in
include/linux/irqflags.h.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:37:03 +11:00
Madhavan Srinivasan
9aa88188ee powerpc: Add new kconfig CONFIG_PPC_IRQ_SOFT_MASK_DEBUG
New Kconfig is added "CONFIG_PPC_IRQ_SOFT_MASK_DEBUG" to add WARN_ON
to alert the invalid transitions. Also moved the code under the
CONFIG_TRACE_IRQFLAGS in arch_local_irq_restore() to new Kconfig.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
[mpe: Fix name of CONFIG option in change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:37:03 +11:00
Madhavan Srinivasan
f442d00480 powerpc/64s: Add support to mask perf interrupts and replay them
Two new bit mask field "IRQ_DISABLE_MASK_PMU" is introduced to support
the masking of PMI and "IRQ_DISABLE_MASK_ALL" to aid interrupt masking
checking.

Couple of new irq #defs "PACA_IRQ_PMI" and "SOFTEN_VALUE_0xf0*" added
to use in the exception code to check for PMI interrupts.

In the masked_interrupt handler, for PMIs we reset the MSR[EE] and
return. In the __check_irq_replay(), replay the PMI interrupt by
calling performance_monitor_common handler.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:37:02 +11:00
Madhavan Srinivasan
f14e953b19 powerpc/64s: Add support to take additional parameter in MASKABLE_* macro
To support addition of "bitmask" to MASKABLE_* macros, factor out the
EXCPETION_PROLOG_1 macro.

Make it explicit the interrupt masking supported by a gievn interrupt
handler. Patch correspondingly extends the MASKABLE_* macros with an
addition's parameter. "bitmask" parameter is passed to SOFTEN_TEST
macro to decide on masking the interrupt.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:37:02 +11:00
Madhavan Srinivasan
d32eb1b550 powerpc/64s: Avoid using EXCEPTION_PROLOG_1 macro in MASKABLE_*
Currently we use both EXCEPTION_PROLOG_1 and __EXCEPTION_PROLOG_1 in
the MASKABLE_* macros. As a cleanup, this patch makes MASKABLE_* to
use only __EXCEPTION_PROLOG_1. There is not logic change.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:37:01 +11:00
Madhavan Srinivasan
4e26bc4a4e powerpc/64: Rename soft_enabled to irq_soft_mask
Rename the paca->soft_enabled to paca->irq_soft_mask as it is no
longer used as a flag for interrupt state, but a mask.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:37:01 +11:00
Madhavan Srinivasan
01417c6cc7 powerpc/64: Change soft_enabled from flag to bitmask
"paca->soft_enabled" is used as a flag to mask some of interrupts.
Currently supported flags values and their details:

soft_enabled    MSR[EE]

0               0       Disabled (PMI and HMI not masked)
1               1       Enabled

"paca->soft_enabled" is initialized to 1 to make the interripts as
enabled. arch_local_irq_disable() will toggle the value when
interrupts needs to disbled. At this point, the interrupts are not
actually disabled, instead, interrupt vector has code to check for the
flag and mask it when it occurs. By "mask it", it update interrupt
paca->irq_happened and return. arch_local_irq_restore() is called to
re-enable interrupts, which checks and replays interrupts if any
occured.

Now, as mentioned, current logic doesnot mask "performance monitoring
interrupts" and PMIs are implemented as NMI. But this patchset depends
on local_irq_* for a successful local_* update. Meaning, mask all
possible interrupts during local_* update and replay them after the
update.

So the idea here is to reserve the "paca->soft_enabled" logic. New
values and details:

soft_enabled    MSR[EE]

1               0       Disabled  (PMI and HMI not masked)
0               1       Enabled

Reason for the this change is to create foundation for a third mask
value "0x2" for "soft_enabled" to add support to mask PMIs. When
->soft_enabled is set to a value "3", PMI interrupts are mask and when
set to a value of "1", PMI are not mask. With this patch also extends
soft_enabled as interrupt disable mask.

Current flags are renamed from IRQ_[EN?DIS}ABLED to
IRQS_ENABLED and IRQS_DISABLED.

Patch also fixes the ptrace call to force the user to see the softe
value to be alway 1. Reason being, even though userspace has no
business knowing about softe, it is part of pt_regs. Like-wise in
signal context.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:37:00 +11:00
Madhavan Srinivasan
acb396d7c2 powerpc/64: Cleanup hard_irq_disable() macro
Minor cleanup to use helper function for manipulating
paca->soft_enabled variable.

Suggested-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:37:00 +11:00
Madhavan Srinivasan
a67c543aac powerpc/64: Implement and use soft_enabled_set_return API
Add a new wrapper function, soft_enabled_set_return(), added to do the
paca->soft_enabled updates requiring a set-return.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:36:59 +11:00
Madhavan Srinivasan
e0b5687bed powerpc/64: Implement and use soft_enabled_return API
Add a new wrapper function, soft_enabled_return(), added to return
paca->soft_enabled value.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:36:59 +11:00
Madhavan Srinivasan
0b63acf4a0 powerpc/64: Move set_soft_enabled() and rename
Move set_soft_enabled() from powerpc/kernel/irq.c to asm/hw_irq.c, to
encourage updates to paca->soft_enabled done via these access
function. Add "memory" clobber to hint compiler since
paca->soft_enabled memory is the target here.

Renaming it as soft_enabled_set() will make namespaces works better as
prefix than a postfix when new soft_enabled manipulation functions are
introduced.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:36:58 +11:00
Madhavan Srinivasan
b5c1bd62c0 powerpc/64: Fix arch_local_irq_disable() prototype
In powerpc/64, the arch_local_irq_disable() function returns unsigned
long, which is not consistent with other architectures.

Move that set-return asm implementation into arch_local_irq_save(),
and make arch_local_irq_disable() return void, simplifying the
assembly.

Suggested-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:36:58 +11:00
Nicholas Piggin
9f83e00f4c powerpc/64: Improve inline asm in arch_local_irq_disable
arch_local_irq_disable is implemented strangely, with a temporary
output register being set to the desired soft_enabled value via an
immediate input, which is then used to store to memory. This is not
required, the immediate can be specified directly as a register input.

For simple cases at least, assembly is unchanged except register
mapping.

Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:36:57 +11:00
Madhavan Srinivasan
c2e480ba82 powerpc/64: Add #defines for paca->soft_enabled flags
Two #defines IRQS_ENABLED and IRQS_DISABLED are added to be used when
updating paca->soft_enabled. Replace the hardcoded values used when
updating paca->soft_enabled with IRQ_(EN|DIS)ABLED #define. No logic
change.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:36:56 +11:00
Michael Ellerman
7a074fc083 powerpc/64s: Fix ps3 build error due to tlbiel_all()
The recent changes to TLB handling broke the PS3 build:

  arch/powerpc/include/asm/book3s/64/tlbflush.h:30: undefined reference to `.hash__tlbiel_all'

Fix it by adding an fallback version of tlbiel_all() for non-native
builds. It should never be called, due to checks in callers so it
calls BUG(). We should probably clean it up further but this will
suffice for now.

Fixes: d4748276ae ("powerpc/64s: Improve local TLB flush for boot and MCE on POWER9")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:36:45 +11:00
Paul Mackerras
3214d01f13 KVM: PPC: Book3S: Provide information about hardware/firmware CVE workarounds
This adds a new ioctl, KVM_PPC_GET_CPU_CHAR, that gives userspace
information about the underlying machine's level of vulnerability
to the recently announced vulnerabilities CVE-2017-5715,
CVE-2017-5753 and CVE-2017-5754, and whether the machine provides
instructions to assist software to work around the vulnerabilities.

The ioctl returns two u64 words describing characteristics of the
CPU and required software behaviour respectively, plus two mask
words which indicate which bits have been filled in by the kernel,
for extensibility.  The bit definitions are the same as for the
new H_GET_CPU_CHARACTERISTICS hypercall.

There is also a new capability, KVM_CAP_PPC_GET_CPU_CHAR, which
indicates whether the new ioctl is available.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-19 15:17:01 +11:00
Benjamin Herrenschmidt
9b9b13a6d1 KVM: PPC: Book3S HV: Keep XIVE escalation interrupt masked unless ceded
This works on top of the single escalation support. When in single
escalation, with this change, we will keep the escalation interrupt
disabled unless the VCPU is in H_CEDE (idle). In any other case, we
know the VCPU will be rescheduled and thus there is no need to take
escalation interrupts in the host whenever a guest interrupt fires.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-19 12:10:21 +11:00
Benjamin Herrenschmidt
35c2405efc KVM: PPC: Book3S HV: Make xive_pushed a byte, not a word
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-19 12:10:21 +11:00
Benjamin Herrenschmidt
2267ea7661 KVM: PPC: Book3S HV: Don't use existing "prodded" flag for XIVE escalations
The prodded flag is only cleared at the beginning of H_CEDE,
so every time we have an escalation, we will cause the *next*
H_CEDE to return immediately.

Instead use a dedicated "irq_pending" flag to indicate that
a guest interrupt is pending for the VCPU. We don't reuse the
existing exception bitmap so as to avoid expensive atomic ops.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-19 12:10:21 +11:00
Benjamin Herrenschmidt
bf4159da47 KVM: PPC: Book3S HV: Enable use of the new XIVE "single escalation" feature
That feature, provided by Power9 DD2.0 and later, when supported
by newer OPAL versions, allows us to sacrifice a queue (priority 7)
in favor of merging all the escalation interrupts of the queues
of a single VP into a single interrupt.

This reduces the number of host interrupts used up by KVM guests
especially when those guests use multiple priorities.

It will also enable a future change to control the masking of the
escalation interrupts more precisely to avoid spurious ones.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-19 12:10:21 +11:00
Paul Mackerras
d27998185d Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-next
This merges in the ppc-kvm topic branch of the powerpc tree to get
two patches which are prerequisites for the following patch series,
plus another patch which touches both powerpc and KVM code.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-19 12:09:57 +11:00
Nicholas Piggin
c16bee4bde powerpc: define __ARCH_IRQ_EXIT_IRQS_DISABLED
powerpc calls irq_exit() with local irqs disabled, therefore it
can define __ARCH_IRQ_EXIT_IRQS_DISABLED.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-18 15:43:43 +11:00
Nicholas Piggin
47712a921b powerpc/watchdog: remove arch_trigger_cpumask_backtrace
The powerpc NMI IPIs may not be recoverable if they are taken in
some sections of code, and also there have been and still are issues
with taking NMIs (in KVM guest code, in firmware, etc) which makes them
a bit dangerous to use.

Generic code like softlockup detector and rcu stall detectors really
hammer on trigger_*_backtrace, which has lead to further problems
because we've implemented it with the NMI.

So stop providing NMI backtraces for now. Importantly, the powerpc code
uses NMI IPIs in crash/debug, and the SMP hardlockup watchdog. So if the
softlockup and rcu hang detection traces are not being printed because
the CPU is stuck with interrupts off, then the hard lockup watchdog
should get it with the NMI IPI.

Fixes: 2104180a53 ("powerpc/64s: implement arch-specific hardlockup watchdog")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-18 15:43:43 +11:00
Paul Mackerras
d075745d89 KVM: PPC: Book3S HV: Improve handling of debug-trigger HMIs on POWER9
Hypervisor maintenance interrupts (HMIs) are generated by various
causes, signalled by bits in the hypervisor maintenance exception
register (HMER).  In most cases calling OPAL to handle the interrupt
is the correct thing to do, but the "debug trigger" HMIs signalled by
PPC bit 17 (bit 46) of HMER are used to invoke software workarounds
for hardware bugs, and OPAL does not have any code to handle this
cause.  The debug trigger HMI is used in POWER9 DD2.0 and DD2.1 chips
to work around a hardware bug in executing vector load instructions to
cache inhibited memory.  In POWER9 DD2.2 chips, it is generated when
conditions are detected relating to threads being in TM (transactional
memory) suspended mode when the core SMT configuration needs to be
reconfigured.

The kernel currently has code to detect the vector CI load condition,
but only when the HMI occurs in the host, not when it occurs in a
guest.  If a HMI occurs in the guest, it is always passed to OPAL, and
then we always re-sync the timebase, because the HMI cause might have
been a timebase error, for which OPAL would re-sync the timebase, thus
removing the timebase offset which KVM applied for the guest.  Since
we don't know what OPAL did, we don't know whether to subtract the
timebase offset from the timebase, so instead we re-sync the timebase.

This adds code to determine explicitly what the cause of a debug
trigger HMI will be.  This is based on a new device-tree property
under the CPU nodes called ibm,hmi-special-triggers, if it is
present, or otherwise based on the PVR (processor version register).
The handling of debug trigger HMIs is pulled out into a separate
function which can be called from the KVM guest exit code.  If this
function handles and clears the HMI, and no other HMI causes remain,
then we skip calling OPAL and we proceed to subtract the guest
timebase offset from the timebase.

The overall handling for HMIs that occur in the host (i.e. not in a
KVM guest) is largely unchanged, except that we now don't set the flag
for the vector CI load workaround on DD2.2 processors.

This also removes a BUG_ON in the KVM code.  BUG_ON is generally not
useful in KVM guest entry/exit code since it is difficult to handle
the resulting trap gracefully.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-18 15:31:25 +11:00
Nicholas Piggin
d4748276ae powerpc/64s: Improve local TLB flush for boot and MCE on POWER9
There are several cases outside the normal address space management
where a CPU's entire local TLB is to be flushed:

  1. Booting the kernel, in case something has left stale entries in
     the TLB (e.g., kexec).

  2. Machine check, to clean corrupted TLB entries.

One other place where the TLB is flushed, is waking from deep idle
states. The flush is a side-effect of calling ->cpu_restore with the
intention of re-setting various SPRs. The flush itself is unnecessary
because in the first case, the TLB should not acquire new corrupted
TLB entries as part of sleep/wake (though they may be lost).

This type of TLB flush is coded inflexibly, several times for each CPU
type, and they have a number of problems with ISA v3.0B:

- The current radix mode of the MMU is not taken into account, it is
  always done as a hash flushn For IS=2 (LPID-matching flush from host)
  and IS=3 with HV=0 (guest kernel flush), tlbie(l) is undefined if
  the R field does not match the current radix mode.

- ISA v3.0B hash must flush the partition and process table caches as
  well.

- ISA v3.0B radix must flush partition and process scoped translations,
  partition and process table caches, and also the page walk cache.

So consolidate the flushing code and implement it in C and inline asm
under the mm/ directory with the rest of the flush code. Add ISA v3.0B
cases for radix and hash, and use the radix flush in radix environment.

Provide a way for IS=2 (LPID flush) to specify the radix mode of the
partition. Have KVM pass in the radix mode of the guest.

Take out the flushes from early cputable/dt_cpu_ftrs detection hooks,
and move it later in the boot process after, the MMU registers are set
up and before relocation is first turned on.

The TLB flush is no longer called when restoring from deep idle states.
This was not be done as a separate step because booting secondaries
uses the same cpu_restore as idle restore, which needs the TLB flush.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-18 00:40:31 +11:00
Michal Suchanek
1b689a95ce powerpc/pseries: include linux/types.h in asm/hvcall.h
Commit 6e032b350c ("powerpc/powernv: Check device-tree for RFI flush
settings") uses u64 in asm/hvcall.h without including linux/types.h

This breaks hvcall.h users that do not include the header themselves.

Fixes: 6e032b350c ("powerpc/powernv: Check device-tree for RFI flush settings")
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-17 23:30:46 +11:00
Benjamin Herrenschmidt
872e2ae4bd powerpc: Remove useless EXC_COMMON_HV
The only difference between EXC_COMMON_HV and EXC_COMMON is that the
former adds "2" to the trap number which is supposed to represent the
fact that this is an "HV" interrupt which uses HSRR0/1.

However KVM is the only one who cares and it has its own separate macros.

In fact, we only have one user of EXC_COMMON_HV and it's for an
unknown interrupt case. All the other ones already using EXC_COMMON.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:51:42 +11:00
Christophe Leroy
4f94b2c746 powerpc/8xx: Use L1 entry APG to handle _PAGE_ACCESSED for CONFIG_SWAP
When CONFIG_SWAP is set, the TLB miss handlers have to also take
into account _PAGE_ACCESSED flag. At the moment it is done by
anding _PAGE_ACCESSED into _PAGE_PRESENT using 3 instructions.

This patch uses APG for handling _PAGE_ACCESSED, allowing to
just copy _PAGE_ACCESSED bit into APG field, hence reducing the
action to a single instruction.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:47:15 +11:00
Christophe Leroy
de0f938739 powerpc/8xx: Remove _PAGE_USER and handle user access at PMD level
As Linux kernel separates KERNEL and USER address spaces, there is
therefore no need to flag USER access at page level.

Today, the 8xx TLB handlers already handle user access in the L1 entry
through Access Protection Groups, it is then natural to move the user
access handling at PMD level once _PAGE_NA allows to handle PAGE_NONE
protection without _PAGE_USER

In the mean time, as we free up one bit in the PTE, we can use it to
include SPS (page size flag) in the PTE and avoid handling it at every
TLB miss hence removing special handling based on compiled page size.

For _PAGE_EXEC, we rework it to use PP PTE bits, avoiding the copy
of _PAGE_EXEC bit into the L1 entry. Unfortunatly we are not
able to put it at the correct location as it conflicts with
NA/RO/RW bits for data entries.

Upper bits of APG in L1 entry overlap with PMD base address. In
order to avoid having to filter that out, we set up all groups so that
upper bits can have any value.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:47:14 +11:00
Christophe Leroy
351750331f powerpc/mm: Introduce _PAGE_NA
Today, PAGE_NONE is defined as a page not having _PAGE_USER.
In some circunstances, when the CPU supports it, it might be
better to be able to flag a page with NO ACCESS.

In a following patch, the 8xx will switch user access being flagged
in the PMD, therefore it will not be possible anymore to use
_PAGE_USER as a way to flag a page with no access.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:47:14 +11:00
Christophe Leroy
812fadcb94 powerpc/mm: extend _PAGE_PRIVILEGED to all CPUs
commit ac29c64089 ("powerpc/mm: Replace _PAGE_USER with
_PAGE_PRIVILEGED") introduced _PAGE_PRIVILEGED for BOOK3S/64

This patch generalises _PAGE_PRIVILEGED for all CPUs, allowing
to have either _PAGE_PRIVILEGED or _PAGE_USER or both.

PPC_8xx has a _PAGE_SHARED flag which is set for and only for
all non user pages. Lets rename it _PAGE_PRIVILEGED to remove
confusion as it has nothing to do with Linux shared pages.

On BookE, there's a _PAGE_BAP_SR which has to be set for kernel
pages: defining _PAGE_PRIVILEGED as _PAGE_BAP_SR will make
this generic

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:47:13 +11:00
Christophe Leroy
5f356497c3 powerpc/8xx: remove unused _PAGE_WRITETHRU
_PAGE_WRITETHRU is only used in:
* AMIGA_Z2RAM block driver which is never activated on powerPC
* Video/FB driver which is for PPC_PMAC

Therefore, no need to spend time in 8xx TLB miss handlers for
handling it.

And by removing it, we free up bit 20 which then avoids having
to clear it on each TLB miss.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:47:13 +11:00
Christophe Leroy
cd99ddbea2 powerpc/8xx: Only perform perf counting when perf is in use.
In TLB miss handlers, updating the perf counter is only useful
when performing a perf analysis. As it has a noticeable overhead,
let's only do it when needed.

In order to do so, the exit of the miss handlers will be patched
when starting/stopping 'perf': the first register restore
instruction of each exit point will be replaced by a jump to
the counting code.

Once this is done, CONFIG_PPC_8xx_PERF_EVENT becomes useless as
this feature doesn't add any overhead.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:47:12 +11:00
Christophe Leroy
2a45addd21 powerpc/8xx: Remove CPU6 ERRATA Workaround
CPU6 ERRATA affects only MPC860 revisions prior to C.0. Manufacturing
of those revisiosn was stopped in 1999-2000.
Therefore, it has been almost 20 years since this ERRATA has been
fixed in the silicon.

This patch removes the workaround for that ERRATA.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:47:12 +11:00
Balbir Singh
4145f35864 powernv/kdump: Fix cases where the kdump kernel can get HMI's
Certain HMI's such as malfunction error propagate through
all threads/core on the system. If a thread was offline
prior to us crashing the system and jumping to the kdump
kernel, bad things happen when it wakes up due to an HMI
in the kdump kernel.

There are several possible ways to solve this problem

1. Put the offline cores in a state such that they are
not woken up for machine check and HMI errors. This
does not work, since we might need to wake up offline
threads to handle TB errors
2. Ignore HMI errors, setup HMEER to mask HMI errors,
but this still leads the window open for any MCEs
and masking them for the duration of the dump might
be a concern
3. Wake up offline CPUs, as in send them to
crash_ipi_callback (not wake them up as in mark them
online as seen by the hotplug). kexec does a
wake_online_cpus() call, this patch does something
similar, but instead sends an IPI and forces them to
crash_ipi_callback()

This patch takes approach #3.

Care is taken to enable this only for powenv platforms
via crash_wake_offline (a global value set at setup
time). The crash code sends out IPI's to all CPU's
which then move to crash_ipi_callback and kexec_smp_wait().

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:47:11 +11:00
Nathan Fontenot
0c38ed6f6f powerpc/pseries: Enable support of ibm,dynamic-memory-v2
Add required bits to the architecture vector to enable support
of the ibm,dynamic-memory-v2 device tree property.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:26:30 +11:00
Nathan Fontenot
2b31e3aec1 powerpc/drmem: Add support for ibm, dynamic-memory-v2 property
The Power Hypervisor has introduced a new device tree format for
the property describing the dynamic reconfiguration LMBs for a system,
ibm,dynamic-memory-v2. This new format condenses the size of the
property, especially on large memory systems, by reporting sets
of LMBs that have the same properties (flags and associativity array
index).

This patch updates the powerpc/mm/drmem.c code to provide routines
that can parse the new device tree format during the walk_drmem_lmb*
routines used during boot, the creation of the LMB array, and updating
the device tree to create a new property in the proper format for
ibm,dynamic-memory-v2.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:26:29 +11:00
Nathan Fontenot
2c77721552 powerpc: Move of_drconf_cell struct to asm/drmem.h
Now that the powerpc code parses dynamic reconfiguration memory
LMB information from the LMB array and not the device tree
directly we can move the of_drconf_cell struct to drmem.h where
it fits better.

In addition, the struct is renamed to of_drconf_cell_v1 in
anticipation of upcoming support for version 2 of the dynamic
reconfiguration property and the members are typed as __be*
values to reflect how they exist in the device tree.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:26:29 +11:00
Nathan Fontenot
6195a5001f powerpc/pseries: Update memory hotplug code to use drmem LMB array
Update the pseries memory hotplug code to use the newly added
dynamic reconfiguration LMB array. Doing this is required for the
upcoming support of version 2 of the dynamic reconfiguration
device tree property.

In addition, making this change cleans up the code that parses the
LMB information as we no longer need to worry about device tree
format. This allows us to discard one of the first steps on memory
hotplug where we make a working copy of the device tree property and
convert the entire property to cpu format. Instead we just use the
LMB array directly while holding the memory hotplug lock.

This patch also moves the updating of the device tree property to
powerpc/mm/drmem.c. This allows to the hotplug code to work without
needing to know the device tree format and provides a single
routine for updating the device tree property. This new routine
will handle determination of the proper device tree format and
generate a properly formatted device tree property.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:26:28 +11:00
Nathan Fontenot
514a9cb331 powerpc/numa: Update numa code use walk_drmem_lmbs
Update code in powerpc/numa.c to use the walk_drmem_lmbs()
routine instead of parsing the device tree directly. This is
in anticipation of introducing a new ibm,dynamic-memory-v2
property with a different format. This will allow the numa code
to use a single initialization routine per-LMB irregardless of
the device tree format.

Additionally, to support additional routines in numa.c that need
to look up LMB information, an late_init routine is added to drmem.c
to allocate the array of LMB information. This LMB array will provide
per-LMB information to separate the LMB data from the device tree
format.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:26:28 +11:00
Nathan Fontenot
6c6ea53725 powerpc/mm: Separate ibm, dynamic-memory data from DT format
We currently have code to parse the dynamic reconfiguration LMB
information from the ibm,dynamic-meory device tree property in
multiple locations; numa.c, prom.c, and pseries/hotplug-memory.c.
In anticipation of adding support for a version 2 of the
ibm,dynamic-memory property this patch aims to separate the device
tree information from the device tree format.

Doing this requires a two step process to avoid a possibly very large
bootmem allocation early in boot. During initial boot, new routines
are provided to walk the device tree property and make a call-back
for each LMB.

The second step (introduced in later patches) will allocate an
array of LMB information that can be used directly without needing
to know the DT format.

This approach provides the benefit of consolidating the device tree
property parsing to a single location and (eventually) providing
a common data structure for retrieving LMB information.

This patch introduces a routine to walk the ibm,dynamic-memory
property in the flattened device tree and updates the prom.c code
to use this to initialize memory.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:26:27 +11:00
Eric W. Biederman
ea64d5acc8 signal: Unify and correct copy_siginfo_to_user32
Among the existing architecture specific versions of
copy_siginfo_to_user32 there are several different implementation
problems.  Some architectures fail to handle all of the cases in in
the siginfo union.  Some architectures perform a blind copy of the
siginfo union when the si_code is negative.  A blind copy suggests the
data is expected to be in 32bit siginfo format, which means that
receiving such a signal via signalfd won't work, or that the data is
in 64bit siginfo and the code is copying nonsense to userspace.

Create a single instance of copy_siginfo_to_user32 that all of the
architectures can share, and teach it to handle all of the cases in
the siginfo union correctly, with the assumption that siginfo is
stored internally to the kernel is 64bit siginfo format.

A special case is made for x86 x32 format.  This is needed as presence
of both x32 and ia32 on x86_64 results in two different 32bit signal
formats.  By allowing this small special case there winds up being
exactly one code base that needs to be maintained between all of the
architectures.  Vastly increasing the testing base and the chances of
finding bugs.

As the x86 copy of copy_siginfo_to_user32 the call of the x86
signal_compat_build_tests were moved into sigaction_compat_abi, so
that they will keep running.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-15 19:56:20 -06:00
Paul Mackerras
5855564c8a KVM: PPC: Book3S HV: Enable migration of decrementer register
This adds a register identifier for use with the one_reg interface
to allow the decrementer expiry time to be read and written by
userspace.  The decrementer expiry time is in guest timebase units
and is equal to the sum of the decrementer and the guest timebase.
(The expiry time is used rather than the decrementer value itself
because the expiry time is not constantly changing, though the
decrementer value is, while the guest vcpu is not running.)

Without this, a guest vcpu migrated to a new host will see its
decrementer set to some random value.  On POWER8 and earlier, the
decrementer is 32 bits wide and counts down at 512MHz, so the
guest vcpu will potentially see no decrementer interrupts for up
to about 4 seconds, which will lead to a stall.  With POWER9, the
decrementer is now 56 bits side, so the stall can be much longer
(up to 2.23 years) and more noticeable.

To help work around the problem in cases where userspace has not been
updated to migrate the decrementer expiry time, we now set the
default decrementer expiry at vcpu creation time to the current time
rather than the maximum possible value.  This should mean an
immediate decrementer interrupt when a migrated vcpu starts
running.  In cases where the decrementer is 32 bits wide and more
than 4 seconds elapse between the creation of the vcpu and when it
first runs, the decrementer would have wrapped around to positive
values and there may still be a stall - but this is no worse than
the current situation.  In the large-decrementer case, we are sure
to get an immediate decrementer interrupt (assuming the time from
vcpu creation to first run is less than 2.23 years) and we thus
avoid a very long stall.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-16 11:54:45 +11:00
Eric W. Biederman
ad2b1ab57d signal/powerpc: Remove redefinition of NSIGTRAP on powerpc
NSIGTRAP is 4 in the generic siginfo and powerpc just undefines
NSGTRAP and redefine it as 4.  That accomplishes nothing so remove
the duplication.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-15 17:42:33 -06:00
Al Viro
b713da69e4 signal: unify compat_siginfo_t
--EWB Added #ifdef CONFIG_X86_X32_ABI to arch/x86/kernel/signal_compat.c
      Changed #ifdef CONFIG_X86_X32 to #ifdef CONFIG_X86_X32_ABI in
      linux/compat.h

      CONFIG_X86_X32 is set when the user requests X32 support.

      CONFIG_X86_X32_ABI is set when the user requests X32 support
      and the tool-chain has X32 allowing X32 support to be built.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2018-01-15 17:40:31 -06:00
Christoph Hellwig
a37a3710f3 powerpc: rename swiotlb_dma_ops
We'll need that name for a generic implementation soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2018-01-15 09:35:26 +01:00
Linus Torvalds
6bb821193b powerpc fixes for 4.15 #7
One fix for an oops at boot if we take a hotplug interrupt before we are ready
 to handle it.
 
 The bulk is patches to implement mitigation for Meltdown, see the change logs
 for more details.
 
 Thanks to:
   Nicholas Piggin, Michael Neuling, Oliver O'Halloran, Jon Masters, Jose Ricardo
   Ziviani, David Gibson.
 -----BEGIN PGP SIGNATURE-----
 
 iQIwBAABCAAaBQJaWXZ+ExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYBe
 ew/+KMp80VAIGSyFeYHLg8oClnQjKcEDMzlNoHo8WU6NwwNzk2XXBXNaGT7Osg7B
 Ny69R3/VRrGmi/2uule9OzF2eZt8BUmRn3cHDcUB5OwFjMysv4M2jc+OK0STQxLU
 F769rX+ZlkyAvQn4UxtNCUob+JDfVbE5Lurrx475ZGL1YHk6QqBGhlN/niXKtkBz
 upeBEpu4JgwsLLD9H7c4QQha9PkhxY51jtvxL24cgwDuKJKhTdxLahPlA3QmLafu
 FJxR9R0YzTj9Ivuae5yM7xRKr5wROTPtTRbG9sN7XOgFNrQ+LSgemXGrlV/fzdhV
 6iPn2hHs4bu45SOIM+Hhf7OwHTtnMetXmKo6NldlKHVCw1HwBLigRohnOTPjiHfZ
 P6brR4T/cglakBiE29+8T0UFqOPizEJiQRHnsoWzwIa/aTqCGEE3m3rgTxU6ovt5
 3JggC51ZRDFsjGvj+JFnxGxCjYC4tDbfBI28M+GuWZZPBKLkFnEk8PlSzYyHPW2p
 5uVA/UN4KHVGYUfJVUC+LMP+ZIOSEyEISKk7RuK0C0mwHSXcvxbeEefXbc18Xjai
 J/pl/c9/4CRnEbdPm305xBxKKmU6gWxWRK3KbRFsoTI/wp+Ryx5TjPpgAa5GnaLv
 nrtrfKs8skWfGMoDWH7+KILt0fgRR2x+91GTSxc5aaHtYBI=
 =UZHv
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.15-7' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "One fix for an oops at boot if we take a hotplug interrupt before we
  are ready to handle it.

  The bulk is patches to implement mitigation for Meltdown, see the
  change logs for more details.

  Thanks to: Nicholas Piggin, Michael Neuling, Oliver O'Halloran, Jon
  Masters, Jose Ricardo Ziviani, David Gibson"

* tag 'powerpc-4.15-7' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/powernv: Check device-tree for RFI flush settings
  powerpc/pseries: Query hypervisor for RFI flush settings
  powerpc/64s: Support disabling RFI flush with no_rfi_flush and nopti
  powerpc/64s: Add support for RFI flush of L1-D cache
  powerpc/64s: Convert slb_miss_common to use RFI_TO_USER/KERNEL
  powerpc/64: Convert fast_exception_return to use RFI_TO_USER/KERNEL
  powerpc/64: Convert the syscall exit path to use RFI_TO_USER/KERNEL
  powerpc/64s: Simple RFI macro conversions
  powerpc/64: Add macros for annotating the destination of rfid/hrfid
  powerpc/pseries: Add H_GET_CPU_CHARACTERISTICS flags & wrapper
  powerpc/pseries: Make RAS IRQ explicitly dependent on DLPAR WQ
2018-01-14 15:03:17 -08:00
Eric W. Biederman
2f82a46f66 signal: Remove _sys_private and _overrun_incr from struct compat_siginfo
We have never passed either field to or from userspace so just remove them.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-12 14:34:46 -06:00
Eric W. Biederman
cf4674c46c signal/powerpc: Document conflicts with SI_USER and SIGFPE and SIGTRAP
Setting si_code to 0 results in a userspace seeing an si_code of 0.
This is the same si_code as SI_USER.  Posix and common sense requires
that SI_USER not be a signal specific si_code.  As such this use of 0
for the si_code is a pretty horribly broken ABI.

Further use of si_code == 0 guaranteed that copy_siginfo_to_user saw a
value of __SI_KILL and now sees a value of SIL_KILL with the result
that uid and pid fields are copied and which might copying the si_addr
field by accident but certainly not by design.  Making this a very
flakey implementation.

Utilizing FPE_FIXME and TRAP_FIXME, siginfo_layout() will now return
SIL_FAULT and the appropriate fields will be reliably copied.

Possible ABI fixes includee:
- Send the signal without siginfo
- Don't generate a signal
- Possibly assign and use an appropriate si_code
- Don't handle cases which can't happen
Cc: Paul Mackerras <paulus@samba.org>
Cc: Kumar Gala <kumar.gala@freescale.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc:  linuxppc-dev@lists.ozlabs.org
Ref: 9bad068c24d7 ("[PATCH] ppc32: support for e500 and 85xx")
Ref: 0ed70f6105ef ("PPC32: Provide proper siginfo information on various exceptions.")
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-12 14:21:04 -06:00
Benjamin Herrenschmidt
7f1c410da5 powerpc/xive: Add interrupt flag to disable automatic EOI
This will be used by KVM in order to keep escalation interrupts
in the non-EOI (masked) state after they fire. They will be
re-enabled directly in HW by KVM when needed.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-12 15:24:44 +11:00
Benjamin Herrenschmidt
12c1f339cd powerpc/xive: Move definition of ESB bits
From xive.h to xive-regs.h since it's a HW register definition
and it can be used from assembly

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-12 15:24:41 +11:00
Christoph Hellwig
2d9d6f6c9e powerpc: rename dma_direct_ to dma_nommu_
We want to use the dma_direct_ namespace for a generic implementation,
so rename powerpc to the second best choice: dma_nommu_.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-10 16:41:14 +01:00
Christoph Hellwig
b49efd7624 dma-mapping: move dma_mark_clean to dma-direct.h
And unlike the other helpers we don't require a <asm/dma-direct.h> as
this helper is a special case for ia64 only, and this keeps it as
simple as possible.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-10 16:41:12 +01:00
Christoph Hellwig
ea8c64ace8 dma-mapping: move swiotlb arch helpers to a new header
phys_to_dma, dma_to_phys and dma_capable are helpers published by
architecture code for use of swiotlb and xen-swiotlb only.  Drivers are
not supposed to use these directly, but use the DMA API instead.

Move these to a new asm/dma-direct.h helper, included by a
linux/dma-direct.h wrapper that provides the default linear mapping
unless the architecture wants to override it.

In the MIPS case the existing dma-coherent.h is reused for now as
untangling it will take a bit of work.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Robin Murphy <robin.murphy@arm.com>
2018-01-10 16:40:54 +01:00
Michael Ellerman
aa8a5e0062 powerpc/64s: Add support for RFI flush of L1-D cache
On some CPUs we can prevent the Meltdown vulnerability by flushing the
L1-D cache on exit from kernel to user mode, and from hypervisor to
guest.

This is known to be the case on at least Power7, Power8 and Power9. At
this time we do not know the status of the vulnerability on other CPUs
such as the 970 (Apple G5), pasemi CPUs (AmigaOne X1000) or Freescale
CPUs. As more information comes to light we can enable this, or other
mechanisms on those CPUs.

The vulnerability occurs when the load of an architecturally
inaccessible memory region (eg. userspace load of kernel memory) is
speculatively executed to the point where its result can influence the
address of a subsequent speculatively executed load.

In order for that to happen, the first load must hit in the L1,
because before the load is sent to the L2 the permission check is
performed. Therefore if no kernel addresses hit in the L1 the
vulnerability can not occur. We can ensure that is the case by
flushing the L1 whenever we return to userspace. Similarly for
hypervisor vs guest.

In order to flush the L1-D cache on exit, we add a section of nops at
each (h)rfi location that returns to a lower privileged context, and
patch that with some sequence. Newer firmwares are able to advertise
to us that there is a special nop instruction that flushes the L1-D.
If we do not see that advertised, we fall back to doing a displacement
flush in software.

For guest kernels we support migration between some CPU versions, and
different CPUs may use different flush instructions. So that we are
prepared to migrate to a machine with a different flush instruction
activated, we may have to patch more than one flush instruction at
boot if the hypervisor tells us to.

In the end this patch is mostly the work of Nicholas Piggin and
Michael Ellerman. However a cast of thousands contributed to analysis
of the issue, earlier versions of the patch, back ports testing etc.
Many thanks to all of them.

Tested-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-10 21:27:06 +11:00
David Howells
0500871f21 Construct init thread stack in the linker script rather than by union
Construct the init thread stack in the linker script rather than doing it
by means of a union so that ia64's init_task.c can be got rid of.

The following symbols are then made available from INIT_TASK_DATA() linker
script macro:

	init_thread_union
	init_stack

INIT_TASK_DATA() also expands the region to THREAD_SIZE to accommodate the
size of the init stack.  init_thread_union is given its own section so that
it can be placed into the stack space in the right order.  I'm assuming
that the ia64 ordering is correct and that the task_struct is first and the
thread_info second.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Tested-by: Will Deacon <will.deacon@arm.com> (arm64)
Tested-by: Palmer Dabbelt <palmer@sifive.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
2018-01-09 23:21:02 +00:00
Nicholas Piggin
222f20f140 powerpc/64s: Simple RFI macro conversions
This commit does simple conversions of rfi/rfid to the new macros that
include the expected destination context. By simple we mean cases
where there is a single well known destination context, and it's
simply a matter of substituting the instruction for the appropriate
macro.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-10 03:07:30 +11:00
Nicholas Piggin
50e51c13b3 powerpc/64: Add macros for annotating the destination of rfid/hrfid
The rfid/hrfid ((Hypervisor) Return From Interrupt) instruction is
used for switching from the kernel to userspace, and from the
hypervisor to the guest kernel. However it can and is also used for
other transitions, eg. from real mode kernel code to virtual mode
kernel code, and it's not always clear from the code what the
destination context is.

To make it clearer when reading the code, add macros which encode the
expected destination context.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-10 03:07:30 +11:00
Christoph Hellwig
c91a7a405b powerpc: remove unused flush_write_buffers definition
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-09 16:28:36 +01:00
Michael Ellerman
a6978f405d Merge branch 'topic/ppc-kvm' into fixes
Merge the topic branch with share with the kvm-ppc tree. In this case
we need to share the definition of a new hypervisor call and
associated flags.
2018-01-10 02:24:34 +11:00
Michael Neuling
191eccb158 powerpc/pseries: Add H_GET_CPU_CHARACTERISTICS flags & wrapper
A new hypervisor call has been defined to communicate various
characteristics of the CPU to guests. Add definitions for the hcall
number, flags and a wrapper function.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-10 01:46:34 +11:00
Sergey Senozhatsky
5633e85b2c powerpc64: Add .opd based function descriptor dereference
We are moving towards separate kernel and module function descriptor
dereference callbacks. This patch enables it for powerpc64.

For pointers that belong to the kernel
-  Added __start_opd and __end_opd pointers, to track the kernel
   .opd section address range;

-  Added dereference_kernel_function_descriptor(). Now we
   will dereference only function pointers that are within
   [__start_opd, __end_opd);

For pointers that belong to a module
-  Added dereference_module_function_descriptor() to handle module
   function descriptor dereference. Now we will dereference only
   pointers that are within [module->opd.start, module->opd.end).

Link: http://lkml.kernel.org/r/20171109234830.5067-4-sergey.senozhatsky@gmail.com
To: Tony Luck <tony.luck@intel.com>
To: Fenghua Yu <fenghua.yu@intel.com>
To: Helge Deller <deller@gmx.de>
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>
To: Paul Mackerras <paulus@samba.org>
To: Michael Ellerman <mpe@ellerman.id.au>
To: James Bottomley <jejb@parisc-linux.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-ia64@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-kernel@vger.kernel.org
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Tested-by: Santosh Sivaraj <santosh@fossix.org> #powerpc
Signed-off-by: Petr Mladek <pmladek@suse.com>
2018-01-09 10:45:37 +01:00
Stephen Boyd
e0af0c1610 arch: Remove clkdev.h asm-generic from Kbuild
Now that every architecture is using the generic clkdev.h file
and we no longer include asm/clkdev.h anywhere in the tree, we
can remove it.

Cc: Russell King <linux@armlinux.org.uk>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: <linux-arch@vger.kernel.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
2018-01-03 09:02:11 -08:00
Linus Torvalds
caf9a82657 Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 PTI preparatory patches from Thomas Gleixner:
 "Todays Advent calendar window contains twentyfour easy to digest
  patches. The original plan was to have twenty three matching the date,
  but a late fixup made that moot.

   - Move the cpu_entry_area mapping out of the fixmap into a separate
     address space. That's necessary because the fixmap becomes too big
     with NRCPUS=8192 and this caused already subtle and hard to
     diagnose failures.

     The top most patch is fresh from today and cures a brain slip of
     that tall grumpy german greybeard, who ignored the intricacies of
     32bit wraparounds.

   - Limit the number of CPUs on 32bit to 64. That's insane big already,
     but at least it's small enough to prevent address space issues with
     the cpu_entry_area map, which have been observed and debugged with
     the fixmap code

   - A few TLB flush fixes in various places plus documentation which of
     the TLB functions should be used for what.

   - Rename the SYSENTER stack to CPU_ENTRY_AREA stack as it is used for
     more than sysenter now and keeping the name makes backtraces
     confusing.

   - Prevent LDT inheritance on exec() by moving it to arch_dup_mmap(),
     which is only invoked on fork().

   - Make vysycall more robust.

   - A few fixes and cleanups of the debug_pagetables code. Check
     PAGE_PRESENT instead of checking the PTE for 0 and a cleanup of the
     C89 initialization of the address hint array which already was out
     of sync with the index enums.

   - Move the ESPFIX init to a different place to prepare for PTI.

   - Several code moves with no functional change to make PTI
     integration simpler and header files less convoluted.

   - Documentation fixes and clarifications"

* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  x86/cpu_entry_area: Prevent wraparound in setup_cpu_entry_area_ptes() on 32bit
  init: Invoke init_espfix_bsp() from mm_init()
  x86/cpu_entry_area: Move it out of the fixmap
  x86/cpu_entry_area: Move it to a separate unit
  x86/mm: Create asm/invpcid.h
  x86/mm: Put MMU to hardware ASID translation in one place
  x86/mm: Remove hard-coded ASID limit checks
  x86/mm: Move the CR3 construction functions to tlbflush.h
  x86/mm: Add comments to clarify which TLB-flush functions are supposed to flush what
  x86/mm: Remove superfluous barriers
  x86/mm: Use __flush_tlb_one() for kernel memory
  x86/microcode: Dont abuse the TLB-flush interface
  x86/uv: Use the right TLB-flush API
  x86/entry: Rename SYSENTER_stack to CPU_ENTRY_AREA_entry_stack
  x86/doc: Remove obvious weirdnesses from the x86 MM layout documentation
  x86/mm/64: Improve the memory map documentation
  x86/ldt: Prevent LDT inheritance on exec
  x86/ldt: Rework locking
  arch, mm: Allow arch_dup_mmap() to fail
  x86/vsyscall/64: Warn and fail vsyscall emulation in NATIVE mode
  ...
2017-12-23 11:53:04 -08:00
Thomas Gleixner
c10e83f598 arch, mm: Allow arch_dup_mmap() to fail
In order to sanitize the LDT initialization on x86 arch_dup_mmap() must be
allowed to fail. Fix up all instances.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: dan.j.williams@intel.com
Cc: hughd@google.com
Cc: keescook@google.com
Cc: kirill.shutemov@linux.intel.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-22 20:13:01 +01:00
Aneesh Kumar K.V
5769beaf18 powerpc/mm: Add proper pte access check helper for other platforms
pte_access_premitted get called in get_user_pages_fast path. If we
have marked the pte PROT_NONE, we should not allow a read access on
the address. With the current implementation we are not checking the
READ and only check for WRITE. This is needed on archs like ppc64 that
implement PROT_NONE using _PAGE_USER access instead of _PAGE_PRESENT.
Also add pte_user check just to make sure we are not accessing kernel
mapping.

Even though there is code duplication, keeping the low level pte
accessors different for different platforms helps in code readability.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-22 22:28:40 +11:00
Aneesh Kumar K.V
f72a85e347 powerpc/mm/book3s/64: Add proper pte access check helper
pte_access_premitted get called in get_user_pages_fast path. If we
have marked the pte PROT_NONE, we should not allow a read access on
the address. With the current implementation we are not checking the
READ and only check for WRITE. This is needed on archs like ppc64 that
implement PROT_NONE using RWX access instead of _PAGE_PRESENT. Also
add pte_user check just to make sure we are not accessing kernel
mapping.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-22 22:28:35 +11:00
Ram Pai
273b493689 powerpc: Swizzle around 4K PTE bits to free up bit 5 and bit 6
We need PTE bits 3 ,4, 5, 6 and 57 to support protection-keys,
because these are the bits we want to consolidate on across all
configuration to support protection keys.

Bit 3,4,5 and 6 are currently used on 4K-pte kernels. But bit 9
and 10 are available. Hence we use the two available bits and
free up bit 5 and 6. We will still not be able to free up bit 3
and 4. In the absence of any other free bits, we will have to
stay satisfied with what we have :-(. This means we will not
be able to support 32 protection keys, but only 8. The bit
numbers are big-endian as defined in the ISA3.0

This patch does the following change to 4K PTE.

H_PAGE_F_SECOND (S) which occupied bit 4 moves to bit 7.
H_PAGE_F_GIX (G,I,X) which occupied bit 5, 6 and 7 also moves
to bit 8,9, 10 respectively.
H_PAGE_HASHPTE (H) which occupied bit 8 moves to bit 4.

Before the patch, the 4k PTE format was as follows

 0 1 2 3 4  5  6  7  8 9 10....................57.....63
 : : : : :  :  :  :  : : :                      :     :
 v v v v v  v  v  v  v v v                      v     v
,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-,
|x|x|x|B|S |G |I |X |H| | |x|x|................| |x|x|x|
'_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_'

After the patch, the 4k PTE format is as follows

 0 1 2 3 4  5  6  7  8 9 10....................57.....63
 : : : : :  :  :  :  : : :                      :     :
 v v v v v  v  v  v  v v v                      v     v
,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-,
|x|x|x|B|H |  |  |S |G|I|X|x|x|................| |.|.|.|
'_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_'

The patch has no code changes; just swizzles around bits.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20 18:57:24 +11:00
Ram Pai
7b84947cad powerpc: shifted-by-one hidx value
0xf is considered invalid hidx value. It indicates absence of a backing
HPTE. A PTE is initialized to 0xf either
a) when it is new it is newly allocated to hold 4k-backing-HPTE
	or
b) Any time it gets demoted to a 4k-backing-HPTE

This patch shifts the representation by one-modulo-0xf; i.e hidx 0 is
represented as 1, 1 as 2,... , and 0xf as 0. This convention lets us
initialize the secondary-part of the PTE to all zeroes. PTEs are anyway
zero'd when allocated. We do not have to zero them again; thus saving on
the initialization.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20 18:57:23 +11:00
Ram Pai
bf9a95f9a6 powerpc: Free up four 64K PTE bits in 64K backed HPTE pages
Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6
in the 64K backed HPTE pages. This along with the earlier
patch will entirely free up the four bits from 64K PTE.
The bit numbers are big-endian as defined in the ISA3.0

This patch does the following change to 64K PTE backed
by 64K HPTE.

H_PAGE_F_SECOND (S) which occupied bit 4 moves to the
	second part of the pte to bit 60.
H_PAGE_F_GIX (G,I,X) which occupied bit 5, 6 and 7 also
	moves to the second part of the pte to bit 61,
 	62, 63, 64 respectively

since bit 7 is now freed up, we move H_PAGE_BUSY (B) from
bit 9 to bit 7.

The second part of the PTE will hold
(H_PAGE_F_SECOND|H_PAGE_F_GIX) at bit 60,61,62,63.
NOTE: None of the bits in the secondary PTE were not used
by 64k-HPTE backed PTE.

Before the patch, the 64K HPTE backed 64k PTE format was
as follows

 0 1 2 3 4  5  6  7  8 9 10...........................63
 : : : : :  :  :  :  : : :                            :
 v v v v v  v  v  v  v v v                            v

,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-,
|x|x|x| |S |G |I |X |x|B| |x|x|................|x|x|x|x| <- primary pte
'_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_'
| | | | |  |  |  |  | | | | |..................| | | | | <- secondary pte
'_'_'_'_'__'__'__'__'_'_'_'_'__________________'_'_'_'_'

After the patch, the 64k HPTE backed 64k PTE format is
as follows

 0 1 2 3 4  5  6  7  8 9 10...........................63
 : : : : :  :  :  :  : : :                            :
 v v v v v  v  v  v  v v v                            v

,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-,
|x|x|x| |  |  |  |B |x| | |x|x|................|.|.|.|.| <- primary pte
'_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_'
| | | | |  |  |  |  | | | | |..................|S|G|I|X| <- secondary pte
'_'_'_'_'__'__'__'__'_'_'_'_'__________________'_'_'_'_'

The above PTE changes is applicable to hugetlbpages aswell.

The patch does the following code changes:

a) moves the H_PAGE_F_SECOND and H_PAGE_F_GIX to 4k PTE
	header since it is no more needed b the 64k PTEs.
b) abstracts out __real_pte() and __rpte_to_hidx() so the
	caller need not know the bit location of the slot.
c) moves the slot bits to the secondary pte.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20 18:57:22 +11:00
Ram Pai
9d2edb1848 powerpc: Free up four 64K PTE bits in 4K backed HPTE pages
Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6,
in the 4K backed HPTE pages.These bits continue to be used
for 64K backed HPTE pages in this patch, but will be freed
up in the next patch. The bit numbers are big-endian as
defined in the ISA3.0

The patch does the following change to the 4k HTPE backed
64K PTE's format.

H_PAGE_BUSY moves from bit 3 to bit 9 (B bit in the figure
		below)
V0 which occupied bit 4 is not used anymore.
V1 which occupied bit 5 is not used anymore.
V2 which occupied bit 6 is not used anymore.
V3 which occupied bit 7 is not used anymore.

Before the patch, the 4k backed 64k PTE format was as follows

 0 1 2 3 4  5  6  7  8 9 10...........................63
 : : : : :  :  :  :  : : :                            :
 v v v v v  v  v  v  v v v                            v

,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-,
|x|x|x|B|V0|V1|V2|V3|x| | |x|x|................|x|x|x|x| <- primary pte
'_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_'
|S|G|I|X|S |G |I |X |S|G|I|X|..................|S|G|I|X| <- secondary pte
'_'_'_'_'__'__'__'__'_'_'_'_'__________________'_'_'_'_'

After the patch, the 4k backed 64k PTE format is as follows

 0 1 2 3 4  5  6  7  8 9 10...........................63
 : : : : :  :  :  :  : : :                            :
 v v v v v  v  v  v  v v v                            v

,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-,
|x|x|x| |  |  |  |  |x|B| |x|x|................|.|.|.|.| <- primary pte
'_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_'
|S|G|I|X|S |G |I |X |S|G|I|X|..................|S|G|I|X| <- secondary pte
'_'_'_'_'__'__'__'__'_'_'_'_'__________________'_'_'_'_'

the four bits S,G,I,X (one quadruplet per 4k HPTE) that
cache the hash-bucket slot value, is initialized to
1,1,1,1 indicating -- an invalid slot. If a HPTE gets
cached in a 1111 slot(i.e 7th slot of secondary hash
bucket), it is released immediately. In other words,
even though 1111 is a valid slot value in the hash
bucket, we consider it invalid and release the slot and
the HPTE. This gives us the opportunity to determine
the validity of S,G,I,X bits based on its contents and
not on any of the bits V0,V1,V2 or V3 in the primary PTE

When we release a HPTE cached in the 1111 slot
we also release a legitimate slot in the primary
hash bucket and unmap its corresponding HPTE. This
is to ensure that we do get a HPTE cached in a slot
of the primary hash bucket, the next time we retry.

Though treating 1111 slot as invalid, reduces the
number of available slots in the hash bucket and may
have an effect on the performance, the probabilty of
hitting a 1111 slot is extermely low.

Compared to the current scheme, the above scheme
reduces the number of false hash table updates
significantly and has the added advantage of releasing
four valuable PTE bits for other purpose.

NOTE:even though bits 3, 4, 5, 6, 7 are not used when
the 64K PTE is backed by 4k HPTE, they continue to be
used if the PTE gets backed by 64k HPTE. The next
patch will decouple that aswell, and truely release the
bits.

This idea was jointly developed by Paul Mackerras,
Aneesh, Michael Ellermen and myself.

4K PTE format remains unchanged currently.

The patch does the following code changes
a) PTE flags are split between 64k and 4k header files.
b) __hash_page_4K() is reimplemented to reflect the
 above logic.

Acked-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20 18:57:20 +11:00
Ram Pai
318995b4f5 powerpc: introduce pte_get_hash_gslot() helper
Introduce pte_get_hash_gslot()() which returns the global slot number of
the HPTE in the global hash table.

This function will come in handy as we work towards re-arranging the PTE
bits in the later patches.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20 18:57:19 +11:00
Ram Pai
59aa31fd6f powerpc: introduce pte_set_hidx() helper
Introduce pte_set_hidx().It sets the (H_PAGE_F_SECOND|H_PAGE_F_GIX) bits
at the appropriate location in the PTE of 4K PTE. For 64K PTE, it sets
the bits in the second part of the PTE. Though the implementation for
the former just needs the slot parameter, it does take some additional
parameters to keep the prototype consistent.

This function will be handy as we work towards re-arranging the bits in
the subsequent patches.

Acked-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20 18:57:19 +11:00