When idle_power4() hard disables interrupts then finds a soft pending
interrupt, it returns with interrupts hard disabled but without
PACA_IRQ_HARD_DIS set. Commit 9b81c0211c ("powerpc/64s: make
PACA_IRQ_HARD_DIS track MSR[EE] closely") added a warning for that
condition (since disabled).
Fix this by adding the PACA_IRQ_HARD_DIS for that case.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Marking default memory region as cached is not always sufficient and is
not flexible. Allow specifying cache attributes for the whole memory
address space with new config entry MEMMAP_CACHEATTR. Apply it after
cache initialization.
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
Cache invalidation macros use cache line size to iterate over
invalidated cache lines, assuming that all cache ways are invalidated by
single instruction, but xtensa ISA recommends to not assume that for
future compatibility:
In some implementations all ways at index Addry-1..z are invalidated
regardless of the specified way, but for future compatibility this
behavior should not be assumed.
Iterate over all cache ways in ___invalidate_icache_all and
___invalidate_dcache_all.
Cc: stable@vger.kernel.org
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
When building kernel for xtensa cores with big cache lines (e.g. 128
bytes or more) __loop_cache_all and __loop_cache_page may generate
assembly instructions with immediate fields that are too big. This
results in the following build errors:
arch/xtensa/mm/misc.S: Assembler messages:
arch/xtensa/mm/misc.S:464: Error: operand 2 of 'diwbi' has invalid value '256'
arch/xtensa/mm/misc.S:464: Error: operand 2 of 'diwbi' has invalid value '384'
arch/xtensa/kernel/head.S: Assembler messages:
arch/xtensa/kernel/head.S:172: Error: operand 2 of 'diu' has invalid value '256'
arch/xtensa/kernel/head.S:172: Error: operand 2 of 'diu' has invalid value '384'
arch/xtensa/kernel/head.S:176: Error: operand 2 of 'iiu' has invalid value '256'
arch/xtensa/kernel/head.S:176: Error: operand 2 of 'iiu' has invalid value '384'
arch/xtensa/kernel/head.S:255: Error: operand 2 of 'diwb' has invalid value '256'
arch/xtensa/kernel/head.S:255: Error: operand 2 of 'diwb' has invalid value '384'
Add parameter max_immed to these macros and use it to limit values of
immediate operands. Extract common code of these macros into the new
macro __loop_cache_unroll.
Cc: stable@vger.kernel.org
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
An overview of the general architecture changes:
- Massive DMA ops refactoring from Christoph Hellwig (huzzah for
deleting crufty code!).
- We introduce NT_MIPS_DSP & NT_MIPS_FP_MODE ELF notes & corresponding
regsets to expose DSP ASE & floating point mode state respectively,
both for live debugging & core dumps.
- We better optimize our code by hard-coding cpu_has_* macros at
compile time where their values are known due to the ISA revision
that the kernel build is targeting.
- The EJTAG exception handler now better handles SMP systems, where it
was previously possible for CPUs to clobber a register value saved
by another CPU.
- Our implementation of memset() gained a couple of fixes for MIPSr6
systems to return correct values in some cases where stores fault.
- We now implement ioremap_wc() using the uncached-accelerated cache
coherency attribute where supported, which is detected during boot,
and fall back to plain uncached access where necessary. The
MIPS-specific (and unused in tree) ioremap_uncached_accelerated() &
ioremap_cacheable_cow() are removed.
- The prctl(PR_SET_FP_MODE, ...) syscall is better supported for SMP
systems by reworking the way we ensure remote CPUs that may be
running threads within the affected process switch mode.
- Systems using the MIPS Coherence Manager will now set the
MIPS_IC_SNOOPS_REMOTE flag to avoid some unnecessary cache
maintenance overhead when flushing the icache.
- A few fixes were made for building with clang/LLVM, which
now sucessfully builds kernels for many of our platforms.
- Miscellaneous cleanups all over.
And some platform-specific changes:
- ar7 gained stubs for a few clock API functions to fix build failures
for some drivers.
- ath79 gained support for a few new SoCs, a few fixes & better
gpio-keys support.
- Ci20 now exposes its SPI bus using the spi-gpio driver.
- The generic platform can now auto-detect a suitable value for
PHYS_OFFSET based upon the memory map described by the device tree,
allowing us to avoid wasting memory on page book-keeping for systems
where RAM starts at a non-zero physical address.
- Ingenic systems using the jz4740 platform code now link their
vmlinuz higher to allow for kernels of a realistic size.
- Loongson32 now builds the kernel targeting MIPSr1 rather than MIPSr2
to avoid CPU errata.
- Loongson64 gains a couple of fixes, a workaround for a write
buffering issue & support for the Loongson 3A R3.1 CPU.
- Malta now uses the piix4-poweroff driver to handle powering down.
- Microsemi Ocelot gained support for its SPI bus & NOR flash, its
second MDIO bus and can now be supported by a FIT/.itb image.
- Octeon saw a bunch of header cleanups which remove a lot of
duplicate or unused code.
-----BEGIN PGP SIGNATURE-----
iIsEABYIADMWIQRgLjeFAZEXQzy86/s+p5+stXUA3QUCW3G6JxUccGF1bC5idXJ0
b25AbWlwcy5jb20ACgkQPqefrLV1AN0n/gD/Rpdgay31G/4eTTKBmBrcaju6Shjt
/2Iu6WC5Sj4hDHUBAJSbuI+B9YjcNsjekBYxB/LLD7ImcLBl6nLMIvKmXLAL
=cUiF
-----END PGP SIGNATURE-----
Merge tag 'mips_4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Pull MIPS updates from Paul Burton:
"Here are the main MIPS changes for 4.19.
An overview of the general architecture changes:
- Massive DMA ops refactoring from Christoph Hellwig (huzzah for
deleting crufty code!).
- We introduce NT_MIPS_DSP & NT_MIPS_FP_MODE ELF notes &
corresponding regsets to expose DSP ASE & floating point mode state
respectively, both for live debugging & core dumps.
- We better optimize our code by hard-coding cpu_has_* macros at
compile time where their values are known due to the ISA revision
that the kernel build is targeting.
- The EJTAG exception handler now better handles SMP systems, where
it was previously possible for CPUs to clobber a register value
saved by another CPU.
- Our implementation of memset() gained a couple of fixes for MIPSr6
systems to return correct values in some cases where stores fault.
- We now implement ioremap_wc() using the uncached-accelerated cache
coherency attribute where supported, which is detected during boot,
and fall back to plain uncached access where necessary. The
MIPS-specific (and unused in tree) ioremap_uncached_accelerated() &
ioremap_cacheable_cow() are removed.
- The prctl(PR_SET_FP_MODE, ...) syscall is better supported for SMP
systems by reworking the way we ensure remote CPUs that may be
running threads within the affected process switch mode.
- Systems using the MIPS Coherence Manager will now set the
MIPS_IC_SNOOPS_REMOTE flag to avoid some unnecessary cache
maintenance overhead when flushing the icache.
- A few fixes were made for building with clang/LLVM, which now
sucessfully builds kernels for many of our platforms.
- Miscellaneous cleanups all over.
And some platform-specific changes:
- ar7 gained stubs for a few clock API functions to fix build
failures for some drivers.
- ath79 gained support for a few new SoCs, a few fixes & better
gpio-keys support.
- Ci20 now exposes its SPI bus using the spi-gpio driver.
- The generic platform can now auto-detect a suitable value for
PHYS_OFFSET based upon the memory map described by the device tree,
allowing us to avoid wasting memory on page book-keeping for
systems where RAM starts at a non-zero physical address.
- Ingenic systems using the jz4740 platform code now link their
vmlinuz higher to allow for kernels of a realistic size.
- Loongson32 now builds the kernel targeting MIPSr1 rather than
MIPSr2 to avoid CPU errata.
- Loongson64 gains a couple of fixes, a workaround for a write
buffering issue & support for the Loongson 3A R3.1 CPU.
- Malta now uses the piix4-poweroff driver to handle powering down.
- Microsemi Ocelot gained support for its SPI bus & NOR flash, its
second MDIO bus and can now be supported by a FIT/.itb image.
- Octeon saw a bunch of header cleanups which remove a lot of
duplicate or unused code"
* tag 'mips_4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: (123 commits)
MIPS: Remove remnants of UASM_ISA
MIPS: netlogic: xlr: Remove erroneous check in nlm_fmn_send()
MIPS: VDSO: Force link endianness
MIPS: Always specify -EB or -EL when using clang
MIPS: Use dins to simplify __write_64bit_c0_split()
MIPS: Use read-write output operand in __write_64bit_c0_split()
MIPS: Avoid using array as parameter to write_c0_kpgd()
MIPS: vdso: Allow clang's --target flag in VDSO cflags
MIPS: genvdso: Remove GOT checks
MIPS: Remove obsolete MIPS checks for DST node "chosen@0"
MIPS: generic: Remove input symbols from defconfig
MIPS: Delete unused code in linux32.c
MIPS: Remove unused sys_32_mmap2
MIPS: Remove nabi_no_regargs
mips: dts: mscc: enable spi and NOR flash support on ocelot PCB123
mips: dts: mscc: Add spi on Ocelot
MIPS: Loongson: Merge load addresses
MIPS: Loongson: Set Loongson32 to MIPS32R1
MIPS: mscc: ocelot: add interrupt controller properties to GPIO controller
MIPS: generic: Select MIPS_AUTO_PFN_OFFSET
...
Pull parisc updates from Helge Deller:
- parisc now uses the generic dma_noncoherent_ops implementation
(Christoph Hellwig)
- further memory barrier and spinlock improvements (John David Anglin)
- prepare removal of current_text_addr() functions (Nick Desaulniers)
- improve kernel stack unwinding on parisc (me)
- drop ENOTSUP which was defined on parisc only (me)
* 'parisc-4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Fix and improve kernel stack unwinding
parisc: Remove unnecessary barriers from spinlock.h
parisc: Remove ordered stores from syscall.S
parisc: prefer _THIS_IP_ and _RET_IP_ statement expressions
parisc: Add HAVE_REGS_AND_STACK_ACCESS_API feature
parisc: Drop architecture-specific ENOTSUP define
parisc: use generic dma_noncoherent_ops
parisc: always use flush_kernel_dcache_range for DMA cache maintainance
parisc: merge pcx_dma_ops and pcxl_dma_ops
Pull ARM updates from Russell King:
- further Spectre variant 1 fixes for user accessors.
- kbuild cleanups (Masahiro Yamada)
- hook up sync core functionality (Will Deacon)
- nommu updates for hypervisor mode booting (Vladimir Murzin)
- use compiler built-ins for fls and ffs (Nicolas Pitre)
* 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: spectre-v1: mitigate user accesses
ARM: spectre-v1: use get_user() for __get_user()
ARM: use __inttype() in get_user()
ARM: oabi-compat: copy semops using __copy_from_user()
ARM: vfp: use __copy_from_user() when restoring VFP state
ARM: 8785/1: use compiler built-ins for ffs and fls
ARM: 8784/1: NOMMU: Allow enter in Hyp mode
ARM: 8783/1: NOMMU: Extend check for VBAR support
ARM: 8782/1: vfp: clean up arch/arm/vfp/Makefile
ARM: signal: copy registers using __copy_from_user()
ARM: tcm: ensure inline stub functions are marked static
ARM: 8779/1: add endianness option to LDFLAGS instead of LD
ARM: 8777/1: Hook up SYNC_CORE functionality for sys_membarrier()
Pull s390 updates from Heiko Carstens:
"Since Martin is on vacation you get the s390 pull request from me:
- Host large page support for KVM guests. As the patches have large
impact on arch/s390/mm/ this series goes out via both the KVM and
the s390 tree.
- Add an option for no compression to the "Kernel compression mode"
menu, this will come in handy with the rework of the early boot
code.
- A large rework of the early boot code that will make life easier
for KASAN and KASLR. With the rework the bootable uncompressed
image is not generated anymore, only the bzImage is available. For
debuggung purposes the new "no compression" option is used.
- Re-enable the gcc plugins as the issue with the latent entropy
plugin is solved with the early boot code rework.
- More spectre relates changes:
+ Detect the etoken facility and remove expolines automatically.
+ Add expolines to a few more indirect branches.
- A rewrite of the common I/O layer trace points to make them
consumable by 'perf stat'.
- Add support for format-3 PCI function measurement blocks.
- Changes for the zcrypt driver:
+ Add attributes to indicate the load of cards and queues.
+ Restructure some code for the upcoming AP device support in KVM.
- Build flags improvements in various Makefiles.
- A few fixes for the kdump support.
- A couple of patches for gcc 8 compile warning cleanup.
- Cleanup s390 specific proc handlers.
- Add s390 support to the restartable sequence self tests.
- Some PTR_RET vs PTR_ERR_OR_ZERO cleanup.
- Lots of bug fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (107 commits)
s390/dasd: fix hanging offline processing due to canceled worker
s390/dasd: fix panic for failed online processing
s390/mm: fix addressing exception after suspend/resume
rseq/selftests: add s390 support
s390: fix br_r1_trampoline for machines without exrl
s390/lib: use expoline for all bcr instructions
s390/numa: move initial setup of node_to_cpumask_map
s390/kdump: Fix elfcorehdr size calculation
s390/cpum_sf: save TOD clock base in SDBs for time conversion
KVM: s390: Add huge page enablement control
s390/mm: Add huge page gmap linking support
s390/mm: hugetlb pages within a gmap can not be freed
KVM: s390: Add skey emulation fault handling
s390/mm: Add huge pmd storage key handling
s390/mm: Clear skeys for newly mapped huge guest pmds
s390/mm: Clear huge page storage keys on enable_skey
s390/mm: Add huge page dirty sync support
s390/mm: Add gmap pmd invalidation and clearing
s390/mm: Add gmap pmd notification bit setting
s390/mm: Add gmap pmd linking
...
Pull x86 timer updates from Thomas Gleixner:
"Early TSC based time stamping to allow better boot time analysis.
This comes with a general cleanup of the TSC calibration code which
grew warts and duct taping over the years and removes 250 lines of
code. Initiated and mostly implemented by Pavel with help from various
folks"
* 'x86-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
x86/kvmclock: Mark kvm_get_preset_lpj() as __init
x86/tsc: Consolidate init code
sched/clock: Disable interrupts when calling generic_sched_clock_init()
timekeeping: Prevent false warning when persistent clock is not available
sched/clock: Close a hole in sched_clock_init()
x86/tsc: Make use of tsc_calibrate_cpu_early()
x86/tsc: Split native_calibrate_cpu() into early and late parts
sched/clock: Use static key for sched_clock_running
sched/clock: Enable sched clock early
sched/clock: Move sched clock initialization and merge with generic clock
x86/tsc: Use TSC as sched clock early
x86/tsc: Initialize cyc2ns when tsc frequency is determined
x86/tsc: Calibrate tsc only once
ARM/time: Remove read_boot_clock64()
s390/time: Remove read_boot_clock64()
timekeeping: Default boot time offset to local_clock()
timekeeping: Replace read_boot_clock64() with read_persistent_wall_and_boot_offset()
s390/time: Add read_persistent_wall_and_boot_offset()
x86/xen/time: Output xen sched_clock time from 0
x86/xen/time: Initialize pv xen time in init_hypervisor_platform()
...
Pull x86 PTI updates from Thomas Gleixner:
"The Speck brigade sadly provides yet another large set of patches
destroying the perfomance which we carefully built and preserved
- PTI support for 32bit PAE. The missing counter part to the 64bit
PTI code implemented by Joerg.
- A set of fixes for the Global Bit mechanics for non PCID CPUs which
were setting the Global Bit too widely and therefore possibly
exposing interesting memory needlessly.
- Protection against userspace-userspace SpectreRSB
- Support for the upcoming Enhanced IBRS mode, which is preferred
over IBRS. Unfortunately we dont know the performance impact of
this, but it's expected to be less horrible than the IBRS
hammering.
- Cleanups and simplifications"
* 'x86/pti' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
x86/mm/pti: Move user W+X check into pti_finalize()
x86/relocs: Add __end_rodata_aligned to S_REL
x86/mm/pti: Clone kernel-image on PTE level for 32 bit
x86/mm/pti: Don't clear permissions in pti_clone_pmd()
x86/mm/pti: Fix 32 bit PCID check
x86/mm/init: Remove freed kernel image areas from alias mapping
x86/mm/init: Add helper for freeing kernel image pages
x86/mm/init: Pass unconverted symbol addresses to free_init_pages()
mm: Allow non-direct-map arguments to free_reserved_area()
x86/mm/pti: Clear Global bit more aggressively
x86/speculation: Support Enhanced IBRS on future CPUs
x86/speculation: Protect against userspace-userspace spectreRSB
x86/kexec: Allocate 8k PGDs for PTI
Revert "perf/core: Make sure the ring-buffer is mapped in all page-tables"
x86/mm: Remove in_nmi() warning from vmalloc_fault()
x86/entry/32: Check for VM86 mode in slow-path check
perf/core: Make sure the ring-buffer is mapped in all page-tables
x86/pti: Check the return value of pti_user_pagetable_walk_pmd()
x86/pti: Check the return value of pti_user_pagetable_walk_p4d()
x86/entry/32: Add debug code to check entry/exit CR3
...
Pull x86 vdso update from Thomas Gleixner:
"Use LD to link the VDSO libs instead of indirecting trough CC which
causes build failures with Clang"
* 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: vdso: Use $LD instead of $CC to link
Add addition argument 'arch_uprobe' to uprobe_write_opcode().
We need this in later set of patches.
Link: http://lkml.kernel.org/r/20180809041856.1547-3-ravi.bangoria@linux.ibm.com
Reviewed-by: Song Liu <songliubraving@fb.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull misc x86 fixes from Thomas Gleixner:
"Two fixes for x86:
- Provide a declaration for native_save_fl() which unbreaks the
wreckage caused by making it 'extern inline'.
- Fix the failing paravirt patching which is supposed to replace
indirect with direct calls. The wreckage is caused by an incorrect
clobber test"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/paravirt: Fix spectre-v2 mitigations for paravirt guests
x86/irqflags: Provide a declaration for native_save_fl
Pull x86 mm updates from Thomas Gleixner:
- Make lazy TLB mode even lazier to avoid pointless switch_mm()
operations, which reduces CPU load by 1-2% for memcache workloads
- Small cleanups and improvements all over the place
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Remove redundant check for kmem_cache_create()
arm/asm/tlb.h: Fix build error implicit func declaration
x86/mm/tlb: Make clear_asid_other() static
x86/mm/tlb: Skip atomic operations for 'init_mm' in switch_mm_irqs_off()
x86/mm/tlb: Always use lazy TLB mode
x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs
x86/mm/tlb: Make lazy TLB mode lazier
x86/mm/tlb: Restructure switch_mm_irqs_off()
x86/mm/tlb: Leave lazy TLB mode at page table free time
mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids
x86/mm: Add TLB purge to free pmd/pte page interfaces
ioremap: Update pgtable free interfaces with addr
x86/mm: Disable ioremap free page handling on x86-PAE
Pull x86 platform updates from Thomas Gleixner:
"Trivial cleanups and improvements"
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/platform/UV: Remove redundant check of p == q
x86/platform/olpc: Use PTR_ERR_OR_ZERO()
x86/platform/UV: Mark memblock related init code and data correctly
Pull x86 cache QoS (RDT/CAR) updates from Thomas Gleixner:
"Add support for pseudo-locked cache regions.
Cache Allocation Technology (CAT) allows on certain CPUs to isolate a
region of cache and 'lock' it. Cache pseudo-locking builds on the fact
that a CPU can still read and write data pre-allocated outside its
current allocated area on cache hit. With cache pseudo-locking data
can be preloaded into a reserved portion of cache that no application
can fill, and from that point on will only serve cache hits. The cache
pseudo-locked memory is made accessible to user space where an
application can map it into its virtual address space and thus have a
region of memory with reduced average read latency.
The locking is not perfect and gets totally screwed by WBINDV and
similar mechanisms, but it provides a reasonable enhancement for
certain types of latency sensitive applications.
The implementation extends the current CAT mechanism and provides a
generally useful exclusive CAT mode on which it builds the extra
pseude-locked regions"
* 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
x86/intel_rdt: Disable PMU access
x86/intel_rdt: Fix possible circular lock dependency
x86/intel_rdt: Make CPU information accessible for pseudo-locked regions
x86/intel_rdt: Support restoration of subset of permissions
x86/intel_rdt: Fix cleanup of plr structure on error
x86/intel_rdt: Move pseudo_lock_region_clear()
x86/intel_rdt: Limit C-states dynamically when pseudo-locking active
x86/intel_rdt: Support L3 cache performance event of Broadwell
x86/intel_rdt: More precise L2 hit/miss measurements
x86/intel_rdt: Create character device exposing pseudo-locked region
x86/intel_rdt: Create debugfs files for pseudo-locking testing
x86/intel_rdt: Create resctrl debug area
x86/intel_rdt: Ensure RDT cleanup on exit
x86/intel_rdt: Resctrl files reflect pseudo-locked information
x86/intel_rdt: Support creation/removal of pseudo-locked region
x86/intel_rdt: Pseudo-lock region creation/removal core
x86/intel_rdt: Discover supported platforms via prefetch disable bits
x86/intel_rdt: Add utilities to test pseudo-locked region possibility
x86/intel_rdt: Split resource group removal in two
x86/intel_rdt: Enable entering of pseudo-locksetup mode
...
Pull x86/hyper-v update from Thomas Gleixner:
"Add fast hypercall support for guest running on the Microsoft HyperV(isor)"
* 'x86-hyperv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/hyper-v: Fix wrong merge conflict resolution
x86/hyper-v: Check for VP_INVAL in hyperv_flush_tlb_others()
x86/hyper-v: Check cpumask_to_vpset() return value in hyperv_flush_tlb_others_ex()
x86/hyper-v: Trace PV IPI send
x86/hyper-v: Use cheaper HVCALL_SEND_IPI hypercall when possible
x86/hyper-v: Use 'fast' hypercall for HVCALL_SEND_IPI
x86/hyper-v: Implement hv_do_fast_hypercall16
x86/hyper-v: Use cheaper HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} hypercalls when possible
Pull x86 dump printing cleanup from Thomas Gleixner:
"Clean up the show_opcodes() printout so nested dumps can be properly
differentiated"
* 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Avoid pr_cont() in show_opcodes()
Pull x86 cpu updates from Thomas Gleixner:
"Two small updates for the CPU code:
- Improve NUMA emulation
- Add the EPT_AD CPU feature bit"
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpufeatures: Add EPT_AD feature bit
x86/numa_emulation: Introduce uniform split capability
x86/numa_emulation: Fix emulated-to-physical node mapping
Pull x86 cleanups from Thomas Gleixner:
"Trival cleanups"
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/iommu: Use NULL instead of 0
x86/platform/pcspeaker: Use PTR_ERR_OR_ZERO() to fix ptr_ret.cocci warning
Pull x86 build cleanup from Thomas Gleixner:
"Remove a stale quirk for a no longer supported GCC version"
* 'x86-build-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/build: Remove old -funit-at-a-time GCC quirk
Pull x86 asm updates from Thomas Gleixner:
"The lowlevel and ASM code updates for x86:
- Make stack trace unwinding more reliable
- ASM instruction updates for better code generation
- Various cleanups"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry/64: Add two more instruction suffixes
x86/asm/64: Use 32-bit XOR to zero registers
x86/build/vdso: Simplify 'cmd_vdso2c'
x86/build/vdso: Remove unused vdso-syms.lds
x86/stacktrace: Enable HAVE_RELIABLE_STACKTRACE for the ORC unwinder
x86/unwind/orc: Detect the end of the stack
x86/stacktrace: Do not fail for ORC with regs on stack
x86/stacktrace: Clarify the reliable success paths
x86/stacktrace: Remove STACKTRACE_DUMP_ONCE
x86/stacktrace: Do not unwind after user regs
x86/asm: Use CC_SET/CC_OUT in percpu_cmpxchg8b_double() to micro-optimize code generation
Pull x86 boot updates from Thomas Gleixner:
"Boot code updates for x86:
- Allow to skip a given amount of huge pages for address layout
randomization on the kernel command line to prevent regressions in
the huge page allocation with small memory sizes
- Various cleanups"
* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Use CC_SET()/CC_OUT() instead of open coding it
x86/boot/KASLR: Make local variable mem_limit static
x86/boot/KASLR: Skip specified number of 1GB huge pages when doing physical randomization (KASLR)
x86/boot/KASLR: Add two new functions for 1GB huge pages handling
Pull x86 apic update from Thomas Gleixner:
"Trivial cleanups of the APIC related code"
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic: Trivial coding style fixes
x86/vector: Merge allocate_vector() into assign_vector_locked()
Pull perf update from Thomas Gleixner:
"The perf crowd presents:
Kernel updates:
- Removal of jprobes
- Cleanup and consolidatation the handling of kprobes
- Cleanup and consolidation of hardware breakpoints
- The usual pile of fixes and updates to PMUs and event descriptors
Tooling updates:
- Updates and improvements all over the place. Nothing outstanding,
just the (good) boring incremental grump work"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
perf trace: Do not require --no-syscalls to suppress strace like output
perf bpf: Include uapi/linux/bpf.h from the 'perf trace' script's bpf.h
perf tools: Allow overriding MAX_NR_CPUS at compile time
perf bpf: Show better message when failing to load an object
perf list: Unify metric group description format with PMU event description
perf vendor events arm64: Update ThunderX2 implementation defined pmu core events
perf cs-etm: Generate branch sample for CS_ETM_TRACE_ON packet
perf cs-etm: Generate branch sample when receiving a CS_ETM_TRACE_ON packet
perf cs-etm: Support dummy address value for CS_ETM_TRACE_ON packet
perf cs-etm: Fix start tracing packet handling
perf build: Fix installation directory for eBPF
perf c2c report: Fix crash for empty browser
perf tests: Fix indexing when invoking subtests
perf trace: Beautify the AF_INET & AF_INET6 'socket' syscall 'protocol' args
perf trace beauty: Add beautifiers for 'socket''s 'protocol' arg
perf trace beauty: Do not print NULL strarray entries
perf beauty: Add a generator for IPPROTO_ socket's protocol constants
tools include uapi: Grab a copy of linux/in.h
perf tests: Fix complex event name parsing
perf evlist: Fix error out while applying initial delay and LBR
...
Pull locking/atomics update from Thomas Gleixner:
"The locking, atomics and memory model brains delivered:
- A larger update to the atomics code which reworks the ordering
barriers, consolidates the atomic primitives, provides the new
atomic64_fetch_add_unless() primitive and cleans up the include
hell.
- Simplify cmpxchg() instrumentation and add instrumentation for
xchg() and cmpxchg_double().
- Updates to the memory model and documentation"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
locking/atomics: Rework ordering barriers
locking/atomics: Instrument cmpxchg_double*()
locking/atomics: Instrument xchg()
locking/atomics: Simplify cmpxchg() instrumentation
locking/atomics/x86: Reduce arch_cmpxchg64*() instrumentation
tools/memory-model: Rename litmus tests to comply to norm7
tools/memory-model/Documentation: Fix typo, smb->smp
sched/Documentation: Update wake_up() & co. memory-barrier guarantees
locking/spinlock, sched/core: Clarify requirements for smp_mb__after_spinlock()
sched/core: Use smp_mb() in wake_woken_function()
tools/memory-model: Add informal LKMM documentation to MAINTAINERS
locking/atomics/Documentation: Describe atomic_set() as a write operation
tools/memory-model: Make scripts executable
tools/memory-model: Remove ACCESS_ONCE() from model
tools/memory-model: Remove ACCESS_ONCE() from recipes
locking/memory-barriers.txt/kokr: Update Korean translation to fix broken DMA vs. MMIO ordering example
MAINTAINERS: Add Daniel Lustig as an LKMM reviewer
tools/memory-model: Fix ISA2+pooncelock+pooncelock+pombonce name
tools/memory-model: Add litmus test for full multicopy atomicity
locking/refcount: Always allow checked forms
...
Pull scheduler updates from Thomas Gleixner:
- Cleanup and improvement of NUMA balancing
- Refactoring and improvements to the PELT (Per Entity Load Tracking)
code
- Watchdog simplification and related cleanups
- The usual pile of small incremental fixes and improvements
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
watchdog: Reduce message verbosity
stop_machine: Reflow cpu_stop_queue_two_works()
sched/numa: Move task_numa_placement() closer to numa_migrate_preferred()
sched/numa: Use group_weights to identify if migration degrades locality
sched/numa: Update the scan period without holding the numa_group lock
sched/numa: Remove numa_has_capacity()
sched/numa: Modify migrate_swap() to accept additional parameters
sched/numa: Remove unused task_capacity from 'struct numa_stats'
sched/numa: Skip nodes that are at 'hoplimit'
sched/debug: Reverse the order of printing faults
sched/numa: Use task faults only if numa_group is not yet set up
sched/numa: Set preferred_node based on best_cpu
sched/numa: Simplify load_too_imbalanced()
sched/numa: Evaluate move once per node
sched/numa: Remove redundant field
sched/debug: Show the sum wait time of a task group
sched/fair: Remove #ifdefs from scale_rt_capacity()
sched/core: Remove get_cpu() from sched_fork()
sched/cpufreq: Clarify sugov_get_util()
sched/sysctl: Remove unused sched_time_avg_ms sysctl
...
Pull x86 RAS updates from Thomas Gleixner:
"A small set of changes to the RAS core:
- Rework of the MCE bank scanning code
- Y2038 converion"
* 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Cleanup __mc_scan_banks()
x86/mce: Carve out bank scanning code
x86/mce: Remove !banks check
x86/mce: Carve out the crashing_cpu check
x86/mce: Always use 64-bit timestamps
Pull genirq updates from Thomas Gleixner:
"The irq departement provides:
- A synchronization fix for free_irq() to synchronize just the
removed interrupt thread on shared interrupt lines.
- Consolidate the multi low level interrupt entry handling and mvoe
it to the generic code instead of adding yet another copy for
RISC-V
- Refactoring of the ARM LPI allocator and LPI exposure to the
hypervisor
- Yet another interrupt chip driver for the JZ4725B SoC
- Speed up for /proc/interrupts as people seem to love reading this
file with high frequency
- Miscellaneous fixes and updates"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
irqchip/gic-v3-its: Make its_lock a raw_spin_lock_t
genirq/irqchip: Remove MULTI_IRQ_HANDLER as it's now obselete
openrisc: Use the new GENERIC_IRQ_MULTI_HANDLER
arm64: Use the new GENERIC_IRQ_MULTI_HANDLER
ARM: Convert to GENERIC_IRQ_MULTI_HANDLER
irqchip: Port the ARM IRQ drivers to GENERIC_IRQ_MULTI_HANDLER
irqchip/gic-v3-its: Reduce minimum LPI allocation to 1 for PCI devices
dt-bindings: irqchip: renesas-irqc: Document r8a77980 support
dt-bindings: irqchip: renesas-irqc: Document r8a77470 support
irqchip/ingenic: Add support for the JZ4725B SoC
irqchip/stm32: Add exti0 translation for stm32mp1
genirq: Remove redundant NULL pointer check in __free_irq()
irqchip/gic-v3-its: Honor hypervisor enforced LPI range
irqchip/gic-v3: Expose GICD_TYPER in the rdist structure
irqchip/gic-v3-its: Drop chunk allocation compatibility
irqchip/gic-v3-its: Move minimum LPI requirements to individual busses
irqchip/gic-v3-its: Use full range of LPIs
irqchip/gic-v3-its: Refactor LPI allocator
genirq: Synchronize only with single thread on free_irq()
genirq: Update code comments wrt recycled thread_mask
...
Pull EFI updates from Thomas Gleixner:
"The EFI pile:
- Make mixed mode UEFI runtime service invocations mutually
exclusive, as mandated by the UEFI spec
- Perform UEFI runtime services calls from a work queue so the calls
into the firmware occur from a kernel thread
- Honor the UEFI memory map attributes for live memory regions
configured by UEFI as a framebuffer. This works around a coherency
problem with KVM guests running on ARM.
- Cleanups, improvements and fixes all over the place"
* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efivars: Call guid_parse() against guid_t type of variable
efi/cper: Use consistent types for UUIDs
efi/x86: Replace references to efi_early->is64 with efi_is_64bit()
efi: Deduplicate efi_open_volume()
efi/x86: Add missing NULL initialization in UGA draw protocol discovery
efi/x86: Merge 32-bit and 64-bit UGA draw protocol setup routines
efi/x86: Align efi_uga_draw_protocol typedef names to convention
efi/x86: Merge the setup_efi_pci32() and setup_efi_pci64() routines
efi/x86: Prevent reentrant firmware calls in mixed mode
efi/esrt: Only call efi_mem_reserve() for boot services memory
fbdev/efifb: Honour UEFI memory map attributes when mapping the FB
efi: Drop type and attribute checks in efi_mem_desc_lookup()
efi/libstub/arm: Add opt-in Kconfig option for the DTB loader
efi: Remove the declaration of efi_late_init() as the function is unused
efi/cper: Avoid using get_seconds()
efi: Use a work queue to invoke EFI Runtime Services
efi/x86: Use non-blocking SetVariable() for efi_delete_dummy_variable()
efi/x86: Clean up the eboot code
Last set of new features for 4.19. Most notable is simplifying SSB
debugging code with two Kconfig option removals and fixing mt76 USB
build problems.
Major changes:
ath10k
* add debugfs file warm_hw_reset
wil6210
* add debugfs files tx_latency, link_stats and link_stats_global
* add 3-MSI support
* allow scan on AP interface
* support max aggregation window size 64
ssb
* remove CONFIG_SSB_SILENT and CONFIG_SSB_DEBUG Kconfig options
mt76
* fix build problems with recently added USB support
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJbcHsyAAoJEG4XJFUm622biTsH/17f4jCOjg49PsLxX7GhJdyS
3LiauBKgT+7s5HU9Ak7ogkM16uzb3yjw22VsO/6SEoUIAG77P2Ns1/8u8fgfvQxm
RyiPglctfLMFlzAyOUihkVmKICH0pjVCU5AB94+71g8ZjSkvxRF0SbeP5an4ojYt
0rTcSbrweuay5YDw4bTpJ7QkSDPYoyNrcTPMTwdjIqry498NO2SMGv573TTn4jeT
cWxQLKIDtmCPdrvRcyOsNLsAve73x760glWqmtFJuyVtgwjnWUyVlNSlTcVTW4OR
NsrTVzfA5AbZb56WQmhmBxVcKn9UwKxPIwjZ9I3zydBagsODIWkRoViBTXNQjsk=
=aha/
-----END PGP SIGNATURE-----
Merge tag 'wireless-drivers-next-for-davem-2018-08-12' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next
Kalle Valo says:
====================
pull-request: wireless-drivers-next 2018-08-12
wireless-drivers-next patches for 4.19
Last set of new features for 4.19. Most notable is simplifying SSB
debugging code with two Kconfig option removals and fixing mt76 USB
build problems.
Major changes:
ath10k
* add debugfs file warm_hw_reset
wil6210
* add debugfs files tx_latency, link_stats and link_stats_global
* add 3-MSI support
* allow scan on AP interface
* support max aggregation window size 64
ssb
* remove CONFIG_SSB_SILENT and CONFIG_SSB_DEBUG Kconfig options
mt76
* fix build problems with recently added USB support
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
- Enable mac_scsi PDMA on PowerBook 500,
- Generic dma_noncoherent_ops conversion,
- Time handling improvements,
- I/O accessor improvements,
- Conversion to MEMBLOCK and NO_BOOTMEM, to bring m68k in line with
other mainstream architectures,
- Miscellaneous fixes and cleanups,
- Defconfig updates.
-----BEGIN PGP SIGNATURE-----
iIsEABYIADMWIQQ9qaHoIs/1I4cXmEiKwlD9ZEnxcAUCW2rv8BUcZ2VlcnRAbGlu
dXgtbTY4ay5vcmcACgkQisJQ/WRJ8XBRAQD+O3x187pxeICV5LUC4SGtPj3kw07e
wr1NA0hnRqri77MBAPNx96fy0wnQAgMuvIghpEYqX6ha2O4NTxKUmLdxl/QL
=gJGl
-----END PGP SIGNATURE-----
Merge tag 'm68k-for-v4.19-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
Pull m68k updates from Geert Uytterhoeven:
- Enable mac_scsi PDMA on PowerBook 500
- Generic dma_noncoherent_ops conversion
- Time handling improvements
- I/O accessor improvements
- Conversion to MEMBLOCK and NO_BOOTMEM, to bring m68k in line with
other mainstream architectures
- Miscellaneous fixes and cleanups
- Defconfig updates
* tag 'm68k-for-v4.19-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
m68k/defconfig: Update defconfigs for v4.18-rc6
m68k: switch to MEMBLOCK + NO_BOOTMEM
m68k/page_no.h: force __va argument to be unsigned long
m68k/bitops: convert __ffs to match generic declaration
m68k/io: Switch mmu variant to <asm-generic/io.h>
m68k/io: Move mem*io define guards to <asm/kmap.h>
Input: hilkbd - Add casts to HP9000/300 I/O accessors
net: mac8390: Use standard memcpy_{from,to}io()
m68k/io: Add missing ioremap define guards, fix typo
m68k: Remove unused set_clock_mmss() helpers
m68k: mac: Use time64_t in RTC handling
m68k: Use generic dma_noncoherent_ops
nubus: Set default dma mask for nubus_board devices
m68k/mac: Enable PDMA for PowerBook 500 series
Enabling both CONFIG_PERF_EVENTS without !CONFIG_SMP
generates following compilation error.
arch/riscv/include/asm/perf_event.h:80:2: error: expected
specifier-qualifier-list before 'irqreturn_t'
irqreturn_t (*handle_irq)(int irq_num, void *dev);
^~~~~~~~~~~
Include interrupt.h in proper place to avoid compilation
error.
Signed-off-by: Atish Patra <atish.patra@wdc.com>
Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
Add a driver for the SiFive implementation of the RISC-V Platform Level
Interrupt Controller (PLIC). The PLIC connects global interrupt sources
to the local interrupt controller on each hart.
This driver is based on the driver in the RISC-V tree from Palmer Dabbelt,
but has been almost entirely rewritten since, and includes many fixes
from Atish Patra.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Atish Patra <atish.patra@wdc.com>
[Binding update by Palmer]
Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
The stvec's value must be 4 byte alignment by specification definition.
These directives avoid to stvec be set the non-alignment value.
Signed-off-by: Zong Li <zong@andestech.com>
Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
The RISC-V ISA defines a per-hart real-time clock and timer, which is
present on all systems. The clock is accessed via the 'rdtime'
pseudo-instruction (which reads a CSR), and the timer is set via an SBI
call.
Contains various improvements from Atish Patra <atish.patra@wdc.com>.
Signed-off-by: Dmitriy Cherkasov <dmitriy@oss-tech.org>
Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
[hch: remove dead code, add SPDX tags, used riscv_of_processor_hart(),
minor cleanups, merged hotplug cpu support and other improvements
from Atish]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Atish Patra <atish.patra@wdc.com>
Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
Add support for a routine that dispatches exceptions with the interrupt
flags set to either the IPI or irqdomain code (and the clock source in the
future).
Loosely based on the irq-riscv-int.c irqchip driver from the RISC-V tree.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
This mirrors the SIE_SSIE and SETE bits that are used in a similar
fashion.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
These are only of use to the local irq controller driver, so add them in
that driver implementation instead, which will be submitted soon.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
Rename handle_ipi to riscv_software_interrupt, drop the unused return
value and move the prototype to irq.h together with riscv_timer_interupt.
This allows simplifying the upcoming interrupt handling support.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
This code is currently unused and will be added back later in a different
place with the real interrupt and clocksource support.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
This code lives entirely within the RISC-V arch code. I've left it
within an "#ifdef CONFIG_EARLY_PRINTK" despite always having
EARLY_PRINTK support on RISC-V just in case someone wants to remove
it.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
Adding 4 to sepc is pointless, and is wrong if we executed a 2-byte
compressed breakpoint. This plus a corresponding gdb patch allows
compressed breakpoints to work in gdb. Gdb maintainers have already
agreed that this is the right approach.
Signed-off-by: Jim Wilson <jimw@sifive.com>
Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
If you use a 64-bit compiler to build a 32-bit kernel then you'll get an
error when building the vDSO due to a library mismatch. The happens
because the relevant "-march" argument isn't supplied to the GCC run
that generates one of the vDSO intermediate files.
I'm not actually sure what the right thing to do here is as I'm not
particularly familiar with the kernel build system. I poked the
documentation and it appears that KCFLAGS is the correct thing to do
(it's suggested that should be used when building modules), but we set
KBUILD_CFLAGS in arch/riscv/Makefile.
This does at least fix the build error.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
This patchset fixes and improves stack unwinding a lot:
1. Show backward stack traces with up to 30 callsites
2. Add callinfo to ENTRY_CFI() such that every assembler function will get an
entry in the unwind table
3. Use constants instead of numbers in call_on_stack()
4. Do not depend on CONFIG_KALLSYMS to generate backtraces.
5. Speed up backtrace generation
Make sure you have this patch to GNU as installed:
https://sourceware.org/ml/binutils/2018-07/msg00474.html
Without this patch, unwind info in the kernel is often wrong for various
functions.
Signed-off-by: Helge Deller <deller@gmx.de>
Now that mb() is an instruction barrier, it will slow performance if we issue
unnecessary barriers.
The spinlock defines have a number of unnecessary barriers. The __ldcw()
define is both a hardware and compiler barrier. The mb() barriers in the
routines using __ldcw() serve no purpose.
The only barrier needed is the one in arch_spin_unlock(). We need to ensure
all accesses are complete prior to releasing the lock.
Signed-off-by: John David Anglin <dave.anglin@bell.net>
Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Helge Deller <deller@gmx.de>
Now that we use a sync prior to releasing the locks in syscall.S, we don't need
the PA 2.0 ordered stores used to release some locks. Using an ordered store,
potentially slows the release and subsequent code.
There are a number of other ordered stores and loads that serve no purpose. I
have converted these to normal stores.
Signed-off-by: John David Anglin <dave.anglin@bell.net>
Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Helge Deller <deller@gmx.de>
As part of the effort to reduce the code duplication between _THIS_IP_
and current_text_addr(), let's consolidate callers of
current_text_addr() to use _THIS_IP_.
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Helge Deller <deller@gmx.de>
Some parts of the HAVE_REGS_AND_STACK_ACCESS_API feature is needed for
the rseq syscall. This patch adds the most important parts, and as long
as we don't support kprobes, we should be fine.
Signed-off-by: Helge Deller <deller@gmx.de>
parisc is the only Linux architecture which has defined a value for ENOTSUP.
All other architectures #define ENOTSUP as EOPNOTSUPP in their libc headers.
Having an own value for ENOTSUP which is different than EOPNOTSUPP often gives
problems with userspace programs which expect both to be the same. One such
example is a build error in the libuv package, as can be seen in
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=900237.
Since we dropped HP-UX support, there is no real benefit in keeping an own
value for ENOTSUP. This patch drops the parisc value for ENOTSUP from the
kernel sources. glibc needs no patch, it reuses the exported headers.
Signed-off-by: Helge Deller <deller@gmx.de>
Switch to the generic noncoherent direct mapping implementation.
Fix sync_single_for_cpu to do skip the cache flush unless the transfer
is to the device to match the more tested unmap_single path which should
have the same cache coherency implications.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Helge Deller <deller@gmx.de>
Current the S/G list based DMA ops use flush_kernel_vmap_range which
contains a few UP optimizations, while the rest of the DMA operations
uses flush_kernel_dcache_range. The single vs sg operations are supposed
to have the same effect, so they should use the same routines. Use
the more conservation version for now, but if people more familiar with
parisc think the vmap version is generally fine for DMA we should switch
all interfaces over to it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Helge Deller <deller@gmx.de>
The only difference is that pcxl supports dma coherent allocations, while
pcx only supports non-consistent allocations and otherwise fails.
But dma_alloc* is not in the fast path, and merging these two allows an
easy migration path to the generic dma-noncoherent implementation, so
do it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Helge Deller <deller@gmx.de>
Add statistics that show how memory is mapped within the kernel linear mapping.
This is similar to commit 37cd944c8d ("s390/pgtable: add mapping statistics")
We don't do this with Hash translation mode. Hash uses one size (mmu_linear_psize)
to map the kernel linear mapping and we print the linear psize during boot as
below.
"Page orders: linear mapping = 24, virtual = 16, io = 16, vmemmap = 24"
A sample output looks like:
DirectMap4k: 0 kB
DirectMap64k: 18432 kB
DirectMap2M: 1030144 kB
DirectMap1G: 11534336 kB
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Return statements in functions returning bool should use true or false
instead of an integer value.
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
In order to generate Group0 SGIs, let's add some decoding logic to
access_gic_sgi(), and pass the generating group accordingly.
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
In order to generate Group0 SGIs, let's add some decoding logic to
access_gic_sgi(), and pass the generating group accordingly.
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Although vgic-v3 now supports Group0 interrupts, it still doesn't
deal with Group0 SGIs. As usually with the GIC, nothing is simple:
- ICC_SGI1R can signal SGIs of both groups, since GICD_CTLR.DS==1
with KVM (as per 8.1.10, Non-secure EL1 access)
- ICC_SGI0R can only generate Group0 SGIs
- ICC_ASGI1R sees its scope refocussed to generate only Group0
SGIs (as per the note at the bottom of Table 8-14)
We only support Group1 SGIs so far, so no material change.
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
ICC_SGI1R is a 64bit system register, even on AArch32. It is thus
pointless to have such an encoding in the 32bit cp15 array. Let's
drop it.
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Holding uts_sem as a writer while accessing userspace memory allows a
namespace admin to stall all processes that attempt to take uts_sem.
Instead, move data through stack buffers and don't access userspace memory
while uts_sem is held.
Cc: stable@vger.kernel.org
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
With gcc-8 fsanitize=null become very noisy. GCC started to complain
about things like &a->b, where 'a' is NULL pointer. There is no NULL
dereference, we just calculate address to struct member. It's
technically undefined behavior so UBSAN is correct to report it. But as
long as there is no real NULL-dereference, I think, we should be fine.
-fno-delete-null-pointer-checks compiler flag should protect us from any
consequences. So let's just no use -fsanitize=null as it's not useful
for us. If there is a real NULL-deref we will see crash. Even if
userspace mapped something at NULL (root can do this), with things like
SMAP should catch the issue.
Link: http://lkml.kernel.org/r/20180802153209.813-1-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since at least the beginning of the git era we've declared our TLB
exception handling functions inconsistently. They're actually functions,
but we declare them as arrays of u32 where each u32 is an encoded
instruction. This has always been the case for arch/mips/mm/tlbex.c, and
has also been true for arch/mips/kernel/traps.c since commit
86a1708a9d ("MIPS: Make tlb exception handler definitions and
declarations match.") which aimed for consistency but did so by
consistently making the our C code inconsistent with our assembly.
This is all usually harmless, but when using GCC 7 or newer to build a
kernel targeting microMIPS (ie. CONFIG_CPU_MICROMIPS=y) it becomes
problematic. With microMIPS bit 0 of the program counter indicates the
ISA mode. When bit 0 is zero instructions are decoded using the standard
MIPS32 or MIPS64 ISA. When bit 0 is one instructions are decoded using
microMIPS. This means that function pointers become odd - their least
significant bit is one for microMIPS code. We work around this in cases
where we need to access code using loads & stores with our
msk_isa16_mode() macro which simply clears bit 0 of the value it is
given:
#define msk_isa16_mode(x) ((x) & ~0x1)
For example we do this for our TLB load handler in
build_r4000_tlb_load_handler():
u32 *p = (u32 *)msk_isa16_mode((ulong)handle_tlbl);
We then write code to p, expecting it to be suitably aligned (our LEAF
macro aligns functions on 4 byte boundaries, so (ulong)handle_tlbl will
give a value one greater than a multiple of 4 - ie. the start of a
function on a 4 byte boundary, with the ISA mode bit 0 set).
This worked fine up to GCC 6, but GCC 7 & onwards is smart enough to
presume that handle_tlbl which we declared as an array of u32s must be
aligned sufficiently that bit 0 of its address will never be set, and as
a result optimize out msk_isa16_mode(). This leads to p having an
address with bit 0 set, and when we go on to attempt to store code at
that address we take an address error exception due to the unaligned
memory access.
This leads to an exception prior to the kernel having configured its own
exception handlers, so we jump to whatever handlers the bootloader
configured. In the case of QEMU this results in a silent hang, since it
has no useful general exception vector.
Fix this by consistently declaring our TLB-related functions as
functions. For handle_tlbl(), handle_tlbs() & handle_tlbm() we do this
in asm/tlbex.h & we make use of the existing declaration of
tlbmiss_handler_setup_pgd() in asm/mmu_context.h. Our TLB handler
generation code in arch/mips/mm/tlbex.c is adjusted to deal with these
definitions, in most cases simply by casting the function pointers to
u32 pointers.
This allows us to include asm/mmu_context.h in arch/mips/mm/tlbex.c to
get the definitions of tlbmiss_handler_setup_pgd & pgd_current, removing
some needless duplication. Consistently using msk_isa16_mode() on
function pointers means we no longer need the
tlbmiss_handler_setup_pgd_start symbol so that is removed entirely.
Now that we're declaring our functions as functions GCC stops optimizing
out msk_isa16_mode() & a microMIPS kernel built with either GCC 7.3.0 or
8.1.0 boots successfully.
Signed-off-by: Paul Burton <paul.burton@mips.com>
We export tlbmiss_handler_setup_pgd in arch/mips/mm/tlbex.c close to a
declaration of it, rather than close to its definition as is standard.
We've supported exporting symbols in assembly code since commit
22823ab419 ("EXPORT_SYMBOL() for asm"), so move the export to follow
the function's (stub) definition.
Signed-off-by: Paul Burton <paul.burton@mips.com>
The user page-table gets the updated kernel mappings in pti_finalize(),
which runs after the RO+X permissions got applied to the kernel page-table
in mark_readonly().
But with CONFIG_DEBUG_WX enabled, the user page-table is already checked in
mark_readonly() for insecure mappings. This causes false-positive
warnings, because the user page-table did not get the updated mappings yet.
Move the W+X check for the user page-table into pti_finalize() after it
updated all required mappings.
[ tglx: Folded !NX supported fix ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1533727000-9172-1-git-send-email-joro@8bytes.org
Currently if you build a 32-bit powerpc kernel and use get_user() to
load a u64 value it will fail to build with eg:
kernel/rseq.o: In function `rseq_get_rseq_cs':
kernel/rseq.c:123: undefined reference to `__get_user_bad'
This is hitting the check in __get_user_size() that makes sure the
size we're copying doesn't exceed the size of the destination:
#define __get_user_size(x, ptr, size, retval)
do {
retval = 0;
__chk_user_ptr(ptr);
if (size > sizeof(x))
(x) = __get_user_bad();
Which doesn't immediately make sense because the size of the
destination is u64, but it's not really, because __get_user_check()
etc. internally create an unsigned long and copy into that:
#define __get_user_check(x, ptr, size)
({
long __gu_err = -EFAULT;
unsigned long __gu_val = 0;
The problem being that on 32-bit unsigned long is not big enough to
hold a u64. We can fix this with a trick from hpa in the x86 code, we
statically check the type of x and set the type of __gu_val to either
unsigned long or unsigned long long.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The machine check code that flushes and restores bolted segments in
real mode belongs in mm/slb.c. This will also be used by pseries
machine check and idle code in future changes.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Fix the below build error using strlcpy instead of strncpy
In function 'pnv_parse_cpuidle_dt',
inlined from 'pnv_init_idle_states' at arch/powerpc/platforms/powernv/idle.c:840:7,
inlined from '__machine_initcall_powernv_pnv_init_idle_states' at arch/powerpc/platforms/powernv/idle.c:870:1:
arch/powerpc/platforms/powernv/idle.c:820:3: error: 'strncpy' specified bound 16 equals destination size [-Werror=stringop-truncation]
strncpy(pnv_idle_states[i].name, temp_string[i],
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
PNV_IDLE_NAME_LEN);
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch makes sure we update the mmu_gather page size even if we are
requesting for a fullmm flush. This avoids triggering VM_WARN_ON in code
paths like __tlb_remove_page_size that explicitly check for removing range page
size to be same as mmu gather page size.
Fixes: 5a6099346c ("powerpc/64s/radix: tlb do not flush on page size when fullmm")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
‘type’ is only used when CONFIG_DEBUG_HIGHMEM is set. So add a possibly
unused tag to variable. Remove warning treated as error with W=1:
arch/powerpc/mm/highmem.c:59:6: error: variable ‘type’ set but not used [-Werror=unused-but-set-variable]
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Make sure to include setup.h to provide the following prototypes:
- irqstack_early_init
- setup_power_save
- initialize_cache_info
Fix the following warnings (treated as error in W=1):
arch/powerpc/kernel/setup_32.c:198:13: error: no previous prototype for ‘irqstack_early_init’
arch/powerpc/kernel/setup_32.c:238:13: error: no previous prototype for ‘setup_power_save’
arch/powerpc/kernel/setup_32.c:253:13: error: no previous prototype for ‘initialize_cache_info’
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add gcc attribute unused for two variables. Fix warnings treated as errors
with W=1:
arch/powerpc/kernel/prom_init.c:1388:8: error: variable ‘path’ set but not used [-Werror=unused-but-set-variable]
Suggested-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
These functions can all be static, make it so. Fix warnings treated as
errors with W=1:
arch/powerpc/platforms/powermac/pci.c:1022:6: error: no previous prototype for ‘pmac_pci_fixup_ohci’
arch/powerpc/platforms/powermac/pci.c:1057:6: error: no previous prototype for ‘pmac_pci_fixup_cardbus’
arch/powerpc/platforms/powermac/pci.c:1094:6: error: no previous prototype for ‘pmac_pci_fixup_pciata’
Remove has_address declaration and assignment since it's not used.
Also add gcc attribute unused to fix a warning treated as error with
W=1:
arch/powerpc/platforms/powermac/pci.c:784:19: error: variable ‘has_address’ set but not used
arch/powerpc/platforms/powermac/pci.c:907:22: error: variable ‘ht’ set but not used
Suggested-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Since the value of x is never intended to be read, remove it. Fix
warning treated as error with W=1:
arch/powerpc/platforms/powermac/udbg_scc.c:76:9: error: variable ‘x’ set but not used
Suggested-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The header `pmac.h` was not included, leading to the following warnings,
treated as error with W=1:
arch/powerpc/platforms/powermac/time.c:69:13: error: no previous prototype for ‘pmac_time_init’ [-Werror=missing-prototypes]
arch/powerpc/platforms/powermac/time.c:207:15: error: no previous prototype for ‘pmac_get_boot_time’ [-Werror=missing-prototypes]
arch/powerpc/platforms/powermac/time.c:222:6: error: no previous prototype for ‘pmac_get_rtc_time’ [-Werror=missing-prototypes]
arch/powerpc/platforms/powermac/time.c:240:5: error: no previous prototype for ‘pmac_set_rtc_time’ [-Werror=missing-prototypes]
arch/powerpc/platforms/powermac/time.c:259:12: error: no previous prototype for ‘via_calibrate_decr’ [-Werror=missing-prototypes]
arch/powerpc/platforms/powermac/time.c:311:13: error: no previous prototype for ‘pmac_calibrate_decr’ [-Werror=missing-prototypes]
The function `via_calibrate_decr` was made static to silence a warning.
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add a jump target so that a bit of exception handling can be better
reused at the end of this function.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently, in xmon, there is no obvious way to get an address for a
percpu symbol for a particular cpu. Having such an ability would be
good for debugging the system when percpu variables got involved.
Therefore, this patch introduces a new xmon command "lp" to lookup the
address for percpu symbols. Usage of "lp" is similar to "ls", except
that we could add a cpu number to choose the variable of which cpu we
want to lookup. If no cpu number is given, lookup for current cpu.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
huge_pte_offset_and_shift() has never existed
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The symbol memcpy_nocache_branch defined in order to allow patching
of memset function once cache is enabled leads to confusing reports
by perf tool.
Using the new patch_site functionality solves this issue.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
During Machine Check interrupt on pseries platform, register r3 points
RTAS extended event log passed by hypervisor. Since hypervisor uses r3
to pass pointer to rtas log, it stores the original r3 value at the
start of the memory (first 8 bytes) pointed by r3. Since hypervisor
stores this info and rtas log is in BE format, linux should make
sure to restore r3 value in correct endian format.
Without this patch when MCE handler, after recovery, returns to code that
that caused the MCE may end up with Data SLB access interrupt for invalid
address followed by kernel panic or hang.
Severe Machine check interrupt [Recovered]
NIP [d00000000ca301b8]: init_module+0x1b8/0x338 [bork_kernel]
Initiator: CPU
Error type: SLB [Multihit]
Effective address: d00000000ca70000
cpu 0xa: Vector: 380 (Data SLB Access) at [c0000000fc7775b0]
pc: c0000000009694c0: vsnprintf+0x80/0x480
lr: c0000000009698e0: vscnprintf+0x20/0x60
sp: c0000000fc777830
msr: 8000000002009033
dar: a803a30c000000d0
current = 0xc00000000bc9ef00
paca = 0xc00000001eca5c00 softe: 3 irq_happened: 0x01
pid = 8860, comm = insmod
vscnprintf+0x20/0x60
vprintk_emit+0xb4/0x4b0
vprintk_func+0x5c/0xd0
printk+0x38/0x4c
init_module+0x1c0/0x338 [bork_kernel]
do_one_initcall+0x54/0x230
do_init_module+0x8c/0x248
load_module+0x12b8/0x15b0
sys_finit_module+0xa8/0x110
system_call+0x58/0x6c
--- Exception: c00 (System Call) at 00007fff8bda0644
SP (7fffdfbfe980) is in userspace
This patch fixes this issue.
Fixes: a08a53ea4c ("powerpc/le: Enable RTAS events support")
Cc: stable@vger.kernel.org # v3.15+
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
With dynamic memory allocation support for crash memory ranges array,
there is no hard limit on the no. of crash memory ranges kernel could
export, but program headers count could overflow in the /proc/vmcore
ELF file while exporting each memory range as PT_LOAD segment. Reduce
the likelihood of a such scenario, by folding adjacent crash memory
ranges which minimizes the total number of PT_LOAD segments.
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Crash memory ranges is an array of memory ranges of the crashing kernel
to be exported as a dump via /proc/vmcore file. The size of the array
is set based on INIT_MEMBLOCK_REGIONS, which works alright in most cases
where memblock memory regions count is less than INIT_MEMBLOCK_REGIONS
value. But this count can grow beyond INIT_MEMBLOCK_REGIONS value since
commit 142b45a72e ("memblock: Add array resizing support").
On large memory systems with a few DLPAR operations, the memblock memory
regions count could be larger than INIT_MEMBLOCK_REGIONS value. On such
systems, registering fadump results in crash or other system failures
like below:
task: c00007f39a290010 ti: c00000000b738000 task.ti: c00000000b738000
NIP: c000000000047df4 LR: c0000000000f9e58 CTR: c00000000010f180
REGS: c00000000b73b570 TRAP: 0300 Tainted: G L X (4.4.140+)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22004484 XER: 20000000
CFAR: c000000000008500 DAR: 000007a450000000 DSISR: 40000000 SOFTE: 0
...
NIP [c000000000047df4] smp_send_reschedule+0x24/0x80
LR [c0000000000f9e58] resched_curr+0x138/0x160
Call Trace:
resched_curr+0x138/0x160 (unreliable)
check_preempt_curr+0xc8/0xf0
ttwu_do_wakeup+0x38/0x150
try_to_wake_up+0x224/0x4d0
__wake_up_common+0x94/0x100
ep_poll_callback+0xac/0x1c0
__wake_up_common+0x94/0x100
__wake_up_sync_key+0x70/0xa0
sock_def_readable+0x58/0xa0
unix_stream_sendmsg+0x2dc/0x4c0
sock_sendmsg+0x68/0xa0
___sys_sendmsg+0x2cc/0x2e0
__sys_sendmsg+0x5c/0xc0
SyS_socketcall+0x36c/0x3f0
system_call+0x3c/0x100
as array index overflow is not checked for while setting up crash memory
ranges causing memory corruption. To resolve this issue, dynamically
allocate memory for crash memory ranges and resize it incrementally,
in units of pagesize, on hitting array size limit.
Fixes: 2df173d9e8 ("fadump: Initialize elfcore header and add PT_LOAD program headers.")
Cc: stable@vger.kernel.org # v3.4+
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
[mpe: Just use PAGE_SIZE directly, fixup variable placement]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
commit e8cb7a55eb ("powerpc: remove superflous inclusions of
asm/fixmap.h") removed inclusion of asm/fixmap.h from files not
including objects from that file.
However, asm/mmu-8xx.h includes call to __fix_to_virt(). The proper
way would be to include asm/fixmap.h in asm/mmu-8xx.h but it creates
an inclusion loop.
So we have to leave asm/fixmap.h in sysdep/cpm_common.c for
CONFIG_PPC_EARLY_DEBUG_CPM
CC arch/powerpc/sysdev/cpm_common.o
In file included from ./arch/powerpc/include/asm/mmu.h:340:0,
from ./arch/powerpc/include/asm/reg_8xx.h:8,
from ./arch/powerpc/include/asm/reg.h:29,
from ./arch/powerpc/include/asm/processor.h:13,
from ./arch/powerpc/include/asm/thread_info.h:28,
from ./include/linux/thread_info.h:38,
from ./arch/powerpc/include/asm/ptrace.h:159,
from ./arch/powerpc/include/asm/hw_irq.h:12,
from ./arch/powerpc/include/asm/irqflags.h:12,
from ./include/linux/irqflags.h:16,
from ./include/asm-generic/cmpxchg-local.h:6,
from ./arch/powerpc/include/asm/cmpxchg.h:537,
from ./arch/powerpc/include/asm/atomic.h:11,
from ./include/linux/atomic.h:5,
from ./include/linux/mutex.h:18,
from ./include/linux/kernfs.h:13,
from ./include/linux/sysfs.h:16,
from ./include/linux/kobject.h:20,
from ./include/linux/device.h:16,
from ./include/linux/node.h:18,
from ./include/linux/cpu.h:17,
from ./include/linux/of_device.h:5,
from arch/powerpc/sysdev/cpm_common.c:21:
arch/powerpc/sysdev/cpm_common.c: In function ‘udbg_init_cpm’:
./arch/powerpc/include/asm/mmu-8xx.h:218:25: error: implicit declaration of function ‘__fix_to_virt’ [-Werror=implicit-function-declaration]
#define VIRT_IMMR_BASE (__fix_to_virt(FIX_IMMR_BASE))
^
arch/powerpc/sysdev/cpm_common.c:75:7: note: in expansion of macro ‘VIRT_IMMR_BASE’
VIRT_IMMR_BASE);
^
./arch/powerpc/include/asm/mmu-8xx.h:218:39: error: ‘FIX_IMMR_BASE’ undeclared (first use in this function)
#define VIRT_IMMR_BASE (__fix_to_virt(FIX_IMMR_BASE))
^
arch/powerpc/sysdev/cpm_common.c:75:7: note: in expansion of macro ‘VIRT_IMMR_BASE’
VIRT_IMMR_BASE);
^
./arch/powerpc/include/asm/mmu-8xx.h:218:39: note: each undeclared identifier is reported only once for each function it appears in
#define VIRT_IMMR_BASE (__fix_to_virt(FIX_IMMR_BASE))
^
arch/powerpc/sysdev/cpm_common.c:75:7: note: in expansion of macro ‘VIRT_IMMR_BASE’
VIRT_IMMR_BASE);
^
cc1: all warnings being treated as errors
make[1]: *** [arch/powerpc/sysdev/cpm_common.o] Error 1
Fixes: e8cb7a55eb ("powerpc: remove superflous inclusions of asm/fixmap.h")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The problem is the the calculation should be "end - start + 1" but the
plus one is missing in this calculation.
Fixes: 8626816e90 ("powerpc: add support for MPIC message register API")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch allows the memory removed by memtrace to be readded to the
kernel. So now you don't have to reboot your system to add the memory
back to the kernel or to have a different amount of memory removed.
Signed-off-by: Rashmica Gupta <rashmica.g@gmail.com>
Tested-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The kernel unnecessarily prevents late microcode loading when SMT is
disabled. It should be safe to allow it if all the primary threads are
online.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Commit 33679a5037 ("MIPS: uasm: Remove needless ISA abstraction")
removed use of the MIPS_ISA preprocessor macro, but left a couple of
unused definitions of it behind.
Remove the dead code.
Signed-off-by: Paul Burton <paul.burton@mips.com>
This new symbol needs to be in the workaround-list for buggy
binutils, otherwise the build with gcc-4.6 fails.
Fixes: 39d668e04e ('x86/mm/pti: Make pti_clone_kernel_text() compile on 32 bit')
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linux-Next Mailing List <linux-next@vger.kernel.org>
Link: https://lkml.kernel.org/r/20180809094449.ddmnrkz7qkvo3j2x@suse.de
Pull crypto fix from Herbert Xu:
"This fixes a performance regression in arm64 NEON crypto as well as a
crash in x86 aegis/morus on unsupported CPUs"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: x86/aegis,morus - Fix and simplify CPUID checks
crypto: arm64 - revert NEON yield for fast AEAD implementations
Use the standard WARN_ON instead.
If a small kernel is desired, WARN_ON can be disabled globally.
Also remove SSB_DEBUG. Besides WARN_ON it only adds a tiny debug check.
Include this check unconditionally.
Signed-off-by: Michael Buesch <m@bues.ch>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
The host-progs has been kept as an alias of hostprogs-y for a long time
(at least since the beginning of Git era), with the clear prompt:
Usage of host-progs is deprecated. Please replace with hostprogs-y!
Enough time for the migration has passed.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
In order to remove the additional check before calling the
ghes_notify_sea(), make stub definition when !CONFIG_ACPI_APEI_SEA.
After this cleanup, we can simply call the ghes_notify_sea() to let
APEI driver handle the SEA notification.
Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit c9b5ad546e "s390/mm: tag normal pages vs pages used in page tables"
accidentally changed the logic in arch_set_page_states(), which is used by
the suspend/resume code. set_page_stable(page, order) was changed to
set_page_stable_dat(page, 0). After this, only the first page of higher order
pages will be set to stable, and a write to one of the unstable pages will
result in an addressing exception.
Fix this by using "order" again, instead of "0".
Fixes: c9b5ad546e ("s390/mm: tag normal pages vs pages used in page tables")
Cc: stable@vger.kernel.org # 4.14+
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The Cortina PHY is not compatible with IEEE 802.3 clause 45.
Signed-off-by: Camelia Groza <camelia.groza@nxp.com>
[scottwood: made commit message about compatibility, not driver choice]
Signed-off-by: Scott Wood <oss@buserror.net>
The Cortina PHY is not compatible with IEEE 802.3 clause 45.
Signed-off-by: Camelia Groza <camelia.groza@nxp.com>
[scottwood: made commit message about compatibility, not driver choice]
Signed-off-by: Scott Wood <oss@buserror.net>
Cortina PHYs are present on T4240RDB and T2080RDB. Enable the driver
by default.
Signed-off-by: Camelia Groza <camelia.groza@nxp.com>
Signed-off-by: Scott Wood <oss@buserror.net>
commit e8cb7a55eb ("powerpc: remove superflous inclusions of
asm/fixmap.h") removed inclusion of asm/fixmap.h from files not
including objects from that file.
However, asm/mmu-8xx.h includes call to __fix_to_virt(). The proper
way would be to include asm/fixmap.h in asm/mmu-8xx.h but it creates
an inclusion loop.
So we have to leave asm/fixmap.h in sysdep/cpm_common.c for
CONFIG_PPC_EARLY_DEBUG_CPM
CC arch/powerpc/sysdev/cpm_common.o
In file included from ./arch/powerpc/include/asm/mmu.h:340:0,
from ./arch/powerpc/include/asm/reg_8xx.h:8,
from ./arch/powerpc/include/asm/reg.h:29,
from ./arch/powerpc/include/asm/processor.h:13,
from ./arch/powerpc/include/asm/thread_info.h:28,
from ./include/linux/thread_info.h:38,
from ./arch/powerpc/include/asm/ptrace.h:159,
from ./arch/powerpc/include/asm/hw_irq.h:12,
from ./arch/powerpc/include/asm/irqflags.h:12,
from ./include/linux/irqflags.h:16,
from ./include/asm-generic/cmpxchg-local.h:6,
from ./arch/powerpc/include/asm/cmpxchg.h:537,
from ./arch/powerpc/include/asm/atomic.h:11,
from ./include/linux/atomic.h:5,
from ./include/linux/mutex.h:18,
from ./include/linux/kernfs.h:13,
from ./include/linux/sysfs.h:16,
from ./include/linux/kobject.h:20,
from ./include/linux/device.h:16,
from ./include/linux/node.h:18,
from ./include/linux/cpu.h:17,
from ./include/linux/of_device.h:5,
from arch/powerpc/sysdev/cpm_common.c:21:
arch/powerpc/sysdev/cpm_common.c: In function ‘udbg_init_cpm’:
./arch/powerpc/include/asm/mmu-8xx.h:218:25: error: implicit declaration of function ‘__fix_to_virt’ [-Werror=implicit-function-declaration]
#define VIRT_IMMR_BASE (__fix_to_virt(FIX_IMMR_BASE))
^
arch/powerpc/sysdev/cpm_common.c:75:7: note: in expansion of macro ‘VIRT_IMMR_BASE’
VIRT_IMMR_BASE);
^
./arch/powerpc/include/asm/mmu-8xx.h:218:39: error: ‘FIX_IMMR_BASE’ undeclared (first use in this function)
#define VIRT_IMMR_BASE (__fix_to_virt(FIX_IMMR_BASE))
^
arch/powerpc/sysdev/cpm_common.c:75:7: note: in expansion of macro ‘VIRT_IMMR_BASE’
VIRT_IMMR_BASE);
^
./arch/powerpc/include/asm/mmu-8xx.h:218:39: note: each undeclared identifier is reported only once for each function it appears in
#define VIRT_IMMR_BASE (__fix_to_virt(FIX_IMMR_BASE))
^
arch/powerpc/sysdev/cpm_common.c:75:7: note: in expansion of macro ‘VIRT_IMMR_BASE’
VIRT_IMMR_BASE);
^
cc1: all warnings being treated as errors
make[1]: *** [arch/powerpc/sysdev/cpm_common.o] Error 1
Fixes: e8cb7a55eb ("powerpc: remove superflous inclusions of asm/fixmap.h")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
The mmio tracer sets io mapping PTEs and PMDs to non present when enabled
without inverting the address bits, which makes the PTE entry vulnerable
for L1TF.
Make it use the right low level macros to actually invert the address bits
to protect against L1TF.
In principle this could be avoided because MMIO tracing is not likely to be
enabled on production machines, but the fix is straigt forward and for
consistency sake it's better to get rid of the open coded PTE manipulation.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
For years I thought all parisc machines executed loads and stores in
order. However, Jeff Law recently indicated on gcc-patches that this is
not correct. There are various degrees of out-of-order execution all the
way back to the PA7xxx processor series (hit-under-miss). The PA8xxx
series has full out-of-order execution for both integer operations, and
loads and stores.
This is described in the following article:
http://web.archive.org/web/20040214092531/http://www.cpus.hp.com/technical_references/advperf.shtml
For this reason, we need to define mb() and to insert a memory barrier
before the store unlocking spinlocks. This ensures that all memory
accesses are complete prior to unlocking. The ldcw instruction performs
the same function on entry.
Signed-off-by: John David Anglin <dave.anglin@bell.net>
Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Helge Deller <deller@gmx.de>
Enable the -mlong-calls compiler option by default, because otherwise in most
cases linking the vmlinux binary fails due to truncations of R_PARISC_PCREL22F
relocations. This fixes building the 64-bit defconfig.
Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Helge Deller <deller@gmx.de>
In nlm_fmn_send() we have a loop which attempts to send a message
multiple times in order to handle the transient failure condition of a
lack of available credit. When examining the status register to detect
the failure we check for a condition that can never be true, which falls
foul of gcc 8's -Wtautological-compare:
In file included from arch/mips/netlogic/common/irq.c:65:
./arch/mips/include/asm/netlogic/xlr/fmn.h: In function 'nlm_fmn_send':
./arch/mips/include/asm/netlogic/xlr/fmn.h:304:22: error: bitwise
comparison always evaluates to false [-Werror=tautological-compare]
if ((status & 0x2) == 1)
^~
If the path taken if this condition were true all we do is print a
message to the kernel console. Since failures seem somewhat expected
here (making the console message questionable anyway) and the condition
has clearly never evaluated true we simply remove it, rather than
attempting to fix it to check status correctly.
Signed-off-by: Paul Burton <paul.burton@mips.com>
Patchwork: https://patchwork.linux-mips.org/patch/20174/
Cc: Ganesan Ramalingam <ganesanr@broadcom.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jayachandran C <jnair@caviumnetworks.com>
Cc: John Crispin <john@phrozen.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-mips@linux-mips.org
Return statements in functions returning bool should use true or false
instead of an integer value. This code was detected with the help of
Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
set_memory_np() is used to mark kernel mappings not present, but it has
it's own open coded mechanism which does not have the L1TF protection of
inverting the address bits.
Replace the open coded PTE manipulation with the L1TF protecting low level
PTE routines.
Passes the CPA self test.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Some cases in THP like:
- MADV_FREE
- mprotect
- split
mark the PMD non present for temporarily to prevent races. The window for
an L1TF attack in these contexts is very small, but it wants to be fixed
for correctness sake.
Use the proper low level functions for pmd/pud_mknotpresent() to address
this.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
For kernel mappings PAGE_PROTNONE is not necessarily set for a non present
mapping, but the inversion logic explicitely checks for !PRESENT and
PROT_NONE.
Remove the PROT_NONE check and make the inversion unconditional for all not
present mappings.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When building the VDSO with clang it appears to invoke ld without
specifying endianness, even though clang itself was provided with a -EB
or -EL flag. This results in the build failing due to a mismatch between
the objects that are the input to ld, and the output it is attempting to
create:
VDSO arch/mips/vdso/vdso.so.dbg.raw
mips-linux-ld: arch/mips/vdso/elf.o: compiled for a big endian system
and target is little endian
mips-linux-ld: arch/mips/vdso/elf.o: endianness incompatible with that
of the selected emulation
mips-linux-ld: failed to merge target specific data of file
arch/mips/vdso/elf.o
...
Work around this problem by explicitly specifying the link endianness
using -Wl,-EB or -Wl,-EL when -EB or -EL are part of KBUILD_CFLAGS. This
resolves the build failure when using clang, and doesn't have any
negative effect on gcc.
Signed-off-by: Paul Burton <paul.burton@mips.com>
When building using clang, always specify -EB or -EL in order to ensure
we target the desired endianness.
Since clang cross compiles using a single compiler build with multiple
targets, our -dumpmachine tests which don't specify clang's --target
argument check output based upon the build machine rather than the
machine our build will target. This means our detection of whether to
specify -EB fails miserably & we never do. Providing the endianness flag
unconditionally for clang resolves this issue & simplifies the clang
path somewhat.
Signed-off-by: Paul Burton <paul.burton@mips.com>
On 32 bit the kernel sections are not huge-page aligned. When we clone
them on PMD-level we unevitably map some areas that are normal kernel
memory and may contain secrets to user-space. To prevent that we need to
clone the kernel-image on PTE-level for 32 bit.
Also make the page-table cloning code more general so that it can handle
PMD and PTE level cloning. This can be generalized further in the future to
also handle clones on the P4D-level.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1533637471-30953-4-git-send-email-joro@8bytes.org
The function sets the global-bit on cloned PMD entries, which only makes
sense when the permissions are identical between the user and the kernel
page-table. Further, only write-permissions are cleared for entry-text and
kernel-text sections, which are not writeable at the end of the boot
process.
The reason why this RW clearing exists is that in the early PTI
implementations the cloned kernel areas were set up during early boot
before the kernel text is set to read only and not touched afterwards.
This is not longer true. The cloned areas are still set up early to get the
entry code working for interrupts and other things, but after the kernel
text has been set RO the clone is repeated which copies the RO PMD/PTEs
over to the user visible clone. That means the initial clearing of the
writable bit can be avoided.
[ tglx: Amended changelog ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1533637471-30953-3-git-send-email-joro@8bytes.org
Nadav reported that on guests we're failing to rewrite the indirect
calls to CALLEE_SAVE paravirt functions. In particular the
pv_queued_spin_unlock() call is left unpatched and that is all over the
place. This obviously wrecks Spectre-v2 mitigation (for paravirt
guests) which relies on not actually having indirect calls around.
The reason is an incorrect clobber test in paravirt_patch_call(); this
function rewrites an indirect call with a direct call to the _SAME_
function, there is no possible way the clobbers can be different
because of this.
Therefore remove this clobber check. Also put WARNs on the other patch
failure case (not enough room for the instruction) which I've not seen
trigger in my (limited) testing.
Three live kernel image disassemblies for lock_sock_nested (as a small
function that illustrates the problem nicely). PRE is the current
situation for guests, POST is with this patch applied and NATIVE is with
or without the patch for !guests.
PRE:
(gdb) disassemble lock_sock_nested
Dump of assembler code for function lock_sock_nested:
0xffffffff817be970 <+0>: push %rbp
0xffffffff817be971 <+1>: mov %rdi,%rbp
0xffffffff817be974 <+4>: push %rbx
0xffffffff817be975 <+5>: lea 0x88(%rbp),%rbx
0xffffffff817be97c <+12>: callq 0xffffffff819f7160 <_cond_resched>
0xffffffff817be981 <+17>: mov %rbx,%rdi
0xffffffff817be984 <+20>: callq 0xffffffff819fbb00 <_raw_spin_lock_bh>
0xffffffff817be989 <+25>: mov 0x8c(%rbp),%eax
0xffffffff817be98f <+31>: test %eax,%eax
0xffffffff817be991 <+33>: jne 0xffffffff817be9ba <lock_sock_nested+74>
0xffffffff817be993 <+35>: movl $0x1,0x8c(%rbp)
0xffffffff817be99d <+45>: mov %rbx,%rdi
0xffffffff817be9a0 <+48>: callq *0xffffffff822299e8
0xffffffff817be9a7 <+55>: pop %rbx
0xffffffff817be9a8 <+56>: pop %rbp
0xffffffff817be9a9 <+57>: mov $0x200,%esi
0xffffffff817be9ae <+62>: mov $0xffffffff817be993,%rdi
0xffffffff817be9b5 <+69>: jmpq 0xffffffff81063ae0 <__local_bh_enable_ip>
0xffffffff817be9ba <+74>: mov %rbp,%rdi
0xffffffff817be9bd <+77>: callq 0xffffffff817be8c0 <__lock_sock>
0xffffffff817be9c2 <+82>: jmp 0xffffffff817be993 <lock_sock_nested+35>
End of assembler dump.
POST:
(gdb) disassemble lock_sock_nested
Dump of assembler code for function lock_sock_nested:
0xffffffff817be970 <+0>: push %rbp
0xffffffff817be971 <+1>: mov %rdi,%rbp
0xffffffff817be974 <+4>: push %rbx
0xffffffff817be975 <+5>: lea 0x88(%rbp),%rbx
0xffffffff817be97c <+12>: callq 0xffffffff819f7160 <_cond_resched>
0xffffffff817be981 <+17>: mov %rbx,%rdi
0xffffffff817be984 <+20>: callq 0xffffffff819fbb00 <_raw_spin_lock_bh>
0xffffffff817be989 <+25>: mov 0x8c(%rbp),%eax
0xffffffff817be98f <+31>: test %eax,%eax
0xffffffff817be991 <+33>: jne 0xffffffff817be9ba <lock_sock_nested+74>
0xffffffff817be993 <+35>: movl $0x1,0x8c(%rbp)
0xffffffff817be99d <+45>: mov %rbx,%rdi
0xffffffff817be9a0 <+48>: callq 0xffffffff810a0c20 <__raw_callee_save___pv_queued_spin_unlock>
0xffffffff817be9a5 <+53>: xchg %ax,%ax
0xffffffff817be9a7 <+55>: pop %rbx
0xffffffff817be9a8 <+56>: pop %rbp
0xffffffff817be9a9 <+57>: mov $0x200,%esi
0xffffffff817be9ae <+62>: mov $0xffffffff817be993,%rdi
0xffffffff817be9b5 <+69>: jmpq 0xffffffff81063aa0 <__local_bh_enable_ip>
0xffffffff817be9ba <+74>: mov %rbp,%rdi
0xffffffff817be9bd <+77>: callq 0xffffffff817be8c0 <__lock_sock>
0xffffffff817be9c2 <+82>: jmp 0xffffffff817be993 <lock_sock_nested+35>
End of assembler dump.
NATIVE:
(gdb) disassemble lock_sock_nested
Dump of assembler code for function lock_sock_nested:
0xffffffff817be970 <+0>: push %rbp
0xffffffff817be971 <+1>: mov %rdi,%rbp
0xffffffff817be974 <+4>: push %rbx
0xffffffff817be975 <+5>: lea 0x88(%rbp),%rbx
0xffffffff817be97c <+12>: callq 0xffffffff819f7160 <_cond_resched>
0xffffffff817be981 <+17>: mov %rbx,%rdi
0xffffffff817be984 <+20>: callq 0xffffffff819fbb00 <_raw_spin_lock_bh>
0xffffffff817be989 <+25>: mov 0x8c(%rbp),%eax
0xffffffff817be98f <+31>: test %eax,%eax
0xffffffff817be991 <+33>: jne 0xffffffff817be9ba <lock_sock_nested+74>
0xffffffff817be993 <+35>: movl $0x1,0x8c(%rbp)
0xffffffff817be99d <+45>: mov %rbx,%rdi
0xffffffff817be9a0 <+48>: movb $0x0,(%rdi)
0xffffffff817be9a3 <+51>: nopl 0x0(%rax)
0xffffffff817be9a7 <+55>: pop %rbx
0xffffffff817be9a8 <+56>: pop %rbp
0xffffffff817be9a9 <+57>: mov $0x200,%esi
0xffffffff817be9ae <+62>: mov $0xffffffff817be993,%rdi
0xffffffff817be9b5 <+69>: jmpq 0xffffffff81063ae0 <__local_bh_enable_ip>
0xffffffff817be9ba <+74>: mov %rbp,%rdi
0xffffffff817be9bd <+77>: callq 0xffffffff817be8c0 <__lock_sock>
0xffffffff817be9c2 <+82>: jmp 0xffffffff817be993 <lock_sock_nested+35>
End of assembler dump.
Fixes: 63f70270cc ("[PATCH] i386: PARAVIRT: add common patching machinery")
Fixes: 3010a0663f ("x86/paravirt, objtool: Annotate indirect calls")
Reported-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: stable@vger.kernel.org
The code in __write_64bit_c0_split() is used by MIPS32 kernels running
on MIPS64 CPUs to write a 64-bit value to a 64-bit coprocessor 0
register using a single 64-bit dmtc0 instruction. It does this by
combining the 2x 32-bit registers used to hold the 64-bit value into a
single register, which in the existing code involves three steps:
1) Zero extend register A which holds bits 31:0 of our data, since it
may have previously held a sign-extended value.
2) Shift register B which holds bits 63:32 of our data in bits 31:0
left by 32 bits, such that the bits forming our data are in the
position they'll be in the final 64-bit value & bits 31:0 of the
register are zero.
3) Or the two registers together to form the 64-bit value in one
64-bit register.
From MIPS r2 onwards we have a dins instruction which can effectively
perform all 3 of those steps using a single instruction.
Add a path for MIPS r2 & beyond which uses dins to take bits 31:0 from
register B & insert them into bits 63:32 of register A, giving us our
full 64-bit value in register A with one instruction.
Since we know that MIPS r2 & above support the sel field for the dmtc0
instruction, we don't bother special casing sel==0. Omiting the sel
field would assemble to exactly the same instruction as when we
explicitly specify that it equals zero.
Signed-off-by: Paul Burton <paul.burton@mips.com>
Commit c22c804310 ("MIPS: Fix input modify in
__write_64bit_c0_split()") modified __write_64bit_c0_split() constraints
such that we have both an input & an output which we hope to assign to
the same registers, and modify the output rather than incorrectly
clobbering an input.
The way in which we use both an output & an input parameter with the
input constrained to share the output registers is a little convoluted &
also problematic for clang, which complains if the input & output values
have different widths. For example:
In file included from kernel/fork.c:98:
./arch/mips/include/asm/mmu_context.h:149:19: error: unsupported
inline asm: input with type 'unsigned long' matching output with
type 'unsigned long long'
write_c0_entryhi(cpu_asid(cpu, next));
~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~
./arch/mips/include/asm/mmu_context.h:93:2: note: expanded from macro
'cpu_asid'
(cpu_context((cpu), (mm)) & cpu_asid_mask(&cpu_data[cpu]))
^
./arch/mips/include/asm/mipsregs.h:1617:65: note: expanded from macro
'write_c0_entryhi'
#define write_c0_entryhi(val) __write_ulong_c0_register($10, 0, val)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~
./arch/mips/include/asm/mipsregs.h:1430:39: note: expanded from macro
'__write_ulong_c0_register'
__write_64bit_c0_register(reg, sel, val); \
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~
./arch/mips/include/asm/mipsregs.h:1400:41: note: expanded from macro
'__write_64bit_c0_register'
__write_64bit_c0_split(register, sel, value); \
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
./arch/mips/include/asm/mipsregs.h:1498:13: note: expanded from macro
'__write_64bit_c0_split'
: "r,0" (val)); \
^~~
We can both fix this build failure & simplify the code somewhat by
assigning the __tmp variable with the input value in C prior to our
inline assembly, and then using a single read-write output operand (ie.
a constraint beginning with +) to provide this value to our assembly.
Signed-off-by: Paul Burton <paul.burton@mips.com>
Using privcmd_call() for a singleton multicall seems to be wrong, as
privcmd_call() is using stac()/clac() to enable hypervisor access to
Linux user space.
Even if currently not a problem (pv domains can't use SMAP while HVM
and PVH domains can't use multicalls) things might change when
PVH dom0 support is added to the kernel.
Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
For (bad) historical reasons, OPAL used to create a non-standard pair
of properties "opal-interrupts" and "opal-interrupts-names" for
representing the list of interrupts it wants Linux to request on its
behalf.
Among other issues, the opal-interrupts doesn't have a way to carry
the type of interrupts, and they were assumed to be all level
sensitive.
This is wrong on some recent systems where some of them are edge
sensitive causing warnings in the XIVE code and possible misbehaviours
if they need to be retriggered (typically the NPU2 TCE error
interrupts).
This makes Linux switch to using the standard "interrupts" and
"interrupt-names" properties instead when they are available, using
standard of_irq helpers, which can carry all the desired type
information.
Newer versions of OPAL will generate those properties in addition to
the legacy ones.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Fixup prefix logic to check strlen(r->name). Reinstate setting
of start = 0 in opal_event_shutdown() to avoid double free warnings]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
GCC supports -mcpu=e300c2 and -mcpu=e300c3
This patch gives the opportunity to tune kernel to one of
those two types.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch extends to PPC32 the capability to select the exact
CPU type.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
At the time being, when adding a new CPU for selection, both
Kconfig.cputype and Makefile have to be modified.
This patch moves into Kconfig.cputype the name of the CPU to me
passed to the -mcpu= argument.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Rename the option to TARGET_CPU to echo the gcc documentation]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In Makefiles if we're testing a CONFIG_FOO symbol for equality with 'y'
we can instead just use ifdef. The latter reads easily, so convert to
it where possible.
Signed-off-by: Rodrigo R. Galvao <rosattig@linux.vnet.ibm.com>
Reviewed-by: Mauro S. M. Rodrigues <maurosr@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In __copy_tofrom_user, if we encounter an exception on a store, we
stop copying and return the number of bytes not copied. However,
if the store is wider than one byte and is to an unaligned address,
it is possible that the store operand overlaps a page boundary
and the exception occurred on the latter part of the store operand,
meaning that it would be possible to copy a few more bytes. Since
copy_to_user is generally expected to copy as much as possible,
it would be better to copy those extra few bytes. This adds code
to do that. Since this edge case is not performance-critical,
the code has been written to be compact rather than as fast as
possible.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The hand-coded assembler 64-bit copy routines include feature sections
that select one code path or another depending on which CPU we are
executing on. The self-tests for these copy routines end up testing
just one path. This adds a mechanism for selecting any desired code
path at compile time, and makes 2 or 3 versions of each test, each
using a different code path, so as to cover all the possible paths.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
[mpe: Add -mcpu=power4 to CFLAGS for older compilers]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This aims to make the generation of exception table entries for the
loads and stores in __copy_tofrom_user_base clearer and easier to
verify. Instead of having a series of local labels on the loads and
stores, with a series of corresponding labels later for the exception
handlers, we now use macros to generate exception table entries at the
point of each load and store that could potentially trap. We do this
with the macros lex (load exception) and stex (store exception).
These macros are used right before the load or store to which they
apply.
Some complexity is introduced by the fact that we have some more work
to do after hitting an exception, because we need to calculate and
return the number of bytes not copied. The code uses r3 as the
current pointer into the destination buffer, that is, the address of
the first byte of the destination that has not been modified.
However, at various points in the copy loops, r3 can be 4, 8, 16 or 24
bytes behind that point.
To express this offset in an understandable way, we define a symbol
r3_offset which is updated at various points so that it equal to the
difference between the address of the first unmodified byte of the
destination and the value in r3. (In fact it only needs to be
accurate at the point of each lex or stex macro invocation.)
The rules for updating r3_offset are as follows:
* It starts out at 0
* An addi r3,r3,N instruction decreases r3_offset by N
* A store instruction (stb, sth, stw, std) to N(r3)
increases r3_offset by the width of the store (1, 2, 4, 8)
* A store with update instruction (stbu, sthu, stwu, stdu) to N(r3)
sets r3_offset to the width of the store.
There is some trickiness to the way that the lex and stex macros and
the associated exception handlers work. I would have liked to use
the current value of r3_offset in the name of the symbol used as
the exception handler, as in ".Lld_exc_$(r3_offset)" and then
have symbols .Lld_exc_0, .Lld_exc_8, .Lld_exc_16 etc. corresponding
to the offsets that needed to be added to r3. However, I couldn't
see a way to do that with gas.
Instead, the exception handler address is .Lld_exc - r3_offset or
.Lst_exc - r3_offset, that is, the distance ahead of .Lld_exc/.Lst_exc
that we start executing is equal to the amount that we need to add to
r3. This works because r3_offset is always a small multiple of 4,
and our instructions are 4 bytes long. This means that before
.Lld_exc and .Lst_exc, we have a sequence of instructions that
increments r3 by 4, 8, 16 or 24 depending on where we start. The
sequence increments r3 by 4 per instruction (on average).
We also replace the exception table for the 4k copy loop by a
macro per load or store. These loads and stores all use exactly
the same exception handler, which simply resets the argument registers
r3, r4 and r5 to there original values and re-does the whole copy
using the slower loop.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
for_each_node_by_name() iterators only exit normally when the loop
cursor is NULL, So there is no need to call of_node_put().
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
NX increments readOffset by FIFO size in receive FIFO control register
when CRB is read. But the index in RxFIFO has to match with the
corresponding entry in FIFO maintained by VAS in kernel. Otherwise NX
may be processing incorrect CRBs and can cause CRB timeout.
VAS FIFO offset is 0 when the receive window is opened during
initialization. When the module is reloaded or in kexec boot, readOffset
in FIFO control register may not match with VAS entry. This patch adds
nx_coproc_init OPAL call to reset readOffset and queued entries in FIFO
control register for both high and normal FIFOs.
Signed-off-by: Haren Myneni <haren@us.ibm.com>
[mpe: Fixup uninitialized variable warning]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Export opal_check_token symbol for modules to check the availability
of OPAL calls before using them.
Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Fix build errors and warnings in t1042rdb_diu.c by adding header files
and MODULE_LICENSE().
../arch/powerpc/platforms/85xx/t1042rdb_diu.c:152:1: warning: data definition has no type or storage class
early_initcall(t1042rdb_diu_init);
../arch/powerpc/platforms/85xx/t1042rdb_diu.c:152:1: error: type defaults to 'int' in declaration of 'early_initcall' [-Werror=implicit-int]
../arch/powerpc/platforms/85xx/t1042rdb_diu.c:152:1: warning: parameter names (without types) in function declaration
and
WARNING: modpost: missing MODULE_LICENSE() in arch/powerpc/platforms/85xx/t1042rdb_diu.o
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Scott Wood <oss@buserror.net>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The page table fragment allocator uses the main page refcount racily
with respect to speculative references. A customer observed a BUG due
to page table page refcount underflow in the fragment allocator. This
can be caused by the fragment allocator set_page_count stomping on a
speculative reference, and then the speculative failure handler
decrements the new reference, and the underflow eventually pops when
the page tables are freed.
Fix this by using a dedicated field in the struct page for the page
table fragment allocator.
Fixes: 5c1f6ee9a3 ("powerpc: Reduce PTE table memory wastage")
Cc: stable@vger.kernel.org # v3.10+
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Pasemi code still uses printk(KERN_ERR/KERN_WARN ... change these to
pr_err(, pr_warn(... to match other powerpc arch code.
No functional changes.
Signed-off-by: Darren Stevens <darren@stevens-zone.net>
[mpe: Unsplit some strings while we're at it]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Call show_user_instructions() in arch/powerpc/kernel/traps.c to dump
instructions at faulty location, useful to debugging.
Before this patch, an unhandled signal message looked like:
pandafault[10524]: segfault (11) at 100007d0 nip 1000061c lr 7fffbd295100 code 2 in pandafault[10000000+10000]
After this patch, it looks like:
pandafault[10524]: segfault (11) at 100007d0 nip 1000061c lr 7fffbd295100 code 2 in pandafault[10000000+10000]
pandafault[10524]: code: 4bfffeec 4bfffee8 3c401002 38427f00 fbe1fff8 f821ffc1 7c3f0b78 3d22fffe
pandafault[10524]: code: 392988d0 f93f0020 e93f0020 39400048 <99490000> 39200000 7d234b78 383f0040
Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
show_user_instructions() is a slightly modified version of
show_instructions() that allows userspace instruction dump.
This will be useful within show_signal_msg() to dump userspace
instructions of the faulty location.
Here is a sample of what show_user_instructions() outputs:
pandafault[10850]: code: 4bfffeec 4bfffee8 3c401002 38427f00 fbe1fff8 f821ffc1 7c3f0b78 3d22fffe
pandafault[10850]: code: 392988d0 f93f0020 e93f0020 39400048 <99490000> 39200000 7d234b78 383f0040
The current->comm and current->pid printed can serve as a glue that
links the instructions dump to its originator, allowing messages to be
interleaved in the logs.
Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds VMA address in the message printed for unhandled signals,
similarly to what other architectures, like x86, print.
Before this patch, a page fault looked like:
pandafault[61470]: unhandled signal 11 at 100007d0 nip 1000061c lr 7fff8d185100 code 2
After this patch, a page fault looks like:
pandafault[6303]: segfault 11 at 100007d0 nip 1000061c lr 7fff93c55100 code 2 in pandafault[10000000+10000]
Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use %lx format to print registers. This avoids having two different
formats and avoids checking for MSR_64BIT, improving readability of the
function.
Even though we could have used %px, which is functionally equivalent to %lx
as per Documentation/core-api/printk-formats.rst, it is not semantically
correct because the data printed are not pointers. And using %px requires
casting data to (void *).
Besides that, %lx matches the format used in show_regs().
Before this patch:
pandafault[4808]: unhandled signal 11 at 0000000010000718 nip 0000000010000574 lr 00007fff935e7a6c code 2
After this patch:
pandafault[4732]: unhandled signal 11 at 10000718 nip 10000574 lr 7fff86697a6c code 2
Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Replace printk_ratelimited() by printk() and a default rate limit
burst to limit displaying unhandled signals messages.
This will allow us to call print_vma_addr() in a future patch, which
does not work with printk_ratelimited().
Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Isolate the logic of printing unhandled signals out of _exception_pkey().
No functional change, only code rearrangement.
Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Because rfi_flush_fallback runs immediately before the return to
userspace it currently runs with the user r1 (stack pointer). This
means if we oops in there we will report a bad kernel stack pointer in
the exception entry path, eg:
Bad kernel stack pointer 7ffff7150e40 at c0000000000023b4
Oops: Bad kernel stack pointer, sig: 6 [#1]
LE SMP NR_CPUS=32 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1246 Comm: klogd Not tainted 4.18.0-rc2-gcc-7.3.1-00175-g0443f8a69ba3 #7
NIP: c0000000000023b4 LR: 0000000010053e00 CTR: 0000000000000040
REGS: c0000000fffe7d40 TRAP: 4100 Not tainted (4.18.0-rc2-gcc-7.3.1-00175-g0443f8a69ba3)
MSR: 9000000002803031 <SF,HV,VEC,VSX,FP,ME,IR,DR,LE> CR: 44000442 XER: 20000000
CFAR: c00000000000bac8 IRQMASK: c0000000f1e66a80
GPR00: 0000000002000000 00007ffff7150e40 00007fff93a99900 0000000000000020
...
NIP [c0000000000023b4] rfi_flush_fallback+0x34/0x80
LR [0000000010053e00] 0x10053e00
Although the NIP tells us where we were, and the TRAP number tells us
what happened, it would still be nicer if we could report the actual
exception rather than barfing about the stack pointer.
We an do that fairly simply by loading the kernel stack pointer on
entry and restoring the user value before returning. That way we see a
regular oops such as:
Unrecoverable exception 4100 at c00000000000239c
Oops: Unrecoverable exception, sig: 6 [#1]
LE SMP NR_CPUS=32 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1251 Comm: klogd Not tainted 4.18.0-rc3-gcc-7.3.1-00097-g4ebfcac65acd-dirty #40
NIP: c00000000000239c LR: 0000000010053e00 CTR: 0000000000000040
REGS: c0000000f1e17bb0 TRAP: 4100 Not tainted (4.18.0-rc3-gcc-7.3.1-00097-g4ebfcac65acd-dirty)
MSR: 9000000002803031 <SF,HV,VEC,VSX,FP,ME,IR,DR,LE> CR: 44000442 XER: 20000000
CFAR: c00000000000bac8 IRQMASK: 0
...
NIP [c00000000000239c] rfi_flush_fallback+0x3c/0x80
LR [0000000010053e00] 0x10053e00
Call Trace:
[c0000000f1e17e30] [c00000000000b9e4] system_call+0x5c/0x70 (unreliable)
Note this shouldn't make the kernel stack pointer vulnerable to a
meltdown attack, because it should be flushed from the cache before we
return to userspace. The user r1 value will be in the cache, because
we load it in the return path, but that is harmless.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Look for fw-features properties to determine the appropriate settings
for the count cache flush, and then call the generic powerpc code to
set it up based on the security feature flags.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use the existing hypercall to determine the appropriate settings for
the count cache flush, and then call the generic powerpc code to set
it up based on the security feature flags.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Some CPU revisions support a mode where the count cache needs to be
flushed by software on context switch. Additionally some revisions may
have a hardware accelerated flush, in which case the software flush
sequence can be shortened.
If we detect the appropriate flag from firmware we patch a branch
into _switch() which takes us to a count cache flush sequence.
That sequence in turn may be patched to return early if we detect that
the CPU supports accelerating the flush sequence in hardware.
Add debugfs support for reporting the state of the flush, as well as
runtime disabling it.
And modify the spectre_v2 sysfs file to report the state of the
software flush.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add security feature flags to indicate the need for software to flush
the count cache on context switch, and for the presence of a hardware
assisted count cache flush.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add a macro and some helper C functions for patching single asm
instructions.
The gas macro means we can do something like:
1: nop
patch_site 1b, patch__foo
Which is less visually distracting than defining a GLOBAL symbol at 1,
and also doesn't pollute the symbol table which can confuse eg. perf.
These are obviously similar to our existing feature sections, but are
not automatically patched based on CPU/MMU features, rather they are
designed to be manually patched by C code at some arbitrary point.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Used barrier_nospec to sanitize the syscall table.
Signed-off-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Implement the barrier_nospec as a isync;sync instruction sequence.
The implementation uses the infrastructure built for BOOK3S 64.
Signed-off-by: Diana Craciun <diana.craciun@nxp.com>
[mpe: Split out of larger patch]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In a subsequent patch we will enable building security.c for Book3E.
However the NXP platforms are not vulnerable to Meltdown, so make the
Meltdown vulnerability reporting PPC_BOOK3S_64 specific.
Signed-off-by: Diana Craciun <diana.craciun@nxp.com>
[mpe: Split out of larger patch]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently we require platform code to call setup_barrier_nospec(). But
if we add an empty definition for the !CONFIG_PPC_BARRIER_NOSPEC case
then we can call it in setup_arch().
Signed-off-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add a config symbol to encode which platforms support the
barrier_nospec speculation barrier. Currently this is just Book3S 64
but we will add Book3E in a future patch.
Signed-off-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
NXP Book3E platforms are not vulnerable to speculative store
bypass, so make the mitigations PPC_BOOK3S_64 specific.
Signed-off-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Now that '%asm-generic' is added to no-dot-config-targets,
'make asm-generic' does not include the kernel configuration.
You can simply do 'make asm-generic' in the recursed top Makefile
without bothering syncconfig.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Richard Weinberger <richard@nod.at>
Randy Dunlap reports UML occasionally fails to build with -j<N> and
O=<builddir> options.
make[1]: Entering directory '/home/rdunlap/mmotm-2018-0802-1529/UM64'
UPD include/generated/uapi/linux/version.h
WRAP arch/x86/include/generated/asm/dma-contiguous.h
WRAP arch/x86/include/generated/asm/export.h
WRAP arch/x86/include/generated/asm/early_ioremap.h
WRAP arch/x86/include/generated/asm/mcs_spinlock.h
WRAP arch/x86/include/generated/asm/mm-arch-hooks.h
WRAP arch/x86/include/generated/uapi/asm/bpf_perf_event.h
WRAP arch/x86/include/generated/uapi/asm/poll.h
GEN ./Makefile
make[2]: *** No rule to make target 'archheaders'. Stop.
arch/um/Makefile:119: recipe for target 'archheaders' failed
make[1]: *** [archheaders] Error 2
make[1]: *** Waiting for unfinished jobs....
UPD include/config/kernel.release
make[1]: *** wait: No child processes. Stop.
Makefile:146: recipe for target 'sub-make' failed
make: *** [sub-make] Error 2
The cause of the problem is the use of '$(MAKE) KBUILD_SRC=',
which recurses to the top Makefile via the $(objtree)/Makefile
generated by scripts/mkmakefile.
When you run "make -j<N> O=<builddir> ARCH=um", Make can execute
'archheaders' and 'outputmakefile' targets simultaneously because
there is no dependency between them.
If it happens,
$(Q)$(MAKE) KBUILD_SRC= ARCH=$(HEADER_ARCH) archheaders
... tries to run $(objtree)/Makefile that is being updated.
The correct way for the recursion is
$(Q)$(MAKE) -f $(srctree)/Makefile ARCH=$(HEADER_ARCH) archheaders
..., which does not rely on the generated Makefile.
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Richard Weinberger <richard@nod.at>
The speculation barrier can be disabled from the command line
with the parameter: "nospectre_v1".
Signed-off-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We only need to use __MASKABLE_EXCEPTION in one of the four cases for
hardware interrupt, so use the helper macros in the other cases.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We pass the "loc" (location) parameter to MASKABLE_EXCEPTION and
friends, but it's not used, so drop it.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
_MASKABLE_RELON_EXCEPTION_PSERIES() does nothing useful, update all
callers to use __MASKABLE_RELON_EXCEPTION_PSERIES() directly.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
_MASKABLE_EXCEPTION_PSERIES() does nothing useful, update all callers
to use __MASKABLE_EXCEPTION_PSERIES() directly.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The EXCEPTION_RELON_PROLOG_PSERIES_1() macro does the same job as
EXCEPTION_PROLOG_2 (which we just recently created), except for
"RELON" (relocation on) exceptions.
So rename it as such.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
As with the other patches in this series, we are removing the
"PSERIES" from the name as it's no longer meaningful.
In this case it's not simply a case of removing the "PSERIES" as that
would result in a clash with the existing EXCEPTION_PROLOG_1.
Instead we name this one EXCEPTION_PROLOG_2, as it's usually used in
sequence after 0 and 1.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The "PSERIES" in STD_EXCEPTION_PSERIES is to differentiate the macros
from the legacy iSeries versions, which are called
STD_EXCEPTION_ISERIES. It is not anything to do with pseries vs
powernv or powermac etc.
We removed the legacy iSeries code in 2012, in commit 8ee3e0d69623x
("powerpc: Remove the main legacy iSerie platform code").
So remove "PSERIES" from the macros.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
EXCEPTION_RELON_PROLOG_PSERIES() only has two users,
STD_RELON_EXCEPTION_PSERIES() and STD_RELON_EXCEPTION_HV() both of
which "call" SET_SCRATCH0(), so just move SET_SCRATCH0() into
EXCEPTION_RELON_PROLOG_PSERIES().
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
EXCEPTION_PROLOG_PSERIES() only has two users, STD_EXCEPTION_PSERIES()
and STD_EXCEPTION_HV() both of which "call" SET_SCRATCH0(), so just
move SET_SCRATCH0() into EXCEPTION_PROLOG_PSERIES().
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Pasemi arch code finds the root of the PCI-e bus by searching the
device-tree for a node called 'pxp'. But the root bus has a compatible
property of 'pasemi,rootbus' so search for that instead.
Signed-off-by: Darren Stevens <darren@stevens-zone.net>
Acked-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The generic implementation of strlen() reads strings byte per byte.
This patch implements strlen() in assembly based on a read of entire
words, in the same spirit as what some other arches and glibc do.
On a 8xx the time spent in strlen is reduced by 3/4 for long strings.
strlen() selftest on an 8xx provides the following values:
Before the patch (ie with the generic strlen() in lib/string.c):
len 256 : time = 1.195055
len 016 : time = 0.083745
len 008 : time = 0.046828
len 004 : time = 0.028390
After the patch:
len 256 : time = 0.272185 ==> 78% improvment
len 016 : time = 0.040632 ==> 51% improvment
len 008 : time = 0.033060 ==> 29% improvment
len 004 : time = 0.029149 ==> 2% degradation
On a 832x:
Before the patch:
len 256 : time = 0.236125
len 016 : time = 0.018136
len 008 : time = 0.011000
len 004 : time = 0.007229
After the patch:
len 256 : time = 0.094950 ==> 60% improvment
len 016 : time = 0.013357 ==> 26% improvment
len 008 : time = 0.010586 ==> 4% improvment
len 004 : time = 0.008784
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
rtas_log_buf is a buffer to hold RTAS event data that are communicated
to kernel by hypervisor. This buffer is then used to pass RTAS event
data to user through proc fs. This buffer is allocated from
vmalloc (non-linear mapping) area.
On Machine check interrupt, register r3 points to RTAS extended event
log passed by hypervisor that contains the MCE event. The pseries
machine check handler then logs this error into rtas_log_buf. The
rtas_log_buf is a vmalloc-ed (non-linear) buffer we end up taking up a
page fault (vector 0x300) while accessing it. Since machine check
interrupt handler runs in NMI context we can not afford to take any
page fault. Page faults are not honored in NMI context and causes
kernel panic. Apart from that, as Nick pointed out,
pSeries_log_error() also takes a spin_lock while logging error which
is not safe in NMI context. It may endup in deadlock if we get another
MCE before releasing the lock. Fix this by deferring the logging of
rtas error to irq work queue.
Current implementation uses two different buffers to hold rtas error
log depending on whether extended log is provided or not. This makes
bit difficult to identify which buffer has valid data that needs to
logged later in irq work. Simplify this using single buffer, one per
paca, and copy rtas log to it irrespective of whether extended log is
provided or not. Allocate this buffer below RMA region so that it can
be accessed in real mode mce handler.
Fixes: b96672dd84 ("powerpc: Machine check interrupt is a non-maskable interrupt")
Cc: stable@vger.kernel.org # v4.14+
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The global mce data buffer that used to copy rtas error log is of 2048
(RTAS_ERROR_LOG_MAX) bytes in size. Before the copy we read
extended_log_length from rtas error log header, then use max of
extended_log_length and RTAS_ERROR_LOG_MAX as a size of data to be copied.
Ideally the platform (phyp) will never send extended error log with
size > 2048. But if that happens, then we have a risk of buffer overrun
and corruption. Fix this by using min_t instead.
Fixes: d368514c30 ("powerpc: Fix corruption when grabbing FWNMI data")
Reported-by: Michal Suchanek <msuchanek@suse.com>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
It's identical to xive_teardown_cpu() so just use the latter
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Those overly verbose statement in the setup of the pool VP
aren't particularly useful (esp. considering we don't actually
use the pool, we configure it bcs HW requires it only). So
remove them which improves the code readability.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The kernel page table caches are tied to init_mm, so there is no
more need for them after userspace is finished.
destroy_context() gets called when we drop the last reference for an
mm, which can be much later than the task exit due to other lazy mm
references to it. We can free the page table cache pages on task exit
because they only cache the userspace page tables and kernel threads
should not access user space addresses.
The mapping for kernel threads itself is maintained in init_mm and
page table cache for that is attached to init_mm.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
[mpe: Merge change log additions from Aneesh]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When the mm is being torn down there will be a full PID flush so
there is no need to flush the TLB on page size changes.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This makes it easy to run checkpatch with settings that I like.
Usage is eg:
$ ./arch/powerpc/tools/checkpatch.sh -g origin/master..
To check all commits since origin/master.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Russell Currey <ruscur@russell.cc>
We recently added a warning in arch_local_irq_restore() to check that
the soft masking state matches reality.
Unfortunately it trips in a few places, which are not entirely trivial
to fix. The key problem is if we're doing function_graph tracing of
restore_math(), the warning pops and then seems to recurse. It's not
entirely clear because the system continuously oopses on all CPUs,
with the output interleaved and unreadable.
It's also been observed on a G5 coming out of idle.
Until we can fix those cases disable the warning for now.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
For machines without the exrl instruction the BFP jit generates
code that uses an "br %r1" instruction located in the lowcore page.
Unfortunately there is a cut & paste error that puts an additional
"larl %r1,.+14" instruction in the code that clobbers the branch
target address in %r1. Remove the larl instruction.
Cc: <stable@vger.kernel.org> # v4.17+
Fixes: de5cb6eb51 ("s390: use expoline thunks in the BPF JIT")
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The memove, memset, memcpy, __memset16, __memset32 and __memset64
function have an additional indirect return branch in form of a
"bzr" instruction. These need to use expolines as well.
Cc: <stable@vger.kernel.org> # v4.17+
Fixes: 97489e0663 ("s390/lib: use expoline for indirect branches")
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Josh reported that the late SMT evaluation in cpu_smt_state_init() sets
cpu_smt_control to CPU_SMT_NOT_SUPPORTED in case that 'nosmt' was supplied
on the kernel command line as it cannot differentiate between SMT disabled
by BIOS and SMT soft disable via 'nosmt'. That wreckages the state and
makes the sysfs interface unusable.
Rework this so that during bringup of the non boot CPUs the availability of
SMT is determined in cpu_smt_allowed(). If a newly booted CPU is not a
'primary' thread then set the local cpu_smt_available marker and evaluate
this explicitely right after the initial SMP bringup has finished.
SMT evaulation on x86 is a trainwreck as the firmware has all the
information _before_ booting the kernel, but there is no interface to query
it.
Fixes: 73d5e2b472 ("cpu/hotplug: detect SMT disabled by BIOS")
Reported-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Enhance the GHASH implementation that uses 64-bit polynomial
multiplication by adding support for 4-way aggregation. This
more than doubles the performance, from 2.4 cycles per byte
to 1.1 cpb on Cortex-A53.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Checking the TIF_NEED_RESCHED flag is disproportionately costly on cores
with fast crypto instructions and comparatively slow memory accesses.
On algorithms such as GHASH, which executes at ~1 cycle per byte on
cores that implement support for 64 bit polynomial multiplication,
there is really no need to check the TIF_NEED_RESCHED particularly
often, and so we can remove the NEON yield check from the assembler
routines.
However, unlike the AEAD or skcipher APIs, the shash/ahash APIs take
arbitrary input lengths, and so there needs to be some sanity check
to ensure that we don't hog the CPU for excessive amounts of time.
So let's simply cap the maximum input size that is processed in one go
to 64 KB.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
It turns out I had misunderstood how the x86_match_cpu() function works.
It evaluates a logical OR of the matching conditions, not logical AND.
This caused the CPU feature checks for AEGIS to pass even if only SSE2
(but not AES-NI) was supported (or vice versa), leading to potential
crashes if something tried to use the registered algs.
This patch switches the checks to a simpler method that is used e.g. in
the Camellia x86 code.
The patch also removes the MODULE_DEVICE_TABLE declarations which
actually seem to cause the modules to be auto-loaded at boot, which is
not desired. The crypto API on-demand module loading is sufficient.
Fixes: 1d373d4e8e ("crypto: x86 - Add optimized AEGIS implementations")
Fixes: 6ecc9d9ff9 ("crypto: x86 - Add optimized MORUS implementations")
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Squeeze out another 5% of performance by minimizing the number
of invocations of kernel_neon_begin()/kernel_neon_end() on the
common path, which also allows some reloads of the key schedule
to be optimized away.
The resulting code runs at 2.3 cycles per byte on a Cortex-A53.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Implement a faster version of the GHASH transform which amortizes
the reduction modulo the characteristic polynomial across two
input blocks at a time.
On a Cortex-A53, the gcm(aes) performance increases 24%, from
3.0 cycles per byte to 2.4 cpb for large input sizes.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Update the core AES/GCM transform and the associated plumbing to operate
on 2 AES/GHASH blocks at a time. By itself, this is not expected to
result in a noticeable speedup, but it paves the way for reimplementing
the GHASH component using 2-way aggregation.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
As it turns out, checking the TIF_NEED_RESCHED flag after each
iteration results in a significant performance regression (~10%)
when running fast algorithms (i.e., ones that use special instructions
and operate in the < 4 cycles per byte range) on in-order cores with
comparatively slow memory accesses such as the Cortex-A53.
Given the speed of these ciphers, and the fact that the page based
nature of the AEAD scatterwalk API guarantees that the core NEON
transform is never invoked with more than a single page's worth of
input, we can estimate the worst case duration of any resulting
scheduling blackout: on a 1 GHz Cortex-A53 running with 64k pages,
processing a page's worth of input at 4 cycles per byte results in
a delay of ~250 us, which is a reasonable upper bound.
So let's remove the yield checks from the fused AES-CCM and AES-GCM
routines entirely.
This reverts commit 7b67ae4d5c and
partially reverts commit 7c50136a8a.
Fixes: 7c50136a8a ("crypto: arm64/aes-ghash - yield NEON after every ...")
Fixes: 7b67ae4d5c ("crypto: arm64/aes-ccm - yield NEON after every ...")
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Passing an array (swapper_pg_dir) as the argument to write_c0_kpgd() in
setup_pw() will become problematic if we modify __write_64bit_c0_split()
to cast its val argument to unsigned long long, because for 32-bit
kernel builds the size of a pointer will differ from the size of an
unsigned long long. This would fall foul of gcc's pointer-to-int-cast
diagnostic.
Cast the value to a long, which should be the same width as the pointer
that we ultimately want & will be sign extended if required to the
unsigned long long that __write_64bit_c0_split() ultimately needs.
Signed-off-by: Paul Burton <paul.burton@mips.com>
The MIPS VDSO code filters out a subset of known-good flags from
KBUILD_CFLAGS to use when building VDSO libraries. When we build using
clang we need to allow the --target flag through, otherwise we'll
generally attempt to build the VDSO for the architecture of the build
machine rather than for MIPS.
Signed-off-by: Paul Burton <paul.burton@mips.com>
Patchwork: https://patchwork.linux-mips.org/patch/20154/
Cc: James Hogan <jhogan@kernel.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-mips@linux-mips.org
Our genvdso tool performs some rather paranoid checking that the VDSO
library isn't attempting to make use of a GOT by constraining the number
of entries that the GOT is allowed to contain to the minimum 2 entries
that are always generated by binutils.
Unfortunately lld prior to revision 334390 generates a third entry,
which is unused & thus harmless but falls foul of genvdso's checks &
causes the build to fail.
Since we already check that the VDSO contains no relocations it seems
reasonable to presume that it also doesn't contain use of a GOT, which
would involve relocations. Thus rather than attempting to work around
this issue by allowing 3 GOT entries when using lld, simply remove the
GOT checks which seem overly paranoid.
Signed-off-by: Paul Burton <paul.burton@mips.com>
Patchwork: https://patchwork.linux-mips.org/patch/20152/
Cc: James Hogan <jhogan@kernel.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-mips@linux-mips.org
Commit d94a155c59 ("x86/cpu: Prevent cpuinfo_x86::x86_phys_bits
adjustment corruption") has moved the query and calculation of the
x86_virt_bits and x86_phys_bits fields of the cpuinfo_x86 struct
from the get_cpu_cap function to a new function named
get_cpu_address_sizes.
One of the call sites related to Xen PV VMs was unfortunately missed
in the aforementioned commit. This prevents successful boot-up of
kernel versions 4.17 and up in Xen PV VMs if CONFIG_DEBUG_VIRTUAL
is enabled, due to the following code path:
enlighten_pv.c::xen_start_kernel
mmu_pv.c::xen_reserve_special_pages
page.h::__pa
physaddr.c::__phys_addr
physaddr.h::phys_addr_valid
phys_addr_valid uses boot_cpu_data.x86_phys_bits to validate physical
addresses. boot_cpu_data.x86_phys_bits is no longer populated before
the call to xen_reserve_special_pages due to the aforementioned commit
though, so the validation performed by phys_addr_valid fails, which
causes __phys_addr to trigger a BUG, preventing boot-up.
Signed-off-by: M. Vefa Bicakci <m.v.b@runbox.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: xen-devel@lists.xenproject.org
Cc: x86@kernel.org
Cc: stable@vger.kernel.org # for v4.17 and up
Fixes: d94a155c59 ("x86/cpu: Prevent cpuinfo_x86::x86_phys_bits adjustment corruption")
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
The kernel image is mapped into two places in the virtual address space
(addresses without KASLR, of course):
1. The kernel direct map (0xffff880000000000)
2. The "high kernel map" (0xffffffff81000000)
We actually execute out of #2. If we get the address of a kernel symbol,
it points to #2, but almost all physical-to-virtual translations point to
Parts of the "high kernel map" alias are mapped in the userspace page
tables with the Global bit for performance reasons. The parts that we map
to userspace do not (er, should not) have secrets. When PTI is enabled then
the global bit is usually not set in the high mapping and just used to
compensate for poor performance on systems which lack PCID.
This is fine, except that some areas in the kernel image that are adjacent
to the non-secret-containing areas are unused holes. We free these holes
back into the normal page allocator and reuse them as normal kernel memory.
The memory will, of course, get *used* via the normal map, but the alias
mapping is kept.
This otherwise unused alias mapping of the holes will, by default keep the
Global bit, be mapped out to userspace, and be vulnerable to Meltdown.
Remove the alias mapping of these pages entirely. This is likely to
fracture the 2M page mapping the kernel image near these areas, but this
should affect a minority of the area.
The pageattr code changes *all* aliases mapping the physical pages that it
operates on (by default). We only want to modify a single alias, so we
need to tweak its behavior.
This unmapping behavior is currently dependent on PTI being in place.
Going forward, we should at least consider doing this for all
configurations. Having an extra read-write alias for memory is not exactly
ideal for debugging things like random memory corruption and this does
undercut features like DEBUG_PAGEALLOC or future work like eXclusive Page
Frame Ownership (XPFO).
Before this patch:
current_kernel:---[ High Kernel Mapping ]---
current_kernel-0xffffffff80000000-0xffffffff81000000 16M pmd
current_kernel-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
current_kernel-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte
current_kernel-0xffffffff81e11000-0xffffffff82000000 1980K RW NX pte
current_kernel-0xffffffff82000000-0xffffffff82600000 6M ro PSE GLB NX pmd
current_kernel-0xffffffff82600000-0xffffffff82c00000 6M RW PSE NX pmd
current_kernel-0xffffffff82c00000-0xffffffff82e00000 2M RW NX pte
current_kernel-0xffffffff82e00000-0xffffffff83200000 4M RW PSE NX pmd
current_kernel-0xffffffff83200000-0xffffffffa0000000 462M pmd
current_user:---[ High Kernel Mapping ]---
current_user-0xffffffff80000000-0xffffffff81000000 16M pmd
current_user-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
current_user-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte
current_user-0xffffffff81e11000-0xffffffff82000000 1980K RW NX pte
current_user-0xffffffff82000000-0xffffffff82600000 6M ro PSE GLB NX pmd
current_user-0xffffffff82600000-0xffffffffa0000000 474M pmd
After this patch:
current_kernel:---[ High Kernel Mapping ]---
current_kernel-0xffffffff80000000-0xffffffff81000000 16M pmd
current_kernel-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
current_kernel-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte
current_kernel-0xffffffff81e11000-0xffffffff82000000 1980K pte
current_kernel-0xffffffff82000000-0xffffffff82400000 4M ro PSE GLB NX pmd
current_kernel-0xffffffff82400000-0xffffffff82488000 544K ro NX pte
current_kernel-0xffffffff82488000-0xffffffff82600000 1504K pte
current_kernel-0xffffffff82600000-0xffffffff82c00000 6M RW PSE NX pmd
current_kernel-0xffffffff82c00000-0xffffffff82c0d000 52K RW NX pte
current_kernel-0xffffffff82c0d000-0xffffffff82dc0000 1740K pte
current_user:---[ High Kernel Mapping ]---
current_user-0xffffffff80000000-0xffffffff81000000 16M pmd
current_user-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
current_user-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte
current_user-0xffffffff81e11000-0xffffffff82000000 1980K pte
current_user-0xffffffff82000000-0xffffffff82400000 4M ro PSE GLB NX pmd
current_user-0xffffffff82400000-0xffffffff82488000 544K ro NX pte
current_user-0xffffffff82488000-0xffffffff82600000 1504K pte
current_user-0xffffffff82600000-0xffffffffa0000000 474M pmd
[ tglx: Do not unmap on 32bit as there is only one mapping ]
Fixes: 0f561fce4d ("x86/pti: Enable global pages for shared areas")
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Joerg Roedel <jroedel@suse.de>
Link: https://lkml.kernel.org/r/20180802225831.5F6A2BFC@viggo.jf.intel.com
As there is precious little left in any DTS files referring to the
node "/chosen@0" as opposed to "/chosen", remove the two checks for
the former node name.
[paul.burton@mips.com:
The modified yamon-dt code only operates on
arch/mips/boot/dts/mti/sead3.dts right now, and that uses chosen
rather than chosen@0 anyway, so this should have no behavioural
effect.]
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Paul Burton <paul.burton@mips.com>
Patchwork: https://patchwork.linux-mips.org/patch/20131/
Cc: linux-mips@linux-mips.org
Remove open-coded uses of set instructions to use CC_SET()/CC_OUT() in
arch/x86/kvm/vmx.c.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
[Mark error paths as unlikely while touching this. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Implement paravirtual apic hooks to enable PV IPIs for KVM if the "send IPI"
hypercall is available. The hypercall lets a guest send IPIs, with
at most 128 destinations per hypercall in 64-bit mode and 64 vCPUs per
hypercall in 32-bit mode.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add kvm hypervisor init time platform setup callback which
will be used to replace native apic hooks by pararvirtual
hooks.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
X86_CR4_OSXSAVE check belongs to sregs check and so move into
kvm_valid_sregs().
Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Considering the fact that the pae_root shadow is not needed when
tdp is in use, skip the pae_root shadow page allocation to allow
mmu creation even not being able to obtain memory from DMA32
zone when particular cgroup cpuset.mems or mempolicy control is
applied.
Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
mmu_set_spte() flushes remote tlbs for drop_parent_pte/drop_spte()
and set_spte() separately. This may introduce redundant flush. This
patch is to combine these flushes and check flush request after
calling set_spte().
Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Reviewed-by: Junaid Shahid <junaids@google.com>
Reviewed-by: Xiao Guangrong <xiaoguangrong@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The host's FS.base and GS.base rarely change, e.g. ~0.1% of host/guest
swaps on my system. Cache the last value written to the VMCS and skip
the VMWRITE to the associated VMCS fields when loading host state if
the value hasn't changed since the last VMWRITE.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On a 64-bit host, FS.sel and GS.sel are all but guaranteed to be 0,
which in turn means they'll rarely change. Skip the VMWRITE for the
associated VMCS fields when loading host state if the selector hasn't
changed since the last VMWRITE.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The HOST_{FS,GS}_BASE fields are guaranteed to be written prior to
VMENTER, by way of vmx_prepare_switch_to_guest(). Initialize the
fields to zero for 64-bit kernels instead of pulling the base values
from their respective MSRs. In addition to eliminating two RDMSRs,
vmx_prepare_switch_to_guest() can safely assume the initial value of
the fields is zero in all cases.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Make host_state a property of a loaded_vmcs so that it can be
used as a cache of the VMCS fields, e.g. to lazily VMWRITE the
corresponding VMCS field. Treating host_state as a cache does
not work if it's not VMCS specific as the cache would become
incoherent when switching between vmcs01 and vmcs02.
Move vmcs_host_cr3 and vmcs_host_cr4 into host_state.
Explicitly zero out host_state when allocating a new VMCS for a
loaded_vmcs. Unlike the pre-existing vmcs_host_cr{3,4} usage,
the segment information is not guaranteed to be (re)initialized
when running a new nested VMCS, e.g. HOST_FS_BASE is not written
in vmx_set_constant_host_state().
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove fs_reload_needed and gs_ldt_reload_needed from host_state
and instead compute whether we need to reload various state at
the time we actually do the reload. The state that is tracked
by the *_reload_needed variables is not any more volatile than
the trackers themselves.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
prepare_vmcs02() has an odd comment that says certain fields are
"not in vmcs02". AFAICT the intent of the comment is to document
that various VMCS fields are not handled by prepare_vmcs02(),
e.g. HOST_{FS,GS}_{BASE,SELECTOR}. While technically true, the
comment is misleading, e.g. it can lead the reader to think that
KVM never writes those fields to vmcs02.
Remove the comment altogether as the handling of FS and GS is
not specific to nested VMX, and GUEST_PML_INDEX has been written
by prepare_vmcs02() since commit "4e59516a12a6 (kvm: vmx: ensure
VMCS is current while enabling PML)"
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that the vmx_load_host_state() wrapper is gone, i.e. the only
time we call the core functions is when we're actually about to
switch between guest/host, rename the functions that handle lazy
state switching to vmx_prepare_switch_to_{guest,host}_state() to
better document the full extent of their functionality.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When lazy save/restore of MSR_KERNEL_GS_BASE was introduced[1], the
MSR was intercepted in all modes and was only restored for the host
when the guest is in 64-bit mode. So at the time, going through the
full host restore prior to accessing MSR_KERNEL_GS_BASE was necessary
to load host state and was not a significant waste of cycles.
Later, MSR_KERNEL_GS_BASE interception was disabled for a 64-bit
guest[2], and then unconditionally saved/restored for the host[3].
As a result, loading full host state is overkill for accesses to
MSR_KERNEL_GS_BASE, and completely unnecessary when the guest is
not in 64-bit mode.
Add a dedicated utility to read/write the guest's MSR_KERNEL_GS_BASE
(outside of the save/restore flow) to minimize the overhead incurred
when accessing the MSR. When setting EFER, only decache the MSR if
the new EFER will disable long mode.
Removing out-of-band usage of vmx_load_host_state() also eliminates,
or at least reduces, potential corner cases in its usage, which in
turn will (hopefuly) make it easier to reason about future changes
to the save/restore flow, e.g. optimization of saving host state.
[1] commit 44ea2b1758 ("KVM: VMX: Move MSR_KERNEL_GS_BASE out of the vmx
autoload msr area")
[2] commit 5897297bc2 ("KVM: VMX: Don't intercept MSR_KERNEL_GS_BASE")
[3] commit c8770e7ba6 ("KVM: VMX: Fix host userspace gsbase corruption")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Using 'struct loaded_vmcs*' to track whether the CPU registers
contain host or guest state kills two birds with one stone.
1. The (effective) boolean host_state.loaded is poorly named.
It does not track whether or not host state is loaded into
the CPU registers (which most readers would expect), but
rather tracks if host state has been saved AND guest state
is loaded.
2. Using a loaded_vmcs pointer provides a more robust framework
for the optimized guest/host state switching, especially when
consideration per-VMCS enhancements. To that end, WARN_ONCE
if we try to switch to host state with a different VMCS than
was last used to save host state.
Resolve an occurrence of the new WARN by setting loaded_vmcs after
the call to vmx_vcpu_put() in vmx_switch_vmcs().
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use local variables in vmx_save_host_state() to temporarily track
the selector and base values for FS and GS, and reorganize the
code so that the 64-bit vs 32-bit portions are contained within
a single #ifdef. This refactoring paves the way for future patches
to modify the updating of VMCS state with minimal changes to the
code, and (hopefully) simplifies resolving a likely conflict with
another in-flight patch[1] by being the whipping boy for future
patches.
[1] https://www.spinics.net/lists/kvm/msg171647.html
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When checking emulated VMX instructions for faults, the #UD for "IF
(not in VMX operation)" should take precedence over the #GP for "ELSIF
CPL > 0."
Suggested-by: Eric Northup <digitaleric@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The fault that should be raised for a privilege level violation is #GP
rather than #UD.
Fixes: 727ba748e1 ("kvm: nVMX: Enforce cpl=0 for VMX instructions")
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Register tlb_remote_flush callback for vmx when hyperv capability of
nested guest mapping flush is detected. The interface can help to
reduce overhead when flush ept table among vcpus for nested VM. The
tradition way is to send IPIs to all affected vcpus and executes
INVEPT on each vcpus. It will trigger several vmexits for IPI
and INVEPT emulation. Hyper-V provides such hypercall to do
flush for all vcpus and call the hypercall when all ept table
pointers of single VM are same.
Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch is to provide a way for platforms to register hv tlb remote
flush callback and this helps to optimize operation of tlb flush
among vcpus for nested virtualization case.
Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch is to add hyperv_nested_flush_guest_mapping support to trace
hvFlushGuestPhysicalAddressSpace hypercall.
Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Acked-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Hyper-V supports a pv hypercall HvFlushGuestPhysicalAddressSpace to
flush nested VM address space mapping in l1 hypervisor and it's to
reduce overhead of flushing ept tlb among vcpus. This patch is to
implement it.
Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Acked-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On a VM with only 1 vCPU, the locking fast path will always be
successful. In this case, there is no need to use the the PV qspinlock
code which has higher overhead on the unlock side than the native
qspinlock code.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Merge check of "sp->role.cr4_pae != !!is_pae(vcpu))" and "vcpu->
arch.mmu.sync_page(vcpu, sp) == 0". kvm_mmu_prepare_zap_page()
is called under both these conditions.
Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It is a duplicate of X86_CR3_PCID_NOFLUSH. So just use that instead.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Adds support for storing multiple previous CR3/root_hpa pairs maintained
as an LRU cache, so that the lockless CR3 switch path can be used when
switching back to any of them.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This needs a minor bug fix. The updated patch is as follows.
Thanks,
Junaid
------------------------------------------------------------------------------
kvm_mmu_invlpg() and kvm_mmu_invpcid_gva() only need to flush the TLB
entries for the specific guest virtual address, instead of flushing all
TLB entries associated with the VM.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When the guest indicates that the TLB doesn't need to be flushed in a
CR3 switch, we can also skip resyncing the shadow page tables since an
out-of-sync shadow page table is equivalent to an out-of-sync TLB.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_mmu_free_roots() now takes a mask specifying which roots to free, so
that either one of the roots (active/previous) can be individually freed
when needed.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This allows invlpg() to be called using either the active root_hpa
or the prev_root_hpa.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When PCIDs are enabled, the MSb of the source operand for a MOV-to-CR3
instruction indicates that the TLB doesn't need to be flushed.
This change enables this optimization for MOV-to-CR3s in the guest
that have been intercepted by KVM for shadow paging and are handled
within the fast CR3 switch path.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Implement support for INVPCID in shadow paging mode as well.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When using shadow paging mode, propagate the guest's PCID value to
the shadow CR3 in the host instead of always using PCID 0.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove the implicit flush from the set_cr3 handlers, so that the
callers are able to decide whether to flush the TLB or not.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use the fast CR3 switch mechanism to locklessly change the MMU root
page when switching between L1 and L2. The switch from L2 to L1 should
always go through the fast path, while the switch from L1 to L2 should
go through the fast path if L1's CR3/EPTP for L2 hasn't changed
since the last time.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This adds support for re-initializing the MMU context in a different
mode while preserving the active root_hpa and the prev_root.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This generalizes the lockless CR3 switch path to be able to work
across different MMU modes (e.g. nested vs non-nested) by checking
that the expected page role of the new root page matches the page role
of the previously stored root page in addition to checking that the new
CR3 matches the previous CR3. Furthermore, instead of loading the
hardware CR3 in fast_cr3_switch(), it is now done in vcpu_enter_guest(),
as by that time the MMU context would be up-to-date with the VCPU mode.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The KVM_REQ_LOAD_CR3 request loads the hardware CR3 using the
current root_hpa.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
These functions factor out the base role calculation from the
corresponding kvm_init_*_mmu() functions. The new functions return
what would be the role assigned to a root page in the current VCPU
state. This can be masked with mmu_base_role_mask to derive the base
role.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When using shadow paging, a CR3 switch in the guest results in a VM Exit.
In the common case, that VM exit doesn't require much processing by KVM.
However, it does acquire the MMU lock, which can start showing signs of
contention under some workloads even on a 2 VCPU VM when the guest is
using KPTI. Therefore, we add a fast path that avoids acquiring the MMU
lock in the most common cases e.g. when switching back and forth between
the kernel and user mode CR3s used by KPTI with no guest page table
changes in between.
For now, this fast path is implemented only for 64-bit guests and hosts
to avoid the handling of PDPTEs, but it can be extended later to 32-bit
guests and/or hosts as well.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_mmu_sync_roots() can locklessly check whether a sync is needed and just
bail out if it isn't.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
sync_page() calls set_spte() from a loop across a page table. It would
work better if set_spte() left the TLB flushing to its callers, so that
sync_page() can aggregate into a single call.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
No functionality change.
This is done as a preparation for VMCS shadowing virtualization.
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
No functionality change.
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Expose VMCS shadowing to L1 as a VMX capability of the virtual CPU,
whether or not VMCS shadowing is supported by the physical CPU.
(VMCS shadowing emulation)
Shadowed VMREADs and VMWRITEs from L2 are handled by L0, without a
VM-exit to L1.
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This is done as a preparation for VMCS shadowing emulation.
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>