When software changes D bit (either from 1 to 0, or 0 to 1), the
corresponding TLB entity in the hardware won't be updated immediately. We
should flush it to guarantee the consistence of D bit between TLB and
MMU page table in memory. This is especially important when clearing
the D bit, since it may cause false negatives in reporting dirtiness.
Sanity test was done on my machine with Intel processor.
Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
[Check A bit too. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Adds a function kvm_vcpu_set_pending_timer instead of calling
kvm_make_request in lapic.c.
Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When access to descriptor in LDT/GDT wraparound outside long-mode, the address
of the descriptor should be truncated to 32-bit. Citing Intel SDM 2.1.1.1
"Global and Local Descriptor Tables in IA-32e Mode": "GDTR and LDTR registers
are expanded to 64-bits wide in both IA-32e sub-modes (64-bit mode and
compatibility mode)."
So in other cases, we need to truncate. Creating new function to return a
pointer to descriptor table to avoid too much code duplication.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
[Wrap 64-bit check with #ifdef CONFIG_X86_64, to avoid a "right shift count
>= width of type" warning and consequent undefined behavior. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When segment is loaded, the segment access bit is set unconditionally. In
fact, it should be set conditionally, based on whether the segment had the
accessed bit set before. In addition, it can improve performance.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
According to Intel SDM: "If the ESP register is used as a base register for
addressing a destination operand in memory, the POP instruction computes the
effective address of the operand after it increments the ESP register."
The current emulation does not behave so. The fix required to waste another
of the precious instruction flags and to check the flag in decode_modrm.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, if em_call_far fails it returns success instead of the resulting
error-code. Fix it.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The KVM emulator does not emulate JMP and CALL that target a call gate or a
task gate. This patch does not try to implement these scenario as they are
presumably rare; yet it returns X86EMUL_UNHANDLEABLE error in such cases
instead of generating an exception.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Since the operand size of fnstcw and fnstsw is updated during the execution,
the emulation may cause spurious exceptions as it reads the memory beforehand.
Marking these instructions as Mov (since the previous value is ignored) and
DstMem16 to simplify the setting of operand size.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Although pop sreg updates RSP according to the operand size, only 2 bytes are
read. The current behavior may result in incorrect #GP or #PF exceptions.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Because ASSERT is just a printk, these would oops right away.
The assertion thus hardly adds anything.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The initialization function in mmu.c can always use walk_mmu, which
is known to be vcpu->arch.mmu. Only init_kvm_nested_mmu is used to
initialize vcpu->arch.nested_mmu.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add tracepoint to wait_lapic_expire.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
[Remind reader if early or late. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For the hrtimer which emulates the tscdeadline timer in the guest,
add an option to advance expiration, and busy spin on VM-entry waiting
for the actual expiration time to elapse.
This allows achieving low latencies in cyclictest (or any scenario
which requires strict timing regarding timer expiration).
Reduces average cyclictest latency from 12us to 8us
on Core i5 desktop.
Note: this option requires tuning to find the appropriate value
for a particular hardware/guest combination. One method is to measure the
average delay between apic_timer_fn and VM-entry.
Another method is to start with 1000ns, and increase the value
in say 500ns increments until avg cyclictest numbers stop decreasing.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_x86_ops->test_posted_interrupt() returns true/false depending
whether 'vector' is set.
Next patch makes use of this interface.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In most cases calling hwapic_isr_update(), we always check if
kvm_apic_vid_enabled() == 1, but actually,
kvm_apic_vid_enabled()
-> kvm_x86_ops->vm_has_apicv()
-> vmx_vm_has_apicv() or '0' in svm case
-> return enable_apicv && irqchip_in_kernel(kvm)
So its a little cost to recall vmx_vm_has_apicv() inside
hwapic_isr_update(), here just NULL out hwapic_isr_update() in
case of !enable_apicv inside hardware_setup() then make all
related stuffs follow this. Note we don't check this under that
condition of irqchip_in_kernel() since we should make sure
definitely any caller don't work without in-kernel irqchip.
Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove FIXME comments about needing fault addresses to be returned. These
are propaagated from walk_addr_generic to gva_to_gpa and from there to
ops->read_std and ops->write_std.
Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When generating #PF VM-exit, check equality:
(PFEC & PFEC_MASK) == PFEC_MATCH
If there is equality, the 14 bit of exception bitmap is used to take decision
about generating #PF VM-exit. If there is inequality, inverted 14 bit is used.
Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch improve checks required by Intel Software Developer Manual.
- SMM MSRs are not allowed.
- microcode MSRs are not allowed.
- check x2apic MSRs only when LAPIC is in x2apic mode.
- MSR switch areas must be aligned to 16 bytes.
- address of first and last byte in MSR switch areas should not set any bits
beyond the processor's physical-address width.
Also it adds warning messages on failures during MSR switch. These messages
are useful for people who debug their VMMs in nVMX.
Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Several hypervisors need MSR auto load/restore feature.
We read MSRs from VM-entry MSR load area which specified by L1,
and load them via kvm_set_msr in the nested entry.
When nested exit occurs, we get MSRs via kvm_get_msr, writing
them to L1`s MSR store area. After this, we read MSRs from VM-exit
MSR load area, and load them via kvm_set_msr.
Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SRCU is not necessary to be compiled by default in all cases. For tinification
efforts not compiling SRCU unless necessary is desirable.
The current patch tries to make compiling SRCU optional by introducing a new
Kconfig option CONFIG_SRCU which is selected when any of the components making
use of SRCU are selected.
If we do not select CONFIG_SRCU, srcu.o will not be compiled at all.
text data bss dec hex filename
2007 0 0 2007 7d7 kernel/rcu/srcu.o
Size of arch/powerpc/boot/zImage changes from
text data bss dec hex filename
831552 64180 23944 919676 e087c arch/powerpc/boot/zImage : before
829504 64180 23952 917636 e0084 arch/powerpc/boot/zImage : after
so the savings are about ~2000 bytes.
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: resolve conflict due to removal of arch/ia64/kvm/Kconfig. ]
Since most virtual machines raise this message once, it is a bit annoying.
Make it KERN_DEBUG severity.
Cc: stable@vger.kernel.org
Fixes: 7a2e8aaf0f
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The commit 34a1cd60d1, "x86: vmx: move some vmx setting from
vmx_init() to hardware_setup()", tried to refactor some codes
specific to vmx hardware setting into hardware_setup(), but some
msr writing should depend on our previous setting condition like
enable_apicv, enable_ept and so on.
Reported-by: Jamie Heilman <jamie@audible.transient.net>
Tested-by: Jamie Heilman <jamie@audible.transient.net>
Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- spring cleaning: removed support for IA64, and for hardware-assisted
virtualization on the PPC970
- ARM, PPC, s390 all had only small fixes
For x86:
- small performance improvements (though only on weird guests)
- usual round of hardware-compliancy fixes from Nadav
- APICv fixes
- XSAVES support for hosts and guests. XSAVES hosts were broken because
the (non-KVM) XSAVES patches inadvertently changed the KVM userspace
ABI whenever XSAVES was enabled; hence, this part is going to stable.
Guest support is just a matter of exposing the feature and CPUID leaves
support.
Right now KVM is broken for PPC BookE in your tree (doesn't compile).
I'll reply to the pull request with a patch, please apply it either
before the pull request or in the merge commit, in order to preserve
bisectability somewhat.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJUkpg+AAoJEL/70l94x66DUmoH/jzXYkptSW9NGgm79KqxGJlD
lzLnLBkitVvx++Mz5YBhdJEhKKLUlCtifFT1zPJQ/pthQhIRSaaAwZyNGgUs5w5x
yMGKHiPQFyZRbmQtZhCInW0BftJoYHHciO3nUfHCZnp34My9MP2D55W7/z+fYFfQ
DuqBSE9ThyZJtZ4zh8NRA9fCOeuqwVYRyoBs820Wbsh4cpIBoIK63Dg7k+CLE+ZV
MZa/mRL6bAfsn9W5bnOUAgHJ3SPznnWbO3/g0aV+roL/5pffblprJx9lKNR08xUM
6hDFLop2gDehDJesDkY/o8Ckp1hEouvfsVpSShry4vcgtn0hgh2O5/6Orbmj6vE=
=Zwq1
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM update from Paolo Bonzini:
"3.19 changes for KVM:
- spring cleaning: removed support for IA64, and for hardware-
assisted virtualization on the PPC970
- ARM, PPC, s390 all had only small fixes
For x86:
- small performance improvements (though only on weird guests)
- usual round of hardware-compliancy fixes from Nadav
- APICv fixes
- XSAVES support for hosts and guests. XSAVES hosts were broken
because the (non-KVM) XSAVES patches inadvertently changed the KVM
userspace ABI whenever XSAVES was enabled; hence, this part is
going to stable. Guest support is just a matter of exposing the
feature and CPUID leaves support"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (179 commits)
KVM: move APIC types to arch/x86/
KVM: PPC: Book3S: Enable in-kernel XICS emulation by default
KVM: PPC: Book3S HV: Improve H_CONFER implementation
KVM: PPC: Book3S HV: Fix endianness of instruction obtained from HEIR register
KVM: PPC: Book3S HV: Remove code for PPC970 processors
KVM: PPC: Book3S HV: Tracepoints for KVM HV guest interactions
KVM: PPC: Book3S HV: Simplify locking around stolen time calculations
arch: powerpc: kvm: book3s_paired_singles.c: Remove unused function
arch: powerpc: kvm: book3s_pr.c: Remove unused function
arch: powerpc: kvm: book3s.c: Remove some unused functions
arch: powerpc: kvm: book3s_32_mmu.c: Remove unused function
KVM: PPC: Book3S HV: Check wait conditions before sleeping in kvmppc_vcore_blocked
KVM: PPC: Book3S HV: ptes are big endian
KVM: PPC: Book3S HV: Fix inaccuracies in ICP emulation for H_IPI
KVM: PPC: Book3S HV: Fix KSM memory corruption
KVM: PPC: Book3S HV: Fix an issue where guest is paused on receiving HMI
KVM: PPC: Book3S HV: Fix computation of tlbie operand
KVM: PPC: Book3S HV: Add missing HPTE unlock
KVM: PPC: BookE: Improve irq inject tracepoint
arm/arm64: KVM: Require in-kernel vgic for the arch timers
...
problems, clarifies VCPU init, and fixes a regression concerning the
VGIC init flow.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUjsVhAAoJEEtpOizt6ddy5rIH/1V/YVwhprC55YqdHelU9Qu2
Muzsx+7F71NxC7xgMGFqPD1YrPR+hxvoPhy+ADOBlvcqlolrkDnV9I+8e3geaYNc
nZ/yEnoGTtbAggiS1smx7usBv34Z88Sd5txNjmj1cmHBy+VOWlyidWMkGBTsfBRe
mVc61BDUfyC47udgRHXhwS80sbHLJHElmADisFOVmQNBYwwiHiTdx0hMBMnHcC3Y
/3T0tKxHdeTISnmA+J+n7TcChtTIM4xqC6kwf3rw3b7XX8gdtTKylDHX2GLAg646
RdebAG2twmGpIc6SxXZbo38f3oY9OFo1Le5xZGa6iUjD56VDw/e4wg4iA2juo0Y=
=J2Ut
-----END PGP SIGNATURE-----
Merge tag 'kvm-arm-for-3.19-take2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
Second round of changes for KVM for arm/arm64 for v3.19; fixes reboot
problems, clarifies VCPU init, and fixes a regression concerning the
VGIC init flow.
Conflicts:
arch/ia64/kvm/kvm-ia64.c [deleted in HEAD and modified in kvmarm]
commit d50eaa1803 ("KVM: x86: Perform limit checks when assigning EIP")
mistakenly used zero as cpl on em_ret_far. Use the actual one.
Fixes: d50eaa1803
Cc: stable@vger.kernel.org
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If L0 has disabled EPT, don't advertise unrestricted
mode at all since it depends on EPT to run real mode code.
Fixes: 92fbc7b195
Cc: stable@vger.kernel.org
Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
to the trace_seq code. It also removed the return values to the
trace_seq_*() functions and use trace_seq_has_overflowed() to see if
the buffer filled up or not. This is similar to work being done to the
seq_file code as well in another tree.
Some of the other goodies include:
o Added some "!" (NOT) logic to the tracing filter.
o Fixed the frame pointer logic to the x86_64 mcount trampolines
o Added the logic for dynamic trampolines on !CONFIG_PREEMPT systems.
That is, the ftrace trampoline can be dynamically allocated
and be called directly by functions that only have a single hook
to them.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUhbLGAAoJEEjnJuOKh9ldRV4H/3NcLbgGB2iu96la1zdYE6pG
Q7cDJMxXK80YIIL70h9G0IItcD4t62LMb72lfBnMGRj3msgFb3AgISW57EuI0Pxk
xk24wuIPoTG2S7v9sc3SboNFwO8qbtIjxD2OBmqIUrGo2sZIiGjyj3gX7mCY3uzL
WB2bUOSFz/22OgaANinR5EELHA3pZZCf54Vz1K9ndmtK0xp0j1a7xJShD6TrMdYv
mZ3zH5ViIhW4A3mdcMceh6fy2JLQAiEKF0uPTvcMMz7NlVul0mxyL/+10P7AE/3R
Ehw4fzmm4NDshPDtBOkKH0LsppgXzuItFuQUTpact3JlqTg++bV6onSsrkt1hlY=
=Z7Cm
-----END PGP SIGNATURE-----
Merge tag 'trace-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"There was a lot of clean ups and minor fixes. One of those clean ups
was to the trace_seq code. It also removed the return values to the
trace_seq_*() functions and use trace_seq_has_overflowed() to see if
the buffer filled up or not. This is similar to work being done to
the seq_file code as well in another tree.
Some of the other goodies include:
- Added some "!" (NOT) logic to the tracing filter.
- Fixed the frame pointer logic to the x86_64 mcount trampolines
- Added the logic for dynamic trampolines on !CONFIG_PREEMPT systems.
That is, the ftrace trampoline can be dynamically allocated and be
called directly by functions that only have a single hook to them"
* tag 'trace-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (55 commits)
tracing: Truncated output is better than nothing
tracing: Add additional marks to signal very large time deltas
Documentation: describe trace_buf_size parameter more accurately
tracing: Allow NOT to filter AND and OR clauses
tracing: Add NOT to filtering logic
ftrace/fgraph/x86: Have prepare_ftrace_return() take ip as first parameter
ftrace/x86: Get rid of ftrace_caller_setup
ftrace/x86: Have save_mcount_regs macro also save stack frames if needed
ftrace/x86: Add macro MCOUNT_REG_SIZE for amount of stack used to save mcount regs
ftrace/x86: Simplify save_mcount_regs on getting RIP
ftrace/x86: Have save_mcount_regs store RIP in %rdi for first parameter
ftrace/x86: Rename MCOUNT_SAVE_FRAME and add more detailed comments
ftrace/x86: Move MCOUNT_SAVE_FRAME out of header file
ftrace/x86: Have static tracing also use ftrace_caller_setup
ftrace/x86: Have static function tracing always test for function graph
kprobes: Add IPMODIFY flag to kprobe_ftrace_ops
ftrace, kprobes: Support IPMODIFY flag to find IP modify conflict
kprobes/ftrace: Recover original IP if pre_handler doesn't change it
tracing/trivial: Fix typos and make an int into a bool
tracing: Deletion of an unnecessary check before iput()
...
Intel SDM table 6-2 ("Priority Among Simultaneous Exceptions and Interrupts")
shows that faults from decoding the next instruction got higher priority than
general protection. Moving the protected-mode check before the CPL check to
avoid wrong exception on vm86 mode.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The pushf instruction does not push eflags.VM, so emulation should not do so as
well. Although eflags.RF should not be pushed as well, it is already cleared
by the time pushf is executed.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The macro GP already sets the flag Prefix. Remove the redundant flag for
0f_38_f0 and 0f_38_f1 opcodes.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We reused host EBX and ECX, but KVM might not support all features;
emulated XSAVE size should be smaller.
EBX depends on unknown XCR0, so we default to ECX.
SDM CPUID (EAX = 0DH, ECX = 0):
EBX Bits 31-00: Maximum size (bytes, from the beginning of the
XSAVE/XRSTOR save area) required by enabled features in XCR0. May
be different than ECX if some features at the end of the XSAVE save
area are not enabled.
ECX Bit 31-00: Maximum size (bytes, from the beginning of the
XSAVE/XRSTOR save area) of the XSAVE/XRSTOR save area required by
all supported features in the processor, i.e all the valid bit
fields in XCR0.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Tested-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add nested virtualization support for xsaves.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add logic to get/set the XSS model-specific register.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Initialize the XSS exit bitmap. It is zero so there should be no XSAVES
or XRSTORS exits.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- EAX=0Dh, ECX=1: output registers EBX/ECX/EDX are reserved.
- EAX=0Dh, ECX>1: output register ECX bit 0 is clear for all the CPUID
leaves we support, because variable "supported" comes from XCR0 and not
XSS. Bits above 0 are reserved, so ECX is overall zero. Output register
EDX is reserved.
Source: Intel Architecture Instruction Set Extensions Programming
Reference, ref. number 319433-022
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Tested-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This is the size of the XSAVES area. This starts providing guest support
for XSAVES (with no support yet for supervisor states, i.e. XSS == 0
always in guests for now).
Wanpeng Li suggested testing XSAVEC as well as XSAVES, since in practice
no real processor exists that only has one of them, and there is no
other way for userspace programs to compute the area of the XSAVEC
save area. CPUID(EAX=0xd,ECX=1).EBX provides an upper bound.
Suggested-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Tested-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Expose the XSAVES feature to the guest if the kvm_x86_ops say it is
available.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Userspace is expecting non-compacted format for KVM_GET_XSAVE, but
struct xsave_struct might be using the compacted format. Convert
in order to preserve userspace ABI.
Likewise, userspace is passing non-compacted format for KVM_SET_XSAVE
but the kernel will pass it to XRSTORS, and we need to convert back.
Fixes: f31a9f7c71
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: stable@vger.kernel.org
Cc: H. Peter Anvin <hpa@linux.intel.com>
Tested-by: Nadav Amit <namit@cs.technion.ac.il>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
While fixing an x2apic bug,
17d68b7 KVM: x86: fix guest-initiated crash with x2apic (CVE-2013-6376)
we've made only one cluster available. This means that the amount of
logically addressible x2APICs was reduced to 16 and VCPUs kept
overwriting themselves in that region, so even the first cluster wasn't
set up correctly.
This patch extends x2APIC support back to the logical_map's limit, and
keeps the CVE fixed as messages for non-present APICs are dropped.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
They can't be violated now, but play it safe for the future.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
x2apic allows destinations > 0xff and we don't want them delivered to
lower APICs. They are correctly handled by doing nothing.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Physical mode can't address more than one APIC, but lowest-prio is
allowed, so we just reuse our paths.
SDM 10.6.2.1 Physical Destination:
Also, for any non-broadcast IPI or I/O subsystem initiated interrupt
with lowest priority delivery mode, software must ensure that APICs
defined in the interrupt address are present and enabled to receive
interrupts.
We could warn on top of that.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
False from kvm_irq_delivery_to_apic_fast() means that we don't handle it
in the fast path, but we still return false in cases that were perfectly
handled, fix that.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
0x830 MSR is 0x300 xAPIC MMIO, which is MSR_ICR.
Signed-off-by: Radim KrÄmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
x2APIC has no registers for DFR and ICR2 (see Intel SDM 10.12.1.2 "x2APIC
Register Address Space"). KVM needs to cause #GP on such accesses.
Fix it (DFR and ICR2 on read, ICR2 on write, DFR already handled on writes).
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Certain x86 instructions that use modrm operands only allow memory operand
(i.e., mod012), and cause a #UD exception otherwise. KVM ignores this fact.
Currently, the instructions that are such and are emulated by KVM are MOVBE,
MOVNTPS, MOVNTPD and MOVNTI. MOVBE is the most blunt example, since it may be
emulated by the host regardless of MMIO.
The fix introduces a new group for handling such instructions, marking mod3 as
illegal instruction.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This reverts commit 85c8555ff0 ("KVM: check for !is_zero_pfn() in
kvm_is_mmio_pfn()") and renames the function to kvm_is_reserved_pfn.
The problem being addressed by the patch above was that some ARM code
based the memory mapping attributes of a pfn on the return value of
kvm_is_mmio_pfn(), whose name indeed suggests that such pfns should
be mapped as device memory.
However, kvm_is_mmio_pfn() doesn't do quite what it says on the tin,
and the existing non-ARM users were already using it in a way which
suggests that its name should probably have been 'kvm_is_reserved_pfn'
from the beginning, e.g., whether or not to call get_page/put_page on
it etc. This means that returning false for the zero page is a mistake
and the patch above should be reverted.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This reverts commit 85c8555ff0 ("KVM: check for !is_zero_pfn() in
kvm_is_mmio_pfn()") and renames the function to kvm_is_reserved_pfn.
The problem being addressed by the patch above was that some ARM code
based the memory mapping attributes of a pfn on the return value of
kvm_is_mmio_pfn(), whose name indeed suggests that such pfns should
be mapped as device memory.
However, kvm_is_mmio_pfn() doesn't do quite what it says on the tin,
and the existing non-ARM users were already using it in a way which
suggests that its name should probably have been 'kvm_is_reserved_pfn'
from the beginning, e.g., whether or not to call get_page/put_page on
it etc. This means that returning false for the zero page is a mistake
and the patch above should be reverted.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
cs.base is declared as a __u64 variable and vector is a u32 so this
causes a static checker warning. The user indeed can set "sipi_vector"
to any u32 value in kvm_vcpu_ioctl_x86_set_vcpu_events(), but the
value should really have 8-bit precision only.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Create a new header, and hide the device assignment functions there.
Move struct kvm_assigned_dev_kernel to assigned-dev.c by modifying
arch/x86/kvm/iommu.c to take a PCI device struct.
Based on a patch by Radim Krcmar <rkrcmark@redhat.com>.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This feature is not supported inside KVM guests yet, because we do not emulate
MSR_IA32_XSS. Mask it out.
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that ia64 is gone, we can hide deprecated device assignment in x86.
Notable changes:
- kvm_vm_ioctl_assigned_device() was moved to x86/kvm_arch_vm_ioctl()
The easy parts were removed from generic kvm code, remaining
- kvm_iommu_(un)map_pages() would require new code to be moved
- struct kvm_assigned_dev_kernel depends on struct kvm_irq_ack_notifier
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
ia64 does not need them anymore. Ack notifiers become x86-specific
too.
Suggested-by: Gleb Natapov <gleb@kernel.org>
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
To allow for the restructiong of the trace_seq code, we need users
of it to use the helper functions instead of accessing the internals
of the trace_seq structure itself.
Link: http://lkml.kernel.org/r/20141104160221.585025609@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Mark Rustad <mark.d.rustad@intel.com>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Remove FIXME comments about needing fault addresses to be returned. These
are propaagated from walk_addr_generic to gva_to_gpa and from there to
ops->read_std and ops->write_std.
Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The check on the higher limit of the segment, and the check on the
maximum accessible size, is the same for both expand-up and
expand-down segments. Only the computation of "lim" varies.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
register_address has been a duplicate of address_mask ever since the
ancestor of __linearize was born in 90de84f50b (KVM: x86 emulator:
preserve an operand's segment identity, 2010-11-17).
However, we can put it to a better use by including the call to reg_read
in register_address. Similarly, the call to reg_rmw can be moved to
register_address_increment.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In __linearize there is check of the condition whether to check if masking of
the linear address is needed. It occurs immediately after switch that
evaluates the same condition. Merge them.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When SS is used using a non-canonical address, an #SS exception is generated on
real hardware. KVM emulator causes a #GP instead. Fix it to behave as real x86
CPU.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If branch (e.g., jmp, ret) causes limit violations, since the target IP >
limit, the #GP exception occurs before the branch. In other words, the RIP
pushed on the stack should be that of the branch and not that of the target.
To do so, we can call __linearize, with new EIP, which also saves us the code
which performs the canonical address checks. On the case of assigning an EIP >=
2^32 (when switching cs.l), we also safe, as __linearize will check the new EIP
does not exceed the limit and would trigger #GP(0) otherwise.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When segment is accessed, real hardware does not perform any privilege level
checks. In contrast, KVM emulator does. This causes some discrepencies from
real hardware. For instance, reading from readable code segment may fail due to
incorrect segment checks. In addition, it introduces unnecassary overhead.
To reference Intel SDM 5.5 ("Privilege Levels"): "Privilege levels are checked
when the segment selector of a segment descriptor is loaded into a segment
register." The SDM never mentions privilege level checks during memory access,
except for loading far pointers in section 5.10 ("Pointer Validation"). Those
are actually segment selector loads and are emulated in the similarily (i.e.,
regardless to __linearize checks).
This behavior was also checked using sysexit. A data-segment whose DPL=0 was
loaded, and after sysexit (CPL=3) it is still accessible.
Therefore, all the privilege level checks in __linearize are removed.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When performing segmented-read/write in the emulator for stack operations, it
ignores the stack size, and uses the ad_bytes as indication for the pointer
size. As a result, a wrong address may be accessed.
To fix this behavior, we can remove the masking of address in __linearize and
perform it beforehand. It is already done for the operands (so currently it is
inefficiently done twice). It is missing in two cases:
1. When using rip_relative
2. On fetch_bit_operand that changes the address.
This patch masks the address on these two occassions, and removes the masking
from __linearize.
Note that it does not mask EIP during fetch. In protected/legacy mode code
fetch when RIP >= 2^32 should result in #GP and not wrap-around. Since we make
limit checks within __linearize, this is the expected behavior.
Partial revert of commit 518547b32a (KVM: x86: Emulator does not
calculate address correctly, 2014-09-30).
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 10e38fc7cab6 ("KVM: x86: Emulator flag for instruction that only support
16-bit addresses in real mode") introduced NoBigReal for instructions such as
MONITOR. Apparetnly, the Intel SDM description that led to this patch is
misleading. Since no instruction is using NoBigReal, it is safe to remove it,
we fully understand what the SDM means.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
MMIO_MAX_GEN is the same as MMIO_GEN_MASK. Use only one.
Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Instead, just use PFERR_{FETCH, PRESENT, WRITE}_MASK
inside handle_ept_violation() for slightly better code.
Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
apic_find_highest_irr assumes irr_pending is set if any vector in APIC_IRR is
set. If this assumption is broken and apicv is disabled, the injection of
interrupts may be deferred until another interrupt is delivered to the guest.
Ultimately, if no other interrupt should be injected to that vCPU, the pending
interrupt may be lost.
commit 56cc2406d6 ("KVM: nVMX: fix "acknowledge interrupt on exit" when APICv
is in use") changed the behavior of apic_clear_irr so irr_pending is cleared
after setting APIC_IRR vector. After this commit, if apic_set_irr and
apic_clear_irr run simultaneously, a race may occur, resulting in APIC_IRR
vector set, and irr_pending cleared. In the following example, assume a single
vector is set in IRR prior to calling apic_clear_irr:
apic_set_irr apic_clear_irr
------------ --------------
apic->irr_pending = true;
apic_clear_vector(...);
vec = apic_search_irr(apic);
// => vec == -1
apic_set_vector(...);
apic->irr_pending = (vec != -1);
// => apic->irr_pending == false
Nonetheless, it appears the race might even occur prior to this commit:
apic_set_irr apic_clear_irr
------------ --------------
apic->irr_pending = true;
apic->irr_pending = false;
apic_clear_vector(...);
if (apic_search_irr(apic) != -1)
apic->irr_pending = true;
// => apic->irr_pending == false
apic_set_vector(...);
Fixing this issue by:
1. Restoring the previous behavior of apic_clear_irr: clear irr_pending, call
apic_clear_vector, and then if APIC_IRR is non-zero, set irr_pending.
2. On apic_set_irr: first call apic_set_vector, then set irr_pending.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Logical destination mode can be used to send NMI IPIs even when all
APICs are software disabled, so if all APICs are software disabled we
should still look at the DFRs.
So the DFRs should all be the same, even if some or all APICs are
software disabled. However, the SDM does not say this, so tweak
the logic as follows:
- if one APIC is enabled and has LDR != 0, use that one to build the map.
This picks the right DFR in case an OS is only setting it for the
software-enabled APICs, or in case an OS is using logical addressing
on some APICs while leaving the rest in reset state (using LDR was
suggested by Radim).
- if all APICs are disabled, pick a random one to build the map.
We use the last one with LDR != 0 for simplicity.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, the APIC logical map does not consider VCPUs whose local-apic is
software-disabled. However, NMIs, INIT, etc. should still be delivered to such
VCPUs. Therefore, the APIC mode should first be determined, and then the map,
considering all VCPUs should be constructed.
To address this issue, first find the APIC mode, and only then construct the
logical map.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When running the tsc_adjust kvm-unit-test on an AMD processor with the
IA32_TSC_ADJUST feature enabled, the WARN_ON in svm_adjust_tsc_offset can be
triggered. This WARN_ON checks for a negative adjustment in case __scale_tsc
is called; however it may trigger unnecessary warnings.
This patch moves the WARN_ON to trigger only if __scale_tsc will actually be
called from svm_adjust_tsc_offset. In addition make adj in kvm_set_msr_common
s64 since this can have signed values.
Signed-off-by: Chris J Arges <chris.j.arges@canonical.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There's nothing to switch if the host and guest values are the same.
I am unable to find evidence that this makes any difference
whatsoever.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
[I could see a difference on Nehalem. From 5 runs:
userspace exit, guest!=host 12200 11772 12130 12164 12327
userspace exit, guest=host 11983 11780 11920 11919 12040
lightweight exit, guest!=host 3214 3220 3238 3218 3337
lightweight exit, guest=host 3178 3193 3193 3187 3220
This passes the t-test with 99% confidence for userspace exit,
98.5% confidence for lightweight exit. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
At least on Sandy Bridge, letting the CPU switch IA32_EFER is much
faster than switching it manually.
I benchmarked this using the vmexit kvm-unit-test (single run, but
GOAL multiplied by 5 to do more iterations):
Test Before After Change
cpuid 2000 1932 -3.40%
vmcall 1914 1817 -5.07%
mov_from_cr8 13 13 0.00%
mov_to_cr8 19 19 0.00%
inl_from_pmtimer 19164 10619 -44.59%
inl_from_qemu 15662 10302 -34.22%
inl_from_kernel 3916 3802 -2.91%
outl_to_kernel 2230 2194 -1.61%
mov_dr 172 176 2.33%
ipi (skipped) (skipped)
ipi+halt (skipped) (skipped)
ple-round-robin 13 13 0.00%
wr_tsc_adjust_msr 1920 1845 -3.91%
rd_tsc_adjust_msr 1892 1814 -4.12%
mmio-no-eventfd:pci-mem 16394 11165 -31.90%
mmio-wildcard-eventfd:pci-mem 4607 4645 0.82%
mmio-datamatch-eventfd:pci-mem 4601 4610 0.20%
portio-no-eventfd:pci-io 11507 7942 -30.98%
portio-wildcard-eventfd:pci-io 2239 2225 -0.63%
portio-datamatch-eventfd:pci-io 2250 2234 -0.71%
I haven't explicitly computed the significance of these numbers,
but this isn't subtle.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
[The results were reproducible on all of Nehalem, Sandy Bridge and
Ivy Bridge. The slowness of manual switching is because writing
to EFER with WRMSR triggers a TLB flush, even if the only bit you're
touching is SCE (so the page table format is not affected). Doing
the write as part of vmentry/vmexit, instead, does not flush the TLB,
probably because all processors that have EPT also have VPID. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
PCIDs are only supported in 64-bit mode. No need to clear bit 63
of CR3 unless the host is 64-bit.
Reported by Fengguang Wu's autobuilder.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The new trace event records:
* the id of vcpu being updated
* the pvclock_vcpu_time_info struct being written to guest memory
This is useful for debugging pvclock bugs, such as the bug fixed by
"[PATCH] kvm: x86: Fix kvm clock versioning.".
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm updates the version number for the guest paravirt clock structure by
incrementing the version of its private copy. It does not read the guest
version, so will write version = 2 in the first update for every new VM,
including after restoring a saved state. If guest state is saved during
reading the clock, it could read and accept struct fields and guest TSC
from two different updates. This changes the code to increment the guest
version and write it back.
Signed-off-by: Owen Hofmann <osh@google.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 3b32004a66 ("KVM: x86: movnti minimum op size of 32-bit is not kept")
did not fully fix the minimum operand size of MONTI emulation. Still, MOVNTI
may be mistakenly performed using 16-bit opsize.
This patch add No16 flag to mark an instruction does not support 16-bits
operand size.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When the guest writes to the TSC, the masterclock TSC copy must be
updated as well along with the TSC_OFFSET update, otherwise a negative
tsc_timestamp is calculated at kvm_guest_time_update.
Once "if (!vcpus_matched && ka->use_master_clock)" is simplified to
"if (ka->use_master_clock)", the corresponding "if (!ka->use_master_clock)"
becomes redundant, so remove the do_request boolean and collapse
everything into a single condition.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that KVM injects #UD on "unhandlable" error, it makes better sense to
return such error on sysenter instead of directly injecting #UD to the guest.
This allows to track more easily the unhandlable cases the emulator does not
support.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
APIC base relocation is unsupported by KVM. If anyone uses it, the least should
be to report a warning in the hypervisor.
Note that KVM-unit-tests uses this feature for some reason, so running the
tests triggers the warning.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 7fe864dc94 (KVM: x86: Mark VEX-prefix instructions emulation as
unimplemented, 2014-06-02) marked VEX instructions as such in protected
mode. VEX-prefix instructions are not supported relevant on real-mode
and VM86, but should cause #UD instead of being decoded as LES/LDS.
Fix this behaviour to be consistent with real hardware.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
[Check for mod == 3, rather than 2 or 3. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Task-switch emulation checks the privilege level prior to performing the
task-switch. This check is incorrect in the case of task-gates, in which the
tss.dpl is ignored, and can cause superfluous exceptions. Moreover this check
is unnecassary, since the CPU checks the privilege levels prior to exiting.
Intel SDM 25.4.2 says "If CALL or JMP accesses a TSS descriptor directly
outside IA-32e mode, privilege levels are checked on the TSS descriptor" prior
to exiting. AMD 15.14.1 says "The intercept is checked before the task switch
takes place but after the incoming TSS and task gate (if one was involved) have
been checked for correctness."
This patch removes the CPL checks for CALL and JMP.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When emulating LTR/LDTR/LGDT/LIDT, #GP should be injected if the base is
non-canonical. Otherwise, VM-entry will fail.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
LGDT and LIDT emulation logic is almost identical. Merge the logic into a
single point to avoid redundancy. This will be used by the next patch that
will ensure the bases of the loaded GDTR and IDTR are canonical.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If the emulation ends in fault, eflags should not be updated. However, several
instruction emulations (actually all the fastops) currently update eflags, if
the fault was detected afterwards (e.g., #PF during writeback).
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Although Intel SDM mentions bit 63 is reserved, MOV to CR3 can have bit 63 set.
As Intel SDM states in section 4.10.4 "Invalidation of TLBs and
Paging-Structure Caches": " MOV to CR3. ... If CR4.PCIDE = 1 and bit 63 of the
instruction’s source operand is 0 ..."
In other words, bit 63 is not reserved. KVM emulator currently consider bit 63
as reserved. Fix it.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
According to Intel SDM push of segment selectors is done in the following
manner: "if the operand size is 32-bits, either a zero-extended value is pushed
on the stack or the segment selector is written on the stack using a 16-bit
move. For the last case, all recent Core and Atom processors perform a 16-bit
move, leaving the upper portion of the stack location unmodified."
This patch modifies the behavior to match the core behavior.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
CMPS and SCAS instructions are evaluated in the wrong order. For reference (of
CMPS), see http://www.fermimn.gov.it/linux/quarta/x86/cmps.htm : "Note that the
direction of subtraction for CMPS is [SI] - [DI] or [ESI] - [EDI]. The left
operand (SI or ESI) is the source and the right operand (DI or EDI) is the
destination. This is the reverse of the usual Intel convention in which the
left operand is the destination and the right operand is the source."
Introducing em_cmp_r for this matter that performs comparison in reverse order
using fastop infrastructure to avoid a wrapper function.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SYSCALL emulation currently clears in 64-bit mode eflags according to
MSR_SYSCALL_MASK. However, on bare-metal eflags[1] which is fixed to one
cannot be cleared, even if MSR_SYSCALL_MASK masks the bit. This wrong behavior
may result in failed VM-entry, as VT disallows entry with eflags[1] cleared.
This patch sets the bit after masking eflags on syscall.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In x86, you can only MOV-sreg to memory with either 16-bits or 64-bits size.
In contrast, KVM may write to 32-bits memory on MOV-sreg. This patch fixes KVM
behavior, and sets the destination operand size to two, if the destination is
memory.
When destination is registers, and the operand size is 32-bits, the high
16-bits in modern CPUs is filled with zero. This is handled correctly.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
x86 debug registers hold a linear address. Therefore, breakpoints detection
should consider CS.base, and check whether instruction linear address equals
(CS.base + RIP). This patch introduces a function to evaluate RIP linear
address and uses it for breakpoints detection.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
DR6[0:3] (previous breakpoint indications) are cleared when #DB is injected
during handle_exception, just as real hardware does. Similarily, handle_dr
should clear DR6[0:3].
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Real-mode exceptions do not deliver error code. As can be seen in Intel SDM
volume 2, real-mode exceptions do not have parentheses, which indicate
error-code. To avoid significant changes of the code, the error code is
"removed" during exception queueing.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In one occassion, decode_modrm uses the rm field after it is extended with
REX.B to determine the addressing mode. Doing so causes it not to read the
offset for rip-relative addressing with REX.B=1.
This patch moves the fetch where we already mask REX.B away instead.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A bug was reported as follows: when running Windows 7 32-bit guests on qemu-kvm,
sometimes the guests run into blue screen during reboot. The problem was that a
guest's RVI was not cleared when it rebooted. This patch has fixed the problem.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
Tested-by: Rongrong Liu <rongrongx.liu@intel.com>, Da Chun <ngugc@qq.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Return a negative error code instead, and WARN() when we should be covering
the entire 2-bit space of vmcs_field_type's return value. For increased
robustness, add a BUILD_BUG_ON checking the range of vmcs_field_to_offset.
Suggested-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Instead of vmx_init(), actually it would make reasonable sense to do
anything specific to vmx hardware setting in vmx_x86_ops->hardware_setup().
Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Just move this pair of functions down to make sure later we can
add something dependent on others.
Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The emulator could reuse an op->type from a previous instruction for some
immediate values. If it mistakenly considers the operands as memory
operands, it will performs a memory read and overwrite op->val.
Consider for instance the ROR instruction - src2 (the number of times)
would be read from memory instead of being used as immediate.
Mark every immediate operand as such to avoid this problem.
Cc: stable@vger.kernel.org
Fixes: c44b4c6ab8
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We mirror a subset of these registers in separate variables.
Using them directly should be faster.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
APIC-write VM exits are "trap-like": they save CS:RIP values for the
instruction after the write, and more importantly, the handler will
already see the new value in the virtual-APIC page. This means that
apic_reg_write cannot use kvm_apic_get_reg to omit timer cancelation
when mode changes.
timer_mode_mask shouldn't be changing as it depends on cpuid.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
APIC-write VM exits are "trap-like": they save CS:RIP values for the
instruction after the write, and more importantly, the handler will
already see the new value in the virtual-APIC page.
This caused a bug if you used KVM_SET_IRQCHIP to set the SW-enabled bit
in the SPIV register. The chain of events is as follows:
* When the irqchip is added to the destination VM, the apic_sw_disabled
static key is incremented (1)
* When the KVM_SET_IRQCHIP ioctl is invoked, it is decremented (0)
* When the guest disables the bit in the SPIV register, e.g. as part of
shutdown, apic_set_spiv does not notice the change and the static key is
_not_ incremented.
* When the guest is destroyed, the static key is decremented (-1),
resulting in this trace:
WARNING: at kernel/jump_label.c:81 __static_key_slow_dec+0xa6/0xb0()
jump label: negative count!
[<ffffffff816bf898>] dump_stack+0x19/0x1b
[<ffffffff8107c6f1>] warn_slowpath_common+0x61/0x80
[<ffffffff8107c76c>] warn_slowpath_fmt+0x5c/0x80
[<ffffffff811931e6>] __static_key_slow_dec+0xa6/0xb0
[<ffffffff81193226>] static_key_slow_dec_deferred+0x16/0x20
[<ffffffffa0637698>] kvm_free_lapic+0x88/0xa0 [kvm]
[<ffffffffa061c63e>] kvm_arch_vcpu_uninit+0x2e/0xe0 [kvm]
[<ffffffffa05ff301>] kvm_vcpu_uninit+0x21/0x40 [kvm]
[<ffffffffa067cec7>] vmx_free_vcpu+0x47/0x70 [kvm_intel]
[<ffffffffa061bc50>] kvm_arch_vcpu_free+0x50/0x60 [kvm]
[<ffffffffa061ca22>] kvm_arch_destroy_vm+0x102/0x260 [kvm]
[<ffffffff810b68fd>] ? synchronize_srcu+0x1d/0x20
[<ffffffffa06030d1>] kvm_put_kvm+0xe1/0x1c0 [kvm]
[<ffffffffa06036f8>] kvm_vcpu_release+0x18/0x20 [kvm]
[<ffffffff81215c62>] __fput+0x102/0x310
[<ffffffff81215f4e>] ____fput+0xe/0x10
[<ffffffff810ab664>] task_work_run+0xb4/0xe0
[<ffffffff81083944>] do_exit+0x304/0xc60
[<ffffffff816c8dfc>] ? _raw_spin_unlock_irq+0x2c/0x50
[<ffffffff810fd22d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
[<ffffffff8108432c>] do_group_exit+0x4c/0xc0
[<ffffffff810843b4>] SyS_exit_group+0x14/0x20
[<ffffffff816d33a9>] system_call_fastpath+0x16/0x1b
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The check in kvm_set_lapic_tscdeadline_msr() was trying to prevent a
situation where we lose a pending deadline timer in a MSR write.
Losing it is fine, because it effectively occurs before the timer fired,
so we should be able to cancel or postpone it.
Another problem comes from interaction with QEMU, or other userspace
that can set deadline MSR without a good reason, when timer is already
pending: one guest's deadline request results in more than one
interrupt because one is injected immediately on MSR write from
userspace and one through hrtimer later.
The solution is to remove the injection when replacing a pending timer
and to improve the usual QEMU path, we inject without a hrtimer when the
deadline has already passed.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reported-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Make the code reusable.
If the timer was already pending, we shouldn't be waiting in a queue,
so wake_up can be skipped, simplifying the path.
There is no 'reinject' case => the comment is removed.
Current race behaves correctly.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If DR4/5 is accessed when it is unavailable (since CR4.DE is set), then #UD
should be generated even if CPL>0. This is according to Intel SDM Table 6-2:
"Priority Among Simultaneous Exceptions and Interrupts".
Note, that this may happen on the first DR access, even if the host does not
sets debug breakpoints. Obviously, it occurs when the host debugs the guest.
This patch moves the DR4/5 checks from __kvm_set_dr/_kvm_get_dr to handle_dr.
The emulator already checks DR4/5 availability in check_dr_read. Nested
virutalization related calls to kvm_set_dr/kvm_get_dr would not like to inject
exceptions to the guest.
As for SVM, the patch follows the previous logic as much as possible. Anyhow,
it appears the DR interception code might be buggy - even if the DR access
may cause an exception, the instruction is skipped.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When read access is performed using a readable code segment, the "conforming"
and "non-conforming" checks should not be done. As a result, read using
non-conforming readable code segment fails.
This is according to Intel SDM 5.6.1 ("Accessing Data in Code Segments").
The fix is not to perform the "non-conforming" checks if the access is not a
fetch; the relevant checks are already done when loading the segment.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
DR7.LE should be cleared during task-switch. This feature is poorly documented.
For reference, see:
http://pdos.csail.mit.edu/6.828/2005/readings/i386/s12_02.htm
SDM [17.2.4]:
This feature is not supported in the P6 family processors, later IA-32
processors, and Intel 64 processors.
AMD [2:13.1.1.4]:
This bit is ignored by implementations of the AMD64 architecture.
Intel's formulation could mean that it isn't even zeroed, but current
hardware indeed does not behave like that.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In long-mode, when the address size is 4 bytes, the linear address is not
truncated as the emulator mistakenly does. Instead, the offset within the
segment (the ea field) should be truncated according to the address size.
As Intel SDM says: "In 64-bit mode, the effective address components are added
and the effective address is truncated ... before adding the full 64-bit
segment base."
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Intel SDM 17.2.4 (Debug Control Register (DR7)) says: "The processor clears the
GD flag upon entering to the debug exception handler." This sentence may be
misunderstood as if it happens only on #DB due to debug-register protection,
but it happens regardless to the cause of the #DB.
Fix the behavior to match both real hardware and Bochs.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM does not deliver x2APIC broadcast messages with physical mode. Intel SDM
(10.12.9 ICR Operation in x2APIC Mode) states: "A destination ID value of
FFFF_FFFFH is used for broadcast of interrupts in both logical destination and
physical destination modes."
In addition, the local-apic enables cluster mode broadcast. As Intel SDM
10.6.2.2 says: "Broadcast to all local APICs is achieved by setting all
destination bits to one." This patch enables cluster mode broadcast.
The fix tries to combine broadcast in different modes through a unified code.
One rare case occurs when the source of IPI has its APIC disabled. In such
case, the source can still issue IPIs, but since the source is not obliged to
have the same LAPIC mode as the enabled ones, we cannot rely on it.
Since it is a rare case, it is unoptimized and done on the slow-path.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Wanpeng Li <wanpeng.li@linux.intel.com>
[As per Radim's review, use unsigned int for X2APIC_BROADCAST, return bool from
kvm_apic_broadcast. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
CR4.TSD is guest-owned; don't trap writes to it in VMX guests. This
avoids a VM exit on context switches into or out of a PR_TSC_SIGSEGV
task.
I think that this fixes an unintentional side-effect of:
4c38609ac5 KVM: VMX: Make guest cr4 mask more conservative
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If the operand size is not 64-bit, then the sysexit instruction should assign
ECX to RSP and EDX to RIP. The current code assigns the full 64-bits.
Fix it by masking.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In 64-bit, stack operations default to 64-bits, but can be overriden (to
16-bit) using opsize override prefix. In contrast, near-branches are always
64-bit. This patch distinguish between the different behaviors.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Breaking grp45 to the relevant functions to speed up the emulation and simplify
the code. In addition, it is necassary the next patch will distinguish between
far and near branches according to the flags.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Replace the current canonical address check with the new function which is
identical.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In order to access the shadow VMCS, we need to load it. At this point,
vmx->loaded_vmcs->vmcs and the actually loaded one start to differ. If
we now get preempted by Linux, vmx_vcpu_put and, on return, the
vmx_vcpu_load will work against the wrong vmcs. That can cause
copy_shadow_to_vmcs12 to corrupt the vmcs12 state.
Fix the issue by disabling preemption during the copy operation.
copy_vmcs12_to_shadow is safe from this issue as it is executed by
vmx_vcpu_run when preemption is already disabled before vmentry.
This bug is exposed by running Jailhouse within KVM on CPUs with
shadow VMCS support. Jailhouse never expects an interrupt pending
vmexit, but the bug can cause it if, after copy_shadow_to_vmcs12
is preempted, the active VMCS happens to have the virtual interrupt
pending flag set in the CPU-based execution controls.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit d1442d85cc ("KVM: x86: Handle errors when RIP is set during far
jumps") introduced a bug that caused the fix to be incomplete. Due to
incorrect evaluation, far jump to segment with L bit cleared (i.e., 32-bit
segment) and RIP with any of the high bits set (i.e, RIP[63:32] != 0) set may
not trigger #GP. As we know, this imposes a security problem.
In addition, the condition for two warnings was incorrect.
Fixes: d1442d85cc
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
[Add #ifdef CONFIG_X86_64 to avoid complaints of undefined behavior. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In order to access the shadow VMCS, we need to load it. At this point,
vmx->loaded_vmcs->vmcs and the actually loaded one start to differ. If
we now get preempted by Linux, vmx_vcpu_put and, on return, the
vmx_vcpu_load will work against the wrong vmcs. That can cause
copy_shadow_to_vmcs12 to corrupt the vmcs12 state.
Fix the issue by disabling preemption during the copy operation.
copy_vmcs12_to_shadow is safe from this issue as it is executed by
vmx_vcpu_run when preemption is already disabled before vmentry.
This bug is exposed by running Jailhouse within KVM on CPUs with
shadow VMCS support. Jailhouse never expects an interrupt pending
vmexit, but the bug can cause it if, after copy_shadow_to_vmcs12
is preempted, the active VMCS happens to have the virtual interrupt
pending flag set in the CPU-based execution controls.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit d1442d85cc ("KVM: x86: Handle errors when RIP is set during far
jumps") introduced a bug that caused the fix to be incomplete. Due to
incorrect evaluation, far jump to segment with L bit cleared (i.e., 32-bit
segment) and RIP with any of the high bits set (i.e, RIP[63:32] != 0) set may
not trigger #GP. As we know, this imposes a security problem.
In addition, the condition for two warnings was incorrect.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
[Add #ifdef CONFIG_X86_64 to avoid complaints of undefined behavior. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Emulation of code that is 14 bytes to the segment limit or closer
(e.g. RIP = 0xFFFFFFF2 after reset) is broken because we try to read as
many as 15 bytes from the beginning of the instruction, and __linearize
fails when the passed (address, size) pair reaches out of the segment.
To fix this, let __linearize return the maximum accessible size (clamped
to 2^32-1) for usage in __do_insn_fetch_bytes, and avoid the limit check
by passing zero for the desired size.
For expand-down segments, __linearize is performing a redundant check.
(u32)(addr.ea + size - 1) <= lim can only happen if addr.ea is close
to 4GB; in this case, addr.ea + size - 1 will also fail the check against
the upper bound of the segment (which is provided by the D/B bit).
After eliminating the redundant check, it is simple to compute
the *max_size for expand-down segments too.
Now that the limit check is done in __do_insn_fetch_bytes, we want
to inject a general protection fault there if size < op_size (like
__linearize would have done), instead of just aborting.
This fixes booting Tiano Core from emulated flash with EPT disabled.
Cc: stable@vger.kernel.org
Fixes: 719d5a9b24
Reported-by: Borislav Petkov <bp@suse.de>
Tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The error code for #GP and #SS is zero when the segment is used to
access an operand or an instruction. It is only non-zero when
a segment register is being loaded; for limit checks this means
cases such as:
* for #GP, when RIP is beyond the limit on a far call (before the first
instruction is executed). We do not implement this check, but it
would be in em_jmp_far/em_call_far.
* for #SS, if the new stack overflows during an inter-privilege-level
call to a non-conforming code segment. We do not implement stack
switching at all.
So use an error code of zero.
Reviewed-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Even after the recent fix, the assertion on paging_tmpl.h is triggered.
Apparently, the assertion wants to check that the PAE is always set on
long-mode, but does it in incorrect way. Note that the assertion is not
enabled unless the code is debugged by defining MMU_DEBUG.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The decode phase of the x86 emulator assumes that every instruction with the
ModRM flag, and which can be used with RIP-relative addressing, has either
SrcMem or DstMem. This is not the case for several instructions - prefetch,
hint-nop and clflush.
Adding SrcMem|NoAccess for prefetch and hint-nop and SrcMem for clflush.
This fixes CVE-2014-8480.
Fixes: 41061cdb98
Cc: stable@vger.kernel.org
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, all group15 instructions are decoded as clflush (e.g., mfence,
xsave). In addition, the clflush instruction requires no prefix (66/f2/f3)
would exist. If prefix exists it may encode a different instruction (e.g.,
clflushopt).
Creating a group for clflush, and different group for each prefix.
This has been the case forever, but the next patch needs the cflush group
in order to fix a bug introduced in 3.17.
Fixes: 41061cdb98
Cc: stable@vger.kernel.org
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A failure to decode the instruction can cause a NULL pointer access.
This is fixed simply by moving the "done" label as close as possible
to the return.
This fixes CVE-2014-8481.
Reported-by: Andy Lutomirski <luto@amacapital.net>
Cc: stable@vger.kernel.org
Fixes: 41061cdb98
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Once an instruction crosses a page boundary, the size read from the second page
disregards the common case that part of the operand resides on the first page.
As a result, fetch of long insturctions may fail, and thereby cause the
decoding to fail as well.
Cc: stable@vger.kernel.org
Fixes: 5cfc7e0f5e
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM_EXIT_UNKNOWN is a kvm bug, we don't really know whether it was
triggered by a priveledged application. Let's not kill the guest: WARN
and inject #UD instead.
Cc: stable@vger.kernel.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On systems with invvpid instruction support (corresponding bit in
IA32_VMX_EPT_VPID_CAP MSR is set) guest invocation of invvpid
causes vm exit, which is currently not handled and results in
propagation of unknown exit to userspace.
Fix this by installing an invvpid vm exit handler.
This is CVE-2014-3646.
Cc: stable@vger.kernel.org
Signed-off-by: Petr Matousek <pmatouse@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Far jmp/call/ret may fault while loading a new RIP. Currently KVM does not
handle this case, and may result in failed vm-entry once the assignment is
done. The tricky part of doing so is that loading the new CS affects the
VMCS/VMCB state, so if we fail during loading the new RIP, we are left in
unconsistent state. Therefore, this patch saves on 64-bit the old CS
descriptor and restores it if loading RIP failed.
This fixes CVE-2014-3647.
Cc: stable@vger.kernel.org
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Before changing rip (during jmp, call, ret, etc.) the target should be asserted
to be canonical one, as real CPUs do. During sysret, both target rsp and rip
should be canonical. If any of these values is noncanonical, a #GP exception
should occur. The exception to this rule are syscall and sysenter instructions
in which the assigned rip is checked during the assignment to the relevant
MSRs.
This patch fixes the emulator to behave as real CPUs do for near branches.
Far branches are handled by the next patch.
This fixes CVE-2014-3647.
Cc: stable@vger.kernel.org
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Relative jumps and calls do the masking according to the operand size, and not
according to the address size as the KVM emulator does today.
This patch fixes KVM behavior.
Cc: stable@vger.kernel.org
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There's a race condition in the PIT emulation code in KVM. In
__kvm_migrate_pit_timer the pit_timer object is accessed without
synchronization. If the race condition occurs at the wrong time this
can crash the host kernel.
This fixes CVE-2014-3611.
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Honig <ahonig@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The previous patch blocked invalid writes directly when the MSR
is written. As a precaution, prevent future similar mistakes by
gracefulling handle GPs caused by writes to shared MSRs.
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Honig <ahonig@google.com>
[Remove parts obsoleted by Nadav's patch. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Upon WRMSR, the CPU should inject #GP if a non-canonical value (address) is
written to certain MSRs. The behavior is "almost" identical for AMD and Intel
(ignoring MSRs that are not implemented in either architecture since they would
anyhow #GP). However, IA32_SYSENTER_ESP and IA32_SYSENTER_EIP cause #GP if
non-canonical address is written on Intel but not on AMD (which ignores the top
32-bits).
Accordingly, this patch injects a #GP on the MSRs which behave identically on
Intel and AMD. To eliminate the differences between the architecutres, the
value which is written to IA32_SYSENTER_ESP and IA32_SYSENTER_EIP is turned to
canonical value before writing instead of injecting a #GP.
Some references from Intel and AMD manuals:
According to Intel SDM description of WRMSR instruction #GP is expected on
WRMSR "If the source register contains a non-canonical address and ECX
specifies one of the following MSRs: IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE,
IA32_KERNEL_GS_BASE, IA32_LSTAR, IA32_SYSENTER_EIP, IA32_SYSENTER_ESP."
According to AMD manual instruction manual:
LSTAR/CSTAR (SYSCALL): "The WRMSR instruction loads the target RIP into the
LSTAR and CSTAR registers. If an RIP written by WRMSR is not in canonical
form, a general-protection exception (#GP) occurs."
IA32_GS_BASE and IA32_FS_BASE (WRFSBASE/WRGSBASE): "The address written to the
base field must be in canonical form or a #GP fault will occur."
IA32_KERNEL_GS_BASE (SWAPGS): "The address stored in the KernelGSbase MSR must
be in canonical form."
This patch fixes CVE-2014-3610.
Cc: stable@vger.kernel.org
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
CR4 isn't constant; at least the TSD and PCE bits can vary.
TBH, treating CR0 and CR3 as constant scares me a bit, too, but it looks
like it's correct.
This adds a branch and a read from cr4 to each vm entry. Because it is
extremely likely that consecutive entries into the same vcpu will have
the same host cr4 value, this fixes up the vmcs instead of restoring cr4
after the fact. A subsequent patch will add a kernel-wide cr4 shadow,
reducing the overhead in the common case to just two memory reads and a
branch.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: stable@vger.kernel.org
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull percpu consistent-ops changes from Tejun Heo:
"Way back, before the current percpu allocator was implemented, static
and dynamic percpu memory areas were allocated and handled separately
and had their own accessors. The distinction has been gone for many
years now; however, the now duplicate two sets of accessors remained
with the pointer based ones - this_cpu_*() - evolving various other
operations over time. During the process, we also accumulated other
inconsistent operations.
This pull request contains Christoph's patches to clean up the
duplicate accessor situation. __get_cpu_var() uses are replaced with
with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().
Unfortunately, the former sometimes is tricky thanks to C being a bit
messy with the distinction between lvalues and pointers, which led to
a rather ugly solution for cpumask_var_t involving the introduction of
this_cpu_cpumask_var_ptr().
This converts most of the uses but not all. Christoph will follow up
with the remaining conversions in this merge window and hopefully
remove the obsolete accessors"
* 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
irqchip: Properly fetch the per cpu offset
percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
Revert "powerpc: Replace __get_cpu_var uses"
percpu: Remove __this_cpu_ptr
clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
sparc: Replace __get_cpu_var uses
avr32: Replace __get_cpu_var with __this_cpu_write
blackfin: Replace __get_cpu_var uses
tile: Use this_cpu_ptr() for hardware counters
tile: Replace __get_cpu_var uses
powerpc: Replace __get_cpu_var uses
alpha: Replace __get_cpu_var
ia64: Replace __get_cpu_var uses
s390: cio driver &__get_cpu_var replacements
s390: Replace __get_cpu_var uses
mips: Replace __get_cpu_var uses
MIPS: Replace __get_cpu_var uses in FPU emulator.
arm: Replace __this_cpu_ptr with raw_cpu_ptr
...
Pull percpu updates from Tejun Heo:
"A lot of activities on percpu front. Notable changes are...
- percpu allocator now can take @gfp. If @gfp doesn't contain
GFP_KERNEL, it tries to allocate from what's already available to
the allocator and a work item tries to keep the reserve around
certain level so that these atomic allocations usually succeed.
This will replace the ad-hoc percpu memory pool used by
blk-throttle and also be used by the planned blkcg support for
writeback IOs.
Please note that I noticed a bug in how @gfp is interpreted while
preparing this pull request and applied the fix 6ae833c7fe
("percpu: fix how @gfp is interpreted by the percpu allocator")
just now.
- percpu_ref now uses longs for percpu and global counters instead of
ints. It leads to more sparse packing of the percpu counters on
64bit machines but the overhead should be negligible and this
allows using percpu_ref for refcnting pages and in-memory objects
directly.
- The switching between percpu and single counter modes of a
percpu_ref is made independent of putting the base ref and a
percpu_ref can now optionally be initialized in single or killed
mode. This allows avoiding percpu shutdown latency for cases where
the refcounted objects may be synchronously created and destroyed
in rapid succession with only a fraction of them reaching fully
operational status (SCSI probing does this when combined with
blk-mq support). It's also planned to be used to implement forced
single mode to detect underflow more timely for debugging.
There's a separate branch percpu/for-3.18-consistent-ops which cleans
up the duplicate percpu accessors. That branch causes a number of
conflicts with s390 and other trees. I'll send a separate pull
request w/ resolutions once other branches are merged"
* 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits)
percpu: fix how @gfp is interpreted by the percpu allocator
blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode
percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky
percpu_ref: add PERCPU_REF_INIT_* flags
percpu_ref: decouple switching to percpu mode and reinit
percpu_ref: decouple switching to atomic mode and killing
percpu_ref: add PCPU_REF_DEAD
percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch
percpu_ref: replace pcpu_ prefix with percpu_
percpu_ref: minor code and comment updates
percpu_ref: relocate percpu_ref_reinit()
Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe"
Revert "percpu: free percpu allocation info for uniprocessor system"
percpu-refcount: make percpu_ref based on longs instead of ints
percpu-refcount: improve WARN messages
percpu: fix locking regression in the failure path of pcpu_alloc()
percpu-refcount: add @gfp to percpu_ref_init()
proportions: add @gfp to init functions
percpu_counter: add @gfp to percpu_counter_init()
percpu_counter: make percpu_counters_lock irq-safe
...
This is to receive 0a30288da1 ("blk-mq, percpu_ref: implement a
kludge for SCSI blk-mq stall during probe") which implements
__percpu_ref_kill_expedited() to work around SCSI blk-mq stall. The
commit reverted and patches to implement proper fix will be added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
In order to make the APIC access page migratable, stop pinning it in
memory.
And because the APIC access page is not pinned in memory, we can
remove kvm_arch->apic_access_page. When we need to write its
physical address into vmcs, we use gfn_to_page() to get its page
struct, which is needed to call page_to_phys(); the page is then
immediately unpinned.
Suggested-by: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, the APIC access page is pinned by KVM for the entire life
of the guest. We want to make it migratable in order to make memory
hot-unplug available for machines that run KVM.
This patch prepares to handle this for the case where there is no nested
virtualization, or where the nested guest does not have an APIC page of
its own. All accesses to kvm->arch.apic_access_page are changed to go
through kvm_vcpu_reload_apic_access_page.
If the APIC access page is invalidated when the host is running, we update
the VMCS in the next guest entry.
If it is invalidated when the guest is running, the MMU notifier will force
an exit, after which we will handle everything as in the previous case.
If it is invalidated when a nested guest is running, the request will update
either the VMCS01 or the VMCS02. Updating the VMCS01 is done at the
next L2->L1 exit, while updating the VMCS02 is done in prepare_vmcs02.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, the APIC access page is pinned by KVM for the entire life
of the guest. We want to make it migratable in order to make memory
hot-unplug available for machines that run KVM.
This patch prepares to handle this in generic code, through a new
request bit (that will be set by the MMU notifier) and a new hook
that is called whenever the request bit is processed.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>