and fix the broken case if a core's frequency depends on others.
trace_power_frequency was only implemented in a rather ungeneric way
in acpi-cpufreq driver's target() function only.
-> Move the call to trace_power_frequency to
cpufreq.c:cpufreq_notify_transition() where CPUFREQ_POSTCHANGE
notifier is triggered.
This will support power frequency tracing by all cpufreq drivers
trace_power_frequency did not trace frequency changes correctly when
the userspace governor was used or when CPU cores' frequency depend
on each other.
-> Moving this into the CPUFREQ_POSTCHANGE notifier and pass the cpu
which gets switched automatically fixes this.
Robert Schoene provided some important fixes on top of my initial
quick shot version which are integrated in this patch:
- Forgot some changes in power_end trace (TP_printk/variable names)
- Variable dummy in power_end must now be cpu_id
- Use static 64 bit variable instead of unsigned int for cpu_id
Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: davej@redhat.com
CC: arjan@infradead.org
CC: linux-kernel@vger.kernel.org
CC: robert.schoene@tu-dresden.de
Tested-by: robert.schoene@tu-dresden.de
Signed-off-by: Dave Jones <davej@redhat.com>
On Wed, 2010-01-20 at 16:56 +0100, Thomas Renninger wrote:
> But most often this happens if people upgrade their CPU and do not
> update their BIOS.
> Or the vendor does not recognise the new CPU even if the BIOS got
> updated.
Maybe some of those people just didn't realize it was disabled in BIOS?
If you tell users that it's a firmware bug then they'll probably just
give up.
> The itself message might be an enhancment, IMO it's not worth a patch.
Why do you think so? I spent an hour on hunting down the BIOS upgrade,
only to find that it didn't improve anything. It was a day later that I
realized that it might be a BIOS option; and the option was literally
the _last_ option in the whole BIOS setup. :)
This message would have saved the day.
> But do not revert the FW_BUG part!
Sure, you have a point here.
How about this patch?
The Pstate transition latency check was added for broken F10h BIOSen
which wrongly contain a value of 0 for transition and bus master
latency. Fam11h and later, however, (will) have similar transition
latency so extend that behavior for them too.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Dave Jones <davej@redhat.com>
The PCC cpufreq driver unmaps the mailbox address range if any CPUs fail to
initialise, but doesn't do anything to remove the registered CPUs from the
cpufreq core resulting in failures further down the line. We're better off
simply returning a failure - the cpufreq core will unregister us cleanly if
we end up with no successfully registered CPUs. Tidy up the failure path
and also add a sanity check to ensure that the firmware gives us a realistic
frequency - the core deals badly with that being set to 0.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Cc: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Signed-off-by: Dave Jones <davej@redhat.com>
The pcc specification documents an _OSC method that's incompatible with the
one defined as part of the ACPI spec. This shouldn't be a problem as both
are supposed to be guarded with a UUID. Unfortunately approximately nobody
(including HP, who wrote this spec) properly check the UUID on entry to the
_OSC call. Right now this could result in surprising behaviour if the pcc
driver performs an _OSC call on a machine that doesn't implement the pcc
specification. Check whether the PCCH method exists first in order to reduce
this probability.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Cc: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Signed-off-by: Dave Jones <davej@redhat.com>
The firmware of production devices does not support this interface so this
is dead code.
Signed-off-by: Sreedhara DS <sreedhara.ds@intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Matthew Garrett <mjg@redhat.com>
It is a subset of <ctype.h> functionality, so name it ctype.h. Also,
reorganize header files so #include statements are clustered near the
top as they should be.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
LKML-Reference: <4C5752F2.8030206@kernel.org>
This enables the decompressor output to be seen on the serial console.
Most of the code is shared with the regular boot code.
We could add printf to the decompressor if needed, but currently there
is no sufficiently compelling user.
-v2: define BOOT_BOOT_H to avoid include boot.h
-v3: early_serial_base need to be static in misc.c ?
-v4: create seperate string.c printf.c cmdline.c early_serial_console.c
after hpa's patch that allow global variables in compressed/misc stage
-v5: remove printf.c related
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
When running on VMware's platform, we have seen situations where
the AP's try to calibrate the lpj values and fail to get good calibration
runs becasue of timing issues. As a result delays don't work correctly
on all cpus.
The solutions is to set preset_lpj value based on the current tsc frequency
value. This is similar to what KVM does as well.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
LKML-Reference: <1280790637.14933.29.camel@ank32.eng.vmware.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Separate early_serial_console from tty.c
This allows for reuse of
early_serial_console.c/string.c/printf.c/cmdline.c in boot/compressed/.
-v2: according to hpa, don't include string.c etc
-v3: compressed/misc.c must have early_serial_base as static, so move it back to tty.c
for setup code
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4C568D2B.205@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
In order for global variables and functions to work in the
decompressor, we need to fix up the GOT in assembly code.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
LKML-Reference: <4C57382E.8050501@zytor.com>
We mapped vdso pages but never unmapped them and the virtual address
is lost after exiting from the function, so unmap vdso pages here.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
LKML-Reference: <20100802004934.GA2505@sli10-desk.sh.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Accomodate the original C1E-aware idle routine to the different times
during boot when the BIOS enables C1E. While at it, remove the synthetic
CPUID flag in favor of a single global setting which denotes C1E status
on the system.
[ hpa: changed c1e_enabled to be a bool; clarified cpu bit 3:21 comment ]
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
LKML-Reference: <20100727165335.GA11630@aftab>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
vmx does not restore GDT.LIMIT to the host value, instead it sets it to 64KB.
This means host userspace can learn a few bits of host memory.
Fix by reloading GDTR when we load other host state.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Sometimes, atomically set spte is not needed, this patch call __xchg_spte()
more smartly
Note: if the old mapping's access bit is already set, we no need atomic operation
since the access bit is not lost
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Introduce set_spte_track_bits() to cleanup current code
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
If the old mapping is not present, the spte.a is not lost, so no need
atomic operation to set it
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
In sync-page path, if spte.writable is changed, it will lose page dirty
tracking, for example:
assume spte.writable = 0 in a unsync-page, when it's synced, it map spte
to writable(that is spte.writable = 1), later guest write spte.gfn, it means
spte.gfn is dirty, then guest changed this mapping to read-only, after it's
synced, spte.writable = 0
So, when host release the spte, it detect spte.writable = 0 and not mark page
dirty
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
In current code, if ept is enabled(shadow_accessed_mask = 0), the page
accessed tracking is lost.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
In the speculative path, we should check guest pte's reserved bits just as
the real processor does
Reported-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The index wasn't calculated correctly (off by one) for huge spte so KVM guest
was unstable with transparent hugepages.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
If the destination is a memory operand and the memory cannot
map to a valid page, the xchg instruction emulation and locked
instruction will not work on io regions and stuck in endless
loop. We should emulate exchange as write to fix it.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
If pit delivers interrupt while pic is masking it OS will never do EOI
and ack notifier will not be called so when pit will be unmasked no pit
interrupts will be delivered any more. Calling mask notifiers solves this
issue.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
With tdp enabled we should get into emulator only when emulating io, so
reexecution will always bring us back into emulator.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Mark inc (0xfe/0 0xff/0) and dec (0xfe/1 0xff/1) as lock prefix capable.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
'level' and 'sptep' are aliases for 'interator.level' and 'iterator.sptep', no
need for them.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Currently, when we fetch an spte, we only verify that gptes match those that
the walker saw if we build new shadow pages for them.
However, this misses the following race:
vcpu1 vcpu2
walk
change gpte
walk
instantiate sp
fetch existing sp
Fix by validating every gpte, regardless of whether it is used for building
a new sp or not.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Partition the function into three sections:
- fetching indirect shadow pages (host_level > guest_level)
- fetching direct shadow pages (page_level < host_level <= guest_level)
- the final spte (page_level == host_level)
Instead of the current spaghetti.
A slight change from the original code is that we call validate_direct_spte()
more often: previously we called it only for gw->level, now we also call it for
lower levels. The change should have no effect.
[xiao: fix regression caused by validate_direct_spte() called too late]
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Move the code to check whether a gpte has changed since we fetched it into
a helper.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Add a helper to verify that a direct shadow page is valid wrt the required
access permissions; drop the page if it is not valid.
Reviewed-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
To clarify spte fetching code, move large spte handling into a helper.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
To avoid split accesses to 64 bit sptes on i386, use __set_spte() to link
shadow pages together.
(not technically required since shadow pages are __GFP_KERNEL, so upper 32
bits are always clear)
Reviewed-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
To simplify the process of fetching an spte, add a helper that links
a shadow page to an spte.
Reviewed-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Userspace needs to reset and save/restore these MSRs.
The MCE banks are not exposed since their number varies from vcpu to vcpu.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
When shadow pages are in use sometimes KVM try to emulate an instruction
when it accesses a shadowed page. If emulation fails KVM un-shadows the
page and reenter guest to allow vcpu to execute the instruction. If page
is not in shadow page hash KVM assumes that this was attempt to do MMIO
and reports emulation failure to userspace since there is no way to fix
the situation. This logic has a race though. If two vcpus tries to write
to the same shadowed page simultaneously both will enter emulator, but
only one of them will find the page in shadow page hash since the one who
founds it also removes it from there, so another cpu will report failure
to userspace and will abort the guest.
Fix this by checking (in addition to checking shadowed page hash) that
page that caused the emulation belongs to valid memory slot. If it is
then reenter the guest to allow vcpu to reexecute the instruction.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Currently if guest access address that belongs to memory slot but is not
backed up by page or page is read only KVM treats it like MMIO access.
Remove that capability. It was never part of the interface and should
not be relied upon.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Stanse found that there is an omitted unlock in kvm_create_pit in one fail
path. Add proper unlock there.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Gleb Natapov <gleb@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Gregory Haskins <ghaskins@novell.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
Real hardware disregards permission errors when computing page fault error
code bit 0 (page present). Do the same.
Reviewed-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Bit 4 of the page fault error code is set only if EFER.NX is set.
Reviewed-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch change to use DstAcc for decoding 'mov AL, moffs'
and introduced SrcAcc for decoding 'mov moffs, AL'.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
If IOPL check fail, the cli/sti emulate GP and then we should
skip writeback since the default write OP is OP_REG.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The source operand of 'mov rm,sreg' is segment register, not
general-purpose register, so remove SrcReg from decoding.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
'and AL,imm8' should be mask as ByteOp, otherwise the dest operand
length will no correct and we may fill the full EAX when writeback.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Fix the comment of out instruction, using the same style as the
other instructions.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
__set_spte() will happily replace an spte with the accessed bit set with
one that has the accessed bit clear. Add a helper update_spte() which checks
for this condition and updates the page flag if needed.
Signed-off-by: Avi Kivity <avi@redhat.com>
Currently, in the window between the check for the accessed bit, and actually
dropping the spte, a vcpu can access the page through the spte and set the bit,
which will be ignored by the mmu.
Fix by using an exchange operation to atmoically fetch the spte and drop it.
Signed-off-by: Avi Kivity <avi@redhat.com>
When we call rmap_remove(), we (almost) always immediately follow it by
an __set_spte() to a nonpresent pte. Since we need to perform the two
operations atomically, to avoid losing the dirty and accessed bits, introduce
a helper drop_spte() and convert all call sites.
The operation is still nonatomic at this point.
Signed-off-by: Avi Kivity <avi@redhat.com>
Commit 341d9b535b6c simplify reload logic while entry guest mode, it
can avoid unnecessary sync-root if KVM_REQ_MMU_RELOAD and
KVM_REQ_MMU_SYNC both set.
But, it cause a issue that when we handle 'KVM_REQ_TLB_FLUSH', the
root is invalid, it is triggered during my test:
Kernel BUG at ffffffffa00212b8 [verbose debug info unavailable]
......
Fixed by directly return if the root is not ready.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Add support for stos access tracing with mmiotrace.
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Acked-by: Pekka Paalanen <pq@iki.fi>
Cc: Nouveau <nouveau@lists.freedesktop.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20100731205101.GA5860@joi.lan>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Use a direct per-cpu reference for the GDT instead of using a scratch
register.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
LKML-Reference: <1280594903-6341-2-git-send-email-brgerst@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Use a direct per-cpu reference for the IST instead of using a scratch
register.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
LKML-Reference: <1280594903-6341-1-git-send-email-brgerst@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
This patch converts unnecessary divide and modulo operations
in the KVM large page related code into logical operations.
This allows to convert gfn_t to u64 while not breaking 32
bit builds.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Cleanup this function that we are already get the direct sp's access
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
If the mapping is writable but the dirty flag is not set, we will find
the read-only direct sp and setup the mapping, then if the write #PF
occur, we will mark this mapping writable in the read-only direct sp,
now, other real read-only mapping will happily write it without #PF.
It may hurt guest's COW
Fixed by re-install the mapping when write #PF occur.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
In no-direct mapping, we mark sp is 'direct' when we mapping the
guest's larger page, but its access is encoded form upper page-struct
entire not include the last mapping, it will cause access conflict.
For example, have this mapping:
[W]
/ PDE1 -> |---|
P[W] | | LPA
\ PDE2 -> |---|
[R]
P have two children, PDE1 and PDE2, both PDE1 and PDE2 mapping the
same lage page(LPA). The P's access is WR, PDE1's access is WR,
PDE2's access is RO(just consider read-write permissions here)
When guest access PDE1, we will create a direct sp for LPA, the sp's
access is from P, is W, then we will mark the ptes is W in this sp.
Then, guest access PDE2, we will find LPA's shadow page, is the same as
PDE's, and mark the ptes is RO.
So, if guest access PDE1, the incorrect #PF is occured.
Fixed by encode the last mapping access into direct shadow page
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
While we sync many unsync sp at one time(in mmu_sync_children()),
we may mapping the spte writable, it's dangerous, if one unsync
sp's mapping gfn is another unsync page's gfn.
For example:
SP1.pte[0] = P
SP2.gfn's pfn = P
[SP1.pte[0] = SP2.gfn's pfn]
First, we write protected SP1 and SP2, but SP1 and SP2 are still the
unsync sp.
Then, sync SP1 first, it will detect SP1.pte[0].gfn only has one unsync-sp,
that is SP2, so it will mapping it writable, but we plan to sync SP2 soon,
at this point, the SP2->unsync is not reliable since later we sync SP2 but
SP2->gfn is already writable.
So the final result is: SP2 is the sync page but SP2.gfn is writable.
This bug will corrupt guest's page table, fixed by mark read-only mapping
if the mapped gfn has shadow pages.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Some guest device driver may leverage the "Non-Snoop" I/O, and explicitly
WBINVD or CLFLUSH to a RAM space. Since migration may occur before WBINVD or
CLFLUSH, we need to maintain data consistency either by:
1: flushing cache (wbinvd) when the guest is scheduled out if there is no
wbinvd exit, or
2: execute wbinvd on all dirty physical CPUs when guest wbinvd exits.
Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
No need to reload the mmu in between two different vcpu->requests checks.
kvm_mmu_reload() may trigger KVM_REQ_TRIPLE_FAULT, but that will be caught
during atomic guest entry later.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Older versions of 32-bit linux have a "Checking 'hlt' instruction"
test where they repeatedly call the 'hlt' instruction, and then
expect a timer interrupt to kick the CPU out of halt. This happens
before any LAPIC or IOAPIC setup happens, which means that all of
the APIC's are in virtual wire mode at this point. Unfortunately,
the current implementation of virtual wire mode is hardcoded to
only kick the BSP, so if a crash+kexec occurs on a different
vcpu, it will never get kicked.
This patch makes pic_unlock() do the equivalent of
kvm_irq_delivery_to_apic() for the IOAPIC code. That is, it runs
through all of the vcpus looking for one that is in virtual wire
mode. In the normal case where LAPICs and IOAPICs are configured,
this won't be used at all. In the bootstrap phase of a modern
OS, before the LAPICs and IOAPICs are configured, this will have
exactly the same behavior as today; VCPU0 is always looked at
first, so it will always get out of the loop after the first
iteration. This will only go through the loop more than once
during a kexec/kdump, in which case it will only do it a few times
until the kexec'ed kernel programs the LAPIC and IOAPIC.
Signed-off-by: Chris Lalancette <clalance@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Enable Intel(R) Advanced Vector Extension(AVX) for guest.
The detection of AVX feature includes OSXSAVE bit testing. When OSXSAVE bit is
not set, even if AVX is supported, the AVX instruction would result in UD as
well. So we're safe to expose AVX bits to guest directly.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
If a process with a memory slot is COWed, the page will change its address
(despite having an elevated reference count). This breaks internal memory
slots which have their physical addresses loaded into vmcs registers (see
the APIC access memory slot).
Signed-off-by: Avi Kivity <avi@redhat.com>
Part of the i8259 code pretends it isn't part of kvm, but we know better.
Reduce excessive abstraction, eliminating callbacks and void pointers.
Signed-off-by: Avi Kivity <avi@redhat.com>
As advertised in feature-removal-schedule.txt. Equivalent support is provided
by overlapping memory regions.
Signed-off-by: Avi Kivity <avi@redhat.com>
Instead of three temporary variables and three free calls, have one temporary
variable (with four names) and one free call.
Signed-off-by: Avi Kivity <avi@redhat.com>
Group 3 instruction with ModRM reg field as 001 is
defined as test instruction under AMD arch, and
emulate_grp3() is ready for emulate it, so fix the
decoding.
static inline int emulate_grp3(...)
{
...
switch (c->modrm_reg) {
case 0 ... 1: /* test */
emulate_2op_SrcV("test", c->src, c->dst, ctxt->eflags);
...
}
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
If the guest wants to accept timer interrupts on a CPU other
than the BSP, we need to remove this gate.
Signed-off-by: Chris Lalancette <clalance@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
We really want to "kvm_set_irq" during the hrtimer callback,
but that is risky because that is during interrupt context.
Instead, offload the work to a workqueue, which is a bit safer
and should provide most of the same functionality.
Signed-off-by: Chris Lalancette <clalance@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
emulate pusha instruction only writeback the last
EDI register, but the other registers which need
to be writeback is ignored. This patch fixed it.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Fix a slight error with assertion in local APIC code.
Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
While we mark the parent's unsync_child_bitmap, if the parent is already
unsynced, it no need walk it's parent, it can reduce some unnecessary
workload
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
In current code, some page's unsync_child_bitmap is not cleared completely
in mmu_sync_children(), for example, if two PDPEs shard one PDT, one of
PDPE's unsync_child_bitmap is not cleared.
Currently, it not harm anything just little overload, but it's the prepare
work for the later patch
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
If the sync-sp just sync transient, don't mark its pte notrap
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The sync page is already write protected in mmu_sync_children(), don't
write protected it again
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Rename 'page' and 'shadow_page' to 'sp' to better fit the context
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
On Intel, we call skip_emulated_instruction() even if we injected a #GP,
resulting in the #GP pointing at the wrong address.
Fix by injecting the exception and skipping the instruction at the same place,
so we can do just one or the other.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
On Intel, we call skip_emulated_instruction() even if we injected a #GP,
resulting in the #GP pointing at the wrong address.
Fix by injecting the exception and skipping the instruction at the same place,
so we can do just one or the other.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
On Intel, we call skip_emulated_instruction() even if we injected a #GP,
resulting in the #GP pointing at the wrong address.
Fix by injecting the exception and skipping the instruction at the same place,
so we can do just one or the other.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This patch enable guest to use XSAVE/XRSTOR instructions.
We assume that host_xcr0 would use all possible bits that OS supported.
And we loaded xcr0 in the same way we handled fpu - do it as late as we can.
Signed-off-by: Dexuan Cui <dexuan.cui@intel.com>
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
delay local tlb flush until enter guest moden, it can reduce vpid flush
frequency and reduce remote tlb flush IPI(if KVM_REQ_TLB_FLUSH bit is
already set, IPI is not sent)
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Use kvm_mmu_flush_tlb() function instead of calling
kvm_x86_ops->tlb_flush(vcpu) directly.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This remote tlb flush is no necessary since we have synced while
sp is zapped
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>