Commit graph

14163 commits

Author SHA1 Message Date
David Vrabel
dc91c728fd xen: allow extra memory to be in multiple regions
Allow the extra memory (used by the balloon driver) to be in multiple
regions (typically two regions, one for low memory and one for high
memory).  This allows the balloon driver to increase the number of
available low pages (if the initial number if pages is small).

As a side effect, the algorithm for building the e820 memory map is
simpler and more obviously correct as the map supplied by the
hypervisor is (almost) used as is (in particular, all reserved regions
and gaps are preserved).  Only RAM regions are altered and RAM regions
above max_pfn + extra_pages are marked as unused (the region is split
in two if necessary).

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-09-29 11:12:10 -04:00
David Vrabel
8b5d44a5ac xen: allow balloon driver to use more than one memory region
Allow the xen balloon driver to populate its list of extra pages from
more than one region of memory.  This will allow platforms to provide
(for example) a region of low memory and a region of high memory.

The maximum possible number of extra regions is 128 (== E820MAX) which
is quite large so xen_extra_mem is placed in __initdata.  This is safe
as both xen_memory_setup() and balloon_init() are in __init.

The balloon regions themselves are not altered (i.e., there is still
only the one region).

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-09-29 11:12:10 -04:00
David Vrabel
aa24411b67 xen/balloon: account for pages released during memory setup
In xen_memory_setup() pages that occur in gaps in the memory map are
released back to Xen.  This reduces the domain's current page count in
the hypervisor.  The Xen balloon driver does not correctly decrease
its initial current_pages count to reflect this.  If 'delta' pages are
released and the target is adjusted the resulting reservation is
always 'delta' less than the requested target.

This affects dom0 if the initial allocation of pages overlaps the PCI
memory region but won't affect most domU guests that have been setup
with pseudo-physical memory maps that don't have gaps.

Fix this by accouting for the released pages when starting the balloon
driver.

If the domain's targets are managed by xapi, the domain may eventually
run out of memory and die because xapi currently gets its target
calculations wrong and whenever it is restarted it always reduces the
target by 'delta'.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-09-29 11:12:09 -04:00
Stefano Stabellini
b17d0b5c08 xen: XEN_PVHVM depends on PCI
Xen PV on HVM guests require PCI support because they need the
xen-platform-pci driver in order to initialize xenbus.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-09-29 10:52:16 -04:00
Stefano Stabellini
0930bba674 xen: modify kernel mappings corresponding to granted pages
If we want to use granted pages for AIO, changing the mappings of a user
vma and the corresponding p2m is not enough, we also need to update the
kernel mappings accordingly.
Currently this is only needed for pages that are created for user usages
through /dev/xen/gntdev. As in, pages that have been in use by the
kernel and use the P2M will not need this special mapping.
However there are no guarantees that in the future the kernel won't
start accessing pages through the 1:1 even for internal usage.

In order to avoid the complexity of dealing with highmem, we allocated
the pages lowmem.
We issue a HYPERVISOR_grant_table_op right away in
m2p_add_override and we remove the mappings using another
HYPERVISOR_grant_table_op in m2p_remove_override.
Considering that m2p_add_override and m2p_remove_override are called
once per page we use multicalls and hypercall batching.

Use the kmap_op pointer directly as argument to do the mapping as it is
guaranteed to be present up until the unmapping is done.
Before issuing any unmapping multicalls, we need to make sure that the
mapping has already being done, because we need the kmap->handle to be
set correctly.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
[v1: Removed GRANT_FRAME_BIT usage]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-09-29 10:32:58 -04:00
Jan Beulich
eab9e6137f x86-64: Fix CFI data for interrupt frames
The patch titled "x86: Don't use frame pointer to save old stack
on irq entry" did not properly adjust CFI directives, so this
patch is a follow-up to that one.

With the old stack pointer no longer stored in a callee-saved
register (plus some offset), we now have to use a CFA expression
to describe the memory location where it is being found. This
requires the use of .cfi_escape (allowing arbitrary byte streams
to be emitted into .eh_frame), as there is no
.cfi_def_cfa_expression (which also cannot reasonably be
expected, as it would require a full expression parser).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/4E8360200200007800058467@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-09-28 19:04:52 +02:00
Jan Beulich
e05139f256 x86-64: Don't apply destructive erratum workaround on unaffected CPUs
Erratum 93 applies to AMD K8 CPUs only, and its workaround
(forcing the upper 32 bits of %rip to all get set under certain
conditions) is actually getting in the way of analyzing page
faults occurring during EFI physical mode runtime calls (in
particular the page table walk shown is completely unrelated to
the actual fault). This is because typically EFI runtime code
lives in the space between 2G and 4G, which - modulo the above
manipulation - is likely to overlap with the kernel or modules
area.

While even for the other errata workarounds their taking effect
could be limited to just the affected CPUs, none of them appears
to be destructive, and they're generally getting called only
outside of performance critical paths, so they're being left
untouched.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/4E835FE30200007800058464@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-09-28 19:04:48 +02:00
Jan Beulich
838312be46 apic, i386/bigsmp: Fix false warnings regarding logical APIC ID mismatches
These warnings (generally one per CPU) are a result of
initializing x86_cpu_to_logical_apicid while apic_default is
still in use, but the check in setup_local_APIC() being done
when apic_bigsmp was already used as an override in
default_setup_apic_routing():

 Overriding APIC driver with bigsmp
 Enabling APIC mode:  Physflat.  Using 5 I/O APICs
 ------------[ cut here ]------------
 WARNING: at .../arch/x86/kernel/apic/apic.c:1239
 ...
 CPU 1 irqstacks, hard=f1c9a000 soft=f1c9c000
 Booting Node   0, Processors  #1
 smpboot cpu 1: start_ip = 9e000
 Initializing CPU#1
 ------------[ cut here ]------------
 WARNING: at .../arch/x86/kernel/apic/apic.c:1239
 setup_local_APIC+0x137/0x46b() Hardware name: ...
 CPU1 logical APIC ID: 2 != 8
 ...

Fix this (for the time being, i.e. until
x86_32_early_logical_apicid() will get removed again, as Tejun
says ought to be possible) by overriding the previously stored
values at the point where the APIC driver gets overridden.

v2: Move this and the pre-existing override logic into
    arch/x86/kernel/apic/bigsmp_32.c.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: <stable@kernel.org> (2.6.39 and onwards)
Link: http://lkml.kernel.org/r/4E835D16020000780005844C@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-09-28 19:01:53 +02:00
Stephen Rothwell
910b2c5122 x86, amd: Include linux/elf.h since we use stuff from asm/elf.h
After merging the moduleh tree, today's linux-next build (x86_64
allmodconfig) failed like this:

  arch/x86/kernel/sys_x86_64.c:28:10: warning: 'enum align_flags' declared inside parameter list
  arch/x86/kernel/sys_x86_64.c:28:10: warning: its scope is only this definition or declaration, which is probably not what you
  want arch/x86/kernel/sys_x86_64.c:28:22: error: parameter 3 ('flags') has incomplete type
  [...]

Presumably caused by the module.h split interacting with a
new commit dfb09f9b7a ("x86, amd: Avoid cache aliasing penalties
on AMD family 15h") from the x8 tree.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Link: http://lkml.kernel.org/r/20110928174214.17a58be15d84d67c185930e1@canb.auug.org.au
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-09-28 10:34:31 +02:00
Ingo Molnar
695d16f787 Merge branch 'upstream/ticketlock-cleanup' of git://github.com/jsgf/linux-xen into x86/spinlocks 2011-09-28 08:57:10 +02:00
Jeremy Fitzhardinge
4a7f340c6a x86, ticketlock: remove obsolete comment
The note about partial registers is not really relevent now that we
rely on gcc to generate all the assembler.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
2011-09-27 23:37:20 -07:00
Randy Dunlap
d6eed550a9 x86: Perf_event_amd.c needs <asm/apicdef.h>
Fix (rare) build error by adding <asm/apicdef.h> header file:

  arch/x86/kernel/cpu/perf_event_amd.c:350:2: error: 'BAD_APICID' undeclared (first use in this function)

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Andre Przywara <andre.przywara@amd.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/4E820138.90301@xenotime.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-09-27 19:55:09 +02:00
Paul Bolle
395cf9691d doc: fix broken references
There are numerous broken references to Documentation files (in other
Documentation files, in comments, etc.). These broken references are
caused by typo's in the references, and by renames or removals of the
Documentation files. Some broken references are simply odd.

Fix these broken references, sometimes by dropping the irrelevant text
they were part of.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-09-27 18:08:04 +02:00
Jeremy Fitzhardinge
fdb9eb9f15 xen/dom0: set wallclock time in Xen
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
2011-09-26 11:04:39 -07:00
Jeremy Fitzhardinge
eec07a9ecf xen: add dom0_op hypercall
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
2011-09-26 11:04:39 -07:00
Yu Ke
3e0996798a xen/acpi: Domain0 acpi parser related platform hypercall
This patches implements the xen_platform_op hypercall, to pass the parsed
ACPI info to hypervisor.

Signed-off-by: Yu Ke <ke.yu@intel.com>
Signed-off-by: Tian Kevin <kevin.tian@intel.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
[v1: Added DEFINE_GUEST.. in appropiate headers]
[v2: Ripped out typedefs]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-09-26 11:04:39 -07:00
Kevin Winchester
de0428a7ad x86, perf: Clean up perf_event cpu code
The CPU support for perf events on x86 was implemented via included C files
with #ifdefs.  Clean this up by creating a new header file and compiling
the vendor-specific files as needed.

Signed-off-by: Kevin Winchester <kjwinchester@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1314747665-2090-1-git-send-email-kjwinchester@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-09-26 12:58:00 +02:00
Ingo Molnar
ed3982cf37 Merge commit 'v3.1-rc7' into perf/core
Merge reason: Pick up the latest upstream fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-09-26 12:54:28 +02:00
Liu, Jinsong
b90dfb0419 x86: TSC deadline definitions
This pre-defination is preparing for KVM tsc deadline timer emulation, but
theirself are not kvm specific.

Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:53:00 +03:00
Avi Kivity
7460fb4a34 KVM: Fix simultaneous NMIs
If simultaneous NMIs happen, we're supposed to queue the second
and next (collapsing them), but currently we sometimes collapse
the second into the first.

Fix by using a counter for pending NMIs instead of a bool; since
the counter limit depends on whether the processor is currently
in an NMI handler, which can only be checked in vcpu context
(via the NMI mask), we add a new KVM_REQ_NMI to request recalculation
of the counter.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:59 +03:00
Avi Kivity
1cd196ea42 KVM: x86 emulator: convert push %sreg/pop %sreg to direct decode
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:58 +03:00
Avi Kivity
d4b4325fdb KVM: x86 emulator: switch lds/les/lss/lfs/lgs to direct decode
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:57 +03:00
Avi Kivity
c191a7a0f4 KVM: x86 emulator: streamline decode of segment registers
The opcodes

  push %seg
  pop %seg
  l%seg, %mem, %reg  (e.g. lds/les/lss/lfs/lgs)

all have an segment register encoded in the instruction.  To allow reuse,
decode the segment number into src2 during the decode stage instead of the
execution stage.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:56 +03:00
Avi Kivity
41ddf9784c KVM: x86 emulator: simplify OpMem64 decode
Use the same technique as the other OpMem variants, and goto mem_common.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:55 +03:00
Avi Kivity
0fe5912884 KVM: x86 emulator: switch src decode to decode_operand()
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:54 +03:00
Avi Kivity
5217973ef8 KVM: x86 emulator: qualify OpReg inhibit_byte_regs hack
OpReg decoding has a hack that inhibits byte registers for movsx and movzx
instructions.  It should be replaced by something better, but meanwhile,
qualify that the hack is only active for the destination operand.

Note these instructions only use OpReg for the destination, but better to
be explicit about it.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:53 +03:00
Avi Kivity
608aabe316 KVM: x86 emulator: switch OpImmUByte decode to decode_imm()
Similar to SrcImmUByte.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:52 +03:00
Avi Kivity
20c29ff205 KVM: x86 emulator: free up some flag bits near src, dst
Op fields are going to grow by a bit, we need two free bits.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:51 +03:00
Avi Kivity
4dd6a57df7 KVM: x86 emulator: switch src2 to generic decode_operand()
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:50 +03:00
Avi Kivity
b1ea50b2b6 KVM: x86 emulator: expand decode flags to 64 bits
Unifiying the operands means not taking advantage of the fact that some
operand types can only go into certain operands (for example, DI can only
be used by the destination), so we need more bits to hold the operand type.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:49 +03:00
Avi Kivity
a99455499a KVM: x86 emulator: split dst decode to a generic decode_operand()
Instead of decoding each operand using its own code, use a generic
function.  Start with the destination operand.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:48 +03:00
Avi Kivity
f09ed83e21 KVM: x86 emulator: move memop, memopp into emulation context
Simplifies further generalization of decode.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:47 +03:00
Avi Kivity
3329ece161 KVM: x86 emulator: convert group 3 instructions to direct decode
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:46 +03:00
Jan Kiszka
9bc5791d4a KVM: x86: Add module parameter for lapic periodic timer limit
Certain guests, specifically RTOSes, request faster periodic timers than
what we allow by default. Add a module parameter to adjust the limit for
non-standard setups. Also add a rate-limited warning in case the guest
requested more.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:44 +03:00
Jan Kiszka
bd80158aff KVM: Clean up and extend rate-limited output
The use of printk_ratelimit is discouraged, replace it with
pr*_ratelimited or __ratelimit. While at it, convert remaining
guest-triggerable printks to rate-limited variants.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:43 +03:00
Jan Kiszka
7712de872c KVM: x86: Avoid guest-triggerable printks in APIC model
Convert remaining printks that the guest can trigger to apic_printk.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:42 +03:00
Jan Kiszka
1e2b1dd797 KVM: x86: Move kvm_trace_exit into atomic vmexit section
This avoids that events causing the vmexit are recorded before the
actual exit reason.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:41 +03:00
Avi Kivity
caa8a168e3 KVM: x86 emulator: disable writeback for TEST
The TEST instruction doesn't write its destination operand.  This
could cause problems if an MMIO register was accessed using the TEST
instruction.  Recently Windows XP was observed to use TEST against
the APIC ICR; this can cause spurious IPIs.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:40 +03:00
Avi Kivity
e8f2b1d621 KVM: x86 emulator: simplify emulate_1op_rax_rdx()
emulate_1op_rax_rdx() is always called with the same parameters.  Simplify
by passing just the emulation context.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:37 +03:00
Avi Kivity
9fef72ce10 KVM: x86 emulator: merge the two emulate_1op_rax_rdx implementations
We have two emulate-with-extended-accumulator implementations: once
which expect traps (_ex) and one which doesn't (plain).  Drop the
plain implementation and always use the one which expects traps;
it will simply return 0 in the _ex argument and we can happily ignore
it.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:36 +03:00
Avi Kivity
d1eef45d59 KVM: x86 emulator: simplify emulate_1op()
emulate_1op() is always called with the same parameters.  Simplify
by passing just the emulation context.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:35 +03:00
Avi Kivity
29053a60d7 KVM: x86 emulator: simplify emulate_2op_cl()
emulate_2op_cl() is always called with the same parameters.  Simplify
by passing just the emulation context.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:34 +03:00
Avi Kivity
761441b9f4 KVM: x86 emulator: simplify emulate_2op_cl()
emulate_2op_cl() is always called with the same parameters.  Simplify
by passing just the emulation context.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:33 +03:00
Avi Kivity
a31b9ceadb KVM: x86 emulator: simplify emulate_2op_SrcV()
emulate_2op_SrcV(), and its siblings, emulate_2op_SrcV_nobyte()
and emulate_2op_SrcB(), all use the same calling conventions
and all get passed exactly the same parameters.  Simplify them
by passing just the emulation context.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:52:32 +03:00
Kevin Tian
58fbbf26eb KVM: APIC: avoid instruction emulation for EOI writes
Instruction emulation for EOI writes can be skipped, since sane
guest simply uses MOV instead of string operations. This is a nice
improvement when guest doesn't support x2apic or hyper-V EOI
support.

a single VM bandwidth is observed with ~8% bandwidth improvement
(7.4Gbps->8Gbps), by saving ~5% cycles from EOI emulation.

Signed-off-by: Kevin Tian <kevin.tian@intel.com>
<Based on earlier work from>:
Signed-off-by: Eddie Dong <eddie.dong@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:52:17 +03:00
Nadav Har'El
45133ecaae KVM: SVM: Fix TSC MSR read in nested SVM
When the TSC MSR is read by an L2 guest (when L1 allowed this MSR to be
read without exit), we need to return L2's notion of the TSC, not L1's.

The current code incorrectly returned L1 TSC, because svm_get_msr() was also
used in x86.c where this was assumed, but now that these places call the new
svm_read_l1_tsc(), the MSR read can be fixed.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Tested-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:18:03 +03:00
Nadav Har'El
27fc51b21c KVM: nVMX: Fix nested VMX TSC emulation
This patch fixes two corner cases in nested (L2) handling of TSC-related
issues:

1. Somewhat suprisingly, according to the Intel spec, if L1 allows WRMSR to
the TSC MSR without an exit, then this should set L1's TSC value itself - not
offset by vmcs12.TSC_OFFSET (like was wrongly done in the previous code).

2. Allow L1 to disable the TSC_OFFSETING control, and then correctly ignore
the vmcs12.TSC_OFFSET.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:18:02 +03:00
Nadav Har'El
d5c1785d2f KVM: L1 TSC handling
KVM assumed in several places that reading the TSC MSR returns the value for
L1. This is incorrect, because when L2 is running, the correct TSC read exit
emulation is to return L2's value.

We therefore add a new x86_ops function, read_l1_tsc, to use in places that
specifically need to read the L1 TSC, NOT the TSC of the current level of
guest.

Note that one change, of one line in kvm_arch_vcpu_load, is made redundant
by a different patch sent by Zachary Amsden (and not yet applied):
kvm_arch_vcpu_load() should not read the guest TSC, and if it didn't, of
course we didn't have to change the call of kvm_get_msr() to read_l1_tsc().

[avi: moved callback to kvm_x86_ops tsc block]

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Acked-by: Zachary Amsdem <zamsden@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:18:02 +03:00
Yang, Wei Y
cd46868c7f KVM: MMU: Fix SMEP failure during fetch
This patch fix kvm-unit-tests hanging and incorrect PT_ACCESSED_MASK
bit set in the case of SMEP fault.  The code updated 'eperm' after
the variable was checked.

Signed-off-by: Yang, Wei <wei.y.yang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25 19:18:02 +03:00
Avi Kivity
e4e517b4be KVM: MMU: Do not unconditionally read PDPTE from guest memory
Architecturally, PDPTEs are cached in the PDPTRs when CR3 is reloaded.
On SVM, it is not possible to implement this, but on VMX this is possible
and was indeed implemented until nested SVM changed this to unconditionally
read PDPTEs dynamically.  This has noticable impact when running PAE guests.

Fix by changing the MMU to read PDPTRs from the cache, falling back to
reading from memory for the nested MMU.

Signed-off-by: Avi Kivity <avi@redhat.com>
Tested-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:18:01 +03:00