Commit graph

88 commits

Author SHA1 Message Date
Izik Eidus
82ce2c9683 KVM: Allow dynamic allocation of the mmu shadow cache size
The user is now able to set how many mmu pages will be allocated to the guest.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:50 +02:00
Izik Eidus
290fc38da8 KVM: Remove the usage of page->private field by rmap
When kvm uses user-allocated pages in the future for the guest, we won't
be able to use page->private for rmap, since page->rmap is reserved for
the filesystem.  So we move the rmap base pointers to the memory slot.

A side effect of this is that we need to store the gfn of each gpte in
the shadow pages, since the memory slot is addressed by gfn, instead of
hfn like struct page.

Signed-off-by: Izik Eidus <izik@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:50 +02:00
Avi Kivity
12b7d28fc1 KVM: MMU: Make flooding detection work when guest page faults are bypassed
When we allow guest page faults to reach the guests directly, we lose
the fault tracking which allows us to detect demand paging.  So we provide
an alternate mechnism by clearing the accessed bit when we set a pte, and
checking it later to see if the guest actually used it.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:48 +02:00
Avi Kivity
c7addb9020 KVM: Allow not-present guest page faults to bypass kvm
There are two classes of page faults trapped by kvm:
 - host page faults, where the fault is needed to allow kvm to install
   the shadow pte or update the guest accessed and dirty bits
 - guest page faults, where the guest has faulted and kvm simply injects
   the fault back into the guest to handle

The second class, guest page faults, is pure overhead.  We can eliminate
some of it on vmx using the following evil trick:
 - when we set up a shadow page table entry, if the corresponding guest pte
   is not present, set up the shadow pte as not present
 - if the guest pte _is_ present, mark the shadow pte as present but also
   set one of the reserved bits in the shadow pte
 - tell the vmx hardware not to trap faults which have the present bit clear

With this, normal page-not-present faults go directly to the guest,
bypassing kvm entirely.

Unfortunately, this trick only works on Intel hardware, as AMD lacks a
way to discriminate among page faults based on error code.  It is also
a little risky since it uses reserved bits which might become unreserved
in the future, so a module parameter is provided to disable it.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:52:48 +02:00
Eddie Dong
8668a3c468 KVM: VMX: Reset mmu context when entering real mode
Resetting an SMP guest will force AP enter real mode (RESET) with
paging enabled in protected mode. While current enter_rmode() can
only handle mode switch from nonpaging mode to real mode which leads
to SMP reboot failure.

Fix by reloading the mmu context on entering real mode.

Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-22 12:03:28 +02:00
Izik Eidus
7f2145ad6f KVM: MMU: Set shadow pte atomically in mmu_pte_write_zap_pte()
Setting shadow page table entry should be set atomicly using set_shadow_pte().

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-22 12:03:28 +02:00
Avi Kivity
2e3e5882dc KVM: MMU: Don't do GFP_NOWAIT allocations
Before preempt notifiers, kvm needed to allocate memory with GFP_NOWAIT so
as not to have to enable preemption and take a heavyweight exit.  On oom, we'd
fall back to a GFP_KERNEL allocation.

With preemption notifiers, we can do a GFP_KERNEL allocation, and perform
the heavyweight exit only if the kernel decides to put us to sleep.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:27 +02:00
Christian Ehrhardt
cbdd1bea2a KVM: Rename kvm_arch_ops to kvm_x86_ops
This patch just renames the current (misnamed) _arch namings to _x86 to
ensure better readability when a real arch layer takes place.

Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:27 +02:00
Shaohua Li
11ec280471 KVM: Convert vm lock to a mutex
This allows the kvm mmu to perform sleepy operations, such as memory
allocation.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:20 +02:00
Avi Kivity
15ad71460d KVM: Use the scheduler preemption notifiers to make kvm preemptible
Current kvm disables preemption while the new virtualization registers are
in use.  This of course is not very good for latency sensitive workloads (one
use of virtualization is to offload user interface and other latency
insensitive stuff to a container, so that it is easier to analyze the
remaining workload).  This patch re-enables preemption for kvm; preemption
is now only disabled when switching the registers in and out, and during
the switch to guest mode and back.

Contains fixes from Shaohua Li <shaohua.li@intel.com>.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:20 +02:00
Shaohua Li
fe55188194 KVM: Move gfn_to_page out of kmap/unmap pairs
gfn_to_page might sleep with swap support. Move it out of the kmap calls.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:19 +02:00
Rusty Russell
707d92fa72 KVM: Trivial: Use standard CR0 flags macros from asm/cpu-features.h
The kernel now has asm/cpu-features.h: use those macros instead of
inventing our own.

Also spell out definition of CR0_RESEVED_BITS (no code change) and fix typo.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-13 10:18:18 +02:00
Avi Kivity
22d95b1282 KVM: MMU: Fix rare oops on guest context switch
A guest context switch to an uncached cr3 can require allocation of
shadow pages, but we only recycle shadow pages in kvm_mmu_page_fault().

Move shadow page recycling to mmu_topup_memory_caches(), which is called
from both the page fault handler and from guest cr3 reload.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-14 13:59:55 -07:00
Avi Kivity
c4d198d518 KVM: MMU: Fix cleaning up the shadow page allocation cache
__free_page() wants a struct page, not a virtual address.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-20 23:48:47 -07:00
Avi Kivity
c1158e63df KVM: MMU: Fix oopses with SLUB
The kvm mmu uses page->private on shadow page tables; so does slub, and
an oops result.  Fix by allocating regular pages for shadows instead of
using slub.

Tested-by: S.Çağlar Onur <caglar@pardus.org.tr>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-20 20:23:59 +03:00
Avi Kivity
90cb0529dd KVM: Fix memory slot management functions for guest smp
The memory slot management functions were oriented against vcpu 0, where
they should be kvm-wide.  This causes hangs starting X on guest smp.

Fix by making the functions (and resultant tail in the mmu) non-vcpu-specific.
Unfortunately this reduces the efficiency of the mmu object cache a bit.  We
may have to revisit this later.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-20 20:16:29 +03:00
Paul Mundt
20c2df83d2 mm: Remove slab destructors from kmem_cache_create().
Slab destructors were no longer supported after Christoph's
c59def9f22 change. They've been
BUGs for both slab and slub, and slob never supported them
either.

This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2007-07-20 10:11:58 +09:00
Avi Kivity
e495606dd0 KVM: Clean up #includes
Remove unnecessary ones, and rearange the remaining in the standard order.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:49 +03:00
Shaohua Li
88a97f0b2f KVM: MMU: Fix Wrong tlb flush order
Need to flush the tlb after updating a pte, not before.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:48 +03:00
Avi Kivity
d9e368d612 KVM: Flush remote tlbs when reducing shadow pte permissions
When a vcpu causes a shadow tlb entry to have reduced permissions, it
must also clear the tlb on remote vcpus.  We do that by:

- setting a bit on the vcpu that requests a tlb flush before the next entry
- if the vcpu is currently executing, we send an ipi to make sure it
  exits before we continue

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:46 +03:00
Avi Kivity
7b53aa5650 KVM: Fix vcpu freeing for guest smp
A vcpu can pin up to four mmu shadow pages, which means the freeing
loop will never terminate.  Fix by first unpinning shadow pages on
all vcpus, then freeing shadow pages.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:45 +03:00
Avi Kivity
17c3ba9d37 KVM: Lazy guest cr3 switching
Switch guest paging context may require us to allocate memory, which
might fail.  Instead of wiring up error paths everywhere, make context
switching lazy and actually do the switch before the next guest entry,
where we can return an error if allocation fails.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:45 +03:00
Avi Kivity
bd2b2baa5c KVM: MMU: Remove unused large page marker
This has not been used for some time, as the same information is available
in the page header.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:45 +03:00
Avi Kivity
b64b3763a5 KVM: MMU: Don't cache guest access bits in the shadow page table
This was once used to avoid accessing the guest pte when upgrading
the shadow pte from read-only to read-write.  But usually we need
to set the guest pte dirty or accessed bits anyway, so this wasn't
really exploited.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:44 +03:00
Avi Kivity
fd97dc516c KVM: MMU: Simpify accessed/dirty/present/nx bit handling
Always set the accessed and dirty bit (since having them cleared causes
a read-modify-write cycle), always set the present bit, and copy the
nx bit from the guest.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:44 +03:00
Avi Kivity
e663ee64ae KVM: MMU: Make setting shadow ptes atomic on i386
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:44 +03:00
Avi Kivity
97a0a01ea9 KVM: MMU: Fold fix_write_pf() into set_pte_common()
This prevents some work from being performed twice, and, more importantly,
reduces the number of places where we modify shadow ptes.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:43 +03:00
Avi Kivity
63b1ad24d2 KVM: MMU: Fold fix_read_pf() into set_pte_common()
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:43 +03:00
Avi Kivity
e60d75ea29 KVM: MMU: Move set_pte_common() to pte width dependent code
In preparation of some modifications.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:43 +03:00
Avi Kivity
d3d25b048b KVM: MMU: Use slab caches for shadow pages and their headers
Use slab caches instead of a simple custom list.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:43 +03:00
Avi Kivity
47ad8e689b KVM: MMU: Store shadow page tables as kernel virtual addresses, not physical
Simpifies things a bit.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:40 +03:00
Avi Kivity
4b02d6daa1 KVM: MMU: Simplify kvm_mmu_free_page() a tiny bit
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:40 +03:00
Avi Kivity
0028425f64 KVM: Update shadow pte on write to guest pte
A typical demand page/copy on write pattern is:

- page fault on vaddr
- kvm propagates fault to guest
- guest handles fault, updates pte
- kvm traps write, clears shadow pte, resumes guest
- guest returns to userspace, re-faults on same vaddr
- kvm installs shadow pte, resumes guest
- guest continues

So, three vmexits for a single guest page fault.  But if instead of clearing
the page table entry, we update to correspond to the value that the guest
has just written, we eliminate the third vmexit.

This patch does exactly that, reducing kbuild time by about 10%.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:39 +03:00
Avi Kivity
fce0657ff9 KVM: MMU: Respect nonpae pagetable quadrant when zapping ptes
When a guest writes to a page that has an mmu shadow, we have to clear
the shadow pte corresponding to the memory location touched by the guest.

Now, in nonpae mode, a single guest page may have two or four shadow
pages (because a nonpae page maps 4MB or 4GB, whereas the pae shadow maps
2MB or 1GB), so we when we look up the page we find up to three additional
aliases for the page.  Since we _clear_ the shadow pte, it doesn't matter
except for a slight performance penalty, but if we want to _update_ the
shadow pte instead of clearing it, it is vital that we don't modify the
aliases.

Fortunately, exactly which page is needed (the "quadrant") is easily
computed, and is accessible in the shadow page header.  All we need is
to ignore shadow pages from the wrong quadrants.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:39 +03:00
Avi Kivity
09072daf37 KVM: Unify kvm_mmu_pre_write() and kvm_mmu_post_write()
Instead of calling two functions and repeating expensive checks, call one
function and provide it with before/after information.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:38 +03:00
Avi Kivity
e925c5ba93 KVM: Assume that writes smaller than 4 bytes are to non-pagetable pages
This allows us to remove write protection earlier than otherwise.  Should
some mad OS choose to use byte writes to update pagetables, it will suffer
a performance hit, but still work correctly.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-07-16 12:05:38 +03:00
Adrian Bunk
2807696c37 KVM: fix an if() condition
It might have worked in this case since PT_PRESENT_MASK is 1, but let's
express this correctly.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:31 +03:00
Avi Kivity
1165f5fec1 KVM: Per-vcpu statistics
Make the exit statistics per-vcpu instead of global.  This gives a 3.5%
boost when running one virtual machine per core on my two socket dual core
(4 cores total) machine.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:30 +03:00
Yaozu Dong
d6c69ee9a2 KVM: MMU: Avoid heavy ASSERT at non debug mode.
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:30 +03:00
Avi Kivity
8c4385024d KVM: Retry sleeping allocation if atomic allocation fails
This avoids -ENOMEM under memory pressure.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:29 +03:00
Avi Kivity
b5a33a7572 KVM: Use slab caches to allocate mmu data structures
Better leak detection, statistics, memory use, speed -- goodness all
around.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:29 +03:00
Avi Kivity
417726a3fb KVM: Handle partial pae pdptr
Some guests (Solaris) do not set up all four pdptrs, but leave some invalid.
kvm incorrectly treated these as valid page directories, pinning the
wrong pages and causing general confusion.

Fix by checking the valid bit of a pae pdpte.  This closes sourceforge bug
1698922.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:29 +03:00
Avi Kivity
954bbbc236 KVM: Simply gfn_to_page()
Mapping a guest page to a host page is a common operation.  Currently,
one has first to find the memory slot where the page belongs (gfn_to_memslot),
then locate the page itself (gfn_to_page()).

This is clumsy, and also won't work well with memory aliases.  So simplify
gfn_to_page() not to require memory slot translation first, and instead do it
internally.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:28 +03:00
Dor Laor
e0fa826f96 KVM: Add mmu cache clear function
Functions that play around with the physical memory map
need a way to clear mappings to possibly nonexistent or
invalid memory.  Both the mmu cache and the processor tlb
are cleared.

Signed-off-by: Dor Laor <dor.laor@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:28 +03:00
Avi Kivity
36868f7b0e KVM: Use list_move()
Use list_move() where possible.  Noticed by Dor Laor.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:27 +03:00
Avi Kivity
d28c6cfbbc KVM: MMU: Fix hugepage pdes mapping same physical address with different access
The kvm mmu keeps a shadow page for hugepage pdes; if several such pdes map
the same physical address, they share the same shadow page.  This is a fairly
common case (kernel mappings on i386 nonpae Linux, for example).

However, if the two pdes map the same memory but with different permissions, kvm
will happily use the cached shadow page.  If the access through the more
permissive pde will occur after the access to the strict pde, an endless pagefault
loop will be generated and the guest will make no progress.

Fix by making the access permissions part of the cache lookup key.

The fix allows Xen pae to boot on kvm and run guest domains.

Thanks to Jeremy Fitzhardinge for reporting the bug and testing the fix.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:27 +03:00
Avi Kivity
aac012245a KVM: MMU: Remove global pte tracking
The initial, noncaching, version of the kvm mmu flushed the all nonglobal
shadow page table translations (much like a native tlb flush).  The new
implementation flushes translations only when they change, rendering global
pte tracking superfluous.

This removes the unused tracking mechanism and storage space.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:25 +03:00
Avi Kivity
039576c03c KVM: Avoid guest virtual addresses in string pio userspace interface
The current string pio interface communicates using guest virtual addresses,
relying on userspace to translate addresses and to check permissions.  This
interface cannot fully support guest smp, as the check needs to take into
account two pages at one in case an unaligned string transfer straddles a
page boundary.

Change the interface not to communicate guest addresses at all; instead use
a buffer page (mmaped by userspace) and do transfers there.  The kernel
manages the virtual to physical translation and can perform the checks
atomically by taking the appropriate locks.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:25 +03:00
Avi Kivity
1ea252afcd KVM: Fix bogus sign extension in mmu mapping audit
When auditing a 32-bit guest on a 64-bit host, sign extension of the page
table directory pointer table index caused bogus addresses to be shown on
audit errors.

Fix by declaring the index unsigned.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-05-03 10:52:23 +03:00
Avi Kivity
6b8d0f9b18 KVM: Fix off-by-one when writing to a nonpae guest pde
Nonpae guest pdes are shadowed by two pae ptes, so we double the offset
twice: once to account for the pte size difference, and once because we
need to shadow pdes for a single guest pde.

But when writing to the upper guest pde we also need to truncate the
lower bits, otherwise the multiply shifts these bits into the pde index
and causes an access to the wrong shadow pde.  If we're at the end of the
page (accessing the very last guest pde) we can even overflow into the
next host page and oops.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-04-19 18:39:26 +03:00