Currently KVM pretends that pages with EPT mappings never got
accessed. This has some side effects in the VM, like swapping
out actively used guest pages and needlessly breaking up actively
used hugepages.
We can avoid those very costly side effects by emulating the
accessed bit for EPT PTEs, which should only be slightly costly
because pages pass through page_referenced infrequently.
TLB flushing is taken care of by kvm_mmu_notifier_clear_flush_young().
This seems to help prevent KVM guests from being swapped out when
they should not on my system.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch introduces a generic function to find out the
host page size for a given gfn. This function is needed by
the kvm iommu code. This patch also simplifies the x86
host_mapping_level function.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
If we fail to alloc page for vcpu->arch.mmu.pae_root, call to
free_mmu_pages() is unnecessary, which just do free the page
malloc for vcpu->arch.mmu.pae_root.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
There are two spellings of "writable" in
arch/x86/kvm/mmu.c and paging_tmpl.h .
This patch renames is_writeble_pte() to is_writable_pte()
and makes grepping easy.
New name is consistent with the definition of itself:
return pte & PT_WRITABLE_MASK;
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
Since we'd like to allow the guest to own a few bits of cr0 at times, we need
to know when we access those bits.
Signed-off-by: Avi Kivity <avi@redhat.com>
Use two steps for memslot deletion: mark the slot invalid (which stops
instantiation of new shadow pages for that slot, but allows destruction),
then instantiate the new empty slot.
Also simplifies kvm_handle_hva locking.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Have a pointer to an allocated region inside struct kvm.
[alex: fix ppc book 3s]
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
In the past we've had errors of single-bit in the other two cases; the
printk() may confirm it for the third case (many->many).
Signed-off-by: Avi Kivity <avi@redhat.com>
When found a error hva, should not return PAGE_SIZE but the level...
Also clean up the coding style of the following loop.
Cc: stable@kernel.org
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Usually userspace will freeze the guest so we can inspect it, but some
internal state is not available. Add extra data to internal error
reporting so we can expose it to the debugger. Extra data is specific
to the suberror.
Signed-off-by: Avi Kivity <avi@redhat.com>
On a 32 bits compile, commit 3da0dd433d
introduced the following warnings:
arch/x86/kvm/mmu.c: In function ‘kvm_set_pte_rmapp’:
arch/x86/kvm/mmu.c:770: warning: cast to pointer from integer of different size
arch/x86/kvm/mmu.c: In function ‘kvm_set_spte_hva’:
arch/x86/kvm/mmu.c:849: warning: cast from pointer to integer of different size
The following patch uses 'unsigned long' instead of u64 to match the
pointer size on both arches.
Signed-off-by: Frederik Deweerdt <frederik.deweerdt@xprog.eu>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
this is needed for kvm if it want ksm to directly map pages into its
shadow page tables.
[marcelo: cast pfn assignment to u64]
Signed-off-by: Izik Eidus <ieidus@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
this flag notify that the host physical page we are pointing to from
the spte is write protected, and therefore we cant change its access
to be write unless we run get_user_pages(write = 1).
(this is needed for change_pte support in kvm)
Signed-off-by: Izik Eidus <ieidus@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
When using mmu notifiers, we are allowed to remove the page count
reference tooken by get_user_pages to a specific page that is mapped
inside the shadow page tables.
This is needed so we can balance the pagecount against mapcount
checking.
(Right now kvm increase the pagecount and does not increase the
mapcount when mapping page into shadow page table entry,
so when comparing pagecount against mapcount, you have no
reliable result.)
Signed-off-by: Izik Eidus <ieidus@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
We know no pages are protected, so we can short-circuit the whole thing
(including fairly nasty guest memory accesses).
Signed-off-by: Avi Kivity <avi@redhat.com>
Remove the bogus n_free_mmu_pages assignment from alloc_mmu_pages.
It breaks accounting of mmu pages, since n_free_mmu_pages is modified
but the real number of pages remains the same.
Cc: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
First check if the list is empty before attempting to look at list
entries.
Cc: stable@kernel.org
Signed-off-by: Izik Eidus <ieidus@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch adds support for shadow paging to the 1gb page table code in KVM.
With this code the guest can use 1gb pages even if the host does not support
them.
[ Marcelo: fix shadow page collision on pmd level if a guest 1gb page is mapped
with 4kb ptes on host level ]
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The page walker may be used with nested paging too when accessing mmio
areas. Make it support the additional page-level too.
[ Marcelo: fix reserved bit check for 1gb pte ]
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
With the new name and the corresponding backend changes this function
can now support multiple hugepage sizes.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch removes the largepage parameter from the rmap_add function.
Together with rmap_remove this function now uses the role.level field to
find determine if the page is a huge page.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
n_requested_mmu_pages/n_free_mmu_pages are used by
kvm_mmu_change_mmu_pages to calculate the number of pages to zap.
alloc_mmu_pages, called from the vcpu initialization path, modifies this
variables without proper locking, which can result in a negative value
in kvm_mmu_change_mmu_pages (say, with cpu hotplug).
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
set_cr3() should already cover the TLB flushing.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This way there is no need to add explicit checks in every
for_each_shadow_entry user.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
- Fail early in case gfn_to_pfn returns is_error_pfn.
- For the pre pte write case, avoid spurious "gva is valid but spte is notrap"
messages (the emulation code does the guest write first, so this particular
case is OK).
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Under testing, count_writable_mappings returns a value that is 2 integers
larger than what count_rmaps returns.
Suspicion is that either of the two functions is counting a duplicate (either
positively or negatively).
Modifying check_writable_mappings_rmap to check for rmap existance on
all present MMU pages fails to trigger an error, which should keep Avi
happy.
Also introduce mmu_spte_walk to invoke a callback on all present sptes visible
to the current vcpu, might be useful in the future.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Hiding some of the last largepage / level interaction (which is useful
for gbpages and for zero based levels).
Also merge the PT_PAGE_TABLE_LEVEL clearing loop in unlink_children.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
We use shadow_pte and spte inconsistently, switch to the shorter spelling.
Rename set_shadow_pte() to __set_spte() to avoid a conflict with the
existing set_spte(), and to indicate its lowlevelness.
Signed-off-by: Avi Kivity <avi@redhat.com>
Since the guest and host ptes can have wildly different format, adjust
the pte accessor names to indicate on which type of pte they operate on.
No functional changes.
Signed-off-by: Avi Kivity <avi@redhat.com>
is_dirty_pte() is used on guest ptes, not shadow ptes, so it needs to avoid
shadow_dirty_mask and use PT_DIRTY_MASK instead.
Misdetecting dirty pages could lead to unnecessarily setting the dirty bit
under EPT.
Signed-off-by: Avi Kivity <avi@redhat.com>
Instead of reloading the pdptrs on every entry and exit (vmcs writes on vmx,
guest memory access on svm) extract them on demand.
Signed-off-by: Avi Kivity <avi@redhat.com>
Otherwise the host can spend too long traversing an rmap chain, which
happens under a spinlock.
Cc: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>