Some architectures just don't have PAGE_KERNEL_EXEC. The mm/nommu.c and
mm/vmalloc.c code have been using PAGE_KERNEL as a fallback for years.
Move this fallback to asm-generic.
Link: http://lkml.kernel.org/r/20180510185507.2439-3-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
link_mem_sections() and walk_memory_range() share most of the code, so
we can use convert link_mem_sections() into a dummy function that calls
walk_memory_range() with a callback to register_mem_sect_under_node().
This patch converts register_mem_sect_under_node() in order to match a
walk_memory_range's callback, getting rid of the check_nid argument and
checking instead if the system is still boothing, since we only have to
check for the nid if the system is in such state.
Link: http://lkml.kernel.org/r/20180622111839.10071-4-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Suggested-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When hotplugging memory, it is possible that two calls are being made to
register_mem_sect_under_node().
One comes from __add_section()->hotplug_memory_register() and the other
from add_memory_resource()->link_mem_sections() if we had to register a
new node.
In case we had to register a new node, hotplug_memory_register() will
only handle/allocate the memory_block's since
register_mem_sect_under_node() will return right away because the node
it is not online yet.
I think it is better if we leave hotplug_memory_register() to
handle/allocate only memory_block's and make link_mem_sections() to call
register_mem_sect_under_node().
So this patch removes the call to register_mem_sect_under_node() from
hotplug_memory_register(), and moves the call to link_mem_sections() out
of the condition, so it will always be called. In this way we only have
one place where the memory sections are registered.
Link: http://lkml.kernel.org/r/20180622111839.10071-3-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a small cleanup for the memhotplug code. A lot more could be
done, but it is better to start somewhere. I tried to unify/remove
duplicated code.
The following is what this patchset does:
1) add_memory_resource() has code to allocate a node in case it was
offline. Since try_online_node has some code for that as well, I just
made add_memory_resource() to use that so we can remove duplicated
code.. This is better explained in patch 1/4.
2) register_mem_sect_under_node() will be called only from
link_mem_sections()
3) Make register_mem_sect_under_node() a callback of
walk_memory_range()
4) Drop unnecessary checks from register_mem_sect_under_node()
I have done some tests and I could not see anything broken because of
this patchset.
add_memory_resource() contains code to allocate a new node in case it is
necessary. Since try_online_node() also has some code for this purpose,
let us make use of that and remove duplicate code.
This introduces __try_online_node(), which is called by
add_memory_resource() and try_online_node(). __try_online_node() has
two new parameters, start_addr of the node, and if the node should be
onlined and registered right away. This is always wanted if we are
calling from do_cpu_up(), but not when we are calling from memhotplug
code. Nothing changes from the point of view of the users of
try_online_node(), since try_online_node passes start_addr=0 and
online_node=true to __try_online_node().
Link: http://lkml.kernel.org/r/20180622111839.10071-2-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__list_lru_count_one() has a single callsite.
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
shadow_lru_isolate() disables interrupts and acquires a lock. It could
use spin_lock_irq() instead. It also uses local_irq_enable() while it
could use spin_unlock_irq()/xa_unlock_irq().
Use proper suffix for lock/unlock in order to enable/disable interrupts
during release/acquire of a lock.
Link: http://lkml.kernel.org/r/20180622151221.28167-3-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: use irq locking suffix instead local_irq_disable()".
A small series which avoids using local_irq_disable()/local_irq_enable()
but instead does spin_lock_irq()/spin_unlock_irq() so it is within the
context of the lock which it belongs to. Patch #1 is a cleanup where
local_irq_.*() remained after the lock was removed.
This patch (of 2):
In 0c7c1bed7e ("mm: make counting of list_lru_one::nr_items lockless")
the
spin_lock(&nlru->lock);
statement was replaced with
rcu_read_lock();
in __list_lru_count_one(). The comment in count_shadow_nodes() says
that the local_irq_disable() is required because the lock must be
acquired with disabled interrupts and (spin_lock()) does not do so.
Since the lock is replaced with rcu_read_lock() the local_irq_disable()
is no longer needed. The code path is
list_lru_shrink_count()
-> list_lru_count_one()
-> __list_lru_count_one()
-> rcu_read_lock()
-> list_lru_from_memcg_idx()
-> rcu_read_unlock()
Remove the local_irq_disable() statement.
Link: http://lkml.kernel.org/r/20180622151221.28167-2-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no real reason to blow up just because the caller doesn't know
that __get_free_pages cannot return highmem pages. Simply fix that up
silently. Even if we have some confused users such a fixup will not be
harmful.
[akpm@linux-foundation.org: mask off __GFP_HIGHMEM]
Link: http://lkml.kernel.org/r/20180622162841.25114-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jiankang Chen <chenjiankang1@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Yisheng Xie <xieyisheng1@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is to take better advantage of the general huge page copying
optimization. Where, the target subpage will be copied last to avoid
the cache lines of target subpage to be evicted when copying other
subpages. This works better if the address of the target subpage is
available when copying huge page. So hugetlbfs page fault handlers are
changed to pass that information to hugetlb_cow(). This will benefit
workloads which don't access the begin of the hugetlbfs huge page after
the page fault under heavy cache contention.
Link: http://lkml.kernel.org/r/20180524005851.4079-5-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Punit Agrawal <punit.agrawal@arm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To take better advantage of general huge page copying optimization, the
target subpage address will be passed to hugetlb_cow(), then
copy_user_huge_page(). So we will use both target subpage address and
huge page size aligned address in hugetlb_cow(). To distinguish between
them, "haddr" is used for huge page size aligned address to be
consistent with Transparent Huge Page naming convention.
Now, only huge page size aligned address is used in hugetlb_cow(), so
the "address" is renamed to "haddr" in hugetlb_cow() in this patch.
Next patch will use target subpage address in hugetlb_cow() too.
The patch is just code cleanup without any functionality changes.
Link: http://lkml.kernel.org/r/20180524005851.4079-4-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Punit Agrawal <punit.agrawal@arm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Huge page helps to reduce TLB miss rate, but it has higher cache
footprint, sometimes this may cause some issue. For example, when
copying huge page on x86_64 platform, the cache footprint is 4M. But on
a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC
(last level cache). That is, in average, there are 2.5M LLC for each
core and 1.25M LLC for each thread.
If the cache contention is heavy when copying the huge page, and we copy
the huge page from the begin to the end, it is possible that the begin
of huge page is evicted from the cache after we finishing copying the
end of the huge page. And it is possible for the application to access
the begin of the huge page after copying the huge page.
In c79b57e462 ("mm: hugetlb: clear target sub-page last when clearing
huge page"), to keep the cache lines of the target subpage hot, the
order to clear the subpages in the huge page in clear_huge_page() is
changed to clearing the subpage which is furthest from the target
subpage firstly, and the target subpage last. The similar order
changing helps huge page copying too. That is implemented in this
patch. Because we have put the order algorithm into a separate
function, the implementation is quite simple.
The patch is a generic optimization which should benefit quite some
workloads, not for a specific use case. To demonstrate the performance
benefit of the patch, we tested it with vm-scalability run on
transparent huge page.
With this patch, the throughput increases ~16.6% in vm-scalability
anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699
system (36 cores, 72 threads). The test case set
/sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big
anonymous memory area and populate it, then forked 36 child processes,
each writes to the anonymous memory area from the begin to the end, so
cause copy on write. For each child process, other child processes
could be seen as other workloads which generate heavy cache pressure.
At the same time, the IPC (instruction per cycle) increased from 0.63 to
0.78, and the time spent in user space is reduced ~7.2%.
Link: http://lkml.kernel.org/r/20180524005851.4079-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Christopher Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm, huge page: Copy target sub-page last when copy huge
page", v2.
Huge page helps to reduce TLB miss rate, but it has higher cache
footprint, sometimes this may cause some issue. For example, when
copying huge page on x86_64 platform, the cache footprint is 4M. But on
a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC
(last level cache). That is, in average, there are 2.5M LLC for each
core and 1.25M LLC for each thread.
If the cache contention is heavy when copying the huge page, and we copy
the huge page from the begin to the end, it is possible that the begin
of huge page is evicted from the cache after we finishing copying the
end of the huge page. And it is possible for the application to access
the begin of the huge page after copying the huge page.
In c79b57e462 ("mm: hugetlb: clear target sub-page last when clearing
huge page"), to keep the cache lines of the target subpage hot, the
order to clear the subpages in the huge page in clear_huge_page() is
changed to clearing the subpage which is furthest from the target
subpage firstly, and the target subpage last. The similar order
changing helps huge page copying too. That is implemented in this
patchset.
The patchset is a generic optimization which should benefit quite some
workloads, not for a specific use case. To demonstrate the performance
benefit of the patchset, we have tested it with vm-scalability run on
transparent huge page.
With this patchset, the throughput increases ~16.6% in vm-scalability
anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699
system (36 cores, 72 threads). The test case set
/sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big
anonymous memory area and populate it, then forked 36 child processes,
each writes to the anonymous memory area from the begin to the end, so
cause copy on write. For each child process, other child processes
could be seen as other workloads which generate heavy cache pressure.
At the same time, the IPC (instruction per cycle) increased from 0.63 to
0.78, and the time spent in user space is reduced ~7.2%.
This patch (of 4):
In c79b57e462 ("mm: hugetlb: clear target sub-page last when clearing
huge page"), to keep the cache lines of the target subpage hot, the
order to clear the subpages in the huge page in clear_huge_page() is
changed to clearing the subpage which is furthest from the target
subpage firstly, and the target subpage last. This optimization could
be applied to copying huge page too with the same order algorithm. To
avoid code duplication and reduce maintenance overhead, in this patch,
the order algorithm is moved out of clear_huge_page() into a separate
function: process_huge_page(). So that we can use it for copying huge
page too.
This will change the direct calls to clear_user_highpage() into the
indirect calls. But with the proper inline support of the compilers,
the indirect call will be optimized to be the direct call. Our tests
show no performance change with the patch.
This patch is a code cleanup without functionality change.
Link: http://lkml.kernel.org/r/20180524005851.4079-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Christopher Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed is used
to record the counter of collapsed THP, but it just gets inc'ed in
anonymous THP collapse path, do this for shmem THP collapse too.
Link: http://lkml.kernel.org/r/1529622949-75504-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When merging anonymous page vma, if the size of the vma can fit in at
least one hugepage, the mm will be registered for khugepaged for
collapsing THP in the future.
But it skips shmem vmas. Do so for shmem also, but not for file-private
mappings when merging a vma in order to increase the odds of collapsing
a hugepage via khugepaged.
hugepage_vma_check() sounds like a good fit to do the check. And move
the definition of it before khugepaged_enter_vma_merge() to avoid a
build error.
Link: http://lkml.kernel.org/r/1529697791-6950-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The argument "gfp_t flags" is not used in kasan_unpoison_element() and
remove_element(), so remove it.
Link: http://lkml.kernel.org/r/20180621070332.16633-1-baijiaju1990@gmail.com
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use smaller scan_control fields for order, priority, and reclaim_idx.
Convert fields from int => s8. All easily fit within a byte:
- allocation order range: 0..MAX_ORDER(64?)
- priority range: 0..12(DEF_PRIORITY)
- reclaim_idx range: 0..6(__MAX_NR_ZONES)
Since 6538b8ea88 ("x86_64: expand kernel stack to 16K") x86_64 stack
overflows are not an issue. But it's inefficient to use ints.
Use s8 (signed byte) rather than u8 to allow for loops like:
do {
...
} while (--sc.priority >= 0);
Add BUILD_BUG_ON to verify that s8 is capable of storing max values.
This reduces sizeof(struct scan_control):
- 96 => 80 bytes (x86_64)
- 68 => 56 bytes (i386)
scan_control structure field order is changed to utilize padding. After
this patch there is 1 bit of scan_control padding.
akpm: makes my vmscan.o's .text 572 bytes smaller as well.
Link: http://lkml.kernel.org/r/20180530061212.84915-1-gthelen@google.com
Signed-off-by: Greg Thelen <gthelen@google.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
lookup_page_ext() finds 'struct page_ext' for a given page. It requires
only read access to the given struct page.
Current implemnentation takes 'struct page *' as an argument. It makes
compiler complain when 'const struct page *' passed.
Change the argument to 'const struct page *'.
Link: http://lkml.kernel.org/r/20180531135457.20167-3-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_seconds() is deprecated because it will lead to a 32-bit overflow in
2038 or 2106. We don't need the i_generation to be strictly monotonic
anyway, and other file systems like ext4 and xfs just use prandom_u32(),
so let's use the same one here.
If this is considered too slow, we could also use ktime_get_seconds() or
ktime_get_real_seconds() to keep the previous behavior. Both of these
return a time64_t and are not deprecated, but only return a unique value
once per second, and are predictable.
Link: http://lkml.kernel.org/r/20180620082556.581543-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__alloc_pages_slowpath() has for a long time contained code to ignore
node restrictions from memory policies for high priority allocations.
The current code that resets the zonelist iterator however does
effectively nothing after commit 7810e6781e ("mm, page_alloc: do not
break __GFP_THISNODE by zonelist reset") removed a buggy zonelist reset.
Even before that commit, mempolicy restrictions were still not ignored,
as they are passed in ac->nodemask which is untouched by the code.
We can either remove the code, or make it work as intended. Since
ac->nodemask can be set from task's mempolicy via alloc_pages_current()
and thus also alloc_pages(), it may indeed affect kernel allocations,
and it makes sense to ignore it to allow progress for high priority
allocations.
Thus, this patch resets ac->nodemask to NULL in such cases. This
assumes all callers can handle it (i.e. there are no guarantees as in
the case of __GFP_THISNODE) which seems to be the case. The same
assumption is already present in check_retry_cpuset() for some time.
The expected effect is that high priority kernel allocations in the
context of userspace tasks (e.g. OOM victims) restricted by mempolicies
will have higher chance to succeed if they are restricted to nodes with
depleted memory, while there are other nodes with free memory left.
It's not a new intention, but for the first time the code will match the
intention, AFAICS. It was intended by commit 183f6371aa ("mm: ignore
mempolicies when using ALLOC_NO_WATERMARK") in v3.6 but I think it never
really worked, as mempolicy restriction was already encoded in nodemask,
not zonelist, at that time.
So originally that was for ALLOC_NO_WATERMARK only. Then it was
adjusted by e46e7b77c9 ("mm, page_alloc: recalculate the preferred
zoneref if the context can ignore memory policies") and cd04ae1e2d
("mm, oom: do not rely on TIF_MEMDIE for memory reserves access") to the
current state. So even GFP_ATOMIC would now ignore mempolicies after
the initial attempts fail - if the code worked as people thought it
does.
Link: http://lkml.kernel.org/r/20180612122624.8045-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit eca56ff906 ("mm, shmem: add internal shmem resident
memory accounting"), MM_SHMEMPAGES is added to separate the shmem
accounting from regular files. So, all shmem pages should be accounted
to MM_SHMEMPAGES instead of MM_FILEPAGES.
And, normal 4K shmem pages have been accounted to MM_SHMEMPAGES, so
shmem thp pages should be not treated differently. Account them to
MM_SHMEMPAGES via mm_counter_file() since shmem pages are swap backed to
keep consistent with normal 4K shmem pages.
This will not change the rss counter of processes since shmem pages are
still a part of it.
The /proc/pid/status and /proc/pid/statm counters will however be more
accurate wrt shmem usage, as originally intended. And as eca56ff906
("mm, shmem: add internal shmem resident memory accounting") mentioned,
oom also could report more accurate "shmem-rss".
Link: http://lkml.kernel.org/r/1529442518-17398-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The role of zero_resv_unavail() is to make sure that every struct page
that is allocated but is not backed by memory that is accessible by
kernel is zeroed and not in some uninitialized state.
Since struct pages are allocated in blocks (2M pages in x86 case), we
can skip pageblock_nr_pages at a time, when the first one is found to be
invalid.
This optimization may help since now on x86 every hole in e820 maps is
marked as reserved in memblock, and thus will go through this function.
This function is called before sched_clock() is initialized, so I used
my x86 early boot clock patches to measure the performance improvement.
With 1T hole on i7-8700 currently we would take 0.606918s of boot time,
but with this optimization 0.001103s.
Link: http://lkml.kernel.org/r/20180615155733.1175-1-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use new return type vm_fault_t for fault handler. For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno. Once all instances are converted, vm_fault_t will become a
distinct type.
Ref-> commit 1c8f422059 ("mm: change return type to vm_fault_t")
In this patch all the caller of handle_mm_fault() are changed to return
vm_fault_t type.
Link: http://lkml.kernel.org/r/20180617084810.GA6730@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: James Hogan <jhogan@kernel.org>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: David S. Miller <davem@davemloft.net>
Cc: Richard Weinberger <richard@nod.at>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Levin, Alexander (Sasha Levin)" <alexander.levin@verizon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In SLUB, prefetch_freepointer() is used when allocating an object from
cache's freelist, to make sure the next object in the list is cache-hot,
since it's probable it will be allocated soon.
Commit 2482ddec67 ("mm: add SLUB free list pointer obfuscation") has
unintentionally changed the prefetch in a way where the prefetch is
turned to a real fetch, and only the next->next pointer is prefetched.
In case there is not a stream of allocations that would benefit from
prefetching, the extra real fetch might add a useless cache miss to the
allocation. Restore the previous behavior.
Link: http://lkml.kernel.org/r/20180809085245.22448-1-vbabka@suse.cz
Fixes: 2482ddec67 ("mm: add SLUB free list pointer obfuscation")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is reworked from an earlier patch that Dan has posted:
https://patchwork.kernel.org/patch/10131727/
VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
the memory page it is dealing with is not typical memory from the linear
map. The get_user_pages_fast() path, since it does not resolve the vma,
is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
use that as a VM_MIXEDMAP replacement in some locations. In the cases
where there is no pte to consult we fallback to using vma_is_dax() to
detect the VM_MIXEDMAP special case.
Now that we have explicit driver pfn_t-flag opt-in/opt-out for
get_user_pages() support for DAX we can stop setting VM_MIXEDMAP. This
also means we no longer need to worry about safely manipulating vm_flags
in a future where we support dynamically changing the dax mode of a
file.
DAX should also now be supported with madvise_behavior(), vma_merge(),
and copy_page_range().
This patch has been tested against ndctl unit test. It has also been
tested against xfstests commit: 625515d using fake pmem created by
memmap and no additional issues have been observed.
Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the source statements of arch-independent Kconfig files instead of
duplicating the includes in every arch/$(SRCARCH)/Kconfig.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJbdFsfAAoJED2LAQed4NsGxHsP/1tmA57OOOj8oGxO2OXhXVbr
Q0MZqCoV4bqMvK/hgCQdl9f+tp0m+j12x4xDLdVf4OqnTXMbqvPDu3uQVKvaj/k1
gHhsFA1tFgSbuJ8InltUsrPEQqbceeJsj50xHVAKijqI6LYeRPPSU7aE9obn+OzH
n2nd5sLKvMI/dqdJvW6i5KPydqTH3r3iA7D+ne/XQj0s0EMXvXUPmDT1+ijTnM4a
yfm6W5p7L/c3Ugf1Pz5PfnPl4BxBwZMfW5ie/UO8j5C6Rl0iPaOGuuHurocaaJb3
MefR/7NEAR3G8MhJyL2+70jbbwhjpqR2b5ooz1vpuulPHxjeU45BY60XIBWq1afR
ewsc12MMCYB695ieYWoHdaWgxD/jhffyRuajfpkXKIZEMgDxS03sMhdULXENVMx1
M0ZQ01g/NLWt9ti9DY3eTKB3ymOhnBa1sa77nGGUHkITq4DQKwPX1J9FP/HT6RNt
uOvzeH5kGzc7tqOlZAO0kHbwhQG1uqGcd78IYd4lgf/XfkSgDERTWjnJmnQbwr9m
3PFuST2u8eyO+8Lh1MK76TXOEkXsHMdFugPmb6SlgtMEPKGVLDPlsj52o/LFtgzl
eygfMiBFr2+ttkZ6IpNcpmQ4IztmDpz6XoMk3PqDAfUTUSYpCnq1gAEuff/eisCM
Odva1ZZaeQ7WpxhsP8rr
=gsQJ
-----END PGP SIGNATURE-----
Merge tag 'kconfig-v4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kconfig consolidation from Masahiro Yamada:
"Consolidation of Kconfig files by Christoph Hellwig.
Move the source statements of arch-independent Kconfig files instead
of duplicating the includes in every arch/$(SRCARCH)/Kconfig"
* tag 'kconfig-v4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kconfig: add a Memory Management options" menu
kconfig: move the "Executable file formats" menu to fs/Kconfig.binfmt
kconfig: use a menu in arch/Kconfig to reduce clutter
kconfig: include kernel/Kconfig.preempt from init/Kconfig
Kconfig: consolidate the "Kernel hacking" menu
kconfig: include common Kconfig files from top-level Kconfig
kconfig: remove duplicate SWAP symbol defintions
um: create a proper drivers Kconfig
um: cleanup Kconfig files
um: stop abusing KBUILD_KCONFIG
- add "hardened_usercopy=off" rare performance needs (Chris von Recklinghausen)
-----BEGIN PGP SIGNATURE-----
Comment: Kees Cook <kees@outflux.net>
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAltx6hAWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJrrsEAChFhTgko1nNKYhks9KIIMZ7YWc
bCWpXMnBkmTbPa192a/4aDvvwuor5EFDavWY+vEciOvT2iY6h6uus/BzKB5JlHZ9
QsZS2uLr6SJX76Ri2r8alWT0hWovp/tFopXfnFt4fOHgSK+6rcWJRFzFefsZkcYd
xNEw2HnS0kYpgw0aEe3BsnsEn6u0/CxzyGTv6OLcnXU5riOkFUqm8ehLSA44aJW4
cfqWmdelfhvs0thR0rJItUUUmhVM3i6Zccvv0HCt6z8Xz9LIZgyxnnD9Ac7mGz8y
WjNPipLqXhu8/JVsd0Y6GK6b8bYh8uNID20fgr/6aWDZkOvUHe54/ChCkjs7cW6F
JWGn1hS1tg75rdw09tr4POVw4tUIe1JcqCfsJ7IzXA7oc6PsXzlGl8USDtK9f/fK
ryC60NQKo1dXGlY+18i1iw7HsMuWbtaIiWf8Zudy7JethDn3RbHshyF5tGpx0nFB
/qRTtMaC5WqIfZAbVb1Qou71gJzmS+k/RjltCO0AnhZrvFr0Qq3eQKRTkGhzOKRq
1dvOHb9ScNeehlQeaC+k0mm8ANf16gzXSGmGg3Z/7LfECbCqc7R7B767dN52hx2X
48P5cDNKUuXgHNk+p20Yr5m16oJDkAOxSHvFN9Kizy/eL7RbgOZREQcB4an9S+A0
yb6uQKU9CQ3n/NSZyA==
=j2xG
-----END PGP SIGNATURE-----
Merge tag 'hardened-usercopy-v4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardened usercopy updates from Kees Cook:
"This cleans up a minor Kconfig issue and adds a kernel boot option for
disabling hardened usercopy for distro users that may have corner-case
performance issues (e.g. high bandwidth small-packet UDP traffic).
Summary:
- drop unneeded Kconfig "select BUG" (Kamal Mostafa)
- add "hardened_usercopy=off" rare performance needs (Chris von
Recklinghausen)"
* tag 'hardened-usercopy-v4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
usercopy: Allow boot cmdline disabling of hardening
usercopy: Do not select BUG with HARDENED_USERCOPY
small fixes and updates. We also have new ktime_get_*() docs from Arnd,
some kernel-doc fixes, a new set of Italian translations (non so se vale la
pena, ma non fa male - speriamo bene), and some extensive early
memory-management documentation improvements from Mike Rapoport.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbcZVtAAoJEI3ONVYwIuV64ekP/jgTlMi/fErRu6zlsrsWgiIK
ir8ueCQ1OSiwOA+N2fUb+2+zlLlfTLgQ+o5IwmZk6rizG87fQ3Rp6i+bvYAZWITh
YUuls3VhtRlJZqG1EW7gww1Q2IhRO6GhcpsIamAvhrSLFPaCKiN3JomJi/X47Pfj
Ibl24HsruI2fDM1JwWRwCtE5J6vCL9lH1/5v4zVv7xdrVgTrwkZ/hAsE7HBNNat5
dSku2u9HSAXa4KR4sLWrVJ8UI5+fylwilz/57HhCeduQDwKCHE/mfhxLdqL4Oa4q
oHTCNq2zTUj4w7GTvHS1g0P3y/iWMYjAzH2is+BokilpIC65NwwsKx2ybZd3Srdh
zwP/kYk5U+mYSgdDlyNqwPCibw8KDXB3srKMzyQSN6tkosKCOHFSXF0Js0eupi7t
NqmGigl3Qozj1uvU6Wy7vh58u+GFeuO4PF566t2m70Jp0cWzuVKLrBvgNO1X37rB
aEBrpOYB/H54t/qf79IFW//pptWXFNZ3S9AgyDVIcmX5C2ihaCoaPNRTom+KbH/D
QEoH9rwWSoCi2DGoR83D+G8thCUfB4yfEGulSSIA4pUR7qvIR5rd1ZioI/qtgAHm
l7MjTbLpPwiMnpFkBrxxxlFFb4gbETakMBGYoYee8ww5WbQLu0qA93AbwIXyjhE8
mqCOLyBdCAZ0mNxqPSsc
=x/P0
-----END PGP SIGNATURE-----
Merge tag 'docs-4.19' of git://git.lwn.net/linux
Pull documentation update from Jonathan Corbet:
"This was a moderately busy cycle for docs, with the usual collection
of small fixes and updates.
We also have new ktime_get_*() docs from Arnd, some kernel-doc fixes,
a new set of Italian translations (non so se vale la pena, ma non fa
male - speriamo bene), and some extensive early memory-management
documentation improvements from Mike Rapoport"
* tag 'docs-4.19' of git://git.lwn.net/linux: (52 commits)
Documentation: corrections to console/console.txt
Documentation: add ioctl number entry for v4l2-subdev.h
Remove gendered language from management style documentation
scripts/kernel-doc: Escape all literal braces in regexes
docs/mm: add description of boot time memory management
docs/mm: memblock: add overview documentation
docs/mm: memblock: add kernel-doc description for memblock types
docs/mm: memblock: add kernel-doc comments for memblock_add[_node]
docs/mm: memblock: update kernel-doc comments
mm/memblock: add a name for memblock flags enumeration
docs/mm: bootmem: add overview documentation
docs/mm: bootmem: add kernel-doc description of 'struct bootmem_data'
docs/mm: bootmem: fix kernel-doc warnings
docs/mm: nobootmem: fixup kernel-doc comments
mm/bootmem: drop duplicated kernel-doc comments
Documentation: vm.txt: Adding 'nr_hugepages_mempolicy' parameter description.
doc:it_IT: translation for kernel-hacking
docs: Fix the reference labels in Locking.rst
doc: tracing: Fix a typo of trace_stat
mm: Introduce new type vm_fault_t
...
- Add a new framework for CPU idle time injection (Daniel Lezcano).
- Add AVS support to the armada-37xx cpufreq driver (Gregory CLEMENT).
- Add support for current CPU frequency reporting to the ACPI CPPC
cpufreq driver (George Cherian).
- Rework the cooling device registration in the imx6q/thermal
driver (Bastian Stender).
- Make the pcc-cpufreq driver refuse to work with dynamic
scaling governors on systems with many CPUs to avoid
scalability issues with it (Rafael Wysocki).
- Fix the intel_pstate driver to report different maximum CPU
frequencies on systems where they really are different and to
ignore the turbo active ratio if hardware-managend P-states (HWP)
are in use; make it use the match_string() helper (Xie Yisheng,
Srinivas Pandruvada).
- Fix a minor deferred probe issue in the qcom-kryo cpufreq
driver (Niklas Cassel).
- Add a tracepoint for the tracking of frequency limits changes
(from Andriod) to the cpufreq core (Ruchi Kandoi).
- Fix a circular lock dependency between CPU hotplug and sysfs
locking in the cpufreq core reported by lockdep (Waiman Long).
- Avoid excessive error reports on driver registration failures
in the ARM cpuidle driver (Sudeep Holla).
- Add a new device links flag to the driver core to make links go
away automatically on supplier driver removal (Vivek Gautam).
- Eliminate potential race condition between system-wide power
management transitions and system shutdown (Pingfan Liu).
- Add a quirk to save NVS memory on system suspend for the ASUS
1025C laptop (Willy Tarreau).
- Make more systems use suspend-to-idle (instead of ACPI S3) by
default (Tristian Celestin).
- Get rid of stack VLA usage in the low-level hibernation code on
64-bit x86 (Kees Cook).
- Fix error handling in the hibernation core and mark an expected
fall-through switch in it (Chengguang Xu, Gustavo Silva).
- Extend the generic power domains (genpd) framework to support
attaching a device to a power domain by name (Ulf Hansson).
- Fix device reference counting and user limits initialization in
the devfreq core (Arvind Yadav, Matthias Kaehlcke).
- Fix a few issues in the rk3399_dmc devfreq driver and improve its
documentation (Enric Balletbo i Serra, Lin Huang, Nick Milner).
- Drop a redundant error message from the exynos-ppmu devfreq driver
(Markus Elfring).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJbcqOqAAoJEILEb/54YlRxOxMP/2ZFvnXU0pey/VX/+TelLMS7
/ROVGQ+s75QP1c9P/3BjvnXc0dsMRLRFPog+7wyoG/2DbEIV25COyAYsmSE0TRni
XUaZO6YAx4/e3pm2AfamYbLCPvjw85eucHg5QJQ4b1mSVRNJOsNv+fUo6lmxwvnm
j9kHvfttFeIhoa/3wa7hbhPKLln46atnpVSxCIceY7L5EFNhkKBvQt6B5yx9geb9
QMY6ohgkyN+bnK9QySXX+trcWpzx1uGX0apI07NkX7n9QGFdU4lCW8lsAf8jMC3g
PPValTsUQsdRONUJJsrgqBioq4tvtgQWibyS2tfRrOGXYvHpJNpGmHVplfsrf/SE
cvlsciR47YbmrXZuqg/r8hql+qefNN16/rnZIZ9VnbcG806VBy2z8IzI5wcdWR7p
vzxhbCqVqOHcEdEwRwvuM2io67MWvkGtKsbCP+33DBh8SubpsECpKN4nIDboa3SE
CJ15RUqXnF6enmmfCKOoHZeu7iXWDz6Pi71XmRzaj9DqbITVV281IerqLgV3rbal
BVa53+202iD0IP+2b7KedGe/5ALlI97ffN0gB+L/eB832853DKSZQKzcvvpRhEN7
Iv2crnUwuQED9ns8P7hzp1Bk9CFCAOLW8UM43YwZRPWnmdeSsPJusJ5lzkAf7bss
wfsFoUE3RaY4msnuHyCh
=kv2M
-----END PGP SIGNATURE-----
Merge tag 'pm-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"These add a new framework for CPU idle time injection, to be used by
all of the idle injection code in the kernel in the future, fix some
issues and add a number of relatively small extensions in multiple
places.
Specifics:
- Add a new framework for CPU idle time injection (Daniel Lezcano).
- Add AVS support to the armada-37xx cpufreq driver (Gregory
CLEMENT).
- Add support for current CPU frequency reporting to the ACPI CPPC
cpufreq driver (George Cherian).
- Rework the cooling device registration in the imx6q/thermal driver
(Bastian Stender).
- Make the pcc-cpufreq driver refuse to work with dynamic scaling
governors on systems with many CPUs to avoid scalability issues
with it (Rafael Wysocki).
- Fix the intel_pstate driver to report different maximum CPU
frequencies on systems where they really are different and to
ignore the turbo active ratio if hardware-managend P-states (HWP)
are in use; make it use the match_string() helper (Xie Yisheng,
Srinivas Pandruvada).
- Fix a minor deferred probe issue in the qcom-kryo cpufreq driver
(Niklas Cassel).
- Add a tracepoint for the tracking of frequency limits changes (from
Andriod) to the cpufreq core (Ruchi Kandoi).
- Fix a circular lock dependency between CPU hotplug and sysfs
locking in the cpufreq core reported by lockdep (Waiman Long).
- Avoid excessive error reports on driver registration failures in
the ARM cpuidle driver (Sudeep Holla).
- Add a new device links flag to the driver core to make links go
away automatically on supplier driver removal (Vivek Gautam).
- Eliminate potential race condition between system-wide power
management transitions and system shutdown (Pingfan Liu).
- Add a quirk to save NVS memory on system suspend for the ASUS 1025C
laptop (Willy Tarreau).
- Make more systems use suspend-to-idle (instead of ACPI S3) by
default (Tristian Celestin).
- Get rid of stack VLA usage in the low-level hibernation code on
64-bit x86 (Kees Cook).
- Fix error handling in the hibernation core and mark an expected
fall-through switch in it (Chengguang Xu, Gustavo Silva).
- Extend the generic power domains (genpd) framework to support
attaching a device to a power domain by name (Ulf Hansson).
- Fix device reference counting and user limits initialization in the
devfreq core (Arvind Yadav, Matthias Kaehlcke).
- Fix a few issues in the rk3399_dmc devfreq driver and improve its
documentation (Enric Balletbo i Serra, Lin Huang, Nick Milner).
- Drop a redundant error message from the exynos-ppmu devfreq driver
(Markus Elfring)"
* tag 'pm-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (35 commits)
PM / reboot: Eliminate race between reboot and suspend
PM / hibernate: Mark expected switch fall-through
cpufreq: intel_pstate: Ignore turbo active ratio in HWP
cpufreq: Fix a circular lock dependency problem
cpu/hotplug: Add a cpus_read_trylock() function
x86/power/hibernate_64: Remove VLA usage
cpufreq: trace frequency limits change
cpufreq: intel_pstate: Show different max frequency with turbo 3 and HWP
cpufreq: pcc-cpufreq: Disable dynamic scaling on many-CPU systems
cpufreq: qcom-kryo: Silently error out on EPROBE_DEFER
cpufreq / CPPC: Add cpuinfo_cur_freq support for CPPC
cpufreq: armada-37xx: Add AVS support
dt-bindings: marvell: Add documentation for the Armada 3700 AVS binding
PM / devfreq: rk3399_dmc: Fix duplicated opp table on reload.
PM / devfreq: Init user limits from OPP limits, not viceversa
PM / devfreq: rk3399_dmc: fix spelling mistakes.
PM / devfreq: rk3399_dmc: do not print error when get supply and clk defer.
dt-bindings: devfreq: rk3399_dmc: move interrupts to be optional.
PM / devfreq: rk3399_dmc: remove wait for dcf irq event.
dt-bindings: clock: add rk3399 DDR3 standard speed bins.
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAltwvasQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpv65EACTq5gSLnJBI6ZPr1RAHruVDnjfzO2Veitl
tUtjm0XfWmnEiwQ3dYvnyhy99xbyaG3900d9BClCTlH6xaUdSiQkDpcKG/R2F36J
5mZitYukQcpFAQJWF8YKsTTE7JPl4VglCIDqYiC4+C3rOSVi8lrKn2qp4J4MMCFn
thRg3jCcq7c5s9Eigsop1pXWQSasubkXfk55Krcp4oybKYpYRKXXf74Mj14QAbwJ
QHN3VisyAUWoBRg7UQZo1Npe2oPk6bbnJypnjf8M0M2EnlvddEkIlHob91sodka8
6p4APOEu5cbyXOBCAQsw/koff14mb8aEadqeQA68WvXfIdX9ZjfxCX0OoC3sBEXk
yqJhZ0C980AM13zIBD8ejv4uasGcPca8W+47mE5P8sRiI++5kBsFWDZPCtUBna0X
2Kh24NsmEya9XRR5vsB84dsIPQ3tLMkxg/IgQRVDaSnfJz0c/+zm54xDyKRaFT4l
5iERk2WSkm9+8jNfVmWG0edrv6nRAXjpGwFfOCPh6/LCSCi4xQRULYN7sVzsX8ZK
FRjt24HftBI8mJbh4BtweJvg+ppVe1gAk3IO3HvxAQhv29Hz+uvFYe9kL+3N8LJA
Qosr9n9O4+wKYizJcDnw+5iPqCHfAwOm9th4pyedR+R7SmNcP3yNC8AbbheNBiF5
Zolos5H+JA==
=b9ib
-----END PGP SIGNATURE-----
Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
"First pull request for this merge window, there will also be a
followup request with some stragglers.
This pull request contains:
- Fix for a thundering heard issue in the wbt block code (Anchal
Agarwal)
- A few NVMe pull requests:
* Improved tracepoints (Keith)
* Larger inline data support for RDMA (Steve Wise)
* RDMA setup/teardown fixes (Sagi)
* Effects log suppor for NVMe target (Chaitanya Kulkarni)
* Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
* TP4004 (ANA) support (Christoph)
* Various NVMe fixes
- Block io-latency controller support. Much needed support for
properly containing block devices. (Josef)
- Series improving how we handle sense information on the stack
(Kees)
- Lightnvm fixes and updates/improvements (Mathias/Javier et al)
- Zoned device support for null_blk (Matias)
- AIX partition fixes (Mauricio Faria de Oliveira)
- DIF checksum code made generic (Max Gurtovoy)
- Add support for discard in iostats (Michael Callahan / Tejun)
- Set of updates for BFQ (Paolo)
- Removal of async write support for bsg (Christoph)
- Bio page dirtying and clone fixups (Christoph)
- Set of bcache fix/changes (via Coly)
- Series improving blk-mq queue setup/teardown speed (Ming)
- Series improving merging performance on blk-mq (Ming)
- Lots of other fixes and cleanups from a slew of folks"
* tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
blkcg: Make blkg_root_lookup() work for queues in bypass mode
bcache: fix error setting writeback_rate through sysfs interface
null_blk: add lock drop/acquire annotation
Blk-throttle: reduce tail io latency when iops limit is enforced
block: paride: pd: mark expected switch fall-throughs
block: Ensure that a request queue is dissociated from the cgroup controller
block: Introduce blk_exit_queue()
blkcg: Introduce blkg_root_lookup()
block: Remove two superfluous #include directives
blk-mq: count the hctx as active before allocating tag
block: bvec_nr_vecs() returns value for wrong slab
bcache: trivial - remove tailing backslash in macro BTREE_FLAG
bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
bcache: set max writeback rate when I/O request is idle
bcache: add code comments for bset.c
bcache: fix mistaken comments in request.c
bcache: fix mistaken code comments in bcache.h
bcache: add a comment in super.c
bcache: avoid unncessary cache prefetch bch_btree_node_get()
bcache: display rate debug parameters to 0 when writeback is not running
...
Merge L1 Terminal Fault fixes from Thomas Gleixner:
"L1TF, aka L1 Terminal Fault, is yet another speculative hardware
engineering trainwreck. It's a hardware vulnerability which allows
unprivileged speculative access to data which is available in the
Level 1 Data Cache when the page table entry controlling the virtual
address, which is used for the access, has the Present bit cleared or
other reserved bits set.
If an instruction accesses a virtual address for which the relevant
page table entry (PTE) has the Present bit cleared or other reserved
bits set, then speculative execution ignores the invalid PTE and loads
the referenced data if it is present in the Level 1 Data Cache, as if
the page referenced by the address bits in the PTE was still present
and accessible.
While this is a purely speculative mechanism and the instruction will
raise a page fault when it is retired eventually, the pure act of
loading the data and making it available to other speculative
instructions opens up the opportunity for side channel attacks to
unprivileged malicious code, similar to the Meltdown attack.
While Meltdown breaks the user space to kernel space protection, L1TF
allows to attack any physical memory address in the system and the
attack works across all protection domains. It allows an attack of SGX
and also works from inside virtual machines because the speculation
bypasses the extended page table (EPT) protection mechanism.
The assoicated CVEs are: CVE-2018-3615, CVE-2018-3620, CVE-2018-3646
The mitigations provided by this pull request include:
- Host side protection by inverting the upper address bits of a non
present page table entry so the entry points to uncacheable memory.
- Hypervisor protection by flushing L1 Data Cache on VMENTER.
- SMT (HyperThreading) control knobs, which allow to 'turn off' SMT
by offlining the sibling CPU threads. The knobs are available on
the kernel command line and at runtime via sysfs
- Control knobs for the hypervisor mitigation, related to L1D flush
and SMT control. The knobs are available on the kernel command line
and at runtime via sysfs
- Extensive documentation about L1TF including various degrees of
mitigations.
Thanks to all people who have contributed to this in various ways -
patches, review, testing, backporting - and the fruitful, sometimes
heated, but at the end constructive discussions.
There is work in progress to provide other forms of mitigations, which
might be less horrible performance wise for a particular kind of
workloads, but this is not yet ready for consumption due to their
complexity and limitations"
* 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
x86/microcode: Allow late microcode loading with SMT disabled
tools headers: Synchronise x86 cpufeatures.h for L1TF additions
x86/mm/kmmio: Make the tracer robust against L1TF
x86/mm/pat: Make set_memory_np() L1TF safe
x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert
x86/speculation/l1tf: Invert all not present mappings
cpu/hotplug: Fix SMT supported evaluation
KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry
x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry
x86/speculation: Simplify sysfs report of VMX L1TF vulnerability
Documentation/l1tf: Remove Yonah processors from not vulnerable list
x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr()
x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d
x86: Don't include linux/irq.h from asm/hardirq.h
x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d
x86/irq: Demote irq_cpustat_t::__softirq_pending to u16
x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush()
x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond'
x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush()
cpu/hotplug: detect SMT disabled by BIOS
...
Merge changes in the PM core, system-wide PM infrastructure, generic
power domains (genpd) framework, ACPI PM infrastructure and cpuidle
for 4.19.
* pm-core:
driver core: Add flag to autoremove device link on supplier unbind
driver core: Rename flag AUTOREMOVE to AUTOREMOVE_CONSUMER
* pm-domains:
PM / Domains: Introduce dev_pm_domain_attach_by_name()
PM / Domains: Introduce option to attach a device by name to genpd
PM / Domains: dt: Add a power-domain-names property
* pm-sleep:
PM / reboot: Eliminate race between reboot and suspend
PM / hibernate: Mark expected switch fall-through
x86/power/hibernate_64: Remove VLA usage
PM / hibernate: cast PAGE_SIZE to int when comparing with error code
* acpi-pm:
ACPI / PM: save NVS memory for ASUS 1025C laptop
ACPI / PM: Default to s2idle in all machines supporting LP S0
* pm-cpuidle:
ARM: cpuidle: silence error on driver registration failure
Pull vfs open-related updates from Al Viro:
- "do we need fput() or put_filp()" rules are gone - it's always fput()
now. We keep track of that state where it belongs - in ->f_mode.
- int *opened mess killed - in finish_open(), in ->atomic_open()
instances and in fs/namei.c code around do_last()/lookup_open()/atomic_open().
- alloc_file() wrappers with saner calling conventions are introduced
(alloc_file_clone() and alloc_file_pseudo()); callers converted, with
much simplification.
- while we are at it, saner calling conventions for path_init() and
link_path_walk(), simplifying things inside fs/namei.c (both on
open-related paths and elsewhere).
* 'work.open3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
few more cleanups of link_path_walk() callers
allow link_path_walk() to take ERR_PTR()
make path_init() unconditionally paired with terminate_walk()
document alloc_file() changes
make alloc_file() static
do_shmat(): grab shp->shm_file earlier, switch to alloc_file_clone()
new helper: alloc_file_clone()
create_pipe_files(): switch the first allocation to alloc_file_pseudo()
anon_inode_getfile(): switch to alloc_file_pseudo()
hugetlb_file_setup(): switch to alloc_file_pseudo()
ocxlflash_getfile(): switch to alloc_file_pseudo()
cxl_getfile(): switch to alloc_file_pseudo()
... and switch shmem_file_setup() to alloc_file_pseudo()
__shmem_file_setup(): reorder allocations
new wrapper: alloc_file_pseudo()
kill FILE_{CREATED,OPENED}
switch atomic_open() and lookup_open() to returning 0 in all success cases
document ->atomic_open() changes
->atomic_open(): return 0 in all success cases
get rid of 'opened' in path_openat() and the helpers downstream
...
Pull x86 PTI updates from Thomas Gleixner:
"The Speck brigade sadly provides yet another large set of patches
destroying the perfomance which we carefully built and preserved
- PTI support for 32bit PAE. The missing counter part to the 64bit
PTI code implemented by Joerg.
- A set of fixes for the Global Bit mechanics for non PCID CPUs which
were setting the Global Bit too widely and therefore possibly
exposing interesting memory needlessly.
- Protection against userspace-userspace SpectreRSB
- Support for the upcoming Enhanced IBRS mode, which is preferred
over IBRS. Unfortunately we dont know the performance impact of
this, but it's expected to be less horrible than the IBRS
hammering.
- Cleanups and simplifications"
* 'x86/pti' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
x86/mm/pti: Move user W+X check into pti_finalize()
x86/relocs: Add __end_rodata_aligned to S_REL
x86/mm/pti: Clone kernel-image on PTE level for 32 bit
x86/mm/pti: Don't clear permissions in pti_clone_pmd()
x86/mm/pti: Fix 32 bit PCID check
x86/mm/init: Remove freed kernel image areas from alias mapping
x86/mm/init: Add helper for freeing kernel image pages
x86/mm/init: Pass unconverted symbol addresses to free_init_pages()
mm: Allow non-direct-map arguments to free_reserved_area()
x86/mm/pti: Clear Global bit more aggressively
x86/speculation: Support Enhanced IBRS on future CPUs
x86/speculation: Protect against userspace-userspace spectreRSB
x86/kexec: Allocate 8k PGDs for PTI
Revert "perf/core: Make sure the ring-buffer is mapped in all page-tables"
x86/mm: Remove in_nmi() warning from vmalloc_fault()
x86/entry/32: Check for VM86 mode in slow-path check
perf/core: Make sure the ring-buffer is mapped in all page-tables
x86/pti: Check the return value of pti_user_pagetable_walk_pmd()
x86/pti: Check the return value of pti_user_pagetable_walk_p4d()
x86/entry/32: Add debug code to check entry/exit CR3
...
Pull x86 mm updates from Thomas Gleixner:
- Make lazy TLB mode even lazier to avoid pointless switch_mm()
operations, which reduces CPU load by 1-2% for memcache workloads
- Small cleanups and improvements all over the place
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Remove redundant check for kmem_cache_create()
arm/asm/tlb.h: Fix build error implicit func declaration
x86/mm/tlb: Make clear_asid_other() static
x86/mm/tlb: Skip atomic operations for 'init_mm' in switch_mm_irqs_off()
x86/mm/tlb: Always use lazy TLB mode
x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs
x86/mm/tlb: Make lazy TLB mode lazier
x86/mm/tlb: Restructure switch_mm_irqs_off()
x86/mm/tlb: Leave lazy TLB mode at page table free time
mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids
x86/mm: Add TLB purge to free pmd/pte page interfaces
ioremap: Update pgtable free interfaces with addr
x86/mm: Disable ioremap free page handling on x86-PAE
ioremap_prot() can return NULL which could lead to an oops.
Link: http://lkml.kernel.org/r/1533195441-58594-1-git-send-email-chenjie6@huawei.com
Signed-off-by: chen jie <chenjie6@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: chenjie <chenjie6@huawei.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
At present, "systemctl suspend" and "shutdown" can run in parrallel. A
system can suspend after devices_shutdown(), and resume. Then the shutdown
task goes on to power off. This causes many devices are not really shut
off. Hence replacing reboot_mutex with system_transition_mutex (renamed
from pm_mutex) to achieve the exclusion. The renaming of pm_mutex as
system_transition_mutex can be better to reflect the purpose of the mutex.
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltU8z0eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG5X8H/2fJr7m3k242+t76
sitwvx1eoPqTgryW59dRKm9IuXAGA+AjauvHzaz1QxomeQa50JghGWefD0eiJfkA
1AphQ/24EOiAbbVk084dAI/C2p122dE4D5Fy7CrfLnuouyrbFaZI5STbnrRct7sR
9deeYW0GDHO1Uenp4WDCj0baaqJqaevZ+7GG09DnWpya2nQtSkGBjqn6GpYmrfOU
mqFuxAX8mEOW6cwK16y/vYtnVjuuMAiZ63/OJ8AQ6d6ArGLwAsdn7f8Fn4I4tEr2
L0d3CRLUyegms4++Dmlu05k64buQu46WlPhjCZc5/Ts4kjrNxBuHejj2/jeSnUSt
vJJlibI=
=42a5
-----END PGP SIGNATURE-----
Merge tag 'v4.18-rc6' into for-4.19/block2
Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a
merge conflict down the line.
Signed-of-by: Jens Axboe <axboe@kernel.dk>
free_reserved_area() takes pointers as arguments to show which addresses
should be freed. However, it does this in a somewhat ambiguous way. If it
gets a kernel direct map address, it always works. However, if it gets an
address that is part of the kernel image alias mapping, it can fail.
It fails if all of the following happen:
* The specified address is part of the kernel image alias
* Poisoning is requested (forcing a memset())
* The address is in a read-only portion of the kernel image
The memset() fails on the read-only mapping, of course.
free_reserved_area() *is* called both on the direct map and on kernel image
alias addresses. We've just lucked out thus far that the kernel image
alias areas it gets used on are read-write. I'm fairly sure this has been
just a happy accident.
It is quite easy to make free_reserved_area() work for all cases: just
convert the address to a direct map address before doing the memset(), and
do this unconditionally. There is little chance of a regression here
because we previously did a virt_to_page() on the address for the memset,
so we know these are not highmem pages for which virt_to_page() would fail.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@google.com
Cc: aarcange@redhat.com
Cc: jgross@suse.com
Cc: jpoimboe@redhat.com
Cc: gregkh@linuxfoundation.org
Cc: peterz@infradead.org
Cc: hughd@google.com
Cc: torvalds@linux-foundation.org
Cc: bp@alien8.de
Cc: luto@kernel.org
Cc: ak@linux.intel.com
Cc: Kees Cook <keescook@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Link: https://lkml.kernel.org/r/20180802225826.1287AE3E@viggo.jf.intel.com
Commit 05ea88608d ("mm, hugetlbfs: introduce ->pagesize() to
vm_operations_struct") adds a new ->pagesize() function to
hugetlb_vm_ops, intended to cover all hugetlbfs backed files.
With System V shared memory model, if "huge page" is specified, the
"shared memory" is backed by hugetlbfs files, but the mappings initiated
via shmget/shmat have their original vm_ops overwritten with shm_vm_ops,
so we need to add a ->pagesize function to shm_vm_ops. Otherwise,
vma_kernel_pagesize() returns PAGE_SIZE given a hugetlbfs backed vma,
result in below BUG:
fs/hugetlbfs/inode.c
443 if (unlikely(page_mapped(page))) {
444 BUG_ON(truncate_op);
resulting in
hugetlbfs: oracle (4592): Using mlock ulimits for SHM_HUGETLB is deprecated
------------[ cut here ]------------
kernel BUG at fs/hugetlbfs/inode.c:444!
Modules linked in: nfsv3 rpcsec_gss_krb5 nfsv4 ...
CPU: 35 PID: 5583 Comm: oracle_5583_sbt Not tainted 4.14.35-1829.el7uek.x86_64 #2
RIP: 0010:remove_inode_hugepages+0x3db/0x3e2
....
Call Trace:
hugetlbfs_evict_inode+0x1e/0x3e
evict+0xdb/0x1af
iput+0x1a2/0x1f7
dentry_unlink_inode+0xc6/0xf0
__dentry_kill+0xd8/0x18d
dput+0x1b5/0x1ed
__fput+0x18b/0x216
____fput+0xe/0x10
task_work_run+0x90/0xa7
exit_to_usermode_loop+0xdd/0x116
do_syscall_64+0x187/0x1ae
entry_SYSCALL_64_after_hwframe+0x150/0x0
[jane.chu@oracle.com: relocate comment]
Link: http://lkml.kernel.org/r/20180731044831.26036-1-jane.chu@oracle.com
Link: http://lkml.kernel.org/r/20180727211727.5020-1-jane.chu@oracle.com
Fixes: 05ea88608d ("mm, hugetlbfs: introduce ->pagesize() to vm_operations_struct")
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In case of memcg_online_kmem() failure, memcg_cgroup::id remains hashed
in mem_cgroup_idr even after memcg memory is freed. This leads to leak
of ID in mem_cgroup_idr.
This patch adds removal into mem_cgroup_css_alloc(), which fixes the
problem. For better readability, it adds a generic helper which is used
in mem_cgroup_alloc() and mem_cgroup_id_put_many() as well.
Link: http://lkml.kernel.org/r/152354470916.22460.14397070748001974638.stgit@localhost.localdomain
Fixes 73f576c04b ("mm: memcontrol: fix cgroup creation failure after many small jobs")
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* make memblock_discard description kernel-doc compatible
* add brief description for memblock_setclr_flag and describe its
parameters
* fixup return value descriptions
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Since kernel-doc does not like anonymous enums the name is required for
adding documentation. While on it, I've also updated all the function
declarations to use 'enum memblock_flags' instead of unsigned long.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Add descriptions of the return value where they were missing and fixup the
syntax for present ones.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
* add kernel-doc marking to free_bootmem_late() description
* add return value descriptions
* mention that address parameter of free_bootmem{_node} is a physical address
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Parts of the bootmem interfaces are duplicated in nobootmem.c along with
the kernel-doc comments. There is no point to keep two copies of the
comments, so let's drop the bootmem.c copy.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
This moves all the options under a proper menu.
Based on a patch from Randy Dunlap.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Delete the old VM_BUG_ON_VMA() from zap_pmd_range(), which asserted
that mmap_sem must be held when splitting an "anonymous" vma there.
Whether that's still strictly true nowadays is not entirely clear,
but the danger of sometimes crashing on the BUG is now fairly clear.
Even with the new stricter rules for anonymous vma marking, the
condition it checks for can possible trigger. Commit 44960f2a7b
("staging: ashmem: Fix SIGBUS crash when traversing mmaped ashmem
pages") is good, and originally I thought it was safe from that
VM_BUG_ON_VMA(), because the /dev/ashmem fd exposed to the user is
disconnected from the vm_file in the vma, and madvise(,,MADV_REMOVE)
insists on VM_SHARED.
But after I read John's earlier mail, drawing attention to the
vfs_fallocate() in there: I may be wrong, and I don't know if Android
has THP in the config anyway, but it looks to me like an
unmap_mapping_range() from ashmem's vfs_fallocate() could hit precisely
the VM_BUG_ON_VMA(), once it's vma_is_anonymous().
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ondemand_readahead() checks bdi->io_pages to cap the maximum pages
that need to be processed. This works until the readit section. If
we would do an async only readahead (async size = sync size) and
target is at beginning of window we expand the pages by another
get_next_ra_size() pages. Btrace for large reads shows that kernel
always issues a doubled size read at the beginning of processing.
Add an additional check for io_pages in the lower part of the func.
The fix helps devices that hard limit bio pages and rely on proper
handling of max_hw_read_sectors (e.g. older FusionIO cards). For
that reason it could qualify for stable.
Fixes: 9491ae4a ("mm: don't cap request size based on read-ahead setting")
Cc: stable@vger.kernel.org
Signed-off-by: Markus Stockhausen stockhausen@collogia.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
/sys/../zswap/stored_pages keeps rising in a zswap test with
"zswap.max_pool_percent=0" parameter. But it should not compress or
store pages any more since there is no space in the compressed pool.
Reproduce steps:
1. Boot kernel with "zswap.enabled=1"
2. Set the max_pool_percent to 0
# echo 0 > /sys/module/zswap/parameters/max_pool_percent
3. Do memory stress test to see if some pages have been compressed
# stress --vm 1 --vm-bytes $mem_available"M" --timeout 60s
4. Watching the 'stored_pages' number increasing or not
The root cause is:
When zswap_max_pool_percent is set to 0 via kernel parameter,
zswap_is_full() will always return true due to zswap_shrink(). But if
the shinking is able to reclain a page successfully the code then
proceeds to compressing/storing another page, so the value of
stored_pages will keep changing.
To solve the issue, this patch adds a zswap_is_full() check again after
zswap_shrink() to make sure it's now under the max_pool_percent, and to
not compress/store if we reached the limit.
Link: http://lkml.kernel.org/r/20180530103936.17812-1-liwang@redhat.com
Signed-off-by: Li Wang <liwang@redhat.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Huang Ying <huang.ying.caritas@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make sure to initialize all VMAs properly, not only those which come
from vm_area_cachep.
Link: http://lkml.kernel.org/r/20180724121139.62570-3-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mce: Uncorrected hardware memory error in user-access at af34214200
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
mce: [Hardware Error]: Machine check events logged
{1}[Hardware Error]: event severity: corrected
Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
[..]
Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
mce: Memory error not recovered
In contrast to typical memory, dev_pagemap pages may be dax mapped. With
dax there is no possibility to map in another page dynamically since dax
establishes 1:1 physical address to file offset associations. Also
dev_pagemap pages associated with NVDIMM / persistent memory devices can
internal remap/repair addresses with poison. While memory_failure()
assumes that it can discard typical poisoned pages and keep them
unmapped indefinitely, dev_pagemap pages may be returned to service
after the error is cleared.
Teach memory_failure() to detect and handle MEMORY_DEVICE_HOST
dev_pagemap pages that have poison consumed by userspace. Mark the
memory as UC instead of unmapping it completely to allow ongoing access
via the device driver (nd_pmem). Later, nd_pmem will grow support for
marking the page back to WB when the error is cleared.
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
In preparation for supporting memory_failure() for dax mappings, teach
collect_procs() to also determine the mapping size. Unlike typical
mappings the dax mapping size is determined by walking page-table
entries rather than using the compound-page accounting for THP pages.
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
The madvise_inject_error() routine uses get_user_pages() to lookup the
pfn and other information for injected error, but it does not release
that pin. The assumption is that failed pages should be taken out of
circulation.
However, for dax mappings it is not possible to take pages out of
circulation since they are 1:1 physically mapped as filesystem blocks,
or device-dax capacity. They also typically represent persistent memory
which has an error clearing capability.
In preparation for adding a special handler for dax mappings, shift the
responsibility of taking the page reference to memory_failure(). I.e.
drop the page reference and do not specify MF_COUNT_INCREASED to
memory_failure().
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
MEMORY_DEVICE_FS_DAX relies on typical page semantics whereby ->mapping
is only ever cleared by truncation, not final put.
Without this fix dax pages may forget their mapping association at the
end of every page pin event.
Move this atypical behavior that HMM wants into the HMM ->page_free()
callback.
Cc: <stable@vger.kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Fixes: d2c997c0f1 ("fs, dax: use page->mapping...")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Like vm_area_dup(), it initializes the anon_vma_chain head, and the
basic mm pointer.
The rest of the fields end up being different for different users,
although the plan is to also initialize the 'vm_ops' field to a dummy
entry.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
.. and re-initialize th eanon_vma_chain head.
This removes some boiler-plate from the users, and also makes it clear
why it didn't need use the 'zalloc()' version.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The vm_area_struct is one of the most fundamental memory management
objects, but the management of it is entirely open-coded evertwhere,
ranging from allocation and freeing (using kmem_cache_[z]alloc and
kmem_cache_free) to initializing all the fields.
We want to unify this in order to end up having some unified
initialization of the vmas, and the first step to this is to at least
have basic allocation functions.
Right now those functions are literally just wrappers around the
kmem_cache_*() calls. This is a purely mechanical conversion:
# new vma:
kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()
# copy old vma
kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)
# free vma
kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)
to the point where the old vma passed in to the vm_area_dup() function
isn't even used yet (because I've left all the old manual initialization
alone).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It was reported that a kernel crash happened in mem_cgroup_iter(), which
can be triggered if the legacy cgroup-v1 non-hierarchical mode is used.
Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b8f
......
Call trace:
mem_cgroup_iter+0x2e0/0x6d4
shrink_zone+0x8c/0x324
balance_pgdat+0x450/0x640
kswapd+0x130/0x4b8
kthread+0xe8/0xfc
ret_from_fork+0x10/0x20
mem_cgroup_iter():
......
if (css_tryget(css)) <-- crash here
break;
......
The crashing reason is that mem_cgroup_iter() uses the memcg object whose
pointer is stored in iter->position, which has been freed before and
filled with POISON_FREE(0x6b).
And the root cause of the use-after-free issue is that
invalidate_reclaim_iterators() fails to reset the value of iter->position
to NULL when the css of the memcg is released in non- hierarchical mode.
Link: http://lkml.kernel.org/r/1531994807-25639-1-git-send-email-jing.xia@unisoc.com
Fixes: 6df38689e0 ("mm: memcontrol: fix possible memcg leak due to interrupted reclaim")
Signed-off-by: Jing Xia <jing.xia.mail@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <chunyan.zhang@unisoc.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__split_huge_pmd_locked() must check if the cleared huge pmd was dirty,
and propagate that to PageDirty: otherwise, data may be lost when a huge
tmpfs page is modified then split then reclaimed.
How has this taken so long to be noticed? Because there was no problem
when the huge page is written by a write system call (shmem_write_end()
calls set_page_dirty()), nor when the page is allocated for a write fault
(fault_dirty_shared_page() calls set_page_dirty()); but when allocated for
a read fault (which MAP_POPULATE simulates), no set_page_dirty().
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1807111741430.1106@eggly.anvils
Fixes: d21b9e57c7 ("thp: handle file pages in split_huge_pmd()")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Ashwin Chaugule <ashwinch@google.com>
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: <stable@vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 26f09e9b3a ("mm/memblock: add memblock memory allocation apis")
introduced two new function definitions:
memblock_virt_alloc_try_nid_nopanic()
memblock_virt_alloc_try_nid()
and commit ea1f5f3712 ("mm: define memblock_virt_alloc_try_nid_raw")
introduced the following function definition:
memblock_virt_alloc_try_nid_raw()
This commit adds an include of header file <linux/bootmem.h> to provide
the missing function prototypes. This silences the following gcc warning
(W=1):
mm/memblock.c:1334:15: warning: no previous prototype for `memblock_virt_alloc_try_nid_raw' [-Wmissing-prototypes]
mm/memblock.c:1371:15: warning: no previous prototype for `memblock_virt_alloc_try_nid_nopanic' [-Wmissing-prototypes]
mm/memblock.c:1407:15: warning: no previous prototype for `memblock_virt_alloc_try_nid' [-Wmissing-prototypes]
Also adds #ifdef blockers to prevent compilation failure on mips/ia64
where CONFIG_NO_BOOTMEM=n as could be seen in commit commit 6cc22dc08a
("revert "mm/memblock: add missing include <linux/bootmem.h>"").
Because Makefile already does:
obj-$(CONFIG_HAVE_MEMBLOCK) += memblock.o
The #ifdef has been simplified from:
#if defined(CONFIG_HAVE_MEMBLOCK) && defined(CONFIG_NO_BOOTMEM)
to simply:
#if defined(CONFIG_NO_BOOTMEM)
Link: http://lkml.kernel.org/r/20180626184422.24974-1-malat@debian.org
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Suggested-by: Tony Luck <tony.luck@intel.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This passes the information we already have at the call sight into
do_send_sig_info. Ultimately allowing for better handling of signals
sent to a group of processes during fork.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Given that dax / device-mapped pages are never subject to page
allocations remove them from consideration by the soft-offline
mechanism.
Reported-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Use new return type vm_fault_t for fault and huge_fault handler. For
now, this is just documenting that the function returns a VM_FAULT value
rather than an errno. Once all instances are converted, vm_fault_t will
become a distinct type.
Commit 1c8f422059 ("mm: change return type to vm_fault_t")
Previously vm_insert_mixed() returned an error code which driver mapped into
VM_FAULT_* type. The new function vmf_insert_mixed() will replace this
inefficiency by returning VM_FAULT_* type.
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Andy discovered that speculative memory accesses while in lazy
TLB mode can crash a system, when a CPU tries to dereference a
speculative access using memory contents that used to be valid
page table memory, but have since been reused for something else
and point into la-la land.
The latter problem can be prevented in two ways. The first is to
always send a TLB shootdown IPI to CPUs in lazy TLB mode, while
the second one is to only send the TLB shootdown at page table
freeing time.
The second should result in fewer IPIs, since operationgs like
mprotect and madvise are very common with some workloads, but
do not involve page table freeing. Also, on munmap, batching
of page table freeing covers much larger ranges of virtual
memory than the batching of unmapped user pages.
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: efault@gmx.de
Cc: kernel-team@fb.com
Cc: luto@kernel.org
Link: http://lkml.kernel.org/r/20180716190337.26133-3-riel@surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The mm_struct always contains a cpumask bitmap, regardless of
CONFIG_CPUMASK_OFFSTACK. That means the first step can be to
simplify things, and simply have one bitmask at the end of the
mm_struct for the mm_cpumask.
This does necessitate moving everything else in mm_struct into
an anonymous sub-structure, which can be randomized when struct
randomization is enabled.
The second step is to determine the correct size for the
mm_struct slab object from the size of the mm_struct
(excluding the CPU bitmap) and the size the cpumask.
For init_mm we can simply allocate the maximum size this
kernel is compiled for, since we only have one init_mm
in the system, anyway.
Pointer magic by Mike Galbraith, to evade -Wstringop-overflow
getting confused by the dynamically sized array.
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@fb.com
Cc: luto@kernel.org
Link: http://lkml.kernel.org/r/20180716190337.26133-2-riel@surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Moving zero_resv_unavail before memmap_init_zone(), caused a regression on
x86-32.
The cause is that we access struct pages before they are allocated when
CONFIG_FLAT_NODE_MEM_MAP is used.
free_area_init_nodes()
zero_resv_unavail()
mm_zero_struct_page(pfn_to_page(pfn)); <- struct page is not alloced
free_area_init_node()
if CONFIG_FLAT_NODE_MEM_MAP
alloc_node_mem_map()
memblock_virt_alloc_node_nopanic() <- struct page alloced here
On the other hand memblock_virt_alloc_node_nopanic() zeroes all the memory
that it returns, so we do not need to do zero_resv_unavail() here.
Fixes: e181ae0c5d ("mm: zero unavailable pages before memmap init")
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Matt Hart <matt@mattface.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge misc fixes from Andrew Morton:
"11 fixes"
* emailed patches form Andrew Morton <akpm@linux-foundation.org>:
reiserfs: fix buffer overflow with long warning messages
checkpatch: fix duplicate invalid vsprintf pointer extension '%p<foo>' messages
mm: do not bug_on on incorrect length in __mm_populate()
mm/memblock.c: do not complain about top-down allocations for !MEMORY_HOTREMOVE
fs, elf: make sure to page align bss in load_elf_library
x86/purgatory: add missing FORCE to Makefile target
net/9p/client.c: put refcount of trans_mod in error case in parse_opts()
mm: allow arch to supply p??_free_tlb functions
autofs: fix slab out of bounds read in getname_kernel()
fs/proc/task_mmu.c: fix Locked field in /proc/pid/smaps*
mm: do not drop unused pages when userfaultd is running
syzbot has noticed that a specially crafted library can easily hit
VM_BUG_ON in __mm_populate
kernel BUG at mm/gup.c:1242!
invalid opcode: 0000 [#1] SMP
CPU: 2 PID: 9667 Comm: a.out Not tainted 4.18.0-rc3 #644
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
RIP: 0010:__mm_populate+0x1e2/0x1f0
Code: 55 d0 65 48 33 14 25 28 00 00 00 89 d8 75 21 48 83 c4 20 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 75 18 f1 ff 0f 0b e8 6e 18 f1 ff <0f> 0b 31 db eb c9 e8 93 06 e0 ff 0f 1f 00 55 48 89 e5 53 48 89 fb
Call Trace:
vm_brk_flags+0xc3/0x100
vm_brk+0x1f/0x30
load_elf_library+0x281/0x2e0
__ia32_sys_uselib+0x170/0x1e0
do_fast_syscall_32+0xca/0x420
entry_SYSENTER_compat+0x70/0x7f
The reason is that the length of the new brk is not page aligned when we
try to populate the it. There is no reason to bug on that though.
do_brk_flags already aligns the length properly so the mapping is
expanded as it should. All we need is to tell mm_populate about it.
Besides that there is absolutely no reason to to bug_on in the first
place. The worst thing that could happen is that the last page wouldn't
get populated and that is far from putting system into an inconsistent
state.
Fix the issue by moving the length sanitization code from do_brk_flags
up to vm_brk_flags. The only other caller of do_brk_flags is brk
syscall entry and it makes sure to provide the proper length so t here
is no need for sanitation and so we can use do_brk_flags without it.
Also remove the bogus BUG_ONs.
[osalvador@techadventures.net: fix up vm_brk_flags s@request@len@]
Link: http://lkml.kernel.org/r/20180706090217.GI32658@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: syzbot <syzbot+5dcb560fe12aa5091c06@syzkaller.appspotmail.com>
Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport is converting architectures from bootmem to nobootmem
allocator. While doing so for m68k Geert has noticed that he gets a
scary looking warning:
WARNING: CPU: 0 PID: 0 at mm/memblock.c:230
memblock_find_in_range_node+0x11c/0x1be
memblock: bottom-up allocation failed, memory hotunplug may be affected
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted
4.18.0-rc3-atari-01343-gf2fb5f2e09a97a3c-dirty #7
Call Trace: __warn+0xa8/0xc2
kernel_pg_dir+0x0/0x1000
netdev_lower_get_next+0x2/0x22
warn_slowpath_fmt+0x2e/0x36
memblock_find_in_range_node+0x11c/0x1be
memblock_find_in_range_node+0x11c/0x1be
memblock_find_in_range_node+0x0/0x1be
vprintk_func+0x66/0x6e
memblock_virt_alloc_internal+0xd0/0x156
netdev_lower_get_next+0x2/0x22
netdev_lower_get_next+0x2/0x22
kernel_pg_dir+0x0/0x1000
memblock_virt_alloc_try_nid_nopanic+0x58/0x7a
netdev_lower_get_next+0x2/0x22
kernel_pg_dir+0x0/0x1000
kernel_pg_dir+0x0/0x1000
EXPTBL+0x234/0x400
EXPTBL+0x234/0x400
alloc_node_mem_map+0x4a/0x66
netdev_lower_get_next+0x2/0x22
free_area_init_node+0xe2/0x29e
EXPTBL+0x234/0x400
paging_init+0x430/0x462
kernel_pg_dir+0x0/0x1000
printk+0x0/0x1a
EXPTBL+0x234/0x400
setup_arch+0x1b8/0x22c
start_kernel+0x4a/0x40a
_sinittext+0x344/0x9e8
The warning is basically saying that a top-down allocation can break
memory hotremove because memblock allocation is not movable. But m68k
doesn't even support MEMORY_HOTREMOVE so there is no point to warn about
it.
Make the warning conditional only to configurations that care.
Link: http://lkml.kernel.org/r/20180706061750.GH32658@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Sam Creasey <sammy@sammy.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KVM guests on s390 can notify the host of unused pages. This can result
in pte_unused callbacks to be true for KVM guest memory.
If a page is unused (checked with pte_unused) we might drop this page
instead of paging it. This can have side-effects on userfaultd, when
the page in question was already migrated:
The next access of that page will trigger a fault and a user fault
instead of faulting in a new and empty zero page. As QEMU does not
expect a userfault on an already migrated page this migration will fail.
The most straightforward solution is to ignore the pte_unused hint if a
userfault context is active for this VMA.
Link: http://lkml.kernel.org/r/20180703171854.63981-1-borntraeger@de.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We must zero struct pages for memory that is not backed by physical
memory, or kernel does not have access to.
Recently, there was a change which zeroed all memmap for all holes in
e820. Unfortunately, it introduced a bug that is discussed here:
https://www.spinics.net/lists/linux-mm/msg156764.html
Linus, also saw this bug on his machine, and confirmed that reverting
commit 124049decb ("x86/e820: put !E820_TYPE_RAM regions into
memblock.reserved") fixes the issue.
The problem is that we incorrectly zero some struct pages after they
were setup.
The fix is to zero unavailable struct pages prior to initializing of
struct pages.
A more detailed fix should come later that would avoid double zeroing
cases: one in __init_single_page(), the other one in
zero_resv_unavail().
Fixes: 124049decb ("x86/e820: put !E820_TYPE_RAM regions into memblock.reserved")
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
... so that it could set both ->f_flags and ->f_mode, without callers
having to set ->f_flags manually.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We noticed in testing we'd get pretty bad latency stalls under heavy
pressure because read ahead would try to do its thing while the cgroup
was under severe pressure. If we're under this much pressure we want to
do as little IO as possible so we can still make progress on real work
if we're a throttled cgroup, so just skip readahead if our group is
under pressure.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Memory allocations can induce swapping via kswapd or direct reclaim. If
we are having IO done for us by kswapd and don't actually go into direct
reclaim we may never get scheduled for throttling. So instead check to
see if our cgroup is congested, and if so schedule the throttling.
Before we return to user space the throttling stuff will only throttle
if we actually required it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For backcharging we need to know who the page belongs to when swapping
it out. We don't worry about things that do ->rw_page (zram etc) at the
moment, we're only worried about pages that actually go to a block
device.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just like REQ_META, it's important to know the IO coming down is swap
in order to guard against potential IO priority inversion issues with
cgroups. Add REQ_SWAP and use it for all swap IO, and add it to our
bio_issue_as_root_blkg helper.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Enabling HARDENED_USERCOPY may cause measurable regressions in networking
performance: up to 8% under UDP flood.
I ran a small packet UDP flood using pktgen vs. a host b2b connected. On
the receiver side the UDP packets are processed by a simple user space
process that just reads and drops them:
https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c
Not very useful from a functional PoV, but it helps to pin-point
bottlenecks in the networking stack.
When running a kernel with CONFIG_HARDENED_USERCOPY=y, I see a 5-8%
regression in the receive tput, compared to the same kernel without this
option enabled.
With CONFIG_HARDENED_USERCOPY=y, perf shows ~6% of CPU time spent
cumulatively in __check_object_size (~4%) and __virt_addr_valid (~2%).
The call-chain is:
__GI___libc_recvfrom
entry_SYSCALL_64_after_hwframe
do_syscall_64
__x64_sys_recvfrom
__sys_recvfrom
inet_recvmsg
udp_recvmsg
__check_object_size
udp_recvmsg() actually calls copy_to_iter() (inlined) and the latters
calls check_copy_size() (again, inlined).
A generic distro may want to enable HARDENED_USERCOPY in their default
kernel config, but at the same time, such distro may want to be able to
avoid the performance penalties in with the default configuration and
disable the stricter check on a per-boot basis.
This change adds a boot parameter that conditionally disables
HARDENED_USERCOPY via "hardened_usercopy=off".
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
If struct page is poisoned, and uninitialized access is detected via
PF_POISONED_CHECK(page) dump_page() is called to output the page. But,
the dump_page() itself accesses struct page to determine how to print
it, and therefore gets into a recursive loop.
For example:
dump_page()
__dump_page()
PageSlab(page)
PF_POISONED_CHECK(page)
VM_BUG_ON_PGFLAGS(PagePoisoned(page), page)
dump_page() recursion loop.
Link: http://lkml.kernel.org/r/20180702180536.2552-1-pasha.tatashin@oracle.com
Fixes: f165b378bb ("mm: uninitialized struct page poisoning sanity checking")
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a special case that the size is "(N << KASAN_SHADOW_SCALE_SHIFT)
Pages plus X", the value of X is [1, KASAN_SHADOW_SCALE_SIZE-1]. The
operation "size >> KASAN_SHADOW_SCALE_SHIFT" will drop X, and the
roundup operation can not retrieve the missed one page. For example:
size=0x28006, PAGE_SIZE=0x1000, KASAN_SHADOW_SCALE_SHIFT=3, we will get
shadow_size=0x5000, but actually we need 6 pages.
shadow_size = round_up(size >> KASAN_SHADOW_SCALE_SHIFT, PAGE_SIZE);
This can lead to a kernel crash when kasan is enabled and the value of
mod->core_layout.size or mod->init_layout.size is like above. Because
the shadow memory of X has not been allocated and mapped.
move_module:
ptr = module_alloc(mod->core_layout.size);
...
memset(ptr, 0, mod->core_layout.size); //crashed
Unable to handle kernel paging request at virtual address ffff0fffff97b000
......
Call trace:
__asan_storeN+0x174/0x1a8
memset+0x24/0x48
layout_and_allocate+0xcd8/0x1800
load_module+0x190/0x23e8
SyS_finit_module+0x148/0x180
Link: http://lkml.kernel.org/r/1529659626-12660-1-git-send-email-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Dmitriy Vyukov <dvyukov@google.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Libin <huawei.libin@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When booting with very large numbers of gigantic (i.e. 1G) pages, the
operations in the loop of gather_bootmem_prealloc, and specifically
prep_compound_gigantic_page, takes a very long time, and can cause a
softlockup if enough pages are requested at boot.
For example booting with 3844 1G pages requires prepping
(set_compound_head, init the count) over 1 billion 4K tail pages, which
takes considerable time.
Add a cond_resched() to the outer loop in gather_bootmem_prealloc() to
prevent this lockup.
Tested: Booted with softlockup_panic=1 hugepagesz=1G hugepages=3844 and
no softlockup is reported, and the hugepages are reported as
successfully setup.
Link: http://lkml.kernel.org/r/20180627214447.260804-1-cannonmatthews@google.com
Signed-off-by: Cannon Matthews <cannonmatthews@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In kernel 4.17 I removed some code from dm-bufio that did slab cache
merging (commit 21bb132767: "dm bufio: remove code that merges slab
caches") - both slab and slub support merging caches with identical
attributes, so dm-bufio now just calls kmem_cache_create and relies on
implicit merging.
This uncovered a bug in the slub subsystem - if we delete a cache and
immediatelly create another cache with the same attributes, it fails
because of duplicate filename in /sys/kernel/slab/. The slub subsystem
offloads freeing the cache to a workqueue - and if we create the new
cache before the workqueue runs, it complains because of duplicate
filename in sysfs.
This patch fixes the bug by moving the call of kobject_del from
sysfs_slab_remove_workfn to shutdown_cache. kobject_del must be called
while we hold slab_mutex - so that the sysfs entry is deleted before a
cache with the same attributes could be created.
Running device-mapper-test-suite with:
dmtest run --suite thin-provisioning -n /commit_failure_causes_fallback/
triggered:
Buffer I/O error on dev dm-0, logical block 1572848, async page read
device-mapper: thin: 253:1: metadata operation 'dm_pool_alloc_data_block' failed: error = -5
device-mapper: thin: 253:1: aborting current metadata transaction
sysfs: cannot create duplicate filename '/kernel/slab/:a-0000144'
CPU: 2 PID: 1037 Comm: kworker/u48:1 Not tainted 4.17.0.snitm+ #25
Hardware name: Supermicro SYS-1029P-WTR/X11DDW-L, BIOS 2.0a 12/06/2017
Workqueue: dm-thin do_worker [dm_thin_pool]
Call Trace:
dump_stack+0x5a/0x73
sysfs_warn_dup+0x58/0x70
sysfs_create_dir_ns+0x77/0x80
kobject_add_internal+0xba/0x2e0
kobject_init_and_add+0x70/0xb0
sysfs_slab_add+0xb1/0x250
__kmem_cache_create+0x116/0x150
create_cache+0xd9/0x1f0
kmem_cache_create_usercopy+0x1c1/0x250
kmem_cache_create+0x18/0x20
dm_bufio_client_create+0x1ae/0x410 [dm_bufio]
dm_block_manager_create+0x5e/0x90 [dm_persistent_data]
__create_persistent_data_objects+0x38/0x940 [dm_thin_pool]
dm_pool_abort_metadata+0x64/0x90 [dm_thin_pool]
metadata_operation_failed+0x59/0x100 [dm_thin_pool]
alloc_data_block.isra.53+0x86/0x180 [dm_thin_pool]
process_cell+0x2a3/0x550 [dm_thin_pool]
do_worker+0x28d/0x8f0 [dm_thin_pool]
process_one_work+0x171/0x370
worker_thread+0x49/0x3f0
kthread+0xf8/0x130
ret_from_fork+0x35/0x40
kobject_add_internal failed for :a-0000144 with -EEXIST, don't try to register things with the same name in the same directory.
kmem_cache_create(dm_bufio_buffer-16) failed with error -17
Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1806151817130.6333@file01.intranet.prod.int.rdu2.redhat.com
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reported-by: Mike Snitzer <snitzer@redhat.com>
Tested-by: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Revert commit c7f26ccfb2 ("mm/vmstat.c: fix vmstat_update() preemption
BUG"). Steven saw a "using smp_processor_id() in preemptible" message
and added a preempt_disable() section around it to keep it quiet. This
is not the right thing to do it does not fix the real problem.
vmstat_update() is invoked by a kworker on a specific CPU. This worker
it bound to this CPU. The name of the worker was "kworker/1:1" so it
should have been a worker which was bound to CPU1. A worker which can
run on any CPU would have a `u' before the first digit.
smp_processor_id() can be used in a preempt-enabled region as long as
the task is bound to a single CPU which is the case here. If it could
run on an arbitrary CPU then this is the problem we have an should seek
to resolve.
Not only this smp_processor_id() must not be migrated to another CPU but
also refresh_cpu_vm_stats() which might access wrong per-CPU variables.
Not to mention that other code relies on the fact that such a worker
runs on one specific CPU only.
Therefore revert that commit and we should look instead what broke the
affinity mask of the kworker.
Link: http://lkml.kernel.org/r/20180504104451.20278-1-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven J. Hill <steven.hill@cavium.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlsudKcQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgphYXEAC0G+fJcLgue64LwLNjXksDUsoldqNWjTDZ
e3szwseN3Woe7t+Q9DwEUO48T421WhDYe3UY50EZF3Fz5yYLi0ggTeq4qsKgpaXx
80dUY9/STxFZUV/tKrYMz5xPjM3k5x+m8FTeGlFgyR77fbOZE6/8w9XPkpmQN/B2
ayyEPMgAR9aZKhtOU4TJY45O278b2Qdfd0s6XUsxDTwBoR1mTQqKhNnY38PrIA5s
wkhtKz5YRdq5tNhKt2IwYABpJ4HJ1Hz/FE6Fo1RNx+6vfxfEhsk2/DZBn0rzzGDj
kvSaVlPeQEu9SZdJGtb2sx26elVWdt5sqMORQ0zb1kVCochZiT1IO3Z81XCy2Y3u
5WVvOlgsGni9BWr/K+iiN/dc+cKoRPy0g/ib77XUiJHgIC+urUIr3XgcIiuP9GZc
cgCfF+6+GvRKLatSsDZSbns5cpgueJLNogZUyhs7oUouRIECo0DD9NnDAlHmaZYB
957bRd/ykE6QOvtxEsG0TWc3tgyHjJFQQ9ukb45VmzYVB8mITKOGLpceJaqzju9c
AnGyQDddK4fqOct1pgFsSA5C3V6OOVXDpUiFn3DTNoQZcu5HN2hMwEA/g/u5TnGD
N2sEwcu2WCoUjzsMEWFLmX8R4w8laZUDwkNSRC5tYzQEZpVDfmAsRtRcUHJdNLl/
zPj8QMAidA==
=hCgr
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20180623' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- Further timeout fixes. We aren't quite there yet, so expect another
round of fixes for that to completely close some of the IRQ vs
completion races. (Christoph/Bart)
- Set of NVMe fixes from the usual suspects, mostly error handling
- Two off-by-one fixes (Dan)
- Another bdi race fix (Jan)
- Fix nbd reconfigure with NBD_DISCONNECT_ON_CLOSE (Doron)
* tag 'for-linus-20180623' of git://git.kernel.dk/linux-block:
blk-mq: Fix timeout handling in case the timeout handler returns BLK_EH_DONE
bdi: Fix another oops in wb_workfn()
lightnvm: Remove depends on HAS_DMA in case of platform dependency
nvme-pci: limit max IO size and segments to avoid high order allocations
nvme-pci: move nvme_kill_queues to nvme_remove_dead_ctrl
nvme-fc: release io queues to allow fast fail
nbd: Add the nbd NBD_DISCONNECT_ON_CLOSE config flag.
block: sed-opal: Fix a couple off by one bugs
blk-mq-debugfs: Off by one in blk_mq_rq_state_name()
nvmet: reset keep alive timer in controller enable
nvme-rdma: don't override opts->queue_size
nvme-rdma: Fix command completion race at error recovery
nvme-rdma: fix possible free of a non-allocated async event buffer
nvme-rdma: fix possible double free condition when failing to create a controller
Revert "block: Add warning for bi_next not NULL in bio_endio()"
block: fix timeout changes for legacy request drivers
syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
WB_shutting_down after wb->bdi->dev became NULL. This indicates that
unregister_bdi() failed to call wb_shutdown() on one of wb objects.
The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
drops bdi's reference to wb structures before going through the list of
wbs again and calling wb_shutdown() on each of them. This way the loop
iterating through all wbs can easily miss a wb if that wb has already
passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
from cgwb_release_workfn() and as a result fully shutdown bdi although
wb_workfn() for this wb structure is still running. In fact there are
also other ways cgwb_bdi_unregister() can race with
cgwb_release_workfn() leading e.g. to use-after-free issues:
CPU1 CPU2
cgwb_bdi_unregister()
cgwb_kill(*slot);
cgwb_release()
queue_work(cgwb_release_wq, &wb->release_work);
cgwb_release_workfn()
wb = list_first_entry(&bdi->wb_list, ...)
spin_unlock_irq(&cgwb_lock);
wb_shutdown(wb);
...
kfree_rcu(wb, rcu);
wb_shutdown(wb); -> oops use-after-free
We solve these issues by synchronizing writeback structure shutdown from
cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That
way we also no longer need synchronization using WB_shutting_down as the
mutex provides it for CONFIG_CGROUP_WRITEBACK case and without
CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from
bdi_unregister().
Reported-by: syzbot <syzbot+4a7438e774b21ddd8eca@syzkaller.appspotmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For the L1TF workaround its necessary to limit the swap file size to below
MAX_PA/2, so that the higher bits of the swap offset inverted never point
to valid memory.
Add a mechanism for the architecture to override the swap file size check
in swapfile.c and add a x86 specific max swapfile check function that
enforces that limit.
The check is only enabled if the CPU is vulnerable to L1TF.
In VMs with 42bit MAX_PA the typical limit is 2TB now, on a native system
with 46bit PA it is 32TB. The limit is only per individual swap file, so
it's always possible to exceed these limits with multiple swap files or
partitions.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
For L1TF PROT_NONE mappings are protected by inverting the PFN in the page
table entry. This sets the high bits in the CPU's address space, thus
making sure to point to not point an unmapped entry to valid cached memory.
Some server system BIOSes put the MMIO mappings high up in the physical
address space. If such an high mapping was mapped to unprivileged users
they could attack low memory by setting such a mapping to PROT_NONE. This
could happen through a special device driver which is not access
protected. Normal /dev/mem is of course access protected.
To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.
Valid page mappings are allowed because the system is then unsafe anyways.
It's not expected that users commonly use PROT_NONE on MMIO. But to
minimize any impact this is only enforced if the mapping actually refers to
a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip
the check for root.
For mmaps this is straight forward and can be handled in vm_insert_pfn and
in remap_pfn_range().
For mprotect it's a bit trickier. At the point where the actual PTEs are
accessed a lot of state has been changed and it would be difficult to undo
on an error. Since this is a uncommon case use a separate early page talk
walk pass for MMIO PROT_NONE mappings that checks for this condition
early. For non MMIO and non PROT_NONE there are no changes.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
The patch fixed a W=1 warning but broke the ia64 build:
CC mm/memblock.o
mm/memblock.c:1340: error: redefinition of `memblock_virt_alloc_try_nid_raw'
./include/linux/bootmem.h:335: error: previous definition of `memblock_virt_alloc_try_nid_raw' was here
Because inlcude/linux/bootmem.h says
#if defined(CONFIG_HAVE_MEMBLOCK) && defined(CONFIG_NO_BOOTMEM)
whereas mm/Makefile says
obj-$(CONFIG_HAVE_MEMBLOCK) += memblock.o
So revert 26f09e9b3a ("mm/memblock: add missing include
<linux/bootmem.h>") while a full fix can be worked on.
Fixes: 26f09e9b3a ("mm/memblock: add missing include <linux/bootmem.h>")
Reported-by: Tony Luck <tony.luck@gmail.com>
Cc: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit e27be240df ("mm: memcg: make sure memory.events is uptodate
when waking pollers") converted most of memcg event counters to
per-memcg atomics, which made them less confusing for a user. The
"oom_kill" counter remained untouched, so now it behaves differently
than other counters (including "oom"). This adds nothing but confusion.
Let's fix this by adding the MEMCG_OOM_KILL event, and follow the
MEMCG_OOM approach.
This also removes a hack from count_memcg_event_mm(), introduced earlier
specially for the OOM_KILL counter.
[akpm@linux-foundation.org: fix for droppage of memcg-replace-mm-owner-with-mm-memcg.patch]
Link: http://lkml.kernel.org/r/20180508124637.29984-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/*.c files use symbolic and octal styles for permissions.
Using octal and not symbolic permissions is preferred by many as more
readable.
https://lkml.org/lkml/2016/8/2/1945
Prefer the direct use of octal for permissions.
Done using
$ scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace mm/*.c
and some typing.
Before: $ git grep -P -w "0[0-7]{3,3}" mm | wc -l
44
After: $ git grep -P -w "0[0-7]{3,3}" mm | wc -l
86
Miscellanea:
o Whitespace neatening around these conversions.
Link: http://lkml.kernel.org/r/2e032ef111eebcd4c5952bae86763b541d373469.1522102887.git.joe@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 5d1904204c ("mremap: fix race between mremap() and page
cleanning") fixed races between mremap and other operations for both
file-backed and anonymous mappings. The file-backed was the most
critical as it allowed the possibility that data could be changed on a
physical page after page_mkclean returned which could trigger data loss
or data integrity issues.
A customer reported that the cost of the TLBs for anonymous regressions
was excessive and resulting in a 30-50% drop in performance overall
since this commit on a microbenchmark. Unfortunately I neither have
access to the test-case nor can I describe what it does other than
saying that mremap operations dominate heavily.
This patch removes the LATENCY_LIMIT to handle TLB flushes on a PMD
boundary instead of every 64 pages to reduce the number of TLB
shootdowns by a factor of 8 in the ideal case. LATENCY_LIMIT was almost
certainly used originally to limit the PTL hold times but the latency
savings are likely offset by the cost of IPIs in many cases. This patch
is not reported to completely restore performance but gets it within an
acceptable percentage. The given metric here is simply described as
"higher is better".
Baseline that was known good
002: Metric: 91.05
004: Metric: 109.45
008: Metric: 73.08
016: Metric: 58.14
032: Metric: 61.09
064: Metric: 57.76
128: Metric: 55.43
Current
001: Metric: 54.98
002: Metric: 56.56
004: Metric: 41.22
008: Metric: 35.96
016: Metric: 36.45
032: Metric: 35.71
064: Metric: 35.73
128: Metric: 34.96
With patch
001: Metric: 61.43
002: Metric: 81.64
004: Metric: 67.92
008: Metric: 51.67
016: Metric: 50.47
032: Metric: 52.29
064: Metric: 50.01
128: Metric: 49.04
So for low threads, it's not restored but for larger number of threads,
it's closer to the "known good" baseline.
Using a different mremap-intensive workload that is not representative
of the real workload there is little difference observed outside of
noise in the headline metrics However, the TLB shootdowns are reduced by
11% on average and at the peak, TLB shootdowns were reduced by 21%.
Interrupts were sampled every second while the workload ran to get those
figures. It's known that the figures will vary as the
non-representative load is non-deterministic.
An alternative patch was posted that should have significantly reduced
the TLB flushes but unfortunately it does not perform as well as this
version on the customer test case. If revisited, the two patches can
stack on top of each other.
Link: http://lkml.kernel.org/r/20180606183803.k7qaw2xnbvzshv34@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 26f09e9b3a ("mm/memblock: add memblock memory allocation apis")
introduced two new function definitions:
memblock_virt_alloc_try_nid_nopanic()
memblock_virt_alloc_try_nid()
Commit ea1f5f3712 ("mm: define memblock_virt_alloc_try_nid_raw")
introduced the following function definition:
memblock_virt_alloc_try_nid_raw()
This commit adds an includeof header file <linux/bootmem.h> to provide
the missing function prototypes. Silence the following gcc warning
(W=1):
mm/memblock.c:1334:15: warning: no previous prototype for `memblock_virt_alloc_try_nid_raw' [-Wmissing-prototypes]
mm/memblock.c:1371:15: warning: no previous prototype for `memblock_virt_alloc_try_nid_nopanic' [-Wmissing-prototypes]
mm/memblock.c:1407:15: warning: no previous prototype for `memblock_virt_alloc_try_nid' [-Wmissing-prototypes]
Link: http://lkml.kernel.org/r/20180606194144.16990-1-malat@debian.org
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The memcg kmem cache creation and deactivation (SLUB only) is
asynchronous. If a root kmem cache is destroyed whose memcg cache is in
the process of creation or deactivation, the kernel may crash.
Example of one such crash:
general protection fault: 0000 [#1] SMP PTI
CPU: 1 PID: 1721 Comm: kworker/14:1 Not tainted 4.17.0-smp
...
Workqueue: memcg_kmem_cache kmemcg_deactivate_workfn
RIP: 0010:has_cpu_slab
...
Call Trace:
? on_each_cpu_cond
__kmem_cache_shrink
kmemcg_cache_deact_after_rcu
kmemcg_deactivate_workfn
process_one_work
worker_thread
kthread
ret_from_fork+0x35/0x40
To fix this race, on root kmem cache destruction, mark the cache as
dying and flush the workqueue used for memcg kmem cache creation and
deactivation. SLUB's memcg kmem cache deactivation also includes RCU
callback and thus make sure all previous registered RCU callbacks have
completed as well.
[shakeelb@google.com: handle the RCU callbacks for SLUB deactivation]
Link: http://lkml.kernel.org/r/20180611192951.195727-1-shakeelb@google.com
[shakeelb@google.com: add more documentation, rename fields for readability]
Link: http://lkml.kernel.org/r/20180522201336.196994-1-shakeelb@google.com
[akpm@linux-foundation.org: fix build, per Shakeel]
[shakeelb@google.com: v3. Instead of refcount, flush the workqueue]
Link: http://lkml.kernel.org/r/20180530001204.183758-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180521174116.171846-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 570a335b8e ("swap_info: swap count continuations") introduces
COUNT_CONTINUED but refers to it incorrectly as SWAP_HAS_CONT in a
comment in swap_count. Fix it.
Link: http://lkml.kernel.org/r/20180612175919.30413-1-daniel.m.jordan@oracle.com
Fixes: 570a335b8e ("swap_info: swap count continuations")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In our armv8a server(QDF2400), I noticed lots of WARN_ON caused by
PAGE_SIZE unaligned for rmap_item->address under memory pressure
tests(start 20 guests and run memhog in the host).
WARNING: CPU: 4 PID: 4641 at virt/kvm/arm/mmu.c:1826 kvm_age_hva_handler+0xc0/0xc8
CPU: 4 PID: 4641 Comm: memhog Tainted: G W 4.17.0-rc3+ #8
Call trace:
kvm_age_hva_handler+0xc0/0xc8
handle_hva_to_gpa+0xa8/0xe0
kvm_age_hva+0x4c/0xe8
kvm_mmu_notifier_clear_flush_young+0x54/0x98
__mmu_notifier_clear_flush_young+0x6c/0xa0
page_referenced_one+0x154/0x1d8
rmap_walk_ksm+0x12c/0x1d0
rmap_walk+0x94/0xa0
page_referenced+0x194/0x1b0
shrink_page_list+0x674/0xc28
shrink_inactive_list+0x26c/0x5b8
shrink_node_memcg+0x35c/0x620
shrink_node+0x100/0x430
do_try_to_free_pages+0xe0/0x3a8
try_to_free_pages+0xe4/0x230
__alloc_pages_nodemask+0x564/0xdc0
alloc_pages_vma+0x90/0x228
do_anonymous_page+0xc8/0x4d0
__handle_mm_fault+0x4a0/0x508
handle_mm_fault+0xf8/0x1b0
do_page_fault+0x218/0x4b8
do_translation_fault+0x90/0xa0
do_mem_abort+0x68/0xf0
el0_da+0x24/0x28
In rmap_walk_ksm, the rmap_item->address might still have the
STABLE_FLAG, then the start and end in handle_hva_to_gpa might not be
PAGE_SIZE aligned. Thus it will cause exceptions in handle_hva_to_gpa
on arm64.
This patch fixes it by ignoring (not removing) the low bits of address
when doing rmap_walk_ksm.
IMO, it should be backported to stable tree. the storm of WARN_ONs is
very easy for me to reproduce. More than that, I watched a panic (not
reproducible) as follows:
page:ffff7fe003742d80 count:-4871 mapcount:-2126053375 mapping: (null) index:0x0
flags: 0x1fffc00000000000()
raw: 1fffc00000000000 0000000000000000 0000000000000000 ffffecf981470000
raw: dead000000000100 dead000000000200 ffff8017c001c000 0000000000000000
page dumped because: nonzero _refcount
CPU: 29 PID: 18323 Comm: qemu-kvm Tainted: G W 4.14.15-5.hxt.aarch64 #1
Hardware name: <snip for confidential issues>
Call trace:
dump_backtrace+0x0/0x22c
show_stack+0x24/0x2c
dump_stack+0x8c/0xb0
bad_page+0xf4/0x154
free_pages_check_bad+0x90/0x9c
free_pcppages_bulk+0x464/0x518
free_hot_cold_page+0x22c/0x300
__put_page+0x54/0x60
unmap_stage2_range+0x170/0x2b4
kvm_unmap_hva_handler+0x30/0x40
handle_hva_to_gpa+0xb0/0xec
kvm_unmap_hva_range+0x5c/0xd0
I even injected a fault on purpose in kvm_unmap_hva_range by seting
size=size-0x200, the call trace is similar as above. So I thought the
panic is similarly caused by the root cause of WARN_ON.
Andrea said:
: It looks a straightforward safe fix, on x86 hva_to_gfn_memslot would
: zap those bits and hide the misalignment caused by the low metadata
: bits being erroneously left set in the address, but the arm code
: notices when that's the last page in the memslot and the hva_end is
: getting aligned and the size is below one page.
:
: I think the problem triggers in the addr += PAGE_SIZE of
: unmap_stage2_ptes that never matches end because end is aligned but
: addr is not.
:
: } while (pte++, addr += PAGE_SIZE, addr != end);
:
: x86 again only works on hva_start/hva_end after converting it to
: gfn_start/end and that being in pfn units the bits are zapped before
: they risk to cause trouble.
Jia He said:
: I've tested by myself in arm64 server (QDF2400,46 cpus,96G mem) Without
: this patch, the WARN_ON is very easy for reproducing. After this patch, I
: have run the same benchmarch for a whole day without any WARN_ONs
Link: http://lkml.kernel.org/r/1525403506-6750-1-git-send-email-hejianet@gmail.com
Signed-off-by: Jia He <jia.he@hxt-semitech.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Jia He <hejianet@gmail.com>
Cc: Suzuki K Poulose <Suzuki.Poulose@arm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Cc: Arvind Yadav <arvind.yadav.cs@gmail.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* DAX broke a fundamental assumption of truncate of file mapped pages.
The truncate path assumed that it is safe to disconnect a pinned page
from a file and let the filesystem reclaim the physical block. With DAX
the page is equivalent to the filesystem block. Introduce
dax_layout_busy_page() to enable filesystems to wait for pinned DAX
pages to be released. Without this wait a filesystem could allocate
blocks under active device-DMA to a new file.
* DAX arranges for the block layer to be bypassed and uses
dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
However, the memcpy_mcsafe() facility is available through the pmem
block driver. In order to safely handle media errors, via the DAX
block-layer bypass, introduce copy_to_iter_mcsafe().
* Fix cache management policy relative to the ACPI NFIT Platform
Capabilities Structure to properly elide cache flushes when they are not
necessary. The table indicates whether CPU caches are power-fail
protected. Clarify that a deep flush is always performed on
REQ_{FUA,PREFLUSH} requests.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbGxI7AAoJEB7SkWpmfYgCDjsP/2Lcibu9Kf4tKIzuInsle6iE
6qP29qlkpHVTpDKbhvIxTYTYL9sMU0DNUrpPCJR/EYdeyztLWDFC5EAT1wF240vf
maV37s/uP331jSC/2VJnKWzBs2ztQxmKLEIQCxh6aT0qs9cbaOvJgB/WlVu+qtsl
aGJFLmb6vdQacp31noU5plKrMgMA1pADyF5qx9I9K2HwowHE7T368ZEFS/3S//c3
LXmpx/Nfq52sGu/qbRbu6B1CTJhIGhmarObyQnvBYoKntK1Ov4e8DS95wD3EhNDe
FuRkOCUKhjl6cFy7QVWh1ct1bFm84ny+b4/AtbpOmv9l/+0mveJ7e+5mu8HQTifT
wYiEe2xzXJ+OG/xntv8SvlZKMpjP3BqI0jYsTutsjT4oHrciiXdXM186cyS+BiGp
KtFmWyncQJgfiTq6+Hj5XpP9BapNS+OYdYgUagw9ZwzdzptuGFYUMSVOBrYrn6c/
fwqtxjubykJoW0P3pkIoT91arFSea7nxOKnGwft06imQ7TwR4ARsI308feQ9itJq
2P2e7/20nYMsw2aRaUDDA70Yu+Lagn1m8WL87IybUGeUDLb1BAkjphAlWa6COJ+u
PhvAD2tvyM9m0c7O5Mytvz7iWKG6SVgatoAyOPkaeplQK8khZ+wEpuK58sO6C1w8
4GBvt9ri9i/Ww/A+ppWs
=4bfw
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"This adds a user for the new 'bytes-remaining' updates to
memcpy_mcsafe() that you already received through Ingo via the
x86-dax- for-linus pull.
Not included here, but still targeting this cycle, is support for
handling memory media errors (poison) consumed via userspace dax
mappings.
Summary:
- DAX broke a fundamental assumption of truncate of file mapped
pages. The truncate path assumed that it is safe to disconnect a
pinned page from a file and let the filesystem reclaim the physical
block. With DAX the page is equivalent to the filesystem block.
Introduce dax_layout_busy_page() to enable filesystems to wait for
pinned DAX pages to be released. Without this wait a filesystem
could allocate blocks under active device-DMA to a new file.
- DAX arranges for the block layer to be bypassed and uses
dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
However, the memcpy_mcsafe() facility is available through the pmem
block driver. In order to safely handle media errors, via the DAX
block-layer bypass, introduce copy_to_iter_mcsafe().
- Fix cache management policy relative to the ACPI NFIT Platform
Capabilities Structure to properly elide cache flushes when they
are not necessary. The table indicates whether CPU caches are
power-fail protected. Clarify that a deep flush is always performed
on REQ_{FUA,PREFLUSH} requests"
* tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits)
dax: Use dax_write_cache* helpers
libnvdimm, pmem: Do not flush power-fail protected CPU caches
libnvdimm, pmem: Unconditionally deep flush on *sync
libnvdimm, pmem: Complete REQ_FLUSH => REQ_PREFLUSH
acpi, nfit: Remove ecc_unit_size
dax: dax_insert_mapping_entry always succeeds
libnvdimm, e820: Register all pmem resources
libnvdimm: Debug probe times
linvdimm, pmem: Preserve read-only setting for pmem devices
x86, nfit_test: Add unit test for memcpy_mcsafe()
pmem: Switch to copy_to_iter_mcsafe()
dax: Report bytes remaining in dax_iomap_actor()
dax: Introduce a ->copy_to_iter dax operation
uio, lib: Fix CONFIG_ARCH_HAS_UACCESS_MCSAFE compilation
xfs, dax: introduce xfs_break_dax_layouts()
xfs: prepare xfs_break_layouts() for another layout type
xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
mm, fs, dax: handle layout changes to pinned dax mappings
mm: fix __gup_device_huge vs unmap
mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS
...
Merge updates from Andrew Morton:
- a few misc things
- ocfs2 updates
- v9fs updates
- MM
- procfs updates
- lib/ updates
- autofs updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
autofs: small cleanup in autofs_getpath()
autofs: clean up includes
autofs: comment on selinux changes needed for module autoload
autofs: update MAINTAINERS entry for autofs
autofs: use autofs instead of autofs4 in documentation
autofs: rename autofs documentation files
autofs: create autofs Kconfig and Makefile
autofs: delete fs/autofs4 source files
autofs: update fs/autofs4/Makefile
autofs: update fs/autofs4/Kconfig
autofs: copy autofs4 to autofs
autofs4: use autofs instead of autofs4 everywhere
autofs4: merge auto_fs.h and auto_fs4.h
fs/binfmt_misc.c: do not allow offset overflow
checkpatch: improve patch recognition
lib/ucs2_string.c: add MODULE_LICENSE()
lib/mpi: headers cleanup
lib/percpu_ida.c: use _irqsave() instead of local_irq_save() + spin_lock
lib/idr.c: remove simple_ida_lock
lib/bitmap.c: micro-optimization for __bitmap_complement()
...
kvmalloc warned about incompatible gfp_mask to catch abusers (mostly
GFP_NOFS) with an intention that this will motivate authors of the code
to fix those. Linus argues that this just motivates people to do even
more hacks like
if (gfp == GFP_KERNEL)
kvmalloc
else
kmalloc
I haven't seen this happening much (Linus pointed to bucket_lock special
cases an atomic allocation but my git foo hasn't found much more) but it
is true that we can grow those in future. Therefore Linus suggested to
simply not fallback to vmalloc for incompatible gfp flags and rather
stick with the kmalloc path.
Link: http://lkml.kernel.org/r/20180601115329.27807-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tom Herbert <tom@quantonium.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
shmem/tmpfs uses pseudo vma to allocate page with correct NUMA policy.
The pseudo vma doesn't have vm_page_prot set. We are going to encode
encryption KeyID in vm_page_prot. Having garbage there causes problems.
Zero out all unused fields in the pseudo vma.
Link: http://lkml.kernel.org/r/20180531135602.20321-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In __alloc_pages_slowpath() we reset zonelist and preferred_zoneref for
allocations that can ignore memory policies. The zonelist is obtained
from current CPU's node. This is a problem for __GFP_THISNODE
allocations that want to allocate on a different node, e.g. because the
allocating thread has been migrated to a different CPU.
This has been observed to break SLAB in our 4.4-based kernel, because
there it relies on __GFP_THISNODE working as intended. If a slab page
is put on wrong node's list, then further list manipulations may corrupt
the list because page_to_nid() is used to determine which node's
list_lock should be locked and thus we may take a wrong lock and race.
Current SLAB implementation seems to be immune by luck thanks to commit
511e3a0588 ("mm/slab: make cache_grow() handle the page allocated on
arbitrary node") but there may be others assuming that __GFP_THISNODE
works as promised.
We can fix it by simply removing the zonelist reset completely. There
is actually no reason to reset it, because memory policies and cpusets
don't affect the zonelist choice in the first place. This was different
when commit 183f6371aa ("mm: ignore mempolicies when using
ALLOC_NO_WATERMARK") introduced the code, as mempolicies provided their
own restricted zonelists.
We might consider this for 4.17 although I don't know if there's
anything currently broken.
SLAB is currently not affected, but in kernels older than 4.7 that don't
yet have 511e3a0588 ("mm/slab: make cache_grow() handle the page
allocated on arbitrary node") it is. That's at least 4.4 LTS. Older
ones I'll have to check.
So stable backports should be more important, but will have to be
reviewed carefully, as the code went through many changes. BTW I think
that also the ac->preferred_zoneref reset is currently useless if we
don't also reset ac->nodemask from a mempolicy to NULL first (which we
probably should for the OOM victims etc?), but I would leave that for a
separate patch.
Link: http://lkml.kernel.org/r/20180525130853.13915-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Fixes: 183f6371aa ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK")
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a process monitored with userfaultfd changes it's memory mappings or
forks() at the same time as uffd monitor fills the process memory with
UFFDIO_COPY, the actual creation of page table entries and copying of
the data in mcopy_atomic may happen either before of after the memory
mapping modifications and there is no way for the uffd monitor to
maintain consistent view of the process memory layout.
For instance, let's consider fork() running in parallel with
userfaultfd_copy():
process | uffd monitor
---------------------------------+------------------------------
fork() | userfaultfd_copy()
... | ...
dup_mmap() | down_read(mmap_sem)
down_write(mmap_sem) | /* create PTEs, copy data */
dup_uffd() | up_read(mmap_sem)
copy_page_range() |
up_write(mmap_sem) |
dup_uffd_complete() |
/* notify monitor */ |
If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will
be present by the time copy_page_range() is called and they will appear
in the child's memory mappings. However, if the fork() is the first to
take the mmap_sem, the new pages won't be mapped in the child's address
space.
If the pages are not present and child tries to access them, the monitor
will get page fault notification and everything is fine. However, if
the pages *are present*, the child can access them without uffd
noticing. And if we copy them into child it'll see the wrong data.
Since we are talking about background copy, we'd need to decide whether
the pages should be copied or not regardless #PF notifications.
Since userfaultfd monitor has no way to determine what was the order,
let's disallow userfaultfd_copy in parallel with the non-cooperative
events. In such case we return -EAGAIN and the uffd monitor can
understand that userfaultfd_copy() clashed with a non-cooperative event
and take an appropriate action.
Link: http://lkml.kernel.org/r/1527061324-19949-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently an attempt to set swap.max into a value lower than the actual
swap usage fails, which causes configuration problems as there's no way
of lowering the configuration below the current usage short of turning
off swap entirely. This makes swap.max difficult to use and allows
delegatees to lock the delegator out of reducing swap allocation.
This patch updates swap_max_write() so that the limit can be lowered
below the current usage. It doesn't implement active reclaiming of swap
entries for the following reasons.
* mem_cgroup_swap_full() already tells the swap machinary to
aggressively reclaim swap entries if the usage is above 50% of
limit, so simply lowering the limit automatically triggers gradual
reclaim.
* Forcing back swapped out pages is likely to heavily impact the
workload and mess up the working set. Given that swap usually is a
lot less valuable and less scarce, letting the existing usage
dissipate over time through the above gradual reclaim and as they're
falted back in is likely the better behavior.
Link: http://lkml.kernel.org/r/20180523185041.GR1718769@devbig577.frc2.facebook.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Rik van Riel <riel@surriel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use new return type vm_fault_t for fault handler. For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno. Once all instances are converted, vm_fault_t will become a
distinct type.
See commit 1c8f422059 ("mm: change return type to vm_fault_t")
vmf_error() is the newly introduce inline function in 4.17-rc6.
Link: http://lkml.kernel.org/r/20180521202410.GA17912@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Christoph doubts anyone was using the 'reserved' file in sysfs, so remove
it.
Link: http://lkml.kernel.org/r/20180518194519.3820-17-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The reserved field was only used for embedding an rcu_head in the data
structure. With the previous commit, we no longer need it. That lets us
remove the 'reserved' argument to a lot of functions.
Link: http://lkml.kernel.org/r/20180518194519.3820-16-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
rcu_head may now grow larger than list_head without affecting slab or
slub.
Link: http://lkml.kernel.org/r/20180518194519.3820-15-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This gives us five words of space in a single union in struct page. The
compound_mapcount moves position (from offset 24 to offset 20) on 64-bit
systems, but that does not seem likely to cause any trouble.
Link: http://lkml.kernel.org/r/20180518194519.3820-11-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since the LRU is two words, this does not affect the double-word alignment
of SLUB's freelist.
Link: http://lkml.kernel.org/r/20180518194519.3820-10-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that we can represent the location of 'deferred_list' in C instead of
comments, make use of that ability.
Link: http://lkml.kernel.org/r/20180518194519.3820-9-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
By moving page->private to the fourth word of struct page, we can put the
SLUB counters in the same word as SLAB's s_mem and still do the
cmpxchg_double trick. Now the SLUB counters no longer overlap with the
mapcount or refcount so we can drop the call to page_mapcount_reset() and
simplify set_page_slub_counters() to a single line.
Link: http://lkml.kernel.org/r/20180518194519.3820-6-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This will allow us to store slub's counters in the same bits as slab's
s_mem. slub now needs to set page->mapping to NULL as it frees the page,
just like slab does.
Link: http://lkml.kernel.org/r/20180518194519.3820-5-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We're already using a union of many fields here, so stop abusing the
_mapcount and make page_type its own field. That implies renaming some of
the machinery that creates PageBuddy, PageBalloon and PageKmemcg; bring
back the PG_buddy, PG_balloon and PG_kmemcg names.
As suggested by Kirill, make page_type a bitmask. Because it starts out
life as -1 (thanks to sharing the storage with _mapcount), setting a page
flag means clearing the appropriate bit. This gives us space for probably
twenty or so extra bits (depending how paranoid we want to be about
_mapcount underflow).
Link: http://lkml.kernel.org/r/20180518194519.3820-3-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is to take better advantage of general huge page clearing
optimization (commit c79b57e462: "mm: hugetlb: clear target sub-page
last when clearing huge page") for hugetlbfs.
In the general optimization patch, the sub-page to access will be
cleared last to avoid the cache lines of to access sub-page to be
evicted when clearing other sub-pages. This works better if we have the
address of the sub-page to access, that is, the fault address inside the
huge page. So the hugetlbfs no page fault handler is changed to pass
that information. This will benefit workloads which don't access the
begin of the hugetlbfs huge page after the page fault under heavy cache
contention for shared last level cache.
The patch is a generic optimization which should benefit quite some
workloads, not for a specific use case. To demonstrate the performance
benefit of the patch, we tested it with vm-scalability run on hugetlbfs.
With this patch, the throughput increases ~28.1% in vm-scalability
anon-w-seq test case with 88 processes on a 2 socket Xeon E5 2699 v4
system (44 cores, 88 threads). The test case creates 88 processes, each
process mmaps a big anonymous memory area with MAP_HUGETLB and writes to
it from the end to the begin. For each process, other processes could
be seen as other workload which generates heavy cache pressure. At the
same time, the cache miss rate reduced from ~36.3% to ~25.6%, the IPC
(instruction per cycle) increased from 0.3 to 0.37, and the time spent
in user space is reduced ~19.3%.
Link: http://lkml.kernel.org/r/20180517083539.9242-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Punit Agrawal <punit.agrawal@arm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use new return type vm_fault_t for fault handler in struct
vm_operations_struct. For now, this is just documenting that the
function returns a VM_FAULT value rather than an errno. Once all
instances are converted, vm_fault_t will become a distinct type.
See commit 1c8f422059 ("mm: change return type to vm_fault_t")
Link: http://lkml.kernel.org/r/20180512063745.GA26866@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joe Perches <joe@perches.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use new return type vm_fault_t for fault handler in struct
vm_operations_struct. For now, this is just documenting that the
function returns a VM_FAULT value rather than an errno. Once all
instances are converted, vm_fault_t will become a distinct type.
Link: http://lkml.kernel.org/r/20180511190542.GA2412@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
finalise_ac() has parameter order which is not used at all. Remove it.
Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The new helper returns index of the matching string in an array. We are
going to use it here.
Link: http://lkml.kernel.org/r/20180503203206.44046-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Using kstrndup() simplifies the code.
Link: http://lkml.kernel.org/r/20180503201807.24941-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Memory controller implements the memory.low best-effort memory
protection mechanism, which works perfectly in many cases and allows
protecting working sets of important workloads from sudden reclaim.
But its semantics has a significant limitation: it works only as long as
there is a supply of reclaimable memory. This makes it pretty useless
against any sort of slow memory leaks or memory usage increases. This
is especially true for swapless systems. If swap is enabled, memory
soft protection effectively postpones problems, allowing a leaking
application to fill all swap area, which makes no sense. The only
effective way to guarantee the memory protection in this case is to
invoke the OOM killer.
It's possible to handle this case in userspace by reacting on MEMCG_LOW
events; but there is still a place for a fail-safe in-kernel mechanism
to provide stronger guarantees.
This patch introduces the memory.min interface for cgroup v2 memory
controller. It works very similarly to memory.low (sharing the same
hierarchical behavior), except that it's not disabled if there is no
more reclaimable memory in the system.
If cgroup is not populated, its memory.min is ignored, because otherwise
even the OOM killer wouldn't be able to reclaim the protected memory,
and the system can stall.
[guro@fb.com: s/low/min/ in docs]
Link: http://lkml.kernel.org/r/20180510130758.GA9129@castle.DHCP.thefacebook.com
Link: http://lkml.kernel.org/r/20180509180734.GA4856@castle.DHCP.thefacebook.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
is_pageblock_removable_nolock() is not used outside of
mm/memory_hotplug.c. Move it next to unique caller
is_mem_section_removable() and make it static.
Remove prototype in <linux/memory_hotplug.h> to silence gcc warning (W=1):
mm/page_alloc.c:7704:6: warning: no previous prototype for `is_pageblock_removable_nolock' [-Wmissing-prototypes]
Link: http://lkml.kernel.org/r/20180509190001.24789-1-malat@debian.org
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memblock_remove report is useful to see why MemTotal of /proc/meminfo
between two kernels makes difference.
Link: http://lkml.kernel.org/r/20180508104223.8028-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The per-cpu memcg stock can retain a charge of upto 32 pages. On a
machine with large number of cpus, this can amount to a decent amount of
memory. Additionally force_empty interface might be triggering unneeded
memcg reclaims.
Link: http://lkml.kernel.org/r/20180507201651.165879-1-shakeelb@google.com
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Resizing the memcg limit for cgroup-v2 drains the stocks before
triggering the memcg reclaim. Do the same for cgroup-v1 to make the
behavior consistent.
Link: http://lkml.kernel.org/r/20180504205548.110696-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mark memcg1_events static: it's only used by memcontrol.c. And mark it
const: it's not modified.
Link: http://lkml.kernel.org/r/20180503192940.94971-1-gthelen@google.com
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mem_cgroup_cgwb_list is a very simple wrapper and it will never be used
outside of code under CONFIG_CGROUP_WRITEBACK. so use memcg->cgwb_list
directly.
Link: http://lkml.kernel.org/r/1524406173-212182-1-git-send-email-wanglong19@meituan.com
Signed-off-by: Wang Long <wanglong19@meituan.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
tmpfs uses the helper d_find_alias() to find a dentry from a decoded
inode, but d_find_alias() skips unhashed dentries, so unlinked files
cannot be decoded from a file handle.
This can be reproduced using xfstests test program open_by_handle:
$ open_by handle -c /tmp/testdir
$ open_by_handle -dk /tmp/testdir
open_by_handle(/tmp/testdir/file000000) returned 116 incorrectly on an unlinked open file!
To fix this, if d_find_alias() can't find a hashed alias, call
d_find_any_alias() to return an unhashed one.
Link: http://lkml.kernel.org/r/CAOQ4uxg+qSLP0KwdW+h1tcPqOCQd+_pGZVXiePQB1TXCMBMctQ@mail.gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
page_stable_node() and set_page_stable_node() are only used in mm/ksm.c
and there is no point to keep them in the include/linux/ksm.h
[akpm@linux-foundation.org: fix SYSFS=n build]
Link: http://lkml.kernel.org/r/1524552106-7356-3-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While revisiting my Btrfs swapfile series [1], I introduced a situation
in which reclaim would lock i_rwsem, and even though the swapon() path
clearly made GFP_KERNEL allocations while holding i_rwsem, I got no
complaints from lockdep. It turns out that the rework of the fs_reclaim
annotation was broken: if the current task has PF_MEMALLOC set, we don't
acquire the dummy fs_reclaim lock, but when reclaiming we always check
this _after_ we've just set the PF_MEMALLOC flag. In most cases, we can
fix this by moving the fs_reclaim_{acquire,release}() outside of the
memalloc_noreclaim_{save,restore}(), althought kswapd is slightly
different. After applying this, I got the expected lockdep splats.
1: https://lwn.net/Articles/625412/
Link: http://lkml.kernel.org/r/9f8aa70652a98e98d7c4de0fc96a4addcee13efe.1523778026.git.osandov@fb.com
Fixes: d92a8cfcb3 ("locking/lockdep: Rework FS_RECLAIM annotation")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since tmpfs THP was supported in 4.8, hugetlbfs is not the only
filesystem with huge page support anymore. tmpfs can use huge page via
THP when mounting by "huge=" mount option.
When applications use huge page on hugetlbfs, it just need check the
filesystem magic number, but it is not enough for tmpfs. Make
stat.st_blksize return huge page size if it is mounted by appropriate
"huge=" option to give applications a hint to optimize the behavior with
THP.
Some applications may not do wisely with THP. For example, QEMU may
mmap file on non huge page aligned hint address with MAP_FIXED, which
results in no pages are PMD mapped even though THP is used. Some
applications may mmap file with non huge page aligned offset. Both
behaviors make THP pointless.
statfs.f_bsize still returns 4KB for tmpfs since THP could be split, and
it also may fallback to 4KB page silently if there is not enough huge
page. Furthermore, different f_bsize makes max_blocks and free_blocks
calculation harder but without too much benefit. Returning huge page
size via stat.st_blksize sounds good enough.
Since PUD size huge page for THP has not been supported, now it just
returns HPAGE_PMD_SIZE.
Hugh said:
: Sorry, I have no enthusiasm for this patch; but do I feel strongly
: enough to override you and everyone else to NAK it? No, I don't feel
: that strongly, maybe st_blksize isn't worth arguing over.
:
: We did look at struct stat when designing huge tmpfs, to see if there
: were any fields that should be adjusted for it; but concluded none.
: Yes, it would sometimes be nice to have a quickly accessible indicator
: for when tmpfs has been mounted huge (scanning /proc/mounts for options
: can be tiresome, agreed); but since tmpfs tries to supply huge (or not)
: pages transparently, no difference seemed right.
:
: So, because st_blksize is a not very useful field of struct stat, with
: "size" in the name, we're going to put HPAGE_PMD_SIZE in there instead
: of PAGE_SIZE, if the tmpfs was mounted with one of the huge "huge"
: options (force or always, okay; within_size or advise, not so much).
: Though HPAGE_PMD_SIZE is no more its "preferred I/O size" or "blocksize
: for file system I/O" than PAGE_SIZE was.
:
: Which we can expect to speed up some applications and disadvantage
: others, depending on how they interpret st_blksize: just like if we
: changed it in the same way on non-huge tmpfs. (Did I actually try
: changing st_blksize early on, and find it broke something? If so, I've
: now forgotten what, and a search through commit messages didn't find
: it; but I guess we'll find out soon enough.)
:
: If there were an mstat() syscall, returning a field "preferred
: alignment", then we could certainly agree to put HPAGE_PMD_SIZE in
: there; but in stat()'s st_blksize? And what happens when (in future)
: mm maps this or that hard-disk filesystem's blocks with a pmd mapping -
: should that filesystem then advertise a bigger st_blksize, despite the
: same disk layout as before? What happens with DAX?
:
: And this change is not going to help the QEMU suboptimality that
: brought you here (or does QEMU align mmaps according to st_blksize?).
: QEMU ought to work well with kernels without this change, and kernels
: with this change; and I hope it can easily deal with both by avoiding
: that use of MAP_FIXED which prevented the kernel's intended alignment.
[akpm@linux-foundation.org: remove unneeded `else']
Link: http://lkml.kernel.org/r/1524665633-83806-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Client can call vunmap with some intermediate 'addr' which may not be
the start of the VM area. Entire unmap code works with vm->vm_start
which is proper but debug object API is called with 'addr'. This could
be a problem within debug objects.
Pass proper start address into debug object API.
[akpm@linux-foundation.org: fix warning]
Link: http://lkml.kernel.org/r/1523961828-9485-3-git-send-email-cpandya@codeaurora.org
Signed-off-by: Chintan Pandya <cpandya@codeaurora.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yisheng Xie <xieyisheng1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, __vunmap flow is,
1) Release the VM area
2) Free the debug objects corresponding to that vm area.
This leave some race window open.
1) Release the VM area
1.5) Some other client gets the same vm area
1.6) This client allocates new debug objects on the same
vm area
2) Free the debug objects corresponding to this vm area.
Here, we actually free 'other' client's debug objects.
Fix this by freeing the debug objects first and then releasing the VM
area.
Link: http://lkml.kernel.org/r/1523961828-9485-2-git-send-email-cpandya@codeaurora.org
Signed-off-by: Chintan Pandya <cpandya@codeaurora.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yisheng Xie <xieyisheng1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vunmap does page table clear operations twice in the case when
DEBUG_PAGEALLOC_ENABLE_DEFAULT is enabled.
So, clean up the code as that is unintended.
As a perf gain, we save few us. Below ftrace data was obtained while
doing 1 MB of vmalloc/vfree on ARM64 based SoC *without* this patch
applied. After this patch, we can save ~3 us (on 1 extra
vunmap_page_range).
CPU DURATION FUNCTION CALLS
| | | | | | |
6) | __vunmap() {
6) | vmap_debug_free_range() {
6) 3.281 us | vunmap_page_range();
6) + 45.468 us | }
6) 2.760 us | vunmap_page_range();
6) ! 505.105 us | }
[cpandya@codeaurora.org: v3]
Link: http://lkml.kernel.org/r/1525176960-18408-1-git-send-email-cpandya@codeaurora.org
Link: http://lkml.kernel.org/r/1523876342-10545-1-git-send-email-cpandya@codeaurora.org
Signed-off-by: Chintan Pandya <cpandya@codeaurora.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Yisheng Xie <xieyisheng1@huawei.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit c4e1be9ec1 ("mm, sparsemem: break out of loops early")
__highest_present_section_nr is introduced to reduce the loop counts for
present section. This is also helpful for usemap and memmap allocation.
This patch uses __highest_present_section_nr + 1 to optimize the loop.
Link: http://lkml.kernel.org/r/20180326081956.75275-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When searching a present section, there are two boundaries:
* __highest_present_section_nr
* NR_MEM_SECTIONS
And it is known, __highest_present_section_nr is a more strict boundary
than NR_MEM_SECTIONS. This means it would be necessary to check
__highest_present_section_nr only.
Link: http://lkml.kernel.org/r/20180326081956.75275-2-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mmap_sem will be read locked when calling follow_pmd_mask(). But this
cannot prevent PMD from being changed for all cases when PTL is
unlocked, for example, from pmd_trans_huge() to pmd_none() via
MADV_DONTNEED. So it is possible for the pmd_present() check in
follow_pmd_mask() to encounter an invalid PMD. This may cause an
incorrect VM_BUG_ON() or an infinite loop. Fix this by reading the PMD
entry into a local variable with READ_ONCE() and checking the local
variable and pmd_none() in the retry loop.
As Kirill pointed out, with PTL unlocked, the *pmd may be changed under
us, so reading it directly again and again may incur weird bugs. So
although using *pmd directly other than for pmd_present() checking may
be safe, it is still better to replace them to read *pmd once and check
the local variable multiple times.
When PTL unlocked, replace all *pmd with local variable was suggested by
Kirill.
Link: http://lkml.kernel.org/r/20180419083514.1365-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If memcg's usage is equal to the memory.low value, avoid reclaiming from
this cgroup while there is a surplus of reclaimable memory.
This sounds more logical and also matches memory.high and memory.max
behavior: both are inclusive.
Empty cgroups are not considered protected, so MEMCG_LOW events are not
emitted for empty cgroups, if there is no more reclaimable memory in the
system.
Link: http://lkml.kernel.org/r/20180406122132.GA7185@castle
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch aims to address an issue in current memory.low semantics,
which makes it hard to use it in a hierarchy, where some leaf memory
cgroups are more valuable than others.
For example, there are memcgs A, A/B, A/C, A/D and A/E:
A A/memory.low = 2G, A/memory.current = 6G
//\\
BC DE B/memory.low = 3G B/memory.current = 2G
C/memory.low = 1G C/memory.current = 2G
D/memory.low = 0 D/memory.current = 2G
E/memory.low = 10G E/memory.current = 0
If we apply memory pressure, B, C and D are reclaimed at the same pace
while A's usage exceeds 2G. This is obviously wrong, as B's usage is
fully below B's memory.low, and C has 1G of protection as well. Also, A
is pushed to the size, which is less than A's 2G memory.low, which is
also wrong.
A simple bash script (provided below) can be used to reproduce
the problem. Current results are:
A: 1430097920
A/B: 711929856
A/C: 717426688
A/D: 741376
A/E: 0
To address the issue a concept of effective memory.low is introduced.
Effective memory.low is always equal or less than original memory.low.
In a case, when there is no memory.low overcommittment (and also for
top-level cgroups), these two values are equal.
Otherwise it's a part of parent's effective memory.low, calculated as a
cgroup's memory.low usage divided by sum of sibling's memory.low usages
(under memory.low usage I mean the size of actually protected memory:
memory.current if memory.current < memory.low, 0 otherwise). It's
necessary to track the actual usage, because otherwise an empty cgroup
with memory.low set (A/E in my example) will affect actual memory
distribution, which makes no sense. To avoid traversing the cgroup tree
twice, page_counters code is reused.
Calculating effective memory.low can be done in the reclaim path, as we
conveniently traversing the cgroup tree from top to bottom and check
memory.low on each level. So, it's a perfect place to calculate
effective memory low and save it to use it for children cgroups.
This also eliminates a need to traverse the cgroup tree from bottom to
top each time to check if parent's guarantee is not exceeded.
Setting/resetting effective memory.low is intentionally racy, but it's
fine and shouldn't lead to any significant differences in actual memory
distribution.
With this patch applied results are matching the expectations:
A: 2147930112
A/B: 1428721664
A/C: 718393344
A/D: 815104
A/E: 0
Test script:
#!/bin/bash
CGPATH="/sys/fs/cgroup"
truncate /file1 --size 2G
truncate /file2 --size 2G
truncate /file3 --size 2G
truncate /file4 --size 50G
mkdir "${CGPATH}/A"
echo "+memory" > "${CGPATH}/A/cgroup.subtree_control"
mkdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"
echo 2G > "${CGPATH}/A/memory.low"
echo 3G > "${CGPATH}/A/B/memory.low"
echo 1G > "${CGPATH}/A/C/memory.low"
echo 0 > "${CGPATH}/A/D/memory.low"
echo 10G > "${CGPATH}/A/E/memory.low"
echo $$ > "${CGPATH}/A/B/cgroup.procs" && vmtouch -qt /file1
echo $$ > "${CGPATH}/A/C/cgroup.procs" && vmtouch -qt /file2
echo $$ > "${CGPATH}/A/D/cgroup.procs" && vmtouch -qt /file3
echo $$ > "${CGPATH}/cgroup.procs" && vmtouch -qt /file4
echo "A: " `cat "${CGPATH}/A/memory.current"`
echo "A/B: " `cat "${CGPATH}/A/B/memory.current"`
echo "A/C: " `cat "${CGPATH}/A/C/memory.current"`
echo "A/D: " `cat "${CGPATH}/A/D/memory.current"`
echo "A/E: " `cat "${CGPATH}/A/E/memory.current"`
rmdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"
rmdir "${CGPATH}/A"
rm /file1 /file2 /file3 /file4
Link: http://lkml.kernel.org/r/20180405185921.4942-2-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch renames struct page_counter fields:
count -> usage
limit -> max
and the corresponding functions:
page_counter_limit() -> page_counter_set_max()
mem_cgroup_get_limit() -> mem_cgroup_get_max()
mem_cgroup_resize_limit() -> mem_cgroup_resize_max()
memcg_update_kmem_limit() -> memcg_update_kmem_max()
memcg_update_tcp_limit() -> memcg_update_tcp_max()
The idea behind this renaming is to have the direct matching
between memory cgroup knobs (low, high, max) and page_counters API.
This is pure renaming, this patch doesn't bring any functional change.
Link: http://lkml.kernel.org/r/20180405185921.4942-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>