kernel-fxtec-pro1x

History

Andrea Arcangeli 70b50f94f1 mm: thp: tail page refcounting fix Michel while working on the working set estimation code, noticed that calling get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the pfn ended up being a tail page of a transparent hugepage under splitting by __split_huge_page_refcount(). He then found the problem could also theoretically materialize with page_cache_get_speculative() during the speculative radix tree lookups that uses get_page_unless_zero() in SMP if the radix tree page is freed and reallocated and get_user_pages is called on it before page_cache_get_speculative has a chance to call get_page_unless_zero(). So the best way to fix the problem is to keep page_tail->_count zero at all times. This will guarantee that get_page_unless_zero() can never succeed on any tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail pages of a compound page, so we can simply account the tail page references there and transfer them to tail_page->_count in __split_huge_page_refcount() (in addition to the head_page->_mapcount). While debugging this s/_count/_mapcount/ change I also noticed get_page is called by direct-io.c on pages returned by get_user_pages. That wasn't entirely safe because the two atomic_inc in get_page weren't atomic. As opposed to other get_user_page users like secondary-MMU page fault to establish the shadow pagetables would never call any superflous get_page after get_user_page returns. It's safer to make get_page universally safe for tail pages and to use get_page_foll() within follow_page (inside get_user_pages()). get_page_foll() is safe to do the refcounting for tail pages without taking any locks because it is run within PT lock protected critical sections (PT lock for pte and page_table_lock for pmd_trans_huge). The standard get_page() as invoked by direct-io instead will now take the compound_lock but still only for tail pages. The direct-io paths are usually I/O bound and the compound_lock is per THP so very finegrined, so there's no risk of scalability issues with it. A simple direct-io benchmarks with all lockdep prove locking and spinlock debugging infrastructure enabled shows identical performance and no overhead. So it's worth it. Ideally direct-io should stop calling get_page() on pages returned by get_user_pages(). The spinlock in get_page() is already optimized away for no-THP builds but doing get_page() on tail pages returned by GUP is generally a rare operation and usually only run in I/O paths. This new refcounting on page_tail->_mapcount in addition to avoiding new RCU critical sections will also allow the working set estimation code to work without any further complexity associated to the tail page refcounting with THP. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Michel Lespinasse <walken@google.com> Reviewed-by: Michel Lespinasse <walken@google.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2011-11-02 16:06:57 -07:00
..
kmemcheck	x86: Swap save_stack_trace_regs parameters	2011-06-14 22:48:51 -04:00
amdtopology.c	x86, NUMA: Enable CONFIG_AMD_NUMA on 32bit too	2011-05-02 17:24:48 +02:00
dump_pagetables.c	x86, mm: Create symbolic index into address_markers array	2010-07-20 16:56:19 -07:00
extable.c
fault.c	Merge branch 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2011-10-28 05:46:02 -07:00
gup.c	mm: thp: tail page refcounting fix	2011-11-02 16:06:57 -07:00
highmem_32.c	mm: fix race in kunmap_atomic()	2010-10-27 18:03:05 -07:00
hugetlbpage.c	mm: Convert i_mmap_lock to a mutex	2011-05-25 08:39:18 -07:00
init.c	x86: Fix S4 regression	2011-10-24 06:55:20 +02:00
init_32.c	x86, mm: Allow ZONE_DMA to be configurable	2011-05-16 14:03:28 -07:00
init_64.c	mm: Move definition of MIN_MEMORY_BLOCK_SIZE to a header	2011-07-12 11:08:01 +10:00
iomap_32.c	mm: fix race in kunmap_atomic()	2010-10-27 18:03:05 -07:00
ioremap.c	ioremap: Delay sanity check until after a successful mapping	2011-04-29 08:02:47 +02:00
kmmio.c	x86, kmmio/mmiotrace: Fix double free of kmmio_fault_pages	2010-06-18 11:30:09 +02:00
Makefile	x86, NUMA: Rename amdtopology_64.c to amdtopology.c	2011-05-02 17:24:48 +02:00
memblock.c	x86, efi: Do not reserve boot services regions within reserved areas	2011-06-18 22:48:49 +02:00
memtest.c	x86, memblock: Replace e820_/_early string with memblock_	2010-08-27 11:13:47 -07:00
mmap.c	x86-32, amd: Move va_align definition to unbreak 32-bit build	2011-08-06 11:44:57 -07:00
mmio-mod.c	Merge branch 'master' into for-next	2011-09-15 15:08:18 +02:00
numa.c	x86, numa: Implement pfn -> nid mapping granularity check	2011-07-12 21:58:29 -07:00
numa_32.c	x86, mm: s/PAGES_PER_ELEMENT/PAGES_PER_SECTION/	2011-07-12 21:58:11 -07:00
numa_64.c	x86, NUMA: Move NUMA init logic from numa_64.c to numa.c	2011-05-02 14:18:53 +02:00
numa_emulation.c	x86, NUMA: Enable emulation on 32bit too	2011-05-02 17:24:48 +02:00
numa_internal.h	x86, NUMA: Initialize and use remap allocator from setup_node_bootmem()	2011-05-02 14:18:54 +02:00
pageattr-test.c	x86: Convert vmalloc()+memset() to vzalloc()	2011-05-28 19:53:57 +02:00
pageattr.c	x86: Fix common misspellings	2011-03-18 10:39:30 +01:00
pat.c	Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip	2010-08-06 10:17:52 -07:00
pat_internal.h	x86, pat: Fix memory leak in free_memtype	2010-05-26 11:26:04 -07:00
pat_rbtree.c	rbtree: Undo augmented trees performance damage and regression	2010-07-05 14:43:50 +02:00
pf_in.c	x86: Eliminate various 'set but not used' warnings	2011-05-21 19:10:33 +02:00
pf_in.h
pgtable.c	x86: Flush TLB if PGD entry is changed in i386 PAE mode	2011-03-18 11:44:01 +01:00
pgtable_32.c	x86: remove last traces of quicklist usage	2010-05-24 13:33:31 -07:00
physaddr.c
physaddr.h
setup_nx.c	x86, cpu: Only CPU features determine NX capabilities	2010-11-10 15:43:15 -08:00
srat.c	x86, NUMA: make srat.c 32bit safe	2011-05-02 14:18:52 +02:00
testmmiotrace.c	x86, kmmio/mmiotrace: Fix double free of kmmio_fault_pages	2010-06-18 11:30:09 +02:00
tlb.c	x86, tlb, UV: Do small micro-optimization for native_flush_tlb_others()	2011-03-15 08:30:34 +01:00