90afa5de6f
A bug was brought to my attention against a distro kernel but it affects mainline and I believe problems like this have been reported in various guises on the mailing lists although I don't have specific examples at the moment. The reported problem was that malloc() stalled for a long time (minutes in some cases) if a large tmpfs mount was occupying a large percentage of memory overall. The pages did not get cleaned or reclaimed by zone_reclaim() because the zone_reclaim_mode was unsuitable, but the lists are uselessly scanned frequencly making the CPU spin at near 100%. This patchset intends to address that bug and bring the behaviour of zone_reclaim() more in line with expectations which were noticed during investigation. It is based on top of mmotm and takes advantage of Kosaki's work with respect to zone_reclaim(). Patch 1 fixes the heuristics that zone_reclaim() uses to determine if the scan should go ahead. The broken heuristic is what was causing the malloc() stall as it uselessly scanned the LRU constantly. Currently, zone_reclaim is assuming zone_reclaim_mode is 1 and historically it could not deal with tmpfs pages at all. This fixes up the heuristic so that an unnecessary scan is more likely to be correctly avoided. Patch 2 notes that zone_reclaim() returning a failure automatically means the zone is marked full. This is not always true. It could have failed because the GFP mask or zone_reclaim_mode were unsuitable. Patch 3 introduces a counter zreclaim_failed that will increment each time the zone_reclaim scan-avoidance heuristics fail. If that counter is rapidly increasing, then zone_reclaim_mode should be set to 0 as a temporarily resolution and a bug reported because the scan-avoidance heuristic is still broken. This patch: On NUMA machines, the administrator can configure zone_reclaim_mode that is a more targetted form of direct reclaim. On machines with large NUMA distances for example, a zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be reclaimed if the zone watermarks are not being met. There is a heuristic that determines if the scan is worthwhile but the problem is that the heuristic is not being properly applied and is basically assuming zone_reclaim_mode is 1 if it is enabled. The lack of proper detection can manfiest as high CPU usage as the LRU list is scanned uselessly. Historically, once enabled it was depending on NR_FILE_PAGES which may include swapcache pages that the reclaim_mode cannot deal with. Patch vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included pages that were not file-backed such as swapcache and made a calculation based on the inactive, active and mapped files. This is far superior when zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a reasonable starting figure. This patch alters how zone_reclaim() works out how many pages it might be able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set in the reclaim_mode it will either consider NR_FILE_PAGES as potential candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount swapcache and other non-file-backed pages. If RECLAIM_WRITE is not set, then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is not set, then NR_FILE_MAPPED are not. [kosaki.motohiro@jp.fujitsu.com: Estimate unmapped pages minus tmpfs pages] [fengguang.wu@intel.com: Fix underflow problem in Kosaki's estimate] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
---|---|---|
.. | ||
ABI | ||
accounting | ||
acpi | ||
aoe | ||
arm | ||
auxdisplay | ||
blackfin | ||
block | ||
blockdev | ||
cdrom | ||
cgroups | ||
connector | ||
console | ||
cpu-freq | ||
cpuidle | ||
cris | ||
crypto | ||
development-process | ||
device-mapper | ||
DocBook | ||
driver-model | ||
dvb | ||
early-userspace | ||
fault-injection | ||
fb | ||
filesystems | ||
firmware_class | ||
frv | ||
hwmon | ||
i2c | ||
i2o | ||
ia64 | ||
ide | ||
infiniband | ||
input | ||
ioctl | ||
isdn | ||
ja_JP | ||
kbuild | ||
kdump | ||
ko_KR | ||
laptops | ||
lguest | ||
m68k | ||
make | ||
mips | ||
misc-devices | ||
mn10300 | ||
mtd | ||
namespaces | ||
netlabel | ||
networking | ||
parisc | ||
PCI | ||
pcmcia | ||
power | ||
powerpc | ||
prctl | ||
RCU | ||
s390 | ||
scheduler | ||
scsi | ||
serial | ||
sh | ||
sound | ||
sparc | ||
spi | ||
sysctl | ||
telephony | ||
thermal | ||
timers | ||
trace | ||
uml | ||
usb | ||
video4linux | ||
vm | ||
w1 | ||
watchdog | ||
wimax | ||
x86 | ||
zh_CN | ||
00-INDEX | ||
applying-patches.txt | ||
atomic_ops.txt | ||
bad_memory.txt | ||
basic_profiling.txt | ||
binfmt_misc.txt | ||
braille-console.txt | ||
bt8xxgpio.txt | ||
BUG-HUNTING | ||
c2port.txt | ||
cachetlb.txt | ||
Changes | ||
CodingStyle | ||
cpu-hotplug.txt | ||
cpu-load.txt | ||
cputopology.txt | ||
credentials.txt | ||
dcdbas.txt | ||
debugging-modules.txt | ||
debugging-via-ohci1394.txt | ||
dell_rbu.txt | ||
devices.txt | ||
DMA-API.txt | ||
DMA-attributes.txt | ||
DMA-ISA-LPC.txt | ||
DMA-mapping.txt | ||
dmaengine.txt | ||
dontdiff | ||
dynamic-debug-howto.txt | ||
edac.txt | ||
eisa.txt | ||
email-clients.txt | ||
exception.txt | ||
feature-removal-schedule.txt | ||
futex-requeue-pi.txt | ||
gpio.txt | ||
highuid.txt | ||
HOWTO | ||
hw_random.txt | ||
ics932s401 | ||
initrd.txt | ||
Intel-IOMMU.txt | ||
io-mapping.txt | ||
IO-mapping.txt | ||
io_ordering.txt | ||
iostats.txt | ||
IPMI.txt | ||
IRQ-affinity.txt | ||
IRQ.txt | ||
irqflags-tracing.txt | ||
isapnp.txt | ||
java.txt | ||
kernel-doc-nano-HOWTO.txt | ||
kernel-docs.txt | ||
kernel-parameters.txt | ||
keys-request-key.txt | ||
keys.txt | ||
kmemleak.txt | ||
kobject.txt | ||
kprobes.txt | ||
kref.txt | ||
ldm.txt | ||
leds-class.txt | ||
local_ops.txt | ||
lockdep-design.txt | ||
lockstat.txt | ||
logo.gif | ||
logo.txt | ||
magic-number.txt | ||
Makefile | ||
ManagementStyle | ||
markers.txt | ||
mca.txt | ||
md.txt | ||
memory-barriers.txt | ||
memory-hotplug.txt | ||
memory.txt | ||
mono.txt | ||
mutex-design.txt | ||
nmi_watchdog.txt | ||
nommu-mmap.txt | ||
numastat.txt | ||
oops-tracing.txt | ||
parport-lowlevel.txt | ||
parport.txt | ||
pi-futex.txt | ||
pnp.txt | ||
preempt-locking.txt | ||
printk-formats.txt | ||
prio_tree.txt | ||
rbtree.txt | ||
rfkill.txt | ||
robust-futex-ABI.txt | ||
robust-futexes.txt | ||
rt-mutex-design.txt | ||
rt-mutex.txt | ||
rtc.txt | ||
SAK.txt | ||
SecurityBugs | ||
SELinux.txt | ||
serial-console.txt | ||
sgi-ioc4.txt | ||
sgi-visws.txt | ||
slow-work.txt | ||
SM501.txt | ||
Smack.txt | ||
sparse.txt | ||
spinlocks.txt | ||
stable_api_nonsense.txt | ||
stable_kernel_rules.txt | ||
SubmitChecklist | ||
SubmittingDrivers | ||
SubmittingPatches | ||
svga.txt | ||
sysfs-rules.txt | ||
sysrq.txt | ||
tomoyo.txt | ||
unaligned-memory-access.txt | ||
unicode.txt | ||
unshare.txt | ||
VGA-softcursor.txt | ||
video-output.txt | ||
volatile-considered-harmful.txt | ||
voyager.txt | ||
zorro.txt |