The cache reaper currently tries to free all alien caches and all remote
per cpu pages in each pass of cache_reap. For a machines with large number
of nodes (such as Altix) this may lead to sporadic delays of around ~10ms.
Interrupts are disabled while reclaiming creating unacceptable delays.
This patch changes that behavior by adding a per cpu reap_node variable.
Instead of attempting to free all caches, we free only one alien cache and
the per cpu pages from one remote node. That reduces the time spend in
cache_reap. However, doing so will lengthen the time it takes to
completely drain all remote per cpu pagesets and all alien caches. The
time needed will grow with the number of nodes in the system. All caches
are drained when they overflow their respective capacity. So the drawback
here is only that a bit of memory may be wasted for awhile longer.
Details:
1. Rename drain_remote_pages to drain_node_pages to allow the specification
of the node to drain of pcp pages.
2. Add additional functions init_reap_node, next_reap_node for NUMA
that manage a per cpu reap_node counter.
3. Add a reap_alien function that reaps only from the current reap_node.
For us this seems to be a critical issue. Holdoffs of an average of ~7ms
cause some HPC benchmarks to slow down significantly. F.e. NAS parallel
slows down dramatically. NAS parallel has a 12-16 seconds runtime w/o rotor
compared to 5.8 secs with the rotor patches. It gets down to 5.05 secs with
the additional interrupt holdoff reductions.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We require that all archs implement atomic_cmpxchg(), for the generic
version of atomic_add_unless().
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When pages are onlined, not only zone->present_pages but also
pgdat->node_present_pages should be refreshed.
This parameter is used to show information at
/sys/device/system/node/nodeX/meminfo via si_meminfo_node().
So, it shows strange value for MemUsed which is calculated
(node_present_pages - all zones free pages).
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
EDAC is still causing a few problems and the code is relatively green. Mark
it as experimental until thing settle down.
Also, provide some documentation pointers in Kconfig help.
Signed-off-by: Tim Small <tim@buttersideup.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently the code tries up to spin_retry times to grab a lock using the cs
instruction. The cs instruction has exclusive access to a memory region
and therefore invalidates the appropiate cache line of all other cpus. If
there is contention on a lock this leads to cache line trashing. This can
be avoided if we first check wether a cs instruction is likely to succeed
before the instruction gets actually executed.
Signed-off-by: Christian Ehrhardt <ehrhardt@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The kobject_put() can free the memory at *cmd, but cmd->lock points to a
persistent lock that is not freed with cmd.
Signed-off-by: Max Asbock <masbock@us.ibm.com>
Cc: Vernon Mauery <vernux@us.ibm.com>
Cc: Srihari Vijayaraghavan <sriharivijayaraghavan@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If the process has already set PF_MALLOC and is already using
current->reclaim_state then do not try to reclaim memory from the zone.
This is set by kswapd and/or synchrononous global reclaim which will not
take it lightly if we zap the reclaim_state.
Signed-off-by: Christoph Lameter <clameter@sig.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- wrong test for 'is this a BARRIER bio'
- not freeing on all possible paths.
- using r1_bio after freeing it.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
/usr/src/ctest/git/kernel/mm/rmap.c: In function `page_referenced_one':
/usr/src/ctest/git/kernel/mm/rmap.c:354: warning: implicit declaration of function `rwsem_is_locked'
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: <chris@zankel.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix some bugs in mtd/jffs2 on 64bit platform.
The MEMGETBADBLOCK/MEMSETBADBLOCK ioctl are not listed in compat_ioctl.h.
And some variables in jffs2 are declared as uint32_t but used to hold
size_t values.
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix a lockup which was introduced during the conversion to the generic IRQ
framework.
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
ERR_PTR() is supposed to be passed a negative value.
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
smi_data_buf_free() references dcdbas_pdev when calling
dma_free_coherent(). In dcdbas_exit(), smi_data_buf_free() is called after
platform_device_unregister(dcdbas_pdev).
This patch moves platform_device_unregister(dcdbas_pdev) after
smi_data_buf_free() in dcdbas_exit().
Signed-off-by: Doug Warzecha <Douglas_Warzecha@dell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove two early-development BUG_ONs from page_add_file_rmap.
The pfn_valid test (originally useful for checking that nobody passed an
artificial struct page) comes too late, since we already have the struct
page.
The PageAnon test (useful when anon was first distinguished from file rmap)
prevents ->nopage implementations from reusing ->mapping, which would
otherwise be available.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Turn on truncation to prevent getting choked by frames larger than expected.
Without this fix, driver hangs after receiving an oversize packet.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Avoid premature transmit ring full conditions.
Force a transmit status interrupt if transmit ring gets nearly full
and after a TSO send.
Allow more entries in transmit ring to be used if dma_addr is 32 bits
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Don't use sky2 to seed random pool beacause the network packet arrival time
will not be truly random due to NAPI and interrupt mitigation.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
de_init_hw enables the irq thus it must be issued after request_irq.
Signed-off-by: Martin Michlmayr <tbm@cyrius.com>
Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
This is gluing the build of cirrusfb but really the mess that would need
cleaning and fixing is <video/vga.h> and <linux/vt_buffer.h> ...
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
At times gcc will place bits of __exit functions into .rodata. If
compiled into the kernle itself we used to discard .exit.text - but
not the bits left in .rodata. While harmless this did at times result
in a large number of warnings. So until gcc fixes this, discard
.exit.text at runtime.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
In case a particular system doesn't support highmem the runtime checks
will ensure nothing bad is going to happen.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
What: Support for NEC DDB5074 and DDB5476 evaluation boards.
When: June 2006
Why: Board specific code doesn't build anymore since ~2.6.0 and no
users have complained indicating there is no more need for these
boards. This should really be considered a last call.
Who: Ralf Baechle <ralf@linux-mips.org>
The low level PCI DMA mapping functions should handle it in most cases.
This should fix problems with depleting the DMA zone early. The old
code used precious GFP_DMA memory in many cases where it was not needed.
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
ATI chipsets tend to generate double timer interrupts for the local APIC
timer when both the 8254 and the IO-APIC timer pins are enabled. This is
because they route it to both and the result is anded together and the CPU
ends up processing it twice.
This patch changes check_timer to disable the 8254 routing for interrupt 0.
I think it would be safe on all chipsets actually (i tested it on a couple
and it worked everywhere) and Windows seems to do it in a similar way, but
to be conservative this patch only enables this mode on ATI (and adds
options to enable/disable too)
Ported over from a similar x86-64 change.
I reused the ACPI earlyquirk infrastructure for the ATI bridge check, but
tweaked it a bit to work even without ACPI.
Inspired by a patch from Chuck Ebbert, but redone.
Cc: Chuck Ebbert <76306.1226@compuserve.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A recent change to compat. dev_ifconf() in fs/compat_ioctl.c
causes ifconf data to be truncated 1 entry too early when copying it
to userspace. The correct amount of data (length) is returned,
but the final entry is empty (zero, not filled in).
The for-loop 'i' check should use <= to allow the final struct
ifreq32 to be copied. I also used the ifconf-corruption program
in kernel bugzilla #4746 to make sure that this change does not
re-introduce the corruption.
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
A pte may be zapped by the swapper, exiting process, unmapping or page
migration while the accessed or dirty bit handers are about to run. In that
case the accessed bit or dirty is set on an zeroed pte which leads the VM to
conclude that this is a swap pte. This may lead to
- Messages from the vm like
swap_free: Bad swap file entry 4000000000000000
- Processes being aborted
swap_dup: Bad swap file entry 4000000000000000
VM: killing process ....
Page migration is particular suitable for the creation of this race since
it needs to remove and restore page table entries.
The fix here is to check for the present bit and simply not update
the pte if the page is not present anymore. If the page is not present
then the fault handler should run next which will take care of the problem
by bringing the page back and then mark the page dirty or move it onto the
active list.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Patch from Alessandro Zummo
The patch that would have made the NSLU2
kernel non compatible with other ixp4xx machs
never entered the kernel. So it is actually
safe to remove the prompt dependencies.
Signed-off-by: Alessandro Zummo <a.zummo@towertech.it>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Patch from Alessandro Zummo
Disable GPIO clocks to allow
the power led to work properly.
Signed-off-by: Alessandro Zummo <a.zummo@towertech.it>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
While testing kexec and kdump we hit problems where the new kernel would
freeze or instantly reboot. The easiest way to trigger it was to kexec a
kernel compiled for CONFIG_M586 on an athlon cpu. Compiling for CONFIG_MK7
instead would work fine.
The patch fixes a few problems with the kexec inline asm.
Signed-off-by: Chris Mason <mason@suse.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Matt Mackall <mpm@selenic.com>
Tested-by: Anders K. Pedersen <akp@cohaesio.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
kmem_cache_init() incorrectly assumes that the cache_cache object will fit
in an order 0 allocation. On very large systems, this is not true. Change
the code to try larger order allocations if order 0 fails.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Miscellaneous fixes related to accessing uninitialized variables or memory
that was already freed.
Signed-off-by: Latchesar Ionkov <lucho@ionkov.net>
Cc: Eric Van Hensbergen <ericvh@ericvh.myip.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The x86_model calculation also applies for family 6. early_cpu_detect
does the right thing, but generic_identify misses.
Signed-off-by: Shaohua Li<shaohua.li@intel.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: "Seth, Rohit" <rohit.seth@intel.com>
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
DASD allows to open a device as soon as gendisk is registered, which means the
device is a fake device (capacity=0) and we do know nothing about blocksize
and partitions at that point of time. In case the device is opened by
someone, the bdev and inode creation is done with the fake device info and the
following partition detection code is just using the wrong data.
To avoid this modify the DASD state machine to make sure that the open is
rejected until the device analysis is either finished or an unformatted device
was detected.
Signed-off-by: Horst Hummel <horst.hummel@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The message limit on the iucv connect call for the smsg module is too low.
Therefore increase the smsg message limit to 255.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
strnlen_user is supposed to return then length count + 1 if no terminating \0
is found, and it should return 0 on exception. Found by David Howells
<dhowells@redhat.com>.
Signed-off-by: Gerald Schaefer <geraldsc@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-By: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I have benchmarked this on an x86_64 NUMA system and see no significant
performance difference on kernbench. Tested on both x86_64 and powerpc.
The way we do file struct accounting is not very suitable for batched
freeing. For scalability reasons, file accounting was
constructor/destructor based. This meant that nr_files was decremented
only when the object was removed from the slab cache. This is susceptible
to slab fragmentation. With RCU based file structure, consequent batched
freeing and a test program like Serge's, we just speed this up and end up
with a very fragmented slab -
llm22:~ # cat /proc/sys/fs/file-nr
587730 0 758844
At the same time, I see only a 2000+ objects in filp cache. The following
patch I fixes this problem.
This patch changes the file counting by removing the filp_count_lock.
Instead we use a separate percpu counter, nr_files, for now and all
accesses to it are through get_nr_files() api. In the sysctl handler for
nr_files, we populate files_stat.nr_files before returning to user.
Counting files as an when they are created and destroyed (as opposed to
inside slab) allows us to correctly count open files with RCU.
Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds new tunables for RCU queue and finished batches. There are
two types of controls - number of completed RCU updates invoked in a batch
(blimit) and monitoring for high rate of incoming RCUs on a cpu (qhimark,
qlowmark).
By default, the per-cpu batch limit is set to a small value. If the input
RCU rate exceeds the high watermark, we do two things - force quiescent
state on all cpus and set the batch limit of the CPU to INTMAX. Setting
batch limit to INTMAX forces all finished RCUs to be processed in one shot.
If we have more than INTMAX RCUs queued up, then we have bigger problems
anyway. Once the incoming queued RCUs fall below the low watermark, the
batch limit is set to the default.
Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>