kernel-fxtec-pro1x

Author	SHA1	Message	Date
Johannes Berg	3b5ad0797c	stacktrace: fix header file for !CONFIG_STACKTRACE The print_stack_trace macro in stacktrace.h has a wrong number of arguments, fix it. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:49 -07:00
Peter Zijlstra	96645678cd	lockstat: measure lock bouncing __acquire \| lock _____ \| \ \| __contended \| \| \| wait \| _______/ \|/ \| __acquired \| __release \| unlock We measure acquisition and contention bouncing. This is done by recording a cpu stamp in each lock instance. Contention bouncing requires the cpu stamp to be set on acquisition. Hence we move __acquired into the generic path. __acquired is then used to measure acquisition bouncing by comparing the current cpu with the old stamp before replacing it. __contended is used to measure contention bouncing (only useful for preemptable locks) [akpm@linux-foundation.org: cleanups] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:49 -07:00
Peter Zijlstra	4b32d0a4e9	lockdep: various fixes - update the copyright notices - use the default hash function - fix a thinko in a BUILD_BUG_ON - add a WARN_ON to spot inconsitent naming - fix a termination issue in /proc/lock_stat [akpm@linux-foundation.org: cleanups] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:49 -07:00
Peter Zijlstra	f20786ff4d	lockstat: core infrastructure Introduce the core lock statistics code. Lock statistics provides lock wait-time and hold-time (as well as the count of corresponding contention and acquisitions events). Also, the first few call-sites that encounter contention are tracked. Lock wait-time is the time spent waiting on the lock. This provides insight into the locking scheme, that is, a heavily contended lock is indicative of a too coarse locking scheme. Lock hold-time is the duration the lock was held, this provides a reference for the wait-time numbers, so they can be put into perspective. 1) lock 2) ... do stuff .. unlock 3) The time between 1 and 2 is the wait-time. The time between 2 and 3 is the hold-time. The lockdep held-lock tracking code is reused, because it already collects locks into meaningful groups (classes), and because it is an existing infrastructure for lock instrumentation. Currently lockdep tracks lock acquisition with two hooks: lock() lock_acquire() _lock() ... code protected by lock ... unlock() lock_release() _unlock() We need to extend this with two more hooks, in order to measure contention. lock_contended() - used to measure contention events lock_acquired() - completion of the contention These are then placed the following way: lock() lock_acquire() if (!_try_lock()) lock_contended() _lock() lock_acquired() ... do locked stuff ... unlock() lock_release() _unlock() (Note: the try_lock() 'trick' is used to avoid instrumenting all platform dependent lock primitive implementations.) It is also possible to toggle the two lockdep features at runtime using: /proc/sys/kernel/prove_locking /proc/sys/kernel/lock_stat (esp. turning off the O(n^2) prove_locking functionaliy can help) [akpm@linux-foundation.org: build fixes] [akpm@linux-foundation.org: nuke unneeded ifdefs] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Jason Baron <jbaron@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:49 -07:00
Peter Zijlstra	21f8ca3bf6	fix raw_spinlock_t vs lockdep Use the lockdep infrastructure to track lock contention and other lock statistics. It tracks lock contention events, and the first four unique call-sites that encountered contention. It also measures lock wait-time and hold-time in nanoseconds. The minimum and maximum times are tracked, as well as a total (which together with the number of event can give the avg). All statistics are done per lock class, per write (exclusive state) and per read (shared state). The statistics are collected per-cpu, so that the collection overhead is minimized via having no global cachemisses. This new lock statistics feature is independent of the lock dependency checking traditionally done by lockdep; it just shares the lock tracking code. It is also possible to enable both and runtime disabled either component - thereby avoiding the O(n^2) lock chain walks for instance. This patch: raw_spinlock_t should not use lockdep (and doesn't) since lockdep itself relies on it. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:49 -07:00
Jan Harkes	3cf01f28c3	coda: remove statistics counters from /proc/fs/coda Similar information can easily be obtained with strace -c. Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:48 -07:00
Jan Harkes	a1b0aa8764	coda: remove struct coda_sb_info The sb_info structure only contains a single pointer to the character device, there is no need for the added indirection. Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:48 -07:00
Jan Harkes	d9664c95af	coda: block signals during upcall processing We ignore signals for about 30 seconds to give userspace a chance to see the upcall. As we did not block signals we ended up in a busy loop for the remainder of the period when a signal is received. Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:48 -07:00
Kawai, Hidehiro	3cb4a0bb1e	coredump masking: add an interface for core dump filter This patch adds an interface to set/reset flags which determines each memory segment should be dumped or not when a core file is generated. /proc/<pid>/coredump_filter file is provided to access the flags. You can change the flag status for a particular process by writing to or reading from the file. The flag status is inherited to the child process when it is created. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:47 -07:00
Kawai, Hidehiro	6c5d523826	coredump masking: reimplementation of dumpable using two flags This patch changes mm_struct.dumpable to a pair of bit flags. set_dumpable() converts three-value dumpable to two flags and stores it into lower two bits of mm_struct.flags instead of mm_struct.dumpable. get_dumpable() behaves in the opposite way. [akpm@linux-foundation.org: export set_dumpable] Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:46 -07:00
Josef 'Jeff' Sipek	f79c20f525	fs: remove path_walk export Signed-off-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu> Cc: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Christoph Hellwig <hch@lst.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Neil Brown <neilb@suse.de> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:45 -07:00
Josef 'Jeff' Sipek	c4a7808fc3	fs: mark link_path_walk static Signed-off-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu> Cc: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Christoph Hellwig <hch@lst.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Neil Brown <neilb@suse.de> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:45 -07:00
Josef 'Jeff' Sipek	16f1820028	fs: introduce vfs_path_lookup Stackable file systems, among others, frequently need to lookup paths or path components starting from an arbitrary point in the namespace (identified by a dentry and a vfsmount). Currently, such file systems use lookup_one_len, which is frowned upon [1] as it does not pass the lookup intent along; not passing a lookup intent, for example, can trigger BUG_ON's when stacking on top of NFSv4. The first patch introduces a new lookup function to allow lookup starting from an arbitrary point in the namespace. This approach has been suggested by Christoph Hellwig [2]. The second patch changes sunrpc to use vfs_path_lookup. The third patch changes nfsctl.c to use vfs_path_lookup. The fourth patch marks link_path_walk static. The fifth, and last patch, unexports path_walk because it is no longer unnecessary to call it directly, and using the new vfs_path_lookup is cleaner. For example, the following snippet of code, looks up "some/path/component" in a directory pointed to by parent_{dentry,vfsmnt}: err = vfs_path_lookup(parent_dentry, parent_vfsmnt, "some/path/component", 0, &nd); if (!err) { /* exits / ... / once done, release the references / path_release(&nd); } else if (err == -ENOENT) { / doesn't exist / } else { / other error */ } VFS functions such as lookup_create can be used on the nameidata structure to pass the create intent to the file system. Signed-off-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu> Cc: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Christoph Hellwig <hch@lst.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Neil Brown <neilb@suse.de> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:45 -07:00
Ollie Wild	b6a2fea393	mm: variable length argument support Remove the arg+env limit of MAX_ARG_PAGES by copying the strings directly from the old mm into the new mm. We create the new mm before the binfmt code runs, and place the new stack at the very top of the address space. Once the binfmt code runs and figures out where the stack should be, we move it downwards. It is a bit peculiar in that we have one task with two mm's, one of which is inactive. [a.p.zijlstra@chello.nl: limit stack size] Signed-off-by: Ollie Wild <aaw@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <linux-arch@vger.kernel.org> Cc: Hugh Dickins <hugh@veritas.com> [bunk@stusta.de: unexport bprm_mm_init] Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:45 -07:00
Peter Zijlstra	bdf4c48af2	audit: rework execve audit The purpose of audit_bprm() is to log the argv array to a userspace daemon at the end of the execve system call. Since user-space hasn't had time to run, this array is still in pristine state on the process' stack; so no need to copy it, we can just grab it from there. In order to minimize the damage to audit_log_*() copy each string into a temporary kernel buffer first. Currently the audit code requires that the full argument vector fits in a single packet. So currently it does clip the argv size to a (sysctl) limit, but only when execve auditing is enabled. If the audit protocol gets extended to allow for multiple packets this check can be removed. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ollie Wild <aaw@google.com> Cc: <linux-audit@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:45 -07:00
Michael Ellerman	3d7e33825d	jprobes: make jprobes a little safer for users I realise jprobes are a razor-blades-included type of interface, but that doesn't mean we can't try and make them safer to use. This guy I know once wrote code like this: struct jprobe jp = { .kp.symbol_name = "foo", .entry = "jprobe_foo" }; And then his kernel exploded. Oops. This patch adds an arch hook, arch_deref_entry_point() (I don't like it either) which takes the void * in a struct jprobe, and gives back the text address that it represents. We can then use that in register_jprobe() to check that the entry point we're passed is actually in the kernel text, rather than just some random value. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:44 -07:00
Michael Ellerman	9e367d8592	jprobes: remove JPROBE_ENTRY() AFAICT now that jprobe.entry is a void *, JPROBE_ENTRY doesn't do anything useful - so remove it .. I've left a do-nothing version so that out-of-tree jprobes code will still compile without modifications. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:44 -07:00
Michael Ellerman	81eae375ec	jprobes: make struct jprobe.entry a void * Currently jprobe.entry is a kprobe_opcode_t , but that's a lie. On some platforms it doesn't point to an opcode at all, it points to a function descriptor. It's really a pointer to something that the arch code can turn into a function entry point. And that's what actually happens, none of the generic code ever looks at jprobe.entry, it's only ever dereferenced by arch code. So just make it a void . Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:44 -07:00
Fengguang Wu	f9acc8c7b3	readahead: sanify file_ra_state names Rename some file_ra_state variables and remove some accessors. It results in much simpler code. Kudos to Rusty! Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:44 -07:00
Rusty Russell	cf914a7d65	readahead: split ondemand readahead interface into two functions Split ondemand readahead interface into two functions. I think this makes it a little clearer for non-readahead experts (like Rusty). Internally they both call ondemand_readahead(), but the page argument is changed to an obvious boolean flag. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:44 -07:00
Fengguang Wu	fe3cba17c4	mm: share PG_readahead and PG_reclaim Share the same page flag bit for PG_readahead and PG_reclaim. One is used only on file reads, another is only for emergency writes. One is used mostly for fresh/young pages, another is for old pages. Combinations of possible interactions are: a) clear PG_reclaim => implicit clear of PG_readahead it will delay an asynchronous readahead into a synchronous one it actually does _good_ for readahead: the pages will be reclaimed soon, it's readahead thrashing! in this case, synchronous readahead makes more sense. b) clear PG_readahead => implicit clear of PG_reclaim one(and only one) page will not be reclaimed in time it can be avoided by checking PageWriteback(page) in readahead first c) set PG_reclaim => implicit set of PG_readahead will confuse readahead and make it restart the size rampup process it's a trivial problem, and can mostly be avoided by checking PageWriteback(page) first in readahead d) set PG_readahead => implicit set of PG_reclaim PG_readahead will never be set on already cached pages. PG_reclaim will always be cleared on dirtying a page. so not a problem. In summary, a) we get better behavior b,d) possible interactions can be avoided c) racy condition exists that might affect readahead, but the chance is _really_ low, and the hurt on readahead is trivial. Compound pages also use PG_reclaim, but for now they do not interact with reclaim/readahead code. Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:44 -07:00
Fengguang Wu	c743d96b6d	readahead: remove the old algorithm Remove the old readahead algorithm. Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Steven Pratt <slpratt@austin.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:44 -07:00
Fengguang Wu	122a21d11c	readahead: on-demand readahead logic This is a minimal readahead algorithm that aims to replace the current one. It is more flexible and reliable, while maintaining almost the same behavior and performance. Also it is full integrated with adaptive readahead. It is designed to be called on demand: - on a missing page, to do synchronous readahead - on a lookahead page, to do asynchronous readahead In this way it eliminated the awkward workarounds for cache hit/miss, readahead thrashing, retried read, and unaligned read. It also adopts the data structure introduced by adaptive readahead, parameterizes readahead pipelining with `lookahead_index', and reduces the current/ahead windows to one single window. HEURISTICS The logic deals with four cases: - sequential-next found a consistent readahead window, so push it forward - random standalone small read, so read as is - sequential-first create a new readahead window for a sequential/oversize request - lookahead-clueless hit a lookahead page not associated with the readahead window, so create a new readahead window and ramp it up In each case, three parameters are determined: - readahead index: where the next readahead begins - readahead size: how much to readahead - lookahead size: when to do the next readahead (for pipelining) BEHAVIORS The old behaviors are maximally preserved for trivial sequential/random reads. Notable changes are: - It no longer imposes strict sequential checks. It might help some interleaved cases, and clustered random reads. It does introduce risks of a random lookahead hit triggering an unexpected readahead. But in general it is more likely to do good than to do evil. - Interleaved reads are supported in a minimal way. Their chances of being detected and proper handled are still low. - Readahead thrashings are better handled. The current readahead leads to tiny average I/O sizes, because it never turn back for the thrashed pages. They have to be fault in by do_generic_mapping_read() one by one. Whereas the on-demand readahead will redo readahead for them. OVERHEADS The new code reduced the overheads of - excessively calling the readahead routine on small sized reads (the current readahead code insists on seeing all requests) - doing a lot of pointless page-cache lookups for small cached files (the current readahead only turns itself off after 256 cache hits, unfortunately most files are < 1MB, so never see that chance) That accounts for speedup of - 0.3% on 1-page sequential reads on sparse file - 1.2% on 1-page cache hot sequential reads - 3.2% on 256-page cache hot sequential reads - 1.3% on cache hot `tar /lib` However, it does introduce one extra page-cache lookup per cache miss, which impacts random reads slightly. That's 1% overheads for 1-page random reads on sparse file. PERFORMANCE The basic benchmark setup is - 2.6.20 kernel with on-demand readahead - 1MB max readahead size - 2.9GHz Intel Core 2 CPU - 2GB memory - 160G/8M Hitachi SATA II 7200 RPM disk The benchmarks show that - it maintains the same performance for trivial sequential/random reads - sysbench/OLTP performance on MySQL gains up to 8% - performance on readahead thrashing gains up to 3 times iozone throughput (KB/s): roughly the same ========================================== iozone -c -t1 -s 4096m -r 64k 2.6.20 on-demand gain first run " Initial write " 61437.27 64521.53 +5.0% " Rewrite " 47893.02 48335.20 +0.9% " Read " 62111.84 62141.49 +0.0% " Re-read " 62242.66 62193.17 -0.1% " Reverse Read " 50031.46 49989.79 -0.1% " Stride read " 8657.61 8652.81 -0.1% " Random read " 13914.28 13898.23 -0.1% " Mixed workload " 19069.27 19033.32 -0.2% " Random write " 14849.80 14104.38 -5.0% " Pwrite " 62955.30 65701.57 +4.4% " Pread " 62209.99 62256.26 +0.1% second run " Initial write " 60810.31 66258.69 +9.0% " Rewrite " 49373.89 57833.66 +17.1% " Read " 62059.39 62251.28 +0.3% " Re-read " 62264.32 62256.82 -0.0% " Reverse Read " 49970.96 50565.72 +1.2% " Stride read " 8654.81 8638.45 -0.2% " Random read " 13901.44 13949.91 +0.3% " Mixed workload " 19041.32 19092.04 +0.3% " Random write " 14019.99 14161.72 +1.0% " Pwrite " 64121.67 68224.17 +6.4% " Pread " 62225.08 62274.28 +0.1% In summary, writes are unstable, reads are pretty close on average: access pattern 2.6.20 on-demand gain Read 62085.61 62196.38 +0.2% Re-read 62253.49 62224.99 -0.0% Reverse Read 50001.21 50277.75 +0.6% Stride read 8656.21 8645.63 -0.1% Random read 13907.86 13924.07 +0.1% Mixed workload 19055.29 19062.68 +0.0% Pread 62217.53 62265.27 +0.1% aio-stress: roughly the same ============================ aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso 2.6.20 on-demand delta sequential 92.57s 92.54s -0.0% random 311.87s 312.15s +0.1% sysbench fileio: roughly the same ================================= sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \ --file-total-size=4G --file-block-size=64K \ --num-threads=001 --max-requests=10000 --max-time=900 run threads 2.6.20 on-demand delta first run 1 59.1974s 59.2262s +0.0% 2 58.0575s 58.2269s +0.3% 4 48.0545s 47.1164s -2.0% 8 41.0684s 41.2229s +0.4% 16 35.8817s 36.4448s +1.6% 32 32.6614s 32.8240s +0.5% 64 23.7601s 24.1481s +1.6% 128 24.3719s 23.8225s -2.3% 256 23.2366s 22.0488s -5.1% second run 1 59.6720s 59.5671s -0.2% 8 41.5158s 41.9541s +1.1% 64 25.0200s 23.9634s -4.2% 256 22.5491s 20.9486s -7.1% Note that the numbers are not very stable because of the writes. The overall performance is close when we sum all seconds up: sum all up 495.046s 491.514s -0.7% sysbench oltp (trans/sec): up to 8% gain ======================================== sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \ --mysql-socket=/var/run/mysqld/mysqld.sock \ --mysql-user=root --mysql-password=readahead \ --num-threads=064 --max-requests=10000 --max-time=900 run 10000-transactions run threads 2.6.20 on-demand gain 1 62.81 64.56 +2.8% 2 67.97 70.93 +4.4% 4 81.81 85.87 +5.0% 8 94.60 97.89 +3.5% 16 99.07 104.68 +5.7% 32 95.93 104.28 +8.7% 64 96.48 103.68 +7.5% 5000-transactions run 1 48.21 48.65 +0.9% 8 68.60 70.19 +2.3% 64 70.57 74.72 +5.9% 2000-transactions run 1 37.57 38.04 +1.3% 2 38.43 38.99 +1.5% 4 45.39 46.45 +2.3% 8 51.64 52.36 +1.4% 16 54.39 55.18 +1.5% 32 52.13 54.49 +4.5% 64 54.13 54.61 +0.9% That's interesting results. Some investigations show that - MySQL is accessing the db file non-uniformly: some parts are more hot than others - It is mostly doing 4-page random reads, and sometimes doing two reads in a row, the latter one triggers a 16-page readahead. - The on-demand readahead leaves many lookahead pages (flagged PG_readahead) there. Many of them will be hit, and trigger more readahead pages. Which might save more seeks. - Naturally, the readahead windows tend to lie in hot areas, and the lookahead pages in hot areas is more likely to be hit. - The more overall read density, the more possible gain. That also explains the adaptive readahead tricks for clustered random reads. readahead thrashing: 3 times better =================================== We boot kernel with "mem=128m single", and start a 100KB/s stream on every second, until reaching 200 streams. max throughput min avg I/O size 2.6.20: 5MB/s 16KB on-demand: 15MB/s 140KB Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Steven Pratt <slpratt@austin.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:44 -07:00
Fengguang Wu	5ce1110b92	readahead: data structure and routines Extend struct file_ra_state to support the on-demand readahead logic. Also define some helpers for it. Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Steven Pratt <slpratt@austin.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:44 -07:00
Fengguang Wu	d77c2d7cc5	readahead: introduce PG_readahead Introduce a new page flag: PG_readahead. It acts as a look-ahead mark, which tells the page reader: Hey, it's time to invoke the read-ahead logic. For the sake of I/O pipelining, don't wait until it runs out of cached pages! Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Steven Pratt <slpratt@austin.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:43 -07:00
David Brownell	2ba2d00363	AIO sparse fix (type of ki_flags) Fix type issue reported by latest 'sparse': kiocb.ki_flags should be "unsigned long" (not "long"), to match bitop type signature. Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:43 -07:00
Akinobu Mita	e53252d97e	unregister_chrdev() return void unregister_chrdev() does not return meaningful value. This patch makes it return void like most unregister_* functions. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:43 -07:00
Pavel Machek	77afcf78a2	PM: Integrate beeping flag with existing acpi_sleep flags Move "debug during resume from s2ram" into the variable we already use for real-mode flags to simplify code. It also closes nasty trap for the user in acpi_sleep_setup; order of parameters actually mattered there, acpi_sleep=s3_bios,s3_mode doing something different from acpi_sleep=s3_mode,s3_bios. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:43 -07:00
Nigel Cunningham	5a60d6235c	PM: Optional beeping during resume from suspend to RAM Add a feature allowing the user to make the system beep during a resume from suspend to RAM, on x86_64 and i386. This is useful for the users with broken resume from RAM, so that they can verify if the control reaches the kernel after a wake-up event. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:43 -07:00
Rafael J. Wysocki	bd804eba1c	PM: Introduce pm_power_off_prepare Introduce the pm_power_off_prepare() callback that can be registered by the interested platforms in analogy with pm_idle() and pm_power_off(), used for preparing the system to power off (needed by ACPI). This allows us to drop acpi_sysclass and device_acpi that are only defined in order to register the ACPI power off preparation callback, which is needed by pm_power_off() registered in a much different way. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:42 -07:00
Rafael J. Wysocki	b10d911749	PM: introduce hibernation and suspend notifiers Make it possible to register hibernation and suspend notifiers, so that subsystems can perform hibernation-related or suspend-related operations that should not be carried out by device drivers' .suspend() and .resume() routines. [akpm@linux-foundation.org: build fixes] [akpm@linux-foundation.org: cleanups] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Nigel Cunningham <nigel@nigel.suspend2.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:42 -07:00
Rafael J. Wysocki	0c1eecfb34	Freezer: avoid freezing kernel threads prematurely Kernel threads should not have TIF_FREEZE set when user space processes are being frozen, since otherwise some of them might be frozen prematurely. To prevent this from happening we can (1) make exit_mm() unset TIF_FREEZE unconditionally just after clearing tsk->mm and (2) make try_to_freeze_tasks() check if p->mm is different from zero and PF_BORROWED_MM is unset in p->flags when user space processes are to be frozen. Namely, when user space processes are being frozen, we only should set TIF_FREEZE for tasks that have p->mm different from NULL and don't have PF_BORROWED_MM set in p->flags. For this reason task_lock() must be used to prevent try_to_freeze_tasks() from racing with use_mm()/unuse_mm(), in which p->mm and p->flags.PF_BORROWED_MM are changed under task_lock(p). Also, we need to prevent the following scenario from happening: * daemonize() is called by a task spawned from a user space code path * freezer checks if the task has p->mm set and the result is positive * task enters exit_mm() and clears its TIF_FREEZE * freezer sets TIF_FREEZE for the task * task calls try_to_freeze() and goes to the refrigerator, which is wrong at that point This requires us to acquire task_lock(p) before p->flags.PF_BORROWED_MM and p->mm are examined and release it after TIF_FREEZE is set for p (or it turns out that TIF_FREEZE should not be set). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Nigel Cunningham <nigel@nigel.suspend2.net> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:42 -07:00
Rafael J. Wysocki	a634cc1016	swsusp: introduce restore platform operations At least on some machines it is necessary to prepare the ACPI firmware for the restoration of the system memory state from the hibernation image if the "platform" mode of hibernation has been used. Namely, in that cases we need to disable the GPEs before replacing the "boot" kernel with the "frozen" kernel (cf. http://bugzilla.kernel.org/show_bug.cgi?id=7887). After the restore they will be re-enabled by hibernation_ops->finish(), but if the restore fails, they have to be re-enabled by the restore code explicitly. For this purpose we can introduce two additional hibernation operations, called pre_restore() and restore_cleanup() and call them from the restore code path. Still, they should be called if the "platform" mode of hibernation has been used, so we need to pass the information about the hibernation mode from the "frozen" kernel to the "boot" kernel in the image header. Apparently, we can't drop the disabling of GPEs before the restore because of Bug #7887 . We also can't do it unconditionally, because the GPEs wouldn't have been enabled after a successful restore if the suspend had been done in the 'shutdown' or 'reboot' mode. In principle we could (and probably should) unconditionally disable the GPEs before each snapshot creation and before the restore, but then we'd have to unconditionally enable them after the snapshot creation as well as after the restore (or restore failure) Still, for this purpose we'd need to modify acpi_enter_sleep_state_prep() and acpi_leave_sleep_state() and we'd have to introduce some mechanism synchronizing the disablind/enabling of the GPEs with the device drivers' .suspend()/.resume() routines and with disable_/enable_nonboot_cpus(). However, this would have affected the suspend (ie. s2ram) code as well as the hibernation, which I'd like to avoid in this patch series. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Nigel Cunningham <nigel@nigel.suspend2.net> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:42 -07:00
Mel Gorman	bb2d5ce164	Remove alloc_zeroed_user_highpage() alloc_zeroed_user_highpage() has no in-tree users and it is not exported. As it is not exported, it can simply be removed. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:41 -07:00
Nick Piggin	83c54070ee	mm: fault feedback #2 This patch completes Linus's wish that the fault return codes be made into bit flags, which I agree makes everything nicer. This requires requires all handle_mm_fault callers to be modified (possibly the modifications should go further and do things like fault accounting in handle_mm_fault -- however that would be for another patch). [akpm@linux-foundation.org: fix alpha build] [akpm@linux-foundation.org: fix s390 build] [akpm@linux-foundation.org: fix sparc build] [akpm@linux-foundation.org: fix sparc64 build] [akpm@linux-foundation.org: fix ia64 build] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ian Molton <spyro@f2s.com> Cc: Bryan Wu <bryan.wu@analog.com> Cc: Mikael Starvik <starvik@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Greg Ungerer <gerg@uclinux.org> Cc: Matthew Wilcox <willy@debian.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp> Cc: Richard Curnow <rc@rc0.org.uk> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp> Cc: Chris Zankel <chris@zankel.net> Acked-by: Kyle McMartin <kyle@mcmartin.ca> Acked-by: Haavard Skinnemoen <hskinnemoen@atmel.com> Acked-by: Ralf Baechle <ralf@linux-mips.org> Acked-by: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Still apparently needs some ARM and PPC loving - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:41 -07:00
Nick Piggin	d0217ac04c	mm: fault feedback #1 Change ->fault prototype. We now return an int, which contains VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte. FAULT_RET_ code tells the VM whether a page was found, whether it has been locked, and potentially other things. This is not quite the way he wanted it yet, but that's changed in the next patch (which requires changes to arch code). This means we no longer set VM_CAN_INVALIDATE in the vma in order to say that a page is locked which requires filemap_nopage to go away (because we can no longer remain backward compatible without that flag), but we were going to do that anyway. struct fault_data is renamed to struct vm_fault as Linus asked. address is now a void __user * that we should firmly encourage drivers not to use without really good reason. The page is now returned via a page pointer in the vm_fault struct. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:41 -07:00
Nick Piggin	54cb8821de	mm: merge populate and nopage into fault (fixes nonlinear) Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes the virtual address -> file offset differently from linear mappings. ->populate is a layering violation because the filesystem/pagecache code should need to know anything about the virtual memory mapping. The hitch here is that the ->nopage handler didn't pass down enough information (ie. pgoff). But it is more logical to pass pgoff rather than have the ->nopage function calculate it itself anyway (because that's a similar layering violation). Having the populate handler install the pte itself is likewise a nasty thing to be doing. This patch introduces a new fault handler that replaces ->nopage and ->populate and (later) ->nopfn. Most of the old mechanism is still in place so there is a lot of duplication and nice cleanups that can be removed if everyone switches over. The rationale for doing this in the first place is that nonlinear mappings are subject to the pagefault vs invalidate/truncate race too, and it seemed stupid to duplicate the synchronisation logic rather than just consolidate the two. After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in pagecache. Seems like a fringe functionality anyway. NOPAGE_REFAULT is removed. This should be implemented with ->fault, and no users have hit mainline yet. [akpm@linux-foundation.org: cleanup] [randy.dunlap@oracle.com: doc. fixes for readahead] [akpm@linux-foundation.org: build fix] Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:41 -07:00
Nick Piggin	d00806b183	mm: fix fault vs invalidate race for linear mappings Fix the race between invalidate_inode_pages and do_no_page. Andrea Arcangeli identified a subtle race between invalidation of pages from pagecache with userspace mappings, and do_no_page. The issue is that invalidation has to shoot down all mappings to the page, before it can be discarded from the pagecache. Between shooting down ptes to a particular page, and actually dropping the struct page from the pagecache, do_no_page from any process might fault on that page and establish a new mapping to the page just before it gets discarded from the pagecache. The most common case where such invalidation is used is in file truncation. This case was catered for by doing a sort of open-coded seqlock between the file's i_size, and its truncate_count. Truncation will decrease i_size, then increment truncate_count before unmapping userspace pages; do_no_page will read truncate_count, then find the page if it is within i_size, and then check truncate_count under the page table lock and back out and retry if it had subsequently been changed (ptl will serialise against unmapping, and ensure a potentially updated truncate_count is actually visible). Complexity and documentation issues aside, the locking protocol fails in the case where we would like to invalidate pagecache inside i_size. do_no_page can come in anytime and filemap_nopage is not aware of the invalidation in progress (as it is when it is outside i_size). The end result is that dangling (->mapping == NULL) pages that appear to be from a particular file may be mapped into userspace with nonsense data. Valid mappings to the same place will see a different page. Andrea implemented two working fixes, one using a real seqlock, another using a page->flags bit. He also proposed using the page lock in do_no_page, but that was initially considered too heavyweight. However, it is not a global or per-file lock, and the page cacheline is modified in do_no_page to increment _count and _mapcount anyway, so a further modification should not be a large performance hit. Scalability is not an issue. This patch implements this latter approach. ->nopage implementations return with the page locked if it is possible for their underlying file to be invalidated (in that case, they must set a special vm_flags bit to indicate so). do_no_page only unlocks the page after setting up the mapping completely. invalidation is excluded because it holds the page lock during invalidation of each page (and ensures that the page is not mapped while holding the lock). This also allows significant simplifications in do_no_page, because we have the page locked in the right place in the pagecache from the start. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:41 -07:00
David Chinner	5417169026	[FS] Implement block_page_mkwrite. Many filesystems need a ->page-mkwrite callout to correctly set up pages that have been written to by mmap. This is especially important when mmap is writing into holes as it allows filesystems to correctly account for and allocate space before the mmap write is allowed to proceed. Protection against truncate races is provided by locking the page and checking to see whether the page mapping is correct and whether it is beyond EOF so we don't end up allowing allocations beyond the current EOF or changing EOF as a result of a mmap write. SGI-PV: 940392 SGI-Modid: 2.6.x-xfs-melb:linux:29146a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tim Shimmin <tes@sgi.com>	2007-07-19 19:50:50 +10:00
Linus Torvalds	ce524c8360	Merge branch 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6 * 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6: eHEA: Fix bonding support Blackfin ethernet driver: on chip ethernet MAC controller driver fix wrong argument of tc35815_read_plat_dev_addr() ARM/ETHER3: Handle multicast frames. SAA9730: Handle multicast frames. NI5010: Handle multicast frames. NS83820: Handle multicast frames. Fix RGMII-ID handling in gianfar Fix Vitesse RGMII-ID support Add phy-connection-type to gianfar nodes Fix Vitesse 824x PHY interrupt acking [PATCH] zd1211rw: Add ID for Siemens Gigaset USB Stick 54 [PATCH] zd1211rw: Add ID for Planex GW-US54GXS [PATCH] Update version ipw2200 stamp to 1.2.2 [PATCH] ipw2200: Fix ipw_isr() comments error on shared IRQ [PATCH] Fix ipw2200 set wrong power parameter causing firmware error [PATCH] ipw2100: Fix `iwpriv set_power` error [PATCH] softmac: Channel is listed twice in scan output	2007-07-18 18:33:45 -07:00
Linus Torvalds	29e7ee378e	Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6 * master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6: sysfs: cosmetic clean up on node creation failure paths sysfs: kill an extra put in sysfs_create_link() failure path Driver core: check return code of sysfs_create_link() HOWTO: Add the knwon_regression URI to the documentation dev_vdbg() documentation dev_vdbg(), available with -DVERBOSE_DEBUG sysfs: make sysfs_init_inode() static sysfs: fix sysfs root inode nlink accounting Documentation fix devres.txt: lib/iomap.c -> lib/devres.c sysfs: avoid kmem_cache_free(NULL) PM: remove deprecated dpm_runtime_* routines PM: Remove deprecated sysfs files Driver core: accept all valid action-strings in uevent-trigger debugfs: remove rmdir() non-empty complaint	2007-07-18 18:28:08 -07:00
Linus Torvalds	fc15bc817e	Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/uio-2.6 * master.kernel.org:/pub/scm/linux/kernel/git/gregkh/uio-2.6: UIO: Hilscher CIF card driver UIO: Documentation UIO: Add the User IO core code	2007-07-18 18:27:50 -07:00
Linus Torvalds	a8dcf12f9e	Merge branch 'for-linus' of git://linux-nfs.org/~bfields/linux * 'for-linus' of git://linux-nfs.org/~bfields/linux: locks: fix vfs_test_lock() comment locks: make posix_test_lock() interface more consistent nfs: disable leases over NFS gfs2: stop giving out non-cluster-coherent leases locks: export setlease to filesystems locks: provide a file lease method enabling cluster-coherent leases locks: rename lease functions to reflect locks.c conventions locks: share more common lease code locks: clean up lease_alloc() locks: convert an -EINVAL return to a BUG leases: minor break_lease() comment clarification	2007-07-18 18:27:00 -07:00
J. Bruce Fields	6d34ac199a	locks: make posix_test_lock() interface more consistent Since posix_test_lock(), like fcntl() and ->lock(), indicates absence or presence of a conflict lock by setting fl_type to, respectively, F_UNLCK or something other than F_UNLCK, the return value is no longer needed. Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>	2007-07-18 19:17:19 -04:00
J. Bruce Fields	4698afe8e3	locks: export setlease to filesystems Export setlease so it can used by filesystems to implement their lease methods. Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>	2007-07-18 19:17:06 -04:00
J. Bruce Fields	f9ffed26d6	locks: provide a file lease method enabling cluster-coherent leases Currently leases are only kept locally, so there's no way for a distributed filesystem to enforce them against multiple clients. We're particularly interested in the case of nfsd exporting a cluster filesystem, in which case nfsd needs cluster-coherent leases in order to implement delegations correctly. Also add some documentation. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2007-07-18 19:14:47 -04:00
J. Bruce Fields	a9933cea7a	locks: rename lease functions to reflect locks.c conventions We've been using the convention that vfs_foo is the function that calls a filesystem-specific foo method if it exists, or falls back on a generic method if it doesn't; thus vfs_foo is what is called when some other part of the kernel (normally lockd or nfsd) wants to get a lock, whereas foo is what filesystems call to use the underlying local functionality as part of their lock implementation. So rename setlease to vfs_setlease (which will call a filesystem-specific setlease after a later patch) and __setlease to setlease. Also, vfs_setlease need only be GPL-exported as long as it's only needed by lockd and nfsd. Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>	2007-07-18 19:14:12 -04:00
Hans J. Koch	beafc54c4e	UIO: Add the User IO core code This interface allows the ability to write the majority of a driver in userspace with only a very small shell of a driver in the kernel itself. It uses a char device and sysfs to interact with a userspace process to process interrupts and control memory accesses. See the docbook documentation for more details on how to use this interface. From: Hans J. Koch <hjk@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Benedikt Spranger <b.spranger@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2007-07-18 15:57:15 -07:00
David Brownell	aebdc3b450	dev_vdbg(), available with -DVERBOSE_DEBUG This defines a dev_vdbg() call, which is enabled with -DVERBOSE_DEBUG. When enabled, dev_vdbg() acts just like dev_dbg(). When disabled, it is a NOP ... just like dev_dbg() without -DDEBUG. The specific code was moved out of a USB patch, but lots of drivers have similar support. That is, code can now be written to use an additional level of debug output, selected at compile time. Many driver authors have found this idiom to be very useful. A typical usage model is for "normal" debug messages to focus on fault paths and not be very "chatty", so that those messages can be left on during normal operation without much of a performance or syslog load. On the other hand "verbose" messages would be noisy enough that they wouldn't normally be enabled; they might even affect timings enough to change system or driver behavior. Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2007-07-18 15:49:50 -07:00
Alan Stern	3f8df781fc	PM: remove deprecated dpm_runtime_* routines This patch (as933) removes the deprecated dpm_runtime_suspend() and dpm_runtime_resume() routines from the PM core. The only user of those routines is the PCMCIA ds driver; local replacements are added. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> CC: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2007-07-18 15:49:49 -07:00
Kay Sievers	60a96a5956	Driver core: accept all valid action-strings in uevent-trigger This allows the uevent file to handle any type of uevent action to be triggered by userspace instead of just the "add" uevent. Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2007-07-18 15:49:49 -07:00
Johannes Berg	2dbba6f773	[GENETLINK]: Dynamic multicast groups. Introduce API to dynamically register and unregister multicast groups. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Acked-by: Patrick McHardy <kaber@trash.net> Acked-by: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-07-18 15:47:52 -07:00
Johannes Berg	84659eb529	[NETLIKN]: Allow removing multicast groups. Allow kicking listeners out of a multicast group when necessary (for example if that group is going to be removed.) Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Acked-by: Patrick McHardy <kaber@trash.net> Acked-by: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-07-18 15:47:05 -07:00
Johannes Berg	b4ff4f0419	[NETLINK]: allocate group bitmaps dynamically Allow changing the number of groups for a netlink family after it has been created, use RCU to protect the listeners bitmap keeping netlink_has_listeners() lock-free. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Acked-by: Patrick McHardy <kaber@trash.net> Acked-by: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-07-18 15:46:06 -07:00
Andy Fleming	7132ab7f6e	Fix RGMII-ID handling in gianfar The TSEC/eTSEC can detect the interface to the PHY automatically, but it isn't able to detect whether the RGMII connection needs internal delay. So we need to detect that change in the device tree, propagate it to the platform data, and then check it if we're in RGMII. This fixes a bug on the 8641D HPCN board where the Vitesse PHY doesn't use the delay for RGMII. Signed-off-by: Andy Fleming <afleming@freescale.com>	2007-07-18 18:29:37 -04:00
Linus Torvalds	d756d10e24	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: extent macros cleanup Fix compilation with EXT_DEBUG, also fix leXX_to_cpu conversions. ext4: remove extra IS_RDONLY() check ext4: Use is_power_of_2() Use zero_user_page() in ext4 where possible ext4: Remove 65000 subdirectory limit ext4: Expand extra_inodes space per the s_{want,min}_extra_isize fields ext4: Add nanosecond timestamps jbd2: Move jbd2-debug file to debugfs jbd2: Fix CONFIG_JBD_DEBUG ifdef to be CONFIG_JBD2_DEBUG ext4: Set the journal JBD2_FEATURE_INCOMPAT_64BIT on large devices ext4: Make extents code sanely handle on-disk corruption ext4: copy i_flags to inode flags on write ext4: Enable extents by default Change on-disk format to support 2^15 uninitialized extents write support for preallocated blocks fallocate support in ext4 sys_fallocate() implementation on i386, x86_64 and powerpc	2007-07-18 10:32:00 -07:00
Linus Torvalds	485cf925d8	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 * 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (24 commits) [NETFILTER]: xt_connlimit needs to depend on nf_conntrack [NETFILTER]: ipt_iprange.h must #include <linux/types.h> [IrDA]: Fix IrDA build failure [ATM]: nicstar needs virt_to_bus [NET]: move __dev_addr_discard adjacent to dev_addr_discard for readability [NET]: merge dev_unicast_discard and dev_mc_discard into one [NET]: move dev_mc_discard from dev_mcast.c to dev.c [NETLINK]: negative groups in netlink_setsockopt [PPPOL2TP]: Reset meta-data in xmit function [PPPOL2TP]: Fix use-after-free [PKT_SCHED]: Some typo fixes in net/sched/Kconfig [XFRM]: Fix crash introduced by struct dst_entry reordering [TCP]: remove unused argument to cong_avoid op [ATM]: [idt77252] Rename CONFIG_ATM_IDT77252_SEND_IDLE to not resemble a Kconfig variable [ATM]: [drivers] ioremap balanced with iounmap [ATM]: [lanai] sram_test_word() must be __devinit [ATM]: [nicstar] Replace C code with call to ARRAY_SIZE() macro. [ATM]: Eliminate dead config variable CONFIG_BR2684_FAST_TRANS. [ATM]: Replacing kmalloc/memset combination with kzalloc. [NET]: gen_estimator deadlock fix ...	2007-07-18 10:24:36 -07:00
Prakash, Sathya	0c8db6beb8	[SCSI] add PCI_VENDOR_ID macro for Brocade in pci_ids.h Adds PCI_VENDOR_ID_BROCADE macro in include/linux/pci_ids.h file. This macro is used in MPT Fusion FC drivers to support Brocade branded FC controllers signed-off-by: Sathya Prakash <sathya.prakash@lsi.com> Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>	2007-07-18 11:17:11 -05:00
Darrick J. Wong	1c50dc83f9	[SCSI] sas_ata: ata_post_internal should abort the sas_task This patch adds a new field, lldd_task, to ata_queued_cmd so that libata users such as libsas can associate some data with a qc. The particular ambition with this patch is to associate a sas_task with a qc; that way, if libata decides to timeout a command, we can come back (in sas_ata_post_internal) and abort the sas task. One question remains: Is it necessary to reset the phy on error, or will the libata error handler take care of it? (Assuming that one is written, of course.) This patch, as it is today, works well enough to clean things up when an ATA device probe attempt fails halfway through the probe, though I'm not sure this is always the right thing to do. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>	2007-07-18 11:15:13 -05:00
Jeremy Fitzhardinge	9f27ee5950	xen: add virtual block device driver. The block device frontend driver allows the kernel to access block devices exported exported by a virtual machine containing a physical block device driver. Signed-off-by: Ian Pratt <ian.pratt@xensource.com> Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Greg KH <greg@kroah.com> Cc: Jens Axboe <axboe@kernel.dk>	2007-07-18 08:47:45 -07:00
Jeremy Fitzhardinge	c85b04c374	xen: add pinned page flag Add a new definition for PG_owner_priv_1 to define PG_pinned on Xen pagetable pages. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org>	2007-07-18 08:47:43 -07:00
Jeremy Fitzhardinge	5f4352fbff	Allocate and free vmalloc areas Allocate/release a chunk of vmalloc address space: alloc_vm_area reserves a chunk of address space, and makes sure all the pagetables are constructed for that address range - but no pages. free_vm_area releases the address space range. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Ian Pratt <ian.pratt@xensource.com> Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Cc: "Jan Beulich" <JBeulich@novell.com> Cc: "Andi Kleen" <ak@muc.de>	2007-07-18 08:47:41 -07:00
Jeremy Fitzhardinge	810bab448e	use elfnote.h to generate vsyscall notes. Use existing elfnote.h to generate vsyscall notes, rather than doing it locally. Changes elfnote.h a bit to suit, since this is the first asm user, and it wasn't quite right. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Roland McGrath <roland@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.com>	2007-07-18 08:47:40 -07:00
Jeremy Fitzhardinge	86313c488a	usermodehelper: Tidy up waiting Rather than using a tri-state integer for the wait flag in call_usermodehelper_exec, define a proper enum, and use that. I've preserved the integer values so that any callers I've missed should still work OK. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Andi Kleen <ak@suse.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Bjorn Helgaas <bjorn.helgaas@hp.com> Cc: Joel Becker <joel.becker@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: David Howells <dhowells@redhat.com>	2007-07-18 08:47:40 -07:00
Jeremy Fitzhardinge	10a0a8d4e3	Add common orderly_poweroff() Various pieces of code around the kernel want to be able to trigger an orderly poweroff. This pulls them together into a single implementation. By default the poweroff command is /sbin/poweroff, but it can be set via sysctl: kernel/poweroff_cmd. This is split at whitespace, so it can include command-line arguments. This patch replaces four other instances of invoking either "poweroff" or "shutdown -h now": two sbus drivers, and acpi thermal management. sparc64 has its own "powerd"; still need to determine whether it should be replaced by orderly_poweroff(). Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Acked-by: Len Brown <lenb@kernel.org> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Andi Kleen <ak@suse.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David S. Miller <davem@davemloft.net>	2007-07-18 08:47:40 -07:00
Jeremy Fitzhardinge	0ab4dc9227	usermodehelper: split setup from execution Rather than having hundreds of variations of call_usermodehelper for various pieces of usermode state which could be set up, split the info allocation and initialization from the actual process execution. This means the general pattern becomes: info = call_usermodehelper_setup(path, argv, envp); /* basic state / call_usermodehelper_<SET EXTRA STATE>(info, stuff...); / extra state / call_usermodehelper_exec(info, wait); / run process and free info / This patch introduces wrappers for all the existing calling styles for call_usermodehelper_, but folds their implementations into one. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Andi Kleen <ak@suse.de> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: David Howells <dhowells@redhat.com> Cc: Bj?rn Steinbrink <B.Steinbrink@gmx.de> Cc: Randy Dunlap <randy.dunlap@oracle.com>	2007-07-18 08:47:40 -07:00
Jeremy Fitzhardinge	d84d1cc764	add argv_split() argv_split() is a helper function which takes a string, splits it at whitespace, and returns a NULL-terminated argv vector. This is deliberately simple - it does no quote processing of any kind. [ Seems to me that this is something which is already being done in the kernel, but I couldn't find any other implementations, either to steal or replace. Keep an eye out. ] Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Randy Dunlap <randy.dunlap@oracle.com>	2007-07-18 08:47:40 -07:00
Jeremy Fitzhardinge	1e66df3ee3	add kstrndup Add a kstrndup function, modelled on strndup. Like strndup this returns a string copied into its own allocated memory, but it copies no more than the specified number of bytes from the source. Remove private strndup() from irda code. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Cc: Akinobu Mita <akinobu.mita@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@mandriva.com> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Panagiotis Issaris <takis@issaris.org> Cc: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>	2007-07-18 08:47:39 -07:00
Maciej W. Rozycki	8b4a40809e	zs: move to the serial subsystem This is a reimplementation of the zs driver for the serial subsystem. Any resemblance to the old driver is purely coincidential. ;-) I do hope I got the handling of modem lines right -- better do not tackle me about the issue unless you feel too good... Any users of the old driver: please note the numbers of the serial lines have now been swapped, i.e. ttyS0 <-> ttyS1 and ttyS2 <-> ttyS3. It has to do with the modem lines mentioned above; basically the port A in a given chip has to be initialised before the port B if you want to use the latter as the serial console (which is usually the case), as operations on modem lines of the serial line associated with the port B access both ports (see the comment at the top of the driver for the details of wiring used). Please update your scripts. This is also the reason each SCC now requests an IRQ once only (as seen in "/proc/interrupts") -- the handler takes care of both ports at once as the line associated with the port B has to take status update interrupts from both ports (and yet the line of the port A takes its own for itself too). The old driver never got it right... Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-18 08:38:22 -07:00
Yinghai Lu	b187f180cc	serial: add early_serial_setup() back to header file early_serial_setup was removed from serial.h, but forgot to put in serial_8250.h Signed-off-by: Yinghai Lu <yinghai.lu@sun.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-18 08:38:22 -07:00
Andreas Dilger	f8628a14a2	ext4: Remove 65000 subdirectory limit This patch adds support to ext4 for allowing more than 65000 subdirectories. Currently the maximum number of subdirectories is capped at 32000. If we exceed 65000 subdirectories in an htree directory it sets the inode link count to 1 and no longer counts subdirectories. The directory link count is not actually used when determining if a directory is empty, as that only counts subdirectories and not regular files that might be in there. A EXT4_FEATURE_RO_COMPAT_DIR_NLINK flag has been added and it is set if the subdir count for any directory crosses 65000. A later fsck will clear EXT4_FEATURE_RO_COMPAT_DIR_NLINK if there are no longer any directory with >65000 subdirs. Signed-off-by: Andreas Dilger <adilger@clusterfs.com> Signed-off-by: Kalpak Shah <kalpak@clusterfs.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2007-07-18 08:38:01 -04:00
Kalpak Shah	6dd4ee7cab	ext4: Expand extra_inodes space per the s_{want,min}_extra_isize fields We need to make sure that existing ext3 filesystems can also avail the new fields that have been added to the ext4 inode. We use s_want_extra_isize and s_min_extra_isize to decide by how much we should expand the inode. If EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE feature is set then we expand the inode by max(s_want_extra_isize, s_min_extra_isize , sizeof(ext4_inode) - EXT4_GOOD_OLD_INODE_SIZE) bytes. Actually it is still an open question about whether users should be able to set s_*_extra_isize smaller than the known fields or not. This patch also adds the functionality to expand inodes to include the newly added fields. We start by trying to expand by s_want_extra_isize bytes and if its fails we try to expand by s_min_extra_isize bytes. This is done by changing the i_extra_isize if enough space is available in the inode and no EAs are present. If EAs are present and there is enough space in the inode then the EAs in the inode are shifted to make space. If enough space is not available in the inode due to the EAs then 1 or more EAs are shifted to the external EA block. In the worst case when even the external EA block does not have enough space we inform the user that some EA would need to be deleted or s_min_extra_isize would have to be reduced. Signed-off-by: Andreas Dilger <adilger@clusterfs.com> Signed-off-by: Kalpak Shah <kalpak@clusterfs.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2007-07-18 09:19:57 -04:00
Kalpak Shah	ef7f38359e	ext4: Add nanosecond timestamps This patch adds nanosecond timestamps for ext4. This involves adding *time_extra fields to the ext4_inode to extend the timestamps to 64-bits. Creation time is also added by this patch. These extended fields will fit into an inode if the filesystem was formatted with large inodes (-I 256 or larger) and there are currently no EAs consuming all of the available space. For new inodes we always reserve enough space for the kernel's known extended fields, but for inodes created with an old kernel this might not have been the case. So this patch also adds the EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE feature flag(ro-compat so that older kernels can't create inodes with a smaller extra_isize). which indicates if the fields fitting inside s_min_extra_isize are available or not. If the expansion of inodes if unsuccessful then this feature will be disabled. This feature is only enabled if requested by the sysadmin. None of the extended inode fields is critical for correct filesystem operation. Signed-off-by: Andreas Dilger <adilger@clusterfs.com> Signed-off-by: Kalpak Shah <kalpak@clusterfs.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2007-07-18 09:15:20 -04:00
Jose R. Santos	0f49d5d019	jbd2: Move jbd2-debug file to debugfs The jbd2-debug file used to be located in /proc/sys/fs/jbd2-debug, but it incorrectly used create_proc_entry() instead of the sysctl routines, and no proc entry was ever created. Instead of fixing this we might as well move the jbd2-debug file to debugfs which would be the preferred location for this kind of tunable. The new location is now /sys/kernel/debug/jbd2/jbd2-debug. Signed-off-by: Jose R. Santos <jrs@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2007-07-18 08:50:18 -04:00
Jose R. Santos	e23291b912	jbd2: Fix CONFIG_JBD_DEBUG ifdef to be CONFIG_JBD2_DEBUG When the JBD code was forked to create the new JBD2 code base, the references to CONFIG_JBD_DEBUG where never changed to CONFIG_JBD2_DEBUG. This patch fixes that. Signed-off-by: Jose R. Santos <jrs@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2007-07-18 08:57:06 -04:00
Jan Kara	ff9ddf7e84	ext4: copy i_flags to inode flags on write Propagate flags such as S_APPEND, S_IMMUTABLE, etc. from i_flags into ext4-specific i_flags. Quota code changes these flags on quota files (to make it harder for sysadmin to screw himself) and these changes were not correctly propagated into the filesystem. (This is a forward port patch from ext3) Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2007-07-18 09:24:20 -04:00
Amit Arora	749269faca	Change on-disk format to support 2^15 uninitialized extents This change was suggested by Andreas Dilger. This patch changes the EXT_MAX_LEN value and extent code which marks/checks uninitialized extents. With this change it will be possible to have initialized extents with 2^15 blocks (earlier the max blocks we could have was 2^15 - 1). This way we can have better extent-to-block alignment. Now, maximum number of blocks we can have in an initialized extent is 2^15 and in an uninitialized extent is 2^15 - 1. Signed-off-by: Amit Arora <aarora@in.ibm.com>	2007-07-18 09:02:56 -04:00
Adrian Bunk	ebd61cc042	[NETFILTER]: ipt_iprange.h must #include <linux/types.h> ipt_iprange.h must #include <linux/types.h> since it uses __be32. This patch fixes kernel Bugzilla #7604. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-07-18 02:21:50 -07:00
Denis Cheng	456ad75c89	[NET]: move dev_mc_discard from dev_mcast.c to dev.c Because this function is only called by unregister_netdevice, this moving could make this non-global function static, and also remove its declaration in netdevice.h; Any further, function __dev_addr_discard is also just called by dev_mc_discard and dev_unicast_discard, keeping this two functions both in one c file could make __dev_addr_discard also static and remove its declaration in netdevice.h; Futhermore, the sequential call to dev_unicast_discard and then dev_mc_discard in unregister_netdevice have a similar mechanism that: (netif_tx_lock_bh / __dev_addr_discard / netif_tx_unlock_bh), they should merged into one to eliminate duplicates in acquiring and releasing the dev->_xmit_lock, this would be done in my following patch. Signed-off-by: Denis Cheng <crquan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-07-18 02:10:54 -07:00
Dmitry Torokhov	5517853712	Input: document intended meaning of KEY_SWITCHVIDEOMODE Signed-off-by: Dmitry Torokhov <dtor@mail.ru>	2007-07-18 00:38:45 -04:00
Dmitry Torokhov	85f202d5df	Input: add driver for Fujitsu serial touchscreens These serial touchscreens are found on some Fujitsu lifebook P-series laptops, and the B6210. Using this requires a new version of inputattach and doing: inputattach -fjt /dev/ttyS0 Big thanks to Stephen Hemminger for testing it and making it work on his B6210 laptop. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: Dmitry Torokhov <dtor@mail.ru>	2007-07-18 00:37:01 -04:00
Semih Hazar	1d25891f32	Input: ads7846 - re-check pendown status before reporting events Pendown status from the PENIRQ pin is currently read only at the beginning of a sample set. If the pen is lifted just after sampling has began then sampled values become wrong. This patch adds an optional platform penirq_recheck_delay attribute. If non-zero, samples are only reported to the input subsystem if PENIRQ is still active that long after the samples taken. Signed-off-by: Semih Hazar <semih.hazar@indefia.com> Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Dmitry Torokhov <dtor@mail.ru>	2007-07-18 00:36:04 -04:00
Semih Hazar	e4f4886199	Input: ads7846 - introduce sample settling delay The ads7846 driver has support for filtering, but when the chip gets deselected between samples this causes noise. This patch adds support for an optional settling delay time, so that two consecutive samples will be taken with the specified delay time apart. This ensures that the chip won't be deselected, so the noise won't appear. Filtering can still be done, but will have less work to do since each time a new sample is taken the same delay applies. Signed-off-by: Semih Hazar <semih.hazar@indefia.com> Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Dmitry Torokhov <dtor@mail.ru>	2007-07-18 00:35:56 -04:00
Amit Arora	56055d3ae4	write support for preallocated blocks This patch adds write support to the uninitialized extents that get created when a preallocation is done using fallocate(). It takes care of splitting the extents into multiple (upto three) extents and merging the new split extents with neighbouring ones, if possible. Signed-off-by: Amit Arora <aarora@in.ibm.com>	2007-07-17 21:42:38 -04:00
Amit Arora	a2df2a6340	fallocate support in ext4 This patch implements ->fallocate() inode operation in ext4. With this patch users of ext4 file systems will be able to use fallocate() system call for persistent preallocation. Current implementation only supports preallocation for regular files (directories not supported as of date) with extent maps. This patch does not support block-mapped files currently. Only FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes are being supported as of now. Signed-off-by: Amit Arora <aarora@in.ibm.com>	2007-07-17 21:42:41 -04:00
Amit Arora	97ac73506c	sys_fallocate() implementation on i386, x86_64 and powerpc fallocate() is a new system call being proposed here which will allow applications to preallocate space to any file(s) in a file system. Each file system implementation that wants to use this feature will need to support an inode operation called ->fallocate(). Applications can use this feature to avoid fragmentation to certain level and thus get faster access speed. With preallocation, applications also get a guarantee of space for particular file(s) - even if later the the system becomes full. Currently, glibc provides an interface called posix_fallocate() which can be used for similar cause. Though this has the advantage of working on all file systems, but it is quite slow (since it writes zeroes to each block that has to be preallocated). Without a doubt, file systems can do this more efficiently within the kernel, by implementing the proposed fallocate() system call. It is expected that posix_fallocate() will be modified to call this new system call first and incase the kernel/filesystem does not implement it, it should fall back to the current implementation of writing zeroes to the new blocks. ToDos: 1. Implementation on other architectures (other than i386, x86_64, and ppc). Patches for s390(x) and ia64 are already available from previous posts, but it was decided that they should be added later once fallocate is in the mainline. Hence not including those patches in this take. 2. Changes to glibc, a) to support fallocate() system call b) to make posix_fallocate() and posix_fallocate64() call fallocate() Signed-off-by: Amit Arora <aarora@in.ibm.com>	2007-07-17 21:42:44 -04:00
Paul Mundt	cb32da0416	slob: Kill off duplicate kzalloc() definition. With the slab zeroing allocations cleanups Christoph stubbed in a generic kzalloc(), which was missed on SLOB. Follow the SLAB/SLUB changes and kill off the __kzalloc() wrapper that SLOB was using. Reported-by: Jan Engelhardt <jengelh@computergmbh.de> Signed-off-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 17:26:43 -07:00
Linus Torvalds	f3d9071667	Merge branch 'bsg' of git://git.kernel.dk/data/git/linux-2.6-block * 'bsg' of git://git.kernel.dk/data/git/linux-2.6-block: bsg: fix missing space in version print Don't define empty struct bsg_class_device if !CONFIG_BLK_DEV_BSG bsg: Kconfig updates bsg: minor cleanup bsg: device hash table cleanup bsg: fix initialization error handling bugs bsg: mark FUJITA Tomonori as bsg maintainer bsg: convert to dynamic major bsg: address various review comments	2007-07-17 15:26:31 -07:00
Al Viro	8dfd588c31	smp_call_function_single() should be a macro on UP ... or we end up with header include order problems from hell. E.g. on m68k this is 100% fatal - local_irq_enable() there wants preempt_count(), which wants task_struct fields, which we won't have when we are in smp.h pulled from sched.h. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 14:39:19 -07:00
Satyam Sharma	3bd858ab1c	Introduce is_owner_or_cap() to wrap CAP_FOWNER use with fsuid check Introduce is_owner_or_cap() macro in fs.h, and convert over relevant users to it. This is done because we want to avoid bugs in the future where we check for only effective fsuid of the current task against a file's owning uid, without simultaneously checking for CAP_FOWNER as well, thus violating its semantics. [ XFS uses special macros and structures, and in general looked ... untouchable, so we leave it alone -- but it has been looked over. ] The (current->fsuid != inode->i_uid) check in generic_permission() and exec_permission_lite() is left alone, because those operations are covered by CAP_DAC_OVERRIDE and CAP_DAC_READ_SEARCH. Similarly operations falling under the purview of CAP_CHOWN and CAP_LEASE are also left alone. Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in> Cc: Al Viro <viro@ftp.linux.org.uk> Acked-by: Serge E. Hallyn <serge@hallyn.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 12:00:03 -07:00
Linus Torvalds	49c13b51a1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm: (80 commits) KVM: Use CPU_DYING for disabling virtualization KVM: Tune hotplug/suspend IPIs KVM: Keep track of which cpus have virtualization enabled SMP: Allow smp_call_function_single() to current cpu i386: Allow smp_call_function_single() to current cpu x86_64: Allow smp_call_function_single() to current cpu HOTPLUG: Adapt thermal throttle to CPU_DYING HOTPLUG: Adapt cpuset hotplug callback to CPU_DYING HOTPLUG: Add CPU_DYING notifier KVM: Clean up #includes KVM: Remove kvmfs in favor of the anonymous inodes source KVM: SVM: Reliably detect if SVM was disabled by BIOS KVM: VMX: Remove unnecessary code in vmx_tlb_flush() KVM: MMU: Fix Wrong tlb flush order KVM: VMX: Reinitialize the real-mode tss when entering real mode KVM: Avoid useless memory write when possible KVM: Fix x86 emulator writeback KVM: Add support for in-kernel pio handlers KVM: VMX: Fix interrupt checking on lightweight exit KVM: Adds support for in-kernel mmio handlers ...	2007-07-17 11:50:26 -07:00
NeilBrown	4ad1366376	md: change bitmap_unplug and others to void functions bitmap_unplug only ever returns 0, so it may as well be void. Two callers try to print a message if it returns non-zero, but that message is already printed by bitmap_file_kick. write_page returns an error which is not consistently checked. It always causes BITMAP_WRITE_ERROR to be set on an error, and that can more conveniently be checked. When the return of write_page is checked, an error causes bitmap_file_kick to be called - so move that call into write_page - and protect against recursive calls into bitmap_file_kick. bitmap_update_sb returns an error that is never checked. So make these 'void' and be consistent about checking the bit. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:15 -07:00
NeilBrown	713f6ab18b	md: improve the is_mddev_idle test fix Don't use 'unsigned' variable to track sync vs non-sync IO, as the only thing we want to do with them is a signed comparison, and fix up the comment which had become quite wrong. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:15 -07:00
Imre Deak	fe0e3a9df6	OMAP: add TI OMAP1610 accelerator entry. Signed-off-by: Trilok Soni <soni.trilok@gmail.com> Cc: Tony Lindgren <tony@atomide.com> Cc: "Antonino A. Daplas" <adaplas@pol.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:13 -07:00
Geert Uytterhoeven	9900abfb5e	fbdev: Add fb_append_extra_logo() Add fb_append_extra_logo(), to append extra lines of logos below the standard Linux logo. Signed-off-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com> Signed-off-by: Geoff Levand <geoffrey.levand@am.sony.com> Acked-By: James Simmons <jsimmons@infradead.org> Cc: "Antonino A. Daplas" <adaplas@pol.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:13 -07:00
Jesse Barnes	b7269dd2b9	vt: add comment for unbind_con_driver() - add comment for unbind_con_driver(). - bind_con_driver() is made private again Signed-off-by: Jesse Barnes <jesse.barnes@intel.com> Signed-off-by: Antonino Daplas <adaplas@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:11 -07:00
Jesse Barnes	cfafca8067	fbdev: fbcon: console unregistration from unregister_framebuffer This allows for proper console unregistration via the VT layer, and updates the FB layer to use it. This makes debugging new console drivers much easier, since you can properly clean them up before unloading. [adaplas] unregister_framebuffer() is typically called as part of the driver's module_exit(). Doing so otherwise will freeze the machine as the VT layer is holding reference counts on fbcon, and fbcon on the driver. With this change, it allows unregister_framebuffer() to be called safely anywhere as needed. Additions from the original: If multiple drivers are used by fbcon, and if one of them unregisters, a driver will take over the consoles vacated by the outgoing one (via set_con2fb_map). Once only the outgoing driver remains, then fbcon will unbind from the VT layer (if CONFIG_HW_CONSOLE_UNBINDING is set to y). It is important that these drivers implement fb_open() and fb_release() just to ensure that no other process is using the driver. Likewise, these drivers _must_ check the return value of unregister_framebuffer(). [akpm@linux-foundation.org: make fbcon_unbind() stub inline] Signed-off-by: Jesse Barnes <jesse.barnes@intel.com> Signed-off-by: Antonino Daplas <adaplas@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:11 -07:00
Antonino A. Daplas	623e71b035	fbcon: allow fbcon to use the primary display driver Allow fbcon to select the primary display adapter using the fb_is_primary_device() arch-specific helper. If a a primary adapter is detected, fbcon will unbind the old adapter from the VT layer, then rebind using the new adapter. This requires that bind_/unbind_con_driver() be made public. Because this feature may produce unexpected behavior (from the user's POV), this must be explicitly enabled in Kconfig. [akpm@linux-foundation.org: export unbind_con_driver] Signed-off-by: Antonino Daplas <adaplas@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:11 -07:00
Mark Zhan	2e774c7caf	rtc: add support for the ST M48T59 RTC [akpm@linux-foundation.org: x86_64 build fix] [akpm@linux-foundation.org: The acpi guys changed the bin_attribute code] Signed-off-by: Mark Zhan <rongkai.zhan@windriver.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:09 -07:00
J. Bruce Fields	1269bc69b6	knfsd: nfsd: enforce per-flavor id squashing Allow root squashing to vary per-pseudoflavor, so that you can (for example) allow root access only when sufficiently strong security is in use. Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:08 -07:00
J. Bruce Fields	4796f45740	knfsd: nfsd4: secinfo handling without secinfo= option We could return some sort of error in the case where someone asks for secinfo on an export without the secinfo= option set--that'd be no worse than what we've been doing. But it's not really correct. So, hack up an approximate secinfo response in that case--it may not be complete, but it'll tell the client at least one acceptable security flavor. Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:08 -07:00
Andy Adamson	dcb488a3b7	knfsd: nfsd4: implement secinfo Implement the secinfo operation. (Thanks to Usha Ketineni wrote an earlier version of this support.) Cc: Usha Ketineni <uketinen@us.ibm.com> Signed-off-by: Andy Adamson <andros@citi.umich.edu> Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:08 -07:00
J. Bruce Fields	0ec757df97	knfsd: nfsd4: make readonly access depend on pseudoflavor Allow readonly access to vary depending on the pseudoflavor, using the flag passed with each pseudoflavor in the export downcall. The rest of the flags are ignored for now, though some day we might also allow id squashing to vary based on the flavor. Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:08 -07:00
Andy Adamson	32c1eb0cd7	knfsd: nfsd4: return nfserr_wrongsec Make the first actual use of the secinfo information by using it to return nfserr_wrongsec when an export is found that doesn't allow the flavor used on this request. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Andy Adamson <andros@citi.umich.edu> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:08 -07:00
J. Bruce Fields	3ab4d8b121	knfsd: nfsd: set rq_client to ip-address-determined-domain We want it to be possible for users to restrict exports both by IP address and by pseudoflavor. The pseudoflavor information has previously been passed using special auth_domains stored in the rq_client field. After the preceding patch that stored the pseudoflavor in rq_pflavor, that's now superfluous; so now we use rq_client for the ip information, as auth_null and auth_unix do. However, we keep around the special auth_domain in the rq_gssclient field for backwards compatibility purposes, so we can still do upcalls using the old "gss/pseudoflavor" auth_domain if upcalls using the unix domain to give us an appropriate export. This allows us to continue supporting old mountd. In fact, for this first patch, we always use the "gss/pseudoflavor" auth_domain (and only it) if it is available; thus rq_client is ignored in the auth_gss case, and this patch on its own makes no change in behavior; that will be left to later patches. Note on idmap: I'm almost tempted to just replace the auth_domain in the idmap upcall by a dummy value--no version of idmapd has ever used it, and it's unlikely anyone really wants to perform idmapping differently depending on the where the client is (they may want to perform credential mapping differently, but that's a different matter--the idmapper just handles id's used in getattr and setattr). But I'm updating the idmapd code anyway, just out of general backwards-compatibility paranoia. Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:07 -07:00
J. Bruce Fields	0989a78896	knfsd: nfsd: provide export lookup wrappers which take a svc_rqst Split the callers of exp_get_by_name(), exp_find(), and exp_parent() into those that are processing requests and those that are doing other stuff (like looking up filehandles for mountd). No change in behavior, just a (fairly pointless, on its own) cleanup. (Note this has the effect of making nfsd_cross_mnt() pass rqstp->rq_client instead of exp->ex_client into exp_find_by_name(). However, the two should have the same value at this point.) Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:07 -07:00
J. Bruce Fields	df547efb03	knfsd: nfsd4: simplify exp_pseudoroot arguments We're passing three arguments to exp_pseudoroot, two of which are just fields of the svc_rqst. Soon we'll want to pass in a third field as well. So let's just give up and pass in the whole struct svc_rqst. Also sneak in some minor style cleanups while we're at it. Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:07 -07:00
Andy Adamson	e677bfe4d4	knfsd: nfsd4: parse secinfo information in exports downcall We add a list of pseudoflavors to each export downcall, which will be used both as a list of security flavors allowed on that export, and (in the order given) as the list of pseudoflavors to return on secinfo calls. This patch parses the new downcall information and adds it to the export structure, but doesn't use it for anything yet. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Andy Adamson <andros@citi.umich.edu> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:07 -07:00
Andy Adamson	c4170583f6	knfsd: nfsd4: store pseudoflavor in request Add a new field to the svc_rqst structure to record the pseudoflavor that the request was made with. For now we record the pseudoflavor but don't use it for anything. Signed-off-by: Andy Adamson <andros@citi.umich.edu> Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:07 -07:00
Meelap Shah	47f9940c55	knfsd: nfsd4: don't delegate files that have had conflicts One more incremental delegation policy improvement: don't give out a delegation on a file if conflicting access has previously required that a delegation be revoked on that file. (In practice we'll forget about the conflict when the struct nfs4_file is removed on close, so this is of limited use for now, though it should at least solve a temporary problem with self-conflicts on write opens from the same client.) Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:07 -07:00
Meelap Shah	c2f1a551de	knfsd: nfsd4: vary maximum delegation limit based on RAM size Our original NFSv4 delegation policy was to give out a read delegation on any open when it was possible to. Since the lifetime of a delegation isn't limited to that of an open, a client may quite reasonably hang on to a delegation as long as it has the inode cached. This becomes an obvious problem the first time a client's inode cache approaches the size of the server's total memory. Our first quick solution was to add a hard-coded limit. This patch makes a mild incremental improvement by varying that limit according to the server's total memory size, allowing at most 4 delegations per megabyte of RAM. My quick back-of-the-envelope calculation finds that in the worst case (where every delegation is for a different inode), a delegation could take about 1.5K, which would make the worst case usage about 6% of memory. The new limit works out to be about the same as the old on a 1-gig server. [akpm@linux-foundation.org: Don't needlessly bloat vmlinux] [akpm@linux-foundation.org: Make it right for highmem machines] Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:07 -07:00
J. Bruce Fields	1e5140279f	knfsd: nfsd: remove unused header interface.h It looks like Al Viro gutted this header file five years ago and it hasn't been touched since. Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:07 -07:00
J. Bruce Fields	33a1060ae7	knfsd: nfsd4: fix NFSv4 filehandle size units confusion NFS4_FHSIZE is measured in bytes, not 4-byte words, so much more space than necessary is being allocated for struct nfs4_cb_recall. I should have wondered why this structure was so much larger than it needed to be! Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:07 -07:00
Marc Eshel	9a8db97e77	knfsd: lockd: nfsd4: use same grace period for lockd and nfsd4 Both lockd and (in the nfsv4 case) nfsd enforce a "grace period" after reboot, during which clients may reclaim locks from the previous server instance, but may not acquire new locks. Currently the lockd and nfsd enforce grace periods of different lengths. This may cause problems when we reboot a server with both v2/v3 and v4 clients. For example, if the lockd grace period is shorter (as is likely the case), then a v3 client might acquire a new lock that conflicts with a lock already held (but not yet reclaimed) by a v4 client. This patch calculates a lease time that lockd and nfsd can both use. Signed-off-by: Marc Eshel <eshel@almaden.ibm.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:07 -07:00
Christoph Hellwig	d37065cd6d	knfsd: exportfs: add procedural interface for NFSD Currently NFSD calls directly into filesystems through the export_operations structure. I plan to change this interface in various ways in later patches, and want to avoid the export of the default operations to NFSD, so this patch adds two simple exportfs_encode_fh/exportfs_decode_fh helpers for NFSD to call instead of poking into exportfs guts. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:06 -07:00
Christoph Hellwig	5ca2960733	knfsd: exportfs: remove iget abuse When the exportfs interface was added the expectation was that filesystems provide an operation to convert from a file handle to an inode/dentry, but it kept a backwards compat option that still calls into iget. Calling into iget from non-filesystem code is very bad, because it gives too little information to filesystem, and simply crashes if the filesystem doesn't implement the ->read_inode routine. Fortunately there are only two filesystems left using this fallback: efs and jfs. This patch moves a copy of export_iget to each of those to implement the get_dentry method. While this is a temporary increase of lines of code in the kernel it allows for a much cleaner interface and important code restructuring in later patches. [akpm@linux-foundation.org: add jfs_get_inode_flags() declaration] Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:06 -07:00
Christoph Hellwig	a569425512	knfsd: exportfs: add exportfs.h header currently the export_operation structure and helpers related to it are in fs.h. fs.h is already far too large and there are very few places needing the export bits, so split them off into a separate header. [akpm@linux-foundation.org: fix cifs build] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Neil Brown <neilb@suse.de> Cc: Steven French <sfrench@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:06 -07:00
Matthias Kaehlcke	67837f232d	Use mutex instead of semaphore in CAPI 2.0 driver The CAPI 2.0 driver uses a semaphore as mutex. Use the mutex API instead of the (binary) semaphore. Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com> Acked-by: Karsten Keil <kkeil@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:05 -07:00
Joakim Tjernlund	f29ba280ec	spi_mpc83xx.c: support QE enabled 83xx CPU's like mpc832x Quicc Engine enabled mpc83xx CPU's has a somewhat different HW interface to the SPI controller. This patch adds a qe_mode knob that sees to that needed adaptions are performed. Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se> Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:05 -07:00
Ben Dooks	447aef1a19	SPI: tle620x power switch driver Add support for the Infineon TLE62x0 series of low-side driver chips, such as the TLE6220 or TLE6230. These can be viewed as output GPIOs specialized for power switching applications. The driver provides a userspace interface to those GPIOs, and to the switch status they provide. Signed-off-by: Ben Dooks <ben-linux@fluff.org> Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:05 -07:00
Jan Nikitenko	ad241528c4	CRC7 support Add CRC7 routines, used for example in MMC over SPI communication. Kerneldoc updates [akpm@linux-foundation.org: fix funny mix of const and non-const] Signed-off-by: Jan Nikitenko <jan.nikitenko@gmail.com> Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Cc: "Randy.Dunlap" <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:04 -07:00
David Brownell	c06e677aed	SPI: add 3wire mode flag Add a new spi->mode bit: SPI_3WIRE, for chips where the SI and SO signals are shared (and which are thus only half duplex). Update the LM70 driver to require support for that hardware mode from the controller. Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:04 -07:00
David Brownell	dccd573bb0	SPI controller drivers: check for unsupported modes Minor SPI controller driver updates: make the setup() methods reject spi->mode bits they don't support, by masking aginst the inverse of bits they do support. This insures against misbehavior later when new mode bits get added. Most controllers can't support SPI_LSB_FIRST; more handle SPI_CS_HIGH. Support for all four SPI clock/transfer modes is routine. Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:04 -07:00
Jan Engelhardt	5b78cc9ac8	make timespec_equal() take const arguments Make arguments of timespec_equal() const struct timespec. Signed-off-by: Jan Engelhardt <jengelh@gmx.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:03 -07:00
Tejun Heo	9281acea6a	kallsyms: make KSYM_NAME_LEN include space for trailing '\0' KSYM_NAME_LEN is peculiar in that it does not include the space for the trailing '\0', forcing all users to use KSYM_NAME_LEN + 1 when allocating buffer. This is nonsense and error-prone. Moreover, when the caller forgets that it's very likely to subtly bite back by corrupting the stack because the last position of the buffer is always cleared to zero. This patch increments KSYM_NAME_LEN by one and updates code accordingly. * off-by-one bug in asm-powerpc/kprobes.h::kprobe_lookup_name() macro is fixed. * Where MODULE_NAME_LEN and KSYM_NAME_LEN were used together, MODULE_NAME_LEN was treated as if it didn't include space for the trailing '\0'. Fix it. Signed-off-by: Tejun Heo <htejun@gmail.com> Acked-by: Paulo Marques <pmarques@grupopie.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:03 -07:00
Maciej W. Rozycki	b45d527974	sb1250-duart.c: SB1250 DUART serial support This is a driver for the SB1250 DUART, a dual serial port implementation included in the Broadcom family of SOCs descending from the SiByte SB1250 MIPS64 chip multiprocessor. It is a new implementation replacing the old-fashioned driver currently present in the linux-mips.org tree. It supports all the usual features one would expect from a(n asynchronous) serial driver, including modem line control (as far as hardware supports it -- there is edge detection logic missing from the DCD and RI lines and the driver does not implement polling of these lines at the moment), the serial console, BREAK transmission and reception, including the magic SysRq. The receive FIFO threshold is not maintained though. The driver was tested with a SWARM board which uses a BCM1250 SOC (which is dual MIPS64 CMP) and has both ports of the single DUART implemented wired externally. Both were tested. Testing included using the ports as terminal lines at 1200bps (which is the ports minimum), 115200bps and a couple of random speeds inbetween. The modem lines were verified to operate correctly. No testing was performed with a use as a network interface, like with SLIP or PPP. Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Acked-by: Ralf Baechle <ralf@linux-mips.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:03 -07:00
Roland McGrath	f9e86f4190	Remove CHILD_MAX The CHILD_MAX macro in limits.h should not be there. It claims to be the limit on processes a user can own, but its value is wrong for that. There is no constant value, but a variable resource limit (RLIMIT_NPROC). Nothing in the kernel uses CHILD_MAX. The proper thing to do according to POSIX is not to define CHILD_MAX at all. The sysconf (_SC_CHILD_MAX) implementation works by calling getrlimit. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:03 -07:00
Roland McGrath	7729303469	Remove OPEN_MAX The OPEN_MAX macro in limits.h should not be there. It claims to be the limit on file descriptors in a process, but its value is wrong for that. There is no constant value, but a variable resource limit (RLIMIT_NOFILE). Nothing in the kernel uses OPEN_MAX except things that are wrong to do so. I've submitted other patches to remove those uses. The proper thing to do according to POSIX is not to define OPEN_MAX at all. The sysconf (_SC_OPEN_MAX) implementation works by calling getrlimit. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:03 -07:00
Akinobu Mita	f4480240f7	unregister_blkdev(): return void Put WARN_ON and fixed all callers of unregister_blkdev(). Now we can make unregister_blkdev return void. Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:03 -07:00
Adrian Bunk	62239ac2b3	proper prototype for proc_nr_files() Add a proper prototype for proc_nr_files() in include/linux/fs.h Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:03 -07:00
Alexey Dobriyan	f284ce7269	PTRACE_POKEDATA consolidation Identical implementations of PTRACE_POKEDATA go into generic_ptrace_pokedata() function. AFAICS, fix bug on xtensa where successful PTRACE_POKEDATA will nevertheless return EPERM. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:03 -07:00
Alexey Dobriyan	7664732315	PTRACE_PEEKDATA consolidation Identical implementations of PTRACE_PEEKDATA go into generic_ptrace_peekdata() function. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:03 -07:00
Pavel Emelianov	bcdcd8e725	Report that kernel is tainted if there was an OOPS If the kernel OOPSed or BUGed then it probably should be considered as tainted. Thus, all subsequent OOPSes and SysRq dumps will report the tainted kernel. This saves a lot of time explaining oddities in the calltraces. Signed-off-by: Pavel Emelianov <xemul@openvz.org> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Added parisc patch from Matthew Wilson -Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:02 -07:00
Christoph Lameter	2a7326b5bb	CONFIG_BOUNCE to avoid useless inclusion of bounce buffer logic The bounce buffer logic is included on systems that do not need it. If a system does not have zones like ZONE_DMA and ZONE_HIGHMEM that can lead to the use of bounce buffers then there is no need to reserve memory pools etc etc. This is true f.e. for SGI Altix. Also nicifies the Makefile and gets rid of the tricky "and" there. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:02 -07:00
Rafael J. Wysocki	8314418629	Freezer: make kernel threads nonfreezable by default Currently, the freezer treats all tasks as freezable, except for the kernel threads that explicitly set the PF_NOFREEZE flag for themselves. This approach is problematic, since it requires every kernel thread to either set PF_NOFREEZE explicitly, or call try_to_freeze(), even if it doesn't care for the freezing of tasks at all. It seems better to only require the kernel threads that want to or need to be frozen to use some freezer-related code and to remove any freezer-related code from the other (nonfreezable) kernel threads, which is done in this patch. The patch causes all kernel threads to be nonfreezable by default (ie. to have PF_NOFREEZE set by default) and introduces the set_freezable() function that should be called by the freezable kernel threads in order to unset PF_NOFREEZE. It also makes all of the currently freezable kernel threads call set_freezable(), so it shouldn't cause any (intentional) change of behaviour to appear. Additionally, it updates documentation to describe the freezing of tasks more accurately. [akpm@linux-foundation.org: build fixes] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net> Cc: Pavel Machek <pavel@ucw.cz> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:02 -07:00
Christoph Lameter	b5fab14e5d	Add VM_BUG_ON in case someone uses page_mapping on a slab page Detect slab objects being passed to the page oriented functions of the VM. It is not sufficient to simply return NULL because the functions calling page_mapping may depend on other items of the page_struct also to be setup properly. Moreover slab object may not be properly aligned. The page oriented functions of the VM expect to operate on page aligned, page sized objects. Operations on object straddling page boundaries may only affect the objects partially which may lead to surprising results. It is better to detect eventually remaining uses and eliminate them. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:02 -07:00
Christoph Lameter	81cda66261	Slab allocators: Cleanup zeroing allocations It becomes now easy to support the zeroing allocs with generic inline functions in slab.h. Provide inline definitions to allow the continued use of kzalloc, kmem_cache_zalloc etc but remove other definitions of zeroing functions from the slab allocators and util.c. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:01 -07:00
Christoph Lameter	0c71001320	SLUB: add some more inlines and #ifdef CONFIG_SLUB_DEBUG Add #ifdefs around data structures only needed if debugging is compiled into SLUB. Add inlines to small functions to reduce code size. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:01 -07:00
Christoph Lameter	6cb8f91320	Slab allocators: consistent ZERO_SIZE_PTR support and NULL result semantics Define ZERO_OR_NULL_PTR macro to be able to remove the checks from the allocators. Move ZERO_SIZE_PTR related stuff into slab.h. Make ZERO_SIZE_PTR work for all slab allocators and get rid of the WARN_ON_ONCE(size == 0) that is still remaining in SLAB. Make slub return NULL like the other allocators if a too large memory segment is requested via __kmalloc. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:01 -07:00
Rusty Russell	8e1f936b73	mm: clean up and kernelify shrinker registration I can never remember what the function to register to receive VM pressure is called. I have to trace down from __alloc_pages() to find it. It's called "set_shrinker()", and it needs Your Help. 1) Don't hide struct shrinker. It contains no magic. 2) Don't allocate "struct shrinker". It's not helpful. 3) Call them "register_shrinker" and "unregister_shrinker". 4) Call the function "shrink" not "shrinker". 5) Reduce the 17 lines of waffly comments to 13, but document it properly. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: David Chinner <dgc@sgi.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:00 -07:00
Andy Whitcroft	5ad333eb66	Lumpy Reclaim V4 When we are out of memory of a suitable size we enter reclaim. The current reclaim algorithm targets pages in LRU order, which is great for fairness at order-0 but highly unsuitable if you desire pages at higher orders. To get pages of higher order we must shoot down a very high proportion of memory; >95% in a lot of cases. This patch set adds a lumpy reclaim algorithm to the allocator. It targets groups of pages at the specified order anchored at the end of the active and inactive lists. This encourages groups of pages at the requested orders to move from active to inactive, and active to free lists. This behaviour is only triggered out of direct reclaim when higher order pages have been requested. This patch set is particularly effective when utilised with an anti-fragmentation scheme which groups pages of similar reclaimability together. This patch set is based on Peter Zijlstra's lumpy reclaim V2 patch which forms the foundation. Credit to Mel Gorman for sanitity checking. Mel said: The patches have an application with hugepage pool resizing. When lumpy-reclaim is used used with ZONE_MOVABLE, the hugepages pool can be resized with greater reliability. Testing on a desktop machine with 2GB of RAM showed that growing the hugepage pool with ZONE_MOVABLE on it's own was very slow as the success rate was quite low. Without lumpy-reclaim, each attempt to grow the pool by 100 pages would yield 1 or 2 hugepages. With lumpy-reclaim, getting 40 to 70 hugepages on each attempt was typical. [akpm@osdl.org: ia64 pfn_to_nid fixes and loop cleanup] [bunk@stusta.de: static declarations for internal functions] [a.p.zijlstra@chello.nl: initial lumpy V2 implementation] Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Bob Picco <bob.picco@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:22:59 -07:00
Mel Gorman	ed7ed36517	handle kernelcore=: generic This patch adds the kernelcore= parameter for x86. Once all patches are applied, a new command-line parameter exist and a new sysctl. This patch adds the necessary documentation. From: Yasunori Goto <y-goto@jp.fujitsu.com> When "kernelcore" boot option is specified, kernel can't boot up on ia64 because of an infinite loop. In addition, the parsing code can be handled in an architecture-independent manner. This patch uses common code to handle the kernelcore= parameter. It is only available to architectures that support arch-independent zone-sizing (i.e. define CONFIG_ARCH_POPULATES_NODE_MAP). Other architectures will ignore the boot parameter. [bunk@stusta.de: make cmdline_parse_kernelcore() static] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Acked-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:22:59 -07:00
Mel Gorman	396faf0303	Allow huge page allocations to use GFP_HIGH_MOVABLE Huge pages are not movable so are not allocated from ZONE_MOVABLE. However, as ZONE_MOVABLE will always have pages that can be migrated or reclaimed, it can be used to satisfy hugepage allocations even when the system has been running a long time. This allows an administrator to resize the hugepage pool at runtime depending on the size of ZONE_MOVABLE. This patch adds a new sysctl called hugepages_treat_as_movable. When a non-zero value is written to it, future allocations for the huge page pool will use ZONE_MOVABLE. Despite huge pages being non-movable, we do not introduce additional external fragmentation of note as huge pages are always the largest contiguous block we care about. [akpm@linux-foundation.org: various fixes] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:22:59 -07:00
Mel Gorman	2a1e274acf	Create the ZONE_MOVABLE zone The following 8 patches against 2.6.20-mm2 create a zone called ZONE_MOVABLE that is only usable by allocations that specify both __GFP_HIGHMEM and __GFP_MOVABLE. This has the effect of keeping all non-movable pages within a single memory partition while allowing movable allocations to be satisfied from either partition. The patches may be applied with the list-based anti-fragmentation patches that groups pages together based on mobility. The size of the zone is determined by a kernelcore= parameter specified at boot-time. This specifies how much memory is usable by non-movable allocations and the remainder is used for ZONE_MOVABLE. Any range of pages within ZONE_MOVABLE can be released by migrating the pages or by reclaiming. When selecting a zone to take pages from for ZONE_MOVABLE, there are two things to consider. First, only memory from the highest populated zone is used for ZONE_MOVABLE. On the x86, this is probably going to be ZONE_HIGHMEM but it would be ZONE_DMA on ppc64 or possibly ZONE_DMA32 on x86_64. Second, the amount of memory usable by the kernel will be spread evenly throughout NUMA nodes where possible. If the nodes are not of equal size, the amount of memory usable by the kernel on some nodes may be greater than others. By default, the zone is not as useful for hugetlb allocations because they are pinned and non-migratable (currently at least). A sysctl is provided that allows huge pages to be allocated from that zone. This means that the huge page pool can be resized to the size of ZONE_MOVABLE during the lifetime of the system assuming that pages are not mlocked. Despite huge pages being non-movable, we do not introduce additional external fragmentation of note as huge pages are always the largest contiguous block we care about. Credit goes to Andy Whitcroft for catching a large variety of problems during review of the patches. This patch creates an additional zone, ZONE_MOVABLE. This zone is only usable by allocations which specify both __GFP_HIGHMEM and __GFP_MOVABLE. Hot-added memory continues to be placed in their existing destination as there is no mechanism to redirect them to a specific zone. [y-goto@jp.fujitsu.com: Fix section mismatch of memory hotplug related code] [akpm@linux-foundation.org: various fixes] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:22:59 -07:00
Mel Gorman	769848c038	Add __GFP_MOVABLE for callers to flag allocations from high memory that may be migrated It is often known at allocation time whether a page may be migrated or not. This patch adds a flag called __GFP_MOVABLE and a new mask called GFP_HIGH_MOVABLE. Allocations using the __GFP_MOVABLE can be either migrated using the page migration mechanism or reclaimed by syncing with backing storage and discarding. An API function very similar to alloc_zeroed_user_highpage() is added for __GFP_MOVABLE allocations called alloc_zeroed_user_highpage_movable(). The flags used by alloc_zeroed_user_highpage() are not changed because it would change the semantics of an existing API. After this patch is applied there are no in-kernel users of alloc_zeroed_user_highpage() so it probably should be marked deprecated if this patch is merged. Note that this patch includes a minor cleanup to the use of __GFP_ZERO in shmem.c to keep all flag modifications to inode->mapping in the shmem_dir_alloc() helper function. This clean-up suggestion is courtesy of Hugh Dickens. Additional credit goes to Christoph Lameter and Linus Torvalds for shaping the concept. Credit to Hugh Dickens for catching issues with shmem swap vector and ramfs allocations. [akpm@linux-foundation.org: build fix] [hugh@veritas.com: __GFP_ZERO cleanup] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:22:59 -07:00
Geert Uytterhoeven	13bd59a111	Don't define empty struct bsg_class_device if !CONFIG_BLK_DEV_BSG Don't define an empty struct bsg_class_device if !CONFIG_BLK_DEV_BSG. It's embedded in struct request_queue, but there we have #if defined(CONFIG_BLK_DEV_BSG) struct bsg_class_device bsg_dev; #endif anyway. Signed-off-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-07-17 14:18:47 +02:00
Linus Torvalds	489de30259	Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (209 commits) [POWERPC] Create add_rtc() function to enable the RTC CMOS driver [POWERPC] Add H_ILLAN_ATTRIBUTES hcall number [POWERPC] xilinxfb: Parameterize xilinxfb platform device registration [POWERPC] Oprofile support for Power 5++ [POWERPC] Enable arbitary speed tty ioctls and split input/output speed [POWERPC] Make drivers/char/hvc_console.c:khvcd() static [POWERPC] Remove dead code for preventing pread() and pwrite() calls [POWERPC] Remove unnecessary #undef printk from prom.c [POWERPC] Fix typo in Ebony default DTS [POWERPC] Check for NULL ppc_md.init_IRQ() before calling [POWERPC] Remove extra return statement [POWERPC] pasemi: Don't auto-select CONFIG_EMBEDDED [POWERPC] pasemi: Rename platform [POWERPC] arch/powerpc/kernel/sysfs.c: Move NUMA exports [POWERPC] Add __read_mostly support for powerpc [POWERPC] Modify sched_clock() to make CONFIG_PRINTK_TIME more sane [POWERPC] Create a dummy zImage if no valid platform has been selected [POWERPC] PS3: Bootwrapper support. [POWERPC] powermac i2c: Use mutex [POWERPC] Schedule removal of arch/ppc ... Fixed up conflicts manually in: Documentation/feature-removal-schedule.txt arch/powerpc/kernel/pci_32.c arch/powerpc/kernel/pci_64.c include/asm-powerpc/pci.h and asked the powerpc people to double-check the result..	2007-07-16 17:58:08 -07:00
Linus Torvalds	2e27afb300	Revert "[NET]: Fix races in net_rx_action vs netpoll." This reverts commit `29578624e3`. Ingo Molnar reports complete breakage with his e1000 card (no networking, card reports transmit timeouts), and bisected it down to this commit. Let's figure out what went wrong, but not keep breaking machines until we do. Cc: Ingo Molnar <mingo@elte.hu> Cc: Olaf Kirch <olaf.kirch@oracle.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 14:31:08 -07:00
Linus Torvalds	add096909d	Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2: (32 commits) [PATCH] ocfs2: zero_user_page conversion ocfs2: Support xfs style space reservation ioctls ocfs2: support for removing file regions ocfs2: update truncate handling of partial clusters ocfs2: btree support for removal of arbirtrary extents ocfs2: Support creation of unwritten extents ocfs2: support writing of unwritten extents ocfs2: small cleanup of ocfs2_write_begin_nolock() ocfs2: btree changes for unwritten extents ocfs2: abstract btree growing calls ocfs2: use all extent block suballocators ocfs2: plug truncate into cached dealloc routines ocfs2: simplify deallocation locking ocfs2: harden buffer check during mapping of page blocks ocfs2: shared writeable mmap ocfs2: factor out write aops into nolock variants ocfs2: rework ocfs2_buffered_write_cluster() ocfs2: take ip_alloc_sem during entire truncate ocfs2: Add "preferred slot" mount option [KJ PATCH] Replacing memset(<addr>,0,PAGE_SIZE) with clear_page() in fs/ocfs2/dlm/dlmrecovery.c ...	2007-07-16 10:52:55 -07:00
Linus Torvalds	e245befce7	Merge branch 'bsg' of git://git.kernel.dk/data/git/linux-2.6-block * 'bsg' of git://git.kernel.dk/data/git/linux-2.6-block: (25 commits) bsg: Kconfig updates bsg: add SCSI transport-level request support bsg: add bidi support add a struct request pointer to the request structure bsg: fix the deadlock on discarding done commands bsg: fix a blocking read bug bsg: minor bug fixes improve bsg device allocation bind bsg to all SCSI devices bsg: bind bsg to request_queue instead of gendisk bsg: add a request_queue argument to scsi_cmd_ioctl() bsg: simplify __bsg_alloc_command failpath bsg: add cheasy error checks for sysfs stuff Add queue resizing support Replace s32, u32 and u64 with __s32, __u32 and __u64 in bsg.h for userspace bsg: silence a bogus gcc warning bsg: style cleanup bsg: use u32 etc instead of uint32_t bsg: add SG_IO to SG v4 bsg: replace SG v3 with SG v4 ...	2007-07-16 10:50:19 -07:00
Badari Pulavarty	5e70030d4c	ext4: statfs speed up This is a patch that speeds up statfs. It is very simple - the "overhead" calculation, which takes a huge amount of time for large filesystems, never changes unless the size of the filesystem itself changes. That means we can store it in memory and only recalculate if the filesystem has been resized (almost never). It also fixes a minor problem that we never update the on-disk superblock free blocks/inodes counts until the filesystem is unmounted. While not fatal, we may as well update that on disk when we have the information, and it makes things like debugfs and dumpe2fs report a bit more accurate info. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andreas Dilger <adilger@clusterfs.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:52 -07:00
Badari Pulavarty	a71ce8c6c9	ext3: statfs speed up This is a patch that speeds up statfs. It is very simple - the "overhead" calculation, which takes a huge amount of time for large filesystems, never changes unless the size of the filesystem itself changes. That means we can store it in memory and only recalculate if the filesystem has been resized (almost never). It also fixes a minor problem that we never update the on-disk superblock free blocks/inodes counts until the filesystem is unmounted. While not fatal, we may as well update that on disk when we have the information, and it makes things like debugfs and dumpe2fs report a bit more accurate info. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andreas Dilger <adilger@clusterfs.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:52 -07:00
Badari Pulavarty	2235219b77	ext2: statfs speed up This is a patch that speeds up statfs. It is very simple - the "overhead" calculation, which takes a huge amount of time for large filesystems, never changes unless the size of the filesystem itself changes. That means we can store it in memory and only recalculate if the filesystem has been resized (almost never). It also fixes a minor problem that we never update the on-disk superblock free blocks/inodes counts until the filesystem is unmounted. While not fatal, we may as well update that on disk when we have the information, and it makes things like debugfs and dumpe2fs report a bit more accurate info. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andreas Dilger <adilger@clusterfs.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:52 -07:00
Adrian Bunk	8f8a68ee48	remove mm/backing-dev.c:congestion_wait_interruptible() congestion_wait_interruptible() is no longer used. Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:52 -07:00
Oleg Nesterov	1f1f642e2f	make cancel_xxx_work_sync() return a boolean Change cancel_work_sync() and cancel_delayed_work_sync() to return a boolean indicating whether the work was actually cancelled. A zero return value means that the work was not pending/queued. Without that kind of change it is not possible to avoid flush_workqueue() sometimes, see the next patch as an example. Also, this patch unifies both functions and kills the (unlikely) busy-wait loop. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Jarek Poplawski <jarkao2@o2.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:51 -07:00
Oleg Nesterov	f5a421a450	rename cancel_rearming_delayed_work() to cancel_delayed_work_sync() Imho, the current naming of cancel_xxx workqueue functions is very confusing. cancel_delayed_work() cancel_rearming_delayed_work() cancel_rearming_delayed_workqueue() // obsolete cancel_work_sync() This looks as if the first 2 functions differ in "type" of their argument which is not true any longer, nowadays the difference is the behaviour. The semantics of cancel_rearming_delayed_work(dwork) was changed significantly, it doesn't require that dwork rearms itself, and cancels dwork synchronously. Rename it to cancel_delayed_work_sync(). This matches cancel_delayed_work() and cancel_work_sync(). Re-create cancel_rearming_delayed_work() as a simple inline obsolete wrapper, like cancel_rearming_delayed_workqueue(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Jarek Poplawski <jarkao2@o2.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:51 -07:00
Jesper Juhl	d52988023a	Remove the last few UMSDOS leftovers The UMSDOS filesystem was removed back in 2.6.11, but some tiny bits stuck around. This patch removes the few remaining leftovers. The only things left behind after this are the entries in the CREDITS file and the ioctl number in Documentation/ioctl-number.txt as documentation. This third (hopefully final) version of the patch doesn't edit the arch/um/config.release file, since Jeff Dike pointed out to me that it should die completely, and asked me to remove it from my patch as he'll send in a seperate patch removing the file completely. Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:51 -07:00
Heiko Carstens	608e261968	generic bug: use show_regs() instead of dump_stack() The current generic bug implementation has a call to dump_stack() in case a WARN_ON(whatever) gets hit. Since report_bug(), which calls dump_stack(), gets called from an exception handler we can do better: just pass the pt_regs structure to report_bug() and pass it to show_regs() in case of a warning. This will give more debug informations like register contents, etc... In addition this avoids some pointless lines that dump_stack() emits, since it includes a stack backtrace of the exception handler which is of no interest in case of a warning. E.g. on s390 the following lines are currently always present in a stack backtrace if dump_stack() gets called from report_bug(): [<000000000001517a>] show_trace+0x92/0xe8) [<0000000000015270>] show_stack+0xa0/0xd0 [<00000000000152ce>] dump_stack+0x2e/0x3c [<0000000000195450>] report_bug+0x98/0xf8 [<0000000000016cc8>] illegal_op+0x1fc/0x21c [<00000000000227d6>] sysc_return+0x0/0x10 Acked-by: Jeremy Fitzhardinge <jeremy@goop.org> Acked-by: Haavard Skinnemoen <hskinnemoen@atmel.com> Cc: Andi Kleen <ak@suse.de> Cc: Kyle McMartin <kyle@parisc-linux.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:51 -07:00
Andrew Morton	cc2ea416b2	uninline check_signature() This is a rather bizarre thing to have inlined in io.h. Stick it in lib/ instead. While we're there, despaghetti it a bit, and fix its off-by-one behaviour when passed a zero length. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:50 -07:00
Andrea Arcangeli	cf99abace7	make seccomp zerocost in schedule This follows a suggestion from Chuck Ebbert on how to make seccomp absolutely zerocost in schedule too. The only remaining footprint of seccomp is in terms of the bzImage size that becomes a few bytes (perhaps even a few kbytes) larger, measure it if you care in the embedded. Signed-off-by: Andrea Arcangeli <andrea@cpushare.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:50 -07:00
Andrea Arcangeli	1d9d02feee	move seccomp from /proc to a prctl This reduces the memory footprint and it enforces that only the current task can enable seccomp on itself (this is a requirement for a strightforward [modulo preempt ;) ] TIF_NOTSC implementation). Signed-off-by: Andrea Arcangeli <andrea@cpushare.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:50 -07:00
Robert P. J. Day	132e4b0a04	cdrom: replace hard-coded constants by kernel.h macro. Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:48 -07:00
Eric W. Biederman	213dd266d4	namespace: ensure clone_flags are always stored in an unsigned long While working on unshare support for the network namespace I noticed we were putting clone flags in an int. Which is weird because the syscall uses unsigned long and we at least need an unsigned to properly hold all of the unshare flags. So to make the code consistent, this patch updates the code to use unsigned long instead of int for the clone flags in those places where we get it wrong today. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:48 -07:00
Robert P. J. Day	f489592597	Remove final two references to "__obsolete_setup" macro Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:48 -07:00
Nathan Lynch	dcf5008db1	remove unused lock_cpu_hotplug_interruptible definition `aa95387774` removed the implementation of lock_cpu_hotplug_interruptible and all users of it. This stub definition for !CONFIG_HOTPLUG_CPU was left over -- kill it now. Signed-off-by: Nathan Lynch <ntl@pobox.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:48 -07:00
Robert P. J. Day	ea5a3dcfda	COBALT: remove all references to Cobalt NVRAM Remove not only the references to Cobalt NVRAM, but the header file as well. Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Acked-by: Tim Hockin <thockin@hockin.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:47 -07:00
Serge E. Hallyn	77ec739d8d	user namespace: add unshare This patch enables the unshare of user namespaces. It adds a new clone flag CLONE_NEWUSER and implements copy_user_ns() which resets the current user_struct and adds a new root user (uid == 0) For now, unsharing the user namespace allows a process to reset its user_struct accounting and uid 0 in the new user namespace should be contained using appropriate means, for instance selinux The plan, when the full support is complete (all uid checks covered), is to keep the original user's rights in the original namespace, and let a process become uid 0 in the new namespace, with full capabilities to the new namespace. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Acked-by: Pavel Emelianov <xemul@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Cc: Andrew Morgan <agm@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:47 -07:00
Cedric Le Goater	acce292c82	user namespace: add the framework Basically, it will allow a process to unshare its user_struct table, resetting at the same time its own user_struct and all the associated accounting. A new root user (uid == 0) is added to the user namespace upon creation. Such root users have full privileges and it seems that theses privileges should be controlled through some means (process capabilities ?) The unshare is not included in this patch. Changes since [try #4]: - Updated get_user_ns and put_user_ns to accept NULL, and get_user_ns to return the namespace. Changes since [try #3]: - moved struct user_namespace to files user_namespace.{c,h} Changes since [try #2]: - removed struct user_namespace* argument from find_user() Changes since [try #1]: - removed struct user_namespace* argument from find_user() - added a root_user per user namespace Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: Pavel Emelianov <xemul@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Cc: Andrew Morgan <agm@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:47 -07:00
Cedric Le Goater	7d69a1f4a7	remove CONFIG_UTS_NS and CONFIG_IPC_NS CONFIG_UTS_NS and CONFIG_IPC_NS have very little value as they only deactivate the unshare of the uts and ipc namespaces and do not improve performance. Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Acked-by: "Serge E. Hallyn" <serue@us.ibm.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Pavel Emelianov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:47 -07:00
Miloslav Trmac	522ed7767e	Audit: add TTY input auditing Add TTY input auditing, used to audit system administrator's actions. This is required by various security standards such as DCID 6/3 and PCI to provide non-repudiation of administrator's actions and to allow a review of past actions if the administrator seems to overstep their duties or if the system becomes misconfigured for unknown reasons. These requirements do not make it necessary to audit TTY output as well. Compared to an user-space keylogger, this approach records TTY input using the audit subsystem, correlated with other audit events, and it is completely transparent to the user-space application (e.g. the console ioctls still work). TTY input auditing works on a higher level than auditing all system calls within the session, which would produce an overwhelming amount of mostly useless audit events. Add an "audit_tty" attribute, inherited across fork (). Data read from TTYs by process with the attribute is sent to the audit subsystem by the kernel. The audit netlink interface is extended to allow modifying the audit_tty attribute, and to allow sending explanatory audit events from user-space (for example, a shell might send an event containing the final command, after the interactive command-line editing and history expansion is performed, which might be difficult to decipher from the TTY input alone). Because the "audit_tty" attribute is inherited across fork (), it would be set e.g. for sshd restarted within an audited session. To prevent this, the audit_tty attribute is cleared when a process with no open TTY file descriptors (e.g. after daemon startup) opens a TTY. See https://www.redhat.com/archives/linux-audit/2007-June/msg00000.html for a more detailed rationale document for an older version of this patch. [akpm@linux-foundation.org: build fix] Signed-off-by: Miloslav Trmac <mitr@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Paul Fulghum <paulkf@microgate.com> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Steve Grubb <sgrubb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:47 -07:00
Alan Cox	4f27c00bf8	Improve behaviour of spurious IRQ detect Currently we handle spurious IRQ activity based upon seeing a lot of invalid interrupts, and we clear things back on the base of lots of valid interrupts. Unfortunately in some cases you get legitimate invalid interrupts caused by timing asynchronicity between the PCI bus and the APIC bus when disabling interrupts and pulling other tricks. In this case although the spurious IRQs are not a problem our unhandled counters didn't clear and they act as a slow running timebomb. (This is effectively what the serial port/tty problem that was fixed by clearing counters when registering a handler showed up) It's easy enough to add a second parameter - time. This means that if we see a regular stream of harmless spurious interrupts which are not harming processing we don't go off and do something stupid like disable the IRQ after a month of running. OTOH lockups and performance killers show up a lot more than 10/second [akpm@linux-foundation.org: cleanup] Signed-off-by: Alan Cox <alan@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:46 -07:00
Dave Jones	5216184571	fix typo in prefetch.h Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:46 -07:00
Maxim Uvarov	b663a79c19	taskstats: add context-switch counters Make available to the user the following task and process performance statistics: * Involuntary Context Switches (task_struct->nivcsw) * Voluntary Context Switches (task_struct->nvcsw) Statistics information is available from: 1. taskstats interface (Documentation/accounting/) 2. /proc/PID/status (task only). This data is useful for detecting hyperactivity patterns between processes. [akpm@linux-foundation.org: cleanup] Signed-off-by: Maxim Uvarov <muvarov@ru.mvista.com> Cc: Shailabh Nagar <nagar@watson.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Jay Lan <jlan@engr.sgi.com> Cc: Jonathan Lim <jlim@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:46 -07:00
Robert P. J. Day	dcae56ea66	Drop an empty isicom.h from being exported to user space. Drop <linux/isicom.h> from being exported to user space since it would be only an empty file. Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Acked-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:46 -07:00
Adrian Bunk	c289dca379	remove sonypi_camera_command() Remove the no longer used sonypi_camera_command(). Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Mattia Dongili <malattia@linux.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:46 -07:00
Jan Engelhardt	759448f459	Kernel utf-8 handling This patch fixes dead keys and copy/paste of non-ASCII characters in UTF-8 mode on Linux console. See more details about the original patch at: http://chris.heathens.co.nz/linux/utf8.html Already posted on (Oldest) http://lkml.org/lkml/2003/5/31/148 http://lkml.org/lkml/2005/12/24/69 (Recent) http://lkml.org/lkml/2006/8/7/75 [bunk@stusta.de: make drivers/char/selection.c:store_utf8() static] Signed-off-by: Jan Engelhardt <jengelh@gmx.de> Cc: Alexander E. Patrakov <patrakov@ums.usu.ru> Cc: Dmitry Torokhov <dtor@mail.ru> Cc: "Antonino A. Daplas" <adaplas@pol.net> Signed-off-by: Adrian Bunk <bunk@stusta.de> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:46 -07:00
Alexey Dobriyan	aa0ac36518	Remove capability.h from mm.h I forgot to remove capability.h from mm.h while removing sched.h! This patch remedies that, because the only inline function which was using CAP_something was made out of line. Cross-compile tested without regressions on: all powerpc defconfigs all mips defconfigs all m68k defconfigs all arm defconfigs all ia64 defconfigs alpha alpha-allnoconfig alpha-defconfig alpha-up arm i386 i386-allnoconfig i386-defconfig i386-up ia64 ia64-allnoconfig ia64-defconfig ia64-up m68k mips parisc parisc-allnoconfig parisc-defconfig parisc-up powerpc powerpc-up s390 s390-allnoconfig s390-defconfig s390-up sparc sparc-allnoconfig sparc-defconfig sparc-up sparc64 sparc64-allnoconfig sparc64-defconfig sparc64-up um-x86_64 x86_64 x86_64-allnoconfig x86_64-defconfig x86_64-up as well as my two usual configs. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:45 -07:00
Ulrich Drepper	4a19542e5f	O_CLOEXEC for SCM_RIGHTS Part two in the O_CLOEXEC saga: adding support for file descriptors received through Unix domain sockets. The patch is once again pretty minimal, it introduces a new flag for recvmsg and passes it just like the existing MSG_CMSG_COMPAT flag. I think this bit is not used otherwise but the networking people will know better. This new flag is not recognized by recvfrom and recv. These functions cannot be used for that purpose and the asymmetry this introduces is not worse than the already existing MSG_CMSG_COMPAT situations. The patch must be applied on the patch which introduced O_CLOEXEC. It has to remove static from the new get_unused_fd_flags function but since scm.c cannot live in a module the function still hasn't to be exported. Here's a test program to make sure the code works. It's so much longer than the actual patch... #include <errno.h> #include <error.h> #include <fcntl.h> #include <stdio.h> #include <string.h> #include <unistd.h> #include <sys/socket.h> #include <sys/un.h> #ifndef O_CLOEXEC # define O_CLOEXEC 02000000 #endif #ifndef MSG_CMSG_CLOEXEC # define MSG_CMSG_CLOEXEC 0x40000000 #endif int main (int argc, char argv[]) { if (argc > 1) { int fd = atol (argv[1]); printf ("child: fd = %d\n", fd); if (fcntl (fd, F_GETFD) == 0 \|\| errno != EBADF) { puts ("file descriptor valid in child"); return 1; } return 0; } struct sockaddr_un sun; strcpy (sun.sun_path, "./testsocket"); sun.sun_family = AF_UNIX; char databuf[] = "hello"; struct iovec iov[1]; iov[0].iov_base = databuf; iov[0].iov_len = sizeof (databuf); union { struct cmsghdr hdr; char bytes[CMSG_SPACE (sizeof (int))]; } buf; struct msghdr msg = { .msg_iov = iov, .msg_iovlen = 1, .msg_control = buf.bytes, .msg_controllen = sizeof (buf) }; struct cmsghdr cmsg = CMSG_FIRSTHDR (&msg); cmsg->cmsg_level = SOL_SOCKET; cmsg->cmsg_type = SCM_RIGHTS; cmsg->cmsg_len = CMSG_LEN (sizeof (int)); msg.msg_controllen = cmsg->cmsg_len; pid_t child = fork (); if (child == -1) error (1, errno, "fork"); if (child == 0) { int sock = socket (PF_UNIX, SOCK_STREAM, 0); if (sock < 0) error (1, errno, "socket"); if (bind (sock, (struct sockaddr ) &sun, sizeof (sun)) < 0) error (1, errno, "bind"); if (listen (sock, SOMAXCONN) < 0) error (1, errno, "listen"); int conn = accept (sock, NULL, NULL); if (conn == -1) error (1, errno, "accept"); (int ) CMSG_DATA (cmsg) = sock; if (sendmsg (conn, &msg, MSG_NOSIGNAL) < 0) error (1, errno, "sendmsg"); return 0; } / For a test suite this should be more robust like a barrier in shared memory. / sleep (1); int sock = socket (PF_UNIX, SOCK_STREAM, 0); if (sock < 0) error (1, errno, "socket"); if (connect (sock, (struct sockaddr ) &sun, sizeof (sun)) < 0) error (1, errno, "connect"); unlink (sun.sun_path); (int ) CMSG_DATA (cmsg) = -1; if (recvmsg (sock, &msg, MSG_CMSG_CLOEXEC) < 0) error (1, errno, "recvmsg"); int fd = (int ) CMSG_DATA (cmsg); if (fd == -1) error (1, 0, "no descriptor received"); char fdname[20]; snprintf (fdname, sizeof (fdname), "%d", fd); execl ("/proc/self/exe", argv[0], fdname, NULL); puts ("execl failed"); return 1; } [akpm@linux-foundation.org: Fix fastcall inconsistency noted by Michael Buesch] [akpm@linux-foundation.org: build fix] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Michael Buesch <mb@bu3sch.de> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:45 -07:00
Venki Pallipadi	c5c061b8f9	Add a flag to indicate deferrable timers in /proc/timer_stats Add a flag in /proc/timer_stats to indicate deferrable timers. This will let developers/users to differentiate between types of tiemrs in /proc/timer_stats. Deferrable timer and normal timer will appear in /proc/timer_stats as below. 10D, 1 swapper queue_delayed_work_on (delayed_work_timer_fn) 10, 1 swapper queue_delayed_work_on (delayed_work_timer_fn) Also version of timer_stats changes from v0.1 to v0.2 Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:45 -07:00
Christoph Hellwig	e080706190	remove odd and misleading comments from uio.h Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:45 -07:00
Dan Williams	1b0fac4587	dma-mapping: prevent dma dependent code from linking on !HAS_DMA archs Continuing the work started in `411f0f3edc` ... This enables code with a dma path, that compiles away, to build without requiring additional code factoring. It also prevents code that calls dma_alloc_coherent and dma_free_coherent from linking whereas previously the code would hit a BUG() at run time. Finally, it allows archs that set !HAS_DMA to delete their asm/dma-mapping.h file. Cc: Cornelia Huck <cornelia.huck@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: John W. Linville <linville@tuxdriver.com> Cc: Kyle McMartin <kyle@parisc-linux.org> Cc: James Bottomley <James.Bottomley@SteelEye.com> Cc: Tejun Heo <htejun@gmail.com> Cc: Jeff Garzik <jeff@garzik.org> Cc: <geert@linux-m68k.org> Cc: <zippel@linux-m68k.org> Cc: <spyro@f2s.com> Cc: <ysato@users.sourceforge.jp> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:45 -07:00
Stefan Richter	9e7bf24b1b	fs: clarify "dummy" member in struct inodes_stat_t Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:45 -07:00
David Howells	e8d6c55412	AFS: implement file locking Implement file locking for AFS. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:43 -07:00
Robert P. J. Day	0a3021f4e2	Remove unnecessary includes of spinlock.h under include/linux Remove the obviously unnecessary includes of <linux/spinlock.h> under the include/linux/ directory, and fix the couple errors that are introduced as a result of that. Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:42 -07:00
OGAWA Hirofumi	9aacd59934	fat: gcc 4.3 warning fix This patch fixes the following warnings. fs/fat/dir.c: In function 'fat_parse_long': include/linux/msdos_fs.h:294: warning: array subscript is above array bounds include/linux/msdos_fs.h:295: warning: array subscript is above array bounds include/linux/msdos_fs.h:295: warning: array subscript is above array bounds The ->name is defined as "name[8], ext[3]", but fat_checksum() uses those as name[11]. There is no actual problem, but it's not a good manner. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:42 -07:00
Andrew Morton	c67ad917cb	percpu_counters(): use cpu notifiers per-cpu counters presently must iterate over all possible CPUs in the exhaustive percpu_counter_sum(). But it can be much better to only iterate over the presently-online CPUs. To do this, we must arrange for an offlined CPU's count to be spilled into the counter's central count. We can do this for all percpu_counters in the machine by linking them into a single global list and walking that list at CPU_DEAD time. (I hope. Might have race windows in which the percpu_counter_sum() count is inaccurate?) Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:41 -07:00
Andrew Morton	21f3da95da	fuse warning fix gcc-4.3: fs/fuse/dir.c: In function 'parse_dirfile': fs/fuse/dir.c:833: warning: cast from pointer to integer of different size fs/fuse/dir.c:835: warning: cast from pointer to integer of different size [miklos@szeredi.hu: use offsetof] Acked-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:41 -07:00
Matthias Kaehlcke	9ac162521c	Use mutexes instead of semaphores in I2O driver The I2O driver uses two semaphores as mutexes. Use the mutex API instead of the (binary) semaphores. Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:41 -07:00
Alan Cox	9c1729db3e	Prevent an O_NDELAY writer from blocking when a tty write is blocked by the tty atomic writer mutex Without this a tty write could block if a previous blocking tty write was in progress on the same tty and blocked by a line discipline or hardware event. Originally found and reported by Dave Johnson. Signed-off-by: Alan Cox <alan@redhat.com> Acked-by: Dave Johnson <djohnson+linux-kernel@sw.starentnetworks.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:41 -07:00
Tomas Janousek	924b42d5a2	Use boot based time for process start time and boot time in /proc Commit `411187fb05` caused boot time to move and process start times to become invalid after suspend. Using boot based time for those restores the old behaviour and fixes the issue. [akpm@linux-foundation.org: little cleanup] Signed-off-by: Tomas Janousek <tjanouse@redhat.com> Cc: Tomas Smetana <tsmetana@redhat.com> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:41 -07:00
Tomas Janousek	7c3f1a5732	Introduce boot based time The commits `411187fb05` (GTOD: persistent clock support) `c1d370e167` (i386: use GTOD persistent clock support) changed the monotonic time so that it no longer jumps after resume, but it's not possible to use it for boot time and process start time calculations then. Also, the uptime no longer increases during suspend. I add a variable to track the wall_to_monotonic changes, a function to get the real boot time and a function to get the boot based time from the monotonic one. [akpm@linux-foundation.org: remove exports, add comment] Signed-off-by: Tomas Janousek <tjanouse@redhat.com> Cc: Tomas Smetana <tsmetana@redhat.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:41 -07:00
Zhang, Yanmin	ed4aaadb1a	fix jvc cdrom drive lockup Before calling init_hwif_default, ide_unregister gets lock ide_lock and disables irq. init_hwif_default calls ide_default_io_base which calls pci_get_device and later pci_get_subsys tries to apply for semaphore pci_bus_sem and goes to sleep. Mostly, pci_get_device should be called when irq is turned on. ide_default_io_base just needs find if list pci_devices is empty. Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com> Cc: Greg KH <greg@kroah.com> Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:40 -07:00
Satyam Sharma	e1f4a88c5a	introduce write_trylock_irqsave() Introduce a write_trylock_irqsave() implementation. Similar in style to the implementation of spin_trylock_irqsave() in mainline. Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in> Cc: Sripathi Kodi <sripathik@in.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:40 -07:00
Alexey Dobriyan	786d7e1612	Fix rmmod/read/write races in /proc entries Fix following races: =========================================== 1. Write via ->write_proc sleeps in copy_from_user(). Module disappears meanwhile. Or, more generically, system call done on /proc file, method supplied by module is called, module dissapeares meanwhile. pde = create_proc_entry() if (!pde) return -ENOMEM; pde->write_proc = ... open write copy_from_user pde = create_proc_entry(); if (!pde) { remove_proc_entry(); return -ENOMEM; /* module unloaded / } boom* ========================================== 2. bogo-revoke aka proc_kill_inodes() remove_proc_entry vfs_read proc_kill_inodes [check ->f_op validness] [check ->f_op->read validness] [verify_area, security permissions checks] ->f_op = NULL; if (file->f_op->read) /* ->f_op dereference, boom */ NOTE, NOTE, NOTE: file_operations are proxied for regular files only. Let's see how this scheme behaves, then extend if needed for directories. Directories creators in /proc only set ->owner for them, so proxying for directories may be unneeded. NOTE, NOTE, NOTE: methods being proxied are ->llseek, ->read, ->write, ->poll, ->unlocked_ioctl, ->ioctl, ->compat_ioctl, ->open, ->release. If your in-tree module uses something else, yell on me. Full audit pending. [akpm@linux-foundation.org: build fix] Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:39 -07:00
Miklos Szeredi	0165ab4435	split mmap This is a straightforward split of do_mmap_pgoff() into two functions: - do_mmap_pgoff() checks the parameters, and calculates the vma flags. Then it calls - mmap_region(), which does the actual mapping Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:37 -07:00
Paul Mundt	6193a2ff18	slob: initial NUMA support This adds preliminary NUMA support to SLOB, primarily aimed at systems with small nodes (tested all the way down to a 128kB SRAM block), whether asymmetric or otherwise. We follow the same conventions as SLAB/SLUB, preferring current node placement for new pages, or with explicit placement, if a node has been specified. Presently on UP NUMA this has the side-effect of preferring node#0 allocations (since numa_node_id() == 0, though this could be reworked if we could hand off a pfn to determine node placement), so single-CPU NUMA systems will want to place smaller nodes further out in terms of node id. Once a page has been bound to a node (via explicit node id typing), we only do block allocations from partial free pages that have a matching node id in the page flags. The current implementation does have some scalability problems, in that all partial free pages are tracked in the global freelist (with contention due to the single spinlock). However, these are things that are being reworked for SMP scalability first, while things like per-node freelists can easily be built on top of this sort of functionality once it's been added. More background can be found in: http://marc.info/?l=linux-mm&m=118117916022379&w=2 http://marc.info/?l=linux-mm&m=118170446306199&w=2 http://marc.info/?l=linux-mm&m=118187859420048&w=2 and subsequent threads. Acked-by: Christoph Lameter <clameter@sgi.com> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Paul Mundt <lethal@linux-sh.org> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:36 -07:00
Jan Beulich	8f0accc862	kill vmalloc_earlyreserve This symbol got orphaned quite a while ago. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:36 -07:00
Andrew Morton	fc9a07e7bf	invalidate_mapping_pages(): add cond_resched invalidate_mapping_pages() can sometimes take a long time (millions of pages to free). Long enough for the softlockup detector to trigger. We used to have a cond_resched() in there but I took it out because the drop_caches code calls invalidate_mapping_pages() under inode_lock. The patch adds a nasty flag and puts the cond_resched() back. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:36 -07:00
Robert P. J. Day	698827fa9f	Remove the deprecated "kmem_cache_t" typedef from slab.h. Given that there is no remaining usage of the deprecated kmem_cache_t typedef anywhere in the tree, remove that typedef. Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:35 -07:00
KAMEZAWA Hiroyuki	f0c0b2b808	change zonelist order: zonelist order selection logic Make zonelist creation policy selectable from sysctl/boot option v6. This patch makes NUMA's zonelist (of pgdat) order selectable. Available order are Default(automatic)/ Node-based / Zone-based. [Default Order] The kernel selects Node-based or Zone-based order automatically. [Node-based Order] This policy treats the locality of memory as the most important parameter. Zonelist order is created by each zone's locality. This means lower zones (ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion. IOW. ZONE_DMA will be in the middle of zonelist. current 2.6.21 kernel uses this. Pros. * A user can expect local memory as much as possible. Cons. * lower zone will be exhansted before higher zone. This may cause OOM_KILL. Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL because of ZONE_DMA exhaution and you need the best locality. (example) assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL. node(0)'s memory allocation order: node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL. node(1)'s memory allocation order: node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA. [Zone-based order] This policy treats the zone type as the most important parameter. Zonelist order is created by zone-type order. This means lower zone never be used bofere higher zone exhaustion. IOW. ZONE_DMA will be always at the tail of zonelist. Pros. * OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted. Cons. * memory locality may not be best. (example) assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL. node(0)'s memory allocation order: node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA. node(1)'s memory allocation order: node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA. bootoption "numa_zonelist_order=" and proc/sysctl is supporetd. command: %echo N > /proc/sys/vm/numa_zonelist_order Will rebuild zonelist in Node-based order. command: %echo Z > /proc/sys/vm/numa_zonelist_order Will rebuild zonelist in Zone-based order. Thanks to Lee Schermerhorn, he gives me much help and codes. [Lee.Schermerhorn@hp.com: add check_highest_zone to build_zonelists_in_zone_order] [akpm@linux-foundation.org: build fix] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Cc: "jesse.barnes@intel.com" <jesse.barnes@intel.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:35 -07:00
Yinghai Lu	18a8bd949d	serial: convert early_uart to earlycon for 8250 Beacuse SERIAL_PORT_DFNS is removed from include/asm-i386/serial.h and include/asm-x86_64/serial.h. the serial8250_ports need to be probed late in serial initializing stage. the console_init=>serial8250_console_init=> register_console=>serial8250_console_setup will return -ENDEV, and console ttyS0 can not be enabled at that time. need to wait till uart_add_one_port in drivers/serial/serial_core.c to call register_console to get console ttyS0. that is too late. Make early_uart to use early_param, so uart console can be used earlier. Make it to be bootconsole with CON_BOOT flag, so can use console handover feature. and it will switch to corresponding normal serial console automatically. new command line will be: console=uart8250,io,0x3f8,9600n8 console=uart8250,mmio,0xff5e0000,115200n8 or earlycon=uart8250,io,0x3f8,9600n8 earlycon=uart8250,mmio,0xff5e0000,115200n8 it will print in very early stage: Early serial console at I/O port 0x3f8 (options '9600n8') console [uart0] enabled later for console it will print: console handover: boot [uart0] -> real [ttyS0] Signed-off-by: <yinghai.lu@sun.com> Cc: Andi Kleen <ak@suse.de> Cc: Bjorn Helgaas <bjorn.helgaas@hp.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Gerd Hoffmann <kraxel@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:35 -07:00
Kristian Hoegsberg	23936cc0b5	lib: add idr_remove_all Remove all ids from the given idr tree. idr_destroy() only frees up unused, cached idp_layers, but this function will remove all id mappings and leave all idp_layers unused. A typical clean-up sequence for objects stored in an idr tree, will use idr_for_each() to free all objects, if necessay, then idr_remove_all() to remove all ids, and idr_destroy() to free up the cached idr_layers. Signed-off-by: Kristian Hoegsberg <krh@redhat.com> Cc: Tejun Heo <htejun@gmail.com> Cc: Dave Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:34 -07:00
Kristian Hoegsberg	96d7fa421e	lib: add idr_for_each() This patch adds an iterator function for the idr data structure. Compared to just iterating through the idr with an integer and idr_find, this iterator is (almost, but not quite) linear in the number of elements, as opposed to the number of integers in the range covered by the idr. This makes a difference for sparse idrs, but more importantly, it's a nicer way to iterate through the elements. The drm subsystem is moving to idr for tracking contexts and drawables, and with this change, we can use the idr exclusively for tracking these resources. [akpm@linux-foundation.org: fix comment] Signed-off-by: Kristian Hoegsberg <krh@redhat.com> Cc: Tejun Heo <htejun@gmail.com> Cc: Dave Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:34 -07:00
Nitin Gupta	f2a11b158a	LZO1X: fix lzo1x_worst_compress This is a correction for a macro which gives worst case compressed data size by LZO1X. This patch was provided by the LZO author (Markus Oberhumer). Signed-off-by: Nitin Gupta <nitingupta910@gmail.com> Cc: "Markus F.X.J. Oberhumer" <markus@oberhumer.com> Cc: "Richard Purdie" <rpurdie@openedhand.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:34 -07:00
Avi Kivity	a52b1752c0	SMP: Allow smp_call_function_single() to current cpu This removes the requirement for callers to get_cpu() to check in simple cases. This patch is for !CONFIG_SMP. Cc: Andi Kleen <ak@suse.de> Signed-off-by: Avi Kivity <avi@qumranet.com>	2007-07-16 12:05:50 +03:00
Avi Kivity	db912f9639	HOTPLUG: Add CPU_DYING notifier KVM wants a notification when a cpu is about to die, so it can disable hardware extensions, but at a time when user processes cannot be scheduled on the cpu, so it doesn't try to use virtualization extensions after they have been disabled. This adds a CPU_DYING notification. The notification is called in atomic context on the doomed cpu. Signed-off-by: Avi Kivity <avi@qumranet.com>	2007-07-16 12:05:49 +03:00
Avi Kivity	d6d2816849	KVM: Remove kvmfs in favor of the anonymous inodes source kvm uses a pseudo filesystem, kvmfs, to generate inodes, a job that the new anonymous inodes source does much better. Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Avi Kivity <avi@qumranet.com>	2007-07-16 12:05:49 +03:00
FUJITA Tomonori	15d10b611f	bsg: add SCSI transport-level request support This enables bsg to handle SCSI transport-level request like SAS management protocol (SMP). - add BSG_SUB_PROTOCOL_{SCSI_CMD, SCSI_TMF, SCSI_TRANSPORT} definitions. - SCSI transport-level requests skip blk_verify_command(). Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-07-16 08:52:47 +02:00
FUJITA Tomonori	abae1fde63	add a struct request pointer to the request structure This adds a struct request pointer to the request structure for the second data phase (bidi for now). A request queue supporting bidi requests sets QUEUE_FLAG_BIDI. This prevents sending bidi requests to a non-bidi queue. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-07-16 08:52:46 +02:00
FUJITA Tomonori	4cf0723ac8	bsg: minor bug fixes This fixes the following minor issues: - add EXPORT_SYMBOL_GPL for bsg_register_queue and bsg_unregister_queue. - shut up gcc warnings Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <axboe@nelson.home.kernel.dk>	2007-07-16 08:52:46 +02:00
FUJITA Tomonori	d351af01b9	bsg: bind bsg to request_queue instead of gendisk This patch binds bsg devices to request_queue instead of gendisk. Any objects (like transport entities) can define own request_handler and create own bsg device. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-07-16 08:52:46 +02:00
FUJITA Tomonori	45e79a3acd	bsg: add a request_queue argument to scsi_cmd_ioctl() bsg uses scsi_cmd_ioctl() for some SCSI/sg ioctl commands. scsi_cmd_ioctl() gets a request queue from a gendisk arguement. This prevents bsg being bound to SCSI devices that don't have a gendisk (like OSD). This adds a request_queue argument to scsi_cmd_ioctl(). The SCSI/sg ioctl commands doesn't use a gendisk so it's safe for any SCSI devices to use scsi_cmd_ioctl(). Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-07-16 08:52:45 +02:00
FUJITA Tomonori	3862153b67	Replace s32, u32 and u64 with __s32, __u32 and __u64 in bsg.h for userspace Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-07-16 08:52:45 +02:00
Jens Axboe	1594a3f0eb	bsg: use u32 etc instead of uint32_t Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-07-16 08:52:44 +02:00
FUJITA Tomonori	45977d0e87	bsg: add sg_io_v4 structure This patch adds sg_io_v4 structure that Doug proposed last month. There's one major change from the RFC. I dropped iovec, which needs compat stuff. The bsg code simply calls blk_rq_map_user against dout_xferp/din_xferp. So if possible, the page frames are directly mapped. If not possible, the block layer allocates new page frames and does memory copies. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-07-16 08:52:44 +02:00
FUJITA Tomonori	337ad41dea	block: export blk_verify_command for SG v4 blk_fill_sghdr_rq doesn't work for SG v4 so verify_command needed to be exported. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-07-16 08:52:44 +02:00
Jens Axboe	3d6392cfbd	bsg: support for full generic block layer SG v3 Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-07-16 08:52:44 +02:00
Richard Purdie	655bfd7aeb	backlight: Convert from struct class_device to struct device Convert the backlight and LCD classes from struct class_device to struct device since class_device is scheduled for removal. One nasty API break is the backlight power attribute has had to be renamed to bl_power and the LCD power attribute has had to be renamed to lcd_power since the original names clash with the core. I can't see a way around this. Signed-off-by: Richard Purdie <rpurdie@rpsys.net> Acked-by: Greg Kroah-Hartman <gregkh@suse.de>	2007-07-16 01:20:23 +01:00
Richard Purdie	f8a7c6fe14	leds: Convert from struct class_device to struct device Convert the LEDs class from struct class_device to struct device since class_device is scheduled for removal. Signed-off-by: Richard Purdie <rpurdie@rpsys.net> Acked-by: Greg Kroah-Hartman <gregkh@suse.de>	2007-07-16 01:15:51 +01:00
Raphael Assenat	22e03f3b58	leds: Add generic GPIO LED driver This patch adds support for GPIO connected leds via the new GPIO framework. Information about leds (gpio, polarity, name, default trigger) is passed to the driver via platform_data. Signed-off-by: Raphael Assenat <raph@8d.com> Signed-off-by: Richard Purdie <rpurdie@rpsys.net>	2007-07-16 01:15:50 +01:00
Linus Torvalds	d3502d7f25	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 * 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (53 commits) [TCP]: Verify the presence of RETRANS bit when leaving FRTO [IPV6]: Call inet6addr_chain notifiers on link down [NET_SCHED]: Kill CONFIG_NET_CLS_POLICE [NET_SCHED]: act_api: qdisc internal reclassify support [NET_SCHED]: sch_dsmark: act_api support [NET_SCHED]: sch_atm: act_api support [NET_SCHED]: sch_atm: Lindent [IPV6]: MSG_ERRQUEUE messages do not pass to connected raw sockets [IPV4]: Cleanup call to __neigh_lookup() [NET_SCHED]: Revert "avoid transmit softirq on watchdog wakeup" optimization [NETFILTER]: nf_conntrack: UDPLITE support [NETFILTER]: nf_conntrack: mark protocols __read_mostly [NETFILTER]: x_tables: add connlimit match [NETFILTER]: Lower *tables printk severity [NETFILTER]: nf_conntrack: Don't track locally generated special ICMP error [NETFILTER]: nf_conntrack: Introduces nf_ct_get_tuplepr and uses it [NETFILTER]: nf_conntrack: make l3proto->prepare() generic and renames it [NETFILTER]: nf_conntrack: Increment error count on parsing IPv4 header [NET]: Add ethtool support for NETIF_F_IPV6_CSUM devices. [AF_IUCV]: Add lock when updating accept_q ...	2007-07-15 16:50:46 -07:00
Al Viro	4381ca3c23	fix return type of skb_checksum_complete() It returns __sum16, not unsigned int Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-15 16:40:51 -07:00
David S. Miller	d09f51b699	Merge master.kernel.org:/pub/scm/linux/kernel/git/herbert/crypto-2.6 Conflicts: crypto/Kconfig	2007-07-14 23:47:04 -07:00
Jan Engelhardt	370786f9cf	[NETFILTER]: x_tables: add connlimit match ipt_connlimit has been sitting in POM-NG for a long time. Here is a new shiny xt_connlimit with: * xtables'ified * will request the layer3 module (previously it hotdropped every packet when it was not loaded) * fixed: there was a deadlock in case of an OOM condition * support for any layer4 protocol (e.g. UDP/SCTP) * using jhash, as suggested by Eric Dumazet * ipv6 support Signed-off-by: Jan Engelhardt <jengelh@gmx.de> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-07-14 20:47:26 -07:00
Michael Chan	6460d948f3	[NET]: Add ethtool support for NETIF_F_IPV6_CSUM devices. Add ethtool utility function to set or clear IPV6_CSUM feature flag. Modify tg3.c and bnx2.c to use this function when doing ethtool -K to change tx checksum. Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-07-14 19:07:52 -07:00
David S. Miller	cf3842ec50	Merge branch 'upstream-davem' of master.kernel.org:/pub/scm/linux/kernel/git/linville/wireless-2.6	2007-07-14 18:58:49 -07:00
Patrick McHardy	b863ceb7dd	[NET]: Add macvlan driver Add macvlan driver, which allows to create virtual ethernet devices based on MAC address. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-07-14 18:55:06 -07:00
Patrick McHardy	56addd6eee	[VLAN]: Use multicast list synchronization helpers Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-07-14 18:53:28 -07:00
Patrick McHardy	6c78dcbd47	[VLAN]: Fix promiscous/allmulti synchronization races The set_multicast_list function may be called without holding the rtnl mutex, resulting in races when changing the underlying device's promiscous and allmulti state. Use the change_rx_mode hook, which is always invoked under the rtnl. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-07-14 18:52:56 -07:00
Patrick McHardy	a0a400d79e	[NET]: dev_mcast: add multicast list synchronization helpers The method drivers currently use to synchronize multicast lists is not very pretty: - walk the multicast list - search each entry on a copy of the previous list - if new add to lower device - walk the copy of the previous list - search each entry on the current list - if removed delete from lower device - copy entire list This patch adds a new field to struct dev_addr_list to store the synchronization state and adds two helper functions for synchronization and cleanup. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-07-14 18:52:02 -07:00
Patrick McHardy	24023451c8	[NET]: Add net_device change_rx_mode callback Currently the set_multicast_list (and set_rx_mode) callbacks are responsible for configuring the device according to the IFF_PROMISC, IFF_MULTICAST and IFF_ALLMULTI flags and the mc_list (and uc_list in case of set_rx_mode). These callbacks can be invoked from BH context without the rtnl_mutex by dev_mc_add/dev_mc_delete, which makes reading the device flags and promiscous/allmulti count racy. For real hardware drivers that just commit all changes to the hardware this is not a real problem since the stack guarantees to call them for every change, so at least the final call will not race and commit the correct configuration to the hardware. For software devices that want to synchronize promiscous and multicast state to an underlying device however this can cause corruption of the underlying device's flags or promisc/allmulti counts. When the software device is concurrently put in promiscous or allmulti mode while set_multicast_list is invoked from bottem half context, the device might synchronize the change to the underlying device without holding the rtnl_mutex, which races with concurrent changes to the underlying device. Add a dev->change_rx_flags hook that is invoked when any of the flags that affect rx filtering change (under the rtnl_mutex), which allows drivers to perform synchronization immediately and only synchronize the address lists in set_multicast_list/set_rx_mode. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-07-14 18:51:31 -07:00
Linus Torvalds	16cefa8c38	Merge git://git.linux-nfs.org/pub/linux/nfs-2.6 * git://git.linux-nfs.org/pub/linux/nfs-2.6: (122 commits) sunrpc: drop BKL around wrap and unwrap NFSv4: Make sure unlock is really an unlock when cancelling a lock NLM: fix source address of callback to client SUNRPC client: add interface for binding to a local address SUNRPC server: record the destination address of a request SUNRPC: cleanup transport creation argument passing NFSv4: Make the NFS state model work with the nosharedcache mount option NFS: Error when mounting the same filesystem with different options NFS: Add the mount option "nosharecache" NFS: Add support for mounting NFSv4 file systems with string options NFS: Add final pieces to support in-kernel mount option parsing NFS: Introduce generic mount client API NFS: Add enums and match tables for mount option parsing NFS: Improve debugging output in NFS in-kernel mount client NFS: Clean up in-kernel NFS mount NFS: Remake nfsroot_mount as a permanent part of NFS client SUNRPC: Add a convenient default for the hostname when calling rpc_create() SUNRPC: Rename rpcb_getport to be consistent with new rpcb_getport_sync name SUNRPC: Rename rpcb_getport_external routine SUNRPC: Allow rpcbind requests to be interrupted by a signal. ...	2007-07-13 16:46:18 -07:00
Linus Torvalds	e030dbf91a	Merge branch 'ioat-md-accel-for-linus' of git://lost.foo-projects.org/~dwillia2/git/iop * 'ioat-md-accel-for-linus' of git://lost.foo-projects.org/~dwillia2/git/iop: (28 commits) ioatdma: add the unisys "i/oat" pci vendor/device id ARM: Add drivers/dma to arch/arm/Kconfig iop3xx: surface the iop3xx DMA and AAU units to the iop-adma driver iop13xx: surface the iop13xx adma units to the iop-adma driver dmaengine: driver for the iop32x, iop33x, and iop13xx raid engines md: remove raid5 compute_block and compute_parity5 md: handle_stripe5 - request io processing in raid5_run_ops md: handle_stripe5 - add request/completion logic for async expand ops md: handle_stripe5 - add request/completion logic for async read ops md: handle_stripe5 - add request/completion logic for async check ops md: handle_stripe5 - add request/completion logic for async compute ops md: handle_stripe5 - add request/completion logic for async write ops md: common infrastructure for running operations with raid5_run_ops md: raid5_run_ops - run stripe operations outside sh->lock raid5: replace custom debug PRINTKs with standard pr_debug raid5: refactor handle_stripe5 and handle_stripe6 (v3) async_tx: add the async_tx api xor: make 'xor_blocks' a library routine for use with async_tx dmaengine: make clients responsible for managing channels dmaengine: refactor dmaengine around dma_async_tx_descriptor ...	2007-07-13 10:52:27 -07:00
Linus Torvalds	aba2da66cf	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched * git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched: [PATCH] sched: small topology.h cleanup [PATCH] sched: fix show_task()/show_tasks() output [PATCH] sched: remove stale version info from kernel/sched_debug.c [PATCH] sched: allow larger granularity [PATCH] sched: fix prio_to_wmult[] for nice 1 [ I re-did the commits to get rid of some bogus merge commit that Ingo had. - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-13 10:13:37 -07:00
Ingo Molnar	f787a50306	[PATCH] sched: small topology.h cleanup trivial cleanup: LOCAL_DISTANCE and REMOTE_DISTANCE are only used in topology.h and inside an #ifndef section - limit their existence to that #ifndef. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-13 10:11:52 -07:00
Dan Williams	3039f0735a	ioatdma: add the unisys "i/oat" pci vendor/device id Cc: John Magolan <john.magolan@unisys.com> Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2007-07-13 08:06:19 -07:00
Dan Williams	b5e98d65d3	md: handle_stripe5 - add request/completion logic for async read ops When a read bio is attached to the stripe and the corresponding block is marked R5_UPTODATE, then a read (biofill) operation is scheduled to copy the data from the stripe cache to the bio buffer. handle_stripe flags the blocks to be operated on with the R5_Wantfill flag. If new read requests arrive while raid5_run_ops is running they will not be handled until handle_stripe is scheduled to run again. Changelog: * cleanup to_read and to_fill accounting * do not fail reads that have reached the cache Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:17 -07:00
Dan Williams	f38e12199a	md: handle_stripe5 - add request/completion logic for async compute ops handle_stripe will compute a block when a backing disk has failed, or when it determines it can save a disk read by computing the block from all the other up-to-date blocks. Previously a block would be computed under the lock and subsequent logic in handle_stripe could use the newly up-to-date block. With the raid5_run_ops implementation the compute operation is carried out a later time outside the lock. To preserve the old functionality we take advantage of the dependency chain feature of async_tx to flag the block as R5_Wantcompute and then let other parts of handle_stripe operate on the block as if it were up-to-date. raid5_run_ops guarantees that the block will be ready before it is used in another operation. However, this only works in cases where the compute and the dependent operation are scheduled at the same time. If a previous call to handle_stripe sets the R5_Wantcompute flag there is no facility to pass the async_tx dependency chain across successive calls to raid5_run_ops. The req_compute variable protects against this case. Changelog: * remove the req_compute BUG_ON Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:17 -07:00
Dan Williams	91c0092484	md: raid5_run_ops - run stripe operations outside sh->lock When the raid acceleration work was proposed, Neil laid out the following attack plan: 1/ move the xor and copy operations outside spin_lock(&sh->lock) 2/ find/implement an asynchronous offload api The raid5_run_ops routine uses the asynchronous offload api (async_tx) and the stripe_operations member of a stripe_head to carry out xor+copy operations asynchronously, outside the lock. To perform operations outside the lock a new set of state flags is needed to track new requests, in-flight requests, and completed requests. In this new model handle_stripe is tasked with scanning the stripe_head for work, updating the stripe_operations structure, and finally dropping the lock and calling raid5_run_ops for processing. The following flags outline the requests that handle_stripe can make of raid5_run_ops: STRIPE_OP_BIOFILL - copy data into request buffers to satisfy a read request STRIPE_OP_COMPUTE_BLK - generate a missing block in the cache from the other blocks STRIPE_OP_PREXOR - subtract existing data as part of the read-modify-write process STRIPE_OP_BIODRAIN - copy data out of request buffers to satisfy a write request STRIPE_OP_POSTXOR - recalculate parity for new data that has entered the cache STRIPE_OP_CHECK - verify that the parity is correct STRIPE_OP_IO - submit i/o to the member disks (note this was already performed outside the stripe lock, but it made sense to add it as an operation type The flow is: 1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending 2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the operation to the async_tx api 3/ async_tx triggers the completion callback routine to set sh->ops.complete and release the stripe 4/ handle_stripe runs again to finish the operation and optionally submit new operations that were previously blocked Note this patch just defines raid5_run_ops, subsequent commits (one per major operation type) modify handle_stripe to take advantage of this routine. Changelog: * removed ops_complete_biodrain in favor of ops_complete_postxor and ops_complete_write. * removed the raid5_run_ops workqueue * call bi_end_io for reads in ops_complete_biofill, saves a call to handle_stripe * explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown * fix race between async engines and bi_end_io call for reads, Neil Brown * remove unnecessary spin_lock from ops_complete_biofill * remove test_and_set/test_and_clear BUG_ONs, Neil Brown * remove explicit interrupt handling for channel switching, this feature was absorbed (i.e. it is now implicit) by the async_tx api * use return_io in ops_complete_biofill Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:15 -07:00
Dan Williams	a445685647	raid5: refactor handle_stripe5 and handle_stripe6 (v3) handle_stripe5 and handle_stripe6 have very deep logic paths handling the various states of a stripe_head. By introducing the 'stripe_head_state' and 'r6_state' objects, large portions of the logic can be moved to sub-routines. 'struct stripe_head_state' consumes all of the automatic variables that previously stood alone in handle_stripe5,6. 'struct r6_state' contains the handle_stripe6 specific variables like p_failed and q_failed. One of the nice side effects of the 'stripe_head_state' change is that it allows for further reductions in code duplication between raid5 and raid6. The following new routines are shared between raid5 and raid6: handle_completed_write_requests handle_requests_to_failed_array handle_stripe_expansion Changes: * v2: fixed 'conf->raid_disk-1' for the raid6 'handle_stripe_expansion' path * v3: removed the unused 'dirty' field from struct stripe_head_state * v3: coalesced open coded bi_end_io routines into return_io() Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:15 -07:00
Dan Williams	9bc89cd82d	async_tx: add the async_tx api The async_tx api provides methods for describing a chain of asynchronous bulk memory transfers/transforms with support for inter-transactional dependencies. It is implemented as a dmaengine client that smooths over the details of different hardware offload engine implementations. Code that is written to the api can optimize for asynchronous operation and the api will fit the chain of operations to the available offload resources. I imagine that any piece of ADMA hardware would register with the 'async_' subsystem, and a call to async_X would be routed as appropriate, or be run in-line. - Neil Brown async_tx exploits the capabilities of struct dma_async_tx_descriptor to provide an api of the following general format: struct dma_async_tx_descriptor async_<operation>(..., struct dma_async_tx_descriptor depend_tx, dma_async_tx_callback cb_fn, void cb_param) { struct dma_chan chan = async_tx_find_channel(depend_tx, <operation>); struct dma_device device = chan ? chan->device : NULL; int int_en = cb_fn ? 1 : 0; struct dma_async_tx_descriptor tx = device ? device->device_prep_dma_<operation>(chan, len, int_en) : NULL; if (tx) { / run <operation> asynchronously / ... tx->tx_set_dest(addr, tx, index); ... tx->tx_set_src(addr, tx, index); ... async_tx_submit(chan, tx, flags, depend_tx, cb_fn, cb_param); } else { / run <operation> synchronously / ... <operation> ... async_tx_sync_epilog(flags, depend_tx, cb_fn, cb_param); } return tx; } async_tx_find_channel() returns a capable channel from its pool. The channel pool is organized as a per-cpu array of channel pointers. The async_tx_rebalance() routine is tasked with managing these arrays. In the uniprocessor case async_tx_rebalance() tries to spread responsibility evenly over channels of similar capabilities. For example if there are two copy+xor channels, one will handle copy operations and the other will handle xor. In the SMP case async_tx_rebalance() attempts to spread the operations evenly over the cpus, e.g. cpu0 gets copy channel0 and xor channel0 while cpu1 gets copy channel 1 and xor channel 1. When a dependency is specified async_tx_find_channel defaults to keeping the operation on the same channel. A xor->copy->xor chain will stay on one channel if it supports both operation types, otherwise the transaction will transition between a copy and a xor resource. Currently the raid5 implementation in the MD raid456 driver has been converted to the async_tx api. A driver for the offload engines on the Intel Xscale series of I/O processors, iop-adma, is provided in a later commit. With the iop-adma driver and async_tx, raid456 is able to offload copy, xor, and xor-zero-sum operations to hardware engines. On iop342 tiobench showed higher throughput for sequential writes (20 - 30% improvement) and sequential reads to a degraded array (40 - 55% improvement). For the other cases performance was roughly equal, +/- a few percentage points. On a x86-smp platform the performance of the async_tx implementation (in synchronous mode) was also +/- a few percentage points of the original implementation. According to 'top' on iop342 CPU utilization drops from ~50% to ~15% during a 'resync' while the speed according to /proc/mdstat doubles from ~25 MB/s to ~50 MB/s. The tiobench command line used for testing was: tiobench --size 2048 --block 4096 --block 131072 --dir /mnt/raid --numruns 5 iop342 had 1GB of memory available Details: * if CONFIG_DMA_ENGINE=n the asynchronous path is compiled away by making async_tx_find_channel a static inline routine that always returns NULL * when a callback is specified for a given transaction an interrupt will fire at operation completion time and the callback will occur in a tasklet. if the the channel does not support interrupts then a live polling wait will be performed * the api is written as a dmaengine client that requests all available channels * In support of dependencies the api implicitly schedules channel-switch interrupts. The interrupt triggers the cleanup tasklet which causes pending operations to be scheduled on the next channel * Xor engines treat an xor destination address differently than a software xor routine. To the software routine the destination address is an implied source, whereas engines treat it as a write-only destination. This patch modifies the xor_blocks routine to take a an explicit destination address to mirror the hardware. Changelog: * fixed a leftover debug print * don't allow callbacks in async_interrupt_cond * fixed xor_block changes * fixed usage of ASYNC_TX_XOR_DROP_DEST * drop dma mapping methods, suggested by Chris Leech * printk warning fixups from Andrew Morton * don't use inline in C files, Adrian Bunk * select the API when MD is enabled * BUG_ON xor source counts <= 1 * implicitly handle hardware concerns like channel switching and interrupts, Neil Brown * remove the per operation type list, and distribute operation capabilities evenly amongst the available channels * simplify async_tx_find_channel to optimize the fast path * introduce the channel_table_initialized flag to prevent early calls to the api * reorganize the code to mimic crypto * include mm.h as not all archs include it in dma-mapping.h * make the Kconfig options non-user visible, Adrian Bunk * move async_tx under crypto since it is meant as 'core' functionality, and the two may share algorithms in the future * move large inline functions into c files * checkpatch.pl fixes * gpl v2 only correction Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:14 -07:00
Dan Williams	685784aaf3	xor: make 'xor_blocks' a library routine for use with async_tx The async_tx api tries to use a dma engine for an operation, but will fall back to an optimized software routine otherwise. Xor support is implemented using the raid5 xor routines. For organizational purposes this routine is moved to a common area. The following fixes are also made: * rename xor_block => xor_blocks, suggested by Adrian Bunk * ensure that xor.o initializes before md.o in the built-in case * checkpatch.pl fixes * mark calibrate_xor_blocks __init, Adrian Bunk Cc: Adrian Bunk <bunk@stusta.de> Cc: NeilBrown <neilb@suse.de> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2007-07-13 08:06:14 -07:00
Dan Williams	d379b01e90	dmaengine: make clients responsible for managing channels The current implementation assumes that a channel will only be used by one client at a time. In order to enable channel sharing the dmaengine core is changed to a model where clients subscribe to channel-available-events. Instead of tracking how many channels a client wants and how many it has received the core just broadcasts the available channels and lets the clients optionally take a reference. The core learns about the clients' needs at dma_event_callback time. In support of multiple operation types, clients can specify a capability mask to only be notified of channels that satisfy a certain set of capabilities. Changelog: * removed DMA_TX_ARRAY_INIT, no longer needed * dma_client_chan_free -> dma_chan_release: switch to global reference counting only at device unregistration time, before it was also happening at client unregistration time * clients now return dma_state_client to dmaengine (ack, dup, nak) * checkpatch.pl fixes * fixup merge with git-ioat Cc: Chris Leech <christopher.leech@intel.com> Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: David S. Miller <davem@davemloft.net>	2007-07-13 08:06:13 -07:00
Dan Williams	7405f74bad	dmaengine: refactor dmaengine around dma_async_tx_descriptor The current dmaengine interface defines mutliple routines per operation, i.e. dma_async_memcpy_buf_to_buf, dma_async_memcpy_buf_to_page etc. Adding more operation types (xor, crc, etc) to this model would result in an unmanageable number of method permutations. Are we really going to add a set of hooks for each DMA engine whizbang feature? - Jeff Garzik The descriptor creation process is refactored using the new common dma_async_tx_descriptor structure. Instead of per driver do_<operation>_<dest>_to_<src> methods, drivers integrate dma_async_tx_descriptor into their private software descriptor and then define a 'prep' routine per operation. The prep routine allocates a descriptor and ensures that the tx_set_src, tx_set_dest, tx_submit routines are valid. Descriptor creation and submission becomes: struct dma_device dev; struct dma_chan chan; struct dma_async_tx_descriptor tx; tx = dev->device_prep_dma_<operation>(chan, len, int_flag) tx->tx_set_src(dma_addr_t, tx, index / for multi-source ops /) tx->tx_set_dest(dma_addr_t, tx, index) tx->tx_submit(tx) In addition to the refactoring, dma_async_tx_descriptor also lays the groundwork for definining cross-channel-operation dependencies, and a callback facility for asynchronous notification of operation completion. Changelog: drop dma mapping methods, suggested by Chris Leech * fix ioat_dma_dependency_added, also caught by Andrew Morton * fix dma_sync_wait, change from Andrew Morton * uninline large functions, change from Andrew Morton * add tx->callback = NULL to dmaengine calls to interoperate with async_tx calls * hookup ioat_tx_submit * convert channel capabilities to a 'cpumask_t like' bitmap * removed DMA_TX_ARRAY_INIT, no longer needed * checkpatch.pl fixes * make set_src, set_dest, and tx_submit descriptor specific methods * fixup git-ioat merge * move group_list and phys to dma_async_tx_descriptor Cc: Jeff Garzik <jeff@garzik.org> Cc: Chris Leech <christopher.leech@intel.com> Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: David S. Miller <davem@davemloft.net>	2007-07-13 08:06:11 -07:00
Linus Torvalds	9374430a52	Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/usb-2.6 * master.kernel.org:/pub/scm/linux/kernel/git/gregkh/usb-2.6: (149 commits) USB: ohci-pnx4008: Remove unnecessary cast of return value of kzalloc USB: additions to the quirk list usb-storage: implement autosuspend USB: cdc-acm: add new device id to option driver USB: goku_udc trivial cleanups USB: usb gadget stack can now -DDEBUG with Kconfig usb gadget stack: remove usb_ep__buffer(), part 2 usb gadget stack: remove usb_ep__buffer(), part 1 USB: pxa2xx_udc -- cleanups, mostly removing dma hooks USB: pxa2xx_udc: use generic gpio layer USB: quirk for samsung printer USB: usb/dma doc updates USB: drivers/usb/storage/unusual_devs.h whitespace cleanup USB: remove Makefile reference to obsolete OHCI_AT91 USB: io_*: remove bogus termios no change checks USB: mos7720: remove bogus no termios change check USB: visor and whiteheat: remove bogus termios change checks USB: pl2303: remove bogus checks and fix speed support to use tty_get_baud_rate() USB: mos7840.c: turn this into a serial driver USB: make the usb_device numa_node get assigned from controller ...	2007-07-12 16:46:58 -07:00
Linus Torvalds	0cdf6990e9	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (76 commits) IB: Update MAINTAINERS with Hal's new email address IB/mlx4: Implement query SRQ IB/mlx4: Implement query QP IB/cm: Send no match if a SIDR REQ does not match a listen IB/cm: Fix handling of duplicate SIDR REQs IB/cm: cm_msgs.h should include ib_cm.h IB/cm: Include HCA ACK delay in local ACK timeout IB/cm: Use spin_lock_irq() instead of spin_lock_irqsave() when possible IB/sa: Make sure SA queries use default P_Key IPoIB: Recycle loopback skbs instead of freeing and reallocating IB/mthca: Replace memset(<addr>, 0, PAGE_SIZE) with clear_page(<addr>) IPoIB/cm: Fix warning if IPV6 is not enabled IB/core: Take sizeof the correct pointer when calling kmalloc() IB/ehca: Improve latency by unlocking after triggering the hardware IB/ehca: Notify consumers of LID/PKEY/SM changes after nondisruptive events IB/ehca: Return QP pointer in poll_cq() IB/ehca: Change idr spinlocks into rwlocks IB/ehca: Refactor sync between completions and destroy_cq using atomic_t IB/ehca: Lock renaming, static initializers IB/ehca: Report RDMA atomic attributes in query_qp() ...	2007-07-12 16:45:40 -07:00
David Brownell	c67ab134ba	usb gadget stack: remove usb_ep_*_buffer(), part 2 This patch removes controller driver infrastructure which supported the now-removed usb_ep_{alloc,free}_buffer() calls. As can be seen, many of the implementations of this were broken to various degrees. Many didn't properly return dma-coherent mappings; those which did so were necessarily ugly because of bogosity in the underlying dma_free_coherent() calls ... which on many platforms can't be called from the same contexts (notably in_irq) from which their dma_alloc_coherent() sibling can be called. The main potential downside of removing this is that gadget drivers wouldn't have specific knowledge that the controller drivers have: endpoints that aren't dma-capable don't need any dma mappings at all. Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2007-07-12 16:34:42 -07:00
David Brownell	9d8bab58b7	usb gadget stack: remove usb_ep__buffer(), part 1 Remove usb_ep_{alloc,free}_buffer() calls, for small dma-coherent buffers. This patch just removes the interface and its users; later patches will remove controller driver support. - This interface is invariably not implemented correctly in the controller drivers (e.g. using dma pools, a mechanism which post-dates the interface by several years). - At this point no gadget driver really needs* to use it. In current kernels, any driver that needs such a mechanism could allocate a dma pool themselves. Removing this interface is thus a simplification and improvement. Note that the gmidi.c driver had a bug in this area; fixed. Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2007-07-12 16:34:42 -07:00
Craig W. Nadler	165fe97ed6	USB: add IAD support to usbfs and sysfs USB_IAD: Adds support for USB Interface Association Descriptors. This patch adds support to the USB host stack for parsing, storing, and displaying Interface Association Descriptors. In /proc/bus/usb/devices lines starting with A: show the fields in an IAD. In sysfs if an interface on a USB device is referenced by an IAD the following files will be added to the sysfs directory for that interface: iad_bFirstInterface, iad_bInterfaceCount, iad_bFunctionClass, and iad_bFunctionSubClass, iad_bFunctionProtocol Signed-off-by: Craig W. Nadler <craig@nadler.us> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2007-07-12 16:34:40 -07:00
Marcel Holtmann	8b3b01c898	USB: Add URB_FREE_BUFFER flag and the logic behind it USB: Add URB_FREE_BUFFER flag for freeing the transfer buffer In some cases it is not needed that the driver keeps track of the transfer buffer of an URB. It can be simply freed along with the URB itself when the reference count goes down to zero. The new flag URB_FREE_BUFFER enables this behavior. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2007-07-12 16:34:38 -07:00

... 3 4 5 6 7 ...

7308 commits