Commit graph

16193 commits

Author SHA1 Message Date
Sukadev Bhattiprolu
edfacdd6f8 devpts_get_tty() should validate inode
devpts_get_tty() assumes that the inode passed in is associated with a valid
pty.  But if the only reference to the pty is via a bind-mount, the inode
passed to devpts_get_tty() while valid, would refer to a pty that no longer
exists.

With a lot of debug effort, Grzegorz Nosek developed a small program (see
below) to reproduce a crash on recent kernels. This crash is a regression
introduced by the commit:

	commit 527b3e4773
	Author: Sukadev Bhattiprolu <sukadev@us.ibm.com>
	Date:   Mon Oct 13 10:43:08 2008 +0100

To fix, ensure that the dentry associated with the inode has not yet been
deleted/unhashed by devpts_pty_kill().

See also:
https://lists.linux-foundation.org/pipermail/containers/2009-July/019273.html 

tty-bug.c:

#define _GNU_SOURCE
#include <fcntl.h>
#include <sched.h>
#include <stdlib.h>
#include <sys/mount.h>
#include <sys/signal.h>
#include <unistd.h>
#include <stdio.h>

#include <linux/fs.h>

void dummy(int sig)
{
}

static int child(void *unused)
{
	int fd;

	signal(SIGINT, dummy); signal(SIGHUP, dummy);
	pause(); /* cheesy synchronisation to wait for /dev/pts/0 to appear */

	mount("/dev/pts/0", "/dev/console", NULL, MS_BIND, NULL);
	sleep(2);

	fd = open("/dev/console", O_RDWR);
	dup(0); dup(0);
	write(1, "Hello world!\n", sizeof("Hello world!\n")-1);
	return 0;
}

int main(void)
{
	pid_t pid;
	char *stack;

	stack = malloc(16384);
	pid = clone(child, stack+16384, CLONE_NEWNS|SIGCHLD, NULL);

	open("/dev/ptmx", O_RDWR|O_NOCTTY|O_NONBLOCK);

	unlockpt(fd); grantpt(fd);

	sleep(2);
	kill(pid, SIGHUP);
	sleep(1);
	return 0; /* exit before child opens /dev/console */
}

Reported-by: Grzegorz Nosek <root@localdomain.pl>
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Tested-by: Serge Hallyn <serue@us.ibm.com>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 15:18:05 -08:00
Jason Gunthorpe
44a743f687 xfs: Fix error return for fallocate() on XFS
Noticed that through glibc fallocate would return 28 rather than -1
and errno = 28 for ENOSPC. The xfs routines uses XFS_ERROR format
positive return error codes while the syscalls use negative return
codes.  Fixup the two cases in xfs_vn_fallocate syscall to convert to
negative.

Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-12-11 15:11:23 -06:00
Christoph Hellwig
30ac0683dd xfs: cleanup dmapi macros in the umount path
Stop the flag saving as we never mangle those in the unmount path, and
hide all the weird arguents to the dmapi code inside the
XFS_SEND_PREUNMOUNT / XFS_SEND_UNMOUNT macros.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-12-11 15:11:23 -06:00
Christoph Hellwig
0c3dc2b02a xfs: remove incorrect sparse annotation for xfs_iget_cache_miss
xfs_iget_cache_miss does not get called with the pag_ici_lock held, so
the __releases annotation is incorrect.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-12-11 15:11:23 -06:00
Christoph Hellwig
b8f82a4a6f xfs: kill the STATIC_INLINE macro
Remove our own STATIC_INLINE macro.  For small function inside
implementation files just use STATIC and let gcc inline it, and for
those in headers do the normal static inline - they are all small
enough to be inlined for debug builds, too.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-12-11 15:11:22 -06:00
Christoph Hellwig
5683f53e36 xfs: uninline xfs_get_extsz_hint
This function is too large to efficiently be inlined.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-12-11 15:11:22 -06:00
Christoph Hellwig
e82fa0c7ca xfs: rename xfs_attr_fetch to xfs_attr_get_int
Using a totally different name for the low-level get operation does
not fit the _int convention used in the rest of the attr code, so
rename it.

While we're at it also fix the prototype to use the normal convention
and mark it static as it's never used outside of xfs_attr.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-12-11 15:11:22 -06:00
Christoph Hellwig
6ad112bfb5 xfs: simplify xfs_buf_get / xfs_buf_read interfaces
Currently the low-level buffer cache interfaces are highly confusing
as we have a _flags variant of each that does actually respect the
flags, and one without _flags which has a flags argument that gets
ignored and overriden with a default set.  Given that very few places
use the default arguments get rid of the duplication and convert all
callers to pass the flags explicitly.  Also remove the now confusing
_flags postfix.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-12-11 15:11:21 -06:00
Christoph Hellwig
c355c656fe xfs: remove IO_ISAIO
We set the IO_ISAIO flag for all read/write I/O since early Linux
2.6.x.  Remove it as it has lost it's purpose long ago.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-12-11 15:11:21 -06:00
Andy Poling
fc5bc4c85c xfs: Wrapped journal record corruption on read at recovery
Summary of problem:

If a journal record wraps at the physical end of the journal, it has to be
read in two parts in xlog_do_recovery_pass(): a read at the physical end and a
read at the physical beginning.  If xlog_bread() has to re-align the first
read, the second read request does not take that re-alignment into account.
If the first read was re-aligned, the second read over-writes the end of the
data from the first read, effectively corrupting it.  This can happen either
when reading the record header or reading the record data.

The first sanity check in xlog_recover_process_data() is to check for a valid
clientid, so that is the error reported.

Summary of fix:

If there was a first read at the physical end, XFS_BUF_PTR() returns where the
data was requested to begin.  Conversely, because it is the result of
xlog_align(), offset indicates where the requested data for the first read
actually begins - whether or not xlog_bread() has re-aligned it.

Using offset as the base for the calculation of where to place the second read
data ensures that it will be correctly placed immediately following the data
from the first read instead of sometimes over-writing the end of it.

The attached patch has resolved the reported problem of occasional inability
to recover the journal (reporting "bad clientid").

Signed-off-by: Andy Poling <andy@realbig.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-12-11 15:11:21 -06:00
Christoph Hellwig
5ec4fabb02 xfs: cleanup data end I/O handlers
Currently we have different end I/O handlers for read vs the different
types of write I/O.  But they are all very similar so we could just
use one with a few conditionals and reduce code size a lot.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-12-11 15:11:20 -06:00
Christoph Hellwig
06342cf8ad xfs: use WRITE_SYNC_PLUG for synchronous writeout
The VM and I/O schedulers now expect us to use WRITE_SYNC_PLUG for
synchronous writeout.  Right now I can't see any changes in performance
numbers with this, but we're getting some beating for not using it,
and the knowledge definitely could help the block code to make better
decisions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-12-11 15:11:20 -06:00
Christoph Hellwig
033da48fda xfs: reset the i_iolock lock class in the reclaim path
The iolock is used for protecting reads, writes and block truncates
against each other.  We have two classes of callers, the first one is
induced by a file operation and requires a reference to the inode be
held and not dropped after the operation is done:

 - xfs_vm_vmap, xfs_vn_fallocate, xfs_read, xfs_write, xfs_splice_read,
   xfs_splice_write and xfs_setattr are all implementations of VFS
   methods that require a live inode
 - xfs_getbmap and xfs_swap_extents are ioctl subcommand for which the
   same is true
 - xfs_truncate_file is only called on quota inodes just returned from
   xfs_iget
 - xfs_sync_inode_data does the lock just after an igrab()
 - xfs_filestream_associate and xfs_filestream_new_ag take the iolock
   on the parent inode of an inode which by VFS rules must be referenced

And we have various calls to truncate blocks past EOF or the whole
file when dropping the last reference to an inode.  Unfortunately
lockdep complains when we do memory allocations that can recurse into
the filesystem in the first class because the second class happens to
take the same lock.  To avoid this re-init the iolock in the beginning
of xfs_fs_clear_inode to get a new lock class.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-12-11 15:11:20 -06:00
Christoph Hellwig
80641dc66a xfs: I/O completion handlers must use NOFS allocations
When completing I/O requests we must not allow the memory allocator to
recurse into the filesystem, as we might deadlock on waiting for the
I/O completion otherwise.  The only thing currently allocating normal
GFP_KERNEL memory is the allocation of the transaction structure for
the unwritten extent conversion.  Add a memflags argument to
_xfs_trans_alloc to allow controlling the allocator behaviour.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Thomas Neumann <tneumann@users.sourceforge.net>
Tested-by: Thomas Neumann <tneumann@users.sourceforge.net>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-12-11 15:11:20 -06:00
Christoph Hellwig
c56c9631cb xfs: fix mmap_sem/iolock inversion in xfs_free_eofblocks
When xfs_free_eofblocks is called from ->release the VM might already
hold the mmap_sem, but in the write path we take the iolock before
taking the mmap_sem in the generic write code.

Switch xfs_free_eofblocks to only trylock the iolock if called from
->release and skip trimming the prellocated blocks in that case.
We'll still free them later on the final iput.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-12-11 15:11:19 -06:00
Christoph Hellwig
848ce8f731 xfs: simplify inode teardown
Currently the reclaim code for the case where we don't reclaim the
final reclaim is overly complicated.  We know that the inode is clean
but instead of just directly reclaiming the clean inode we go through
the whole process of marking the inode reclaimable just to directly
reclaim it from the calling context.  Besides being overly complicated
this introduces a race where iget could recycle an inode between
marked reclaimable and actually being reclaimed leading to panics.

This patch gets rid of the existing reclaim path, and replaces it with
a simple call to xfs_ireclaim if the inode was clean.  While we're at
it we also use the slightly more lax xfs_inode_clean check we'd use
later to determine if we need to flush the inode here.

Finally get rid of xfs_reclaim function and place the remaining small
bits of reclaim code directly into xfs_fs_destroy_inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Patrick Schreurs <patrick@news-service.com>
Reported-by: Tommy van Leeuwen <tommy@news-service.com>
Tested-by: Patrick Schreurs <patrick@news-service.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-12-11 15:11:19 -06:00
Linus Torvalds
4e2ccdb040 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: (49 commits)
  nilfs2: separate wait function from nilfs_segctor_write
  nilfs2: add iterator for segment buffers
  nilfs2: hide nilfs_write_info struct in segment buffer code
  nilfs2: relocate io status variables to segment buffer
  nilfs2: do not return io error for bio allocation failure
  nilfs2: use list_splice_tail or list_splice_tail_init
  nilfs2: replace mark_inode_dirty as nilfs_mark_inode_dirty
  nilfs2: delete mark_inode_dirty in nilfs_delete_entry
  nilfs2: delete mark_inode_dirty in nilfs_commit_chunk
  nilfs2: change return type of nilfs_commit_chunk
  nilfs2: split nilfs_unlink as nilfs_do_unlink and nilfs_unlink
  nilfs2: delete redundant mark_inode_dirty
  nilfs2: expand inode_inc_link_count and inode_dec_link_count
  nilfs2: delete mark_inode_dirty from nilfs_set_link
  nilfs2: delete mark_inode_dirty in nilfs_new_inode
  nilfs2: add norecovery mount option
  nilfs2: add helper to get if volume is in a valid state
  nilfs2: move recovery completion into load_nilfs function
  nilfs2: apply readahead for recovery on mount
  nilfs2: clean up get/put function of a segment usage
  ...
2009-12-11 11:49:18 -08:00
Eric W. Biederman
e16acb503b sysfs: sysfs_setattr remove unnecessary permission check.
inode_change_ok already clears the SGID bit when necessary
so there is no reason for sysfs_setattr to carry code to do
the same, and it is good to kill the extra copy because when
I moved the code last in certain corner cases the code will
look at the wrong gid.

Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:54 -08:00
Eric W. Biederman
ca1bab3819 sysfs: Factor out sysfs_rename from sysfs_rename_dir and sysfs_move_dir
These two functions do 90% of the same work and it doesn't significantly
obfuscate the function to allow both the parent dir and the name to change
at the same time.  So merge them together to simplify maintenance, and
increase testing.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:54 -08:00
Eric W. Biederman
832b6af198 sysfs: Propagate renames to the vfs on demand
By teaching sysfs_revalidate to hide a dentry for
a sysfs_dirent if the sysfs_dirent has been renamed,
and by teaching sysfs_lookup to return the original
dentry if the sysfs dirent has been renamed.  I can
show the results of renames correctly without having to
update the dcache during the directory rename.

This massively simplifies the rename logic allowing a lot
of weird sysfs special cases to be removed along with
a lot of now unnecesary helper code.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:54 -08:00
Eric W. Biederman
a16bbc3430 sysfs: Gut sysfs_addrm_start and sysfs_addrm_finish
With lazy inode updates and dentry operations bringing everything
into sync on demand there is no longer any need to immediately
update the vfs or grab i_mutex to protect those updates as we
make changes to sysfs.

Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:54 -08:00
Eric W. Biederman
06fc0d66f7 sysfs: In sysfs_chmod_file lazily propagate the mode change.
Now that sysfs_getattr and sysfs_permission refresh the vfs
inode there is no need to immediatly push the mode change
into the vfs cache.  Reducing the amount of work needed and
simplifying the locking.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:54 -08:00
Eric W. Biederman
e61ab4ae48 sysfs: Implement sysfs_getattr & sysfs_permission
With the implementation of sysfs_getattr and sysfs_permission
sysfs becomes able to lazily propogate inode attribute changes
from the sysfs_dirents to the vfs inodes.   This paves the way
for deleting significant chunks of now unnecessary code.

While doing this we did not reference sysfs_setattr from
sysfs_symlink_inode_operations so I added along with
sysfs_getattr and sysfs_permission.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:54 -08:00
Eric W. Biederman
c099aacd48 sysfs: Nicely indent sysfs_symlink_inode_operations
Lining up the functions in sysfs_symlink_inode_operations
follows the pattern in the rest of sysfs and makes things
slightly more readable.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:53 -08:00
Eric W. Biederman
6b0bfe9383 sysfs: Update s_iattr on link and unlink.
Currently sysfs updates the timestamps on the vfs directory
inode when we create or remove a directory entry but doesn't
update the cached copy on the sysfs_dirent, fix that oversight.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:53 -08:00
Eric W. Biederman
35df63c46c sysfs: Fix locking and factor out sysfs_sd_setattr
Cleanly separate the work that is specific to setting the
attributes of a sysfs_dirent from what is needed to update
the attributes of a vfs inode.

Additionally grab the sysfs_mutex to keep any nasties from
surprising us when updating the sysfs_dirent.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:53 -08:00
Eric W. Biederman
4be3df28be sysfs: Simplify iattr time assignments
The granularity of sysfs time when we keep it is 1 ns.  Which
when passed to timestamp_trunc results in a nop.  So remove
the unnecessary function call making sysfs_setattr slightly
easier to read.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:53 -08:00
Eric W. Biederman
4c6974f51a sysfs: Simplify sysfs_chmod_file semantics
Currently every caller of sysfs_chmod_file happens at either
file creation time to set a non-default mode or in response
to a specific user requested space change in policy.  Making
timestamps of when the chmod happens and notification of
a file changing mode uninteresting.

Remove the unnecessary time stamp and filesystem change
notification, and removes the last of the explicit inotify
and donitfy support from sysfs.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:53 -08:00
Eric W. Biederman
e8f077c883 sysfs: Use dentry_ops instead of directly playing with the dcache
Calling d_drop unconditionally when a sysfs_dirent is deleted has
the potential to leak mounts, so instead implement dentry delete
and revalidate operations that cause sysfs dentries to be removed
at the appropriate time.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:53 -08:00
Eric W. Biederman
28a027cfc0 sysfs: Rename sysfs_d_iput to sysfs_dentry_iput
Using dentry instead of d in the function name is what
several other filesystems are doing and it seems to be
a more readable convention.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:53 -08:00
Eric W. Biederman
f44d3e7857 sysfs: Update sysfs_setxattr so it updates secdata under the sysfs_mutex
The sysfs_mutex is required to ensure updates are and will remain
atomic with respect to other inode iattr updates, that do not happen
through the filesystem.

Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:53 -08:00
Mathieu Desnoyers
d3a3b0adad debugfs: fix create mutex racy fops and private data
Setting fops and private data outside of the mutex at debugfs file
creation introduces a race where the files can be opened with the wrong
file operations and private data.  It is easy to trigger with a process
waiting on file creation notification.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:53 -08:00
Stefan Richter
f38506c49d sysfs: mark a locally-only used function static
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Acked-by: David P. Quigley <dpquigl@tycho.nsa.gov>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 11:24:51 -08:00
Arnd Bergmann
637e8a60a7 usbdevfs: move compat_ioctl handling to devio.c
Half the compat_ioctl handling is in devio.c, the other
half is in fs/compat_ioctl.c. This moves everything into
one place for consistency.

As a positive side-effect, push down the BKL into the
ioctl methods.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oliver@neukum.org>
Cc: Alon Bar-Lev <alon.barlev@gmail.com>
Cc: David Vrabel <david.vrabel@csr.com>
Cc: linux-usb@vger.kernel.org
2009-12-10 22:55:37 +01:00
Arnd Bergmann
3695669cd4 lp: move compat_ioctl handling into lp.c
Handling for LPSETTIMEOUT can easily be done in lp_ioctl, which
is the only user. As a positive side-effect, push the BKL
into the ioctl methods.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-10 22:55:36 +01:00
Arnd Bergmann
43c6e7b97f compat_ioctl: pass compat pointer directly to handlers
Instead of having each handler call compat_ptr, we can now
convert the pointer once and pass that to each handler.
This saves a little bit of both source and object code size.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2009-12-10 22:52:12 +01:00
Arnd Bergmann
661f627da9 compat_ioctl: simplify lookup table
The compat_ioctl table now only contains entries for
COMPATIBLE_IOCTL, so we only need to know if a number
is listed in it or now.

As an optimization, we hash the table entries with a
reversible transformation to get a more uniform distribution
over it, sort the table at startup and then guess the
position in the table when an ioctl number gets called
to do a linear search from there.

With the current set of ioctl numbers and the chosen
transformation function, we need an average of four
steps to find if a number is in the set, all of the
accesses within one or two cache lines.

This at least as good as the previous hash table
approach but saves 8.5 kb of kernel memory.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2009-12-10 22:52:11 +01:00
Arnd Bergmann
789f0f8911 compat_ioctl: simplify calling of handlers
The compat_ioctl array now contains only entries for ioctl numbers
that do not require a separate handler. By special-casing the
ULONG_IOCTL case in the do_ioctl_trans function, we can kill the
final use of a function pointer in the array.

   text    data     bss     dec     hex filename
   7539   13352    2080   22971    59bb before/fs/compat_ioctl.o
   7910    8552    2080   18542    486e after/fs/compat_ioctl.o

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2009-12-10 22:52:10 +01:00
Arnd Bergmann
5a07ea0b97 compat_ioctl: inline all conversion handlers
This makes all ioctl conversion handlers called from
a single switch statement, leaving only COMPATIBLE_IOCTL
and ULONG_IOCTL statements in the table. This is somewhat
more space efficient and also lets us simplify the
handling of the lookup table significantly.

before:
   text    data     bss     dec     hex filename
   7619   14024    2080   23723    5cab obj/fs/compat_ioctl.o
after:
   7567   13352    2080   22999    59d7 obj/fs/compat_ioctl.o

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2009-12-10 22:52:09 +01:00
Arnd Bergmann
348c4b9078 compat_ioctl: Remove BKL
We have always called ioctl conversion handlers under the big kernel lock,
although that is generally not necessary.  In particular it is not needed
for conversion of data structures and for calling sys_ioctl or
do_vfs_ioctl, which will get the BKL again if needed.

Handlers doing more than those two have been moved out, so we can kill off
the BKL from compat_sys_ioctl.  This may significantly improve latencies
with 32 bit applications, and it avoids a common scenario where a thread
acquires the BKL twice.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2009-12-10 22:52:08 +01:00
Arnd Bergmann
fb07a5f857 compat_ioctl: remove all VT ioctl handling
The VT driver now handles all of these ioctls directly, so we can remove
the handlers from common code.

These are the only handlers that require the BKL because they directly
perform the ioctl action rather than just converting the data structures.
Once they are gone, we can remove the BKL from the remaining ioctl
conversion handlers.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2009-12-10 22:52:08 +01:00
Linus Torvalds
02412f49f6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
  dlm: always use GFP_NOFS
2009-12-10 09:33:59 -08:00
Linus Torvalds
4515c3069d Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (47 commits)
  ext4: Fix potential fiemap deadlock (mmap_sem vs. i_data_sem)
  ext4: Do not override ext2 or ext3 if built they are built as modules
  jbd2: Export jbd2_log_start_commit to fix ext4 build
  ext4: Fix insufficient checks in EXT4_IOC_MOVE_EXT
  ext4: Wait for proper transaction commit on fsync
  ext4: fix incorrect block reservation on quota transfer.
  ext4: quota macros cleanup
  ext4: ext4_get_reserved_space() must return bytes instead of blocks
  ext4: remove blocks from inode prealloc list on failure
  ext4: wait for log to commit when umounting
  ext4: Avoid data / filesystem corruption when write fails to copy data
  ext4: Use ext4 file system driver for ext2/ext3 file system mounts
  ext4: Return the PTR_ERR of the correct pointer in setup_new_group_blocks()
  jbd2: Add ENOMEM checking in and for jbd2_journal_write_metadata_buffer()
  ext4: remove unused parameter wbc from __ext4_journalled_writepage()
  ext4: remove encountered_congestion trace
  ext4: move_extent_per_page() cleanup
  ext4: initialize moved_len before calling ext4_move_extents()
  ext4: Fix double-free of blocks with EXT4_IOC_MOVE_EXT
  ext4: use ext4_data_block_valid() in ext4_free_blocks()
  ...
2009-12-10 09:33:29 -08:00
Linus Torvalds
a5eba3f66f Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd
* 'for-linus' of git://git.open-osd.org/linux-open-osd:
  exofs: Multi-device mirror support
  exofs: Move all operations to an io_engine
  exofs: move osd.c to ios.c
  exofs: statfs blocks is sectors not FS blocks
  exofs: Prints on mount and unmout
  exofs: refactor exofs_i_info initialization into common helper
  exofs: dbg-print less
  exofs: More sane debug print
  trivial: some small fixes in exofs documentation
2009-12-10 09:32:24 -08:00
Linus Torvalds
fc1495bf99 Merge git://git.infradead.org/ubifs-2.6
* git://git.infradead.org/ubifs-2.6:
  UBIFS: fix return code in check_leaf
  UBI: flush wl before clearing update marker
  MAINTAINERS: change e-mail of Artem Bityutskiy
  UBIFS: remove manual O_SYNC handling
  UBIFS: support mounting of UBI volume character devices
  UBI: Add ubi_open_volume_path
2009-12-10 09:31:45 -08:00
Trond Myklebust
190f38e5ce NFS: Fix nfs_migrate_page()
The call to migrate_page() will cause the page->private field to be
cleared.
Also fix up the locking around the page->private transfer, so that we ensure
that calls to nfs_page_find_request() don't end up racing.

Finally, fix up a double free bug: nfs_unlock_request() already calls
nfs_release_request() for us...

Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Tested-by: Andi Kleen <andi@firstfloor.org>
Cc: stable@kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-12-10 09:05:55 -05:00
Roel Kluin
8e0eb4011b ext3: PTR_ERR return of wrong pointer in setup_new_group_blocks()
Return the PTR_ERR of the correct pointer.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:55 +01:00
Jan Kara
68eb3db083 ext3: Fix data / filesystem corruption when write fails to copy data
When ext3_write_begin fails after allocating some blocks or
generic_perform_write fails to copy data to write, we truncate blocks already
instantiated beyond i_size. Although these blocks were never inside i_size, we
have to truncate pagecache of these blocks so that corresponding buffers get
unmapped. Otherwise subsequent __block_prepare_write (called because we are
retrying the write) will find the buffers mapped, not call ->get_block, and
thus the page will be backed by already freed blocks leading to filesystem and
data corruption.

Reported-by: James Y Knight <foom@fuhm.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:55 +01:00
Jan Kara
5a20bdfcdc ext4: Support for 64-bit quota format
Add support for new 64-bit quota format. It is enough to add proper
mount options handling. The rest is done by the generic code.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:54 +01:00
Jan Kara
1aeec43432 ext3: Support for vfsv1 quota format
We just have to add proper mount options handling. The rest is handled by
the generic quota code.

CC: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:54 +01:00
Jan Kara
498c60153e quota: Implement quota format with 64-bit space and inode limits
So far the maximum quota space limit was 4TB. Apparently this isn't enough
for Lustre guys anymore. So implement new quota format which raises block
limits to 2^64 bytes. Also store number of inodes and inode limits in
64-bit variables as 2^32 files isn't that insanely high anymore.

The first version of the patch has been developed by Andrew Perepechko
<Andrew.Perepechko@Sun.COM>.

CC: Andrew.Perepechko@Sun.COM
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:54 +01:00
Jan Kara
3067393005 quota: Move definition of QFMT_OCFS2 to linux/quota.h
Move definition of this constant to linux/quota.h so that it
cannot clash with other format IDs.

CC: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:53 +01:00
Jérémy Cochoy
92e128884b ext2: fix comment in ext2_find_entry about return values
Signed-off-by: Jérémy Cochoy <jeremy.cochoy@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:53 +01:00
Alexey Fisher
4cf46b67eb ext3: Unify log messages in ext3
Make messages produced by ext3 more unified. It should be
easy to parse.

dmesg before patch:
[ 4893.684892] reservations ON
[ 4893.684896] xip option not supported
[ 4893.684964] EXT3-fs warning: maximal mount count reached, running
e2fsck is recommended

dmesg after patch:
[  873.300792] EXT3-fs (loop0): using internal journaln
[  873.300796] EXT3-fs (loop0): mounted filesystem with writeback data mode
[  924.163657] EXT3-fs (loop0): error: can't find ext3 filesystem on dev loop0.
[  723.755642] EXT3-fs (loop0): error: bad blocksize 8192
[  357.874687] EXT3-fs (loop0): error: no journal found. mounting ext3 over ext2?
[  873.300764] EXT3-fs (loop0): warning: maximal mount count reached, running e2fsck is recommended
[  924.163657] EXT3-fs (loop0): error: can't find ext3 filesystem on dev loop0.

Signed-off-by: Alexey Fisher <bug-track@fisher-privat.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:53 +01:00
Stephen Hemminger
2074abfeb8 ext2: clear uptodate flag on super block I/O error
This fixes a WARN backtrace in mark_buffer_dirty() that occurs during
unmount when a USB or floppy device is removed. I reported this a kernel
regression, but looks like it might have been there for longer
than that.

The super block update from a previous operation has marked the buffer
as in error, and the flag has to be cleared before doing the update.
(Similar code already exists in ext4).

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:53 +01:00
Alexey Fisher
2314b07cb4 ext2: Unify log messages in ext2
make messages produced by ext2 more unified. It should be
easy to parse.

dmesg before patch:
[ 4893.684892] reservations ON
[ 4893.684896] xip option not supported
[ 4893.684961] EXT2-fs warning: mounting ext3 filesystem as ext2
[ 4893.684964] EXT2-fs warning: maximal mount count reached, running
e2fsck is recommended
[ 4893.684990] EXT II FS: 0.5b, 95/08/09, bs=1024, fs=1024, gc=2,
bpg=8192, ipg=1280, mo=80010]

dmesg after patch:
[ 4893.684892] EXT2-fs (loop0): reservations ON
[ 4893.684896] EXT2-fs (loop0): xip option not supported
[ 4893.684961] EXT2-fs (loop0): warning: mounting ext3 filesystem as
ext2
[ 4893.684964] EXT2-fs (loop0): warning: maximal mount count reached,
running e2fsck is recommended
[ 4893.684990] EXT2-fs (loop0): 0.5b, 95/08/09, bs=1024, fs=1024, gc=2,
bpg=8192, ipg=1280, mo=80010]

Signed-off-by: Alexey Fisher <bug-track@fisher-privat.net>
Reviewed-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:52 +01:00
Eric Sandeen
dee1d3b627 ext3: make "norecovery" an alias for "noload"
Users on the list recently complained about differences across
filesystems w.r.t. how to mount without a journal replay.

In the discussion it was noted that xfs's "norecovery" option is
perhaps more descriptively accurate than "noload," so let's make
that an alias for ext3.

Also show this status in /proc/mounts

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:52 +01:00
Eric Sandeen
b918397542 ext3: Don't update the superblock in ext3_statfs()
commit a71ce8c6c9 updated ext3_statfs()
to update the on-disk superblock counters, but modified this buffer
directly without any journaling of the change.  This is one of the
accesses that was causing the crc errors in journal replay as seen in
kernel.org bugzilla #14354.

The modifications were originally to keep the sb "more" in sync,
so that a readonly fsck of the device didn't flag this as an
error (as often), but apparently e2fsprogs deals with this differently
now, anyway.

Based on Ted's patch for ext4, which was in turn based on my
work on that bug and another preliminary patch...

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:52 +01:00
Eric Sandeen
d965736b8c ext3: journal all modifications in ext3_xattr_set_handle
ext3_xattr_set_handle() was zeroing out an inode outside
of journaling constraints; this is one of the accesses that
was causing the crc errors in journal replay as seen in
kernel.org bugzilla #14354.

Although ext3 doesn't have the crc issue, modifications
out of journal control are a Bad Thing.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:52 +01:00
Jan Kara
c56818d7dc quota: Fix WARN_ON in lookup_one_len
We should hold i_mutex when looking up quota files for journaled quotas,
otherwise a WARN_ON in lookup_one_len triggers. The fact that we didn't
hold i_mutex previously probably could not lead to a real bug since the
filesystem is just being mounted / remounted read-write and thus the
root directory cannot change anyway but it's definitely cleaner with
i_mutex.

Reported-by: Bastien ROUCARIES <roucaries.bastien@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:51 +01:00
Alexey Dobriyan
1472da5fdc const: struct quota_format_ops
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:51 +01:00
Christoph Hellwig
5ced58f735 ubifs: remove manual O_SYNC handling
generic_file_aio_write already calls into ->fsync to handle O_SYNC/O_DSYNC.
Remove the duplicate call to ubifs_sync_wbufs_by_inode which is already
covered by ubifs_fsync.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:51 +01:00
Christoph Hellwig
027cf316af afs: remove manual O_SYNC handling
generic_file_aio_write already calls into ->fsync to handle O_SYNC/O_DSYNC.
Remove the duplicate manual invocation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:50 +01:00
Christoph Hellwig
94004ed726 kill wait_on_page_writeback_range
All callers really want the more logical filemap_fdatawait_range interface,
so convert them to use it and merge wait_on_page_writeback_range into
filemap_fdatawait_range.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:50 +01:00
Christoph Hellwig
6b2f3d1f76 vfs: Implement proper O_SYNC semantics
While Linux provided an O_SYNC flag basically since day 1, it took until
Linux 2.4.0-test12pre2 to actually get it implemented for filesystems,
since that day we had generic_osync_around with only minor changes and the
great "For now, when the user asks for O_SYNC, we'll actually give
O_DSYNC" comment.  This patch intends to actually give us real O_SYNC
semantics in addition to the O_DSYNC semantics.  After Jan's O_SYNC
patches which are required before this patch it's actually surprisingly
simple, we just need to figure out when to set the datasync flag to
vfs_fsync_range and when not.

This patch renames the existing O_SYNC flag to O_DSYNC while keeping it's
numerical value to keep binary compatibility, and adds a new real O_SYNC
flag.  To guarantee backwards compatiblity it is defined as expanding to
both the O_DSYNC and the new additional binary flag (__O_SYNC) to make
sure we are backwards-compatible when compiled against the new headers.

This also means that all places that don't care about the differences can
just check O_DSYNC and get the right behaviour for O_SYNC, too - only
places that actuall care need to check __O_SYNC in addition.  Drivers and
network filesystems have been updated in a fail safe way to always do the
full sync magic if O_DSYNC is set.  The few places setting O_SYNC for
lower layers are kept that way for now to stay failsafe.

We enforce that O_DSYNC is set when __O_SYNC is set early in the open path
to make sure we always get these sane options.

Note that parisc really screwed up their headers as they already define a
O_DSYNC that has always been a no-op.  We try to repair it by using it for
the new O_DSYNC and redefinining O_SYNC to send both the traditional
O_SYNC numerical value _and_ the O_DSYNC one.

Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger@sun.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Kyle McMartin <kyle@mcmartin.ca>
Acked-by: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:50 +01:00
Jan Kara
59bc055211 zisofs: Implement reading of compressed files when PAGE_CACHE_SIZE > compress block size
Also split and cleanup zisofs_readpage() when we are changing it anyway.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:49 +01:00
Boaz Harrosh
04dc1e88ad exofs: Multi-device mirror support
This patch changes on-disk format, it is accompanied with a parallel
patch to mkfs.exofs that enables multi-device capabilities.

After this patch, old exofs will refuse to mount a new formatted FS and
new exofs will refuse an old format. This is done by moving the magic
field offset inside the FSCB. A new FSCB *version* field was added. In
the future, exofs will refuse to mount unmatched FSCB version. To
up-grade or down-grade an exofs one must use mkfs.exofs --upgrade option
before mounting.

Introduced, a new object that contains a *device-table*. This object
contains the default *data-map* and a linear array of devices
information, which identifies the devices used in the filesystem. This
object is only written to offline by mkfs.exofs. This is why it is kept
separate from the FSCB, since the later is written to while mounted.

Same partition number, same object number is used on all devices only
the device varies.

* define the new format, then load the device table on mount time make
  sure every thing is supported.

* Change I/O engine to now support Mirror IO, .i.e write same data
  to multiple devices, read from a random device to spread the
  read-load from multiple clients (TODO: stripe read)

Implementation notes:
 A few points introduced in previous patch should be mentioned here:

* Special care was made so absolutlly all operation that have any chance
  of failing are done before any osd-request is executed. This is to
  minimize the need for a data consistency recovery, to only real IO
  errors.

* Each IO state has a kref. It starts at 1, any osd-request executed
  will increment the kref, finally when all are executed the first ref
  is dropped. At IO-done, each request completion decrements the kref,
  the last one to return executes the internal _last_io() routine.
  _last_io() will call the registered io_state_done. On sync mode a
  caller does not supply a done method, indicating a synchronous
  request, the caller is put to sleep and a special io_state_done is
  registered that will awaken the caller. Though also in sync mode all
  operations are executed in parallel.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2009-12-10 09:59:23 +02:00
Boaz Harrosh
06886a5a3d exofs: Move all operations to an io_engine
In anticipation for multi-device operations, we separate osd operations
into an abstract I/O API. Currently only one device is used but later
when adding more devices, we will drive all devices in parallel according
to a "data_map" that describes how data is arranged on multiple devices.
The file system level operates, like before, as if there is one object
(inode-number) and an i_size. The io engine will split this to the same
object-number but on multiple device.

At first we introduce Mirror (raid 1) layout. But at the final outcome
we intend to fully implement the pNFS-Objects data-map, including
raid 0,4,5,6 over mirrored devices, over multiple device-groups. And
more. See: http://tools.ietf.org/html/draft-ietf-nfsv4-pnfs-obj-12

* Define an io_state based API for accessing osd storage devices
  in an abstract way.
  Usage:
	First a caller allocates an io state with:
		exofs_get_io_state(struct exofs_sb_info *sbi,
				   struct exofs_io_state** ios);

	Then calles one of:
		exofs_sbi_create(struct exofs_io_state *ios);
		exofs_sbi_remove(struct exofs_io_state *ios);
		exofs_sbi_write(struct exofs_io_state *ios);
		exofs_sbi_read(struct exofs_io_state *ios);
		exofs_oi_truncate(struct exofs_i_info *oi, u64 new_len);

	And when done
		exofs_put_io_state(struct exofs_io_state *ios);

* Convert all source files to use this new API
* Convert from bio_alloc to bio_kmalloc
* In io engine we make use of the now fixed osd_req_decode_sense

There are no functional changes or on disk additions after this patch.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2009-12-10 09:59:22 +02:00
Boaz Harrosh
8ce9bdd1fb exofs: move osd.c to ios.c
If I do a "git mv" together with a massive code change
and commit in one patch, git looses the rename and
records a delete/new instead. This is bad because I want
a rename recorded so later rebased/cherry-picked patches
to the old name will work. Also the --follow is lost.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2009-12-10 09:59:21 +02:00
Boaz Harrosh
cae012d853 exofs: statfs blocks is sectors not FS blocks
Even though exofs has a 4k block size, statfs blocks
is in sectors (512 bytes).

Also if target returns 0 for capacity then make it
ULLONG_MAX. df does not like zero-size filesystems

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2009-12-10 09:59:21 +02:00
Boaz Harrosh
19fe294f2e exofs: Prints on mount and unmout
It is important to print in the logs when a filesystem was
mounted and eventually unmounted.

Print the osd-device's osd_name and pid the FS was
mounted/unmounted on.

TODO: How to also print the namespace path the filesystem was
      mounted on?

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2009-12-10 09:59:20 +02:00
Boaz Harrosh
9cfdc7aa9f exofs: refactor exofs_i_info initialization into common helper
There are two places that initialize inodes: exofs_iget() and
exofs_new_inode()

As more members of exofs_i_info that need initialization are
added this code will grow. (soon)

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2009-12-10 09:59:19 +02:00
Boaz Harrosh
fe33cc1ee1 exofs: dbg-print less
Iner-loops printing is converted to EXOFS_DBG2 which is #defined
to nothing.

It is now almost bareable to just leave debug-on. Every operation
is printed once, with most relevant info (I hope).

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2009-12-10 09:59:18 +02:00
Boaz Harrosh
58311c43df exofs: More sane debug print
debug prints should be somewhat useful without actually
reading the source code

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2009-12-10 09:59:17 +02:00
Linus Torvalds
4ef58d4e2a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (42 commits)
  tree-wide: fix misspelling of "definition" in comments
  reiserfs: fix misspelling of "journaled"
  doc: Fix a typo in slub.txt.
  inotify: remove superfluous return code check
  hdlc: spelling fix in find_pvc() comment
  doc: fix regulator docs cut-and-pasteism
  mtd: Fix comment in Kconfig
  doc: Fix IRQ chip docs
  tree-wide: fix assorted typos all over the place
  drivers/ata/libata-sff.c: comment spelling fixes
  fix typos/grammos in Documentation/edac.txt
  sysctl: add missing comments
  fs/debugfs/inode.c: fix comment typos
  sgivwfb: Make use of ARRAY_SIZE.
  sky2: fix sky2_link_down copy/paste comment error
  tree-wide: fix typos "couter" -> "counter"
  tree-wide: fix typos "offest" -> "offset"
  fix kerneldoc for set_irq_msi()
  spidev: fix double "of of" in comment
  comment typo fix: sybsystem -> subsystem
  ...
2009-12-09 19:43:33 -08:00
Theodore Ts'o
fab3a549e2 ext4: Fix potential fiemap deadlock (mmap_sem vs. i_data_sem)
Fix the following potential circular locking dependency between
mm->mmap_sem and ei->i_data_sem:

    =======================================================
    [ INFO: possible circular locking dependency detected ]
    2.6.32-04115-gec044c5 #37
    -------------------------------------------------------
    ureadahead/1855 is trying to acquire lock:
     (&mm->mmap_sem){++++++}, at: [<ffffffff81107224>] might_fault+0x5c/0xac

    but task is already holding lock:
     (&ei->i_data_sem){++++..}, at: [<ffffffff811be1fd>] ext4_fiemap+0x11b/0x159

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&ei->i_data_sem){++++..}:
           [<ffffffff81099bfa>] __lock_acquire+0xb67/0xd0f
           [<ffffffff81099e7e>] lock_acquire+0xdc/0x102
           [<ffffffff81516633>] down_read+0x51/0x84
           [<ffffffff811a2414>] ext4_get_blocks+0x50/0x2a5
           [<ffffffff811a3453>] ext4_get_block+0xab/0xef
           [<ffffffff81154f39>] do_mpage_readpage+0x198/0x48d
           [<ffffffff81155360>] mpage_readpages+0xd0/0x114
           [<ffffffff811a104b>] ext4_readpages+0x1d/0x1f
           [<ffffffff810f8644>] __do_page_cache_readahead+0x12f/0x1bc
           [<ffffffff810f86f2>] ra_submit+0x21/0x25
           [<ffffffff810f0cfd>] filemap_fault+0x19f/0x32c
           [<ffffffff81107b97>] __do_fault+0x55/0x3a2
           [<ffffffff81109db0>] handle_mm_fault+0x327/0x734
           [<ffffffff8151aaa9>] do_page_fault+0x292/0x2aa
           [<ffffffff81518205>] page_fault+0x25/0x30
           [<ffffffff812a34d8>] clear_user+0x38/0x3c
           [<ffffffff81167e16>] padzero+0x20/0x31
           [<ffffffff81168b47>] load_elf_binary+0x8bc/0x17ed
           [<ffffffff81130e95>] search_binary_handler+0xc2/0x259
           [<ffffffff81166d64>] load_script+0x1b8/0x1cc
           [<ffffffff81130e95>] search_binary_handler+0xc2/0x259
           [<ffffffff8113255f>] do_execve+0x1ce/0x2cf
           [<ffffffff81027494>] sys_execve+0x43/0x5a
           [<ffffffff8102918a>] stub_execve+0x6a/0xc0

    -> #0 (&mm->mmap_sem){++++++}:
           [<ffffffff81099aa4>] __lock_acquire+0xa11/0xd0f
           [<ffffffff81099e7e>] lock_acquire+0xdc/0x102
           [<ffffffff81107251>] might_fault+0x89/0xac
           [<ffffffff81139382>] fiemap_fill_next_extent+0x95/0xda
           [<ffffffff811bcb43>] ext4_ext_fiemap_cb+0x138/0x157
           [<ffffffff811be069>] ext4_ext_walk_space+0x178/0x1f1
           [<ffffffff811be21e>] ext4_fiemap+0x13c/0x159
           [<ffffffff811390e6>] do_vfs_ioctl+0x348/0x4d6
           [<ffffffff811392ca>] sys_ioctl+0x56/0x79
           [<ffffffff81028cb2>] system_call_fastpath+0x16/0x1b

    other info that might help us debug this:

    1 lock held by ureadahead/1855:
     #0:  (&ei->i_data_sem){++++..}, at: [<ffffffff811be1fd>] ext4_fiemap+0x11b/0x159

    stack backtrace:
    Pid: 1855, comm: ureadahead Not tainted 2.6.32-04115-gec044c5 #37
    Call Trace:
     [<ffffffff81098c70>] print_circular_bug+0xa8/0xb7
     [<ffffffff81099aa4>] __lock_acquire+0xa11/0xd0f
     [<ffffffff8102f229>] ? sched_clock+0x9/0xd
     [<ffffffff81099e7e>] lock_acquire+0xdc/0x102
     [<ffffffff81107224>] ? might_fault+0x5c/0xac
     [<ffffffff81107251>] might_fault+0x89/0xac
     [<ffffffff81107224>] ? might_fault+0x5c/0xac
     [<ffffffff81124b44>] ? __kmalloc+0x13b/0x18c
     [<ffffffff81139382>] fiemap_fill_next_extent+0x95/0xda
     [<ffffffff811bcb43>] ext4_ext_fiemap_cb+0x138/0x157
     [<ffffffff811bca0b>] ? ext4_ext_fiemap_cb+0x0/0x157
     [<ffffffff811be069>] ext4_ext_walk_space+0x178/0x1f1
     [<ffffffff811be21e>] ext4_fiemap+0x13c/0x159
     [<ffffffff81107224>] ? might_fault+0x5c/0xac
     [<ffffffff811390e6>] do_vfs_ioctl+0x348/0x4d6
     [<ffffffff8129f6d0>] ? __up_read+0x8d/0x95
     [<ffffffff81517fb5>] ? retint_swapgs+0x13/0x1b
     [<ffffffff811392ca>] sys_ioctl+0x56/0x79
     [<ffffffff81028cb2>] system_call_fastpath+0x16/0x1b

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-09 21:30:02 -05:00
Theodore Ts'o
a214238d3b ext4: Do not override ext2 or ext3 if built they are built as modules
The CONFIG_EXT4_USE_FOR_EXT23 option must not try to take over the
ext2 or ext3 file systems if the those file system drivers are
configured to be built as mdoules.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-09 21:09:58 -05:00
Theodore Ts'o
3b799d15f2 jbd2: Export jbd2_log_start_commit to fix ext4 build
This fixes:
    ERROR: "jbd2_log_start_commit" [fs/ext4/ext4.ko] undefined!

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-09 20:42:53 -05:00
Linus Torvalds
a9280fed38 Merge branch 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing
* 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing: (31 commits)
  kill-the-bkl/reiserfs: turn GFP_ATOMIC flag to GFP_NOFS in reiserfs_get_block()
  kill-the-bkl/reiserfs: drop the fs race watchdog from _get_block_create_0()
  kill-the-bkl/reiserfs: definitely drop the bkl from reiserfs_ioctl()
  kill-the-bkl/reiserfs: always lock the ioctl path
  kill-the-bkl/reiserfs: fix reiserfs lock to cpu_add_remove_lock dependency
  kill-the-bkl/reiserfs: Fix induced mm->mmap_sem to sysfs_mutex dependency
  kill-the-bkl/reiserfs: panic in case of lock imbalance
  kill-the-bkl/reiserfs: fix recursive reiserfs write lock in reiserfs_commit_write()
  kill-the-bkl/reiserfs: fix recursive reiserfs lock in reiserfs_mkdir()
  kill-the-bkl/reiserfs: fix "reiserfs lock" / "inode mutex" lock inversion dependency
  kill-the-bkl/reiserfs: move the concurrent tree accesses checks per superblock
  kill-the-bkl/reiserfs: acquire the inode mutex safely
  kill-the-bkl/reiserfs: unlock only when needed in search_by_key
  kill-the-bkl/reiserfs: use mutex_lock in reiserfs_mutex_lock_safe
  kill-the-bkl/reiserfs: factorize the locking in reiserfs_write_end()
  kill-the-bkl/reiserfs: reduce number of contentions in search_by_key()
  kill-the-bkl/reiserfs: don't hold the write recursively in reiserfs_lookup()
  kill-the-bkl/reiserfs: lock only once on reiserfs_get_block()
  kill-the-bkl/reiserfs: conditionaly release the write lock on fs_changed()
  kill-the-BKL/reiserfs: add reiserfs_cond_resched()
  ...
2009-12-09 07:58:15 -08:00
Benjamin Herrenschmidt
bcd6acd51f Merge commit 'origin/master' into next
Conflicts:
	include/linux/kvm.h
2009-12-09 17:14:38 +11:00
Ricardo Labiaga
7cab89b275 nfs41: Invoke RECLAIM_COMPLETE on all new client ids
The NFSv4.1 spec indicates RECLAIM_COMPLETE is to be issued
whenever a client establishes a new client id, not only after
detecting the server has rebooted.

Set the NFS4CLNT_RECLAIM_REBOOT bit after every new client id has
been established.  This enables us to issue RECLAIM_COMPLETE
during the wrap up of the NFS4CLNT_RECLAIM_REBOOT state.

Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-12-08 14:35:28 -05:00
Linus Torvalds
6035ccd8e9 Merge branch 'for-2.6.33' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.33' of git://git.kernel.dk/linux-2.6-block: (113 commits)
  cfq-iosched: Do not access cfqq after freeing it
  block: include linux/err.h to use ERR_PTR
  cfq-iosched: use call_rcu() instead of doing grace period stall on queue exit
  blkio: Allow CFQ group IO scheduling even when CFQ is a module
  blkio: Implement dynamic io controlling policy registration
  blkio: Export some symbols from blkio as its user CFQ can be a module
  block: Fix io_context leak after failure of clone with CLONE_IO
  block: Fix io_context leak after clone with CLONE_IO
  cfq-iosched: make nonrot check logic consistent
  io controller: quick fix for blk-cgroup and modular CFQ
  cfq-iosched: move IO controller declerations to a header file
  cfq-iosched: fix compile problem with !CONFIG_CGROUP
  blkio: Documentation
  blkio: Wait on sync-noidle queue even if rq_noidle = 1
  blkio: Implement group_isolation tunable
  blkio: Determine async workload length based on total number of queues
  blkio: Wait for cfq queue to get backlogged if group is empty
  blkio: Propagate cgroup weight updation to cfq groups
  blkio: Drop the reference to queue once the task changes cgroup
  blkio: Provide some isolation between groups
  ...
2009-12-08 08:19:16 -08:00
Linus Torvalds
d7fc02c7ba Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1815 commits)
  mac80211: fix reorder buffer release
  iwmc3200wifi: Enable wimax core through module parameter
  iwmc3200wifi: Add wifi-wimax coexistence mode as a module parameter
  iwmc3200wifi: Coex table command does not expect a response
  iwmc3200wifi: Update wiwi priority table
  iwlwifi: driver version track kernel version
  iwlwifi: indicate uCode type when fail dump error/event log
  iwl3945: remove duplicated event logging code
  b43: fix two warnings
  ipw2100: fix rebooting hang with driver loaded
  cfg80211: indent regulatory messages with spaces
  iwmc3200wifi: fix NULL pointer dereference in pmkid update
  mac80211: Fix TX status reporting for injected data frames
  ath9k: enable 2GHz band only if the device supports it
  airo: Fix integer overflow warning
  rt2x00: Fix padding bug on L2PAD devices.
  WE: Fix set events not propagated
  b43legacy: avoid PPC fault during resume
  b43: avoid PPC fault during resume
  tcp: fix a timewait refcnt race
  ...

Fix up conflicts due to sysctl cleanups (dead sysctl_check code and
CTL_UNNUMBERED removed) in
	kernel/sysctl_check.c
	net/ipv4/sysctl_net_ipv4.c
	net/ipv6/addrconf.c
	net/sctp/sysctl.c
2009-12-08 07:55:01 -08:00
Linus Torvalds
1557d33007 Merge git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6: (43 commits)
  security/tomoyo: Remove now unnecessary handling of security_sysctl.
  security/tomoyo: Add a special case to handle accesses through the internal proc mount.
  sysctl: Drop & in front of every proc_handler.
  sysctl: Remove CTL_NONE and CTL_UNNUMBERED
  sysctl: kill dead ctl_handler definitions.
  sysctl: Remove the last of the generic binary sysctl support
  sysctl net: Remove unused binary sysctl code
  sysctl security/tomoyo: Don't look at ctl_name
  sysctl arm: Remove binary sysctl support
  sysctl x86: Remove dead binary sysctl support
  sysctl sh: Remove dead binary sysctl support
  sysctl powerpc: Remove dead binary sysctl support
  sysctl ia64: Remove dead binary sysctl support
  sysctl s390: Remove dead sysctl binary support
  sysctl frv: Remove dead binary sysctl support
  sysctl mips/lasat: Remove dead binary sysctl support
  sysctl drivers: Remove dead binary sysctl support
  sysctl crypto: Remove dead binary sysctl support
  sysctl security/keys: Remove dead binary sysctl support
  sysctl kernel: Remove binary sysctl logic
  ...
2009-12-08 07:38:50 -08:00
Trond Myklebust
88069f77e1 NFSv41: Fix a potential state leakage when restarting nfs4_close_prepare
Currently, if the call to nfs4_setup_sequence() in nfs4_close_prepare
fails, any later retries will fail to launch an RPC call, due to the fact
that the &state->flags will have been cleared.
Ditto if nfs4_close_done() triggers a call to the NFSv4.1 version of
nfs_restart_rpc().

We therefore move the actual clearing of the state->flags to
nfs4_close_done(), when we know that the RPC call was successful.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-12-08 08:33:16 -05:00
Roel Kluin
b38882f5c0 UBIFS: fix return code in check_leaf
Return the PTR_ERR of the correct pointer. This fixes the debugging code.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-12-08 14:24:00 +02:00
Jiri Kosina
d014d04386 Merge branch 'for-next' into for-linus
Conflicts:

	kernel/irq/chip.c
2009-12-07 18:36:35 +01:00
Ricardo Labiaga
74e7bb73a3 nfs41: Handle NFSv4.1 session errors in the delegation recall code
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-12-07 09:48:30 -05:00
Ricardo Labiaga
7970886118 nfs41: Retry delegation return if it failed with session error
Update nfs4_delegreturn_done() to retry the operation after setting the
NFS4CLNT_SESSION_SETUP bit to indicate the need to reset the session.

Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-12-07 09:23:21 -05:00
Ricardo Labiaga
bcfa49f6f9 nfs41: Handle session errors during delegation return
Add session error handling to nfs4_open_delegation_recall()

Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-12-07 09:22:29 -05:00
Ricardo Labiaga
f455848a11 nfs41: Mark stateids in need of reclaim if state manager gets stale clientid
The state manager was not marking the stateids as needing to be reclaimed
after reestablishing the clientid.

Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-12-07 09:16:09 -05:00
Trond Myklebust
0110ee152b NFS: Fix up the declaration of nfs4_restart_rpc when NFSv4 not configured
Also rename it: it is used in generic code, and so should not have a 'nfs4'
prefix.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-12-07 09:00:24 -05:00
Akira Fujita
4a58579b9e ext4: Fix insufficient checks in EXT4_IOC_MOVE_EXT
This patch fixes three problems in the handling of the
EXT4_IOC_MOVE_EXT ioctl:

1. In current EXT4_IOC_MOVE_EXT, there are read access mode checks for
original and donor files, but they allow the illegal write access to
donor file, since donor file is overwritten by original file data.  To
fix this problem, change access mode checks of original (r->r/w) and
donor (r->w) files.

2.  Disallow the use of donor files that have a setuid or setgid bits.

3.  Call mnt_want_write() and mnt_drop_write() before and after
ext4_move_extents() calling to get write access to a mount.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-06 23:38:31 -05:00
Jan Kara
b436b9bef8 ext4: Wait for proper transaction commit on fsync
We cannot rely on buffer dirty bits during fsync because pdflush can come
before fsync is called and clear dirty bits without forcing a transaction
commit. What we do is that we track which transaction has last changed
the inode and which transaction last changed allocation and force it to
disk on fsync.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-08 23:51:10 -05:00
Dmitry Monakhov
194074acac ext4: fix incorrect block reservation on quota transfer.
Inside ->setattr() call both ATTR_UID and ATTR_GID may be valid
This means that we may end-up with transferring all quotas. Add
we have to reserve QUOTA_DEL_BLOCKS for all quotas, as we do in
case of QUOTA_INIT_BLOCKS.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-08 22:42:28 -05:00
Dmitry Monakhov
5aca07eb7d ext4: quota macros cleanup
Currently all quota block reservation macros contains hard-coded "2"
aka MAXQUOTAS value. This is no good because in some places it is not
obvious to understand what does this digit represent. Let's introduce
new macro with self descriptive name.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Acked-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-08 22:42:15 -05:00
Dmitry Monakhov
8aa6790f87 ext4: ext4_get_reserved_space() must return bytes instead of blocks
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-08 22:41:52 -05:00
Curt Wohlgemuth
b844167edc ext4: remove blocks from inode prealloc list on failure
This fixes a leak of blocks in an inode prealloc list if device failures
cause ext4_mb_mark_diskspace_used() to fail.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Acked-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-08 22:18:25 -05:00
Josef Bacik
d4edac314e ext4: wait for log to commit when umounting
There is a potential race when a transaction is committing right when
the file system is being umounting.  This could reduce in a race
because EXT4_SB(sb)->s_group_info could be freed in ext4_put_super
before the commit code calls a callback so the mballoc code can
release freed blocks in the transaction, resulting in a panic trying
to access the freed s_group_info.

The fix is to wait for the transaction to finish committing before we
shutdown the multiblock allocator.  

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-08 21:48:58 -05:00
Jan Kara
b9a4207d5e ext4: Avoid data / filesystem corruption when write fails to copy data
When ext4_write_begin fails after allocating some blocks or
generic_perform_write fails to copy data to write, we truncate blocks
already instantiated beyond i_size.  Although these blocks were never
inside i_size, we have to truncate the pagecache of these blocks so
that corresponding buffers get unmapped.  Otherwise subsequent
__block_prepare_write (called because we are retrying the write) will
find the buffers mapped, not call ->get_block, and thus the page will
be backed by already freed blocks leading to filesystem and data
corruption.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-08 21:24:33 -05:00