Without an extra lock, we cannot just assume that the delegation->inode is
valid when we're traversing the rcu-protected nfs_client lists. Use the
delegation->lock to ensure that it is truly valid.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When we can update_open_stateid(), we need to be certain that we don't
race with a delegation return. While we could do this by grabbing the
nfs_client->cl_lock, a dedicated spin lock in the delegation structure
will scale better.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the admin has specified the "noresvport" option for an NFS mount
point, the kernel's NFS client uses an unprivileged source port for
the main NFS transport. The kernel's lockd client should use an
unprivileged port in this case as well.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the admin has specified the "noresvport" option for an NFS mount
point, the kernel's NFS client uses an unprivileged source port for
the main NFS transport. The kernel's mountd client should use an
unprivileged port in this case as well.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The standard default security setting for NFS is AUTH_SYS. An NFS
client connects to NFS servers via a privileged source port and a
fixed standard destination port (2049). The client sends raw uid and
gid numbers to identify users making NFS requests, and the server
assumes an appropriate authority on the client has vetted these
values because the source port is privileged.
On Linux, by default in-kernel RPC services use a privileged port in
the range between 650 and 1023 to avoid using source ports of well-
known IP services. Using such a small range limits the number of NFS
mount points and the number of unique NFS servers to which a client
can connect concurrently.
An NFS client can use unprivileged source ports to expand the range of
source port numbers, allowing more concurrent server connections and
more NFS mount points. Servers must explicitly allow NFS connections
from unprivileged ports for this to work.
In the past, bumping the value of the sunrpc.max_resvport sysctl on
the client would permit the NFS client to use unprivileged ports.
Bumping this setting also changes the maximum port number used by
other in-kernel RPC services, some of which still required a port
number less than 1023.
This is exacerbated by the way source port numbers are chosen by the
Linux RPC client, which starts at the top of the range and works
downwards. It means that bumping the maximum means all RPC services
requesting a source port will likely get an unprivileged port instead
of a privileged one.
Changing this setting effects all NFS mount points on a client. A
sysadmin could not selectively choose which mount points would use
non-privileged ports and which could not.
Lastly, this mechanism of expanding the limit on the number of NFS
mount points was entirely undocumented.
To address the need for the NFS client to use a large range of source
ports without interfering with the activity of other in-kernel RPC
services, we introduce a new NFS mount option. This option explicitly
tells only the NFS client to use a non-privileged source port when
communicating with the NFS server for one specific mount point.
This new mount option is called "resvport," like the similar NFS mount
option on FreeBSD and Mac OS X. A sister patch for nfs-utils will be
submitted that documents this new option in nfs(5).
The default setting for this new mount option requires the NFS client
to use a privileged port, as before. Explicitly specifying the
"noresvport" mount option allows the NFS client to use an unprivileged
source port for this mount point when connecting to the NFS server
port.
This mount option is supported only for text-based NFS mounts.
[ Sidebar: it is widely known that security mechanisms based on the
use of privileged source ports are ineffective. However, the NFS
client can combine the use of unprivileged ports with the use of
secure authentication mechanisms, such as Kerberos. This allows a
large number of connections and mount points while ensuring a useful
level of security.
Eventually we may change the default setting for this option
depending on the security flavor used for the mount. For example,
if the mount is using only AUTH_SYS, then the default setting will
be "resvport;" if the mount is using a strong security flavor such
as krb5, the default setting will be "noresvport." ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
[Trond.Myklebust@netapp.com: Fixed a bug whereby nfs4_init_client()
was being called with incorrect arguments.]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make it possible for the NFSv4 mount set up logic to pass mount option
flags down the stack to nfs_create_rpc_client().
This is immediately useful if we want NFS mount options to modulate
settings of the underlying RPC transport, but it may be useful at some
later point if other parts of the NFSv4 mount initialization logic
want to know what the mount options are.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The nfs_create_rpc_client() function sets up an RPC client for an NFS
mount point. Add an option that allows it to set up an RPC transport
from an unprivileged port.
Instead of having nfs_create_rpc_client()'s callers retain local
knowledge about how to set up an RPC client, create a couple of flag
arguments to control the use of RPC_CLNT_CREATE flags.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: convert nfs_mount() to take a single data structure argument to make
it simpler to add more arguments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: The nfs_mount() function is not to be used outside of the
NFS client. Move its public declaration to fs/nfs/internal.h.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: I'm about to move the declaration of nfs_mount into
fs/nfs/internal.h and include it in fs/nfs/nfsroot.c. There's a
conflicting definition of nfs_path in fs/nfs/internal.h and
fs/nfs/nfsroot.c, so rename the private one.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
aops->readpages() and its NFS helper readpage_async_filler() will only
be called to do readahead I/O for newly allocated pages. So it's not
necessary to test for the always 0 dirty/uptodate page flags.
The removal of nfs_wb_page() call also fixes a readahead bug: the NFS
readahead has been synchronous since 2.6.23, because that call will
clear PG_readahead, which is the reminder for asynchronous readahead.
More background: the PG_readahead page flag is shared with PG_reclaim,
one for read path and the other for write path. clear_page_dirty_for_io()
unconditionally clears PG_readahead to prevent possible readahead residuals,
assuming itself to be always called in the write path. However, NFS is one
and the only exception in that it _always_ calls clear_page_dirty_for_io()
in the read path, i.e. for readpages()/readpage().
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <wfg@linux.intel.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Using NIPQUAD() with NIPQUAD_FMT, %d.%d.%d.%d or %u.%u.%u.%u
can be replaced with %pI4
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The iscsi_ibft.c changes are almost certainly a bugfix as the
pointer 'ip' is a u8 *, so they never print the last 8 bytes
of the IPv6 address, and the eight bytes they do print have
a zero byte with them in each 16-bit word.
Other than that, this should cause no difference in functionality.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The most important property we need from nfs_attr_generation_counter is
monotonicity, which is not guaranteed by the current system of smp memory
barriers. We should convert it to an atomic_long_t, and drop the memory
barriers.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
For execute permission on a regular files we need to check if file has
any execute bits at all, regardless of capabilites.
This check is normally performed by generic_permission() but was also
added to the case when the filesystem defines its own ->permission()
method. In the latter case the filesystem should be responsible for
performing this check.
Move the check from inode_permission() inside filesystems which are
not calling generic_permission().
Create a helper function execute_ok() that returns true if the inode
is a directory or if any execute bits are present in i_mode.
Also fix up the following code:
- coda control file is never executable
- sysctl files are never executable
- hfs_permission seems broken on MAY_EXEC, remove
- hfsplus_permission is eqivalent to generic_permission(), remove
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
New flag: LOOKUP_EXCL. Set before doing the final step of pathname
resolution on the paths that have LOOKUP_CREATE and O_EXCL.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFS: use correct fs type for v4 submounts and referrals
Make nfs_file_cred more robust.
NFS: Enable NFSv4 callback server to listen on AF_INET6 sockets
Split the LRU lists in two, one set for pages that are backed by real file
systems ("file") and one for pages that are backed by memory and swap
("anon"). The latter includes tmpfs.
The advantage of doing this is that the VM will not have to scan over lots
of anonymous pages (which we generally do not want to swap out), just to
find the page cache pages that it should evict.
This patch has the infrastructure and a basic policy to balance how much
we scan the anon lists and how much we scan the file lists. The big
policy changes are in separate patches.
[lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
[kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
[kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
[hugh@veritas.com: memcg swapbacked pages active]
[hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
[akpm@linux-foundation.org: fix /proc/vmstat units]
[nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
[kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
[kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As not all files have an associated open_context (e.g. device special
files), it is safest to test for the existence of the open context
before de-referencing it.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Allow the NFS callback server to listen for requests via an AF_INET6 or
AF_INET socket when IPv6 support is present in the kernel.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The cache_change_attribute is used to decide whether or not a directory has
changed, in which case we may need to look it up again. Again, the use of
'jiffies' leads to an issue of resolution.
Once again, the fix is to change nfs_inode->cache_change_attribute, and
just make it a simple counter.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
It appears that 'jiffies' timestamps do not have high enough resolution for
nfs_inode_attrs_need_update(). One problem is that a GETATTR can be
launched within < 1 jiffy of the last operation that updated the attribute.
Another problem is that RPC calls can take < 1 jiffy to execute.
We can fix this by switching the variables to use a simple global counter
that gets incremented every time we start another GETATTR call.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'for-2.6.28' of git://linux-nfs.org/~bfields/linux: (59 commits)
svcrdma: Fix IRD/ORD polarity
svcrdma: Update svc_rdma_send_error to use DMA LKEY
svcrdma: Modify the RPC reply path to use FRMR when available
svcrdma: Modify the RPC recv path to use FRMR when available
svcrdma: Add support to svc_rdma_send to handle chained WR
svcrdma: Modify post recv path to use local dma key
svcrdma: Add a service to register a Fast Reg MR with the device
svcrdma: Query device for Fast Reg support during connection setup
svcrdma: Add FRMR get/put services
NLM: Remove unused argument from svc_addsock() function
NLM: Remove "proto" argument from lockd_up()
NLM: Always start both UDP and TCP listeners
lockd: Remove unused fields in the nlm_reboot structure
lockd: Add helper to sanity check incoming NOTIFY requests
lockd: change nlmclnt_grant() to take a "struct sockaddr *"
lockd: Adjust nlmsvc_lookup_host() to accomodate AF_INET6 addresses
lockd: Adjust nlmclnt_lookup_host() signature to accomodate non-AF_INET
lockd: Support non-AF_INET addresses in nlm_lookup_host()
NLM: Convert nlm_lookup_host() to use a single argument
svcrdma: Add Fast Reg MR Data Types
...
This is a much better version of a previous patch to make the parser
tables constant. Rather than changing the typedef, we put the "const" in
all the various places where its required, allowing the __initconst
exception for nfsroot which was the cause of the previous trouble.
This was posted for review some time ago and I believe its been in -mm
since then.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Alexander Viro <aviro@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bruce observed that nfs_parse_ip_address() will successfully parse an
IPv6 address that looks like this:
"::1%"
A scope delimiter is present, but there is no scope ID following it.
This is harmless, as it would simply set the scope ID to zero. However,
in some cases we would like to flag this as an improperly formed
address.
We are now also careful to reject addresses where garbage follows the
address (up to the length of the string), instead of ignoring the
non-address characters; and where the scope ID is nonsense (not a valid
device name, but also not numeric). Before, both of these cases would
result in a harmless zero scope ID.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This fixes a regression seen when running the Connectathon testsuite
against an ext3 filesystem. The reason was that the inode was constantly
being marked as 'just updated' by the jiffy wraparound test.
This again meant that newer GETATTR calls were failing to pass the
nfs_inode_attrs_need_update() test unless the changes caused a ctime update
on the server, since they were perceived as having been started before the
latest inode update.
Given that nfs_inode_attrs_need_update() already checks for wraparound
of nfsi->last_updated, we can drop the buggy "protection" in
nfs_update_inode().
Also make a slight micro-optimisation of nfs_inode_attrs_need_update(): we
are more often going to see time_after(fattr->time_start, nfsi->last_updated)
be true, rather than seeing an update of ctime/size, so put that test
first to ensure that we optimise away the ctime/size tests.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch fixes a regression that was introduced by the string based mounts.
nfs_mount() statically returns -EACCES for every error returned
by the remote mounted. This is incorrect because -EACCES is
an non-fatal error to the mount.nfs command. This error causes
mount.nfs to retry the mount even in the case when the exported
directory does not exist.
This patch maps the errors returned by the remote mountd into
valid errno values, exactly how it was done pre-string based
mounts. By returning the correct errno enables mount.nfs
to do the right thing.
Signed-off-by: Steve Dickson <steved@redhat.com>
[Trond.Myklebust@netapp.com: nfs_stat_to_errno() now correctly returns
negative errors, so remove the sign change.]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The code incorrectly assumes here that the server name (or ip address)
is null-terminated. This can cause referrals to fail in some cases.
Also support ipv6 addresses.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We plan to use this function elsewhere.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Whoever wrote this had a bizarre allergy to for loops.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This function is a little longer and more deeply nested than necessary.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Allow mount to do authenticated mounts below the root of the exported tree.
The wording in RFC 2623, sec 2.3.2. allows fsinfo with UNIX authentication
on the root of the export. Mounts are not always done on the root
of the exported tree. Especially autoumounts often mount below the root of
the exported tree.
Some server implementations (justly) require full authentication for the
so-called deep mounts. The old code used AUTH_SYS only. This caused deep
mounts to fail on systems requiring stronger authentication..
The client should try both authentication types and use the first one that
succeeds.
This method was already partially implemented. This patch completes
the implementation for NFS2 and NFS3.
This patch was developed to allow Debian systems to automount home directories
on Solaris servers with krb5 authentication.
Tested on kernel 2.6.24-etchnhalf.1
Signed-off-by: E.G. Keizer <keie@few.vu.nl>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The fattrs used in the NFSv3 getacl/setacl calls are not being properly
initialized. This occasionally causes nfs_update_inode to fall into
NFSv4 specific codepaths when handling post-op attrs from these calls.
Thanks to Cai Qian for noticing the spurious NFSv4 messages in debug
output from a v3 mount...
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We *do* now allow bsd flocks over nfs.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Unfortunately, BUG_ON(IS_ROOT(dentry)) can happen inside
nfs_follow_mountpoint with NFS running Fedora 8 using a
specific setup.
https://bugzilla.redhat.com/show_bug.cgi?id=458622
So, the situation should be handled on NFS client gracefully.
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
CC: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, if two processes are both trying to revalidate metadata for the
same inode, they will find themselves being serialised. There is no good
justification for this now that we have improved our ability to detect
stale attribute data, so we should remove that serialisation.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If we're merely checking the inode attributes because we suspect that the
'updated' attributes returned by the RPC call are stale, then we shouldn't
be doing weak cache consistency updates or clearing the cache_validity
flags.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In the case where there are parallel RPC calls to the same inode, we may
receive stale metadata due to the lack of ordering, hence the sanity
checking of metadata in nfs_refresh_inode().
Currently, __nfs_revalidate_inode() is calling nfs_update_inode() directly,
without any further sanity checks, and hence may end up setting the inode
up with stale metadata.
Fix is to use nfs_refresh_inode() instead of nfs_update_inode().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If we believe that the attributes are old (see nfs_refresh_inode()), then
we shouldn't force an update.
Also ensure that we hold the inode->i_lock across attribute checks and the
call to nfs_refresh_inode_locked() to ensure that we don't race with other
attribute updates.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently nfs_refresh_inode() will only update the inode metadata if it
sees that the RPC call that returned the nfs_fattr was started
after the last update of the inode. This means that if we have parallel
RPC calls to the same inode (when sending WRITE calls, for instance), we
may often miss updates.
This patch attempts to recover those missed updates by also accepting
them if the ctime in the nfs_fattr is more recent than the inode's
cached ctime.
It also recovers the case where the file size has increased, but the
ctime has not been updated due to limited ctime resolution.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Try to avoid taking and dropping the inode->i_lock more than once. Do so by
moving the code in nfs_refresh_inode() that needs to be done under the
spinlock into a function nfs_refresh_inode_locked(), and then having both
nfs_refresh_inode() and nfs_post_op_update_inode() call it directly.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The point of introducing text-based mounts was to allow us to add
functionality without having to worry about legacy binary mount formats.
The mask should be there in order to ensure that binary formats don't start
enabling features that they cannot support. There is no justification for
applying it to the text mount path.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add the flag NFS_MOUNT_LOOKUP_CACHE_NONEG to turn off the caching of
negative dentries. In reality what we do is to force
nfs_lookup_revalidate() to always discard negative dentries.
Add the flag NFS_MOUNT_LOOKUP_CACHE_NONE for enforcing stricter
revalidation of dentries. It forces the revalidate code to always do a
lookup instead of just checking the cached mtime of the parent directory.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Instead of causing umount requests to block on server->active_wq while the
asynchronous sillyrename deletes are executing, we can use the sb->s_active
counter to obtain a reference to the super_block, and then release that
reference in nfs_async_unlink_release().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
After the BKL removal patches were applied to the rest of the NFS code, the
BKL protection in nfs_file_llseek() is no longer sufficient to ensure that
inode->i_size is read safely in generic_file_llseek_unlocked().
In order to fix the situation, we either have to replace the naked read of
inode->i_size in generic_file_llseek_unlocked() with i_size_read(), or the
whole thing needs to be executed under the inode->i_lock;
In order to avoid disrupting other filesystems, avoid touching
generic_file_llseek_unlocked() for now...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Introduce and initialize an address family field in the svc_serv structure.
This field will determine what family to use for the service's listener
sockets and what families are advertised via the local rpcbind daemon.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Automounter maps can contain mount options valid for other NFS
implementations but not for Linux. The Linux automounter uses the
mount command's "-s" command line option ("s" for "sloppy") so that
mount requests containing such options are not rejected.
Commit f45663ce5f attempted to address a
known regression with text-based NFS mount option parsing. Unrecognized
mount options would cause mount requests to fail, even if the "-s"
option was used on the mount command line.
Unfortunately, this commit was not complete as submitted. It adds a
new mount option, "sloppy". But it is missing a hunk, so it now allows
NFS mounts with unrecognized mount options, even if the "sloppy" option
is not present. This could be a problem if a required critical mount
option such as "sync" is misspelled, for example, and is considered a
regression from 2.6.26.
This patch restores the missing hunk. Now, the default behavior of
text-based NFS mount options is as before: any unrecognized mount option
will cause the mount to fail.
Please include this in 2.6.27-rc.
Thanks to Neil Brown for reporting this.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit f9247273cb (and
fb2e405fc1 - "fix fs/nfs/nfsroot.c
compilation" - that fixed a missed conversion).
The changes cause problems for at least the sparc build. Let's re-do
them when the exact issues are resolved.
Requested-by: Andrew Morton <akpm@linux-foundation.org>
Requested-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In order to avoid the "Busy inodes after unmount" error message, we need to
ensure that nfs_async_unlink_release() releases the super block after the
call to nfs_free_unlinkdata().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* kill nameidata * argument; map the 3 bits in ->flags anybody cares
about to new MAY_... ones and pass with the mask.
* kill redundant gfs2_iop_permission()
* sanitize ecryptfs_permission()
* fix remaining places where ->permission() instances might barf on new
MAY_... found in mask.
The obvious next target in that direction is permission(9)
folded fix for nfs_permission() breakage from Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Kmem cache passed to constructor is only needed for constructors that are
themselves multiplexeres. Nobody uses this "feature", nor does anybody uses
passed kmem cache in non-trivial way, so pass only pointer to object.
Non-trivial places are:
arch/powerpc/mm/init_64.c
arch/powerpc/mm/hugetlbpage.c
This is flag day, yes.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jon Tollefson <kniht@linux.vnet.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Matt Mackall <mpm@selenic.com>
[akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
[akpm@linux-foundation.org: fix mm/slab.c]
[akpm@linux-foundation.org: fix ubifs]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This fixes the following compile error caused by commit
f9247273cb ("UFS: add const to parser
token table"):
CC fs/nfs/nfsroot.o
/home/bunk/linux/kernel-2.6/git/linux-2.6/fs/nfs/nfsroot.c:130: error: tokens causes a section type conflict
make[3]: *** [fs/nfs/nfsroot.o] Error 1
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Page accesses are serialised using the page locks, whereas all attribute
updates are serialised using the inode->i_lock.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Page cache accesses are serialised using page locks, whereas attribute
updates are serialised using inode->i_lock.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Attribute updates are safe, and dentry operations are protected using VFS
level locks. Defer removing the BKL from sillyrename until a separate
patch.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
All dentry-related operations are already BKL-safe, since they are
protected by the VFS locking. No extra locks should be needed in the NFS
code.
In the case of nfs_revalidate_inode(), we're only doing an attribute
update (protected by the inode->i_lock).
In the case of nfs_lookup(), we're instantiating a new dentry, so there
should be no contention possible until after we call d_materialise_unique.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs_instantiate() does not require the BKL, neither do the attribute
updates or the RPC code.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
All the NFSv4 stateful operations are already protected by other locks (in
particular by the rpc_sequence locks.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The main problem is dealing with inode->i_size: we need to set the
inode->i_lock on all attribute updates, and so vmtruncate won't cut it.
Make an NFS-private version of vmtruncate that has the necessary locking
semantics.
The result should be that the following inode attribute updates are
protected by inode->i_lock
nfsi->cache_validity
nfsi->read_cache_jiffies
nfsi->attrtimeo
nfsi->attrtimeo_timestamp
nfsi->change_attr
nfsi->last_updated
nfsi->cache_change_attribute
nfsi->access_cache
nfsi->access_cache_entry_lru
nfsi->access_cache_inode_lru
nfsi->acl_access
nfsi->acl_default
nfsi->nfs_page_tree
nfsi->ncommit
nfsi->npages
nfsi->open_files
nfsi->silly_list
nfsi->acl
nfsi->open_states
inode->i_size
inode->i_atime
inode->i_mtime
inode->i_ctime
inode->i_nlink
inode->i_uid
inode->i_gid
The following is protected by dir->i_mutex
nfsi->cookieverf
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The kernel's NFS client mount option parser currently doesn't allow
unrecognized or incorrect mount options. This prevents misspellings or
incorrectly specified mount options from possibly causing silent data
corruption.
However, NFS mount options are not standardized, so different operating
systems can use differently spelled mount options to support similar
features, or can support mount options which no other operating system
supports.
"Sloppy" mount option parsing, which allows the parser to ignore any
option it doesn't recognize, is needed to support automounters that often
use maps that are shared between heterogenous operating systems.
The legacy mount command ignores the validity of the values of mount
options entirely, except for the "sec=" and "proto=" options. If an
incorrect value is specified, the out-of-range value is passed to the
kernel; if a value is specified that contains non-numeric characters,
it appears as though the legacy mount command sets that option to zero
(probably incorrect behavior in general).
In any case, this sets a precedent which we will partially follow for
the kernel mount option parser:
+ if "sloppy" is not set, the parser will be strict about both
unrecognized options (same as legacy) and invalid option
values (stricter than legacy)
+ if "sloppy" is set, the parser will ignore unrecognized
options and invalid option values (same as legacy)
An "invalid" option value in this case means that either the type
(integer, short, or string) or sign (for integer values) of the specified
value is incorrect.
This patch does two things: it changes the NFS client's mount option
parsing loop so that it parses the whole string instead of failing at
the first unrecognized option or invalid option value. An unrecognized
option or an invalid option value cause the option to be skipped.
Then, the patch adds a "sloppy" mount option that allows the parsing
to succeed anyway if there were any problems during parsing. When
parsing a set of options is complete, if there are errors and "sloppy"
was specified, return success anyway. Otherwise, only return success
if there are no errors.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Set the default security flavor when we set the other mount option
default values for NFSv4. This cleans up the NFSv4 mount option parsing
path to look like the NFSv2/v3 one.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Set the default security flavor when we set the other mount option default
values. After this change, only the legacy user-space mount path needs to
set the NFS_MOUNT_SECFLAVOUR flag.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: Refactor the NFS mount option parsing function to extract the
security flavor parsing logic into a separate function.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The remount path does not need to set the port in the server address.
Since it's not really a part of option parsing, move the nfs_set_port()
call to nfs_parse_mount_options()'s callers.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Move the UDP/TCP default timeo/retrans settings for text mounts to
nfs_init_timeout_values(), which was were they were always being
initialised (and sanity checked) for binary mounts.
Document the default timeout values using appropriate #defines.
Ensure that we initialise and sanity check the transport protocols that
may have been specified by the user.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the nfsv4 callback thread is rapidly brought up and down, it's
possible that nfs_callback_svc might never get a chance to run. If
this happens, the cleanup at thread exit might never occur, throwing
the refcounting off and nfs_callback_info in an incorrect state.
Move the clean functions into nfs_callback_down. Also change the
nfs_callback_info struct to track the svc_rqst rather than svc_serv
since we need to know that to call svc_exit_thread.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The nfs_callback_mutex is sufficient protection.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
gcc (4.3.0) rightfully warns about this:
/usr0/export/dev/bhalevy/git/linux-pnfs-bh-nfs41/fs/nfs/nfs4proc.c: In function nfs4_proc_setclientid_confirm:
/usr0/export/dev/bhalevy/git/linux-pnfs-bh-nfs41/fs/nfs/nfs4proc.c:2936: warning: timeout may be used uninitialized in this function
nfs4_delay that's passed a pointer to 'timeout' is looking at its value
and sets it up to some value in the range: NFS4_POLL_RETRY_MIN..NFS4_POLL_RETRY_MAX
if (*timeout <= 0)
*timeout = NFS4_POLL_RETRY_MIN;
if (*timeout > NFS4_POLL_RETRY_MAX)
*timeout = NFS4_POLL_RETRY_MAX;
Therefore it will end up set to some sane, though rather indeterministic, value.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add support in the kernel NFS client's address parser for interface
identifiers.
IPv6 link-local addresses require an additional "interface identifier",
which is a network device name or an integer that indexes the array of
local network interfaces. They are suffixed to the address with a '%'.
For example:
fe80::215:c5ff:fe3b:e1b2%2
indicates an interface index of 2. Or
fe80::215:c5ff:fe3b:e1b2%eth0
indicates that requests should be routed through the eth0 device.
Without the interface ID, link-local addresses are not usable for NFS.
Both the kernel NFS client mount option parser and the mount.nfs command
can take either form. The mount.nfs command always passes the address
through getnameinfo(3), which usually re-writes interface indices as
device names.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
To make nfs_parse_server_address() more generally useful, allow it to
accept input strings that are not terminated with '\0'.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Traditionally the mount command has looked for a ":" to separate the
server's hostname from the export path in the mounted on device name,
like this:
mount server:/export /mounted/on/dir
The server's hostname is "server" and the export path is "/export".
You can also substitute a specific IPv4 network address for the server
hostname, like this:
mount 192.168.0.55:/export /mounted/on/dir
Raw IPv6 addresses present a problem, however, because they look
something like this:
fe80::200:5aff:fe00:30b
Note the use of colons.
To get around the presence of colons, copy the Solaris convention used for
mounting IPv6 servers by address: wrap a raw IPv6 address with square
brackets.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
To support passing a raw IPv6 address as a server hostname, we need to
expand the logic that handles splitting the passed-in device name into
a server hostname and export path
Start by pulling device name parsing out of the mount option validation
functions and into separate helper functions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix the 'nfs4_fs_type' undeclared error in nfs_remount when compiling sans
NFSv4...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Jeff Layton <jlayton@redhat.com>
Currently, if an unstable write completes, we cannot redirty the page in
order to reflect a new change in the page data until after we've sent a
COMMIT request.
This patch allows a page rewrite to proceed without the unnecessary COMMIT
step, putting it immediately back onto the dirty page list, undoing the
VM unstable write accounting, and removing the NFS_PAGE_TAG_COMMIT tag from
the NFS radix tree.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Simplify the loop in nfs_update_request by moving into a separate function
the code that attempts to update an existing cached NFS write.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: the "intr" and "nointr" mount options were recently retired.
Document this in the NFS mount option parser.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The kernel NFS mount option parser should ignore the retry= mount option
since it is meaningful only in user space. Today it expects a number
rather than arbitrary text, so it ignores the option if the value is
numeric, but chokes if there are other characters in the value.
Change it to allow any text (except ",") as its value.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We're not modifying the nfs_server when we call nfs_inc_server_stats and
friends, so allow the compiler to pass 'const' pointers too.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The fs/nfs/iostat.h header has definitions that were designed to be exposed
to user space. Move these definitions under include/linux so user space can
use the definitions in applications that read /proc/self/mountstats.
Also address a handful of coding style issues called out by checkpatch.pl in
fs/nfs/iostat.h.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
All instances are set to nfs_open(), so we should just remove the redundant
indirection. Ditto for the file_release op
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
ftruncate() access checking is supposed to be performed at open() time,
just like reads and writes.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When remounting an NFS or NFS4 filesystem, the new NFS options are not
respected, yet the remount will still return success. This patch adds
a remount_fs sb op for NFS that checks any new nfs mount options against
the existing ones and fails the mount if any have changed.
This is only implemented for string-based mount options since doing
this with binary options isn't really feasible.
This is essentially the same as the original patch I sent out, but
adds a check to see if the addr= option has changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch removes a CVS keyword that wasn't updated for a long time
from a comment.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: fix a few dprintk messages that still need to show the RPC task ID
correctly, and be sure we use the preferred %lld or %llu instead of %Ld or
%Lu.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: some fops use NFSDBG_FILE, some use NFSDBG_VFS. Let's use
NFSDBG_FILE for all fops, and consistently report file names instead
of inode numbers.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Recent work in fs/nfs/file.c neglected to add appropriate trace debugging
for the NFS client's address space operations.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: Report the same debugging info and count function calls the
same for files and directories in nfs_opendir() and nfs_file_open().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: Report the same debugging info in nfs_llseek_dir() and
nfs_llseek_file().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: Report the same debugging info, count function calls the same,
and use similar function naming in nfs_fsync_dir() and nfs_fsync().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Revert commit 44dd151d "NFS: Don't mark a written page as uptodate until it
is on disk". While it is true that the write may fail, that is always the
case. There is no reason why we should treat data on pages that are not
already marked as PG_uptodate as being special. The only thing we gain is a
noticeable slowdown when re-reading these pages.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If a file is being extended, and we're creating a hole, we might as well
declare the entire page to be up to date.
This patch significantly improves the write performance for sparse files
in the case where lseek(SEEK_END) is used to append several non-contiguous
writes at intervals of < PAGE_SIZE.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFSv2 file locking currently fails the Connectathon tests, because the
calls to the VFS locking code do not return an EINVAL error if the
struct file_lock overflows the 32-bit boundaries.
The problem is due to the fact that we occasionally call helpers from
fs/locks.c in order to avoid RPC calls to the server when we know that a
local process holds the lock. These helpers are, of course, always
64-bit enabled, so EINVAL is not returned in cases when it would if
the call had gone to the NLM code.
For consistency, we therefore add support for a bounds-checking helper.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The commit 2785259631 (nfs: use GFP_NOFS
preloads for radix-tree insertion) appears to have introduced a bug:
We only want to call radix_tree_preload() once after creating a request.
Calling it every time we loop after we created the request, will cause
preemption count leaks.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Nick Piggin <npiggin@suse.de>
invalidate_inode_pages2_range() takes page offset arguments, not byte
ranges.
Another thought is that individual pages might perhaps get evicted by VM
pressure, in which case we might perhaps want to re-read not only the
evicted page, but all subsequent pages too (in case the server returns
more/less data per page so that the alignment of the next entry
changes). We should therefore remove the condition that we only do this on
page->index==0.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
- Replace remote_llseek with generic_file_llseek_unlocked (to force compilation
failures in all users)
- Change all users to either use generic_file_llseek_unlocked directly or
take the BKL around. I changed the file systems who don't use the BKL
for anything (CIFS, GFS) to call it directly. NCPFS and SMBFS and NFS
take the BKL, but explicitely in their own source now.
I moved them all over in a single patch to avoid unbisectable sections.
Open problem: 32bit kernels can corrupt fpos because its modification
is not atomic, but they can do that anyways because there's other paths who
modify it without BKL.
Do we need a special lock for the pos/f_version = 0 checks?
Trond says the NFS BKL is likely not needed, but keep it for now
until his full audit.
v2: Use generic_file_llseek_unlocked instead of remote_llseek_unlocked
and factor duplicated code (suggested by hch)
Cc: Trond.Myklebust@netapp.com
Cc: swhiteho@redhat.com
Cc: sfrench@samba.org
Cc: vandrove@vc.cvut.cz
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Fix a sign issue in xdr_decode_fhstatus3()
Fix incorrect comparison in nfs_validate_mount_data()
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs4_drop_state_owner() can now become static.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Here are some more places where path_{get,put}() can be used instead of
dput()/mntput() pair.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFSv3 get_sb operations call into the LSM layer to set security options passed
from userspace. NFSv4 hooks were not originally added since it was reasonably
late in the merge window and NFSv3 was the only thing that had regressed (v4
has never supported any LSM options)
This patch makes NFSv4 call into the LSM to set security options rather than
just blindly dropping them with no notice to the user as happens today. This
patch was tested in a simple NFSv4 environment with the context= option and
appeared to work as expected.
Signed-off-by: Eric Paris <eparis@redhat.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: James Morris <jmorris@namei.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If decode_compound_hdr_arg() returns a resource error, then we cannot
proceed to process the callback. Return a 'GARBAGE_ARGS' rpc-level error to
the caller instead.
If, however, the minor version field is incorrect, then we need to
propagate the resulting NFS4ERR_MINOR_VERS_MISMATCH error back as the
compound status field (setting the nops field to 0).
Finally, if encode_compound_hdr_res() returns an error, we need to return
an RPC_SYSTEM_ERR to the caller.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When called from nfs_flush_incompatible, the req is not locked, so
req->wb_page might be set to NULL before it is used by PageWriteback.
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Both the 'noac' and 'actimeo=0' mount options should ensure that attributes
are not cached, however a bug in nfs_attribute_timeout() means that
currently, the attributes may in fact get cached for up to one jiffy. This
has been seen to cause corruption in some applications.
The reason for the bug is that the time_in_range() test returns 'true' as
long as the current time lies between nfsi->read_cache_jiffies and
nfsi->read_cache_jiffies + nfsi->attrtimeo. In other words, if jiffies
equals nfsi->read_cache_jiffies, then we still cache the attribute data.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Register NFS' backing_dev_info under sysfs with the name "nfs-MAJOR:MINOR"
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use proc_create() to make sure that ->proc_fops be setup before gluing PDE to
main tree.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use creation by full path instead: "fs/foo".
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (80 commits)
SUNRPC: Invalidate the RPCSEC_GSS session if the server dropped the request
make nfs_automount_list static
NFS: remove duplicate flags assignment from nfs_validate_mount_data
NFS - fix potential NULL pointer dereference v2
SUNRPC: Don't change the RPCSEC_GSS context on a credential that is in use
SUNRPC: Fix a race in gss_refresh_upcall()
SUNRPC: Don't disconnect more than once if retransmitting NFSv4 requests
SUNRPC: Remove the unused export of xprt_force_disconnect
SUNRPC: remove XS_SENDMSG_RETRY
SUNRPC: Protect creds against early garbage collection
NFSv4: Attempt to use machine credentials in SETCLIENTID calls
NFSv4: Reintroduce machine creds
NFSv4: Don't use cred->cr_ops->cr_name in nfs4_proc_setclientid()
nfs: fix printout of multiword bitfields
nfs: return negative error value from nfs{,4}_stat_to_errno
NLM/lockd: Ensure client locking calls use correct credentials
NFS: Remove the buggy lock-if-signalled case from do_setlk()
NLM/lockd: Fix a race when cancelling a blocking lock
NLM/lockd: Ensure that nlmclnt_cancel() returns results of the CANCEL call
NLM: Remove the signal masking in nlmclnt_proc/nlmclnt_cancel
...
When svc_recv returns an unexpected error, nfs_callback_svc will print a
warning and exit. This problematic for several reasons. In particular,
it will cause the reference counts for the thread to be wrong, and no
new thread will be started until all nfs4 mounts are unmounted.
Rather than exiting on error from svc_recv, have the thread do a 1s
sleep and then retry the loop. This is unlikely to cause any harm, and
if the error turns out to be something temporary then it may be able to
recover.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
There's a general push to convert kernel threads to use the (much
cleaner) kthread API. This patch converts the NFSv4 callback kernel
thread to the kthread API. In addition to being generally cleaner this
also removes the dependency on signals when shutting down the thread.
Note that this patch depends on the recent patches to svc_recv() to
make it check kthread_should_stop() periodically. Those patches are
in Bruce's tree at the moment and are slated for 2.6.26 along with
the lockd conversion, so this conversion is probably also appropriate
for 2.6.26.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Thanks to Robert Day for pointing out that these two defines are unused.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: Trond Myklebust <trond@netapp.com>Trond Myklebust <trond@netapp.com>
Cc: Neil Brown <neilb@suse.de>
Cc: "Robert P. J. Day" <rpjday@crashcourse.ca>
There is possible NULL pointer dereference if kstr[n]dup failed.
So fix them for safety.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We need to try to ensure that we always use the same credentials whenever
we re-establish the clientid on the server. If not, the server won't
recognise that we're the same client, and so may not allow us to recover
state.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
With the recent change to generic creds, we can no longer use
cred->cr_ops->cr_name to distinguish between RPCSEC_GSS principals and
AUTH_SYS/AUTH_NULL identities. Replace it with the rpc_authops->au_name
instead...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Benny points out that zero-padding of multiword bitfields is necessary,
and that delimiting each word is nice to avoid endianess confusion.
bhalevy: without zero padding output can be ambiguous. Also,
since the printed array of two 32-bit unsigned integers is not a
64-bit number, delimiting the output with a semicolon makes more sense.
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
All use sites for nfs{,4}_stat_to_errno negate their return value.
It's more efficient to return a negative error from the stat_to_errno convertors
rather than negating its return value everywhere. This also produces slightly
smaller code.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Both NLM and NFSv4 should be able to clean up adequately in the case where
the user interrupts the RPC call...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
It is quite possible that the OPEN, CLOSE, LOCK, LOCKU,... compounds fail
before the actual stateful operation has been executed (for instance in the
PUTFH call). There is no way to tell from the overall status result which
operations were executed from the COMPOUND.
The fix is to move incrementing of the sequence id into the XDR layer,
so that we do it as we process the results from the stateful operation.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There should be no need to invalidate a perfectly good state owner just
because of a stale filehandle. Doing so can cause the state recovery code
to break, since nfs4_get_renew_cred() and nfs4_get_setclientid_cred() rely
on finding active state owners.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In the case of readpage() we need to ensure that the pages get unlocked,
and that the error is flagged.
In the case of O_DIRECT, we need to ensure that the pages are all released.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
It is possible for nfs_wb_page() to sometimes exit with 0 return value, yet
the page is left in a dirty state.
For instance in the case where the server rebooted, and the COMMIT request
failed, then all the previously "clean" pages which were cached by the
server, but were not guaranteed to have been writted out to disk,
have to be redirtied and resent to the server.
The fix is to have nfs_wb_page_priority() check that the page is clean
before it exits...
This fixes a condition that triggers the BUG_ON(PagePrivate(page)) in
nfs_create_request() when we're in the nfs_readpage() path.
Also eliminate a redundant BUG_ON(!PageLocked(page)) while we're at it. It
turns out that clear_page_dirty_for_io() has the exact same test.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If we depend on the inodes for writeability, we will not catch the r/o mounts
when implemented.
This patches uses __mnt_want_write(). It does not guarantee that the mount
will stay writeable after the check. But, this is OK for one of the checks
because it is just for a printk().
The other two are probably unnecessary and duplicate existing checks in the
VFS. This won't make them better checks than before, but it will make them
detect r/o mounts.
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The nfs_open_context struct had a "flags" field added recently, but the
allocator isn't initializing it. It also looks like the allocator isn't
initializing the mode or list either, but they seem to be overwritten
by the caller, so that's less of an issue.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
... and take it out of ->umount_begin() instances. Call with all locks
already taken (by do_umount()) and leave calling release_mounts() to
caller (it will do release_mounts() anyway, so we can just put into
the same list).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Since O_DIRECT is a standard feature that is enabled in most distros,
eliminate the CONFIG_NFS_DIRECTIO build option, and change the
fs/nfs/Makefile to always build in the NFS direct I/O engine.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Display all mount options in /proc/mount which may be needed to reconstruct
a previous mount.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Save the value of the mountproto= mountport= mountvers= and mountaddr=
options so that these values can be displayed later via
nfs_show_options().
This preserves the intent of the original mount options, should the file
system need to be remounted based on what's displayed in /proc/mounts.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
During a remount based on the mount options displayed in /proc/mounts, we
want to preserve the original behavior of the mount request. Let's save
the original setting of the "port=" mount option in the mount's nfs_server
structure.
This allows us to simplify the default behavior of port setting for NFSv4
mounts: by default, NFSv2/3 mounts first try an RPC bind to determine the
NFS server's port, unless the user specified the "port=" mount option;
Users can force the client to skip the RPC bind by explicitly specifying
"port=<value>".
NFSv4, by contrast, assumes the NFS server port is 2049 and skips the RPC
bind, unless the user specifies "port=". Users can force an RPC bind for
NFSv4 by explicitly specifying "port=0".
I added a couple of extra comments to clarify this behavior.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: make data types of fields in nfs_parsed_mount_options more
consistent with other uses.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: use %u instead of %d when displaying NFS mount options.
Nit: Fix reporting of "namlen=" option in nfs_show_mount_stats. The mount
option is called "namlen" without the "e".
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, the NFS readdir decoders have a workaround for buggy servers
that send an empty readdir response with the EOF bit unset. If the
server sends a malformed response in some cases, this workaround kicks
in and just returns an empty response rather than returning a proper
error to the caller.
This patch does 3 things:
1) have malformed responses with no entries return error (-EIO)
2) preserve existing workaround for servers that send empty
responses with the EOF marker unset.
3) Add some comments to clarify the logic in decode_readdir().
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, the NFS readdir decoders have a workaround for buggy servers
that send an empty readdir response with the EOF bit unset. If the
server sends a malformed response in some cases, this workaround kicks
in and just returns an empty response rather than returning a proper
error to the caller.
This patch does 3 things:
1) have malformed responses with no entries return error (-EIO)
2) preserve existing workaround for servers that send empty
responses with the EOF marker unset.
3) Add some comments to clarify the logic in nfs3_xdr_readdirres().
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, the NFS readdir decoders have a workaround for buggy servers
that send an empty readdir response with the EOF bit unset. If the
server sends a malformed response in some cases, this workaround kicks
in and just returns an empty response rather than returning a proper
error to the caller.
This patch does 3 things:
1) have malformed responses with no entries return error (-EIO)
2) preserve existing workaround for servers that send empty
responses with the EOF marker unset.
3) Add some comments to clarify the logic in nfs_xdr_readdirres().
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Both flush functions have the same error handling routine. Pull
it out as a function.
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ignoring the return value from nfs_pageio_add_request can cause deadlocks.
In read path:
call nfs_pageio_add_request from readpage_async_filler
assume at this point that there are requests already in desc, that
can't be merged with the current request.
so nfs_pageio_doio is fired up to clear out desc.
assume something goes wrong in setting up the io, so desc->pg_error is set.
This causes nfs_pageio_add_request to return 0, *WITHOUT* adding the original
request.
BUT, since return code is ignored, readpage_async_filler assumes it has
been added, and does nothing further, leaving page locked.
do_generic_mapping_read will eventually call lock_page, resulting in deadlock
In write path:
page is marked dirty by generic_perform_write
nfs_writepages is called
call nfs_pageio_add_request from nfs_page_async_flush
assume at this point that there are requests already in desc, that
can't be merged with the current request.
so nfs_pageio_doio is fired up to clear out desc.
assume something goes wrong in setting up the io, so desc->pg_error is set.
This causes nfs_page_async_flush to return 0, *WITHOUT* adding the original
request, yet marking the request as locked (PG_BUSY) and in writeback,
clearing dirty marks.
The next time a write is done to the page, deadlock will result as
nfs_write_end calls nfs_update_request
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The previous value was not taking into account space for bitmap array size.
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The NFSv4 protocol allows clients to negotiate security protocols on the
fly in the case where an administrator on the server changes the export
settings and/or in the case where we may have a filesystem migration event.
Instead of having the NFS client code cache credentials that are tied to a
particular AUTH method it is therefore preferable to have a generic credential
that can be converted into whatever AUTH is in use by the RPC client when
the read/write/sillyrename/... is put on the wire.
We do this by means of the new "generic" credential, which basically just
caches the minimal information that is needed to look up an RPCSEC_GSS,
AUTH_SYS, or AUTH_NULL credential.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'hotfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFS: Fix dentry revalidation for NFSv4 referrals and mountpoint crossings
NFS: Fix the fsid revalidation in nfs_update_inode()
SUNRPC: Fix a nfs4 over rdma transport oops
NFS: Fix an f_mode/f_flags confusion in fs/nfs/write.c
As long as the directory contents haven't changed, we should just let the
path walk proceed to cross the mountpoint. Apart from being an optimisation
in the case of 'nohide' mountpoint traversals, it also fixes an issue with
referrals: referral inodes don't have valid filehandles, so calling
nfs_revalidate_inode() on them is a bug.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When we detect that we've crossed a mountpoint on the remote server, we
must take care not to use that inode to revalidate the fsid on our
current superblock. To do so, we label the inode as a remote mountpoint,
and check for that in nfs_update_inode().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFS and SELinux worked together previously because SELinux had NFS
specific knowledge built in. This design was approved by both groups
back in 2004 but the recent NFS changes to use nfs_parsed_mount_data and
the usage of nfs_clone_mount_data showed this to be a poor fragile
solution. This patch fixes the NFS functionality regression by making
use of the new LSM interfaces to allow an FS to explicitly set its own
mount options.
The explicit setting of mount options is done in the nfs get_sb
functions which are called before the generic vfs hooks try to set mount
options for filesystems which use text mount data.
This does not currently support NFSv4 as that functionality did not
exist in previous kernels and thus there is no regression. I will be
adding the needed code, which I believe to be the exact same as the v3
code, in nfs4_get_sb for 2.6.26.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: James Morris <jmorris@namei.org>
Now that we've tightened up the locking rules for RPC queue wakeups, we can
remove the RCU-safe kfree calls...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
An audit of the current RPC timeout functions shows that they don't really
ever need to run in the softirq context. As long as the softirq is
able to signal that the wakeup is due to a timeout (which it can do by
setting task->tk_status to -ETIMEDOUT) then the callback functions can just
run as standard task->tk_callback functions (in the rpciod/process
context).
The only possible border-line case would be xprt_timer() for the case of
UDP, when the callback is used to reduce the size of the transport
congestion window. In testing, however, the effect of moving that update
to a callback would appear to be minor.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We want to ensure that rpc_call_ops that involve mntput() are run on nfsiod
rather than on rpciod, so that they don't deadlock when the resulting
umount calls rpc_shutdown_client(). Hence we specify that read, write and
commit calls must complete on nfsiod.
Ditto for NFSv4 open, lock, locku and close asynchronous calls.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFS post-rpciod cleanups often involve tasks that cannot be safely
performed within the rpciod context (due to deadlock concerns). We
therefore add a dedicated NFS workqueue that can perform tasks like
cleaning up state after an interrupted NFSv4 open() call, or calling
put_nfs_open_context() after an asynchronous read or write call.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We can't allow rpc callback functions like task->tk_ops->rpc_call_prepare()
and task->tk_ops->rpc_call_done() to call mntput() in any way, since
that will cause a deadlock when the call to rpc_shutdown_client() attempts
to wait on 'task' to complete.
We can avoid the above deadlock by moving calls to mntput to
task->tk_ops->rpc_release() callback, since at that time the task will be
marked as completed, and so rpc_shutdown_client won't attempt to wait on
it.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Sorry for the noise, but here's the v3 of this compilation fix :)
There are some places, which declare the char buf[...] on the stack
to push it later into dprintk(). Since the dprintk sometimes (if the
CONFIG_SYSCTL=n) becomes an empty do { } while (0) stub, these buffers
cause gcc to produce appropriate warnings.
Wrap these buffers with RPC_IFDEBUG macro, as Trond proposed, to
compile them out when not needed.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
fs/nfs/nfs4state.c:788:34: warning: Using plain integer as NULL pointer
fs/nfs/delegation.c:52:34: warning: Using plain integer as NULL pointer
fs/nfs/idmap.c:312:12: warning: Using plain integer as NULL pointer
fs/nfs/callback_xdr.c:257:6: warning: Using plain integer as NULL pointer
fs/nfs/callback_xdr.c:270:6: warning: Using plain integer as NULL pointer
fs/nfs/callback_xdr.c:281:6: warning: Using plain integer as NULL pointer
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Now that the reference counting on the callback thread is working as
expected, it uncovers another problem. Peter Staubach noticed while
testing that patch on an older kernel that he would occasionally see
this printk in rpc_register fire:
"RPC: failed to contact portmap (errno -512).
The NFSv4 callback thread is signaled by nfs_callback_down(), but never
flushes that signal. All of the shutdown processing is done with that
signal pending. This makes it fail the call to unregister the port with
the portmapper.
In actuality, this rpc_register call isn't necessary at all since the
port isn't actually registered with the portmapper anymore. Regardless,
there doesn't seem to be any reason to leave the signal pending while
the thread is being shut down and flushing it should generally silence
that printk.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* Add path_put() functions for releasing a reference to the dentry and
vfsmount of a struct path in the right order
* Switch from path_release(nd) to path_put(&nd->path)
* Rename dput_path() to path_put_conditional()
[akpm@linux-foundation.org: fix cifs]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: <linux-fsdevel@vger.kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the central patch of a cleanup series. In most cases there is no good
reason why someone would want to use a dentry for itself. This series reflects
that fact and embeds a struct path into nameidata.
Together with the other patches of this series
- it enforced the correct order of getting/releasing the reference count on
<dentry,vfsmount> pairs
- it prepares the VFS for stacking support since it is essential to have a
struct path in every place where the stack can be traversed
- it reduces the overall code size:
without patch series:
text data bss dec hex filename
5321639 858418 715768 6895825 6938d1 vmlinux
with patch series:
text data bss dec hex filename
5320026 858418 715768 6894212 693284 vmlinux
This patch:
Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix cifs]
[akpm@linux-foundation.org: fix smack]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NFS should use GFP_NOFS mode radix tree preloads rather than GFP_ATOMIC
allocations at radix-tree insertion-time. This is important to reduce the
atomic memory requirement.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The error field in nfs_readdir_descriptor_t is never used outside of the
function in which it is set. Remove the field and change the place that
does use it to use an existing local variable.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The warning message for a v4 server returning various bad sequence-ids is
missing spaces.
Signed-off-by: Dan Muntz <dmuntz@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The reference counting for the NFSv4 callback thread stays artificially
high. When this thread comes down, it doesn't properly tear down the
svc_serv, causing a memory leak. In my testing on an older kernel on
x86_64, memory would leak out of the 8k kmalloc slab. So, we're leaking
at least a page of memory every time the thread comes down.
svc_create() creates the svc_serv with a sv_nrthreads count of 1, and
then svc_create_thread() increments that count. Whenever the callback
thread is started it has a sv_nrthreads count of 2. When coming down, it
calls svc_exit_thread() which decrements that count and if it hits 0, it
tears everything down. That never happens here since the count is always
at 2 when the thread exits.
The problem is that nfs_callback_up() should be calling svc_destroy() on
the svc_serv on both success and failure. This is how lockd_up_proto()
handles the reference counting, and doing that here fixes the leak.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the inode is flagged as having an invalid mapping, then we can't rely on
the PageUptodate() flag. Ensure that we don't use the "anti-fragmentation"
write optimisation in nfs_updatepage(), since that will cause NFS to write
out areas of the page that are no longer guaranteed to be up to date.
A potential corruption could occur in the following scenario:
client 1 client 2
=============== ===============
fd=open("f",O_CREAT|O_WRONLY,0644);
write(fd,"fubar\n",6); // cache last page
close(fd);
fd=open("f",O_WRONLY|O_APPEND);
write(fd,"foo\n",4);
close(fd);
fd=open("f",O_WRONLY|O_APPEND);
write(fd,"bar\n",4);
close(fd);
-----
The bug may lead to the file "f" reading 'fubar\n\0\0\0\nbar\n' because
client 2 does not update the cached page after re-opening the file for
write. Instead it keeps it marked as PageUptodate() until someone calls
invaldate_inode_pages2() (typically by calling read()).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Simplify page cache zeroing of segments of pages through 3 functions
zero_user_segments(page, start1, end1, start2, end2)
Zeros two segments of the page. It takes the position where to
start and end the zeroing which avoids length calculations and
makes code clearer.
zero_user_segment(page, start, end)
Same for a single segment.
zero_user(page, start, length)
Length variant for the case where we know the length.
We remove the zero_user_page macro. Issues:
1. Its a macro. Inline functions are preferable.
2. The KM_USER0 macro is only defined for HIGHMEM.
Having to treat this special case everywhere makes the
code needlessly complex. The parameter for zeroing is always
KM_USER0 except in one single case that we open code.
Avoiding KM_USER0 makes a lot of code not having to be dealing
with the special casing for HIGHMEM anymore. Dealing with
kmap is only necessary for HIGHMEM configurations. In those
configurations we use KM_USER0 like we do for a series of other
functions defined in highmem.h.
Since KM_USER0 is depends on HIGHMEM the existing zero_user_page
function could not be a macro. zero_user_* functions introduced
here can be be inline because that constant is not used when these
functions are called.
Also extract the flushing of the caches to be outside of the kmap.
[akpm@linux-foundation.org: fix nfs and ntfs build]
[akpm@linux-foundation.org: fix ntfs build some more]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Modify the various kernel RPC svcs to use the svc_create_xprt service.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Acked-by: Neil Brown <neilb@suse.de>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Greg Banks <gnb@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* 'task_killable' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc: (22 commits)
Remove commented-out code copied from NFS
NFS: Switch from intr mount option to TASK_KILLABLE
Add wait_for_completion_killable
Add wait_event_killable
Add schedule_timeout_killable
Use mutex_lock_killable in vfs_readdir
Add mutex_lock_killable
Use lock_page_killable
Add lock_page_killable
Add fatal_signal_pending
Add TASK_WAKEKILL
exit: Use task_is_*
signal: Use task_is_*
sched: Use task_contributes_to_load, TASK_ALL and TASK_NORMAL
ptrace: Use task_is_*
power: Use task_is_*
wait: Use TASK_NORMAL
proc/base.c: Use task_is_*
proc/array.c: Use TASK_REPORT
perfmon: Use task_is_*
...
Fixed up conflicts in NFS/sunrpc manually..
The same delegation may have been handed out to more than one nfs_client.
Ensure that if a recall occurs, we return all instances.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If a (broken?) server hands out two different delegations for the same
file, then we should return one of them.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Otherwise, there is a potential deadlock if the last dput() from an NFSv4
close() or other asynchronous operation leads to nfs_clear_inode calling
the synchronous delegreturn.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
David Howells noticed that repeating the same mount option twice during an
NFS mount request can result in orphaned memory in certain cases.
Only the client_address and mount_server.hostname strings are initialized
in the mount parsing loop, so those appear to be the only two pointers that
might be written over by repeating a mount option. The strings in the
nfs_server section of the nfs_parsed_mount_data structure are set only once
after the options are parsed, thus these are not susceptible to being
overwritten.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The rfc doesn't give any reason it shouldn't be possible to set an
attribute on a non-regular file. And if the server supports it, then it
shouldn't be up to us to prevent it.
Thanks to Erez for the report and Trond for further analysis.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Tested-by: Erez Zadok <ezk@cs.sunysb.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There are no interruptible waits for asynchronous RPC tasks, so we don't
need to wrap calls to rpc_run_task() with an
rpc_clnt_sigmask/rpc_clnt_unsigmask pair.
Instead we can wrap the wait_for_completion_interruptible() in
nfs_direct_wait(). This means that we completely optimise away sigmask
setting for the case of non-blocking aio/dio.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: pass 5 arguments to nlmclnt_init() in a structure similar to the
new nfs_client_initdata structure.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Now that each NFS mount point caches its own nlm_host structure, it can be
passed to nlmclnt_proc() for each lock request. By pinning an nlm_host for
each mount point, we trade the overhead of looking up or creating a fresh
nlm_host struct during every NLM procedure call for a little extra memory.
We also restrict the nlmclnt_proc symbol to limit the use of this call to
in-tree modules.
Note that nlm_lookup_host() (just removed from the client's per-request
NLM processing) could also trigger an nlm_host garbage collection. Now
client-side nlm_host garbage collection occurs only during NFS mount
processing. Since the NFS client now holds a reference on these nlm_host
structures, they wouldn't have been affected by garbage collection
anyway.
Given that nlm_lookup_host() reorders the global nlm_host chain after
every successful lookup, and that a garbage collection could be triggered
during the call, we've removed a significant amount of per-NLM-request
CPU processing overhead.
Sidebar: there are only a few remaining references to the internals of
NFS inodes in the client-side NLM code. The only references I found are
related to extracting or comparing the inode's file handle via NFS_FH().
One is in nlmclnt_grant(); the other is in nlmclnt_setlockargs().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cache an appropriate nlm_host structure in the NFS client's mount point
metadata for later use.
Note that there is no need to set NFS_MOUNT_NONLM in the error case -- if
nfs_start_lockd() returns a non-zero value, its callers ensure that the
mount request fails outright.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: commit 4899f9c8 added nfs_write_end(), which introduces a
conditional expression that returns an unsigned integer in one arm and
a signed integer in the other.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: PAGE_CACHE_SIZE is unsigned, and nfs_pageio_init() takes a size_t.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: always use the same type when handling buffer lengths. As a
bonus, this prevents a mixed sign comparison in idmap_lookup_name.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The idmap_pipe_upcall() function expects the copy_to_user() function to
return a negative error value if the call fails, but copy_to_user()
returns an unsigned long number of bytes that couldn't be copied.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up white space damage and use standard kernel coding conventions for
return statements.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, if you have a server mounted using networking protocol, you
cannot specify a different value using the 'proto=' option on another
mountpoint.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Now that the needed IPv6 infrastructure is in place, allow the NFS client's
IP address parser to generate AF_INET6 addresses.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Replace the nfs_server and mount_server address fields in the
nfs_parsed_mount_data structure with a "struct sockaddr_storage"
instead of a "struct sockaddr_in".
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Aurelien Charbon <aurelien.charbon@ext.bull.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Refactor the logic to parse incoming text-based IP addresses. Use the
in4_pton() function instead of the older in_aton(), following the lead
of the in-kernel CIFS client.
Later we'll add IPv6 address parsing using the matching in6_pton()
function. For now we can't allow IPv6 address parsing: we must expand
the size of the address storage fields in the nfs_parsed_mount_options
struct before we can parse and store IPv6 addresses.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Aurelien Charbon <aurelien.charbon@ext.bull.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In the name of address family compatibility, we can't have the NIP_FMT and
NIPQUAD macros in nfs_try_mount(). Instead, we can make use of an unused
mount option to display the mount server's hostname.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Aurelien Charbon <aurelien.charbon@ext.bull.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Change the addr field in the nfs_clone_mount structure to store a "struct
sockaddr *" to support non-IPv4 addresses in the NFS client.
Note this is mostly a cosmetic change, and does not actually allow
referrals using IPv6 addresses. The existing referral code assumes that
the server returns a string that represents an IPv4 address. This code
needs to support hostnames and IPv6 addresses as well as IPv4 addresses,
thus it will need to be reorganized completely (to handle DNS resolution
in user space).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Aurelien Charbon <aurelien.charbon@ext.bull.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Adjust the arguments and callers of nfs4_set_client() to pass a "struct
sockaddr *" instead of a "struct sockaddr_in *" to support non-IPv4
addresses in the NFS client.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Aurelien Charbon <aurelien.charbon@ext.bull.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Adjust arguments and callers of nfs_get_client() to pass a
"struct sockaddr *" instead of "struct sockaddr_in *" to support
non-IPv4 addresses.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Aurelien Charbon <aurelien.charbon@ext.bull.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Adjust arguments and callers of nfs_find_client() to pass a
"struct sockaddr *" instead of "struct sockaddr_in *" to support non-IPv4
addresses.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Aurelien Charbon <aurelien.charbon@ext.bull.net>
Trond: Also fix up protocol version number argument in nfs_find_client() to
use the correct u32 type.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Change the addr field in the cb_recallargs struct to a "struct sockaddr *"
to support non-IPv4 addresses.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Aurelien Charbon <aurelien.charbon@ext.bull.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Change the addr field in the cb_getattrargs struct to a "struct sockaddr *"
to support non-IPv4 addresses.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Aurelien Charbon <aurelien.charbon@ext.bull.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Prepare for managing larger addresses in the NFS client by widening the
nfs_client struct's cl_addr field.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Aurelien Charbon <aurelien.charbon@ext.bull.net>
(Modified to work with the new parameters for nfs_alloc_client)
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Create a helper function to set the default NFS port for NFSv4 mount
points. The helper supports both AF_INET and AF_INET6 family addresses.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We'll need to set the port number of an AF_INET or AF_INET6 address in
several places in fs/nfs/super.c, so introduce a helper that can manage
this for us. We put this helper to immediate use.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add support to nfs_verify_server_address for recognizing AF_INET6
addresses.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Refactor nfs_compare_super() and add AF_INET6 support.
Replace the generic memcmp() to document explicitly what parts of the
addresses must match in this check, and make the comparison independent
of the lengths of both addresses.
A side benefit is both tests are more computationally efficient than a
memcmp().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: fix an outdated block comment, and address a comparison
between a signed and unsigned integer.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: The client side peer address is available in callback_proc.c,
so move a dprintk out of fs/nfs/callback.c and into
fs/nfs/callback_proc.c.
This is more consistent with other debugging messages, and the proc
routines have more information about each request to display.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Aurelien Charbon <aurelien.charbon@ext.bull.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
To ensure the NFS client displays IPv6 addresses properly, replace
address family-specific NIPQUAD() invocations with a call to the RPC
client to get a formatted string representing the remote peer's
address.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Aurelien Charbon <aurelien.charbon@ext.bull.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We recently added methods to RPC transports that provide string versions of
the remote peer address information. Convert the NFSv4 SETCLIENTID
procedure to use those methods instead of building the client ID out of
whole cloth.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Aurelien Charbon <aurelien.charbon@ext.bull.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ensure that the RPC buffer size specified for NFSv4 SETCLIENTID procedures
matches what we are encoding into the buffer. See the definition of
struct nfs4_setclientid {} and the encode_setclientid() function.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: The header tag length is unsigned, so checking that it is less
than zero is unnecessary.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The address comparison in the __nfs_find_client() function is deceptive.
It uses a memcmp() to check a pair of u32 fields for equality. Not only is
this inefficient, but usually memcmp() is used for comparing two *whole*
sockaddr_in's (which includes comparisons of the address family and port
number), so it's easy to mistake the comparison here for a whole sockaddr
comparison, which it isn't.
So for clarity and efficiency, we replace the memcmp() with a simple test
for equality between the two s_addr fields. This should have no
behavioral effect.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: mount option parsing uses kstrndup in several places, rather than
using kzalloc. Replace the few remaining uses of kzalloc with kstrndup,
for consistency.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Remove the mount option that allows users to specify an alternate mountd
program number. The client hasn't support setting an alternate mountd
program number for a very long time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Remove the mount option that allows users to specify an alternate NFS
program number. The client hasn't support setting an alternate NFS
program number for a very long time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Text-based mount option parsing introduced a minor regression in the
behavior of NFS version 4 mounts. NFS version 4 is not supposed to require
a running rpcbind service on the server in order for a mount to succeed.
In other words, if the mount options don't specify a port number, the port
number is supposed to default to 2049. For earlier versions of NFS, the
default port number was zero in order to cause the RPC client to autobind
to the server's NFS service.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
POSIX requires that ctime and mtime, as reported by the stat(2) call,
reflect the activity of the most recent write(2). To that end, nfs_getattr()
flushes pending dirty writes to a file before doing a GETATTR to allow the
NFS server to set the file's size, ctime, and mtime properly.
However, nfs_getattr() can be starved when a constant stream of application
writes to a file prevents nfs_wb_nocommit() from completing. This usually
results in hangs of programs doing a stat against an NFS file that is being
written. "ls -l" is a common victim of this behavior.
To prevent starvation, hold the file's i_mutex in nfs_getattr() to
freeze applications writes temporarily so the client can more quickly obtain
clean values for a file's size, mtime, and ctime.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The nfs_wcc_update_inode() function omits logic to convert the type of
the NFS on-the-wire value of a file's size (__u64) to the type of file
size value stored in struct inode (loff_t, which is signed).
Everywhere else in the NFS client I checked already correctly converts the
file size type.
This effects only very large files.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Replace use of rpc_call_setup() with rpc_init_task(), and in cases where we
need to initialise task->tk_action, with rpc_call_start().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Move the common code for setting up the nfs_write_data and nfs_read_data
structures into fs/nfs/read.c, fs/nfs/write.c and fs/nfs/direct.c.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We want the default scheduling priority (priority == 0) to remain
RPC_PRIORITY_NORMAL.
Also ensure that the priority wait queue scheduling is per process id
instead of sometimes being per thread, and sometimes being per inode.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Added an active/deactive mechanism to the nfs_server structure
allowing async operations to hold off umount until the
operations are done.
Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Reduce the time spent locking the rpc_sequence structure by queuing the
nfs_seqid only when we are ready to take the lock (when calling
nfs_wait_on_sequence).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The current model locks the page twice for no good reason. Optimise by
inlining the parts of nfs_write_begin()/nfs_write_end() that we care about.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the server returns an ENOENT error, we still need to do a d_delete() in
order to ensure that the dentry is deleted.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In nfs_do_call_unlink() we check that we haven't raced, and that lookup()
hasn't created an aliased dentry to our sillydeleted dentry. If somebody
has deleted the file on the server and the lookup() resulted in a negative
dentry, then ignore...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ensure that readdir revalidates its data cache after blocking on
sillyrename.
Also fix a typo in nfs_do_call_unlink(): swap the ^= for an |=. The result
is the same, since we've already checked that the flag is unset, but it
makes the code more readable.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Sharing the open sequence queue causes a deadlock when we try to take
both a lock sequence id and and open sequence id.
This fixes the regression reported by Dimitri Puzin and Jeff Garzik: See
http://bugzilla.kernel.org/show_bug.cgi?id=9712
for details.
Reported-and-tested-by: Dimitri Puzin <bugs@psycast.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Tested-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NFSv4 file locking is currently completely broken since it doesn't respect
the OPEN sequencing when it is given an unconfirmed lock_owner and needs to
do an open_to_lock_owner. Worse: it breaks the sunrpc rules by doing a
GFP_KERNEL allocation inside an rpciod callback.
Fix is to preallocate the open seqid structure in nfs4_alloc_lockdata if we
see that the lock_owner is unconfirmed.
Then, in nfs4_lock_prepare() we wait for either the open_seqid, if
the lock_owner is still unconfirmed, or else fall back to waiting on the
standard lock_seqid.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
RFC3530 states that the open_owner is confirmed if and only if the client
sends an OPEN_CONFIRM request with the appropriate sequence id and stateid
within the lease period.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Sigh... commit 4584f520e1 (NFS: Fix NFS
mountpoint crossing...) had a slight flaw: server can be NULL if sget()
returned an existing superblock.
Fix the fix by dereferencing s->s_fs_info.
Thanks to Coverity/Adrian Bunk and Frank Filz for spotting the bug.
(See http://bugzilla.kernel.org/show_bug.cgi?id=9647)
Also add in the same namespace Oops fix for NFSv4 in both the mountpoint
crossing case, and the referral case.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ensure that the dummy 'root dentry' is invisible to d_find_alias(). If not,
then it may be spliced into the tree if a parent directory from the same
filesystem gets mounted at a later time.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This reverts commit b9148c6b80.
On Wed, 12 Dec 2007 10:57:30 -0500, Chuck Lever wrote
> commit b9148c6b should be reverted. It was recently forward-ported
> from some years-old patches, and is clearly not needed now.
>
> On Dec 11, 2007, at 5:21 PM, Adrian Bunk wrote:
>
>> This code became dead after commit
>> b9148c6b80
>> (which BTW doesn't seem to have changed any behaviour) and can
>> therefore
>> be removed.
>>
>> Spotted by the Coverity checker.
>>
>> Signed-off-by: Adrian Bunk <bunk@kernel.org>
>>
>> ---
>> --- linux-2.6/fs/nfs/direct.c.old 2007-12-02 21:54:53.000000000 +0100
>> +++ linux-2.6/fs/nfs/direct.c 2007-12-02 21:55:10.000000000 +0100
>> @@ -897,15 +897,12 @@ ssize_t nfs_file_direct_write(struct kio
>> if (!count)
>> goto out; /* return 0 */
>>
>> retval = -EINVAL;
>> if ((ssize_t) count < 0)
>> goto out;
>> - retval = 0;
>> - if (!count)
>> - goto out;
>>
>> retval = nfs_sync_mapping(mapping);
>> if (retval)
>> goto out;
>>
>> retval = nfs_direct_write(iocb, iov, nr_segs, pos, count);
>>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Neil Brown said:
> Hi Trond,
>
> We found that a machine which made moderately heavy use of
> 'automount' was leaking some nfs data structures - particularly the
> 4K allocated by rpc_alloc_iostats.
> It turns out that this only happens with filesystems with -onolock
> set.
> The problem is that if NFS_MOUNT_NONLM is set, nfs_start_lockd doesn't
> set server->destroy, so when the filesystem is unmounted, the
> ->client_acl is not shutdown, and so several resources are still
> held. Multiple mount/umount cycles will slowly eat away memory
> several pages at a time.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: NeilBrown <neilb@suse.de>
The check that was added to nfs_xdev_get_sb() to work around broken
servers, works fine for NFSv2, but causes mountpoint crossing on NFSv3 to
always return ESTALE.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
By using the TASK_KILLABLE infrastructure, we can get rid of the 'intr'
mount option. We have to use _killable everywhere instead of _interruptible
as we get rid of rpc_clnt_sigmask/sigunmask.
Signed-off-by: Liam R. Howlett <howlett@gmail.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Simplify calling sequence of nfs_direct_{read,write}_schedule(), and
rename them to reflect their new role.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
A zero byte count direct write request should be a successful no-op, not an
error.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Allow applications to perform asynchronous scatter-gather direct I/O
to NFS files.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add helpers that iterate over multi-segment iovecs. These will
be used to support multi-segment scatter/gather direct I/O in a
later patch.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Every file should include the headers containing the prototypes for its global
functions (in this case nfs_access_cache_shrinker()).
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs_wb_page_priority() can now become static.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
While testing a kernel based upon ecd744eec3
(with wrong boot arguments), I got the following bad page state entry while
NFS was trying to mount it's rootfs:
IP-Config: Complete:
device=eth0, addr=192.168.1.101, mask=255.255.255.0, gw=255.255.255.255,
host=192.168.1.101, domain=, nis-domain=(none),
bootserver=192.168.1.100, rootserver=192.168.1.100, rootpath=
Looking up port of RPC 100003/2 on 192.168.1.100
rpcbind: server 192.168.1.100 not responding, timed out
Root-NFS: Unable to get nfsd port number from server, using default
Looking up port of RPC 100005/1 on 192.168.1.100
rpcbind: server 192.168.1.100 not responding, timed out
Root-NFS: Unable to get mountd port number from server, using default
mount: server 192.168.1.100 not responding, timed out
Root-NFS: Server returned error -5 while mounting /nfs/rootfs/
VFS: Unable to mount root fs via NFS, trying floppy.
Bad page state in process 'swapper'
page:c02b1260 flags:0x00000400 mapping:00000000 mapcount:0 count:0
Trying to fix it up, but a reboot is needed
Backtrace:
[<c0023e34>] (dump_stack+0x0/0x14) from [<c0062570>] (bad_page+0x70/0xac)
[<c0062500>] (bad_page+0x0/0xac) from [<c0064914>] (free_hot_cold_page+0x80/0x178)
[<c0064894>] (free_hot_cold_page+0x0/0x178) from [<c0064a74>] (free_hot_page+0x14/0x18)
[<c0064a60>] (free_hot_page+0x0/0x18) from [<c0067078>] (put_page+0xf8/0x154)
[<c0066f80>] (put_page+0x0/0x154) from [<c007dbc8>] (kfree+0xc8/0xd0)
[<c007db00>] (kfree+0x0/0xd0) from [<c00cbb54>] (nfs_get_sb+0x230/0x710)
[<c00cb924>] (nfs_get_sb+0x0/0x710) from [<c0084334>] (vfs_kern_mount+0x58/0xac)[<c00842dc>] (vfs_kern_mount+0x0/0xac) from [<c00843c0>] (do_kern_mount+0x38/0xf4)
[<c0084388>] (do_kern_mount+0x0/0xf4) from [<c0099c7c>] (do_mount+0x1e8/0x614)
...
This seems to be caused by use of an uninitialised structure due to NULL
options being passed to nfs_validate_mount_data(). Ensure that the
parsed mount data is always initialised.
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
(Trond: added fix for the same bug in nfs4_validate_mount_data()).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Hi Trond,
I have discovered that the BUG_ON in nfs_follow_mountpoint:
BUG_ON(IS_ROOT(dentry));
can be triggered by a misbehaving server.
What happens is the client does a lookup and discoveres that the named
directory has a different fsid, so it initiates a mount.
It then performs a GETATTR on the mounted directory and gets a
different fsid again (due to a bug in the NFS server).
This causes nfs_follow_mountpoint to be called on the newly mounted
root, which triggers the BUG_ON.
To duplicate this, have a directory which contains some mountpoints,
and export that directory with the "crossmnt" flag using nfs-utils
1.1.1 (or 1.1.0 I think)
The GETATTR on the root of the mounted filesystem will return the
information for the top exportpoint, while a lookup will return the
correct information. This difference causes the NFS client to BUG.
I think the best way to fix this is to trap this possibility early, so
just before completing the mount in the NFS client, check that it isn't
going to use nfs_mountpoint_inode_operations.
As long as i_op will never change once set (is that true?), this
should be adequately safe.
The following patch shows a possible approach, and it works for me.
i.e. when the NFS server is misbehaving, I get ESTALE on those
mountpoints, while when the NFS server is working correctly, I get
correct behaviour on the client.
NeilBrown
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Since 2.6.18, the superblock sb->s_root has been a dummy dentry with a
dummy inode. This breaks ustat(), which actually uses sb->s_root in a
vfstat() call.
Fix this by making the s_root a dummy alias to the directory inode that was
used when creating the superblock.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Commit eda3cef8dd ("NFS: Fix error
handling in nfs_direct_write_result()") ensured that if a WRITE returns
an error, then data->res.verf->committed is not tested (as it is not
initialised).
Then commit 60fa3f769f ("NFS: Fix two bugs
in the O_DIRECT write code") inadvertently reverted this while fixing
other problems.
So move the test so that we never examine ->committed in an error case,
and fix a speeling error while we are there.
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (74 commits)
fix do_sys_open() prototype
sysfs: trivial: fix sysfs_create_file kerneldoc spelling mistake
Documentation: Fix typo in SubmitChecklist.
Typo: depricated -> deprecated
Add missing profile=kvm option to Documentation/kernel-parameters.txt
fix typo about TBI in e1000 comment
proc.txt: Add /proc/stat field
small documentation fixes
Fix compiler warning in smount example program from sharedsubtree.txt
docs/sysfs: add missing word to sysfs attribute explanation
documentation/ext3: grammar fixes
Documentation/java.txt: typo and grammar fixes
Documentation/filesystems/vfs.txt: typo fix
include/asm-*/system.h: remove unused set_rmb(), set_wmb() macros
trivial copy_data_pages() tidy up
Fix typo in arch/x86/kernel/tsc_32.c
file link fix for Pegasus USB net driver help
remove unused return within void return function
Typo fixes retrun -> return
x86 hpet.h: remove broken links
...
Erez Zadok reports that certain configurations fail to build due to
schedule() TASK_[UN]INTERRUPTIBLE not being declared. Add proper
include files to fix.
Cc: Erez Zadok <ezk@cs.sunysb.edu>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Convert files to UTF-8.
* Also correct some people's names
(one example is Eißfeldt, which was found in a source file.
Given that the author used an ß at all in a source file
indicates that the real name has in fact a 'ß' and not an 'ss',
which is commonly used as a substitute for 'ß' when limited to
7bit.)
* Correct town names (Goettingen -> Göttingen)
* Update Eberhard Mönkeberg's address (http://lkml.org/lkml/2007/1/8/313)
Signed-off-by: Jan Engelhardt <jengelh@gmx.de>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Otherwise, we do end up breaking close-to-open semantics. We also end up
breaking some of the silly-rename tests in Connectathon on some setups.
Please refer to the bug-report at
http://bugzilla.linux-nfs.org/show_bug.cgi?id=150
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
lookup() and sillyrename() can race one another because the sillyrename()
completion cannot take the parent directory's inode->i_mutex since the
latter may be held by whoever is calling dput().
We therefore have little option but to add extra locking to ensure that
nfs_lookup() and nfs_atomic_open() do not race with the sillyrename
completion.
If somebody has looked up the sillyrenamed file in the meantime, we just
transfer the sillydelete information to the new dentry.
Please refer to the bug-report at
http://bugzilla.linux-nfs.org/show_bug.cgi?id=150
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch fixes a regression that was introduced by commit
44dd151d5c
We cannot zero the user page in nfs_mark_uptodate() any more, since
a) We'd be modifying the page without holding the page lock
b) We can race with other updates of the page, most notably
because of the call to nfs_wb_page() in nfs_writepage_setup().
Instead, we do the zeroing in nfs_update_request() if we see that we're
creating a request that might potentially be marked as up to date.
Thanks to Olivier Paquet for reporting the bug and providing a test-case.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the ATTR_KILL_S*ID bits are set then any mode change is only for clearing
the setuid/setgid bits. For NFS, skip the mode change and let the server
handle it.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Slab constructors currently have a flags parameter that is never used. And
the order of the arguments is opposite to other slab functions. The object
pointer is placed before the kmem_cache pointer.
Convert
ctor(void *object, struct kmem_cache *s, unsigned long flags)
to
ctor(struct kmem_cache *s, void *object)
throughout the kernel
[akpm@linux-foundation.org: coupla fixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These patches aim to improve balance_dirty_pages() and directly address three
issues:
1) inter device starvation
2) stacked device deadlocks
3) inter process starvation
1 and 2 are a direct result from removing the global dirty limit and using
per device dirty limits. By giving each device its own dirty limit is will
no longer starve another device, and the cyclic dependancy on the dirty limit
is broken.
In order to efficiently distribute the dirty limit across the independant
devices a floating proportion is used, this will allocate a share of the total
limit proportional to the device's recent activity.
3 is done by also scaling the dirty limit proportional to the current task's
recent dirty rate.
This patch:
nfs: remove congestion_end(). It's redundant, clear_bdi_congested() already
wakes the waiters.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'locks' of git://linux-nfs.org/~bfields/linux:
nfsd: remove IS_ISMNDLCK macro
Rework /proc/locks via seq_files and seq_list helpers
fs/locks.c: use list_for_each_entry() instead of list_for_each()
NFS: clean up explicit check for mandatory locks
AFS: clean up explicit check for mandatory locks
9PFS: clean up explicit check for mandatory locks
GFS2: clean up explicit check for mandatory locks
Cleanup macros for distinguishing mandatory locks
Documentation: move locks.txt in filesystems/
locks: add warning about mandatory locking races
Documentation: move mandatory locking documentation to filesystems/
locks: Fix potential OOPS in generic_setlease()
Use list_first_entry in locks_wake_up_blocks
locks: fix flock_lock_file() comment
Memory shortage can result in inconsistent flocks state
locks: kill redundant local variable
locks: reverse order of posix_locks_conflict() arguments
We were intending to put the previous instance of delegation->cred
before setting a new one.
Thanks to David Howells for spotting this.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The __mandatory_lock(inode) macro makes the same check, but makes the code
more readable.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This boot parameter will allow legacy 32-bit applications which call stat()
to continue to function even if the NFSv3/v4 server uses 64-bit inode
numbers.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the cached attributes match the ones supplied in the fattr, then assume
we've revalidated the inode.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We're failing basic test6 against Linux servers because they lack a correct
change attribute. The fix is to assume that we always want to invalidate
the readdir caches when we call update_changeattr and/or
nfs_post_op_update_inode on a directory.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs_post_op_update_inode() is really only meant to be used if we expect the
inode and its attributes to have changed in some way.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
- NFS_READTIME, NFS_CHANGE_ATTR are completely unused.
- Inline the few remaining uses of NFS_ATTRTIMEO, and remove.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
LOOKUP returns the directory post-op attributes whether or not the
operation was successful.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We don't need to call nfs_revalidate_inode() on the directory if we already
know that the verifiers don't match.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The lower level routines in fs/nfs/proc.c, fs/nfs/nfs3proc.c and
fs/nfs/nfs4proc.c should already be dealing with the revalidation issues.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The fact that we're in the process of modifying the inode does not mean
that we should not invalidate the attribute and data caches. The defensive
thing is to always invalidate when we're confronted with inode
mtime/ctime or change_attribute updates that we do not immediately
recognise.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the ->lookup() call causes the directory verifier to change, then there
is still no need to use the old verifier, since our dentry has been
verified.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If nfs_post_op_update_inode fails because the server didn't return any
attributes, then we let the subsequent inode revalidation update
cache_change_attribute.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>