Failure of the layoutreturn allocation fails is not a good reason to
mark the pnfs_layout_hdr as having failed a layoutget or i/o. Just
exit cleanly.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
It serves no purpose that the test for whether or not we have valid
layout segments doesn't already serve.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Once all the affected layout segments have been freed up, clear the
NFS_LAYOUT_BULK_RECALL flag so that we can reuse the pnfs_layout_hdr
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We already have a mechanism for blocking LAYOUTGET by means of the
plh_block_lgets counter. The only "service" that NFS_LAYOUT_DESTROYED
provides at this point is to block layoutget once the layout segment
list is empty, which basically means that you have to wait until
the pnfs_layout_hdr is destroyed before you can do pNFS on that file
again.
This patch enables the reuse of the pnfs_layout_hdr if the layout
segment list is empty.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ensure that the reference count for pnfs_layout_hdr reverts to the
original value after a call to pnfs_layout_remove_lseg().
Note that the caller is expected to hold a reference to the struct
pnfs_layout_hdr.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There is no longer a need to use pnfs_free_lseg_list(). Just call
pnfs_free_lseg() directly.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Move the code into pnfs_free_layout_hdr(), and add checks to
get_layout_by_fh_locked to ensure that they don't reference a layout
that is being freed.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
None of the existing pNFS layout drivers seem to require the inode
to be locked while they free the layout header.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Each layout segment already holds a reference to the pnfs_layout_hdr,
so there is no need to hold an extra reference that is released once
the last layout segment is freed.
Ensure that pnfs_find_alloc_layout() always returns a reference
to the pnfs_layout_hdr, which will be matched by the final call to
pnfs_put_layout_hdr() in pnfs_update_layout().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The latter name is more descriptive of the actual function.
Also rename pnfs_insert_layout to pnfs_layout_insert_lseg.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Instead of resetting the inode MDS threshold counters when we mark
the layout for destruction, do it as part of freeing the layout.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In all cases where we set NFS_LAYOUT_INVALID, we also set NFS_LAYOUT_DESTROYED.
Furthermore, in all cases where we test for NFS_LAYOUT_INVALID, we should
also be testing for NFS_LAYOUT_DESTROYED, since the latter means that
we hold no valid layout segments.
Ergo the two are redundant.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If we sleep after dropping the inode->i_lock, then we are no longer
atomic with respect to the rpc_wake_up() call in pnfs_layout_remove_lseg().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If pnfs_layout_io_test_failed() authorises a retry of the failed layoutgets,
we should clear the existing layout segments so that we start afresh. Do
this in pnfs_layout_io_set_failed().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We want to cache the pnfs_layout_hdr after a layoutget or i/o
failure so that pnfs_update_layout() can find it and know when
it is time to retry.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If we exit after the call to pnfs_find_alloc_layout(), we have to ensure
that we put the struct pnfs_layout_hdr.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In cases where the pNFS data server is just temporarily out of service,
we want to mark it as such, and then try again later. Typically that will
be in cases of network connection errors etc.
This patch allows us to mark the devices as being "unavailable" for such
transient errors, and will make them available for retries after a
2 minute timeout period.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If we had to fall back to read/write through MDS, then assume that we should
retry pNFS after a suitable timeout period.
The following patch sets a timeout of 2 minutes.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Dereferencing nfsi->layout in order to read plh_flags without holding
a spin lock is bug prone. Furthermore, the dprintk() tells you nothing
about whether or not the call succeeded.
Replace it with something that tells you about whether or not a valid
layout segment was returned for the inode in question.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ensure that we do return errors from nfs4_proc_layoutget() and that we
don't mark the layout as having failed if the error was due to a
signal or resource problem on the client side.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This is to ensure that we don't clear the NFS_CONTEXT_RESEND_WRITES
flag while there are still writes that haven't been resent.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the server reboots before it can commit the unstable writes to disk,
then nfs_commit_release_pages() will detect this when it compares the
verifier returned by COMMIT to the one returned by WRITE. When this
happens, the client needs to resend those writes in order to guarantee
that they make it to stable storage.
This patch adds a signalling mechanism to notify fsync() that it
needs to retry all writes before it can exit.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We want to be able to pass on the information that the page was not
dirtied under a lock. Instead of adding a flag parameter, do this
by passing a pointer to a 'struct nfs_lock_owner' that may be NULL.
Also reuse this structure in struct nfs_lock_context to carry the
fl_owner_t and pid_t.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We want to be able to distinguish between allocation failures, and
the case where the lock context is not needed (because there are no
locks).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In nfs4_create_sec_client, 'flavor' can hold a negative error
code (returned from nfs4_negotiate_security), even though it
is an 'enum' and hence unsigned.
The code is careful to cast it to an (int) before testing if it
is negative, however it doesn't cast to an (int) before calling
ERR_PTR.
On a machine where "void*" is larger than "int", this results in
the unsigned equivalent of -1 (e.g. 0xffffffff) being converted
to a pointer. Subsequent code determines that this is not
negative, and so dereferences it with predictable results.
So: cast 'flavor' to a (signed) int before passing to ERR_PTR.
cc: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In case of error, the function rpcauth_create() returns ERR_PTR()
and never returns NULL pointer. The NULL test in the return value
check should be replaced with IS_ERR().
dpatch engine is used to auto generated this patch.
(https://github.com/weiyj/dpatch)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
IBM reported a soft lockup after applying the fix for the rename_lock
deadlock. Commit c83ce989cb ("VFS: Fix the nfs sillyrename regression
in kernel 2.6.38") was found to be the culprit.
The nfs sillyrename fix used DCACHE_DISCONNECTED to indicate that the
dentry was killed. This flag can be set on non-killed dentries too,
which results in infinite retries when trying to traverse the dentry
tree.
This patch introduces a separate flag: DCACHE_DENTRY_KILLED, which is
only set in d_kill() and makes try_to_ascend() test only this flag.
IBM reported successful test results with this patch.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The unregister_sysctl_table() function hangs if all references to its
ctl_table_header structure are not dropped.
This can happen sometimes because of a leak in proc_sys_lookup():
proc_sys_lookup() gets a reference to the table via lookup_entry(), but
it does not release it when a subsequent call to sysctl_follow_link()
fails.
This patch fixes this leak by making sure the reference is always
dropped on return.
See also commit 076c3eed2c ("sysctl: Rewrite proc_sys_lookup
introducing find_entry and lookup_entry") which reorganized this code in
3.4.
Tested in Linux 3.4.4.
Signed-off-by: Francesco Ruggeri <fruggeri@aristanetworks.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull a btrfs revert from Chris Mason:
"My for-linus branch has one revert in the new quota code.
We're building up more fixes at etc for the next merge window, but I'm
keeping them out unless they are bigger regressions or have a huge
impact."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Revert "Btrfs: fix some error codes in btrfs_qgroup_inherit()"
Pull GFS2 fixes from Steven Whitehouse:
"Here are three GFS2 fixes for the current kernel tree. These are all
related to the block reservation code which was added at the merge
window. That code will be getting an update at the forthcoming merge
window too. In the mean time though there are a few smaller issues
which should be fixed.
The first patch resolves an issue with write sizes of greater than 32
bits with the size hinting code. The second ensures that the
allocation data structure is initialised when using xattrs and the
third takes into account allocations which may have been made by other
nodes which affect a reservation on the local node."
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes:
GFS2: Take account of blockages when using reserved blocks
GFS2: Fix missing allocation data for set/remove xattr
GFS2: Make write size hinting code common
shared memory mapping is dirtied and unmapped. The lower file was being
released when the eCryptfs file was closed and the dirtied pages could not be
written out.
- Adds a call to the lower filesystem's ->flush() from ecryptfs_flush().
- Fixes a regression, introduced in 2.6.39, when a file is renamed on top of
another file. The target file's inode was not being evicted and the space
taken by the file was not reclaimed until eCryptfs was unmounted.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABCgAGBQJQU14UAAoJENaSAD2qAscKrpwQAK+Xu3RAHPLO16bEjhrFZaIr
FY2z8yK5geFzHfmhDxUT7jovw8d2ghAjrE3XHf+PgrqfUqIZ6vo57GFmR8ZNS+30
exNOb1DffaRqIejStrFkz+Ek4/FbnaPD/FvJYgC8D3Bv04YllyNI9Euq49VkkOiy
v6jxAy4H+/MCGzYvR4K0jmKlOLuC47TtJ322jXTm0iNeJJH2+kNXm0WvE8dY7dYV
2QMvv11M1HMI39S4yIY0eBPwbloP3AoO++H3Id7/XWcIlaaeZrVA7HmV8z04DEWq
Nt2vceSbX4FVBxmLab5TzTftorskJRDLCtMaSM8opve+3Sxgx+wooLskgrHqJjZH
8AtGfs/ZzvUkKcYMNv+iMax0/KqkGKOhdJMmqc37aI8qkBo5IYXpIPU1ZTy/S2T8
t3fM8jEI4AOKFn7GlG0kZg7EXo0wWPw3A8X73tCQx8W6sJy5STvv07hFZspX/4nn
+qhlM7Qzpr1XXPEv9cKo4oXzqOmlSi1Jd2ueGANNGJvche90uCWgmSZCfNNxGnPq
mLiiiV6KJ94xjjyf/ZWiKLryk5rv0fWiYXzutaORADn6WWwIVKX9IyvJfFI8y+M6
phREWNdOnyDFFA++h5QotWh3L7ZCL/HNeTpLDSD1ONxosVDhViviA2HrRlyl8LSc
soDzN7w+yOh+QjlN4sO/
=UC2V
-----END PGP SIGNATURE-----
Merge tag 'ecryptfs-3.6-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs
Pull ecryptfs fixes from Tyler Hicks:
- Fixes a regression, introduced in 3.6-rc1, when a file is closed
before its shared memory mapping is dirtied and unmapped. The lower
file was being released when the eCryptfs file was closed and the
dirtied pages could not be written out.
- Adds a call to the lower filesystem's ->flush() from
ecryptfs_flush().
- Fixes a regression, introduced in 2.6.39, when a file is renamed on
top of another file. The target file's inode was not being evicted
and the space taken by the file was not reclaimed until eCryptfs was
unmounted.
* tag 'ecryptfs-3.6-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
eCryptfs: Copy up attributes of the lower target inode after rename
eCryptfs: Call lower ->flush() from ecryptfs_flush()
eCryptfs: Write out all dirty pages just before releasing the lower file
This reverts commit 5986802c2f.
Both paths are not error paths but regular cases where non-qgroup
subvols are involved.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We already use them for openat() and friends, but fstat() also wants to
be able to use O_PATH file descriptors. This should make it more
directly comparable to the O_SEARCH of Solaris.
Note that you could already do the same thing with "fstatat()" and an
empty path, but just doing "fstat()" directly is simpler and faster, so
there is no reason not to just allow it directly.
See also commit 332a2e1244, which did the same thing for fchdir, for
the same reasons.
Reported-by: ольга крыжановская <olga.kryzhanovska@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org # O_PATH introduced in 3.0+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After calling into the lower filesystem to do a rename, the lower target
inode's attributes were not copied up to the eCryptfs target inode. This
resulted in the eCryptfs target inode staying around, rather than being
evicted, because i_nlink was not updated for the eCryptfs inode. This
also meant that eCryptfs didn't do the final iput() on the lower target
inode so it stayed around, as well. This would result in a failure to
free up space occupied by the target file in the rename() operation.
Both target inodes would eventually be evicted when the eCryptfs
filesystem was unmounted.
This patch calls fsstack_copy_attr_all() after the lower filesystem
does its ->rename() so that important inode attributes, such as i_nlink,
are updated at the eCryptfs layer. ecryptfs_evict_inode() is now called
and eCryptfs can drop its final reference on the lower inode.
http://launchpad.net/bugs/561129
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Tested-by: Colin Ian King <colin.king@canonical.com>
Cc: <stable@vger.kernel.org> [2.6.39+]
Since eCryptfs only calls fput() on the lower file in
ecryptfs_release(), eCryptfs should call the lower filesystem's
->flush() from ecryptfs_flush().
If the lower filesystem implements ->flush(), then eCryptfs should try
to flush out any dirty pages prior to calling the lower ->flush(). If
the lower filesystem does not implement ->flush(), then eCryptfs has no
need to do anything in ecryptfs_flush() since dirty pages are now
written out to the lower filesystem in ecryptfs_release().
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Fixes a regression caused by:
821f749 eCryptfs: Revert to a writethrough cache model
That patch reverted some code (specifically, 32001d6f) that was
necessary to properly handle open() -> mmap() -> close() -> dirty pages
-> munmap(), because the lower file could be closed before the dirty
pages are written out.
Rather than reapplying 32001d6f, this approach is a better way of
ensuring that the lower file is still open in order to handle writing
out the dirty pages. It is called from ecryptfs_release(), while we have
a lock on the lower file pointer, just before the lower file gets the
final fput() and we overwrite the pointer.
https://launchpad.net/bugs/1047261
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Reported-by: Artemy Tregubenko <me@arty.name>
Tested-by: Artemy Tregubenko <me@arty.name>
Tested-by: Colin Ian King <colin.king@canonical.com>
The claim_reserved_blks() function was not taking account of
the possibility of "blockages" while performing allocation.
This can be caused by another node allocating something in
the same extent which has been reserved locally.
This patch tests for this condition and then skips the remainder
of the reservation in this case. This is a relatively rare event,
so that it should not affect the general performance improvement
which the block reservations provide.
The claim_reserved_blks() function also appears not to be able
to deal with reservations which cross bitmap boundaries, but
that can be dealt with in a future patch since we don't generate
boundary crossing reservations currently.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: David Teigland <teigland@redhat.com>
Cc: Bob Peterson <rpeterso@redhat.com>
This collects up the write size hinting code which is used by the
block reservation subsystem into a single function. At the same
time this also corrects the rounding for this calculation.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
- Final (hopefully) fix for the range checking code in NFSv4 getacl. This
should fix the Oopses being seen when the acl size is close to PAGE_SIZE.
- Fix a regression with the legacy binary mount code
- Fix a regression in the readdir cookieverf initialisation
- Fix an RPC over UDP regression
- Ensure that we report all errors in the NFSv4 open code
- Ensure that fsync() reports all relevant synchronisation errors.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJQUN+KAAoJEGcL54qWCgDyHGcQAKj7MYVDIjhdmsVGGNWXUCnf
X0LVg/ajh+vjusK+hmquzcJikZqgce5IU5DW4vcFr1X8BgP+R51UVvU0KksByD5H
ourV2JVCztAQzQ4WWOsZAGqN0tooJUjyEjl4lEiDsQCF4Nk1HWbuCHeYuX74OToZ
jrgedj0EZ6zb7TOizvbgU/7lI+FKu3Hlw6+u27M9phtSuefJdYSHZHYVMOX81qPh
k0zgZ4tuLIaDuBB84iCrPwNt9icnevq6cIc+AGluI6xhDw+foPvUaUR+OUI420IZ
tunNzP2So+nNoyjEiyMVENaCdEyA75XAmmGHTUUdBiVOsMV4HF/TqvTtSsjk2mN1
FbZVvtjD6srjsQaKdVmqMIZBdhY9LSMLIQVqb4H2rYP6Mwq06WTuyCxf5YhzFfoy
2tai7JuqBkTAWfKB8ESWywV6Qk/MkUWRAOBO6ksS66gAwpcFDj6nfeAdwaEmoYKc
uzLUIRZaclPMZf661cs1fWeFV5XOnCL7je4owgTRGs7MHooWHPcC3273fEJqnhFz
5MkC7nfmUiGcdO1v0mfYTEtMj9Pp9icBoZcVTGn4eZIHzvhhZOx//8LhyBfS+jll
bKjaLZ1rErvIqwnSGcB7PK2yBYY9P6ZaxWjOrAAncZmiOxfhN0hvCo54jNOr/VZ+
atsDEAuqSTeK7ouBqyO4
=e5yE
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.6-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
- Final (hopefully) fix for the range checking code in NFSv4 getacl.
This should fix the Oopses being seen when the acl size is close to
PAGE_SIZE.
- Fix a regression with the legacy binary mount code
- Fix a regression in the readdir cookieverf initialisation
- Fix an RPC over UDP regression
- Ensure that we report all errors in the NFSv4 open code
- Ensure that fsync() reports all relevant synchronisation errors.
* tag 'nfs-for-3.6-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: fsync() must exit with an error if page writeback failed
SUNRPC: Fix a UDP transport regression
NFS: return error from decode_getfh in decode open
NFSv4: Fix buffer overflow checking in __nfs4_get_acl_uncached
NFSv4: Fix range checking in __nfs4_get_acl_uncached and __nfs4_proc_set_acl
NFS: Fix a problem with the legacy binary mount code
NFS: Fix the initialisation of the readdir 'cookieverf' array