Pull misc vfs updates from Al Viro:
"All kinds of stuff. That probably should've been 5 or 6 separate
branches, but by the time I'd realized how large and mixed that bag
had become it had been too close to -final to play with rebasing.
Some fs/namei.c cleanups there, memdup_user_nul() introduction and
switching open-coded instances, burying long-dead code, whack-a-mole
of various kinds, several new helpers for ->llseek(), assorted
cleanups and fixes from various people, etc.
One piece probably deserves special mention - Neil's
lookup_one_len_unlocked(). Similar to lookup_one_len(), but gets
called without ->i_mutex and tries to avoid ever taking it. That, of
course, means that it's not useful for any directory modifications,
but things like getting inode attributes in nfds readdirplus are fine
with that. I really should've asked for moratorium on lookup-related
changes this cycle, but since I hadn't done that early enough... I
*am* asking for that for the coming cycle, though - I'm going to try
and get conversion of i_mutex to rwsem with ->lookup() done under lock
taken shared.
There will be a patch closer to the end of the window, along the lines
of the one Linus had posted last May - mechanical conversion of
->i_mutex accesses to inode_lock()/inode_unlock()/inode_trylock()/
inode_is_locked()/inode_lock_nested(). To quote Linus back then:
-----
| This is an automated patch using
|
| sed 's/mutex_lock(&\(.*\)->i_mutex)/inode_lock(\1)/'
| sed 's/mutex_unlock(&\(.*\)->i_mutex)/inode_unlock(\1)/'
| sed 's/mutex_lock_nested(&\(.*\)->i_mutex,[ ]*I_MUTEX_\([A-Z0-9_]*\))/inode_lock_nested(\1, I_MUTEX_\2)/'
| sed 's/mutex_is_locked(&\(.*\)->i_mutex)/inode_is_locked(\1)/'
| sed 's/mutex_trylock(&\(.*\)->i_mutex)/inode_trylock(\1)/'
|
| with a very few manual fixups
-----
I'm going to send that once the ->i_mutex-affecting stuff in -next
gets mostly merged (or when Linus says he's about to stop taking
merges)"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
nfsd: don't hold i_mutex over userspace upcalls
fs:affs:Replace time_t with time64_t
fs/9p: use fscache mutex rather than spinlock
proc: add a reschedule point in proc_readfd_common()
logfs: constify logfs_block_ops structures
fcntl: allow to set O_DIRECT flag on pipe
fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE
fs: xattr: Use kvfree()
[s390] page_to_phys() always returns a multiple of PAGE_SIZE
nbd: use ->compat_ioctl()
fs: use block_device name vsprintf helper
lib/vsprintf: add %*pg format specifier
fs: use gendisk->disk_name where possible
poll: plug an unused argument to do_poll
amdkfd: don't open-code memdup_user()
cdrom: don't open-code memdup_user()
rsxx: don't open-code memdup_user()
mtip32xx: don't open-code memdup_user()
[um] mconsole: don't open-code memdup_user_nul()
[um] hostaudio: don't open-code memdup_user()
...
Pull vfs xattr updates from Al Viro:
"Andreas' xattr cleanup series.
It's a followup to his xattr work that went in last cycle; -0.5KLoC"
* 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
xattr handlers: Simplify list operation
ocfs2: Replace list xattr handler operations
nfs: Move call to security_inode_listsecurity into nfs_listxattr
xfs: Change how listxattr generates synthetic attributes
tmpfs: listxattr should include POSIX ACL xattrs
tmpfs: Use xattr handler infrastructure
btrfs: Use xattr handler infrastructure
vfs: Distinguish between full xattr names and proper prefixes
posix acls: Remove duplicate xattr name definitions
gfs2: Remove gfs2_xattr_acl_chmod
vfs: Remove vfs_xattr_cmp
Change the list operation to only return whether or not an attribute
should be listed. Copying the attribute names into the buffer is moved
to the callers.
Since the result only depends on the dentry and not on the attribute
name, we do not pass the attribute name to list operations.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new method: ->get_link(); replacement of ->follow_link(). The differences
are:
* inode and dentry are passed separately
* might be called both in RCU and non-RCU mode;
the former is indicated by passing it a NULL dentry.
* when called that way it isn't allowed to block
and should return ERR_PTR(-ECHILD) if it needs to be called
in non-RCU mode.
It's a flagday change - the old method is gone, all in-tree instances
converted. Conversion isn't hard; said that, so far very few instances
do not immediately bail out when called in RCU mode. That'll change
in the next commits.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
kmap() in page_follow_link_light() needed to go - allowing to hold
an arbitrary number of kmaps for long is a great way to deadlocking
the system.
new helper (inode_nohighmem(inode)) needs to be used for pagecache
symlinks inodes; done for all in-tree cases. page_follow_link_light()
instrumented to yell about anything missed.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add an additional "name" field to struct xattr_handler. When the name
is set, the handler matches attributes with exactly that name. When the
prefix is set instead, the handler matches attributes with the given
prefix and with a non-empty suffix.
This patch should avoid bugs like the one fixed in commit c361016a in
the future.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now that the xattr handler is passed to the xattr handler operations, we
have access to the attribute name prefix, so simplify
f2fs_xattr_generic_list.
Also, f2fs_xattr_advise_list is only ever called for
f2fs_xattr_advise_handler; there is no need to double check for that.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Changman Lee <cm224.lee@samsung.com>
Cc: Chao Yu <chao2.yu@samsung.com>
Cc: linux-f2fs-devel@lists.sourceforge.net
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The xattr_handler operations are currently all passed a file system
specific flags value which the operations can use to disambiguate between
different handlers; some file systems use that to distinguish the xattr
namespace, for example. In some oprations, it would be useful to also have
access to the handler prefix. To allow that, pass a pointer to the handler
to operations instead of the flags value alone.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new_valid_dev() always returns 1, so the !new_valid_dev() check is not
needed. Remove it.
Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Changman Lee <cm224.lee@samsung.com>
Cc: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull security subsystem update from James Morris:
"This is mostly maintenance updates across the subsystem, with a
notable update for TPM 2.0, and addition of Jarkko Sakkinen as a
maintainer of that"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (40 commits)
apparmor: clarify CRYPTO dependency
selinux: Use a kmem_cache for allocation struct file_security_struct
selinux: ioctl_has_perm should be static
selinux: use sprintf return value
selinux: use kstrdup() in security_get_bools()
selinux: use kmemdup in security_sid_to_context_core()
selinux: remove pointless cast in selinux_inode_setsecurity()
selinux: introduce security_context_str_to_sid
selinux: do not check open perm on ftruncate call
selinux: change CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE default
KEYS: Merge the type-specific data with the payload data
KEYS: Provide a script to extract a module signature
KEYS: Provide a script to extract the sys cert list from a vmlinux file
keys: Be more consistent in selection of union members used
certs: add .gitignore to stop git nagging about x509_certificate_list
KEYS: use kvfree() in add_key
Smack: limited capability for changing process label
TPM: remove unnecessary little endian conversion
vTPM: support little endian guests
char: Drop owner assignment from i2c_driver
...
In f2fs_shrink_extent_tree we should stop shrink flow if we have already
shrunk enough nodes in extent cache.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now, in ->symlink of f2fs, we kept the fixed invoking order between
f2fs_add_link and page_symlink since we should init node info firstly
in f2fs_add_link, then such node info can be used in page_symlink.
But we didn't fix to release meta info which was done before page_symlink
in our error path, so this will leave us corrupt symlink entry in its
parent's dentry page. Fix this issue by adding f2fs_unlink in the error
path for removing such linking.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Atomic write page can be GCed, after committing this kind of page, we should
clear the GCed flag for it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we got failure during commit_atomic_write, abort_volatile_write will be
called, but will not drop the inmemory pages due to no FI_ATOMIC_FILE.
Actually, there is no reason to check the flag in abort_volatile_write.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Users expect bmap will give allocated block addresses.
Let's play likewise ext4.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Once f2fs_gc is done, wait_ms is changed once more.
So, its tracepoint would be located after it.
Reported-by: He YunLei <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
different competitors
Since we use different page cache (normally inode's page cache for R/W
and meta inode's page cache for GC) to cache the same physical block
which is belong to an encrypted inode. Writeback of these two page
cache should be exclusive, but now we didn't handle writeback state
well, so there may be potential racing problem:
a)
kworker: f2fs_gc:
- f2fs_write_data_pages
- f2fs_write_data_page
- do_write_data_page
- write_data_page
- f2fs_submit_page_mbio
(page#1 in inode's page cache was queued
in f2fs bio cache, and be ready to write
to new blkaddr)
- gc_data_segment
- move_encrypted_block
- pagecache_get_page
(page#2 in meta inode's page cache
was cached with the invalid datas
of physical block located in new
blkaddr)
- f2fs_submit_page_mbio
(page#1 was submitted, later, page#2
with invalid data will be submitted)
b)
f2fs_gc:
- gc_data_segment
- move_encrypted_block
- f2fs_submit_page_mbio
(page#1 in meta inode's page cache was
queued in f2fs bio cache, and be ready
to write to new blkaddr)
user thread:
- f2fs_write_begin
- f2fs_submit_page_bio
(we submit the request to block layer
to update page#2 in inode's page cache
with physical block located in new
blkaddr, so here we may read gabbage
data from new blkaddr since GC hasn't
writebacked the page#1 yet)
This patch fixes above potential racing problem for encrypted inode.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
After finishing building free nid cache, we will try to readahead
asynchronously 4 more pages for the next reloading, the count of
readahead nid pages is fixed.
In some case, like SMR drive, read less sectors with fixed count
each time we trigger RA may be low efficient, since we will face
high seeking overhead, so we'd better let user to configure this
parameter from sysfs in specific workload.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When there is no free nid in nid cache, all new node allocaters stop their
job to wait for reloading of free nids, however reloading is synchronous as
we will read 4 NAT pages for building nid cache, it cause the long latency.
This patch tries to readahead more NAT pages with READA request flag after
reloading of free nids. It helps to improve performance when users allocate
node id intensively.
Env: Sandisk 32G sd card
time for i in `seq 1 60000`; { echo -n > /mnt/f2fs/$i; echo XXXXXX > /mnt/f2fs/$i;}
Before:
real 0m2.814s
user 0m1.220s
sys 0m1.536s
After:
real 0m2.711s
user 0m1.136s
sys 0m1.568s
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now, we use ra_meta_pages to reads continuous physical blocks as much as
possible to improve performance of following reads. However, ra_meta_pages
uses a synchronous readahead approach by submitting bio with READ, as READ
is with high priority, it can not be used in the case of preloading blocks,
and it's not sure when these RAed pages will be used.
This patch supports asynchronous readahead in ra_meta_pages by tagging bio
with READA flag in order to allow preloading.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In recovery or checkpoint flow, we grab pages temperarily in meta inode's
mapping for caching temperary data, actually, datas in these pages were
not meta data of f2fs, but still we tag them with REQ_META flag. However,
lower device like eMMC may do some optimization for data of such type.
So in order to avoid wrong optimization, we'd better remove such flag
for temperary non-meta pages.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds a tracepoint for f2fs_read_data_pages to trace when pages
are readahead by VFS.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For normal inodes, their pages are allocated with __GFP_FS, which can cause
filesystem calls when reclaiming memory.
This can incur a dead lock condition accordingly.
So, this patch addresses this problem by introducing
f2fs_grab_cache_page(.., bool for_write), which calls
grab_cache_page_write_begin() with AOP_FLAG_NOFS.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The f2fs_collapse_range and f2fs_insert_range changes the block addresses
directly. But that can cause uncovered SSA updates.
In that case, we need to give up to change the block addresses and do buffered
writes to keep filesystem consistency.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The periodic checkpoint can resolve the previous issue.
So, now we can use this again to improve the reported performance regression:
https://lkml.org/lkml/2015/10/8/20
This reverts commit 15bec0ff5a9ba6d203178fa8772259df6207942a.
This patch introduces F2FS_GOING_DOWN_METAFLUSH which flushes meta pages like
SSA blocks and then blocks all the writes.
This can be used by power-failure tests.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces a periodic checkpoint feature.
Note that, this is not enforcing to conduct checkpoints very strictly in terms
of trigger timing, instead just hope to help user experiences.
The default value is 60 seconds.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces a tracepoint to monitor background gc behaviors.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduce a new ioctl for those users who want to trigger
checkpoint from userspace through ioctl.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch drops in batches gc triggered through ioctl, since user
can easily control the gc by designing the loop around the ->ioctl.
We support synchronous gc by forcing using FG_GC in f2fs_gc, so with
it, user can make sure that in this round all blocks gced were
persistent in the device until ioctl returned.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When searching victim during gc, if there are no dirty segments in
filesystem, we will still take the time to search the whole dirty segment
map, it's not needed, it's better to skip in this condition.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When doing gc, we search a victim in dirty map, starting from position of
last victim, we will reset the current searching position until we touch
the end of dirty map, and then search the whole diryt map. So sometimes we
will search the range [victim, last] twice, it's redundant, this patch
avoids this issue.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Our hit stat of extent cache will increase all the time until remount,
and we use atomic_t type for the stat variable, so it may easily incur
overflow when we query extent cache frequently in a long time running
fs.
So to avoid that, this patch uses atomic64_t for hit stat variables.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we do not call get_victim first, we cannot get a new victim for retrial
path.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes to maintain the right section count freed in garbage
collecting when triggering a foreground gc.
Besides, when a foreground gc is running on current selected section, once
we fail to gc one segment, it's better to abandon gcing the left segments
in current section, because anyway we will select next victim for
foreground gc, so gc on the left segments in previous section will become
overhead and also cause the long latency for caller.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes to update ctime and atime correctly when truncating
larger in ->setattr.
The bug is reported by xfstest generic/313 as below:
generic/313 2s ... - output mismatch (see ./results/generic/313.out.bad)
--- tests/generic/313.out 2015-08-04 15:28:53.430798882 +0800
+++ results/generic/313.out.bad 2015-09-28 17:04:27.294278016 +0800
@@ -1,2 +1,4 @@
QA output created by 313
Silence is golden
+ctime not updated after truncate up
+mtime not updated after truncate up
...
(Run 'diff -u tests/generic/313.out tests/generic/313.out.bad' to see the entire diff)
Ran: generic/313
Failures: generic/313
Failed 1 of 1 tests
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, we skip dentry block writes when wbc is SYNC_NONE with no memory
pressure and the number of dirty pages is pretty small.
But, we didn't skip for normal data writes, which gives us not much big impact
on overall performance.
Moreover, by skipping some data writes, kworker falls into infinite loop to try
to write blocks, when many dir inodes have only one dentry block.
So, this patch removes skipping data writes.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Protecting recovery flow by using cp_rwsem is not needed, since we have
prevent triggering any checkpoint by locking cp_mutex previously.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In update_sit_info, we use div_u64 to handle 'u64 divide u64' case, but
div_u64 can only handle 32-bits divisor, so our divisor with u64 type
passed to div_u64 will overflow, result in the wrong calculation when
show debug info of f2fs as below:
BDF: 464, avg. vblocks: 23509
(BDF should never exceed 100)
So change to use div64_u64 to handle this case correctly.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>