When initializing an fs-verity hash algorithm, also initialize a mempool
that contains a single preallocated hash request object. Then replace
the direct calls to ahash_request_alloc() and ahash_request_free() with
allocating and freeing from this mempool.
This eliminates the possibility of the allocation failing, which is
desirable for the I/O path.
This doesn't cause deadlocks because there's no case where multiple hash
requests are needed at a time to make forward progress.
Link: https://lore.kernel.org/r/20191231175545.20709-1-ebiggers@kernel.org
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
When fs-verity verifies data pages, currently it reads each Merkle tree
page synchronously using read_mapping_page().
Therefore, when the Merkle tree pages aren't already cached, fs-verity
causes an extra 4 KiB I/O request for every 512 KiB of data (assuming
that the Merkle tree uses SHA-256 and 4 KiB blocks). This results in
more I/O requests and performance loss than is strictly necessary.
Therefore, implement readahead of the Merkle tree pages.
For simplicity, we take advantage of the fact that the kernel already
does readahead of the file's *data*, just like it does for any other
file. Due to this, we don't really need a separate readahead state
(struct file_ra_state) just for the Merkle tree, but rather we just need
to piggy-back on the existing data readahead requests.
We also only really need to bother with the first level of the Merkle
tree, since the usual fan-out factor is 128, so normally over 99% of
Merkle tree I/O requests are for the first level.
Therefore, make fsverity_verify_bio() enable readahead of the first
Merkle tree level, for up to 1/4 the number of pages in the bio, when it
sees that the REQ_RAHEAD flag is set on the bio. The readahead size is
then passed down to ->read_merkle_tree_page() for the filesystem to
(optionally) implement if it sees that the requested page is uncached.
While we're at it, also make build_merkle_tree_level() set the Merkle
tree readahead size, since it's easy to do there.
However, for now don't set the readahead size in fsverity_verify_page(),
since currently it's only used to verify holes on ext4 and f2fs, and it
would need parameters added to know how much to read ahead.
This patch significantly improves fs-verity sequential read performance.
Some quick benchmarks with 'cat'-ing a 250MB file after dropping caches:
On an ARM64 phone (using sha256-ce):
Before: 217 MB/s
After: 263 MB/s
(compare to sha256sum of non-verity file: 357 MB/s)
In an x86_64 VM (using sha256-avx2):
Before: 173 MB/s
After: 215 MB/s
(compare to sha256sum of non-verity file: 223 MB/s)
Link: https://lore.kernel.org/r/20200106205533.137005-1-ebiggers@kernel.org
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
When it builds the first level of the Merkle tree, FS_IOC_ENABLE_VERITY
sequentially reads each page of the file using read_mapping_page().
This works fine if the file's data is already in pagecache, which should
normally be the case, since this ioctl is normally used immediately
after writing out the file.
But in any other case this implementation performs very poorly, since
only one page is read at a time.
Fix this by implementing readahead using the functions from
mm/readahead.c.
This improves performance in the uncached case by about 20x, as seen in
the following benchmarks done on a 250MB file (on x86_64 with SHA-NI):
FS_IOC_ENABLE_VERITY uncached (before) 3.299s
FS_IOC_ENABLE_VERITY uncached (after) 0.160s
FS_IOC_ENABLE_VERITY cached 0.147s
sha256sum uncached 0.191s
sha256sum cached 0.145s
Note: we could instead switch to kernel_read(). But that would mean
we'd no longer be hashing the data directly from the pagecache, which is
a nice optimization of its own. And using kernel_read() would require
allocating another temporary buffer, hashing the data and tree pages
separately, and explicitly zero-padding the last page -- so it wouldn't
really be any simpler than direct pagecache access, at least for now.
Link: https://lore.kernel.org/r/20200106205410.136707-1-ebiggers@kernel.org
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
When an encrypted directory is listed without the key, the filesystem
must show "no-key names" that uniquely identify directory entries, are
at most 255 (NAME_MAX) bytes long, and don't contain '/' or '\0'.
Currently, for short names the no-key name is the base64 encoding of the
ciphertext filename, while for long names it's the base64 encoding of
the ciphertext filename's dirhash and second-to-last 16-byte block.
This format has the following problems:
- Since it doesn't always include the dirhash, it's incompatible with
directories that will use a secret-keyed dirhash over the plaintext
filenames. In this case, the dirhash won't be computable from the
ciphertext name without the key, so it instead must be retrieved from
the directory entry and always included in the no-key name.
Casefolded encrypted directories will use this type of dirhash.
- It's ambiguous: it's possible to craft two filenames that map to the
same no-key name, since the method used to abbreviate long filenames
doesn't use a proper cryptographic hash function.
Solve both these problems by switching to a new no-key name format that
is the base64 encoding of a variable-length structure that contains the
dirhash, up to 149 bytes of the ciphertext filename, and (if any bytes
remain) the SHA-256 of the remaining bytes of the ciphertext filename.
This ensures that each no-key name contains everything needed to find
the directory entry again, contains only legal characters, doesn't
exceed NAME_MAX, is unambiguous unless there's a SHA-256 collision, and
that we only take the performance hit of SHA-256 on very long filenames.
Note: this change does *not* address the existing issue where users can
modify the 'dirhash' part of a no-key name and the filesystem may still
accept the name.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
[EB: improved comments and commit message, fixed checking return value
of base64_decode(), check for SHA-256 error, continue to set disk_name
for short names to keep matching simpler, and many other cleanups]
Link: https://lore.kernel.org/r/20200120223201.241390-7-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
In order to support a new dirhash method that is a secret-keyed hash
over the plaintext filenames (which will be used by encrypted+casefolded
directories on ext4 and f2fs), fscrypt will be switching to a new no-key
name format that always encodes the dirhash in the name.
UBIFS isn't happy with this because it has assertions that verify that
either the hash or the disk name is provided, not both.
Change it to use the disk name if one is provided, even if a hash is
available too; else use the hash.
Link: https://lore.kernel.org/r/20200120223201.241390-6-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
If userspace provides an invalid fscrypt no-key filename which encodes a
hash value with any of the UBIFS node type bits set (i.e. the high 3
bits), gracefully report ENOENT rather than triggering ubifs_assert().
Test case with kvm-xfstests shell:
. fs/ubifs/config
. ~/xfstests/common/encrypt
dev=$(__blkdev_to_ubi_volume /dev/vdc)
ubiupdatevol $dev -t
mount $dev /mnt -t ubifs
mkdir /mnt/edir
xfs_io -c set_encpolicy /mnt/edir
rm /mnt/edir/_,,,,,DAAAAAAAAAAAAAAAAAAAAAAAAAA
With the bug, the following assertion fails on the 'rm' command:
[ 19.066048] UBIFS error (ubi0:0 pid 379): ubifs_assert_failed: UBIFS assert failed: !(hash & ~UBIFS_S_KEY_HASH_MASK), in fs/ubifs/key.h:170
Fixes: f4f61d2cc6 ("ubifs: Implement encrypted filenames")
Cc: <stable@vger.kernel.org> # v4.10+
Link: https://lore.kernel.org/r/20200120223201.241390-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Now that there's sometimes a second type of per-file key (the dirhash
key), clarify some function names, macros, and documentation that
specifically deal with per-file *encryption* keys.
Link: https://lore.kernel.org/r/20200120223201.241390-4-ebiggers@kernel.org
Reviewed-by: Daniel Rosenberg <drosen@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
When we allow indexed directories to use both encryption and
casefolding, for the dirhash we can't just hash the ciphertext filenames
that are stored on-disk (as is done currently) because the dirhash must
be case insensitive, but the stored names are case-preserving. Nor can
we hash the plaintext names with an unkeyed hash (or a hash keyed with a
value stored on-disk like ext4's s_hash_seed), since that would leak
information about the names that encryption is meant to protect.
Instead, if we can accept a dirhash that's only computable when the
fscrypt key is available, we can hash the plaintext names with a keyed
hash using a secret key derived from the directory's fscrypt master key.
We'll use SipHash-2-4 for this purpose.
Prepare for this by deriving a SipHash key for each casefolded encrypted
directory. Make sure to handle deriving the key not only when setting
up the directory's fscrypt_info, but also in the case where the casefold
flag is enabled after the fscrypt_info was already set up. (We could
just always derive the key regardless of casefolding, but that would
introduce unnecessary overhead for people not using casefolding.)
Signed-off-by: Daniel Rosenberg <drosen@google.com>
[EB: improved commit message, updated fscrypt.rst, squashed with change
that avoids unnecessarily deriving the key, and many other cleanups]
Link: https://lore.kernel.org/r/20200120223201.241390-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Casefolded encrypted directories will use a new dirhash method that
requires a secret key. If the directory uses a v2 encryption policy,
it's easy to derive this key from the master key using HKDF. However,
v1 encryption policies don't provide a way to derive additional keys.
Therefore, don't allow casefolding on directories that use a v1 policy.
Specifically, make it so that trying to enable casefolding on a
directory that has a v1 policy fails, trying to set a v1 policy on a
casefolded directory fails, and trying to open a casefolded directory
that has a v1 policy (if one somehow exists on-disk) fails.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
[EB: improved commit message, updated fscrypt.rst, and other cleanups]
Link: https://lore.kernel.org/r/20200120223201.241390-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
fname_encrypt() is a global function, due to being used in both fname.c
and hooks.c. So it should be prefixed with "fscrypt_", like all the
other global functions in fs/crypto/.
Link: https://lore.kernel.org/r/20200120071736.45915-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
When an encryption key can't be fully removed due to file(s) protected
by it still being in-use, we shouldn't really print the path to one of
these files to the kernel log, since parts of this path are likely to be
encrypted on-disk, and (depending on how the system is set up) the
confidentiality of this path might be lost by printing it to the log.
This is a trade-off: a single file path often doesn't matter at all,
especially if it's a directory; the kernel log might still be protected
in some way; and I had originally hoped that any "inode(s) still busy"
bugs (which are security weaknesses in their own right) would be quickly
fixed and that to do so it would be super helpful to always know the
file path and not have to run 'find dir -inum $inum' after the fact.
But in practice, these bugs can be hard to fix (e.g. due to asynchronous
process killing that is difficult to eliminate, for performance
reasons), and also not tied to specific files, so knowing a file path
doesn't necessarily help.
So to be safe, for now let's just show the inode number, not the path.
If someone really wants to know a path they can use 'find -inum'.
Fixes: b1c0ec3599f4 ("fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY ioctl")
Cc: <stable@vger.kernel.org> # v5.4+
Link: https://lore.kernel.org/r/20200120060732.390362-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Document that fscrypt_encrypt_pagecache_blocks() allocates the bounce
page from a mempool, and document what this means for the @gfp_flags
argument.
Link: https://lore.kernel.org/r/20191231181026.47400-1-ebiggers@kernel.org
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Currently fscrypt_zeroout_range() issues and waits on a bio for each
block it writes, which makes it very slow.
Optimize it to write up to 16 pages at a time instead.
Also add a function comment, and improve reliability by allowing the
allocations of the bio and the first ciphertext page to wait on the
corresponding mempools.
Link: https://lore.kernel.org/r/20191226160813.53182-1-ebiggers@kernel.org
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
The commit 643fa9612bf1 ("fscrypt: remove filesystem specific
build config option") removed modular support for fs/crypto. This
causes the Crypto API to be built-in whenever fscrypt is enabled.
This makes it very difficult for me to test modular builds of
the Crypto API without disabling fscrypt which is a pain.
As fscrypt is still evolving and it's developing new ties with the
fs layer, it's hard to build it as a module for now.
However, the actual algorithms are not required until a filesystem
is mounted. Therefore we can allow them to be built as modules.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/r/20191227024700.7vrzuux32uyfdgum@gondor.apana.org.au
Signed-off-by: Eric Biggers <ebiggers@google.com>
<linux/fscrypt.h> defines ioctl numbers using the macros like _IOWR()
which are defined in <linux/ioctl.h>, so <linux/ioctl.h> should be
included as a prerequisite, like it is in many other kernel headers.
In practice this doesn't really matter since anyone referencing these
ioctl numbers will almost certainly include <sys/ioctl.h> too in order
to actually call ioctl(). But we might as well fix this.
Link: https://lore.kernel.org/r/20191219185624.21251-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
fscrypt_valid_enc_modes() is only used by policy.c, so move it to there.
Also adjust the order of the checks to be more natural, matching the
numerical order of the constants and also keeping AES-256 (the
recommended default) first in the list.
No change in behavior.
Link: https://lore.kernel.org/r/20191209211829.239800-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
FSCRYPT_POLICY_FLAG_DIRECT_KEY is currently only allowed with Adiantum
encryption. But FS_IOC_SET_ENCRYPTION_POLICY allowed it in combination
with other encryption modes, and an error wasn't reported until later
when the encrypted directory was actually used.
Fix it to report the error earlier by validating the correct use of the
DIRECT_KEY flag in fscrypt_supported_policy(), similar to how we
validate the IV_INO_LBLK_64 flag.
Link: https://lore.kernel.org/r/20191209211829.239800-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Make fscrypt_supported_policy() call new functions
fscrypt_supported_v1_policy() and fscrypt_supported_v2_policy(), to
reduce the indentation level and make the code easier to read.
Also adjust the function comment to mention that whether the encryption
policy is supported can also depend on the inode.
No change in behavior.
Link: https://lore.kernel.org/r/20191209211829.239800-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Add a function fscrypt_needs_contents_encryption() which takes an inode
and returns true if it's an encrypted regular file and the kernel was
built with fscrypt support.
This will allow replacing duplicated checks of IS_ENCRYPTED() &&
S_ISREG() on the I/O paths in ext4 and f2fs, while also optimizing out
unneeded code when !CONFIG_FS_ENCRYPTION.
Link: https://lore.kernel.org/r/20191209205021.231767-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
fscrypt_d_revalidate() and fscrypt_d_ops really belong in fname.c, since
they're specific to filenames encryption. crypto.c is for contents
encryption and general fs/crypto/ initialization and utilities.
Link: https://lore.kernel.org/r/20191209204359.228544-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Constify the struct fscrypt_hkdf parameter to fscrypt_hkdf_expand().
This makes it clearer that struct fscrypt_hkdf contains the key only,
not any per-request state.
Link: https://lore.kernel.org/r/20191209204054.227736-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
As a sanity check, verify that the allocated crypto_skcipher actually
has the ivsize that fscrypt is assuming it has. This will always be the
case unless there's a bug. But if there ever is such a bug (e.g. like
there was in earlier versions of the ESSIV conversion patch [1]) it's
preferable for it to be immediately obvious, and not rely on the
ciphertext verification tests failing due to uninitialized IV bytes.
[1] https://lkml.kernel.org/linux-crypto/20190702215517.GA69157@gmail.com/
Link: https://lore.kernel.org/r/20191209203918.225691-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Crypto API users shouldn't really be accessing struct skcipher_alg
directly. <crypto/skcipher.h> already has a function
crypto_skcipher_driver_name(), so use that instead.
No change in behavior.
Link: https://lore.kernel.org/r/20191209203810.225302-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Extend the FS_IOC_ADD_ENCRYPTION_KEY ioctl to allow the raw key to be
specified by a Linux keyring key, rather than specified directly.
This is useful because fscrypt keys belong to a particular filesystem
instance, so they are destroyed when that filesystem is unmounted.
Usually this is desired. But in some cases, userspace may need to
unmount and re-mount the filesystem while keeping the keys, e.g. during
a system update. This requires keeping the keys somewhere else too.
The keys could be kept in memory in a userspace daemon. But depending
on the security architecture and assumptions, it can be preferable to
keep them only in kernel memory, where they are unreadable by userspace.
We also can't solve this by going back to the original fscrypt API
(where for each file, the master key was looked up in the process's
keyring hierarchy) because that caused lots of problems of its own.
Therefore, add the ability for FS_IOC_ADD_ENCRYPTION_KEY to accept a
Linux keyring key. This solves the problem by allowing userspace to (if
needed) save the keys securely in a Linux keyring for re-provisioning,
while still using the new fscrypt key management ioctls.
This is analogous to how dm-crypt accepts a Linux keyring key, but the
key is then stored internally in the dm-crypt data structures rather
than being looked up again each time the dm-crypt device is accessed.
Use a custom key type "fscrypt-provisioning" rather than one of the
existing key types such as "logon". This is strongly desired because it
enforces that these keys are only usable for a particular purpose: for
fscrypt as input to a particular KDF. Otherwise, the keys could also be
passed to any kernel API that accepts a "logon" key with any service
prefix, e.g. dm-crypt, UBIFS, or (recently proposed) AF_ALG. This would
risk leaking information about the raw key despite it ostensibly being
unreadable. Of course, this mistake has already been made for multiple
kernel APIs; but since this is a new API, let's do it right.
This patch has been tested using an xfstest which I wrote to test it.
Link: https://lore.kernel.org/r/20191119222447.226853-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Export lookup_user_key() symbol in order to allow nvdimm passphrase
update to retrieve user injected keys.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Since ->d_compare() and ->d_hash() can be called in RCU-walk mode,
->d_parent and ->d_inode can be concurrently modified, and in
particular, ->d_inode may be changed to NULL. For f2fs_d_hash() this
resulted in a reproducible NULL dereference if a lookup is done in a
directory being deleted, e.g. with:
int main()
{
if (fork()) {
for (;;) {
mkdir("subdir", 0700);
rmdir("subdir");
}
} else {
for (;;)
access("subdir/file", 0);
}
}
... or by running the 't_encrypted_d_revalidate' program from xfstests.
Both repros work in any directory on a filesystem with the encoding
feature, even if the directory doesn't actually have the casefold flag.
I couldn't reproduce a crash in f2fs_d_compare(), but it appears that a
similar crash is possible there.
Fix these bugs by reading ->d_parent and ->d_inode using READ_ONCE() and
falling back to the case sensitive behavior if the inode is NULL.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: 2c2eb7a300cd ("f2fs: Support case-insensitive file name lookups")
Cc: <stable@vger.kernel.org> # v5.4+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Do the name comparison for non-casefolded directories correctly.
This is analogous to ext4's commit 66883da1eee8 ("ext4: fix dcache
lookup of !casefolded directories").
Fixes: 2c2eb7a300cd ("f2fs: Support case-insensitive file name lookups")
Cc: <stable@vger.kernel.org> # v5.4+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Currently f2fs stats are only available from /d/f2fs/status. This patch
adds some of the f2fs stats to sysfs so that they are accessible even
when debugfs is not mounted.
The following sysfs nodes are added:
-/sys/fs/f2fs/<disk>/free_segments
-/sys/fs/f2fs/<disk>/cp_foreground_calls
-/sys/fs/f2fs/<disk>/cp_background_calls
-/sys/fs/f2fs/<disk>/gc_foreground_calls
-/sys/fs/f2fs/<disk>/gc_background_calls
-/sys/fs/f2fs/<disk>/moved_blocks_foreground
-/sys/fs/f2fs/<disk>/moved_blocks_background
-/sys/fs/f2fs/<disk>/avg_vblocks
Signed-off-by: Hridya Valsaraju <hridya@google.com>
[Jaegeuk Kim: allow STAT_FS without DEBUG_FS]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch merges the sysfs node documentation present in
Documentation/filesystems/f2fs.txt and
Documentation/ABI/testing/sysfs-fs-f2fs
and deletes the duplicate information from
Documentation/filesystems/f2fs.txt. This is to prevent having to update
both files when a new sysfs node is added for f2fs.
The patch also makes minor formatting changes to
Documentation/ABI/testing/sysfs-fs-f2fs.
Signed-off-by: Hridya Valsaraju <hridya@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Mutex lock won't serialize callers, in order to avoid starving of unlucky
caller, let's use rwsem lock instead.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Setting 0x40 in /sys/fs/f2fs/dev/ipu_policy gives a way to turn off
bio cache, which is useufl to check whether block layer using hardware
encryption engine merges IOs correctly.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Remove the duplicate CP_UMOUNT enum and add the new CP_PAUSE
enum to show the checkpoint reason in the trace prints.
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Without any form of coordination, any case where multiple allocations
from the same mempool are needed at a time to make forward progress can
deadlock under memory pressure.
This is the case for struct bio_post_read_ctx, as one can be allocated
to decrypt a Merkle tree page during fsverity_verify_bio(), which itself
is running from a post-read callback for a data bio which has its own
struct bio_post_read_ctx.
Fix this by freeing first bio_post_read_ctx before calling
fsverity_verify_bio(). This works because verity (if enabled) is always
the last post-read step.
This deadlock can be reproduced by trying to read from an encrypted
verity file after reducing NUM_PREALLOC_POST_READ_CTXS to 1 and patching
mempool_alloc() to pretend that pool->alloc() always fails.
Note that since NUM_PREALLOC_POST_READ_CTXS is actually 128, to actually
hit this bug in practice would require reading from lots of encrypted
verity files at the same time. But it's theoretically possible, as N
available objects doesn't guarantee forward progress when > N/2 threads
each need 2 objects at a time.
Fixes: 95ae251fe828 ("f2fs: add fs-verity support")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since allocating an object from a mempool never fails when
__GFP_DIRECT_RECLAIM (which is included in GFP_NOFS) is set, the check
for failure to allocate a bio_post_read_ctx is unnecessary. Remove it.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we hit an error during rename, we'll get two dentries in different
directories.
Chao adds to check the room in inline_dir which can avoid needless
inversion. This should be done by inode_lock(&old_dir).
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If kobject_init_and_add() failed, caller needs to invoke kobject_put()
to release kobject explicitly.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As Youling reported in mailing list:
https://www.linuxquestions.org/questions/linux-newbie-8/the-file-system-f2fs-is-broken-4175666043/https://www.linux.org/threads/the-file-system-f2fs-is-broken.26490/
There is a test case can corrupt f2fs image:
- dd if=/dev/zero of=/swapfile bs=1M count=4096
- chmod 600 /swapfile
- mkswap /swapfile
- swapon --discard /swapfile
The root cause is f2fs_swap_activate() intends to return zero value
to setup_swap_extents() to enable SWP_FS mode (swap file goes through
fs), in this flow, setup_swap_extents() setups swap extent with wrong
block address range, result in discard_swap() erasing incorrect address.
Because f2fs_swap_activate() has pinned swapfile, its data block
address will not change, it's safe to let swap to handle IO through
raw device, so we can get rid of SWAP_FS mode and initial swap extents
inside f2fs_swap_activate(), by this way, later discard_swap() can trim
in right address range.
Fixes: 4969c06a0d83 ("f2fs: support swap file w/ DIO")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Btrfs currently does not support swap files because swap's use of bmap
does not work with copy-on-write and multiple devices. See 35054394c4
("Btrfs: stop providing a bmap operation to avoid swapfile corruptions").
However, the swap code has a mechanism for the filesystem to manually add
swap extents using add_swap_extent() from the ->swap_activate() aop.
iomap has done this since 67482129cd ("iomap: add a swapfile activation
function"). Btrfs will do the same in a later patch, so export
add_swap_extent().
Link: http://lkml.kernel.org/r/bb1208575e02829aae51b538709476964f97b1ea.1536704650.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Sterba <dsterba@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch tries to support compression in f2fs.
- New term named cluster is defined as basic unit of compression, file can
be divided into multiple clusters logically. One cluster includes 4 << n
(n >= 0) logical pages, compression size is also cluster size, each of
cluster can be compressed or not.
- In cluster metadata layout, one special flag is used to indicate cluster
is compressed one or normal one, for compressed cluster, following metadata
maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs stores
data including compress header and compressed data.
- In order to eliminate write amplification during overwrite, F2FS only
support compression on write-once file, data can be compressed only when
all logical blocks in file are valid and cluster compress ratio is lower
than specified threshold.
- To enable compression on regular inode, there are three ways:
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
Compress metadata layout:
[Dnode Structure]
+-----------------------------------------------+
| cluster 1 | cluster 2 | ......... | cluster N |
+-----------------------------------------------+
. . . .
. . . .
. Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+
. .
. .
. .
+-------------+-------------+----------+----------------------------+
| data length | data chksum | reserved | compressed data |
+-------------+-------------+----------+----------------------------+
Changelog:
20190326:
- fix error handling of read_end_io().
- remove unneeded comments in f2fs_encrypt_one_page().
20190327:
- fix wrong use of f2fs_cluster_is_full() in f2fs_mpage_readpages().
- don't jump into loop directly to avoid uninitialized variables.
- add TODO tag in error path of f2fs_write_cache_pages().
20190328:
- fix wrong merge condition in f2fs_read_multi_pages().
- check compressed file in f2fs_post_read_required().
20190401
- allow overwrite on non-compressed cluster.
- check cluster meta before writing compressed data.
20190402
- don't preallocate blocks for compressed file.
- add lz4 compress algorithm
- process multiple post read works in one workqueue
Now f2fs supports processing post read work in multiple workqueue,
it shows low performance due to schedule overhead of multiple
workqueue executing orderly.
20190921
- compress: support buffered overwrite
C: compress cluster flag
V: valid block address
N: NEW_ADDR
One cluster contain 4 blocks
before overwrite after overwrite
- VVVV -> CVNN
- CVNN -> VVVV
- CVNN -> CVNN
- CVNN -> CVVV
- CVVV -> CVNN
- CVVV -> CVVV
20191029
- add kconfig F2FS_FS_COMPRESSION to isolate compression related
codes, add kconfig F2FS_FS_{LZO,LZ4} to cover backend algorithm.
note that: will remove lzo backend if Jaegeuk agreed that too.
- update codes according to Eric's comments.
20191101
- apply fixes from Jaegeuk
20191113
- apply fixes from Jaegeuk
- split workqueue for fsverity
20191216
- apply fixes from Jaegeuk
[Jaegeuk Kim]
- add tracepoint for f2fs_{,de}compress_pages()
- fix many bugs and add some compression stats
- fix overwrite/mmap bugs
- address 32bit build error, reported by Geert.
- bug fixes when handling errors and i_compressed_blocks
Reported-by: <noreply@ellerman.id.au>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_rename(), new_page is gone after f2fs_set_link(), but it tries
to put again when whiteout is failed and jumped to put_out_dir.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>