This reverts commit 74cca90e7d as sdcardfs
is now gone.
Bug: 157700134
Cc: Daniel Rosenberg <drosen@google.com>
Cc: Yongqin Liu <yongqin.liu@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ie67083f34797ed5187194b173fa034b125b1477f
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl57BzQACgkQONu9yGCS
aT4sXg/9ERHCo0CNoF+KpzfPcH718NEzICFa5pHVE5OJnjGQ+W+AUx4ERkU9gMrk
W4N+zAFH+v6D/ejNVUPKVB+XMXeZo+QXVBfLme/N4VlrxaSeA2pMxiRrXVY1UX3/
mb6eShhpBn5q942c8QkzcQJbA1TUKnrqGuqq6rfDQTnjm6OcU/PbPHWqOBVIuOVk
Op9VSCcTC41N8sCdsjg411fRBue24+zRU05mw4leeqh0f/XaLjZ9xJyEAuW+6zIz
Vu/8c/vzb4j9TPBg0FaEzgDrR7TUwde7F/O9eejY/GWdQngyfPxjOGpKdakrQOOA
qRnPv0Fou8hFlKZWFKJhAYcPpTc7KP/y9o1aUsNDrIeA00t9R3nDKng+ewW09hfk
K75Znyw6yrWlc5nHotd1pG2NEEeDvSKdXVbyarnaTwqunla9WAdQ0IDFwdaiTklt
CfrJ+AJcd5Smnuo+JfljqF4oF88UJSzhI5hp0Zi9w0JROfbPOYFK0JM2DeEOE27J
IFm1Z5lvTj4VEJmyLL7CvJSM23yjK5todlG3+zFJt2ZncY2Kw1eHEOIvIwbBtxBp
2AWRkco+hsf+GToJNQopxGYyTjMI3NDy/FocAVIJ8wMSEWZkyS8NkIlgjPjTC2dk
ygJ0ZDDiPU0pKouZofQhzGR/Esv/phjWvTLPerFkKIIuiaG+uUo=
=1W2f
-----END PGP SIGNATURE-----
Merge 4.19.113 into android-4.19
Changes in 4.19.113
drm/mediatek: Find the cursor plane instead of hard coding it
spi: qup: call spi_qup_pm_resume_runtime before suspending
powerpc: Include .BTF section
ARM: dts: dra7: Add "dma-ranges" property to PCIe RC DT nodes
spi: pxa2xx: Add CS control clock quirk
spi/zynqmp: remove entry that causes a cs glitch
drm/exynos: dsi: propagate error value and silence meaningless warning
drm/exynos: dsi: fix workaround for the legacy clock name
drivers/perf: arm_pmu_acpi: Fix incorrect checking of gicc pointer
altera-stapl: altera_get_note: prevent write beyond end of 'key'
dm bio record: save/restore bi_end_io and bi_integrity
dm integrity: use dm_bio_record and dm_bio_restore
riscv: avoid the PIC offset of static percpu data in module beyond 2G limits
drm/amd/display: Clear link settings on MST disable connector
drm/amd/display: fix dcc swath size calculations on dcn1
xenbus: req->body should be updated before req->state
xenbus: req->err should be updated before req->state
block, bfq: fix overwrite of bfq_group pointer in bfq_find_set_group()
parse-maintainers: Mark as executable
USB: Disable LPM on WD19's Realtek Hub
usb: quirks: add NO_LPM quirk for RTL8153 based ethernet adapters
USB: serial: option: add ME910G1 ECM composition 0x110b
usb: host: xhci-plat: add a shutdown
USB: serial: pl2303: add device-id for HP LD381
usb: xhci: apply XHCI_SUSPEND_DELAY to AMD XHCI controller 1022:145c
ALSA: line6: Fix endless MIDI read loop
ALSA: seq: virmidi: Fix running status after receiving sysex
ALSA: seq: oss: Fix running status after receiving sysex
ALSA: pcm: oss: Avoid plugin buffer overflow
ALSA: pcm: oss: Remove WARNING from snd_pcm_plug_alloc() checks
iio: st_sensors: remap SMO8840 to LIS2DH12
iio: trigger: stm32-timer: disable master mode when stopping
iio: magnetometer: ak8974: Fix negative raw values in sysfs
iio: adc: at91-sama5d2_adc: fix differential channels in triggered mode
mmc: rtsx_pci: Fix support for speed-modes that relies on tuning
mmc: sdhci-of-at91: fix cd-gpios for SAMA5D2
staging: rtl8188eu: Add device id for MERCUSYS MW150US v2
staging: greybus: loopback_test: fix poll-mask build breakage
staging/speakup: fix get_word non-space look-ahead
intel_th: Fix user-visible error codes
intel_th: pci: Add Elkhart Lake CPU support
rtc: max8907: add missing select REGMAP_IRQ
xhci: Do not open code __print_symbolic() in xhci trace events
btrfs: fix log context list corruption after rename whiteout error
drm/amd/amdgpu: Fix GPR read from debugfs (v2)
drm/lease: fix WARNING in idr_destroy
memcg: fix NULL pointer dereference in __mem_cgroup_usage_unregister_event
mm: slub: be more careful about the double cmpxchg of freelist
mm, slub: prevent kmalloc_node crashes and memory leaks
page-flags: fix a crash at SetPageError(THP_SWAP)
x86/mm: split vmalloc_sync_all()
USB: cdc-acm: fix close_delay and closing_wait units in TIOCSSERIAL
USB: cdc-acm: fix rounding error in TIOCSSERIAL
iio: light: vcnl4000: update sampling periods for vcnl4200
kbuild: Disable -Wpointer-to-enum-cast
futex: Fix inode life-time issue
futex: Unbreak futex hashing
Revert "vrf: mark skb for multicast or link-local as enslaved to VRF"
Revert "ipv6: Fix handling of LLA with VRF and sockets bound to VRF"
ALSA: hda/realtek: Fix pop noise on ALC225
arm64: smp: fix smp_send_stop() behaviour
arm64: smp: fix crash_smp_send_stop() behaviour
drm/bridge: dw-hdmi: fix AVI frame colorimetry
staging: greybus: loopback_test: fix potential path truncation
staging: greybus: loopback_test: fix potential path truncations
Linux 4.19.113
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I90c48cd7189a964e59d199ecc0f32c0a68688ec5
commit 8019ad13ef7f64be44d4f892af9c840179009254 upstream.
As reported by Jann, ihold() does not in fact guarantee inode
persistence. And instead of making it so, replace the usage of inode
pointers with a per boot, machine wide, unique inode identifier.
This sequence number is global, but shared (file backed) futexes are
rare enough that this should not become a performance issue.
Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* aosp/upstream-f2fs-stable-linux-4.19.y:
fs-verity: use u64_to_user_ptr()
fs-verity: use mempool for hash requests
fs-verity: implement readahead of Merkle tree pages
fs-verity: implement readahead for FS_IOC_ENABLE_VERITY
fscrypt: improve format of no-key names
ubifs: allow both hash and disk name to be provided in no-key names
ubifs: don't trigger assertion on invalid no-key filename
fscrypt: clarify what is meant by a per-file key
fscrypt: derive dirhash key for casefolded directories
fscrypt: don't allow v1 policies with casefolding
fscrypt: add "fscrypt_" prefix to fname_encrypt()
fscrypt: don't print name of busy file when removing key
fscrypt: document gfp_flags for bounce page allocation
fscrypt: optimize fscrypt_zeroout_range()
fscrypt: remove redundant bi_status check
fscrypt: Allow modular crypto algorithms
fscrypt: include <linux/ioctl.h> in UAPI header
fscrypt: don't check for ENOKEY from fscrypt_get_encryption_info()
fscrypt: remove fscrypt_is_direct_key_policy()
fscrypt: move fscrypt_valid_enc_modes() to policy.c
fscrypt: check for appropriate use of DIRECT_KEY flag earlier
fscrypt: split up fscrypt_supported_policy() by policy version
fscrypt: introduce fscrypt_needs_contents_encryption()
fscrypt: move fscrypt_d_revalidate() to fname.c
fscrypt: constify inode parameter to filename encryption functions
fscrypt: constify struct fscrypt_hkdf parameter to fscrypt_hkdf_expand()
fscrypt: verify that the crypto_skcipher has the correct ivsize
fscrypt: use crypto_skcipher_driver_name()
fscrypt: support passing a keyring key to FS_IOC_ADD_ENCRYPTION_KEY
keys: Export lookup_user_key to external users
Conflicts:
fs/crypto/Kconfig
fs/crypto/bio.c
fs/crypto/fname.c
fs/crypto/fscrypt_private.h
fs/crypto/keyring.c
fs/crypto/keysetup.c
fs/ubifs/dir.c
include/uapi/linux/fscrypt.h
Resolved the conflicts as per the corresponding android-mainline change,
Ib1e6b9eda8fb5dcfc6bdc8fa89d93f72b088c5f6.
Bug: 148667616
Change-Id: I5f8b846f0cd4d5403d8c61b9e12acb4581fac6f7
Signed-off-by: Eric Biggers <ebiggers@google.com>
Casefolded encrypted directories will use a new dirhash method that
requires a secret key. If the directory uses a v2 encryption policy,
it's easy to derive this key from the master key using HKDF. However,
v1 encryption policies don't provide a way to derive additional keys.
Therefore, don't allow casefolding on directories that use a v1 policy.
Specifically, make it so that trying to enable casefolding on a
directory that has a v1 policy fails, trying to set a v1 policy on a
casefolded directory fails, and trying to open a casefolded directory
that has a v1 policy (if one somehow exists on-disk) fails.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
[EB: improved commit message, updated fscrypt.rst, and other cleanups]
Link: https://lore.kernel.org/r/20200120223201.241390-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl4bAEoACgkQONu9yGCS
aT70DQ//YgSd3JJ9/d7TLy+Mm8GrlZIZWgRrz8YZdKcJBG+l/c+m6FR0uLDw95nf
zIFq1GY8DwS3djuNkxPoz8yvIgqoHuSjEUo+YF8v71ZdnGCXIt6SCoKbugAH+azp
zTEJ4lU3/Wrc1Gh4w5zeSgdQVPIJBQ+d7pKccZHOJ0DFwbU3hQ69vqXcaxrFuhop
vqKvNNEeCT2l1AxgAhKwNhceFL3Rb/2HxAjUED68ueY/EZ0/OsoinboBu2riSKNu
NWZTPq7B2Kht19aV48FThwksEvkelz72gkSzKvrsT03tDtNewQ7C/mreUqpU6VZE
lrcIWOHjXMQyk3g3XZfG3j8ppzHvZPaG/WEqmnJV8txjtaRU8hvj3Q95kHPheTv6
O/Ds4OpHBaS5g6+dmsUmtfSrbrhm3KKXMn0IJvyYkC/VY0RVdLKTZu0TM5ZS5id/
zW4AZWWzephUID9LAwRrTWvfGKMBK6gEiv2AtnY4XRiEMYEFw45uP97yWcUCg89L
a3BCbhhiO4tMqL/lf7KD4vPbXZsRgDM3q1wLitvM2KCVVSA32XGRMT9x8A6uwTJl
OLL39NSi8h8rqn5S22AowKcR/3VakRNw9pf5Y6AH5xOnP3R/OBHzYJNho2DUc8OI
6p+zeUZkDEIsFI1wdJSaaSWsNdpJj3Sw6SdB2QJI3c782F4pFUA=
=RCVS
-----END PGP SIGNATURE-----
Merge 4.19.95 into android-4.19
Changes in 4.19.95
USB: dummy-hcd: use usb_urb_dir_in instead of usb_pipein
USB: dummy-hcd: increase max number of devices to 32
bpf: Fix passing modified ctx to ld/abs/ind instruction
regulator: fix use after free issue
ASoC: max98090: fix possible race conditions
locking/spinlock/debug: Fix various data races
netfilter: ctnetlink: netns exit must wait for callbacks
mwifiex: Fix heap overflow in mmwifiex_process_tdls_action_frame()
libtraceevent: Fix lib installation with O=
x86/efi: Update e820 with reserved EFI boot services data to fix kexec breakage
ASoC: Intel: bytcr_rt5640: Update quirk for Teclast X89
efi/gop: Return EFI_NOT_FOUND if there are no usable GOPs
efi/gop: Return EFI_SUCCESS if a usable GOP was found
efi/gop: Fix memory leak in __gop_query32/64()
ARM: dts: imx6ul: imx6ul-14x14-evk.dtsi: Fix SPI NOR probing
ARM: vexpress: Set-up shared OPP table instead of individual for each CPU
netfilter: uapi: Avoid undefined left-shift in xt_sctp.h
netfilter: nft_set_rbtree: bogus lookup/get on consecutive elements in named sets
netfilter: nf_tables: validate NFT_SET_ELEM_INTERVAL_END
netfilter: nf_tables: validate NFT_DATA_VALUE after nft_data_init()
ARM: dts: BCM5301X: Fix MDIO node address/size cells
selftests/ftrace: Fix multiple kprobe testcase
ARM: dts: Cygnus: Fix MDIO node address/size cells
spi: spi-cavium-thunderx: Add missing pci_release_regions()
ASoC: topology: Check return value for soc_tplg_pcm_create()
ARM: dts: bcm283x: Fix critical trip point
bnxt_en: Return error if FW returns more data than dump length
bpf, mips: Limit to 33 tail calls
spi: spi-ti-qspi: Fix a bug when accessing non default CS
ARM: dts: am437x-gp/epos-evm: fix panel compatible
samples: bpf: Replace symbol compare of trace_event
samples: bpf: fix syscall_tp due to unused syscall
powerpc: Ensure that swiotlb buffer is allocated from low memory
btrfs: Fix error messages in qgroup_rescan_init
bpf: Clear skb->tstamp in bpf_redirect when necessary
bnx2x: Do not handle requests from VFs after parity
bnx2x: Fix logic to get total no. of PFs per engine
cxgb4: Fix kernel panic while accessing sge_info
net: usb: lan78xx: Fix error message format specifier
parisc: add missing __init annotation
rfkill: Fix incorrect check to avoid NULL pointer dereference
ASoC: wm8962: fix lambda value
regulator: rn5t618: fix module aliases
iommu/iova: Init the struct iova to fix the possible memleak
kconfig: don't crash on NULL expressions in expr_eq()
perf/x86/intel: Fix PT PMI handling
fs: avoid softlockups in s_inodes iterators
net: stmmac: Do not accept invalid MTU values
net: stmmac: xgmac: Clear previous RX buffer size
net: stmmac: RX buffer size must be 16 byte aligned
net: stmmac: Always arm TX Timer at end of transmission start
s390/purgatory: do not build purgatory with kcov, kasan and friends
drm/exynos: gsc: add missed component_del
s390/dasd/cio: Interpret ccw_device_get_mdc return value correctly
s390/dasd: fix memleak in path handling error case
block: fix memleak when __blk_rq_map_user_iov() is failed
parisc: Fix compiler warnings in debug_core.c
llc2: Fix return statement of llc_stat_ev_rx_null_dsap_xid_c (and _test_c)
hv_netvsc: Fix unwanted rx_table reset
powerpc/vcpu: Assume dedicated processors as non-preempt
powerpc/spinlocks: Include correct header for static key
cpufreq: imx6q: read OCOTP through nvmem for imx6ul/imx6ull
ARM: dts: imx6ul: use nvmem-cells for cpu speed grading
PCI/switchtec: Read all 64 bits of part_event_bitmap
gtp: fix bad unlock balance in gtp_encap_enable_socket
macvlan: do not assume mac_header is set in macvlan_broadcast()
net: dsa: mv88e6xxx: Preserve priority when setting CPU port.
net: stmmac: dwmac-sun8i: Allow all RGMII modes
net: stmmac: dwmac-sunxi: Allow all RGMII modes
net: usb: lan78xx: fix possible skb leak
pkt_sched: fq: do not accept silly TCA_FQ_QUANTUM
sch_cake: avoid possible divide by zero in cake_enqueue()
sctp: free cmd->obj.chunk for the unprocessed SCTP_CMD_REPLY
tcp: fix "old stuff" D-SACK causing SACK to be treated as D-SACK
vxlan: fix tos value before xmit
vlan: fix memory leak in vlan_dev_set_egress_priority
vlan: vlan_changelink() should propagate errors
mlxsw: spectrum_qdisc: Ignore grafting of invisible FIFO
net: sch_prio: When ungrafting, replace with FIFO
usb: dwc3: gadget: Fix request complete check
USB: core: fix check for duplicate endpoints
USB: serial: option: add Telit ME910G1 0x110a composition
usb: missing parentheses in USE_NEW_SCHEME
Linux 4.19.95
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I611f034c58f975a5d2b70eed0a0884f1ff5b09cc
[ Upstream commit 04646aebd30b99f2cfa0182435a2ec252fcb16d0 ]
Anything that walks all inodes on sb->s_inodes list without rescheduling
risks softlockups.
Previous efforts were made in 2 functions, see:
c27d82f fs/drop_caches.c: avoid softlockups in drop_pagecache_sb()
ac05fbb inode: don't softlockup when evicting inodes
but there hasn't been an audit of all walkers, so do that now. This
also consistently moves the cond_resched() calls to the bottom of each
loop in cases where it already exists.
One loop remains: remove_dquot_ref(), because I'm not quite sure how
to deal with that one w/o taking the i_lock.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
* origin/upstream-f2fs-stable-linux-4.19.y:
f2fs: use EINVAL for superblock with invalid magic
f2fs: fix to read source block before invalidating it
f2fs: remove redundant check from f2fs_setflags_common()
f2fs: use generic checking function for FS_IOC_FSSETXATTR
f2fs: use generic checking and prep function for FS_IOC_SETFLAGS
ubifs, fscrypt: cache decrypted symlink target in ->i_link
vfs: use READ_ONCE() to access ->i_link
fs, fscrypt: clear DCACHE_ENCRYPTED_NAME when unaliasing directory
fscrypt: cache decrypted symlink target in ->i_link
fscrypt: fix race where ->lookup() marks plaintext dentry as ciphertext
fscrypt: only set dentry_operations on ciphertext dentries
fscrypt: fix race allowing rename() and link() of ciphertext dentries
fscrypt: clean up and improve dentry revalidation
fscrypt: use READ_ONCE() to access ->i_crypt_info
fscrypt: remove WARN_ON_ONCE() when decryption fails
fscrypt: drop inode argument from fscrypt_get_ctx()
f2fs: improve print log in f2fs_sanity_check_ckpt()
f2fs: avoid out-of-range memory access
f2fs: fix to avoid long latency during umount
f2fs: allow all the users to pin a file
f2fs: support swap file w/ DIO
f2fs: allocate blocks for pinned file
f2fs: fix is_idle() check for discard type
f2fs: add a rw_sem to cover quota flag changes
f2fs: set SBI_NEED_FSCK for xattr corruption case
f2fs: use generic EFSBADCRC/EFSCORRUPTED
f2fs: Use DIV_ROUND_UP() instead of open-coding
f2fs: print kernel message if filesystem is inconsistent
f2fs: introduce f2fs_<level> macros to wrap f2fs_printk()
f2fs: avoid get_valid_blocks() for cleanup
f2fs: ioctl for removing a range from F2FS
f2fs: only set project inherit bit for directory
f2fs: separate f2fs i_flags from fs_flags and ext4 i_flags
f2fs: Add option to limit required GC for checkpoint=disable
f2fs: Fix accounting for unusable blocks
f2fs: Fix root reserved on remount
f2fs: Lower threshold for disable_cp_again
f2fs: fix sparse warning
f2fs: fix f2fs_show_options to show nodiscard mount option
f2fs: add error prints for debugging mount failure
f2fs: fix to do sanity check on segment bitmap of LFS curseg
f2fs: add missing sysfs entries in documentation
f2fs: fix to avoid deadloop if data_flush is on
f2fs: always assume that the device is idle under gc_urgent
f2fs: add bio cache for IPU
f2fs: allow ssr block allocation during checkpoint=disable period
f2fs: fix to check layout on last valid checkpoint park
Change-Id: Ie910f127f574c2115e5b9a6725461ce002c267be
Signed-off-by: Jaegeuk Kim <jaegeuk@google.com>
Make the f2fs implementation of FS_IOC_FSSETXATTR use the new VFS helper
function vfs_ioc_fssetxattr_check(), and remove the project quota check
since it's now done by the helper function.
This is based on a patch from Darrick Wong, but reworked to apply after
commit 360985573b55 ("f2fs: separate f2fs i_flags from fs_flags and ext4
i_flags").
Originally-from: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Make the f2fs implementation of FS_IOC_SETFLAGS use the new VFS helper
function vfs_ioc_setflags_prepare().
This is based on a patch from Darrick Wong, but reworked to apply after
commit 360985573b55 ("f2fs: separate f2fs i_flags from fs_flags and ext4
i_flags").
Originally-from: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl0Nx3oACgkQONu9yGCS
aT7hJRAAzdu1EKr/VUiFJshvPL/+veH1MLdMgafW6gXcoQCQSdMuv1j3mv3axElC
dz2e/nHXvIhMKMrhzgEvVwV8m8eKPwM4IZjTJ8ji8st9uy0R2UIdbPUHrLtRAQzI
bK2BIk6697B9/9z2s8nDXhKex9EtZ2vj8DeQLT1AgQKMM0+abPA2YR9+cp9HEjQR
Wa5NjajwWgpty1VVk+Q6GMD+GBnEILyqxtVGv30z/prWoVadFAXJEZrWsaXotUvh
ySc2CS9Iu/muiIegTSStnNXaWaawZblA4JdDcdRJ235rPK8sbACYtDgZispYopi4
y/9kUtuZzWEv61aFZkXd13jKZ8pJsrLZLoc+yDqew0IA6M2Hz7d7Pq/UWQ0m4qnt
rCcU1vTFooSp/IHOToXN0oQQrTqD2oK5zO4V9S5McwQniuO/RqUPJfX/sKjtrvNT
SI/yLVKMXImAwVXxNEfO4bE+T9F/QYptOHdpfqJkPvpKLzjxnPBPswG8poIKwQzJ
w3p1AoQ1ISuW6b94a/nI5nSC28gVslVg+oTjRjF7kfIcOtvglX79XPvfdM5bB2DC
qD51A8veEGCtAu57/tSyisGOomqTqqqbaiYXwuhgTI1NSszbqKDLNvjsw1GKj0rK
mSmDzVMgQhxaHOagOeDqLYlj1zgTamx5R6+dretf8AOXyEE2RSg=
=+MJy
-----END PGP SIGNATURE-----
Merge 4.19.54 into android-4.19
Changes in 4.19.54
ax25: fix inconsistent lock state in ax25_destroy_timer
be2net: Fix number of Rx queues used for flow hashing
hv_netvsc: Set probe mode to sync
ipv6: flowlabel: fl6_sock_lookup() must use atomic_inc_not_zero
lapb: fixed leak of control-blocks.
neigh: fix use-after-free read in pneigh_get_next
net: dsa: rtl8366: Fix up VLAN filtering
net: openvswitch: do not free vport if register_netdevice() is failed.
nfc: Ensure presence of required attributes in the deactivate_target handler
sctp: Free cookie before we memdup a new one
sunhv: Fix device naming inconsistency between sunhv_console and sunhv_reg
tipc: purge deferredq list for each grp member in tipc_group_delete
vsock/virtio: set SOCK_DONE on peer shutdown
net/mlx5: Avoid reloading already removed devices
net: mvpp2: prs: Fix parser range for VID filtering
net: mvpp2: prs: Use the correct helpers when removing all VID filters
Staging: vc04_services: Fix a couple error codes
perf/x86/intel/ds: Fix EVENT vs. UEVENT PEBS constraints
netfilter: nf_queue: fix reinject verdict handling
ipvs: Fix use-after-free in ip_vs_in
selftests: netfilter: missing error check when setting up veth interface
clk: ti: clkctrl: Fix clkdm_clk handling
powerpc/powernv: Return for invalid IMC domain
usb: xhci: Fix a potential null pointer dereference in xhci_debugfs_create_endpoint()
mISDN: make sure device name is NUL terminated
x86/CPU/AMD: Don't force the CPB cap when running under a hypervisor
perf/ring_buffer: Fix exposing a temporarily decreased data_head
perf/ring_buffer: Add ordering to rb->nest increment
perf/ring-buffer: Always use {READ,WRITE}_ONCE() for rb->user_page data
gpio: fix gpio-adp5588 build errors
net: stmmac: update rx tail pointer register to fix rx dma hang issue.
net: tulip: de4x5: Drop redundant MODULE_DEVICE_TABLE()
ACPI/PCI: PM: Add missing wakeup.flags.valid checks
drm/etnaviv: lock MMU while dumping core
net: aquantia: tx clean budget logic error
net: aquantia: fix LRO with FCS error
i2c: dev: fix potential memory leak in i2cdev_ioctl_rdwr
ALSA: hda - Force polling mode on CNL for fixing codec communication
configfs: Fix use-after-free when accessing sd->s_dentry
perf data: Fix 'strncat may truncate' build failure with recent gcc
perf namespace: Protect reading thread's namespace
perf record: Fix s390 missing module symbol and warning for non-root users
ia64: fix build errors by exporting paddr_to_nid()
xen/pvcalls: Remove set but not used variable
xenbus: Avoid deadlock during suspend due to open transactions
KVM: PPC: Book3S: Use new mutex to synchronize access to rtas token list
KVM: PPC: Book3S HV: Don't take kvm->lock around kvm_for_each_vcpu
arm64: fix syscall_fn_t type
arm64: use the correct function type in SYSCALL_DEFINE0
arm64: use the correct function type for __arm64_sys_ni_syscall
net: sh_eth: fix mdio access in sh_eth_close() for R-Car Gen2 and RZ/A1 SoCs
net: phylink: ensure consistent phy interface mode
net: phy: dp83867: Set up RGMII TX delay
scsi: libcxgbi: add a check for NULL pointer in cxgbi_check_route()
scsi: smartpqi: properly set both the DMA mask and the coherent DMA mask
scsi: scsi_dh_alua: Fix possible null-ptr-deref
scsi: libsas: delete sas port if expander discover failed
mlxsw: spectrum: Prevent force of 56G
ocfs2: fix error path kobject memory leak
coredump: fix race condition between collapse_huge_page() and core dumping
Abort file_remove_privs() for non-reg. files
Linux 4.19.54
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit f69e749a49353d96af1a293f56b5b56de59c668a upstream.
file_remove_privs() might be called for non-regular files, e.g.
blkdev inode. There is no reason to do its job on things
like blkdev inodes, pipes, or cdevs. Hence, abort if
file does not refer to a regular inode.
AV: more to the point, for devices there might be any number of
inodes refering to given device. Which one to strip the permissions
from, even if that made any sense in the first place? All of them
will be observed with contents modified, after all.
Found by LockDoc (Alexander Lochmann, Horst Schirmeier and Olaf
Spinczyk)
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Alexander Lochmann <alexander.lochmann@tu-dortmund.de>
Signed-off-by: Horst Schirmeier <horst.schirmeier@tu-dortmund.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Zubin Mithra <zsm@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlxtHSUACgkQONu9yGCS
aT6WIw//Xdd2P7aIWSDnmS6L+RlEtdz2hnO07O2lMVIhsH1MgetAk0nJbP8b9sYp
zJY4kW6gEue1/J7vHIiTnDpnhaylxESOxOb49O/E95bnPIdhE9EqR1aiMmo8R5rU
8F2lH6kZncAawZ/7WGnQdHqukO9GtwFO0hwtsDT5RcwiIPe/fHkIpp1QfN21ntib
nV7xQOfDFEMVcbyzAFa+59A4UUQSWKTkld9tcJFG62pjcN9DlpPej86M2qy9mG8U
0d7hEkYGqHthh092TAxgobsJDq5j0jUUVjK80/Fq0TKe/OSvsgQ8MRBryVsgBf8J
hJBCLHay4Ps3ONtNI3IXDVZu9dgdL8UynBEfbWdgJHlGgKJEvMGxFCckzI8OP3yr
HJtFm6gG24kySFt+QnvwHg1/zwAho5FDBMfx02RkukJwDiN16DZSbKRimrTVAioV
bBTmL+JyoCBLYGfUzq4gM1+3L5TRyMBqjgcjLvXWuXF18fdpVdb0exKlNtLHMUf6
aqzhroF90sn5vUybpw02ouduxmRPWDaDel7ArA3zb16Dp+ODCxYtpkV7dmRBjPVK
YxwweO604Rd2musacQkVAB5JkshjPdV7WA/FmhJf7Fn1/CaaXliAVrfMNbEGW15a
NFQvTMrJva2pihXYJY4p/UfhnsBmxXmY30wYlLX2VxRFb+qoy3Y=
=MO0g
-----END PGP SIGNATURE-----
Merge 4.19.24 into android-4.19
Changes in 4.19.24
dt-bindings: eeprom: at24: add "atmel,24c2048" compatible string
eeprom: at24: add support for 24c2048
blk-mq: fix a hung issue when fsync
ARM: 8789/1: signal: copy registers using __copy_to_user()
ARM: 8790/1: signal: always use __copy_to_user to save iwmmxt context
ARM: 8791/1: vfp: use __copy_to_user() when saving VFP state
ARM: 8792/1: oabi-compat: copy oabi events using __copy_to_user()
ARM: 8793/1: signal: replace __put_user_error with __put_user
ARM: 8794/1: uaccess: Prevent speculative use of the current addr_limit
ARM: 8795/1: spectre-v1.1: use put_user() for __put_user()
ARM: 8796/1: spectre-v1,v1.1: provide helpers for address sanitization
ARM: 8797/1: spectre-v1.1: harden __copy_to_user
ARM: 8810/1: vfp: Fix wrong assignement to ufp_exc
ARM: make lookup_processor_type() non-__init
ARM: split out processor lookup
ARM: clean up per-processor check_bugs method call
ARM: add PROC_VTABLE and PROC_TABLE macros
ARM: spectre-v2: per-CPU vtables to work around big.Little systems
ARM: ensure that processor vtables is not lost after boot
ARM: fix the cockup in the previous patch
drm/amdgpu/sriov:Correct pfvf exchange logic
ACPI: NUMA: Use correct type for printing addresses on i386-PAE
perf report: Fix wrong iteration count in --branch-history
perf test shell: Use a fallback to get the pathname in vfs_getname
tools uapi: fix RISC-V 64-bit support
riscv: fix trace_sys_exit hook
cpufreq: check if policy is inactive early in __cpufreq_get()
drm/bridge: tc358767: add bus flags
drm/bridge: tc358767: add defines for DP1_SRCCTRL & PHY_2LANE
drm/bridge: tc358767: fix single lane configuration
drm/bridge: tc358767: fix initial DP0/1_SRCCTRL value
drm/bridge: tc358767: reject modes which require too much BW
drm/bridge: tc358767: fix output H/V syncs
nvme-pci: use the same attributes when freeing host_mem_desc_bufs.
nvme-pci: fix out of bounds access in nvme_cqe_pending
nvme-multipath: zero out ANA log buffer
nvme: pad fake subsys NQN vid and ssvid with zeros
drm/amdgpu: set WRITE_BURST_LENGTH to 64B to workaround SDMA1 hang
ARM: dts: da850-evm: Correct the audio codec regulators
ARM: dts: da850-evm: Correct the sound card name
ARM: dts: da850-lcdk: Correct the audio codec regulators
ARM: dts: da850-lcdk: Correct the sound card name
ARM: dts: kirkwood: Fix polarity of GPIO fan lines
gpio: pl061: handle failed allocations
drm/nouveau: Don't disable polling in fallback mode
drm/nouveau/falcon: avoid touching registers if engine is off
cifs: Limit memory used by lock request calls to a page
kvm: sev: Fail KVM_SEV_INIT if already initialized
CIFS: Do not assume one credit for async responses
gpio: mxc: move gpio noirq suspend/resume to syscore phase
Revert "Input: elan_i2c - add ACPI ID for touchpad in ASUS Aspire F5-573G"
Input: elan_i2c - add ACPI ID for touchpad in Lenovo V330-15ISK
ARM: OMAP5+: Fix inverted nirq pin interrupts with irq_set_type
perf/core: Fix impossible ring-buffer sizes warning
perf/x86: Add check_period PMU callback
ALSA: hda - Add quirk for HP EliteBook 840 G5
ALSA: usb-audio: Fix implicit fb endpoint setup by quirk
ASoC: hdmi-codec: fix oops on re-probe
tools uapi: fix Alpha support
riscv: Add pte bit to distinguish swap from invalid
x86/kvm/nVMX: read from MSR_IA32_VMX_PROCBASED_CTLS2 only when it is available
kvm: vmx: Fix entry number check for add_atomic_switch_msr()
mmc: sunxi: Filter out unsupported modes declared in the device tree
mmc: block: handle complete_work on separate workqueue
Input: bma150 - register input device after setting private data
Input: elantech - enable 3rd button support on Fujitsu CELSIUS H780
Revert "nfsd4: return default lease period"
Revert "mm: don't reclaim inodes with many attached pages"
Revert "mm: slowly shrink slabs with a relatively small number of objects"
alpha: fix page fault handling for r16-r18 targets
alpha: Fix Eiger NR_IRQS to 128
s390/zcrypt: fix specification exception on z196 during ap probe
tracing/uprobes: Fix output for multiple string arguments
x86/platform/UV: Use efi_runtime_lock to serialise BIOS calls
scsi: sd: fix entropy gathering for most rotational disks
signal: Restore the stop PTRACE_EVENT_EXIT
md/raid1: don't clear bitmap bits on interrupted recovery.
x86/a.out: Clear the dump structure initially
dm crypt: don't overallocate the integrity tag space
dm thin: fix bug where bio that overwrites thin block ignores FUA
drm: Use array_size() when creating lease
drm/vkms: Fix license inconsistent
drm/i915: Block fbdev HPD processing during suspend
drm/i915: Prevent a race during I915_GEM_MMAP ioctl with WC set
mm: proc: smaps_rollup: fix pss_locked calculation
Linux 4.19.24
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 69056ee6a8a3d576ed31e38b3b14c70d6c74edcc upstream.
This reverts commit a76cf1a474d7d ("mm: don't reclaim inodes with many
attached pages").
This change causes serious changes to page cache and inode cache
behaviour and balance, resulting in major performance regressions when
combining worklaods such as large file copies and kernel compiles.
https://bugzilla.kernel.org/show_bug.cgi?id=202441
This change is a hack to work around the problems introduced by changing
how agressive shrinkers are on small caches in commit 172b06c32b ("mm:
slowly shrink slabs with a relatively small number of objects"). It
creates more problems than it solves, wasn't adequately reviewed or
tested, so it needs to be reverted.
Link: http://lkml.kernel.org/r/20190130041707.27750-2-david@fromorbit.com
Fixes: a76cf1a474d7d ("mm: don't reclaim inodes with many attached pages")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Cc: Wolfgang Walter <linux@stwm.de>
Cc: Roman Gushchin <guro@fb.com>
Cc: Spock <dairinin@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This allows filesystems to use their mount private data to
influence the permssions they use in setattr2. It has
been separated into a new call to avoid disrupting current
setattr users.
Bug: 120446149
Change-Id: I19959038309284448f1b7f232d579674ef546385
Signed-off-by: Daniel Rosenberg <drosen@google.com>
commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0 upstream.
Spock reported that commit 172b06c32b ("mm: slowly shrink slabs with a
relatively small number of objects") leads to a regression on his setup:
periodically the majority of the pagecache is evicted without an obvious
reason, while before the change the amount of free memory was balancing
around the watermark.
The reason behind is that the mentioned above change created some
minimal background pressure on the inode cache. The problem is that if
an inode is considered to be reclaimed, all belonging pagecache page are
stripped, no matter how many of them are there. So, if a huge
multi-gigabyte file is cached in the memory, and the goal is to reclaim
only few slab objects (unused inodes), we still can eventually evict all
gigabytes of the pagecache at once.
The workload described by Spock has few large non-mapped files in the
pagecache, so it's especially noticeable.
To solve the problem let's postpone the reclaim of inodes, which have
more than 1 attached page. Let's wait until the pagecache pages will be
evicted naturally by scanning the corresponding LRU lists, and only then
reclaim the inode structure.
Link: http://lkml.kernel.org/r/20181023164302.20436-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Spock <dairinin@gmail.com>
Tested-by: Spock <dairinin@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: <stable@vger.kernel.org> [4.19.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This contains two new features:
1) Stack file operations: this allows removal of several hacks from the
VFS, proper interaction of read-only open files with copy-up,
possibility to implement fs modifying ioctls properly, and others.
2) Metadata only copy-up: when file is on lower layer and only metadata is
modified (except size) then only copy up the metadata and continue to
use the data from the lower file.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW3srhAAKCRDh3BK/laaZ
PC6tAQCP+KklcN+TvNp502f+O/kATahSpgnun4NY1/p4I8JV+AEAzdlkTN3+MiAO
fn9brN6mBK7h59DO3hqedPLJy2vrgwg=
=QDXH
-----END PGP SIGNATURE-----
Merge tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
"This contains two new features:
- Stack file operations: this allows removal of several hacks from
the VFS, proper interaction of read-only open files with copy-up,
possibility to implement fs modifying ioctls properly, and others.
- Metadata only copy-up: when file is on lower layer and only
metadata is modified (except size) then only copy up the metadata
and continue to use the data from the lower file"
* tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (66 commits)
ovl: Enable metadata only feature
ovl: Do not do metacopy only for ioctl modifying file attr
ovl: Do not do metadata only copy-up for truncate operation
ovl: add helper to force data copy-up
ovl: Check redirect on index as well
ovl: Set redirect on upper inode when it is linked
ovl: Set redirect on metacopy files upon rename
ovl: Do not set dentry type ORIGIN for broken hardlinks
ovl: Add an inode flag OVL_CONST_INO
ovl: Treat metacopy dentries as type OVL_PATH_MERGE
ovl: Check redirects for metacopy files
ovl: Move some dir related ovl_lookup_single() code in else block
ovl: Do not expose metacopy only dentry from d_real()
ovl: Open file with data except for the case of fsync
ovl: Add helper ovl_inode_realdata()
ovl: Store lower data inode in ovl_inode
ovl: Fix ovl_getattr() to get number of blocks from lower
ovl: Add helper ovl_dentry_lowerdata() to get lower data dentry
ovl: Copy up meta inode data from lowest data inode
ovl: Modify ovl_lookup() and friends to lookup metacopy dentry
...
Pull vfs icache updates from Al Viro:
- NFS mkdir/open_by_handle race fix
- analogous solution for FUSE, replacing the one currently in mainline
- new primitive to be used when discarding halfway set up inodes on
failed object creation; gives sane warranties re icache lookups not
returning such doomed by still not freed inodes. A bunch of
filesystems switched to that animal.
- Miklos' fix for last cycle regression in iget5_locked(); -stable will
need a slightly different variant, unfortunately.
- misc bits and pieces around things icache-related (in adfs and jfs).
* 'work.mkdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
jfs: don't bother with make_bad_inode() in ialloc()
adfs: don't put inodes into icache
new helper: inode_fake_hash()
vfs: don't evict uninitialized inode
jfs: switch to discard_new_inode()
ext2: make sure that partially set up inodes won't be returned by ext2_iget()
udf: switch to discard_new_inode()
ufs: switch to discard_new_inode()
btrfs: switch to discard_new_inode()
new primitive: discard_new_inode()
kill d_instantiate_no_diralias()
nfs_instantiate(): prevent multiple aliases for directory inode
iput() ends up calling ->evict() on new inode, which is not yet initialized
by owning fs. So use destroy_inode() instead.
Add to sb->s_inodes list only if inode is not in I_CREATING state (meaning
that it wasn't allocated with new_inode(), which already does the
insertion).
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 80ea09a002 ("vfs: factor out inode_insert5()")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We don't want open-by-handle picking half-set-up in-core
struct inode from e.g. mkdir() having failed halfway through.
In other words, we don't want such inodes returned by iget_locked()
on their way to extinction. However, we can't just have them
unhashed - otherwise open-by-handle immediately *after* that would've
ended up creating a new in-core inode over the on-disk one that
is in process of being freed right under us.
Solution: new flag (I_CREATING) set by insert_inode_locked() and
removed by unlock_new_inode() and a new primitive (discard_new_inode())
to be used by such halfway-through-setup failure exits instead of
unlock_new_inode() / iput() combinations. That primitive unlocks new
inode, but leaves I_CREATING in place.
iget_locked() treats finding an I_CREATING inode as failure
(-ESTALE, once we sort out the error propagation).
insert_inode_locked() treats the same as instant -EBUSY.
ilookup() treats those as icache miss.
[Fix by Dan Carpenter <dan.carpenter@oracle.com> folded in]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
sgid directories have special semantics, making newly created files in
the directory belong to the group of the directory, and newly created
subdirectories will also become sgid. This is historically used for
group-shared directories.
But group directories writable by non-group members should not imply
that such non-group members can magically join the group, so make sure
to clear the sgid bit on non-directories for non-members (but remember
that sgid without group execute means "mandatory locking", just to
confuse things even more).
Reported-by: Jann Horn <jannh@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a late set of changes from Deepa Dinamani doing an automated
treewide conversion of the inode and iattr structures from 'timespec'
to 'timespec64', to push the conversion from the VFS layer into the
individual file systems.
There were no conflicts between this and the contents of linux-next
until just before the merge window, when we saw multiple problems:
- A minor conflict with my own y2038 fixes, which I could address
by adding another patch on top here.
- One semantic conflict with late changes to the NFS tree. I addressed
this by merging Deepa's original branch on top of the changes that
now got merged into mainline and making sure the merge commit includes
the necessary changes as produced by coccinelle.
- A trivial conflict against the removal of staging/lustre.
- Multiple conflicts against the VFS changes in the overlayfs tree.
These are still part of linux-next, but apparently this is no longer
intended for 4.18 [1], so I am ignoring that part.
As Deepa writes:
The series aims to switch vfs timestamps to use struct timespec64.
Currently vfs uses struct timespec, which is not y2038 safe.
The series involves the following:
1. Add vfs helper functions for supporting struct timepec64 timestamps.
2. Cast prints of vfs timestamps to avoid warnings after the switch.
3. Simplify code using vfs timestamps so that the actual
replacement becomes easy.
4. Convert vfs timestamps to use struct timespec64 using a script.
This is a flag day patch.
Next steps:
1. Convert APIs that can handle timespec64, instead of converting
timestamps at the boundaries.
2. Update internal data structures to avoid timestamp conversions.
Thomas Gleixner adds:
I think there is no point to drag that out for the next merge window.
The whole thing needs to be done in one go for the core changes which
means that you're going to play that catchup game forever. Let's get
over with it towards the end of the merge window.
[1] https://www.spinics.net/lists/linux-fsdevel/msg128294.html
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJbInZAAAoJEGCrR//JCVInReoQAIlVIIMt5ZX6wmaKbrjy9Itf
MfgbFihQ/djLnuSPVQ3nztcxF0d66BKHZ9puVjz6+mIHqfDvJTRwZs9nU+sOF/T1
g78fRkM1cxq6ZCkGYAbzyjyo5aC4PnSMP/NQLmwqvi0MXqqrbDoq5ZdP9DHJw39h
L9lD8FM/P7T29Fgp9tq/pT5l9X8VU8+s5KQG1uhB5hii4VL6pD6JyLElDita7rg+
Z7/V7jkxIGEUWF7vGaiR1QTFzEtpUA/exDf9cnsf51OGtK/LJfQ0oiZPPuq3oA/E
LSbt8YQQObc+dvfnGxwgxEg1k5WP5ekj/Wdibv/+rQKgGyLOTz6Q4xK6r8F2ahxs
nyZQBdXqHhJYyKr1H1reUH3mrSgQbE5U5R1i3My0xV2dSn+vtK5vgF21v2Ku3A1G
wJratdtF/kVBzSEQUhsYTw14Un+xhBLRWzcq0cELonqxaKvRQK9r92KHLIWNE7/v
c0TmhFbkZA+zR8HdsaL3iYf1+0W/eYy8PcvepyldKNeW2pVk3CyvdTfY2Z87G2XK
tIkK+BUWbG3drEGG3hxZ3757Ln3a9qWyC5ruD3mBVkuug/wekbI8PykYJS7Mx4s/
WNXl0dAL0Eeu1M8uEJejRAe1Q3eXoMWZbvCYZc+wAm92pATfHVcKwPOh8P7NHlfy
A3HkjIBrKW5AgQDxfgvm
=CZX2
-----END PGP SIGNATURE-----
Merge tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground
Pull inode timestamps conversion to timespec64 from Arnd Bergmann:
"This is a late set of changes from Deepa Dinamani doing an automated
treewide conversion of the inode and iattr structures from 'timespec'
to 'timespec64', to push the conversion from the VFS layer into the
individual file systems.
As Deepa writes:
'The series aims to switch vfs timestamps to use struct timespec64.
Currently vfs uses struct timespec, which is not y2038 safe.
The series involves the following:
1. Add vfs helper functions for supporting struct timepec64
timestamps.
2. Cast prints of vfs timestamps to avoid warnings after the switch.
3. Simplify code using vfs timestamps so that the actual replacement
becomes easy.
4. Convert vfs timestamps to use struct timespec64 using a script.
This is a flag day patch.
Next steps:
1. Convert APIs that can handle timespec64, instead of converting
timestamps at the boundaries.
2. Update internal data structures to avoid timestamp conversions'
Thomas Gleixner adds:
'I think there is no point to drag that out for the next merge
window. The whole thing needs to be done in one go for the core
changes which means that you're going to play that catchup game
forever. Let's get over with it towards the end of the merge window'"
* tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
pstore: Remove bogus format string definition
vfs: change inode times to use struct timespec64
pstore: Convert internal records to timespec64
udf: Simplify calls to udf_disk_stamp_to_time
fs: nfs: get rid of memcpys for inode times
ceph: make inode time prints to be long long
lustre: Use long long type to print inode time
fs: add timespec64_truncate()
This contains a fix for the vfs_mkdir() issue discovered by Al, as well as
other fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCWxatHQAKCRDh3BK/laaZ
POg5AP95a/uUOrTJeTsENJwTmyAwHed9a6y4abKtvNErxUm4awD9FmhyYXodzJNq
9/mheT4kV2XkR/KkxI5sizfT1uPuvgA=
=+ljQ
-----END PGP SIGNATURE-----
Merge tag 'ovl-fixes-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
"This contains a fix for the vfs_mkdir() issue discovered by Al, as
well as other fixes and cleanups"
* tag 'ovl-fixes-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: use inode_insert5() to hash a newly created inode
ovl: Pass argument to ovl_get_inode() in a structure
vfs: factor out inode_insert5()
ovl: clean up copy-up error paths
ovl: return EIO on internal error
ovl: make ovl_create_real() cope with vfs_mkdir() safely
ovl: create helper ovl_create_temp()
ovl: return dentry from ovl_create_real()
ovl: struct cattr cleanups
ovl: strip debug argument from ovl_do_ helpers
ovl: remove WARN_ON() real inode attributes mismatch
ovl: Kconfig documentation fixes
ovl: update documentation for unionmount-testsuite
Split out common helper for race free insertion of an already allocated
inode into the cache. Use this from iget5_locked() and
insert_inode_locked4(). Make iget5_locked() use new_inode()/iput() instead
of alloc_inode()/destroy_inode() directly.
Also export to modules for use by filesystems which want to preallocate an
inode before file/directory creation.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In inode_init_always(), we clear the inode mapping flags, which clears
any retained error (AS_EIO, AS_ENOSPC) bits. Unfortunately, we do not
also clear wb_err, which means that old mapping errors can leak through
to new inodes.
This is crucial for the XFS inode allocation path because we recycle old
in-core inodes and we do not want error state from an old file to leak
into the new file. This bug was discovered by running generic/036 and
generic/047 in a loop and noticing that the EIOs generated by the
collision of direct and buffered writes in generic/036 would survive the
remount between 036 and 047, and get reported to the fsyncs (on
different files!) in generic/047.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
As vfs moves to using struct timespec64 to represent times,
update the argument to timespec_truncate() to use
struct timespec64. Also change the name of the function.
The rest of the implementation logic is the same.
Move this to fs/inode.c instead of kernel/time/time.c as all the
users of this api are filesystems.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <viro@zeniv.linux.org.uk>
Remove the address_space ->tree_lock and use the xa_lock newly added to
the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
since we don't really care that it's a tree.
[willy@infradead.org: fix nds32, fs/dax.c]
Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Noticed when looking at why cycling 600k inodes/s through the inode
cache was taking a total of 8% cpu in memset() during inode
initialisation. There is no need to zero the inode.i_data structure
twice.
This increases single threaded bulkstat throughput from ~200,000
inodes/s to ~220,000 inodes/s, so we save a substantial amount of
CPU time per inode init by doing this.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
__mark_inode_dirty already takes care of that, and for the XFS lazytime
implementation we need to know that ->dirty_inode was called because
I_DIRTY_TIME was set.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Commit 7994e6f725 ("vfs: Move waiting for inode writeback from
end_writeback() to evict_inode()") removed inode_sync_wait() from
end_writeback() and commit dbd5768f87 ("vfs: Rename end_writeback() to
clear_inode()") renamed end_writeback() to clear_inode().
After these patches there is no sleeping operation in clear_inode().
So, remove might_sleep() from it.
Link: http://lkml.kernel.org/r/20171108004354.40308-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We only really need to update i_version if someone has queried for it
since we last incremented it. By doing that, we can avoid having to
update the inode if the times haven't changed.
If the times have changed, then we go ahead and forcibly increment the
counter, under the assumption that we'll be going to the storage
anyway, and the increment itself is relatively cheap.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Add a documentation blob that explains what the i_version field is, how
it is expected to work, and how it is currently implemented by various
filesystems.
We already have inode_inc_iversion. Add several other functions for
manipulating and accessing the i_version counter. For now, the
implementation is trivial and basically works the way that all of the
open-coded i_version accesses work today.
Future patches will convert existing users of i_version to use the new
API, and then convert the backend implementation to do things more
efficiently.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
This is a pure automated search-and-replace of the internal kernel
superblock flags.
The s_flags are now called SB_*, with the names and the values for the
moment mirroring the MS_* flags that they're equivalent to.
Note how the MS_xyz flags are the ones passed to the mount system call,
while the SB_xyz flags are what we then use in sb->s_flags.
The script to do this was:
# places to look in; re security/*: it generally should *not* be
# touched (that stuff parses mount(2) arguments directly), but
# there are two places where we really deal with superblock flags.
FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
include/linux/fs.h include/uapi/linux/bfs_fs.h \
security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
# the list of MS_... constants
SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
ACTIVE NOUSER"
SED_PROG=
for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done
# we want files that contain at least one of MS_...,
# with fs/namespace.c and fs/pnode.c excluded.
L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')
for f in $L; do sed -i $f $SED_PROG; done
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.
However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:
----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()
// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull overlayfs updates from Miklos Szeredi:
"This fixes d_ino correctness in readdir, which brings overlayfs on par
with normal filesystems regarding inode number semantics, as long as
all layers are on the same filesystem.
There are also some bug fixes, one in particular (random ioctl's
shouldn't be able to modify lower layers) that touches some vfs code,
but of course no-op for non-overlay fs"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix false positive ESTALE on lookup
ovl: don't allow writing ioctl on lower layer
ovl: fix relatime for directories
vfs: add flags to d_real()
ovl: cleanup d_real for negative
ovl: constant d_ino for non-merge dirs
ovl: constant d_ino across copy up
ovl: fix readdir error value
ovl: check snprintf return
Allow interval trees to quickly check for overlaps to avoid unnecesary
tree lookups in interval_tree_iter_first().
As of this patch, all interval tree flavors will require using a
'rb_root_cached' such that we can have the leftmost node easily
available. While most users will make use of this feature, those with
special functions (in addition to the generic insert, delete, search
calls) will avoid using the cached option as they can do funky things
with insertions -- for example, vma_interval_tree_insert_after().
[jglisse@redhat.com: fix deadlock from typo vm_lock_anon_vma()]
Link: http://lkml.kernel.org/r/20170808225719.20723-1-jglisse@redhat.com
Link: http://lkml.kernel.org/r/20170719014603.19029-12-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Doug Ledford <dledford@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Christian Benvenuti <benve@cisco.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Need to treat non-regular overlayfs files the same as regular files when
checking for an atime update.
Add a d_real() flag to make it return the upper dentry for all file types.
Reported-by: "zhangyi (F)" <yi.zhang@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When we introduced the bmap redo log items, we set MS_ACTIVE on the
mountpoint and XFS_IRECOVERY on the inode to prevent unlinked inodes
from being truncated prematurely during log recovery. This also had the
effect of putting linked inodes on the lru instead of evicting them.
Unfortunately, we neglected to find all those unreferenced lru inodes
and evict them after finishing log recovery, which means that we leak
them if anything goes wrong in the rest of xfs_mountfs, because the lru
is only cleaned out on unmount.
Therefore, evict unreferenced inodes in the lru list immediately
after clearing MS_ACTIVE.
Fixes: 17c12bcd30 ("xfs: when replaying bmap operations, don't let unlinked inodes get reaped")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: viro@ZenIV.linux.org.uk
Reviewed-by: Brian Foster <bfoster@redhat.com>
Pull misc filesystem updates from Al Viro:
"Assorted normal VFS / filesystems stuff..."
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
dentry name snapshots
Make statfs properly return read-only state after emergency remount
fs/dcache: init in_lookup_hashtable
minix: Deinline get_block, save 2691 bytes
fs: Reorder inode_owner_or_capable() to avoid needless
fs: warn in case userspace lied about modprobe return
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- Add the SYSTEM_SCHEDULING bootup state to move various scheduler
debug checks earlier into the bootup. This turns silent and
sporadically deadly bugs into nice, deterministic splats. Fix some
of the splats that triggered. (Thomas Gleixner)
- A round of restructuring and refactoring of the load-balancing and
topology code (Peter Zijlstra)
- Another round of consolidating ~20 of incremental scheduler code
history: this time in terms of wait-queue nomenclature. (I didn't
get much feedback on these renaming patches, and we can still
easily change any names I might have misplaced, so if anyone hates
a new name, please holler and I'll fix it.) (Ingo Molnar)
- sched/numa improvements, fixes and updates (Rik van Riel)
- Another round of x86/tsc scheduler clock code improvements, in hope
of making it more robust (Peter Zijlstra)
- Improve NOHZ behavior (Frederic Weisbecker)
- Deadline scheduler improvements and fixes (Luca Abeni, Daniel
Bristot de Oliveira)
- Simplify and optimize the topology setup code (Lauro Ramos
Venancio)
- Debloat and decouple scheduler code some more (Nicolas Pitre)
- Simplify code by making better use of llist primitives (Byungchul
Park)
- ... plus other fixes and improvements"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
sched/cputime: Refactor the cputime_adjust() code
sched/debug: Expose the number of RT/DL tasks that can migrate
sched/numa: Hide numa_wake_affine() from UP build
sched/fair: Remove effective_load()
sched/numa: Implement NUMA node level wake_affine()
sched/fair: Simplify wake_affine() for the single socket case
sched/numa: Override part of migrate_degrades_locality() when idle balancing
sched/rt: Move RT related code from sched/core.c to sched/rt.c
sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
sched/fair: Spare idle load balancing on nohz_full CPUs
nohz: Move idle balancer registration to the idle path
sched/loadavg: Generalize "_idle" naming to "_nohz"
sched/core: Drop the unused try_get_task_struct() helper function
sched/fair: WARN() and refuse to set buddy when !se->on_rq
sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
sched/wait: Split out the wait_bit*() APIs from <linux/wait.h> into <linux/wait_bit.h>
sched/wait: Re-adjust macro line continuation backslashes in <linux/wait.h>
...
Checking for capabilities should be the last operation when performing
access control tests so that PF_SUPERPRIV is set only when it was required
for success (implying that the capability was needed for the operation).
Reported-by: Solar Designer <solar@openwall.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Define a set of write life time hints:
RWH_WRITE_LIFE_NOT_SET No hint information set
RWH_WRITE_LIFE_NONE No hints about write life time
RWH_WRITE_LIFE_SHORT Data written has a short life time
RWH_WRITE_LIFE_MEDIUM Data written has a medium life time
RWH_WRITE_LIFE_LONG Data written has a long life time
RWH_WRITE_LIFE_EXTREME Data written has an extremely long life time
The intent is for these values to be relative to each other, no
absolute meaning should be attached to these flag names.
Add an fcntl interface for querying these flags, and also for
setting them as well:
F_GET_RW_HINT Returns the read/write hint set on the
underlying inode.
F_SET_RW_HINT Set one of the above write hints on the
underlying inode.
F_GET_FILE_RW_HINT Returns the read/write hint set on the
file descriptor.
F_SET_FILE_RW_HINT Set one of the above write hints on the
file descriptor.
The user passes in a 64-bit pointer to get/set these values, and
the interface returns 0/-1 on success/error.
Sample program testing/implementing basic setting/getting of write
hints is below.
Add support for storing the write life time hint in the inode flags
and in struct file as well, and pass them to the kiocb flags. If
both a file and its corresponding inode has a write hint, then we
use the one in the file, if available. The file hint can be used
for sync/direct IO, for buffered writeback only the inode hint
is available.
This is in preparation for utilizing these hints in the block layer,
to guide on-media data placement.
/*
* writehint.c: get or set an inode write hint
*/
#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdbool.h>
#include <inttypes.h>
#ifndef F_GET_RW_HINT
#define F_LINUX_SPECIFIC_BASE 1024
#define F_GET_RW_HINT (F_LINUX_SPECIFIC_BASE + 11)
#define F_SET_RW_HINT (F_LINUX_SPECIFIC_BASE + 12)
#endif
static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE",
"RWH_WRITE_LIFE_SHORT", "RWH_WRITE_LIFE_MEDIUM",
"RWH_WRITE_LIFE_LONG", "RWH_WRITE_LIFE_EXTREME" };
int main(int argc, char *argv[])
{
uint64_t hint;
int fd, ret;
if (argc < 2) {
fprintf(stderr, "%s: file <hint>\n", argv[0]);
return 1;
}
fd = open(argv[1], O_RDONLY);
if (fd < 0) {
perror("open");
return 2;
}
if (argc > 2) {
hint = atoi(argv[2]);
ret = fcntl(fd, F_SET_RW_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_SET_RW_HINT");
return 4;
}
}
ret = fcntl(fd, F_GET_RW_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_GET_RW_HINT");
return 3;
}
printf("%s: hint %s\n", argv[1], str[hint]);
close(fd);
return 0;
}
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rename 'struct wait_bit_queue::wait' to ::wq_entry, to more clearly
name it as a wait-queue entry.
Propagate it to a couple of usage sites where the wait-bit-queue internals
are exposed.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull misc vfs updates from Al Viro:
"Assorted bits and pieces from various people. No common topic in this
pile, sorry"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs/affs: add rename exchange
fs/affs: add rename2 to prepare multiple methods
Make stat/lstat/fstatat pass AT_NO_AUTOMOUNT to vfs_statx()
fs: don't set *REFERENCED on single use objects
fs: compat: Remove warning from COMPATIBLE_IOCTL
remove pointless extern of atime_need_update_rcu()
fs: completely ignore unknown open flags
fs: add a VALID_OPEN_FLAGS
fs: remove _submit_bh()
fs: constify tree_descr arrays passed to simple_fill_super()
fs: drop duplicate header percpu-rwsem.h
fs/affs: bugfix: Write files greater than page size on OFS
fs/affs: bugfix: enable writes on OFS disks
fs/affs: remove node generation check
fs/affs: import amigaffs.h
fs/affs: bugfix: make symbolic links work again
Fix typos and add the following to the scripts/spelling.txt:
intialisation||initialisation
intialised||initialised
intialise||initialise
This commit does not intend to change the British spelling itself.
Link: http://lkml.kernel.org/r/1481573103-11329-18-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
By default we set DCACHE_REFERENCED and I_REFERENCED on any dentry or
inode we create. This is problematic as this means that it takes two
trips through the LRU for any of these objects to be reclaimed,
regardless of their actual lifetime. With enough pressure from these
caches we can easily evict our working set from page cache with single
use objects. So instead only set *REFERENCED if we've already been
added to the LRU list. This means that we've been touched since the
first time we were accessed, and so more likely to need to hang out in
cache.
To illustrate this issue I wrote the following scripts
https://github.com/josefbacik/debug-scripts/tree/master/cache-pressure
on my test box. It is a single socket 4 core CPU with 16gib of RAM and
I tested on an Intel 2tib NVME drive. The cache-pressure.sh script
creates a new file system and creates 2 6.5gib files in order to take up
13gib of the 16gib of ram with pagecache. Then it runs a test program
that reads these 2 files in a loop, and keeps track of how often it has
to read bytes for each loop. On an ideal system with no pressure we
should have to read 0 bytes indefinitely. The second thing this script
does is start a fs_mark job that creates a ton of 0 length files,
putting pressure on the system with slab only allocations. On exit the
script prints out how many bytes were read by the read-file program.
The results are as follows
Without patch:
/mnt/btrfs-test/reads/file1: total read during loops 27262988288
/mnt/btrfs-test/reads/file2: total read during loops 27262976000
With patch:
/mnt/btrfs-test/reads/file2: total read during loops 18640457728
/mnt/btrfs-test/reads/file1: total read during loops 9565376512
This patch results in a 50% reduction of the amount of pages evicted
from our working set.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>