This has one fix in advance merged in f2fs-stable.
("xfs: drop I_DIRTY_TIME_EXPIRED")
* aosp/upstream-f2fs-stable-linux-4.19.y:
writeback: Drop I_DIRTY_TIME_EXPIRE
writeback: Fix sync livelock due to b_dirty_time processing
writeback: Avoid skipping inode writeback
writeback: Protect inode->i_io_list with inode->i_lock
Revert "writeback: Avoid skipping inode writeback"
Bug: 154542664
Change-Id: I98a6258cb60227e6ca02e57bf7adf28ab7816cbf
Signed-off-by: Jaegeuk Kim <jaegeuk@google.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl7wXf4ACgkQONu9yGCS
aT6rDBAAg6jIYJhhb9lpK59hpMxNsaFnPfXdA3Z6qARqH7iIQa9TTP9JF5eFndS0
+2wV8t/8Nz/3BWq9NQAF525QJdqyY6Ahcj5QQXzIzEZyb/p5fRVCBOUcBP7uaBCu
gdORR7OhHI9+7aGLr05Svb7pVWPLi0Mk5vjvthEIkojEOIREGuGlERRZNlL1SN3y
cYDBCCJtD2XiuhyZNLNxtwE/2/d/1xuIG7T3VRDS6oBtqfOXdsy5xoU9lpbbmZQg
s1i3cjWgxEYjJOJqONwzfUSu9Zj4GUZfLTx3gtXG7iEiuUfEw3ljEvIrqSqtNxB5
aTysoOu4MSdJTALHkA7Szhk2Q8Pecmo+NdKLfgMCxAWwIEbn1X9seea7QC5M5/lr
Q1z150M2+Lcs6z9I/vR+vmPh9YKn1yGV4RbTeMXwiQLlWcRh7vh7jN+YJvrpmJSL
BGbsRLB02J4i58CLW7n2rgeq5ycO41bJeWdXSSZjJg7KiZMvuD7mnDv1nUoj3Ad0
lFxgfBRYYZzGCe53xLBXKnjua1lxp8rStUK4iotqkXyhaZqHo0J52okDxSqbP4VZ
DYMfgyiFDufITd7l7qK5H6OeWQJ2IPtaude0HiMQf00bdOIrIsl+xXCtFHo6kx6z
VxwFUAUZWIKZT9ZWo2DNmbbDSRmij3Pqm6ZiakDSPT+kZFqvkBo=
=u5pA
-----END PGP SIGNATURE-----
Merge 4.19.129 into android-4.19-stable
Changes in 4.19.129
ipv6: fix IPV6_ADDRFORM operation logic
net_failover: fixed rollback in net_failover_open()
bridge: Avoid infinite loop when suppressing NS messages with invalid options
vxlan: Avoid infinite loop when suppressing NS messages with invalid options
tun: correct header offsets in napi frags mode
selftests: bpf: fix use of undeclared RET_IF macro
make 'user_access_begin()' do 'access_ok()'
Fix 'acccess_ok()' on alpha and SH
arch/openrisc: Fix issues with access_ok()
x86: uaccess: Inhibit speculation past access_ok() in user_access_begin()
lib: Reduce user_access_begin() boundaries in strncpy_from_user() and strnlen_user()
btrfs: merge btrfs_find_device and find_device
btrfs: Detect unbalanced tree with empty leaf before crashing btree operations
crypto: talitos - fix ECB and CBC algs ivsize
Input: mms114 - fix handling of mms345l
ARM: 8977/1: ptrace: Fix mask for thumb breakpoint hook
sched/fair: Don't NUMA balance for kthreads
Input: synaptics - add a second working PNP_ID for Lenovo T470s
drivers/net/ibmvnic: Update VNIC protocol version reporting
powerpc/xive: Clear the page tables for the ESB IO mapping
ath9k_htc: Silence undersized packet warnings
RDMA/uverbs: Make the event_queue fds return POLLERR when disassociated
x86/cpu/amd: Make erratum #1054 a legacy erratum
perf probe: Accept the instance number of kretprobe event
mm: add kvfree_sensitive() for freeing sensitive data objects
aio: fix async fsync creds
btrfs: tree-checker: Check level for leaves and nodes
x86_64: Fix jiffies ODR violation
x86/PCI: Mark Intel C620 MROMs as having non-compliant BARs
x86/speculation: Prevent rogue cross-process SSBD shutdown
x86/reboot/quirks: Add MacBook6,1 reboot quirk
efi/efivars: Add missing kobject_put() in sysfs entry creation error path
ALSA: es1688: Add the missed snd_card_free()
ALSA: hda/realtek - add a pintbl quirk for several Lenovo machines
ALSA: usb-audio: Fix inconsistent card PM state after resume
ALSA: usb-audio: Add vendor, product and profile name for HP Thunderbolt Dock
ACPI: sysfs: Fix reference count leak in acpi_sysfs_add_hotplug_profile()
ACPI: CPPC: Fix reference count leak in acpi_cppc_processor_probe()
ACPI: GED: add support for _Exx / _Lxx handler methods
ACPI: PM: Avoid using power resources if there are none for D0
cgroup, blkcg: Prepare some symbols for module and !CONFIG_CGROUP usages
nilfs2: fix null pointer dereference at nilfs_segctor_do_construct()
spi: dw: Fix controller unregister order
spi: bcm2835aux: Fix controller unregister order
spi: bcm-qspi: when tx/rx buffer is NULL set to 0
PM: runtime: clk: Fix clk_pm_runtime_get() error path
crypto: cavium/nitrox - Fix 'nitrox_get_first_device()' when ndevlist is fully iterated
ALSA: pcm: disallow linking stream to itself
x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned
KVM: x86: Fix APIC page invalidation race
kvm: x86: Fix L1TF mitigation for shadow MMU
KVM: x86/mmu: Consolidate "is MMIO SPTE" code
KVM: x86: only do L1TF workaround on affected processors
x86/speculation: Change misspelled STIPB to STIBP
x86/speculation: Add support for STIBP always-on preferred mode
x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS.
x86/speculation: PR_SPEC_FORCE_DISABLE enforcement for indirect branches.
spi: No need to assign dummy value in spi_unregister_controller()
spi: Fix controller unregister order
spi: pxa2xx: Fix controller unregister order
spi: bcm2835: Fix controller unregister order
spi: pxa2xx: Balance runtime PM enable/disable on error
spi: pxa2xx: Fix runtime PM ref imbalance on probe error
crypto: virtio: Fix use-after-free in virtio_crypto_skcipher_finalize_req()
crypto: virtio: Fix src/dst scatterlist calculation in __virtio_crypto_skcipher_do_req()
crypto: virtio: Fix dest length calculation in __virtio_crypto_skcipher_do_req()
selftests/net: in rxtimestamp getopt_long needs terminating null entry
ovl: initialize error in ovl_copy_xattr
proc: Use new_inode not new_inode_pseudo
video: fbdev: w100fb: Fix a potential double free.
KVM: nSVM: fix condition for filtering async PF
KVM: nSVM: leave ASID aside in copy_vmcb_control_area
KVM: nVMX: Consult only the "basic" exit reason when routing nested exit
KVM: MIPS: Define KVM_ENTRYHI_ASID to cpu_asid_mask(&boot_cpu_data)
KVM: MIPS: Fix VPN2_MASK definition for variable cpu_vmbits
KVM: arm64: Make vcpu_cp1x() work on Big Endian hosts
scsi: megaraid_sas: TM command refire leads to controller firmware crash
ath9k: Fix use-after-free Read in ath9k_wmi_ctrl_rx
ath9k: Fix use-after-free Write in ath9k_htc_rx_msg
ath9x: Fix stack-out-of-bounds Write in ath9k_hif_usb_rx_cb
ath9k: Fix general protection fault in ath9k_hif_usb_rx_cb
Smack: slab-out-of-bounds in vsscanf
drm/vkms: Hold gem object while still in-use
mm/slub: fix a memory leak in sysfs_slab_add()
fat: don't allow to mount if the FAT length == 0
perf: Add cond_resched() to task_function_call()
agp/intel: Reinforce the barrier after GTT updates
mmc: sdhci-msm: Clear tuning done flag while hs400 tuning
ARM: dts: at91: sama5d2_ptc_ek: fix sdmmc0 node description
mmc: sdio: Fix potential NULL pointer error in mmc_sdio_init_card()
xen/pvcalls-back: test for errors when calling backend_connect()
KVM: arm64: Synchronize sysreg state on injecting an AArch32 exception
ACPI: GED: use correct trigger type field in _Exx / _Lxx handling
drm: bridge: adv7511: Extend list of audio sample rates
crypto: ccp -- don't "select" CONFIG_DMADEVICES
media: si2157: Better check for running tuner in init
objtool: Ignore empty alternatives
spi: pxa2xx: Apply CS clk quirk to BXT
net: atlantic: make hw_get_regs optional
net: ena: fix error returning in ena_com_get_hash_function()
efi/libstub/x86: Work around LLVM ELF quirk build regression
arm64: cacheflush: Fix KGDB trap detection
spi: dw: Zero DMA Tx and Rx configurations on stack
arm64: insn: Fix two bugs in encoding 32-bit logical immediates
ixgbe: Fix XDP redirect on archs with PAGE_SIZE above 4K
MIPS: Loongson: Build ATI Radeon GPU driver as module
Bluetooth: Add SCO fallback for invalid LMP parameters error
kgdb: Disable WARN_CONSOLE_UNLOCKED for all kgdb
kgdb: Prevent infinite recursive entries to the debugger
spi: dw: Enable interrupts in accordance with DMA xfer mode
clocksource: dw_apb_timer: Make CPU-affiliation being optional
clocksource: dw_apb_timer_of: Fix missing clockevent timers
btrfs: do not ignore error from btrfs_next_leaf() when inserting checksums
ARM: 8978/1: mm: make act_mm() respect THREAD_SIZE
batman-adv: Revert "disable ethtool link speed detection when auto negotiation off"
mmc: meson-mx-sdio: trigger a soft reset after a timeout or CRC error
spi: dw: Fix Rx-only DMA transfers
x86/kvm/hyper-v: Explicitly align hcall param for kvm_hyperv_exit
net: vmxnet3: fix possible buffer overflow caused by bad DMA value in vmxnet3_get_rss()
staging: android: ion: use vmap instead of vm_map_ram
brcmfmac: fix wrong location to get firmware feature
tools api fs: Make xxx__mountpoint() more scalable
e1000: Distribute switch variables for initialization
dt-bindings: display: mediatek: control dpi pins mode to avoid leakage
audit: fix a net reference leak in audit_send_reply()
media: dvb: return -EREMOTEIO on i2c transfer failure.
media: platform: fcp: Set appropriate DMA parameters
MIPS: Make sparse_init() using top-down allocation
Bluetooth: btbcm: Add 2 missing models to subver tables
audit: fix a net reference leak in audit_list_rules_send()
netfilter: nft_nat: return EOPNOTSUPP if type or flags are not supported
selftests/bpf: Fix memory leak in extract_build_id()
net: bcmgenet: set Rx mode before starting netif
lib/mpi: Fix 64-bit MIPS build with Clang
exit: Move preemption fixup up, move blocking operations down
sched/core: Fix illegal RCU from offline CPUs
drivers/perf: hisi: Fix typo in events attribute array
net: lpc-enet: fix error return code in lpc_mii_init()
media: cec: silence shift wrapping warning in __cec_s_log_addrs()
net: allwinner: Fix use correct return type for ndo_start_xmit()
powerpc/spufs: fix copy_to_user while atomic
xfs: clean up the error handling in xfs_swap_extents
Crypto/chcr: fix for ccm(aes) failed test
MIPS: Truncate link address into 32bit for 32bit kernel
mips: cm: Fix an invalid error code of INTVN_*_ERR
kgdb: Fix spurious true from in_dbg_master()
xfs: reset buffer write failure state on successful completion
xfs: fix duplicate verification from xfs_qm_dqflush()
platform/x86: intel-vbtn: Use acpi_evaluate_integer()
platform/x86: intel-vbtn: Split keymap into buttons and switches parts
platform/x86: intel-vbtn: Do not advertise switches to userspace if they are not there
platform/x86: intel-vbtn: Also handle tablet-mode switch on "Detachable" and "Portable" chassis-types
nvme: refine the Qemu Identify CNS quirk
ath10k: Remove msdu from idr when management pkt send fails
wcn36xx: Fix error handling path in 'wcn36xx_probe()'
net: qed*: Reduce RX and TX default ring count when running inside kdump kernel
mt76: avoid rx reorder buffer overflow
md: don't flush workqueue unconditionally in md_open
veth: Adjust hard_start offset on redirect XDP frames
net/mlx5e: IPoIB, Drop multicast packets that this interface sent
rtlwifi: Fix a double free in _rtl_usb_tx_urb_setup()
mwifiex: Fix memory corruption in dump_station
x86/boot: Correct relocation destination on old linkers
mips: MAAR: Use more precise address mask
mips: Add udelay lpj numbers adjustment
crypto: stm32/crc32 - fix ext4 chksum BUG_ON()
crypto: stm32/crc32 - fix run-time self test issue.
crypto: stm32/crc32 - fix multi-instance
x86/mm: Stop printing BRK addresses
m68k: mac: Don't call via_flush_cache() on Mac IIfx
btrfs: qgroup: mark qgroup inconsistent if we're inherting snapshot to a new qgroup
macvlan: Skip loopback packets in RX handler
PCI: Don't disable decoding when mmio_always_on is set
MIPS: Fix IRQ tracing when call handle_fpe() and handle_msa_fpe()
bcache: fix refcount underflow in bcache_device_free()
mmc: sdhci-msm: Set SDHCI_QUIRK_MULTIBLOCK_READ_ACMD12 quirk
staging: greybus: sdio: Respect the cmd->busy_timeout from the mmc core
mmc: via-sdmmc: Respect the cmd->busy_timeout from the mmc core
ixgbe: fix signed-integer-overflow warning
mmc: sdhci-esdhc-imx: fix the mask for tuning start point
spi: dw: Return any value retrieved from the dma_transfer callback
cpuidle: Fix three reference count leaks
platform/x86: hp-wmi: Convert simple_strtoul() to kstrtou32()
platform/x86: intel-hid: Add a quirk to support HP Spectre X2 (2015)
platform/x86: intel-vbtn: Only blacklist SW_TABLET_MODE on the 9 / "Laptop" chasis-type
string.h: fix incompatibility between FORTIFY_SOURCE and KASAN
btrfs: include non-missing as a qualifier for the latest_bdev
btrfs: send: emit file capabilities after chown
mm: thp: make the THP mapcount atomic against __split_huge_pmd_locked()
mm: initialize deferred pages with interrupts enabled
ima: Fix ima digest hash table key calculation
ima: Directly assign the ima_default_policy pointer to ima_rules
evm: Fix possible memory leak in evm_calc_hmac_or_hash()
ext4: fix EXT_MAX_EXTENT/INDEX to check for zeroed eh_max
ext4: fix error pointer dereference
ext4: fix race between ext4_sync_parent() and rename()
PCI: Avoid Pericom USB controller OHCI/EHCI PME# defect
PCI: Avoid FLR for AMD Matisse HD Audio & USB 3.0
PCI: Avoid FLR for AMD Starship USB 3.0
PCI: Add ACS quirk for iProc PAXB
PCI: Add ACS quirk for Intel Root Complex Integrated Endpoints
PCI: Remove unused NFP32xx IDs
pci:ipmi: Move IPMI PCI class id defines to pci_ids.h
hwmon/k10temp, x86/amd_nb: Consolidate shared device IDs
x86/amd_nb: Add PCI device IDs for family 17h, model 30h
PCI: add USR vendor id and use it in r8169 and w6692 driver
PCI: Move Synopsys HAPS platform device IDs
PCI: Move Rohm Vendor ID to generic list
misc: pci_endpoint_test: Add the layerscape EP device support
misc: pci_endpoint_test: Add support to test PCI EP in AM654x
PCI: Add Synopsys endpoint EDDA Device ID
PCI: Add NVIDIA GPU multi-function power dependencies
PCI: Enable NVIDIA HDA controllers
PCI: mediatek: Add controller support for MT7629
x86/amd_nb: Add PCI device IDs for family 17h, model 70h
ALSA: lx6464es - add support for LX6464ESe pci express variant
PCI: Add Genesys Logic, Inc. Vendor ID
PCI: Add Amazon's Annapurna Labs vendor ID
PCI: vmd: Add device id for VMD device 8086:9A0B
x86/amd_nb: Add Family 19h PCI IDs
PCI: Add Loongson vendor ID
serial: 8250_pci: Move Pericom IDs to pci_ids.h
PCI: Make ACS quirk implementations more uniform
PCI: Unify ACS quirk desired vs provided checking
PCI: Generalize multi-function power dependency device links
btrfs: fix error handling when submitting direct I/O bio
btrfs: fix wrong file range cleanup after an error filling dealloc range
ima: Call ima_calc_boot_aggregate() in ima_eventdigest_init()
PCI: Program MPS for RCiEP devices
e1000e: Disable TSO for buffer overrun workaround
e1000e: Relax condition to trigger reset for ME workaround
carl9170: remove P2P_GO support
media: go7007: fix a miss of snd_card_free
Bluetooth: hci_bcm: fix freeing not-requested IRQ
b43legacy: Fix case where channel status is corrupted
b43: Fix connection problem with WPA3
b43_legacy: Fix connection problem with WPA3
media: ov5640: fix use of destroyed mutex
igb: Report speed and duplex as unknown when device is runtime suspended
power: vexpress: add suppress_bind_attrs to true
pinctrl: samsung: Correct setting of eint wakeup mask on s5pv210
pinctrl: samsung: Save/restore eint_mask over suspend for EINT_TYPE GPIOs
gnss: sirf: fix error return code in sirf_probe()
sparc32: fix register window handling in genregs32_[gs]et()
sparc64: fix misuses of access_process_vm() in genregs32_[sg]et()
dm crypt: avoid truncating the logical block size
alpha: fix memory barriers so that they conform to the specification
kernel/cpu_pm: Fix uninitted local in cpu_pm
ARM: tegra: Correct PL310 Auxiliary Control Register initialization
ARM: dts: exynos: Fix GPIO polarity for thr GalaxyS3 CM36651 sensor's bus
ARM: dts: at91: sama5d2_ptc_ek: fix vbus pin
ARM: dts: s5pv210: Set keep-power-in-suspend for SDHCI1 on Aries
drivers/macintosh: Fix memleak in windfarm_pm112 driver
powerpc/64s: Don't let DT CPU features set FSCR_DSCR
powerpc/64s: Save FSCR to init_task.thread.fscr after feature init
kbuild: force to build vmlinux if CONFIG_MODVERSION=y
sunrpc: svcauth_gss_register_pseudoflavor must reject duplicate registrations.
sunrpc: clean up properly in gss_mech_unregister()
mtd: rawnand: brcmnand: fix hamming oob layout
mtd: rawnand: pasemi: Fix the probe error path
w1: omap-hdq: cleanup to add missing newline for some dev_dbg
perf probe: Do not show the skipped events
perf probe: Fix to check blacklist address correctly
perf probe: Check address correctness by map instead of _etext
perf symbols: Fix debuginfo search for Ubuntu
Linux 4.19.129
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I7b1108d90ee1109a28fe488a4358b7a3e101d9c9
commit 9b0eb69b75bccada2d341d7e7ca342f0cb1c9a6a upstream.
btrfs is going to use css_put() and wbc helpers to improve cgroup
writeback support. Add dummy css_get() definition and export wbc
helpers to prepare for module and !CONFIG_CGROUP builds.
[only backport the export of __inode_attach_wb for stable kernels - gregkh]
Reported-by: kbuild test robot <lkp@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The only use of I_DIRTY_TIME_EXPIRE is to detect in
__writeback_single_inode() that inode got there because flush worker
decided it's time to writeback the dirty inode time stamps (either
because we are syncing or because of age). However we can detect this
directly in __writeback_single_inode() and there's no need for the
strange propagation with I_DIRTY_TIME_EXPIRE flag.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
When we are processing writeback for sync(2), move_expired_inodes()
didn't set any inode expiry value (older_than_this). This can result in
writeback never completing if there's steady stream of inodes added to
b_dirty_time list as writeback rechecks dirty lists after each writeback
round whether there's more work to be done. Fix the problem by using
sync(2) start time is inode expiry value when processing b_dirty_time
list similarly as for ordinarily dirtied inodes. This requires some
refactoring of older_than_this handling which simplifies the code
noticeably as a bonus.
Fixes: 0ae45f63d4 ("vfs: add support for a lazytime mount option")
CC: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Inode's i_io_list list head is used to attach inode to several different
lists - wb->{b_dirty, b_dirty_time, b_io, b_more_io}. When flush worker
prepares a list of inodes to writeback e.g. for sync(2), it moves inodes
to b_io list. Thus it is critical for sync(2) data integrity guarantees
that inode is not requeued to any other writeback list when inode is
queued for processing by flush worker. That's the reason why
writeback_single_inode() does not touch i_io_list (unless the inode is
completely clean) and why __mark_inode_dirty() does not touch i_io_list
if I_SYNC flag is set.
However there are two flaws in the current logic:
1) When inode has only I_DIRTY_TIME set but it is already queued in b_io
list due to sync(2), concurrent __mark_inode_dirty(inode, I_DIRTY_SYNC)
can still move inode back to b_dirty list resulting in skipping
writeback of inode time stamps during sync(2).
2) When inode is on b_dirty_time list and writeback_single_inode() races
with __mark_inode_dirty() like:
writeback_single_inode() __mark_inode_dirty(inode, I_DIRTY_PAGES)
inode->i_state |= I_SYNC
__writeback_single_inode()
inode->i_state |= I_DIRTY_PAGES;
if (inode->i_state & I_SYNC)
bail
if (!(inode->i_state & I_DIRTY_ALL))
- not true so nothing done
We end up with I_DIRTY_PAGES inode on b_dirty_time list and thus
standard background writeback will not writeback this inode leading to
possible dirty throttling stalls etc. (thanks to Martijn Coenen for this
analysis).
Fix these problems by tracking whether inode is queued in b_io or
b_more_io lists in a new I_SYNC_QUEUED flag. When this flag is set, we
know flush worker has queued inode and we should not touch i_io_list.
On the other hand we also know that once flush worker is done with the
inode it will requeue the inode to appropriate dirty list. When
I_SYNC_QUEUED is not set, __mark_inode_dirty() can (and must) move inode
to appropriate dirty list.
Reported-by: Martijn Coenen <maco@android.com>
Reviewed-by: Martijn Coenen <maco@android.com>
Tested-by: Martijn Coenen <maco@android.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixes: 0ae45f63d4 ("vfs: add support for a lazytime mount option")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Currently, operations on inode->i_io_list are protected by
wb->list_lock. In the following patches we'll need to maintain
consistency between inode->i_state and inode->i_io_list so change the
code so that inode->i_lock protects also all inode's i_io_list handling.
Reviewed-by: Martijn Coenen <maco@android.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
CC: stable@vger.kernel.org # Prerequisite for "writeback: Avoid skipping inode writeback"
Signed-off-by: Jan Kara <jack@suse.cz>
* aosp/upstream-f2fs-stable-linux-4.19.y:
f2fs: attach IO flags to the missing cases
f2fs: add node_io_flag for bio flags likewise data_io_flag
f2fs: remove unused parameter of f2fs_put_rpages_mapping()
f2fs: handle readonly filesystem in f2fs_ioc_shutdown()
f2fs: avoid utf8_strncasecmp() with unstable name
f2fs: don't return vmalloc() memory from f2fs_kmalloc()
f2fs: fix retry logic in f2fs_write_cache_pages()
f2fs: fix wrong discard space
f2fs: compress: don't compress any datas after cp stop
f2fs: remove unneeded return value of __insert_discard_tree()
f2fs: fix wrong value of tracepoint parameter
f2fs: protect new segment allocation in expand_inode_data
f2fs: code cleanup by removing ifdef macro surrounding
writeback: Avoid skipping inode writeback
f2fs: avoid inifinite loop to wait for flushing node pages at cp_error
f2fs: compress: fix zstd data corruption
f2fs: add compressed/gc data read IO stat
f2fs: fix potential use-after-free issue
f2fs: compress: don't handle non-compressed data in workqueue
f2fs: remove redundant assignment to variable err
f2fs: refactor resize_fs to avoid meta updates in progress
f2fs: use round_up to enhance calculation
f2fs: introduce F2FS_IOC_RESERVE_COMPRESS_BLOCKS
f2fs: Avoid double lock for cp_rwsem during checkpoint
f2fs: report delalloc reserve as non-free in statfs for project quota
f2fs: Fix wrong stub helper update_sit_info
f2fs: compress: let lz4 compressor handle output buffer budget properly
f2fs: remove blk_plugging in block_operations
f2fs: introduce F2FS_IOC_RELEASE_COMPRESS_BLOCKS
f2fs: shrink spinlock coverage
f2fs: correctly fix the parent inode number during fsync()
f2fs: introduce mempool for {,de}compress intermediate page allocation
f2fs: introduce f2fs_bmap_compress()
f2fs: support fiemap on compressed inode
f2fs: support partial truncation on compressed inode
f2fs: remove redundant compress inode check
f2fs: use strcmp() in parse_options()
f2fs: Use the correct style for SPDX License Identifier
Conflicts:
fs/f2fs/data.c
fs/f2fs/dir.c
Bug: 154167995
Change-Id: I04ec97a9cafef2d7b8736f36a2a8d244965cae9a
Signed-off-by: Jaegeuk Kim <jaegeuk@google.com>
Inode's i_io_list list head is used to attach inode to several different
lists - wb->{b_dirty, b_dirty_time, b_io, b_more_io}. When flush worker
prepares a list of inodes to writeback e.g. for sync(2), it moves inodes
to b_io list. Thus it is critical for sync(2) data integrity guarantees
that inode is not requeued to any other writeback list when inode is
queued for processing by flush worker. That's the reason why
writeback_single_inode() does not touch i_io_list (unless the inode is
completely clean) and why __mark_inode_dirty() does not touch i_io_list
if I_SYNC flag is set.
However there are two flaws in the current logic:
1) When inode has only I_DIRTY_TIME set but it is already queued in b_io
list due to sync(2), concurrent __mark_inode_dirty(inode, I_DIRTY_SYNC)
can still move inode back to b_dirty list resulting in skipping
writeback of inode time stamps during sync(2).
2) When inode is on b_dirty_time list and writeback_single_inode() races
with __mark_inode_dirty() like:
writeback_single_inode() __mark_inode_dirty(inode, I_DIRTY_PAGES)
inode->i_state |= I_SYNC
__writeback_single_inode()
inode->i_state |= I_DIRTY_PAGES;
if (inode->i_state & I_SYNC)
bail
if (!(inode->i_state & I_DIRTY_ALL))
- not true so nothing done
We end up with I_DIRTY_PAGES inode on b_dirty_time list and thus
standard background writeback will not writeback this inode leading to
possible dirty throttling stalls etc. (thanks to Martijn Coenen for this
analysis).
Fix these problems by tracking whether inode is queued in b_io or
b_more_io lists in a new I_SYNC_QUEUED flag. When this flag is set, we
know flush worker has queued inode and we should not touch i_io_list.
On the other hand we also know that once flush worker is done with the
inode it will requeue the inode to appropriate dirty list. When
I_SYNC_QUEUED is not set, __mark_inode_dirty() can (and must) move inode
to appropriate dirty list.
Reported-by: Martijn Coenen <maco@android.com>
Fixes: 0ae45f63d4 ("vfs: add support for a lazytime mount option")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
commit 65de03e251382306a4575b1779c57c87889eee49 upstream.
cgroup writeback tries to refresh the associated wb immediately if the
current wb is dead. This is to avoid keeping issuing IOs on the stale
wb after memcg - blkcg association has changed (ie. when blkcg got
disabled / enabled higher up in the hierarchy).
Unfortunately, the logic gets triggered spuriously on inodes which are
associated with dead cgroups. When the logic is triggered on dead
cgroups, the attempt fails only after doing quite a bit of work
allocating and initializing a new wb.
While c3aab9a0bd91 ("mm/filemap.c: don't initiate writeback if mapping
has no dirty pages") alleviated the issue significantly as it now only
triggers when the inode has dirty pages. However, the condition can
still be triggered before the inode is switched to a different cgroup
and the logic simply doesn't make sense.
Skip the immediate switching if the associated memcg is dying.
This is a simplified version of the following two patches:
* https://lore.kernel.org/linux-mm/20190513183053.GA73423@dennisz-mbp/
* http://lkml.kernel.org/r/156355839560.2063.5265687291430814589.stgit@buzz
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: e8a7abf5a5 ("writeback: disassociate inodes from dying bdi_writebacks")
Acked-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 6631142229005e1b1c311a09efe9fb3cfdac8559 ]
wbc_account_io() collects information on cgroup ownership of writeback
pages to determine which cgroup should own the inode. Pages can stay
associated with dead memcgs but we want to avoid attributing IOs to
dead blkcgs as much as possible as the association is likely to be
stale. However, currently, pages associated with dead memcgs
contribute to the accounting delaying and/or confusing the
arbitration.
Fix it by ignoring pages associated with dead memcgs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit ec084de929e419e51bcdafaafe567d9e7d0273b7 upstream.
synchronize_rcu() didn't wait for call_rcu() callbacks, so inode wb
switch may not go to the workqueue after synchronize_rcu(). Thus
previous scheduled switches was not finished even flushing the
workqueue, which will cause a NULL pointer dereferenced followed below.
VFS: Busy inodes after unmount of vdd. Self-destruct in 5 seconds. Have a nice day...
BUG: unable to handle kernel NULL pointer dereference at 0000000000000278
evict+0xb3/0x180
iput+0x1b0/0x230
inode_switch_wbs_work_fn+0x3c0/0x6a0
worker_thread+0x4e/0x490
? process_one_work+0x410/0x410
kthread+0xe6/0x100
ret_from_fork+0x39/0x50
Replace the synchronize_rcu() call with a rcu_barrier() to wait for all
pending callbacks to finish. And inc isw_nr_in_flight after call_rcu()
in inode_switch_wbs() to make more sense.
Link: http://lkml.kernel.org/r/20190429024108.54150-1-jiufei.xue@linux.alibaba.com
Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Acked-by: Tejun Heo <tj@kernel.org>
Suggested-by: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 7fc5854f8c6efae9e7624970ab49a1eac2faefb1 ]
sync_inodes_sb() can race against cgwb (cgroup writeback) membership
switches and fail to writeback some inodes. For example, if an inode
switches to another wb while sync_inodes_sb() is in progress, the new
wb might not be visible to bdi_split_work_to_wbs() at all or the inode
might jump from a wb which hasn't issued writebacks yet to one which
already has.
This patch adds backing_dev_info->wb_switch_rwsem to synchronize cgwb
switch path against sync_inodes_sb() so that sync_inodes_sb() is
guaranteed to see all the target wbs and inodes can't jump wbs to
escape syncing.
v2: Fixed misplaced rwsem init. Spotted by Jiufei.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jiufei Xue <xuejiufei@gmail.com>
Link: http://lkml.kernel.org/r/dc694ae2-f07f-61e1-7097-7c8411cee12d@gmail.com
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Syzbot has reported that it can hit a NULL pointer dereference in
wb_workfn() due to wb->bdi->dev being NULL. This indicates that
wb_workfn() was called for an already unregistered bdi which should not
happen as wb_shutdown() called from bdi_unregister() should make sure
all pending writeback works are completed before bdi is unregistered.
Except that wb_workfn() itself can requeue the work with:
mod_delayed_work(bdi_wq, &wb->dwork, 0);
and if this happens while wb_shutdown() is waiting in:
flush_delayed_work(&wb->dwork);
the dwork can get executed after wb_shutdown() has finished and
bdi_unregister() has cleared wb->bdi->dev.
Make wb_workfn() use wakeup_wb() for requeueing the work which takes all
the necessary precautions against racing with bdi unregistration.
CC: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
CC: Tejun Heo <tj@kernel.org>
Fixes: 839a8e8660
Reported-by: syzbot <syzbot+9873874c735f2892e7e9@syzkaller.appspotmail.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
the page's memcg is undergoing move accounting, which occurs when a
process leaves its memcg for a new one that has
memory.move_charge_at_immigrate set.
unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
the given inode is switching writeback domains. Switches occur when
enough writes are issued from a new domain.
This existing pattern is thus suspicious:
lock_page_memcg(page);
unlocked_inode_to_wb_begin(inode, &locked);
...
unlocked_inode_to_wb_end(inode, locked);
unlock_page_memcg(page);
If both inode switch and process memcg migration are both in-flight then
unlocked_inode_to_wb_end() will unconditionally enable interrupts while
still holding the lock_page_memcg() irq spinlock. This suggests the
possibility of deadlock if an interrupt occurs before unlock_page_memcg().
truncate
__cancel_dirty_page
lock_page_memcg
unlocked_inode_to_wb_begin
unlocked_inode_to_wb_end
<interrupts mistakenly enabled>
<interrupt>
end_page_writeback
test_clear_page_writeback
lock_page_memcg
<deadlock>
unlock_page_memcg
Due to configuration limitations this deadlock is not currently possible
because we don't mix cgroup writeback (a cgroupv2 feature) and
memory.move_charge_at_immigrate (a cgroupv1 feature).
If the kernel is hacked to always claim inode switching and memcg
moving_account, then this script triggers lockup in less than a minute:
cd /mnt/cgroup/memory
mkdir a b
echo 1 > a/memory.move_charge_at_immigrate
echo 1 > b/memory.move_charge_at_immigrate
(
echo $BASHPID > a/cgroup.procs
while true; do
dd if=/dev/zero of=/mnt/big bs=1M count=256
done
) &
while true; do
sync
done &
sleep 1h &
SLEEP=$!
while true; do
echo $SLEEP > a/cgroup.procs
echo $SLEEP > b/cgroup.procs
done
The deadlock does not seem possible, so it's debatable if there's any
reason to modify the kernel. I suggest we should to prevent future
surprises. And Wang Long said "this deadlock occurs three times in our
environment", so there's more reason to apply this, even to stable.
Stable 4.4 has minor conflicts applying this patch. For a clean 4.4 patch
see "[PATCH for-4.4] writeback: safer lock nesting"
https://lkml.org/lkml/2018/4/11/146
Wang Long said "this deadlock occurs three times in our environment"
[gthelen@google.com: v4]
Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com
[akpm@linux-foundation.org: comment tweaks, struct initialization simplification]
Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com
Fixes: 682aa8e1a6 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
Signed-off-by: Greg Thelen <gthelen@google.com>
Reported-by: Wang Long <wanglong19@meituan.com>
Acked-by: Wang Long <wanglong19@meituan.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: <stable@vger.kernel.org> [v4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the address_space ->tree_lock and use the xa_lock newly added to
the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
since we don't really care that it's a tree.
[willy@infradead.org: fix nds32, fs/dax.c]
Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
And use it in a few more places rather than opencoding the values.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The @head can be wb->b_dirty_time, so update the comment.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Wang Long <wanglong19@meituan.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is a pure automated search-and-replace of the internal kernel
superblock flags.
The s_flags are now called SB_*, with the names and the values for the
moment mirroring the MS_* flags that they're equivalent to.
Note how the MS_xyz flags are the ones passed to the mount system call,
while the SB_xyz flags are what we then use in sb->s_flags.
The script to do this was:
# places to look in; re security/*: it generally should *not* be
# touched (that stuff parses mount(2) arguments directly), but
# there are two places where we really deal with superblock flags.
FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
include/linux/fs.h include/uapi/linux/bfs_fs.h \
security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
# the list of MS_... constants
SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
ACTIVE NOUSER"
SED_PROG=
for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done
# we want files that contain at least one of MS_...,
# with fs/namespace.c and fs/pnode.c excluded.
L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')
for f in $L; do sed -i $f $SED_PROG; done
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 925a6efb8f ("Btrfs: stop using
try_to_writeback_inodes_sb_nr to flush delalloc") this function hasn't
been used outside so stop exporting it.
In addition we merge it into try_to_writeback_inodes_sb() which is the
only caller. Also change return type of try_to_writeback_inodes_sb to
void as the only user ext4 doesn't care.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Handle start-all writeback like we do periodic or kupdate
style writeback - by marking the bdi_writeback as needing a full
flush, and simply waking the thread. This eliminates the need to
allocate and queue a specific work item just for this purpose.
After this change, we truly only ever have one of them running at
any point in time. We mark the need to start all flushes, and the
writeback thread will clear it once it has processed the request.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When someone calls wakeup_flusher_threads() or
wakeup_flusher_threads_bdi(), they schedule writeback of all dirty
pages in the system (or on that bdi). If we are tight on memory, we
can get tons of these queued from kswapd/vmscan. This causes (at
least) two problems:
1) We consume a ton of memory just allocating writeback work items.
We've seen as much as 600 million of these writeback work items
pending. That's a lot of memory to pointlessly hold hostage,
while the box is under memory pressure.
2) We spend so much time processing these work items, that we
introduce a softlockup in writeback processing. This is because
each of the writeback work items don't end up doing any work (it's
hard when you have millions of identical ones coming in to the
flush machinery), so we just sit in a tight loop pulling work
items and deleting/freeing them.
Fix this by adding a 'start_all' bit to the writeback structure, and
set that when someone attempts to flush all dirty pages. The bit is
cleared when we start writeback on that work item. If the bit is
already set when we attempt to queue !nr_pages writeback, then we
simply ignore it.
This provides us one full flush in flight, with one pending as well,
and makes for more efficient handling of this type of writeback.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that we have no external callers of wb_start_writeback(), we
can shuffle the passing in of 'nr_pages'. Everybody passes in 0
at this point, so just kill the argument and move the dirty
count retrieval to that function.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't have any callers outside of fs-writeback.c anymore,
make it private.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Similar to wakeup_flusher_threads(), except that we only wake
up the flusher threads on the specified backing device.
No functional changes in this patch.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We're writing back the full range of dirty pages on the devices,
there's no point in making this special and not do normal range
cyclic writeback.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Everybody is passing in 0 now, let's get rid of the argument.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently the writeback statistics code uses a percpu counters to hold
various statistics. Furthermore we have 2 families of functions - those
which disable local irq and those which doesn't and whose names begin
with double underscore. However, they both end up calling
__add_wb_stats which in turn calls percpu_counter_add_batch which is
already irq-safe.
Exploiting this fact allows to eliminated the __wb_* functions since
they don't add any further protection than we already have.
Furthermore, refactor the wb_* function to call __add_wb_stat directly
without the irq-disabling dance. This will likely result in better
runtime of code which deals with modifying the stat counters.
While at it also document why percpu_counter_add_batch is in fact
preempt and irq-safe since at least 3 people got confused.
Link: http://lkml.kernel.org/r/1498029937-27293-1-git-send-email-nborisov@suse.com
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sphinx gets confused when it finds identation without a
good reason for it and without a preceding blank line:
./fs/mpage.c:347: ERROR: Unexpected indentation.
./fs/namei.c:4303: ERROR: Unexpected indentation.
./fs/fs-writeback.c:2060: ERROR: Unexpected indentation.
No functional changes.
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
When WB_registered flag is not set, wb_queue_work() skips queuing the
work, but does not perform the necessary clean up. In particular, if
work->auto_free is true, it should free the memory.
The leak condition can be reprouced by following these steps:
mount /dev/sdb /mnt/sdb
/* In qemu console: device_del sdb */
umount /dev/sdb
Above will result in a wb_queue_work() call on an unregistered wb and
thus leak memory.
Reported-by: John Sperbeck <jsperbeck@google.com>
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
I've found funny live-lock between raid10 barriers during resync and
memory controller hard limits. Inside mpage_readpages() task holds on to
its plug bio which blocks the barrier in raid10. Its memory cgroup have
no free memory thus the task goes into reclaimer but all reclaimable
pages are dirty and cannot be written because raid10 is rebuilding and
stuck on the barrier.
Common flush of such IO in schedule() never happens, because the caller
doesn't go to sleep.
Lock is 'live' because changing memory limit or killing tasks which
holds that stuck bio unblock whole progress.
That was what happened in 3.18.x but I see no difference in upstream
logic. Theoretically this might happen even without memory cgroup.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@fb.com>
There are now a number of accounting oddities such as mapped file pages
being accounted for on the node while the total number of file pages are
accounted on the zone. This can be coped with to some extent but it's
confusing so this patch moves the relevant file-based accounted. Due to
throttling logic in the page allocator for reliable OOM detection, it is
still necessary to track dirty and writeback pages on a per-zone basis.
[mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The per-sb inode writeback list tracks inodes currently under writeback
to facilitate efficient sync processing. In particular, it ensures that
sync only needs to walk through a list of inodes that were cleaned by
the sync.
Add a couple tracepoints to help identify when inodes are added/removed
to and from the writeback lists. Piggyback off of the writeback
lazytime tracepoint template as it already tracks the relevant inode
information.
Link: http://lkml.kernel.org/r/1466594593-6757-3-git-send-email-bfoster@redhat.com
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
cc: Josef Bacik <jbacik@fb.com>
Cc: Holger Hoffstätte <holger.hoffstaette@applied-asynchrony.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
wait_sb_inodes() currently does a walk of all inodes in the filesystem
to find dirty one to wait on during sync. This is highly inefficient
and wastes a lot of CPU when there are lots of clean cached inodes that
we don't need to wait on.
To avoid this "all inode" walk, we need to track inodes that are
currently under writeback that we need to wait for. We do this by
adding inodes to a writeback list on the sb when the mapping is first
tagged as having pages under writeback. wait_sb_inodes() can then walk
this list of "inodes under IO" and wait specifically just for the inodes
that the current sync(2) needs to wait for.
Define a couple helpers to add/remove an inode from the writeback list
and call them when the overall mapping is tagged for or cleared from
writeback. Update wait_sb_inodes() to walk only the inodes under
writeback due to the sync.
With this change, filesystem sync times are significantly reduced for
fs' with largely populated inode caches and otherwise no other work to
do. For example, on a 16xcpu 2GHz x86-64 server, 10TB XFS filesystem
with a ~10m entry inode cache, sync times are reduced from ~7.3s to less
than 0.1s when the filesystem is fully clean.
Link: http://lkml.kernel.org/r/1466594593-6757-2-git-send-email-bfoster@redhat.com
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Holger Hoffstätte <holger.hoffstaette@applied-asynchrony.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When cgroup writeback is in use, there can be multiple wb's
(bdi_writeback's) per bdi and an inode may switch among them
dynamically. In a couple places, the wrong wb was used leading to
performing operations on the wrong list under the wrong lock
corrupting the io lists.
* writeback_single_inode() was taking @wb parameter and used it to
remove the inode from io lists if it becomes clean after writeback.
The callers of this function were always passing in the root wb
regardless of the actual wb that the inode was associated with,
which could also change while writeback is in progress.
Fix it by dropping the @wb parameter and using
inode_to_wb_and_lock_list() to determine and lock the associated wb.
* After writeback_sb_inodes() writes out an inode, it re-locks @wb and
inode to remove it from or move it to the right io list. It assumes
that the inode is still associated with @wb; however, the inode may
have switched to another wb while writeback was in progress.
Fix it by using inode_to_wb_and_lock_list() to determine and lock
the associated wb after writeback is complete. As the function
requires the original @wb->list_lock locked for the next iteration,
in the unlikely case where the inode has changed association, switch
the locks.
Kudos to Tahsin for pinpointing these subtle breakages.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: d10c809552 ("writeback: implement foreign cgroup inode bdi_writeback switching")
Link: http://lkml.kernel.org/g/CAAeU0aMYeM_39Y2+PaRvyB1nqAPYZSNngJ1eBRmrxn7gKAt2Mg@mail.gmail.com
Reported-and-diagnosed-by: Tahsin Erdogan <tahsin@google.com>
Tested-by: Tahsin Erdogan <tahsin@google.com>
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Jens Axboe <axboe@fb.com>
locked_inode_to_wb_and_lock_list() wb_get()'s the wb associated with
the target inode, unlocks inode, locks the wb's list_lock and verifies
that the inode is still associated with the wb. To prevent the wb
going away between dropping inode lock and acquiring list_lock, the wb
is pinned while inode lock is held. The wb reference is put right
after acquiring list_lock citing that the wb won't be dereferenced
anymore.
This isn't true. If the inode is still associated with the wb, the
inode has reference and it's safe to return the wb; however, if inode
has been switched, the wb still needs to be unlocked which is a
dereference and can lead to use-after-free if it it races with wb
destruction.
Fix it by putting the reference after releasing list_lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 87e1d789bf ("writeback: implement [locked_]inode_to_wb_and_lock_list()")
Cc: stable@vger.kernel.org # v4.2+
Tested-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If cgroup writeback is in use, inodes can be scheduled for
asynchronous wb switching. Before 5ff8eaac16 ("writeback: keep
superblock pinned during cgroup writeback association switches"), this
could race with umount leading to super_block being destroyed while
inodes are pinned for wb switching. 5ff8eaac16 fixed it by bumping
s_active while wb switches are in flight; however, this allowed
in-flight wb switches to make umounts asynchronous when the userland
expected synchronosity - e.g. fsck immediately following umount may
fail because the device is still busy.
This patch removes the problematic super_block pinning and instead
makes generic_shutdown_super() flush in-flight wb switches. wb
switches are now executed on a dedicated isw_wq so that they can be
flushed and isw_nr_in_flight keeps track of the number of in-flight wb
switches so that flushing can be avoided in most cases.
v2: Move cgroup_writeback_umount() further below and add MS_ACTIVE
check in inode_switch_wbs() as Jan an Al suggested.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Tahsin Erdogan <tahsin@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
Fixes: 5ff8eaac16 ("writeback: keep superblock pinned during cgroup writeback association switches")
Cc: stable@vger.kernel.org #v4.5
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If cgroup writeback is in use, an inode is associated with a cgroup
for writeback. If the inode's main dirtier changes to another cgroup,
the association gets updated asynchronously. Nothing was pinning the
superblock while such switches are in progress and superblock could go
away while async switching is pending or in progress leading to
crashes like the following.
kernel BUG at fs/jbd2/transaction.c:319!
invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
CPU: 1 PID: 29158 Comm: kworker/1:10 Not tainted 4.5.0-rc3 #51
Hardware name: Google Google, BIOS Google 01/01/2011
Workqueue: events inode_switch_wbs_work_fn
task: ffff880213dbbd40 ti: ffff880209264000 task.ti: ffff880209264000
RIP: 0010:[<ffffffff803e6922>] [<ffffffff803e6922>] start_this_handle+0x382/0x3e0
RSP: 0018:ffff880209267c30 EFLAGS: 00010202
...
Call Trace:
[<ffffffff803e6be4>] jbd2__journal_start+0xf4/0x190
[<ffffffff803cfc7e>] __ext4_journal_start_sb+0x4e/0x70
[<ffffffff803b31ec>] ext4_evict_inode+0x12c/0x3d0
[<ffffffff8035338b>] evict+0xbb/0x190
[<ffffffff80354190>] iput+0x130/0x190
[<ffffffff80360223>] inode_switch_wbs_work_fn+0x343/0x4c0
[<ffffffff80279819>] process_one_work+0x129/0x300
[<ffffffff80279b16>] worker_thread+0x126/0x480
[<ffffffff8027ed14>] kthread+0xc4/0xe0
[<ffffffff809771df>] ret_from_fork+0x3f/0x70
Fix it by bumping s_active while cgroup association switching is in
flight.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Tahsin Erdogan <tahsin@google.com>
Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
Fixes: d10c809552 ("writeback: implement foreign cgroup inode bdi_writeback switching")
Cc: stable@vger.kernel.org #v4.5+
Signed-off-by: Jens Axboe <axboe@fb.com>
In earlier versions, mem_cgroup_css_from_page() could return non-root
css on a legacy hierarchy which can go away and required rcu locking;
however, the eventual version simply returns the root cgroup if memcg is
on a legacy hierarchy and thus doesn't need rcu locking around or in it.
Remove spurious rcu lockings.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix kernel-doc warnings in fs/fs-writeback.c by moving a #define macro to
after the function's opening brace. Also #undef this macro at the end of
the function.
../fs/fs-writeback.c:1984: warning: Excess function parameter 'inode' description in 'I_DIRTY_INODE'
../fs/fs-writeback.c:1984: warning: Excess function parameter 'flags' description in 'I_DIRTY_INODE'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
filemap_fdatawait() is a function to wait for on-going writeback to
complete but also consume and clear error status of the mapping set during
writeback.
The latter functionality is critical for applications to detect writeback
error with system calls like fsync(2)/fdatasync(2).
However filemap_fdatawait() is also used by sync(2) or FIFREEZE ioctl,
which don't check error status of individual mappings.
As a result, fsync() may not be able to detect writeback error if events
happen in the following order:
Application System admin
----------------------------------------------------------
write data on page cache
Run sync command
writeback completes with error
filemap_fdatawait() clears error
fsync returns success
(but the data is not on disk)
This patch adds filemap_fdatawait_keep_errors() for call sites where
writeback error is not handled so that they don't clear error status.
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Cc: Fengguang Wu <fengguang.wu@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
bdi_split_work_to_wbs() uses list_for_each_entry_rcu_continue()
to walk @bdi->wb_list. To set up the initial iteration
condition, it uses list_entry_rcu() to calculate the entry
pointer corresponding to the list head; however, this isn't an
actual RCU dereference and using list_entry_rcu() for it ended
up breaking a proposed list_entry_rcu() change because it was
feeding an non-lvalue pointer into the macro.
Don't use the RCU variant for simple pointer offsetting. Use
list_entry() instead.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Patrick Marlier <patrick.marlier@gmail.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pranith kumar <bobby.prani@gmail.com>
Link: http://lkml.kernel.org/r/20151027051939.GA19355@mtj.duckdns.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
bdi_for_each_wb() is used in several places to wake up or issue
writeback work items to all wb's (bdi_writeback's) on a given bdi.
The iteration is performed by walking bdi->cgwb_tree; however, the
tree only indexes wb's which are currently active.
For example, when a memcg gets associated with a different blkcg, the
old wb is removed from the tree so that the new one can be indexed.
The old wb starts dying from then on but will linger till all its
inodes are drained. As these dying wb's may still host dirty inodes,
writeback operations which affect all wb's must include them.
bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
failing to sync the inodes belonging to those wb's.
This patch adds a RCU protected @bdi->wb_list which lists all wb's
beloinging to that bdi. wb's are added on creation and removed on
release rather than on the start of destruction. bdi_for_each_wb()
usages are replaced with list_for_each[_continue]_rcu() iterations
over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.
v2: Updated as per Jan. last_wb ref leak in bdi_split_work_to_wbs()
fixed and unnecessary list head severing in cgwb_bdi_destroy()
removed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Artem Bityutskiy <dedekind1@gmail.com>
Fixes: ebe41ab0c7 ("writeback: implement bdi_for_each_wb()")
Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>