kernel-fxtec-pro1x

Author	SHA1	Message	Date
Jaegeuk Kim	c95d6ed4e7	Merge remote-tracking branch 'aosp/upstream-f2fs-stable-linux-4.19.y' into android-4.19-stable This has one fix in advance merged in f2fs-stable. ("xfs: drop I_DIRTY_TIME_EXPIRED") * aosp/upstream-f2fs-stable-linux-4.19.y: writeback: Drop I_DIRTY_TIME_EXPIRE writeback: Fix sync livelock due to b_dirty_time processing writeback: Avoid skipping inode writeback writeback: Protect inode->i_io_list with inode->i_lock Revert "writeback: Avoid skipping inode writeback" Bug: 154542664 Change-Id: I98a6258cb60227e6ca02e57bf7adf28ab7816cbf Signed-off-by: Jaegeuk Kim <jaegeuk@google.com>	2020-06-24 13:31:46 -07:00
Greg Kroah-Hartman	4d01d462e6	This is the 4.19.129 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl7wXf4ACgkQONu9yGCS aT6rDBAAg6jIYJhhb9lpK59hpMxNsaFnPfXdA3Z6qARqH7iIQa9TTP9JF5eFndS0 +2wV8t/8Nz/3BWq9NQAF525QJdqyY6Ahcj5QQXzIzEZyb/p5fRVCBOUcBP7uaBCu gdORR7OhHI9+7aGLr05Svb7pVWPLi0Mk5vjvthEIkojEOIREGuGlERRZNlL1SN3y cYDBCCJtD2XiuhyZNLNxtwE/2/d/1xuIG7T3VRDS6oBtqfOXdsy5xoU9lpbbmZQg s1i3cjWgxEYjJOJqONwzfUSu9Zj4GUZfLTx3gtXG7iEiuUfEw3ljEvIrqSqtNxB5 aTysoOu4MSdJTALHkA7Szhk2Q8Pecmo+NdKLfgMCxAWwIEbn1X9seea7QC5M5/lr Q1z150M2+Lcs6z9I/vR+vmPh9YKn1yGV4RbTeMXwiQLlWcRh7vh7jN+YJvrpmJSL BGbsRLB02J4i58CLW7n2rgeq5ycO41bJeWdXSSZjJg7KiZMvuD7mnDv1nUoj3Ad0 lFxgfBRYYZzGCe53xLBXKnjua1lxp8rStUK4iotqkXyhaZqHo0J52okDxSqbP4VZ DYMfgyiFDufITd7l7qK5H6OeWQJ2IPtaude0HiMQf00bdOIrIsl+xXCtFHo6kx6z VxwFUAUZWIKZT9ZWo2DNmbbDSRmij3Pqm6ZiakDSPT+kZFqvkBo= =u5pA -----END PGP SIGNATURE----- Merge 4.19.129 into android-4.19-stable Changes in 4.19.129 ipv6: fix IPV6_ADDRFORM operation logic net_failover: fixed rollback in net_failover_open() bridge: Avoid infinite loop when suppressing NS messages with invalid options vxlan: Avoid infinite loop when suppressing NS messages with invalid options tun: correct header offsets in napi frags mode selftests: bpf: fix use of undeclared RET_IF macro make 'user_access_begin()' do 'access_ok()' Fix 'acccess_ok()' on alpha and SH arch/openrisc: Fix issues with access_ok() x86: uaccess: Inhibit speculation past access_ok() in user_access_begin() lib: Reduce user_access_begin() boundaries in strncpy_from_user() and strnlen_user() btrfs: merge btrfs_find_device and find_device btrfs: Detect unbalanced tree with empty leaf before crashing btree operations crypto: talitos - fix ECB and CBC algs ivsize Input: mms114 - fix handling of mms345l ARM: 8977/1: ptrace: Fix mask for thumb breakpoint hook sched/fair: Don't NUMA balance for kthreads Input: synaptics - add a second working PNP_ID for Lenovo T470s drivers/net/ibmvnic: Update VNIC protocol version reporting powerpc/xive: Clear the page tables for the ESB IO mapping ath9k_htc: Silence undersized packet warnings RDMA/uverbs: Make the event_queue fds return POLLERR when disassociated x86/cpu/amd: Make erratum #1054 a legacy erratum perf probe: Accept the instance number of kretprobe event mm: add kvfree_sensitive() for freeing sensitive data objects aio: fix async fsync creds btrfs: tree-checker: Check level for leaves and nodes x86_64: Fix jiffies ODR violation x86/PCI: Mark Intel C620 MROMs as having non-compliant BARs x86/speculation: Prevent rogue cross-process SSBD shutdown x86/reboot/quirks: Add MacBook6,1 reboot quirk efi/efivars: Add missing kobject_put() in sysfs entry creation error path ALSA: es1688: Add the missed snd_card_free() ALSA: hda/realtek - add a pintbl quirk for several Lenovo machines ALSA: usb-audio: Fix inconsistent card PM state after resume ALSA: usb-audio: Add vendor, product and profile name for HP Thunderbolt Dock ACPI: sysfs: Fix reference count leak in acpi_sysfs_add_hotplug_profile() ACPI: CPPC: Fix reference count leak in acpi_cppc_processor_probe() ACPI: GED: add support for _Exx / _Lxx handler methods ACPI: PM: Avoid using power resources if there are none for D0 cgroup, blkcg: Prepare some symbols for module and !CONFIG_CGROUP usages nilfs2: fix null pointer dereference at nilfs_segctor_do_construct() spi: dw: Fix controller unregister order spi: bcm2835aux: Fix controller unregister order spi: bcm-qspi: when tx/rx buffer is NULL set to 0 PM: runtime: clk: Fix clk_pm_runtime_get() error path crypto: cavium/nitrox - Fix 'nitrox_get_first_device()' when ndevlist is fully iterated ALSA: pcm: disallow linking stream to itself x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned KVM: x86: Fix APIC page invalidation race kvm: x86: Fix L1TF mitigation for shadow MMU KVM: x86/mmu: Consolidate "is MMIO SPTE" code KVM: x86: only do L1TF workaround on affected processors x86/speculation: Change misspelled STIPB to STIBP x86/speculation: Add support for STIBP always-on preferred mode x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS. x86/speculation: PR_SPEC_FORCE_DISABLE enforcement for indirect branches. spi: No need to assign dummy value in spi_unregister_controller() spi: Fix controller unregister order spi: pxa2xx: Fix controller unregister order spi: bcm2835: Fix controller unregister order spi: pxa2xx: Balance runtime PM enable/disable on error spi: pxa2xx: Fix runtime PM ref imbalance on probe error crypto: virtio: Fix use-after-free in virtio_crypto_skcipher_finalize_req() crypto: virtio: Fix src/dst scatterlist calculation in __virtio_crypto_skcipher_do_req() crypto: virtio: Fix dest length calculation in __virtio_crypto_skcipher_do_req() selftests/net: in rxtimestamp getopt_long needs terminating null entry ovl: initialize error in ovl_copy_xattr proc: Use new_inode not new_inode_pseudo video: fbdev: w100fb: Fix a potential double free. KVM: nSVM: fix condition for filtering async PF KVM: nSVM: leave ASID aside in copy_vmcb_control_area KVM: nVMX: Consult only the "basic" exit reason when routing nested exit KVM: MIPS: Define KVM_ENTRYHI_ASID to cpu_asid_mask(&boot_cpu_data) KVM: MIPS: Fix VPN2_MASK definition for variable cpu_vmbits KVM: arm64: Make vcpu_cp1x() work on Big Endian hosts scsi: megaraid_sas: TM command refire leads to controller firmware crash ath9k: Fix use-after-free Read in ath9k_wmi_ctrl_rx ath9k: Fix use-after-free Write in ath9k_htc_rx_msg ath9x: Fix stack-out-of-bounds Write in ath9k_hif_usb_rx_cb ath9k: Fix general protection fault in ath9k_hif_usb_rx_cb Smack: slab-out-of-bounds in vsscanf drm/vkms: Hold gem object while still in-use mm/slub: fix a memory leak in sysfs_slab_add() fat: don't allow to mount if the FAT length == 0 perf: Add cond_resched() to task_function_call() agp/intel: Reinforce the barrier after GTT updates mmc: sdhci-msm: Clear tuning done flag while hs400 tuning ARM: dts: at91: sama5d2_ptc_ek: fix sdmmc0 node description mmc: sdio: Fix potential NULL pointer error in mmc_sdio_init_card() xen/pvcalls-back: test for errors when calling backend_connect() KVM: arm64: Synchronize sysreg state on injecting an AArch32 exception ACPI: GED: use correct trigger type field in _Exx / _Lxx handling drm: bridge: adv7511: Extend list of audio sample rates crypto: ccp -- don't "select" CONFIG_DMADEVICES media: si2157: Better check for running tuner in init objtool: Ignore empty alternatives spi: pxa2xx: Apply CS clk quirk to BXT net: atlantic: make hw_get_regs optional net: ena: fix error returning in ena_com_get_hash_function() efi/libstub/x86: Work around LLVM ELF quirk build regression arm64: cacheflush: Fix KGDB trap detection spi: dw: Zero DMA Tx and Rx configurations on stack arm64: insn: Fix two bugs in encoding 32-bit logical immediates ixgbe: Fix XDP redirect on archs with PAGE_SIZE above 4K MIPS: Loongson: Build ATI Radeon GPU driver as module Bluetooth: Add SCO fallback for invalid LMP parameters error kgdb: Disable WARN_CONSOLE_UNLOCKED for all kgdb kgdb: Prevent infinite recursive entries to the debugger spi: dw: Enable interrupts in accordance with DMA xfer mode clocksource: dw_apb_timer: Make CPU-affiliation being optional clocksource: dw_apb_timer_of: Fix missing clockevent timers btrfs: do not ignore error from btrfs_next_leaf() when inserting checksums ARM: 8978/1: mm: make act_mm() respect THREAD_SIZE batman-adv: Revert "disable ethtool link speed detection when auto negotiation off" mmc: meson-mx-sdio: trigger a soft reset after a timeout or CRC error spi: dw: Fix Rx-only DMA transfers x86/kvm/hyper-v: Explicitly align hcall param for kvm_hyperv_exit net: vmxnet3: fix possible buffer overflow caused by bad DMA value in vmxnet3_get_rss() staging: android: ion: use vmap instead of vm_map_ram brcmfmac: fix wrong location to get firmware feature tools api fs: Make xxx__mountpoint() more scalable e1000: Distribute switch variables for initialization dt-bindings: display: mediatek: control dpi pins mode to avoid leakage audit: fix a net reference leak in audit_send_reply() media: dvb: return -EREMOTEIO on i2c transfer failure. media: platform: fcp: Set appropriate DMA parameters MIPS: Make sparse_init() using top-down allocation Bluetooth: btbcm: Add 2 missing models to subver tables audit: fix a net reference leak in audit_list_rules_send() netfilter: nft_nat: return EOPNOTSUPP if type or flags are not supported selftests/bpf: Fix memory leak in extract_build_id() net: bcmgenet: set Rx mode before starting netif lib/mpi: Fix 64-bit MIPS build with Clang exit: Move preemption fixup up, move blocking operations down sched/core: Fix illegal RCU from offline CPUs drivers/perf: hisi: Fix typo in events attribute array net: lpc-enet: fix error return code in lpc_mii_init() media: cec: silence shift wrapping warning in __cec_s_log_addrs() net: allwinner: Fix use correct return type for ndo_start_xmit() powerpc/spufs: fix copy_to_user while atomic xfs: clean up the error handling in xfs_swap_extents Crypto/chcr: fix for ccm(aes) failed test MIPS: Truncate link address into 32bit for 32bit kernel mips: cm: Fix an invalid error code of INTVN__ERR kgdb: Fix spurious true from in_dbg_master() xfs: reset buffer write failure state on successful completion xfs: fix duplicate verification from xfs_qm_dqflush() platform/x86: intel-vbtn: Use acpi_evaluate_integer() platform/x86: intel-vbtn: Split keymap into buttons and switches parts platform/x86: intel-vbtn: Do not advertise switches to userspace if they are not there platform/x86: intel-vbtn: Also handle tablet-mode switch on "Detachable" and "Portable" chassis-types nvme: refine the Qemu Identify CNS quirk ath10k: Remove msdu from idr when management pkt send fails wcn36xx: Fix error handling path in 'wcn36xx_probe()' net: qed: Reduce RX and TX default ring count when running inside kdump kernel mt76: avoid rx reorder buffer overflow md: don't flush workqueue unconditionally in md_open veth: Adjust hard_start offset on redirect XDP frames net/mlx5e: IPoIB, Drop multicast packets that this interface sent rtlwifi: Fix a double free in _rtl_usb_tx_urb_setup() mwifiex: Fix memory corruption in dump_station x86/boot: Correct relocation destination on old linkers mips: MAAR: Use more precise address mask mips: Add udelay lpj numbers adjustment crypto: stm32/crc32 - fix ext4 chksum BUG_ON() crypto: stm32/crc32 - fix run-time self test issue. crypto: stm32/crc32 - fix multi-instance x86/mm: Stop printing BRK addresses m68k: mac: Don't call via_flush_cache() on Mac IIfx btrfs: qgroup: mark qgroup inconsistent if we're inherting snapshot to a new qgroup macvlan: Skip loopback packets in RX handler PCI: Don't disable decoding when mmio_always_on is set MIPS: Fix IRQ tracing when call handle_fpe() and handle_msa_fpe() bcache: fix refcount underflow in bcache_device_free() mmc: sdhci-msm: Set SDHCI_QUIRK_MULTIBLOCK_READ_ACMD12 quirk staging: greybus: sdio: Respect the cmd->busy_timeout from the mmc core mmc: via-sdmmc: Respect the cmd->busy_timeout from the mmc core ixgbe: fix signed-integer-overflow warning mmc: sdhci-esdhc-imx: fix the mask for tuning start point spi: dw: Return any value retrieved from the dma_transfer callback cpuidle: Fix three reference count leaks platform/x86: hp-wmi: Convert simple_strtoul() to kstrtou32() platform/x86: intel-hid: Add a quirk to support HP Spectre X2 (2015) platform/x86: intel-vbtn: Only blacklist SW_TABLET_MODE on the 9 / "Laptop" chasis-type string.h: fix incompatibility between FORTIFY_SOURCE and KASAN btrfs: include non-missing as a qualifier for the latest_bdev btrfs: send: emit file capabilities after chown mm: thp: make the THP mapcount atomic against __split_huge_pmd_locked() mm: initialize deferred pages with interrupts enabled ima: Fix ima digest hash table key calculation ima: Directly assign the ima_default_policy pointer to ima_rules evm: Fix possible memory leak in evm_calc_hmac_or_hash() ext4: fix EXT_MAX_EXTENT/INDEX to check for zeroed eh_max ext4: fix error pointer dereference ext4: fix race between ext4_sync_parent() and rename() PCI: Avoid Pericom USB controller OHCI/EHCI PME# defect PCI: Avoid FLR for AMD Matisse HD Audio & USB 3.0 PCI: Avoid FLR for AMD Starship USB 3.0 PCI: Add ACS quirk for iProc PAXB PCI: Add ACS quirk for Intel Root Complex Integrated Endpoints PCI: Remove unused NFP32xx IDs pci:ipmi: Move IPMI PCI class id defines to pci_ids.h hwmon/k10temp, x86/amd_nb: Consolidate shared device IDs x86/amd_nb: Add PCI device IDs for family 17h, model 30h PCI: add USR vendor id and use it in r8169 and w6692 driver PCI: Move Synopsys HAPS platform device IDs PCI: Move Rohm Vendor ID to generic list misc: pci_endpoint_test: Add the layerscape EP device support misc: pci_endpoint_test: Add support to test PCI EP in AM654x PCI: Add Synopsys endpoint EDDA Device ID PCI: Add NVIDIA GPU multi-function power dependencies PCI: Enable NVIDIA HDA controllers PCI: mediatek: Add controller support for MT7629 x86/amd_nb: Add PCI device IDs for family 17h, model 70h ALSA: lx6464es - add support for LX6464ESe pci express variant PCI: Add Genesys Logic, Inc. Vendor ID PCI: Add Amazon's Annapurna Labs vendor ID PCI: vmd: Add device id for VMD device 8086:9A0B x86/amd_nb: Add Family 19h PCI IDs PCI: Add Loongson vendor ID serial: 8250_pci: Move Pericom IDs to pci_ids.h PCI: Make ACS quirk implementations more uniform PCI: Unify ACS quirk desired vs provided checking PCI: Generalize multi-function power dependency device links btrfs: fix error handling when submitting direct I/O bio btrfs: fix wrong file range cleanup after an error filling dealloc range ima: Call ima_calc_boot_aggregate() in ima_eventdigest_init() PCI: Program MPS for RCiEP devices e1000e: Disable TSO for buffer overrun workaround e1000e: Relax condition to trigger reset for ME workaround carl9170: remove P2P_GO support media: go7007: fix a miss of snd_card_free Bluetooth: hci_bcm: fix freeing not-requested IRQ b43legacy: Fix case where channel status is corrupted b43: Fix connection problem with WPA3 b43_legacy: Fix connection problem with WPA3 media: ov5640: fix use of destroyed mutex igb: Report speed and duplex as unknown when device is runtime suspended power: vexpress: add suppress_bind_attrs to true pinctrl: samsung: Correct setting of eint wakeup mask on s5pv210 pinctrl: samsung: Save/restore eint_mask over suspend for EINT_TYPE GPIOs gnss: sirf: fix error return code in sirf_probe() sparc32: fix register window handling in genregs32_[gs]et() sparc64: fix misuses of access_process_vm() in genregs32_[sg]et() dm crypt: avoid truncating the logical block size alpha: fix memory barriers so that they conform to the specification kernel/cpu_pm: Fix uninitted local in cpu_pm ARM: tegra: Correct PL310 Auxiliary Control Register initialization ARM: dts: exynos: Fix GPIO polarity for thr GalaxyS3 CM36651 sensor's bus ARM: dts: at91: sama5d2_ptc_ek: fix vbus pin ARM: dts: s5pv210: Set keep-power-in-suspend for SDHCI1 on Aries drivers/macintosh: Fix memleak in windfarm_pm112 driver powerpc/64s: Don't let DT CPU features set FSCR_DSCR powerpc/64s: Save FSCR to init_task.thread.fscr after feature init kbuild: force to build vmlinux if CONFIG_MODVERSION=y sunrpc: svcauth_gss_register_pseudoflavor must reject duplicate registrations. sunrpc: clean up properly in gss_mech_unregister() mtd: rawnand: brcmnand: fix hamming oob layout mtd: rawnand: pasemi: Fix the probe error path w1: omap-hdq: cleanup to add missing newline for some dev_dbg perf probe: Do not show the skipped events perf probe: Fix to check blacklist address correctly perf probe: Check address correctness by map instead of _etext perf symbols: Fix debuginfo search for Ubuntu Linux 4.19.129 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I7b1108d90ee1109a28fe488a4358b7a3e101d9c9	2020-06-22 10:50:54 +02:00
Tejun Heo	c39a90b1a4	cgroup, blkcg: Prepare some symbols for module and !CONFIG_CGROUP usages commit 9b0eb69b75bccada2d341d7e7ca342f0cb1c9a6a upstream. btrfs is going to use css_put() and wbc helpers to improve cgroup writeback support. Add dummy css_get() definition and export wbc helpers to prepare for module and !CONFIG_CGROUP builds. [only backport the export of __inode_attach_wb for stable kernels - gregkh] Reported-by: kbuild test robot <lkp@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-06-22 09:05:03 +02:00
Jan Kara	00f6b03b41	writeback: Drop I_DIRTY_TIME_EXPIRE The only use of I_DIRTY_TIME_EXPIRE is to detect in __writeback_single_inode() that inode got there because flush worker decided it's time to writeback the dirty inode time stamps (either because we are syncing or because of age). However we can detect this directly in __writeback_single_inode() and there's no need for the strange propagation with I_DIRTY_TIME_EXPIRE flag. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2020-06-19 11:51:04 -07:00
Jan Kara	c7cba792a6	writeback: Fix sync livelock due to b_dirty_time processing When we are processing writeback for sync(2), move_expired_inodes() didn't set any inode expiry value (older_than_this). This can result in writeback never completing if there's steady stream of inodes added to b_dirty_time list as writeback rechecks dirty lists after each writeback round whether there's more work to be done. Fix the problem by using sync(2) start time is inode expiry value when processing b_dirty_time list similarly as for ordinarily dirtied inodes. This requires some refactoring of older_than_this handling which simplifies the code noticeably as a bonus. Fixes: `0ae45f63d4` ("vfs: add support for a lazytime mount option") CC: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2020-06-19 11:50:49 -07:00
Jan Kara	359caa950d	writeback: Avoid skipping inode writeback Inode's i_io_list list head is used to attach inode to several different lists - wb->{b_dirty, b_dirty_time, b_io, b_more_io}. When flush worker prepares a list of inodes to writeback e.g. for sync(2), it moves inodes to b_io list. Thus it is critical for sync(2) data integrity guarantees that inode is not requeued to any other writeback list when inode is queued for processing by flush worker. That's the reason why writeback_single_inode() does not touch i_io_list (unless the inode is completely clean) and why __mark_inode_dirty() does not touch i_io_list if I_SYNC flag is set. However there are two flaws in the current logic: 1) When inode has only I_DIRTY_TIME set but it is already queued in b_io list due to sync(2), concurrent __mark_inode_dirty(inode, I_DIRTY_SYNC) can still move inode back to b_dirty list resulting in skipping writeback of inode time stamps during sync(2). 2) When inode is on b_dirty_time list and writeback_single_inode() races with __mark_inode_dirty() like: writeback_single_inode() __mark_inode_dirty(inode, I_DIRTY_PAGES) inode->i_state \|= I_SYNC __writeback_single_inode() inode->i_state \|= I_DIRTY_PAGES; if (inode->i_state & I_SYNC) bail if (!(inode->i_state & I_DIRTY_ALL)) - not true so nothing done We end up with I_DIRTY_PAGES inode on b_dirty_time list and thus standard background writeback will not writeback this inode leading to possible dirty throttling stalls etc. (thanks to Martijn Coenen for this analysis). Fix these problems by tracking whether inode is queued in b_io or b_more_io lists in a new I_SYNC_QUEUED flag. When this flag is set, we know flush worker has queued inode and we should not touch i_io_list. On the other hand we also know that once flush worker is done with the inode it will requeue the inode to appropriate dirty list. When I_SYNC_QUEUED is not set, __mark_inode_dirty() can (and must) move inode to appropriate dirty list. Reported-by: Martijn Coenen <maco@android.com> Reviewed-by: Martijn Coenen <maco@android.com> Tested-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Fixes: `0ae45f63d4` ("vfs: add support for a lazytime mount option") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz>	2020-06-19 11:50:12 -07:00
Jan Kara	64cf8ca416	writeback: Protect inode->i_io_list with inode->i_lock Currently, operations on inode->i_io_list are protected by wb->list_lock. In the following patches we'll need to maintain consistency between inode->i_state and inode->i_io_list so change the code so that inode->i_lock protects also all inode's i_io_list handling. Reviewed-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> CC: stable@vger.kernel.org # Prerequisite for "writeback: Avoid skipping inode writeback" Signed-off-by: Jan Kara <jack@suse.cz>	2020-06-19 11:50:12 -07:00
Jaegeuk Kim	fd0b57f1cc	Revert "writeback: Avoid skipping inode writeback" This reverts commit 83af290d0975eda561a0599f54991c62842a8a99.	2020-06-19 11:50:11 -07:00
Jaegeuk Kim	a176f88e45	Merge remote-tracking branch 'aosp/upstream-f2fs-stable-linux-4.19.y' into android-4.19-stable * aosp/upstream-f2fs-stable-linux-4.19.y: f2fs: attach IO flags to the missing cases f2fs: add node_io_flag for bio flags likewise data_io_flag f2fs: remove unused parameter of f2fs_put_rpages_mapping() f2fs: handle readonly filesystem in f2fs_ioc_shutdown() f2fs: avoid utf8_strncasecmp() with unstable name f2fs: don't return vmalloc() memory from f2fs_kmalloc() f2fs: fix retry logic in f2fs_write_cache_pages() f2fs: fix wrong discard space f2fs: compress: don't compress any datas after cp stop f2fs: remove unneeded return value of __insert_discard_tree() f2fs: fix wrong value of tracepoint parameter f2fs: protect new segment allocation in expand_inode_data f2fs: code cleanup by removing ifdef macro surrounding writeback: Avoid skipping inode writeback f2fs: avoid inifinite loop to wait for flushing node pages at cp_error f2fs: compress: fix zstd data corruption f2fs: add compressed/gc data read IO stat f2fs: fix potential use-after-free issue f2fs: compress: don't handle non-compressed data in workqueue f2fs: remove redundant assignment to variable err f2fs: refactor resize_fs to avoid meta updates in progress f2fs: use round_up to enhance calculation f2fs: introduce F2FS_IOC_RESERVE_COMPRESS_BLOCKS f2fs: Avoid double lock for cp_rwsem during checkpoint f2fs: report delalloc reserve as non-free in statfs for project quota f2fs: Fix wrong stub helper update_sit_info f2fs: compress: let lz4 compressor handle output buffer budget properly f2fs: remove blk_plugging in block_operations f2fs: introduce F2FS_IOC_RELEASE_COMPRESS_BLOCKS f2fs: shrink spinlock coverage f2fs: correctly fix the parent inode number during fsync() f2fs: introduce mempool for {,de}compress intermediate page allocation f2fs: introduce f2fs_bmap_compress() f2fs: support fiemap on compressed inode f2fs: support partial truncation on compressed inode f2fs: remove redundant compress inode check f2fs: use strcmp() in parse_options() f2fs: Use the correct style for SPDX License Identifier Conflicts: fs/f2fs/data.c fs/f2fs/dir.c Bug: 154167995 Change-Id: I04ec97a9cafef2d7b8736f36a2a8d244965cae9a Signed-off-by: Jaegeuk Kim <jaegeuk@google.com>	2020-06-15 13:24:58 -07:00
Jan Kara	762f39459a	writeback: Avoid skipping inode writeback Inode's i_io_list list head is used to attach inode to several different lists - wb->{b_dirty, b_dirty_time, b_io, b_more_io}. When flush worker prepares a list of inodes to writeback e.g. for sync(2), it moves inodes to b_io list. Thus it is critical for sync(2) data integrity guarantees that inode is not requeued to any other writeback list when inode is queued for processing by flush worker. That's the reason why writeback_single_inode() does not touch i_io_list (unless the inode is completely clean) and why __mark_inode_dirty() does not touch i_io_list if I_SYNC flag is set. However there are two flaws in the current logic: 1) When inode has only I_DIRTY_TIME set but it is already queued in b_io list due to sync(2), concurrent __mark_inode_dirty(inode, I_DIRTY_SYNC) can still move inode back to b_dirty list resulting in skipping writeback of inode time stamps during sync(2). 2) When inode is on b_dirty_time list and writeback_single_inode() races with __mark_inode_dirty() like: writeback_single_inode() __mark_inode_dirty(inode, I_DIRTY_PAGES) inode->i_state \|= I_SYNC __writeback_single_inode() inode->i_state \|= I_DIRTY_PAGES; if (inode->i_state & I_SYNC) bail if (!(inode->i_state & I_DIRTY_ALL)) - not true so nothing done We end up with I_DIRTY_PAGES inode on b_dirty_time list and thus standard background writeback will not writeback this inode leading to possible dirty throttling stalls etc. (thanks to Martijn Coenen for this analysis). Fix these problems by tracking whether inode is queued in b_io or b_more_io lists in a new I_SYNC_QUEUED flag. When this flag is set, we know flush worker has queued inode and we should not touch i_io_list. On the other hand we also know that once flush worker is done with the inode it will requeue the inode to appropriate dirty list. When I_SYNC_QUEUED is not set, __mark_inode_dirty() can (and must) move inode to appropriate dirty list. Reported-by: Martijn Coenen <maco@android.com> Fixes: `0ae45f63d4` ("vfs: add support for a lazytime mount option") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz>	2020-05-30 08:49:33 -07:00
Tejun Heo	6890b4bc3d	cgroup,writeback: don't switch wbs immediately on dead wbs if the memcg is dead commit 65de03e251382306a4575b1779c57c87889eee49 upstream. cgroup writeback tries to refresh the associated wb immediately if the current wb is dead. This is to avoid keeping issuing IOs on the stale wb after memcg - blkcg association has changed (ie. when blkcg got disabled / enabled higher up in the hierarchy). Unfortunately, the logic gets triggered spuriously on inodes which are associated with dead cgroups. When the logic is triggered on dead cgroups, the attempt fails only after doing quite a bit of work allocating and initializing a new wb. While c3aab9a0bd91 ("mm/filemap.c: don't initiate writeback if mapping has no dirty pages") alleviated the issue significantly as it now only triggers when the inode has dirty pages. However, the condition can still be triggered before the inode is switched to a different cgroup and the logic simply doesn't make sense. Skip the immediate switching if the associated memcg is dying. This is a simplified version of the following two patches: * https://lore.kernel.org/linux-mm/20190513183053.GA73423@dennisz-mbp/ * http://lkml.kernel.org/r/156355839560.2063.5265687291430814589.stgit@buzz Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Fixes: `e8a7abf5a5` ("writeback: disassociate inodes from dying bdi_writebacks") Acked-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-11-12 19:21:20 +01:00
Tejun Heo	b7d66bbc8a	blkcg, writeback: dead memcgs shouldn't contribute to writeback ownership arbitration [ Upstream commit 6631142229005e1b1c311a09efe9fb3cfdac8559 ] wbc_account_io() collects information on cgroup ownership of writeback pages to determine which cgroup should own the inode. Pages can stay associated with dead memcgs but we want to avoid attributing IOs to dead blkcgs as much as possible as the association is likely to be stale. However, currently, pages associated with dead memcgs contribute to the accounting delaying and/or confusing the arbitration. Fix it by ignoring pages associated with dead memcgs. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-07-26 09:14:08 +02:00
Jiufei Xue	986d3453be	fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going into workqueue when umount commit ec084de929e419e51bcdafaafe567d9e7d0273b7 upstream. synchronize_rcu() didn't wait for call_rcu() callbacks, so inode wb switch may not go to the workqueue after synchronize_rcu(). Thus previous scheduled switches was not finished even flushing the workqueue, which will cause a NULL pointer dereferenced followed below. VFS: Busy inodes after unmount of vdd. Self-destruct in 5 seconds. Have a nice day... BUG: unable to handle kernel NULL pointer dereference at 0000000000000278 evict+0xb3/0x180 iput+0x1b0/0x230 inode_switch_wbs_work_fn+0x3c0/0x6a0 worker_thread+0x4e/0x490 ? process_one_work+0x410/0x410 kthread+0xe6/0x100 ret_from_fork+0x39/0x50 Replace the synchronize_rcu() call with a rcu_barrier() to wait for all pending callbacks to finish. And inc isw_nr_in_flight after call_rcu() in inode_switch_wbs() to make more sense. Link: http://lkml.kernel.org/r/20190429024108.54150-1-jiufei.xue@linux.alibaba.com Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> Acked-by: Tejun Heo <tj@kernel.org> Suggested-by: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-05-22 07:37:43 +02:00
Tejun Heo	edca54b897	writeback: synchronize sync(2) against cgroup writeback membership switches [ Upstream commit 7fc5854f8c6efae9e7624970ab49a1eac2faefb1 ] sync_inodes_sb() can race against cgwb (cgroup writeback) membership switches and fail to writeback some inodes. For example, if an inode switches to another wb while sync_inodes_sb() is in progress, the new wb might not be visible to bdi_split_work_to_wbs() at all or the inode might jump from a wb which hasn't issued writebacks yet to one which already has. This patch adds backing_dev_info->wb_switch_rwsem to synchronize cgwb switch path against sync_inodes_sb() so that sync_inodes_sb() is guaranteed to see all the target wbs and inodes can't jump wbs to escape syncing. v2: Fixed misplaced rwsem init. Spotted by Jiufei. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jiufei Xue <xuejiufei@gmail.com> Link: http://lkml.kernel.org/r/dc694ae2-f07f-61e1-7097-7c8411cee12d@gmail.com Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-03-05 17:58:50 +01:00
Jan Kara	b8b784958e	bdi: Fix oops in wb_workfn() Syzbot has reported that it can hit a NULL pointer dereference in wb_workfn() due to wb->bdi->dev being NULL. This indicates that wb_workfn() was called for an already unregistered bdi which should not happen as wb_shutdown() called from bdi_unregister() should make sure all pending writeback works are completed before bdi is unregistered. Except that wb_workfn() itself can requeue the work with: mod_delayed_work(bdi_wq, &wb->dwork, 0); and if this happens while wb_shutdown() is waiting in: flush_delayed_work(&wb->dwork); the dwork can get executed after wb_shutdown() has finished and bdi_unregister() has cleared wb->bdi->dev. Make wb_workfn() use wakeup_wb() for requeueing the work which takes all the necessary precautions against racing with bdi unregistration. CC: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> CC: Tejun Heo <tj@kernel.org> Fixes: `839a8e8660` Reported-by: syzbot <syzbot+9873874c735f2892e7e9@syzkaller.appspotmail.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-03 16:11:37 -06:00
Greg Thelen	2e898e4c0a	writeback: safer lock nesting lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if the page's memcg is undergoing move accounting, which occurs when a process leaves its memcg for a new one that has memory.move_charge_at_immigrate set. unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if the given inode is switching writeback domains. Switches occur when enough writes are issued from a new domain. This existing pattern is thus suspicious: lock_page_memcg(page); unlocked_inode_to_wb_begin(inode, &locked); ... unlocked_inode_to_wb_end(inode, locked); unlock_page_memcg(page); If both inode switch and process memcg migration are both in-flight then unlocked_inode_to_wb_end() will unconditionally enable interrupts while still holding the lock_page_memcg() irq spinlock. This suggests the possibility of deadlock if an interrupt occurs before unlock_page_memcg(). truncate __cancel_dirty_page lock_page_memcg unlocked_inode_to_wb_begin unlocked_inode_to_wb_end <interrupts mistakenly enabled> <interrupt> end_page_writeback test_clear_page_writeback lock_page_memcg <deadlock> unlock_page_memcg Due to configuration limitations this deadlock is not currently possible because we don't mix cgroup writeback (a cgroupv2 feature) and memory.move_charge_at_immigrate (a cgroupv1 feature). If the kernel is hacked to always claim inode switching and memcg moving_account, then this script triggers lockup in less than a minute: cd /mnt/cgroup/memory mkdir a b echo 1 > a/memory.move_charge_at_immigrate echo 1 > b/memory.move_charge_at_immigrate ( echo $BASHPID > a/cgroup.procs while true; do dd if=/dev/zero of=/mnt/big bs=1M count=256 done ) & while true; do sync done & sleep 1h & SLEEP=$! while true; do echo $SLEEP > a/cgroup.procs echo $SLEEP > b/cgroup.procs done The deadlock does not seem possible, so it's debatable if there's any reason to modify the kernel. I suggest we should to prevent future surprises. And Wang Long said "this deadlock occurs three times in our environment", so there's more reason to apply this, even to stable. Stable 4.4 has minor conflicts applying this patch. For a clean 4.4 patch see "[PATCH for-4.4] writeback: safer lock nesting" https://lkml.org/lkml/2018/4/11/146 Wang Long said "this deadlock occurs three times in our environment" [gthelen@google.com: v4] Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com [akpm@linux-foundation.org: comment tweaks, struct initialization simplification] Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613 Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com Fixes: `682aa8e1a6` ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates") Signed-off-by: Greg Thelen <gthelen@google.com> Reported-by: Wang Long <wanglong19@meituan.com> Acked-by: Wang Long <wanglong19@meituan.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: <stable@vger.kernel.org> [v4.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-04-20 17:18:35 -07:00
Matthew Wilcox	b93b016313	page cache: use xa_lock Remove the address_space ->tree_lock and use the xa_lock newly added to the radix_tree_root. Rename the address_space ->page_tree to ->i_pages, since we don't really care that it's a tree. [willy@infradead.org: fix nds32, fs/dax.c] Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Jeff Layton <jlayton@redhat.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-04-11 10:28:39 -07:00
Christoph Hellwig	0e11f6443f	fs: move I_DIRTY_INODE to fs.h And use it in a few more places rather than opencoding the values. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-28 01:39:02 -04:00
Wang Long	bbbc3c1cfa	writeback: update comment in inode_io_list_move_locked The @head can be wb->b_dirty_time, so update the comment. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Wang Long <wanglong19@meituan.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-01-06 09:18:00 -07:00
Linus Torvalds	1751e8a6cb	Rename superblock flags (MS_xyz -> SB_xyz) This is a pure automated search-and-replace of the internal kernel superblock flags. The s_flags are now called SB_, with the names and the values for the moment mirroring the MS_ flags that they're equivalent to. Note how the MS_xyz flags are the ones passed to the mount system call, while the SB_xyz flags are what we then use in sb->s_flags. The script to do this was: # places to look in; re security/: it generally should not* be # touched (that stuff parses mount(2) arguments directly), but # there are two places where we really deal with superblock flags. FILES="drivers/mtd drivers/staging/lustre fs ipc mm \ include/linux/fs.h include/uapi/linux/bfs_fs.h \ security/apparmor/apparmorfs.c security/apparmor/include/lib.h" # the list of MS_... constants SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \ DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \ POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \ I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \ ACTIVE NOUSER" SED_PROG= for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done # we want files that contain at least one of MS_..., # with fs/namespace.c and fs/pnode.c excluded. L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done\| sort\|uniq\|grep -v '^fs/namespace.c'\|grep -v '^fs/pnode.c') for f in $L; do sed -i $f $SED_PROG; done Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-11-27 13:05:09 -08:00
Rakesh Pandit	8264c3214f	writeback: merge try_to_writeback_inodes_sb_nr() into caller Since commit `925a6efb8f` ("Btrfs: stop using try_to_writeback_inodes_sb_nr to flush delalloc") this function hasn't been used outside so stop exporting it. In addition we merge it into try_to_writeback_inodes_sb() which is the only caller. Also change return type of try_to_writeback_inodes_sb to void as the only user ext4 doesn't care. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-10-10 08:14:37 -06:00
Jens Axboe	85009b4f5f	writeback: eliminate work item allocation in bd_start_writeback() Handle start-all writeback like we do periodic or kupdate style writeback - by marking the bdi_writeback as needing a full flush, and simply waking the thread. This eliminates the need to allocate and queue a specific work item just for this purpose. After this change, we truly only ever have one of them running at any point in time. We mark the need to start all flushes, and the writeback thread will clear it once it has processed the request. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-10-04 11:24:12 -06:00
Jens Axboe	aac8d41cd4	writeback: only allow one inflight and pending full flush When someone calls wakeup_flusher_threads() or wakeup_flusher_threads_bdi(), they schedule writeback of all dirty pages in the system (or on that bdi). If we are tight on memory, we can get tons of these queued from kswapd/vmscan. This causes (at least) two problems: 1) We consume a ton of memory just allocating writeback work items. We've seen as much as 600 million of these writeback work items pending. That's a lot of memory to pointlessly hold hostage, while the box is under memory pressure. 2) We spend so much time processing these work items, that we introduce a softlockup in writeback processing. This is because each of the writeback work items don't end up doing any work (it's hard when you have millions of identical ones coming in to the flush machinery), so we just sit in a tight loop pulling work items and deleting/freeing them. Fix this by adding a 'start_all' bit to the writeback structure, and set that when someone attempts to flush all dirty pages. The bit is cleared when we start writeback on that work item. If the bit is already set when we attempt to queue !nr_pages writeback, then we simply ignore it. This provides us one full flush in flight, with one pending as well, and makes for more efficient handling of this type of writeback. Acked-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Chris Mason <clm@fb.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-10-03 08:38:17 -06:00
Jens Axboe	e8e8a0c6c9	writeback: move nr_pages == 0 logic to one location Now that we have no external callers of wb_start_writeback(), we can shuffle the passing in of 'nr_pages'. Everybody passes in 0 at this point, so just kill the argument and move the dirty count retrieval to that function. Acked-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Chris Mason <clm@fb.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-10-03 08:38:17 -06:00
Jens Axboe	9dfb176fae	writeback: make wb_start_writeback() static We don't have any callers outside of fs-writeback.c anymore, make it private. Acked-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Chris Mason <clm@fb.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-10-03 08:38:17 -06:00
Jens Axboe	595043e5f9	writeback: provide a wakeup_flusher_threads_bdi() Similar to wakeup_flusher_threads(), except that we only wake up the flusher threads on the specified backing device. No functional changes in this patch. Acked-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Chris Mason <clm@fb.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-10-03 08:38:17 -06:00
Jens Axboe	47410d88f6	writeback: remove 'range_cyclic' argument for wb_start_writeback() All the callers pass in 'true' for range_cyclic, so kill the argument. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-10-03 08:38:17 -06:00
Jens Axboe	d31cd9d326	writeback: switch wakeup_flusher_threads() to cyclic writeback We're writing back the full range of dirty pages on the devices, there's no point in making this special and not do normal range cyclic writeback. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-10-03 08:38:17 -06:00
Jens Axboe	9ba4b2dfaf	fs: kill 'nr_pages' argument from wakeup_flusher_threads() Everybody is passing in 0 now, let's get rid of the argument. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-10-03 08:38:17 -06:00
Nikolay Borisov	3e8f399da4	writeback: rework wb_[dec\|inc]_stat family of functions Currently the writeback statistics code uses a percpu counters to hold various statistics. Furthermore we have 2 families of functions - those which disable local irq and those which doesn't and whose names begin with double underscore. However, they both end up calling __add_wb_stats which in turn calls percpu_counter_add_batch which is already irq-safe. Exploiting this fact allows to eliminated the __wb_* functions since they don't add any further protection than we already have. Furthermore, refactor the wb_* function to call __add_wb_stat directly without the irq-disabling dance. This will likely result in better runtime of code which deals with modifying the stat counters. While at it also document why percpu_counter_add_batch is in fact preempt and irq-safe since at least 3 people got confused. Link: http://lkml.kernel.org/r/1498029937-27293-1-git-send-email-nborisov@suse.com Signed-off-by: Nikolay Borisov <nborisov@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Josef Bacik <jbacik@fb.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jeff Layton <jlayton@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-07-12 16:26:05 -07:00
Mauro Carvalho Chehab	0117d4272b	fs: add a blank lines on some kernel-doc comments Sphinx gets confused when it finds identation without a good reason for it and without a preceding blank line: ./fs/mpage.c:347: ERROR: Unexpected indentation. ./fs/namei.c:4303: ERROR: Unexpected indentation. ./fs/fs-writeback.c:2060: ERROR: Unexpected indentation. No functional changes. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>	2017-05-16 08:44:10 -03:00
Tahsin Erdogan	4a3a485b1e	writeback: fix memory leak in wb_queue_work() When WB_registered flag is not set, wb_queue_work() skips queuing the work, but does not perform the necessary clean up. In particular, if work->auto_free is true, it should free the memory. The leak condition can be reprouced by following these steps: mount /dev/sdb /mnt/sdb /* In qemu console: device_del sdb */ umount /dev/sdb Above will result in a wb_queue_work() call on an unregistered wb and thus leak memory. Reported-by: John Sperbeck <jsperbeck@google.com> Signed-off-by: Tahsin Erdogan <tahsin@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-03-13 08:27:34 -06:00
Tahsin Erdogan	bace924818	fs/fs-writeback.c: remove redundant if check b_more_io non-empty check is already preceded by an opposite check. Link: http://lkml.kernel.org/r/1478591249-30641-1-git-send-email-tahsin@google.com Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-12-12 18:55:08 -08:00
Konstantin Khlebnikov	51350ea0d7	mm, writeback: flush plugged IO in wakeup_flusher_threads() I've found funny live-lock between raid10 barriers during resync and memory controller hard limits. Inside mpage_readpages() task holds on to its plug bio which blocks the barrier in raid10. Its memory cgroup have no free memory thus the task goes into reclaimer but all reclaimable pages are dirty and cannot be written because raid10 is rebuilding and stuck on the barrier. Common flush of such IO in schedule() never happens, because the caller doesn't go to sleep. Lock is 'live' because changing memory limit or killing tasks which holds that stuck bio unblock whole progress. That was what happened in 3.18.x but I see no difference in upstream logic. Theoretically this might happen even without memory cgroup. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-08-09 19:58:06 -06:00
Jan Kara	dc5ff2b1d6	writeback: Write dirty times for WB_SYNC_ALL writeback Currently we take care to handle I_DIRTY_TIME in vfs_fsync() and queue_io() so that inodes which have only dirty timestamps are properly written on fsync(2) and sync(2). However there are other call sites - most notably going through write_inode_now() - which expect inode to be clean after WB_SYNC_ALL writeback. This is not currently true as we do not clear I_DIRTY_TIME in __writeback_single_inode() even for WB_SYNC_ALL writeback in all the cases. This then resulted in the following oops because bdev_write_inode() did not clean the inode and writeback code later stumbled over a dirty inode with detached wb. general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN Modules linked in: CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ #349 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Workqueue: writeback wb_workfn (flush-11:0) task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000 RIP: 0010:[<ffffffff818884d2>] [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750 RSP: 0018:ffff88006cdaf7d0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050 RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286 RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000 R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0 R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40 FS: 0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0 DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 Stack: ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948 ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948 ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e Call Trace: [< inline >] inode_to_wb_and_lock_list fs/fs-writeback.c:309 [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554 [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600 [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709 [< inline >] wb_do_writeback fs/fs-writeback.c:1844 [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884 [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094 [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228 [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303 [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392 Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e 00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42> 80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3 RIP [< inline >] wb_get include/linux/backing-dev-defs.h:212 RIP [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750 fs/fs-writeback.c:281 RSP <ffff88006cdaf7d0> ---[ end trace 986a4d314dcb2694 ]--- Fix the problem by making sure __writeback_single_inode() writes inode only with dirty times in WB_SYNC_ALL mode. Reported-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-08-04 14:19:16 -06:00
Mel Gorman	11fb998986	mm: move most file-based accounting to the node There are now a number of accounting oddities such as mapped file pages being accounted for on the node while the total number of file pages are accounted on the zone. This can be coped with to some extent but it's confusing so this patch moves the relevant file-based accounted. Due to throttling logic in the page allocator for reliable OOM detection, it is still necessary to track dirty and writeback pages on a per-zone basis. [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting] Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-07-28 16:07:41 -07:00
Brian Foster	9a46b04f16	fs/fs-writeback.c: inode writeback list tracking tracepoints The per-sb inode writeback list tracks inodes currently under writeback to facilitate efficient sync processing. In particular, it ensures that sync only needs to walk through a list of inodes that were cleaned by the sync. Add a couple tracepoints to help identify when inodes are added/removed to and from the writeback lists. Piggyback off of the writeback lazytime tracepoint template as it already tracks the relevant inode information. Link: http://lkml.kernel.org/r/1466594593-6757-3-git-send-email-bfoster@redhat.com Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Dave Chinner <dchinner@redhat.com> cc: Josef Bacik <jbacik@fb.com> Cc: Holger Hoffstätte <holger.hoffstaette@applied-asynchrony.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-07-26 16:19:19 -07:00
Dave Chinner	6c60d2b574	fs/fs-writeback.c: add a new writeback list for sync wait_sb_inodes() currently does a walk of all inodes in the filesystem to find dirty one to wait on during sync. This is highly inefficient and wastes a lot of CPU when there are lots of clean cached inodes that we don't need to wait on. To avoid this "all inode" walk, we need to track inodes that are currently under writeback that we need to wait for. We do this by adding inodes to a writeback list on the sb when the mapping is first tagged as having pages under writeback. wait_sb_inodes() can then walk this list of "inodes under IO" and wait specifically just for the inodes that the current sync(2) needs to wait for. Define a couple helpers to add/remove an inode from the writeback list and call them when the overall mapping is tagged for or cleared from writeback. Update wait_sb_inodes() to walk only the inodes under writeback due to the sync. With this change, filesystem sync times are significantly reduced for fs' with largely populated inode caches and otherwise no other work to do. For example, on a 16xcpu 2GHz x86-64 server, 10TB XFS filesystem with a ~10m entry inode cache, sync times are reduced from ~7.3s to less than 0.1s when the filesystem is fully clean. Link: http://lkml.kernel.org/r/1466594593-6757-2-git-send-email-bfoster@redhat.com Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: Holger Hoffstätte <holger.hoffstaette@applied-asynchrony.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-07-26 16:19:19 -07:00
Tahsin Erdogan	7452495555	writeback: inode cgroup wb switch should not call ihold() Asynchronous wb switching of inodes takes an additional ref count on an inode to make sure inode remains valid until switchover is completed. However, anyone calling ihold() must already have a ref count on inode, but in this case inode->i_count may already be zero: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 917 at fs/inode.c:397 ihold+0x2b/0x30 CPU: 1 PID: 917 Comm: kworker/u4:5 Not tainted 4.7.0-rc2+ #49 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Workqueue: writeback wb_workfn (flush-8:16) 0000000000000000 ffff88007ca0fb58 ffffffff805990af 0000000000000000 0000000000000000 ffff88007ca0fb98 ffffffff80268702 0000018d000004e2 ffff88007cef40e8 ffff88007c9b89a8 ffff880079e3a740 0000000000000003 Call Trace: [<ffffffff805990af>] dump_stack+0x4d/0x6e [<ffffffff80268702>] __warn+0xc2/0xe0 [<ffffffff802687d8>] warn_slowpath_null+0x18/0x20 [<ffffffff8035b4ab>] ihold+0x2b/0x30 [<ffffffff80367ecc>] inode_switch_wbs+0x11c/0x180 [<ffffffff80369110>] wbc_detach_inode+0x170/0x1a0 [<ffffffff80369abc>] writeback_sb_inodes+0x21c/0x530 [<ffffffff80369f7e>] wb_writeback+0xee/0x1e0 [<ffffffff8036a147>] wb_workfn+0xd7/0x280 [<ffffffff80287531>] ? try_to_wake_up+0x1b1/0x2b0 [<ffffffff8027bb09>] process_one_work+0x129/0x300 [<ffffffff8027be06>] worker_thread+0x126/0x480 [<ffffffff8098cde7>] ? __schedule+0x1c7/0x561 [<ffffffff8027bce0>] ? process_one_work+0x300/0x300 [<ffffffff80280ff4>] kthread+0xc4/0xe0 [<ffffffff80335578>] ? kfree+0xc8/0x100 [<ffffffff809903cf>] ret_from_fork+0x1f/0x40 [<ffffffff80280f30>] ? __kthread_parkme+0x70/0x70 ---[ end trace aaefd2fd9f306bc4 ]--- Signed-off-by: Tahsin Erdogan <tahsin@google.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-30 13:58:41 -06:00
Tetsuo Handa	78ebc2f714	mm,writeback: don't use memory reserves for wb_start_writeback When writeback operation cannot make forward progress because memory allocation requests needed for doing I/O cannot be satisfied (e.g. under OOM-livelock situation), we can observe flood of order-0 page allocation failure messages caused by complete depletion of memory reserves. This is caused by unconditionally allocating "struct wb_writeback_work" objects using GFP_ATOMIC from PF_MEMALLOC context. __alloc_pages_nodemask() { __alloc_pages_slowpath() { __alloc_pages_direct_reclaim() { __perform_reclaim() { current->flags \|= PF_MEMALLOC; try_to_free_pages() { do_try_to_free_pages() { wakeup_flusher_threads() { wb_start_writeback() { kzalloc(sizeof(work), GFP_ATOMIC) { / ALLOC_NO_WATERMARKS via PF_MEMALLOC / } } } } } current->flags &= ~PF_MEMALLOC; } } } } Since I/O is stalling, allocating writeback requests forever shall deplete memory reserves. Fortunately, since wb_start_writeback() can fall back to wb_wakeup() when allocating "struct wb_writeback_work" failed, we don't need to allow wb_start_writeback() to use memory reserves. Mem-Info: active_anon:289393 inactive_anon:2093 isolated_anon:29 active_file:10838 inactive_file:113013 isolated_file:859 unevictable:0 dirty:108531 writeback:5308 unstable:0 slab_reclaimable:5526 slab_unreclaimable:7077 mapped:9970 shmem:2159 pagetables:2387 bounce:0 free:3042 free_pcp:0 free_cma:0 Node 0 DMA free:6968kB min:44kB low:52kB high:64kB active_anon:6056kB inactive_anon:176kB active_file:712kB inactive_file:744kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:208kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9708 all_unreclaimable? yes lowmem_reserve[]: 0 1732 1732 1732 Node 0 DMA32 free:5200kB min:5200kB low:6500kB high:7800kB active_anon:1151516kB inactive_anon:8196kB active_file:42640kB inactive_file:451076kB unevictable:0kB isolated(anon):116kB isolated(file):3564kB present:2080640kB managed:1775332kB mlocked:0kB dirty:433368kB writeback:21232kB mapped:39144kB shmem:8452kB slab_reclaimable:22056kB slab_unreclaimable:28100kB kernel_stack:20976kB pagetables:9404kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2701604 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 254kB (UME) 168kB (UME) 316kB (UE) 532kB (UME) 264kB (UM) 2128kB (ME) 2256kB (ME) 1512kB (E) 11024kB (E) 22048kB (ME) 04096kB = 6964kB Node 0 DMA32: 9254kB (UME) 1408kB (UME) 516kB (ME) 532kB (M) 064kB 0128kB 0256kB 0512kB 01024kB 02048kB 04096kB = 5060kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 126847 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 0kB Total swap = 0kB 524157 pages RAM 0 pages HighMem/MovableOnly 76348 pages reserved 0 pages hwpoisoned Out of memory: Kill process 4450 (file_io.00) score 998 or sacrifice child Killed process 4450 (file_io.00) total-vm:4308kB, anon-rss:100kB, file-rss:1184kB, shmem-rss:0kB kthreadd: page allocation failure: order:0, mode:0x2200020 file_io.00: page allocation failure: order:0, mode:0x2200020 CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 Call Trace: warn_alloc_failed+0xf7/0x150 __alloc_pages_nodemask+0x23f/0xa60 alloc_pages_current+0x87/0x110 new_slab+0x3a1/0x440 ___slab_alloc+0x3cf/0x590 __slab_alloc.isra.64+0x18/0x1d kmem_cache_alloc+0x11c/0x150 wb_start_writeback+0x39/0x90 wakeup_flusher_threads+0x7f/0xf0 do_try_to_free_pages+0x1f9/0x410 try_to_free_pages+0x94/0xc0 __alloc_pages_nodemask+0x566/0xa60 alloc_pages_current+0x87/0x110 __page_cache_alloc+0xaf/0xc0 pagecache_get_page+0x88/0x260 grab_cache_page_write_begin+0x21/0x40 xfs_vm_write_begin+0x2f/0xf0 generic_perform_write+0xca/0x1c0 xfs_file_buffered_aio_write+0xcc/0x1f0 xfs_file_write_iter+0x84/0x140 __vfs_write+0xc7/0x100 vfs_write+0x9d/0x190 SyS_write+0x50/0xc0 entry_SYSCALL_64_fastpath+0x12/0x6a Mem-Info: active_anon:293335 inactive_anon:2093 isolated_anon:0 active_file:10829 inactive_file:110045 isolated_file:32 unevictable:0 dirty:109275 writeback:822 unstable:0 slab_reclaimable:5489 slab_unreclaimable:10070 mapped:9999 shmem:2159 pagetables:2420 bounce:0 free:3 free_pcp:0 free_cma:0 Node 0 DMA free:12kB min:44kB low:52kB high:64kB active_anon:6060kB inactive_anon:176kB active_file:708kB inactive_file:756kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:7160kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9844 all_unreclaimable? yes lowmem_reserve[]: 0 1732 1732 1732 Node 0 DMA32 free:0kB min:5200kB low:6500kB high:7800kB active_anon:1167280kB inactive_anon:8196kB active_file:42608kB inactive_file:439424kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:1775332kB mlocked:0kB dirty:436344kB writeback:3288kB mapped:39260kB shmem:8452kB slab_reclaimable:21908kB slab_unreclaimable:33120kB kernel_stack:20976kB pagetables:9536kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:11073180 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 04kB 08kB 016kB 032kB 064kB 0128kB 0256kB 0512kB 01024kB 02048kB 04096kB = 0kB Node 0 DMA32: 04kB 08kB 016kB 032kB 064kB 0128kB 0256kB 0512kB 01024kB 02048kB 0*4096kB = 0kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 123086 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 0kB Total swap = 0kB 524157 pages RAM 0 pages HighMem/MovableOnly 76348 pages reserved 0 pages hwpoisoned SLUB: Unable to allocate memory on node -1 (gfp=0x2088020) cache: kmalloc-64, object size: 64, buffer size: 64, default order: 0, min order: 0 node 0: slabs: 3218, objs: 205952, free: 0 file_io.00: page allocation failure: order:0, mode:0x2200020 CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45 Assuming that somebody will find a better solution, let's apply this patch for now to stop bleeding, for this problem frequently prevents me from testing OOM livelock condition. Link: http://lkml.kernel.org/r/20160318131136.GE7152@quack.suse.cz Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jan Kara <jack@suse.cz> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-05-20 17:58:30 -07:00
Kirill A. Shutemov	09cbfeaf1a	mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced long time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-04-04 10:41:08 -07:00
Tejun Heo	aaf2559332	writeback, cgroup: fix use of the wrong bdi_writeback which mismatches the inode When cgroup writeback is in use, there can be multiple wb's (bdi_writeback's) per bdi and an inode may switch among them dynamically. In a couple places, the wrong wb was used leading to performing operations on the wrong list under the wrong lock corrupting the io lists. * writeback_single_inode() was taking @wb parameter and used it to remove the inode from io lists if it becomes clean after writeback. The callers of this function were always passing in the root wb regardless of the actual wb that the inode was associated with, which could also change while writeback is in progress. Fix it by dropping the @wb parameter and using inode_to_wb_and_lock_list() to determine and lock the associated wb. * After writeback_sb_inodes() writes out an inode, it re-locks @wb and inode to remove it from or move it to the right io list. It assumes that the inode is still associated with @wb; however, the inode may have switched to another wb while writeback was in progress. Fix it by using inode_to_wb_and_lock_list() to determine and lock the associated wb after writeback is complete. As the function requires the original @wb->list_lock locked for the next iteration, in the unlikely case where the inode has changed association, switch the locks. Kudos to Tahsin for pinpointing these subtle breakages. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `d10c809552` ("writeback: implement foreign cgroup inode bdi_writeback switching") Link: http://lkml.kernel.org/g/CAAeU0aMYeM_39Y2+PaRvyB1nqAPYZSNngJ1eBRmrxn7gKAt2Mg@mail.gmail.com Reported-and-diagnosed-by: Tahsin Erdogan <tahsin@google.com> Tested-by: Tahsin Erdogan <tahsin@google.com> Cc: stable@vger.kernel.org # v4.2+ Signed-off-by: Jens Axboe <axboe@fb.com>	2016-03-20 09:44:20 -06:00
Tejun Heo	614a4e3773	writeback, cgroup: fix premature wb_put() in locked_inode_to_wb_and_lock_list() locked_inode_to_wb_and_lock_list() wb_get()'s the wb associated with the target inode, unlocks inode, locks the wb's list_lock and verifies that the inode is still associated with the wb. To prevent the wb going away between dropping inode lock and acquiring list_lock, the wb is pinned while inode lock is held. The wb reference is put right after acquiring list_lock citing that the wb won't be dereferenced anymore. This isn't true. If the inode is still associated with the wb, the inode has reference and it's safe to return the wb; however, if inode has been switched, the wb still needs to be unlocked which is a dereference and can lead to use-after-free if it it races with wb destruction. Fix it by putting the reference after releasing list_lock. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `87e1d789bf` ("writeback: implement [locked_]inode_to_wb_and_lock_list()") Cc: stable@vger.kernel.org # v4.2+ Tested-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-03-20 09:44:18 -06:00
Tejun Heo	a1a0e23e49	writeback: flush inode cgroup wb switches instead of pinning super_block If cgroup writeback is in use, inodes can be scheduled for asynchronous wb switching. Before `5ff8eaac16` ("writeback: keep superblock pinned during cgroup writeback association switches"), this could race with umount leading to super_block being destroyed while inodes are pinned for wb switching. `5ff8eaac16` fixed it by bumping s_active while wb switches are in flight; however, this allowed in-flight wb switches to make umounts asynchronous when the userland expected synchronosity - e.g. fsck immediately following umount may fail because the device is still busy. This patch removes the problematic super_block pinning and instead makes generic_shutdown_super() flush in-flight wb switches. wb switches are now executed on a dedicated isw_wq so that they can be flushed and isw_nr_in_flight keeps track of the number of in-flight wb switches so that flushing can be avoided in most cases. v2: Move cgroup_writeback_umount() further below and add MS_ACTIVE check in inode_switch_wbs() as Jan an Al suggested. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Tahsin Erdogan <tahsin@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@ZenIV.linux.org.uk> Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com Fixes: `5ff8eaac16` ("writeback: keep superblock pinned during cgroup writeback association switches") Cc: stable@vger.kernel.org #v4.5 Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-03-03 14:42:50 -07:00
Tejun Heo	5ff8eaac16	writeback: keep superblock pinned during cgroup writeback association switches If cgroup writeback is in use, an inode is associated with a cgroup for writeback. If the inode's main dirtier changes to another cgroup, the association gets updated asynchronously. Nothing was pinning the superblock while such switches are in progress and superblock could go away while async switching is pending or in progress leading to crashes like the following. kernel BUG at fs/jbd2/transaction.c:319! invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC CPU: 1 PID: 29158 Comm: kworker/1:10 Not tainted 4.5.0-rc3 #51 Hardware name: Google Google, BIOS Google 01/01/2011 Workqueue: events inode_switch_wbs_work_fn task: ffff880213dbbd40 ti: ffff880209264000 task.ti: ffff880209264000 RIP: 0010:[<ffffffff803e6922>] [<ffffffff803e6922>] start_this_handle+0x382/0x3e0 RSP: 0018:ffff880209267c30 EFLAGS: 00010202 ... Call Trace: [<ffffffff803e6be4>] jbd2__journal_start+0xf4/0x190 [<ffffffff803cfc7e>] __ext4_journal_start_sb+0x4e/0x70 [<ffffffff803b31ec>] ext4_evict_inode+0x12c/0x3d0 [<ffffffff8035338b>] evict+0xbb/0x190 [<ffffffff80354190>] iput+0x130/0x190 [<ffffffff80360223>] inode_switch_wbs_work_fn+0x343/0x4c0 [<ffffffff80279819>] process_one_work+0x129/0x300 [<ffffffff80279b16>] worker_thread+0x126/0x480 [<ffffffff8027ed14>] kthread+0xc4/0xe0 [<ffffffff809771df>] ret_from_fork+0x3f/0x70 Fix it by bumping s_active while cgroup association switching is in flight. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Tahsin Erdogan <tahsin@google.com> Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com Fixes: `d10c809552` ("writeback: implement foreign cgroup inode bdi_writeback switching") Cc: stable@vger.kernel.org #v4.5+ Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-16 11:34:07 -07:00
Tejun Heo	654a0dd095	cgroup, memcg, writeback: drop spurious rcu locking around mem_cgroup_css_from_page() In earlier versions, mem_cgroup_css_from_page() could return non-root css on a legacy hierarchy which can go away and required rcu locking; however, the eventual version simply returns the root cgroup if memcg is on a legacy hierarchy and thus doesn't need rcu locking around or in it. Remove spurious rcu lockings. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-01-15 17:56:32 -08:00
Randy Dunlap	dbce03b9e3	fs/writeback.c: fix kernel-doc warnings Fix kernel-doc warnings in fs/fs-writeback.c by moving a #define macro to after the function's opening brace. Also #undef this macro at the end of the function. ../fs/fs-writeback.c:1984: warning: Excess function parameter 'inode' description in 'I_DIRTY_INODE' ../fs/fs-writeback.c:1984: warning: Excess function parameter 'flags' description in 'I_DIRTY_INODE' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-09 15:11:24 -08:00
Junichi Nomura	aa750fd71c	mm/filemap.c: make global sync not clear error status of individual inodes filemap_fdatawait() is a function to wait for on-going writeback to complete but also consume and clear error status of the mapping set during writeback. The latter functionality is critical for applications to detect writeback error with system calls like fsync(2)/fdatasync(2). However filemap_fdatawait() is also used by sync(2) or FIFREEZE ioctl, which don't check error status of individual mappings. As a result, fsync() may not be able to detect writeback error if events happen in the following order: Application System admin ---------------------------------------------------------- write data on page cache Run sync command writeback completes with error filemap_fdatawait() clears error fsync returns success (but the data is not on disk) This patch adds filemap_fdatawait_keep_errors() for call sites where writeback error is not handled so that they don't clear error status. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tejun Heo <tj@kernel.org> Cc: Fengguang Wu <fengguang.wu@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Tejun Heo	b33e18f61b	fs/writeback, rcu: Don't use list_entry_rcu() for pointer offsetting in bdi_split_work_to_wbs() bdi_split_work_to_wbs() uses list_for_each_entry_rcu_continue() to walk @bdi->wb_list. To set up the initial iteration condition, it uses list_entry_rcu() to calculate the entry pointer corresponding to the list head; however, this isn't an actual RCU dereference and using list_entry_rcu() for it ended up breaking a proposed list_entry_rcu() change because it was feeding an non-lvalue pointer into the macro. Don't use the RCU variant for simple pointer offsetting. Use list_entry() instead. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Darren Hart <dvhart@linux.intel.com> Cc: David Howells <dhowells@redhat.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Patrick Marlier <patrick.marlier@gmail.com> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: pranith kumar <bobby.prani@gmail.com> Link: http://lkml.kernel.org/r/20151027051939.GA19355@mtj.duckdns.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2015-10-28 13:17:30 +01:00
Tejun Heo	b817525a4a	writeback: bdi_writeback iteration must not skip dying ones bdi_for_each_wb() is used in several places to wake up or issue writeback work items to all wb's (bdi_writeback's) on a given bdi. The iteration is performed by walking bdi->cgwb_tree; however, the tree only indexes wb's which are currently active. For example, when a memcg gets associated with a different blkcg, the old wb is removed from the tree so that the new one can be indexed. The old wb starts dying from then on but will linger till all its inodes are drained. As these dying wb's may still host dirty inodes, writeback operations which affect all wb's must include them. bdi_for_each_wb() skipping dying wb's led to sync(2) missing and failing to sync the inodes belonging to those wb's. This patch adds a RCU protected @bdi->wb_list which lists all wb's beloinging to that bdi. wb's are added on creation and removed on release rather than on the start of destruction. bdi_for_each_wb() usages are replaced with list_for_each[_continue]_rcu() iterations over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed. v2: Updated as per Jan. last_wb ref leak in bdi_split_work_to_wbs() fixed and unnecessary list head severing in cgwb_bdi_destroy() removed. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Artem Bityutskiy <dedekind1@gmail.com> Fixes: `ebe41ab0c7` ("writeback: implement bdi_for_each_wb()") Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-10-12 10:31:12 -06:00

1 2 3 4 5 ...

363 commits