kernel-fxtec-pro1x

Author	SHA1	Message	Date
Pradeep P V K	95bd93f330	block, bfq: fix use-after-free in bfq_idle_slice_timer_body In bfq_idle_slice_timer func, bfqq = bfqd->in_service_queue is not in bfqd-lock critical section. The bfqq, which is not equal to NULL in bfq_idle_slice_timer, may be freed after passing to bfq_idle_slice_timer_body. So we will access the freed memory. In addition, considering the bfqq may be in race, we should firstly check whether bfqq is in service before doing something on it in bfq_idle_slice_timer_body func. If the bfqq in race is not in service, it means the bfqq has been expired through __bfq_bfqq_expire func, and wait_request flags has been cleared in __bfq_bfqd_reset_in_service func. So we do not need to re-clear the wait_request of bfqq which is not in service. KASAN log is given as follows: [13058.354613] ============================================================== [13058.354640] BUG: KASAN: use-after-free in bfq_idle_slice_timer+0xac/0x290 [13058.354644] Read of size 8 at addr ffffa02cf3e63f78 by task fork13/19767 [13058.354646] [13058.354655] CPU: 96 PID: 19767 Comm: fork13 [13058.354661] Call trace: [13058.354667] dump_backtrace+0x0/0x310 [13058.354672] show_stack+0x28/0x38 [13058.354681] dump_stack+0xd8/0x108 [13058.354687] print_address_description+0x68/0x2d0 [13058.354690] kasan_report+0x124/0x2e0 [13058.354697] __asan_load8+0x88/0xb0 [13058.354702] bfq_idle_slice_timer+0xac/0x290 [13058.354707] __hrtimer_run_queues+0x298/0x8b8 [13058.354710] hrtimer_interrupt+0x1b8/0x678 [13058.354716] arch_timer_handler_phys+0x4c/0x78 [13058.354722] handle_percpu_devid_irq+0xf0/0x558 [13058.354731] generic_handle_irq+0x50/0x70 [13058.354735] __handle_domain_irq+0x94/0x110 [13058.354739] gic_handle_irq+0x8c/0x1b0 [13058.354742] el1_irq+0xb8/0x140 [13058.354748] do_wp_page+0x260/0xe28 [13058.354752] __handle_mm_fault+0x8ec/0x9b0 [13058.354756] handle_mm_fault+0x280/0x460 [13058.354762] do_page_fault+0x3ec/0x890 [13058.354765] do_mem_abort+0xc0/0x1b0 [13058.354768] el0_da+0x24/0x28 [13058.354770] [13058.354773] Allocated by task 19731: [13058.354780] kasan_kmalloc+0xe0/0x190 [13058.354784] kasan_slab_alloc+0x14/0x20 [13058.354788] kmem_cache_alloc_node+0x130/0x440 [13058.354793] bfq_get_queue+0x138/0x858 [13058.354797] bfq_get_bfqq_handle_split+0xd4/0x328 [13058.354801] bfq_init_rq+0x1f4/0x1180 [13058.354806] bfq_insert_requests+0x264/0x1c98 [13058.354811] blk_mq_sched_insert_requests+0x1c4/0x488 [13058.354818] blk_mq_flush_plug_list+0x2d4/0x6e0 [13058.354826] blk_flush_plug_list+0x230/0x548 [13058.354830] blk_finish_plug+0x60/0x80 [13058.354838] read_pages+0xec/0x2c0 [13058.354842] __do_page_cache_readahead+0x374/0x438 [13058.354846] ondemand_readahead+0x24c/0x6b0 [13058.354851] page_cache_sync_readahead+0x17c/0x2f8 [13058.354858] generic_file_buffered_read+0x588/0xc58 [13058.354862] generic_file_read_iter+0x1b4/0x278 [13058.354965] ext4_file_read_iter+0xa8/0x1d8 [ext4] [13058.354972] __vfs_read+0x238/0x320 [13058.354976] vfs_read+0xbc/0x1c0 [13058.354980] ksys_read+0xdc/0x1b8 [13058.354984] __arm64_sys_read+0x50/0x60 [13058.354990] el0_svc_common+0xb4/0x1d8 [13058.354994] el0_svc_handler+0x50/0xa8 [13058.354998] el0_svc+0x8/0xc [13058.354999] [13058.355001] Freed by task 19731: [13058.355007] __kasan_slab_free+0x120/0x228 [13058.355010] kasan_slab_free+0x10/0x18 [13058.355014] kmem_cache_free+0x288/0x3f0 [13058.355018] bfq_put_queue+0x134/0x208 [13058.355022] bfq_exit_icq_bfqq+0x164/0x348 [13058.355026] bfq_exit_icq+0x28/0x40 [13058.355030] ioc_exit_icq+0xa0/0x150 [13058.355035] put_io_context_active+0x250/0x438 [13058.355038] exit_io_context+0xd0/0x138 [13058.355045] do_exit+0x734/0xc58 [13058.355050] do_group_exit+0x78/0x220 [13058.355054] __wake_up_parent+0x0/0x50 [13058.355058] el0_svc_common+0xb4/0x1d8 [13058.355062] el0_svc_handler+0x50/0xa8 [13058.355066] el0_svc+0x8/0xc. Change-Id: I510c704a6f2324741d70db33f0350e14642fe92f Acked-by: Paolo Valente <paolo.valente@linaro.org> Reported-by: Wang Wang <wangwang2@huawei.com> Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Signed-off-by: Feilong Lin <linfeilong@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Git-commit: 2f95fa5c955d0a9987ffdc3a095e2f4e62c5f2a9 Git-repo: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block Signed-off-by: Pradeep P V K <ppvk@codeaurora.org>	2020-04-01 20:13:38 +05:30
Ivaylo Georgiev	6a2f759853	Merge android-4.19-q.85 (`8e36d3d`) into msm-4.19 * refs/heads/tmp-8e36d3d: Documentation: devicetree: Remove unimac-mdio bindings from kernel Revert "media: dt-bindings: adv748x: Fix decimal unit addresses" Linux 4.19.85 slcan: Fix memory leak in error path memfd: Use radix_tree_deref_slot_protected to avoid the warning. net: phy: mdio-bcm-unimac: mark PM functions as __maybe_unused s390/vdso: correct vdso mapping for compat tasks media: ov2680: fix null dereference at power on IB/iser: Fix possible NULL deref at iser_inv_desc() fuse: use READ_ONCE on congestion_threshold and max_background usb: usbtmc: uninitialized symbol 'actual' in usbtmc_ioctl_clear usb: xhci-mtk: fix ISOC error when interval is zero netfilter: masquerade: don't flush all conntracks if only one address deleted on device rtc: armada38x: fix possible race condition rtc: tx4939: fixup nvmem name and register size rtc: isl1208: avoid possible sysfs race ARM: dts: lpc32xx: Fix SPI controller node names arm64: dts: lg: Fix SPI controller node names arm64: dts: amd: Fix SPI bus warnings scsi: NCR5380: Check for bus reset scsi: NCR5380: Handle BUS FREE during reselection scsi: NCR5380: Don't call dsprintk() following reselection interrupt scsi: NCR5380: Don't clear busy flag when abort fails scsi: NCR5380: Check for invalid reselection target scsi: NCR5380: Use DRIVER_SENSE to indicate valid sense data scsi: NCR5380: Withhold disconnect privilege for REQUEST SENSE scsi: NCR5380: Have NCR5380_select() return a bool scsi: NCR5380: Clear all unissued commands on host reset iwlwifi: mvm: Allow TKIP for AP mode iwlwifi: mvm: use correct FIFO length iwlwifi: pcie: fit reclaim msg to MAX_MSG_LEN iwlwifi: pcie: gen2: build A-MSDU only for GSO iwlwifi: api: annotate compressed BA notif array sizes iwlwifi: pcie: read correct prph address for newer devices iwlwifi: fix non_shared_ant for 22000 devices iwlwifi: dbg: don't crash if the firmware crashes in the middle of a debug dump crypto: fix a memory leak in rsa-kcs1pad's encryption mode crypto: s5p-sss: Fix Fix argument list alignment crypto: s5p-sss: Fix race in error handling x86/hyperv: Suppress "PCI: Fatal: No config space access function found" Bluetooth: btrsi: fix bt tx timeout issue Bluetooth: L2CAP: Detect if remote is not able to use the whole MPS Bluetooth: hci_serdev: clear HCI_UART_PROTO_READY to avoid closing proto races firmware: dell_rbu: Make payload memory uncachable ARM: dts: realview: Fix SPI controller node names EDAC: Raise the maximum number of memory controllers RDMA: Fix dependencies for rdma_user_mmap_io f2fs: mark inode dirty explicitly in recover_inode() f2fs: fix to recover inode's project id during POR f2fs: update i_size after DIO completion PCI/ERR: Run error recovery callbacks for all affected devices net: faraday: fix return type of ndo_start_xmit function net: smsc: fix return type of ndo_start_xmit function ARM: dts: paz00: fix wakeup gpio keycode ARM: tegra: colibri_t30: fix mcp2515 can controller interrupt polarity ARM: tegra: apalis_t30: fix mcp2515 can controller interrupt polarity ARM: tegra: apalis_t30: fix mmc1 cmd pull-up ARM: dts: tegra20: restore address order ARM: dts: tegra30: fix xcvr-setup-use-fuses arm64: tegra: I2C on Tegra194 is not compatible with Tegra114 ARM: dts: imx51-zii-rdu1: Fix the rtc compatible string arm64: dts: fsl: Fix I2C and SPI bus warnings phy: lantiq: Fix compile warning f2fs: fix remount problem of option io_bits scsi: libsas: always unregister the old device if going to discover new iw_cxgb4: Use proper enumerated type in c4iw_bar2_addrs vfio/pci: Mask buggy SR-IOV VF INTx support vfio/pci: Fix potential memory leak in vfio_msi_cap_len vmbus: keep pointer to ring buffer page misc: genwqe: should return proper error value. misc: kgdbts: Fix restrict error silmbus: ngd: register controller after power up. slimbus: ngd: return proper error code instead of zero slimbus: ngd: register ngd driver only once. coresight: dynamic-replicator: Handle multiple connections coresight: tmc: Fix byte-address alignment for RRP coresight: etm4x: Configure EL2 exception level when kernel is running in HYP coresight: tmc-etr: Handle driver mode specific ETR buffers coresight: perf: Disable trace path upon source error coresight: perf: Fix per cpu path management coresight: Fix handling of sinks coresight: Use ERR_CAST instead of ERR_PTR usb: gadget: uvc: Only halt video streaming endpoint in bulk mode usb: gadget: uvc: Factor out video USB request queueing ARM: dts: imx6ull: update vdd_soc voltage for 900MHz operating point phy: phy-twl4030-usb: fix denied runtime access phy: renesas: rcar-gen3-usb2: fix vbus_ctrl for role sysfs phy: brcm-sata: allow PHY_BRCM_SATA driver to be built for DSL SoCs ARM: at91: pm: call put_device instead of of_node_put in at91_pm_config_ws gpiolib: Fix gpio_direction_* for single direction GPIOs i2c: aspeed: fix invalid clock parameters for very large divisors ARM: dts: exynos: Correct audio subsystem parent clock on Peach Chromebooks usb: gadget: uvc: configfs: Sort frame intervals upon writing usb: gadget: uvc: configfs: Prevent format changes after linking header usb: gadget: uvc: configfs: Drop leaked references to config items ARM: dts: rockchip: explicitly set vcc_sd0 pin to gpio on rk3188-radxarock media: davinci: Fix implicit enum conversion warning media: au0828: Fix incorrect error messages media: pci: ivtv: Fix a sleep-in-atomic-context bug in ivtv_yuv_init() media: imx: work around false-positive warning, again mlxsw: Make MLXSW_SP1_FWREV_MINOR a hard requirement arm64: dts: rockchip: Fix microSD in rk3399 sapphire board MIPS: kexec: Relax memory restriction EDAC: Correct DIMM capacity unit symbol x86/CPU: Change query logic so CPUID is enabled before testing x86/CPU: Use correct macros for Cyrix calls net: freescale: fix return type of ndo_start_xmit function net: micrel: fix return type of ndo_start_xmit function net: phy: mdio-bcm-unimac: Allow configuring MDIO clock divider samples/bpf: fix compilation failure PCI/ERR: Use slot reset if available PCI/AER: Don't read upstream ports below fatal errors PCI/AER: Take reference on error devices bnx2x: Ignore bandwidth attention in single function mode ARM: dts: stm32: Fix SPI controller node names ARM: dts: clearfog: fix sdhci supply property name ARM: dts: stm32: enable display on stm32mp157c-ev1 board x86/mce-inject: Reset injection struct after injection ARM: dts: marvell: Fix SPI and I2C bus warnings crypto: arm/crc32 - avoid warning when compiling with Clang cpufeature: avoid warning when compiling with clang crypto: chacha20 - Fix chacha20_block() keystream alignment (again) spi: pic32: Use proper enum in dmaengine_prep_slave_rg ARM: dts: ste: Fix SPI controller node names ARM: dts: ux500: Fix LCDA clock line muxing ARM: dts: ux500: Correct SCU unit address f2fs: fix to recover inode's uid/gid during POR f2fs: avoid infinite loop in f2fs_alloc_nid ARM: dts: ti: Fix SPI and I2C bus warnings ARM: dts: am335x-evm: fix number of cpsw PCI: portdrv: Initialize service drivers directly mlxsw: spectrum: Init shaper for TCs 8..15 brcmsmac: Use kvmalloc() for ucode allocations brcmfmac: increase buffer for obtaining firmware capabilities s390/vdso: correct CFI annotations of vDSO functions s390/vdso: avoid 64-bit vdso mapping for compat tasks s390/zcrypt: enable AP bus scan without a valid default domain usb: usbtmc: Fix ioctl USBTMC_IOCTL_ABORT_BULK_OUT usb: chipidea: Fix otg event handler usb: chipidea: imx: enable OTG overcurrent in case USB subsystem is already started nfp: provide a better warning when ring allocation fails net: hns3: Fix parameter type for q_id in hclge_tm_q_to_qs_map_cfg() net: hns3: Fix client initialize state issue when roce client initialize failed net: hns3: Clear client pointer when initialize client failed or unintialize finished net: hns3: Fix cmdq registers initialization issue for vf net: hns3: Fix for setting speed for phy failed problem net: sun: fix return type of ndo_start_xmit function net: amd: fix return type of ndo_start_xmit function net: broadcom: fix return type of ndo_start_xmit function net: xilinx: fix return type of ndo_start_xmit function net: toshiba: fix return type of ndo_start_xmit function net: marvell: fix return type of ndo_start_xmit function net: mvpp2: fix the number of queues per cpu for PPv2.2 power: supply: twl4030_charger: disable eoc interrupt on linear charge power: supply: twl4030_charger: fix charging current out-of-bounds libfdt: Ensure INT_MAX is defined in libfdt_env.h of/unittest: Fix I2C bus unit-address error OPP: Protect dev_list with opp_table lock ARM: dts: atmel: Fix I2C and SPI bus warnings RDMA/i40iw: Fix incorrect iterator type powerpc: Fix duplicate const clang warning in user access code powerpc/pseries: Disable CPU hotplug across migrations powerpc/pseries/memory-hotplug: Only update DT once per memory DLPAR request powerpc/64s/hash: Fix stab_rr off by one initialization selftests/powerpc: Do not fail with reschedule powerpc/iommu: Avoid derefence before pointer check net: ibm: fix return type of ndo_start_xmit function net: cavium: fix return type of ndo_start_xmit function net: hns3: fix return type of ndo_start_xmit function ipmi: fix return value of ipmi_set_my_LUN ipmi:dmi: Ignore IPMI SMBIOS entries with a zero base address ipmi_si: fix potential integer overflow on large shift ipmi_si_pci: fix NULL device in ipmi_si error message ASoC: rt5682: Fix the boost volume at the begining of playback spi: mediatek: Don't modify spi_transfer when transfer. spi/bcm63xx-hsspi: keep pll clk enabled samples/bpf: fix a compilation failure arm64: dts: ti: k3-am65: Change #address-cells and #size-cells of interconnect to 2 tty: serial: qcom_geni_serial: Fix serial when not used as console serial: mxs-auart: Fix potential infinite loop serial: samsung: Enable baud clock for UART reset procedure in resume serial: uartps: Fix suspend functionality ARM: dts: xilinx: Fix I2C and SPI bus warnings PCI: mediatek: Fix unchecked return value net: socionext: Fix two sleep-in-atomic-context bugs in ave_rxfifo_reset() PCI/ACPI: Correct error message for ASPM disabling media: ov2680: don't register the v4l2 subdevice before checking chip ID media: vsp1: Fix YCbCr planar formats pitch calculation media: vsp1: Fix vsp1_regs.h license header s390/qeth: invoke softirqs after napi_schedule() s390/qeth: uninstall IRQ handler on device removal ath9k: Fix a locking bug in ath9k_add_interface() netfilter: nf_tables: avoid BUG_ON usage ACPI / LPSS: Exclude I2C busses shared with PUNIT from pmc_atom_d3_mask arm64: dts: rockchip: Fix I2C bus unit-address error on rk3399-puma-haikou ARM: dts: rockchip: Fix erroneous SPI bus dtc warnings on rk3036 scsi: ufshcd: Fix NULL pointer dereference for in ufshcd_init ip_gre: fix parsing gre header in ipgre_err kernfs: Fix range checks in kernfs_get_target_path component: fix loop condition to call unbind() if bind() fails power: supply: max8998-charger: Fix platform data retrieval power: reset: at91-poweroff: do not procede if at91_shdwc is allocated power: supply: ab8500_fg: silence uninitialized variable warnings arm64: dts: meson: Fix erroneous SPI bus warnings blok, bfq: do not plug I/O if all queues are weight-raised block, bfq: inject other-queue I/O into seeky idle queues on NCQ flash arm64: fix for bad_mode() handler to always result in panic cxgb4: Fix endianness issue in t4_fwcache() android: binder: no outgoing transaction when thread todo has transaction ARM: dts: sun9i: Fix I2C bus warnings pinctrl: at91: don't use the same irqchip with multiple gpiochips ARM: dts: sunxi: Fix I2C bus warnings ARM: dts: socfpga: Fix I2C bus unit-address error powerpc/vdso: Correct call frame information ARM: dts: aspeed: Fix I2C bus warnings ARM: dts: bcm: Fix SPI bus warnings arm64: dts: broadcom: Fix I2C and SPI bus warnings drivers: qcom: rpmh-rsc: clear wait_for_compl after use soc: qcom: apr: Avoid string overflow soc: qcom: wcnss_ctrl: Avoid string overflow soc: qcom: geni: geni_se_clk_freq_match() should always accept multiples soc: qcom: geni: Don't ignore clk_round_rate() errors in geni_se_clk_tbl_get() ARM: dts: qcom: ipq4019: fix cpu0's qcom,saw2 reg value llc: avoid blocking in llc_sap_close() pinctrl: at91-pio4: fix has_config check in atmel_pctl_dt_subnode_to_map() arm64: dts: renesas: r8a77965: Fix clock/reset for usb2_phy1 arm64: dts: renesas: r8a77965: Fix HS-USB compatible arm64: dts: renesas: r8a77965: Attach the SYS-DMAC to the IPMMU arm64: dts: renesas: salvator-common: adv748x: Override secondary addresses ALSA: intel8x0m: Register irq handler after register initializations arm64: dts: meson-axg: use the proper compatible for ethmac arm64: dts: meson: libretech: update board model net: bcmgenet: Fix speed selection for reverse MII media: dvb: fix compat ioctl translation media: fix: media: pci: meye: validate offset to avoid arbitrary access ALSA: hda: Fix implicit definition of pci_iomap() on SH media: dt-bindings: adv748x: Fix decimal unit addresses nvmem: core: return error code instead of NULL from nvmem_device_get Drivers: hv: vmbus: Fix synic per-cpu context initialization net: aquantia: fix hw_atl_utils_fw_upload_dwords kprobes: Don't call BUG_ON() if there is a kprobe in use on free list scsi: pm80xx: Fixed system hang issue during kexec boot scsi: pm80xx: Corrected dma_unmap_sg() parameter ARM: imx6: register pm_power_off handler if "fsl,pmic-stby-poweroff" is set scsi: sym53c8xx: fix NULL pointer dereference panic in sym_int_sir() scsi: lpfc: Fix errors in log messages. scsi: lpfc: Correct invalid EQ doorbell write on if_type=6 scsi: lpfc: Fix GFT_ID and PRLI logic for RSCN scsi: qla2xxx: Fix duplicate switch's Nport ID entries scsi: qla2xxx: Fix dropped srb resource. scsi: qla2xxx: Fix port speed display on chip reset scsi: qla2xxx: Check for Register disconnect scsi: qla2xxx: Increase abort timeout value scsi: qla2xxx: Fix deadlock between ATIO and HW lock scsi: qla2xxx: Terminate Plogi/PRLI if WWN is 0 scsi: qla2xxx: Defer chip reset until target mode is enabled scsi: qla2xxx: Fix iIDMA error scsi: qla2xxx: Use correct qpair for ABTS/CMD f2fs: fix setattr project check upon fssetxattr ioctl f2fs: fix memory leak of percpu counter in fill_super() f2fs: fix memory leak of write_io in fill_super() signal: Properly deliver SIGSEGV from x86 uprobes signal: Properly deliver SIGILL from uprobes signal: Always ignore SIGKILL and SIGSTOP sent to the global init IB/hfi1: Missing return value in error path for user sdma RDMA/hns: Fix an error code in hns_roce_v2_init_eq_table() dmaengine: at_xdmac: remove a stray bottom half unlock ath9k: add back support for using active monitor interfaces for tx99 rtc: pl030: fix possible race condition rtc: mt6397: fix possible race condition EDAC, sb_edac: Return early on ADDRV bit and address type test dmaengine: dma-jz4780: Further residue status fix dmaengine: dma-jz4780: Don't depend on MACH_JZ4780 usb: mtu3: disable vbus rise/fall interrupts of ltssm ARM: dts: exynos: Disable pull control for PMIC IRQ line on Artik5 board arm64: dts: rockchip: Fix VCC5V0_HOST_EN on rk3399-sapphire firmware: arm_scmi: use strlcpy to ensure NULL-terminated strings sched/debug: Use symbolic names for task state constants sched/debug: Explicitly cast sched_feat() to bool failover: Fix error return code in net_failover_create f2fs: submit bio after shutdown ARM: dts: omap3-gta04: keep vpll2 always on ARM: dts: omap3-gta04: make NAND partitions compatible with recent U-Boot ARM: dts: omap3-gta04: fix touchscreen tsc2007 ARM: dts: omap3-gta04: tvout: enable as display1 alias ARM: dts: omap3-gta04: fixes for tvout / venc ARM: dts: omap3-gta04: give spi_lcd node a label so that we can overwrite in other DTS files of: make PowerMac cache node search conditional on CONFIG_PPC_PMAC ata: Disable AHCI ALPM feature for Ampere Computing eMAG SATA ASoC: Intel: hdac_hdmi: Limit sampling rates at dai creation ASoC: dapm: Avoid uninitialised variable warning udf: Fix crash during mount mips: txx9: fix iounmap related issue RDMA/core: Follow correct unregister order between sysfs and cgroup RDMA/core: Rate limit MAD error messages IB/ipoib: Ensure that MTU isn't less than minimum permitted IB/mlx5: Don't hold spin lock while checking device state i2c: mediatek: Use DMA safe buffers for i2c transactions ath10k: wmi: disable softirq's while calling ieee80211_rx ARM: dts: exynos: Disable pull control for S5M8767 PMIC ath10k: avoid possible memory access violation ASoC: sgtl5000: avoid division by zero if lo_vag is zero rtnetlink: move type calculation out of loop net: lan78xx: Bail out if lan78xx_get_endpoints fails f2fs: avoid wrong decrypted data from disk cfg80211: validate wmm rule when setting mac80211: fix saving a few HE values qxl: fix null-pointer crash during suspend IB/mlx5: Change TX affinity assignment in RoCE LAG mode mtd: rawnand: qcom: don't include dma-direct.h mtd: rawnand: fsl_ifc: fixup SRAM init for newer ctrl versions mtd: rawnand: fsl_ifc: check result of SRAM initialization mtd: rawnand: marvell: use regmap_update_bits() for syscon access ARM: dts: meson8b: fix the clock controller register size ARM: dts: meson8: fix the clock controller register size net: phy: mscc: read 'vsc8531, edge-slowdown' as an u32 net: phy: mscc: read 'vsc8531,vddmac' as an u32 net/mlx5: Fix atomic_mode enum values net: hns3: Change the dst mac addr of loopback packet net: hns3: Fix for loopback selftest failed problem net: hns3: Fix error of checking used vlan id net: hns3: Fix for multicast failure ASoC: rsnd: ssi: Fix issue in dma data address assignment soc: imx: gpc: fix PDN delay mt76: Fix comparisons with invalid hardware key index brcmfmac: fix wrong strnchr usage mwifex: free rx_cmd skb in suspended state mwifiex: do no submit URB in suspended state rtl8187: Fix warning generated when strncpy() destination length matches the sixe argument ARM: dts: pxa: fix power i2c base address ARM: dts: pxa: fix the rtc controller media: ov772x: Disable clk on error path media: i2c: Fix pm_runtime_get_if_in_use() usage in sensor drivers media: vicodec: fix out-of-range values when decoding iwlwifi: mvm: avoid sending too many BARs iwlwifi: don't WARN on trying to dump dead firmware iwlwifi: drop packets with bad status in CD IB/rxe: fixes for rdma read retry IB/rxe: avoid back-to-back retries i40e: Prevent deleting MAC address from VF when set by PF i40evf: cancel workqueue sync for adminq when a VF is removed i40e: hold the rtnl lock on clearing interrupt scheme i40evf: Don't enable vlan stripping when rx offload is turned on i40e: Check and correct speed values for link on open i40evf: set IFF_UNICAST_FLT flag for the VF i40e: use correct length for strncpy i40evf: Validate the number of queues a PF sends ARM: dts: exynos: Fix regulators configuration on Peach Pi/Pit Chromebooks arm64: dts: stratix10: i2c clock running out of spec liquidio: fix race condition in instruction completion processing ARM: dts: exynos: Fix sound in Snow-rev5 Chromebook ARM: dts: exynos: Fix HDMI-HPD line handling on Arndale ARM: dts: exynos: Use i2c-gpio for HDMI-DDC on Arndale MIPS: BCM47XX: Enable USB power on Netgear WNDR3400v3 pinctrl: ingenic: Probe driver at subsys_initcall ASoC: AMD: Change MCLK to 48Mhz ASoC: meson: axg-fifo: report interrupt request failure ASoC: dpcm: Properly initialise hw->rate_max ASoC: dapm: Don't fail creating new DAPM control on NULL pinctrl ice: Fix and update driver version string gfs2: Don't set GFS2_RDF_UPTODATE when the lvb is updated ice: Prevent control queue operations during reset ice: Update request resource command to latest specification ath10k: limit available channels via DT ieee80211-freq-limit wil6210: fix invalid memory access for rx_buff_mgmt debugfs wil6210: prevent usage of tx ring 0 for eDMA wil6210: set edma variables only for Talyn-MB devices wil6210: drop Rx multicast packets that are looped-back to STA ath9k: fix tx99 with monitor mode interface ath10k: skip resetting rx filter for WCN3990 ALSA: seq: Do error checks at creating system ports cfg80211: Avoid regulatory restore when COUNTRY_IE_IGNORE is set extcon: cht-wc: Return from default case to avoid warnings remoteproc/davinci: Use %zx for formating size_t rtc: rv8803: fix the rv8803 id in the OF table rtc: sysfs: fix NULL check in rtc_add_groups() ARM: dts: at91/trivial: Fix USART1 definition for at91sam9g45 ARM: dts: rcar: Correct SATA device sizes to 2 MiB y2038: make do_gettimeofday() and get_seconds() inline arm64: dts: tegra210-p2180: Correct sdmmc4 vqmmc-supply soc/tegra: pmc: Fix pad voltage configuration for Tegra186 ALSA: pcm: signedness bug in snd_pcm_plug_alloc() arm64: dts: allwinner: a64: NanoPi-A64: Fix DCDC1 voltage arm64: dts: allwinner: a64: Olinuxino: fix DRAM voltage arm64: dts: allwinner: a64: Orange Pi Win: Fix SD card node soundwire: intel: Fix uninitialized adev deref soundwire: Initialize completion for defer messages clk: sunxi-ng: h6: fix PWM gate/reset offset iio: dac: mcp4922: fix error handling in mcp4922_write_raw ath10k: fix kernel panic by moving pci flush after napi_disable tee: optee: take DT status property into account iio: adc: max9611: explicitly cast gain_selectors mmc: sdhci-of-at91: fix quirk2 overwrite mm: hugetlb: switch to css_tryget() in hugetlb_cgroup_charge_cgroup() mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm() mm: mempolicy: fix the wrong return value and potential pages leak of mbind iommu/vt-d: Fix QI_DEV_IOTLB_PFSID and QI_DEV_EIOTLB_PFSID macros net: ethernet: dwmac-sun8i: Use the correct function in exit path ecryptfs_lookup_interpose(): lower_dentry->d_parent is not stable either ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable i2c: acpi: Force bus speed to 400KHz if a Silead touchscreen is present IB/hfi1: Use a common pad buffer for 9B and 16B packets IB/hfi1: Ensure full Gen3 speed in a Gen4 system Input: synaptics-rmi4 - destroy F54 poller workqueue when removing Input: synaptics-rmi4 - clear IRQ enables for F54 Input: synaptics-rmi4 - do not consume more data than we have (F11, F12) Input: synaptics-rmi4 - disable the relative position IRQ in the F12 driver Input: synaptics-rmi4 - fix video buffer size Input: ff-memless - kill timer in destroy() Btrfs: fix log context list corruption after rename exchange operation ALSA: usb-audio: Fix incorrect size check for processing/extension units ALSA: usb-audio: Fix incorrect NULL check in create_yamaha_midi_quirk() ALSA: usb-audio: not submit urb for stopped endpoint ALSA: usb-audio: Fix missing error check at mixer resolution test slip: Fix memory leak in slip_open error path net: usb: qmi_wwan: add support for Foxconn T77W968 LTE modules net: gemini: add missed free_netdev ipmr: Fix skb headroom in ipmr_get_route(). ax88172a: fix information leak on short answers scsi: core: Handle drivers which set sg_tablesize to zero MIPS: BCM63XX: fix switch core reset on BCM6368 KVM: x86: introduce is_pae_paging Conflicts: drivers/hwtracing/coresight/coresight-etm-perf.c drivers/hwtracing/coresight/coresight-tmc-etr.c drivers/hwtracing/coresight/coresight-tmc.h drivers/hwtracing/coresight/coresight.c drivers/net/wireless/ath/wil6210/pcie_bus.c drivers/net/wireless/ath/wil6210/txrx.c drivers/scsi/ufs/ufshcd.c drivers/slimbus/qcom-ngd-ctrl.c include/linux/libfdt_env.h Change-Id: Iba6cbaecffd0ef9fd94503df06397ca4cce9b4fb Signed-off-by: Ivaylo Georgiev <irgeorgiev@codeaurora.org>	2020-01-28 03:09:04 -08:00
Paolo Valente	89f4d27c1b	blok, bfq: do not plug I/O if all queues are weight-raised [ Upstream commit c8765de0adfcaaf4ffb2d951e07444f00ffa9453 ] To reduce latency for interactive and soft real-time applications, bfq privileges the bfq_queues containing the I/O of these applications. These privileged queues, referred-to as weight-raised queues, get a much higher share of the device throughput w.r.t. non-privileged queues. To preserve this higher share, the I/O of any non-weight-raised queue must be plugged whenever a sync weight-raised queue, while being served, remains temporarily empty. To attain this goal, bfq simply plugs any I/O (from any queue), if a sync weight-raised queue remains empty while in service. Unfortunately, this plugging typically lowers throughput with random I/O, on devices with internal queueing (because it reduces the filling level of the internal queues of the device). This commit addresses this issue by restricting the cases where plugging is performed: if a sync weight-raised queue remains empty while in service, then I/O plugging is performed only if some of the active bfq_queues are not weight-raised (which is actually the only circumstance where plugging is needed to preserve the higher share of the throughput of weight-raised queues). This restriction proved able to boost throughput in really many use cases needing only maximum throughput. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-11-20 18:46:44 +01:00
Paolo Valente	6c9a79651b	block, bfq: inject other-queue I/O into seeky idle queues on NCQ flash [ Upstream commit d0edc2473be9d70f999282e1ca7863ad6ae704dc ] The Achilles' heel of BFQ is its failing to reach a high throughput with sync random I/O on flash storage with internal queueing, in case the processes doing I/O have differentiated weights. The cause of this failure is as follows. If at least two processes do sync I/O, and have a different weight from each other, then BFQ plugs I/O dispatching every time one of these processes, while it is being served, remains temporarily without pending I/O requests. This plugging is necessary to guarantee that every process enjoys a bandwidth proportional to its weight; but it empties the internal queue(s) of the drive. And this kills throughput with random I/O. So, if some processes have differentiated weights and do both sync and random I/O, the end result is a throughput collapse. This commit tries to counter this problem by injecting the service of other processes, in a controlled way, while the process in service happens to have no I/O. This injection is performed only if the medium is non rotational and performs internal queueing, and the process in service does random I/O (service injection might be beneficial for sequential I/O too, we'll work on that). As an example of the benefits of this commit, on a PLEXTOR PX-256M5S SSD, and with five processes having differentiated weights and doing sync random 4KB I/O, this commit makes the throughput with bfq grow by 400%, from 25 to 100MB/s. This higher throughput is 10MB/s lower than that reached with none. As some less random I/O is added to the mix, the throughput becomes equal to or higher than that with none. This commit is a very first attempt to recover throughput without losing control, and certainly has many limitations. One is, e.g., that the processes whose service is injected are not chosen so as to distribute the extra bandwidth they receive in accordance to their weights. Thus there might be loss of weighted fairness in some cases. Anyway, this loss concerns extra service, which would not have been received at all without this commit. Other limitations and issues will probably show up with usage. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-11-20 18:46:44 +01:00
Pradeep P V K	052084a6bc	bfq-iosched: Make BFQ default to IOPS mode on SSDs BFQ idling causes reduced IOPS throughput on non-rotational disks. Since disk head seeking is not applicable to SSDs, it doesn't really help performance by anticipating future near-by IO requests. Idling due to anticipation of future near-by IO requests and wait on completion of submitted requests, will also effect the other bfq-queues by causing delay in their scheduling and there by effecting some time bounded applications. By turning off idling (and switching to IOPS mode), we allow other processes(bfq-queues) to dispatch IO requests down to the driver and so increase IO throughput. Following FIO benchmark results were taken on a local SSD run: RandomReads: Idling iops avg-lat(us) stddev bw ---------------------------------------------------- On 4136 1189.07 17221.65 16.9MB/s Off 7246 670.11 1054.76 29.7MB/s fio --name=temp --size=5G --time_based --ioengine=sync \ --randrepeat=0 --direct=1 --invalidate=1 --verify=0 \ --verify_fatal=0 --rw=randread --blocksize=4k \ --group_reporting=1 --directory=/data --runtime=10 \ --iodepth=64 --numjobs=5 RandomWrites: Idling iops avg-lat(us) stddev bw --------------------------------------------------- On 1368 3631.38 28234.55 5.47MB/s Off 4746 1024.61 12184.00 19.4MB/s fio --name=temp --size=5G --time_based --ioengine=sync \ --randrepeat=0 --direct=1 --invalidate=1 --verify=0 \ --verify_fatal=0 --rw=randwrite --blocksize=4k \ --group_reporting=1 --directory=/data --runtime=10 \ --iodepth=64 --numjobs=5 Change-Id: I9e55eee03917a1ab07fbd3f04635ca1a6541b860 Signed-off-by: Pradeep P V K <ppvk@codeaurora.org>	2019-11-12 10:59:29 +05:30
qctecmdr	b978816fc1	Merge "block: use current active bfqq to update statistics"	2019-09-12 15:41:04 -07:00
Pradeep P V K	1852d0be8d	block: use current active bfqq to update statistics Use the current active bfq-queue to update bfq-group statitics. It could be possible that the current active serving bfq-queue can expire if the allocated time/budget for the queue got expired. During this time, it will select a new queue that to be served for its service tree. If there were no more queues to be served, then it will choose a next group of queues to be served from its group service tree. So, the selection of the new request from its new group and queue are updated via __bfq_dispatch_request() fn. As "in_serv_queue" variable is not updated again the group associated with "in_serv_queue" queue can be freed, if there were no more active queues. So, with picking in_serv_queue as active queue, and updating its group statistics one will see a kernel panic as below. [ 120.572960] Hardware name: Qualcomm Technologies, Inc. Lito MTP (DT) [ 120.572973] Workqueue: kblockd blk_mq_run_work_fn [ 120.572979] pstate: a0c00085 (NzCv daIf +PAN +UAO) [ 120.572987] pc : bfqg_stats_update_idle_time+0x14/0x50 [ 120.572992] lr : bfq_dispatch_request+0x398/0x948 [ 121.185249] Call trace: [ 121.187772] bfqg_stats_update_idle_time+0x14/0x50 [ 121.192700] bfq_dispatch_request+0x398/0x948 [ 121.197187] blk_mq_do_dispatch_sched+0x84/0x118 [ 121.198270] CPU7: update max cpu_capacity 1024 [ 121.206504] blk_mq_sched_dispatch_requests+0x130/0x190 [ 121.211873] __blk_mq_run_hw_queue+0xcc/0x148 [ 121.216359] blk_mq_run_work_fn+0x24/0x30 [ 121.220489] process_one_work+0x328/0x6b0 [ 121.224619] worker_thread+0x330/0x4d0 [ 121.228475] kthread+0x128/0x138 [ 121.231806] ret_from_fork+0x10/0x1c To avoid this, always use the current active bfq-queue, which is derived from the current active serving request. Change-Id: I51d5b9d2020da9f3a3a31378b06257463afd08eb Signed-off-by: Pradeep P V K <ppvk@codeaurora.org>	2019-09-10 16:10:31 +05:30
Paolo Valente	7aa8dfa450	block, bfq: handle NULL return value by bfq_init_rq() [ Upstream commit fd03177c33b287c6541f4048f1d67b7b45a1abc9 ] As reported in [1], the call bfq_init_rq(rq) may return NULL in case of OOM (in particular, if rq->elv.icq is NULL because memory allocation failed in failed in ioc_create_icq()). This commit handles this circumstance. [1] https://lkml.org/lkml/2019/7/22/824 Cc: Hsin-Yi Wang <hsinyi@google.com> Cc: Nicolas Boichat <drinkcat@chromium.org> Cc: Doug Anderson <dianders@chromium.org> Reported-by: Guenter Roeck <linux@roeck-us.net> Reported-by: Hsin-Yi Wang <hsinyi@google.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-08-29 08:28:46 +02:00
Douglas Anderson	018524b758	block, bfq: NULL out the bic when it's no longer valid commit dbc3117d4ca9e17819ac73501e914b8422686750 upstream. In reboot tests on several devices we were seeing a "use after free" when slub_debug or KASAN was enabled. The kernel complained about: Unable to handle kernel paging request at virtual address 6b6b6c2b ...which is a classic sign of use after free under slub_debug. The stack crawl in kgdb looked like: 0 test_bit (addr=<optimized out>, nr=<optimized out>) 1 bfq_bfqq_busy (bfqq=<optimized out>) 2 bfq_select_queue (bfqd=<optimized out>) 3 __bfq_dispatch_request (hctx=<optimized out>) 4 bfq_dispatch_request (hctx=<optimized out>) 5 0xc056ef00 in blk_mq_do_dispatch_sched (hctx=0xed249440) 6 0xc056f728 in blk_mq_sched_dispatch_requests (hctx=0xed249440) 7 0xc0568d24 in __blk_mq_run_hw_queue (hctx=0xed249440) 8 0xc0568d94 in blk_mq_run_work_fn (work=<optimized out>) 9 0xc024c5c4 in process_one_work (worker=0xec6d4640, work=0xed249480) 10 0xc024cff4 in worker_thread (__worker=0xec6d4640) Digging in kgdb, it could be found that, though bfqq looked fine, bfqq->bic had been freed. Through further digging, I postulated that perhaps it is illegal to access a "bic" (AKA an "icq") after bfq_exit_icq() had been called because the "bic" can be freed at some point in time after this call is made. I confirmed that there certainly were cases where the exact crashing code path would access the "bic" after bfq_exit_icq() had been called. Sspecifically I set the "bfqq->bic" to (void *)0x7 and saw that the bic was 0x7 at the time of the crash. To understand a bit more about why this crash was fairly uncommon (I saw it only once in a few hundred reboots), you can see that much of the time bfq_exit_icq_fbqq() fully frees the bfqq and thus it can't access the ->bic anymore. The only case it doesn't is if bfq_put_queue() sees a reference still held. However, even in the case when bfqq isn't freed, the crash is still rare. Why? I tracked what happened to the "bic" after the exit routine. It doesn't get freed right away. Rather, put_io_context_active() eventually called put_io_context() which queued up freeing on a workqueue. The freeing then actually happened later than that through call_rcu(). Despite all these delays, some extra debugging showed that all the hoops could be jumped through in time and the memory could be freed causing the original crash. Phew! To make a long story short, assuming it truly is illegal to access an icq after the "exit_icq" callback is finished, this patch is needed. Cc: stable@vger.kernel.org Reviewed-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-07-14 08:11:17 +02:00
Paolo Valente	b5a185ee30	block, bfq: increase idling for weight-raised queues [ Upstream commit 778c02a236a8728bb992de10ed1f12c0be5b7b0e ] If a sync bfq_queue has a higher weight than some other queue, and remains temporarily empty while in service, then, to preserve the bandwidth share of the queue, it is necessary to plug I/O dispatching until a new request arrives for the queue. In addition, a timeout needs to be set, to avoid waiting for ever if the process associated with the queue has actually finished its I/O. Even with the above timeout, the device is however not fed with new I/O for a while, if the process has finished its I/O. If this happens often, then throughput drops and latencies grow. For this reason, the timeout is kept rather low: 8 ms is the current default. Unfortunately, such a low value may cause, on the opposite end, a violation of bandwidth guarantees for a process that happens to issue new I/O too late. The higher the system load, the higher the probability that this happens to some process. This is a problem in scenarios where service guarantees matter more than throughput. One important case are weight-raised queues, which need to be granted a very high fraction of the bandwidth. To address this issue, this commit lower-bounds the plugging timeout for weight-raised queues to 20 ms. This simple change provides relevant benefits. For example, on a PLEXTOR PX-256M5S, with which gnome-terminal starts in 0.6 seconds if there is no other I/O in progress, the same applications starts in - 0.8 seconds, instead of 1.2 seconds, if ten files are being read sequentially in parallel - 1 second, instead of 2 seconds, if, in parallel, five files are being read sequentially, and five more files are being written sequentially Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-06-15 11:54:10 +02:00
Jens Axboe	824c212908	bfq: update internal depth state when queue depth changes commit 77f1e0a52d26242b6c2dba019f6ebebfb9ff701e upstream A previous commit moved the shallow depth and BFQ depth map calculations to be done at init time, moving it outside of the hotter IO path. This potentially causes hangs if the users changes the depth of the scheduler map, by writing to the 'nr_requests' sysfs file for that device. Add a blk-mq-sched hook that allows blk-mq to inform the scheduler if the depth changes, so that the scheduler can update its internal state. Signed-off-by: Eric Wheeler <bfq@linux.ewheeler.net> Tested-by: Kai Krakow <kai@kaishome.de> Reported-by: Paolo Valente <paolo.valente@linaro.org> Fixes: `f0635b8a41` ("bfq: calculate shallow depths at init time") Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-05-16 19:41:17 +02:00
Paolo Valente	06666a19d5	block, bfq: fix in-service-queue check for queue merging [ Upstream commit 058fdecc6de7cdecbf4c59b851e80eb2d6c5295f ] When a new I/O request arrives for a bfq_queue, say Q, bfq checks whether that request is close to (a) the head request of some other queue waiting to be served, or (b) the last request dispatched for the in-service queue (in case Q itself is not the in-service queue) If a queue, say Q2, is found for which the above condition holds, then bfq merges Q and Q2, to hopefully get a more sequential I/O in the resulting merged queue, and thus a possibly higher throughput. Case (b) is checked by comparing the new request for Q with the last request dispatched, assuming that the latter necessarily belonged to the in-service queue. Unfortunately, this assumption is no longer always correct, since commit d0edc2473be9 ("block, bfq: inject other-queue I/O into seeky idle queues on NCQ flash"). When the assumption does not hold, queues that must not be merged may be merged, causing unexpected loss of control on per-queue service guarantees. This commit solves this problem by adding an extra field, which stores the actual last request dispatched for the in-service queue, and by using this new field to correctly check case (b). Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-04-05 22:33:12 +02:00
Paolo Valente	d5801088a7	block, bfq: reduce write overcharge When a sync request is dispatched, the queue that contains that request, and all the ancestor entities of that queue, are charged with the number of sectors of the request. In constrast, if the request is async, then the queue and its ancestor entities are charged with the number of sectors of the request, multiplied by an overcharge factor. This throttles the bandwidth for async I/O, w.r.t. to sync I/O, and it is done to counter the tendency of async writes to steal I/O throughput to reads. On the opposite end, the lower this parameter, the stabler I/O control, in the following respect. The lower this parameter is, the less the bandwidth enjoyed by a group decreases - when the group does writes, w.r.t. to when it does reads; - when other groups do reads, w.r.t. to when they do writes. The fixes "block, bfq: always update the budget of an entity when needed" and "block, bfq: readd missing reset of parent-entity service" improved I/O control in bfq to such an extent that it has been possible to revise this overcharge factor downwards. This commit introduces the resulting, new value. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-16 13:08:13 -06:00
Paolo Valente	8a511ba5fe	block, bfq: readd missing reset of parent-entity service The received-service counter needs to be equal to 0 when an entity is set in service. Unfortunately, commit "block, bfq: fix service being wrongly set to zero in case of preemption" mistakenly removed the resetting of this counter for the parent entities of the bfq_queue being set in service. This commit fixes this issue by resetting service for parent entities, directly on the expiration of the in-service bfq_queue. Fixes: `9fae8dd59f` ("block, bfq: fix service being wrongly set to zero in case of preemption") Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-16 13:08:10 -06:00
Paolo Valente	277a4a9b56	block, bfq: give a better name to bfq_bfqq_may_idle The actual goal of the function bfq_bfqq_may_idle is to tell whether it is better to perform device idling (more precisely: I/O-dispatch plugging) for the input bfq_queue, either to boost throughput or to preserve service guarantees. This commit improves the name of the function accordingly. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-09 09:07:52 -06:00
Paolo Valente	9fae8dd59f	block, bfq: fix service being wrongly set to zero in case of preemption If - a bfq_queue Q preempts another queue, because one request of Q arrives in time, - but, after this preemption, Q is not the queue that is set in service, then Q->entity.service is set to 0 when Q is eventually set in service. But Q should have continued receiving service with its old budget (which is why preemption has occurred) and its old service. This commit addresses this issue by resetting service on queue real expiration. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-09 09:07:52 -06:00
Paolo Valente	4420b095cc	block, bfq: do not expire a queue that will deserve dispatch plugging For some bfq_queues, BFQ plugs I/O dispatching when the queue becomes idle, and keeps the plug until a new request of the queue arrives, or a timeout fires. BFQ does so either to boost throughput or to preserve service guarantees for the queue. More precisely, for such a queue, plugging starts when the queue happens to have either no request enqueued, or no request in flight, that is, no request already dispatched but not yet completed. On the opposite end, BFQ may happen to expire a queue with no request enqueued, without doing any plugging, if the queue still has some request in flight. Unfortunately, such a premature expiration causes the queue to lose its chance to enjoy dispatch plugging a moment later, i.e., when its in-flight requests finally get completed. This breaks service guarantees for the queue. This commit prevents BFQ from expiring an empty queue if the latter still has in-flight requests. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-09 09:07:52 -06:00
Paolo Valente	0471559c2f	block, bfq: add/remove entity weights correctly To keep I/O throughput high as often as possible, BFQ performs I/O-dispatch plugging (aka device idling) only when beneficial exactly for throughput, or when needed for service guarantees (low latency, fairness). An important case where the latter condition holds is when the scenario is 'asymmetric' in terms of weights: i.e., when some bfq_queue or whole group of queues has a higher weight, and thus has to receive more service, than other queues or groups. Without dispatch plugging, lower-weight queues/groups may unjustly steal bandwidth to higher-weight queues/groups. To detect asymmetric scenarios, BFQ checks some sufficient conditions. One of these conditions is that active groups have different weights. BFQ controls this condition by maintaining a special set of unique weights of active groups (group_weights_tree). To this purpose, in the function bfq_active_insert/bfq_active_extract BFQ adds/removes the weight of a group to/from this set. Unfortunately, the function bfq_active_extract may happen to be invoked also for a group that is still active (to preserve the correct update of the next queue to serve, see comments in function bfq_no_longer_next_in_service() for details). In this case, removing the weight of the group makes the set group_weights_tree inconsistent. Service-guarantee violations follow. This commit addresses this issue by moving group_weights_tree insertions from their previous location (in bfq_active_insert) into the function __bfq_activate_entity, and by moving group_weights_tree extractions from bfq_active_extract to when the entity that represents a group remains throughly idle, i.e., with no request either enqueued or dispatched. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-09 09:07:52 -06:00
Davide Sapienza	f6c3ca0e58	block, bfq: prevent soft_rt_next_start from being stuck at infinity BFQ can deem a bfq_queue as soft real-time only if the queue - periodically becomes completely idle, i.e., empty and with no still-outstanding I/O request; - after becoming idle, gets new I/O only after a special reference time soft_rt_next_start. In this respect, after commit "block, bfq: consider also past I/O in soft real-time detection", the value of soft_rt_next_start can never decrease. This causes a problem with the following special updating case for soft_rt_next_start: to prevent queues that are not completely idle to be wrongly detected as soft real-time (when they become non-empty again), soft_rt_next_start is temporarily set to infinity for empty queues with still outstanding I/O requests. But, if such an update is actually performed, then, because of the above commit, soft_rt_next_start will be stuck at infinity forever, and the queue will have no more chance to be considered soft real-time. On slow systems, this problem does cause actual soft real-time applications to be occasionally not detected as such. This commit addresses this issue by eliminating the pushing of soft_rt_next_start to infinity, and by changing the way non-empty queues are prevented from being wrongly detected as soft real-time. Simply, a queue that becomes non-empty again can now be detected as soft real-time only if it has no outstanding I/O request. Signed-off-by: Davide Sapienza <sapienza.dav@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-31 08:54:41 -06:00
Davide Sapienza	d450542e3c	block, bfq: increase weight-raising duration for interactive apps The maximum possible duration of the weight-raising period for interactive applications is limited to 13 seconds, as this is the time needed to load the largest application that we considered when tuning weight raising. Unfortunately, in such an evaluation, we did not consider the case of very slow virtual machines. For example, on a QEMU/KVM virtual machine - running in a slow PC; - with a virtual disk stacked on a slow low-end 5400rpm HDD; - serving a heavy I/O workload, such as the sequential reading of several files; mplayer takes 23 seconds to start, if constantly weight-raised. To address this issue, this commit conservatively sets the upper limit for weight-raising duration to 25 seconds. Signed-off-by: Davide Sapienza <sapienza.dav@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-31 08:54:40 -06:00
Paolo Valente	e24f1c245f	block, bfq: remove slow-system class BFQ computes the duration of weight raising for interactive applications automatically, using some reference parameters. In particular, BFQ uses the best durations (see comments in the code for how these durations have been assessed) for two classes of systems: slow and fast ones. Examples of slow systems are old phones or systems using micro HDDs. Fast systems are all the remaining ones. Using these parameters, BFQ computes the actual duration of the weight raising, for the system at hand, as a function of the relative speed of the system w.r.t. the speed of a reference system, belonging to the same class of systems as the system at hand. This slow vs fast differentiation proved to be useful in the past, but happens to have little meaning with current hardware. Even worse, it does cause problems in virtual systems, where the speed of the system can vary frequently, and so widely to just confuse the class-detection mechanism, and, as we have verified experimentally, to cause BFQ to compute non-sensical weight-raising durations. This commit addresses this issue by removing the slow class and the class-detection mechanism. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-31 08:54:38 -06:00
Paolo Valente	4029eef1be	block, bfq: add description of weight-raising heuristics A description of how weight raising works is missing in BFQ sources. In addition, the code for handling weight raising is scattered across a few functions. This makes it rather hard to understand the mechanism and its rationale. This commits adds such a description at the beginning of the main source file. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-31 08:54:36 -06:00
Filippo Muzzini	ac857e0d54	block, bfq: remove the removal of 'next' rq in bfq_requests_merged Since bfq_finish_request() is always called on the request 'next', after bfq_requests_merged() is finished, and bfq_finish_request() removes 'next' from its bfq_queue if needed, it isn't necessary to do such a removal in advance in bfq_merged_requests(). This commit removes such a useless 'next' removal. Signed-off-by: Filippo Muzzini <filippo.muzzini@outlook.it> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-31 08:48:32 -06:00
Paolo Valente	8abfa4d6fd	block, bfq: remove wrong check in bfq_requests_merged The request rq passed to the function bfq_requests_merged is always in a bfq_queue, so the check !RB_EMPTY_NODE(&rq->rb_node) at the beginning of bfq_requests_merged always succeeds, and the control flow systematically skips to the end of the function. This implies that the body of the function is never executed, i.e., the repositioning of rq is never performed. On the opposite end, a control is missing in the body of the function: 'next' must be removed only if it is inside a bfq_queue. This commit removes the wrong check on rq, and adds the missing check on 'next'. In addition, this commit adds comments on bfq_requests_merged. Signed-off-by: Filippo Muzzini <filippo.muzzini@outlook.it> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-31 08:48:05 -06:00
Filippo Muzzini	a12bffebc0	block, bfq: remove wrong lock in bfq_requests_merged In bfq_requests_merged(), there is a deadlock because the lock on bfqq->bfqd->lock is held by the calling function, but the code of this function tries to grab the lock again. This deadlock is currently hidden by another bug (fixed by next commit for this source file), which causes the body of bfq_requests_merged() to be never executed. This commit removes the deadlock by removing the lock/unlock pair. Signed-off-by: Filippo Muzzini <filippo.muzzini@outlook.it> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-31 08:42:27 -06:00
Jens Axboe	483b7bf2e4	bfq-iosched: update shallow depth to smallest one used If our shallow depth is smaller than the wake batching of sbitmap, we can introduce hangs. Ensure that sbitmap knows how low we'll go. Acked-by: Paolo Valente <paolo.valente@linaro.org> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-10 11:27:41 -06:00
Jens Axboe	bd7d4ef6a4	bfq-iosched: remove unused variable bfqd->sb_shift was attempted used as a cache for the sbitmap queue shift, but we don't need it, as it never changes. Kill it with fire. Acked-by: Paolo Valente <paolo.valente@linaro.org> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-10 11:27:31 -06:00
Jens Axboe	f0635b8a41	bfq: calculate shallow depths at init time It doesn't change, so don't put it in the per-IO hot path. Acked-by: Paolo Valente <paolo.valente@linaro.org> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-10 11:27:29 -06:00
Jens Axboe	55141366de	bfq-iosched: don't worry about reserved tags in limit_depth Reserved tags are used for error handling, we don't need to care about them for regular IO. The core won't call us for these anyway. Acked-by: Paolo Valente <paolo.valente@linaro.org> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-10 11:27:17 -06:00
Paolo Valente	18e5a57d79	block, bfq: postpone rq preparation to insert or merge When invoked for an I/O request rq, the prepare_request hook of bfq increments reference counters in the destination bfq_queue for rq. In this respect, after this hook has been invoked, rq may still be transformed into a request with no icq attached, i.e., for bfq, a request not associated with any bfq_queue. No further hook is invoked to signal this tranformation to bfq (in general, to the destination elevator for rq). This leads bfq into an inconsistent state, because bfq has no chance to correctly lower these counters back. This inconsistency may in its turn cause incorrect scheduling and hangs. It certainly causes memory leaks, by making it impossible for bfq to free the involved bfq_queue. On the bright side, no transformation can still happen for rq after rq has been inserted into bfq, or merged with another, already inserted, request. Exploiting this fact, this commit addresses the above issue by delaying the preparation of an I/O request to when the request is inserted or merged. This change also gives a performance bonus: a lock-contention point gets removed. To prepare a request, bfq needs to hold its scheduler lock. After postponing request preparation to insertion or merging, no lock needs to be grabbed any longer in the prepare_request hook, while the lock already taken to perform insertion or merging is used to preparare the request as well. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-10 10:16:29 -06:00
Omar Sandoval	522a777566	block: consolidate struct request timestamp fields Currently, struct request has four timestamp fields: - A start time, set at get_request time, in jiffies, used for iostats - An I/O start time, set at start_request time, in ktime nanoseconds, used for blk-stats (i.e., wbt, kyber, hybrid polling) - Another start time and another I/O start time, used for cfq and bfq These can all be consolidated into one start time and one I/O start time, both in ktime nanoseconds, shaving off up to 16 bytes from struct request depending on the kernel config. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-09 08:33:09 -06:00
Jens Axboe	72961c4e60	bfq-iosched: ensure to clear bic/bfqq pointers when preparing request Even if we don't have an IO context attached to a request, we still need to clear the priv[0..1] pointers, as they could be pointing to previously used bic/bfqq structures. If we don't do so, we'll either corrupt memory on dispatching a request, or cause an imbalance in counters. Inspired by a fix from Kees. Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name> Reported-by: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Fixes: `aee69d78de` ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-04-17 17:08:52 -06:00
Paolo Valente	bc56e2cafa	block, bfq: lower-bound the estimated peak rate to 1 If a storage device handled by BFQ happens to be slower than 7.5 KB/s for a certain amount of time (in the order of a second), then the estimated peak rate of the device, maintained in BFQ, becomes equal to 0. The reason is the limited precision with which the rate is represented (details on the range of representable values in the comments introduced by this commit). This leads to a division-by-zero error where the estimated peak rate is used as divisor. Such a type of failure has been reported in [1]. This commit addresses this issue by: 1. Lower-bounding the estimated peak rate to 1 2. Adding and improving comments on the range of rates representable [1] https://www.spinics.net/lists/kernel/msg2739205.html Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-03-26 10:18:27 -06:00
Paolo Valente	a787739061	block, bfq: add requeue-request hook Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device be re-inserted into the active I/O scheduler for that device. As a consequence, I/O schedulers may get the same request inserted again, even several times, without a finish_request invoked on that request before each re-insertion. This fact is the cause of the failure reported in [1]. For an I/O scheduler, every re-insertion of the same re-prepared request is equivalent to the insertion of a new request. For schedulers like mq-deadline or kyber, this fact causes no harm. In contrast, it confuses a stateful scheduler like BFQ, which keeps state for an I/O request, until the finish_request hook is invoked on the request. In particular, BFQ may get stuck, waiting forever for the number of request dispatches, of the same request, to be balanced by an equal number of request completions (while there will be one completion for that request). In this state, BFQ may refuse to serve I/O requests from other bfq_queues. The hang reported in [1] then follows. However, the above re-prepared requests undergo a requeue, thus the requeue_request hook of the active elevator is invoked for these requests, if set. This commit then addresses the above issue by properly implementing the hook requeue_request in BFQ. [1] https://marc.info/?l=linux-block&m=151211117608676 Reported-by: Ivan Kozik <ivan@ludios.org> Reported-by: Alban Browaeys <alban.browaeys@gmail.com> Tested-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Serena Ziviani <ziviani.serena@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-02-07 15:17:46 -07:00
Paolo Valente	8a8747dc01	block, bfq: limit sectors served with interactive weight raising To maximise responsiveness, BFQ raises the weight, and performs device idling, for bfq_queues associated with processes deemed as interactive. In particular, weight raising has a maximum duration, equal to the time needed to start a large application. If a weight-raised process goes on doing I/O beyond this maximum duration, it loses weight-raising. This mechanism is evidently vulnerable to the following false positives: I/O-bound applications that will go on doing I/O for much longer than the duration of weight-raising. These applications have basically no benefit from being weight-raised at the beginning of their I/O. On the opposite end, while being weight-raised, these applications a) unjustly steal throughput to applications that may truly need low latency; b) make BFQ uselessly perform device idling; device idling results in loss of device throughput with most flash-based storage, and may increase latencies when used purposelessly. This commit adds a countermeasure to reduce both the above problems. To introduce this countermeasure, we provide the following extra piece of information (full details in the comments added by this commit). During the start-up of the large application used as a reference to set the duration of weight-raising, involved processes transfer at most ~110K sectors each. Accordingly, a process initially deemed as interactive has no right to be weight-raised any longer, once transferred 110K sectors or more. Basing on this consideration, this commit early-ends weight-raising for a bfq_queue if the latter happens to have received an amount of service at least equal to 110K sectors (actually, a little bit more, to keep a safety margin). I/O-bound applications that reach a high throughput, such as file copy, get to this threshold much before the allowed weight-raising period finishes. Thus this early ending of weight-raising reduces the amount of time during which these applications cause the problems described above. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-01-18 08:21:37 -07:00
Paolo Valente	a52a69ea89	block, bfq: limit tags for writes and async I/O Asynchronous I/O can easily starve synchronous I/O (both sync reads and sync writes), by consuming all request tags. Similarly, storms of synchronous writes, such as those that sync(2) may trigger, can starve synchronous reads. In their turn, these two problems may also cause BFQ to loose control on latency for interactive and soft real-time applications. For example, on a PLEXTOR PX-256M5S SSD, LibreOffice Writer takes 0.6 seconds to start if the device is idle, but it takes more than 45 seconds (!) if there are sequential writes in the background. This commit addresses this issue by limiting the maximum percentage of tags that asynchronous I/O requests and synchronous write requests can consume. In particular, this commit grants a higher threshold to synchronous writes, to prevent the latter from being starved by asynchronous I/O. According to the above test, LibreOffice Writer now starts in about 1.2 seconds on average, regardless of the background workload, and apart from some rare outlier. To check this improvement, run, e.g., sudo ./comm_startup_lat.sh bfq 5 5 seq 10 "lowriter --terminate_after_init" for the comm_startup_lat benchmark in the S suite [1]. [1] https://github.com/Algodev-github/S Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-01-18 08:21:35 -07:00
Chiara Bruschi	8993d445df	block, bfq: fix occurrences of request finish method's old name Commit '7b9e93616399' ("blk-mq-sched: unify request finished methods") changed the old name of current bfq_finish_request method, but left it unchanged elsewhere in the code (related comments, part of function name bfq_put_rq_priv_body). This commit fixes all occurrences of the old name of this method by changing them into the current name. Fixes: `7b9e936163` ("blk-mq-sched: unify request finished methods") Reviewed-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Federico Motta <federico@willer.it> Signed-off-by: Chiara Bruschi <bruschi.chiara@outlook.it> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-01-10 07:43:54 -07:00
Jens Axboe	8abef10b3d	bfq-iosched: don't call bfqg_and_blkg_put for !CONFIG_BFQ_GROUP_IOSCHED It's not available if we don't have group io scheduling set, and there's no need to call it. Fixes: `0d52af5905` ("block, bfq: release oom-queue ref to root group on exit") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-01-09 12:22:28 -07:00
Paolo Valente	0d52af5905	block, bfq: release oom-queue ref to root group on exit On scheduler init, a reference to the root group, and a reference to its corresponding blkg are taken for the oom queue. Yet these references are not released on scheduler exit, which prevents these objects from be freed. This commit adds the missing reference releases. Reported-by: Davide Ferrari <davideferrari8@gmail.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-01-09 08:45:25 -07:00
Paolo Valente	9b25bd0368	block, bfq: remove batches of confusing ifdefs Commit `a33801e8b4` ("block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP") introduced two batches of confusing ifdefs: one reported in [1], plus a similar one in another function. This commit removes both batches, in the way suggested in [1]. [1] https://www.spinics.net/lists/linux-block/msg20043.html Fixes: `a33801e8b4` ("block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Tested-by: Luca Miccio <lucmiccio@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-01-05 09:32:59 -07:00
Paolo Valente	a34b024448	block, bfq: consider also past I/O in soft real-time detection BFQ privileges the I/O of soft real-time applications, such as video players, to guarantee to these application a high bandwidth and a low latency. In this respect, it is not easy to correctly detect when an application is soft real-time. A particularly nasty false positive is that of an I/O-bound application that occasionally happens to meet all requirements to be deemed as soft real-time. After being detected as soft real-time, such an application monopolizes the device. Fortunately, BFQ will realize soon that the application is actually not soft real-time and suspend every privilege. Yet, the application may happen again to be wrongly detected as soft real-time, and so on. As highlighted by our tests, this problem causes BFQ to occasionally fail to guarantee a high responsiveness, in the presence of heavy background I/O workloads. The reason is that the background workload happens to be detected as soft real-time, more or less frequently, during the execution of the interactive task under test. To give an idea, because of this problem, Libreoffice Writer occasionally takes 8 seconds, instead of 3, to start up, if there are sequential reads and writes in the background, on a Kingston SSDNow V300. This commit addresses this issue by leveraging the following facts. The reason why some applications are detected as soft real-time despite all BFQ checks to avoid false positives, is simply that, during high CPU or storage-device load, I/O-bound applications may happen to do I/O slowly enough to meet all soft real-time requirements, and pass all BFQ extra checks. Yet, this happens only for limited time periods: slow-speed time intervals are usually interspersed between other time intervals during which these applications do I/O at a very high speed. To exploit these facts, this commit introduces a little change, in the detection of soft real-time behavior, to systematically consider also the recent past: the higher the speed was in the recent past, the later next I/O should arrive for the application to be considered as soft real-time. At the beginning of a slow-speed interval, the minimum arrival time allowed for the next I/O usually happens to still be so high, to fall after the end of the slow-speed period itself. As a consequence, the application does not risk to be deemed as soft real-time during the slow-speed interval. Then, during the next high-speed interval, the application cannot, evidently, be deemed as soft real-time (exactly because of its speed), and so on. This extra filtering proved to be rather effective: in the above test, the frequency of false positives became so low that the start-up time was 3 seconds in all iterations (apart from occasional outliers, caused by page-cache-management issues, which are out of the scope of this commit, and cannot be solved by an I/O scheduler). Tested-by: Lee Tibbert <lee.tibbert@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-01-05 09:31:19 -07:00
Angelo Ruocco	4403e4e467	block, bfq: remove superfluous check in queue-merging setup When two or more processes do I/O in a way that the their requests are sequential in respect to one another, BFQ merges the bfq_queues associated with the processes. This way the overall I/O pattern becomes sequential, and thus there is a boost in througput. These cooperating processes usually start or restart to do I/O shortly after each other. So, in order to avoid merging non-cooperating processes, BFQ ensures that none of these queues has been in weight raising for too long. In this respect, from commit "block, bfq-sq, bfq-mq: let a queue be merged only shortly after being created", BFQ checks whether any queue (and not only weight-raised ones) is doing I/O continuously from too long to be merged. This new additional check makes the first one useless: a queue doing I/O from long enough, if being weight-raised, is also a queue in weight raising for too long to be merged. Accordingly, this commit removes the first check. Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-01-05 09:26:11 -07:00
Paolo Valente	7b8fa3b900	block, bfq: let a queue be merged only shortly after starting I/O In BFQ and CFQ, two processes are said to be cooperating if they do I/O in such a way that the union of their I/O requests yields a sequential I/O pattern. To get such a sequential I/O pattern out of the non-sequential pattern of each cooperating process, BFQ and CFQ merge the queues associated with these processes. In more detail, cooperating processes, and thus their associated queues, usually start, or restart, to do I/O shortly after each other. This is the case, e.g., for the I/O threads of KVM/QEMU and of the dump utility. Basing on this assumption, this commit allows a bfq_queue to be merged only during a short time interval (100ms) after it starts, or re-starts, to do I/O. This filtering provides two important benefits. First, it greatly reduces the probability that two non-cooperating processes have their queues merged by mistake, if they just happen to do I/O close to each other for a short time interval. These spurious merges cause loss of service guarantees. A low-weight bfq_queue may unjustly get more than its expected share of the throughput: if such a low-weight queue is merged with a high-weight queue, then the I/O for the low-weight queue is served as if the queue had a high weight. This may damage other high-weight queues unexpectedly. For instance, because of this issue, lxterminal occasionally took 7.5 seconds to start, instead of 6.5 seconds, when some sequential readers and writers did I/O in the background on a FUJITSU MHX2300BT HDD. The reason is that the bfq_queues associated with some of the readers or the writers were merged with the high-weight queues of some processes that had to do some urgent but little I/O. The readers then exploited the inherited high weight for all or most of their I/O, during the start-up of terminal. The filtering introduced by this commit eliminated any outlier caused by spurious queue merges in our start-up time tests. This filtering also provides a little boost of the throughput sustainable by BFQ: 3-4%, depending on the CPU. The reason is that, once a bfq_queue cannot be merged any longer, this commit makes BFQ stop updating the data needed to handle merging for the queue. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-01-05 09:26:09 -07:00
Angelo Ruocco	1be6e8a964	block, bfq: check low_latency flag in bfq_bfqq_save_state() A just-created bfq_queue will certainly be deemed as interactive on the arrival of its first I/O request, if the low_latency flag is set. Yet, if the queue is merged with another queue on the arrival of its first I/O request, it will not have the chance to be flagged as interactive. Nevertheless, if the queue is then split soon enough, it has to be flagged as interactive after the split. To handle this early-merge scenario correctly, BFQ saves the state of the queue, on the merge, as if the latter had already been deemed interactive. So, if the queue is split soon, it will get weight-raised, because the previous state of the queue is resumed on the split. Unfortunately, in the act of saving the state of the newly-created queue, BFQ doesn't check whether the low_latency flag is set, and this causes early-merged queues to be then weight-raised, on queue splits, even if low_latency is off. This commit addresses this problem by adding the missing check. Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-01-05 09:26:08 -07:00
Paolo Valente	05e9028356	block, bfq: add missing rq_pos_tree update on rq removal If two processes do I/O close to each other, then BFQ merges the bfq_queues associated with these processes, to get a more sequential I/O, and thus a higher throughput. In this respect, to detect whether two processes are doing I/O close to each other, BFQ keeps a list of the head-of-line I/O requests of all active bfq_queues. The list is ordered by initial sectors, and implemented through a red-black tree (rq_pos_tree). Unfortunately, the update of the rq_pos_tree was incomplete, because the tree was not updated on the removal of the head-of-line I/O request of a bfq_queue, in case the queue did not remain empty. This commit adds the missing update. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-01-05 09:26:06 -07:00
Paolo Valente	f0ba5ea2fe	block, bfq: increase threshold to deem I/O as random If two processes do I/O close to each other, i.e., are cooperating processes in BFQ (and CFQ'S) nomenclature, then BFQ merges their associated bfq_queues, so as to get sequential I/O from the union of the I/O requests of the processes, and thus reach a higher throughput. A merged queue is then split if its I/O stops being sequential. In this respect, BFQ deems the I/O of a bfq_queue as (mostly) sequential only if less than 4 I/O requests are random, out of the last 32 requests inserted into the queue. Unfortunately, extensive testing (with the interleaved_io benchmark of the S suite [1], and with real applications spawning cooperating processes) has clearly shown that, with such a low threshold, only a rather low I/O throughput may be reached when several cooperating processes do I/O. In particular, the outcome of each test run was bimodal: if queue merging occurred and was stable during the test, then the throughput was close to the peak rate of the storage device, otherwise the throughput was arbitrarily low (usually around 1/10 of the peak rate with a rotational device). The probability to get the unlucky outcomes grew with the number of cooperating processes: it was already significant with 5 processes, and close to one with 7 or more processes. The cause of the low throughput in the unlucky runs was that the merged queues containing the I/O of these cooperating processes were soon split, because they contained more random I/O requests than those tolerated by the 4/32 threshold, but - that I/O would have however allowed the storage device to reach peak throughput or almost peak throughput; - in contrast, the I/O of these processes, if served individually (from separate queues) yielded a rather low throughput. So we repeated our tests with increasing values of the threshold, until we found the minimum value (19) for which we obtained maximum throughput, reliably, with at least up to 9 cooperating processes. Then we checked that the use of that higher threshold value did not cause any regression for any other benchmark in the suite [1]. This commit raises the threshold to such a higher value. [1] https://github.com/Algodev-github/S Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-01-05 09:23:57 -07:00
Luca Miccio	a33801e8b4	block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP BFQ currently creates, and updates, its own instance of the whole set of blkio statistics that cfq creates. Yet, from the comments of Tejun Heo in [1], it turned out that most of these statistics are meant/useful only for debugging. This commit makes BFQ create the latter, debugging statistics only if the option CONFIG_DEBUG_BLK_CGROUP is set. By doing so, this commit also enables BFQ to enjoy a high perfomance boost. The reason is that, if CONFIG_DEBUG_BLK_CGROUP is not set, then BFQ has to update far fewer statistics, and, in particular, not the heaviest to update. To give an idea of the benefits, if CONFIG_DEBUG_BLK_CGROUP is not set, then, on an Intel i7-4850HQ, and with 8 threads doing random I/O in parallel on null_blk (configured with 0 latency), the throughput of BFQ grows from 310 to 400 KIOPS (+30%). We have measured similar or even much higher boosts with other CPUs: e.g., +45% with an ARM CortexTM-A53 Octa-core. Our results have been obtained and can be reproduced very easily with the script in [1]. [1] https://www.spinics.net/lists/linux-block/msg18943.html Suggested-by: Tejun Heo <tj@kernel.org> Suggested-by: Ulf Hansson <ulf.hansson@linaro.org> Tested-by: Lee Tibbert <lee.tibbert@gmail.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Luca Miccio <lucmiccio@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-14 20:13:33 -07:00
Paolo Valente	24bfd19bb7	block, bfq: update blkio stats outside the scheduler lock bfq invokes various blkg_stats_ functions to update the statistics contained in the special files blkio.bfq.* in the blkio controller groups, i.e., the I/O accounting related to the proportional-share policy provided by bfq. The execution of these functions takes a considerable percentage, about 40%, of the total per-request execution time of bfq (i.e., of the sum of the execution time of all the bfq functions that have to be executed to process an I/O request from its creation to its destruction). This reduces the request-processing rate sustainable by bfq noticeably, even on a multicore CPU. In fact, the bfq functions that invoke blkg_stats_ functions cannot be executed in parallel with the rest of the code of bfq, because both are executed under the same same per-device scheduler lock. To reduce this slowdown, this commit moves, wherever possible, the invocation of these functions (more precisely, of the bfq functions that invoke blkg_stats_ functions) outside the critical sections protected by the scheduler lock. With this change, and with all blkio.bfq.* statistics enabled, the throughput grows, e.g., from 250 to 310 KIOPS (+25%) on an Intel i7-4850HQ, in case of 8 threads doing random I/O in parallel on null_blk, with the latter configured with 0 latency. We obtained the same or higher throughput boosts, up to +30%, with other processors (some figures are reported in the documentation). For our tests, we used the script [1], with which our results can be easily reproduced. NOTE. This commit still protects the invocation of blkg_stats_ functions with the request_queue lock, because the group these functions are invoked on may otherwise disappear before or while these functions are executed. Fortunately, tests without even this lock show, by difference, that the serialization caused by this lock has a little impact (at most ~5% of throughput reduction). [1] https://github.com/Algodev-github/IOSpeed Tested-by: Lee Tibbert <lee.tibbert@gmail.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Luca Miccio <lucmiccio@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-14 20:13:33 -07:00
Luca Miccio	614822f81f	block, bfq: add missing invocations of bfqg_stats_update_io_add/remove bfqg_stats_update_io_add and bfqg_stats_update_io_remove are to be invoked, respectively, when an I/O request enters and when an I/O request exits the scheduler. Unfortunately, bfq does not fully comply with this scheme, because it does not invoke these functions for requests that are inserted into or extracted from its priority dispatch list. This commit fixes this mistake. Tested-by: Lee Tibbert <lee.tibbert@gmail.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Luca Miccio <lucmiccio@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-11-14 20:13:33 -07:00
Paolo Valente	99fead8d38	block, bfq: fix unbalanced decrements of burst size The commit "block, bfq: decrease burst size when queues in burst exit" introduced the decrement of burst_size on the removal of a bfq_queue from the burst list. Unfortunately, this decrement can happen to be performed even when burst size is already equal to 0, because of unbalanced decrements. A description follows of the cause of these unbalanced decrements, namely a wrong assumption, and of the way how this wrong assumption leads to unbalanced decrements. The wrong assumption is that a bfq_queue can exit only if the process associated with the bfq_queue has exited. This is false, because a bfq_queue, say Q, may exit also as a consequence of a merge with another bfq_queue. In this case, Q exits because the I/O of its associated process has been redirected to another bfq_queue. The decrement unbalance occurs because Q may then be re-created after a split, and added back to the current burst list, without incrementing burst_size. burst_size is not incremented because Q is not a new bfq_queue added to the burst list, but a bfq_queue only temporarily removed from the list, and, before the commit "bfq-sq, bfq-mq: decrease burst size when queues in burst exit", burst_size was not decremented when Q was removed. This commit addresses this issue by just checking whether the exiting bfq_queue is a merged bfq_queue, and, in that case, not decrementing burst_size. Unfortunately, this still leaves room for unbalanced decrements, in the following rarer case: on a split, the bfq_queue happens to be inserted into a different burst list than that it was removed from when merged. If this happens, the number of elements in the new burst list becomes higher than burst_size (by one). When the bfq_queue then exits, it is of course not in a merged state any longer, thus burst_size is decremented, which results in an unbalanced decrement. To handle this sporadic, unlucky case in a simple way, this commit also checks that burst_size is larger than 0 before decrementing it. Finally, this commit removes an useless, extra check: the check that the bfq_queue is sync, performed before checking whether the bfq_queue is in the burst list. This extra check is redundant, because only sync bfq_queues can be inserted into the burst list. Fixes: `7cb04004fa` ("block, bfq: decrease burst size when queues in burst exit") Reported-by: Philip Müller <philm@manjaro.org> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com> Tested-by: Philip Müller <philm@manjaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Lee Tibbert <lee.tibbert@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-10-09 09:54:58 -06:00

1 2

91 commits