-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAly6xzUACgkQONu9yGCS
aT5sIA//b7nAk2zuhmbkonsBfzFq5uBJmqXcCrOgy3XHMs4fE+Q11kLd1wMAV7dx
U7FNHe4PIJ8Rczxgqr2VP3VmFbV6UuTK+UTclJKfbV3ouIAQiQBuutABBmbDUj2p
FInc/yAYyhVc9n7gX78czTiUxKnKi4+sisUYDCZPr3hr6jDPcLvm/WVWdyrcXJje
rYFNmE/2MBH1NofG+MOpq+ILhKHXlf2APN2/spl+I42a8bwodiSl9g+dhuWr7wgT
Ln2Ocf7BZ6BPCQKoveZdD1Gd56NNR/lJh4ulqpuhaZw4Yp+B/C7GmrBtdPzVSGka
IwPWoSc9/9VSUl+ooSZHms78VLbqq0rNNclskL2bN6m962u04Eu7sB2Tg/bwUs52
Wkcw0DY4J/oMJtj/CMHcQOUPsk6vwHxqnjsj+LYJ1ZjHO68tUshnENxXrbAoDc45
2fuY3TCA+XqFvqNt5HbkLPtFR78u8QmZ1lP/Pkri6xoG/GA6O0EAxhS0Z9hncGK7
8wJNuxLMd2UX94wlajQ+DF7yyCU4HOFEdeSEOwlHHBid/fckXsGzL2tKJUAbbUPP
ux3An8kJHni8nQrmUkyy1Nx29ROyAFxBLOQshWGpXgJrV3qRMYLyB2Icv0WYCGFk
zZCTupPgvb46u81VzqxrLH4RZdy4Ar4uB3BQGPKs596rlYmvnSo=
=CArs
-----END PGP SIGNATURE-----
Merge 4.19.36 into android-4.19
Changes in 4.19.36
ARC: u-boot args: check that magic number is correct
arc: hsdk_defconfig: Enable CONFIG_BLK_DEV_RAM
inotify: Fix fsnotify_mark refcount leak in inotify_update_existing_watch()
perf/core: Restore mmap record type correctly
ext4: avoid panic during forced reboot
ext4: add missing brelse() in add_new_gdb_meta_bg()
ext4: report real fs size after failed resize
ALSA: echoaudio: add a check for ioremap_nocache
ALSA: sb8: add a check for request_region
auxdisplay: hd44780: Fix memory leak on ->remove()
drm/udl: use drm_gem_object_put_unlocked.
IB/mlx4: Fix race condition between catas error reset and aliasguid flows
i40iw: Avoid panic when handling the inetdev event
mmc: davinci: remove extraneous __init annotation
ALSA: opl3: fix mismatch between snd_opl3_drum_switch definition and declaration
thermal/intel_powerclamp: fix __percpu declaration of worker_data
thermal: samsung: Fix incorrect check after code merge
thermal: bcm2835: Fix crash in bcm2835_thermal_debugfs
thermal/int340x_thermal: Add additional UUIDs
thermal/int340x_thermal: fix mode setting
thermal/intel_powerclamp: fix truncated kthread name
scsi: iscsi: flush running unbind operations when removing a session
sched/cpufreq: Fix 32-bit math overflow
sched/core: Fix buffer overflow in cgroup2 property cpu.max
x86/mm: Don't leak kernel addresses
tools/power turbostat: return the exit status of a command
perf list: Don't forget to drop the reference to the allocated thread_map
perf config: Fix an error in the config template documentation
perf config: Fix a memory leak in collect_config()
perf build-id: Fix memory leak in print_sdt_events()
perf top: Fix error handling in cmd_top()
perf hist: Add missing map__put() in error case
perf evsel: Free evsel->counts in perf_evsel__exit()
perf tests: Fix a memory leak of cpu_map object in the openat_syscall_event_on_all_cpus test
perf tests: Fix memory leak by expr__find_other() in test__expr()
perf tests: Fix a memory leak in test__perf_evsel__tp_sched_test()
ACPI / utils: Drop reference in test for device presence
PM / Domains: Avoid a potential deadlock
blk-iolatency: #include "blk.h"
drm/exynos/mixer: fix MIXER shadow registry synchronisation code
irqchip/stm32: Don't clear rising/falling config registers at init
irqchip/mbigen: Don't clear eventid when freeing an MSI
x86/hpet: Prevent potential NULL pointer dereference
x86/hyperv: Prevent potential NULL pointer dereference
x86/cpu/cyrix: Use correct macros for Cyrix calls on Geode processors
drm/nouveau/debugfs: Fix check of pm_runtime_get_sync failure
iommu/vt-d: Check capability before disabling protected memory
x86/hw_breakpoints: Make default case in hw_breakpoint_arch_parse() return an error
fix incorrect error code mapping for OBJECTID_NOT_FOUND
x86/gart: Exclude GART aperture from kcore
ext4: prohibit fstrim in norecovery mode
drm/cirrus: Use drm_framebuffer_put to avoid kernel oops in clean-up
gpio: pxa: handle corner case of unprobed device
rsi: improve kernel thread handling to fix kernel panic
f2fs: fix to avoid NULL pointer dereference on se->discard_map
9p: do not trust pdu content for stat item size
9p locks: add mount option for lock retry interval
ASoC: Fix UBSAN warning at snd_soc_get/put_volsw_sx()
f2fs: fix to do sanity check with current segment number
netfilter: xt_cgroup: shrink size of v2 path
serial: uartps: console_setup() can't be placed to init section
powerpc/pseries: Remove prrn_work workqueue
media: au0828: cannot kfree dev before usb disconnect
Bluetooth: Fix debugfs NULL pointer dereference
HID: i2c-hid: override HID descriptors for certain devices
pinctrl: core: make sure strcmp() doesn't get a null parameter
ARM: samsung: Limit SAMSUNG_PM_CHECK config option to non-Exynos platforms
usbip: fix vhci_hcd controller counting
ACPI / SBS: Fix GPE storm on recent MacBookPro's
HID: usbhid: Add quirk for Redragon/Dragonrise Seymur 2
KVM: nVMX: restore host state in nested_vmx_vmexit for VMFail
compiler.h: update definition of unreachable()
netfilter: nf_flow_table: remove flowtable hook flush routine in netns exit routine
f2fs: cleanup dirty pages if recover failed
net: stmmac: Set OWN bit for jumbo frames
cifs: fallback to older infolevels on findfirst queryinfo retry
kernel: hung_task.c: disable on suspend
platform/x86: Add Intel AtomISP2 dummy / power-management driver
drm/ttm: Fix bo_global and mem_global kfree error
ALSA: hda: fix front speakers on Huawei MBXP
ACPI: EC / PM: Disable non-wakeup GPEs for suspend-to-idle
net/rds: fix warn in rds_message_alloc_sgs
xfrm: destroy xfrm_state synchronously on net exit path
crypto: sha256/arm - fix crash bug in Thumb2 build
crypto: sha512/arm - fix crash bug in Thumb2 build
net: ip6_gre: fix possible NULL pointer dereference in ip6erspan_set_version
iommu/dmar: Fix buffer overflow during PCI bus notification
scsi: core: Avoid that system resume triggers a kernel warning
soc/tegra: pmc: Drop locking from tegra_powergate_is_powered()
lkdtm: Print real addresses
lkdtm: Add tests for NULL pointer dereference
drm/panel: panel-innolux: set display off in innolux_panel_unprepare
crypto: axis - fix for recursive locking from bottom half
Revert "ACPI / EC: Remove old CLEAR_ON_RESUME quirk"
coresight: cpu-debug: Support for CA73 CPUs
PCI: Blacklist power management of Gigabyte X299 DESIGNARE EX PCIe ports
drm/nouveau/volt/gf117: fix speedo readout register
ARM: 8839/1: kprobe: make patch_lock a raw_spinlock_t
drm/amdkfd: use init_mqd function to allocate object for hid_mqd (CI)
appletalk: Fix use-after-free in atalk_proc_exit
lib/div64.c: off by one in shift
rxrpc: Fix client call connect/disconnect race
f2fs: fix to dirty inode for i_mode recovery
include/linux/swap.h: use offsetof() instead of custom __swapoffset macro
bpf: fix use after free in bpf_evict_inode
IB/hfi1: Failed to drain send queue when QP is put into error state
mm: hide incomplete nr_indirectly_reclaimable in /proc/zoneinfo
mm: hide incomplete nr_indirectly_reclaimable in sysfs
appletalk: Fix compile regression
Linux 4.19.36
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 373e915cd8e84544609eced57a44fbc084f8d60f ]
This patch avoids that the following warning is reported when building
with W=1:
block/blk-iolatency.c:734:5: warning: no previous prototype for 'blk_iolatency_init' [-Wmissing-prototypes]
Cc: Josef Bacik <jbacik@fb.com>
Fixes: d706751215 ("block: introduce blk-iolatency io controller") # v4.19
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAly2yf8ACgkQONu9yGCS
aT5OkA/+Jf3UWnM8MdPoxVUROYg6WRSMThvuKkcspkvXxXqu7Z9a/+meqdIZ0NDr
j+RdhCfZf7RRNpTpYB9jMqXXuRFhVWniAVM3/me2LxFvwyKkNqURyIz9azh7T0VW
nDa2NDQ9C0IyLFLW4cyzZncrmWntgW5XKZfVot1TMNpryHB4FPsKP5rInDQpSNIu
bbJBWuZziUqdEnk3TWuqdQr4ZTXgnjYSBaTbOQBU3F6E88TRmFIatyYovLcJCQh5
jBmXV3U6mGPZe64I+JXj8dTp32WD74afMX1YSZmjdLL6KgbD9a4k47UzpXzEbKT8
uMG057zPiYcwOyMCFZOWsCYIH7Yr4oMIAzOEyYcDI1uCuKd4aESdpgiWfUWambR8
BJO5oGELzLArMVWusWK7xBfm+Et2ePOFWC7V9MRVVQJLdAWkEafYuLoPqrhhqxQX
Iu3ThoY3UmEoNqi1345gmxQBJfbOqdK7CwQGzKOtiPNs0nyMfg62B9Rcr9O+FE5A
+womRQF2Tik1cZhXxkJVSD4f+orL+xuOu59ufvMTBe4co9tPVFm9Qp6SeWoOVuwZ
l8SS7DD5wb7vzDUZOO4KmA7D6p3YQWdBvGeN6jKqjASIFS9t5Sb+fNiKznTo/WCT
1I68sLm7/rs95+WazKu66ewW8bh91VAo+Gmpfo5zEJjcfajnDxY=
=F1Te
-----END PGP SIGNATURE-----
Merge 4.19.35 into android-4.19
Changes in 4.19.35
kvm: nVMX: NMI-window and interrupt-window exiting should wake L2 from HLT
drm/i915/gvt: do not let pin count of shadow mm go negative
powerpc/tm: Limit TM code inside PPC_TRANSACTIONAL_MEM
hv_netvsc: Fix unwanted wakeup after tx_disable
ibmvnic: Fix completion structure initialization
ip6_tunnel: Match to ARPHRD_TUNNEL6 for dev type
ipv6: Fix dangling pointer when ipv6 fragment
ipv6: sit: reset ip header pointer in ipip6_rcv
kcm: switch order of device registration to fix a crash
net: ethtool: not call vzalloc for zero sized memory request
net-gro: Fix GRO flush when receiving a GSO packet.
net/mlx5: Decrease default mr cache size
netns: provide pure entropy for net_hash_mix()
net: rds: force to destroy connection if t_sock is NULL in rds_tcp_kill_sock().
net/sched: act_sample: fix divide by zero in the traffic path
net/sched: fix ->get helper of the matchall cls
openvswitch: fix flow actions reallocation
qmi_wwan: add Olicard 600
r8169: disable ASPM again
sctp: initialize _pad of sockaddr_in before copying to user memory
tcp: Ensure DCTCP reacts to losses
tcp: fix a potential NULL pointer dereference in tcp_sk_exit
vrf: check accept_source_route on the original netdevice
net/mlx5e: Fix error handling when refreshing TIRs
net/mlx5e: Add a lock on tir list
nfp: validate the return code from dev_queue_xmit()
nfp: disable netpoll on representors
bnxt_en: Improve RX consumer index validity check.
bnxt_en: Reset device on RX buffer errors.
net: ip_gre: fix possible use-after-free in erspan_rcv
net: ip6_gre: fix possible use-after-free in ip6erspan_rcv
net: core: netif_receive_skb_list: unlist skb before passing to pt->func
r8169: disable default rx interrupt coalescing on RTL8168
net: mlx5: Add a missing check on idr_find, free buf
net/mlx5e: Update xoff formula
net/mlx5e: Update xon formula
kbuild: deb-pkg: fix bindeb-pkg breakage when O= is used
kbuild: clang: choose GCC_TOOLCHAIN_DIR not on LD
x86/vdso: Drop implicit common-page-size linker flag
lib/string.c: implement a basic bcmp
Revert "clk: meson: clean-up clock registration"
netfilter: nfnetlink_cttimeout: pass default timeout policy to obj_to_nlattr
netfilter: nfnetlink_cttimeout: fetch timeouts for udplite and gre, too
arm64: kaslr: Reserve size of ARM64_MEMSTART_ALIGN in linear region
tty: mark Siemens R3964 line discipline as BROKEN
tty: ldisc: add sysctl to prevent autoloading of ldiscs
hwmon: (w83773g) Select REGMAP_I2C to fix build error
ACPICA: Clear status of GPEs before enabling them
ACPICA: Namespace: remove address node from global list after method termination
ALSA: seq: Fix OOB-reads from strlcpy
ALSA: hda/realtek: Enable headset MIC of Acer TravelMate B114-21 with ALC233
ALSA: hda/realtek - Add quirk for Tuxedo XC 1509
ALSA: hda - Add two more machines to the power_save_blacklist
mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd()
arm64: dts: rockchip: fix rk3328 sdmmc0 write errors
parisc: Detect QEMU earlier in boot process
parisc: regs_return_value() should return gpr28
parisc: also set iaoq_b in instruction_pointer_set()
alarmtimer: Return correct remaining time
drm/i915/gvt: do not deliver a workload if its creation fails
drm/udl: add a release method and delay modeset teardown
kvm: svm: fix potential get_num_contig_pages overflow
include/linux/bitrev.h: fix constant bitrev
mm: writeback: use exact memcg dirty counts
ASoC: intel: Fix crash at suspend/resume after failed codec registration
ASoC: fsl_esai: fix channel swap issue when stream starts
Btrfs: do not allow trimming when a fs is mounted with the nologreplay option
btrfs: prop: fix zstd compression parameter validation
btrfs: prop: fix vanished compression property after failed set
riscv: Fix syscall_get_arguments() and syscall_set_arguments()
block: do not leak memory in bio_copy_user_iov()
block: fix the return errno for direct IO
genirq: Respect IRQCHIP_SKIP_SET_WAKE in irq_chip_set_wake_parent()
genirq: Initialize request_mutex if CONFIG_SPARSE_IRQ=n
virtio: Honour 'may_reduce_num' in vring_create_virtqueue
ARM: dts: rockchip: fix rk3288 cpu opp node reference
ARM: dts: am335x-evmsk: Correct the regulators for the audio codec
ARM: dts: am335x-evm: Correct the regulators for the audio codec
ARM: dts: at91: Fix typo in ISC_D0 on PC9
arm64: futex: Fix FUTEX_WAKE_OP atomic ops with non-zero result value
arm64: dts: rockchip: fix rk3328 rgmii high tx error rate
arm64: backtrace: Don't bother trying to unwind the userspace stack
xen: Prevent buffer overflow in privcmd ioctl
sched/fair: Do not re-read ->h_load_next during hierarchical load calculation
xtensa: fix return_address
x86/asm: Remove dead __GNUC__ conditionals
x86/asm: Use stricter assembly constraints in bitops
x86/perf/amd: Resolve race condition when disabling PMC
x86/perf/amd: Resolve NMI latency issues for active PMCs
x86/perf/amd: Remove need to check "running" bit in NMI handler
PCI: Add function 1 DMA alias quirk for Marvell 9170 SATA controller
PCI: pciehp: Ignore Link State Changes after powering off a slot
dm integrity: change memcmp to strncmp in dm_integrity_ctr
dm: revert 8f50e35815 ("dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE")
dm table: propagate BDI_CAP_STABLE_WRITES to fix sporadic checksum errors
dm integrity: fix deadlock with overlapping I/O
arm64: dts: rockchip: fix vcc_host1_5v pin assign on rk3328-rock64
arm64: dts: rockchip: Fix vcc_host1_5v GPIO polarity on rk3328-rock64
ACPICA: AML interpreter: add region addresses in global list during initialization
KVM: x86: nVMX: close leak of L0's x2APIC MSRs (CVE-2019-3887)
KVM: x86: nVMX: fix x2APIC VTPR read intercept
Linux 4.19.35
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit a3761c3c91209b58b6f33bf69dd8bb8ec0c9d925 upstream.
When bio_add_pc_page() fails in bio_copy_user_iov() we should free
the page we just allocated otherwise we are leaking it.
Cc: linux-block@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlynu40ACgkQONu9yGCS
aT5X6g//Wkfm/+qSZ0GhLDQkPniiH1QkvzhOmVrrxu+KB0qsiwsEl8Srw33ZVkJK
LT8+IPGiG9jEGu9dj+BYXTIfy9ZvfSsEL2N6GhYwDSXP0fok2rUaHbZvv1IB2g4W
afhGdNwNAUCJ/j1UrUsi+SAFJ+xWbVxFpGstd0cqM9IbKdEV7RIukvuKckHiKOKR
qI8FxC+G2PAr+BtnETfk5/suPDJ7B3ZicDoMhiWJGxJ6dfFTVmkSmasSoPDaMiHm
4S3hN2lu+WTeRpRPPB17Dlk4MmIp0k+bGYBKAlaxAMCc/RZxvbT2pRYaMQbId2/L
mNUfSnOQFGEAhlAPfb7wdbObphnyT34GhlkWfZBTrnhPO0/FomLOvU6xVdcNuakX
Tv2JKfDzb+2ttcMZ+0T84Ru9RztoswFATSw8uFMVxW8oTS6MVWnHu96Kxfl7QO3J
PdlIGcyqxSuWNE8OX1QVtdSruGZfwUDNs94S4nQJtkB8BViRwhGJlqaXuy4d9Wp6
fGlI2W6qhjyosi2wBSMTjh/ytk/jq0vfs+z2XjR2gAYssvB/SOLR/AlSVguWsDnf
WaoFBkXvCbuPvPlo0TrLpl5RW5WlOtLXHE3Vr3dKp458wLwpf/OZBGoZiknp7DrF
PzBZs2ie5tmyqTxbAygl7WkbQPJ682pd5R4nf5CY+zvUaOMZv1g=
=Iuup
-----END PGP SIGNATURE-----
Merge 4.19.34 into android-4.19
Changes in 4.19.34
arm64: debug: Don't propagate UNKNOWN FAR into si_code for debug signals
ext4: cleanup bh release code in ext4_ind_remove_space()
tty/serial: atmel: Add is_half_duplex helper
tty/serial: atmel: RS485 HD w/DMA: enable RX after TX is stopped
CIFS: fix POSIX lock leak and invalid ptr deref
h8300: use cc-cross-prefix instead of hardcoding h8300-unknown-linux-
f2fs: fix to adapt small inline xattr space in __find_inline_xattr()
f2fs: fix to avoid deadlock in f2fs_read_inline_dir()
tracing: kdb: Fix ftdump to not sleep
net/mlx5: Avoid panic when setting vport rate
net/mlx5: Avoid panic when setting vport mac, getting vport config
gpio: gpio-omap: fix level interrupt idling
include/linux/relay.h: fix percpu annotation in struct rchan
sysctl: handle overflow for file-max
net: stmmac: Avoid sometimes uninitialized Clang warnings
enic: fix build warning without CONFIG_CPUMASK_OFFSTACK
libbpf: force fixdep compilation at the start of the build
scsi: hisi_sas: Set PHY linkrate when disconnected
scsi: hisi_sas: Fix a timeout race of driver internal and SMP IO
iio: adc: fix warning in Qualcomm PM8xxx HK/XOADC driver
x86/hyperv: Fix kernel panic when kexec on HyperV
perf c2c: Fix c2c report for empty numa node
mm/sparse: fix a bad comparison
mm/cma.c: cma_declare_contiguous: correct err handling
mm/page_ext.c: fix an imbalance with kmemleak
mm, swap: bounds check swap_info array accesses to avoid NULL derefs
mm,oom: don't kill global init via memory.oom.group
memcg: killed threads should not invoke memcg OOM killer
mm, mempolicy: fix uninit memory access
mm/vmalloc.c: fix kernel BUG at mm/vmalloc.c:512!
mm/slab.c: kmemleak no scan alien caches
ocfs2: fix a panic problem caused by o2cb_ctl
f2fs: do not use mutex lock in atomic context
fs/file.c: initialize init_files.resize_wait
page_poison: play nicely with KASAN
cifs: use correct format characters
dm thin: add sanity checks to thin-pool and external snapshot creation
f2fs: fix to check inline_xattr_size boundary correctly
cifs: Accept validate negotiate if server return NT_STATUS_NOT_SUPPORTED
cifs: Fix NULL pointer dereference of devname
netfilter: nf_tables: check the result of dereferencing base_chain->stats
netfilter: conntrack: tcp: only close if RST matches exact sequence
jbd2: fix invalid descriptor block checksum
fs: fix guard_bio_eod to check for real EOD errors
tools lib traceevent: Fix buffer overflow in arg_eval
PCI/PME: Fix hotplug/sysfs remove deadlock in pcie_pme_remove()
wil6210: check null pointer in _wil_cfg80211_merge_extra_ies
mt76: fix a leaked reference by adding a missing of_node_put
crypto: crypto4xx - add missing of_node_put after of_device_is_available
crypto: cavium/zip - fix collision with generic cra_driver_name
usb: chipidea: Grab the (legacy) USB PHY by phandle first
powerpc/powernv/ioda: Fix locked_vm counting for memory used by IOMMU tables
scsi: core: replace GFP_ATOMIC with GFP_KERNEL in scsi_scan.c
kbuild: invoke syncconfig if include/config/auto.conf.cmd is missing
powerpc/xmon: Fix opcode being uninitialized in print_insn_powerpc
coresight: etm4x: Add support to enable ETMv4.2
serial: 8250_pxa: honor the port number from devicetree
ARM: 8840/1: use a raw_spinlock_t in unwind
iommu/io-pgtable-arm-v7s: Only kmemleak_ignore L2 tables
powerpc/hugetlb: Handle mmap_min_addr correctly in get_unmapped_area callback
btrfs: qgroup: Make qgroup async transaction commit more aggressive
mmc: omap: fix the maximum timeout setting
net: dsa: mv88e6xxx: Add lockdep classes to fix false positive splat
e1000e: Fix -Wformat-truncation warnings
mlxsw: spectrum: Avoid -Wformat-truncation warnings
platform/x86: ideapad-laptop: Fix no_hw_rfkill_list for Lenovo RESCUER R720-15IKBN
platform/mellanox: mlxreg-hotplug: Fix KASAN warning
loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part()
IB/mlx4: Increase the timeout for CM cache
clk: fractional-divider: check parent rate only if flag is set
perf annotate: Fix getting source line failure
ASoC: qcom: Fix of-node refcount unbalance in qcom_snd_parse_of()
cpufreq: acpi-cpufreq: Report if CPU doesn't support boost technologies
efi: cper: Fix possible out-of-bounds access
s390/ism: ignore some errors during deregistration
scsi: megaraid_sas: return error when create DMA pool failed
scsi: fcoe: make use of fip_mode enum complete
drm/amd/display: Clear stream->mode_changed after commit
perf test: Fix failure of 'evsel-tp-sched' test on s390
mwifiex: don't advertise IBSS features without FW support
perf report: Don't shadow inlined symbol with different addr range
SoC: imx-sgtl5000: add missing put_device()
media: ov7740: fix runtime pm initialization
media: sh_veu: Correct return type for mem2mem buffer helpers
media: s5p-jpeg: Correct return type for mem2mem buffer helpers
media: rockchip/rga: Correct return type for mem2mem buffer helpers
media: s5p-g2d: Correct return type for mem2mem buffer helpers
media: mx2_emmaprp: Correct return type for mem2mem buffer helpers
media: mtk-jpeg: Correct return type for mem2mem buffer helpers
mt76: usb: do not run mt76u_queues_deinit twice
xen/gntdev: Do not destroy context while dma-bufs are in use
vfs: fix preadv64v2 and pwritev64v2 compat syscalls with offset == -1
HID: intel-ish-hid: avoid binding wrong ishtp_cl_device
cgroup, rstat: Don't flush subtree root unless necessary
jbd2: fix race when writing superblock
leds: lp55xx: fix null deref on firmware load failure
perf report: Add s390 diagnosic sampling descriptor size
iwlwifi: pcie: fix emergency path
ACPI / video: Refactor and fix dmi_is_desktop()
selftests: skip seccomp get_metadata test if not real root
kprobes: Prohibit probing on bsearch()
kprobes: Prohibit probing on RCU debug routine
netfilter: conntrack: fix cloned unconfirmed skb->_nfct race in __nf_conntrack_confirm
ARM: 8833/1: Ensure that NEON code always compiles with Clang
ARM: dts: meson8b: fix the Ethernet data line signals in eth_rgmii_pins
ALSA: PCM: check if ops are defined before suspending PCM
ath10k: fix shadow register implementation for WCN3990
usb: f_fs: Avoid crash due to out-of-scope stack ptr access
sched/topology: Fix percpu data types in struct sd_data & struct s_data
bcache: fix input overflow to cache set sysfs file io_error_halflife
bcache: fix input overflow to sequential_cutoff
bcache: fix potential div-zero error of writeback_rate_i_term_inverse
bcache: improve sysfs_strtoul_clamp()
genirq: Avoid summation loops for /proc/stat
net: marvell: mvpp2: fix stuck in-band SGMII negotiation
iw_cxgb4: fix srqidx leak during connection abort
net: phy: consider latched link-down status in polling mode
fbdev: fbmem: fix memory access if logo is bigger than the screen
cdrom: Fix race condition in cdrom_sysctl_register
drm: rcar-du: add missing of_node_put
drm/amd/display: Don't re-program planes for DPMS changes
drm/amd/display: Disconnect mpcc when changing tg
perf/aux: Make perf_event accessible to setup_aux()
e1000e: fix cyclic resets at link up with active tx
e1000e: Exclude device from suspend direct complete optimization
platform/x86: intel_pmc_core: Fix PCH IP sts reading
i2c: of: Try to find an I2C adapter matching the parent
staging: spi: mt7621: Add return code check on device_reset()
iwlwifi: mvm: fix RFH config command with >=10 CPUs
ASoC: fsl-asoc-card: fix object reference leaks in fsl_asoc_card_probe
sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK
efi/memattr: Don't bail on zero VA if it equals the region's PA
sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock()
drm/vkms: Bugfix extra vblank frame
ARM: dts: lpc32xx: Remove leading 0x and 0s from bindings notation
efi/arm/arm64: Allow SetVirtualAddressMap() to be omitted
soc: qcom: gsbi: Fix error handling in gsbi_probe()
mt7601u: bump supported EEPROM version
ARM: 8830/1: NOMMU: Toggle only bits in EXC_RETURN we are really care of
ARM: avoid Cortex-A9 livelock on tight dmb loops
block, bfq: fix in-service-queue check for queue merging
bpf: fix missing prototype warnings
selftests/bpf: skip verifier tests for unsupported program types
powerpc/64s: Clear on-stack exception marker upon exception return
cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting
backlight: pwm_bl: Use gpiod_get_value_cansleep() to get initial state
tty: increase the default flip buffer limit to 2*640K
powerpc/pseries: Perform full re-add of CPU for topology update post-migration
drm/amd/display: Enable vblank interrupt during CRC capture
ALSA: dice: add support for Solid State Logic Duende Classic/Mini
usb: dwc3: gadget: Fix OTG events when gadget driver isn't loaded
platform/x86: intel-hid: Missing power button release on some Dell models
perf script python: Use PyBytes for attr in trace-event-python
perf script python: Add trace_context extension module to sys.modules
media: mt9m111: set initial frame size other than 0x0
hwrng: virtio - Avoid repeated init of completion
soc/tegra: fuse: Fix illegal free of IO base address
HID: intel-ish: ipc: handle PIMR before ish_wakeup also clear PISR busy_clear bit
f2fs: UBSAN: set boolean value iostat_enable correctly
hpet: Fix missing '=' character in the __setup() code of hpet_mmap_enable
cpu/hotplug: Mute hotplug lockdep during init
dmaengine: imx-dma: fix warning comparison of distinct pointer types
dmaengine: qcom_hidma: assign channel cookie correctly
dmaengine: qcom_hidma: initialize tx flags in hidma_prep_dma_*
netfilter: physdev: relax br_netfilter dependency
media: rcar-vin: Allow independent VIN link enablement
media: s5p-jpeg: Check for fmt_ver_flag when doing fmt enumeration
regulator: act8865: Fix act8600_sudcdc_voltage_ranges setting
pinctrl: meson: meson8b: add the eth_rxd2 and eth_rxd3 pins
drm: Auto-set allow_fb_modifiers when given modifiers at plane init
drm/nouveau: Stop using drm_crtc_force_disable
x86/build: Specify elf_i386 linker emulation explicitly for i386 objects
selinux: do not override context on context mounts
brcmfmac: Use firmware_request_nowarn for the clm_blob
wlcore: Fix memory leak in case wl12xx_fetch_firmware failure
x86/build: Mark per-CPU symbols as absolute explicitly for LLD
drm/fb-helper: fix leaks in error path of drm_fb_helper_fbdev_setup
clk: meson: clean-up clock registration
clk: rockchip: fix frac settings of GPLL clock for rk3328
dmaengine: tegra: avoid overflow of byte tracking
Input: soc_button_array - fix mapping of the 5th GPIO in a PNP0C40 device
drm/dp/mst: Configure no_stop_bit correctly for remote i2c xfers
net: stmmac: Avoid one more sometimes uninitialized Clang warning
ACPI / video: Extend chassis-type detection with a "Lunch Box" check
bcache: fix potential div-zero error of writeback_rate_p_term_inverse
kprobes/x86: Blacklist non-attachable interrupt functions
Linux 4.19.34
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 058fdecc6de7cdecbf4c59b851e80eb2d6c5295f ]
When a new I/O request arrives for a bfq_queue, say Q, bfq checks
whether that request is close to
(a) the head request of some other queue waiting to be served, or
(b) the last request dispatched for the in-service queue (in case Q
itself is not the in-service queue)
If a queue, say Q2, is found for which the above condition holds, then
bfq merges Q and Q2, to hopefully get a more sequential I/O in the
resulting merged queue, and thus a possibly higher throughput.
Case (b) is checked by comparing the new request for Q with the last
request dispatched, assuming that the latter necessarily belonged to the
in-service queue. Unfortunately, this assumption is no longer always
correct, since commit d0edc2473be9 ("block, bfq: inject other-queue I/O
into seeky idle queues on NCQ flash").
When the assumption does not hold, queues that must not be merged may be
merged, causing unexpected loss of control on per-queue service
guarantees.
This commit solves this problem by adding an extra field, which stores
the actual last request dispatched for the in-service queue, and by
using this new field to correctly check case (b).
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlyWhJcACgkQONu9yGCS
aT6XzxAAzP2QGzC4SVPgcFH1woF/d8Cz0zQ81mLXzjXtEPm39fZCM2hbBnxkXLu1
peFyrKNk6/c9541D9gsQCQT6Fu+H6u1bJKcIezlKJ2xyB/MsU1hXkjZrTJYW3RRs
gimy1EGdood2el1ubEBZiaspazoeRzBqtg1Nsmr4V0l+RT8HwtKKw+0+Nxixfp59
NoVkqTpPI5mL0FiH2R9ogcfg3SvgMZOsOhOBjdPvSjiJJsbvIWcW48MCs95XSUpF
R+l/fWn+oiFCcIqBaFheujuqZMvVrUHZHaWAPMuoR/c3Cdf0lTBokdv6UM9c0nv3
61jX5r5ImRI/dfQANN5mbB1YKcs5xOI+I7QZHQ2q4clsWrWyLapXW4clrAZJ6z5t
UVeVbuLV2y5PL9GJyBcXpyY0BOf4e2gZURaPY3C5McNwgybNoiR0ZePqKb8ZhZyh
jYOYRoBjJJpZoVTSt6MNX95NTvGaSAtqKMu1s3IeMfpwCfQKBPMOuBHr/dUqSC6I
U0xxjk/71C15dSPVcTVJT/lmcKc6TXgoagnfbn8GBtDOAjBNsYyUJLQI+db1ERCe
9MEB9k1Z87ROQ5jQCQmWsewOVAtFZBEvSszFmpKv3zTe8M2oFpXG56zckdiumwHU
nSfeZTTeWzsFJd30MioEnGYm3ZwKwZx7wi0x4B4WWvBfSpp20Us=
=xtLx
-----END PGP SIGNATURE-----
Merge 4.19.31 into android-4.19
Changes in 4.19.31
media: videobuf2-v4l2: drop WARN_ON in vb2_warn_zero_bytesused()
9p: use inode->i_lock to protect i_size_write() under 32-bit
9p/net: fix memory leak in p9_client_create
ASoC: fsl_esai: fix register setting issue in RIGHT_J mode
ASoC: codecs: pcm186x: fix wrong usage of DECLARE_TLV_DB_SCALE()
ASoC: codecs: pcm186x: Fix energysense SLEEP bit
iio: adc: exynos-adc: Fix NULL pointer exception on unbind
mei: hbm: clean the feature flags on link reset
mei: bus: move hw module get/put to probe/release
stm class: Fix an endless loop in channel allocation
crypto: caam - fix hash context DMA unmap size
crypto: ccree - fix missing break in switch statement
crypto: caam - fixed handling of sg list
crypto: caam - fix DMA mapping of stack memory
crypto: ccree - fix free of unallocated mlli buffer
crypto: ccree - unmap buffer before copying IV
crypto: ccree - don't copy zero size ciphertext
crypto: cfb - add missing 'chunksize' property
crypto: cfb - remove bogus memcpy() with src == dest
crypto: ahash - fix another early termination in hash walk
crypto: rockchip - fix scatterlist nents error
crypto: rockchip - update new iv to device in multiple operations
drm/imx: ignore plane updates on disabled crtcs
gpu: ipu-v3: Fix i.MX51 CSI control registers offset
drm/imx: imx-ldb: add missing of_node_puts
gpu: ipu-v3: Fix CSI offsets for imx53
ASoC: rt5682: Correct the setting while select ASRC clk for AD/DA filter
clocksource: timer-ti-dm: Fix pwm dmtimer usage of fck reparenting
KVM: arm/arm64: vgic: Make vgic_dist->lpi_list_lock a raw_spinlock
arm64: dts: rockchip: fix graph_port warning on rk3399 bob kevin and excavator
s390/dasd: fix using offset into zero size array error
Input: pwm-vibra - prevent unbalanced regulator
Input: pwm-vibra - stop regulator after disabling pwm, not before
ARM: dts: Configure clock parent for pwm vibra
ARM: OMAP2+: Variable "reg" in function omap4_dsi_mux_pads() could be uninitialized
ASoC: dapm: fix out-of-bounds accesses to DAPM lookup tables
ASoC: rsnd: fixup rsnd_ssi_master_clk_start() user count check
KVM: arm/arm64: Reset the VCPU without preemption and vcpu state loaded
arm/arm64: KVM: Allow a VCPU to fully reset itself
arm/arm64: KVM: Don't panic on failure to properly reset system registers
KVM: arm/arm64: vgic: Always initialize the group of private IRQs
KVM: arm64: Forbid kprobing of the VHE world-switch code
ASoC: samsung: Prevent clk_get_rate() calls in atomic context
ARM: OMAP2+: fix lack of timer interrupts on CPU1 after hotplug
Input: cap11xx - switch to using set_brightness_blocking()
Input: ps2-gpio - flush TX work when closing port
Input: matrix_keypad - use flush_delayed_work()
mac80211: call drv_ibss_join() on restart
mac80211: Fix Tx aggregation session tear down with ITXQs
netfilter: compat: initialize all fields in xt_init
blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue
ipvs: fix dependency on nf_defrag_ipv6
floppy: check_events callback should not return a negative number
xprtrdma: Make sure Send CQ is allocated on an existing compvec
NFS: Don't use page_file_mapping after removing the page
mm/gup: fix gup_pmd_range() for dax
Revert "mm: use early_pfn_to_nid in page_ext_init"
scsi: qla2xxx: Fix panic from use after free in qla2x00_async_tm_cmd
net: dsa: bcm_sf2: potential array overflow in bcm_sf2_sw_suspend()
x86/CPU: Add Icelake model number
mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs
net: hns: Fix object reference leaks in hns_dsaf_roce_reset()
i2c: cadence: Fix the hold bit setting
i2c: bcm2835: Clear current buffer pointers and counts after a transfer
auxdisplay: ht16k33: fix potential user-after-free on module unload
Input: st-keyscan - fix potential zalloc NULL dereference
clk: sunxi-ng: v3s: Fix TCON reset de-assert bit
kallsyms: Handle too long symbols in kallsyms.c
clk: sunxi: A31: Fix wrong AHB gate number
esp: Skip TX bytes accounting when sending from a request socket
ARM: 8824/1: fix a migrating irq bug when hotplug cpu
bpf: only adjust gso_size on bytestream protocols
bpf: fix lockdep false positive in stackmap
af_key: unconditionally clone on broadcast
ARM: 8835/1: dma-mapping: Clear DMA ops on teardown
assoc_array: Fix shortcut creation
keys: Fix dependency loop between construction record and auth key
scsi: libiscsi: Fix race between iscsi_xmit_task and iscsi_complete_task
net: systemport: Fix reception of BPDUs
net: dsa: bcm_sf2: Do not assume DSA master supports WoL
pinctrl: meson: meson8b: fix the sdxc_a data 1..3 pins
qmi_wwan: apply SET_DTR quirk to Sierra WP7607
net: mv643xx_eth: disable clk on error path in mv643xx_eth_shared_probe()
xfrm: Fix inbound traffic via XFRM interfaces across network namespaces
mailbox: bcm-flexrm-mailbox: Fix FlexRM ring flush timeout issue
ASoC: topology: free created components in tplg load error
qed: Fix iWARP buffer size provided for syn packet processing.
qed: Fix iWARP syn packet mac address validation.
ARM: dts: armada-xp: fix Armada XP boards NAND description
arm64: Relax GIC version check during early boot
ARM: tegra: Restore DT ABI on Tegra124 Chromebooks
net: marvell: mvneta: fix DMA debug warning
mm: handle lru_add_drain_all for UP properly
tmpfs: fix link accounting when a tmpfile is linked in
ixgbe: fix older devices that do not support IXGBE_MRQC_L3L4TXSWEN
ARCv2: lib: memcpy: fix doing prefetchw outside of buffer
ARC: uacces: remove lp_start, lp_end from clobber list
ARCv2: support manual regfile save on interrupts
ARCv2: don't assume core 0x54 has dual issue
phonet: fix building with clang
mac80211_hwsim: propagate genlmsg_reply return code
bpf, lpm: fix lookup bug in map_delete_elem
net: thunderx: make CFG_DONE message to run through generic send-ack sequence
net: thunderx: add nicvf_send_msg_to_pf result check for set_rx_mode_task
nfp: bpf: fix code-gen bug on BPF_ALU | BPF_XOR | BPF_K
nfp: bpf: fix ALU32 high bits clearance bug
bnxt_en: Fix typo in firmware message timeout logic.
bnxt_en: Wait longer for the firmware message response to complete.
net: set static variable an initial value in atl2_probe()
selftests: fib_tests: sleep after changing carrier. again.
tmpfs: fix uninitialized return value in shmem_link
stm class: Prevent division by zero
nfit: acpi_nfit_ctl(): Check out_obj->type in the right place
acpi/nfit: Fix bus command validation
nfit/ars: Attempt a short-ARS whenever the ARS state is idle at boot
nfit/ars: Attempt short-ARS even in the no_init_ars case
libnvdimm/label: Clear 'updating' flag after label-set update
libnvdimm, pfn: Fix over-trim in trim_pfn_device()
libnvdimm/pmem: Honor force_raw for legacy pmem regions
libnvdimm: Fix altmap reservation size calculation
fix cgroup_do_mount() handling of failure exits
crypto: aead - set CRYPTO_TFM_NEED_KEY if ->setkey() fails
crypto: aegis - fix handling chunked inputs
crypto: arm/crct10dif - revert to C code for short inputs
crypto: arm64/aes-neonbs - fix returning final keystream block
crypto: arm64/crct10dif - revert to C code for short inputs
crypto: hash - set CRYPTO_TFM_NEED_KEY if ->setkey() fails
crypto: morus - fix handling chunked inputs
crypto: pcbc - remove bogus memcpy()s with src == dest
crypto: skcipher - set CRYPTO_TFM_NEED_KEY if ->setkey() fails
crypto: testmgr - skip crc32c context test for ahash algorithms
crypto: x86/aegis - fix handling chunked inputs and MAY_SLEEP
crypto: x86/aesni-gcm - fix crash on empty plaintext
crypto: x86/morus - fix handling chunked inputs and MAY_SLEEP
crypto: arm64/aes-ccm - fix logical bug in AAD MAC handling
crypto: arm64/aes-ccm - fix bugs in non-NEON fallback routine
CIFS: Do not reset lease state to NONE on lease break
CIFS: Do not skip SMB2 message IDs on send failures
CIFS: Fix read after write for files with read caching
tracing: Use strncpy instead of memcpy for string keys in hist triggers
tracing: Do not free iter->trace in fail path of tracing_open_pipe()
tracing/perf: Use strndup_user() instead of buggy open-coded version
xen: fix dom0 boot on huge systems
ACPI / device_sysfs: Avoid OF modalias creation for removed device
mmc: sdhci-esdhc-imx: fix HS400 timing issue
mmc:fix a bug when max_discard is 0
netfilter: ipt_CLUSTERIP: fix warning unused variable cn
spi: ti-qspi: Fix mmap read when more than one CS in use
spi: pxa2xx: Setup maximum supported DMA transfer length
regulator: s2mps11: Fix steps for buck7, buck8 and LDO35
regulator: max77620: Initialize values for DT properties
regulator: s2mpa01: Fix step values for some LDOs
clocksource/drivers/exynos_mct: Move one-shot check from tick clear to ISR
clocksource/drivers/exynos_mct: Clear timer interrupt when shutdown
clocksource/drivers/arch_timer: Workaround for Allwinner A64 timer instability
s390/setup: fix early warning messages
s390/virtio: handle find on invalid queue gracefully
scsi: virtio_scsi: don't send sc payload with tmfs
scsi: aacraid: Fix performance issue on logical drives
scsi: sd: Optimal I/O size should be a multiple of physical block size
scsi: target/iscsi: Avoid iscsit_release_commands_from_conn() deadlock
scsi: qla2xxx: Fix LUN discovery if loop id is not assigned yet by firmware
fs/devpts: always delete dcache dentry-s in dput()
splice: don't merge into linked buffers
ovl: During copy up, first copy up data and then xattrs
ovl: Do not lose security.capability xattr over metadata file copy-up
m68k: Add -ffreestanding to CFLAGS
Btrfs: setup a nofs context for memory allocation at btrfs_create_tree()
Btrfs: setup a nofs context for memory allocation at __btrfs_set_acl
btrfs: ensure that a DUP or RAID1 block group has exactly two stripes
Btrfs: fix corruption reading shared and compressed extents after hole punching
soc: qcom: rpmh: Avoid accessing freed memory from batch API
libertas_tf: don't set URB_ZERO_PACKET on IN USB transfer
irqchip/gic-v3-its: Avoid parsing _indirect_ twice for Device table
irqchip/brcmstb-l2: Use _irqsave locking variants in non-interrupt code
x86/kprobes: Prohibit probing on optprobe template code
cpufreq: kryo: Release OPP tables on module removal
cpufreq: tegra124: add missing of_node_put()
cpufreq: pxa2xx: remove incorrect __init annotation
ext4: fix check of inode in swap_inode_boot_loader
ext4: cleanup pagecache before swap i_data
ext4: update quota information while swapping boot loader inode
ext4: add mask of ext4 flags to swap
ext4: fix crash during online resizing
PCI/ASPM: Use LTR if already enabled by platform
PCI/DPC: Fix print AER status in DPC event handling
PCI: dwc: skip MSI init if MSIs have been explicitly disabled
IB/hfi1: Close race condition on user context disable and close
cxl: Wrap iterations over afu slices inside 'afu_list_lock'
ext2: Fix underflow in ext2_max_size()
clk: uniphier: Fix update register for CPU-gear
clk: clk-twl6040: Fix imprecise external abort for pdmclk
clk: samsung: exynos5: Fix possible NULL pointer exception on platform_device_alloc() failure
clk: samsung: exynos5: Fix kfree() of const memory on setting driver_override
clk: ingenic: Fix round_rate misbehaving with non-integer dividers
clk: ingenic: Fix doc of ingenic_cgu_div_info
usb: chipidea: tegra: Fix missed ci_hdrc_remove_device()
usb: typec: tps6598x: handle block writes separately with plain-I2C adapters
dmaengine: usb-dmac: Make DMAC system sleep callbacks explicit
mm: hwpoison: fix thp split handing in soft_offline_in_use_page()
mm/vmalloc: fix size check for remap_vmalloc_range_partial()
mm/memory.c: do_fault: avoid usage of stale vm_area_struct
kernel/sysctl.c: add missing range check in do_proc_dointvec_minmax_conv
device property: Fix the length used in PROPERTY_ENTRY_STRING()
intel_th: Don't reference unassigned outputs
parport_pc: fix find_superio io compare code, should use equal test.
i2c: tegra: fix maximum transfer size
media: i2c: ov5640: Fix post-reset delay
gpio: pca953x: Fix dereference of irq data in shutdown
can: flexcan: FLEXCAN_IFLAG_MB: add () around macro argument
drm/i915: Relax mmap VMA check
bpf: only test gso type on gso packets
serial: uartps: Fix stuck ISR if RX disabled with non-empty FIFO
serial: 8250_of: assume reg-shift of 2 for mrvl,mmp-uart
serial: 8250_pci: Fix number of ports for ACCES serial cards
serial: 8250_pci: Have ACCES cards that use the four port Pericom PI7C9X7954 chip use the pci_pericom_setup()
jbd2: clear dirty flag when revoking a buffer from an older transaction
jbd2: fix compile warning when using JBUFFER_TRACE
selinux: add the missing walk_size + len check in selinux_sctp_bind_connect
security/selinux: fix SECURITY_LSM_NATIVE_LABELS on reused superblock
powerpc/32: Clear on-stack exception marker upon exception return
powerpc/wii: properly disable use of BATs when requested.
powerpc/powernv: Make opal log only readable by root
powerpc/83xx: Also save/restore SPRG4-7 during suspend
powerpc/powernv: Don't reprogram SLW image on every KVM guest entry/exit
powerpc: Fix 32-bit KVM-PR lockup and host crash with MacOS guest
powerpc/ptrace: Simplify vr_get/set() to avoid GCC warning
powerpc/hugetlb: Don't do runtime allocation of 16G pages in LPAR configuration
powerpc/traps: fix recoverability of machine check handling on book3s/32
powerpc/traps: Fix the message printed when stack overflows
ARM: s3c24xx: Fix boolean expressions in osiris_dvs_notify
arm64: Fix HCR.TGE status for NMI contexts
arm64: debug: Ensure debug handlers check triggering exception level
arm64: KVM: Fix architecturally invalid reset value for FPEXC32_EL2
ipmi_si: fix use-after-free of resource->name
dm: fix to_sector() for 32bit
dm integrity: limit the rate of error messages
mfd: sm501: Fix potential NULL pointer dereference
cpcap-charger: generate events for userspace
NFS: Fix I/O request leakages
NFS: Fix an I/O request leakage in nfs_do_recoalesce
NFS: Don't recoalesce on error in nfs_pageio_complete_mirror()
nfsd: fix performance-limiting session calculation
nfsd: fix memory corruption caused by readdir
nfsd: fix wrong check in write_v4_end_grace()
NFSv4.1: Reinitialise sequence results before retransmitting a request
svcrpc: fix UDP on servers with lots of threads
PM / wakeup: Rework wakeup source timer cancellation
bcache: never writeback a discard operation
stable-kernel-rules.rst: add link to networking patch queue
vt: perform safe console erase in the right order
x86/unwind/orc: Fix ORC unwind table alignment
perf intel-pt: Fix CYC timestamp calculation after OVF
perf tools: Fix split_kallsyms_for_kcore() for trampoline symbols
perf auxtrace: Define auxtrace record alignment
perf intel-pt: Fix overlap calculation for padding
perf/x86/intel/uncore: Fix client IMC events return huge result
perf intel-pt: Fix divide by zero when TSC is not available
md: Fix failed allocation of md_register_thread
tpm/tpm_crb: Avoid unaligned reads in crb_recv()
tpm: Unify the send callback behaviour
rcu: Do RCU GP kthread self-wakeup from softirq and interrupt
media: imx: prpencvf: Stop upstream before disabling IDMA channel
media: lgdt330x: fix lock status reporting
media: uvcvideo: Avoid NULL pointer dereference at the end of streaming
media: vimc: Add vimc-streamer for stream control
media: imx: csi: Disable CSI immediately after last EOF
media: imx: csi: Stop upstream before disabling IDMA channel
drm/fb-helper: generic: Fix drm_fbdev_client_restore()
drm/radeon/evergreen_cs: fix missing break in switch statement
drm/amd/powerplay: correct power reading on fiji
drm/amd/display: don't call dm_pp_ function from an fpu block
KVM: Call kvm_arch_memslots_updated() before updating memslots
KVM: x86/mmu: Detect MMIO generation wrap in any address space
KVM: x86/mmu: Do not cache MMIO accesses while memslots are in flux
KVM: nVMX: Sign extend displacements of VMX instr's mem operands
KVM: nVMX: Apply addr size mask to effective address for VMX instructions
KVM: nVMX: Ignore limit checks on VMX instructions using flat segments
bcache: use (REQ_META|REQ_PRIO) to indicate bio for metadata
s390/setup: fix boot crash for machine without EDAT-1
Linux 4.19.31
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit aef1897cd36dcf5e296f1d2bae7e0d268561b685 ]
When requeue, if RQF_DONTPREP, rq has contained some driver
specific data, so insert it to hctx dispatch list to avoid any
merge. Take scsi as example, here is the trace event log (no
io scheduler, because RQF_STARTED would prevent merging),
kworker/0:1H-339 [000] ...1 2037.209289: block_rq_insert: 8,0 R 4096 () 32768 + 8 [kworker/0:1H]
scsi_inert_test-1987 [000] .... 2037.220465: block_bio_queue: 8,0 R 32776 + 8 [scsi_inert_test]
scsi_inert_test-1987 [000] ...2 2037.220466: block_bio_backmerge: 8,0 R 32776 + 8 [scsi_inert_test]
kworker/0:1H-339 [000] .... 2047.220913: block_rq_issue: 8,0 R 8192 () 32768 + 16 [kworker/0:1H]
scsi_inert_test-1996 [000] ..s1 2047.221007: block_rq_complete: 8,0 R () 32768 + 8 [0]
scsi_inert_test-1996 [000] .Ns1 2047.221045: block_rq_requeue: 8,0 R () 32776 + 8 [0]
kworker/0:1H-339 [000] ...1 2047.221054: block_rq_insert: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
kworker/0:1H-339 [000] ...1 2047.221056: block_rq_issue: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
scsi_inert_test-1986 [000] ..s1 2047.221119: block_rq_complete: 8,0 R () 32776 + 8 [0]
(32768 + 8) was requeued by scsi_queue_insert and had RQF_DONTPREP.
Then it was merged with (32776 + 8) and issued. Due to RQF_DONTPREP,
the sdb only contained the part of (32768 + 8), then only that part
was completed. The lucky thing was that scsi_io_completion detected
it and requeued the remaining part. So we didn't get corrupted data.
However, the requeue of (32776 + 8) is not expected.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
There are several definitions of those functions/macros in places that
mess with fixed-point load averages. Provide an official version.
[akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c]
Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 8508cf3ffad4defa202b303e5b6379efc4cd9054)
Conflicts:
block/blk-iolatency.c
(1. manual merge to replace stat->rqs.mean with stat.mean)
Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I716b4874491cff75a2355c6d95c64cf02d05e7ee
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
[ Upstream commit 8c772a9bfc7c07c76f4a58b58910452fbb20843b ]
Our test reported the following stack, and vmcore showed that
->inflight counter is -1.
[ffffc9003fcc38d0] __schedule at ffffffff8173d95d
[ffffc9003fcc3958] schedule at ffffffff8173de26
[ffffc9003fcc3970] io_schedule at ffffffff810bb6b6
[ffffc9003fcc3988] blkcg_iolatency_throttle at ffffffff813911cb
[ffffc9003fcc3a20] rq_qos_throttle at ffffffff813847f3
[ffffc9003fcc3a48] blk_mq_make_request at ffffffff8137468a
[ffffc9003fcc3b08] generic_make_request at ffffffff81368b49
[ffffc9003fcc3b68] submit_bio at ffffffff81368d7d
[ffffc9003fcc3bb8] ext4_io_submit at ffffffffa031be00 [ext4]
[ffffc9003fcc3c00] ext4_writepages at ffffffffa03163de [ext4]
[ffffc9003fcc3d68] do_writepages at ffffffff811c49ae
[ffffc9003fcc3d78] __filemap_fdatawrite_range at ffffffff811b6188
[ffffc9003fcc3e30] filemap_write_and_wait_range at ffffffff811b6301
[ffffc9003fcc3e60] ext4_sync_file at ffffffffa030cee8 [ext4]
[ffffc9003fcc3ea8] vfs_fsync_range at ffffffff8128594b
[ffffc9003fcc3ee8] do_fsync at ffffffff81285abd
[ffffc9003fcc3f18] sys_fsync at ffffffff81285d50
[ffffc9003fcc3f28] do_syscall_64 at ffffffff81003c04
[ffffc9003fcc3f50] entry_SYSCALL_64_after_swapgs at ffffffff81742b8e
The ->inflight counter may be negative (-1) if
1) blk-iolatency was disabled when the IO was issued,
2) blk-iolatency was enabled before this IO reached its endio,
3) the ->inflight counter is decreased from 0 to -1 in endio()
In fact the hang can be easily reproduced by the below script,
H=/sys/fs/cgroup/unified/
P=/sys/fs/cgroup/unified/test
echo "+io" > $H/cgroup.subtree_control
mkdir -p $P
echo $$ > $P/cgroup.procs
xfs_io -f -d -c "pwrite 0 4k" /dev/sdg
echo "`cat /sys/block/sdg/dev` target=1000000" > $P/io.latency
xfs_io -f -d -c "pwrite 0 4k" /dev/sdg
This fixes the problem by freezing the queue so that while
enabling/disabling iolatency, there is no inflight rq running.
Note that quiesce_queue is not needed as this only updating iolatency
configuration about which dispatching request_queue doesn't care.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 85bd6e61f34dffa8ec2dc75ff3c02ee7b2f1cbce ]
Florian reported a io hung issue when fsync(). It should be
triggered by following race condition.
data + post flush a flush
blk_flush_complete_seq
case REQ_FSEQ_DATA
blk_flush_queue_rq
issued to driver blk_mq_dispatch_rq_list
try to issue a flush req
failed due to NON-NCQ command
.queue_rq return BLK_STS_DEV_RESOURCE
request completion
req->end_io // doesn't check RESTART
mq_flush_data_end_io
case REQ_FSEQ_POSTFLUSH
blk_kick_flush
do nothing because previous flush
has not been completed
blk_mq_run_hw_queue
insert rq to hctx->dispatch
due to RESTART is still set, do nothing
To fix this, replace the blk_mq_run_hw_queue in mq_flush_data_end_io
with blk_mq_sched_restart to check and clear the RESTART flag.
Fixes: bd166ef1 (blk-mq-sched: add framework for MQ capable IO schedulers)
Reported-by: Florian Stecker <m19@florianstecker.de>
Tested-by: Florian Stecker <m19@florianstecker.de>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 7211aef86f79583e59b88a0aba0bc830566f7e8e upstream.
For a zoned block device using mq-deadline, if a write request for a
zone is received while another write was already dispatched for the same
zone, dd_dispatch_request() will return NULL and the newly inserted
write request is kept in the scheduler queue waiting for the ongoing
zone write to complete. With this behavior, when no other request has
been dispatched, rq_list in blk_mq_sched_dispatch_requests() is empty
and blk_mq_sched_mark_restart_hctx() not called. This in turn leads to
__blk_mq_free_request() call of blk_mq_sched_restart() to not run the
queue when the already dispatched write request completes. The newly
dispatched request stays stuck in the scheduler queue until eventually
another request is submitted.
This problem does not affect SCSI disk as the SCSI stack handles queue
restart on request completion. However, this problem is can be triggered
the nullblk driver with zoned mode enabled.
Fix this by always requesting a queue restart in dd_dispatch_request()
if no request was dispatched while WRITE requests are queued.
Fixes: 5700f69178 ("mq-deadline: Introduce zone locking support")
Cc: <stable@vger.kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add missing export of blk_mq_sched_restart()
Signed-off-by: Jens Axboe <axboe@kernel.dk>
commit f55adad601c6a97c8c9628195453e0fb23b4a0ae upstream.
We don't need to zero fill the bio if not using kernel allocated pages.
Fixes: f3587d76da05 ("block: Clear kernel memory before copying to user") # v4.20-rc2
Reported-by: Todd Aiken <taiken@mvtech.ca>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: stable@vger.kernel.org
Cc: Bart Van Assche <bvanassche@acm.org>
Tested-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c616cbee97aed4bc6178f148a7240206dcdb85a6 upstream.
After the direct dispatch corruption fix, we permanently disallow direct
dispatch of non read/write requests. This works fine off the normal IO
path, as they will be retried like any other failed direct dispatch
request. But for the blk_insert_cloned_request() that only DM uses to
bypass the bottom level scheduler, we always first attempt direct
dispatch. For some types of requests, that's now a permanent failure,
and no amount of retrying will make that succeed. This results in a
livelock.
Instead of making special cases for what we can direct issue, and now
having to deal with DM solving the livelock while still retaining a BUSY
condition feedback loop, always just add a request that has been through
->queue_rq() to the hardware queue dispatch list. These are safe to use
as no merging can take place there. Additionally, if requests do have
prepped data from drivers, we aren't dependent on them not sharing space
in the request structure to safely add them to the IO scheduler lists.
This basically reverts ffe81d45322c and is based on a patch from Ming,
but with the list insert case covered as well.
Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
Cc: stable@vger.kernel.org
Suggested-by: Ming Lei <ming.lei@redhat.com>
Reported-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ffe81d45322cc3cb140f0db080a4727ea284661e upstream.
If we attempt a direct issue to a SCSI device, and it returns BUSY, then
we queue the request up normally. However, the SCSI layer may have
already setup SG tables etc for this particular command. If we later
merge with this request, then the old tables are no longer valid. Once
we issue the IO, we only read/write the original part of the request,
not the new state of it.
This causes data corruption, and is most often noticed with the file
system complaining about the just read data being invalid:
[ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)
because most of it is garbage...
This doesn't happen from the normal issue path, as we will simply defer
the request to the hardware queue dispatch list if we fail. Once it's on
the dispatch list, we never merge with it.
Fix this from the direct issue path by flagging the request as
REQ_NOMERGE so we don't change the size of it before issue.
See also:
https://bugzilla.kernel.org/show_bug.cgi?id=201685
Tested-by: Guenter Roeck <linux@roeck-us.net>
Fixes: 6ce3dd6eec ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ca474b73896bf6e0c1eb8787eb217b0f80221610 ]
We need to copy the io priority, too; otherwise the clone will run
with a different priority than the original one.
Fixes: 43b62ce3ff ("block: move bio io prio to a new field")
Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Fixed up subject, and ordered stores.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f3587d76da05f68098ddb1cb3c98cc6a9e8a402c ]
If the kernel allocates a bounce buffer for user read data, this memory
needs to be cleared before copying it to the user, otherwise it may leak
kernel memory to user space.
Laurence Oberman <loberman@redhat.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 8dc765d438f1e42b3e8227b3b09fad7d73f4ec9a upstream.
c2856ae2f3 ("blk-mq: quiesce queue before freeing queue") has
already fixed this race, however the implied synchronize_rcu()
in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused
performance regression.
Then 1311326cf4 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
tried to quiesce queue for avoiding unnecessary synchronize_rcu()
only when queue initialization is done, because it is usual to see
lots of inexistent LUNs which need to be probed.
However, turns out it isn't safe to quiesce queue only when queue
initialization is done. Because when one SCSI command is completed,
the user of sending command can be waken up immediately, then the
scsi device may be removed, meantime the run queue in scsi_end_request()
is still in-progress, so kernel panic can be caused.
In Red Hat QE lab, there are several reports about this kind of kernel
panic triggered during kernel booting.
This patch tries to address the issue by grabing one queue usage
counter during freeing one request and the following run queue.
Fixes: 1311326cf4 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
Cc: Andrew Jones <drjones@redhat.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
Cc: stable <stable@vger.kernel.org>
Cc: jianchao.wang <jianchao.w.wang@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit cbeb869a3d1110450186b738199963c5e68c2a71 ]
BFQ schedules entities (which represent either per-process queues or
groups of queues) as a function of their timestamps. In particular, as
a function of their (virtual) finish times. The finish time of an
entity is computed as a function of the budget assigned to the entity,
assuming, tentatively, that the entity, once in service, will receive
an amount of service equal to its budget. Then, when the entity is
expired because it finishes to be served, this finish time is updated
as a function of the actual service received by the entity. This
allows the entity to be correctly charged with only the service
received, and then to be correctly re-scheduled.
Yet an entity may receive service also while not being the entity in
service (in the scheduling environment of its parent entity), for
several reasons. If the entity remains with no backlog while receiving
this 'unofficial' service, then it is expired. Also on such an
expiration, the finish time of the entity should be updated to account
for only the service actually received by the entity. Unfortunately,
such an update is not performed for an entity expiring without being
the entity in service.
In a similar vein, the service counter of the entity in service is
reset when the entity is expired, to be ready to be used for next
service cycle. This reset too should be performed also in case an
entity is expired because it remains empty after receiving service
while not being the entity in service. But in this case the reset is
not performed.
This commit performs the above update of the finish time and reset of
the service received, also for an entity expiring while not being the
entity in service.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 34ffec60b27aa81d04e274e71e4c6ef740f75fc7 upstream.
Obviously the created writesame bio has to be aligned with logical block
size, and use bio_allowed_max_sectors() to retrieve this number.
Cc: stable@vger.kernel.org
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Xiao Ni <xni@redhat.com>
Cc: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
Fixes: b49a0871be ("block: remove split code in blkdev_issue_{discard,write_same}")
Tested-by: Rui Salvaterra <rsalvaterra@gmail.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1adfc5e4136f5967d591c399aff95b3b035f16b7 upstream.
Obviously the created discard bio has to be aligned with logical block size.
This patch introduces the helper of bio_allowed_max_sectors() for
this purpose.
Cc: stable@vger.kernel.org
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Xiao Ni <xni@redhat.com>
Cc: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
Fixes: 744889b7cb ("block: don't deal with discard limit in blkdev_issue_discard()")
Fixes: a22c4d7e34 ("block: re-add discard_granularity and alignment checks")
Reported-by: Rui Salvaterra <rsalvaterra@gmail.com>
Tested-by: Rui Salvaterra <rsalvaterra@gmail.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 52990a5fb0c991ecafebdab43138b5ed41376852 upstream.
We're only setting up the bounce bio sets if we happen
to need bouncing for regular HIGHMEM, not if we only need
it for ISA devices.
Protect the ISA bounce setup with a mutex, since it's
being invoked from driver init functions and can thus be
called in parallel.
Cc: stable@vger.kernel.org
Reported-by: Ondrej Zary <linux@rainbow-software.org>
Tested-by: Ondrej Zary <linux@rainbow-software.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
blk_queue_split() does respect this limit via bio splitting, so no
need to do that in blkdev_issue_discard(), then we can align to
normal bio submit(bio_add_page() & submit_bio()).
More importantly, this patch fixes one issue introduced in a22c4d7e34
("block: re-add discard_granularity and alignment checks"), in which
zero discard bio may be generated in case of zero alignment.
Fixes: a22c4d7e34 ("block: re-add discard_granularity and alignment checks")
Cc: stable@vger.kernel.org
Cc: Ming Lin <ming.l@ssi.samsung.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Xiao Ni <xni@redhat.com>
Tested-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tetsuo brought to my attention that I screwed up the scale_up/scale_down
helpers when I factored out the rq-qos code. We need to wake up all the
waiters when we add slots for requests to make, not when we shrink the
slots. Otherwise we'll end up things waiting forever. This was a
mistake and simply puts everything back the way it was.
cc: stable@vger.kernel.org
Fixes: a79050434b ("blk-rq-qos: refactor out common elements of blk-wbt")
eported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
trace_block_unplug() takes true for explicit unplugs and false for
implicit unplugs. schedule() unplugs are implicit and should be
reported as timer unplugs. While correct in the legacy code, this has
been inverted in blk-mq since 4.11.
Cc: stable@vger.kernel.org
Fixes: bd166ef183 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When the deadline scheduler is used with a zoned block device, writes
to a zone will be dispatched one at a time. This causes the warning
message:
deadline: forced dispatching is broken (nr_sorted=X), please report this
to be displayed when switching to another elevator with the legacy I/O
path while write requests to a zone are being retained in the scheduler
queue.
Prevent this message from being displayed when executing
elv_drain_elevator() for a zoned block device. __blk_drain_queue() will
loop until all writes are dispatched and completed, resulting in the
desired elevator queue drain without extensive modifications to the
deadline code itself to handle forced-dispatch calls.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Fixes: 8dc8146f9c ("deadline-iosched: Introduce zone locking support")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A recent commit runs tag iterator callbacks under the rcu read lock,
but existing callbacks do not satisfy the non-blocking requirement.
The commit intended to prevent an iterator from accessing a queue that's
being modified. This patch fixes the original issue by taking a queue
reference instead of reading it, which allows callbacks to make blocking
calls.
Fixes: f5bbbbe4d6 ("blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter")
Acked-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Klaus Kusche reported that the I/O busy time in /proc/diskstats was not
updating properly on 4.18. This is because we started using ktime to
track elapsed time, and we convert nanoseconds to jiffies when we update
the partition counter. However, this gets rounded down, so any I/Os that
take less than a jiffy are not accounted for. Previously in this case,
the value of jiffies would sometimes increment while we were doing I/O,
so at least some I/Os were accounted for.
Let's convert the stats to use nanoseconds internally. We still report
milliseconds as before, now more accurately than ever. The value is
still truncated to 32 bits for backwards compatibility.
Fixes: 522a777566 ("block: consolidate struct request timestamp fields")
Cc: stable@vger.kernel.org
Reported-by: Klaus Kusche <klaus.kusche@computerix.info>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After merging the iolatency policy, we potentially now have 4 policies
being registered, but only support 3. This causes one of them to fail
loading. Takashi reports that BFQ no longer works for him, because it
fails to load due to policy registration failure.
Bump to 5 policies, and also add a warning for when we have exceeded
the global amount. If we have to touch this again, we should switch
to a dynamic scheme instead.
Reported-by: Takashi Iwai <tiwai@suse.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Tested-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix trivial use-after-free. This could be last reference to bfqg.
Fixes: 8f9bebc33d ("block, bfq: access and cache blkg data only when safe")
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It is possible to call fsync on a read-only handle (for example, fsck.ext2
does it when doing read-only check), and this call results in kernel
warning.
The patch b089cfd95d ("block: don't warn for flush on read-only device")
attempted to disable the warning, but it is buggy and it doesn't
(op_is_flush tests flags, but bio_op strips off the flags).
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 721c7fc701 ("block: fail op_is_write() requests to read-only partitions")
Cc: stable@vger.kernel.org # 4.18
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is a very small change a bio gets caught up in a really
unfortunate race between a task migration, cgroup exiting, and itself
trying to associate with a blkg. This is due to css offlining being
performed after the css->refcnt is killed which triggers removal of
blkgs that reach their blkg->refcnt of 0.
To avoid this, association with a blkg should use tryget and fallback to
using the root_blkg.
Fixes: 08e18eab0c ("block: add bi_blkg to the bio for cgroups")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, blkcg destruction relies on a sequence of events:
1. Destruction starts. blkcg_css_offline() is called and blkgs
release their reference to the blkcg. This immediately destroys
the cgwbs (writeback).
2. With blkgs giving up their reference, the blkcg ref count should
become zero and eventually call blkcg_css_free() which finally
frees the blkcg.
Jiufei Xue reported that there is a race between blkcg_bio_issue_check()
and cgroup_rmdir(). To remedy this, blkg destruction becomes contingent
on the completion of all writeback associated with the blkcg. A count of
the number of cgwbs is maintained and once that goes to zero, blkg
destruction can follow. This should prevent premature blkg destruction
related to writeback.
The new process for blkcg cleanup is as follows:
1. Destruction starts. blkcg_css_offline() is called which offlines
writeback. Blkg destruction is delayed on the cgwb_refcnt count to
avoid punting potentially large amounts of outstanding writeback
to root while maintaining any ongoing policies. Here, the base
cgwb_refcnt is put back.
2. When the cgwb_refcnt becomes zero, blkcg_destroy_blkgs() is called
and handles destruction of blkgs. This is where the css reference
held by each blkg is released.
3. Once the blkcg ref count goes to zero, blkcg_css_free() is called.
This finally frees the blkg.
It seems in the past blk-throttle didn't do the most understandable
things with taking data from a blkg while associating with current. So,
the simplification and unification of what blk-throttle is doing caused
this.
Fixes: 08e18eab0c ("block: add bi_blkg to the bio for cgroups")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This reverts commit 4c6994806f.
Destroying blkgs is tricky because of the nature of the relationship. A
blkg should go away when either a blkcg or a request_queue goes away.
However, blkg's pin the blkcg to ensure they remain valid. To break this
cycle, when a blkcg is offlined, blkgs put back their css ref. This
eventually lets css_free() get called which frees the blkcg.
The above commit (4c6994806f) breaks this order of events by trying to
destroy blkgs in css_free(). As the blkgs still hold references to the
blkcg, css_free() is never called.
The race between blkcg_bio_issue_check() and cgroup_rmdir() will be
addressed in the following patch by delaying destruction of a blkg until
all writeback associated with the blkcg has been finished.
Fixes: 4c6994806f ("blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, variable ref_count within the bsg_device struct is of
type atomic_t. For variables being used as reference counters,
the refcount API should be used instead of atomic. The newer
refcount API works to prevent counter overflows and use-after-free
bugs. So, move this varable from the atomic API to refcount,
potentially avoiding the issues mentioned.
Signed-off-by: John Pittman <jpittman@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
kmem_cache_destroy() can handle NULL pointer correctly, so there is
no need to check e->icq_cache before calling kmem_cache_destroy().
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have two potential issues:
1) After commit 2887e41b91, we only wake one process at the time when
we finish an IO. We really want to wake up as many tasks as can
queue IO. Before this commit, we woke up everyone, which could cause
a thundering herd issue.
2) A task can potentially consume two wakeups, causing us to (in
practice) miss a wakeup.
Fix both by providing our own wakeup function, which stops
__wake_up_common() from waking up more tasks if we fail to get a
queueing token. With the strict ordering we have on the wait list, this
wakes the right tasks and the right amount of tasks.
Based on a patch from Jianchao Wang <jianchao.w.wang@oracle.com>.
Tested-by: Agarwal, Anchal <anchalag@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Prep patch for calling the handler from a different context,
no functional changes in this patch.
Tested-by: Agarwal, Anchal <anchalag@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A previous commit removed the ability to have per-rq flags. We used
those flags to maintain inflight counts. Since we don't have those
anymore, we have to always maintain inflight counts, even if wbt is
disabled. This is clearly suboptimal.
Add a queue quiesce around changing the wbt latency settings from sysfs
to work around this. With that, we can reliably put the enabled check in
our bio_to_wbt_flags(), since we know the WBT_TRACKED flag will be
consistent for the lifetime of the request.
Fixes: c1c80384c8 ("block: remove external dependency on wbt_flags")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We need to do this inside the loop as well, or we can allow new
IO to supersede previous IO.
Tested-by: Anchal Agarwal <anchalag@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We need the memory barrier before checking the list head,
use the appropriate helper for this. The matching queue
side memory barrier is provided by set_current_state().
Tested-by: Anchal Agarwal <anchalag@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlt9on8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpj1xEADKBmJlV9aVyxc5w6XggqAGeHqI4afFrl+v
9fW6WUQMAaBUrr7PMIEJQ0Zm4B7KxgBaEWNtuuj4ULkjpgYm2AuGUuTJSyKz41rS
Ma+KNyCA2Zmq4SvwGFbcdCuCbUqnoxTycscAgCjuDvIYLW0+nFGNc47ibmu9lZIV
33Ef5LrxuCjhC2zyNxEdWpUxDCjoYzock85LW+wYyIYLU9uKdoExS+YmT8U+ebA/
AkXBcxPztNDxwkcsIwgGVoTjwxiowqGz3uueWfyEmYgaCPiNOsxkoNQAtjX4ykQE
hnqnHWyzJkRwbYo7Vd/bRAZXvszKGYE1YcJmu5QrNf0dK5MSq2o5OYJAEJWbucPj
m0R2u7O9qbS2JEnxGrm5+oYJwBzwNY5/Lajr15WkljTqobKnqcvn/Hdgz/XdGtek
0S1QHkkBsF7e+cax8sePWK+O3ilY7pl9CzyZKB/tJngl8A45Jv8xVojg0v3O7oS+
zZib0rwWg/bwR/uN6uPCDcEsQusqL5YovB7m6NRVshwz6cV1zVNp2q+iOulk7KuC
MprW4Du9CJf8HA19XtyJIG1XLstnuz+Exy+i5BiimUJ5InoEFDuj/6OZa6Qaczbo
SrDDvpGtSf4h7czKpE5kV4uZiTOrjuI30TrI+4csdZ7HQIlboxNL72seNTLJs55F
nbLjRM8L6g==
=FS7e
-----END PGP SIGNATURE-----
Merge tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block
Pull more block updates from Jens Axboe:
- Set of bcache fixes and changes (Coly)
- The flush warn fix (me)
- Small series of BFQ fixes (Paolo)
- wbt hang fix (Ming)
- blktrace fix (Steven)
- blk-mq hardware queue count update fix (Jianchao)
- Various little fixes
* tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block: (31 commits)
block/DAC960.c: make some arrays static const, shrinks object size
blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter
blk-mq: init hctx sched after update ctx and hctx mapping
block: remove duplicate initialization
tracing/blktrace: Fix to allow setting same value
pktcdvd: fix setting of 'ret' error return for a few cases
block: change return type to bool
block, bfq: return nbytes and not zero from struct cftype .write() method
block, bfq: improve code of bfq_bfqq_charge_time
block, bfq: reduce write overcharge
block, bfq: always update the budget of an entity when needed
block, bfq: readd missing reset of parent-entity service
blk-wbt: fix IO hang in wbt_wait()
block: don't warn for flush on read-only device
bcache: add the missing comments for smp_mb()/smp_wmb()
bcache: remove unnecessary space before ioctl function pointer arguments
bcache: add missing SPDX header
bcache: move open brace at end of function definitions to next line
bcache: add static const prefix to char * array declarations
bcache: fix code comments style
...
For blk-mq, part_in_flight/rw will invoke blk_mq_in_flight/rw to
account the inflight requests. It will access the queue_hw_ctx and
nr_hw_queues w/o any protection. When updating nr_hw_queues and
blk_mq_in_flight/rw occur concurrently, panic comes up.
Before update nr_hw_queues, the q will be frozen. So we could use
q_usage_counter to avoid the race. percpu_ref_is_zero is used here
so that we will not miss any in-flight request. The access to
nr_hw_queues and queue_hw_ctx in blk_mq_queue_tag_busy_iter are
under rcu critical section, __blk_mq_update_nr_hw_queues could use
synchronize_rcu to ensure the zeroed q_usage_counter to be globally
visible.
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, when update nr_hw_queues, IO scheduler's init_hctx will
be invoked before the mapping between ctx and hctx is adapted
correctly by blk_mq_map_swqueue. The IO scheduler init_hctx (kyber)
may depend on this mapping and get wrong result and panic finally.
A simply way to fix this is that switch the IO scheduler to 'none'
before update the nr_hw_queues, and then switch it back after
update nr_hw_queues. blk_mq_sched_init_/exit_hctx are removed due
to nobody use them any more.
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch removes the duplicate initialization of q->queue_head
in the blk_alloc_queue_node(). This removes the 2nd initialization
so that we preserve the initialization order same as declaration
present in struct request_queue.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Because blk_do_io_stat() only does a judgement about the request
contributes to IO statistics, it better changes return type to bool.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The value that struct cftype .write() method returns is then directly
returned to userspace as the value returned by write() syscall, so it
should be the number of bytes actually written (or consumed) and not zero.
Returning zero from write() syscall makes programs like /bin/echo or bash
spin.
Signed-off-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Fixes: e21b7a0b98 ("block, bfq: add full hierarchical scheduling and cgroups support")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bfq_bfqq_charge_time contains some lengthy and redundant code. This
commit trims and condenses that code.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a sync request is dispatched, the queue that contains that
request, and all the ancestor entities of that queue, are charged with
the number of sectors of the request. In constrast, if the request is
async, then the queue and its ancestor entities are charged with the
number of sectors of the request, multiplied by an overcharge
factor. This throttles the bandwidth for async I/O, w.r.t. to sync
I/O, and it is done to counter the tendency of async writes to steal
I/O throughput to reads.
On the opposite end, the lower this parameter, the stabler I/O
control, in the following respect. The lower this parameter is, the
less the bandwidth enjoyed by a group decreases
- when the group does writes, w.r.t. to when it does reads;
- when other groups do reads, w.r.t. to when they do writes.
The fixes "block, bfq: always update the budget of an entity when
needed" and "block, bfq: readd missing reset of parent-entity service"
improved I/O control in bfq to such an extent that it has been
possible to revise this overcharge factor downwards. This commit
introduces the resulting, new value.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When the next child entity to serve changes for a given parent entity,
the budget of that parent entity must be updated accordingly.
Unfortunately, this update is not performed, by mistake, for the
entities that happen to switch from having no child entity to serve,
to having one child entity to serve.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The received-service counter needs to be equal to 0 when an entity is
set in service. Unfortunately, commit "block, bfq: fix service being
wrongly set to zero in case of preemption" mistakenly removed the
resetting of this counter for the parent entities of the bfq_queue
being set in service. This commit fixes this issue by resetting
service for parent entities, directly on the expiration of the
in-service bfq_queue.
Fixes: 9fae8dd59f ("block, bfq: fix service being wrongly set to zero in case of preemption")
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAltwvasQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpv65EACTq5gSLnJBI6ZPr1RAHruVDnjfzO2Veitl
tUtjm0XfWmnEiwQ3dYvnyhy99xbyaG3900d9BClCTlH6xaUdSiQkDpcKG/R2F36J
5mZitYukQcpFAQJWF8YKsTTE7JPl4VglCIDqYiC4+C3rOSVi8lrKn2qp4J4MMCFn
thRg3jCcq7c5s9Eigsop1pXWQSasubkXfk55Krcp4oybKYpYRKXXf74Mj14QAbwJ
QHN3VisyAUWoBRg7UQZo1Npe2oPk6bbnJypnjf8M0M2EnlvddEkIlHob91sodka8
6p4APOEu5cbyXOBCAQsw/koff14mb8aEadqeQA68WvXfIdX9ZjfxCX0OoC3sBEXk
yqJhZ0C980AM13zIBD8ejv4uasGcPca8W+47mE5P8sRiI++5kBsFWDZPCtUBna0X
2Kh24NsmEya9XRR5vsB84dsIPQ3tLMkxg/IgQRVDaSnfJz0c/+zm54xDyKRaFT4l
5iERk2WSkm9+8jNfVmWG0edrv6nRAXjpGwFfOCPh6/LCSCi4xQRULYN7sVzsX8ZK
FRjt24HftBI8mJbh4BtweJvg+ppVe1gAk3IO3HvxAQhv29Hz+uvFYe9kL+3N8LJA
Qosr9n9O4+wKYizJcDnw+5iPqCHfAwOm9th4pyedR+R7SmNcP3yNC8AbbheNBiF5
Zolos5H+JA==
=b9ib
-----END PGP SIGNATURE-----
Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
"First pull request for this merge window, there will also be a
followup request with some stragglers.
This pull request contains:
- Fix for a thundering heard issue in the wbt block code (Anchal
Agarwal)
- A few NVMe pull requests:
* Improved tracepoints (Keith)
* Larger inline data support for RDMA (Steve Wise)
* RDMA setup/teardown fixes (Sagi)
* Effects log suppor for NVMe target (Chaitanya Kulkarni)
* Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
* TP4004 (ANA) support (Christoph)
* Various NVMe fixes
- Block io-latency controller support. Much needed support for
properly containing block devices. (Josef)
- Series improving how we handle sense information on the stack
(Kees)
- Lightnvm fixes and updates/improvements (Mathias/Javier et al)
- Zoned device support for null_blk (Matias)
- AIX partition fixes (Mauricio Faria de Oliveira)
- DIF checksum code made generic (Max Gurtovoy)
- Add support for discard in iostats (Michael Callahan / Tejun)
- Set of updates for BFQ (Paolo)
- Removal of async write support for bsg (Christoph)
- Bio page dirtying and clone fixups (Christoph)
- Set of bcache fix/changes (via Coly)
- Series improving blk-mq queue setup/teardown speed (Ming)
- Series improving merging performance on blk-mq (Ming)
- Lots of other fixes and cleanups from a slew of folks"
* tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
blkcg: Make blkg_root_lookup() work for queues in bypass mode
bcache: fix error setting writeback_rate through sysfs interface
null_blk: add lock drop/acquire annotation
Blk-throttle: reduce tail io latency when iops limit is enforced
block: paride: pd: mark expected switch fall-throughs
block: Ensure that a request queue is dissociated from the cgroup controller
block: Introduce blk_exit_queue()
blkcg: Introduce blkg_root_lookup()
block: Remove two superfluous #include directives
blk-mq: count the hctx as active before allocating tag
block: bvec_nr_vecs() returns value for wrong slab
bcache: trivial - remove tailing backslash in macro BTREE_FLAG
bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
bcache: set max writeback rate when I/O request is idle
bcache: add code comments for bset.c
bcache: fix mistaken comments in request.c
bcache: fix mistaken code comments in bcache.h
bcache: add a comment in super.c
bcache: avoid unncessary cache prefetch bch_btree_node_get()
bcache: display rate debug parameters to 0 when writeback is not running
...
On wbt invariant is that if one IO is tracked via WBT_TRACKED, rqw->inflight
should be updated for tracking this IO.
But commit c1c80384c8 ("block: remove external dependency on wbt_flags")
forgets to remove the early handling of !rwb_enabled(rwb) inside wbt_wait(),
then the inflight counter may not be increased in wbt_wait(), but decreased
in wbt_done() for this kind of IO, so this counter may become negative, then
wbt_wait() may wait forever.
This patch fixes the report in the following link:
https://marc.info/?l=linux-block&m=153221542021033&w=2
Fixes: c1c80384c8 ("block: remove external dependency on wbt_flags")
Cc: Josef Bacik <jbacik@fb.com>
Reported-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't warn for a flush issued to a read-only device. It's not strictly
a writable command, as it doesn't change any on-media data by itself.
Reported-by: Stefan Agner <stefan@agner.ch>
Fixes: 721c7fc701 ("block: fail op_is_write() requests to read-only partitions")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For legacy queues the only call of blkg_root_lookup() happens after
bypass mode has been enabled. Since blkg_lookup() returns NULL for
queues in bypass mode, modify the blkg_root_lookup() such that it
no longer depends on bypass mode. Rename the function into
blk_queue_root_blkg() as suggested by Tejun.
Suggested-by: Tejun Heo <tj@kernel.org>
Fixes: 6bad9b210a ("blkcg: Introduce blkg_root_lookup()")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Several block drivers call alloc_disk() followed by put_disk() if
something fails before device_add_disk() is called without calling
blk_cleanup_queue(). Make sure that also for this scenario a request
queue is dissociated from the cgroup controller. This patch avoids
that loading the parport_pc, paride and pf drivers triggers the
following kernel crash:
BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride]
Read of size 4 at addr 0000000000000008 by task modprobe/744
Call Trace:
dump_stack+0x9a/0xeb
kasan_report+0x139/0x350
pi_init+0x42e/0x580 [paride]
pf_init+0x2bb/0x1000 [pf]
do_one_initcall+0x8e/0x405
do_init_module+0xd9/0x2f2
load_module+0x3ab4/0x4700
SYSC_finit_module+0x176/0x1a0
do_syscall_64+0xee/0x2b0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
Reported-by: Alexandru Moise <00moses.alexander00@gmail.com>
Fixes: a063057d7c ("block: Fix a race between request queue removal and the block cgroup controller") # v4.17
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Alexandru Moise <00moses.alexander00@gmail.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Alexandru Moise <00moses.alexander00@gmail.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Alexandru Moise <00moses.alexander00@gmail.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, we count the hctx as active after allocate driver tag
successfully. If a previously inactive hctx try to get tag first
time, it may fails and need to wait. However, due to the stale tag
->active_queues, the other shared-tags users are still able to
occupy all driver tags while there is someone waiting for tag.
Consequently, even if the previously inactive hctx is waked up, it
still may not be able to get a tag and could be starved.
To fix it, we count the hctx as active before try to allocate driver
tag, then when it is waiting the tag, the other shared-tag users
will reserve budget for it.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In commit ed996a52c8 ("block: simplify and cleanup bvec pool
handling"), the value of the slab index is incremented by one in
bvec_alloc() after the allocation is done to indicate an index value of
0 does not need to be later freed.
bvec_nr_vecs() was not updated accordingly, and thus returns the wrong
value. Decrement idx before performing the lookup.
Fixes: ed996a52c8 ("block: simplify and cleanup bvec pool handling")
Signed-off-by: Greg Edwards <gedwards@ddn.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch does not change any functionality but avoids that gcc
reports the following warnings when building with W=1:
block/cfq-iosched.c: In function ?cfq_back_seek_max_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
if (__data < (MIN)) \
^
block/cfq-iosched.c:4756:1: note: in expansion of macro ?STORE_FUNCTION?
STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_slice_idle_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
if (__data < (MIN)) \
^
block/cfq-iosched.c:4759:1: note: in expansion of macro ?STORE_FUNCTION?
STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_group_idle_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
if (__data < (MIN)) \
^
block/cfq-iosched.c:4760:1: note: in expansion of macro ?STORE_FUNCTION?
STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, UINT_MAX, 1);
^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_low_latency_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
if (__data < (MIN)) \
^
block/cfq-iosched.c:4765:1: note: in expansion of macro ?STORE_FUNCTION?
STORE_FUNCTION(cfq_low_latency_store, &cfqd->cfq_latency, 0, 1, 0);
^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_slice_idle_us_store?:
block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
if (__data < (MIN)) \
^
block/cfq-iosched.c:4782:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
USEC_STORE_FUNCTION(cfq_slice_idle_us_store, &cfqd->cfq_slice_idle, 0, UINT_MAX);
^~~~~~~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_group_idle_us_store?:
block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
if (__data < (MIN)) \
^
block/cfq-iosched.c:4783:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
USEC_STORE_FUNCTION(cfq_group_idle_us_store, &cfqd->cfq_group_idle, 0, UINT_MAX);
^~~~~~~~~~~~~~~~~~~
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch avoids that gcc complains about fall-through when building
with W=1.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
I am currently running a large bare metal instance (i3.metal)
on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
4.18 kernel. I have a workload that simulates a database
workload and I am running into lockup issues when writeback
throttling is enabled,with the hung task detector also
kicking in.
Crash dumps show that most CPUs (up to 50 of them) are
all trying to get the wbt wait queue lock while trying to add
themselves to it in __wbt_wait (see stack traces below).
[ 0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
[ 0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
[ 0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
[ 0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
[ 0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
[ 0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
[ 0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
[ 0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
[ 0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
[ 0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
[ 0.948132] FS: 0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
[ 0.948134] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
[ 0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 0.948138] Call Trace:
[ 0.948139] <IRQ>
[ 0.948142] do_raw_spin_lock+0xad/0xc0
[ 0.948145] _raw_spin_lock_irqsave+0x44/0x4b
[ 0.948149] ? __wake_up_common_lock+0x53/0x90
[ 0.948150] __wake_up_common_lock+0x53/0x90
[ 0.948155] wbt_done+0x7b/0xa0
[ 0.948158] blk_mq_free_request+0xb7/0x110
[ 0.948161] __blk_mq_complete_request+0xcb/0x140
[ 0.948166] nvme_process_cq+0xce/0x1a0 [nvme]
[ 0.948169] nvme_irq+0x23/0x50 [nvme]
[ 0.948173] __handle_irq_event_percpu+0x46/0x300
[ 0.948176] handle_irq_event_percpu+0x20/0x50
[ 0.948179] handle_irq_event+0x34/0x60
[ 0.948181] handle_edge_irq+0x77/0x190
[ 0.948185] handle_irq+0xaf/0x120
[ 0.948188] do_IRQ+0x53/0x110
[ 0.948191] common_interrupt+0x87/0x87
[ 0.948192] </IRQ>
....
[ 0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
[ 0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
[ 0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
[ 0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
[ 0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
[ 0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
[ 0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
[ 0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
[ 0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
[ 0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
[ 0.311149] FS: 000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
[ 0.311150] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
[ 0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 0.311154] Call Trace:
[ 0.311157] do_raw_spin_lock+0xad/0xc0
[ 0.311160] _raw_spin_lock_irqsave+0x44/0x4b
[ 0.311162] ? prepare_to_wait_exclusive+0x28/0xb0
[ 0.311164] prepare_to_wait_exclusive+0x28/0xb0
[ 0.311167] wbt_wait+0x127/0x330
[ 0.311169] ? finish_wait+0x80/0x80
[ 0.311172] ? generic_make_request+0xda/0x3b0
[ 0.311174] blk_mq_make_request+0xd6/0x7b0
[ 0.311176] ? blk_queue_enter+0x24/0x260
[ 0.311178] ? generic_make_request+0xda/0x3b0
[ 0.311181] generic_make_request+0x10c/0x3b0
[ 0.311183] ? submit_bio+0x5c/0x110
[ 0.311185] submit_bio+0x5c/0x110
[ 0.311197] ? __ext4_journal_stop+0x36/0xa0 [ext4]
[ 0.311210] ext4_io_submit+0x48/0x60 [ext4]
[ 0.311222] ext4_writepages+0x810/0x11f0 [ext4]
[ 0.311229] ? do_writepages+0x3c/0xd0
[ 0.311239] ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
[ 0.311240] do_writepages+0x3c/0xd0
[ 0.311243] ? _raw_spin_unlock+0x24/0x30
[ 0.311245] ? wbc_attach_and_unlock_inode+0x165/0x280
[ 0.311248] ? __filemap_fdatawrite_range+0xa3/0xe0
[ 0.311250] __filemap_fdatawrite_range+0xa3/0xe0
[ 0.311253] file_write_and_wait_range+0x34/0x90
[ 0.311264] ext4_sync_file+0x151/0x500 [ext4]
[ 0.311267] do_fsync+0x38/0x60
[ 0.311270] SyS_fsync+0xc/0x10
[ 0.311272] do_syscall_64+0x6f/0x170
[ 0.311274] entry_SYSCALL_64_after_hwframe+0x42/0xb7
In the original patch, wbt_done is waking up all the exclusive
processes in the wait queue, which can cause a thundering herd
if there is a large number of writer threads in the queue. The
original intention of the code seems to be to wake up one thread
only however, it uses wake_up_all() in __wbt_done(), and then
uses the following check in __wbt_wait to have only one thread
actually get out of the wait loop:
if (waitqueue_active(&rqw->wait) &&
rqw->wait.head.next != &wait->entry)
return false;
The problem with this is that the wait entry in wbt_wait is
define with DEFINE_WAIT, which uses the autoremove wakeup function.
That means that the above check is invalid - the wait entry will
have been removed from the queue already by the time we hit the
check in the loop.
Secondly, auto-removing the wait entries also means that the wait
queue essentially gets reordered "randomly" (e.g. threads re-add
themselves in the order they got to run after being woken up).
Additionally, new requests entering wbt_wait might overtake requests
that were queued earlier, because the wait queue will be
(temporarily) empty after the wake_up_all, so the waitqueue_active
check will not stop them. This can cause certain threads to starve
under high load.
The fix is to leave the woken up requests in the queue and remove
them in finish_wait() once the current thread breaks out of the
wait loop in __wbt_wait. This will ensure new requests always
end up at the back of the queue, and they won't overtake requests
that are already in the wait queue. With that change, the loop
in wbt_wait is also in line with many other wait loops in the kernel.
Waking up just one thread drastically reduces lock contention, as
does moving the wait queue add/remove out of the loop.
A significant drop in lockdep's lock contention numbers is seen when
running the test application on the patched kernel.
Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltU8z0eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG5X8H/2fJr7m3k242+t76
sitwvx1eoPqTgryW59dRKm9IuXAGA+AjauvHzaz1QxomeQa50JghGWefD0eiJfkA
1AphQ/24EOiAbbVk084dAI/C2p122dE4D5Fy7CrfLnuouyrbFaZI5STbnrRct7sR
9deeYW0GDHO1Uenp4WDCj0baaqJqaevZ+7GG09DnWpya2nQtSkGBjqn6GpYmrfOU
mqFuxAX8mEOW6cwK16y/vYtnVjuuMAiZ63/OJ8AQ6d6ArGLwAsdn7f8Fn4I4tEr2
L0d3CRLUyegms4++Dmlu05k64buQu46WlPhjCZc5/Ts4kjrNxBuHejj2/jeSnUSt
vJJlibI=
=42a5
-----END PGP SIGNATURE-----
Merge tag 'v4.18-rc6' into for-4.19/block2
Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a
merge conflict down the line.
Signed-of-by: Jens Axboe <axboe@kernel.dk>
It turns out that commit 721c7fc701 ("block: fail op_is_write()
requests to read-only partitions"), while obviously correct, causes
problems for some older lvm2 installations.
The reason is that the lvm snapshotting will continue to write to the
snapshow COW volume, even after the volume has been marked read-only.
End result: snapshot failure.
This has actually been fixed in newer version of the lvm2 tool, but the
old tools still exist, and the breakage was reported both in the kernel
bugzilla and in the Debian bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=200439https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=900442
The lvm2 fix is here
https://sourceware.org/git/?p=lvm2.git;a=commit;h=a6fdb9d9d70f51c49ad11a87ab4243344e6701a3
but until everybody has updated to recent versions, we'll have to weaken
the "never write to read-only partitions" check. It now allows the
write to happen, but causes a warning, something like this:
generic_make_request: Trying to write to read-only block-device dm-3 (partno X)
Modules linked in: nf_tables xt_cgroup xt_owner kvm_intel iwlmvm kvm irqbypass iwlwifi
CPU: 1 PID: 77 Comm: kworker/1:1 Not tainted 4.17.9-gentoo #3
Hardware name: LENOVO 20B6A019RT/20B6A019RT, BIOS GJET91WW (2.41 ) 09/21/2016
Workqueue: ksnaphd do_metadata
RIP: 0010:generic_make_request_checks+0x4ac/0x600
...
Call Trace:
generic_make_request+0x64/0x400
submit_bio+0x6c/0x140
dispatch_io+0x287/0x430
sync_io+0xc3/0x120
dm_io+0x1f8/0x220
do_metadata+0x1d/0x30
process_one_work+0x1b9/0x3e0
worker_thread+0x2b/0x3c0
kthread+0x113/0x130
ret_from_fork+0x35/0x40
Note that this is a "revert" in behavior only. I'm leaving alone the
actual code cleanups in commit 721c7fc701, but letting the previously
uncaught request go through with a warning instead of stopping it.
Fixes: 721c7fc701 ("block: fail op_is_write() requests to read-only partitions")
Reported-and-tested-by: WGH <wgh@torlan.ru>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAltkcI4QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpuByD/9Pf8mfsv/WDJposa5pRqnY18dtBNdfHjsb
t6vuBZiLxe86LDo1Y3e7663u+kgraARyswUpNTv5iAKseVkwIcz0RTArQJkaDLy7
AD2BEfLvpcacAoWZ96MiayjQabGXEnTWqiKnIlcbc6J4/LIfdYYZHd+SXUXH0R81
BaXPANL222srxuGbBifsfZ/yXxNPmBYfT1mTaFU6cKAftVO/24RV/MbwkU4cgym7
0ZM/iBDkFDy6eV3XhhGeX1miDhGa2zNM/NHvsnSsP8jQqlyJW+lpI5dRPEoH1Fzp
ecpxSH/HtjW5GYMwP7qhHMth/XHhjkOmNDRblkRWYrMGxgy7mAcSR8axdjqTdDxD
+vlez+b9uPF3qXUk1D2NkW3RmTSkxjV6ztzc+yddbFSzqmQcHSqH+7ohaHnQRefX
IGngTOq5pSmluNmrul48y/wmkSwVI1/jrteOfJfhJXWL0pY65g+8KWQxLmeZHke7
ytFu6msOsVmZjDQ7tckSxSLMB+frvzuc/h+eOBTx8ida0bjopVlfNUcLQRD4kZfy
xHn6nPxGsdiq2PM71nxYvqoUfmZnIE3o0A9+IB4OIp5bLVAGt2/fbWV3llunGK0A
JM8UmyOysh71xh288VyE/hPz/7tyea/KgTaTm0jdTkTeoHH2isHsgWWvYB3zjeEn
/1c7o2T13Q==
=Ml3z
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20180803' of git://git.kernel.dk/linux-block
Pull block fix from Jens Axboe:
"Just a single fix, from Ming, fixing a regression in this cycle where
the busy tag iteration was changed to only calling the callback
function for requests that are started. We really want all non-free
requests.
This fixes a boot regression on certain VM setups"
* tag 'for-linus-20180803' of git://git.kernel.dk/linux-block:
blk-mq: fix blk_mq_tagset_busy_iter
Commit d250bf4e776ff09d5("blk-mq: only iterate over inflight requests
in blk_mq_tagset_busy_iter") uses 'blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT'
to replace 'blk_mq_request_started(req)', this way is wrong, and causes
lots of test system hang during booting.
Fix the issue by using blk_mq_request_started(req) inside bt_tags_iter().
Fixes: d250bf4e77 ("blk-mq: only iterate over inflight requests in blk_mq_tagset_busy_iter")
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Mark Brown <broonie@kernel.org>
Cc: Matt Hart <matthew.hart@linaro.org>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: John Garry <john.garry@huawei.com>
Cc: Hannes Reinecke <hare@suse.com>,
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: linux-scsi@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reported-by: Mark Brown <broonie@kernel.org>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The passed 'nr' from userspace represents the total depth, meantime
inside 'struct blk_mq_tags', 'nr_tags' stores the total tag depth,
and 'nr_reserved_tags' stores the reserved part.
There are two issues in blk_mq_tag_update_depth() now:
1) for growing tags, we should have used the passed 'nr', and keep the
number of reserved tags not changed.
2) the passed 'nr' should have been used for checking against
'tags->nr_tags', instead of number of the normal part.
This patch fixes the above two cases, and avoids kernel crash caused
by wrong resizing sbitmap queue.
Cc: "Ewan D. Milne" <emilne@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Omar Sandoval <osandov@fb.com>
Tested by: Marco Patalano <mpatalan@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Runtime PM isn't ready for blk-mq yet, and commit 765e40b675 ("block:
disable runtime-pm for blk-mq") tried to disable it. Unfortunately,
it can't take effect in that way since user space still can switch
it on via 'echo auto > /sys/block/sdN/device/power/control'.
This patch disables runtime-pm for blk-mq really by pm_runtime_disable()
and fixes all kinds of PM related kernel crash.
Cc: Tomas Janousek <tomi@nomi.cz>
Cc: Przemek Socha <soprwa@gmail.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: <stable@vger.kernel.org>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, avg_lat is calculated by accumulating the mean of every
window in a long running cumulative average. As time goes on, the metric
becomes less and less useful due to the accumulated history.
This patch reuses the same calculation done in load averages to make the
avg_lat metric more lively. Unlike load averages, the avg only advances
when a window elapses (due to an io). Idle periods extend the most
recent window. Bucketing is used to limit the history of avg_lat by
binding it to the window size. So, the window range for 1/exp (decay
rate) is [1 min, 2.5 min) when windows elapse immediately.
The current sample window size is exposed in the debug info to enable
calculation of the window range.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The blkg lifetime is protected by the queue lifetime, so we need to put
the queue _after_ we're done using the blkg.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
At this point we have a ref on the blkg, we need to drop it if we don't
have a iolat.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Simplify the code by using the PTR_ERR_OR_ZERO, instead of the
open code. It is better.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We find the memory use-after-free issue in __blk_drain_queue()
on the kernel 4.14. After read the latest kernel 4.18-rc6 we
think it has the same problem.
Memory is allocated for q->fq in the blk_init_allocated_queue().
If the elevator init function called with error return, it will
run into the fail case to free the q->fq.
Then the __blk_drain_queue() uses the same memory after the free
of the q->fq, it will lead to the unpredictable event.
The patch is to set q->fq as NULL in the fail case of
blk_init_allocated_queue().
Fixes: commit 7c94e1c157 ("block: introduce blk_flush_queue to drive flush machinery")
Cc: <stable@vger.kernel.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: xiao jin <jin.xiao@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently these functions are implemented in the scsi layer, but their
actual place should be the block layer since T10-PI is a general data
integrity feature that is used in the nvme protocol as well. Also, use
the tuple size from the integrity profile since it may vary between
integrity types.
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAltbc20QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpgh4D/9GYQcjk9qLVFxkv5ucAUvCuxEL6gjsMf4W
M/QdxVIrwh3zpvsH++2IXXn+xH+UjujMA5NkzhsSr4+hsSO2iAGOYMJbroNfhsTD
onvQQ6NTaHPu/+PZs0otVK4KMWHwZGWOV6YU00TWTfRgzRmGEsSMe91oeBIXVv9w
v6d09twaLSY0lUkAAbcdu5fuFBtXu4Bxy60qyHEKkAdWWHEUYaZLrODhVjoGg2V4
KdAWS5X4A6kJMcPcoOvG6RFtpf71boaip9o/DRLUWhGdIQnI38UgSCUmz1XMYnik
Sq8r74vqCm8IhIOLTlxnPrMHHbKv7JZhY3Ow9fxnS6HZRNI0aPX31Yml6NULqnWh
MsQh+6gZXd3xC1O7txEQn4a15Lk0OLXa8HJcIn5ADNxqz5/r/g0mPUG9HmPSIalO
ISFF/9UKQFcAd0RjHR+bEEH2VMznz59UWKfdOsmwFZtZSCmR1ucj0xAKDj+oP1JS
ZsgZ09K2GezrL4GEueocISo9ACIWgDWH8T7/bTxlBok0IYbybAfmOe+MZInL1Tf4
pklmoXm3ntgV3Pq8Ptk05LYyIgAaUIltuSiR3AFaXIADX0wNtV0ZgysIWgHf3BSA
18j+I1yPG1IwBdM8xNwxi56xMQR84uY5tsIyafbfj+laRI2nH5OIYjNZnrKpm957
4xZUgIECBA==
=2ogY
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20180727' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Bigger than usual at this time, mostly due to the O_DIRECT corruption
issue and the fact that I was on vacation last week. This contains:
- NVMe pull request with two fixes for the FC code, and two target
fixes (Christoph)
- a DIF bio reset iteration fix (Greg Edwards)
- two nbd reply and requeue fixes (Josef)
- SCSI timeout fixup (Keith)
- a small series that fixes an issue with bio_iov_iter_get_pages(),
which ended up causing corruption for larger sized O_DIRECT writes
that ended up racing with buffered writes (Martin Wilck)"
* tag 'for-linus-20180727' of git://git.kernel.dk/linux-block:
block: reset bi_iter.bi_done after splitting bio
block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
blkdev: __blkdev_direct_IO_simple: fix leak in error case
block: bio_iov_iter_get_pages: fix size of last iovec
nvmet: only check for filebacking on -ENOTBLK
nvmet: fixup crash on NULL device path
scsi: set timed out out mq requests to complete
blk-mq: export setting request completion state
nvme: if_ready checks to fail io to deleting controller
nvmet-fc: fix target sgl list on large transfers
nbd: handle unexpected replies better
nbd: don't requeue the same request twice.
Even if properly initialized, the lvname array (i.e., strings)
is read from disk, and might contain corrupt data (e.g., lack
the null terminating character for strings).
So, make sure the partition name string used in pr_warn() has
the null terminating character.
Fixes: 6ceea22bbb ("partitions: add aix lvm partition support files")
Suggested-by: Daniel J. Axtens <daniel.axtens@canonical.com>
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The if-block that sets a successful return value in aix_partition()
uses 'lvip[].pps_per_lv' and 'n[].name' potentially uninitialized.
For example, if 'numlvs' is zero or alloc_lvn() fails, neither is
initialized, but are used anyway if alloc_pvd() succeeds after it.
So, make the alloc_pvd() call conditional on their initialization.
This has been hit when attaching an apparently corrupted/stressed
AIX LUN, misleading the kernel to pr_warn() invalid data and hang.
[...] partition (null) (11 pp's found) is not contiguous
[...] partition (null) (2 pp's found) is not contiguous
[...] partition (null) (3 pp's found) is not contiguous
[...] partition (null) (64 pp's found) is not contiguous
Fixes: 6ceea22bbb ("partitions: add aix lvm partition support files")
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After the bio has been updated to represent the remaining sectors, reset
bi_done so bio_rewind_iter() does not rewind further than it should.
This resolves a bio_integrity_process() failure on reads where the
original request was split.
Fixes: 63573e359d ("bio-integrity: Restore original iterator on verify stage")
Signed-off-by: Greg Edwards <gedwards@ddn.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This allows bio_integrity_bytes() to be called from drivers instead of
open coding it.
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Edwards <gedwards@ddn.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio_iov_iter_get_pages() currently only adds pages for the next non-zero
segment from the iov_iter to the bio. That's suboptimal for callers,
which typically try to pin as many pages as fit into the bio. This patch
converts the current bio_iov_iter_get_pages() into a static helper, and
introduces a new helper that allocates as many pages as
1) fit into the bio,
2) are present in the iov_iter,
3) and can be pinned by MM.
Error is returned only if zero pages could be pinned. Because of 3), a
zero return value doesn't necessarily mean all pages have been pinned.
Callers that have to pin every page in the iov_iter must still call this
function in a loop (this is currently the case).
This change matters most for __blkdev_direct_IO_simple(), which calls
bio_iov_iter_get_pages() only once. If it obtains less pages than
requested, it returns a "short write" or "short read", and
__generic_file_write_iter() falls back to buffered writes, which may
lead to data corruption.
Fixes: 72ecad22d9 ("block: support a full bio worth of IO for simplified bdev direct-io")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the last page of the bio is not "full", the length of the last
vector slot needs to be corrected. This slot has the index
(bio->bi_vcnt - 1), but only in bio->bi_io_vec. In the "bv" helper
array, which is shifted by the value of bio->bi_vcnt at function
invocation, the correct index is (nr_pages - 1).
v2: improved readability following suggestions from Ming Lei.
v3: followed a formatting suggestion from Christoph Hellwig.
Fixes: 2cefe4dbaa ("block: add bio_iov_iter_get_pages()")
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Set max_discard_segments to USHRT_MAX in blk_set_stacking_limits() so
that blk_stack_limits() can stack up this limit for stacked devices.
before:
$ cat /sys/block/nvme0n1/queue/max_discard_segments
256
$ cat /sys/block/dm-0/queue/max_discard_segments
1
after:
$ cat /sys/block/nvme0n1/queue/max_discard_segments
256
$ cat /sys/block/dm-0/queue/max_discard_segments
256
Fixes: 1e739730c5 ("block: optionally merge discontiguous discard bios into a single request")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now only used by the bounce code, so move it there and mark the function
static.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is preparing for drivers that want to directly alter the state of
their requests. No functional change here.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
So don't bother handling it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio_check_pages_dirty currently inviolates the invariant that bv_page of
a bio_vec inside bi_vcnt shouldn't be zero, and that is going to become
really annoying with multpath biovecs. Fortunately there isn't any
all that good reason for it - once we decide to defer freeing the bio
to a workqueue holding onto a few additional pages isn't really an
issue anymore. So just check if there is a clean page that needs
dirtying in the first path, and do a second pass to free them if there
was none, while the cache is still hot.
Also use the chance to micro-optimize bio_dirty_fn a bit by not saving
irq state - we know we are called from a workqueue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Inside blk_mq_try_issue_list_directly(), if the request is issued as
failed, we shouldn't try to do it again, otherwise the warning in
blk_mq_start_request() will be triggered. This change is aligned to
behaviour of other ways of request issue & dispatch.
Fixes: 6ce3dd6eec ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: kernel test robot <rong.a.chen@intel.com>
Cc: LKP <lkp@01.org>
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With the change to use UINT_MAX I broke the depth check as any value of
inflight (ie 0) would be less than (int)UINT_MAX. Fix this by changing
everything to unsigned int to match the depth.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add tracking of REQ_OP_DISCARD ios to the per-cgroup io.stat. Two
fields, dbytes and dios, to respectively count the total bytes and
number of discards are added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Cc: Michael Callahan <michaelcallahan@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add tracking of REQ_OP_DISCARD ios to the partition statistics and
append them to the various stat files in /sys as well as
/proc/diskstats. These are tracked with the same four stats as reads
and writes:
Number of discard ios completed.
Number of discard ios merged
Number of discard sectors completed
Milliseconds spent on discard requests
This is done via adding a new STAT_DISCARD define to genhd.h and then
using it to index that stat field for discard requests.
tj: Refreshed on top of v4.17 and other previous updates.
Signed-off-by: Michael Callahan <michaelcallahan@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add and use a new op_stat_group() function for indexing partition stat
fields rather than indexing them by rq_data_dir() or bio_data_dir().
This function works similarly to op_is_sync() in that it takes the
request::cmd_flags or bio::bi_opf flags and determines which stats
should et updated.
In addition, the second parameter to generic_start_io_acct() and
generic_end_io_acct() is now a REQ_OP rather than simply a read or
write bit and it uses op_stat_group() on the parameter to determine
the stat group.
Note that the partition in_flight counts are not part of the per-cpu
statistics and as such are not indexed via this function. It's now
indexed by op_is_write().
tj: Refreshed on top of v4.17. Updated to pass around REQ_OP.
Signed-off-by: Michael Callahan <michaelcallahan@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Matias Bjorling <mb@lightnvm.io>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Alasdair Kergon <agk@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add defines for STAT_READ and STAT_WRITE for indexing the partition
stat entries. This clarifies some fs/ code which has hardcoded 1 for
STAT_WRITE and will make it easier to extend the stats with additional
fields.
tj: Refreshed on top of v4.17.
Signed-off-by: Michael Callahan <michaelcallahan@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In case of 'none' io scheduler, when hw queue isn't busy, it isn't
necessary to enqueue request to sw queue and dequeue it from
sw queue because request may be submitted to hw queue asap without
extra cost, meantime there shouldn't be much request in sw queue,
and we don't need to worry about effect on IO merge.
There are still some single hw queue SCSI HBAs(HPSA, megaraid_sas, ...)
which may connect high performance devices, so 'none' is often required
for obtaining good performance.
This patch improves IOPS and decreases CPU unilization on megaraid_sas,
per Kashyap's test.
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Reported-by: Kashyap Desai <kashyap.desai@broadcom.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In our longer tests we noticed that some boxes would degrade to the
point of uselessness. This is because we truncate the current time when
saving it in our bio, but I was using the raw current time to subtract
from. So once the box had been up a certain amount of time it would
appear as if our IO's were taking several years to complete. Fix this
by truncating the current time so it matches the issue time. Verified
this worked by running with this patch for a week on our test tier.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Early versions of these patches had us waiting for seconds at a time
during submission, so we had to adjust the timing window we monitored
for latency. Now we don't do things like that so this is unnecessary
code.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The code poses a security risk due to user memory access in ->release
and had an API that can't be used reliably. As far as we know it was
never used for real, but if that turns out wrong we'll have to revert
this commit and come up with a band aid.
Jann Horn did look software archives for users of this interface,
and the only users found were example code in sg3_utils, and optional
support in an optional module of the tgt user space iscsi target,
which looks like a proof of concept extension of the /dev/sg
read/write support.
Tony Battersby chimes in that the code is basically unsafe to use in
general:
The read/write interface on /dev/bsg is impossible to use safely
because the list of completed commands is per-device (bd->done_list)
rather than per-fd like it is with /dev/sg. So if program A and
program B are both using the write/read interface on the same bsg
device, then their command responses will get mixed up, and program
A will read() some command results from program B and vice versa.
So no, I don't use read/write on /dev/bsg. From a security standpoint,
it should definitely be fixed or removed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix a regression introduced in Linux kernel 4.17 where sending a SCSI
command that does not transfer data (such as TEST UNIT READY) via
/dev/bsg/* results in EINVAL.
Fixes: 17cb960f29 ("bsg: split handling of SCSI CDBs vs transport requeues")
Cc: <stable@vger.kernel.org> # 4.17+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
max_depth used to be a u64, but I changed it to a unsigned int but
didn't convert my comparisons over everywhere. Fix by using UINT_MAX
everywhere instead of (u64)-1.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
On 32-bit architectures, dividing a 64-bit number needs to use the
do_div() function or something like it to avoid a link failure:
block/blk-iolatency.o: In function `iolatency_prfill_limit':
blk-iolatency.c:(.text+0x8cc): undefined reference to `__aeabi_uldivmod'
Using div_u64() gives us the best output and avoids the need for an
explicit cast.
Fixes: d706751215 ("block: introduce blk-iolatency io controller")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With gcc 4.9.0 and 7.3.0:
block/blk-core.c: In function 'blk_pm_allow_request':
block/blk-core.c:2747:2: warning: enumeration value 'RPM_ACTIVE' not handled in switch [-Wswitch]
switch (rq->q->rpm_status) {
^
Convert the return statement below the switch() block into a default
case to fix this.
Fixes: e4f36b249b ("block: fix peeking requests during PM")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If __blkdev_issue_discard is in progress and a device mapper device is
reloaded with a table that doesn't support discard,
q->limits.max_discard_sectors is set to zero. This results in infinite
loop in __blkdev_issue_discard.
This patch checks if max_discard_sectors is zero and aborts with
-EOPNOTSUPP.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Zdenek Kabelac <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The flag GFP_ATOMIC already contains __GFP_HIGH. There is no need to
explicitly or __GFP_HIGH again. So, just remove unnecessary __GFP_HIGH.
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Current IO controllers for the block layer are less than ideal for our
use case. The io.max controller is great at hard limiting, but it is
not work conserving. This patch introduces io.latency. You provide a
latency target for your group and we monitor the io in short windows to
make sure we are not exceeding those latency targets. This makes use of
the rq-qos infrastructure and works much like the wbt stuff. There are
a few differences from wbt
- It's bio based, so the latency covers the whole block layer in addition to
the actual io.
- We will throttle all IO types that comes in here if we need to.
- We use the mean latency over the 100ms window. This is because writes can
be particularly fast, which could give us a false sense of the impact of
other workloads on our protected workload.
- By default there's no throttling, we set the queue_depth to INT_MAX so that
we can have as many outstanding bio's as we're allowed to. Only at
throttle time do we pay attention to the actual queue depth.
- We backcharge cgroups for root cg issued IO and induce artificial
delays in order to deal with cases like metadata only or swap heavy
workloads.
In testing this has worked out relatively well. Protected workloads
will throttle noisy workloads down to 1 io at time if they are doing
normal IO on their own, or induce up to a 1 second delay per syscall if
they are doing a lot of root issued IO (metadata/swap IO).
Our testing has revolved mostly around our production web servers where
we have hhvm (the web server application) in a protected group and
everything else in another group. We see slightly higher requests per
second (RPS) on the test tier vs the control tier, and much more stable
RPS across all machines in the test tier vs the control tier.
Another test we run is a slow memory allocator in the unprotected group.
Before this would eventually push us into swap and cause the whole box
to die and not recover at all. With these patches we see slight RPS
drops (usually 10-15%) before the memory consumer is properly killed and
things recover within seconds.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
wbt cares only about request completion time, but controllers may need
information that is on the bio itself, so add a done_bio callback for
rq-qos so things like blk-iolatency can use it to have the bio when it
completes.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't really need to save this stuff in the core block code, we can
just pass the bio back into the helpers later on to derive the same
flags and update the rq->wbt_flags appropriately.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkcg-qos is going to do essentially what wbt does, only on a cgroup
basis. Break out the common code that will be shared between blkcg-qos
and wbt into blk-rq-qos.* so they can both utilize the same
infrastructure.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We need to use blk_rq_stat in the blkcg qos stuff, so export some of
these helpers so they can be used by other things.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since IO can be issued from literally anywhere it's almost impossible to
do throttling without having some sort of adverse effect somewhere else
in the system because of locking or other dependencies. The best way to
solve this is to do the throttling when we know we aren't holding any
other kernel resources. Do this by tracking throttling in a per-blkg
basis, and if we require throttling flag the task that it needs to check
before it returns to user space and possibly sleep there.
This is to address the case where a process is doing work that is
generating IO that can't be throttled, whether that is directly with a
lot of REQ_META IO, or indirectly by allocating so much memory that it
is swamping the disk with REQ_SWAP. We can't use task_add_work as we
don't want to induce a memory allocation in the IO path, so simply
saving the request queue in the task and flagging it to do the
notify_resume thing achieves the same result without the overhead of a
memory allocation.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For backcharging we need to know who the page belongs to when swapping
it out. We don't worry about things that do ->rw_page (zram etc) at the
moment, we're only worried about pages that actually go to a block
device.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk-iolatency has a few stats that it would like to print out, and
instead of adding a bunch of crap to the generic code just provide a
helper so that controllers can add stuff to the stat line if they want
to.
Hide it behind a boot option since it changes the output of io.stat from
normal, and these stats are only interesting to developers.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently io.low uses a bi_cg_private to stash its private data for the
blkg, however other blkcg policies may want to use this as well. Since
we can get the private data out of the blkg, move this to bi_blkg in the
bio and make it generic, then we can use bio_associate_blkg() to attach
the blkg to the bio.
Theoretically we could simply replace the bi_css with this since we can
get to all the same information from the blkg, however you have to
lookup the blkg, so for example wbc_init_bio() would have to lookup and
possibly allocate the blkg for the css it was trying to attach to the
bio. This could be problematic and result in us either not attaching
the css at all to the bio, or falling back to the root blkcg if we are
unable to allocate the corresponding blkg.
So for now do this, and in the future if possible we could just replace
the bi_css with bi_blkg and update the helpers to do the correct
translation.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It won't be efficient to dequeue request one by one from sw queue,
but we have to do that when queue is busy for better merge performance.
This patch takes the Exponential Weighted Moving Average(EWMA) to figure
out if queue is busy, then only dequeue request one by one from sw queue
when queue is busy.
Fixes: b347689ffb ("blk-mq-sched: improve dispatching from sw queue")
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Reported-by: Kashyap Desai <kashyap.desai@broadcom.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Only attempt to merge bio iff the ctx->rq_list isn't empty, because:
1) for high-performance SSD, most of times dispatch may succeed, then
there may be nothing left in ctx->rq_list, so don't try to merge over
sw queue if it is empty, then we can save one acquiring of ctx->lock
2) we can't expect good merge performance on per-cpu sw queue, and missing
one merge on sw queue won't be a big deal since tasks can be scheduled from
one CPU to another.
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reported-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
list_splice_tail_init() is much more faster than inserting each
request one by one, given all requets in 'list' belong to
same sw queue and ctx->lock is required to insert requests.
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix typo in a function blk_mq_alloc_tag_set() comment.
if if it too large -> if it's too large.
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
set->mq_map is now currently cleared if something goes wrong when
establishing a queue map in blk-mq-pci.c. It's also cleared before
updating a queue map in blk_mq_update_queue_map().
This patch provides an API to clear set->mq_map to make it clear.
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pointer dgrp is being assigned but is never used hence it is redundant
and can be removed.
Cleans up clang warning:
warning: variable 'dgrp' set but not used [-Wunused-but-set-variable]
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Once one cgroup has io.low configured, @low_valid becomes true and other
cgroups won't switch it back whatsoever.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The payload of struct request is stored in the request.bio chain if
the RQF_SPECIAL_PAYLOAD flag is not set and in request.special_vec if
RQF_SPECIAL_PAYLOAD has been set. However, blk_update_request()
iterates over req->bio whether or not RQF_SPECIAL_PAYLOAD has been
set. Additionally, the RQF_SPECIAL_PAYLOAD flag is ignored by
blk_rq_bytes() which means that the value returned by that function
is incorrect if the RQF_SPECIAL_PAYLOAD flag has been set. It is not
clear to me whether this is an oversight or whether this happened on
purpose. Anyway, document that it is known that both functions ignore
RQF_SPECIAL_PAYLOAD. See also commit f9d03f96b9 ("block: improve
handling of the magic discard payload").
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
SCSI probing may synchronously create and destroy a lot of request_queues
for non-existent devices. Any synchronize_rcu() in queue creation or
destroy path may introduce long latency during booting, see detailed
description in comment of blk_register_queue().
This patch removes one synchronize_rcu() inside blk_cleanup_queue()
for this case, commit c2856ae2f315d75(blk-mq: quiesce queue before freeing queue)
needs synchronize_rcu() for implementing blk_mq_quiesce_queue(), but
when queue isn't initialized, it isn't necessary to do that since
only pass-through requests are involved, no original issue in
scsi_execute() at all.
Without this patch and previous one, it may take more 20+ seconds for
virtio-scsi to complete disk probe. With the two patches, the time becomes
less than 100ms.
Fixes: c2856ae2f3 ("blk-mq: quiesce queue before freeing queue")
Reported-by: Andrew Jones <drjones@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Tested-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have to remove synchronize_rcu() from blk_queue_cleanup(),
otherwise long delay can be caused during lun probe. For removing
it, we have to avoid to iterate the set->tag_list in IO path, eg,
blk_mq_sched_restart().
This patch reverts 5b79413946d (Revert "blk-mq: don't handle
TAG_SHARED in restart"). Given we have fixed enough IO hang issue,
and there isn't any reason to restart all queues in one tags any more,
see the following reasons:
1) blk-mq core can deal with shared-tags case well via blk_mq_get_driver_tag(),
which can wake up queues waiting for driver tag.
2) SCSI is a bit special because it may return BLK_STS_RESOURCE if queue,
target or host is ready, but SCSI built-in restart can cover all these well,
see scsi_end_request(), queue will be rerun after any request initiated from
this host/target is completed.
In my test on scsi_debug(8 luns), this patch may improve IOPS by 20% ~ 30%
when running I/O on these 8 luns concurrently.
Fixes: 705cda97ee ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Reported-by: Andrew Jones <drjones@redhat.com>
Tested-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now hctx->lock is only acquired when adding hctx->dispatch_wait to
one wait queue, but not held when removing it from the wait queue.
IO hang can be observed easily if SCHED RESTART is disabled, that means
now RESTART exits just for fixing the issue in blk_mq_mark_tag_wait().
This patch fixes the issue by introducing hctx->dispatch_wait_lock and
holding it for removing hctx->dispatch_wait in blk_mq_dispatch_wake(),
since we need to avoid acquiring hctx->lock in irq context.
Fixes: eb619fdb2d ("blk-mq: fix issue with shared tag queue re-running")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
'hctx' won't be changed at all, so not necessary to pass
'**hctx' to blk_mq_mark_tag_wait().
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We never pass 'wait' as true to blk_mq_get_driver_tag(), and hence
we never change '**hctx' as well. The last use of these went away
with the flush cleanup, commit 0c2a6fe4dc.
So cleanup the usage and remove the two extra parameters.
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Tested-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The actual goal of the function bfq_bfqq_may_idle is to tell whether
it is better to perform device idling (more precisely: I/O-dispatch
plugging) for the input bfq_queue, either to boost throughput or to
preserve service guarantees. This commit improves the name of the
function accordingly.
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If
- a bfq_queue Q preempts another queue, because one request of Q
arrives in time,
- but, after this preemption, Q is not the queue that is set in service,
then Q->entity.service is set to 0 when Q is eventually set in
service. But Q should have continued receiving service with its old
budget (which is why preemption has occurred) and its old service.
This commit addresses this issue by resetting service on queue real
expiration.
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For some bfq_queues, BFQ plugs I/O dispatching when the queue becomes
idle, and keeps the plug until a new request of the queue arrives, or
a timeout fires. BFQ does so either to boost throughput or to preserve
service guarantees for the queue.
More precisely, for such a queue, plugging starts when the queue
happens to have either no request enqueued, or no request in flight,
that is, no request already dispatched but not yet completed.
On the opposite end, BFQ may happen to expire a queue with no request
enqueued, without doing any plugging, if the queue still has some
request in flight. Unfortunately, such a premature expiration causes
the queue to lose its chance to enjoy dispatch plugging a moment
later, i.e., when its in-flight requests finally get completed. This
breaks service guarantees for the queue.
This commit prevents BFQ from expiring an empty queue if the latter
still has in-flight requests.
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
To keep I/O throughput high as often as possible, BFQ performs
I/O-dispatch plugging (aka device idling) only when beneficial exactly
for throughput, or when needed for service guarantees (low latency,
fairness). An important case where the latter condition holds is when
the scenario is 'asymmetric' in terms of weights: i.e., when some
bfq_queue or whole group of queues has a higher weight, and thus has
to receive more service, than other queues or groups. Without dispatch
plugging, lower-weight queues/groups may unjustly steal bandwidth to
higher-weight queues/groups.
To detect asymmetric scenarios, BFQ checks some sufficient
conditions. One of these conditions is that active groups have
different weights. BFQ controls this condition by maintaining a
special set of unique weights of active groups
(group_weights_tree). To this purpose, in the function
bfq_active_insert/bfq_active_extract BFQ adds/removes the weight of a
group to/from this set.
Unfortunately, the function bfq_active_extract may happen to be
invoked also for a group that is still active (to preserve the correct
update of the next queue to serve, see comments in function
bfq_no_longer_next_in_service() for details). In this case, removing
the weight of the group makes the set group_weights_tree
inconsistent. Service-guarantee violations follow.
This commit addresses this issue by moving group_weights_tree
insertions from their previous location (in bfq_active_insert) into
the function __bfq_activate_entity, and by moving group_weights_tree
extractions from bfq_active_extract to when the entity that represents
a group remains throughly idle, i.e., with no request either enqueued
or dispatched.
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Exclude zoned block device members from struct request_queue for
CONFIG_BLK_DEV_ZONED == n. Avoid breaking the build by only building
the code that uses these struct request_queue members if
CONFIG_BLK_DEV_ZONED != n.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Matias Bjorling <mb@lightnvm.io>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since the implementation of blk_queue_nr_zones() is trivial and since
it only has a single caller, inline this function.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Matias Bjorling <mb@lightnvm.io>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No cast is necessary when assigning a non-void pointer to a void
pointer.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Matias Bjorling <mb@lightnvm.io>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAls2oVcQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpr89D/9PaJCtLpCEU1HdaohIDXDyJKBdtCllIsqJ
YAJTlUkFRkMvsbdEA3U43rVRVlu7zxP3aRg79VRI+kdZ8y8XSiisAF/QUbG/Bwd3
NuFqKj4Hm5xFTtLMKIUlumR2/exD4ZpPHKnWmD4idp1PZRoTUwxeIjqVv77L6Nfy
c4/GVlz2t9buWxH/5PSa//rsT49xM6CLpyIwO2lFsNEk2HYE/uAWya5KZVJ1YKVw
mSjUdyFtJ//lFK9lVU4YkdNF0d5w2PcEoCfNyPe0n9F43iITACo2om7rkSkTgciS
izPAs1DkqNdWPta2twULt656WlUoGlwnWyFchU34N/I/pDiww4kdCQFfNIOi6qBW
7rPQ4xpywiw/U93C2GLEeE09++T8+yO4KuELhIcN5+D3D2VGRIuaGcf4xeY0CX3A
a335EZIxBGOfxyejknBDg3BsoUcLvejbANKD8ltis3zciGf/QyLt+FFqx25WCzkN
N028ppml8WPSiqhSlFDMyfscO1dx/WSYh1q7ANrBiPPaEjFYmakzEDT0cZW4JsUh
av4jqYt3cqrFIXyhRHJM2wjFymQv6aVnmLa4zOKCFbwm9KzMWUUFjc4R962T/sQK
0g+eCSFGidii4JZ4ghOAQk2dqCTIZqV72pmLlpJBGu+SAIQxkh+29h0LOi5U3v1Y
FnTtJ1JhCg==
=0uqV
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20180629' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Small set of fixes for this series. Mostly just minor fixes, the only
oddball in here is the sg change.
The sg change came out of the stall fix for NVMe, where we added a
mempool and limited us to a single page allocation. CONFIG_SG_DEBUG
sort-of ruins that, since we'd need to account for that. That's
actually a generic problem, since lots of drivers need to allocate SG
lists. So this just removes support for CONFIG_SG_DEBUG, which I added
back in 2007 and to my knowledge it was never useful.
Anyway, outside of that, this pull contains:
- clone of request with special payload fix (Bart)
- drbd discard handling fix (Bart)
- SATA blk-mq stall fix (me)
- chunk size fix (Keith)
- double free nvme rdma fix (Sagi)"
* tag 'for-linus-20180629' of git://git.kernel.dk/linux-block:
sg: remove ->sg_magic member
drbd: Fix drbd_request_prepare() discard handling
blk-mq: don't queue more if we get a busy return
block: Fix cloning of requests with a special payload
nvme-rdma: fix possible double free of controller async event buffer
block: Fix transfer when chunk sectors exceeds max
Some devices have different queue limits depending on the type of IO. A
classic case is SATA NCQ, where some commands can queue, but others
cannot. If we have NCQ commands inflight and encounter a non-queueable
command, the driver returns busy. Currently we attempt to dispatch more
from the scheduler, if we were able to queue some commands. But for the
case where we ended up stopping due to BUSY, we should not attempt to
retrieve more from the scheduler. If we do, we can get into a situation
where we attempt to queue a non-queueable command, get BUSY, then
successfully retrieve more commands from that scheduler and queue those.
This can repeat forever, starving the non-queuable command indefinitely.
Fix this by NOT attempting to pull more commands from the scheduler, if
we get a BUSY return. This should also be more optimal in terms of
letting requests stay in the scheduler for as long as possible, if we
get a BUSY due to the regular out-of-tags condition.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch avoids that removing a path controlled by the dm-mpath driver
while mkfs is running triggers the following kernel bug:
kernel BUG at block/blk-core.c:3347!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 20 PID: 24369 Comm: mkfs.ext4 Not tainted 4.18.0-rc1-dbg+ #2
RIP: 0010:blk_end_request_all+0x68/0x70
Call Trace:
<IRQ>
dm_softirq_done+0x326/0x3d0 [dm_mod]
blk_done_softirq+0x19b/0x1e0
__do_softirq+0x128/0x60d
irq_exit+0x100/0x110
smp_call_function_single_interrupt+0x90/0x330
call_function_single_interrupt+0xf/0x20
</IRQ>
Fixes: f9d03f96b9 ("block: improve handling of the magic discard payload")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlsudKcQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgphYXEAC0G+fJcLgue64LwLNjXksDUsoldqNWjTDZ
e3szwseN3Woe7t+Q9DwEUO48T421WhDYe3UY50EZF3Fz5yYLi0ggTeq4qsKgpaXx
80dUY9/STxFZUV/tKrYMz5xPjM3k5x+m8FTeGlFgyR77fbOZE6/8w9XPkpmQN/B2
ayyEPMgAR9aZKhtOU4TJY45O278b2Qdfd0s6XUsxDTwBoR1mTQqKhNnY38PrIA5s
wkhtKz5YRdq5tNhKt2IwYABpJ4HJ1Hz/FE6Fo1RNx+6vfxfEhsk2/DZBn0rzzGDj
kvSaVlPeQEu9SZdJGtb2sx26elVWdt5sqMORQ0zb1kVCochZiT1IO3Z81XCy2Y3u
5WVvOlgsGni9BWr/K+iiN/dc+cKoRPy0g/ib77XUiJHgIC+urUIr3XgcIiuP9GZc
cgCfF+6+GvRKLatSsDZSbns5cpgueJLNogZUyhs7oUouRIECo0DD9NnDAlHmaZYB
957bRd/ykE6QOvtxEsG0TWc3tgyHjJFQQ9ukb45VmzYVB8mITKOGLpceJaqzju9c
AnGyQDddK4fqOct1pgFsSA5C3V6OOVXDpUiFn3DTNoQZcu5HN2hMwEA/g/u5TnGD
N2sEwcu2WCoUjzsMEWFLmX8R4w8laZUDwkNSRC5tYzQEZpVDfmAsRtRcUHJdNLl/
zPj8QMAidA==
=hCgr
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20180623' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- Further timeout fixes. We aren't quite there yet, so expect another
round of fixes for that to completely close some of the IRQ vs
completion races. (Christoph/Bart)
- Set of NVMe fixes from the usual suspects, mostly error handling
- Two off-by-one fixes (Dan)
- Another bdi race fix (Jan)
- Fix nbd reconfigure with NBD_DISCONNECT_ON_CLOSE (Doron)
* tag 'for-linus-20180623' of git://git.kernel.dk/linux-block:
blk-mq: Fix timeout handling in case the timeout handler returns BLK_EH_DONE
bdi: Fix another oops in wb_workfn()
lightnvm: Remove depends on HAS_DMA in case of platform dependency
nvme-pci: limit max IO size and segments to avoid high order allocations
nvme-pci: move nvme_kill_queues to nvme_remove_dead_ctrl
nvme-fc: release io queues to allow fast fail
nbd: Add the nbd NBD_DISCONNECT_ON_CLOSE config flag.
block: sed-opal: Fix a couple off by one bugs
blk-mq-debugfs: Off by one in blk_mq_rq_state_name()
nvmet: reset keep alive timer in controller enable
nvme-rdma: don't override opts->queue_size
nvme-rdma: Fix command completion race at error recovery
nvme-rdma: fix possible free of a non-allocated async event buffer
nvme-rdma: fix possible double free condition when failing to create a controller
Revert "block: Add warning for bi_next not NULL in bio_endio()"
block: fix timeout changes for legacy request drivers
Make sure that RQF_TIMED_OUT is cleared when a request is reused
after a block driver timeout handler has returned BLK_EH_DONE.
Fixes: da66126739 ("blk-mq: don't time out requests again that are in the timeout handler")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Andrew Randrianasulu <randrianasulu@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
resp->num is the number of tokens in resp->tok[]. It gets set in
response_parse(). So if n == resp->num then we're reading beyond the
end of the data.
Fixes: 455a7b238c ("block: Add Sed-opal library")
Reviewed-by: Scott Bauer <scott.bauer@intel.com>
Tested-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If rq_state == ARRAY_SIZE() then we read one element beyond the end of
the blk_mq_rq_state_name_array[] array.
Fixes: ec6dcf63c5 ("blk-mq-debugfs: Show more request state information")
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 0ba99ca483 ("block: Add warning for bi_next not NULL in
bio_endio()") breaks the dm driver. end_clone_bio() detects whether
or not a bio is the last bio associated with a request by checking
the .bi_next field. Commit 0ba99ca483 clears that field before
end_clone_bio() has had a chance to inspect that field. Hence revert
commit 0ba99ca483.
This patch avoids that KASAN reports the following complaint when
running the srp-test software (srp-test/run_tests -c -d -r 10 -t 02-mq):
==================================================================
BUG: KASAN: use-after-free in bio_advance+0x11b/0x1d0
Read of size 4 at addr ffff8801300e06d0 by task ksoftirqd/0/9
CPU: 0 PID: 9 Comm: ksoftirqd/0 Not tainted 4.18.0-rc1-dbg+ #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Call Trace:
dump_stack+0xa4/0xf5
print_address_description+0x6f/0x270
kasan_report+0x241/0x360
__asan_load4+0x78/0x80
bio_advance+0x11b/0x1d0
blk_update_request+0xa7/0x5b0
scsi_end_request+0x56/0x320 [scsi_mod]
scsi_io_completion+0x7d6/0xb20 [scsi_mod]
scsi_finish_command+0x1c0/0x280 [scsi_mod]
scsi_softirq_done+0x19a/0x230 [scsi_mod]
blk_mq_complete_request+0x160/0x240
scsi_mq_done+0x50/0x1a0 [scsi_mod]
srp_recv_done+0x515/0x1330 [ib_srp]
__ib_process_cq+0xa0/0xf0 [ib_core]
ib_poll_handler+0x38/0xa0 [ib_core]
irq_poll_softirq+0xe8/0x1f0
__do_softirq+0x128/0x60d
run_ksoftirqd+0x3f/0x60
smpboot_thread_fn+0x352/0x460
kthread+0x1c1/0x1e0
ret_from_fork+0x24/0x30
Allocated by task 1918:
save_stack+0x43/0xd0
kasan_kmalloc+0xad/0xe0
kasan_slab_alloc+0x11/0x20
kmem_cache_alloc+0xfe/0x350
mempool_alloc_slab+0x15/0x20
mempool_alloc+0xfb/0x270
bio_alloc_bioset+0x244/0x350
submit_bh_wbc+0x9c/0x2f0
__block_write_full_page+0x299/0x5a0
block_write_full_page+0x16b/0x180
blkdev_writepage+0x18/0x20
__writepage+0x42/0x80
write_cache_pages+0x376/0x8a0
generic_writepages+0xbe/0x110
blkdev_writepages+0xe/0x10
do_writepages+0x9b/0x180
__filemap_fdatawrite_range+0x178/0x1c0
file_write_and_wait_range+0x59/0xc0
blkdev_fsync+0x46/0x80
vfs_fsync_range+0x66/0x100
do_fsync+0x3d/0x70
__x64_sys_fsync+0x21/0x30
do_syscall_64+0x77/0x230
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Freed by task 9:
save_stack+0x43/0xd0
__kasan_slab_free+0x137/0x190
kasan_slab_free+0xe/0x10
kmem_cache_free+0xd3/0x380
mempool_free_slab+0x17/0x20
mempool_free+0x63/0x160
bio_free+0x81/0xa0
bio_put+0x59/0x60
end_bio_bh_io_sync+0x5d/0x70
bio_endio+0x1a7/0x360
blk_update_request+0xd0/0x5b0
end_clone_bio+0xa3/0xd0 [dm_mod]
bio_endio+0x1a7/0x360
blk_update_request+0xd0/0x5b0
scsi_end_request+0x56/0x320 [scsi_mod]
scsi_io_completion+0x7d6/0xb20 [scsi_mod]
scsi_finish_command+0x1c0/0x280 [scsi_mod]
scsi_softirq_done+0x19a/0x230 [scsi_mod]
blk_mq_complete_request+0x160/0x240
scsi_mq_done+0x50/0x1a0 [scsi_mod]
srp_recv_done+0x515/0x1330 [ib_srp]
__ib_process_cq+0xa0/0xf0 [ib_core]
ib_poll_handler+0x38/0xa0 [ib_core]
irq_poll_softirq+0xe8/0x1f0
__do_softirq+0x128/0x60d
The buggy address belongs to the object at ffff8801300e0640
which belongs to the cache bio-0 of size 200
The buggy address is located 144 bytes inside of
200-byte region [ffff8801300e0640, ffff8801300e0708)
The buggy address belongs to the page:
page:ffffea0004c03800 count:1 mapcount:0 mapping:ffff88015a563a00 index:0x0 compound_mapcount: 0
flags: 0x8000000000008100(slab|head)
raw: 8000000000008100 dead000000000100 dead000000000200 ffff88015a563a00
raw: 0000000000000000 0000000000330033 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8801300e0580: fb fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc
ffff8801300e0600: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
>ffff8801300e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8801300e0700: fb fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff8801300e0780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Fixes: 0ba99ca483 ("block: Add warning for bi_next not NULL in bio_endio()")
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_mq_complete_request can only be called for blk-mq drivers, but when
removing the BLK_EH_HANDLED return value, two legacy request timeout
methods incorrectly got switched to call blk_mq_complete_request.
Call __blk_complete_request instead to reinstance the previous behavior.
For that __blk_complete_request needs to be exported.
Fixes: 1fc2b62e ("scsi_transport_fc: complete requests from ->timeout")
Fixes: 0df0bb08 ("null_blk: complete requests from ->timeout")
Reported-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlslJkoQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpi3READZgtKEaixC4EMLNH/7WqVLC5XxsIpuV3N7
kxaV2KoJ/zJbBPqE+WIW/UasuTRkIFQpmXp3aYw879FZfCq1sN7K9yyzY2i7qvuG
l8WRPWFc2SMfMrXb3pctGiNyYhTIZrOOzPmKAwlNinyyr7tlan2Ha4z5XHOJon9O
uVk252kTlBqvTEX4nZKN2x6UiI9RaVegbOV6lthm3lQZF+25C6kN/8emdkhDIkjG
mOI8MaHp6iDbpIBWBziEpkvGxvxO4PAaESldyJH6Fg+uzmNEDEwp6jvFBmlCiQIE
wdt5Swsl2zKk8g/tBHeDSDK0inHnP5xho4tC68rDtLrgxyLeBuVjyLMItkYJdTqh
s8B5gdCs3SS41aXMk+w3BnLBIRRMrByGiY2Nx23Si6KZU8pT5sM0dGGOTGCma1dy
NNitIRbePyrajV2p/xqnMeiGvxqnCckRRYr0uV2KPJzqwwMoZvNsnS+XL2KqdgCY
ndK5Nda5oRb+yDDlPKlRvY8cKh1N8GWdoFIGbfrMIYJcGiUXBoX7UoA/DKKbnY2o
p+TE3t9RGz9UPwxHqSMTXRV+xs4X3lgWf4sA/ciGGPeuAgwdOoGJnfn4kY7AIPq8
LYXj8K4rx3cRA+l7Y4DCppPQEl8bxKXhFFccKsXToFITW+PNdCuM7FDugOHdzwQE
QlesAkLV+w==
=gyaY
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20180616' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A collection of fixes that should go into -rc1. This contains:
- bsg_open vs bsg_unregister race fix (Anatoliy)
- NVMe pull request from Christoph, with fixes for regressions in
this window, FC connect/reconnect path code unification, and a
trace point addition.
- timeout fix (Christoph)
- remove a few unused functions (Christoph)
- blk-mq tag_set reinit fix (Roman)"
* tag 'for-linus-20180616' of git://git.kernel.dk/linux-block:
bsg: fix race of bsg_open and bsg_unregister
block: remov blk_queue_invalidate_tags
nvme-fabrics: fix and refine state checks in __nvmf_check_ready
nvme-fabrics: handle the admin-only case properly in nvmf_check_ready
nvme-fabrics: refactor queue ready check
blk-mq: remove blk_mq_tagset_iter
nvme: remove nvme_reinit_tagset
nvme-fc: fix nulling of queue data on reconnect
nvme-fc: remove reinit_request routine
blk-mq: don't time out requests again that are in the timeout handler
nvme-fc: change controllers first connect to use reconnect path
nvme: don't rely on the changed namespace list log
nvmet: free smart-log buffer after use
nvme-rdma: fix error flow during mapping request data
nvme: add bio remapping tracepoint
nvme: fix NULL pointer dereference in nvme_init_subsystem
blk-mq: reinit q->tag_set_list entry only after grace period
As we move stuff around, some doc references are broken. Fix some of
them via this script:
./scripts/documentation-file-ref-check --fix
Manually checked if the produced result is valid, removing a few
false-positives.
Acked-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Stephen Boyd <sboyd@kernel.org>
Acked-by: Charles Keepax <ckeepax@opensource.wolfsonmicro.com>
Acked-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Jonathan Corbet <corbet@lwn.net>
The existing implementation allows races between bsg_unregister and
bsg_open paths. bsg_unregister and request_queue cleanup and deletion
may start and complete right after bsg_get_device (in bsg_open path)
retrieves bsg_class_device and releases the mutex. Then bsg_open path
touches freed memory of bsg_class_device and request_queue.
One possible fix is to hold the mutex all the way through bsg_get_device
instead of releasing it after bsg_class_device retrieval.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-Off-By: Anatoliy Glagolev <glagolig@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This function is entirely unused, so remove it and the tag_queue_busy
member of struct request_queue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>