Commit graph

4137 commits

Author SHA1 Message Date
Greg Kroah-Hartman
10f41ccfc7 This is the 4.19.36 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAly6xzUACgkQONu9yGCS
 aT5sIA//b7nAk2zuhmbkonsBfzFq5uBJmqXcCrOgy3XHMs4fE+Q11kLd1wMAV7dx
 U7FNHe4PIJ8Rczxgqr2VP3VmFbV6UuTK+UTclJKfbV3ouIAQiQBuutABBmbDUj2p
 FInc/yAYyhVc9n7gX78czTiUxKnKi4+sisUYDCZPr3hr6jDPcLvm/WVWdyrcXJje
 rYFNmE/2MBH1NofG+MOpq+ILhKHXlf2APN2/spl+I42a8bwodiSl9g+dhuWr7wgT
 Ln2Ocf7BZ6BPCQKoveZdD1Gd56NNR/lJh4ulqpuhaZw4Yp+B/C7GmrBtdPzVSGka
 IwPWoSc9/9VSUl+ooSZHms78VLbqq0rNNclskL2bN6m962u04Eu7sB2Tg/bwUs52
 Wkcw0DY4J/oMJtj/CMHcQOUPsk6vwHxqnjsj+LYJ1ZjHO68tUshnENxXrbAoDc45
 2fuY3TCA+XqFvqNt5HbkLPtFR78u8QmZ1lP/Pkri6xoG/GA6O0EAxhS0Z9hncGK7
 8wJNuxLMd2UX94wlajQ+DF7yyCU4HOFEdeSEOwlHHBid/fckXsGzL2tKJUAbbUPP
 ux3An8kJHni8nQrmUkyy1Nx29ROyAFxBLOQshWGpXgJrV3qRMYLyB2Icv0WYCGFk
 zZCTupPgvb46u81VzqxrLH4RZdy4Ar4uB3BQGPKs596rlYmvnSo=
 =CArs
 -----END PGP SIGNATURE-----

Merge 4.19.36 into android-4.19

Changes in 4.19.36
	ARC: u-boot args: check that magic number is correct
	arc: hsdk_defconfig: Enable CONFIG_BLK_DEV_RAM
	inotify: Fix fsnotify_mark refcount leak in inotify_update_existing_watch()
	perf/core: Restore mmap record type correctly
	ext4: avoid panic during forced reboot
	ext4: add missing brelse() in add_new_gdb_meta_bg()
	ext4: report real fs size after failed resize
	ALSA: echoaudio: add a check for ioremap_nocache
	ALSA: sb8: add a check for request_region
	auxdisplay: hd44780: Fix memory leak on ->remove()
	drm/udl: use drm_gem_object_put_unlocked.
	IB/mlx4: Fix race condition between catas error reset and aliasguid flows
	i40iw: Avoid panic when handling the inetdev event
	mmc: davinci: remove extraneous __init annotation
	ALSA: opl3: fix mismatch between snd_opl3_drum_switch definition and declaration
	thermal/intel_powerclamp: fix __percpu declaration of worker_data
	thermal: samsung: Fix incorrect check after code merge
	thermal: bcm2835: Fix crash in bcm2835_thermal_debugfs
	thermal/int340x_thermal: Add additional UUIDs
	thermal/int340x_thermal: fix mode setting
	thermal/intel_powerclamp: fix truncated kthread name
	scsi: iscsi: flush running unbind operations when removing a session
	sched/cpufreq: Fix 32-bit math overflow
	sched/core: Fix buffer overflow in cgroup2 property cpu.max
	x86/mm: Don't leak kernel addresses
	tools/power turbostat: return the exit status of a command
	perf list: Don't forget to drop the reference to the allocated thread_map
	perf config: Fix an error in the config template documentation
	perf config: Fix a memory leak in collect_config()
	perf build-id: Fix memory leak in print_sdt_events()
	perf top: Fix error handling in cmd_top()
	perf hist: Add missing map__put() in error case
	perf evsel: Free evsel->counts in perf_evsel__exit()
	perf tests: Fix a memory leak of cpu_map object in the openat_syscall_event_on_all_cpus test
	perf tests: Fix memory leak by expr__find_other() in test__expr()
	perf tests: Fix a memory leak in test__perf_evsel__tp_sched_test()
	ACPI / utils: Drop reference in test for device presence
	PM / Domains: Avoid a potential deadlock
	blk-iolatency: #include "blk.h"
	drm/exynos/mixer: fix MIXER shadow registry synchronisation code
	irqchip/stm32: Don't clear rising/falling config registers at init
	irqchip/mbigen: Don't clear eventid when freeing an MSI
	x86/hpet: Prevent potential NULL pointer dereference
	x86/hyperv: Prevent potential NULL pointer dereference
	x86/cpu/cyrix: Use correct macros for Cyrix calls on Geode processors
	drm/nouveau/debugfs: Fix check of pm_runtime_get_sync failure
	iommu/vt-d: Check capability before disabling protected memory
	x86/hw_breakpoints: Make default case in hw_breakpoint_arch_parse() return an error
	fix incorrect error code mapping for OBJECTID_NOT_FOUND
	x86/gart: Exclude GART aperture from kcore
	ext4: prohibit fstrim in norecovery mode
	drm/cirrus: Use drm_framebuffer_put to avoid kernel oops in clean-up
	gpio: pxa: handle corner case of unprobed device
	rsi: improve kernel thread handling to fix kernel panic
	f2fs: fix to avoid NULL pointer dereference on se->discard_map
	9p: do not trust pdu content for stat item size
	9p locks: add mount option for lock retry interval
	ASoC: Fix UBSAN warning at snd_soc_get/put_volsw_sx()
	f2fs: fix to do sanity check with current segment number
	netfilter: xt_cgroup: shrink size of v2 path
	serial: uartps: console_setup() can't be placed to init section
	powerpc/pseries: Remove prrn_work workqueue
	media: au0828: cannot kfree dev before usb disconnect
	Bluetooth: Fix debugfs NULL pointer dereference
	HID: i2c-hid: override HID descriptors for certain devices
	pinctrl: core: make sure strcmp() doesn't get a null parameter
	ARM: samsung: Limit SAMSUNG_PM_CHECK config option to non-Exynos platforms
	usbip: fix vhci_hcd controller counting
	ACPI / SBS: Fix GPE storm on recent MacBookPro's
	HID: usbhid: Add quirk for Redragon/Dragonrise Seymur 2
	KVM: nVMX: restore host state in nested_vmx_vmexit for VMFail
	compiler.h: update definition of unreachable()
	netfilter: nf_flow_table: remove flowtable hook flush routine in netns exit routine
	f2fs: cleanup dirty pages if recover failed
	net: stmmac: Set OWN bit for jumbo frames
	cifs: fallback to older infolevels on findfirst queryinfo retry
	kernel: hung_task.c: disable on suspend
	platform/x86: Add Intel AtomISP2 dummy / power-management driver
	drm/ttm: Fix bo_global and mem_global kfree error
	ALSA: hda: fix front speakers on Huawei MBXP
	ACPI: EC / PM: Disable non-wakeup GPEs for suspend-to-idle
	net/rds: fix warn in rds_message_alloc_sgs
	xfrm: destroy xfrm_state synchronously on net exit path
	crypto: sha256/arm - fix crash bug in Thumb2 build
	crypto: sha512/arm - fix crash bug in Thumb2 build
	net: ip6_gre: fix possible NULL pointer dereference in ip6erspan_set_version
	iommu/dmar: Fix buffer overflow during PCI bus notification
	scsi: core: Avoid that system resume triggers a kernel warning
	soc/tegra: pmc: Drop locking from tegra_powergate_is_powered()
	lkdtm: Print real addresses
	lkdtm: Add tests for NULL pointer dereference
	drm/panel: panel-innolux: set display off in innolux_panel_unprepare
	crypto: axis - fix for recursive locking from bottom half
	Revert "ACPI / EC: Remove old CLEAR_ON_RESUME quirk"
	coresight: cpu-debug: Support for CA73 CPUs
	PCI: Blacklist power management of Gigabyte X299 DESIGNARE EX PCIe ports
	drm/nouveau/volt/gf117: fix speedo readout register
	ARM: 8839/1: kprobe: make patch_lock a raw_spinlock_t
	drm/amdkfd: use init_mqd function to allocate object for hid_mqd (CI)
	appletalk: Fix use-after-free in atalk_proc_exit
	lib/div64.c: off by one in shift
	rxrpc: Fix client call connect/disconnect race
	f2fs: fix to dirty inode for i_mode recovery
	include/linux/swap.h: use offsetof() instead of custom __swapoffset macro
	bpf: fix use after free in bpf_evict_inode
	IB/hfi1: Failed to drain send queue when QP is put into error state
	mm: hide incomplete nr_indirectly_reclaimable in /proc/zoneinfo
	mm: hide incomplete nr_indirectly_reclaimable in sysfs
	appletalk: Fix compile regression
	Linux 4.19.36

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-04-20 15:53:36 +02:00
Bart Van Assche
bde271d1ad blk-iolatency: #include "blk.h"
[ Upstream commit 373e915cd8e84544609eced57a44fbc084f8d60f ]

This patch avoids that the following warning is reported when building
with W=1:

block/blk-iolatency.c:734:5: warning: no previous prototype for 'blk_iolatency_init' [-Wmissing-prototypes]

Cc: Josef Bacik <jbacik@fb.com>
Fixes: d706751215 ("block: introduce blk-iolatency io controller") # v4.19
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-20 09:15:58 +02:00
Greg Kroah-Hartman
3e166630b0 This is the 4.19.35 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAly2yf8ACgkQONu9yGCS
 aT5OkA/+Jf3UWnM8MdPoxVUROYg6WRSMThvuKkcspkvXxXqu7Z9a/+meqdIZ0NDr
 j+RdhCfZf7RRNpTpYB9jMqXXuRFhVWniAVM3/me2LxFvwyKkNqURyIz9azh7T0VW
 nDa2NDQ9C0IyLFLW4cyzZncrmWntgW5XKZfVot1TMNpryHB4FPsKP5rInDQpSNIu
 bbJBWuZziUqdEnk3TWuqdQr4ZTXgnjYSBaTbOQBU3F6E88TRmFIatyYovLcJCQh5
 jBmXV3U6mGPZe64I+JXj8dTp32WD74afMX1YSZmjdLL6KgbD9a4k47UzpXzEbKT8
 uMG057zPiYcwOyMCFZOWsCYIH7Yr4oMIAzOEyYcDI1uCuKd4aESdpgiWfUWambR8
 BJO5oGELzLArMVWusWK7xBfm+Et2ePOFWC7V9MRVVQJLdAWkEafYuLoPqrhhqxQX
 Iu3ThoY3UmEoNqi1345gmxQBJfbOqdK7CwQGzKOtiPNs0nyMfg62B9Rcr9O+FE5A
 +womRQF2Tik1cZhXxkJVSD4f+orL+xuOu59ufvMTBe4co9tPVFm9Qp6SeWoOVuwZ
 l8SS7DD5wb7vzDUZOO4KmA7D6p3YQWdBvGeN6jKqjASIFS9t5Sb+fNiKznTo/WCT
 1I68sLm7/rs95+WazKu66ewW8bh91VAo+Gmpfo5zEJjcfajnDxY=
 =F1Te
 -----END PGP SIGNATURE-----

Merge 4.19.35 into android-4.19

Changes in 4.19.35
	kvm: nVMX: NMI-window and interrupt-window exiting should wake L2 from HLT
	drm/i915/gvt: do not let pin count of shadow mm go negative
	powerpc/tm: Limit TM code inside PPC_TRANSACTIONAL_MEM
	hv_netvsc: Fix unwanted wakeup after tx_disable
	ibmvnic: Fix completion structure initialization
	ip6_tunnel: Match to ARPHRD_TUNNEL6 for dev type
	ipv6: Fix dangling pointer when ipv6 fragment
	ipv6: sit: reset ip header pointer in ipip6_rcv
	kcm: switch order of device registration to fix a crash
	net: ethtool: not call vzalloc for zero sized memory request
	net-gro: Fix GRO flush when receiving a GSO packet.
	net/mlx5: Decrease default mr cache size
	netns: provide pure entropy for net_hash_mix()
	net: rds: force to destroy connection if t_sock is NULL in rds_tcp_kill_sock().
	net/sched: act_sample: fix divide by zero in the traffic path
	net/sched: fix ->get helper of the matchall cls
	openvswitch: fix flow actions reallocation
	qmi_wwan: add Olicard 600
	r8169: disable ASPM again
	sctp: initialize _pad of sockaddr_in before copying to user memory
	tcp: Ensure DCTCP reacts to losses
	tcp: fix a potential NULL pointer dereference in tcp_sk_exit
	vrf: check accept_source_route on the original netdevice
	net/mlx5e: Fix error handling when refreshing TIRs
	net/mlx5e: Add a lock on tir list
	nfp: validate the return code from dev_queue_xmit()
	nfp: disable netpoll on representors
	bnxt_en: Improve RX consumer index validity check.
	bnxt_en: Reset device on RX buffer errors.
	net: ip_gre: fix possible use-after-free in erspan_rcv
	net: ip6_gre: fix possible use-after-free in ip6erspan_rcv
	net: core: netif_receive_skb_list: unlist skb before passing to pt->func
	r8169: disable default rx interrupt coalescing on RTL8168
	net: mlx5: Add a missing check on idr_find, free buf
	net/mlx5e: Update xoff formula
	net/mlx5e: Update xon formula
	kbuild: deb-pkg: fix bindeb-pkg breakage when O= is used
	kbuild: clang: choose GCC_TOOLCHAIN_DIR not on LD
	x86/vdso: Drop implicit common-page-size linker flag
	lib/string.c: implement a basic bcmp
	Revert "clk: meson: clean-up clock registration"
	netfilter: nfnetlink_cttimeout: pass default timeout policy to obj_to_nlattr
	netfilter: nfnetlink_cttimeout: fetch timeouts for udplite and gre, too
	arm64: kaslr: Reserve size of ARM64_MEMSTART_ALIGN in linear region
	tty: mark Siemens R3964 line discipline as BROKEN
	tty: ldisc: add sysctl to prevent autoloading of ldiscs
	hwmon: (w83773g) Select REGMAP_I2C to fix build error
	ACPICA: Clear status of GPEs before enabling them
	ACPICA: Namespace: remove address node from global list after method termination
	ALSA: seq: Fix OOB-reads from strlcpy
	ALSA: hda/realtek: Enable headset MIC of Acer TravelMate B114-21 with ALC233
	ALSA: hda/realtek - Add quirk for Tuxedo XC 1509
	ALSA: hda - Add two more machines to the power_save_blacklist
	mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd()
	arm64: dts: rockchip: fix rk3328 sdmmc0 write errors
	parisc: Detect QEMU earlier in boot process
	parisc: regs_return_value() should return gpr28
	parisc: also set iaoq_b in instruction_pointer_set()
	alarmtimer: Return correct remaining time
	drm/i915/gvt: do not deliver a workload if its creation fails
	drm/udl: add a release method and delay modeset teardown
	kvm: svm: fix potential get_num_contig_pages overflow
	include/linux/bitrev.h: fix constant bitrev
	mm: writeback: use exact memcg dirty counts
	ASoC: intel: Fix crash at suspend/resume after failed codec registration
	ASoC: fsl_esai: fix channel swap issue when stream starts
	Btrfs: do not allow trimming when a fs is mounted with the nologreplay option
	btrfs: prop: fix zstd compression parameter validation
	btrfs: prop: fix vanished compression property after failed set
	riscv: Fix syscall_get_arguments() and syscall_set_arguments()
	block: do not leak memory in bio_copy_user_iov()
	block: fix the return errno for direct IO
	genirq: Respect IRQCHIP_SKIP_SET_WAKE in irq_chip_set_wake_parent()
	genirq: Initialize request_mutex if CONFIG_SPARSE_IRQ=n
	virtio: Honour 'may_reduce_num' in vring_create_virtqueue
	ARM: dts: rockchip: fix rk3288 cpu opp node reference
	ARM: dts: am335x-evmsk: Correct the regulators for the audio codec
	ARM: dts: am335x-evm: Correct the regulators for the audio codec
	ARM: dts: at91: Fix typo in ISC_D0 on PC9
	arm64: futex: Fix FUTEX_WAKE_OP atomic ops with non-zero result value
	arm64: dts: rockchip: fix rk3328 rgmii high tx error rate
	arm64: backtrace: Don't bother trying to unwind the userspace stack
	xen: Prevent buffer overflow in privcmd ioctl
	sched/fair: Do not re-read ->h_load_next during hierarchical load calculation
	xtensa: fix return_address
	x86/asm: Remove dead __GNUC__ conditionals
	x86/asm: Use stricter assembly constraints in bitops
	x86/perf/amd: Resolve race condition when disabling PMC
	x86/perf/amd: Resolve NMI latency issues for active PMCs
	x86/perf/amd: Remove need to check "running" bit in NMI handler
	PCI: Add function 1 DMA alias quirk for Marvell 9170 SATA controller
	PCI: pciehp: Ignore Link State Changes after powering off a slot
	dm integrity: change memcmp to strncmp in dm_integrity_ctr
	dm: revert 8f50e35815 ("dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE")
	dm table: propagate BDI_CAP_STABLE_WRITES to fix sporadic checksum errors
	dm integrity: fix deadlock with overlapping I/O
	arm64: dts: rockchip: fix vcc_host1_5v pin assign on rk3328-rock64
	arm64: dts: rockchip: Fix vcc_host1_5v GPIO polarity on rk3328-rock64
	ACPICA: AML interpreter: add region addresses in global list during initialization
	KVM: x86: nVMX: close leak of L0's x2APIC MSRs (CVE-2019-3887)
	KVM: x86: nVMX: fix x2APIC VTPR read intercept
	Linux 4.19.35

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-04-17 11:46:16 +02:00
Jérôme Glisse
2591bfc682 block: do not leak memory in bio_copy_user_iov()
commit a3761c3c91209b58b6f33bf69dd8bb8ec0c9d925 upstream.

When bio_add_pc_page() fails in bio_copy_user_iov() we should free
the page we just allocated otherwise we are leaking it.

Cc: linux-block@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-17 08:38:51 +02:00
Greg Kroah-Hartman
d885da678e This is the 4.19.34 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlynu40ACgkQONu9yGCS
 aT5X6g//Wkfm/+qSZ0GhLDQkPniiH1QkvzhOmVrrxu+KB0qsiwsEl8Srw33ZVkJK
 LT8+IPGiG9jEGu9dj+BYXTIfy9ZvfSsEL2N6GhYwDSXP0fok2rUaHbZvv1IB2g4W
 afhGdNwNAUCJ/j1UrUsi+SAFJ+xWbVxFpGstd0cqM9IbKdEV7RIukvuKckHiKOKR
 qI8FxC+G2PAr+BtnETfk5/suPDJ7B3ZicDoMhiWJGxJ6dfFTVmkSmasSoPDaMiHm
 4S3hN2lu+WTeRpRPPB17Dlk4MmIp0k+bGYBKAlaxAMCc/RZxvbT2pRYaMQbId2/L
 mNUfSnOQFGEAhlAPfb7wdbObphnyT34GhlkWfZBTrnhPO0/FomLOvU6xVdcNuakX
 Tv2JKfDzb+2ttcMZ+0T84Ru9RztoswFATSw8uFMVxW8oTS6MVWnHu96Kxfl7QO3J
 PdlIGcyqxSuWNE8OX1QVtdSruGZfwUDNs94S4nQJtkB8BViRwhGJlqaXuy4d9Wp6
 fGlI2W6qhjyosi2wBSMTjh/ytk/jq0vfs+z2XjR2gAYssvB/SOLR/AlSVguWsDnf
 WaoFBkXvCbuPvPlo0TrLpl5RW5WlOtLXHE3Vr3dKp458wLwpf/OZBGoZiknp7DrF
 PzBZs2ie5tmyqTxbAygl7WkbQPJ682pd5R4nf5CY+zvUaOMZv1g=
 =Iuup
 -----END PGP SIGNATURE-----

Merge 4.19.34 into android-4.19

Changes in 4.19.34
	arm64: debug: Don't propagate UNKNOWN FAR into si_code for debug signals
	ext4: cleanup bh release code in ext4_ind_remove_space()
	tty/serial: atmel: Add is_half_duplex helper
	tty/serial: atmel: RS485 HD w/DMA: enable RX after TX is stopped
	CIFS: fix POSIX lock leak and invalid ptr deref
	h8300: use cc-cross-prefix instead of hardcoding h8300-unknown-linux-
	f2fs: fix to adapt small inline xattr space in __find_inline_xattr()
	f2fs: fix to avoid deadlock in f2fs_read_inline_dir()
	tracing: kdb: Fix ftdump to not sleep
	net/mlx5: Avoid panic when setting vport rate
	net/mlx5: Avoid panic when setting vport mac, getting vport config
	gpio: gpio-omap: fix level interrupt idling
	include/linux/relay.h: fix percpu annotation in struct rchan
	sysctl: handle overflow for file-max
	net: stmmac: Avoid sometimes uninitialized Clang warnings
	enic: fix build warning without CONFIG_CPUMASK_OFFSTACK
	libbpf: force fixdep compilation at the start of the build
	scsi: hisi_sas: Set PHY linkrate when disconnected
	scsi: hisi_sas: Fix a timeout race of driver internal and SMP IO
	iio: adc: fix warning in Qualcomm PM8xxx HK/XOADC driver
	x86/hyperv: Fix kernel panic when kexec on HyperV
	perf c2c: Fix c2c report for empty numa node
	mm/sparse: fix a bad comparison
	mm/cma.c: cma_declare_contiguous: correct err handling
	mm/page_ext.c: fix an imbalance with kmemleak
	mm, swap: bounds check swap_info array accesses to avoid NULL derefs
	mm,oom: don't kill global init via memory.oom.group
	memcg: killed threads should not invoke memcg OOM killer
	mm, mempolicy: fix uninit memory access
	mm/vmalloc.c: fix kernel BUG at mm/vmalloc.c:512!
	mm/slab.c: kmemleak no scan alien caches
	ocfs2: fix a panic problem caused by o2cb_ctl
	f2fs: do not use mutex lock in atomic context
	fs/file.c: initialize init_files.resize_wait
	page_poison: play nicely with KASAN
	cifs: use correct format characters
	dm thin: add sanity checks to thin-pool and external snapshot creation
	f2fs: fix to check inline_xattr_size boundary correctly
	cifs: Accept validate negotiate if server return NT_STATUS_NOT_SUPPORTED
	cifs: Fix NULL pointer dereference of devname
	netfilter: nf_tables: check the result of dereferencing base_chain->stats
	netfilter: conntrack: tcp: only close if RST matches exact sequence
	jbd2: fix invalid descriptor block checksum
	fs: fix guard_bio_eod to check for real EOD errors
	tools lib traceevent: Fix buffer overflow in arg_eval
	PCI/PME: Fix hotplug/sysfs remove deadlock in pcie_pme_remove()
	wil6210: check null pointer in _wil_cfg80211_merge_extra_ies
	mt76: fix a leaked reference by adding a missing of_node_put
	crypto: crypto4xx - add missing of_node_put after of_device_is_available
	crypto: cavium/zip - fix collision with generic cra_driver_name
	usb: chipidea: Grab the (legacy) USB PHY by phandle first
	powerpc/powernv/ioda: Fix locked_vm counting for memory used by IOMMU tables
	scsi: core: replace GFP_ATOMIC with GFP_KERNEL in scsi_scan.c
	kbuild: invoke syncconfig if include/config/auto.conf.cmd is missing
	powerpc/xmon: Fix opcode being uninitialized in print_insn_powerpc
	coresight: etm4x: Add support to enable ETMv4.2
	serial: 8250_pxa: honor the port number from devicetree
	ARM: 8840/1: use a raw_spinlock_t in unwind
	iommu/io-pgtable-arm-v7s: Only kmemleak_ignore L2 tables
	powerpc/hugetlb: Handle mmap_min_addr correctly in get_unmapped_area callback
	btrfs: qgroup: Make qgroup async transaction commit more aggressive
	mmc: omap: fix the maximum timeout setting
	net: dsa: mv88e6xxx: Add lockdep classes to fix false positive splat
	e1000e: Fix -Wformat-truncation warnings
	mlxsw: spectrum: Avoid -Wformat-truncation warnings
	platform/x86: ideapad-laptop: Fix no_hw_rfkill_list for Lenovo RESCUER R720-15IKBN
	platform/mellanox: mlxreg-hotplug: Fix KASAN warning
	loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part()
	IB/mlx4: Increase the timeout for CM cache
	clk: fractional-divider: check parent rate only if flag is set
	perf annotate: Fix getting source line failure
	ASoC: qcom: Fix of-node refcount unbalance in qcom_snd_parse_of()
	cpufreq: acpi-cpufreq: Report if CPU doesn't support boost technologies
	efi: cper: Fix possible out-of-bounds access
	s390/ism: ignore some errors during deregistration
	scsi: megaraid_sas: return error when create DMA pool failed
	scsi: fcoe: make use of fip_mode enum complete
	drm/amd/display: Clear stream->mode_changed after commit
	perf test: Fix failure of 'evsel-tp-sched' test on s390
	mwifiex: don't advertise IBSS features without FW support
	perf report: Don't shadow inlined symbol with different addr range
	SoC: imx-sgtl5000: add missing put_device()
	media: ov7740: fix runtime pm initialization
	media: sh_veu: Correct return type for mem2mem buffer helpers
	media: s5p-jpeg: Correct return type for mem2mem buffer helpers
	media: rockchip/rga: Correct return type for mem2mem buffer helpers
	media: s5p-g2d: Correct return type for mem2mem buffer helpers
	media: mx2_emmaprp: Correct return type for mem2mem buffer helpers
	media: mtk-jpeg: Correct return type for mem2mem buffer helpers
	mt76: usb: do not run mt76u_queues_deinit twice
	xen/gntdev: Do not destroy context while dma-bufs are in use
	vfs: fix preadv64v2 and pwritev64v2 compat syscalls with offset == -1
	HID: intel-ish-hid: avoid binding wrong ishtp_cl_device
	cgroup, rstat: Don't flush subtree root unless necessary
	jbd2: fix race when writing superblock
	leds: lp55xx: fix null deref on firmware load failure
	perf report: Add s390 diagnosic sampling descriptor size
	iwlwifi: pcie: fix emergency path
	ACPI / video: Refactor and fix dmi_is_desktop()
	selftests: skip seccomp get_metadata test if not real root
	kprobes: Prohibit probing on bsearch()
	kprobes: Prohibit probing on RCU debug routine
	netfilter: conntrack: fix cloned unconfirmed skb->_nfct race in __nf_conntrack_confirm
	ARM: 8833/1: Ensure that NEON code always compiles with Clang
	ARM: dts: meson8b: fix the Ethernet data line signals in eth_rgmii_pins
	ALSA: PCM: check if ops are defined before suspending PCM
	ath10k: fix shadow register implementation for WCN3990
	usb: f_fs: Avoid crash due to out-of-scope stack ptr access
	sched/topology: Fix percpu data types in struct sd_data & struct s_data
	bcache: fix input overflow to cache set sysfs file io_error_halflife
	bcache: fix input overflow to sequential_cutoff
	bcache: fix potential div-zero error of writeback_rate_i_term_inverse
	bcache: improve sysfs_strtoul_clamp()
	genirq: Avoid summation loops for /proc/stat
	net: marvell: mvpp2: fix stuck in-band SGMII negotiation
	iw_cxgb4: fix srqidx leak during connection abort
	net: phy: consider latched link-down status in polling mode
	fbdev: fbmem: fix memory access if logo is bigger than the screen
	cdrom: Fix race condition in cdrom_sysctl_register
	drm: rcar-du: add missing of_node_put
	drm/amd/display: Don't re-program planes for DPMS changes
	drm/amd/display: Disconnect mpcc when changing tg
	perf/aux: Make perf_event accessible to setup_aux()
	e1000e: fix cyclic resets at link up with active tx
	e1000e: Exclude device from suspend direct complete optimization
	platform/x86: intel_pmc_core: Fix PCH IP sts reading
	i2c: of: Try to find an I2C adapter matching the parent
	staging: spi: mt7621: Add return code check on device_reset()
	iwlwifi: mvm: fix RFH config command with >=10 CPUs
	ASoC: fsl-asoc-card: fix object reference leaks in fsl_asoc_card_probe
	sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK
	efi/memattr: Don't bail on zero VA if it equals the region's PA
	sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock()
	drm/vkms: Bugfix extra vblank frame
	ARM: dts: lpc32xx: Remove leading 0x and 0s from bindings notation
	efi/arm/arm64: Allow SetVirtualAddressMap() to be omitted
	soc: qcom: gsbi: Fix error handling in gsbi_probe()
	mt7601u: bump supported EEPROM version
	ARM: 8830/1: NOMMU: Toggle only bits in EXC_RETURN we are really care of
	ARM: avoid Cortex-A9 livelock on tight dmb loops
	block, bfq: fix in-service-queue check for queue merging
	bpf: fix missing prototype warnings
	selftests/bpf: skip verifier tests for unsupported program types
	powerpc/64s: Clear on-stack exception marker upon exception return
	cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting
	backlight: pwm_bl: Use gpiod_get_value_cansleep() to get initial state
	tty: increase the default flip buffer limit to 2*640K
	powerpc/pseries: Perform full re-add of CPU for topology update post-migration
	drm/amd/display: Enable vblank interrupt during CRC capture
	ALSA: dice: add support for Solid State Logic Duende Classic/Mini
	usb: dwc3: gadget: Fix OTG events when gadget driver isn't loaded
	platform/x86: intel-hid: Missing power button release on some Dell models
	perf script python: Use PyBytes for attr in trace-event-python
	perf script python: Add trace_context extension module to sys.modules
	media: mt9m111: set initial frame size other than 0x0
	hwrng: virtio - Avoid repeated init of completion
	soc/tegra: fuse: Fix illegal free of IO base address
	HID: intel-ish: ipc: handle PIMR before ish_wakeup also clear PISR busy_clear bit
	f2fs: UBSAN: set boolean value iostat_enable correctly
	hpet: Fix missing '=' character in the __setup() code of hpet_mmap_enable
	cpu/hotplug: Mute hotplug lockdep during init
	dmaengine: imx-dma: fix warning comparison of distinct pointer types
	dmaengine: qcom_hidma: assign channel cookie correctly
	dmaengine: qcom_hidma: initialize tx flags in hidma_prep_dma_*
	netfilter: physdev: relax br_netfilter dependency
	media: rcar-vin: Allow independent VIN link enablement
	media: s5p-jpeg: Check for fmt_ver_flag when doing fmt enumeration
	regulator: act8865: Fix act8600_sudcdc_voltage_ranges setting
	pinctrl: meson: meson8b: add the eth_rxd2 and eth_rxd3 pins
	drm: Auto-set allow_fb_modifiers when given modifiers at plane init
	drm/nouveau: Stop using drm_crtc_force_disable
	x86/build: Specify elf_i386 linker emulation explicitly for i386 objects
	selinux: do not override context on context mounts
	brcmfmac: Use firmware_request_nowarn for the clm_blob
	wlcore: Fix memory leak in case wl12xx_fetch_firmware failure
	x86/build: Mark per-CPU symbols as absolute explicitly for LLD
	drm/fb-helper: fix leaks in error path of drm_fb_helper_fbdev_setup
	clk: meson: clean-up clock registration
	clk: rockchip: fix frac settings of GPLL clock for rk3328
	dmaengine: tegra: avoid overflow of byte tracking
	Input: soc_button_array - fix mapping of the 5th GPIO in a PNP0C40 device
	drm/dp/mst: Configure no_stop_bit correctly for remote i2c xfers
	net: stmmac: Avoid one more sometimes uninitialized Clang warning
	ACPI / video: Extend chassis-type detection with a "Lunch Box" check
	bcache: fix potential div-zero error of writeback_rate_p_term_inverse
	kprobes/x86: Blacklist non-attachable interrupt functions
	Linux 4.19.34

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-04-05 22:43:09 +02:00
Paolo Valente
06666a19d5 block, bfq: fix in-service-queue check for queue merging
[ Upstream commit 058fdecc6de7cdecbf4c59b851e80eb2d6c5295f ]

When a new I/O request arrives for a bfq_queue, say Q, bfq checks
whether that request is close to
(a) the head request of some other queue waiting to be served, or
(b) the last request dispatched for the in-service queue (in case Q
itself is not the in-service queue)

If a queue, say Q2, is found for which the above condition holds, then
bfq merges Q and Q2, to hopefully get a more sequential I/O in the
resulting merged queue, and thus a possibly higher throughput.

Case (b) is checked by comparing the new request for Q with the last
request dispatched, assuming that the latter necessarily belonged to the
in-service queue. Unfortunately, this assumption is no longer always
correct, since commit d0edc2473be9 ("block, bfq: inject other-queue I/O
into seeky idle queues on NCQ flash").

When the assumption does not hold, queues that must not be merged may be
merged, causing unexpected loss of control on per-queue service
guarantees.

This commit solves this problem by adding an extra field, which stores
the actual last request dispatched for the in-service queue, and by
using this new field to correctly check case (b).

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-05 22:33:12 +02:00
Greg Kroah-Hartman
bb418a146a This is the 4.19.31 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlyWhJcACgkQONu9yGCS
 aT6XzxAAzP2QGzC4SVPgcFH1woF/d8Cz0zQ81mLXzjXtEPm39fZCM2hbBnxkXLu1
 peFyrKNk6/c9541D9gsQCQT6Fu+H6u1bJKcIezlKJ2xyB/MsU1hXkjZrTJYW3RRs
 gimy1EGdood2el1ubEBZiaspazoeRzBqtg1Nsmr4V0l+RT8HwtKKw+0+Nxixfp59
 NoVkqTpPI5mL0FiH2R9ogcfg3SvgMZOsOhOBjdPvSjiJJsbvIWcW48MCs95XSUpF
 R+l/fWn+oiFCcIqBaFheujuqZMvVrUHZHaWAPMuoR/c3Cdf0lTBokdv6UM9c0nv3
 61jX5r5ImRI/dfQANN5mbB1YKcs5xOI+I7QZHQ2q4clsWrWyLapXW4clrAZJ6z5t
 UVeVbuLV2y5PL9GJyBcXpyY0BOf4e2gZURaPY3C5McNwgybNoiR0ZePqKb8ZhZyh
 jYOYRoBjJJpZoVTSt6MNX95NTvGaSAtqKMu1s3IeMfpwCfQKBPMOuBHr/dUqSC6I
 U0xxjk/71C15dSPVcTVJT/lmcKc6TXgoagnfbn8GBtDOAjBNsYyUJLQI+db1ERCe
 9MEB9k1Z87ROQ5jQCQmWsewOVAtFZBEvSszFmpKv3zTe8M2oFpXG56zckdiumwHU
 nSfeZTTeWzsFJd30MioEnGYm3ZwKwZx7wi0x4B4WWvBfSpp20Us=
 =xtLx
 -----END PGP SIGNATURE-----

Merge 4.19.31 into android-4.19

Changes in 4.19.31
	media: videobuf2-v4l2: drop WARN_ON in vb2_warn_zero_bytesused()
	9p: use inode->i_lock to protect i_size_write() under 32-bit
	9p/net: fix memory leak in p9_client_create
	ASoC: fsl_esai: fix register setting issue in RIGHT_J mode
	ASoC: codecs: pcm186x: fix wrong usage of DECLARE_TLV_DB_SCALE()
	ASoC: codecs: pcm186x: Fix energysense SLEEP bit
	iio: adc: exynos-adc: Fix NULL pointer exception on unbind
	mei: hbm: clean the feature flags on link reset
	mei: bus: move hw module get/put to probe/release
	stm class: Fix an endless loop in channel allocation
	crypto: caam - fix hash context DMA unmap size
	crypto: ccree - fix missing break in switch statement
	crypto: caam - fixed handling of sg list
	crypto: caam - fix DMA mapping of stack memory
	crypto: ccree - fix free of unallocated mlli buffer
	crypto: ccree - unmap buffer before copying IV
	crypto: ccree - don't copy zero size ciphertext
	crypto: cfb - add missing 'chunksize' property
	crypto: cfb - remove bogus memcpy() with src == dest
	crypto: ahash - fix another early termination in hash walk
	crypto: rockchip - fix scatterlist nents error
	crypto: rockchip - update new iv to device in multiple operations
	drm/imx: ignore plane updates on disabled crtcs
	gpu: ipu-v3: Fix i.MX51 CSI control registers offset
	drm/imx: imx-ldb: add missing of_node_puts
	gpu: ipu-v3: Fix CSI offsets for imx53
	ASoC: rt5682: Correct the setting while select ASRC clk for AD/DA filter
	clocksource: timer-ti-dm: Fix pwm dmtimer usage of fck reparenting
	KVM: arm/arm64: vgic: Make vgic_dist->lpi_list_lock a raw_spinlock
	arm64: dts: rockchip: fix graph_port warning on rk3399 bob kevin and excavator
	s390/dasd: fix using offset into zero size array error
	Input: pwm-vibra - prevent unbalanced regulator
	Input: pwm-vibra - stop regulator after disabling pwm, not before
	ARM: dts: Configure clock parent for pwm vibra
	ARM: OMAP2+: Variable "reg" in function omap4_dsi_mux_pads() could be uninitialized
	ASoC: dapm: fix out-of-bounds accesses to DAPM lookup tables
	ASoC: rsnd: fixup rsnd_ssi_master_clk_start() user count check
	KVM: arm/arm64: Reset the VCPU without preemption and vcpu state loaded
	arm/arm64: KVM: Allow a VCPU to fully reset itself
	arm/arm64: KVM: Don't panic on failure to properly reset system registers
	KVM: arm/arm64: vgic: Always initialize the group of private IRQs
	KVM: arm64: Forbid kprobing of the VHE world-switch code
	ASoC: samsung: Prevent clk_get_rate() calls in atomic context
	ARM: OMAP2+: fix lack of timer interrupts on CPU1 after hotplug
	Input: cap11xx - switch to using set_brightness_blocking()
	Input: ps2-gpio - flush TX work when closing port
	Input: matrix_keypad - use flush_delayed_work()
	mac80211: call drv_ibss_join() on restart
	mac80211: Fix Tx aggregation session tear down with ITXQs
	netfilter: compat: initialize all fields in xt_init
	blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue
	ipvs: fix dependency on nf_defrag_ipv6
	floppy: check_events callback should not return a negative number
	xprtrdma: Make sure Send CQ is allocated on an existing compvec
	NFS: Don't use page_file_mapping after removing the page
	mm/gup: fix gup_pmd_range() for dax
	Revert "mm: use early_pfn_to_nid in page_ext_init"
	scsi: qla2xxx: Fix panic from use after free in qla2x00_async_tm_cmd
	net: dsa: bcm_sf2: potential array overflow in bcm_sf2_sw_suspend()
	x86/CPU: Add Icelake model number
	mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs
	net: hns: Fix object reference leaks in hns_dsaf_roce_reset()
	i2c: cadence: Fix the hold bit setting
	i2c: bcm2835: Clear current buffer pointers and counts after a transfer
	auxdisplay: ht16k33: fix potential user-after-free on module unload
	Input: st-keyscan - fix potential zalloc NULL dereference
	clk: sunxi-ng: v3s: Fix TCON reset de-assert bit
	kallsyms: Handle too long symbols in kallsyms.c
	clk: sunxi: A31: Fix wrong AHB gate number
	esp: Skip TX bytes accounting when sending from a request socket
	ARM: 8824/1: fix a migrating irq bug when hotplug cpu
	bpf: only adjust gso_size on bytestream protocols
	bpf: fix lockdep false positive in stackmap
	af_key: unconditionally clone on broadcast
	ARM: 8835/1: dma-mapping: Clear DMA ops on teardown
	assoc_array: Fix shortcut creation
	keys: Fix dependency loop between construction record and auth key
	scsi: libiscsi: Fix race between iscsi_xmit_task and iscsi_complete_task
	net: systemport: Fix reception of BPDUs
	net: dsa: bcm_sf2: Do not assume DSA master supports WoL
	pinctrl: meson: meson8b: fix the sdxc_a data 1..3 pins
	qmi_wwan: apply SET_DTR quirk to Sierra WP7607
	net: mv643xx_eth: disable clk on error path in mv643xx_eth_shared_probe()
	xfrm: Fix inbound traffic via XFRM interfaces across network namespaces
	mailbox: bcm-flexrm-mailbox: Fix FlexRM ring flush timeout issue
	ASoC: topology: free created components in tplg load error
	qed: Fix iWARP buffer size provided for syn packet processing.
	qed: Fix iWARP syn packet mac address validation.
	ARM: dts: armada-xp: fix Armada XP boards NAND description
	arm64: Relax GIC version check during early boot
	ARM: tegra: Restore DT ABI on Tegra124 Chromebooks
	net: marvell: mvneta: fix DMA debug warning
	mm: handle lru_add_drain_all for UP properly
	tmpfs: fix link accounting when a tmpfile is linked in
	ixgbe: fix older devices that do not support IXGBE_MRQC_L3L4TXSWEN
	ARCv2: lib: memcpy: fix doing prefetchw outside of buffer
	ARC: uacces: remove lp_start, lp_end from clobber list
	ARCv2: support manual regfile save on interrupts
	ARCv2: don't assume core 0x54 has dual issue
	phonet: fix building with clang
	mac80211_hwsim: propagate genlmsg_reply return code
	bpf, lpm: fix lookup bug in map_delete_elem
	net: thunderx: make CFG_DONE message to run through generic send-ack sequence
	net: thunderx: add nicvf_send_msg_to_pf result check for set_rx_mode_task
	nfp: bpf: fix code-gen bug on BPF_ALU | BPF_XOR | BPF_K
	nfp: bpf: fix ALU32 high bits clearance bug
	bnxt_en: Fix typo in firmware message timeout logic.
	bnxt_en: Wait longer for the firmware message response to complete.
	net: set static variable an initial value in atl2_probe()
	selftests: fib_tests: sleep after changing carrier. again.
	tmpfs: fix uninitialized return value in shmem_link
	stm class: Prevent division by zero
	nfit: acpi_nfit_ctl(): Check out_obj->type in the right place
	acpi/nfit: Fix bus command validation
	nfit/ars: Attempt a short-ARS whenever the ARS state is idle at boot
	nfit/ars: Attempt short-ARS even in the no_init_ars case
	libnvdimm/label: Clear 'updating' flag after label-set update
	libnvdimm, pfn: Fix over-trim in trim_pfn_device()
	libnvdimm/pmem: Honor force_raw for legacy pmem regions
	libnvdimm: Fix altmap reservation size calculation
	fix cgroup_do_mount() handling of failure exits
	crypto: aead - set CRYPTO_TFM_NEED_KEY if ->setkey() fails
	crypto: aegis - fix handling chunked inputs
	crypto: arm/crct10dif - revert to C code for short inputs
	crypto: arm64/aes-neonbs - fix returning final keystream block
	crypto: arm64/crct10dif - revert to C code for short inputs
	crypto: hash - set CRYPTO_TFM_NEED_KEY if ->setkey() fails
	crypto: morus - fix handling chunked inputs
	crypto: pcbc - remove bogus memcpy()s with src == dest
	crypto: skcipher - set CRYPTO_TFM_NEED_KEY if ->setkey() fails
	crypto: testmgr - skip crc32c context test for ahash algorithms
	crypto: x86/aegis - fix handling chunked inputs and MAY_SLEEP
	crypto: x86/aesni-gcm - fix crash on empty plaintext
	crypto: x86/morus - fix handling chunked inputs and MAY_SLEEP
	crypto: arm64/aes-ccm - fix logical bug in AAD MAC handling
	crypto: arm64/aes-ccm - fix bugs in non-NEON fallback routine
	CIFS: Do not reset lease state to NONE on lease break
	CIFS: Do not skip SMB2 message IDs on send failures
	CIFS: Fix read after write for files with read caching
	tracing: Use strncpy instead of memcpy for string keys in hist triggers
	tracing: Do not free iter->trace in fail path of tracing_open_pipe()
	tracing/perf: Use strndup_user() instead of buggy open-coded version
	xen: fix dom0 boot on huge systems
	ACPI / device_sysfs: Avoid OF modalias creation for removed device
	mmc: sdhci-esdhc-imx: fix HS400 timing issue
	mmc:fix a bug when max_discard is 0
	netfilter: ipt_CLUSTERIP: fix warning unused variable cn
	spi: ti-qspi: Fix mmap read when more than one CS in use
	spi: pxa2xx: Setup maximum supported DMA transfer length
	regulator: s2mps11: Fix steps for buck7, buck8 and LDO35
	regulator: max77620: Initialize values for DT properties
	regulator: s2mpa01: Fix step values for some LDOs
	clocksource/drivers/exynos_mct: Move one-shot check from tick clear to ISR
	clocksource/drivers/exynos_mct: Clear timer interrupt when shutdown
	clocksource/drivers/arch_timer: Workaround for Allwinner A64 timer instability
	s390/setup: fix early warning messages
	s390/virtio: handle find on invalid queue gracefully
	scsi: virtio_scsi: don't send sc payload with tmfs
	scsi: aacraid: Fix performance issue on logical drives
	scsi: sd: Optimal I/O size should be a multiple of physical block size
	scsi: target/iscsi: Avoid iscsit_release_commands_from_conn() deadlock
	scsi: qla2xxx: Fix LUN discovery if loop id is not assigned yet by firmware
	fs/devpts: always delete dcache dentry-s in dput()
	splice: don't merge into linked buffers
	ovl: During copy up, first copy up data and then xattrs
	ovl: Do not lose security.capability xattr over metadata file copy-up
	m68k: Add -ffreestanding to CFLAGS
	Btrfs: setup a nofs context for memory allocation at btrfs_create_tree()
	Btrfs: setup a nofs context for memory allocation at __btrfs_set_acl
	btrfs: ensure that a DUP or RAID1 block group has exactly two stripes
	Btrfs: fix corruption reading shared and compressed extents after hole punching
	soc: qcom: rpmh: Avoid accessing freed memory from batch API
	libertas_tf: don't set URB_ZERO_PACKET on IN USB transfer
	irqchip/gic-v3-its: Avoid parsing _indirect_ twice for Device table
	irqchip/brcmstb-l2: Use _irqsave locking variants in non-interrupt code
	x86/kprobes: Prohibit probing on optprobe template code
	cpufreq: kryo: Release OPP tables on module removal
	cpufreq: tegra124: add missing of_node_put()
	cpufreq: pxa2xx: remove incorrect __init annotation
	ext4: fix check of inode in swap_inode_boot_loader
	ext4: cleanup pagecache before swap i_data
	ext4: update quota information while swapping boot loader inode
	ext4: add mask of ext4 flags to swap
	ext4: fix crash during online resizing
	PCI/ASPM: Use LTR if already enabled by platform
	PCI/DPC: Fix print AER status in DPC event handling
	PCI: dwc: skip MSI init if MSIs have been explicitly disabled
	IB/hfi1: Close race condition on user context disable and close
	cxl: Wrap iterations over afu slices inside 'afu_list_lock'
	ext2: Fix underflow in ext2_max_size()
	clk: uniphier: Fix update register for CPU-gear
	clk: clk-twl6040: Fix imprecise external abort for pdmclk
	clk: samsung: exynos5: Fix possible NULL pointer exception on platform_device_alloc() failure
	clk: samsung: exynos5: Fix kfree() of const memory on setting driver_override
	clk: ingenic: Fix round_rate misbehaving with non-integer dividers
	clk: ingenic: Fix doc of ingenic_cgu_div_info
	usb: chipidea: tegra: Fix missed ci_hdrc_remove_device()
	usb: typec: tps6598x: handle block writes separately with plain-I2C adapters
	dmaengine: usb-dmac: Make DMAC system sleep callbacks explicit
	mm: hwpoison: fix thp split handing in soft_offline_in_use_page()
	mm/vmalloc: fix size check for remap_vmalloc_range_partial()
	mm/memory.c: do_fault: avoid usage of stale vm_area_struct
	kernel/sysctl.c: add missing range check in do_proc_dointvec_minmax_conv
	device property: Fix the length used in PROPERTY_ENTRY_STRING()
	intel_th: Don't reference unassigned outputs
	parport_pc: fix find_superio io compare code, should use equal test.
	i2c: tegra: fix maximum transfer size
	media: i2c: ov5640: Fix post-reset delay
	gpio: pca953x: Fix dereference of irq data in shutdown
	can: flexcan: FLEXCAN_IFLAG_MB: add () around macro argument
	drm/i915: Relax mmap VMA check
	bpf: only test gso type on gso packets
	serial: uartps: Fix stuck ISR if RX disabled with non-empty FIFO
	serial: 8250_of: assume reg-shift of 2 for mrvl,mmp-uart
	serial: 8250_pci: Fix number of ports for ACCES serial cards
	serial: 8250_pci: Have ACCES cards that use the four port Pericom PI7C9X7954 chip use the pci_pericom_setup()
	jbd2: clear dirty flag when revoking a buffer from an older transaction
	jbd2: fix compile warning when using JBUFFER_TRACE
	selinux: add the missing walk_size + len check in selinux_sctp_bind_connect
	security/selinux: fix SECURITY_LSM_NATIVE_LABELS on reused superblock
	powerpc/32: Clear on-stack exception marker upon exception return
	powerpc/wii: properly disable use of BATs when requested.
	powerpc/powernv: Make opal log only readable by root
	powerpc/83xx: Also save/restore SPRG4-7 during suspend
	powerpc/powernv: Don't reprogram SLW image on every KVM guest entry/exit
	powerpc: Fix 32-bit KVM-PR lockup and host crash with MacOS guest
	powerpc/ptrace: Simplify vr_get/set() to avoid GCC warning
	powerpc/hugetlb: Don't do runtime allocation of 16G pages in LPAR configuration
	powerpc/traps: fix recoverability of machine check handling on book3s/32
	powerpc/traps: Fix the message printed when stack overflows
	ARM: s3c24xx: Fix boolean expressions in osiris_dvs_notify
	arm64: Fix HCR.TGE status for NMI contexts
	arm64: debug: Ensure debug handlers check triggering exception level
	arm64: KVM: Fix architecturally invalid reset value for FPEXC32_EL2
	ipmi_si: fix use-after-free of resource->name
	dm: fix to_sector() for 32bit
	dm integrity: limit the rate of error messages
	mfd: sm501: Fix potential NULL pointer dereference
	cpcap-charger: generate events for userspace
	NFS: Fix I/O request leakages
	NFS: Fix an I/O request leakage in nfs_do_recoalesce
	NFS: Don't recoalesce on error in nfs_pageio_complete_mirror()
	nfsd: fix performance-limiting session calculation
	nfsd: fix memory corruption caused by readdir
	nfsd: fix wrong check in write_v4_end_grace()
	NFSv4.1: Reinitialise sequence results before retransmitting a request
	svcrpc: fix UDP on servers with lots of threads
	PM / wakeup: Rework wakeup source timer cancellation
	bcache: never writeback a discard operation
	stable-kernel-rules.rst: add link to networking patch queue
	vt: perform safe console erase in the right order
	x86/unwind/orc: Fix ORC unwind table alignment
	perf intel-pt: Fix CYC timestamp calculation after OVF
	perf tools: Fix split_kallsyms_for_kcore() for trampoline symbols
	perf auxtrace: Define auxtrace record alignment
	perf intel-pt: Fix overlap calculation for padding
	perf/x86/intel/uncore: Fix client IMC events return huge result
	perf intel-pt: Fix divide by zero when TSC is not available
	md: Fix failed allocation of md_register_thread
	tpm/tpm_crb: Avoid unaligned reads in crb_recv()
	tpm: Unify the send callback behaviour
	rcu: Do RCU GP kthread self-wakeup from softirq and interrupt
	media: imx: prpencvf: Stop upstream before disabling IDMA channel
	media: lgdt330x: fix lock status reporting
	media: uvcvideo: Avoid NULL pointer dereference at the end of streaming
	media: vimc: Add vimc-streamer for stream control
	media: imx: csi: Disable CSI immediately after last EOF
	media: imx: csi: Stop upstream before disabling IDMA channel
	drm/fb-helper: generic: Fix drm_fbdev_client_restore()
	drm/radeon/evergreen_cs: fix missing break in switch statement
	drm/amd/powerplay: correct power reading on fiji
	drm/amd/display: don't call dm_pp_ function from an fpu block
	KVM: Call kvm_arch_memslots_updated() before updating memslots
	KVM: x86/mmu: Detect MMIO generation wrap in any address space
	KVM: x86/mmu: Do not cache MMIO accesses while memslots are in flux
	KVM: nVMX: Sign extend displacements of VMX instr's mem operands
	KVM: nVMX: Apply addr size mask to effective address for VMX instructions
	KVM: nVMX: Ignore limit checks on VMX instructions using flat segments
	bcache: use (REQ_META|REQ_PRIO) to indicate bio for metadata
	s390/setup: fix boot crash for machine without EDAT-1
	Linux 4.19.31

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-03-23 21:13:30 +01:00
Jianchao Wang
29452f665c blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue
[ Upstream commit aef1897cd36dcf5e296f1d2bae7e0d268561b685 ]

When requeue, if RQF_DONTPREP, rq has contained some driver
specific data, so insert it to hctx dispatch list to avoid any
merge. Take scsi as example, here is the trace event log (no
io scheduler, because RQF_STARTED would prevent merging),

   kworker/0:1H-339   [000] ...1  2037.209289: block_rq_insert: 8,0 R 4096 () 32768 + 8 [kworker/0:1H]
scsi_inert_test-1987  [000] ....  2037.220465: block_bio_queue: 8,0 R 32776 + 8 [scsi_inert_test]
scsi_inert_test-1987  [000] ...2  2037.220466: block_bio_backmerge: 8,0 R 32776 + 8 [scsi_inert_test]
   kworker/0:1H-339   [000] ....  2047.220913: block_rq_issue: 8,0 R 8192 () 32768 + 16 [kworker/0:1H]
scsi_inert_test-1996  [000] ..s1  2047.221007: block_rq_complete: 8,0 R () 32768 + 8 [0]
scsi_inert_test-1996  [000] .Ns1  2047.221045: block_rq_requeue: 8,0 R () 32776 + 8 [0]
   kworker/0:1H-339   [000] ...1  2047.221054: block_rq_insert: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
   kworker/0:1H-339   [000] ...1  2047.221056: block_rq_issue: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
scsi_inert_test-1986  [000] ..s1  2047.221119: block_rq_complete: 8,0 R () 32776 + 8 [0]

(32768 + 8) was requeued by scsi_queue_insert and had RQF_DONTPREP.
Then it was merged with (32776 + 8) and issued. Due to RQF_DONTPREP,
the sdb only contained the part of (32768 + 8), then only that part
was completed. The lucky thing was that scsi_io_completion detected
it and requeued the remaining part. So we didn't get corrupted data.
However, the requeue of (32776 + 8) is not expected.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-23 20:09:45 +01:00
Johannes Weiner
2ba18b41d3 BACKPORT: sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD
There are several definitions of those functions/macros in places that
mess with fixed-point load averages.  Provide an official version.

[akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c]
Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 8508cf3ffad4defa202b303e5b6379efc4cd9054)

Conflicts:
        block/blk-iolatency.c

(1. manual merge to replace stat->rqs.mean with stat.mean)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I716b4874491cff75a2355c6d95c64cf02d05e7ee
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:26 -07:00
Liu Bo
6d482bc569 blk-iolatency: fix IO hang due to negative inflight counter
[ Upstream commit 8c772a9bfc7c07c76f4a58b58910452fbb20843b ]

Our test reported the following stack, and vmcore showed that
->inflight counter is -1.

[ffffc9003fcc38d0] __schedule at ffffffff8173d95d
[ffffc9003fcc3958] schedule at ffffffff8173de26
[ffffc9003fcc3970] io_schedule at ffffffff810bb6b6
[ffffc9003fcc3988] blkcg_iolatency_throttle at ffffffff813911cb
[ffffc9003fcc3a20] rq_qos_throttle at ffffffff813847f3
[ffffc9003fcc3a48] blk_mq_make_request at ffffffff8137468a
[ffffc9003fcc3b08] generic_make_request at ffffffff81368b49
[ffffc9003fcc3b68] submit_bio at ffffffff81368d7d
[ffffc9003fcc3bb8] ext4_io_submit at ffffffffa031be00 [ext4]
[ffffc9003fcc3c00] ext4_writepages at ffffffffa03163de [ext4]
[ffffc9003fcc3d68] do_writepages at ffffffff811c49ae
[ffffc9003fcc3d78] __filemap_fdatawrite_range at ffffffff811b6188
[ffffc9003fcc3e30] filemap_write_and_wait_range at ffffffff811b6301
[ffffc9003fcc3e60] ext4_sync_file at ffffffffa030cee8 [ext4]
[ffffc9003fcc3ea8] vfs_fsync_range at ffffffff8128594b
[ffffc9003fcc3ee8] do_fsync at ffffffff81285abd
[ffffc9003fcc3f18] sys_fsync at ffffffff81285d50
[ffffc9003fcc3f28] do_syscall_64 at ffffffff81003c04
[ffffc9003fcc3f50] entry_SYSCALL_64_after_swapgs at ffffffff81742b8e

The ->inflight counter may be negative (-1) if

1) blk-iolatency was disabled when the IO was issued,

2) blk-iolatency was enabled before this IO reached its endio,

3) the ->inflight counter is decreased from 0 to -1 in endio()

In fact the hang can be easily reproduced by the below script,

H=/sys/fs/cgroup/unified/
P=/sys/fs/cgroup/unified/test

echo "+io" > $H/cgroup.subtree_control
mkdir -p $P

echo $$ > $P/cgroup.procs

xfs_io -f -d -c "pwrite 0 4k" /dev/sdg

echo "`cat /sys/block/sdg/dev` target=1000000" > $P/io.latency

xfs_io -f -d -c "pwrite 0 4k" /dev/sdg

This fixes the problem by freezing the queue so that while
enabling/disabling iolatency, there is no inflight rq running.

Note that quiesce_queue is not needed as this only updating iolatency
configuration about which dispatching request_queue doesn't care.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-13 14:02:38 -07:00
Jianchao Wang
80c8452ad4 blk-mq: fix a hung issue when fsync
[ Upstream commit 85bd6e61f34dffa8ec2dc75ff3c02ee7b2f1cbce ]

Florian reported a io hung issue when fsync(). It should be
triggered by following race condition.

data + post flush         a flush

blk_flush_complete_seq
  case REQ_FSEQ_DATA
    blk_flush_queue_rq
    issued to driver      blk_mq_dispatch_rq_list
                            try to issue a flush req
                            failed due to NON-NCQ command
                            .queue_rq return BLK_STS_DEV_RESOURCE

request completion
  req->end_io // doesn't check RESTART
  mq_flush_data_end_io
    case REQ_FSEQ_POSTFLUSH
      blk_kick_flush
        do nothing because previous flush
        has not been completed
     blk_mq_run_hw_queue
                              insert rq to hctx->dispatch
                              due to RESTART is still set, do nothing

To fix this, replace the blk_mq_run_hw_queue in mq_flush_data_end_io
with blk_mq_sched_restart to check and clear the RESTART flag.

Fixes: bd166ef1 (blk-mq-sched: add framework for MQ capable IO schedulers)
Reported-by: Florian Stecker <m19@florianstecker.de>
Tested-by: Florian Stecker <m19@florianstecker.de>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-20 10:25:36 +01:00
Yufen Yu
4cc66cc4f8 block: use rcu_work instead of call_rcu to avoid sleep in softirq
commit 94a2c3a32b62e868dc1e3d854326745a7f1b8c7a upstream.

We recently got a stack by syzkaller like this:

BUG: sleeping function called from invalid context at mm/slab.h:361
in_atomic(): 1, irqs_disabled(): 0, pid: 6644, name: blkid
INFO: lockdep is turned off.
CPU: 1 PID: 6644 Comm: blkid Not tainted 4.4.163-514.55.6.9.x86_64+ #76
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
 0000000000000000 5ba6a6b879e50c00 ffff8801f6b07b10 ffffffff81cb2194
 0000000041b58ab3 ffffffff833c7745 ffffffff81cb2080 5ba6a6b879e50c00
 0000000000000000 0000000000000001 0000000000000004 0000000000000000
Call Trace:
 <IRQ>  [<ffffffff81cb2194>] __dump_stack lib/dump_stack.c:15 [inline]
 <IRQ>  [<ffffffff81cb2194>] dump_stack+0x114/0x1a0 lib/dump_stack.c:51
 [<ffffffff8129a981>] ___might_sleep+0x291/0x490 kernel/sched/core.c:7675
 [<ffffffff8129ac33>] __might_sleep+0xb3/0x270 kernel/sched/core.c:7637
 [<ffffffff81794c13>] slab_pre_alloc_hook mm/slab.h:361 [inline]
 [<ffffffff81794c13>] slab_alloc_node mm/slub.c:2610 [inline]
 [<ffffffff81794c13>] slab_alloc mm/slub.c:2692 [inline]
 [<ffffffff81794c13>] kmem_cache_alloc_trace+0x2c3/0x5c0 mm/slub.c:2709
 [<ffffffff81cbe9a7>] kmalloc include/linux/slab.h:479 [inline]
 [<ffffffff81cbe9a7>] kzalloc include/linux/slab.h:623 [inline]
 [<ffffffff81cbe9a7>] kobject_uevent_env+0x2c7/0x1150 lib/kobject_uevent.c:227
 [<ffffffff81cbf84f>] kobject_uevent+0x1f/0x30 lib/kobject_uevent.c:374
 [<ffffffff81cbb5b9>] kobject_cleanup lib/kobject.c:633 [inline]
 [<ffffffff81cbb5b9>] kobject_release+0x229/0x440 lib/kobject.c:675
 [<ffffffff81cbb0a2>] kref_sub include/linux/kref.h:73 [inline]
 [<ffffffff81cbb0a2>] kref_put include/linux/kref.h:98 [inline]
 [<ffffffff81cbb0a2>] kobject_put+0x72/0xd0 lib/kobject.c:692
 [<ffffffff8216f095>] put_device+0x25/0x30 drivers/base/core.c:1237
 [<ffffffff81c4cc34>] delete_partition_rcu_cb+0x1d4/0x2f0 block/partition-generic.c:232
 [<ffffffff813c08bc>] __rcu_reclaim kernel/rcu/rcu.h:118 [inline]
 [<ffffffff813c08bc>] rcu_do_batch kernel/rcu/tree.c:2705 [inline]
 [<ffffffff813c08bc>] invoke_rcu_callbacks kernel/rcu/tree.c:2973 [inline]
 [<ffffffff813c08bc>] __rcu_process_callbacks kernel/rcu/tree.c:2940 [inline]
 [<ffffffff813c08bc>] rcu_process_callbacks+0x59c/0x1c70 kernel/rcu/tree.c:2957
 [<ffffffff8120f509>] __do_softirq+0x299/0xe20 kernel/softirq.c:273
 [<ffffffff81210496>] invoke_softirq kernel/softirq.c:350 [inline]
 [<ffffffff81210496>] irq_exit+0x216/0x2c0 kernel/softirq.c:391
 [<ffffffff82c2cd7b>] exiting_irq arch/x86/include/asm/apic.h:652 [inline]
 [<ffffffff82c2cd7b>] smp_apic_timer_interrupt+0x8b/0xc0 arch/x86/kernel/apic/apic.c:926
 [<ffffffff82c2bc25>] apic_timer_interrupt+0xa5/0xb0 arch/x86/entry/entry_64.S:746
 <EOI>  [<ffffffff814cbf40>] ? audit_kill_trees+0x180/0x180
 [<ffffffff8187d2f7>] fd_install+0x57/0x80 fs/file.c:626
 [<ffffffff8180989e>] do_sys_open+0x45e/0x550 fs/open.c:1043
 [<ffffffff818099c2>] SYSC_open fs/open.c:1055 [inline]
 [<ffffffff818099c2>] SyS_open+0x32/0x40 fs/open.c:1050
 [<ffffffff82c299e1>] entry_SYSCALL_64_fastpath+0x1e/0x9a

In softirq context, we call rcu callback function delete_partition_rcu_cb(),
which may allocate memory by kzalloc with GFP_KERNEL flag. If the
allocation cannot be satisfied, it may sleep. However, That is not allowed
in softirq contex.

Although we found this problem on linux 4.4, the latest kernel version
seems to have this problem as well. And it is very similar to the
previous one:
	https://lkml.org/lkml/2018/7/9/391

Fix it by using RCU workqueue, which allows sleep.

Reviewed-by: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-22 21:40:35 +01:00
Damien Le Moal
6353c0a037 block: mq-deadline: Fix write completion handling
commit 7211aef86f79583e59b88a0aba0bc830566f7e8e upstream.

For a zoned block device using mq-deadline, if a write request for a
zone is received while another write was already dispatched for the same
zone, dd_dispatch_request() will return NULL and the newly inserted
write request is kept in the scheduler queue waiting for the ongoing
zone write to complete. With this behavior, when no other request has
been dispatched, rq_list in blk_mq_sched_dispatch_requests() is empty
and blk_mq_sched_mark_restart_hctx() not called. This in turn leads to
__blk_mq_free_request() call of blk_mq_sched_restart() to not run the
queue when the already dispatched write request completes. The newly
dispatched request stays stuck in the scheduler queue until eventually
another request is submitted.

This problem does not affect SCSI disk as the SCSI stack handles queue
restart on request completion. However, this problem is can be triggered
the nullblk driver with zoned mode enabled.

Fix this by always requesting a queue restart in dd_dispatch_request()
if no request was dispatched while WRITE requests are queued.

Fixes: 5700f69178 ("mq-deadline: Introduce zone locking support")
Cc: <stable@vger.kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Add missing export of blk_mq_sched_restart()

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-13 09:51:07 +01:00
Ming Lei
69e9b2858b block: deactivate blk_stat timer in wbt_disable_default()
commit 544fbd16a461a318cd80537d1331c0df5c6cf930 upstream.

rwb_enabled() can't be changed when there is any inflight IO.

wbt_disable_default() may set rwb->wb_normal as zero, however the
blk_stat timer may still be pending, and the timer function will update
wrb->wb_normal again.

This patch introduces blk_stat_deactivate() and applies it in
wbt_disable_default(), then the following IO hang triggered when running
parted & switching io scheduler can be fixed:

[  369.937806] INFO: task parted:3645 blocked for more than 120 seconds.
[  369.938941]       Not tainted 4.20.0-rc6-00284-g906c801e5248 #498
[  369.939797] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  369.940768] parted          D    0  3645   3239 0x00000000
[  369.941500] Call Trace:
[  369.941874]  ? __schedule+0x6d9/0x74c
[  369.942392]  ? wbt_done+0x5e/0x5e
[  369.942864]  ? wbt_cleanup_cb+0x16/0x16
[  369.943404]  ? wbt_done+0x5e/0x5e
[  369.943874]  schedule+0x67/0x78
[  369.944298]  io_schedule+0x12/0x33
[  369.944771]  rq_qos_wait+0xb5/0x119
[  369.945193]  ? karma_partition+0x1c2/0x1c2
[  369.945691]  ? wbt_cleanup_cb+0x16/0x16
[  369.946151]  wbt_wait+0x85/0xb6
[  369.946540]  __rq_qos_throttle+0x23/0x2f
[  369.947014]  blk_mq_make_request+0xe6/0x40a
[  369.947518]  generic_make_request+0x192/0x2fe
[  369.948042]  ? submit_bio+0x103/0x11f
[  369.948486]  ? __radix_tree_lookup+0x35/0xb5
[  369.949011]  submit_bio+0x103/0x11f
[  369.949436]  ? blkg_lookup_slowpath+0x25/0x44
[  369.949962]  submit_bio_wait+0x53/0x7f
[  369.950469]  blkdev_issue_flush+0x8a/0xae
[  369.951032]  blkdev_fsync+0x2f/0x3a
[  369.951502]  do_fsync+0x2e/0x47
[  369.951887]  __x64_sys_fsync+0x10/0x13
[  369.952374]  do_syscall_64+0x89/0x149
[  369.952819]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  369.953492] RIP: 0033:0x7f95a1e729d4
[  369.953996] Code: Bad RIP value.
[  369.954456] RSP: 002b:00007ffdb570dd48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[  369.955506] RAX: ffffffffffffffda RBX: 000055c2139c6be0 RCX: 00007f95a1e729d4
[  369.956389] RDX: 0000000000000001 RSI: 0000000000001261 RDI: 0000000000000004
[  369.957325] RBP: 0000000000000002 R08: 0000000000000000 R09: 000055c2139c6ce0
[  369.958199] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c2139c0380
[  369.959143] R13: 0000000000000004 R14: 0000000000000100 R15: 0000000000000008

Cc: stable@vger.kernel.org
Cc: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-13 09:51:06 +01:00
Keith Busch
7290c71ded block/bio: Do not zero user pages
commit f55adad601c6a97c8c9628195453e0fb23b4a0ae upstream.

We don't need to zero fill the bio if not using kernel allocated pages.

Fixes: f3587d76da05 ("block: Clear kernel memory before copying to user") # v4.20-rc2
Reported-by: Todd Aiken <taiken@mvtech.ca>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: stable@vger.kernel.org
Cc: Bart Van Assche <bvanassche@acm.org>
Tested-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-19 19:19:50 +01:00
Jens Axboe
55cbeea76e blk-mq: punt failed direct issue to dispatch list
commit c616cbee97aed4bc6178f148a7240206dcdb85a6 upstream.

After the direct dispatch corruption fix, we permanently disallow direct
dispatch of non read/write requests. This works fine off the normal IO
path, as they will be retried like any other failed direct dispatch
request. But for the blk_insert_cloned_request() that only DM uses to
bypass the bottom level scheduler, we always first attempt direct
dispatch. For some types of requests, that's now a permanent failure,
and no amount of retrying will make that succeed. This results in a
livelock.

Instead of making special cases for what we can direct issue, and now
having to deal with DM solving the livelock while still retaining a BUSY
condition feedback loop, always just add a request that has been through
->queue_rq() to the hardware queue dispatch list. These are safe to use
as no merging can take place there. Additionally, if requests do have
prepped data from drivers, we aren't dependent on them not sharing space
in the request structure to safely add them to the IO scheduler lists.

This basically reverts ffe81d45322c and is based on a patch from Ming,
but with the list insert case covered as well.

Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
Cc: stable@vger.kernel.org
Suggested-by: Ming Lei <ming.lei@redhat.com>
Reported-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-08 12:59:10 +01:00
Jens Axboe
724ff9cbfe blk-mq: fix corruption with direct issue
commit ffe81d45322cc3cb140f0db080a4727ea284661e upstream.

If we attempt a direct issue to a SCSI device, and it returns BUSY, then
we queue the request up normally. However, the SCSI layer may have
already setup SG tables etc for this particular command. If we later
merge with this request, then the old tables are no longer valid. Once
we issue the IO, we only read/write the original part of the request,
not the new state of it.

This causes data corruption, and is most often noticed with the file
system complaining about the just read data being invalid:

[  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)

because most of it is garbage...

This doesn't happen from the normal issue path, as we will simply defer
the request to the hardware queue dispatch list if we fail. Once it's on
the dispatch list, we never merge with it.

Fix this from the direct issue path by flagging the request as
REQ_NOMERGE so we don't change the size of it before issue.

See also:
  https://bugzilla.kernel.org/show_bug.cgi?id=201685

Tested-by: Guenter Roeck <linux@roeck-us.net>
Fixes: 6ce3dd6eec ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-08 12:59:06 +01:00
Hannes Reinecke
487d58a9c3 block: copy ioprio in __bio_clone_fast() and bounce
[ Upstream commit ca474b73896bf6e0c1eb8787eb217b0f80221610 ]

We need to copy the io priority, too; otherwise the clone will run
with a different priority than the original one.

Fixes: 43b62ce3ff ("block: move bio io prio to a new field")
Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jean Delvare <jdelvare@suse.de>

Fixed up subject, and ordered stores.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-01 09:37:32 +01:00
Keith Busch
a4da95ea10 block: Clear kernel memory before copying to user
[ Upstream commit f3587d76da05f68098ddb1cb3c98cc6a9e8a402c ]

If the kernel allocates a bounce buffer for user read data, this memory
needs to be cleared before copying it to the user, otherwise it may leak
kernel memory to user space.

Laurence Oberman <loberman@redhat.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-11-27 16:13:05 +01:00
Ming Lei
410306a0f2 SCSI: fix queue cleanup race before queue initialization is done
commit 8dc765d438f1e42b3e8227b3b09fad7d73f4ec9a upstream.

c2856ae2f3 ("blk-mq: quiesce queue before freeing queue") has
already fixed this race, however the implied synchronize_rcu()
in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused
performance regression.

Then 1311326cf4 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
tried to quiesce queue for avoiding unnecessary synchronize_rcu()
only when queue initialization is done, because it is usual to see
lots of inexistent LUNs which need to be probed.

However, turns out it isn't safe to quiesce queue only when queue
initialization is done. Because when one SCSI command is completed,
the user of sending command can be waken up immediately, then the
scsi device may be removed, meantime the run queue in scsi_end_request()
is still in-progress, so kernel panic can be caused.

In Red Hat QE lab, there are several reports about this kind of kernel
panic triggered during kernel booting.

This patch tries to address the issue by grabing one queue usage
counter during freeing one request and the following run queue.

Fixes: 1311326cf4 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
Cc: Andrew Jones <drjones@redhat.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
Cc: stable <stable@vger.kernel.org>
Cc: jianchao.wang <jianchao.w.wang@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-21 09:19:18 +01:00
Paolo Valente
668c01c11b block, bfq: correctly charge and reset entity service in all cases
[ Upstream commit cbeb869a3d1110450186b738199963c5e68c2a71 ]

BFQ schedules entities (which represent either per-process queues or
groups of queues) as a function of their timestamps. In particular, as
a function of their (virtual) finish times. The finish time of an
entity is computed as a function of the budget assigned to the entity,
assuming, tentatively, that the entity, once in service, will receive
an amount of service equal to its budget. Then, when the entity is
expired because it finishes to be served, this finish time is updated
as a function of the actual service received by the entity. This
allows the entity to be correctly charged with only the service
received, and then to be correctly re-scheduled.

Yet an entity may receive service also while not being the entity in
service (in the scheduling environment of its parent entity), for
several reasons. If the entity remains with no backlog while receiving
this 'unofficial' service, then it is expired. Also on such an
expiration, the finish time of the entity should be updated to account
for only the service actually received by the entity. Unfortunately,
such an update is not performed for an entity expiring without being
the entity in service.

In a similar vein, the service counter of the entity in service is
reset when the entity is expired, to be ready to be used for next
service cycle. This reset too should be performed also in case an
entity is expired because it remains empty after receiving service
while not being the entity in service. But in this case the reset is
not performed.

This commit performs the above update of the finish time and reset of
the service received, also for an entity expiring while not being the
entity in service.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-13 11:08:28 -08:00
Ming Lei
1ea5c403dd block: make sure writesame bio is aligned with logical block size
commit 34ffec60b27aa81d04e274e71e4c6ef740f75fc7 upstream.

Obviously the created writesame bio has to be aligned with logical block
size, and use bio_allowed_max_sectors() to retrieve this number.

Cc: stable@vger.kernel.org
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Xiao Ni <xni@redhat.com>
Cc: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
Fixes: b49a0871be ("block: remove split code in blkdev_issue_{discard,write_same}")
Tested-by: Rui Salvaterra <rsalvaterra@gmail.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-13 11:08:16 -08:00
Ming Lei
14657efd3a block: make sure discard bio is aligned with logical block size
commit 1adfc5e4136f5967d591c399aff95b3b035f16b7 upstream.

Obviously the created discard bio has to be aligned with logical block size.

This patch introduces the helper of bio_allowed_max_sectors() for
this purpose.

Cc: stable@vger.kernel.org
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Xiao Ni <xni@redhat.com>
Cc: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
Fixes: 744889b7cb ("block: don't deal with discard limit in blkdev_issue_discard()")
Fixes: a22c4d7e34 ("block: re-add discard_granularity and alignment checks")
Reported-by: Rui Salvaterra <rsalvaterra@gmail.com>
Tested-by: Rui Salvaterra <rsalvaterra@gmail.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-13 11:08:16 -08:00
Jens Axboe
cf8d0973c2 block: setup bounce bio_sets properly
commit 52990a5fb0c991ecafebdab43138b5ed41376852 upstream.

We're only setting up the bounce bio sets if we happen
to need bouncing for regular HIGHMEM, not if we only need
it for ISA devices.

Protect the ISA bounce setup with a mutex, since it's
being invoked from driver init functions and can thus be
called in parallel.

Cc: stable@vger.kernel.org
Reported-by: Ondrej Zary <linux@rainbow-software.org>
Tested-by: Ondrej Zary <linux@rainbow-software.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-13 11:08:16 -08:00
Ming Lei
744889b7cb block: don't deal with discard limit in blkdev_issue_discard()
blk_queue_split() does respect this limit via bio splitting, so no
need to do that in blkdev_issue_discard(), then we can align to
normal bio submit(bio_add_page() & submit_bio()).

More importantly, this patch fixes one issue introduced in a22c4d7e34
("block: re-add discard_granularity and alignment checks"), in which
zero discard bio may be generated in case of zero alignment.

Fixes: a22c4d7e34 ("block: re-add discard_granularity and alignment checks")
Cc: stable@vger.kernel.org
Cc: Ming Lin <ming.l@ssi.samsung.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Xiao Ni <xni@redhat.com>
Tested-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-18 07:23:40 -06:00
Josef Bacik
5e65a20341 blk-wbt: wake up all when we scale up, not down
Tetsuo brought to my attention that I screwed up the scale_up/scale_down
helpers when I factored out the rq-qos code.  We need to wake up all the
waiters when we add slots for requests to make, not when we shrink the
slots.  Otherwise we'll end up things waiting forever.  This was a
mistake and simply puts everything back the way it was.

cc: stable@vger.kernel.org
Fixes: a79050434b ("blk-rq-qos: refactor out common elements of blk-wbt")
eported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-11 13:31:28 -06:00
Ilya Dryomov
587562d0c7 blk-mq: I/O and timer unplugs are inverted in blktrace
trace_block_unplug() takes true for explicit unplugs and false for
implicit unplugs.  schedule() unplugs are implicit and should be
reported as timer unplugs.  While correct in the legacy code, this has
been inverted in blk-mq since 4.11.

Cc: stable@vger.kernel.org
Fixes: bd166ef183 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-27 13:12:44 -06:00
Damien Le Moal
854f31ccdd block: fix deadline elevator drain for zoned block devices
When the deadline scheduler is used with a zoned block device, writes
to a zone will be dispatched one at a time. This causes the warning
message:

deadline: forced dispatching is broken (nr_sorted=X), please report this

to be displayed when switching to another elevator with the legacy I/O
path while write requests to a zone are being retained in the scheduler
queue.

Prevent this message from being displayed when executing
elv_drain_elevator() for a zoned block device. __blk_drain_queue() will
loop until all writes are dispatched and completed, resulting in the
desired elevator queue drain without extensive modifications to the
deadline code itself to handle forced-dispatch calls.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Fixes: 8dc8146f9c ("deadline-iosched: Introduce zone locking support")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-26 19:57:24 -06:00
Keith Busch
530ca2c9bd blk-mq: Allow blocking queue tag iter callbacks
A recent commit runs tag iterator callbacks under the rcu read lock,
but existing callbacks do not satisfy the non-blocking requirement.
The commit intended to prevent an iterator from accessing a queue that's
being modified. This patch fixes the original issue by taking a queue
reference instead of reading it, which allows callbacks to make blocking
calls.

Fixes: f5bbbbe4d6 ("blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter")
Acked-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-25 20:17:59 -06:00
Omar Sandoval
b57e99b4b8 block: use nanosecond resolution for iostat
Klaus Kusche reported that the I/O busy time in /proc/diskstats was not
updating properly on 4.18. This is because we started using ktime to
track elapsed time, and we convert nanoseconds to jiffies when we update
the partition counter. However, this gets rounded down, so any I/Os that
take less than a jiffy are not accounted for. Previously in this case,
the value of jiffies would sometimes increment while we were doing I/O,
so at least some I/Os were accounted for.

Let's convert the stats to use nanoseconds internally. We still report
milliseconds as before, now more accurately than ever. The value is
still truncated to 32 bits for backwards compatibility.

Fixes: 522a777566 ("block: consolidate struct request timestamp fields")
Cc: stable@vger.kernel.org
Reported-by: Klaus Kusche <klaus.kusche@computerix.info>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:26:59 -06:00
Jens Axboe
01c5f85aeb blk-cgroup: increase number of supported policies
After merging the iolatency policy, we potentially now have 4 policies
being registered, but only support 3. This causes one of them to fail
loading. Takashi reports that BFQ no longer works for him, because it
fails to load due to policy registration failure.

Bump to 5 policies, and also add a warning for when we have exceeded
the global amount. If we have to touch this again, we should switch
to a dynamic scheme instead.

Reported-by: Takashi Iwai <tiwai@suse.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Tested-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-11 10:59:53 -06:00
Konstantin Khlebnikov
d5274b3cd6 block: bfq: swap puts in bfqg_and_blkg_put
Fix trivial use-after-free. This could be last reference to bfqg.

Fixes: 8f9bebc33d ("block, bfq: access and cache blkg data only when safe")
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-06 11:32:58 -06:00
Mikulas Patocka
8b2ded1c94 block: don't warn when doing fsync on read-only devices
It is possible to call fsync on a read-only handle (for example, fsck.ext2
does it when doing read-only check), and this call results in kernel
warning.

The patch b089cfd95d ("block: don't warn for flush on read-only device")
attempted to disable the warning, but it is buggy and it doesn't
(op_is_flush tests flags, but bio_op strips off the flags).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 721c7fc701 ("block: fail op_is_write() requests to read-only partitions")
Cc: stable@vger.kernel.org	# 4.18
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-05 16:14:36 -06:00
Dennis Zhou (Facebook)
3111885015 blkcg: use tryget logic when associating a blkg with a bio
There is a very small change a bio gets caught up in a really
unfortunate race between a task migration, cgroup exiting, and itself
trying to associate with a blkg. This is due to css offlining being
performed after the css->refcnt is killed which triggers removal of
blkgs that reach their blkg->refcnt of 0.

To avoid this, association with a blkg should use tryget and fallback to
using the root_blkg.

Fixes: 08e18eab0c ("block: add bi_blkg to the bio for cgroups")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-31 14:48:58 -06:00
Dennis Zhou (Facebook)
59b57717ff blkcg: delay blkg destruction until after writeback has finished
Currently, blkcg destruction relies on a sequence of events:
  1. Destruction starts. blkcg_css_offline() is called and blkgs
     release their reference to the blkcg. This immediately destroys
     the cgwbs (writeback).
  2. With blkgs giving up their reference, the blkcg ref count should
     become zero and eventually call blkcg_css_free() which finally
     frees the blkcg.

Jiufei Xue reported that there is a race between blkcg_bio_issue_check()
and cgroup_rmdir(). To remedy this, blkg destruction becomes contingent
on the completion of all writeback associated with the blkcg. A count of
the number of cgwbs is maintained and once that goes to zero, blkg
destruction can follow. This should prevent premature blkg destruction
related to writeback.

The new process for blkcg cleanup is as follows:
  1. Destruction starts. blkcg_css_offline() is called which offlines
     writeback. Blkg destruction is delayed on the cgwb_refcnt count to
     avoid punting potentially large amounts of outstanding writeback
     to root while maintaining any ongoing policies. Here, the base
     cgwb_refcnt is put back.
  2. When the cgwb_refcnt becomes zero, blkcg_destroy_blkgs() is called
     and handles destruction of blkgs. This is where the css reference
     held by each blkg is released.
  3. Once the blkcg ref count goes to zero, blkcg_css_free() is called.
     This finally frees the blkg.

It seems in the past blk-throttle didn't do the most understandable
things with taking data from a blkg while associating with current. So,
the simplification and unification of what blk-throttle is doing caused
this.

Fixes: 08e18eab0c ("block: add bi_blkg to the bio for cgroups")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-31 14:48:56 -06:00
Dennis Zhou (Facebook)
6b06546206 Revert "blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()"
This reverts commit 4c6994806f.

Destroying blkgs is tricky because of the nature of the relationship. A
blkg should go away when either a blkcg or a request_queue goes away.
However, blkg's pin the blkcg to ensure they remain valid. To break this
cycle, when a blkcg is offlined, blkgs put back their css ref. This
eventually lets css_free() get called which frees the blkcg.

The above commit (4c6994806f) breaks this order of events by trying to
destroy blkgs in css_free(). As the blkgs still hold references to the
blkcg, css_free() is never called.

The race between blkcg_bio_issue_check() and cgroup_rmdir() will be
addressed in the following patch by delaying destruction of a blkg until
all writeback associated with the blkcg has been finished.

Fixes: 4c6994806f ("blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-31 14:48:54 -06:00
John Pittman
db193954ed block: bsg: move atomic_t ref_count variable to refcount API
Currently, variable ref_count within the bsg_device struct is of
type atomic_t.  For variables being used as reference counters,
the refcount API should be used instead of atomic.  The newer
refcount API works to prevent counter overflows and use-after-free
bugs.  So, move this varable from the atomic API to refcount,
potentially avoiding the issues mentioned.

Signed-off-by: John Pittman <jpittman@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-27 19:17:02 -06:00
Chengguang Xu
62d2a19407 block: remove unnecessary condition check
kmem_cache_destroy() can handle NULL pointer correctly, so there is
no need to check e->icq_cache before calling kmem_cache_destroy().

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-27 19:16:06 -06:00
Jens Axboe
b0a84beb2e blk-wbt: remove dead code
We already note and mark discard and swap IO from bio_to_wbt_flags().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-27 13:32:12 -06:00
Jens Axboe
38cfb5a45e blk-wbt: improve waking of tasks
We have two potential issues:

1) After commit 2887e41b91, we only wake one process at the time when
   we finish an IO. We really want to wake up as many tasks as can
   queue IO. Before this commit, we woke up everyone, which could cause
   a thundering herd issue.

2) A task can potentially consume two wakeups, causing us to (in
   practice) miss a wakeup.

Fix both by providing our own wakeup function, which stops
__wake_up_common() from waking up more tasks if we fail to get a
queueing token. With the strict ordering we have on the wait list, this
wakes the right tasks and the right amount of tasks.

Based on a patch from Jianchao Wang <jianchao.w.wang@oracle.com>.

Tested-by: Agarwal, Anchal <anchalag@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-27 11:27:26 -06:00
Jens Axboe
061a542753 blk-wbt: abstract out end IO completion handler
Prep patch for calling the handler from a different context,
no functional changes in this patch.

Tested-by: Agarwal, Anchal <anchalag@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-27 11:27:24 -06:00
Jens Axboe
c125311d96 blk-wbt: don't maintain inflight counts if disabled
A previous commit removed the ability to have per-rq flags. We used
those flags to maintain inflight counts. Since we don't have those
anymore, we have to always maintain inflight counts, even if wbt is
disabled. This is clearly suboptimal.

Add a queue quiesce around changing the wbt latency settings from sysfs
to work around this. With that, we can reliably put the enabled check in
our bio_to_wbt_flags(), since we know the WBT_TRACKED flag will be
consistent for the lifetime of the request.

Fixes: c1c80384c8 ("block: remove external dependency on wbt_flags")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-23 09:34:46 -06:00
Jens Axboe
c45e6a037a blk-wbt: fix has-sleeper queueing check
We need to do this inside the loop as well, or we can allow new
IO to supersede previous IO.

Tested-by: Anchal Agarwal <anchalag@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-22 15:07:32 -06:00
Jens Axboe
b78820937b blk-wbt: use wq_has_sleeper() for wq active check
We need the memory barrier before checking the list head,
use the appropriate helper for this. The matching queue
side memory barrier is provided by set_current_state().

Tested-by: Anchal Agarwal <anchalag@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-22 15:07:31 -06:00
Jens Axboe
ffa358dcaa blk-wbt: move disable check into get_limit()
Check it in one place, instead of in multiple places.

Tested-by: Anchal Agarwal <anchalag@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-22 15:07:31 -06:00
Linus Torvalds
5bed49adfe for-4.19/post-20180822
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlt9on8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpj1xEADKBmJlV9aVyxc5w6XggqAGeHqI4afFrl+v
 9fW6WUQMAaBUrr7PMIEJQ0Zm4B7KxgBaEWNtuuj4ULkjpgYm2AuGUuTJSyKz41rS
 Ma+KNyCA2Zmq4SvwGFbcdCuCbUqnoxTycscAgCjuDvIYLW0+nFGNc47ibmu9lZIV
 33Ef5LrxuCjhC2zyNxEdWpUxDCjoYzock85LW+wYyIYLU9uKdoExS+YmT8U+ebA/
 AkXBcxPztNDxwkcsIwgGVoTjwxiowqGz3uueWfyEmYgaCPiNOsxkoNQAtjX4ykQE
 hnqnHWyzJkRwbYo7Vd/bRAZXvszKGYE1YcJmu5QrNf0dK5MSq2o5OYJAEJWbucPj
 m0R2u7O9qbS2JEnxGrm5+oYJwBzwNY5/Lajr15WkljTqobKnqcvn/Hdgz/XdGtek
 0S1QHkkBsF7e+cax8sePWK+O3ilY7pl9CzyZKB/tJngl8A45Jv8xVojg0v3O7oS+
 zZib0rwWg/bwR/uN6uPCDcEsQusqL5YovB7m6NRVshwz6cV1zVNp2q+iOulk7KuC
 MprW4Du9CJf8HA19XtyJIG1XLstnuz+Exy+i5BiimUJ5InoEFDuj/6OZa6Qaczbo
 SrDDvpGtSf4h7czKpE5kV4uZiTOrjuI30TrI+4csdZ7HQIlboxNL72seNTLJs55F
 nbLjRM8L6g==
 =FS7e
 -----END PGP SIGNATURE-----

Merge tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block

Pull more block updates from Jens Axboe:

 - Set of bcache fixes and changes (Coly)

 - The flush warn fix (me)

 - Small series of BFQ fixes (Paolo)

 - wbt hang fix (Ming)

 - blktrace fix (Steven)

 - blk-mq hardware queue count update fix (Jianchao)

 - Various little fixes

* tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block: (31 commits)
  block/DAC960.c: make some arrays static const, shrinks object size
  blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter
  blk-mq: init hctx sched after update ctx and hctx mapping
  block: remove duplicate initialization
  tracing/blktrace: Fix to allow setting same value
  pktcdvd: fix setting of 'ret' error return for a few cases
  block: change return type to bool
  block, bfq: return nbytes and not zero from struct cftype .write() method
  block, bfq: improve code of bfq_bfqq_charge_time
  block, bfq: reduce write overcharge
  block, bfq: always update the budget of an entity when needed
  block, bfq: readd missing reset of parent-entity service
  blk-wbt: fix IO hang in wbt_wait()
  block: don't warn for flush on read-only device
  bcache: add the missing comments for smp_mb()/smp_wmb()
  bcache: remove unnecessary space before ioctl function pointer arguments
  bcache: add missing SPDX header
  bcache: move open brace at end of function definitions to next line
  bcache: add static const prefix to char * array declarations
  bcache: fix code comments style
  ...
2018-08-22 13:38:05 -07:00
Jianchao Wang
f5bbbbe4d6 blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter
For blk-mq, part_in_flight/rw will invoke blk_mq_in_flight/rw to
account the inflight requests. It will access the queue_hw_ctx and
nr_hw_queues w/o any protection. When updating nr_hw_queues and
blk_mq_in_flight/rw occur concurrently, panic comes up.

Before update nr_hw_queues, the q will be frozen. So we could use
q_usage_counter to avoid the race. percpu_ref_is_zero is used here
so that we will not miss any in-flight request. The access to
nr_hw_queues and queue_hw_ctx in blk_mq_queue_tag_busy_iter are
under rcu critical section, __blk_mq_update_nr_hw_queues could use
synchronize_rcu to ensure the zeroed q_usage_counter to be globally
visible.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-21 09:02:56 -06:00
Jianchao Wang
d48ece209f blk-mq: init hctx sched after update ctx and hctx mapping
Currently, when update nr_hw_queues, IO scheduler's init_hctx will
be invoked before the mapping between ctx and hctx is adapted
correctly by blk_mq_map_swqueue. The IO scheduler init_hctx (kyber)
may depend on this mapping and get wrong result and panic finally.
A simply way to fix this is that switch the IO scheduler to 'none'
before update the nr_hw_queues, and then switch it back after
update nr_hw_queues. blk_mq_sched_init_/exit_hctx are removed due
to nobody use them any more.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-21 09:02:55 -06:00
Chaitanya Kulkarni
fcedba42d9 block: remove duplicate initialization
This patch removes the duplicate initialization of q->queue_head
in the blk_alloc_queue_node(). This removes the 2nd initialization
so that we preserve the initialization order same as declaration
present in struct request_queue.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-17 12:33:36 -06:00
Chengguang Xu
599d067dd3 block: change return type to bool
Because blk_do_io_stat() only does a judgement about the request
contributes to IO statistics, it better changes return type to bool.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-16 13:44:17 -06:00
Maciej S. Szmigiero
fc8ebd01de block, bfq: return nbytes and not zero from struct cftype .write() method
The value that struct cftype .write() method returns is then directly
returned to userspace as the value returned by write() syscall, so it
should be the number of bytes actually written (or consumed) and not zero.

Returning zero from write() syscall makes programs like /bin/echo or bash
spin.

Signed-off-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Fixes: e21b7a0b98 ("block, bfq: add full hierarchical scheduling and cgroups support")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-16 13:11:16 -06:00
Paolo Valente
f812164869 block, bfq: improve code of bfq_bfqq_charge_time
bfq_bfqq_charge_time contains some lengthy and redundant code. This
commit trims and condenses that code.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-16 13:08:15 -06:00
Paolo Valente
d5801088a7 block, bfq: reduce write overcharge
When a sync request is dispatched, the queue that contains that
request, and all the ancestor entities of that queue, are charged with
the number of sectors of the request. In constrast, if the request is
async, then the queue and its ancestor entities are charged with the
number of sectors of the request, multiplied by an overcharge
factor. This throttles the bandwidth for async I/O, w.r.t. to sync
I/O, and it is done to counter the tendency of async writes to steal
I/O throughput to reads.

On the opposite end, the lower this parameter, the stabler I/O
control, in the following respect.  The lower this parameter is, the
less the bandwidth enjoyed by a group decreases
- when the group does writes, w.r.t. to when it does reads;
- when other groups do reads, w.r.t. to when they do writes.

The fixes "block, bfq: always update the budget of an entity when
needed" and "block, bfq: readd missing reset of parent-entity service"
improved I/O control in bfq to such an extent that it has been
possible to revise this overcharge factor downwards.  This commit
introduces the resulting, new value.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-16 13:08:13 -06:00
Paolo Valente
e02a0aa26b block, bfq: always update the budget of an entity when needed
When the next child entity to serve changes for a given parent entity,
the budget of that parent entity must be updated accordingly.
Unfortunately, this update is not performed, by mistake, for the
entities that happen to switch from having no child entity to serve,
to having one child entity to serve.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-16 13:08:12 -06:00
Paolo Valente
8a511ba5fe block, bfq: readd missing reset of parent-entity service
The received-service counter needs to be equal to 0 when an entity is
set in service. Unfortunately, commit "block, bfq: fix service being
wrongly set to zero in case of preemption" mistakenly removed the
resetting of this counter for the parent entities of the bfq_queue
being set in service. This commit fixes this issue by resetting
service for parent entities, directly on the expiration of the
in-service bfq_queue.

Fixes: 9fae8dd59f ("block, bfq: fix service being wrongly set to zero in case of preemption")
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-16 13:08:10 -06:00
Linus Torvalds
73ba2fb33c for-4.19/block-20180812
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAltwvasQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpv65EACTq5gSLnJBI6ZPr1RAHruVDnjfzO2Veitl
 tUtjm0XfWmnEiwQ3dYvnyhy99xbyaG3900d9BClCTlH6xaUdSiQkDpcKG/R2F36J
 5mZitYukQcpFAQJWF8YKsTTE7JPl4VglCIDqYiC4+C3rOSVi8lrKn2qp4J4MMCFn
 thRg3jCcq7c5s9Eigsop1pXWQSasubkXfk55Krcp4oybKYpYRKXXf74Mj14QAbwJ
 QHN3VisyAUWoBRg7UQZo1Npe2oPk6bbnJypnjf8M0M2EnlvddEkIlHob91sodka8
 6p4APOEu5cbyXOBCAQsw/koff14mb8aEadqeQA68WvXfIdX9ZjfxCX0OoC3sBEXk
 yqJhZ0C980AM13zIBD8ejv4uasGcPca8W+47mE5P8sRiI++5kBsFWDZPCtUBna0X
 2Kh24NsmEya9XRR5vsB84dsIPQ3tLMkxg/IgQRVDaSnfJz0c/+zm54xDyKRaFT4l
 5iERk2WSkm9+8jNfVmWG0edrv6nRAXjpGwFfOCPh6/LCSCi4xQRULYN7sVzsX8ZK
 FRjt24HftBI8mJbh4BtweJvg+ppVe1gAk3IO3HvxAQhv29Hz+uvFYe9kL+3N8LJA
 Qosr9n9O4+wKYizJcDnw+5iPqCHfAwOm9th4pyedR+R7SmNcP3yNC8AbbheNBiF5
 Zolos5H+JA==
 =b9ib
 -----END PGP SIGNATURE-----

Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "First pull request for this merge window, there will also be a
  followup request with some stragglers.

  This pull request contains:

   - Fix for a thundering heard issue in the wbt block code (Anchal
     Agarwal)

   - A few NVMe pull requests:
      * Improved tracepoints (Keith)
      * Larger inline data support for RDMA (Steve Wise)
      * RDMA setup/teardown fixes (Sagi)
      * Effects log suppor for NVMe target (Chaitanya Kulkarni)
      * Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
      * TP4004 (ANA) support (Christoph)
      * Various NVMe fixes

   - Block io-latency controller support. Much needed support for
     properly containing block devices. (Josef)

   - Series improving how we handle sense information on the stack
     (Kees)

   - Lightnvm fixes and updates/improvements (Mathias/Javier et al)

   - Zoned device support for null_blk (Matias)

   - AIX partition fixes (Mauricio Faria de Oliveira)

   - DIF checksum code made generic (Max Gurtovoy)

   - Add support for discard in iostats (Michael Callahan / Tejun)

   - Set of updates for BFQ (Paolo)

   - Removal of async write support for bsg (Christoph)

   - Bio page dirtying and clone fixups (Christoph)

   - Set of bcache fix/changes (via Coly)

   - Series improving blk-mq queue setup/teardown speed (Ming)

   - Series improving merging performance on blk-mq (Ming)

   - Lots of other fixes and cleanups from a slew of folks"

* tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
  blkcg: Make blkg_root_lookup() work for queues in bypass mode
  bcache: fix error setting writeback_rate through sysfs interface
  null_blk: add lock drop/acquire annotation
  Blk-throttle: reduce tail io latency when iops limit is enforced
  block: paride: pd: mark expected switch fall-throughs
  block: Ensure that a request queue is dissociated from the cgroup controller
  block: Introduce blk_exit_queue()
  blkcg: Introduce blkg_root_lookup()
  block: Remove two superfluous #include directives
  blk-mq: count the hctx as active before allocating tag
  block: bvec_nr_vecs() returns value for wrong slab
  bcache: trivial - remove tailing backslash in macro BTREE_FLAG
  bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
  bcache: set max writeback rate when I/O request is idle
  bcache: add code comments for bset.c
  bcache: fix mistaken comments in request.c
  bcache: fix mistaken code comments in bcache.h
  bcache: add a comment in super.c
  bcache: avoid unncessary cache prefetch bch_btree_node_get()
  bcache: display rate debug parameters to 0 when writeback is not running
  ...
2018-08-14 10:23:25 -07:00
Ming Lei
df60f6e835 blk-wbt: fix IO hang in wbt_wait()
On wbt invariant is that if one IO is tracked via WBT_TRACKED, rqw->inflight
should be updated for tracking this IO.

But commit c1c80384c8 ("block: remove external dependency on wbt_flags")
forgets to remove the early handling of !rwb_enabled(rwb) inside wbt_wait(),
then the inflight counter may not be increased in wbt_wait(), but decreased
in wbt_done() for this kind of IO, so this counter may become negative, then
wbt_wait() may wait forever.

This patch fixes the report in the following link:

	https://marc.info/?l=linux-block&m=153221542021033&w=2

Fixes: c1c80384c8 ("block: remove external dependency on wbt_flags")
Cc: Josef Bacik <jbacik@fb.com>
Reported-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-14 11:05:52 -06:00
Jens Axboe
b089cfd95d block: don't warn for flush on read-only device
Don't warn for a flush issued to a read-only device. It's not strictly
a writable command, as it doesn't change any on-media data by itself.

Reported-by: Stefan Agner <stefan@agner.ch>
Fixes: 721c7fc701 ("block: fail op_is_write() requests to read-only partitions")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-14 10:52:40 -06:00
Bart Van Assche
b86d865cb1 blkcg: Make blkg_root_lookup() work for queues in bypass mode
For legacy queues the only call of blkg_root_lookup() happens after
bypass mode has been enabled. Since blkg_lookup() returns NULL for
queues in bypass mode, modify the blkg_root_lookup() such that it
no longer depends on bypass mode. Rename the function into
blk_queue_root_blkg() as suggested by Tejun.

Suggested-by: Tejun Heo <tj@kernel.org>
Fixes: 6bad9b210a ("blkcg: Introduce blkg_root_lookup()")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:41:25 -06:00
Liu Bo
991f61fe7e Blk-throttle: reduce tail io latency when iops limit is enforced
When an application's iops has exceeded its cgroup's iops limit, surely it
is throttled and kernel will set a timer for dispatching, thus IO latency
includes the delay.

However, the dispatch delay which is calculated by the limit and the
elapsed jiffies is suboptimal.  As the dispatch delay is only calculated
once the application's iops is (iops limit + 1), it doesn't need to wait
any longer than the remaining time of the current slice.

The difference can be proved by the following fio job and cgroup iops
setting,
-----
$ echo 4 > /mnt/config/nullb/disk1/mbps    # limit nullb's bandwidth to 4MB/s for testing.
$ echo "253:1 riops=100 rbps=max" > /sys/fs/cgroup/unified/cg1/io.max
$ cat r2.job
[global]
name=fio-rand-read
filename=/dev/nullb1
rw=randread
bs=4k
direct=1
numjobs=1
time_based=1
runtime=60
group_reporting=1

[file1]
size=4G
ioengine=libaio
iodepth=1
rate_iops=50000
norandommap=1
thinktime=4ms
-----

wo patch:
file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.7-66-gedfc
Starting 1 process

   read: IOPS=99, BW=400KiB/s (410kB/s)(23.4MiB/60001msec)
    slat (usec): min=10, max=336, avg=27.71, stdev=17.82
    clat (usec): min=2, max=28887, avg=5929.81, stdev=7374.29
     lat (usec): min=24, max=28901, avg=5958.73, stdev=7366.22
    clat percentiles (usec):
     |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    6], 60.00th=[11731],
     | 70.00th=[11863], 80.00th=[11994], 90.00th=[12911], 95.00th=[22676],
     | 99.00th=[23725], 99.50th=[23987], 99.90th=[23987], 99.95th=[25035],
     | 99.99th=[28967]

w/ patch:
file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.7-66-gedfc
Starting 1 process

   read: IOPS=100, BW=400KiB/s (410kB/s)(23.4MiB/60005msec)
    slat (usec): min=10, max=155, avg=23.24, stdev=16.79
    clat (usec): min=2, max=12393, avg=5961.58, stdev=5959.25
     lat (usec): min=23, max=12412, avg=5985.91, stdev=5951.92
    clat percentiles (usec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    4], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    5], 50.00th=[   47], 60.00th=[11863],
     | 70.00th=[11994], 80.00th=[11994], 90.00th=[11994], 95.00th=[11994],
     | 99.00th=[11994], 99.50th=[11994], 99.90th=[12125], 99.95th=[12125],
     | 99.99th=[12387]

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 12:43:16 -06:00
Bart Van Assche
24ecc35853 block: Ensure that a request queue is dissociated from the cgroup controller
Several block drivers call alloc_disk() followed by put_disk() if
something fails before device_add_disk() is called without calling
blk_cleanup_queue(). Make sure that also for this scenario a request
queue is dissociated from the cgroup controller. This patch avoids
that loading the parport_pc, paride and pf drivers triggers the
following kernel crash:

BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride]
Read of size 4 at addr 0000000000000008 by task modprobe/744
Call Trace:
dump_stack+0x9a/0xeb
kasan_report+0x139/0x350
pi_init+0x42e/0x580 [paride]
pf_init+0x2bb/0x1000 [pf]
do_one_initcall+0x8e/0x405
do_init_module+0xd9/0x2f2
load_module+0x3ab4/0x4700
SYSC_finit_module+0x176/0x1a0
do_syscall_64+0xee/0x2b0
entry_SYSCALL_64_after_hwframe+0x42/0xb7

Reported-by: Alexandru Moise <00moses.alexander00@gmail.com>
Fixes: a063057d7c ("block: Fix a race between request queue removal and the block cgroup controller") # v4.17
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Alexandru Moise <00moses.alexander00@gmail.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Alexandru Moise <00moses.alexander00@gmail.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 09:13:00 -06:00
Bart Van Assche
4cf6324b17 block: Introduce blk_exit_queue()
This patch does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Alexandru Moise <00moses.alexander00@gmail.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 09:12:59 -06:00
Jianchao Wang
d263ed9926 blk-mq: count the hctx as active before allocating tag
Currently, we count the hctx as active after allocate driver tag
successfully. If a previously inactive hctx try to get tag first
time, it may fails and need to wait. However, due to the stale tag
->active_queues, the other shared-tags users are still able to
occupy all driver tags while there is someone waiting for tag.
Consequently, even if the previously inactive hctx is waked up, it
still may not be able to get a tag and could be starved.

To fix it, we count the hctx as active before try to allocate driver
tag, then when it is waiting the tag, the other shared-tag users
will reserve budget for it.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 08:34:17 -06:00
Greg Edwards
d6c02a9beb block: bvec_nr_vecs() returns value for wrong slab
In commit ed996a52c8 ("block: simplify and cleanup bvec pool
handling"), the value of the slab index is incremented by one in
bvec_alloc() after the allocation is done to indicate an index value of
0 does not need to be later freed.

bvec_nr_vecs() was not updated accordingly, and thus returns the wrong
value.  Decrement idx before performing the lookup.

Fixes: ed996a52c8 ("block: simplify and cleanup bvec pool handling")
Signed-off-by: Greg Edwards <gedwards@ddn.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 08:25:04 -06:00
Bart Van Assche
f7ecb1b109 cfq: Suppress compiler warnings about comparisons
This patch does not change any functionality but avoids that gcc
reports the following warnings when building with W=1:

block/cfq-iosched.c: In function ?cfq_back_seek_max_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
  if (__data < (MIN))      \
             ^
block/cfq-iosched.c:4756:1: note: in expansion of macro ?STORE_FUNCTION?
 STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
 ^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_slice_idle_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
  if (__data < (MIN))      \
             ^
block/cfq-iosched.c:4759:1: note: in expansion of macro ?STORE_FUNCTION?
 STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
 ^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_group_idle_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
  if (__data < (MIN))      \
             ^
block/cfq-iosched.c:4760:1: note: in expansion of macro ?STORE_FUNCTION?
 STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, UINT_MAX, 1);
 ^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_low_latency_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
  if (__data < (MIN))      \
             ^
block/cfq-iosched.c:4765:1: note: in expansion of macro ?STORE_FUNCTION?
 STORE_FUNCTION(cfq_low_latency_store, &cfqd->cfq_latency, 0, 1, 0);
 ^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_slice_idle_us_store?:
block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
  if (__data < (MIN))      \
             ^
block/cfq-iosched.c:4782:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
 USEC_STORE_FUNCTION(cfq_slice_idle_us_store, &cfqd->cfq_slice_idle, 0, UINT_MAX);
 ^~~~~~~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_group_idle_us_store?:
block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
  if (__data < (MIN))      \
             ^
block/cfq-iosched.c:4783:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
 USEC_STORE_FUNCTION(cfq_group_idle_us_store, &cfqd->cfq_group_idle, 0, UINT_MAX);
 ^~~~~~~~~~~~~~~~~~~

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-07 17:57:13 -06:00
Bart Van Assche
9b4f43460d cfq: Annotate fall-through in a switch statement
This patch avoids that gcc complains about fall-through when building
with W=1.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-07 17:57:11 -06:00
Anchal Agarwal
2887e41b91 blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
I am currently running a large bare metal instance (i3.metal)
on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
4.18 kernel. I have a workload that simulates a database
workload and I am running into lockup issues when writeback
throttling is enabled,with the hung task detector also
kicking in.

Crash dumps show that most CPUs (up to 50 of them) are
all trying to get the wbt wait queue lock while trying to add
themselves to it in __wbt_wait (see stack traces below).

[    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
[    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
[    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
[    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
[    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
[    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
[    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
[    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
[    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
[    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
[    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
[    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
[    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.948138] Call Trace:
[    0.948139]  <IRQ>
[    0.948142]  do_raw_spin_lock+0xad/0xc0
[    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
[    0.948149]  ? __wake_up_common_lock+0x53/0x90
[    0.948150]  __wake_up_common_lock+0x53/0x90
[    0.948155]  wbt_done+0x7b/0xa0
[    0.948158]  blk_mq_free_request+0xb7/0x110
[    0.948161]  __blk_mq_complete_request+0xcb/0x140
[    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
[    0.948169]  nvme_irq+0x23/0x50 [nvme]
[    0.948173]  __handle_irq_event_percpu+0x46/0x300
[    0.948176]  handle_irq_event_percpu+0x20/0x50
[    0.948179]  handle_irq_event+0x34/0x60
[    0.948181]  handle_edge_irq+0x77/0x190
[    0.948185]  handle_irq+0xaf/0x120
[    0.948188]  do_IRQ+0x53/0x110
[    0.948191]  common_interrupt+0x87/0x87
[    0.948192]  </IRQ>
....
[    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
[    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
[    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
[    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
[    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
[    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
[    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
[    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
[    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
[    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
[    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
[    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
[    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.311154] Call Trace:
[    0.311157]  do_raw_spin_lock+0xad/0xc0
[    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
[    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
[    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
[    0.311167]  wbt_wait+0x127/0x330
[    0.311169]  ? finish_wait+0x80/0x80
[    0.311172]  ? generic_make_request+0xda/0x3b0
[    0.311174]  blk_mq_make_request+0xd6/0x7b0
[    0.311176]  ? blk_queue_enter+0x24/0x260
[    0.311178]  ? generic_make_request+0xda/0x3b0
[    0.311181]  generic_make_request+0x10c/0x3b0
[    0.311183]  ? submit_bio+0x5c/0x110
[    0.311185]  submit_bio+0x5c/0x110
[    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
[    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
[    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
[    0.311229]  ? do_writepages+0x3c/0xd0
[    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
[    0.311240]  do_writepages+0x3c/0xd0
[    0.311243]  ? _raw_spin_unlock+0x24/0x30
[    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
[    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
[    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
[    0.311253]  file_write_and_wait_range+0x34/0x90
[    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
[    0.311267]  do_fsync+0x38/0x60
[    0.311270]  SyS_fsync+0xc/0x10
[    0.311272]  do_syscall_64+0x6f/0x170
[    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7

In the original patch, wbt_done is waking up all the exclusive
processes in the wait queue, which can cause a thundering herd
if there is a large number of writer threads in the queue. The
original intention of the code seems to be to wake up one thread
only however, it uses wake_up_all() in __wbt_done(), and then
uses the following check in __wbt_wait to have only one thread
actually get out of the wait loop:

if (waitqueue_active(&rqw->wait) &&
            rqw->wait.head.next != &wait->entry)
                return false;

The problem with this is that the wait entry in wbt_wait is
define with DEFINE_WAIT, which uses the autoremove wakeup function.
That means that the above check is invalid - the wait entry will
have been removed from the queue already by the time we hit the
check in the loop.

Secondly, auto-removing the wait entries also means that the wait
queue essentially gets reordered "randomly" (e.g. threads re-add
themselves in the order they got to run after being woken up).
Additionally, new requests entering wbt_wait might overtake requests
that were queued earlier, because the wait queue will be
(temporarily) empty after the wake_up_all, so the waitqueue_active
check will not stop them. This can cause certain threads to starve
under high load.

The fix is to leave the woken up requests in the queue and remove
them in finish_wait() once the current thread breaks out of the
wait loop in __wbt_wait. This will ensure new requests always
end up at the back of the queue, and they won't overtake requests
that are already in the wait queue. With that change, the loop
in wbt_wait is also in line with many other wait loops in the kernel.
Waking up just one thread drastically reduces lock contention, as
does moving the wait queue add/remove out of the loop.

A significant drop in lockdep's lock contention numbers is seen when
running the test application on the patched kernel.

Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-07 14:40:49 -06:00
Jens Axboe
05b9ba4b55 Linux 4.18-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltU8z0eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG5X8H/2fJr7m3k242+t76
 sitwvx1eoPqTgryW59dRKm9IuXAGA+AjauvHzaz1QxomeQa50JghGWefD0eiJfkA
 1AphQ/24EOiAbbVk084dAI/C2p122dE4D5Fy7CrfLnuouyrbFaZI5STbnrRct7sR
 9deeYW0GDHO1Uenp4WDCj0baaqJqaevZ+7GG09DnWpya2nQtSkGBjqn6GpYmrfOU
 mqFuxAX8mEOW6cwK16y/vYtnVjuuMAiZ63/OJ8AQ6d6ArGLwAsdn7f8Fn4I4tEr2
 L0d3CRLUyegms4++Dmlu05k64buQu46WlPhjCZc5/Ts4kjrNxBuHejj2/jeSnUSt
 vJJlibI=
 =42a5
 -----END PGP SIGNATURE-----

Merge tag 'v4.18-rc6' into for-4.19/block2

Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a
merge conflict down the line.

Signed-of-by: Jens Axboe <axboe@kernel.dk>
2018-08-05 19:32:09 -06:00
Linus Torvalds
a32e236eb9 Partially revert "block: fail op_is_write() requests to read-only partitions"
It turns out that commit 721c7fc701 ("block: fail op_is_write()
requests to read-only partitions"), while obviously correct, causes
problems for some older lvm2 installations.

The reason is that the lvm snapshotting will continue to write to the
snapshow COW volume, even after the volume has been marked read-only.
End result: snapshot failure.

This has actually been fixed in newer version of the lvm2 tool, but the
old tools still exist, and the breakage was reported both in the kernel
bugzilla and in the Debian bugzilla:

  https://bugzilla.kernel.org/show_bug.cgi?id=200439
  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=900442

The lvm2 fix is here

  https://sourceware.org/git/?p=lvm2.git;a=commit;h=a6fdb9d9d70f51c49ad11a87ab4243344e6701a3

but until everybody has updated to recent versions, we'll have to weaken
the "never write to read-only partitions" check.  It now allows the
write to happen, but causes a warning, something like this:

  generic_make_request: Trying to write to read-only block-device dm-3 (partno X)
  Modules linked in: nf_tables xt_cgroup xt_owner kvm_intel iwlmvm kvm irqbypass iwlwifi
  CPU: 1 PID: 77 Comm: kworker/1:1 Not tainted 4.17.9-gentoo #3
  Hardware name: LENOVO 20B6A019RT/20B6A019RT, BIOS GJET91WW (2.41 ) 09/21/2016
  Workqueue: ksnaphd do_metadata
  RIP: 0010:generic_make_request_checks+0x4ac/0x600
  ...
  Call Trace:
   generic_make_request+0x64/0x400
   submit_bio+0x6c/0x140
   dispatch_io+0x287/0x430
   sync_io+0xc3/0x120
   dm_io+0x1f8/0x220
   do_metadata+0x1d/0x30
   process_one_work+0x1b9/0x3e0
   worker_thread+0x2b/0x3c0
   kthread+0x113/0x130
   ret_from_fork+0x35/0x40

Note that this is a "revert" in behavior only.  I'm leaving alone the
actual code cleanups in commit 721c7fc701, but letting the previously
uncaught request go through with a warning instead of stopping it.

Fixes: 721c7fc701 ("block: fail op_is_write() requests to read-only partitions")
Reported-and-tested-by: WGH <wgh@torlan.ru>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-04 18:19:55 -07:00
Linus Torvalds
71abe04269 for-linus-20180803
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAltkcI4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpuByD/9Pf8mfsv/WDJposa5pRqnY18dtBNdfHjsb
 t6vuBZiLxe86LDo1Y3e7663u+kgraARyswUpNTv5iAKseVkwIcz0RTArQJkaDLy7
 AD2BEfLvpcacAoWZ96MiayjQabGXEnTWqiKnIlcbc6J4/LIfdYYZHd+SXUXH0R81
 BaXPANL222srxuGbBifsfZ/yXxNPmBYfT1mTaFU6cKAftVO/24RV/MbwkU4cgym7
 0ZM/iBDkFDy6eV3XhhGeX1miDhGa2zNM/NHvsnSsP8jQqlyJW+lpI5dRPEoH1Fzp
 ecpxSH/HtjW5GYMwP7qhHMth/XHhjkOmNDRblkRWYrMGxgy7mAcSR8axdjqTdDxD
 +vlez+b9uPF3qXUk1D2NkW3RmTSkxjV6ztzc+yddbFSzqmQcHSqH+7ohaHnQRefX
 IGngTOq5pSmluNmrul48y/wmkSwVI1/jrteOfJfhJXWL0pY65g+8KWQxLmeZHke7
 ytFu6msOsVmZjDQ7tckSxSLMB+frvzuc/h+eOBTx8ida0bjopVlfNUcLQRD4kZfy
 xHn6nPxGsdiq2PM71nxYvqoUfmZnIE3o0A9+IB4OIp5bLVAGt2/fbWV3llunGK0A
 JM8UmyOysh71xh288VyE/hPz/7tyea/KgTaTm0jdTkTeoHH2isHsgWWvYB3zjeEn
 /1c7o2T13Q==
 =Ml3z
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20180803' of git://git.kernel.dk/linux-block

Pull block fix from Jens Axboe:
 "Just a single fix, from Ming, fixing a regression in this cycle where
  the busy tag iteration was changed to only calling the callback
  function for requests that are started. We really want all non-free
  requests.

  This fixes a boot regression on certain VM setups"

* tag 'for-linus-20180803' of git://git.kernel.dk/linux-block:
  blk-mq: fix blk_mq_tagset_busy_iter
2018-08-03 10:43:56 -07:00
Ming Lei
2d5ba0e2de blk-mq: fix blk_mq_tagset_busy_iter
Commit d250bf4e776ff09d5("blk-mq: only iterate over inflight requests
in blk_mq_tagset_busy_iter") uses 'blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT'
to replace 'blk_mq_request_started(req)', this way is wrong, and causes
lots of test system hang during booting.

Fix the issue by using blk_mq_request_started(req) inside bt_tags_iter().

Fixes: d250bf4e77 ("blk-mq: only iterate over inflight requests in blk_mq_tagset_busy_iter")
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Mark Brown <broonie@kernel.org>
Cc: Matt Hart <matthew.hart@linaro.org>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: John Garry <john.garry@huawei.com>
Cc: Hannes Reinecke <hare@suse.com>,
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: linux-scsi@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reported-by: Mark Brown <broonie@kernel.org>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-02 14:47:20 -06:00
Ming Lei
75d6e175fc blk-mq: fix updating tags depth
The passed 'nr' from userspace represents the total depth, meantime
inside 'struct blk_mq_tags', 'nr_tags' stores the total tag depth,
and 'nr_reserved_tags' stores the reserved part.

There are two issues in blk_mq_tag_update_depth() now:

1) for growing tags, we should have used the passed 'nr', and keep the
number of reserved tags not changed.

2) the passed 'nr' should have been used for checking against
'tags->nr_tags', instead of number of the normal part.

This patch fixes the above two cases, and avoids kernel crash caused
by wrong resizing sbitmap queue.

Cc: "Ewan D. Milne" <emilne@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Omar Sandoval <osandov@fb.com>
Tested by: Marco Patalano <mpatalan@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-02 14:41:58 -06:00
Ming Lei
b233f12704 block: really disable runtime-pm for blk-mq
Runtime PM isn't ready for blk-mq yet, and commit 765e40b675 ("block:
disable runtime-pm for blk-mq") tried to disable it. Unfortunately,
it can't take effect in that way since user space still can switch
it on via 'echo auto > /sys/block/sdN/device/power/control'.

This patch disables runtime-pm for blk-mq really by pm_runtime_disable()
and fixes all kinds of PM related kernel crash.

Cc: Tomas Janousek <tomi@nomi.cz>
Cc: Przemek Socha <soprwa@gmail.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: <stable@vger.kernel.org>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-02 10:36:02 -06:00
Dennis Zhou (Facebook)
c480bcf97b block: make iolatency avg_lat exponentially decay
Currently, avg_lat is calculated by accumulating the mean of every
window in a long running cumulative average. As time goes on, the metric
becomes less and less useful due to the accumulated history.

This patch reuses the same calculation done in load averages to make the
avg_lat metric more lively. Unlike load averages, the avg only advances
when a window elapses (due to an io). Idle periods extend the most
recent window. Bucketing is used to limit the history of avg_lat by
binding it to the window size. So, the window range for 1/exp (decay
rate) is [1 min, 2.5 min) when windows elapse immediately.

The current sample window size is exposed in the debug info to enable
calculation of the window range.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-02 09:58:14 -06:00
Josef Bacik
cc7ecc2585 blk-cgroup: hold the queue ref during throttling
The blkg lifetime is protected by the queue lifetime, so we need to put
the queue _after_ we're done using the blkg.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-01 09:16:03 -06:00
Josef Bacik
52a1199ccd blk-iolatency: fix blkg leak in timer_fn
At this point we have a ref on the blkg, we need to drop it if we don't
have a iolat.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-01 09:16:01 -06:00
zhong jiang
4725549192 block/bsg-lib: use PTR_ERR_OR_ZERO to simplify the flow path
Simplify the code by using the PTR_ERR_OR_ZERO, instead of the
open code. It is better.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-01 09:13:03 -06:00
xiao jin
54648cf1ec block: blk_init_allocated_queue() set q->fq as NULL in the fail case
We find the memory use-after-free issue in __blk_drain_queue()
on the kernel 4.14. After read the latest kernel 4.18-rc6 we
think it has the same problem.

Memory is allocated for q->fq in the blk_init_allocated_queue().
If the elevator init function called with error return, it will
run into the fail case to free the q->fq.

Then the __blk_drain_queue() uses the same memory after the free
of the q->fq, it will lead to the unpredictable event.

The patch is to set q->fq as NULL in the fail case of
blk_init_allocated_queue().

Fixes: commit 7c94e1c157 ("block: introduce blk_flush_queue to drive flush machinery")
Cc: <stable@vger.kernel.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: xiao jin <jin.xiao@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-30 08:28:39 -06:00
Max Gurtovoy
10c41ddd61 block: move dif_prepare/dif_complete functions to block layer
Currently these functions are implemented in the scsi layer, but their
actual place should be the block layer since T10-PI is a general data
integrity feature that is used in the nvme protocol as well. Also, use
the tuple size from the integrity profile since it may vary between
integrity types.

Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-30 08:27:02 -06:00
Linus Torvalds
eb181a814c for-linus-20180727
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAltbc20QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgh4D/9GYQcjk9qLVFxkv5ucAUvCuxEL6gjsMf4W
 M/QdxVIrwh3zpvsH++2IXXn+xH+UjujMA5NkzhsSr4+hsSO2iAGOYMJbroNfhsTD
 onvQQ6NTaHPu/+PZs0otVK4KMWHwZGWOV6YU00TWTfRgzRmGEsSMe91oeBIXVv9w
 v6d09twaLSY0lUkAAbcdu5fuFBtXu4Bxy60qyHEKkAdWWHEUYaZLrODhVjoGg2V4
 KdAWS5X4A6kJMcPcoOvG6RFtpf71boaip9o/DRLUWhGdIQnI38UgSCUmz1XMYnik
 Sq8r74vqCm8IhIOLTlxnPrMHHbKv7JZhY3Ow9fxnS6HZRNI0aPX31Yml6NULqnWh
 MsQh+6gZXd3xC1O7txEQn4a15Lk0OLXa8HJcIn5ADNxqz5/r/g0mPUG9HmPSIalO
 ISFF/9UKQFcAd0RjHR+bEEH2VMznz59UWKfdOsmwFZtZSCmR1ucj0xAKDj+oP1JS
 ZsgZ09K2GezrL4GEueocISo9ACIWgDWH8T7/bTxlBok0IYbybAfmOe+MZInL1Tf4
 pklmoXm3ntgV3Pq8Ptk05LYyIgAaUIltuSiR3AFaXIADX0wNtV0ZgysIWgHf3BSA
 18j+I1yPG1IwBdM8xNwxi56xMQR84uY5tsIyafbfj+laRI2nH5OIYjNZnrKpm957
 4xZUgIECBA==
 =2ogY
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20180727' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Bigger than usual at this time, mostly due to the O_DIRECT corruption
  issue and the fact that I was on vacation last week. This contains:

   - NVMe pull request with two fixes for the FC code, and two target
     fixes (Christoph)

   - a DIF bio reset iteration fix (Greg Edwards)

   - two nbd reply and requeue fixes (Josef)

   - SCSI timeout fixup (Keith)

   - a small series that fixes an issue with bio_iov_iter_get_pages(),
     which ended up causing corruption for larger sized O_DIRECT writes
     that ended up racing with buffered writes (Martin Wilck)"

* tag 'for-linus-20180727' of git://git.kernel.dk/linux-block:
  block: reset bi_iter.bi_done after splitting bio
  block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
  blkdev: __blkdev_direct_IO_simple: fix leak in error case
  block: bio_iov_iter_get_pages: fix size of last iovec
  nvmet: only check for filebacking on -ENOTBLK
  nvmet: fixup crash on NULL device path
  scsi: set timed out out mq requests to complete
  blk-mq: export setting request completion state
  nvme: if_ready checks to fail io to deleting controller
  nvmet-fc: fix target sgl list on large transfers
  nbd: handle unexpected replies better
  nbd: don't requeue the same request twice.
2018-07-27 12:51:00 -07:00
Mauricio Faria de Oliveira
d43fdae7ba partitions/aix: append null character to print data from disk
Even if properly initialized, the lvname array (i.e., strings)
is read from disk, and might contain corrupt data (e.g., lack
the null terminating character for strings).

So, make sure the partition name string used in pr_warn() has
the null terminating character.

Fixes: 6ceea22bbb ("partitions: add aix lvm partition support files")
Suggested-by: Daniel J. Axtens <daniel.axtens@canonical.com>
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27 09:17:41 -06:00
Mauricio Faria de Oliveira
14cb2c8a6c partitions/aix: fix usage of uninitialized lv_info and lvname structures
The if-block that sets a successful return value in aix_partition()
uses 'lvip[].pps_per_lv' and 'n[].name' potentially uninitialized.

For example, if 'numlvs' is zero or alloc_lvn() fails, neither is
initialized, but are used anyway if alloc_pvd() succeeds after it.

So, make the alloc_pvd() call conditional on their initialization.

This has been hit when attaching an apparently corrupted/stressed
AIX LUN, misleading the kernel to pr_warn() invalid data and hang.

    [...] partition (null) (11 pp's found) is not contiguous
    [...] partition (null) (2 pp's found) is not contiguous
    [...] partition (null) (3 pp's found) is not contiguous
    [...] partition (null) (64 pp's found) is not contiguous

Fixes: 6ceea22bbb ("partitions: add aix lvm partition support files")
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27 09:17:39 -06:00
Greg Edwards
5151842b9d block: reset bi_iter.bi_done after splitting bio
After the bio has been updated to represent the remaining sectors, reset
bi_done so bio_rewind_iter() does not rewind further than it should.

This resolves a bio_integrity_process() failure on reads where the
original request was split.

Fixes: 63573e359d ("bio-integrity: Restore original iterator on verify stage")
Signed-off-by: Greg Edwards <gedwards@ddn.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27 09:10:34 -06:00
Greg Edwards
359f642700 block: move bio_integrity_{intervals,bytes} into blkdev.h
This allows bio_integrity_bytes() to be called from drivers instead of
open coding it.

Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Edwards <gedwards@ddn.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-26 15:49:41 -06:00
Martin Wilck
17d51b10d7 block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
bio_iov_iter_get_pages() currently only adds pages for the next non-zero
segment from the iov_iter to the bio. That's suboptimal for callers,
which typically try to pin as many pages as fit into the bio. This patch
converts the current bio_iov_iter_get_pages() into a static helper, and
introduces a new helper that allocates as many pages as

 1) fit into the bio,
 2) are present in the iov_iter,
 3) and can be pinned by MM.

Error is returned only if zero pages could be pinned. Because of 3), a
zero return value doesn't necessarily mean all pages have been pinned.
Callers that have to pin every page in the iov_iter must still call this
function in a loop (this is currently the case).

This change matters most for __blkdev_direct_IO_simple(), which calls
bio_iov_iter_get_pages() only once. If it obtains less pages than
requested, it returns a "short write" or "short read", and
__generic_file_write_iter() falls back to buffered writes, which may
lead to data corruption.

Fixes: 72ecad22d9 ("block: support a full bio worth of IO for simplified bdev direct-io")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-26 11:52:36 -06:00
Martin Wilck
b403ea2404 block: bio_iov_iter_get_pages: fix size of last iovec
If the last page of the bio is not "full", the length of the last
vector slot needs to be corrected. This slot has the index
(bio->bi_vcnt - 1), but only in bio->bi_io_vec. In the "bv" helper
array, which is shifted by the value of bio->bi_vcnt at function
invocation, the correct index is (nr_pages - 1).

v2: improved readability following suggestions from Ming Lei.
v3: followed a formatting suggestion from Christoph Hellwig.

Fixes: 2cefe4dbaa ("block: add bio_iov_iter_get_pages()")
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-26 11:52:29 -06:00
Mike Snitzer
42c9cdfe1e block: allow max_discard_segments to be stacked
Set max_discard_segments to USHRT_MAX in blk_set_stacking_limits() so
that blk_stack_limits() can stack up this limit for stacked devices.

before:

$ cat /sys/block/nvme0n1/queue/max_discard_segments
256
$ cat /sys/block/dm-0/queue/max_discard_segments
1

after:

$ cat /sys/block/nvme0n1/queue/max_discard_segments
256
$ cat /sys/block/dm-0/queue/max_discard_segments
256

Fixes: 1e739730c5 ("block: optionally merge discontiguous discard bios into a single request")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-24 14:46:39 -06:00
Christoph Hellwig
c55183c9aa block: unexport bio_clone_bioset
Now only used by the bounce code, so move it there and mark the function
static.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-24 14:43:26 -06:00
Keith Busch
0fc09f9209 blk-mq: export setting request completion state
This is preparing for drivers that want to directly alter the state of
their requests. No functional change here.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-24 14:41:50 -06:00
Christoph Hellwig
3bb5098310 block: bio_set_pages_dirty can't see NULL bv_page in a valid bio_vec
So don't bother handling it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-24 14:39:28 -06:00
Christoph Hellwig
24d5493f20 block: simplify bio_check_pages_dirty
bio_check_pages_dirty currently inviolates the invariant that bv_page of
a bio_vec inside bi_vcnt shouldn't be zero, and that is going to become
really annoying with multpath biovecs.  Fortunately there isn't any
all that good reason for it - once we decide to defer freeing the bio
to a workqueue holding onto a few additional pages isn't really an
issue anymore.  So just check if there is a clean page that needs
dirtying in the first path, and do a second pass to free them if there
was none, while the cache is still hot.

Also use the chance to micro-optimize bio_dirty_fn a bit by not saving
irq state - we know we are called from a workqueue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-24 14:39:27 -06:00
Ming Lei
8824f62246 blk-mq: fail the request in case issue failure
Inside blk_mq_try_issue_list_directly(), if the request is issued as
failed, we shouldn't try to do it again, otherwise the warning in
blk_mq_start_request() will be triggered. This change is aligned to
behaviour of other ways of request issue & dispatch.

Fixes: 6ce3dd6eec ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: kernel test robot <rong.a.chen@intel.com>
Cc: LKP <lkp@01.org>
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-22 17:31:18 -06:00
Josef Bacik
22f17952c7 blk-rq-qos: make depth comparisons unsigned
With the change to use UINT_MAX I broke the depth check as any value of
inflight (ie 0) would be less than (int)UINT_MAX.  Fix this by changing
everything to unsigned int to match the depth.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-22 11:30:53 -06:00
Tejun Heo
636620b66d blkcg: Track DISCARD statistics and output them in cgroup io.stat
Add tracking of REQ_OP_DISCARD ios to the per-cgroup io.stat.  Two
fields, dbytes and dios, to respectively count the total bytes and
number of discards are added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Cc: Michael Callahan <michaelcallahan@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-18 08:44:23 -06:00
Michael Callahan
bdca3c87fb block: Track DISCARD statistics and output them in stat and diskstat
Add tracking of REQ_OP_DISCARD ios to the partition statistics and
append them to the various stat files in /sys as well as
/proc/diskstats.  These are tracked with the same four stats as reads
and writes:

Number of discard ios completed.
Number of discard ios merged
Number of discard sectors completed
Milliseconds spent on discard requests

This is done via adding a new STAT_DISCARD define to genhd.h and then
using it to index that stat field for discard requests.

tj: Refreshed on top of v4.17 and other previous updates.

Signed-off-by: Michael Callahan <michaelcallahan@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-18 08:44:22 -06:00
Michael Callahan
ddcf35d397 block: Add and use op_stat_group() for indexing disk_stat fields.
Add and use a new op_stat_group() function for indexing partition stat
fields rather than indexing them by rq_data_dir() or bio_data_dir().
This function works similarly to op_is_sync() in that it takes the
request::cmd_flags or bio::bi_opf flags and determines which stats
should et updated.

In addition, the second parameter to generic_start_io_acct() and
generic_end_io_acct() is now a REQ_OP rather than simply a read or
write bit and it uses op_stat_group() on the parameter to determine
the stat group.

Note that the partition in_flight counts are not part of the per-cpu
statistics and as such are not indexed via this function.  It's now
indexed by op_is_write().

tj: Refreshed on top of v4.17.  Updated to pass around REQ_OP.

Signed-off-by: Michael Callahan <michaelcallahan@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Matias Bjorling <mb@lightnvm.io>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Alasdair Kergon <agk@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-18 08:44:20 -06:00
Michael Callahan
dbae2c5513 block: Define and use STAT_READ and STAT_WRITE
Add defines for STAT_READ and STAT_WRITE for indexing the partition
stat entries. This clarifies some fs/ code which has hardcoded 1 for
STAT_WRITE and will make it easier to extend the stats with additional
fields.

tj: Refreshed on top of v4.17.

Signed-off-by: Michael Callahan <michaelcallahan@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-18 08:44:18 -06:00
Ming Lei
6ce3dd6eec blk-mq: issue directly if hw queue isn't busy in case of 'none'
In case of 'none' io scheduler, when hw queue isn't busy, it isn't
necessary to enqueue request to sw queue and dequeue it from
sw queue because request may be submitted to hw queue asap without
extra cost, meantime there shouldn't be much request in sw queue,
and we don't need to worry about effect on IO merge.

There are still some single hw queue SCSI HBAs(HPSA, megaraid_sas, ...)
which may connect high performance devices, so 'none' is often required
for obtaining good performance.

This patch improves IOPS and decreases CPU unilization on megaraid_sas,
per Kashyap's test.

Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Reported-by: Kashyap Desai <kashyap.desai@broadcom.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-17 16:04:00 -06:00
Josef Bacik
71e9690b59 blk-iolatency: truncate our current time
In our longer tests we noticed that some boxes would degrade to the
point of uselessness.  This is because we truncate the current time when
saving it in our bio, but I was using the raw current time to subtract
from.  So once the box had been up a certain amount of time it would
appear as if our IO's were taking several years to complete.  Fix this
by truncating the current time so it matches the issue time.  Verified
this worked by running with this patch for a week on our test tier.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-16 10:15:19 -06:00
Josef Bacik
d607eefa3b blk-iolatency: don't change the latency window
Early versions of these patches had us waiting for seconds at a time
during submission, so we had to adjust the timing window we monitored
for latency.  Now we don't do things like that so this is unnecessary
code.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-16 10:15:17 -06:00
Linus Torvalds
2da8c426d9 for-linus-20180713
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAltJZNkQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiYPEADGvN9iXz71j5vKV4FmV6nRo66gRhlegGg2
 QDcf88BVUlCly+wZq5zHvyWoI8PFzHD0DOK83u6mPkCm1oRG5mETatBnK3y6xxPK
 10V2UadAALD0ZA6bS4Xj4toKVouZt2mC8xwLR/TCqmCN45eL+7Y2IZuegu6GcESE
 dxCrnQ8uFKLcDOAPXHIPGN6IFM7gyAAQjBvHS3mvIyKuVo+0Rwv4S2q7DcAZmxer
 8nzT6GhwHCzos1kjZRrJhWe9WWSFprI504rhF58h4PTx1GXwR5Arsmqz5DaftGVI
 0Co+uodx8uUrDP+9ChgJKgPT/eiOEmO5oUS531XFcbKNwU0vNktXpne5e/0MAeUG
 e5uwm8x35UIbwI07+Av78FyYrRSe8IBdv492uT+WB8uTwbwts3BJNr+FgeXw3h9+
 jGIRtWBuHY623mqsiJlQ7WOopK8raHfl2zZcrRsWQcAByh2v9lzV60voY50ssNrR
 Os/ZdLN4g+BgP0gfcHjm0Km2q4RO/hHTVq06oPbydkOjbanHvKhqtLJAGlMBlGAY
 Z65+nDu1xTZtKMMDU9r42K5zWzylnW9pdnOYMz6q+PyQXhBaZGmAOQ2Mm/ohGf1f
 8Hs+5fHBQA090bpLAWiuJvEAKVKGhP/TCenKY/PhPkkdIQgIoJce9cgQYSjnuc/W
 Nejp8SStHA==
 =wZtn
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20180713' of git://git.kernel.dk/linux-block

Pull block fix from Jens Axboe:
 "Just a single regression fix (from 4.17) for bsg, fixing an EINVAL
  return on non-data commands"

* tag 'for-linus-20180713' of git://git.kernel.dk/linux-block:
  bsg: fix bogus EINVAL on non-data commands
2018-07-14 12:28:00 -07:00
Christoph Hellwig
28519c891c bsg: remove read/write support
The code poses a security risk due to user memory access in ->release
and had an API that can't be used reliably.  As far as we know it was
never used for real, but if that turns out wrong we'll have to revert
this commit and come up with a band aid.

Jann Horn did look software archives for users of this interface,
and the only users found were example code in sg3_utils, and optional
support in an optional module of the tgt user space iscsi target,
which looks like a proof of concept extension of the /dev/sg
read/write support.

Tony Battersby chimes in that the code is basically unsafe to use in
general:

  The read/write interface on /dev/bsg is impossible to use safely
  because the list of completed commands is per-device (bd->done_list)
  rather than per-fd like it is with /dev/sg.  So if program A and
  program B are both using the write/read interface on the same bsg
  device, then their command responses will get mixed up, and program
  A will read() some command results from program B and vice versa.
  So no, I don't use read/write on /dev/bsg.  From a security standpoint,
  it should definitely be fixed or removed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-12 08:04:08 -06:00
Tony Battersby
70dbcc2254 bsg: fix bogus EINVAL on non-data commands
Fix a regression introduced in Linux kernel 4.17 where sending a SCSI
command that does not transfer data (such as TEST UNIT READY) via
/dev/bsg/* results in EINVAL.

Fixes: 17cb960f29 ("bsg: split handling of SCSI CDBs vs transport requeues")
Cc: <stable@vger.kernel.org> # 4.17+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-11 08:48:28 -06:00
Josef Bacik
a284390b39 blk-iolatency: fix max_depth comparisons
max_depth used to be a u64, but I changed it to a unsigned int but
didn't convert my comparisons over everywhere.  Fix by using UINT_MAX
everywhere instead of (u64)-1.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-11 08:37:38 -06:00
Arnd Bergmann
88b7210c81 block: iolatency: avoid 64-bit division
On 32-bit architectures, dividing a 64-bit number needs to use the
do_div() function or something like it to avoid a link failure:

block/blk-iolatency.o: In function `iolatency_prfill_limit':
blk-iolatency.c:(.text+0x8cc): undefined reference to `__aeabi_uldivmod'

Using div_u64() gives us the best output and avoids the need for an
explicit cast.

Fixes: d706751215 ("block: introduce blk-iolatency io controller")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-10 12:26:09 -06:00
Geert Uytterhoeven
e9a8385330 block: Add default switch case to blk_pm_allow_request() to kill warning
With gcc 4.9.0 and 7.3.0:

    block/blk-core.c: In function 'blk_pm_allow_request':
    block/blk-core.c:2747:2: warning: enumeration value 'RPM_ACTIVE' not handled in switch [-Wswitch]
      switch (rq->q->rpm_status) {
      ^

Convert the return statement below the switch() block into a default
case to fix this.

Fixes: e4f36b249b ("block: fix peeking requests during PM")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Mikulas Patocka
b88aef36b8 block: fix infinite loop if the device loses discard capability
If __blkdev_issue_discard is in progress and a device mapper device is
reloaded with a table that doesn't support discard,
q->limits.max_discard_sectors is set to zero. This results in infinite
loop in __blkdev_issue_discard.

This patch checks if max_discard_sectors is zero and aborts with
-EOPNOTSUPP.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Zdenek Kabelac <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Shakeel Butt
c137969bd4 block, mm: remove unnecessary __GFP_HIGH flag
The flag GFP_ATOMIC already contains __GFP_HIGH. There is no need to
explicitly or __GFP_HIGH again. So, just remove unnecessary __GFP_HIGH.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Josef Bacik
d706751215 block: introduce blk-iolatency io controller
Current IO controllers for the block layer are less than ideal for our
use case.  The io.max controller is great at hard limiting, but it is
not work conserving.  This patch introduces io.latency.  You provide a
latency target for your group and we monitor the io in short windows to
make sure we are not exceeding those latency targets.  This makes use of
the rq-qos infrastructure and works much like the wbt stuff.  There are
a few differences from wbt

 - It's bio based, so the latency covers the whole block layer in addition to
   the actual io.
 - We will throttle all IO types that comes in here if we need to.
 - We use the mean latency over the 100ms window.  This is because writes can
   be particularly fast, which could give us a false sense of the impact of
   other workloads on our protected workload.
 - By default there's no throttling, we set the queue_depth to INT_MAX so that
   we can have as many outstanding bio's as we're allowed to.  Only at
   throttle time do we pay attention to the actual queue depth.
 - We backcharge cgroups for root cg issued IO and induce artificial
   delays in order to deal with cases like metadata only or swap heavy
   workloads.

In testing this has worked out relatively well.  Protected workloads
will throttle noisy workloads down to 1 io at time if they are doing
normal IO on their own, or induce up to a 1 second delay per syscall if
they are doing a lot of root issued IO (metadata/swap IO).

Our testing has revolved mostly around our production web servers where
we have hhvm (the web server application) in a protected group and
everything else in another group.  We see slightly higher requests per
second (RPS) on the test tier vs the control tier, and much more stable
RPS across all machines in the test tier vs the control tier.

Another test we run is a slow memory allocator in the unprotected group.
Before this would eventually push us into swap and cause the whole box
to die and not recover at all.  With these patches we see slight RPS
drops (usually 10-15%) before the memory consumer is properly killed and
things recover within seconds.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Josef Bacik
67b42d0bf7 rq-qos: introduce dio_bio callback
wbt cares only about request completion time, but controllers may need
information that is on the bio itself, so add a done_bio callback for
rq-qos so things like blk-iolatency can use it to have the bio when it
completes.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Josef Bacik
c1c80384c8 block: remove external dependency on wbt_flags
We don't really need to save this stuff in the core block code, we can
just pass the bio back into the helpers later on to derive the same
flags and update the rq->wbt_flags appropriately.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Josef Bacik
a79050434b blk-rq-qos: refactor out common elements of blk-wbt
blkcg-qos is going to do essentially what wbt does, only on a cgroup
basis.  Break out the common code that will be shared between blkcg-qos
and wbt into blk-rq-qos.* so they can both utilize the same
infrastructure.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Josef Bacik
2ecbf45635 blk-stat: export helpers for modifying blk_rq_stat
We need to use blk_rq_stat in the blkcg qos stuff, so export some of
these helpers so they can be used by other things.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Josef Bacik
d09d8df3a2 blkcg: add generic throttling mechanism
Since IO can be issued from literally anywhere it's almost impossible to
do throttling without having some sort of adverse effect somewhere else
in the system because of locking or other dependencies.  The best way to
solve this is to do the throttling when we know we aren't holding any
other kernel resources.  Do this by tracking throttling in a per-blkg
basis, and if we require throttling flag the task that it needs to check
before it returns to user space and possibly sleep there.

This is to address the case where a process is doing work that is
generating IO that can't be throttled, whether that is directly with a
lot of REQ_META IO, or indirectly by allocating so much memory that it
is swamping the disk with REQ_SWAP.  We can't use task_add_work as we
don't want to induce a memory allocation in the IO path, so simply
saving the request queue in the task and flagging it to do the
notify_resume thing achieves the same result without the overhead of a
memory allocation.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Tejun Heo
0d3bd88d54 swap,blkcg: issue swap io with the appropriate context
For backcharging we need to know who the page belongs to when swapping
it out.  We don't worry about things that do ->rw_page (zram etc) at the
moment, we're only worried about pages that actually go to a block
device.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Josef Bacik
903d23f0a3 blk-cgroup: allow controllers to output their own stats
blk-iolatency has a few stats that it would like to print out, and
instead of adding a bunch of crap to the generic code just provide a
helper so that controllers can add stuff to the stat line if they want
to.

Hide it behind a boot option since it changes the output of io.stat from
normal, and these stats are only interesting to developers.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Josef Bacik
08e18eab0c block: add bi_blkg to the bio for cgroups
Currently io.low uses a bi_cg_private to stash its private data for the
blkg, however other blkcg policies may want to use this as well.  Since
we can get the private data out of the blkg, move this to bi_blkg in the
bio and make it generic, then we can use bio_associate_blkg() to attach
the blkg to the bio.

Theoretically we could simply replace the bi_css with this since we can
get to all the same information from the blkg, however you have to
lookup the blkg, so for example wbc_init_bio() would have to lookup and
possibly allocate the blkg for the css it was trying to attach to the
bio.  This could be problematic and result in us either not attaching
the css at all to the bio, or falling back to the root blkcg if we are
unable to allocate the corresponding blkg.

So for now do this, and in the future if possible we could just replace
the bi_css with bi_blkg and update the helpers to do the correct
translation.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:53 -06:00
Ming Lei
6e76871730 blk-mq: dequeue request one by one from sw queue if hctx is busy
It won't be efficient to dequeue request one by one from sw queue,
but we have to do that when queue is busy for better merge performance.

This patch takes the Exponential Weighted Moving Average(EWMA) to figure
out if queue is busy, then only dequeue request one by one from sw queue
when queue is busy.

Fixes: b347689ffb ("blk-mq-sched: improve dispatching from sw queue")
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Reported-by: Kashyap Desai <kashyap.desai@broadcom.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:53 -06:00
Ming Lei
b04f50ab8a blk-mq: only attempt to merge bio if there is rq in sw queue
Only attempt to merge bio iff the ctx->rq_list isn't empty, because:

1) for high-performance SSD, most of times dispatch may succeed, then
there may be nothing left in ctx->rq_list, so don't try to merge over
sw queue if it is empty, then we can save one acquiring of ctx->lock

2) we can't expect good merge performance on per-cpu sw queue, and missing
one merge on sw queue won't be a big deal since tasks can be scheduled from
one CPU to another.

Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reported-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:53 -06:00
Ming Lei
3f0cedc7e9 blk-mq: use list_splice_tail_init() to insert requests
list_splice_tail_init() is much more faster than inserting each
request one by one, given all requets in 'list' belong to
same sw queue and ctx->lock is required to insert requests.

Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:53 -06:00
Minwoo Im
c018c84fdb blk-mq: fix typo in a function comment
Fix typo in a function blk_mq_alloc_tag_set() comment.
if if it too large -> if it's too large.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:53 -06:00
Minwoo Im
0da73d00ca blk-mq: code clean-up by adding an API to clear set->mq_map
set->mq_map is now currently cleared if something goes wrong when
establishing a queue map in blk-mq-pci.c.  It's also cleared before
updating a queue map in blk_mq_update_queue_map().

This patch provides an API to clear set->mq_map to make it clear.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:53 -06:00
Colin Ian King
e84422cdf3 partitions/ldm: remove redundant pointer dgrp
Pointer dgrp is being assigned but is never used hence it is redundant
and can be removed.

Cleans up clang warning:
warning: variable 'dgrp' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:53 -06:00
Liu Bo
43ada78781 Block: blk-throttle: set low_valid immediately once one cgroup has io.low configured
Once one cgroup has io.low configured, @low_valid becomes true and other
cgroups won't switch it back whatsoever.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:52 -06:00
Bart Van Assche
1954e9a998 block: Document how blk_update_request() handles RQF_SPECIAL_PAYLOAD requests
The payload of struct request is stored in the request.bio chain if
the RQF_SPECIAL_PAYLOAD flag is not set and in request.special_vec if
RQF_SPECIAL_PAYLOAD has been set. However, blk_update_request()
iterates over req->bio whether or not RQF_SPECIAL_PAYLOAD has been
set. Additionally, the RQF_SPECIAL_PAYLOAD flag is ignored by
blk_rq_bytes() which means that the value returned by that function
is incorrect if the RQF_SPECIAL_PAYLOAD flag has been set. It is not
clear to me whether this is an oversight or whether this happened on
purpose. Anyway, document that it is known that both functions ignore
RQF_SPECIAL_PAYLOAD. See also commit f9d03f96b9 ("block: improve
handling of the magic discard payload").

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:52 -06:00
Ming Lei
1311326cf4 blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()
SCSI probing may synchronously create and destroy a lot of request_queues
for non-existent devices. Any synchronize_rcu() in queue creation or
destroy path may introduce long latency during booting, see detailed
description in comment of blk_register_queue().

This patch removes one synchronize_rcu() inside blk_cleanup_queue()
for this case, commit c2856ae2f315d75(blk-mq: quiesce queue before freeing queue)
needs synchronize_rcu() for implementing blk_mq_quiesce_queue(), but
when queue isn't initialized, it isn't necessary to do that since
only pass-through requests are involved, no original issue in
scsi_execute() at all.

Without this patch and previous one, it may take more 20+ seconds for
virtio-scsi to complete disk probe. With the two patches, the time becomes
less than 100ms.

Fixes: c2856ae2f3 ("blk-mq: quiesce queue before freeing queue")
Reported-by: Andrew Jones <drjones@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Tested-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:52 -06:00
Ming Lei
97889f9ac2 blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()
We have to remove synchronize_rcu() from blk_queue_cleanup(),
otherwise long delay can be caused during lun probe. For removing
it, we have to avoid to iterate the set->tag_list in IO path, eg,
blk_mq_sched_restart().

This patch reverts 5b79413946d (Revert "blk-mq: don't handle
TAG_SHARED in restart"). Given we have fixed enough IO hang issue,
and there isn't any reason to restart all queues in one tags any more,
see the following reasons:

1) blk-mq core can deal with shared-tags case well via blk_mq_get_driver_tag(),
which can wake up queues waiting for driver tag.

2) SCSI is a bit special because it may return BLK_STS_RESOURCE if queue,
target or host is ready, but SCSI built-in restart can cover all these well,
see scsi_end_request(), queue will be rerun after any request initiated from
this host/target is completed.

In my test on scsi_debug(8 luns), this patch may improve IOPS by 20% ~ 30%
when running I/O on these 8 luns concurrently.

Fixes: 705cda97ee ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Reported-by: Andrew Jones <drjones@redhat.com>
Tested-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:52 -06:00
Ming Lei
5815839b3c blk-mq: introduce new lock for protecting hctx->dispatch_wait
Now hctx->lock is only acquired when adding hctx->dispatch_wait to
one wait queue, but not held when removing it from the wait queue.

IO hang can be observed easily if SCHED RESTART is disabled, that means
now RESTART exits just for fixing the issue in blk_mq_mark_tag_wait().

This patch fixes the issue by introducing hctx->dispatch_wait_lock and
holding it for removing hctx->dispatch_wait in blk_mq_dispatch_wake(),
since we need to avoid acquiring hctx->lock in irq context.

Fixes: eb619fdb2d ("blk-mq: fix issue with shared tag queue re-running")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:52 -06:00
Ming Lei
2278d69f03 blk-mq: don't pass **hctx to blk_mq_mark_tag_wait()
'hctx' won't be changed at all, so not necessary to pass
'**hctx' to blk_mq_mark_tag_wait().

Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:52 -06:00
Ming Lei
8ab6bb9ee8 blk-mq: cleanup blk_mq_get_driver_tag()
We never pass 'wait' as true to blk_mq_get_driver_tag(), and hence
we never change '**hctx' as well. The last use of these went away
with the flush cleanup, commit 0c2a6fe4dc.

So cleanup the usage and remove the two extra parameters.

Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Tested-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:52 -06:00
Paolo Valente
277a4a9b56 block, bfq: give a better name to bfq_bfqq_may_idle
The actual goal of the function bfq_bfqq_may_idle is to tell whether
it is better to perform device idling (more precisely: I/O-dispatch
plugging) for the input bfq_queue, either to boost throughput or to
preserve service guarantees. This commit improves the name of the
function accordingly.

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:52 -06:00
Paolo Valente
9fae8dd59f block, bfq: fix service being wrongly set to zero in case of preemption
If
- a bfq_queue Q preempts another queue, because one request of Q
arrives in time,
- but, after this preemption, Q is not the queue that is set in service,
then Q->entity.service is set to 0 when Q is eventually set in
service. But Q should have continued receiving service with its old
budget (which is why preemption has occurred) and its old service.

This commit addresses this issue by resetting service on queue real
expiration.

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:52 -06:00
Paolo Valente
4420b095cc block, bfq: do not expire a queue that will deserve dispatch plugging
For some bfq_queues, BFQ plugs I/O dispatching when the queue becomes
idle, and keeps the plug until a new request of the queue arrives, or
a timeout fires. BFQ does so either to boost throughput or to preserve
service guarantees for the queue.

More precisely, for such a queue, plugging starts when the queue
happens to have either no request enqueued, or no request in flight,
that is, no request already dispatched but not yet completed.

On the opposite end, BFQ may happen to expire a queue with no request
enqueued, without doing any plugging, if the queue still has some
request in flight. Unfortunately, such a premature expiration causes
the queue to lose its chance to enjoy dispatch plugging a moment
later, i.e., when its in-flight requests finally get completed. This
breaks service guarantees for the queue.

This commit prevents BFQ from expiring an empty queue if the latter
still has in-flight requests.

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:52 -06:00
Paolo Valente
0471559c2f block, bfq: add/remove entity weights correctly
To keep I/O throughput high as often as possible, BFQ performs
I/O-dispatch plugging (aka device idling) only when beneficial exactly
for throughput, or when needed for service guarantees (low latency,
fairness). An important case where the latter condition holds is when
the scenario is 'asymmetric' in terms of weights: i.e., when some
bfq_queue or whole group of queues has a higher weight, and thus has
to receive more service, than other queues or groups. Without dispatch
plugging, lower-weight queues/groups may unjustly steal bandwidth to
higher-weight queues/groups.

To detect asymmetric scenarios, BFQ checks some sufficient
conditions. One of these conditions is that active groups have
different weights. BFQ controls this condition by maintaining a
special set of unique weights of active groups
(group_weights_tree). To this purpose, in the function
bfq_active_insert/bfq_active_extract BFQ adds/removes the weight of a
group to/from this set.

Unfortunately, the function bfq_active_extract may happen to be
invoked also for a group that is still active (to preserve the correct
update of the next queue to serve, see comments in function
bfq_no_longer_next_in_service() for details). In this case, removing
the weight of the group makes the set group_weights_tree
inconsistent. Service-guarantee violations follow.

This commit addresses this issue by moving group_weights_tree
insertions from their previous location (in bfq_active_insert) into
the function __bfq_activate_entity, and by moving group_weights_tree
extractions from bfq_active_extract to when the entity that represents
a group remains throughly idle, i.e., with no request either enqueued
or dispatched.

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:52 -06:00
Bart Van Assche
6a5ac98465 block: Make struct request_queue smaller for CONFIG_BLK_DEV_ZONED=n
Exclude zoned block device members from struct request_queue for
CONFIG_BLK_DEV_ZONED == n. Avoid breaking the build by only building
the code that uses these struct request_queue members if
CONFIG_BLK_DEV_ZONED != n.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Matias Bjorling <mb@lightnvm.io>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:52 -06:00
Bart Van Assche
7c8542b798 block: Inline blk_queue_nr_zones()
Since the implementation of blk_queue_nr_zones() is trivial and since
it only has a single caller, inline this function.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Matias Bjorling <mb@lightnvm.io>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:52 -06:00
Bart Van Assche
f441108fa0 block: Remove a superfluous cast from blkdev_report_zones()
No cast is necessary when assigning a non-void pointer to a void
pointer.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Matias Bjorling <mb@lightnvm.io>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:52 -06:00
Linus Torvalds
e6e5bec43c for-linus-20180629
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAls2oVcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpr89D/9PaJCtLpCEU1HdaohIDXDyJKBdtCllIsqJ
 YAJTlUkFRkMvsbdEA3U43rVRVlu7zxP3aRg79VRI+kdZ8y8XSiisAF/QUbG/Bwd3
 NuFqKj4Hm5xFTtLMKIUlumR2/exD4ZpPHKnWmD4idp1PZRoTUwxeIjqVv77L6Nfy
 c4/GVlz2t9buWxH/5PSa//rsT49xM6CLpyIwO2lFsNEk2HYE/uAWya5KZVJ1YKVw
 mSjUdyFtJ//lFK9lVU4YkdNF0d5w2PcEoCfNyPe0n9F43iITACo2om7rkSkTgciS
 izPAs1DkqNdWPta2twULt656WlUoGlwnWyFchU34N/I/pDiww4kdCQFfNIOi6qBW
 7rPQ4xpywiw/U93C2GLEeE09++T8+yO4KuELhIcN5+D3D2VGRIuaGcf4xeY0CX3A
 a335EZIxBGOfxyejknBDg3BsoUcLvejbANKD8ltis3zciGf/QyLt+FFqx25WCzkN
 N028ppml8WPSiqhSlFDMyfscO1dx/WSYh1q7ANrBiPPaEjFYmakzEDT0cZW4JsUh
 av4jqYt3cqrFIXyhRHJM2wjFymQv6aVnmLa4zOKCFbwm9KzMWUUFjc4R962T/sQK
 0g+eCSFGidii4JZ4ghOAQk2dqCTIZqV72pmLlpJBGu+SAIQxkh+29h0LOi5U3v1Y
 FnTtJ1JhCg==
 =0uqV
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20180629' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Small set of fixes for this series. Mostly just minor fixes, the only
  oddball in here is the sg change.

  The sg change came out of the stall fix for NVMe, where we added a
  mempool and limited us to a single page allocation. CONFIG_SG_DEBUG
  sort-of ruins that, since we'd need to account for that. That's
  actually a generic problem, since lots of drivers need to allocate SG
  lists. So this just removes support for CONFIG_SG_DEBUG, which I added
  back in 2007 and to my knowledge it was never useful.

  Anyway, outside of that, this pull contains:

   - clone of request with special payload fix (Bart)

   - drbd discard handling fix (Bart)

   - SATA blk-mq stall fix (me)

   - chunk size fix (Keith)

   - double free nvme rdma fix (Sagi)"

* tag 'for-linus-20180629' of git://git.kernel.dk/linux-block:
  sg: remove ->sg_magic member
  drbd: Fix drbd_request_prepare() discard handling
  blk-mq: don't queue more if we get a busy return
  block: Fix cloning of requests with a special payload
  nvme-rdma: fix possible double free of controller async event buffer
  block: Fix transfer when chunk sectors exceeds max
2018-06-30 10:47:46 -07:00
Jens Axboe
1f57f8d442 blk-mq: don't queue more if we get a busy return
Some devices have different queue limits depending on the type of IO. A
classic case is SATA NCQ, where some commands can queue, but others
cannot. If we have NCQ commands inflight and encounter a non-queueable
command, the driver returns busy. Currently we attempt to dispatch more
from the scheduler, if we were able to queue some commands. But for the
case where we ended up stopping due to BUSY, we should not attempt to
retrieve more from the scheduler. If we do, we can get into a situation
where we attempt to queue a non-queueable command, get BUSY, then
successfully retrieve more commands from that scheduler and queue those.
This can repeat forever, starving the non-queuable command indefinitely.

Fix this by NOT attempting to pull more commands from the scheduler, if
we get a BUSY return. This should also be more optimal in terms of
letting requests stay in the scheduler for as long as possible, if we
get a BUSY due to the regular out-of-tags condition.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-29 07:52:31 -06:00
Bart Van Assche
297ba57dcd block: Fix cloning of requests with a special payload
This patch avoids that removing a path controlled by the dm-mpath driver
while mkfs is running triggers the following kernel bug:

    kernel BUG at block/blk-core.c:3347!
    invalid opcode: 0000 [#1] PREEMPT SMP KASAN
    CPU: 20 PID: 24369 Comm: mkfs.ext4 Not tainted 4.18.0-rc1-dbg+ #2
    RIP: 0010:blk_end_request_all+0x68/0x70
    Call Trace:
     <IRQ>
     dm_softirq_done+0x326/0x3d0 [dm_mod]
     blk_done_softirq+0x19b/0x1e0
     __do_softirq+0x128/0x60d
     irq_exit+0x100/0x110
     smp_call_function_single_interrupt+0x90/0x330
     call_function_single_interrupt+0xf/0x20
     </IRQ>

Fixes: f9d03f96b9 ("block: improve handling of the magic discard payload")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-28 09:51:30 -06:00
Linus Torvalds
77072ca59f for-linus-20180623
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlsudKcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgphYXEAC0G+fJcLgue64LwLNjXksDUsoldqNWjTDZ
 e3szwseN3Woe7t+Q9DwEUO48T421WhDYe3UY50EZF3Fz5yYLi0ggTeq4qsKgpaXx
 80dUY9/STxFZUV/tKrYMz5xPjM3k5x+m8FTeGlFgyR77fbOZE6/8w9XPkpmQN/B2
 ayyEPMgAR9aZKhtOU4TJY45O278b2Qdfd0s6XUsxDTwBoR1mTQqKhNnY38PrIA5s
 wkhtKz5YRdq5tNhKt2IwYABpJ4HJ1Hz/FE6Fo1RNx+6vfxfEhsk2/DZBn0rzzGDj
 kvSaVlPeQEu9SZdJGtb2sx26elVWdt5sqMORQ0zb1kVCochZiT1IO3Z81XCy2Y3u
 5WVvOlgsGni9BWr/K+iiN/dc+cKoRPy0g/ib77XUiJHgIC+urUIr3XgcIiuP9GZc
 cgCfF+6+GvRKLatSsDZSbns5cpgueJLNogZUyhs7oUouRIECo0DD9NnDAlHmaZYB
 957bRd/ykE6QOvtxEsG0TWc3tgyHjJFQQ9ukb45VmzYVB8mITKOGLpceJaqzju9c
 AnGyQDddK4fqOct1pgFsSA5C3V6OOVXDpUiFn3DTNoQZcu5HN2hMwEA/g/u5TnGD
 N2sEwcu2WCoUjzsMEWFLmX8R4w8laZUDwkNSRC5tYzQEZpVDfmAsRtRcUHJdNLl/
 zPj8QMAidA==
 =hCgr
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20180623' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Further timeout fixes. We aren't quite there yet, so expect another
   round of fixes for that to completely close some of the IRQ vs
   completion races. (Christoph/Bart)

 - Set of NVMe fixes from the usual suspects, mostly error handling

 - Two off-by-one fixes (Dan)

 - Another bdi race fix (Jan)

 - Fix nbd reconfigure with NBD_DISCONNECT_ON_CLOSE (Doron)

* tag 'for-linus-20180623' of git://git.kernel.dk/linux-block:
  blk-mq: Fix timeout handling in case the timeout handler returns BLK_EH_DONE
  bdi: Fix another oops in wb_workfn()
  lightnvm: Remove depends on HAS_DMA in case of platform dependency
  nvme-pci: limit max IO size and segments to avoid high order allocations
  nvme-pci: move nvme_kill_queues to nvme_remove_dead_ctrl
  nvme-fc: release io queues to allow fast fail
  nbd: Add the nbd NBD_DISCONNECT_ON_CLOSE config flag.
  block: sed-opal: Fix a couple off by one bugs
  blk-mq-debugfs: Off by one in blk_mq_rq_state_name()
  nvmet: reset keep alive timer in controller enable
  nvme-rdma: don't override opts->queue_size
  nvme-rdma: Fix command completion race at error recovery
  nvme-rdma: fix possible free of a non-allocated async event buffer
  nvme-rdma: fix possible double free condition when failing to create a controller
  Revert "block: Add warning for bi_next not NULL in bio_endio()"
  block: fix timeout changes for legacy request drivers
2018-06-24 06:33:54 +08:00
Bart Van Assche
f5e350f021 blk-mq: Fix timeout handling in case the timeout handler returns BLK_EH_DONE
Make sure that RQF_TIMED_OUT is cleared when a request is reused
after a block driver timeout handler has returned BLK_EH_DONE.

Fixes: da66126739 ("blk-mq: don't time out requests again that are in the timeout handler")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Andrew Randrianasulu <randrianasulu@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-23 10:25:45 -06:00
Dan Carpenter
ce042c183b block: sed-opal: Fix a couple off by one bugs
resp->num is the number of tokens in resp->tok[].  It gets set in
response_parse().  So if n == resp->num then we're reading beyond the
end of the data.

Fixes: 455a7b238c ("block: Add Sed-opal library")
Reviewed-by: Scott Bauer <scott.bauer@intel.com>
Tested-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-20 12:04:06 -06:00
Dan Carpenter
a1e7918862 blk-mq-debugfs: Off by one in blk_mq_rq_state_name()
If rq_state == ARRAY_SIZE() then we read one element beyond the end of
the blk_mq_rq_state_name_array[] array.

Fixes: ec6dcf63c5 ("blk-mq-debugfs: Show more request state information")
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-20 11:26:04 -06:00
Bart Van Assche
9c24c10a2c Revert "block: Add warning for bi_next not NULL in bio_endio()"
Commit 0ba99ca483 ("block: Add warning for bi_next not NULL in
bio_endio()") breaks the dm driver. end_clone_bio() detects whether
or not a bio is the last bio associated with a request by checking
the .bi_next field. Commit 0ba99ca483 clears that field before
end_clone_bio() has had a chance to inspect that field. Hence revert
commit 0ba99ca483.

This patch avoids that KASAN reports the following complaint when
running the srp-test software (srp-test/run_tests -c -d -r 10 -t 02-mq):

==================================================================
BUG: KASAN: use-after-free in bio_advance+0x11b/0x1d0
Read of size 4 at addr ffff8801300e06d0 by task ksoftirqd/0/9

CPU: 0 PID: 9 Comm: ksoftirqd/0 Not tainted 4.18.0-rc1-dbg+ #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Call Trace:
 dump_stack+0xa4/0xf5
 print_address_description+0x6f/0x270
 kasan_report+0x241/0x360
 __asan_load4+0x78/0x80
 bio_advance+0x11b/0x1d0
 blk_update_request+0xa7/0x5b0
 scsi_end_request+0x56/0x320 [scsi_mod]
 scsi_io_completion+0x7d6/0xb20 [scsi_mod]
 scsi_finish_command+0x1c0/0x280 [scsi_mod]
 scsi_softirq_done+0x19a/0x230 [scsi_mod]
 blk_mq_complete_request+0x160/0x240
 scsi_mq_done+0x50/0x1a0 [scsi_mod]
 srp_recv_done+0x515/0x1330 [ib_srp]
 __ib_process_cq+0xa0/0xf0 [ib_core]
 ib_poll_handler+0x38/0xa0 [ib_core]
 irq_poll_softirq+0xe8/0x1f0
 __do_softirq+0x128/0x60d
 run_ksoftirqd+0x3f/0x60
 smpboot_thread_fn+0x352/0x460
 kthread+0x1c1/0x1e0
 ret_from_fork+0x24/0x30

Allocated by task 1918:
 save_stack+0x43/0xd0
 kasan_kmalloc+0xad/0xe0
 kasan_slab_alloc+0x11/0x20
 kmem_cache_alloc+0xfe/0x350
 mempool_alloc_slab+0x15/0x20
 mempool_alloc+0xfb/0x270
 bio_alloc_bioset+0x244/0x350
 submit_bh_wbc+0x9c/0x2f0
 __block_write_full_page+0x299/0x5a0
 block_write_full_page+0x16b/0x180
 blkdev_writepage+0x18/0x20
 __writepage+0x42/0x80
 write_cache_pages+0x376/0x8a0
 generic_writepages+0xbe/0x110
 blkdev_writepages+0xe/0x10
 do_writepages+0x9b/0x180
 __filemap_fdatawrite_range+0x178/0x1c0
 file_write_and_wait_range+0x59/0xc0
 blkdev_fsync+0x46/0x80
 vfs_fsync_range+0x66/0x100
 do_fsync+0x3d/0x70
 __x64_sys_fsync+0x21/0x30
 do_syscall_64+0x77/0x230
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 9:
 save_stack+0x43/0xd0
 __kasan_slab_free+0x137/0x190
 kasan_slab_free+0xe/0x10
 kmem_cache_free+0xd3/0x380
 mempool_free_slab+0x17/0x20
 mempool_free+0x63/0x160
 bio_free+0x81/0xa0
 bio_put+0x59/0x60
 end_bio_bh_io_sync+0x5d/0x70
 bio_endio+0x1a7/0x360
 blk_update_request+0xd0/0x5b0
 end_clone_bio+0xa3/0xd0 [dm_mod]
 bio_endio+0x1a7/0x360
 blk_update_request+0xd0/0x5b0
 scsi_end_request+0x56/0x320 [scsi_mod]
 scsi_io_completion+0x7d6/0xb20 [scsi_mod]
 scsi_finish_command+0x1c0/0x280 [scsi_mod]
 scsi_softirq_done+0x19a/0x230 [scsi_mod]
 blk_mq_complete_request+0x160/0x240
 scsi_mq_done+0x50/0x1a0 [scsi_mod]
 srp_recv_done+0x515/0x1330 [ib_srp]
 __ib_process_cq+0xa0/0xf0 [ib_core]
 ib_poll_handler+0x38/0xa0 [ib_core]
 irq_poll_softirq+0xe8/0x1f0
 __do_softirq+0x128/0x60d

The buggy address belongs to the object at ffff8801300e0640
 which belongs to the cache bio-0 of size 200
The buggy address is located 144 bytes inside of
 200-byte region [ffff8801300e0640, ffff8801300e0708)
The buggy address belongs to the page:
page:ffffea0004c03800 count:1 mapcount:0 mapping:ffff88015a563a00 index:0x0 compound_mapcount: 0
flags: 0x8000000000008100(slab|head)
raw: 8000000000008100 dead000000000100 dead000000000200 ffff88015a563a00
raw: 0000000000000000 0000000000330033 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff8801300e0580: fb fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc
 ffff8801300e0600: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
>ffff8801300e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                 ^
 ffff8801300e0700: fb fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff8801300e0780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================

Cc: Kent Overstreet <kent.overstreet@gmail.com>
Fixes: 0ba99ca483 ("block: Add warning for bi_next not NULL in bio_endio()")
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-19 11:59:47 -06:00
Christoph Hellwig
0cc61e64e2 block: fix timeout changes for legacy request drivers
blk_mq_complete_request can only be called for blk-mq drivers, but when
removing the BLK_EH_HANDLED return value, two legacy request timeout
methods incorrectly got switched to call blk_mq_complete_request.
Call __blk_complete_request instead to reinstance the previous behavior.
For that __blk_complete_request needs to be exported.

Fixes: 1fc2b62e ("scsi_transport_fc: complete requests from ->timeout")
Fixes: 0df0bb08 ("null_blk: complete requests from ->timeout")
Reported-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-19 11:27:18 -06:00
Linus Torvalds
265c5596da for-linus-20180616
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlslJkoQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpi3READZgtKEaixC4EMLNH/7WqVLC5XxsIpuV3N7
 kxaV2KoJ/zJbBPqE+WIW/UasuTRkIFQpmXp3aYw879FZfCq1sN7K9yyzY2i7qvuG
 l8WRPWFc2SMfMrXb3pctGiNyYhTIZrOOzPmKAwlNinyyr7tlan2Ha4z5XHOJon9O
 uVk252kTlBqvTEX4nZKN2x6UiI9RaVegbOV6lthm3lQZF+25C6kN/8emdkhDIkjG
 mOI8MaHp6iDbpIBWBziEpkvGxvxO4PAaESldyJH6Fg+uzmNEDEwp6jvFBmlCiQIE
 wdt5Swsl2zKk8g/tBHeDSDK0inHnP5xho4tC68rDtLrgxyLeBuVjyLMItkYJdTqh
 s8B5gdCs3SS41aXMk+w3BnLBIRRMrByGiY2Nx23Si6KZU8pT5sM0dGGOTGCma1dy
 NNitIRbePyrajV2p/xqnMeiGvxqnCckRRYr0uV2KPJzqwwMoZvNsnS+XL2KqdgCY
 ndK5Nda5oRb+yDDlPKlRvY8cKh1N8GWdoFIGbfrMIYJcGiUXBoX7UoA/DKKbnY2o
 p+TE3t9RGz9UPwxHqSMTXRV+xs4X3lgWf4sA/ciGGPeuAgwdOoGJnfn4kY7AIPq8
 LYXj8K4rx3cRA+l7Y4DCppPQEl8bxKXhFFccKsXToFITW+PNdCuM7FDugOHdzwQE
 QlesAkLV+w==
 =gyaY
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20180616' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "A collection of fixes that should go into -rc1. This contains:

   - bsg_open vs bsg_unregister race fix (Anatoliy)

   - NVMe pull request from Christoph, with fixes for regressions in
     this window, FC connect/reconnect path code unification, and a
     trace point addition.

   - timeout fix (Christoph)

   - remove a few unused functions (Christoph)

   - blk-mq tag_set reinit fix (Roman)"

* tag 'for-linus-20180616' of git://git.kernel.dk/linux-block:
  bsg: fix race of bsg_open and bsg_unregister
  block: remov blk_queue_invalidate_tags
  nvme-fabrics: fix and refine state checks in __nvmf_check_ready
  nvme-fabrics: handle the admin-only case properly in nvmf_check_ready
  nvme-fabrics: refactor queue ready check
  blk-mq: remove blk_mq_tagset_iter
  nvme: remove nvme_reinit_tagset
  nvme-fc: fix nulling of queue data on reconnect
  nvme-fc: remove reinit_request routine
  blk-mq: don't time out requests again that are in the timeout handler
  nvme-fc: change controllers first connect to use reconnect path
  nvme: don't rely on the changed namespace list log
  nvmet: free smart-log buffer after use
  nvme-rdma: fix error flow during mapping request data
  nvme: add bio remapping tracepoint
  nvme: fix NULL pointer dereference in nvme_init_subsystem
  blk-mq: reinit q->tag_set_list entry only after grace period
2018-06-17 05:37:55 +09:00
Mauro Carvalho Chehab
5fb94e9ca3 docs: Fix some broken references
As we move stuff around, some doc references are broken. Fix some of
them via this script:
	./scripts/documentation-file-ref-check --fix

Manually checked if the produced result is valid, removing a few
false-positives.

Acked-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Stephen Boyd <sboyd@kernel.org>
Acked-by: Charles Keepax <ckeepax@opensource.wolfsonmicro.com>
Acked-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Jonathan Corbet <corbet@lwn.net>
2018-06-15 18:10:01 -03:00
Anatoliy Glagolev
d6c73964f1 bsg: fix race of bsg_open and bsg_unregister
The existing implementation allows races between bsg_unregister and
bsg_open paths. bsg_unregister and request_queue cleanup and deletion
may start and complete right after bsg_get_device (in bsg_open path)
retrieves bsg_class_device and releases the mutex. Then bsg_open path
touches freed memory of bsg_class_device and request_queue.

One possible fix is to hold the mutex all the way through bsg_get_device
instead of releasing it after bsg_class_device retrieval.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-Off-By: Anatoliy Glagolev <glagolig@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-15 08:15:37 -06:00
Christoph Hellwig
be7f99c536 block: remov blk_queue_invalidate_tags
This function is entirely unused, so remove it and the tag_queue_busy
member of struct request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-15 08:13:35 -06:00