* refs/heads/tmp-cd1eb9f:
Revert "net: qrtr: Stop rx_worker before freeing node"
Linux 4.19.77
drm/amd/display: Restore backlight brightness after system resume
mm/compaction.c: clear total_{migrate,free}_scanned before scanning a new zone
fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock
md/raid0: avoid RAID0 data corruption due to layout confusion.
CIFS: Fix oplock handling for SMB 2.1+ protocols
CIFS: fix max ea value size
i2c: riic: Clear NACK in tend isr
hwrng: core - don't wait on add_early_randomness()
quota: fix wrong condition in is_quota_modification()
ext4: fix punch hole for inline_data file systems
ext4: fix warning inside ext4_convert_unwritten_extents_endio
/dev/mem: Bail out upon SIGKILL.
cfg80211: Purge frame registrations on iftype change
md: only call set_in_sync() when it is expected to succeed.
md: don't report active array_state until after revalidate_disk() completes.
md/raid6: Set R5_ReadError when there is read failure on parity disk
Btrfs: fix race setting up and completing qgroup rescan workers
btrfs: qgroup: Fix reserved data space leak if we have multiple reserve calls
btrfs: qgroup: Fix the wrong target io_tree when freeing reserved data space
btrfs: Relinquish CPUs in btrfs_compare_trees
Btrfs: fix use-after-free when using the tree modification log
btrfs: fix allocation of free space cache v1 bitmap pages
ovl: filter of trusted xattr results in audit
ovl: Fix dereferencing possible ERR_PTR()
smb3: allow disabling requesting leases
block: fix null pointer dereference in blk_mq_rq_timed_out()
i40e: check __I40E_VF_DISABLE bit in i40e_sync_filters_subtask
memcg, kmem: do not fail __GFP_NOFAIL charges
memcg, oom: don't require __GFP_FS when invoking memcg OOM killer
gfs2: clear buf_in_tr when ending a transaction in sweep_bh_for_rgrps
efifb: BGRT: Improve efifb_bgrt_sanity_check
regulator: Defer init completion for a while after late_initcall
alarmtimer: Use EOPNOTSUPP instead of ENOTSUPP
arm64: dts: rockchip: limit clock rate of MMC controllers for RK3328
arm64: tlb: Ensure we execute an ISB following walk cache invalidation
Revert "arm64: Remove unnecessary ISBs from set_{pte,pmd,pud}"
ARM: zynq: Use memcpy_toio instead of memcpy on smp bring-up
ARM: samsung: Fix system restart on S3C6410
ASoC: Intel: Fix use of potentially uninitialized variable
ASoC: Intel: Skylake: Use correct function to access iomem space
ASoC: Intel: NHLT: Fix debug print format
binfmt_elf: Do not move brk for INTERP-less ET_EXEC
media: don't drop front-end reference count for ->detach
media: sn9c20x: Add MSI MS-1039 laptop to flip_dmi_table
KVM: x86: Manually calculate reserved bits when loading PDPTRS
KVM: x86: set ctxt->have_exception in x86_decode_insn()
KVM: x86: always stop emulation on page fault
parisc: Disable HP HSC-PCI Cards to prevent kernel crash
fuse: fix missing unlock_page in fuse_writepage()
powerpc/imc: Dont create debugfs files for cpu-less nodes
scsi: implement .cleanup_rq callback
blk-mq: add callback of .cleanup_rq
ALSA: hda/realtek - PCI quirk for Medion E4254
ceph: use ceph_evict_inode to cleanup inode's resource
Revert "ceph: use ceph_evict_inode to cleanup inode's resource"
randstruct: Check member structs in is_pure_ops_struct()
IB/hfi1: Define variables as unsigned long to fix KASAN warning
IB/mlx5: Free mpi in mp_slave mode
printk: Do not lose last line in kmsg buffer dump
scsi: qla2xxx: Fix Relogin to prevent modifying scan_state flag
scsi: scsi_dh_rdac: zero cdb in send_mode_select()
ALSA: firewire-tascam: check intermediate state of clock status and retry
ALSA: firewire-tascam: handle error code when getting current source of clock
iwlwifi: fw: don't send GEO_TX_POWER_LIMIT command to FW version 36
PM / devfreq: passive: fix compiler warning
media: omap3isp: Set device on omap3isp subdevs
btrfs: extent-tree: Make sure we only allocate extents from block groups with the same type
iommu/amd: Override wrong IVRS IOAPIC on Raven Ridge systems
ALSA: hda/realtek - Blacklist PC beep for Lenovo ThinkCentre M73/93
media: ttusb-dec: Fix info-leak in ttusb_dec_send_command()
drm/amd/powerplay/smu7: enforce minimal VBITimeout (v2)
ALSA: hda - Drop unsol event handler for Intel HDMI codecs
e1000e: add workaround for possible stalled packet
libertas: Add missing sentinel at end of if_usb.c fw_table
raid5: don't increment read_errors on EILSEQ return
mmc: dw_mmc: Re-store SDIO IRQs mask at system resume
mmc: core: Add helper function to indicate if SDIO IRQs is enabled
mmc: sdhci: Fix incorrect switch to HS mode
mmc: core: Clarify sdio_irq_pending flag for MMC_CAP2_SDIO_IRQ_NOTHREAD
raid5: don't set STRIPE_HANDLE to stripe which is in batch list
ASoC: dmaengine: Make the pcm->name equal to pcm->id if the name is not set
platform/x86: intel_pmc_core: Do not ioremap RAM
x86/cpu: Add Tiger Lake to Intel family
s390/crypto: xts-aes-s390 fix extra run-time crypto self tests finding
kprobes: Prohibit probing on BUG() and WARN() address
dmaengine: ti: edma: Do not reset reserved paRAM slots
md/raid1: fail run raid1 array when active disk less than one
hwmon: (acpi_power_meter) Change log level for 'unsafe software power cap'
closures: fix a race on wakeup from closure_sync
ACPI / PCI: fix acpi_pci_irq_enable() memory leak
ACPI: custom_method: fix memory leaks
ARM: dts: exynos: Mark LDO10 as always-on on Peach Pit/Pi Chromebooks
libtraceevent: Change users plugin directory
iommu/iova: Avoid false sharing on fq_timer_on
libata/ahci: Drop PCS quirk for Denverton and beyond
iommu/amd: Silence warnings under memory pressure
ALSA: firewire-motu: add support for MOTU 4pre
nvme-multipath: fix ana log nsid lookup when nsid is not found
nvmet: fix data units read and written counters in SMART log
x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()
ASoC: fsl_ssi: Fix clock control issue in master mode
x86/mm/pti: Do not invoke PTI functions when PTI is disabled
arm64: kpti: ensure patched kernel text is fetched from PoU
x86/apic/vector: Warn when vector space exhaustion breaks affinity
sched/cpufreq: Align trace event behavior of fast switching
ACPI / CPPC: do not require the _PSD method
ASoC: es8316: fix headphone mixer volume table
media: ov9650: add a sanity check
perf trace beauty ioctl: Fix off-by-one error in cmd->string table
media: saa7134: fix terminology around saa7134_i2c_eeprom_md7134_gate()
media: cpia2_usb: fix memory leaks
media: saa7146: add cleanup in hexium_attach()
media: cec-notifier: clear cec_adap in cec_notifier_unregister
PM / devfreq: exynos-bus: Correct clock enable sequence
PM / devfreq: passive: Use non-devm notifiers
EDAC/amd64: Decode syndrome before translating address
EDAC/amd64: Recognize DRAM device type ECC capability
libperf: Fix alignment trap with xyarray contents in 'perf stat'
media: dvb-core: fix a memory leak bug
posix-cpu-timers: Sanitize bogus WARNONS
media: dvb-frontends: use ida for pll number
media: mceusb: fix (eliminate) TX IR signal length limit
nbd: add missing config put
led: triggers: Fix a memory leak bug
ASoC: sun4i-i2s: Don't use the oversample to calculate BCLK
tools headers: Fixup bitsperlong per arch includes
ASoC: uniphier: Fix double reset assersion when transitioning to suspend state
media: hdpvr: add terminating 0 at end of string
media: radio/si470x: kill urb on error
ARM: dts: imx7-colibri: disable HS400
ARM: dts: imx7d: cl-som-imx7: make ethernet work again
m68k: Prevent some compiler warnings in Coldfire builds
net: lpc-enet: fix printk format strings
media: imx: mipi csi-2: Don't fail if initial state times-out
media: omap3isp: Don't set streaming state on random subdevs
media: i2c: ov5645: Fix power sequence
media: vsp1: fix memory leak of dl on error return path
perf record: Support aarch64 random socket_id assignment
dmaengine: iop-adma: use correct printk format strings
media: rc: imon: Allow iMON RC protocol for ffdc 7e device
media: em28xx: modules workqueue not inited for 2nd device
media: fdp1: Reduce FCP not found message level to debug
media: mtk-mdp: fix reference count on old device tree
perf test vfs_getname: Disable ~/.perfconfig to get default output
perf config: Honour $PERF_CONFIG env var to specify alternate .perfconfig
media: gspca: zero usb_buf on error
idle: Prevent late-arriving interrupts from disrupting offline
sched/fair: Use rq_lock/unlock in online_fair_sched_group
firmware: arm_scmi: Check if platform has released shmem before using
efi: cper: print AER info of PCIe fatal error
EDAC, pnd2: Fix ioremap() size in dnv_rd_reg()
loop: Add LOOP_SET_DIRECT_IO to compat ioctl
ACPI / processor: don't print errors for processorIDs == 0xff
media: media/platform: fsl-viu.c: fix build for MICROBLAZE
md: don't set In_sync if array is frozen
md: don't call spare_active in md_reap_sync_thread if all member devices can't work
md/raid1: end bio when the device faulty
arm64/prefetch: fix a -Wtype-limits warning
ASoC: rsnd: don't call clk_get_rate() under atomic context
EDAC/altera: Use the proper type for the IRQ status bits
ia64:unwind: fix double free for mod->arch.init_unw_table
ALSA: usb-audio: Skip bSynchAddress endpoint check if it is invalid
base: soc: Export soc_device_register/unregister APIs
media: iguanair: add sanity checks
EDAC/mc: Fix grain_bits calculation
ALSA: i2c: ak4xxx-adda: Fix a possible null pointer dereference in build_adc_controls()
ALSA: hda - Show the fatal CORB/RIRB error more clearly
x86/apic: Soft disable APIC before initializing it
x86/reboot: Always use NMI fallback when shutdown via reboot vector IPI fails
sched/deadline: Fix bandwidth accounting at all levels after offline migration
x86/apic: Make apic_pending_intr_clear() more robust
sched/core: Fix CPU controller for !RT_GROUP_SCHED
sched/fair: Fix imbalance due to CPU affinity
time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
media: i2c: ov5640: Check for devm_gpiod_get_optional() error
media: hdpvr: Add device num check and handling
media: exynos4-is: fix leaked of_node references
media: mtk-cir: lower de-glitch counter for rc-mm protocol
media: dib0700: fix link error for dibx000_i2c_set_speed
leds: leds-lp5562 allow firmware files up to the maximum length
dmaengine: bcm2835: Print error in case setting DMA mask fails
firmware: qcom_scm: Use proper types for dma mappings
ASoC: sgtl5000: Fix charge pump source assignment
ASoC: sgtl5000: Fix of unmute outputs on probe
ASoC: tlv320aic31xx: suppress error message for EPROBE_DEFER
regulator: lm363x: Fix off-by-one n_voltages for lm3632 ldo_vpos/ldo_vneg
ALSA: hda: Flush interrupts on disabling
nfp: flower: prevent memory leak in nfp_flower_spawn_phy_reprs
nfc: enforce CAP_NET_RAW for raw sockets
ieee802154: enforce CAP_NET_RAW for raw sockets
ax25: enforce CAP_NET_RAW for raw sockets
appletalk: enforce CAP_NET_RAW for raw sockets
mISDN: enforce CAP_NET_RAW for raw sockets
net/mlx5: Add device ID of upcoming BlueField-2
tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state
net: sched: fix possible crash in tcf_action_destroy()
usbnet: sanity checking of packet sizes and device mtu
usbnet: ignore endpoints with invalid wMaxPacketSize
skge: fix checksum byte order
sch_netem: fix a divide by zero in tabledist()
ppp: Fix memory leak in ppp_write
openvswitch: change type of UPCALL_PID attribute to NLA_UNSPEC
nfp: flower: fix memory leak in nfp_flower_spawn_vnic_reprs
net_sched: add max len check for TCA_KIND
net/sched: act_sample: don't push mac header on ip6gre ingress
net: qrtr: Stop rx_worker before freeing node
net/phy: fix DP83865 10 Mbps HDX loopback disable function
macsec: drop skb sk before calling gro_cells_receive
cdc_ncm: fix divide-by-zero caused by invalid wMaxPacketSize
arcnet: provide a buffer big enough to actually receive packets
ANDROID: properly export new symbols with _GPL tag
Conflicts:
kernel/sched/cpufreq_schedutil.c
mm/compaction.c
Change-Id: I3f2cc4a1421480479b9040aa5eefe7e17437021d
Signed-off-by: Ivaylo Georgiev <irgeorgiev@codeaurora.org>
* refs/heads/tmp-8ed9c66:
Linux 4.19.76
f2fs: use generic EFSBADCRC/EFSCORRUPTED
net/rds: Check laddr_check before calling it
net/rds: An rds_sock is added too early to the hash table
net_sched: check cops->tcf_block in tc_bind_tclass()
Bluetooth: btrtl: Additional Realtek 8822CE Bluetooth devices
netfilter: nft_socket: fix erroneous socket assignment
xfs: don't crash on null attr fork xfs_bmapi_read
drm/nouveau/disp/nv50-: fix center/aspect-corrected scaling
ACPI: video: Add new hw_changes_brightness quirk, set it on PB Easynote MZ35
Bluetooth: btrtl: HCI reset on close for Realtek BT chip
net: don't warn in inet diag when IPV6 is disabled
drm: Flush output polling on shutdown
f2fs: fix to do sanity check on segment bitmap of LFS curseg
net/ibmvnic: Fix missing { in __ibmvnic_reset
dm zoned: fix invalid memory access
Revert "f2fs: avoid out-of-range memory access"
blk-mq: move cancel of requeue_work to the front of blk_exit_queue
blk-mq: change gfp flags to GFP_NOIO in blk_mq_realloc_hw_ctxs
initramfs: don't free a non-existent initrd
bcache: remove redundant LIST_HEAD(journal) from run_cache_set()
PCI: hv: Avoid use of hv_pci_dev->pci_slot after freeing it
f2fs: check all the data segments against all node ones
irqchip/gic-v3-its: Fix LPI release for Multi-MSI devices
bpf: libbpf: retry loading program on EAGAIN
Revert "drm/amd/powerplay: Enable/Disable NBPSTATE on On/OFF of UVD"
scsi: qla2xxx: Return switch command on a timeout
scsi: qla2xxx: Remove all rports if fabric scan retry fails
scsi: qla2xxx: Turn off IOCB timeout timer on IOCB completion
locking/lockdep: Add debug_locks check in __lock_downgrade()
power: supply: sysfs: ratelimit property read error message
pinctrl: sprd: Use define directive for sprd_pinconf_params values
objtool: Clobber user CFLAGS variable
ALSA: hda - Apply AMD controller workaround for Raven platform
ALSA: hda - Add laptop imic fixup for ASUS M9V laptop
ALSA: dice: fix wrong packet parameter for Alesis iO26
ALSA: usb-audio: Add DSD support for EVGA NU Audio
ALSA: usb-audio: Add Hiby device family to quirks for native DSD support
ASoC: fsl: Fix of-node refcount unbalance in fsl_ssi_probe_from_dt()
ASoC: Intel: cht_bsw_max98090_ti: Enable codec clock once and keep it enabled
media: tvp5150: fix switch exit in set control handler
iwlwifi: mvm: always init rs_fw with 20MHz bandwidth rates
iwlwifi: mvm: send BCAST management frames to the right station
net/mlx5e: Rx, Check ip headers sanity
net/mlx5e: Rx, Fixup skb checksum for packets with tail padding
net/mlx5e: XDP, Avoid checksum complete when XDP prog is loaded
net/mlx5e: Allow reporting of checksum unnecessary
mlx5: fix get_ip_proto()
net/mlx5e: don't set CHECKSUM_COMPLETE on SCTP packets
net/mlx5e: Set ECN for received packets using CQE indication
CIFS: fix deadlock in cached root handling
crypto: talitos - fix missing break in switch statement
mtd: cfi_cmdset_0002: Use chip_good() to retry in do_write_oneword()
HID: Add quirk for HP X500 PIXART OEM mouse
HID: hidraw: Fix invalid read in hidraw_ioctl
HID: logitech: Fix general protection fault caused by Logitech driver
HID: sony: Fix memory corruption issue on cleanup.
HID: prodikeys: Fix general protection fault during probe
IB/core: Add an unbound WQ type to the new CQ API
drm/amd/display: readd -msse2 to prevent Clang from emitting libcalls to undefined SW FP routines
powerpc/xive: Fix bogus error code returned by OPAL
RDMA/restrack: Protect from reentry to resource return path
net/ibmvnic: free reset work of removed device from queue
Revert "Bluetooth: validate BLE connection interval updates"
Conflicts:
fs/f2fs/data.c
fs/f2fs/f2fs.h
fs/f2fs/gc.c
fs/f2fs/inode.c
Change-Id: I6626d6288e229c78e400be190dca200c62d2ec51
Signed-off-by: Ivaylo Georgiev <irgeorgiev@codeaurora.org>
commit 8d6996630c03d7ceeabe2611378fea5ca1c3f1b3 upstream.
We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
as following:
[ 108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
[ 108.827059] PGD 0 P4D 0
[ 108.827313] Oops: 0000 [#1] SMP PTI
[ 108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
[ 108.829503] Workqueue: kblockd blk_mq_timeout_work
[ 108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
[ 108.838191] Call Trace:
[ 108.838406] bt_iter+0x74/0x80
[ 108.838665] blk_mq_queue_tag_busy_iter+0x204/0x450
[ 108.839074] ? __switch_to_asm+0x34/0x70
[ 108.839405] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.839823] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.840273] ? syscall_return_via_sysret+0xf/0x7f
[ 108.840732] blk_mq_timeout_work+0x74/0x200
[ 108.841151] process_one_work+0x297/0x680
[ 108.841550] worker_thread+0x29c/0x6f0
[ 108.841926] ? rescuer_thread+0x580/0x580
[ 108.842344] kthread+0x16a/0x1a0
[ 108.842666] ? kthread_flush_work+0x170/0x170
[ 108.843100] ret_from_fork+0x35/0x40
The bug is caused by the race between timeout handle and completion for
flush request.
When timeout handle function blk_mq_rq_timed_out() try to read
'req->q->mq_ops', the 'req' have completed and reinitiated by next
flush request, which would call blk_rq_init() to clear 'req' as 0.
After commit 12f5b93145 ("blk-mq: Remove generation seqeunce"),
normal requests lifetime are protected by refcount. Until 'rq->ref'
drop to zero, the request can really be free. Thus, these requests
cannot been reused before timeout handle finish.
However, flush request has defined .end_io and rq->end_io() is still
called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
can be reused by the next flush request handle, resulting in null
pointer deference BUG ON.
We fix this problem by covering flush request with 'rq->ref'.
If the refcount is not zero, flush_end_io() return and wait the
last holder recall it. To record the request status, we add a new
entry 'rq_status', which will be used in flush_end_io().
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: stable@vger.kernel.org # v4.18+
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-------
v2:
- move rq_status from struct request to struct blk_flush_queue
v3:
- remove unnecessary '{}' pair.
v4:
- let spinlock to protect 'fq->rq_status'
v5:
- move rq_status after flush_running_idx member of struct blk_flush_queue
Signed-off-by: Jens Axboe <axboe@kernel.dk>
[ Upstream commit 5b202853ffbc54b29f23c4b1b5f3948efab489a2 ]
blk_mq_realloc_hw_ctxs could be invoked during update hw queues.
At the momemt, IO is blocked. Change the gfp flags from GFP_KERNEL
to GFP_NOIO to avoid forever hang during memory allocation in
blk_mq_realloc_hw_ctxs.
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
If all targets in a device-mapper table support inline encryption, mark
the mapped device as supporting inline encryption. This will allow
filesystems such as ext4 to tell whether the hardware supports inline
encryption even in the case where there is an intervening dm device.
Change-Id: Ic83203f8cc8d440d50fefac62d7849ce035c657a
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Shivaprasad Hongal <shongal@codeaurora.org>
Signed-off-by: Barani Muthukumaran <bmuthuku@codeaurora.org>
This is a snapshot of the crypto drivers as of msm-4.14 commit
367c46b1 (Enable hardware based FBE on f2fs and adapt ext4 fs).
Change-Id: Ifb52ed101d6e971c5823037f7895049b830c78c5
Signed-off-by: Zhen Kong <zkong@codeaurora.org>
commit 1adfc5e4136f5967d591c399aff95b3b035f16b7 upstream.
Obviously the created discard bio has to be aligned with logical block size.
This patch introduces the helper of bio_allowed_max_sectors() for
this purpose.
Cc: stable@vger.kernel.org
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Xiao Ni <xni@redhat.com>
Cc: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
Fixes: 744889b7cb ("block: don't deal with discard limit in blkdev_issue_discard()")
Fixes: a22c4d7e34 ("block: re-add discard_granularity and alignment checks")
Reported-by: Rui Salvaterra <rsalvaterra@gmail.com>
Tested-by: Rui Salvaterra <rsalvaterra@gmail.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently, when update nr_hw_queues, IO scheduler's init_hctx will
be invoked before the mapping between ctx and hctx is adapted
correctly by blk_mq_map_swqueue. The IO scheduler init_hctx (kyber)
may depend on this mapping and get wrong result and panic finally.
A simply way to fix this is that switch the IO scheduler to 'none'
before update the nr_hw_queues, and then switch it back after
update nr_hw_queues. blk_mq_sched_init_/exit_hctx are removed due
to nobody use them any more.
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Because blk_do_io_stat() only does a judgement about the request
contributes to IO statistics, it better changes return type to bool.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Alexandru Moise <00moses.alexander00@gmail.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Current IO controllers for the block layer are less than ideal for our
use case. The io.max controller is great at hard limiting, but it is
not work conserving. This patch introduces io.latency. You provide a
latency target for your group and we monitor the io in short windows to
make sure we are not exceeding those latency targets. This makes use of
the rq-qos infrastructure and works much like the wbt stuff. There are
a few differences from wbt
- It's bio based, so the latency covers the whole block layer in addition to
the actual io.
- We will throttle all IO types that comes in here if we need to.
- We use the mean latency over the 100ms window. This is because writes can
be particularly fast, which could give us a false sense of the impact of
other workloads on our protected workload.
- By default there's no throttling, we set the queue_depth to INT_MAX so that
we can have as many outstanding bio's as we're allowed to. Only at
throttle time do we pay attention to the actual queue depth.
- We backcharge cgroups for root cg issued IO and induce artificial
delays in order to deal with cases like metadata only or swap heavy
workloads.
In testing this has worked out relatively well. Protected workloads
will throttle noisy workloads down to 1 io at time if they are doing
normal IO on their own, or induce up to a 1 second delay per syscall if
they are doing a lot of root issued IO (metadata/swap IO).
Our testing has revolved mostly around our production web servers where
we have hhvm (the web server application) in a protected group and
everything else in another group. We see slightly higher requests per
second (RPS) on the test tier vs the control tier, and much more stable
RPS across all machines in the test tier vs the control tier.
Another test we run is a slow memory allocator in the unprotected group.
Before this would eventually push us into swap and cause the whole box
to die and not recover at all. With these patches we see slight RPS
drops (usually 10-15%) before the memory consumer is properly killed and
things recover within seconds.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is almost no shared logic, which leads to a very confusing code
flow.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reported-by: Damien Le Moal <Damien.LeMoal@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
These are only used by the block core. Also move the declarations to
block/blk.h.
Reported-by: Damien Le Moal <Damien.LeMoal@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, struct request has four timestamp fields:
- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq
These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch helps to avoid that new code gets introduced in block drivers
that manipulates queue flags without holding the queue lock when that
lock should be held.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block updates from Jens Axboe:
"This is the main pull request for block IO related changes for the
4.16 kernel. Nothing major in this pull request, but a good amount of
improvements and fixes all over the map. This contains:
- BFQ improvements, fixes, and cleanups from Angelo, Chiara, and
Paolo.
- Support for SMR zones for deadline and mq-deadline from Damien and
Christoph.
- Set of fixes for bcache by way of Michael Lyle, including fixes
from himself, Kent, Rui, Tang, and Coly.
- Series from Matias for lightnvm with fixes from Hans Holmberg,
Javier, and Matias. Mostly centered around pblk, and the removing
rrpc 1.2 in preparation for supporting 2.0.
- A couple of NVMe pull requests from Christoph. Nothing major in
here, just fixes and cleanups, and support for command tracing from
Johannes.
- Support for blk-throttle for tracking reads and writes separately.
From Joseph Qi. A few cleanups/fixes also for blk-throttle from
Weiping.
- Series from Mike Snitzer that enables dm to register its queue more
logically, something that's alwways been problematic on dm since
it's a stacked device.
- Series from Ming cleaning up some of the bio accessor use, in
preparation for supporting multipage bvecs.
- Various fixes from Ming closing up holes around queue mapping and
quiescing.
- BSD partition fix from Richard Narron, fixing a problem where we
can't mount newer (10/11) FreeBSD partitions.
- Series from Tejun reworking blk-mq timeout handling. The previous
scheme relied on atomic bits, but it had races where we would think
a request had timed out if it to reused at the wrong time.
- null_blk now supports faking timeouts, to enable us to better
exercise and test that functionality separately. From me.
- Kill the separate atomic poll bit in the request struct. After
this, we don't use the atomic bits on blk-mq anymore at all. From
me.
- sgl_alloc/free helpers from Bart.
- Heavily contended tag case scalability improvement from me.
- Various little fixes and cleanups from Arnd, Bart, Corentin,
Douglas, Eryu, Goldwyn, and myself"
* 'for-4.16/block' of git://git.kernel.dk/linux-block: (186 commits)
block: remove smart1,2.h
nvme: add tracepoint for nvme_complete_rq
nvme: add tracepoint for nvme_setup_cmd
nvme-pci: introduce RECONNECTING state to mark initializing procedure
nvme-rdma: remove redundant boolean for inline_data
nvme: don't free uuid pointer before printing it
nvme-pci: Suspend queues after deleting them
bsg: use pr_debug instead of hand crafted macros
blk-mq-debugfs: don't allow write on attributes with seq_operations set
nvme-pci: Fix queue double allocations
block: Set BIO_TRACE_COMPLETION on new bio during split
blk-throttle: use queue_is_rq_based
block: Remove kblockd_schedule_delayed_work{,_on}()
blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays
blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly()
lib/scatterlist: Fix chaining support in sgl_alloc_order()
blk-throttle: track read and write request individually
block: add bdev_read_only() checks to common helpers
block: fail op_is_write() requests to read-only partitions
blk-throttle: export io_serviced_recursive, io_service_bytes_recursive
...
These two functions are only called from inside the block layer so
unexport them.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We only have one atomic flag left. Instead of using an entire
unsigned long for that, steal the bottom bit of the deadline
field that we already reserved.
Remove ->atomic_flags, since it's now unused.
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We reduce the resolution of request expiry, but since we're already
using jiffies for this where resolution depends on the kernel
configuration and since the timeout resolution is coarse anyway,
that should be fine.
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't need this to be an atomic flag, it can be a regular
flag. We either end up on the same CPU for the polling, in which
case the state is sane, or we did the sleep which would imply
the needed barrier to ensure we see the right state.
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After the recent updates to use generation number and state based
synchronization, we can easily replace REQ_ATOM_STARTED usages by
adding an extra state to distinguish completed but not yet freed
state.
Add MQ_RQ_COMPLETE and replace REQ_ATOM_STARTED usages with
blk_mq_rq_state() tests. REQ_ATOM_STARTED no longer has any users
left and is removed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now we track legacy requests with .q_usage_counter in commit 055f6e18e0
("block: Make q_usage_counter also track legacy requests"), but that
commit never runs and drains legacy queue before waiting for this counter
becoming zero, then IO hang is caused in the test of pulling disk during IO.
This patch fixes the issue by draining requests before waiting for
q_usage_counter becoming zero, both Mauricio and chenxiang reported this
issue, and observed that it can be fixed by this patch.
Link: https://marc.info/?l=linux-block&m=151192424731797&w=2
Fixes: 055f6e18e08f("block: Make q_usage_counter also track legacy requests")
Cc: Wen Xiong <wenxiong@us.ibm.com>
Tested-by: "chenxiang (M)" <chenxiang66@hisilicon.com>
Tested-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull core block layer updates from Jens Axboe:
"This is the main pull request for block storage for 4.15-rc1.
Nothing out of the ordinary in here, and no API changes or anything
like that. Just various new features for drivers, core changes, etc.
In particular, this pull request contains:
- A patch series from Bart, closing the whole on blk/scsi-mq queue
quescing.
- A series from Christoph, building towards hidden gendisks (for
multipath) and ability to move bio chains around.
- NVMe
- Support for native multipath for NVMe (Christoph).
- Userspace notifications for AENs (Keith).
- Command side-effects support (Keith).
- SGL support (Chaitanya Kulkarni)
- FC fixes and improvements (James Smart)
- Lots of fixes and tweaks (Various)
- bcache
- New maintainer (Michael Lyle)
- Writeback control improvements (Michael)
- Various fixes (Coly, Elena, Eric, Liang, et al)
- lightnvm updates, mostly centered around the pblk interface
(Javier, Hans, and Rakesh).
- Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)
- Writeback series that fix the much discussed hundreds of millions
of sync-all units. This goes all the way, as discussed previously
(me).
- Fix for missing wakeup on writeback timer adjustments (Yafang
Shao).
- Fix laptop mode on blk-mq (me).
- {mq,name} tupple lookup for IO schedulers, allowing us to have
alias names. This means you can use 'deadline' on both !mq and on
mq (where it's called mq-deadline). (me).
- blktrace race fix, oopsing on sg load (me).
- blk-mq optimizations (me).
- Obscure waitqueue race fix for kyber (Omar).
- NBD fixes (Josef).
- Disable writeback throttling by default on bfq, like we do on cfq
(Luca Miccio).
- Series from Ming that enable us to treat flush requests on blk-mq
like any other request. This is a really nice cleanup.
- Series from Ming that improves merging on blk-mq with schedulers,
getting us closer to flipping the switch on scsi-mq again.
- BFQ updates (Paolo).
- blk-mq atomic flags memory ordering fixes (Peter Z).
- Loop cgroup support (Shaohua).
- Lots of minor fixes from lots of different folks, both for core and
driver code"
* 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
nvme: fix visibility of "uuid" ns attribute
blk-mq: fixup some comment typos and lengths
ide: ide-atapi: fix compile error with defining macro DEBUG
blk-mq: improve tag waiting setup for non-shared tags
brd: remove unused brd_mutex
blk-mq: only run the hardware queue if IO is pending
block: avoid null pointer dereference on null disk
fs: guard_bio_eod() needs to consider partitions
xtensa/simdisk: fix compile error
nvme: expose subsys attribute to sysfs
nvme: create 'slaves' and 'holders' entries for hidden controllers
block: create 'slaves' and 'holders' entries for hidden gendisks
nvme: also expose the namespace identification sysfs files for mpath nodes
nvme: implement multipath access to nvme subsystems
nvme: track shared namespaces
nvme: introduce a nvme_ns_ids structure
nvme: track subsystems
block, nvme: Introduce blk_mq_req_flags_t
block, scsi: Make SCSI quiesce and resume work reliably
block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
...
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
For memory ordering guarantees on stores, we need to ensure that
these two bits share the same byte of storage in the unsigned
long. Add a comment as to why, and a BUILD_BUG_ON() to ensure that
we don't violate this requirement.
Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No need to have this helper inline in a header. Also drop the __ prefix.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The only caller of this function is blk_start_request() in the same
file. Fix blk_start_request() description accordingly.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This helper allows looking up a partion under RCU protection without
grabbing a reference to it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
And instead call directly into the integrity code from bio_end_io.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some functions in block/blk-core.c must only be used on blk-sq queues
while others are safe to use against any queue type. Document which
functions are intended for blk-sq queues and issue a warning if the
blk-sq API is misused. This does not only help block driver authors
but will also make it easier to remove the blk-sq code once that code
is declared obsolete.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Export this function such that it becomes available to block
drivers.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matias Bjørling <m@bjorling.me>
Cc: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blk_insert_flush should be using __blk_end_request to start with.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
User configures latency target, but the latency threshold for each
request size isn't fixed. For a SSD, the IO latency highly depends on
request size. To calculate latency threshold, we sample some data, eg,
average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
threshold of each request size will be the sample latency (I'll call it
base latency) plus latency target. For example, the base latency for
request size 4k is 80us and user configures latency target 60us. The 4k
latency threshold will be 80 + 60 = 140us.
To sample data, we calculate the order base 2 of rounded up IO sectors.
If the IO size is bigger than 1M, it will be accounted as 1M. Since the
calculation does round up, the base latency will be slightly smaller
than actual value. Also if there isn't any IO dispatched for a specific
IO size, we will use the base latency of smaller IO size for this IO
size.
But we shouldn't sample data at any time. The base latency is supposed
to be latency where disk isn't congested, because we use latency
threshold to schedule IOs between cgroups. If disk is congested, the
latency is higher, using it for scheduling is meaningless. Hence we only
do the sampling when block throttling is in the LOW limit, with
assumption disk isn't congested in such state. If the assumption isn't
true, eg, low limit is too high, calculated latency threshold will be
higher.
Hard disk is completely different. Latency depends on spindle seek
instead of request size. Currently this feature is SSD only, we probably
can use a fixed threshold like 4ms for hard disk though.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
A cgroup gets assigned a low limit, but the cgroup could never dispatch
enough IO to cross the low limit. In such case, the queue state machine
will remain in LIMIT_LOW state and all other cgroups will be throttled
according to low limit. This is unfair for other cgroups. We should
treat the cgroup idle and upgrade the state machine to lower state.
We also have a downgrade logic. If the state machine upgrades because of
cgroup idle (real idle), the state machine will downgrade soon as the
cgroup is below its low limit. This isn't what we want. A more
complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
when queue gets upgraded to lower state, other cgroups could dispatch
more IO and this cgroup can't dispatch enough IO, so the cgroup is below
its low limit and looks like idle (fake idle). In this case, the queue
should downgrade soon. The key to determine if we should do downgrade is
to detect if cgroup is truely idle.
Unfortunately it's very hard to determine if a cgroup is real idle. This
patch uses the 'think time check' idea from CFQ for the purpose. Please
note, the idea doesn't work for all workloads. For example, a workload
with io depth 8 has disk utilization 100%, hence think time is 0, eg,
not idle. But the workload can run higher bandwidth with io depth 16.
Compared to io depth 16, the io depth 8 workload is idle. We use the
idea to roughly determine if a cgroup is idle.
We treat a cgroup idle if its think time is above a threshold (by
default 1ms for SSD and 100ms for HD). The idea is think time above the
threshold will start to harm performance. HD is much slower so a longer
think time is ok.
The patch (and the latter patches) uses 'unsigned long' to track time.
We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses
precision, should not a big deal.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The throtl_slice is 100ms by default. This is a long time for SSD, a lot
of IO can run. To make cgroups have smoother throughput, we choose a
small value (20ms) for SSD.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
throtl_slice is important for blk-throttling. It's called slice
internally but it really is a time window blk-throttling samples data.
blk-throttling will make decision based on the samplings. An example is
bandwidth measurement. A cgroup's bandwidth is measured in the time
interval of throtl_slice.
A small throtl_slice meanse cgroups have smoother throughput but burn
more CPUs. It has 100ms default value, which is not appropriate for all
disks. A fast SSD can dispatch a lot of IOs in 100ms. This patch makes
it tunable.
Since throtl_slice isn't a time slice, the sysfs name
'throttle_sample_time' reflects its character better.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Add a new merge strategy that merges discard bios into a request until the
maximum number of discard ranges (or the maximum discard size) is reached
from the plug merging code. I/O scheduler merging is not wired up yet
but might also be useful, although not for fast devices like NVMe which
are the only user for now.
Note that for now we don't support limiting the size of each discard range,
but if needed that can be added later.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Switch these constants to an enum, and make let the compiler ensure that
all callers of blk_try_merge and elv_merge handle all potential values.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This makes it available outside of blk-merge.c, and inlining such a trivial
helper seems pretty useful to start with.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
When we attempt to merge request-to-request, we return a 0/1 if we
ended up merging or not. Change that to return the pointer to the
request that we freed. We will use this to move the freeing of
that request out of the merge logic, so that callers can drop
locks before freeing the request.
There should be no functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
When I added the blk-mq debugging information to debugfs, I didn't
notice that blktrace also creates a "block" directory in debugfs. Make
them use the same dentry, now created in the core block code. Based on a
patch from Jens.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This can be used to check for fs vs non-fs requests and basically
removes all knowledge of BLOCK_PC specific from the block layer,
as well as preparing for removing the cmd_type field in struct request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
We want to use it outside of blk-core.c.
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Prep patch for adding MQ ops as well, since doing anon unions with
named initializers doesn't work on older compilers.
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
This patch enables a hybrid polling mode. Instead of polling after IO
submission, we can induce an artificial delay, and then poll after that.
For example, if the IO is presumed to complete in 8 usecs from now, we
can sleep for 4 usecs, wake up, and then do our polling. This still puts
a sleep/wakeup cycle in the IO path, but instead of the wakeup happening
after the IO has completed, it'll happen before. With this hybrid
scheme, we can achieve big latency reductions while still using the same
(or less) amount of CPU.
Signed-off-by: Jens Axboe <axboe@fb.com>
Tested-By: Stephen Bates <sbates@raithlin.com>
Reviewed-By: Stephen Bates <sbates@raithlin.com>