d706751215
Current IO controllers for the block layer are less than ideal for our use case. The io.max controller is great at hard limiting, but it is not work conserving. This patch introduces io.latency. You provide a latency target for your group and we monitor the io in short windows to make sure we are not exceeding those latency targets. This makes use of the rq-qos infrastructure and works much like the wbt stuff. There are a few differences from wbt - It's bio based, so the latency covers the whole block layer in addition to the actual io. - We will throttle all IO types that comes in here if we need to. - We use the mean latency over the 100ms window. This is because writes can be particularly fast, which could give us a false sense of the impact of other workloads on our protected workload. - By default there's no throttling, we set the queue_depth to INT_MAX so that we can have as many outstanding bio's as we're allowed to. Only at throttle time do we pay attention to the actual queue depth. - We backcharge cgroups for root cg issued IO and induce artificial delays in order to deal with cases like metadata only or swap heavy workloads. In testing this has worked out relatively well. Protected workloads will throttle noisy workloads down to 1 io at time if they are doing normal IO on their own, or induce up to a 1 second delay per syscall if they are doing a lot of root issued IO (metadata/swap IO). Our testing has revolved mostly around our production web servers where we have hhvm (the web server application) in a protected group and everything else in another group. We see slightly higher requests per second (RPS) on the test tier vs the control tier, and much more stable RPS across all machines in the test tier vs the control tier. Another test we run is a slow memory allocator in the unprotected group. Before this would eventually push us into swap and cause the whole box to die and not recover at all. With these patches we see slight RPS drops (usually 10-15%) before the memory consumer is properly killed and things recover within seconds. Signed-off-by: Josef Bacik <jbacik@fb.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
231 lines
6.7 KiB
Text
231 lines
6.7 KiB
Text
# SPDX-License-Identifier: GPL-2.0
|
|
#
|
|
# Block layer core configuration
|
|
#
|
|
menuconfig BLOCK
|
|
bool "Enable the block layer" if EXPERT
|
|
default y
|
|
select SBITMAP
|
|
select SRCU
|
|
help
|
|
Provide block layer support for the kernel.
|
|
|
|
Disable this option to remove the block layer support from the
|
|
kernel. This may be useful for embedded devices.
|
|
|
|
If this option is disabled:
|
|
|
|
- block device files will become unusable
|
|
- some filesystems (such as ext3) will become unavailable.
|
|
|
|
Also, SCSI character devices and USB storage will be disabled since
|
|
they make use of various block layer definitions and facilities.
|
|
|
|
Say Y here unless you know you really don't want to mount disks and
|
|
suchlike.
|
|
|
|
if BLOCK
|
|
|
|
config LBDAF
|
|
bool "Support for large (2TB+) block devices and files"
|
|
depends on !64BIT
|
|
default y
|
|
help
|
|
Enable block devices or files of size 2TB and larger.
|
|
|
|
This option is required to support the full capacity of large
|
|
(2TB+) block devices, including RAID, disk, Network Block Device,
|
|
Logical Volume Manager (LVM) and loopback.
|
|
|
|
This option also enables support for single files larger than
|
|
2TB.
|
|
|
|
The ext4 filesystem requires that this feature be enabled in
|
|
order to support filesystems that have the huge_file feature
|
|
enabled. Otherwise, it will refuse to mount in the read-write
|
|
mode any filesystems that use the huge_file feature, which is
|
|
enabled by default by mke2fs.ext4.
|
|
|
|
The GFS2 filesystem also requires this feature.
|
|
|
|
If unsure, say Y.
|
|
|
|
config BLK_SCSI_REQUEST
|
|
bool
|
|
|
|
config BLK_DEV_BSG
|
|
bool "Block layer SG support v4"
|
|
default y
|
|
select BLK_SCSI_REQUEST
|
|
help
|
|
Saying Y here will enable generic SG (SCSI generic) v4 support
|
|
for any block device.
|
|
|
|
Unlike SG v3 (aka block/scsi_ioctl.c drivers/scsi/sg.c), SG v4
|
|
can handle complicated SCSI commands: tagged variable length cdbs
|
|
with bidirectional data transfers and generic request/response
|
|
protocols (e.g. Task Management Functions and SMP in Serial
|
|
Attached SCSI).
|
|
|
|
This option is required by recent UDEV versions to properly
|
|
access device serial numbers, etc.
|
|
|
|
If unsure, say Y.
|
|
|
|
config BLK_DEV_BSGLIB
|
|
bool "Block layer SG support v4 helper lib"
|
|
default n
|
|
select BLK_DEV_BSG
|
|
select BLK_SCSI_REQUEST
|
|
help
|
|
Subsystems will normally enable this if needed. Users will not
|
|
normally need to manually enable this.
|
|
|
|
If unsure, say N.
|
|
|
|
config BLK_DEV_INTEGRITY
|
|
bool "Block layer data integrity support"
|
|
select CRC_T10DIF if BLK_DEV_INTEGRITY
|
|
---help---
|
|
Some storage devices allow extra information to be
|
|
stored/retrieved to help protect the data. The block layer
|
|
data integrity option provides hooks which can be used by
|
|
filesystems to ensure better data integrity.
|
|
|
|
Say yes here if you have a storage device that provides the
|
|
T10/SCSI Data Integrity Field or the T13/ATA External Path
|
|
Protection. If in doubt, say N.
|
|
|
|
config BLK_DEV_ZONED
|
|
bool "Zoned block device support"
|
|
---help---
|
|
Block layer zoned block device support. This option enables
|
|
support for ZAC/ZBC host-managed and host-aware zoned block devices.
|
|
|
|
Say yes here if you have a ZAC or ZBC storage device.
|
|
|
|
config BLK_DEV_THROTTLING
|
|
bool "Block layer bio throttling support"
|
|
depends on BLK_CGROUP=y
|
|
default n
|
|
---help---
|
|
Block layer bio throttling support. It can be used to limit
|
|
the IO rate to a device. IO rate policies are per cgroup and
|
|
one needs to mount and use blkio cgroup controller for creating
|
|
cgroups and specifying per device IO rate policies.
|
|
|
|
See Documentation/cgroup-v1/blkio-controller.txt for more information.
|
|
|
|
config BLK_DEV_THROTTLING_LOW
|
|
bool "Block throttling .low limit interface support (EXPERIMENTAL)"
|
|
depends on BLK_DEV_THROTTLING
|
|
default n
|
|
---help---
|
|
Add .low limit interface for block throttling. The low limit is a best
|
|
effort limit to prioritize cgroups. Depending on the setting, the limit
|
|
can be used to protect cgroups in terms of bandwidth/iops and better
|
|
utilize disk resource.
|
|
|
|
Note, this is an experimental interface and could be changed someday.
|
|
|
|
config BLK_CMDLINE_PARSER
|
|
bool "Block device command line partition parser"
|
|
default n
|
|
---help---
|
|
Enabling this option allows you to specify the partition layout from
|
|
the kernel boot args. This is typically of use for embedded devices
|
|
which don't otherwise have any standardized method for listing the
|
|
partitions on a block device.
|
|
|
|
See Documentation/block/cmdline-partition.txt for more information.
|
|
|
|
config BLK_WBT
|
|
bool "Enable support for block device writeback throttling"
|
|
default n
|
|
---help---
|
|
Enabling this option enables the block layer to throttle buffered
|
|
background writeback from the VM, making it more smooth and having
|
|
less impact on foreground operations. The throttling is done
|
|
dynamically on an algorithm loosely based on CoDel, factoring in
|
|
the realtime performance of the disk.
|
|
|
|
config BLK_CGROUP_IOLATENCY
|
|
bool "Enable support for latency based cgroup IO protection"
|
|
depends on BLK_CGROUP=y
|
|
default n
|
|
---help---
|
|
Enabling this option enables the .latency interface for IO throttling.
|
|
The IO controller will attempt to maintain average IO latencies below
|
|
the configured latency target, throttling anybody with a higher latency
|
|
target than the victimized group.
|
|
|
|
Note, this is an experimental interface and could be changed someday.
|
|
|
|
config BLK_WBT_SQ
|
|
bool "Single queue writeback throttling"
|
|
default n
|
|
depends on BLK_WBT
|
|
---help---
|
|
Enable writeback throttling by default on legacy single queue devices
|
|
|
|
config BLK_WBT_MQ
|
|
bool "Multiqueue writeback throttling"
|
|
default y
|
|
depends on BLK_WBT
|
|
---help---
|
|
Enable writeback throttling by default on multiqueue devices.
|
|
Multiqueue currently doesn't have support for IO scheduling,
|
|
enabling this option is recommended.
|
|
|
|
config BLK_DEBUG_FS
|
|
bool "Block layer debugging information in debugfs"
|
|
default y
|
|
depends on DEBUG_FS
|
|
---help---
|
|
Include block layer debugging information in debugfs. This information
|
|
is mostly useful for kernel developers, but it doesn't incur any cost
|
|
at runtime.
|
|
|
|
Unless you are building a kernel for a tiny system, you should
|
|
say Y here.
|
|
|
|
config BLK_DEBUG_FS_ZONED
|
|
bool
|
|
default BLK_DEBUG_FS && BLK_DEV_ZONED
|
|
|
|
config BLK_SED_OPAL
|
|
bool "Logic for interfacing with Opal enabled SEDs"
|
|
---help---
|
|
Builds Logic for interfacing with Opal enabled controllers.
|
|
Enabling this option enables users to setup/unlock/lock
|
|
Locking ranges for SED devices using the Opal protocol.
|
|
|
|
menu "Partition Types"
|
|
|
|
source "block/partitions/Kconfig"
|
|
|
|
endmenu
|
|
|
|
endif # BLOCK
|
|
|
|
config BLOCK_COMPAT
|
|
bool
|
|
depends on BLOCK && COMPAT
|
|
default y
|
|
|
|
config BLK_MQ_PCI
|
|
bool
|
|
depends on BLOCK && PCI
|
|
default y
|
|
|
|
config BLK_MQ_VIRTIO
|
|
bool
|
|
depends on BLOCK && VIRTIO
|
|
default y
|
|
|
|
config BLK_MQ_RDMA
|
|
bool
|
|
depends on BLOCK && INFINIBAND
|
|
default y
|
|
|
|
source block/Kconfig.iosched
|