Commit graph

417 commits

Author SHA1 Message Date
Greg Kroah-Hartman
9ce79d9bed This is the 4.19.149 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl91ulMACgkQONu9yGCS
 aT7ezhAArTOQxPGkhktgdGfCMYgjvIHdny8o4pNGumnxW6TG7FCiJHoZuj8OLkdx
 2x5brOOvSGgcGTOwJXyUjL6opQzD5syTCuzbgEpGB2Tyd1x5q8vgqvI2XPxZeYHy
 x+mUDgacT+4m7FNbFDhNMZoTS4KCiJ3IcTevjeQexDtIs6R38HhxNl0Ee67gkqxZ
 p7c6L3kbUuR5T9EWGE1DPPLhOFGeOMk592qzkFsCGERsuswQOpXrxyw6zkik/0UG
 6Losmo2i+OtQFeiDz0WYJZNO9ySI511j+7R2Ewch/nFuTp6yFzy9kJZnP0YWK/KE
 U4BLmopgzCs9q+TQ/QNjxlCltl4eOrrjkFXF3Zz8o5ddbKwrugEsJUdUUDIpva71
 qEUgSw7vguGKoCttBenCDwyYOcjIVJRBFSWTVDzkgw5pXrz3m7qePF1Kj+KzG0pN
 8gTqosXPlYPzH1mh+2vRVntiCpZRMJYo18CX+ifqN20dHH3dsM4vA5NiWwjTJVY8
 JddRXfujxBQ0jxs2jFKvPZNrgqeY3Mh51L0a5G+HbHCIb+4kgD+2jl+C/X38TKch
 osTM1/qQriFVxtlH9TkTa8opYvrYBWO+G+XhNVc2tSpmd8T2EaKokMAVVvGiK3l9
 ZPq06SytJyKDPsSLvk4BKxCUv5CY0VT18k6mCYd1fq4oxTR92A4=
 =5bC5
 -----END PGP SIGNATURE-----

Merge 4.19.149 into android-4.19-stable

Changes in 4.19.149
	selinux: allow labeling before policy is loaded
	media: mc-device.c: fix memleak in media_device_register_entity
	dma-fence: Serialise signal enabling (dma_fence_enable_sw_signaling)
	ath10k: fix array out-of-bounds access
	ath10k: fix memory leak for tpc_stats_final
	mm: fix double page fault on arm64 if PTE_AF is cleared
	scsi: aacraid: fix illegal IO beyond last LBA
	m68k: q40: Fix info-leak in rtc_ioctl
	gma/gma500: fix a memory disclosure bug due to uninitialized bytes
	ASoC: kirkwood: fix IRQ error handling
	media: smiapp: Fix error handling at NVM reading
	arch/x86/lib/usercopy_64.c: fix __copy_user_flushcache() cache writeback
	x86/ioapic: Unbreak check_timer()
	ALSA: usb-audio: Add delay quirk for H570e USB headsets
	ALSA: hda/realtek - Couldn't detect Mic if booting with headset plugged
	ALSA: hda/realtek: Enable front panel headset LED on Lenovo ThinkStation P520
	lib/string.c: implement stpcpy
	leds: mlxreg: Fix possible buffer overflow
	PM / devfreq: tegra30: Fix integer overflow on CPU's freq max out
	scsi: fnic: fix use after free
	scsi: lpfc: Fix kernel crash at lpfc_nvme_info_show during remote port bounce
	net: silence data-races on sk_backlog.tail
	clk/ti/adpll: allocate room for terminating null
	drm/amdgpu/powerplay: fix AVFS handling with custom powerplay table
	mtd: cfi_cmdset_0002: don't free cfi->cfiq in error path of cfi_amdstd_setup()
	mfd: mfd-core: Protect against NULL call-back function pointer
	drm/amdgpu/powerplay/smu7: fix AVFS handling with custom powerplay table
	tpm_crb: fix fTPM on AMD Zen+ CPUs
	tracing: Adding NULL checks for trace_array descriptor pointer
	bcache: fix a lost wake-up problem caused by mca_cannibalize_lock
	dmaengine: mediatek: hsdma_probe: fixed a memory leak when devm_request_irq fails
	RDMA/qedr: Fix potential use after free
	RDMA/i40iw: Fix potential use after free
	fix dget_parent() fastpath race
	xfs: fix attr leaf header freemap.size underflow
	RDMA/iw_cgxb4: Fix an error handling path in 'c4iw_connect()'
	ubi: Fix producing anchor PEBs
	mmc: core: Fix size overflow for mmc partitions
	gfs2: clean up iopen glock mess in gfs2_create_inode
	scsi: pm80xx: Cleanup command when a reset times out
	debugfs: Fix !DEBUG_FS debugfs_create_automount
	CIFS: Properly process SMB3 lease breaks
	ASoC: max98090: remove msleep in PLL unlocked workaround
	kernel/sys.c: avoid copying possible padding bytes in copy_to_user
	KVM: arm/arm64: vgic: Fix potential double free dist->spis in __kvm_vgic_destroy()
	xfs: fix log reservation overflows when allocating large rt extents
	neigh_stat_seq_next() should increase position index
	rt_cpu_seq_next should increase position index
	ipv6_route_seq_next should increase position index
	seqlock: Require WRITE_ONCE surrounding raw_seqcount_barrier
	media: ti-vpe: cal: Restrict DMA to avoid memory corruption
	sctp: move trace_sctp_probe_path into sctp_outq_sack
	ACPI: EC: Reference count query handlers under lock
	scsi: ufs: Make ufshcd_add_command_trace() easier to read
	scsi: ufs: Fix a race condition in the tracing code
	dmaengine: zynqmp_dma: fix burst length configuration
	s390/cpum_sf: Use kzalloc and minor changes
	powerpc/eeh: Only dump stack once if an MMIO loop is detected
	Bluetooth: btrtl: Use kvmalloc for FW allocations
	tracing: Set kernel_stack's caller size properly
	ARM: 8948/1: Prevent OOB access in stacktrace
	ar5523: Add USB ID of SMCWUSBT-G2 wireless adapter
	ceph: ensure we have a new cap before continuing in fill_inode
	selftests/ftrace: fix glob selftest
	tools/power/x86/intel_pstate_tracer: changes for python 3 compatibility
	Bluetooth: Fix refcount use-after-free issue
	mm/swapfile.c: swap_next should increase position index
	mm: pagewalk: fix termination condition in walk_pte_range()
	Bluetooth: prefetch channel before killing sock
	KVM: fix overflow of zero page refcount with ksm running
	ALSA: hda: Clear RIRB status before reading WP
	skbuff: fix a data race in skb_queue_len()
	audit: CONFIG_CHANGE don't log internal bookkeeping as an event
	selinux: sel_avc_get_stat_idx should increase position index
	scsi: lpfc: Fix RQ buffer leakage when no IOCBs available
	scsi: lpfc: Fix coverity errors in fmdi attribute handling
	drm/omap: fix possible object reference leak
	clk: stratix10: use do_div() for 64-bit calculation
	crypto: chelsio - This fixes the kernel panic which occurs during a libkcapi test
	mt76: clear skb pointers from rx aggregation reorder buffer during cleanup
	ALSA: usb-audio: Don't create a mixer element with bogus volume range
	perf test: Fix test trace+probe_vfs_getname.sh on s390
	RDMA/rxe: Fix configuration of atomic queue pair attributes
	KVM: x86: fix incorrect comparison in trace event
	dmaengine: stm32-mdma: use vchan_terminate_vdesc() in .terminate_all
	media: staging/imx: Missing assignment in imx_media_capture_device_register()
	x86/pkeys: Add check for pkey "overflow"
	bpf: Remove recursion prevention from rcu free callback
	dmaengine: stm32-dma: use vchan_terminate_vdesc() in .terminate_all
	dmaengine: tegra-apb: Prevent race conditions on channel's freeing
	drm/amd/display: dal_ddc_i2c_payloads_create can fail causing panic
	firmware: arm_sdei: Use cpus_read_lock() to avoid races with cpuhp
	random: fix data races at timer_rand_state
	bus: hisi_lpc: Fixup IO ports addresses to avoid use-after-free in host removal
	media: go7007: Fix URB type for interrupt handling
	Bluetooth: guard against controllers sending zero'd events
	timekeeping: Prevent 32bit truncation in scale64_check_overflow()
	ext4: fix a data race at inode->i_disksize
	perf jevents: Fix leak of mapfile memory
	mm: avoid data corruption on CoW fault into PFN-mapped VMA
	drm/amdgpu: increase atombios cmd timeout
	drm/amd/display: Stop if retimer is not available
	ath10k: use kzalloc to read for ath10k_sdio_hif_diag_read
	scsi: aacraid: Disabling TM path and only processing IOP reset
	Bluetooth: L2CAP: handle l2cap config request during open state
	media: tda10071: fix unsigned sign extension overflow
	xfs: don't ever return a stale pointer from __xfs_dir3_free_read
	xfs: mark dir corrupt when lookup-by-hash fails
	ext4: mark block bitmap corrupted when found instead of BUGON
	tpm: ibmvtpm: Wait for buffer to be set before proceeding
	rtc: sa1100: fix possible race condition
	rtc: ds1374: fix possible race condition
	nfsd: Don't add locks to closed or closing open stateids
	RDMA/cm: Remove a race freeing timewait_info
	KVM: PPC: Book3S HV: Treat TM-related invalid form instructions on P9 like the valid ones
	drm/msm: fix leaks if initialization fails
	drm/msm/a5xx: Always set an OPP supported hardware value
	tracing: Use address-of operator on section symbols
	thermal: rcar_thermal: Handle probe error gracefully
	perf parse-events: Fix 3 use after frees found with clang ASAN
	serial: 8250_port: Don't service RX FIFO if throttled
	serial: 8250_omap: Fix sleeping function called from invalid context during probe
	serial: 8250: 8250_omap: Terminate DMA before pushing data on RX timeout
	perf cpumap: Fix snprintf overflow check
	cpufreq: powernv: Fix frame-size-overflow in powernv_cpufreq_work_fn
	tools: gpio-hammer: Avoid potential overflow in main
	nvme-multipath: do not reset on unknown status
	nvme: Fix controller creation races with teardown flow
	RDMA/rxe: Set sys_image_guid to be aligned with HW IB devices
	scsi: hpsa: correct race condition in offload enabled
	SUNRPC: Fix a potential buffer overflow in 'svc_print_xprts()'
	svcrdma: Fix leak of transport addresses
	PCI: Use ioremap(), not phys_to_virt() for platform ROM
	ubifs: Fix out-of-bounds memory access caused by abnormal value of node_len
	ALSA: usb-audio: Fix case when USB MIDI interface has more than one extra endpoint descriptor
	PCI: pciehp: Fix MSI interrupt race
	NFS: Fix races nfs_page_group_destroy() vs nfs_destroy_unlinked_subrequests()
	mm/kmemleak.c: use address-of operator on section symbols
	mm/filemap.c: clear page error before actual read
	mm/vmscan.c: fix data races using kswapd_classzone_idx
	nvmet-rdma: fix double free of rdma queue
	mm/mmap.c: initialize align_offset explicitly for vm_unmapped_area
	scsi: qedi: Fix termination timeouts in session logout
	serial: uartps: Wait for tx_empty in console setup
	KVM: Remove CREATE_IRQCHIP/SET_PIT2 race
	bdev: Reduce time holding bd_mutex in sync in blkdev_close()
	drivers: char: tlclk.c: Avoid data race between init and interrupt handler
	KVM: arm64: vgic-its: Fix memory leak on the error path of vgic_add_lpi()
	net: openvswitch: use u64 for meter bucket
	scsi: aacraid: Fix error handling paths in aac_probe_one()
	staging:r8188eu: avoid skb_clone for amsdu to msdu conversion
	sparc64: vcc: Fix error return code in vcc_probe()
	arm64: cpufeature: Relax checks for AArch32 support at EL[0-2]
	dt-bindings: sound: wm8994: Correct required supplies based on actual implementaion
	atm: fix a memory leak of vcc->user_back
	perf mem2node: Avoid double free related to realloc
	power: supply: max17040: Correct voltage reading
	phy: samsung: s5pv210-usb2: Add delay after reset
	Bluetooth: Handle Inquiry Cancel error after Inquiry Complete
	USB: EHCI: ehci-mv: fix error handling in mv_ehci_probe()
	tipc: fix memory leak in service subscripting
	tty: serial: samsung: Correct clock selection logic
	ALSA: hda: Fix potential race in unsol event handler
	powerpc/traps: Make unrecoverable NMIs die instead of panic
	fuse: don't check refcount after stealing page
	USB: EHCI: ehci-mv: fix less than zero comparison of an unsigned int
	scsi: cxlflash: Fix error return code in cxlflash_probe()
	arm64/cpufeature: Drop TraceFilt feature exposure from ID_DFR0 register
	e1000: Do not perform reset in reset_task if we are already down
	drm/nouveau/debugfs: fix runtime pm imbalance on error
	drm/nouveau: fix runtime pm imbalance on error
	drm/nouveau/dispnv50: fix runtime pm imbalance on error
	printk: handle blank console arguments passed in.
	usb: dwc3: Increase timeout for CmdAct cleared by device controller
	btrfs: don't force read-only after error in drop snapshot
	vfio/pci: fix memory leaks of eventfd ctx
	perf evsel: Fix 2 memory leaks
	perf trace: Fix the selection for architectures to generate the errno name tables
	perf stat: Fix duration_time value for higher intervals
	perf util: Fix memory leak of prefix_if_not_in
	perf metricgroup: Free metric_events on error
	perf kcore_copy: Fix module map when there are no modules loaded
	ASoC: img-i2s-out: Fix runtime PM imbalance on error
	wlcore: fix runtime pm imbalance in wl1271_tx_work
	wlcore: fix runtime pm imbalance in wlcore_regdomain_config
	mtd: rawnand: omap_elm: Fix runtime PM imbalance on error
	PCI: tegra: Fix runtime PM imbalance on error
	ceph: fix potential race in ceph_check_caps
	mm/swap_state: fix a data race in swapin_nr_pages
	rapidio: avoid data race between file operation callbacks and mport_cdev_add().
	mtd: parser: cmdline: Support MTD names containing one or more colons
	x86/speculation/mds: Mark mds_user_clear_cpu_buffers() __always_inline
	vfio/pci: Clear error and request eventfd ctx after releasing
	cifs: Fix double add page to memcg when cifs_readpages
	nvme: fix possible deadlock when I/O is blocked
	scsi: libfc: Handling of extra kref
	scsi: libfc: Skip additional kref updating work event
	selftests/x86/syscall_nt: Clear weird flags after each test
	vfio/pci: fix racy on error and request eventfd ctx
	btrfs: qgroup: fix data leak caused by race between writeback and truncate
	ubi: fastmap: Free unused fastmap anchor peb during detach
	perf parse-events: Use strcmp() to compare the PMU name
	net: openvswitch: use div_u64() for 64-by-32 divisions
	nvme: explicitly update mpath disk capacity on revalidation
	ASoC: wm8994: Skip setting of the WM8994_MICBIAS register for WM1811
	ASoC: wm8994: Ensure the device is resumed in wm89xx_mic_detect functions
	ASoC: Intel: bytcr_rt5640: Add quirk for MPMAN Converter9 2-in-1
	RISC-V: Take text_mutex in ftrace_init_nop()
	s390/init: add missing __init annotations
	lockdep: fix order in trace_hardirqs_off_caller()
	drm/amdkfd: fix a memory leak issue
	i2c: core: Call i2c_acpi_install_space_handler() before i2c_acpi_register_devices()
	objtool: Fix noreturn detection for ignored functions
	ieee802154: fix one possible memleak in ca8210_dev_com_init
	ieee802154/adf7242: check status of adf7242_read_reg
	clocksource/drivers/h8300_timer8: Fix wrong return value in h8300_8timer_init()
	mwifiex: Increase AES key storage size to 256 bits
	batman-adv: bla: fix type misuse for backbone_gw hash indexing
	atm: eni: fix the missed pci_disable_device() for eni_init_one()
	batman-adv: mcast/TT: fix wrongly dropped or rerouted packets
	mac802154: tx: fix use-after-free
	bpf: Fix clobbering of r2 in bpf_gen_ld_abs
	drm/vc4/vc4_hdmi: fill ASoC card owner
	net: qed: RDMA personality shouldn't fail VF load
	drm/sun4i: sun8i-csc: Secondary CSC register correction
	batman-adv: Add missing include for in_interrupt()
	batman-adv: mcast: fix duplicate mcast packets in BLA backbone from mesh
	batman-adv: mcast: fix duplicate mcast packets from BLA backbone to mesh
	bpf: Fix a rcu warning for bpffs map pretty-print
	ALSA: asihpi: fix iounmap in error handler
	regmap: fix page selection for noinc reads
	MIPS: Add the missing 'CPU_1074K' into __get_cpu_type()
	KVM: x86: Reset MMU context if guest toggles CR4.SMAP or CR4.PKE
	KVM: SVM: Add a dedicated INVD intercept routine
	tracing: fix double free
	s390/dasd: Fix zero write for FBA devices
	kprobes: Fix to check probe enabled before disarm_kprobe_ftrace()
	mm, THP, swap: fix allocating cluster for swapfile by mistake
	s390/zcrypt: Fix ZCRYPT_PERDEV_REQCNT ioctl
	kprobes: Fix compiler warning for !CONFIG_KPROBES_ON_FTRACE
	ata: define AC_ERR_OK
	ata: make qc_prep return ata_completion_errors
	ata: sata_mv, avoid trigerrable BUG_ON
	KVM: arm64: Assume write fault on S1PTW permission fault on instruction fetch
	Linux 4.19.149

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Idfc1b35ec63b4b464aeb6e32709102bee0efc872
2020-10-01 16:49:05 +02:00
Douglas Anderson
b6256c2966 bdev: Reduce time holding bd_mutex in sync in blkdev_close()
[ Upstream commit b849dd84b6ccfe32622988b79b7b073861fcf9f7 ]

While trying to "dd" to the block device for a USB stick, I
encountered a hung task warning (blocked for > 120 seconds).  I
managed to come up with an easy way to reproduce this on my system
(where /dev/sdb is the block device for my USB stick) with:

  while true; do dd if=/dev/zero of=/dev/sdb bs=4M; done

With my reproduction here are the relevant bits from the hung task
detector:

 INFO: task udevd:294 blocked for more than 122 seconds.
 ...
 udevd           D    0   294      1 0x00400008
 Call trace:
  ...
  mutex_lock_nested+0x40/0x50
  __blkdev_get+0x7c/0x3d4
  blkdev_get+0x118/0x138
  blkdev_open+0x94/0xa8
  do_dentry_open+0x268/0x3a0
  vfs_open+0x34/0x40
  path_openat+0x39c/0xdf4
  do_filp_open+0x90/0x10c
  do_sys_open+0x150/0x3c8
  ...

 ...
 Showing all locks held in the system:
 ...
 1 lock held by dd/2798:
  #0: ffffff814ac1a3b8 (&bdev->bd_mutex){+.+.}, at: __blkdev_put+0x50/0x204
 ...
 dd              D    0  2798   2764 0x00400208
 Call trace:
  ...
  schedule+0x8c/0xbc
  io_schedule+0x1c/0x40
  wait_on_page_bit_common+0x238/0x338
  __lock_page+0x5c/0x68
  write_cache_pages+0x194/0x500
  generic_writepages+0x64/0xa4
  blkdev_writepages+0x24/0x30
  do_writepages+0x48/0xa8
  __filemap_fdatawrite_range+0xac/0xd8
  filemap_write_and_wait+0x30/0x84
  __blkdev_put+0x88/0x204
  blkdev_put+0xc4/0xe4
  blkdev_close+0x28/0x38
  __fput+0xe0/0x238
  ____fput+0x1c/0x28
  task_work_run+0xb0/0xe4
  do_notify_resume+0xfc0/0x14bc
  work_pending+0x8/0x14

The problem appears related to the fact that my USB disk is terribly
slow and that I have a lot of RAM in my system to cache things.
Specifically my writes seem to be happening at ~15 MB/s and I've got
~4 GB of RAM in my system that can be used for buffering.  To write 4
GB of buffer to disk thus takes ~4000 MB / ~15 MB/s = ~267 seconds.

The 267 second number is a problem because in __blkdev_put() we call
sync_blockdev() while holding the bd_mutex.  Any other callers who
want the bd_mutex will be blocked for the whole time.

The problem is made worse because I believe blkdev_put() specifically
tells other tasks (namely udev) to go try to access the device at right
around the same time we're going to hold the mutex for a long time.

Putting some traces around this (after disabling the hung task detector),
I could confirm:
 dd:    437.608600: __blkdev_put() right before sync_blockdev() for sdb
 udevd: 437.623901: blkdev_open() right before blkdev_get() for sdb
 dd:    661.468451: __blkdev_put() right after sync_blockdev() for sdb
 udevd: 663.820426: blkdev_open() right after blkdev_get() for sdb

A simple fix for this is to realize that sync_blockdev() works fine if
you're not holding the mutex.  Also, it's not the end of the world if
you sync a little early (though it can have performance impacts).
Thus we can make a guess that we're going to need to do the sync and
then do it without holding the mutex.  We still do one last sync with
the mutex but it should be much, much faster.

With this, my hung task warnings for my test case are gone.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01 13:14:42 +02:00
Greg Kroah-Hartman
b96549a28b This is the 4.19.130 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl70p5sACgkQONu9yGCS
 aT4nRg/9HAG5FdyJgH3OnUY5LtrjtlVpRsD+AgdUnnftpDprZ0m4dZsUuUSF4fJJ
 hzEr98/vX41yYqi2ne9mZcPnm4uOw21dEJK9X/2Z5+654i4qOwcv9FfVQSAdSpXC
 KEN0rRlZw0MdXwISmU9UbBTwSRqxi0qJNWfSR3rS9DZRiSPhhxElMwbdcYXjpgKR
 GZ7Wd/pDb0q371mR2Ar7+13PVxsvBzoEwYUngbeTr3EXtCiWiavBqBzABpJQYH9L
 xm5ZfFLwLMsnQSd/gdW7DZGq+6JdLgf6HaY4FOcpkoDo5gSSDc2RNTg8jbzu8dM0
 o2Wge54q9aTbuu+sc6k9x5HAaTPdv7uFnORdBvNhYGZWfXv69SZjCFgqwjsmPiww
 +wg6D4uYWeh+faIz4tNBHk/bjIIMACnSJifsd131aXUed1cm4GeeQEQgimT8ea99
 uOZrKPSTd+tbY3tZJrX2TyixYKgtaegdYTh3GWLx0AmYS8dOTHKfezUMMtFj4F7F
 XTbZ3IWyVTSynO/7LRLdEGwAZjA/vWCK3YQS95qVCY1ni7h2vZw4FQb2C9CHyHAq
 BIqiu2rOvxAsmIgh6GFcLLYN6cAa+lgIifmOMn2DPYrfF2gcrrGt6iOm1fc0gr5t
 QbNYK7OQ/1L70BgvnI1lDAl4qG6lhSKV28SB6Csbzrz5WYzbfnE=
 =iU+4
 -----END PGP SIGNATURE-----

Merge 4.19.130 into android-4.19-stable

Changes in 4.19.130
	power: supply: bq24257_charger: Replace depends on REGMAP_I2C with select
	clk: sunxi: Fix incorrect usage of round_down()
	ASoC: tegra: tegra_wm8903: Support nvidia, headset property
	i2c: piix4: Detect secondary SMBus controller on AMD AM4 chipsets
	iio: pressure: bmp280: Tolerate IRQ before registering
	remoteproc: Fix IDR initialisation in rproc_alloc()
	clk: qcom: msm8916: Fix the address location of pll->config_reg
	backlight: lp855x: Ensure regulators are disabled on probe failure
	ASoC: davinci-mcasp: Fix dma_chan refcnt leak when getting dma type
	ARM: integrator: Add some Kconfig selections
	scsi: qedi: Check for buffer overflow in qedi_set_path()
	ALSA: hda/realtek - Introduce polarity for micmute LED GPIO
	ALSA: isa/wavefront: prevent out of bounds write in ioctl
	PCI: Allow pci_resize_resource() for devices on root bus
	scsi: qla2xxx: Fix issue with adapter's stopping state
	iio: bmp280: fix compensation of humidity
	f2fs: report delalloc reserve as non-free in statfs for project quota
	i2c: pxa: clear all master action bits in i2c_pxa_stop_message()
	clk: samsung: Mark top ISP and CAM clocks on Exynos542x as critical
	usblp: poison URBs upon disconnect
	serial: 8250: Fix max baud limit in generic 8250 port
	dm mpath: switch paths in dm_blk_ioctl() code path
	PCI: aardvark: Don't blindly enable ASPM L0s and don't write to read-only register
	ps3disk: use the default segment boundary
	vfio/pci: fix memory leaks in alloc_perm_bits()
	RDMA/mlx5: Add init2init as a modify command
	m68k/PCI: Fix a memory leak in an error handling path
	gpio: dwapb: Call acpi_gpiochip_free_interrupts() on GPIO chip de-registration
	mfd: wm8994: Fix driver operation if loaded as modules
	scsi: lpfc: Fix lpfc_nodelist leak when processing unsolicited event
	clk: clk-flexgen: fix clock-critical handling
	powerpc/perf/hv-24x7: Fix inconsistent output values incase multiple hv-24x7 events run
	nfsd: Fix svc_xprt refcnt leak when setup callback client failed
	PCI: vmd: Filter resource type bits from shadow register
	powerpc/crashkernel: Take "mem=" option into account
	pwm: img: Call pm_runtime_put() in pm_runtime_get_sync() failed case
	yam: fix possible memory leak in yam_init_driver
	NTB: ntb_pingpong: Choose doorbells based on port number
	NTB: Fix the default port and peer numbers for legacy drivers
	mksysmap: Fix the mismatch of '.L' symbols in System.map
	apparmor: fix introspection of of task mode for unconfined tasks
	apparmor: check/put label on apparmor_sk_clone_security()
	ASoC: meson: add missing free_irq() in error path
	scsi: sr: Fix sr_probe() missing deallocate of device minor
	scsi: ibmvscsi: Don't send host info in adapter info MAD after LPM
	apparmor: fix nnp subset test for unconfined
	x86/purgatory: Disable various profiling and sanitizing options
	staging: greybus: fix a missing-check bug in gb_lights_light_config()
	arm64: dts: mt8173: fix unit name warnings
	scsi: qedi: Do not flush offload work if ARP not resolved
	ARM: dts: sun8i-h2-plus-bananapi-m2-zero: Fix led polarity
	gpio: dwapb: Append MODULE_ALIAS for platform driver
	scsi: qedf: Fix crash when MFW calls for protocol stats while function is still probing
	pinctrl: rza1: Fix wrong array assignment of rza1l_swio_entries
	firmware: qcom_scm: fix bogous abuse of dma-direct internals
	staging: gasket: Fix mapping refcnt leak when put attribute fails
	staging: gasket: Fix mapping refcnt leak when register/store fails
	ALSA: usb-audio: Improve frames size computation
	ALSA: usb-audio: Fix racy list management in output queue
	s390/qdio: put thinint indicator after early error
	tty: hvc: Fix data abort due to race in hvc_open
	slimbus: ngd: get drvdata from correct device
	thermal/drivers/ti-soc-thermal: Avoid dereferencing ERR_PTR
	usb: dwc3: gadget: Properly handle failed kick_transfer
	staging: sm750fb: add missing case while setting FB_VISUAL
	PCI: v3-semi: Fix a memory leak in v3_pci_probe() error handling paths
	i2c: pxa: fix i2c_pxa_scream_blue_murder() debug output
	serial: amba-pl011: Make sure we initialize the port.lock spinlock
	drivers: base: Fix NULL pointer exception in __platform_driver_probe() if a driver developer is foolish
	PCI: rcar: Fix incorrect programming of OB windows
	PCI/ASPM: Allow ASPM on links to PCIe-to-PCI/PCI-X Bridges
	scsi: qla2xxx: Fix warning after FC target reset
	power: supply: lp8788: Fix an error handling path in 'lp8788_charger_probe()'
	power: supply: smb347-charger: IRQSTAT_D is volatile
	scsi: mpt3sas: Fix double free warnings
	pinctrl: rockchip: fix memleak in rockchip_dt_node_to_map
	dlm: remove BUG() before panic()
	clk: ti: composite: fix memory leak
	PCI: Fix pci_register_host_bridge() device_register() error handling
	powerpc/64: Don't initialise init_task->thread.regs
	tty: n_gsm: Fix SOF skipping
	tty: n_gsm: Fix waking up upper tty layer when room available
	HID: Add quirks for Trust Panora Graphic Tablet
	ipmi: use vzalloc instead of kmalloc for user creation
	powerpc/pseries/ras: Fix FWNMI_VALID off by one
	powerpc/ps3: Fix kexec shutdown hang
	vfio-pci: Mask cap zero
	usb/ohci-platform: Fix a warning when hibernating
	drm/msm/mdp5: Fix mdp5_init error path for failed mdp5_kms allocation
	ASoC: Intel: bytcr_rt5640: Add quirk for Toshiba Encore WT8-A tablet
	USB: host: ehci-mxc: Add error handling in ehci_mxc_drv_probe()
	tty: n_gsm: Fix bogus i++ in gsm_data_kick
	fpga: dfl: afu: Corrected error handling levels
	clk: samsung: exynos5433: Add IGNORE_UNUSED flag to sclk_i2s1
	scsi: target: tcmu: Userspace must not complete queued commands
	arm64: tegra: Fix ethernet phy-mode for Jetson Xavier
	powerpc/64s/pgtable: fix an undefined behaviour
	dm zoned: return NULL if dmz_get_zone_for_reclaim() fails to find a zone
	PCI/PTM: Inherit Switch Downstream Port PTM settings from Upstream Port
	PCI: dwc: Fix inner MSI IRQ domain registration
	IB/cma: Fix ports memory leak in cma_configfs
	watchdog: da9062: No need to ping manually before setting timeout
	usb: dwc2: gadget: move gadget resume after the core is in L0 state
	USB: gadget: udc: s3c2410_udc: Remove pointless NULL check in s3c2410_udc_nuke
	usb: gadget: lpc32xx_udc: don't dereference ep pointer before null check
	usb: gadget: fix potential double-free in m66592_probe.
	usb: gadget: Fix issue with config_ep_by_speed function
	RDMA/iw_cxgb4: cleanup device debugfs entries on ULD remove
	x86/apic: Make TSC deadline timer detection message visible
	ASoC: fix incomplete error-handling in img_i2s_in_probe.
	scsi: target: tcmu: Fix a use after free in tcmu_check_expired_queue_cmd()
	clk: bcm2835: Fix return type of bcm2835_register_gate
	scsi: ufs-qcom: Fix scheduling while atomic issue
	KVM: PPC: Book3S HV: Ignore kmemleak false positives
	clk: sprd: return correct type of value for _sprd_pll_recalc_rate
	net: sunrpc: Fix off-by-one issues in 'rpc_ntop6'
	NFSv4.1 fix rpc_call_done assignment for BIND_CONN_TO_SESSION
	of: Fix a refcounting bug in __of_attach_node_sysfs()
	powerpc/4xx: Don't unmap NULL mbase
	extcon: adc-jack: Fix an error handling path in 'adc_jack_probe()'
	ASoC: fsl_asrc_dma: Fix dma_chan leak when config DMA channel failed
	vfio/mdev: Fix reference count leak in add_mdev_supported_type
	rxrpc: Adjust /proc/net/rxrpc/calls to display call->debug_id not user_ID
	openrisc: Fix issue with argument clobbering for clone/fork
	gfs2: Allow lock_nolock mount to specify jid=X
	scsi: iscsi: Fix reference count leak in iscsi_boot_create_kobj
	scsi: ufs: Don't update urgent bkops level when toggling auto bkops
	pinctrl: imxl: Fix an error handling path in 'imx1_pinctrl_core_probe()'
	pinctrl: freescale: imx: Fix an error handling path in 'imx_pinctrl_probe()'
	crypto: omap-sham - add proper load balancing support for multicore
	geneve: change from tx_error to tx_dropped on missing metadata
	lib/zlib: remove outdated and incorrect pre-increment optimization
	include/linux/bitops.h: avoid clang shift-count-overflow warnings
	elfnote: mark all .note sections SHF_ALLOC
	selftests/vm/pkeys: fix alloc_random_pkey() to make it really random
	blktrace: use errno instead of bi_status
	blktrace: fix endianness in get_pdu_int()
	blktrace: fix endianness for blk_log_remap()
	gfs2: fix use-after-free on transaction ail lists
	ntb_perf: pass correct struct device to dma_alloc_coherent
	ntb_tool: pass correct struct device to dma_alloc_coherent
	NTB: ntb_tool: reading the link file should not end in a NULL byte
	NTB: Revert the change to use the NTB device dev for DMA allocations
	NTB: perf: Don't require one more memory window than number of peers
	NTB: perf: Fix support for hardware that doesn't have port numbers
	NTB: perf: Fix race condition when run with ntb_test
	NTB: ntb_test: Fix bug when counting remote files
	drivers/perf: hisi: Fix wrong value for all counters enable
	selftests/net: in timestamping, strncpy needs to preserve null byte
	afs: Fix memory leak in afs_put_sysnames()
	ASoC: core: only convert non DPCM link to DPCM link
	ASoC: Intel: bytcr_rt5640: Add quirk for Toshiba Encore WT10-A tablet
	ASoC: rt5645: Add platform-data for Asus T101HA
	drm/sun4i: hdmi ddc clk: Fix size of m divider
	scsi: acornscsi: Fix an error handling path in acornscsi_probe()
	x86/idt: Keep spurious entries unset in system_vectors
	net/filter: Permit reading NET in load_bytes_relative when MAC not set
	xdp: Fix xsk_generic_xmit errno
	usb/xhci-plat: Set PM runtime as active on resume
	usb: host: ehci-platform: add a quirk to avoid stuck
	usb/ehci-platform: Set PM runtime as active on resume
	perf report: Fix NULL pointer dereference in hists__fprintf_nr_sample_events()
	ext4: stop overwrite the errcode in ext4_setup_super
	bcache: fix potential deadlock problem in btree_gc_coalesce
	afs: Fix non-setting of mtime when writing into mmap
	afs: afs_write_end() should change i_size under the right lock
	block: Fix use-after-free in blkdev_get()
	arm64: hw_breakpoint: Don't invoke overflow handler on uaccess watchpoints
	libata: Use per port sync for detach
	drm: encoder_slave: fix refcouting error for modules
	drm/dp_mst: Reformat drm_dp_check_act_status() a bit
	drm/qxl: Use correct notify port address when creating cursor ring
	drm/amdgpu: Replace invalid device ID with a valid device ID
	selinux: fix double free
	ext4: fix partial cluster initialization when splitting extent
	ext4: avoid race conditions when remounting with options that change dax
	drm/dp_mst: Increase ACT retry timeout to 3s
	x86/boot/compressed: Relax sed symbol type regex for LLVM ld.lld
	block: nr_sects_write(): Disable preemption on seqcount write
	mtd: rawnand: Pass a nand_chip object to nand_scan()
	mtd: rawnand: Pass a nand_chip object to nand_release()
	mtd: rawnand: diskonchip: Fix the probe error path
	mtd: rawnand: sharpsl: Fix the probe error path
	mtd: rawnand: xway: Fix the probe error path
	mtd: rawnand: orion: Fix the probe error path
	mtd: rawnand: oxnas: Add of_node_put()
	mtd: rawnand: oxnas: Fix the probe error path
	mtd: rawnand: socrates: Fix the probe error path
	mtd: rawnand: plat_nand: Fix the probe error path
	mtd: rawnand: mtk: Fix the probe error path
	mtd: rawnand: tmio: Fix the probe error path
	s390: fix syscall_get_error for compat processes
	drm/i915: Whitelist context-local timestamp in the gen9 cmdparser
	drm/i915/icl+: Fix hotplug interrupt disabling after storm detection
	crypto: algif_skcipher - Cap recv SG list at ctx->used
	crypto: algboss - don't wait during notifier callback
	kprobes: Fix to protect kick_kprobe_optimizer() by kprobe_mutex
	e1000e: Do not wake up the system via WOL if device wakeup is disabled
	net: octeon: mgmt: Repair filling of RX ring
	kretprobe: Prevent triggering kretprobe from within kprobe_flush_task
	sched/rt, net: Use CONFIG_PREEMPTION.patch
	net: core: device_rename: Use rwsem instead of a seqcount
	Revert "dpaa_eth: fix usage as DSA master, try 3"
	md: add feature flag MD_FEATURE_RAID0_LAYOUT
	kvm: x86: Move kvm_set_mmio_spte_mask() from x86.c to mmu.c
	kvm: x86: Fix reserved bits related calculation errors caused by MKTME
	KVM: x86/mmu: Set mmio_value to '0' if reserved #PF can't be generated
	Linux 4.19.130

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I8fff23470852b747c3d75461b45f9d77460062d3
2020-06-27 09:50:13 +02:00
Jason Yan
49289b1fa5 block: Fix use-after-free in blkdev_get()
[ Upstream commit 2d3a8e2deddea6c89961c422ec0c5b851e648c14 ]

In blkdev_get() we call __blkdev_get() to do some internal jobs and if
there is some errors in __blkdev_get(), the bdput() is called which
means we have released the refcount of the bdev (actually the refcount of
the bdev inode). This means we cannot access bdev after that point. But
acctually bdev is still accessed in blkdev_get() after calling
__blkdev_get(). This results in use-after-free if the refcount is the
last one we released in __blkdev_get(). Let's take a look at the
following scenerio:

  CPU0            CPU1                    CPU2
blkdev_open     blkdev_open           Remove disk
                  bd_acquire
		  blkdev_get
		    __blkdev_get      del_gendisk
					bdev_unhash_inode
  bd_acquire          bdev_get_gendisk
    bd_forget           failed because of unhashed
	  bdput
	              bdput (the last one)
		        bdev_evict_inode

	  	    access bdev => use after free

[  459.350216] BUG: KASAN: use-after-free in __lock_acquire+0x24c1/0x31b0
[  459.351190] Read of size 8 at addr ffff88806c815a80 by task syz-executor.0/20132
[  459.352347]
[  459.352594] CPU: 0 PID: 20132 Comm: syz-executor.0 Not tainted 4.19.90 #2
[  459.353628] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[  459.354947] Call Trace:
[  459.355337]  dump_stack+0x111/0x19e
[  459.355879]  ? __lock_acquire+0x24c1/0x31b0
[  459.356523]  print_address_description+0x60/0x223
[  459.357248]  ? __lock_acquire+0x24c1/0x31b0
[  459.357887]  kasan_report.cold+0xae/0x2d8
[  459.358503]  __lock_acquire+0x24c1/0x31b0
[  459.359120]  ? _raw_spin_unlock_irq+0x24/0x40
[  459.359784]  ? lockdep_hardirqs_on+0x37b/0x580
[  459.360465]  ? _raw_spin_unlock_irq+0x24/0x40
[  459.361123]  ? finish_task_switch+0x125/0x600
[  459.361812]  ? finish_task_switch+0xee/0x600
[  459.362471]  ? mark_held_locks+0xf0/0xf0
[  459.363108]  ? __schedule+0x96f/0x21d0
[  459.363716]  lock_acquire+0x111/0x320
[  459.364285]  ? blkdev_get+0xce/0xbe0
[  459.364846]  ? blkdev_get+0xce/0xbe0
[  459.365390]  __mutex_lock+0xf9/0x12a0
[  459.365948]  ? blkdev_get+0xce/0xbe0
[  459.366493]  ? bdev_evict_inode+0x1f0/0x1f0
[  459.367130]  ? blkdev_get+0xce/0xbe0
[  459.367678]  ? destroy_inode+0xbc/0x110
[  459.368261]  ? mutex_trylock+0x1a0/0x1a0
[  459.368867]  ? __blkdev_get+0x3e6/0x1280
[  459.369463]  ? bdev_disk_changed+0x1d0/0x1d0
[  459.370114]  ? blkdev_get+0xce/0xbe0
[  459.370656]  blkdev_get+0xce/0xbe0
[  459.371178]  ? find_held_lock+0x2c/0x110
[  459.371774]  ? __blkdev_get+0x1280/0x1280
[  459.372383]  ? lock_downgrade+0x680/0x680
[  459.373002]  ? lock_acquire+0x111/0x320
[  459.373587]  ? bd_acquire+0x21/0x2c0
[  459.374134]  ? do_raw_spin_unlock+0x4f/0x250
[  459.374780]  blkdev_open+0x202/0x290
[  459.375325]  do_dentry_open+0x49e/0x1050
[  459.375924]  ? blkdev_get_by_dev+0x70/0x70
[  459.376543]  ? __x64_sys_fchdir+0x1f0/0x1f0
[  459.377192]  ? inode_permission+0xbe/0x3a0
[  459.377818]  path_openat+0x148c/0x3f50
[  459.378392]  ? kmem_cache_alloc+0xd5/0x280
[  459.379016]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  459.379802]  ? path_lookupat.isra.0+0x900/0x900
[  459.380489]  ? __lock_is_held+0xad/0x140
[  459.381093]  do_filp_open+0x1a1/0x280
[  459.381654]  ? may_open_dev+0xf0/0xf0
[  459.382214]  ? find_held_lock+0x2c/0x110
[  459.382816]  ? lock_downgrade+0x680/0x680
[  459.383425]  ? __lock_is_held+0xad/0x140
[  459.384024]  ? do_raw_spin_unlock+0x4f/0x250
[  459.384668]  ? _raw_spin_unlock+0x1f/0x30
[  459.385280]  ? __alloc_fd+0x448/0x560
[  459.385841]  do_sys_open+0x3c3/0x500
[  459.386386]  ? filp_open+0x70/0x70
[  459.386911]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[  459.387610]  ? trace_hardirqs_off_caller+0x55/0x1c0
[  459.388342]  ? do_syscall_64+0x1a/0x520
[  459.388930]  do_syscall_64+0xc3/0x520
[  459.389490]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  459.390248] RIP: 0033:0x416211
[  459.390720] Code: 75 14 b8 02 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83
04 19 00 00 c3 48 83 ec 08 e8 0a fa ff ff 48 89 04 24 b8 02 00 00 00 0f
   05 <48> 8b 3c 24 48 89 c2 e8 53 fa ff ff 48 89 d0 48 83 c4 08 48 3d
      01
[  459.393483] RSP: 002b:00007fe45dfe9a60 EFLAGS: 00000293 ORIG_RAX: 0000000000000002
[  459.394610] RAX: ffffffffffffffda RBX: 00007fe45dfea6d4 RCX: 0000000000416211
[  459.395678] RDX: 00007fe45dfe9b0a RSI: 0000000000000002 RDI: 00007fe45dfe9b00
[  459.396758] RBP: 000000000076bf20 R08: 0000000000000000 R09: 000000000000000a
[  459.397930] R10: 0000000000000075 R11: 0000000000000293 R12: 00000000ffffffff
[  459.399022] R13: 0000000000000bd9 R14: 00000000004cdb80 R15: 000000000076bf2c
[  459.400168]
[  459.400430] Allocated by task 20132:
[  459.401038]  kasan_kmalloc+0xbf/0xe0
[  459.401652]  kmem_cache_alloc+0xd5/0x280
[  459.402330]  bdev_alloc_inode+0x18/0x40
[  459.402970]  alloc_inode+0x5f/0x180
[  459.403510]  iget5_locked+0x57/0xd0
[  459.404095]  bdget+0x94/0x4e0
[  459.404607]  bd_acquire+0xfa/0x2c0
[  459.405113]  blkdev_open+0x110/0x290
[  459.405702]  do_dentry_open+0x49e/0x1050
[  459.406340]  path_openat+0x148c/0x3f50
[  459.406926]  do_filp_open+0x1a1/0x280
[  459.407471]  do_sys_open+0x3c3/0x500
[  459.408010]  do_syscall_64+0xc3/0x520
[  459.408572]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  459.409415]
[  459.409679] Freed by task 1262:
[  459.410212]  __kasan_slab_free+0x129/0x170
[  459.410919]  kmem_cache_free+0xb2/0x2a0
[  459.411564]  rcu_process_callbacks+0xbb2/0x2320
[  459.412318]  __do_softirq+0x225/0x8ac

Fix this by delaying bdput() to the end of blkdev_get() which means we
have finished accessing bdev.

Fixes: 77ea887e43 ("implement in-kernel gendisk events handling")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-25 15:33:06 +02:00
Greg Kroah-Hartman
ff0e96e80f This is the 4.19.94 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl4W8A4ACgkQONu9yGCS
 aT5ZcBAAha0GMcpxm1ettNVMXUVD/Df2pntc3x2G1T+dtI89YwIilJcdQBpbDGB6
 6oNRpnopc+/ynm820SMlhjBNE8KlDzHS3Tmsn1lplru0yOqZMFcFlHSESCAA0b4E
 T21KwQ4rLZTzW4LvTf//4WpJZD1RnVrwKkbgkci9kvCjZsdh2GMK3XkBeVBUdXuX
 3gvW+9zsgmkU3Bhk5Mi8JUmOw2yY5sJt2tDmIyxOtBknAo1TK6n4kqd+NgjfsdcI
 cnNTstDU0Ikmi4UBOZGDmey0THtHdvi/oM3DUkzHtZ68W0rg/gPu4nDR+Fx3sKvo
 y5bI10j4W16PKXyxVehel+lD8XmIV/+zSerS0enGjijBPZKI9FTlGEuczk0x7sj+
 wqMh3WkkPig2bQPrCOIjkA5VW4n/ZL07cjd1nNeZ48MkvA/3k47o4vDV/lPE88ZT
 49qqaJvZ3kAdqtV1pfzpQtrG1Pp8YPcEHAMYIM/6jb6poCro5vFtuRX4tzj2fRSf
 u7jSVPDt7ED9SgHPe+RrGWVIx2/iVnr5mVdi53rjWTWfeTdNIY5bUs/wsTde1k99
 9bnAhwD4ZbFrO240MMYPWpOCr8kl0LXAeyQ104m7ONbMRnLoRp4sQCae252jyHFD
 Qxgez5cDwDQnj2W4/SdXSWytioTnyVHsI89FkWw+f/IU8AsbBuw=
 =mmeT
 -----END PGP SIGNATURE-----

Merge 4.19.94 into android-4.19

Changes in 4.19.94
	nvme_fc: add module to ops template to allow module references
	nvme-fc: fix double-free scenarios on hw queues
	drm/amdgpu: add check before enabling/disabling broadcast mode
	drm/amdgpu: add cache flush workaround to gfx8 emit_fence
	drm/amd/display: Fixed kernel panic when booting with DP-to-HDMI dongle
	iio: adc: max9611: Fix too short conversion time delay
	PM / devfreq: Fix devfreq_notifier_call returning errno
	PM / devfreq: Set scaling_max_freq to max on OPP notifier error
	PM / devfreq: Don't fail devfreq_dev_release if not in list
	afs: Fix afs_find_server lookups for ipv4 peers
	afs: Fix SELinux setting security label on /afs
	RDMA/cma: add missed unregister_pernet_subsys in init failure
	rxe: correctly calculate iCRC for unaligned payloads
	scsi: lpfc: Fix memory leak on lpfc_bsg_write_ebuf_set func
	scsi: qla2xxx: Drop superfluous INIT_WORK of del_work
	scsi: qla2xxx: Don't call qlt_async_event twice
	scsi: qla2xxx: Fix PLOGI payload and ELS IOCB dump length
	scsi: qla2xxx: Configure local loop for N2N target
	scsi: qla2xxx: Send Notify ACK after N2N PLOGI
	scsi: qla2xxx: Ignore PORT UPDATE after N2N PLOGI
	scsi: iscsi: qla4xxx: fix double free in probe
	scsi: libsas: stop discovering if oob mode is disconnected
	drm/nouveau: Move the declaration of struct nouveau_conn_atom up a bit
	usb: gadget: fix wrong endpoint desc
	net: make socket read/write_iter() honor IOCB_NOWAIT
	afs: Fix creation calls in the dynamic root to fail with EOPNOTSUPP
	md: raid1: check rdev before reference in raid1_sync_request func
	s390/cpum_sf: Adjust sampling interval to avoid hitting sample limits
	s390/cpum_sf: Avoid SBD overflow condition in irq handler
	IB/mlx4: Follow mirror sequence of device add during device removal
	IB/mlx5: Fix steering rule of drop and count
	xen-blkback: prevent premature module unload
	xen/balloon: fix ballooned page accounting without hotplug enabled
	PM / hibernate: memory_bm_find_bit(): Tighten node optimisation
	ALSA: hda/realtek - Add Bass Speaker and fixed dac for bass speaker
	ALSA: hda/realtek - Enable the bass speaker of ASUS UX431FLC
	ALSA: hda - fixup for the bass speaker on Lenovo Carbon X1 7th gen
	xfs: fix mount failure crash on invalid iclog memory access
	taskstats: fix data-race
	drm: limit to INT_MAX in create_blob ioctl
	netfilter: nft_tproxy: Fix port selector on Big Endian
	ALSA: ice1724: Fix sleep-in-atomic in Infrasonic Quartet support code
	ALSA: usb-audio: fix set_format altsetting sanity check
	ALSA: usb-audio: set the interface format after resume on Dell WD19
	ALSA: hda/realtek - Add headset Mic no shutup for ALC283
	drm/sun4i: hdmi: Remove duplicate cleanup calls
	MIPS: Avoid VDSO ABI breakage due to global register variable
	media: pulse8-cec: fix lost cec_transmit_attempt_done() call
	media: cec: CEC 2.0-only bcast messages were ignored
	media: cec: avoid decrementing transmit_queue_sz if it is 0
	media: cec: check 'transmit_in_progress', not 'transmitting'
	mm/zsmalloc.c: fix the migrated zspage statistics.
	memcg: account security cred as well to kmemcg
	mm: move_pages: return valid node id in status if the page is already on the target node
	pstore/ram: Write new dumps to start of recycled zones
	locks: print unsigned ino in /proc/locks
	dmaengine: Fix access to uninitialized dma_slave_caps
	compat_ioctl: block: handle Persistent Reservations
	compat_ioctl: block: handle BLKREPORTZONE/BLKRESETZONE
	ata: libahci_platform: Export again ahci_platform_<en/dis>able_phys()
	ata: ahci_brcm: Fix AHCI resources management
	ata: ahci_brcm: Allow optional reset controller to be used
	ata: ahci_brcm: Add missing clock management during recovery
	ata: ahci_brcm: BCM7425 AHCI requires AHCI_HFLAG_DELAY_ENGINE
	libata: Fix retrieving of active qcs
	gpiolib: fix up emulated open drain outputs
	riscv: ftrace: correct the condition logic in function graph tracer
	rseq/selftests: Fix: Namespace gettid() for compatibility with glibc 2.30
	tracing: Fix lock inversion in trace_event_enable_tgid_record()
	tracing: Avoid memory leak in process_system_preds()
	tracing: Have the histogram compare functions convert to u64 first
	tracing: Fix endianness bug in histogram trigger
	apparmor: fix aa_xattrs_match() may sleep while holding a RCU lock
	ALSA: cs4236: fix error return comparison of an unsigned integer
	ALSA: firewire-motu: Correct a typo in the clock proc string
	exit: panic before exit_mm() on global init exit
	arm64: Revert support for execute-only user mappings
	ftrace: Avoid potential division by zero in function profiler
	drm/msm: include linux/sched/task.h
	PM / devfreq: Check NULL governor in available_governors_show
	nfsd4: fix up replay_matches_cache()
	HID: i2c-hid: Reset ALPS touchpads on resume
	ACPI: sysfs: Change ACPI_MASKABLE_GPE_MAX to 0x100
	xfs: don't check for AG deadlock for realtime files in bunmapi
	platform/x86: pmc_atom: Add Siemens CONNECT X300 to critclk_systems DMI table
	Bluetooth: btusb: fix PM leak in error case of setup
	Bluetooth: delete a stray unlock
	Bluetooth: Fix memory leak in hci_connect_le_scan
	media: flexcop-usb: ensure -EIO is returned on error condition
	regulator: ab8500: Remove AB8505 USB regulator
	media: usb: fix memory leak in af9005_identify_state
	dt-bindings: clock: renesas: rcar-usb2-clock-sel: Fix typo in example
	arm64: dts: meson: odroid-c2: Disable usb_otg bus to avoid power failed warning
	tty: serial: msm_serial: Fix lockup for sysrq and oops
	fix compat handling of FICLONERANGE, FIDEDUPERANGE and FS_IOC_FIEMAP
	bdev: Factor out bdev revalidation into a common helper
	bdev: Refresh bdev size for disks without partitioning
	scsi: qedf: Do not retry ELS request if qedf_alloc_cmd fails
	drm/mst: Fix MST sideband up-reply failure handling
	powerpc/pseries/hvconsole: Fix stack overread via udbg
	selftests: rtnetlink: add addresses with fixed life time
	KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag
	rxrpc: Fix possible NULL pointer access in ICMP handling
	tcp: annotate tp->rcv_nxt lockless reads
	net: core: limit nested device depth
	ath9k_htc: Modify byte order for an error message
	ath9k_htc: Discard undersized packets
	xfs: periodically yield scrub threads to the scheduler
	net: add annotations on hh->hh_len lockless accesses
	ubifs: ubifs_tnc_start_commit: Fix OOB in layout_in_gaps
	s390/smp: fix physical to logical CPU map for SMT
	xen/blkback: Avoid unmapping unmapped grant pages
	perf/x86/intel/bts: Fix the use of page_private()
	Linux 4.19.94

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ic3d1a4e10565c38d0e82448f0fb7b6fd1822aab2
2020-01-09 16:14:43 +01:00
Jan Kara
a06ee0d2ff bdev: Refresh bdev size for disks without partitioning
commit cba22d86e0a10b7070d2e6a7379dbea51aa0883c upstream.

Currently, block device size in not updated on second and further open
for block devices where partition scan is disabled. This is particularly
annoying for example for DVD drives as that means block device size does
not get updated once the media is inserted into a drive if the device is
already open when inserting the media. This is actually always the case
for example when pktcdvd is in use.

Fix the problem by revalidating block device size on every open even for
devices with partition scan disabled.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-09 10:19:07 +01:00
Jan Kara
2a5b513148 bdev: Factor out bdev revalidation into a common helper
commit 731dc4868311ee097757b8746eaa1b4f8b2b4f1c upstream.

Factor out code handling revalidation of bdev on disk change into a
common helper.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-09 10:19:07 +01:00
Jaegeuk Kim
435c9a613f Merge remote-tracking branch 'aosp/upstream-f2fs-stable-linux-4.19.y/v5.5-rc1' into android-4.19
* aosp/upstream-f2fs-stable-linux-4.19.y:
  f2fs: stop GC when the victim becomes fully valid
  f2fs: expose main_blkaddr in sysfs
  f2fs: choose hardlimit when softlimit is larger than hardlimit in f2fs_statfs_project()
  f2fs: Fix deadlock in f2fs_gc() context during atomic files handling
  f2fs: show f2fs instance in printk_ratelimited
  f2fs: fix potential overflow
  f2fs: fix to update dir's i_pino during cross_rename
  f2fs: support aligned pinned file
  f2fs: avoid kernel panic on corruption test
  f2fs: fix wrong description in document
  f2fs: cache global IPU bio
  f2fs: fix to avoid memory leakage in f2fs_listxattr
  f2fs: check total_segments from devices in raw_super
  f2fs: update multi-dev metadata in resize_fs
  f2fs: mark recovery flag correctly in read_raw_super_block()
  f2fs: fix to update time in lazytime mode
  vfs: don't allow writes to swap files
  mm: set S_SWAPFILE on blockdev swap devices

Bug: 146023540
Change-Id: Ia24ce5f48f245dd7ba4fd94aa00a7d84615a8b22
Signed-off-by: Jaegeuk Kim <jaegeuk@google.com>
2019-12-18 10:26:14 -08:00
Darrick J. Wong
8cfd90e159 vfs: don't allow writes to swap files
Don't let userspace write to an active swap file because the kernel
effectively has a long term lease on the storage and things could get
seriously corrupted if we let this happen.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-12-02 15:03:46 -08:00
Jason Yan
543bb48dc4 block: fix the return errno for direct IO
commit a89afe58f1a74aac768a5eb77af95ef4ee15beaa upstream.

If the last bio returned is not dio->bio, the status of the bio will
not assigned to dio->bio if it is error. This will cause the whole IO
status wrong.

    ksoftirqd/21-117   [021] ..s.  4017.966090:   8,0    C   N 4883648 [0]
          <idle>-0     [018] ..s.  4017.970888:   8,0    C  WS 4924800 + 1024 [0]
          <idle>-0     [018] ..s.  4017.970909:   8,0    D  WS 4935424 + 1024 [<idle>]
          <idle>-0     [018] ..s.  4017.970924:   8,0    D  WS 4936448 + 321 [<idle>]
    ksoftirqd/21-117   [021] ..s.  4017.995033:   8,0    C   R 4883648 + 336 [65475]
    ksoftirqd/21-117   [021] d.s.  4018.001988: myprobe1: (blkdev_bio_end_io+0x0/0x168) bi_status=7
    ksoftirqd/21-117   [021] d.s.  4018.001992: myprobe: (aio_complete_rw+0x0/0x148) x0=0xffff802f2595ad80 res=0x12a000 res2=0x0

We always have to assign bio->bi_status to dio->bio.bi_status because we
will only check dio->bio.bi_status when we return the whole IO to
the upper layer.

Fixes: 542ff7bf18 ("block: new direct I/O implementation")
Cc: stable@vger.kernel.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-17 08:38:52 +02:00
Jan Kara
1e11b1d630 blockdev: Fix livelocks on loop device
commit 04906b2f542c23626b0ef6219b808406f8dddbe9 upstream.

bd_set_size() updates also block device's block size. This is somewhat
unexpected from its name and at this point, only blkdev_open() uses this
functionality. Furthermore, this can result in changing block size under
a filesystem mounted on a loop device which leads to livelocks inside
__getblk_gfp() like:

Sending NMI from CPU 0 to CPUs 1:
NMI backtrace for cpu 1
CPU: 1 PID: 10863 Comm: syz-executor0 Not tainted 4.18.0-rc5+ #151
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
01/01/2011
RIP: 0010:__sanitizer_cov_trace_pc+0x3f/0x50 kernel/kcov.c:106
...
Call Trace:
 init_page_buffers+0x3e2/0x530 fs/buffer.c:904
 grow_dev_page fs/buffer.c:947 [inline]
 grow_buffers fs/buffer.c:1009 [inline]
 __getblk_slow fs/buffer.c:1036 [inline]
 __getblk_gfp+0x906/0xb10 fs/buffer.c:1313
 __bread_gfp+0x2d/0x310 fs/buffer.c:1347
 sb_bread include/linux/buffer_head.h:307 [inline]
 fat12_ent_bread+0x14e/0x3d0 fs/fat/fatent.c:75
 fat_ent_read_block fs/fat/fatent.c:441 [inline]
 fat_alloc_clusters+0x8ce/0x16e0 fs/fat/fatent.c:489
 fat_add_cluster+0x7a/0x150 fs/fat/inode.c:101
 __fat_get_block fs/fat/inode.c:148 [inline]
...

Trivial reproducer for the problem looks like:

truncate -s 1G /tmp/image
losetup /dev/loop0 /tmp/image
mkfs.ext4 -b 1024 /dev/loop0
mount -t ext4 /dev/loop0 /mnt
losetup -c /dev/loop0
l /mnt

Fix the problem by moving initialization of a block device block size
into a separate function and call it when needed.

Thanks to Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> for help with
debugging the problem.

Reported-by: syzbot+9933e4476f365f5d5a1b@syzkaller.appspotmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-22 21:40:36 +01:00
Linus Torvalds
73ba2fb33c for-4.19/block-20180812
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAltwvasQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpv65EACTq5gSLnJBI6ZPr1RAHruVDnjfzO2Veitl
 tUtjm0XfWmnEiwQ3dYvnyhy99xbyaG3900d9BClCTlH6xaUdSiQkDpcKG/R2F36J
 5mZitYukQcpFAQJWF8YKsTTE7JPl4VglCIDqYiC4+C3rOSVi8lrKn2qp4J4MMCFn
 thRg3jCcq7c5s9Eigsop1pXWQSasubkXfk55Krcp4oybKYpYRKXXf74Mj14QAbwJ
 QHN3VisyAUWoBRg7UQZo1Npe2oPk6bbnJypnjf8M0M2EnlvddEkIlHob91sodka8
 6p4APOEu5cbyXOBCAQsw/koff14mb8aEadqeQA68WvXfIdX9ZjfxCX0OoC3sBEXk
 yqJhZ0C980AM13zIBD8ejv4uasGcPca8W+47mE5P8sRiI++5kBsFWDZPCtUBna0X
 2Kh24NsmEya9XRR5vsB84dsIPQ3tLMkxg/IgQRVDaSnfJz0c/+zm54xDyKRaFT4l
 5iERk2WSkm9+8jNfVmWG0edrv6nRAXjpGwFfOCPh6/LCSCi4xQRULYN7sVzsX8ZK
 FRjt24HftBI8mJbh4BtweJvg+ppVe1gAk3IO3HvxAQhv29Hz+uvFYe9kL+3N8LJA
 Qosr9n9O4+wKYizJcDnw+5iPqCHfAwOm9th4pyedR+R7SmNcP3yNC8AbbheNBiF5
 Zolos5H+JA==
 =b9ib
 -----END PGP SIGNATURE-----

Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "First pull request for this merge window, there will also be a
  followup request with some stragglers.

  This pull request contains:

   - Fix for a thundering heard issue in the wbt block code (Anchal
     Agarwal)

   - A few NVMe pull requests:
      * Improved tracepoints (Keith)
      * Larger inline data support for RDMA (Steve Wise)
      * RDMA setup/teardown fixes (Sagi)
      * Effects log suppor for NVMe target (Chaitanya Kulkarni)
      * Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
      * TP4004 (ANA) support (Christoph)
      * Various NVMe fixes

   - Block io-latency controller support. Much needed support for
     properly containing block devices. (Josef)

   - Series improving how we handle sense information on the stack
     (Kees)

   - Lightnvm fixes and updates/improvements (Mathias/Javier et al)

   - Zoned device support for null_blk (Matias)

   - AIX partition fixes (Mauricio Faria de Oliveira)

   - DIF checksum code made generic (Max Gurtovoy)

   - Add support for discard in iostats (Michael Callahan / Tejun)

   - Set of updates for BFQ (Paolo)

   - Removal of async write support for bsg (Christoph)

   - Bio page dirtying and clone fixups (Christoph)

   - Set of bcache fix/changes (via Coly)

   - Series improving blk-mq queue setup/teardown speed (Ming)

   - Series improving merging performance on blk-mq (Ming)

   - Lots of other fixes and cleanups from a slew of folks"

* tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
  blkcg: Make blkg_root_lookup() work for queues in bypass mode
  bcache: fix error setting writeback_rate through sysfs interface
  null_blk: add lock drop/acquire annotation
  Blk-throttle: reduce tail io latency when iops limit is enforced
  block: paride: pd: mark expected switch fall-throughs
  block: Ensure that a request queue is dissociated from the cgroup controller
  block: Introduce blk_exit_queue()
  blkcg: Introduce blkg_root_lookup()
  block: Remove two superfluous #include directives
  blk-mq: count the hctx as active before allocating tag
  block: bvec_nr_vecs() returns value for wrong slab
  bcache: trivial - remove tailing backslash in macro BTREE_FLAG
  bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
  bcache: set max writeback rate when I/O request is idle
  bcache: add code comments for bset.c
  bcache: fix mistaken comments in request.c
  bcache: fix mistaken code comments in bcache.h
  bcache: add a comment in super.c
  bcache: avoid unncessary cache prefetch bch_btree_node_get()
  bcache: display rate debug parameters to 0 when writeback is not running
  ...
2018-08-14 10:23:25 -07:00
Linus Torvalds
eb181a814c for-linus-20180727
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAltbc20QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgh4D/9GYQcjk9qLVFxkv5ucAUvCuxEL6gjsMf4W
 M/QdxVIrwh3zpvsH++2IXXn+xH+UjujMA5NkzhsSr4+hsSO2iAGOYMJbroNfhsTD
 onvQQ6NTaHPu/+PZs0otVK4KMWHwZGWOV6YU00TWTfRgzRmGEsSMe91oeBIXVv9w
 v6d09twaLSY0lUkAAbcdu5fuFBtXu4Bxy60qyHEKkAdWWHEUYaZLrODhVjoGg2V4
 KdAWS5X4A6kJMcPcoOvG6RFtpf71boaip9o/DRLUWhGdIQnI38UgSCUmz1XMYnik
 Sq8r74vqCm8IhIOLTlxnPrMHHbKv7JZhY3Ow9fxnS6HZRNI0aPX31Yml6NULqnWh
 MsQh+6gZXd3xC1O7txEQn4a15Lk0OLXa8HJcIn5ADNxqz5/r/g0mPUG9HmPSIalO
 ISFF/9UKQFcAd0RjHR+bEEH2VMznz59UWKfdOsmwFZtZSCmR1ucj0xAKDj+oP1JS
 ZsgZ09K2GezrL4GEueocISo9ACIWgDWH8T7/bTxlBok0IYbybAfmOe+MZInL1Tf4
 pklmoXm3ntgV3Pq8Ptk05LYyIgAaUIltuSiR3AFaXIADX0wNtV0ZgysIWgHf3BSA
 18j+I1yPG1IwBdM8xNwxi56xMQR84uY5tsIyafbfj+laRI2nH5OIYjNZnrKpm957
 4xZUgIECBA==
 =2ogY
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20180727' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Bigger than usual at this time, mostly due to the O_DIRECT corruption
  issue and the fact that I was on vacation last week. This contains:

   - NVMe pull request with two fixes for the FC code, and two target
     fixes (Christoph)

   - a DIF bio reset iteration fix (Greg Edwards)

   - two nbd reply and requeue fixes (Josef)

   - SCSI timeout fixup (Keith)

   - a small series that fixes an issue with bio_iov_iter_get_pages(),
     which ended up causing corruption for larger sized O_DIRECT writes
     that ended up racing with buffered writes (Martin Wilck)"

* tag 'for-linus-20180727' of git://git.kernel.dk/linux-block:
  block: reset bi_iter.bi_done after splitting bio
  block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
  blkdev: __blkdev_direct_IO_simple: fix leak in error case
  block: bio_iov_iter_get_pages: fix size of last iovec
  nvmet: only check for filebacking on -ENOTBLK
  nvmet: fixup crash on NULL device path
  scsi: set timed out out mq requests to complete
  blk-mq: export setting request completion state
  nvme: if_ready checks to fail io to deleting controller
  nvmet-fc: fix target sgl list on large transfers
  nbd: handle unexpected replies better
  nbd: don't requeue the same request twice.
2018-07-27 12:51:00 -07:00
Martin Wilck
9362dd1109 blkdev: __blkdev_direct_IO_simple: fix leak in error case
Fixes: 72ecad22d9 ("block: support a full bio worth of IO for simplified bdev direct-io")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-26 11:52:33 -06:00
Tejun Heo
3f289dcb4b block: make bdev_ops->rw_page() take a REQ_OP instead of bool
c11f0c0b5b ("block/mm: make bdev_ops->rw_page() take a bool for
read/write") replaced @op with boolean @is_write, which limited the
amount of information going into ->rw_page() and more importantly
page_endio(), which removed the need to expose block internals to mm.

Unfortunately, we want to track discards separately and @is_write
isn't enough information.  This patch updates bdev_ops->rw_page() to
take REQ_OP instead but leaves page_endio() to take bool @is_write.
This allows the block part of operations to have enough information
while not leaking it to mm.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-18 08:44:14 -06:00
Kees Cook
6da2ec5605 treewide: kmalloc() -> kmalloc_array()
The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
patch replaces cases of:

        kmalloc(a * b, gfp)

with:
        kmalloc_array(a * b, gfp)

as well as handling cases of:

        kmalloc(a * b * c, gfp)

with:

        kmalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

        kmalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

        kmalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The tools/ directory was manually excluded, since it has its own
implementation of kmalloc().

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  kmalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  kmalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  kmalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * (COUNT_ID)
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * COUNT_ID
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * (COUNT_CONST)
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * COUNT_CONST
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * (COUNT_ID)
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * COUNT_ID
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * (COUNT_CONST)
+	COUNT_CONST, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * COUNT_CONST
+	COUNT_CONST, sizeof(THING)
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kmalloc
+ kmalloc_array
  (
-	SIZE * COUNT
+	COUNT, SIZE
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  kmalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kmalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kmalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kmalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  kmalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kmalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kmalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  kmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  kmalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  kmalloc(C1 * C2 * C3, ...)
|
  kmalloc(
-	(E1) * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kmalloc(
-	(E1) * (E2) * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kmalloc(
-	(E1) * (E2) * (E3)
+	array3_size(E1, E2, E3)
  , ...)
|
  kmalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
  kmalloc(sizeof(THING) * C2, ...)
|
  kmalloc(sizeof(TYPE) * C2, ...)
|
  kmalloc(C1 * C2 * C3, ...)
|
  kmalloc(C1 * C2, ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * (E2)
+	E2, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * E2
+	E2, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * (E2)
+	E2, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * E2
+	E2, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	(E1) * E2
+	E1, E2
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	(E1) * (E2)
+	E1, E2
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	E1 * E2
+	E1, E2
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Linus Torvalds
4a189982e2 Merge branch 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull aio iopriority support from Al Viro:
 "The rest of aio stuff for this cycle - Adam's aio ioprio series"

* 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: aio ioprio use ioprio_check_cap ret val
  fs: aio ioprio add explicit block layer dependence
  fs: iomap dio set bio prio from kiocb prio
  fs: blkdev set bio prio from kiocb prio
  fs: Add aio iopriority support
  fs: Convert kiocb rw_hint from enum to u16
  block: add ioprio_check_cap function
2018-06-08 10:00:20 -07:00
Adam Manzanares
074111ca5f fs: blkdev set bio prio from kiocb prio
Now that kiocb has an ioprio field copy this over to the bio when it is
created from the kiocb.

Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-31 10:50:55 -04:00
Kent Overstreet
52190f8abe fs: convert block_dev.c to bioset_init()
Convert block DIO code to embedded bio sets.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 15:33:32 -06:00
Christoph Hellwig
5afb78356c block: don't print a message when the device went away
The information about a size change in this case just creates confusion.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-29 08:59:21 -06:00
Christoph Hellwig
4163a03984 block: unexport check_disk_size_change
Only used in block_dev.c and the partitions code, and it should remain
that way..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-29 08:59:21 -06:00
Linus Torvalds
9f3a0941fb libnvdimm for 4.17
* A rework of the filesytem-dax implementation provides for detection of
   unmap operations (truncate / hole punch) colliding with in-progress
   device-DMA. A fix for these collisions remains a work-in-progress
   pending resolution of truncate latency and starvation regressions.
 
 * The of_pmem driver expands the users of libnvdimm outside of x86 and
   ACPI to describe an implementation of persistent memory on PowerPC with
   Open Firmware / Device tree.
 
 * Address Range Scrub (ARS) handling is completely rewritten to account for
   the fact that ARS may run for 100s of seconds and there is no platform
   defined way to cancel it. ARS will now no longer block namespace
   initialization.
 
 * The NVDIMM Namespace Label implementation is updated to handle label
   areas as small as 1K, down from 128K.
 
 * Miscellaneous cleanups and updates to unit test infrastructure.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJazDt5AAoJEB7SkWpmfYgCqGMQALLwdPeY87cUK7AvQ2IXj46B
 lJgeVuHPzyQDbC03AS5uUYnnU3I5lFd7i4y7ZrywNpFs4lsb/bNmbUpQE5xp+Yvc
 1MJ/JYDIP5X4misWYm3VJo85N49+VqSRgAQk52PBigwnZ7M6/u4cSptXM9//c9JL
 /NYbat6IjjY6Tx49Tec6+F3GMZjsFLcuTVkQcREoOyOqVJE4YpP0vhNjEe0vq6vr
 EsSWiqEI5VFH4PfJwKdKj/64IKB4FGKj2A5cEgjQBxW2vw7tTJnkRkdE3jDUjqtg
 xYAqGp/Dqs4+bgdYlT817YhiOVrcr5mOHj7TKWQrBPgzKCbcG5eKDmfT8t+3NEga
 9kBlgisqIcG72lwZNA7QkEHxq1Omy9yc1hUv9qz2YA0G+J1WE8l1T15k1DOFwV57
 qIrLLUypklNZLxvrzNjclempboKc4JCUlj+TdN5E5Y6pRs55UWTXaP7Xf5O7z0vf
 l/uiiHkc3MPH73YD2PSEGFJ8m8EU0N8xhrcz3M9E2sHgYCnbty1Lw3FH0/GhThVA
 ya1mMeDdb8A2P7gWCBk1Lqeig+rJKXSey4hKM6D0njOEtMQO1H4tFqGjyfDX1xlJ
 3plUR9WBVEYzN5+9xWbwGag/ezGZ+NfcVO2gmy6yXiEph796BxRAZx/18zKRJr0m
 9eGJG1H+JspcbtLF9iHn
 =acZQ
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "This cycle was was not something I ever want to repeat as there were
  several late changes that have only now just settled.

  Half of the branch up to commit d2c997c0f1 ("fs, dax: use
  page->mapping to warn...") have been in -next for several releases.
  The of_pmem driver and the address range scrub rework were late
  arrivals, and the dax work was scaled back at the last moment.

  The of_pmem driver missed a previous merge window due to an oversight.
  A sense of obligation to rectify that miss is why it is included for
  4.17. It has acks from PowerPC folks. Stephen reported a build failure
  that only occurs when merging it with your latest tree, for now I have
  fixed that up by disabling modular builds of of_pmem. A test merge
  with your tree has received a build success report from the 0day robot
  over 156 configs.

  An initial version of the ARS rework was submitted before the merge
  window. It is self contained to libnvdimm, a net code reduction, and
  passing all unit tests.

  The filesystem-dax changes are based on the wait_var_event()
  functionality from tip/sched/core. However, late review feedback
  showed that those changes regressed truncate performance to a large
  degree. The branch was rewound to drop the truncate behavior change
  and now only includes preparation patches and cleanups (with full acks
  and reviews). The finalization of this dax-dma-vs-trnucate work will
  need to wait for 4.18.

  Summary:

   - A rework of the filesytem-dax implementation provides for detection
     of unmap operations (truncate / hole punch) colliding with
     in-progress device-DMA. A fix for these collisions remains a
     work-in-progress pending resolution of truncate latency and
     starvation regressions.

   - The of_pmem driver expands the users of libnvdimm outside of x86
     and ACPI to describe an implementation of persistent memory on
     PowerPC with Open Firmware / Device tree.

   - Address Range Scrub (ARS) handling is completely rewritten to
     account for the fact that ARS may run for 100s of seconds and there
     is no platform defined way to cancel it. ARS will now no longer
     block namespace initialization.

   - The NVDIMM Namespace Label implementation is updated to handle
     label areas as small as 1K, down from 128K.

   - Miscellaneous cleanups and updates to unit test infrastructure"

* tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (39 commits)
  libnvdimm, of_pmem: workaround OF_NUMA=n build error
  nfit, address-range-scrub: add module option to skip initial ars
  nfit, address-range-scrub: rework and simplify ARS state machine
  nfit, address-range-scrub: determine one platform max_ars value
  powerpc/powernv: Create platform devs for nvdimm buses
  doc/devicetree: Persistent memory region bindings
  libnvdimm: Add device-tree based driver
  libnvdimm: Add of_node to region and bus descriptors
  libnvdimm, region: quiet region probe
  libnvdimm, namespace: use a safe lookup for dimm device name
  libnvdimm, dimm: fix dpa reservation vs uninitialized label area
  libnvdimm, testing: update the default smart ctrl_temperature
  libnvdimm, testing: Add emulation for smart injection commands
  nfit, address-range-scrub: introduce nfit_spa->ars_state
  libnvdimm: add an api to cast a 'struct nd_region' to its 'struct device'
  nfit, address-range-scrub: fix scrub in-progress reporting
  dax, dm: allow device-mapper to operate without dax support
  dax: introduce CONFIG_DAX_DRIVER
  fs, dax: use page->mapping to warn if truncate collides with a busy page
  ext2, dax: introduce ext2_dax_aops
  ...
2018-04-10 10:25:57 -07:00
shunki-fujita
849cf55963 fs: don't flush pagecache when expanding block device
When changing the size of a block device, its all caches are freed.
It's necessary on shrinking to prevent spurious I/Os to the disappeared
region.  However, on expanding, such kind of I/Os doesn't happen.

Similar things can be considered for btrfs filesystem resize and
resize2fs, but they are designed not to drop caches when expanding.
Therefore this patch removes unnecessary cache drop.

Link: http://lkml.kernel.org/r/1521457240-153390-1-git-send-email-shunki-fujita@cybozu.co.jp
Signed-off-by: Shunki Fujita <shunki-fujita@cybozu.co.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:23 -07:00
Dan Williams
15aa8a0118 block, dax: remove dead code in blkdev_writepages()
Block device inodes never have S_DAX set, so kill the check for DAX and
diversion to dax_writeback_mapping_range().

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-03-30 11:34:55 -07:00
Jan Kara
560e7cb2f3 blockdev: Avoid two active bdev inodes for one device
When blkdev_open() races with device removal and creation it can happen
that unhashed bdev inode gets associated with newly created gendisk
like:

CPU0					CPU1
blkdev_open()
  bdev = bd_acquire()
					del_gendisk()
					  bdev_unhash_inode(bdev);
					remove device
					create new device with the same number
  __blkdev_get()
    disk = get_gendisk()
      - gets reference to gendisk of the new device

Now another blkdev_open() will not find original 'bdev' as it got
unhashed, create a new one and associate it with the same 'disk' at
which point problems start as we have two independent page caches for
one device.

Fix the problem by verifying that the bdev inode didn't get unhashed
before we acquired gendisk reference. That way we make sure gendisk can
get associated only with visible bdev inodes.

Tested-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-26 09:48:42 -07:00
Jan Kara
897366537f genhd: Fix use after free in __blkdev_get()
When two blkdev_open() calls race with device removal and recreation,
__blkdev_get() can use looked up gendisk after it is freed:

CPU0				CPU1			CPU2
							del_gendisk(disk);
							  bdev_unhash_inode(inode);
blkdev_open()			blkdev_open()
  bdev = bd_acquire(inode);
    - creates and returns new inode
				  bdev = bd_acquire(inode);
				    - returns the same inode
  __blkdev_get(devt)		  __blkdev_get(devt)
    disk = get_gendisk(devt);
      - got structure of device going away
							<finish device removal>
							<new device gets
							 created under the same
							 device number>
				  disk = get_gendisk(devt);
				    - got new device structure
				  if (!bdev->bd_openers) {
				    does the first open
				  }
    if (!bdev->bd_openers)
      - false
    } else {
      put_disk_and_module(disk)
        - remember this was old device - this was last ref and disk is
          now freed
    }
    disk_unblock_events(disk); -> oops

Fix the problem by making sure we drop reference to disk in
__blkdev_get() only after we are really done with it.

Reported-by: Hou Tao <houtao1@huawei.com>
Tested-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-26 09:48:42 -07:00
Jan Kara
9df6c29912 genhd: Add helper put_disk_and_module()
Add a proper counterpart to get_disk_and_module() -
put_disk_and_module(). Currently it is opencoded in several places.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-26 09:48:42 -07:00
Linus Torvalds
e2c5923c34 Merge branch 'for-4.15/block' of git://git.kernel.dk/linux-block
Pull core block layer updates from Jens Axboe:
 "This is the main pull request for block storage for 4.15-rc1.

  Nothing out of the ordinary in here, and no API changes or anything
  like that. Just various new features for drivers, core changes, etc.
  In particular, this pull request contains:

   - A patch series from Bart, closing the whole on blk/scsi-mq queue
     quescing.

   - A series from Christoph, building towards hidden gendisks (for
     multipath) and ability to move bio chains around.

   - NVMe
        - Support for native multipath for NVMe (Christoph).
        - Userspace notifications for AENs (Keith).
        - Command side-effects support (Keith).
        - SGL support (Chaitanya Kulkarni)
        - FC fixes and improvements (James Smart)
        - Lots of fixes and tweaks (Various)

   - bcache
        - New maintainer (Michael Lyle)
        - Writeback control improvements (Michael)
        - Various fixes (Coly, Elena, Eric, Liang, et al)

   - lightnvm updates, mostly centered around the pblk interface
     (Javier, Hans, and Rakesh).

   - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

   - Writeback series that fix the much discussed hundreds of millions
     of sync-all units. This goes all the way, as discussed previously
     (me).

   - Fix for missing wakeup on writeback timer adjustments (Yafang
     Shao).

   - Fix laptop mode on blk-mq (me).

   - {mq,name} tupple lookup for IO schedulers, allowing us to have
     alias names. This means you can use 'deadline' on both !mq and on
     mq (where it's called mq-deadline). (me).

   - blktrace race fix, oopsing on sg load (me).

   - blk-mq optimizations (me).

   - Obscure waitqueue race fix for kyber (Omar).

   - NBD fixes (Josef).

   - Disable writeback throttling by default on bfq, like we do on cfq
     (Luca Miccio).

   - Series from Ming that enable us to treat flush requests on blk-mq
     like any other request. This is a really nice cleanup.

   - Series from Ming that improves merging on blk-mq with schedulers,
     getting us closer to flipping the switch on scsi-mq again.

   - BFQ updates (Paolo).

   - blk-mq atomic flags memory ordering fixes (Peter Z).

   - Loop cgroup support (Shaohua).

   - Lots of minor fixes from lots of different folks, both for core and
     driver code"

* 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
  nvme: fix visibility of "uuid" ns attribute
  blk-mq: fixup some comment typos and lengths
  ide: ide-atapi: fix compile error with defining macro DEBUG
  blk-mq: improve tag waiting setup for non-shared tags
  brd: remove unused brd_mutex
  blk-mq: only run the hardware queue if IO is pending
  block: avoid null pointer dereference on null disk
  fs: guard_bio_eod() needs to consider partitions
  xtensa/simdisk: fix compile error
  nvme: expose subsys attribute to sysfs
  nvme: create 'slaves' and 'holders' entries for hidden controllers
  block: create 'slaves' and 'holders' entries for hidden gendisks
  nvme: also expose the namespace identification sysfs files for mpath nodes
  nvme: implement multipath access to nvme subsystems
  nvme: track shared namespaces
  nvme: introduce a nvme_ns_ids structure
  nvme: track subsystems
  block, nvme: Introduce blk_mq_req_flags_t
  block, scsi: Make SCSI quiesce and resume work reliably
  block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
  ...
2017-11-14 15:32:19 -08:00
Bart Van Assche
3a0a529971 block, scsi: Make SCSI quiesce and resume work reliably
The contexts from which a SCSI device can be quiesced or resumed are:
* Writing into /sys/class/scsi_device/*/device/state.
* SCSI parallel (SPI) domain validation.
* The SCSI device power management methods. See also scsi_bus_pm_ops.

It is essential during suspend and resume that neither the filesystem
state nor the filesystem metadata in RAM changes. This is why while
the hibernation image is being written or restored that SCSI devices
are quiesced. The SCSI core quiesces devices through scsi_device_quiesce()
and scsi_device_resume(). In the SDEV_QUIESCE state execution of
non-preempt requests is deferred. This is realized by returning
BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI
devices. Avoid that a full queue prevents power management requests
to be submitted by deferring allocation of non-preempt requests for
devices in the quiesced state. This patch has been tested by running
the following commands and by verifying that after each resume the
fio job was still running:

for ((i=0; i<10; i++)); do
  (
    cd /sys/block/md0/md &&
    while true; do
      [ "$(<sync_action)" = "idle" ] && echo check > sync_action
      sleep 1
    done
  ) &
  pids=($!)
  for d in /sys/class/block/sd*[a-z]; do
    bdev=${d#/sys/class/block/}
    hcil=$(readlink "$d/device")
    hcil=${hcil#../../../}
    echo 4 > "$d/queue/nr_requests"
    echo 1 > "/sys/class/scsi_device/$hcil/device/queue_depth"
    fio --name="$bdev" --filename="/dev/$bdev" --buffered=0 --bs=512 \
      --rw=randread --ioengine=libaio --numjobs=4 --iodepth=16       \
      --iodepth_batch=1 --thread --loops=$((2**31)) &
    pids+=($!)
  done
  sleep 1
  echo "$(date) Hibernating ..." >>hibernate-test-log.txt
  systemctl hibernate
  sleep 10
  kill "${pids[@]}"
  echo idle > /sys/block/md0/md/sync_action
  wait
  echo "$(date) Done." >>hibernate-test-log.txt
done

Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
References: "I/O hangs after resuming from suspend-to-ram" (https://marc.info/?l=linux-block&m=150340235201348).
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Martin Steigerwald <martin@lichtvoll.de>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Christoph Hellwig
ea435e1b93 block: add a poll_fn callback to struct request_queue
That we we can also poll non blk-mq queues.  Mostly needed for
the NVMe multipath code, but could also be useful elsewhere.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 10:31:48 -06:00
Matthew Wilcox
f892760aa6 fs/mpage.c: fix mpage_writepage() for pages with buffers
When using FAT on a block device which supports rw_page, we can hit
BUG_ON(!PageLocked(page)) in try_to_free_buffers().  This is because we
call clean_buffers() after unlocking the page we've written.  Introduce
a new clean_page_buffers() which cleans all buffers associated with a
page and call it from within bdev_write_page().

[akpm@linux-foundation.org: s/PAGE_SIZE/~0U/ per Linus and Matthew]
Link: http://lkml.kernel.org/r/20171006211541.GA7409@bombadil.infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reported-by: Toshi Kani <toshi.kani@hpe.com>
Reported-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Tested-by: Toshi Kani <toshi.kani@hpe.com>
Acked-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-13 16:18:33 -07:00
Rakesh Pandit
7f66721a7d fs/block_dev: remove vfs_msg() interface
Replaced by pr_err usage in commit ef51042472 ("block, dax: move
"select DAX" from BLOCK to FS_DAX")

Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Acked-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-12 12:30:24 -06:00
Linus Torvalds
e253d98f5b Merge branch 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull nowait read support from Al Viro:
 "Support IOCB_NOWAIT for buffered reads and block devices"

* 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  block_dev: support RFW_NOWAIT on block device nodes
  fs: support RWF_NOWAIT for buffered reads
  fs: support IOCB_NOWAIT in generic_file_buffered_read
  fs: pass iocb to do_generic_file_read
2017-09-14 19:29:55 -07:00
Christoph Hellwig
c35fc7a5ab block_dev: support RFW_NOWAIT on block device nodes
All support is already there in the generic code, we just need to wire
it up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-04 19:04:23 -04:00
Christoph Hellwig
74d46992e0 block: replace bi_bdev with a gendisk pointer and partitions index
This way we don't need a block_device structure to submit I/O.  The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open.  Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).

For the actual I/O path all that we need is the gendisk, which exists
once per block device.  But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.

Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:55 -06:00
Christoph Hellwig
c2ee070fb0 block: cache the partition index in struct block_device
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:53 -06:00
Linus Torvalds
088737f44b Writeback error handling fixes (pile #2)
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZXhmCAAoJEAAOaEEZVoIVpRkP/1qlYn3pq6d5Kuz84pejOmlL
 5jbkS/cOmeTxeUU4+B1xG8Lx7bAk8PfSXQOADbSJGiZd0ug95tJxplFYIGJzR/tG
 aNMHeu/BVKKhUKORGuKR9rJKtwC839L/qao+yPBo5U3mU4L73rFWX8fxFuhSJ8HR
 hvkgBu3Hx6GY59CzxJ8iJzj+B+uPSFrNweAk0+0UeWkBgTzEdiGqaXBX4cHIkq/5
 hMoCG+xnmwHKbCBsQ5js+YJT+HedZ4lvfjOqGxgElUyjJ7Bkt/IFYOp8TUiu193T
 tA4UinDjN8A7FImmIBIftrECmrAC9HIGhGZroYkMKbb8ReDR2ikE5FhKEpuAGU3a
 BXBgX2mPQuArvZWM7qeJCkxV9QJ0u/8Ykbyzo30iPrICyrzbEvIubeB/mDA034+Z
 Z0/z8C3v7826F3zP/NyaQEojUgRq30McMOIS8GMnx15HJwRsRKlzjfy9Wm4tWhl0
 t3nH1jMqAZ7068s6rfh/oCwdgGOwr5o4hW/bnlITzxbjWQUOnZIe7KBxIezZJ2rv
 OcIwd5qE8PNtpagGj5oUbnjGOTkERAgsMfvPk5tjUNt28/qUlVs2V0aeo47dlcsh
 oYr8WMOIzw98Rl7Bo70mplLrqLD6nGl0LfXOyUlT4STgLWW4ksmLVuJjWIUxcO/0
 yKWjj9wfYRQ0vSUqhsI5
 =3Z93
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull Writeback error handling updates from Jeff Layton:
 "This pile represents the bulk of the writeback error handling fixes
  that I have for this cycle. Some of the earlier patches in this pile
  may look trivial but they are prerequisites for later patches in the
  series.

  The aim of this set is to improve how we track and report writeback
  errors to userland. Most applications that care about data integrity
  will periodically call fsync/fdatasync/msync to ensure that their
  writes have made it to the backing store.

  For a very long time, we have tracked writeback errors using two flags
  in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
  writeback error occurs (via mapping_set_error) and are cleared as a
  side-effect of filemap_check_errors (as you noted yesterday). This
  model really sucks for userland.

  Only the first task to call fsync (or msync or fdatasync) will see the
  error. Any subsequent task calling fsync on a file will get back 0
  (unless another writeback error occurs in the interim). If I have
  several tasks writing to a file and calling fsync to ensure that their
  writes got stored, then I need to have them coordinate with one
  another. That's difficult enough, but in a world of containerized
  setups that coordination may even not be possible.

  But wait...it gets worse!

  The calls to filemap_check_errors can be buried pretty far down in the
  call stack, and there are internal callers of filemap_write_and_wait
  and the like that also end up clearing those errors. Many of those
  callers ignore the error return from that function or return it to
  userland at nonsensical times (e.g. truncate() or stat()). If I get
  back -EIO on a truncate, there is no reason to think that it was
  because some previous writeback failed, and a subsequent fsync() will
  (incorrectly) return 0.

  This pile aims to do three things:

   1) ensure that when a writeback error occurs that that error will be
      reported to userland on a subsequent fsync/fdatasync/msync call,
      regardless of what internal callers are doing

   2) report writeback errors on all file descriptions that were open at
      the time that the error occurred. This is a user-visible change,
      but I think most applications are written to assume this behavior
      anyway. Those that aren't are unlikely to be hurt by it.

   3) document what filesystems should do when there is a writeback
      error. Today, there is very little consistency between them, and a
      lot of cargo-cult copying. We need to make it very clear what
      filesystems should do in this situation.

  To achieve this, the set adds a new data type (errseq_t) and then
  builds new writeback error tracking infrastructure around that. Once
  all of that is in place, we change the filesystems to use the new
  infrastructure for reporting wb errors to userland.

  Note that this is just the initial foray into cleaning up this mess.
  There is a lot of work remaining here:

   1) convert the rest of the filesystems in a similar fashion. Once the
      initial set is in, then I think most other fs' will be fairly
      simple to convert. Hopefully most of those can in via individual
      filesystem trees.

   2) convert internal waiters on writeback to use errseq_t for
      detecting errors instead of relying on the AS_* flags. I have some
      draft patches for this for ext4, but they are not quite ready for
      prime time yet.

  This was a discussion topic this year at LSF/MM too. If you're
  interested in the gory details, LWN has some good articles about this:

      https://lwn.net/Articles/718734/
      https://lwn.net/Articles/724307/"

* tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  btrfs: minimal conversion to errseq_t writeback error reporting on fsync
  xfs: minimal conversion to errseq_t writeback error reporting
  ext4: use errseq_t based error handling for reporting data writeback errors
  fs: convert __generic_file_fsync to use errseq_t based reporting
  block: convert to errseq_t based writeback error tracking
  dax: set errors in mapping when writeback fails
  Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
  mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
  fs: new infrastructure for writeback error handling and reporting
  lib: add errseq_t type and infrastructure for handling it
  mm: don't TestClearPageError in __filemap_fdatawait_range
  mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
  jbd2: don't clear and reset errors after waiting on writeback
  buffer: set errors in mapping at the time that the error occurs
  fs: check for writeback errors after syncing out buffers in generic_file_fsync
  buffer: use mapping_set_error instead of setting the flag
  mm: fix mapping_set_error call in me_pagecache_dirty
2017-07-07 19:38:17 -07:00
Jeff Layton
372cf243ea block: convert to errseq_t based writeback error tracking
This is a very minimal conversion to errseq_t based error tracking
for raw block device access. Just have it use the standard
file_write_and_wait_range call.

Note that there are internal callers that call sync_blockdev
and the like that are not affected by this. They'll continue
to use the AS_EIO/AS_ENOSPC flags for error reporting like
they always have for now.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-07-06 07:02:28 -04:00
Jeff Layton
5660e13d2f fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.

The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.

If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.

This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.

In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.

One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.

This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.

This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).

Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.

The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 07:02:25 -04:00
Linus Torvalds
c6b1e36c8f Merge branch 'for-4.13/block' of git://git.kernel.dk/linux-block
Pull core block/IO updates from Jens Axboe:
 "This is the main pull request for the block layer for 4.13. Not a huge
  round in terms of features, but there's a lot of churn related to some
  core cleanups.

  Note this depends on the UUID tree pull request, that Christoph
  already sent out.

  This pull request contains:

   - A series from Christoph, unifying the error/stats codes in the
     block layer. We now use blk_status_t everywhere, instead of using
     different schemes for different places.

   - Also from Christoph, some cleanups around request allocation and IO
     scheduler interactions in blk-mq.

   - And yet another series from Christoph, cleaning up how we handle
     and do bounce buffering in the block layer.

   - A blk-mq debugfs series from Bart, further improving on the support
     we have for exporting internal information to aid debugging IO
     hangs or stalls.

   - Also from Bart, a series that cleans up the request initialization
     differences across types of devices.

   - A series from Goldwyn Rodrigues, allowing the block layer to return
     failure if we will block and the user asked for non-blocking.

   - Patch from Hannes for supporting setting loop devices block size to
     that of the underlying device.

   - Two series of patches from Javier, fixing various issues with
     lightnvm, particular around pblk.

   - A series from me, adding support for write hints. This comes with
     NVMe support as well, so applications can help guide data placement
     on flash to improve performance, latencies, and write
     amplification.

   - A series from Ming, improving and hardening blk-mq support for
     stopping/starting and quiescing hardware queues.

   - Two pull requests for NVMe updates. Nothing major on the feature
     side, but lots of cleanups and bug fixes. From the usual crew.

   - A series from Neil Brown, greatly improving the bio rescue set
     support. Most notably, this kills the bio rescue work queues, if we
     don't really need them.

   - Lots of other little bug fixes that are all over the place"

* 'for-4.13/block' of git://git.kernel.dk/linux-block: (217 commits)
  lightnvm: pblk: set line bitmap check under debug
  lightnvm: pblk: verify that cache read is still valid
  lightnvm: pblk: add initialization check
  lightnvm: pblk: remove target using async. I/Os
  lightnvm: pblk: use vmalloc for GC data buffer
  lightnvm: pblk: use right metadata buffer for recovery
  lightnvm: pblk: schedule if data is not ready
  lightnvm: pblk: remove unused return variable
  lightnvm: pblk: fix double-free on pblk init
  lightnvm: pblk: fix bad le64 assignations
  nvme: Makefile: remove dead build rule
  blk-mq: map all HWQ also in hyperthreaded system
  nvmet-rdma: register ib_client to not deadlock in device removal
  nvme_fc: fix error recovery on link down.
  nvmet_fc: fix crashes on bad opcodes
  nvme_fc: Fix crash when nvme controller connection fails.
  nvme_fc: replace ioabort msleep loop with completion
  nvme_fc: fix double calls to nvme_cleanup_cmd()
  nvme-fabrics: verify that a controller returns the correct NQN
  nvme: simplify nvme_dev_attrs_are_visible
  ...
2017-07-03 10:34:51 -07:00
Jens Axboe
9ae3b3f52c block: provide bio_uninit() free freeing integrity/task associations
Wen reports significant memory leaks with DIF and O_DIRECT:

"With nvme devive + T10 enabled, On a system it has 256GB and started
logging /proc/meminfo & /proc/slabinfo for every minute and in an hour
it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute
leaking.

/proc/meminfo | grep SUnreclaim...

SUnreclaim:      6752128 kB
SUnreclaim:      6874880 kB
SUnreclaim:      7238080 kB
....
SUnreclaim:     22307264 kB
SUnreclaim:     22485888 kB
SUnreclaim:     22720256 kB

When testcases with T10 enabled call into __blkdev_direct_IO_simple,
code doesn't free memory allocated by bio_integrity_alloc. The patch
fixes the issue. HTX has been run with +60 hours without failure."

Since __blkdev_direct_IO_simple() allocates the bio on the stack, it
doesn't go through the regular bio free. This means that any ancillary
data allocated with the bio through the stack is not freed. Hence, we
can leak the integrity data associated with the bio, if the device is
using DIF/DIX.

Fix this by providing a bio_uninit() and export it, so that we can use
it to free this data. Note that this is a minimal fix for this issue.
Any current user of bio's that are allocated outside of
bio_alloc_bioset() suffers from this issue, most notably some drivers.
We will fix those in a more comprehensive patch for 4.13. This also
means that the commit marked as being fixed by this isn't the real
culprit, it's just the most obvious one out there.

Fixes: 542ff7bf18 ("block: new direct I/O implementation")
Reported-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28 15:30:13 -06:00
Jens Axboe
45d06cf701 fs: add O_DIRECT and aio support for sending down write life time hints
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27 12:05:36 -06:00
NeilBrown
011067b056 blk: replace bioset_create_nobvec() with a flags arg to bioset_create()
"flags" arguments are often seen as good API design as they allow
easy extensibility.
bioset_create_nobvec() is implemented internally as a variation in
flags passed to __bioset_create().

To support future extension, make the internal structure part of the
API.
i.e. add a 'flags' argument to bioset_create() and discard
bioset_create_nobvec().

Note that the bio_split allocations in drivers/md/raid* do not need
the bvec mempool - they should have used bioset_create_nobvec().

Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 12:40:59 -06:00
Christoph Hellwig
4e4cbee93d block: switch bios to blk_status_t
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Christoph Hellwig
36ffc6c1c0 block_dev: propagate bio_iov_iter_get_pages error in __blkdev_direct_IO
Once we move the block layer to its own status code we'll still want to
propagate the bio_iov_iter_get_pages, so restructure __blkdev_direct_IO
to take ret into account when returning the errno.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Linus Torvalds
0fcc3ab23d Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
 "Incremental fixes and a small feature addition on top of the main
  libnvdimm 4.12 pull request:

   - Geert noticed that tinyconfig was bloated by BLOCK selecting DAX.
     The size regression is fixed by moving all dax helpers into the
     dax-core and only specifying "select DAX" for FS_DAX and
     dax-capable drivers. He also asked for clarification of the
     NR_DEV_DAX config option which, on closer look, does not need to be
     a config option at all. Mike also throws in a DEV_DAX_PMEM fixup
     for good measure.

   - Ben's attention to detail on -stable patch submissions caught a
     case where the recent fixes to arch_copy_from_iter_pmem() missed a
     condition where we strand dirty data in the cache. This is tagged
     for -stable and will also be included in the rework of the pmem api
     to a proposed {memcpy,copy_user}_flushcache() interface for 4.13.

   - Vishal adds a feature that missed the initial pull due to pending
     review feedback. It allows the kernel to clear media errors when
     initializing a BTT (atomic sector update driver) instance on a pmem
     namespace.

   - Ross noticed that the dax_device + dax_operations conversion broke
     __dax_zero_page_range(). The nvdimm unit tests fail to check this
     path, but xfstests immediately trips over it. No excuse for missing
     this before submitting the 4.12 pull request.

  These all pass the nvdimm unit tests and an xfstests spot check. The
  set has received a build success notification from the kbuild robot"

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  filesystem-dax: fix broken __dax_zero_page_range() conversion
  libnvdimm, btt: ensure that initializing metadata clears poison
  libnvdimm: add an atomic vs process context flag to rw_bytes
  x86, pmem: Fix cache flushing for iovec write < 8 bytes
  device-dax: kill NR_DEV_DAX
  block, dax: move "select DAX" from BLOCK to FS_DAX
  device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX
2017-05-12 15:43:10 -07:00
Dan Williams
ef51042472 block, dax: move "select DAX" from BLOCK to FS_DAX
For configurations that do not enable DAX filesystems or drivers, do not
require the DAX core to be built.

Given that the 'direct_access' method has been removed from
'block_device_operations', we can also go ahead and remove the
block-related dax helper functions from fs/block_dev.c to
drivers/dax/super.c. This keeps dax details out of the block layer and
lets the DAX core be built as a module in the FS_DAX=n case.

Filesystems need to include dax.h to call bdev_dax_supported().

Cc: linux-xfs@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-05-08 10:55:27 -07:00
Linus Torvalds
53ef7d0e20 libnvdimm for 4.12
* Region media error reporting: A libnvdimm region device is the parent
 to one or more namespaces. To date, media errors have been reported via
 the "badblocks" attribute attached to pmem block devices for namespaces
 in "raw" or "memory" mode. Given that namespaces can be in "device-dax"
 or "btt-sector" mode this new interface reports media errors
 generically, i.e. independent of namespace modes or state. This
 subsequently allows userspace tooling to craft "ACPI 6.1 Section
 9.20.7.6 Function Index 4 - Clear Uncorrectable Error" requests and
 submit them via the ioctl path for NVDIMM root bus devices.
 
 * Introduce 'struct dax_device' and 'struct dax_operations': Prompted by
 a request from Linus and feedback from Christoph this allows for dax
 capable drivers to publish their own custom dax operations. This fixes
 the broken assumption that all dax operations are related to a
 persistent memory device, and makes it easier for other architectures
 and platforms to add customized persistent memory support.
 
 * 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
 available for storage appliance applications to manually trigger memory
 controllers to drain write-pending buffers that would otherwise be
 flushed automatically by the platform ADR (asynchronous-DRAM-refresh)
 mechanism at a power loss event. Support for "locked" DIMMs is included
 to prevent namespaces from surfacing when the namespace label data area
 is locked. Finally, fixes for various reported deadlocks and crashes,
 also tagged for -stable.
 
 * ACPI / nfit driver updates: General updates of the nfit driver to add
 DSM command overrides, ACPI 6.1 health state flags support, DSM payload
 debug available by default, and various fixes.
 
 Acknowledgements that came after the branch was pushed:
 
 commmit 565851c972 "device-dax: fix sysfs attribute deadlock"
 Tested-by: Yi Zhang <yizhan@redhat.com>
 
 commit 23f4984483 "libnvdimm: rework region badblocks clearing"
 Tested-by: Toshi Kani <toshi.kani@hpe.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZDONJAAoJEB7SkWpmfYgC3SsP/2KrLvTUcz646ViuPOgZ2cC4
 W6wAx6cvDSt+H52kLnFEsYoFt7WAj20ggPirb/Bc5jkGlvwE0lT9Xtmso9GpVkYT
 J9ZJ9pP/4YaAD3II1gmTwaUjYi0FxoOdx3Eb92yuWkO/8ylz4b2Nu3cBpYwyziGQ
 nIfEVwDXRLE86u6x0bWuf6TlVuvsbdiAI55CDqDMVQC6xIOLbSez7b8QIHlpiKEb
 Mw+xqdQva0esoreZEOXEhWNO+qtfILx8/ceBEGTNMp4e/JjZ2FbrSNplM+9bH5k7
 ywqP8lW+mBEw0fmBBkYoVG/xyesiiBb55JLnbi8Ew+7IUxw8a3iV7wftRi62lHcK
 zAjsHe4L+MansgtZsCL8wluvIPaktAdtB4xr7l9VNLKRYRUG73jEWU0gcUNryHIL
 BkQJ52pUS1PkClyAsWbBBHl1I/CvzVPd21VW0YELmLR4OywKy1c+eKw2bcYgjrb4
 59HZSv6S6EoKaQC+2qvVNpePil7cdfg5V2ubH/ki9HoYVyoxDptEWHnvf0NNatIH
 Y7mNcOPvhOksJmnKSyHbDjtRur7WoHIlC9D7UjEFkSBWsKPjxJHoidN4SnCMRtjQ
 WKQU0seoaKj04b68Bs/Qm9NozVgnsPFIUDZeLMikLFX2Jt7YSPu+Jmi2s4re6WLh
 TmJQ3Ly9t3o3/weHSzmn
 =Ox0s
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "The bulk of this has been in multiple -next releases. There were a few
  late breaking fixes and small features that got added in the last
  couple days, but the whole set has received a build success
  notification from the kbuild robot.

  Change summary:

   - Region media error reporting: A libnvdimm region device is the
     parent to one or more namespaces. To date, media errors have been
     reported via the "badblocks" attribute attached to pmem block
     devices for namespaces in "raw" or "memory" mode. Given that
     namespaces can be in "device-dax" or "btt-sector" mode this new
     interface reports media errors generically, i.e. independent of
     namespace modes or state.

     This subsequently allows userspace tooling to craft "ACPI 6.1
     Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error"
     requests and submit them via the ioctl path for NVDIMM root bus
     devices.

   - Introduce 'struct dax_device' and 'struct dax_operations': Prompted
     by a request from Linus and feedback from Christoph this allows for
     dax capable drivers to publish their own custom dax operations.
     This fixes the broken assumption that all dax operations are
     related to a persistent memory device, and makes it easier for
     other architectures and platforms to add customized persistent
     memory support.

   - 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
     available for storage appliance applications to manually trigger
     memory controllers to drain write-pending buffers that would
     otherwise be flushed automatically by the platform ADR
     (asynchronous-DRAM-refresh) mechanism at a power loss event.
     Support for "locked" DIMMs is included to prevent namespaces from
     surfacing when the namespace label data area is locked. Finally,
     fixes for various reported deadlocks and crashes, also tagged for
     -stable.

   - ACPI / nfit driver updates: General updates of the nfit driver to
     add DSM command overrides, ACPI 6.1 health state flags support, DSM
     payload debug available by default, and various fixes.

  Acknowledgements that came after the branch was pushed:

   - commmit 565851c972 "device-dax: fix sysfs attribute deadlock":
     Tested-by: Yi Zhang <yizhan@redhat.com>

   - commit 23f4984483 "libnvdimm: rework region badblocks clearing"
     Tested-by: Toshi Kani <toshi.kani@hpe.com>"

* tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits)
  libnvdimm, pfn: fix 'npfns' vs section alignment
  libnvdimm: handle locked label storage areas
  libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
  brd: fix uninitialized use of brd->dax_dev
  block, dax: use correct format string in bdev_dax_supported
  device-dax: fix sysfs attribute deadlock
  libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"
  libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering
  libnvdimm: rework region badblocks clearing
  acpi, nfit: kill ACPI_NFIT_DEBUG
  libnvdimm: fix clear length of nvdimm_forget_poison()
  libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify
  libnvdimm, region: sysfs trigger for nvdimm_flush()
  libnvdimm: fix phys_addr for nvdimm_clear_poison
  x86, dax, pmem: remove indirection around memcpy_from_pmem()
  block: remove block_device_operations ->direct_access()
  block, dax: convert bdev_dax_supported() to dax_direct_access()
  filesystem-dax: convert to dax_direct_access()
  Revert "block: use DAX for partition table reads"
  ext2, ext4, xfs: retrieve dax_device for iomap operations
  ...
2017-05-05 18:49:20 -07:00
Andrey Ryabinin
a5f6a6a9c7 fs/block_dev: always invalidate cleancache in invalidate_bdev()
invalidate_bdev() calls cleancache_invalidate_inode() iff ->nrpages != 0
which doen't make any sense.

Make sure that invalidate_bdev() always calls cleancache_invalidate_inode()
regardless of mapping->nrpages value.

Fixes: c515e1fd36 ("mm/fs: add hooks to support cleancache")
Link: http://lkml.kernel.org/r/20170424164135.22350-3-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:12 -07:00
Arnd Bergmann
67fd389735 block, dax: use correct format string in bdev_dax_supported
The new message has an incorrect format string, causing a warning in some
configurations:

fs/block_dev.c: In function 'bdev_dax_supported':
fs/block_dev.c:779:5: error: format '%d' expects argument of type 'int', but argument 2 has type 'long int' [-Werror=format=]
     "error: dax access failed (%d)", len);

This changes it to use the correct %ld instead of %d.

Fixes: 2093f2e9df ("block, dax: convert bdev_dax_supported() to dax_direct_access()")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-05-01 13:16:29 -07:00