[ Upstream commit a6ecfb11bf37743c1ac49b266595582b107b61d4 ]
When halting a guest, QEMU flushes the virtual ITS caches, which
amounts to writing to the various tables that the guest has allocated.
When doing this, we fail to take the srcu lock, and the kernel
shouts loudly if running a lockdep kernel:
[ 69.680416] =============================
[ 69.680819] WARNING: suspicious RCU usage
[ 69.681526] 5.1.0-rc1-00008-g600025238f51-dirty #18 Not tainted
[ 69.682096] -----------------------------
[ 69.682501] ./include/linux/kvm_host.h:605 suspicious rcu_dereference_check() usage!
[ 69.683225]
[ 69.683225] other info that might help us debug this:
[ 69.683225]
[ 69.683975]
[ 69.683975] rcu_scheduler_active = 2, debug_locks = 1
[ 69.684598] 6 locks held by qemu-system-aar/4097:
[ 69.685059] #0: 0000000034196013 (&kvm->lock){+.+.}, at: vgic_its_set_attr+0x244/0x3a0
[ 69.686087] #1: 00000000f2ed935e (&its->its_lock){+.+.}, at: vgic_its_set_attr+0x250/0x3a0
[ 69.686919] #2: 000000005e71ea54 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
[ 69.687698] #3: 00000000c17e548d (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
[ 69.688475] #4: 00000000ba386017 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
[ 69.689978] #5: 00000000c2c3c335 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
[ 69.690729]
[ 69.690729] stack backtrace:
[ 69.691151] CPU: 2 PID: 4097 Comm: qemu-system-aar Not tainted 5.1.0-rc1-00008-g600025238f51-dirty #18
[ 69.691984] Hardware name: rockchip evb_rk3399/evb_rk3399, BIOS 2019.04-rc3-00124-g2feec69fb1 03/15/2019
[ 69.692831] Call trace:
[ 69.694072] lockdep_rcu_suspicious+0xcc/0x110
[ 69.694490] gfn_to_memslot+0x174/0x190
[ 69.694853] kvm_write_guest+0x50/0xb0
[ 69.695209] vgic_its_save_tables_v0+0x248/0x330
[ 69.695639] vgic_its_set_attr+0x298/0x3a0
[ 69.696024] kvm_device_ioctl_attr+0x9c/0xd8
[ 69.696424] kvm_device_ioctl+0x8c/0xf8
[ 69.696788] do_vfs_ioctl+0xc8/0x960
[ 69.697128] ksys_ioctl+0x8c/0xa0
[ 69.697445] __arm64_sys_ioctl+0x28/0x38
[ 69.697817] el0_svc_common+0xd8/0x138
[ 69.698173] el0_svc_handler+0x38/0x78
[ 69.698528] el0_svc+0x8/0xc
The fix is to obviously take the srcu lock, just like we do on the
read side of things since bf308242ab. One wonders why this wasn't
fixed at the same time, but hey...
Fixes: bf308242ab ("KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock")
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
[ Upstream commit ebff0b0e3d3c862c16c487959db5e0d879632559 ]
We've become very cautious to now always reset the vcpu when nothing
is loaded on the physical CPU. To do so, we now disable preemption
and do a kvm_arch_vcpu_put() to make sure we have all the state
in memory (and that it won't be loaded behind out back).
This now causes issues with resetting the PMU, which calls into perf.
Perf itself uses mutexes, which clashes with the lack of preemption.
It is worth realizing that the PMU is fully emulated, and that
no PMU state is ever loaded on the physical CPU. This means we can
perfectly reset the PMU outside of the non-preemptible section.
Fixes: e761a927bc9a ("KVM: arm/arm64: Reset the VCPU without preemption and vcpu state loaded")
Reported-by: Julien Grall <julien.grall@arm.com>
Tested-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
[ Upstream commit 72ff51d8dd262d1fef25baedc2ac35116435be47 ]
Andrey has reported on OpenWrt's bug tracking system[1], that he
currently can't use ar93xx_uart as pure serial UART without console
(CONFIG_SERIAL_8250_CONSOLE and CONFIG_SERIAL_AR933X_CONSOLE undefined),
because compilation ends with following error:
ar933x_uart.c: In function 'ar933x_uart_console_write':
ar933x_uart.c:550:14: error: 'struct uart_port' has no
member named 'sysrq'
So this patch moves all the code related to console handling behind
series of CONFIG_SERIAL_AR933X_CONSOLE ifdefs.
1. https://bugs.openwrt.org/index.php?do=details&task_id=2152
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Andrey Batyiev <batyiev@gmail.com>
Reported-by: Andrey Batyiev <batyiev@gmail.com>
Tested-by: Andrey Batyiev <batyiev@gmail.com>
Signed-off-by: Petr Štetiar <ynezz@true.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
[ Upstream commit ac0cdb3d990108df795b676cd0d0e65ac34b2273 ]
Add the missing uart_unregister_driver() and i2c_del_driver() before return
from sc16is7xx_init() in the error handling case.
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Reviewed-by: Vladimir Zapolskiy <vz@mleia.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
[ Upstream commit 0c17e83fe423467e3ccf0a02f99bd050a73bbeb4 ]
The call to of_get_next_child returns a node pointer with refcount
incremented thus it must be explicitly decremented after the last
usage.
Detected by coccinelle with the following warnings:
./arch/arm/mach-imx/mach-imx51.c:64:2-8: ERROR: missing of_node_put; acquired a node pointer with refcount incremented on line 57, but without a corresponding object release within this function.
Signed-off-by: Wen Yang <wen.yang99@zte.com.cn>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Shawn Guo <shawnguo@kernel.org>
Cc: Sascha Hauer <s.hauer@pengutronix.de>
Cc: Pengutronix Kernel Team <kernel@pengutronix.de>
Cc: Fabio Estevam <festevam@gmail.com>
Cc: NXP Linux Team <linux-imx@nxp.com>
Cc: Lucas Stach <l.stach@pengutronix.de>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
[ Upstream commit 7221b727f0079a32aca91f657141e1de564d4b97 ]
The ucast IP table is utilized by some of the L3-specific sysfs attributes
that qeth_l3_create_device_attributes() provides. So initialize the table
_before_ registering the attributes.
Fixes: ebccc7397e ("s390/qeth: add missing hash table initializations")
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
[ Upstream commit 6d65561f3d5ec933151939c543d006b79044e7a6 ]
skb_header_pointer may return NULL. The current code dereference
its return values without a NULL check.
The fix inserts the checks to avoid NULL pointer dereferences.
Fixes: 202a8ff545 ("netfilter: add IPv6 segment routing header 'srh' match")
Signed-off-by: Kangjie Lu <kjlu@umn.edu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
[ Upstream commit d1fa381033eb718df5c602f64b6e88676138dfc6 ]
With NETFILTER_XT_TARGET_TEE=y and IP6_NF_IPTABLES=m, we get a link
error when referencing the NF_DUP_IPV6 module:
net/netfilter/xt_TEE.o: In function `tee_tg6':
xt_TEE.c:(.text+0x14): undefined reference to `nf_dup_ipv6'
The problem here is the 'select NF_DUP_IPV6 if IP6_NF_IPTABLES'
that forces NF_DUP_IPV6 to be =m as well rather than setting it
to =y as was intended here. Adding a soft dependency on
IP6_NF_IPTABLES avoids that broken configuration.
Fixes: 5d400a4933 ("netfilter: Kconfig: Change select IPv6 dependencies")
Cc: Máté Eckl <ecklm94@gmail.com>
Cc: Taehee Yoo <ap420073@gmail.com>
Link: https://patchwork.ozlabs.org/patch/999498/
Link: https://lore.kernel.org/patchwork/patch/960062/
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
[ Upstream commit e166e4fdaced850bee3d5ee12a5740258fb30587 ]
Since Commit 21d1196a35 ("ipv4: set transport header earlier"),
skb->transport_header has been always set before entering INET
netfilter. This patch is to set skb->transport_header for bridge
before entering INET netfilter by bridge-nf-call-iptables.
It also fixes an issue that sctp_error() couldn't compute a right
csum due to unset skb->transport_header.
Fixes: e6d8b64b34 ("net: sctp: fix and consolidate SCTP checksumming code")
Reported-by: Li Shuang <shuali@redhat.com>
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
[ Upstream commit 1beea6204e2304dd11600791d8dad8e7350af6ad ]
When building without CONFIG_OF, the compiler loses track of the flow
control in axis_fifo_probe(), and thinks that many variables are used
without an initialization even though we actually leave the function
before the first use:
drivers/staging/axis-fifo/axis-fifo.c: In function 'axis_fifo_probe':
drivers/staging/axis-fifo/axis-fifo.c:900:5: error: 'rxd_tdata_width' may be used uninitialized in this function [-Werror=maybe-uninitialized]
if (rxd_tdata_width != 32) {
^
drivers/staging/axis-fifo/axis-fifo.c:907:5: error: 'txd_tdata_width' may be used uninitialized in this function [-Werror=maybe-uninitialized]
if (txd_tdata_width != 32) {
^
drivers/staging/axis-fifo/axis-fifo.c:914:5: error: 'has_tdest' may be used uninitialized in this function [-Werror=maybe-uninitialized]
if (has_tdest) {
^
drivers/staging/axis-fifo/axis-fifo.c:919:5: error: 'has_tid' may be used uninitialized in this function [-Werror=maybe-uninitialized]
When CONFIG_OF is set, this does not happen, and since the driver cannot
work without it, just add that option as a Kconfig dependency.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
[ Upstream commit 044175a06706d516aa42874bb44dbbfc3c4d20eb ]
When the umem is cleaned up, the task that created it might already be
gone. If the task was gone, the xdp_umem_release function did not free
the pages member of struct xdp_umem.
It turned out that the task lookup was not needed at all; The code was
a left-over when we moved from task accounting to user accounting [1].
This patch fixes the memory leak by removing the task lookup logic
completely.
[1] https://lore.kernel.org/netdev/20180131135356.19134-3-bjorn.topel@gmail.com/
Link: https://lore.kernel.org/netdev/c1cb2ca8-6a14-3980-8672-f3de0bb38dfd@suse.cz/
Fixes: c0c77d8fb7 ("xsk: add user memory registration support sockopt")
Reported-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
[ Upstream commit 5bf7295fe34a5251b1d241b9736af4697b590670 ]
netdev_alloc_skb can fail and return a NULL pointer which is
dereferenced without a check. The patch avoids such a scenario.
Signed-off-by: Aditya Pakki <pakki001@umn.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
[ Upstream commit 80acbed9f8fca1db3fbe915540b756f048aa0fd7 ]
Commit 0e80bdc9a7 ("stmmac: first frame prep at the end of xmit
routine") overlooked jumbo frames when re-ordering the code, and as a
result the own bit was not getting set anymore for the first jumbo frame
descriptor. Commit 487e2e22ab79 ("net: stmmac: Set OWN bit for jumbo
frames") tried to fix this, but now the bit is getting set too early and
the DMA may start while we are still setting up the remaining descriptors.
And with the chain mode the own bit remains still unset.
Fix by setting the own bit at the end of xmit also with jumbo frames.
Fixes: 0e80bdc9a7 ("stmmac: first frame prep at the end of xmit routine")
Fixes: 487e2e22ab79 ("net: stmmac: Set OWN bit for jumbo frames")
Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com>
Acked-by: Jose Abreu <joabreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
[ Upstream commit 2795e8c251614ac0784c9d41008551109f665716 ]
In case alloc_ordered_workqueue fails, the fix releases
sources and returns -ENOMEM to avoid NULL pointer dereference.
Signed-off-by: Kangjie Lu <kjlu@umn.edu>
Acked-by: Michael Hennerich <michael.hennerich@analog.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
[ Upstream commit cd479eccd2e057116d504852814402a1e68ead80 ]
For a 64-bit process the randomization of the program break is quite
large with 1GB. That is as big as the randomization of the anonymous
mapping base, for a test case started with '/lib/ld64.so.1 <exec>'
it can happen that the heap is placed after the stack. To avoid
this limit the program break randomization to 32MB for 64-bit and
keep 8MB for 31-bit.
Reported-by: Stefan Liebler <stli@linux.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
[ Upstream commit 544e784188f1dd7c797c70b213385e67d92005b6 ]
Raspberry pi board model B revison 2 have the hot plug detector gpio
active high (and not low as it was in the dts).
Signed-off-by: Helen Koike <helen.koike@collabora.com>
Fixes: 49ac67e0c3 ("ARM: bcm2835: Add VC4 to the device tree.")
Reviewed-by: Eric Anholt <eric@anholt.net>
Signed-off-by: Eric Anholt <eric@anholt.net>
Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
commit 15fab63e1e57be9fdb5eec1bbc5916e9825e9acb upstream.
Change pipe_buf_get() to return a bool indicating whether it succeeded
in raising the refcount of the page (if the thing in the pipe is a page).
This removes another mechanism for overflowing the page refcount. All
callers converted to handle a failure.
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8fde12ca79aff9b5ba951fce1a2641901b8d8e64 upstream.
If the page refcount wraps around past zero, it will be freed while
there are still four billion references to it. One of the possible
avenues for an attacker to try to make this happen is by doing direct IO
on a page multiple times. This patch makes get_user_pages() refuse to
take a new page reference if there are already more than two billion
references to the page.
Reported-by: Jann Horn <jannh@google.com>
Acked-by: Matthew Wilcox <willy@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 88b1a17dfc3ed7728316478fae0f5ad508f50397 upstream.
This is the same as the traditional 'get_page()' function, but instead
of unconditionally incrementing the reference count of the page, it only
does so if the count was "safe". It returns whether the reference count
was incremented (and is marked __must_check, since the caller obviously
has to be aware of it).
Also like 'get_page()', you can't use this function unless you already
had a reference to the page. The intent is that you can use this
exactly like get_page(), but in situations where you want to limit the
maximum reference count.
The code currently does an unconditional WARN_ON_ONCE() if we ever hit
the reference count issues (either zero or negative), as a notification
that the conditional non-increment actually happened.
NOTE! The count access for the "safety" check is inherently racy, but
that doesn't matter since the buffer we use is basically half the range
of the reference count (ie we look at the sign of the count).
Acked-by: Matthew Wilcox <willy@infradead.org>
Cc: Jann Horn <jannh@google.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f958d7b528b1b40c44cfda5eabe2d82760d868c3 upstream.
We have a VM_BUG_ON() to check that the page reference count doesn't
underflow (or get close to overflow) by checking the sign of the count.
That's all fine, but we actually want to allow people to use a "get page
ref unless it's already very high" helper function, and we want that one
to use the sign of the page ref (without triggering this VM_BUG_ON).
Change the VM_BUG_ON to only check for small underflows (or _very_ close
to overflowing), and ignore overflows which have strayed into negative
territory.
Acked-by: Matthew Wilcox <willy@infradead.org>
Cc: Jann Horn <jannh@google.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2c2a2fb1e2a9256714338875bede6b7cbd4b9542 upstream.
Revert commit c8b1917c8987 ("ACPICA: Clear status of GPEs before
enabling them") that causes problems with Thunderbolt controllers
to occur if a dock device is connected at init time (the xhci_hcd
and thunderbolt modules crash which prevents peripherals connected
through them from working).
Commit c8b1917c8987 effectively causes commit ecc1165b8b ("ACPICA:
Dispatch active GPEs at init time") to get undone, so the problem
addressed by commit ecc1165b8b appears again as a result of it.
Fixes: c8b1917c8987 ("ACPICA: Clear status of GPEs before enabling them")
Link: https://lore.kernel.org/lkml/s5hy33siofw.wl-tiwai@suse.de/T/#u
Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1132943
Reported-by: Michael Hirmke <opensuse@mike.franken.de>
Reported-by: Takashi Iwai <tiwai@suse.de>
Cc: 4.17+ <stable@vger.kernel.org> # 4.17+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit dfbd199a7cfe3e3cd8531e1353cdbd7175bfbc5e upstream.
When compiling genheaders and mdp from a newer host kernel, the
following error happens:
In file included from scripts/selinux/genheaders/genheaders.c:18:
./security/selinux/include/classmap.h:238:2: error: #error New
address family defined, please update secclass_map. #error New
address family defined, please update secclass_map. ^~~~~
make[3]: *** [scripts/Makefile.host:107:
scripts/selinux/genheaders/genheaders] Error 1 make[2]: ***
[scripts/Makefile.build:599: scripts/selinux/genheaders] Error 2
make[1]: *** [scripts/Makefile.build:599: scripts/selinux] Error 2
make[1]: *** Waiting for unfinished jobs....
Instead of relying on the host definition, include linux/socket.h in
classmap.h to have PF_MAX.
Cc: stable@vger.kernel.org
Signed-off-by: Paulo Alcantara <paulo@paulo.ac>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
[PM: manually merge in mdp.c, subject line tweaks]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 12c7686111326148b4b5db189130522a4ad1be4a ]
When device refuses the offload in tls_set_device_offload_rx()
it calls tls_sw_free_resources_rx() to clean up software context
state.
Unfortunately, tls_sw_free_resources_rx() does not free all
the state tls_set_sw_offload() allocated - it leaks IV and
sequence number buffers. All other code paths which lead to
tls_sw_release_resources_rx() (which tls_sw_free_resources_rx()
calls) free those right before the call.
Avoid the leak by moving freeing of iv and rec_seq into
tls_sw_release_resources_rx().
Fixes: 4799ac81e5 ("tls: Add rx inline crypto offload")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 62ef81d5632634d5e310ed25b9b940b2b6612b46 ]
If device supports offload, but offload fails tls_set_device_offload_rx()
will call tls_sw_free_resources_rx() which (unhelpfully) releases
and reacquires the socket lock.
For a small fix release and reacquire the device_offload_lock.
Fixes: 4799ac81e5 ("tls: Add rx inline crypto offload")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 12fc512f5741443a03adde2ead20724da8ad550a ]
xdp_return_frame releases the frame. It leads to releasing the page, so
it's not allowed to access xdpi.xdpf->len after that, because xdpi.xdpf
is at xdp->data_hard_start after convert_to_xdp_frame. This patch moves
the memory access to precede the return of the frame.
Fixes: 58b99ee3e3 ("net/mlx5e: Add support for XDP_REDIRECT in device-out side")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit d460c2718906252a2a69bc6f89b537071f792e6e ]
MLX5E_XDP_MAX_MTU was calculated incorrectly. It didn't account for
NET_IP_ALIGN and MLX5E_HW2SW_MTU, and it also misused MLX5_SKB_FRAG_SZ.
This commit fixes the calculations and adds a brief explanation for the
formula used.
Fixes: a26a5bdf3e ("net/mlx5e: Restrict the combination of large MTU and XDP")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit f476b3f809fa02f47af6333ed63715058c3fc348 ]
Both Spectrum-1 and Spectrum-2 chips are currently configured such that
pairs of TC n (which is used for UC traffic) and TC n+8 (which is used
for MC traffic) are feeding into the same subgroup. Strict
prioritization is configured between the two TCs, and by enabling
MC-aware mode on the switch, the lower-numbered (UC) TCs are favored
over the higher-numbered (MC) TCs.
On Spectrum-2 however, there is an issue in configuration of the
MC-aware mode. As a result, MC traffic is prioritized over UC traffic.
To work around the issue, configure the MC TCs with DWRR mode (while
keeping the UC TCs in strict mode).
With this patch, the multicast-unicast arbitration results in the same
behavior on both Spectrum-1 and Spectrum-2 chips.
Fixes: 7b81953066 ("mlxsw: spectrum: Configure MC-aware mode on mlxsw ports")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 1ab3030193d25878b3b1409060e1e0a879800c95 ]
During driver initialization the driver sends a reset to the device and
waits for the firmware to signal that it is ready to continue.
Commit d2f372ba0914 ("mlxsw: pci: Increase PCI SW reset timeout")
increased the timeout to 13 seconds due to longer PHY calibration in
Spectrum-2 compared to Spectrum-1.
Recently it became apparent that this timeout is too short and therefore
this patch increases it again to a safer limit that will be reduced in
the future.
Fixes: c3ab435466 ("mlxsw: spectrum: Extend to support Spectrum-2 ASIC")
Fixes: d2f372ba0914 ("mlxsw: pci: Increase PCI SW reset timeout")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit dfdf26babc98 upstream
this patch need merge to 4.19.y stable kernel
Fix Conflict:already fixed the confilct dfdf26babc98 with Yonglong Liu
stable candidate:user cannot connect to the internet via hns dev
by default setting without this patch
we have already verified this patch on kunpeng916 platform,
and it works well.
Signed-off-by: Yonglong Liu <liuyonglong@huawei.com>
Signed-off-by: Jun Xiao <xiaojun2@hisilicon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 925b0c841e066b488cc3a60272472b2c56300704 ]
If we add a bond device which is already the master of the team interface,
we will hold the team->lock in team_add_slave() first and then request the
lock in team_set_mac_address() again. The functions are called like:
- team_add_slave()
- team_port_add()
- team_port_enter()
- team_modeop_port_enter()
- __set_port_dev_addr()
- dev_set_mac_address()
- bond_set_mac_address()
- dev_set_mac_address()
- team_set_mac_address
Although team_upper_dev_link() would check the upper devices but it is
called too late. Fix it by adding a checking before processing the slave.
v2: Do not split the string in netdev_err()
Fixes: 3d249d4ca7 ("net: introduce ethernet teaming device")
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e0c1d14a1a3211dccf0540a6703ffbd5d2a75bdb ]
Since there are more IOT2040 variants with identical hardware but
different asset tags, the asset tag matching should be adjusted to
support them.
For the board name "SIMATIC IOT2000", currently there are 2 types of
hardware, IOT2020 and IOT2040. The IOT2020 is identified by its unique
asset tag. Match on it first. If we then match on the board name only,
we will catch all IOT2040 variants. In the future there will be no other
devices with the "SIMATIC IOT2000" DMI board name but different
hardware.
Signed-off-by: Su Bao Cheng <baocheng.su@siemens.com>
Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 9188d5ca454fd665145904267e726e9e8d122f5c ]
Unlike atomic_add(), refcount_add() does not deal well
with a negative argument. TLS fallback code reallocates
the skb and is very likely to shrink the truesize, leading to:
[ 189.513254] WARNING: CPU: 5 PID: 0 at lib/refcount.c:81 refcount_add_not_zero_checked+0x15c/0x180
Call Trace:
refcount_add_checked+0x6/0x40
tls_enc_skb+0xb93/0x13e0 [tls]
Once wmem_allocated count saturates the application can no longer
send data on the socket. This is similar to Eric's fixes for GSO,
TCP:
commit 7ec318feee ("tcp: gso: avoid refcount_t warning from tcp_gso_segment()")
and UDP:
commit 575b65bc5b ("udp: avoid refcount_t saturation in __udp_gso_segment()").
Unlike the GSO case, for TLS fallback it's likely that the skb has
shrunk, so the "likely" annotation is the other way around (likely
branch being "sub").
Fixes: e8f6979981 ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit b561af36b1841088552464cdc3f6371d92f17710 ]
stmmac_check_ether_addr() checks the MAC address and assigns one in
driver open(). In many cases when we create slave netdevice, the dev
addr is inherited from master but the master dev addr maybe NULL at
that time, so move this call to driver probe so that address is
always valid.
Signed-off-by: Xiaofei Shen <xiaofeis@codeaurora.org>
Tested-by: Xiaofei Shen <xiaofeis@codeaurora.org>
Signed-off-by: Sneh Shah <snehshah@codeaurora.org>
Signed-off-by: Vinod Koul <vkoul@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 4b9fc7146249a6e0e3175d0acc033fdcd2bfcb17 ]
Before the commit 490ea5967b ("RDS: IB: move FMR code to its own file"),
when the dirty_count is greater than 9/10 of max_items of 8K pool,
1M pool is used, Vice versa. After the commit 490ea5967b ("RDS: IB: move
FMR code to its own file"), the above is removed. When we make the
following tests.
Server:
rds-stress -r 1.1.1.16 -D 1M
Client:
rds-stress -r 1.1.1.14 -s 1.1.1.16 -D 1M
The following will appear.
"
connecting to 1.1.1.16:4000
negotiated options, tasks will start in 2 seconds
Starting up..header from 1.1.1.166:4001 to id 4001 bogus
..
tsks tx/s rx/s tx+rx K/s mbi K/s mbo K/s tx us/c rtt us
cpu %
1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00
1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00
1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00
1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00
1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00
...
"
So this exchange between 8K and 1M pool is added back.
Fixes: commit 490ea5967b ("RDS: IB: move FMR code to its own file")
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ace329f4ab3ba434be2adf618073c752d083b524 ]
Querying EEPROM high pages data for SFP module is currently
not supported by our driver and yet queried, resulting in
invalid FW queries.
Set the EEPROM ethtool data length to 256 for SFP module will
limit the reading for page 0 only and prevent invalid FW queries.
Fixes: bb64143eee ("net/mlx5e: Add ethtool support for dump module EEPROM")
Signed-off-by: Erez Alfasi <ereza@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 151f0dddbbfe4c35c9c5b64873115aafd436af9d ]
If link is down and autoneg is set to on/off, the status in ethtool does
not change.
The reason is when the link is down the function returns with zero
before changing autoneg value.
Move the checking of link state (up/down) to be performed after setting
autoneg value, in order to be sure that autoneg will change in any case.
Fixes: 56ade8fe3f ("mlxsw: spectrum: Add initial support for Spectrum ASIC")
Signed-off-by: Amit Cohen <amitc@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 12209993e98c5fa1855c467f22a24e3d5b8be205 upstream.
There is one user of __kernel_fpu_begin() and before invoking it,
it invokes preempt_disable(). So it could invoke kernel_fpu_begin()
right away. The 32bit version of arch_efi_call_virt_setup() and
arch_efi_call_virt_teardown() does this already.
The comment above *kernel_fpu*() claims that before invoking
__kernel_fpu_begin() preemption should be disabled and that KVM is a
good example of doing it. Well, KVM doesn't do that since commit
f775b13eed ("x86,kvm: move qemu/guest FPU switching out to vcpu_run")
so it is not an example anymore.
With EFI gone as the last user of __kernel_fpu_{begin|end}(), both can
be made static and not exported anymore.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Rik van Riel <riel@surriel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm ML <kvm@vger.kernel.org>
Cc: linux-efi <linux-efi@vger.kernel.org>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20181129150210.2k4mawt37ow6c2vq@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f2c57d91b0d96aa13ccff4e3b178038f17b00658 upstream.
In DAX mode a write pagefault can race with write(2) in the following
way:
CPU0 CPU1
write fault for mapped zero page (hole)
dax_iomap_rw()
iomap_apply()
xfs_file_iomap_begin()
- allocates blocks
dax_iomap_actor()
invalidate_inode_pages2_range()
- invalidates radix tree entries in given range
dax_iomap_pte_fault()
grab_mapping_entry()
- no entry found, creates empty
...
xfs_file_iomap_begin()
- finds already allocated block
...
vmf_insert_mixed_mkwrite()
- WARNs and does nothing because there
is still zero page mapped in PTE
unmap_mapping_pages()
This race results in WARN_ON from insert_pfn() and is occasionally
triggered by fstest generic/344. Note that the race is otherwise
harmless as before write(2) on CPU0 is finished, we will invalidate page
tables properly and thus user of mmap will see modified data from
write(2) from that point on. So just restrict the warning only to the
case when the PFN in PTE is not zero page.
Link: http://lkml.kernel.org/r/20180824154542.26872-1-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a9d57ef15cbe327fe54416dd194ee0ea66ae53a4 upstream.
Commit ce02ef06fcf7 ("x86, retpolines: Raise limit for generating indirect
calls from switch-case") raised the limit under retpolines to 20 switch
cases where gcc would only then start to emit jump tables, and therefore
effectively disabling the emission of slow indirect calls in this area.
After this has been brought to attention to gcc folks [0], Martin Liska
has then fixed gcc to align with clang by avoiding to generate switch jump
tables entirely under retpolines. This is taking effect in gcc starting
from stable version 8.4.0. Given kernel supports compilation with older
versions of gcc where the fix is not being available or backported anymore,
we need to keep the extra KBUILD_CFLAGS around for some time and generally
set the -fno-jump-tables to align with what more recent gcc is doing
automatically today.
More than 20 switch cases are not expected to be fast-path critical, but
it would still be good to align with gcc behavior for versions < 8.4.0 in
order to have consistency across supported gcc versions. vmlinux size is
slightly growing by 0.27% for older gcc. This flag is only set to work
around affected gcc, no change for clang.
[0] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86952
Suggested-by: Martin Liska <mliska@suse.cz>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Björn Töpel<bjorn.topel@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Link: https://lkml.kernel.org/r/20190325135620.14882-1-daniel@iogearbox.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ce02ef06fcf7a399a6276adb83f37373d10cbbe1 upstream.
From networking side, there are numerous attempts to get rid of indirect
calls in fast-path wherever feasible in order to avoid the cost of
retpolines, for example, just to name a few:
* 283c16a2dfd3 ("indirect call wrappers: helpers to speed-up indirect calls of builtin")
* aaa5d90b395a ("net: use indirect call wrappers at GRO network layer")
* 028e0a476684 ("net: use indirect call wrappers at GRO transport layer")
* 356da6d0cde3 ("dma-mapping: bypass indirect calls for dma-direct")
* 09772d92cd ("bpf: avoid retpoline for lookup/update/delete calls on maps")
* 10870dd89e95 ("netfilter: nf_tables: add direct calls for all builtin expressions")
[...]
Recent work on XDP from Björn and Magnus additionally found that manually
transforming the XDP return code switch statement with more than 5 cases
into if-else combination would result in a considerable speedup in XDP
layer due to avoidance of indirect calls in CONFIG_RETPOLINE enabled
builds. On i40e driver with XDP prog attached, a 20-26% speedup has been
observed [0]. Aside from XDP, there are many other places later in the
networking stack's critical path with similar switch-case
processing. Rather than fixing every XDP-enabled driver and locations in
stack by hand, it would be good to instead raise the limit where gcc would
emit expensive indirect calls from the switch under retpolines and stick
with the default as-is in case of !retpoline configured kernels. This would
also have the advantage that for archs where this is not necessary, we let
compiler select the underlying target optimization for these constructs and
avoid potential slow-downs by if-else hand-rewrite.
In case of gcc, this setting is controlled by case-values-threshold which
has an architecture global default that selects 4 or 5 (latter if target
does not have a case insn that compares the bounds) where some arch back
ends like arm64 or s390 override it with their own target hooks, for
example, in gcc commit db7a90aa0de5 ("S/390: Disable prediction of indirect
branches") the threshold pretty much disables jump tables by limit of 20
under retpoline builds. Comparing gcc's and clang's default code
generation on x86-64 under O2 level with retpoline build results in the
following outcome for 5 switch cases:
* gcc with -mindirect-branch=thunk-inline -mindirect-branch-register:
# gdb -batch -ex 'disassemble dispatch' ./c-switch
Dump of assembler code for function dispatch:
0x0000000000400be0 <+0>: cmp $0x4,%edi
0x0000000000400be3 <+3>: ja 0x400c35 <dispatch+85>
0x0000000000400be5 <+5>: lea 0x915f8(%rip),%rdx # 0x4921e4
0x0000000000400bec <+12>: mov %edi,%edi
0x0000000000400bee <+14>: movslq (%rdx,%rdi,4),%rax
0x0000000000400bf2 <+18>: add %rdx,%rax
0x0000000000400bf5 <+21>: callq 0x400c01 <dispatch+33>
0x0000000000400bfa <+26>: pause
0x0000000000400bfc <+28>: lfence
0x0000000000400bff <+31>: jmp 0x400bfa <dispatch+26>
0x0000000000400c01 <+33>: mov %rax,(%rsp)
0x0000000000400c05 <+37>: retq
0x0000000000400c06 <+38>: nopw %cs:0x0(%rax,%rax,1)
0x0000000000400c10 <+48>: jmpq 0x400c90 <fn_3>
0x0000000000400c15 <+53>: nopl (%rax)
0x0000000000400c18 <+56>: jmpq 0x400c70 <fn_2>
0x0000000000400c1d <+61>: nopl (%rax)
0x0000000000400c20 <+64>: jmpq 0x400c50 <fn_1>
0x0000000000400c25 <+69>: nopl (%rax)
0x0000000000400c28 <+72>: jmpq 0x400c40 <fn_0>
0x0000000000400c2d <+77>: nopl (%rax)
0x0000000000400c30 <+80>: jmpq 0x400cb0 <fn_4>
0x0000000000400c35 <+85>: push %rax
0x0000000000400c36 <+86>: callq 0x40dd80 <abort>
End of assembler dump.
* clang with -mretpoline emitting search tree:
# gdb -batch -ex 'disassemble dispatch' ./c-switch
Dump of assembler code for function dispatch:
0x0000000000400b30 <+0>: cmp $0x1,%edi
0x0000000000400b33 <+3>: jle 0x400b44 <dispatch+20>
0x0000000000400b35 <+5>: cmp $0x2,%edi
0x0000000000400b38 <+8>: je 0x400b4d <dispatch+29>
0x0000000000400b3a <+10>: cmp $0x3,%edi
0x0000000000400b3d <+13>: jne 0x400b52 <dispatch+34>
0x0000000000400b3f <+15>: jmpq 0x400c50 <fn_3>
0x0000000000400b44 <+20>: test %edi,%edi
0x0000000000400b46 <+22>: jne 0x400b5c <dispatch+44>
0x0000000000400b48 <+24>: jmpq 0x400c20 <fn_0>
0x0000000000400b4d <+29>: jmpq 0x400c40 <fn_2>
0x0000000000400b52 <+34>: cmp $0x4,%edi
0x0000000000400b55 <+37>: jne 0x400b66 <dispatch+54>
0x0000000000400b57 <+39>: jmpq 0x400c60 <fn_4>
0x0000000000400b5c <+44>: cmp $0x1,%edi
0x0000000000400b5f <+47>: jne 0x400b66 <dispatch+54>
0x0000000000400b61 <+49>: jmpq 0x400c30 <fn_1>
0x0000000000400b66 <+54>: push %rax
0x0000000000400b67 <+55>: callq 0x40dd20 <abort>
End of assembler dump.
For sake of comparison, clang without -mretpoline:
# gdb -batch -ex 'disassemble dispatch' ./c-switch
Dump of assembler code for function dispatch:
0x0000000000400b30 <+0>: cmp $0x4,%edi
0x0000000000400b33 <+3>: ja 0x400b57 <dispatch+39>
0x0000000000400b35 <+5>: mov %edi,%eax
0x0000000000400b37 <+7>: jmpq *0x492148(,%rax,8)
0x0000000000400b3e <+14>: jmpq 0x400bf0 <fn_0>
0x0000000000400b43 <+19>: jmpq 0x400c30 <fn_4>
0x0000000000400b48 <+24>: jmpq 0x400c10 <fn_2>
0x0000000000400b4d <+29>: jmpq 0x400c20 <fn_3>
0x0000000000400b52 <+34>: jmpq 0x400c00 <fn_1>
0x0000000000400b57 <+39>: push %rax
0x0000000000400b58 <+40>: callq 0x40dcf0 <abort>
End of assembler dump.
Raising the cases to a high number (e.g. 100) will still result in similar
code generation pattern with clang and gcc as above, in other words clang
generally turns off jump table emission by having an extra expansion pass
under retpoline build to turn indirectbr instructions from their IR into
switch instructions as a built-in -mno-jump-table lowering of a switch (in
this case, even if IR input already contained an indirect branch).
For gcc, adding --param=case-values-threshold=20 as in similar fashion as
s390 in order to raise the limit for x86 retpoline enabled builds results
in a small vmlinux size increase of only 0.13% (before=18,027,528
after=18,051,192). For clang this option is ignored due to i) not being
needed as mentioned and ii) not having above cmdline
parameter. Non-retpoline-enabled builds with gcc continue to use the
default case-values-threshold setting, so nothing changes here.
[0] https://lore.kernel.org/netdev/20190129095754.9390-1-bjorn.topel@gmail.com/
and "The Path to DPDK Speeds for AF_XDP", LPC 2018, networking track:
- http://vger.kernel.org/lpc_net2018_talks/lpc18_pres_af_xdp_perf-v3.pdf
- http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: netdev@vger.kernel.org
Cc: David S. Miller <davem@davemloft.net>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/20190221221941.29358-1-daniel@iogearbox.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit af5c72b1fc7a00aa484e90b0c4e0eeb582545634 upstream.
aio_poll() has to cope with several unpleasant problems:
* requests that might stay around indefinitely need to
be made visible for io_cancel(2); that must not be done to
a request already completed, though.
* in cases when ->poll() has placed us on a waitqueue,
wakeup might have happened (and request completed) before ->poll()
returns.
* worse, in some early wakeup cases request might end
up re-added into the queue later - we can't treat "woken up and
currently not in the queue" as "it's not going to stick around
indefinitely"
* ... moreover, ->poll() might have decided not to
put it on any queues to start with, and that needs to be distinguished
from the previous case
* ->poll() might have tried to put us on more than one queue.
Only the first will succeed for aio poll, so we might end up missing
wakeups. OTOH, we might very well notice that only after the
wakeup hits and request gets completed (all before ->poll() gets
around to the second poll_wait()). In that case it's too late to
decide that we have an error.
req->woken was an attempt to deal with that. Unfortunately, it was
broken. What we need to keep track of is not that wakeup has happened -
the thing might come back after that. It's that async reference is
already gone and won't come back, so we can't (and needn't) put the
request on the list of cancellables.
The easiest case is "request hadn't been put on any waitqueues"; we
can tell by seeing NULL apt.head, and in that case there won't be
anything async. We should either complete the request ourselves
(if vfs_poll() reports anything of interest) or return an error.
In all other cases we get exclusion with wakeups by grabbing the
queue lock.
If request is currently on queue and we have something interesting
from vfs_poll(), we can steal it and complete the request ourselves.
If it's on queue and vfs_poll() has not reported anything interesting,
we either put it on the cancellable list, or, if we know that it
hadn't been put on all queues ->poll() wanted it on, we steal it and
return an error.
If it's _not_ on queue, it's either been already dealt with (in which
case we do nothing), or there's aio_poll_complete_work() about to be
executed. In that case we either put it on the cancellable list,
or, if we know it hadn't been put on all queues ->poll() wanted it on,
simulate what cancel would've done.
It's a lot more convoluted than I'd like it to be. Single-consumer APIs
suck, and unfortunately aio is not an exception...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2bb874c0d873d13bd9b9b9c6d7b7c4edab18c8b4 upstream.
Instead of having aio_complete() set ->ki_res.{res,res2}, do that
explicitly in its callers, drop the reference (as aio_complete()
used to do) and delay the rest until the final iocb_put().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>