Commit graph

324 commits

Author SHA1 Message Date
Greg Kroah-Hartman
bfe2901c20 This is the 4.19.111 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl5xvNEACgkQONu9yGCS
 aT6VUg//SJSSC5IX7gulaIm8IzvVijE7EKkdkjukJ4TD672J1QqzXVlhKp8tSvAV
 ZBknOar0AP5sDNtvF3cgz0t6w6IJHrLWGyWqcMfUTC75M9HVZH6YUgHDkPmi0g8f
 dyTrVe20/lC5yBNAFmS0pnYB+UfL8biJEF6N++pULZQhOY0eRr6BMKdl2npxH7D3
 YL/jipdGHmwkr/OgOtRaOBgEP6HIu1xKnZUkGzvhF0BOxAM/ib/5lQognOD6x4Hm
 9vHzc8+nBXlWj6N7XkE+I3RiZumUx+vEr2kLljdrTE7cH7ALzJQl4GQ6Db6lbd0E
 q78Y44FhrfKiwxDeGPHKOX39sgzVwCsKhwTg3a4Rq4Aq0I7QQoPikAyCUj9kaeFq
 q8bI0Wub+4nQhzuyv6UgRWaQnIBZxXe56M8z3u4CTy6ljwvn4hXeZ9bkVRyXdQtS
 D4h3WtxFwBed0tQGb5ypv83Wg/lwK8bQHab4LDV9AZNZ3Jrbg70ldlea0GiA8Csc
 Y3MncS6zF9mnAU8ZdsYT3GNkRQS3OTFNeb7+V5MRgdCnG3xk5GltHTy0JYhZKmXH
 8zMXlUgUeyihFx6f7LwFhYk8NTSg3+W700SKND/zd+VK8m7mqT7PB1bkny5zJ6aC
 teehBWmHlxZlL1ENXya8lUEEsOieAWxi3IMlhOEo2roidPW1N0o=
 =ryZW
 -----END PGP SIGNATURE-----

Merge 4.19.111 into android-4.19

Changes in 4.19.111
	phy: Revert toggling reset changes.
	net: phy: Avoid multiple suspends
	cgroup, netclassid: periodically release file_lock on classid updating
	gre: fix uninit-value in __iptunnel_pull_header
	inet_diag: return classid for all socket types
	ipv6/addrconf: call ipv6_mc_up() for non-Ethernet interface
	ipvlan: add cond_resched_rcu() while processing muticast backlog
	ipvlan: do not add hardware address of master to its unicast filter list
	ipvlan: do not use cond_resched_rcu() in ipvlan_process_multicast()
	ipvlan: don't deref eth hdr before checking it's set
	net/ipv6: use configured metric when add peer route
	netlink: Use netlink header as base to calculate bad attribute offset
	net: macsec: update SCI upon MAC address change.
	net: nfc: fix bounds checking bugs on "pipe"
	net/packet: tpacket_rcv: do not increment ring index on drop
	net: stmmac: dwmac1000: Disable ACS if enhanced descs are not used
	net: systemport: fix index check to avoid an array out of bounds access
	r8152: check disconnect status after long sleep
	sfc: detach from cb_page in efx_copy_channel()
	bnxt_en: reinitialize IRQs when MTU is modified
	cgroup: memcg: net: do not associate sock with unrelated cgroup
	net: memcg: late association of sock to memcg
	net: memcg: fix lockdep splat in inet_csk_accept()
	devlink: validate length of param values
	fib: add missing attribute validation for tun_id
	nl802154: add missing attribute validation
	nl802154: add missing attribute validation for dev_type
	can: add missing attribute validation for termination
	macsec: add missing attribute validation for port
	net: fq: add missing attribute validation for orphan mask
	team: add missing attribute validation for port ifindex
	team: add missing attribute validation for array index
	nfc: add missing attribute validation for SE API
	nfc: add missing attribute validation for deactivate target
	nfc: add missing attribute validation for vendor subcommand
	net: phy: fix MDIO bus PM PHY resuming
	selftests/net/fib_tests: update addr_metric_test for peer route testing
	net/ipv6: need update peer route when modify metric
	net/ipv6: remove the old peer route if change it to a new one
	tipc: add missing attribute validation for MTU property
	devlink: validate length of region addr/len
	bonding/alb: make sure arp header is pulled before accessing it
	slip: make slhc_compress() more robust against malicious packets
	net: fec: validate the new settings in fec_enet_set_coalesce()
	macvlan: add cond_resched() during multicast processing
	cgroup: cgroup_procs_next should increase position index
	cgroup: Iterate tasks that did not finish do_exit()
	iwlwifi: mvm: Do not require PHY_SKU NVM section for 3168 devices
	virtio-blk: fix hw_queue stopped on arbitrary error
	iommu/vt-d: quirk_ioat_snb_local_iommu: replace WARN_TAINT with pr_warn + add_taint
	netfilter: nf_conntrack: ct_cpu_seq_next should increase position index
	netfilter: synproxy: synproxy_cpu_seq_next should increase position index
	netfilter: xt_recent: recent_seq_next should increase position index
	netfilter: x_tables: xt_mttg_seq_next should increase position index
	workqueue: don't use wq_select_unbound_cpu() for bound works
	drm/amd/display: remove duplicated assignment to grph_obj_type
	ktest: Add timeout for ssh sync testing
	cifs_atomic_open(): fix double-put on late allocation failure
	gfs2_atomic_open(): fix O_EXCL|O_CREAT handling on cold dcache
	KVM: x86: clear stale x86_emulate_ctxt->intercept value
	ARC: define __ALIGN_STR and __ALIGN symbols for ARC
	macintosh: windfarm: fix MODINFO regression
	efi: Fix a race and a buffer overflow while reading efivars via sysfs
	efi: Make efi_rts_work accessible to efi page fault handler
	mt76: fix array overflow on receiving too many fragments for a packet
	x86/mce: Fix logic and comments around MSR_PPIN_CTL
	iommu/dma: Fix MSI reservation allocation
	iommu/vt-d: dmar: replace WARN_TAINT with pr_warn + add_taint
	iommu/vt-d: Fix a bug in intel_iommu_iova_to_phys() for huge page
	batman-adv: Don't schedule OGM for disabled interface
	pinctrl: meson-gxl: fix GPIOX sdio pins
	pinctrl: core: Remove extra kref_get which blocks hogs being freed
	drm/i915/gvt: Fix unnecessary schedule timer when no vGPU exits
	i2c: gpio: suppress error on probe defer
	nl80211: add missing attribute validation for critical protocol indication
	nl80211: add missing attribute validation for beacon report scanning
	nl80211: add missing attribute validation for channel switch
	perf bench futex-wake: Restore thread count default to online CPU count
	netfilter: cthelper: add missing attribute validation for cthelper
	netfilter: nft_payload: add missing attribute validation for payload csum flags
	netfilter: nft_tunnel: add missing attribute validation for tunnels
	iommu/vt-d: Fix the wrong printing in RHSA parsing
	iommu/vt-d: Ignore devices with out-of-spec domain number
	i2c: acpi: put device when verifying client fails
	ipv6: restrict IPV6_ADDRFORM operation
	net/smc: check for valid ib_client_data
	net/smc: cancel event worker during device removal
	efi: Add a sanity check to efivar_store_raw()
	batman-adv: Avoid free/alloc race when handling OGM2 buffer
	Linux 4.19.111

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ide220f0b6a12d291bda4a83f17cde25bbe64e2ff
2020-03-18 08:19:47 +01:00
Al Viro
a8ab0b7097 cifs_atomic_open(): fix double-put on late allocation failure
commit d9a9f4849fe0c9d560851ab22a85a666cddfdd24 upstream.

several iterations of ->atomic_open() calling conventions ago, we
used to need fput() if ->atomic_open() failed at some point after
successful finish_open().  Now (since 2016) it's not needed -
struct file carries enough state to make fput() work regardless
of the point in struct file lifecycle and discarding it on
failure exits in open() got unified.  Unfortunately, I'd missed
the fact that we had an instance of ->atomic_open() (cifs one)
that used to need that fput(), as well as the stale comment in
finish_open() demanding such late failure handling.  Trivially
fixed...

Fixes: fe9ec8291f "do_last(): take fput() on error after opening to out:"
Cc: stable@kernel.org # v4.7+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-18 07:14:21 +01:00
Greg Kroah-Hartman
75ff56e1a2 This is the 4.19.63 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl1BJrAACgkQONu9yGCS
 aT6MpRAAt2nuozm5Z/MshFAuGFzddAwPoYtkIPSy8BiPHYjf7x0+D5Ew4dz5OihS
 ElfbA94hMOpvhhXlzBU3ZFJsWZIK78gzV6+LiHyb5R97Jdzj/zT4h40y0kxKw+pS
 gghnZ6zx+pGSIXm/EsODW2gg98yTrmhBFpUhXpAGoC/71c1vxVlj3jHcuhd778YK
 NRlj2tFWJGIBpmXrApo1Eg7qQRj4tzbOjthfgANWPq+EP68PgiOaxMDMMp7IUwju
 KrXEFcXlk5aHvfaJ06FSBRBnn45XMyGXPYV/76HsaqkBNmg1r7o02dZKy/0S84Fn
 YXoEFxHt6NFOJ52LiO95z7cQ0xeTSYygNCtYXJ70uyDMnrCPXCrKp7DRyka+vDPs
 RCrcpB1QjcCb3xTL//SPkNWM3oZEW9CawpRFh9bmqbw/h7ZjaUEuwNIJnunISulu
 2fvOjUmFWfUVIARiwKFVuIkXzgf3cSLYZTtiDFC5/yBpkGVNnXqyO7YtWZwnrMHq
 L3DC3pOKuYXMa03KWGEzZoCZEXjtoRhRwCSgV7wbK5o90sZeRj/HC1zXEyDPPD/R
 7A1rePTuwlAH3gHCJGhYkmYqULx62ZdvV6IC2N7xNxeTL1Y7OVNBT7ZUqexxY6WC
 OG1vVxUKNJIvBLYmc6cmQATgR6XH5/B9H2p1YRBLuAQuqHk+jqo=
 =ZQf3
 -----END PGP SIGNATURE-----

Merge 4.19.63 into android-4.19

Changes in 4.19.63
	hvsock: fix epollout hang from race condition
	drm/panel: simple: Fix panel_simple_dsi_probe
	iio: adc: stm32-dfsdm: manage the get_irq error case
	iio: adc: stm32-dfsdm: missing error case during probe
	staging: vt6656: use meaningful error code during buffer allocation
	usb: core: hub: Disable hub-initiated U1/U2
	tty: max310x: Fix invalid baudrate divisors calculator
	pinctrl: rockchip: fix leaked of_node references
	tty: serial: cpm_uart - fix init when SMC is relocated
	drm/amd/display: Fill prescale_params->scale for RGB565
	drm/amdgpu/sriov: Need to initialize the HDP_NONSURFACE_BAStE
	drm/amd/display: Disable ABM before destroy ABM struct
	drm/amdkfd: Fix a potential memory leak
	drm/amdkfd: Fix sdma queue map issue
	drm/edid: Fix a missing-check bug in drm_load_edid_firmware()
	PCI: Return error if cannot probe VF
	drm/bridge: tc358767: read display_props in get_modes()
	drm/bridge: sii902x: pixel clock unit is 10kHz instead of 1kHz
	gpu: host1x: Increase maximum DMA segment size
	drm/crc-debugfs: User irqsafe spinlock in drm_crtc_add_crc_entry
	drm/crc-debugfs: Also sprinkle irqrestore over early exits
	memstick: Fix error cleanup path of memstick_init
	tty/serial: digicolor: Fix digicolor-usart already registered warning
	tty: serial: msm_serial: avoid system lockup condition
	serial: 8250: Fix TX interrupt handling condition
	drm/amd/display: Always allocate initial connector state state
	drm/virtio: Add memory barriers for capset cache.
	phy: renesas: rcar-gen2: Fix memory leak at error paths
	drm/amd/display: fix compilation error
	powerpc/pseries/mobility: prevent cpu hotplug during DT update
	drm/rockchip: Properly adjust to a true clock in adjusted_mode
	serial: imx: fix locking in set_termios()
	tty: serial_core: Set port active bit in uart_port_activate
	usb: gadget: Zero ffs_io_data
	mmc: sdhci: sdhci-pci-o2micro: Check if controller supports 8-bit width
	powerpc/pci/of: Fix OF flags parsing for 64bit BARs
	drm/msm: Depopulate platform on probe failure
	serial: mctrl_gpio: Check if GPIO property exisits before requesting it
	PCI: sysfs: Ignore lockdep for remove attribute
	i2c: stm32f7: fix the get_irq error cases
	kbuild: Add -Werror=unknown-warning-option to CLANG_FLAGS
	genksyms: Teach parser about 128-bit built-in types
	PCI: xilinx-nwl: Fix Multi MSI data programming
	iio: iio-utils: Fix possible incorrect mask calculation
	powerpc/cacheflush: fix variable set but not used
	powerpc/xmon: Fix disabling tracing while in xmon
	recordmcount: Fix spurious mcount entries on powerpc
	mfd: madera: Add missing of table registration
	mfd: core: Set fwnode for created devices
	mfd: arizona: Fix undefined behavior
	mfd: hi655x-pmic: Fix missing return value check for devm_regmap_init_mmio_clk
	mm/swap: fix release_pages() when releasing devmap pages
	um: Silence lockdep complaint about mmap_sem
	powerpc/4xx/uic: clear pending interrupt after irq type/pol change
	RDMA/i40iw: Set queue pair state when being queried
	serial: sh-sci: Terminate TX DMA during buffer flushing
	serial: sh-sci: Fix TX DMA buffer flushing and workqueue races
	IB/mlx5: Fixed reporting counters on 2nd port for Dual port RoCE
	powerpc/mm: Handle page table allocation failures
	IB/ipoib: Add child to parent list only if device initialized
	arm64: assembler: Switch ESB-instruction with a vanilla nop if !ARM64_HAS_RAS
	PCI: mobiveil: Fix PCI base address in MEM/IO outbound windows
	PCI: mobiveil: Fix the Class Code field
	kallsyms: exclude kasan local symbols on s390
	PCI: mobiveil: Initialize Primary/Secondary/Subordinate bus numbers
	PCI: mobiveil: Use the 1st inbound window for MEM inbound transactions
	perf test mmap-thread-lookup: Initialize variable to suppress memory sanitizer warning
	perf stat: Fix use-after-freed pointer detected by the smatch tool
	perf top: Fix potential NULL pointer dereference detected by the smatch tool
	perf session: Fix potential NULL pointer dereference found by the smatch tool
	perf annotate: Fix dereferencing freed memory found by the smatch tool
	perf hists browser: Fix potential NULL pointer dereference found by the smatch tool
	RDMA/rxe: Fill in wc byte_len with IB_WC_RECV_RDMA_WITH_IMM
	PCI: dwc: pci-dra7xx: Fix compilation when !CONFIG_GPIOLIB
	powerpc/boot: add {get, put}_unaligned_be32 to xz_config.h
	block: init flush rq ref count to 1
	f2fs: avoid out-of-range memory access
	mailbox: handle failed named mailbox channel request
	dlm: check if workqueues are NULL before flushing/destroying
	powerpc/eeh: Handle hugepages in ioremap space
	block/bio-integrity: fix a memory leak bug
	sh: prevent warnings when using iounmap
	mm/kmemleak.c: fix check for softirq context
	9p: pass the correct prototype to read_cache_page
	mm/gup.c: mark undo_dev_pagemap as __maybe_unused
	mm/gup.c: remove some BUG_ONs from get_gate_page()
	memcg, fsnotify: no oom-kill for remote memcg charging
	mm/mmu_notifier: use hlist_add_head_rcu()
	proc: use down_read_killable mmap_sem for /proc/pid/smaps_rollup
	proc: use down_read_killable mmap_sem for /proc/pid/pagemap
	proc: use down_read_killable mmap_sem for /proc/pid/clear_refs
	proc: use down_read_killable mmap_sem for /proc/pid/map_files
	cxgb4: reduce kernel stack usage in cudbg_collect_mem_region()
	proc: use down_read_killable mmap_sem for /proc/pid/maps
	locking/lockdep: Fix lock used or unused stats error
	mm: use down_read_killable for locking mmap_sem in access_remote_vm
	locking/lockdep: Hide unused 'class' variable
	usb: wusbcore: fix unbalanced get/put cluster_id
	usb: pci-quirks: Correct AMD PLL quirk detection
	btrfs: inode: Don't compress if NODATASUM or NODATACOW set
	x86/sysfb_efi: Add quirks for some devices with swapped width and height
	x86/speculation/mds: Apply more accurate check on hypervisor platform
	binder: prevent transactions to context manager from its own process.
	fpga-manager: altera-ps-spi: Fix build error
	mei: me: add mule creek canyon (EHL) device ids
	hpet: Fix division by zero in hpet_time_div()
	ALSA: ac97: Fix double free of ac97_codec_device
	ALSA: line6: Fix wrong altsetting for LINE6_PODHD500_1
	ALSA: hda - Add a conexant codec entry to let mute led work
	powerpc/xive: Fix loop exit-condition in xive_find_target_in_mask()
	powerpc/tm: Fix oops on sigreturn on systems without TM
	libnvdimm/bus: Stop holding nvdimm_bus_list_mutex over __nd_ioctl()
	access: avoid the RCU grace period for the temporary subjective credentials
	Linux 4.19.63

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ic31529aa6fd283d16d6bfb182187a9402a4db44f
2019-07-31 08:03:42 +02:00
Linus Torvalds
408af82309 access: avoid the RCU grace period for the temporary subjective credentials
commit d7852fbd0f0423937fa287a598bfde188bb68c22 upstream.

It turns out that 'access()' (and 'faccessat()') can cause a lot of RCU
work because it installs a temporary credential that gets allocated and
freed for each system call.

The allocation and freeing overhead is mostly benign, but because
credentials can be accessed under the RCU read lock, the freeing
involves a RCU grace period.

Which is not a huge deal normally, but if you have a lot of access()
calls, this causes a fair amount of seconday damage: instead of having a
nice alloc/free patterns that hits in hot per-CPU slab caches, you have
all those delayed free's, and on big machines with hundreds of cores,
the RCU overhead can end up being enormous.

But it turns out that all of this is entirely unnecessary.  Exactly
because access() only installs the credential as the thread-local
subjective credential, the temporary cred pointer doesn't actually need
to be RCU free'd at all.  Once we're done using it, we can just free it
synchronously and avoid all the RCU overhead.

So add a 'non_rcu' flag to 'struct cred', which can be set by users that
know they only use it in non-RCU context (there are other potential
users for this).  We can make it a union with the rcu freeing list head
that we need for the RCU case, so this doesn't need any extra storage.

Note that this also makes 'get_current_cred()' clear the new non_rcu
flag, in case we have filesystems that take a long-term reference to the
cred and then expect the RCU delayed freeing afterwards.  It's not
entirely clear that this is required, but it makes for clear semantics:
the subjective cred remains non-RCU as long as you only access it
synchronously using the thread-local accessors, but you _can_ use it as
a generic cred if you want to.

It is possible that we should just remove the whole RCU markings for
->cred entirely.  Only ->real_cred is really supposed to be accessed
through RCU, and the long-term cred copies that nfs uses might want to
explicitly re-enable RCU freeing if required, rather than have
get_current_cred() do it implicitly.

But this is a "minimal semantic changes" change for the immediate
problem.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jan Glauber <jglauber@marvell.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Jayachandran Chandrasekharan Nair <jnair@marvell.com>
Cc: Greg KH <greg@kroah.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-31 07:27:11 +02:00
Greg Kroah-Hartman
44c5f03127 This is the 4.19.41 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlzSZ3MACgkQONu9yGCS
 aT7f2hAAww50MxygeVQXgB0zJ7oSlEBwJAUH1GRCj1ual70pDoK9W8iuacPI4gNu
 +A0V4cD+lndMZ8E+F9/Vmzhmj/xHdxRfCkRmtnpa5IKUJqA/bLr1kdspYO6G354G
 ZBGvizvmFaAikbqNSnoadQZ5AYoT/C4FxxD6t/meQtnwmrK+OhRlh1xCkTddAbQo
 UEPDns9pxqrwHK7tS0GbPXLWgsr4y7BPq92ILdt/GFZGmEBJa0+tf66/DF+RMKjD
 OCmAkiFr+hN8KScUHeZpJKdmI3s7dHEUnQnwkHyC1TwjEXwxhtA0QZgzl9j1F8JI
 f27kbJrpmYuw2QADJyZmCnTiqBo7XglItxEZ2csJKyV7+jFFlzKT4S26iyrHwgXp
 BTdSuTaVe2Ubp/EiULwoSsmxaIyW6fROotQK/x4oB0eDm+OSXq217nnbnF2MRpMb
 eOqzRPCd5CvL8+7T9rc0kOJcDuK/9h95OjzZWBo9rDGz0glxlokw7cJ4fUyrJMPo
 kcw7ZIRRHLtKGTD/ZKX3XrErAqz9I4RAEzBRAXT6W0is7PUi00vFNcOU4ThNSsSK
 ihfuHsp9Jq/SKEUl6/CumU8e91LWv77/gqYxLLuygpuKLFLpMMc/PQOruxXT6ojx
 QC/KIUtcMfTP+xy23Wg2UQ1BXhl+HcfShXzvcEXxHzjgHpglKvc=
 =S4RT
 -----END PGP SIGNATURE-----

Merge 4.19.41 into android-4.19

Changes in 4.19.41
	iwlwifi: fix driver operation for 5350
	mwifiex: Make resume actually do something useful again on SDIO cards
	mac80211: don't attempt to rename ERR_PTR() debugfs dirs
	i2c: synquacer: fix enumeration of slave devices
	i2c: imx: correct the method of getting private data in notifier_call
	i2c: Remove unnecessary call to irq_find_mapping
	i2c: Clear client->irq in i2c_device_remove
	i2c: Allow recovery of the initial IRQ by an I2C client device.
	i2c: Prevent runtime suspend of adapter when Host Notify is required
	ALSA: hda/realtek - Add new Dell platform for headset mode
	ALSA: hda/realtek - Fixed Dell AIO speaker noise
	ALSA: hda/realtek - Apply the fixup for ASUS Q325UAR
	USB: yurex: Fix protection fault after device removal
	USB: w1 ds2490: Fix bug caused by improper use of altsetting array
	USB: dummy-hcd: Fix failure to give back unlinked URBs
	usb: usbip: fix isoc packet num validation in get_pipe
	USB: core: Fix unterminated string returned by usb_string()
	USB: core: Fix bug caused by duplicate interface PM usage counter
	nvme-loop: init nvmet_ctrl fatal_err_work when allocate
	efi: Fix debugobjects warning on 'efi_rts_work'
	arm64: dts: rockchip: fix rk3328-roc-cc gmac2io tx/rx_delay
	HID: logitech: check the return value of create_singlethread_workqueue
	HID: debug: fix race condition with between rdesc_show() and device removal
	rtc: cros-ec: Fail suspend/resume if wake IRQ can't be configured
	rtc: sh: Fix invalid alarm warning for non-enabled alarm
	batman-adv: Reduce claim hash refcnt only for removed entry
	batman-adv: Reduce tt_local hash refcnt only for removed entry
	batman-adv: Reduce tt_global hash refcnt only for removed entry
	batman-adv: fix warning in function batadv_v_elp_get_throughput
	ARM: dts: rockchip: Fix gpu opp node names for rk3288
	reset: meson-audio-arb: Fix missing .owner setting of reset_controller_dev
	igb: Fix WARN_ONCE on runtime suspend
	riscv: fix accessing 8-byte variable from RV32
	HID: quirks: Fix keyboard + touchpad on Lenovo Miix 630
	net: hns3: fix compile error
	net/mlx5: E-Switch, Fix esw manager vport indication for more vport commands
	bonding: show full hw address in sysfs for slave entries
	net: stmmac: use correct DMA buffer size in the RX descriptor
	net: stmmac: ratelimit RX error logs
	net: stmmac: don't stop NAPI processing when dropping a packet
	net: stmmac: don't overwrite discard_frame status
	net: stmmac: fix dropping of multi-descriptor RX frames
	net: stmmac: don't log oversized frames
	jffs2: fix use-after-free on symlink traversal
	debugfs: fix use-after-free on symlink traversal
	mfd: twl-core: Disable IRQ while suspended
	block: use blk_free_flush_queue() to free hctx->fq in blk_mq_init_hctx
	rtc: da9063: set uie_unsupported when relevant
	HID: input: add mapping for Assistant key
	vfio/pci: use correct format characters
	scsi: core: add new RDAC LENOVO/DE_Series device
	scsi: storvsc: Fix calculation of sub-channel count
	arm/mach-at91/pm : fix possible object reference leak
	arm64: fix wrong check of on_sdei_stack in nmi context
	net: hns: fix KASAN: use-after-free in hns_nic_net_xmit_hw()
	net: hns: Use NAPI_POLL_WEIGHT for hns driver
	net: hns: Fix probabilistic memory overwrite when HNS driver initialized
	net: hns: fix ICMP6 neighbor solicitation messages discard problem
	net: hns: Fix WARNING when remove HNS driver with SMMU enabled
	libcxgb: fix incorrect ppmax calculation
	KVM: SVM: prevent DBG_DECRYPT and DBG_ENCRYPT overflow
	kmemleak: powerpc: skip scanning holes in the .bss section
	hugetlbfs: fix memory leak for resv_map
	sh: fix multiple function definition build errors
	xsysace: Fix error handling in ace_setup
	fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
	ARM: orion: don't use using 64-bit DMA masks
	ARM: iop: don't use using 64-bit DMA masks
	block: pass no-op callback to INIT_WORK().
	perf/x86/amd: Update generic hardware cache events for Family 17h
	Bluetooth: btusb: request wake pin with NOAUTOEN
	Bluetooth: mediatek: fix up an error path to restore bdev->tx_state
	clk: qcom: Add missing freq for usb30_master_clk on 8998
	staging: iio: adt7316: allow adt751x to use internal vref for all dacs
	staging: iio: adt7316: fix the dac read calculation
	staging: iio: adt7316: fix the dac write calculation
	scsi: RDMA/srpt: Fix a credit leak for aborted commands
	ASoC: Intel: bytcr_rt5651: Revert "Fix DMIC map headsetmic mapping"
	ASoC: wm_adsp: Correct handling of compressed streams that restart
	ASoC: stm32: fix sai driver name initialisation
	platform/x86: intel_pmc_core: Fix PCH IP name
	platform/x86: intel_pmc_core: Handle CFL regmap properly
	IB/core: Unregister notifier before freeing MAD security
	IB/core: Fix potential memory leak while creating MAD agents
	IB/core: Destroy QP if XRC QP fails
	Input: snvs_pwrkey - initialize necessary driver data before enabling IRQ
	Input: stmfts - acknowledge that setting brightness is a blocking call
	gpio: mxc: add check to return defer probe if clock tree NOT ready
	selinux: avoid silent denials in permissive mode under RCU walk
	selinux: never allow relabeling on context mounts
	mac80211: Honor SW_CRYPTO_CONTROL for unicast keys in AP VLAN mode
	powerpc/mm/hash: Handle mmap_min_addr correctly in get_unmapped_area topdown search
	x86/mce: Improve error message when kernel cannot recover, p2
	clk: x86: Add system specific quirk to mark clocks as critical
	x86/mm/KASLR: Fix the size of the direct mapping section
	x86/mm: Fix a crash with kmemleak_scan()
	x86/mm/tlb: Revert "x86/mm: Align TLB invalidation info"
	i2c: i2c-stm32f7: Fix SDADEL minimum formula
	media: v4l2: i2c: ov7670: Fix PLL bypass register values
	ASoC: wm_adsp: Check for buffer in trigger stop
	mm/kmemleak.c: fix unused-function warning
	Linux 4.19.41

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-05-08 07:39:48 +02:00
Kirill Smelkov
04b4d5f75a fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
[ Upstream commit 10dce8af34226d90fa56746a934f8da5dcdba3df ]

Commit 9c225f2655 ("vfs: atomic f_pos accesses as per POSIX") added
locking for file.f_pos access and in particular made concurrent read and
write not possible - now both those functions take f_pos lock for the
whole run, and so if e.g. a read is blocked waiting for data, write will
deadlock waiting for that read to complete.

This caused regression for stream-like files where previously read and
write could run simultaneously, but after that patch could not do so
anymore. See e.g. commit 581d21a2d0 ("xenbus: fix deadlock on writes
to /proc/xen/xenbus") which fixes such regression for particular case of
/proc/xen/xenbus.

The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
safety for read/write/lseek and added the locking to file descriptors of
all regular files. In 2014 that thread-safety problem was not new as it
was already discussed earlier in 2006.

However even though 2006'th version of Linus's patch was adding f_pos
locking "only for files that are marked seekable with FMODE_LSEEK (thus
avoiding the stream-like objects like pipes and sockets)", the 2014
version - the one that actually made it into the tree as 9c225f2655 -
is doing so irregardless of whether a file is seekable or not.

See

    https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/
    https://lwn.net/Articles/180387
    https://lwn.net/Articles/180396

for historic context.

The reason that it did so is, probably, that there are many files that
are marked non-seekable, but e.g. their read implementation actually
depends on knowing current position to correctly handle the read. Some
examples:

	kernel/power/user.c		snapshot_read
	fs/debugfs/file.c		u32_array_read
	fs/fuse/control.c		fuse_conn_waiting_read + ...
	drivers/hwmon/asus_atk0110.c	atk_debugfs_ggrp_read
	arch/s390/hypfs/inode.c		hypfs_read_iter
	...

Despite that, many nonseekable_open users implement read and write with
pure stream semantics - they don't depend on passed ppos at all. And for
those cases where read could wait for something inside, it creates a
situation similar to xenbus - the write could be never made to go until
read is done, and read is waiting for some, potentially external, event,
for potentially unbounded time -> deadlock.

Besides xenbus, there are 14 such places in the kernel that I've found
with semantic patch (see below):

	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
	drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
	drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()

In addition to the cases above another regression caused by f_pos
locking is that now FUSE filesystems that implement open with
FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
stream-like files - for the same reason as above e.g. read can deadlock
write locking on file.f_pos in the kernel.

FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990f7 ("fuse:
implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
write routines not depending on current position at all, and with both
read and write being potentially blocking operations:

See

    https://github.com/libfuse/osspd
    https://lwn.net/Articles/308445

    https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
    https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
    https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510

Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
"somewhat pipe-like files ..." with read handler not using offset.
However that test implements only read without write and cannot exercise
the deadlock scenario:

    https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
    https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
    https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216

I've actually hit the read vs write deadlock for real while implementing
my FUSE filesystem where there is /head/watch file, for which open
creates separate bidirectional socket-like stream in between filesystem
and its user with both read and write being later performed
simultaneously. And there it is semantically not easy to split the
stream into two separate read-only and write-only channels:

    https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169

Let's fix this regression. The plan is:

1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
   doing so would break many in-kernel nonseekable_open users which
   actually use ppos in read/write handlers.

2. Add stream_open() to kernel to open stream-like non-seekable file
   descriptors. Read and write on such file descriptors would never use
   nor change ppos. And with that property on stream-like files read and
   write will be running without taking f_pos lock - i.e. read and write
   could be running simultaneously.

3. With semantic patch search and convert to stream_open all in-kernel
   nonseekable_open users for which read and write actually do not
   depend on ppos and where there is no other methods in file_operations
   which assume @offset access.

4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
   steam_open if that bit is present in filesystem open reply.

   It was tempting to change fs/fuse/ open handler to use stream_open
   instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
   grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
   and in particular GVFS which actually uses offset in its read and
   write handlers

	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481

   so if we would do such a change it will break a real user.

5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
   from v3.14+ (the kernel where 9c225f2655 first appeared).

   This will allow to patch OSSPD and other FUSE filesystems that
   provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
   in their open handler and this way avoid the deadlock on all kernel
   versions. This should work because fs/fuse/ ignores unknown open
   flags returned from a filesystem and so passing FOPEN_STREAM to a
   kernel that is not aware of this flag cannot hurt. In turn the kernel
   that is not aware of FOPEN_STREAM will be < v3.14 where just
   FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
   write deadlock.

This patch adds stream_open, converts /proc/xen/xenbus to it and adds
semantic patch to automatically locate in-kernel places that are either
required to be converted due to read vs write deadlock, or that are just
safe to be converted because read and write do not use ppos and there
are no other funky methods in file_operations.

Regarding semantic patch I've verified each generated change manually -
that it is correct to convert - and each other nonseekable_open instance
left - that it is either not correct to convert there, or that it is not
converted due to current stream_open.cocci limitations.

The script also does not convert files that should be valid to convert,
but that currently have .llseek = noop_llseek or generic_file_llseek for
unknown reason despite file being opened with nonseekable_open (e.g.
drivers/input/mousedev.c)

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Yongzhi Pan <panyongzhi@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Nikolaus Rath <Nikolaus@rath.org>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-08 07:21:51 +02:00
Greg Kroah-Hartman
0b065cd568 This is the 4.19.33 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlykNfcACgkQONu9yGCS
 aT40dRAAiCeYjEC1zH8dkAnbFlKo6IZuhKgISfVTgrWlRe9nUTYaenBXqAfGjufH
 EzXHrD1IANRAnFfWg8xt01TNBfTaiEYnYzFmJkWAHFWGKxa5fRU5Kan0MB97r8s9
 NjoSRsnFl8l2oJI88zwFa7k89Itop9ST/zvZIgnrysAr+j8yEZb7BZWaU2UrKK/q
 qfnJxjfCb/jeqAxwVh3OkasXj0gG2JkGR/uEGTw2EARuI6pvKo5OCzYz0tXTN6ZJ
 CSzM4X7dhkGSgLIUw3JOCB28riK9TYbOdPr4MFYMrnoU5VL8+n62tXoKXewobJ1C
 2+Pmg5E54r13Rr35eoGCiHsW2LGQrOyvm8S9TFB/0SmTPtzUjFlHBC62Vs/AW5ut
 HSmwVy+ILM/xTIdts5QT58Gw+5NCmHxw2oEdrgcct+6FtnR9XPOqZYyzH1tw1ZB+
 DL2PqYyT9czuo2bKanWA37M8339q2INDFskXqKRokQ9GNiqUx1E6fNBqtK5SqyDI
 CdohtAs7xSQoPPbKDiITOmt82MM8xvefKmqTIvHN5B7Ns4lT1QC74DCXEFEKtw6M
 l+p64h6Qw4DiKmqna7fsbKPjmg8pg1lVrCOwD5iUF3JXy+Wxi/OKnr2gqAVMChAq
 GSfFFf+MMZhYzJPRQKxOn4GwDBDz5niaWvQmlPcPWLJ5jdbzl+Y=
 =Yyc/
 -----END PGP SIGNATURE-----

Merge 4.19.33 into android-4.19

Changes in 4.19.33
	Bluetooth: Check L2CAP option sizes returned from l2cap_get_conf_opt
	Bluetooth: Verify that l2cap_get_conf_opt provides large enough buffer
	ipmi_si: Fix crash when using hard-coded device
	dccp: do not use ipv6 header for ipv4 flow
	genetlink: Fix a memory leak on error path
	gtp: change NET_UDP_TUNNEL dependency to select
	ipv6: make ip6_create_rt_rcu return ip6_null_entry instead of NULL
	mac8390: Fix mmio access size probe
	mISDN: hfcpci: Test both vendor & device ID for Digium HFC4S
	net: aquantia: fix rx checksum offload for UDP/TCP over IPv6
	net: datagram: fix unbounded loop in __skb_try_recv_datagram()
	net/packet: Set __GFP_NOWARN upon allocation in alloc_pg_vec
	net: phy: meson-gxl: fix interrupt support
	net: rose: fix a possible stack overflow
	net: stmmac: fix memory corruption with large MTUs
	net-sysfs: call dev_hold if kobject_init_and_add success
	packets: Always register packet sk in the same order
	rhashtable: Still do rehash when we get EEXIST
	sctp: get sctphdr by offset in sctp_compute_cksum
	sctp: use memdup_user instead of vmemdup_user
	tcp: do not use ipv6 header for ipv4 flow
	tipc: allow service ranges to be connect()'ed on RDM/DGRAM
	tipc: change to check tipc_own_id to return in tipc_net_stop
	tipc: fix cancellation of topology subscriptions
	tun: properly test for IFF_UP
	vrf: prevent adding upper devices
	vxlan: Don't call gro_cells_destroy() before device is unregistered
	ila: Fix rhashtable walker list corruption
	net: sched: fix cleanup NULL pointer exception in act_mirr
	thunderx: enable page recycling for non-XDP case
	thunderx: eliminate extra calls to put_page() for pages held for recycling
	tun: add a missing rcu_read_unlock() in error path
	powerpc/fsl: Add infrastructure to fixup branch predictor flush
	powerpc/fsl: Add macro to flush the branch predictor
	powerpc/fsl: Emulate SPRN_BUCSR register
	powerpc/fsl: Add nospectre_v2 command line argument
	powerpc/fsl: Flush the branch predictor at each kernel entry (64bit)
	powerpc/fsl: Flush the branch predictor at each kernel entry (32 bit)
	powerpc/fsl: Flush branch predictor when entering KVM
	powerpc/fsl: Enable runtime patching if nospectre_v2 boot arg is used
	powerpc/fsl: Update Spectre v2 reporting
	powerpc/fsl: Fixed warning: orphan section `__btb_flush_fixup'
	powerpc/fsl: Fix the flush of branch predictor.
	powerpc/security: Fix spectre_v2 reporting
	Btrfs: fix incorrect file size after shrinking truncate and fsync
	btrfs: remove WARN_ON in log_dir_items
	btrfs: don't report readahead errors and don't update statistics
	btrfs: raid56: properly unmap parity page in finish_parity_scrub()
	btrfs: Avoid possible qgroup_rsv_size overflow in btrfs_calculate_inode_block_rsv_size
	Btrfs: fix assertion failure on fsync with NO_HOLES enabled
	ARM: imx6q: cpuidle: fix bug that CPU might not wake up at expected time
	powerpc: bpf: Fix generation of load/store DW instructions
	vfio: ccw: only free cp on final interrupt
	NFS: fix mount/umount race in nlmclnt.
	NFSv4.1 don't free interrupted slot on open
	net: dsa: qca8k: remove leftover phy accessors
	ALSA: rawmidi: Fix potential Spectre v1 vulnerability
	ALSA: seq: oss: Fix Spectre v1 vulnerability
	ALSA: pcm: Fix possible OOB access in PCM oss plugins
	ALSA: pcm: Don't suspend stream in unrecoverable PCM state
	ALSA: hda/realtek - Add support headset mode for DELL WYSE AIO
	ALSA: hda/realtek - Add support headset mode for New DELL WYSE NB
	ALSA: hda/realtek: Enable headset MIC of Acer AIO with ALC286
	ALSA: hda/realtek: Enable headset MIC of Acer Aspire Z24-890 with ALC286
	ALSA: hda/realtek - Add support for Acer Aspire E5-523G/ES1-432 headset mic
	ALSA: hda/realtek: Enable ASUS X441MB and X705FD headset MIC with ALC256
	ALSA: hda/realtek: Enable headset mic of ASUS P5440FF with ALC256
	ALSA: hda/realtek: Enable headset MIC of ASUS X430UN and X512DK with ALC256
	ALSA: hda/realtek - Fix speakers on Acer Predator Helios 500 Ryzen laptops
	kbuild: modversions: Fix relative CRC byte order interpretation
	fs/open.c: allow opening only regular files during execve()
	ocfs2: fix inode bh swapping mixup in ocfs2_reflink_inodes_lock
	scsi: sd: Fix a race between closing an sd device and sd I/O
	scsi: sd: Quiesce warning if device does not report optimal I/O size
	scsi: zfcp: fix rport unblock if deleted SCSI devices on Scsi_Host
	scsi: zfcp: fix scsi_eh host reset with port_forced ERP for non-NPIV FCP devices
	drm/rockchip: vop: reset scale mode when win is disabled
	tty: mxs-auart: fix a potential NULL pointer dereference
	tty: atmel_serial: fix a potential NULL pointer dereference
	tty: serial: qcom_geni_serial: Initialize baud in qcom_geni_console_setup
	staging: comedi: ni_mio_common: Fix divide-by-zero for DIO cmdtest
	staging: speakup_soft: Fix alternate speech with other synths
	staging: vt6655: Remove vif check from vnt_interrupt
	staging: vt6655: Fix interrupt race condition on device start up.
	staging: erofs: fix to handle error path of erofs_vmap()
	serial: max310x: Fix to avoid potential NULL pointer dereference
	serial: mvebu-uart: Fix to avoid a potential NULL pointer dereference
	serial: sh-sci: Fix setting SCSCR_TIE while transferring data
	USB: serial: cp210x: add new device id
	USB: serial: ftdi_sio: add additional NovaTech products
	USB: serial: mos7720: fix mos_parport refcount imbalance on error path
	USB: serial: option: set driver_info for SIM5218 and compatibles
	USB: serial: option: add support for Quectel EM12
	USB: serial: option: add Olicard 600
	Disable kgdboc failed by echo space to /sys/module/kgdboc/parameters/kgdboc
	fs/proc/proc_sysctl.c: fix NULL pointer dereference in put_links
	drm/vgem: fix use-after-free when drm_gem_handle_create() fails
	drm/vkms: fix use-after-free when drm_gem_handle_create() fails
	drm/i915/gvt: Fix MI_FLUSH_DW parsing with correct index check
	gpio: exar: add a check for the return value of ida_simple_get fails
	gpio: adnp: Fix testing wrong value in adnp_gpio_direction_input
	phy: sun4i-usb: Support set_mode to USB_HOST for non-OTG PHYs
	usb: mtu3: fix EXTCON dependency
	USB: gadget: f_hid: fix deadlock in f_hidg_write()
	usb: common: Consider only available nodes for dr_mode
	usb: host: xhci-rcar: Add XHCI_TRUST_TX_LENGTH quirk
	xhci: Fix port resume done detection for SS ports with LPM enabled
	usb: xhci: dbc: Don't free all memory with spinlock held
	xhci: Don't let USB3 ports stuck in polling state prevent suspend
	usb: cdc-acm: fix race during wakeup blocking TX traffic
	mm: add support for kmem caches in DMA32 zone
	iommu/io-pgtable-arm-v7s: request DMA32 memory, and improve debugging
	mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified
	mm/migrate.c: add missing flush_dcache_page for non-mapped page migrate
	perf pmu: Fix parser error for uncore event alias
	perf intel-pt: Fix TSC slip
	objtool: Query pkg-config for libelf location
	powerpc/pseries/energy: Use OF accessor functions to read ibm,drc-indexes
	powerpc/64: Fix memcmp reading past the end of src/dest
	watchdog: Respect watchdog cpumask on CPU hotplug
	cpu/hotplug: Prevent crash when CPU bringup fails on CONFIG_HOTPLUG_CPU=n
	x86/smp: Enforce CONFIG_HOTPLUG_CPU when SMP=y
	KVM: Reject device ioctls from processes other than the VM's creator
	KVM: x86: update %rip after emulating IO
	KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hosts
	staging: erofs: fix error handling when failed to read compresssed data
	staging: erofs: keep corrupted fs from crashing kernel in erofs_readdir()
	bpf: do not restore dst_reg when cur_state is freed
	drivers: base: Helpers for adding device connection descriptions
	platform: x86: intel_cht_int33fe: Register all connections at once
	platform: x86: intel_cht_int33fe: Add connection for the DP alt mode
	platform: x86: intel_cht_int33fe: Add connections for the USB Type-C port
	usb: typec: class: Don't use port parent for getting mux handles
	platform: x86: intel_cht_int33fe: Remove the old connections for the muxes
	Linux 4.19.33

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-04-03 06:53:19 +02:00
Tetsuo Handa
72b790c417 fs/open.c: allow opening only regular files during execve()
commit 73601ea5b7b18eb234219ae2adf77530f389da79 upstream.

syzbot is hitting lockdep warning [1] due to trying to open a fifo
during an execve() operation.  But we don't need to open non regular
files during an execve() operation, for all files which we will need are
the executable file itself and the interpreter programs like /bin/sh and
ld-linux.so.2 .

Since the manpage for execve(2) says that execve() returns EACCES when
the file or a script interpreter is not a regular file, and the manpage
for uselib(2) says that uselib() can return EACCES, and we use
FMODE_EXEC when opening for execve()/uselib(), we can bail out if a non
regular file is requested with FMODE_EXEC set.

Since this deadlock followed by khungtaskd warnings is trivially
reproducible by a local unprivileged user, and syzbot's frequent crash
due to this deadlock defers finding other bugs, let's workaround this
deadlock until we get a chance to find a better solution.

[1] https://syzkaller.appspot.com/bug?id=b5095bfec44ec84213bac54742a82483aad578ce

Link: http://lkml.kernel.org/r/1552044017-7890-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Reported-by: syzbot <syzbot+e93a80c1bb7c5c56e522461c149f8bf55eab1b2b@syzkaller.appspotmail.com>
Fixes: 8924feff66 ("splice: lift pipe_lock out of splice_to_pipe()")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@vger.kernel.org>	[4.9+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-03 06:26:23 +02:00
Daniel Rosenberg
e81cea2a6f ANDROID: vfs: Add permission2 for filesystems with per mount permissions
This allows filesystems to use their mount private data to
influence the permssions they return in permission2. It has
been separated into a new call to avoid disrupting current
permission users.

Bug: 35848445
Bug: 120446149
Change-Id: I9d416e3b8b6eca84ef3e336bd2af89ddd51df6ca
Signed-off-by: Daniel Rosenberg <drosen@google.com>
[AmitP: Minor refactoring of original patch to align with
        changes from the following upstream commit
        4bfd054ae1 ("fs: fold __inode_permission() into inode_permission()").
        Also introduce vfs_mkobj2(), because do_create()
        moved from using vfs_create() to vfs_mkobj()
        eecec19d9e ("mqueue: switch to vfs_mkobj(), quit abusing ->d_fsdata")
        do_create() is dropped/cleaned-up upstream so a
        minor refactoring there as well.
        066cc813e9 ("do_mq_open(): move all work prior to dentry_open() into a helper")]
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
[astrachan: Folded the following changes into this patch:
            f46c9d62dd81 ("ANDROID: fs: Export vfs_rmdir2")
            9992eb8b9a1e ("ANDROID: xattr: Pass EOPNOTSUPP to permission2")]
Signed-off-by: Alistair Strachan <astrachan@google.com>
2018-12-05 09:48:14 -08:00
Daniel Rosenberg
74cca90e7d ANDROID: vfs: Add setattr2 for filesystems with per mount permissions
This allows filesystems to use their mount private data to
influence the permssions they use in setattr2. It has
been separated into a new call to avoid disrupting current
setattr users.

Bug: 120446149
Change-Id: I19959038309284448f1b7f232d579674ef546385
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2018-12-05 09:48:13 -08:00
Linus Torvalds
d9a185f8b4 overlayfs update for 4.19
This contains two new features:
 
  1) Stack file operations: this allows removal of several hacks from the
     VFS, proper interaction of read-only open files with copy-up,
     possibility to implement fs modifying ioctls properly, and others.
 
  2) Metadata only copy-up: when file is on lower layer and only metadata is
     modified (except size) then only copy up the metadata and continue to
     use the data from the lower file.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW3srhAAKCRDh3BK/laaZ
 PC6tAQCP+KklcN+TvNp502f+O/kATahSpgnun4NY1/p4I8JV+AEAzdlkTN3+MiAO
 fn9brN6mBK7h59DO3hqedPLJy2vrgwg=
 =QDXH
 -----END PGP SIGNATURE-----

Merge tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs updates from Miklos Szeredi:
 "This contains two new features:

   - Stack file operations: this allows removal of several hacks from
     the VFS, proper interaction of read-only open files with copy-up,
     possibility to implement fs modifying ioctls properly, and others.

   - Metadata only copy-up: when file is on lower layer and only
     metadata is modified (except size) then only copy up the metadata
     and continue to use the data from the lower file"

* tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (66 commits)
  ovl: Enable metadata only feature
  ovl: Do not do metacopy only for ioctl modifying file attr
  ovl: Do not do metadata only copy-up for truncate operation
  ovl: add helper to force data copy-up
  ovl: Check redirect on index as well
  ovl: Set redirect on upper inode when it is linked
  ovl: Set redirect on metacopy files upon rename
  ovl: Do not set dentry type ORIGIN for broken hardlinks
  ovl: Add an inode flag OVL_CONST_INO
  ovl: Treat metacopy dentries as type OVL_PATH_MERGE
  ovl: Check redirects for metacopy files
  ovl: Move some dir related ovl_lookup_single() code in else block
  ovl: Do not expose metacopy only dentry from d_real()
  ovl: Open file with data except for the case of fsync
  ovl: Add helper ovl_inode_realdata()
  ovl: Store lower data inode in ovl_inode
  ovl: Fix ovl_getattr() to get number of blocks from lower
  ovl: Add helper ovl_dentry_lowerdata() to get lower data dentry
  ovl: Copy up meta inode data from lowest data inode
  ovl: Modify ovl_lookup() and friends to lookup metacopy dentry
  ...
2018-08-21 18:19:09 -07:00
Miklos Szeredi
8cf9ee5061 Revert "vfs: do get_write_access() on upper layer of overlayfs"
This reverts commit 4d0c5ba2ff.

We now get write access on both overlay and underlying layers so this patch
is no longer needed for correct operation.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:43 +02:00
Miklos Szeredi
4ab30319fd Revert "vfs: add flags to d_real()"
This reverts commit 495e642939.

No user of "flags" argument of d_real() remain.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:43 +02:00
Miklos Szeredi
6742cee043 Revert "ovl: don't allow writing ioctl on lower layer"
This reverts commit 7c6893e3c9.

Overlayfs no longer relies on the vfs for checking writability of files.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:43 +02:00
Miklos Szeredi
a6518f73e6 vfs: don't open real
Let overlayfs do its thing when opening a file.

This enables stacking and fixes the corner case when a file is opened for
read, modified through a writable open, and data is read from the read-only
file.  After this patch the read-only open will not return stale data even
in this case.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-18 15:44:42 +02:00
Miklos Szeredi
d3b1084dfd vfs: make open_with_fake_path() not contribute to nr_files
Stacking file operations in overlay will store an extra open file for each
overlay file opened.

The overhead is just that of "struct file" which is about 256bytes, because
overlay already pins an extra dentry and inode when the file is open, which
add up to a much larger overhead.

For fear of breaking working setups, don't start accounting the extra file.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:40 +02:00
Al Viro
2abc77af89 new helper: open_with_fake_path()
open a file by given inode, faking ->f_path.  Use with shitloads
of caution - at the very least you'd damn better make sure that
some dentry alias of that inode is pinned down by the path in
question.  Again, this is no general-purpose interface and I hope
it will eventually go away.  Right now overlayfs wants something
like that, but nothing else should.

Any out-of-tree code with bright idea of using this one *will*
eventually get hurt, with zero notice and great delight on my part.
I refuse to use EXPORT_SYMBOL_GPL(), especially in situations when
it's really EXPORT_SYMBOL_DONT_USE_IT(), but don't take that export
as "you are welcome to use it".

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 11:18:42 -04:00
Al Viro
64e1ac4d46 ->atomic_open(): return 0 in all success cases
FMODE_OPENED can be used to distingusish "successful open" from the
"called finish_no_open(), do it yourself" cases.  Since finish_no_open()
has been adjusted, no changes in the instances were actually needed.
The caller has been adjusted.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:21 -04:00
Al Viro
be12af3ef5 getting rid of 'opened' argument of ->atomic_open() - part 1
'opened' argument of finish_open() is unused.  Kill it.

Signed-off-by Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:19 -04:00
Al Viro
aad888f828 switch all remaining checks for FILE_OPENED to FMODE_OPENED
... and don't bother with setting FILE_OPENED at all.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:18 -04:00
Al Viro
69527c554f now we can fold open_check_o_direct() into do_dentry_open()
These checks are better off in do_dentry_open(); the reason we couldn't
put them there used to be that callers couldn't tell what kind of cleanup
would do_dentry_open() failure call for.  Now that we have FMODE_OPENED,
cleanup is the same in all cases - it's simply fput().  So let's fold
that into do_dentry_open(), as Christoph's patch tried to.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:17 -04:00
Al Viro
4d27f3266f fold put_filp() into fput()
Just check FMODE_OPENED in __fput() and be done with that...

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:16 -04:00
Al Viro
f5d11409e6 introduce FMODE_OPENED
basically, "is that instance set up enough for regular fput(), or
do we want put_filp() for that one".

NOTE: the only alloc_file() caller that could be followed by put_filp()
is in arch/ia64/kernel/perfmon.c, which is (Kconfig-level) broken.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:16 -04:00
Al Viro
e3f20ae210 security_file_open(): lose cred argument
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:15 -04:00
Al Viro
ae2bb293a3 get rid of cred argument of vfs_open() and do_dentry_open()
always equal to ->f_cred

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:14 -04:00
Al Viro
ea73ea7279 pass ->f_flags value to alloc_empty_file()
... and have it set the f_flags-derived part of ->f_mode.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:13 -04:00
Al Viro
6de37b6dc0 pass creds to get_empty_filp(), make sure dentry_open() passes the right creds
... and rename get_empty_filp() to alloc_empty_file().

dentry_open() gets creds as argument, but the only thing that sees those is
security_file_open() - file->f_cred still ends up with current_cred().  For
almost all callers it's the same thing, but there are several broken cases.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:13 -04:00
Al Viro
6b4e8085c0 make sure do_dentry_open() won't return positive as an error
An ->open() instances really, really should not be doing that.  There's
a lot of places e.g. around atomic_open() that could be confused by that,
so let's catch that early.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-10 23:29:03 -04:00
Al Viro
19f391eb05 turn filp_clone_open() into inline wrapper for dentry_open()
it's exactly the same thing as
	dentry_open(&file->f_path, file->f_flags, file->f_cred)

... and rename it to file_clone_open(), while we are at it.
'filp' naming convention is bogus; sure, it's "file pointer",
but we generally don't do that kind of Hungarian notation.
Some of the instances have too many callers to touch, but this
one has only two, so let's sanitize it while we can...

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-10 23:29:03 -04:00
Al Viro
af04fadcaa Revert "fs: fold open_check_o_direct into do_dentry_open"
This reverts commit cab64df194.

Having vfs_open() in some cases drop the reference to
struct file combined with

	error = vfs_open(path, f, cred);
	if (error) {
		put_filp(f);
		return ERR_PTR(error);
	}
	return f;

is flat-out wrong.  It used to be

		error = vfs_open(path, f, cred);
		if (!error) {
			/* from now on we need fput() to dispose of f */
			error = open_check_o_direct(f);
			if (error) {
				fput(f);
				f = ERR_PTR(error);
			}
		} else {
			put_filp(f);
			f = ERR_PTR(error);
		}

and sure, having that open_check_o_direct() boilerplate gotten rid of is
nice, but not that way...

Worse, another call chain (via finish_open()) is FUBAR now wrt
FILE_OPENED handling - in that case we get error returned, with file
already hit by fput() *AND* FILE_OPENED not set.  Guess what happens in
path_openat(), when it hits

	if (!(opened & FILE_OPENED)) {
		BUG_ON(!error);
		put_filp(file);
	}

The root cause of all that crap is that the callers of do_dentry_open()
have no way to tell which way did it fail; while that could be fixed up
(by passing something like int *opened to do_dentry_open() and have it
marked if we'd called ->open()), it's probably much too late in the
cycle to do so right now.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-03 10:58:23 -07:00
Linus Torvalds
9022ca6b11 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted stuff, including Christoph's I_DIRTY patches"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: move I_DIRTY_INODE to fs.h
  ubifs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) call
  ntfs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) call
  gfs2: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) calls
  fs: fold open_check_o_direct into do_dentry_open
  vfs: Replace stray non-ASCII homoglyph characters with their ASCII equivalents
  vfs: make sure struct filename->iname is word-aligned
  get rid of pointless includes of fs_struct.h
  [poll] annotate SAA6588_CMD_POLL users
2018-04-06 11:07:08 -07:00
Dominik Brodowski
edf292c76b fs: add ksys_fallocate() wrapper; remove in-kernel calls to sys_fallocate()
Using the ksys_fallocate() wrapper allows us to get rid of in-kernel
calls to the sys_fallocate() syscall. The ksys_ prefix denotes that this
function is meant as a drop-in replacement for the syscall. In
particular, it uses the same calling convention as sys_fallocate().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:16:09 +02:00
Dominik Brodowski
df260e21e6 fs: add ksys_truncate() wrapper; remove in-kernel calls to sys_truncate()
Using the ksys_truncate() wrapper allows us to get rid of in-kernel
calls to the sys_truncate() syscall. The ksys_ prefix denotes that this
function is meant as a drop-in replacement for the syscall. In
particular, it uses the same calling convention as sys_truncate().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:16:08 +02:00
Dominik Brodowski
bae217ea8c fs: add ksys_open() wrapper; remove in-kernel calls to sys_open()
Using this wrapper allows us to avoid the in-kernel calls to the
sys_open() syscall. The ksys_ prefix denotes that this function is meant
as a drop-in replacement for the syscall. In particular, it uses the
same calling convention as sys_open().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:16:01 +02:00
Dominik Brodowski
2ca2a09d62 fs: add ksys_close() wrapper; remove in-kernel calls to sys_close()
Using the ksys_close() wrapper allows us to get rid of in-kernel calls
to the sys_close() syscall. The ksys_ prefix denotes that this function
is meant as a drop-in replacement for the syscall. In particular, it
uses the same calling convention as sys_close(), with one subtle
difference:

The few places which checked the return value did not care about the return
value re-writing in sys_close(), so simply use a wrapper around
__close_fd().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:16:00 +02:00
Dominik Brodowski
411d9475cf fs: add ksys_ftruncate() wrapper; remove in-kernel calls to sys_ftruncate()
Using the ksys_ftruncate() wrapper allows us to get rid of in-kernel
calls to the sys_ftruncate() syscall. The ksys_ prefix denotes that this
function is meant as a drop-in replacement for the syscall. In
particular, it uses the same calling convention as sys_ftruncate().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:16:00 +02:00
Dominik Brodowski
55731b3cda fs: add do_fchownat(), ksys_fchown() helpers and ksys_{,l}chown() wrappers
Using the fs-interal do_fchownat() wrapper allows us to get rid of
fs-internal calls to the sys_fchownat() syscall.

Introducing the ksys_fchown() helper and the ksys_{,}chown() wrappers
allows us to avoid the in-kernel calls to the sys_{,l,f}chown() syscalls.
The ksys_ prefix denotes that these functions are meant as a drop-in
replacement for the syscalls. In particular, they use the same calling
convention as sys_{,l,f}chown().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:59 +02:00
Dominik Brodowski
cbfe20f565 fs: add do_faccessat() helper and ksys_access() wrapper; remove in-kernel calls to syscall
Using the fs-internal do_faccessat() helper allows us to get rid of
fs-internal calls to the sys_faccessat() syscall.

Introducing the ksys_access() wrapper allows us to avoid the in-kernel
calls to the sys_access() syscall. The ksys_ prefix denotes that this
function is meant as a drop-in replacement for the syscall. In
particular, it uses the same calling convention as sys_access().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:58 +02:00
Dominik Brodowski
03450e271a fs: add ksys_fchmod() and do_fchmodat() helpers and ksys_chmod() wrapper; remove in-kernel calls to syscall
Using the fs-internal do_fchmodat() helper allows us to get rid of
fs-internal calls to the sys_fchmodat() syscall.

Introducing the ksys_fchmod() helper and the ksys_chmod() wrapper allows
us to avoid the in-kernel calls to the sys_fchmod() and sys_chmod()
syscalls. The ksys_ prefix denotes that these functions are meant as a
drop-in replacement for the syscalls. In particular, they use the same
calling convention as sys_fchmod() and sys_chmod().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:57 +02:00
Dominik Brodowski
447016e968 fs: add ksys_chdir() helper; remove in-kernel calls to sys_chdir()
Using this helper allows us to avoid the in-kernel calls to the sys_chdir()
syscall. The ksys_ prefix denotes that this function is meant as a drop-in
replacement for the syscall. In particular, it uses the same calling
convention as sys_chdir().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:51 +02:00
Dominik Brodowski
a16fe33ab5 fs: add ksys_chroot() helper; remove-in kernel calls to sys_chroot()
Using this helper allows us to avoid the in-kernel calls to the
sys_chroot() syscall. The ksys_ prefix denotes that this function is
meant as a drop-in replacement for the syscall. In particular, it uses the
same calling convention as sys_chroot().

In the near future, the fs-external callers of ksys_chroot() should be
converted to use kern_path()/set_fs_root() directly. Then ksys_chroot()
can be moved within sys_chroot() again.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:50 +02:00
Christoph Hellwig
cab64df194 fs: fold open_check_o_direct into do_dentry_open
do_dentry_open is where we do the actual open of the file, so this is
where we should do our O_DIRECT sanity check to cover all potential
callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-28 01:39:01 -04:00
Miklos Szeredi
7c6893e3c9 ovl: don't allow writing ioctl on lower layer
Problem with ioctl() is that it's a file operation, yet often used as an
inode operation (i.e. modify the inode despite the file being opened for
read-only).

mnt_want_write_file() is used by filesystems in such cases to get write
access on an arbitrary open file.

Since overlayfs lets filesystems do all file operations, including ioctl,
this can lead to mnt_want_write_file() returning OK for a lower file and
modification of that lower file.

This patch prevents modification by checking if the file is from an
overlayfs lower layer and returning EPERM in that case.

Need to introduce a mnt_want_write_file_path() variant that still does the
old thing for inode operations that can do the copy up + modification
correctly in such cases (fchown, fsetxattr, fremovexattr).

This does not address the correctness of such ioctls on overlayfs (the
correct way would be to copy up and attempt to perform ioctl on upper
file).

In theory this could be a regression.  We very much hope that nobody is
relying on such a hack in any sane setup.

While this patch meddles in VFS code, it has no effect on non-overlayfs
filesystems.

Reported-by: "zhangyi (F)" <yi.zhang@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-09-05 12:53:12 +02:00
Miklos Szeredi
495e642939 vfs: add flags to d_real()
Add a separate flags argument (in addition to the open flags) to control
the behavior of d_real().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-09-04 21:42:22 +02:00
Linus Torvalds
088737f44b Writeback error handling fixes (pile #2)
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZXhmCAAoJEAAOaEEZVoIVpRkP/1qlYn3pq6d5Kuz84pejOmlL
 5jbkS/cOmeTxeUU4+B1xG8Lx7bAk8PfSXQOADbSJGiZd0ug95tJxplFYIGJzR/tG
 aNMHeu/BVKKhUKORGuKR9rJKtwC839L/qao+yPBo5U3mU4L73rFWX8fxFuhSJ8HR
 hvkgBu3Hx6GY59CzxJ8iJzj+B+uPSFrNweAk0+0UeWkBgTzEdiGqaXBX4cHIkq/5
 hMoCG+xnmwHKbCBsQ5js+YJT+HedZ4lvfjOqGxgElUyjJ7Bkt/IFYOp8TUiu193T
 tA4UinDjN8A7FImmIBIftrECmrAC9HIGhGZroYkMKbb8ReDR2ikE5FhKEpuAGU3a
 BXBgX2mPQuArvZWM7qeJCkxV9QJ0u/8Ykbyzo30iPrICyrzbEvIubeB/mDA034+Z
 Z0/z8C3v7826F3zP/NyaQEojUgRq30McMOIS8GMnx15HJwRsRKlzjfy9Wm4tWhl0
 t3nH1jMqAZ7068s6rfh/oCwdgGOwr5o4hW/bnlITzxbjWQUOnZIe7KBxIezZJ2rv
 OcIwd5qE8PNtpagGj5oUbnjGOTkERAgsMfvPk5tjUNt28/qUlVs2V0aeo47dlcsh
 oYr8WMOIzw98Rl7Bo70mplLrqLD6nGl0LfXOyUlT4STgLWW4ksmLVuJjWIUxcO/0
 yKWjj9wfYRQ0vSUqhsI5
 =3Z93
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull Writeback error handling updates from Jeff Layton:
 "This pile represents the bulk of the writeback error handling fixes
  that I have for this cycle. Some of the earlier patches in this pile
  may look trivial but they are prerequisites for later patches in the
  series.

  The aim of this set is to improve how we track and report writeback
  errors to userland. Most applications that care about data integrity
  will periodically call fsync/fdatasync/msync to ensure that their
  writes have made it to the backing store.

  For a very long time, we have tracked writeback errors using two flags
  in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
  writeback error occurs (via mapping_set_error) and are cleared as a
  side-effect of filemap_check_errors (as you noted yesterday). This
  model really sucks for userland.

  Only the first task to call fsync (or msync or fdatasync) will see the
  error. Any subsequent task calling fsync on a file will get back 0
  (unless another writeback error occurs in the interim). If I have
  several tasks writing to a file and calling fsync to ensure that their
  writes got stored, then I need to have them coordinate with one
  another. That's difficult enough, but in a world of containerized
  setups that coordination may even not be possible.

  But wait...it gets worse!

  The calls to filemap_check_errors can be buried pretty far down in the
  call stack, and there are internal callers of filemap_write_and_wait
  and the like that also end up clearing those errors. Many of those
  callers ignore the error return from that function or return it to
  userland at nonsensical times (e.g. truncate() or stat()). If I get
  back -EIO on a truncate, there is no reason to think that it was
  because some previous writeback failed, and a subsequent fsync() will
  (incorrectly) return 0.

  This pile aims to do three things:

   1) ensure that when a writeback error occurs that that error will be
      reported to userland on a subsequent fsync/fdatasync/msync call,
      regardless of what internal callers are doing

   2) report writeback errors on all file descriptions that were open at
      the time that the error occurred. This is a user-visible change,
      but I think most applications are written to assume this behavior
      anyway. Those that aren't are unlikely to be hurt by it.

   3) document what filesystems should do when there is a writeback
      error. Today, there is very little consistency between them, and a
      lot of cargo-cult copying. We need to make it very clear what
      filesystems should do in this situation.

  To achieve this, the set adds a new data type (errseq_t) and then
  builds new writeback error tracking infrastructure around that. Once
  all of that is in place, we change the filesystems to use the new
  infrastructure for reporting wb errors to userland.

  Note that this is just the initial foray into cleaning up this mess.
  There is a lot of work remaining here:

   1) convert the rest of the filesystems in a similar fashion. Once the
      initial set is in, then I think most other fs' will be fairly
      simple to convert. Hopefully most of those can in via individual
      filesystem trees.

   2) convert internal waiters on writeback to use errseq_t for
      detecting errors instead of relying on the AS_* flags. I have some
      draft patches for this for ext4, but they are not quite ready for
      prime time yet.

  This was a discussion topic this year at LSF/MM too. If you're
  interested in the gory details, LWN has some good articles about this:

      https://lwn.net/Articles/718734/
      https://lwn.net/Articles/724307/"

* tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  btrfs: minimal conversion to errseq_t writeback error reporting on fsync
  xfs: minimal conversion to errseq_t writeback error reporting
  ext4: use errseq_t based error handling for reporting data writeback errors
  fs: convert __generic_file_fsync to use errseq_t based reporting
  block: convert to errseq_t based writeback error tracking
  dax: set errors in mapping when writeback fails
  Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
  mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
  fs: new infrastructure for writeback error handling and reporting
  lib: add errseq_t type and infrastructure for handling it
  mm: don't TestClearPageError in __filemap_fdatawait_range
  mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
  jbd2: don't clear and reset errors after waiting on writeback
  buffer: set errors in mapping at the time that the error occurs
  fs: check for writeback errors after syncing out buffers in generic_file_fsync
  buffer: use mapping_set_error instead of setting the flag
  mm: fix mapping_set_error call in me_pagecache_dirty
2017-07-07 19:38:17 -07:00
Jeff Layton
5660e13d2f fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.

The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.

If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.

This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.

In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.

One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.

This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.

This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).

Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.

The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 07:02:25 -04:00
Jens Axboe
c75b1d9421 fs: add fcntl() interface for setting/getting write life time hints
Define a set of write life time hints:

RWH_WRITE_LIFE_NOT_SET	No hint information set
RWH_WRITE_LIFE_NONE	No hints about write life time
RWH_WRITE_LIFE_SHORT	Data written has a short life time
RWH_WRITE_LIFE_MEDIUM	Data written has a medium life time
RWH_WRITE_LIFE_LONG	Data written has a long life time
RWH_WRITE_LIFE_EXTREME	Data written has an extremely long life time

The intent is for these values to be relative to each other, no
absolute meaning should be attached to these flag names.

Add an fcntl interface for querying these flags, and also for
setting them as well:

F_GET_RW_HINT		Returns the read/write hint set on the
			underlying inode.

F_SET_RW_HINT		Set one of the above write hints on the
			underlying inode.

F_GET_FILE_RW_HINT	Returns the read/write hint set on the
			file descriptor.

F_SET_FILE_RW_HINT	Set one of the above write hints on the
			file descriptor.

The user passes in a 64-bit pointer to get/set these values, and
the interface returns 0/-1 on success/error.

Sample program testing/implementing basic setting/getting of write
hints is below.

Add support for storing the write life time hint in the inode flags
and in struct file as well, and pass them to the kiocb flags. If
both a file and its corresponding inode has a write hint, then we
use the one in the file, if available. The file hint can be used
for sync/direct IO, for buffered writeback only the inode hint
is available.

This is in preparation for utilizing these hints in the block layer,
to guide on-media data placement.

/*
 * writehint.c: get or set an inode write hint
 */
 #include <stdio.h>
 #include <fcntl.h>
 #include <stdlib.h>
 #include <unistd.h>
 #include <stdbool.h>
 #include <inttypes.h>

 #ifndef F_GET_RW_HINT
 #define F_LINUX_SPECIFIC_BASE	1024
 #define F_GET_RW_HINT		(F_LINUX_SPECIFIC_BASE + 11)
 #define F_SET_RW_HINT		(F_LINUX_SPECIFIC_BASE + 12)
 #endif

static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE",
			"RWH_WRITE_LIFE_SHORT", "RWH_WRITE_LIFE_MEDIUM",
			"RWH_WRITE_LIFE_LONG", "RWH_WRITE_LIFE_EXTREME" };

int main(int argc, char *argv[])
{
	uint64_t hint;
	int fd, ret;

	if (argc < 2) {
		fprintf(stderr, "%s: file <hint>\n", argv[0]);
		return 1;
	}

	fd = open(argv[1], O_RDONLY);
	if (fd < 0) {
		perror("open");
		return 2;
	}

	if (argc > 2) {
		hint = atoi(argv[2]);
		ret = fcntl(fd, F_SET_RW_HINT, &hint);
		if (ret < 0) {
			perror("fcntl: F_SET_RW_HINT");
			return 4;
		}
	}

	ret = fcntl(fd, F_GET_RW_HINT, &hint);
	if (ret < 0) {
		perror("fcntl: F_GET_RW_HINT");
		return 3;
	}

	printf("%s: hint %s\n", argv[1], str[hint]);
	close(fd);
	return 0;
}

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27 12:05:22 -06:00
Linus Torvalds
050453295f Merge branch 'work.sane_pwd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Making sure that something like a referral point won't end up as pwd
  or root.

  The main part is the last commit (fixing mntns_install()); that one
  fixes a hard-to-hit race. The fchdir() commit is making fchdir(2) a
  bit more robust - it should be impossible to get opened files (even
  O_PATH ones) for referral points in the first place, so the existing
  checks are OK, but checking the same thing as in chdir(2) is just as
  cheap.

  The path_init() commit removes a redundant check that shouldn't have
  been there in the first place"

* 'work.sane_pwd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  make sure that mntns_install() doesn't end up with referral for root
  path_init(): don't bother with checking MAY_EXEC for LOOKUP_ROOT
  make sure that fchdir() won't accept referral points, etc.
2017-05-12 11:39:59 -07:00
Linus Torvalds
b948abf53a Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs update from Miklos Szeredi:
 "The biggest part of this is making st_dev/st_ino on the overlay behave
  like a normal filesystem (i.e. st_ino doesn't change on copy up,
  st_dev is the same for all files and directories). Currently this only
  works if all layers are on the same filesystem, but future work will
  move the general case towards more sane behavior.

  There are also miscellaneous fixes, including fixes to handling
  append-only files. There's a small change in the VFS, but that only
  has an effect on overlayfs, since otherwise file->f_path.dentry->inode
  and file_inode(file) are always the same"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: update documentation w.r.t. constant inode numbers
  ovl: persistent inode numbers for upper hardlinks
  ovl: merge getattr for dir and nondir
  ovl: constant st_ino/st_dev across copy up
  ovl: persistent inode number for directories
  ovl: set the ORIGIN type flag
  ovl: lookup non-dir copy-up-origin by file handle
  ovl: use an auxiliary var for overlay root entry
  ovl: store file handle of lower inode on copy up
  ovl: check if all layers are on the same fs
  ovl: do not set overlay.opaque on non-dir create
  ovl: check IS_APPEND() on real upper inode
  vfs: ftruncate check IS_APPEND() on real upper inode
  ovl: Use designated initializers
  ovl: lockdep annotate of nested stacked overlayfs inode lock
2017-05-10 09:03:48 -07:00
Linus Torvalds
11fbf53d66 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted bits and pieces from various people. No common topic in this
  pile, sorry"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs/affs: add rename exchange
  fs/affs: add rename2 to prepare multiple methods
  Make stat/lstat/fstatat pass AT_NO_AUTOMOUNT to vfs_statx()
  fs: don't set *REFERENCED on single use objects
  fs: compat: Remove warning from COMPATIBLE_IOCTL
  remove pointless extern of atime_need_update_rcu()
  fs: completely ignore unknown open flags
  fs: add a VALID_OPEN_FLAGS
  fs: remove _submit_bh()
  fs: constify tree_descr arrays passed to simple_fill_super()
  fs: drop duplicate header percpu-rwsem.h
  fs/affs: bugfix: Write files greater than page size on OFS
  fs/affs: bugfix: enable writes on OFS disks
  fs/affs: remove node generation check
  fs/affs: import amigaffs.h
  fs/affs: bugfix: make symbolic links work again
2017-05-09 09:12:53 -07:00