packages_gid.list was improperly returning the wrong
count. Use scnprintf instead, and inform the user that
the list was truncated if it is.
Bug: 30013843
Change-Id: Ida2b2ef7cd86dd87300bfb4c2cdb6bfe2ee1650d
Signed-off-by: Daniel Rosenberg <drosen@google.com>
CONFIG_SDCARD_FS_CI_SEARCH only guards a define for
LOOKUP_CASE_INSENSITIVE, which is never used in the
kernel. Remove both, along with the option matching
that supports it.
Change-Id: I363a8f31de8ee7a7a934d75300cc9ba8176e2edf
Signed-off-by: Daniel Rosenberg <drosen@google.com>
In M, the workings of sdcardfs were changed significantly.
This brings sdcardfs into line with the changes.
Change-Id: I10e91a84a884c838feef7aa26c0a2b21f02e052e
Fixed existing type-casting in packagelist management code. All
warnings at compile time were taken care of.
Change-Id: I1ea97786d1d1325f31b9f09ae966af1f896a2af5
Signed-off-by: Daniel Campello <campello@google.com>
The permission call for xattr operations happens regardless of
whether or not the xattr functions are implemented.
The xattr functions currently don't have support for permission2.
Passing EOPNOTSUPP as the mount point in xattr_permission allows
us to return EOPNOTSUPP early in permission2, if the filesystem
supports it.
Change-Id: I9d07e4cd633cf40af60450ffbff7ac5c1b4e8c2c
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 35848445
This allows filesystems to use their mount private data to
influence the permssions they return in permission2. It has
been separated into a new call to avoid disrupting current
permission users.
Change-Id: I9d416e3b8b6eca84ef3e336bd2af89ddd51df6ca
Signed-off-by: Daniel Rosenberg <drosen@google.com>
[AmitP: Minor refactoring of original patch to align with
changes from the following upstream commit
4bfd054ae1 ("fs: fold __inode_permission() into inode_permission()").
Also introduce vfs_mkobj2(), because do_create()
moved from using vfs_create() to vfs_mkobj()
eecec19d9e ("mqueue: switch to vfs_mkobj(), quit abusing ->d_fsdata")
do_create() is dropped/cleaned up upstream so a
minor refactoring there as well.
066cc813e9 ("do_mq_open(): move all work prior to dentry_open() into a helper")]
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
This allows filesystems to use their mount private data to
influence the permssions they use in setattr2. It has
been separated into a new call to avoid disrupting current
setattr users.
Change-Id: I19959038309284448f1b7f232d579674ef546385
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Now we pass the vfsmount when mounting and remounting.
This allows the filesystem to actually set up the mount
specific data, although we can't quite do anything with
it yet. show_options is expanded to include data that
lives with the mount.
To avoid changing existing filesystems, these have
been added as new vfs functions.
Change-Id: If80670bfad9f287abb8ac22457e1b034c9697097
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Allows FUSE to report to inotify that it is acting
as a layered filesystem. The userspace component
returns a string representing the location of the
underlying file. If the string cannot be resolved
into a path, the top level path is returned instead.
bug: 23904372
Change-Id: Iabdca0bbedfbff59e9c820c58636a68ef9683d9f
Signed-off-by: Daniel Rosenberg <drosen@google.com>
This starts to add private data associated directly
to mount points. The intent is to give filesystems
a sense of where they have come from, as a means of
letting a filesystem take different actions based on
this information.
Change-Id: Ie769d7b3bb2f5972afe05c1bf16cf88c91647ab2
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Inotify does not currently know when a filesystem
is acting as a wrapper around another fs. This means
that inotify watchers will miss any modifications to
the base file, as well as any made in a separate
stacked fs that points to the same file.
d_canonical_path solves this problem by allowing the fs
to map a dentry to a path in the lower fs. Inotify
can use it to find the appropriate place to watch to
be informed of all changes to a file.
Change-Id: I09563baffad1711a045e45c1bd0bd8713c2cc0b6
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Only included the source code as is for kernel 3.0. Following patches
take care of porting this file system to version 3.10.
Change-Id: I09e76db77cd98a059053ba5b6fd88572a4b75b5b
Signed-off-by: Daniel Campello <campello@google.com>
Userspace processes often have multiple allocators that each do
anonymous mmaps to get memory. When examining memory usage of
individual processes or systems as a whole, it is useful to be
able to break down the various heaps that were allocated by
each layer and examine their size, RSS, and physical memory
usage.
This patch adds a user pointer to the shared union in
vm_area_struct that points to a null terminated string inside
the user process containing a name for the vma. vmas that
point to the same address will be merged, but vmas that
point to equivalent strings at different addresses will
not be merged.
Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
Setting the name to NULL clears it.
The names of named anonymous vmas are shown in /proc/pid/maps
as [anon:<name>] and in /proc/pid/smaps in a new "Name" field
that is only present for named vmas. If the userspace pointer
is no longer valid all or part of the name will be replaced
with "<fault>".
The idea to store a userspace pointer to reduce the complexity
within mm (at the expense of the complexity of reading
/proc/pid/mem) came from Dave Hansen. This results in no
runtime overhead in the mm subsystem other than comparing
the anon_name pointers when considering vma merging. The pointer
is stored in a union with fieds that are only used on file-backed
mappings, so it does not increase memory usage.
Includes fix from Jed Davis <jld@mozilla.com> for typo in
prctl_set_vma_anon_name, which could attempt to set the name
across two vmas at the same time due to a typo, which might
corrupt the vma list. Fix it to use tmp instead of end to limit
the name setting to a single vma at a time.
Change-Id: I9aa7b6b5ef536cd780599ba4e2fba8ceebe8b59f
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
[AmitP: Fix get_user_pages_remote() call to align with upstream commit
5b56d49fc3 ("mm: add locked parameter to get_user_pages_remote()")]
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
adb reboot calls /proc/sysrq-trigger to force an emergency remount (ro) of all
mounted disks. This is executed in the order of the time the mount was originally
done. Because we have a test system which loop mount images from an extra
partition, we see errors cause the loop mounted partitions gets remounted after
this physical partition was set to read only already.
Fix this by reversing the order of the emergency remount. This will remount the
disk first which have been mounted last.
So instead of remounting in this order:
/dev/sda1
/dev/loop1
/dev/loop2
we now remount in this order:
/dev/loop2
/dev/loop1
/dev/sda1
Change-Id: I68fe7e16cc9400ab5278877af70c9ea1d9b57936
Signed-off-by: Christian Poetzsch <christian.potzsch@imgtec.com>
Refactor the fs readpage/write tracepoints to move the
inode->path lookup outside the tracepoint code, and pass a pointer
to the path into the tracepoint code instead. This is necessary
because the tracepoint code runs non-preemptible. Thanks to
Trilok Soni for catching this in 4.4.
Signed-off-by: Mohan Srinivasan <srmohan@google.com>
[AmitP: Folded following android-4.9 commit changes into this patch
a5c4dbb05ab7 ("ANDROID: Replace spaces by '_' for some android filesystem tracepoints.")]
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Adds tracepoints in ext4/f2fs/mpage to track readpages/buffered
write()s. This allows us to track files that are being read/written
to PIDs. (Merged from android4.4-common).
Signed-off-by: Mohan Srinivasan <srmohan@google.com>
Change-Id: I6c138de5b2332eea70f57e098134d1d141247b3f
Signed-off-by: Jin Qian <jinqian@google.com>
[AmitP: Refactored changes to align with changes from upstream commit
9a07000400 ("sched/headers: Move CONFIG_TASK_XACCT bits from <linux/sched.h> to <linux/sched/xacct.h>")]
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Avoid waking up every thread sleeping in an epoll_wait call during
suspend and resume by calling a freezable blocking call. Previous
patches modified the freezer to avoid sending wakeups to threads
that are blocked in freezable blocking calls.
This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.
Change-Id: I848d08d28c89302fd42bbbdfa76489a474ab27bf
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull IDA updates from Matthew Wilcox:
"A better IDA API:
id = ida_alloc(ida, GFP_xxx);
ida_free(ida, id);
rather than the cumbersome ida_simple_get(), ida_simple_remove().
The new IDA API is similar to ida_simple_get() but better named. The
internal restructuring of the IDA code removes the bitmap
preallocation nonsense.
I hope the net -200 lines of code is convincing"
* 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax: (29 commits)
ida: Change ida_get_new_above to return the id
ida: Remove old API
test_ida: check_ida_destroy and check_ida_alloc
test_ida: Convert check_ida_conv to new API
test_ida: Move ida_check_max
test_ida: Move ida_check_leaf
idr-test: Convert ida_check_nomem to new API
ida: Start new test_ida module
target/iscsi: Allocate session IDs from an IDA
iscsi target: fix session creation failure handling
drm/vmwgfx: Convert to new IDA API
dmaengine: Convert to new IDA API
ppc: Convert vas ID allocation to new IDA API
media: Convert entity ID allocation to new IDA API
ppc: Convert mmu context allocation to new IDA API
Convert net_namespace to new IDA API
cb710: Convert to new IDA API
rsxx: Convert to new IDA API
osd: Convert to new IDA API
sd: Convert to new IDA API
...
Pull perf updates from Thomas Gleixner:
"Kernel:
- Improve kallsyms coverage
- Add x86 entry trampolines to kcore
- Fix ARM SPE handling
- Correct PPC event post processing
Tools:
- Make the build system more robust
- Small fixes and enhancements all over the place
- Update kernel ABI header copies
- Preparatory work for converting libtraceevnt to a shared library
- License cleanups"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
tools arch: Update arch/x86/lib/memcpy_64.S copy used in 'perf bench mem memcpy'
tools arch x86: Update tools's copy of cpufeatures.h
perf python: Fix pyrf_evlist__read_on_cpu() interface
perf mmap: Store real cpu number in 'struct perf_mmap'
perf tools: Remove ext from struct kmod_path
perf tools: Add gzip_is_compressed function
perf tools: Add lzma_is_compressed function
perf tools: Add is_compressed callback to compressions array
perf tools: Move the temp file processing into decompress_kmodule
perf tools: Use compression id in decompress_kmodule()
perf tools: Store compression id into struct dso
perf tools: Add compression id into 'struct kmod_path'
perf tools: Make is_supported_compression() static
perf tools: Make decompress_to_file() function static
perf tools: Get rid of dso__needs_decompress() call in __open_dso()
perf tools: Get rid of dso__needs_decompress() call in symbol__disassemble()
perf tools: Get rid of dso__needs_decompress() call in read_object_code()
tools lib traceevent: Change to SPDX License format
perf llvm: Allow passing options to llc in addition to clang
perf parser: Improve error message for PMU address filters
...
* memory_failure() gets confused by dev_pagemap backed mappings. The
recovery code has specific enabling for several possible page states
that needs new enabling to handle poison in dax mappings. Teach
memory_failure() about ZONE_DEVICE pages.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAlt9ui8ACgkQYGjFFmlT
OEpNRw//XGj9s7sezfJFeol4psJlRUd935yii/gmJRgi/yPf2VxxQG9qyM6SMBUc
75jASfOL6FSsfxHz0kplyWzMDNdrTkNNAD+9rv80FmY7GqWgcas9DaJX7jZ994vI
5SRO7pfvNZcXlo7IhqZippDw3yxkIU9Ufi0YQKaEUm7GFieptvCZ0p9x3VYfdvwM
BExrxQe0X1XUF4xErp5P78+WUbKxP47DLcucRDig8Q7dmHELUdyNzo3E1SVoc7m+
3CmvyTj6XuFQgOZw7ZKun1BJYfx/eD5ZlRJLZbx6wJHRtTXv/Uea8mZ8mJ31ykN9
F7QVd0Pmlyxys8lcXfK+nvpL09QBE0/PhwWKjmZBoU8AdgP/ZvBXLDL/D6YuMTg6
T4wwtPNJorfV4lVD06OliFkVI4qbKbmNsfRq43Ns7PCaLueu4U/eMaSwSH99UMaZ
MGbO140XW2RZsHiU9yTRUmZq73AplePEjxtzR8oHmnjo45nPDPy8mucWPlkT9kXA
oUFMhgiviK7dOo19H4eaPJGqLmHM93+x5tpYxGqTr0dUOXUadKWxMsTnkID+8Yi7
/kzQWCFvySz3VhiEHGuWkW08GZT6aCcpkREDomnRh4MEnETlZI8bblcuXYOCLs6c
nNf1SIMtLdlsl7U1fEX89PNeQQ2y237vEDhFQZftaalPeu/JJV0=
=Ftop
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm memory-failure update from Dave Jiang:
"As it stands, memory_failure() gets thoroughly confused by dev_pagemap
backed mappings. The recovery code has specific enabling for several
possible page states and needs new enabling to handle poison in dax
mappings.
In order to support reliable reverse mapping of user space addresses:
1/ Add new locking in the memory_failure() rmap path to prevent races
that would typically be handled by the page lock.
2/ Since dev_pagemap pages are hidden from the page allocator and the
"compound page" accounting machinery, add a mechanism to determine
the size of the mapping that encompasses a given poisoned pfn.
3/ Given pmem errors can be repaired, change the speculatively
accessed poison protection, mce_unmap_kpfn(), to be reversible and
otherwise allow ongoing access from the kernel.
A side effect of this enabling is that MADV_HWPOISON becomes usable
for dax mappings, however the primary motivation is to allow the
system to survive userspace consumption of hardware-poison via dax.
Specifically the current behavior is:
mce: Uncorrected hardware memory error in user-access at af34214200
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
mce: [Hardware Error]: Machine check events logged
{1}[Hardware Error]: event severity: corrected
Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
[..]
Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
mce: Memory error not recovered
<reboot>
...and with these changes:
Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
Memory failure: 0x20cb00: recovery action for dax page: Recovered
Given all the cross dependencies I propose taking this through
nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
folks"
* tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm, pmem: Restore page attributes when clearing errors
x86/memory_failure: Introduce {set, clear}_mce_nospec()
x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses
mm, memory_failure: Teach memory_failure() about dev_pagemap pages
filesystem-dax: Introduce dax_lock_mapping_entry()
mm, memory_failure: Collect mapping size in collect_procs()
mm, madvise_inject_error: Let memory_failure() optionally take a page reference
mm, dev_pagemap: Do not clear ->mapping on final put
mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pages
filesystem-dax: Set page->index
device-dax: Set page->index
device-dax: Enable page_mapping()
device-dax: Convert to vmf_insert_mixed and vm_fault_t
Collection of misc libnvdimm patches for 4.19 submission
* Adding support to read locked nvdimm capacity.
* Change test code to make DSM failure code injection an override.
* Add support for calculate maximum contiguous area for namespace.
* Add support for queueing a short ARS when there is on going ARS for
nvdimm.
* Allow NULL to be passed in to ->direct_access() for kaddr and
pfn params.
* Improve smart injection support for nvdimm emulation testing.
* Fix test code that supports for emulating controller temperature.
* Fix hang on error before devm_memremap_pages()
* Fix a bug that causes user memory corruption when data returned
to user for ars_status.
* Maintainer updates for Ross Zwisler emails and adding Jan Kara to fsdax.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAlt9uUIACgkQYGjFFmlT
OErL+xAAgWHSGs8w98VtYA9kLDeTYEXutq93wJZQoBu/FMAXuuU3hYmQYnOQU87h
KKYmfDkeusaih1R3IX7mzlegnnzSfQ6MraNSV76M43noJHbRTunknCPZH6ebp4fo
b/eljvWlZF/idM+7YcsnoFMnHSRj2pjJGXmKQDlKedHD+KMxpmk6zEl2s5Y0zvPU
4U7UQLtk3D5IIpLNsLEmxge32BfvNf5IzoSO1aZp7Eqk0+U5Tq3Sq/Tjmd+J0RKt
6WH5yA6NqXQgBh+ayHsYU8YX62RqnbKQZXqVxD35OH64zJEUefnP1fpt9pmaZ9eL
43BPMkpM09eLAikO2ET3/3c2k6h3h9ttz1sH8t/hiroCtfmxs3XgskY06hxpKjZV
EbN+BUmut5Mr+zzYitRr3dbK2aHPVU9IbU7jUw/1Tz23rq3kU5iI7SHHv1b/eWup
1Cr77Z1M6HB8VBhjnJ+R607sbRrnKQUOV7fGzAaIskyUOTWsEvIgTh/6MRiaj9MD
5HXIgc/0y9E+G93s7MsUWwzpB7J6E7EGoybST2SKPtqwtDMPsBNeWRjyA9quBCoN
u1s+e+lWHYutqRW0eisDTTlq3nJwPijSx1nnzhJxw9s1EkCXz3f7KRZhyH1C79Co
7wjiuvKQ79e/HI/oXvGmTnv5lbLEpWYyJ3U3KIFfoUqugeyhr0k=
=5p2n
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.19_misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dave Jiang:
"Collection of misc libnvdimm patches for 4.19 submission:
- Adding support to read locked nvdimm capacity.
- Change test code to make DSM failure code injection an override.
- Add support for calculate maximum contiguous area for namespace.
- Add support for queueing a short ARS when there is on going ARS for
nvdimm.
- Allow NULL to be passed in to ->direct_access() for kaddr and pfn
params.
- Improve smart injection support for nvdimm emulation testing.
- Fix test code that supports for emulating controller temperature.
- Fix hang on error before devm_memremap_pages()
- Fix a bug that causes user memory corruption when data returned to
user for ars_status.
- Maintainer updates for Ross Zwisler emails and adding Jan Kara to
fsdax"
* tag 'libnvdimm-for-4.19_misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm: fix ars_status output length calculation
device-dax: avoid hang on error before devm_memremap_pages()
tools/testing/nvdimm: improve emulation of smart injection
filesystem-dax: Do not request kaddr and pfn when not required
md/dm-writecache: Don't request pointer dummy_addr when not required
dax/super: Do not request a pointer kaddr when not required
tools/testing/nvdimm: kaddr and pfn can be NULL to ->direct_access()
s390, dcssblk: kaddr and pfn can be NULL to ->direct_access()
libnvdimm, pmem: kaddr and pfn can be NULL to ->direct_access()
acpi/nfit: queue issuing of ars when an uc error notification comes in
libnvdimm: Export max available extent
libnvdimm: Use max contiguous area for namespace size
MAINTAINERS: Add Jan Kara for filesystem DAX
MAINTAINERS: update Ross Zwisler's email address
tools/testing/nvdimm: Fix support for emulating controller temperature
tools/testing/nvdimm: Make DSM failure code injection an override
acpi, nfit: Prefer _DSM over _LSR for namespace label reads
libnvdimm: Introduce locked DIMM capacity support
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAluAdzkACgkQiiy9cAdy
T1GjlAv/SOsNm2sj9Bcq/Z/CpPoFRFoJBFsLeReF78QdbF/+eUFuQJvq0aIK8BmV
0NRmvlQk9oCGQfWN0pLWeRn7a7xUqMQ7HYKSS1fzW1O+kJGlyA1cFjCbiZe4py6m
AFSlpraPTtL7isX/ZMOyZ1D7YKMj4Fq5wndcHPSnMQXI2GlaUAip5k/zamXUbMmo
dFaGDkXc67un6Y/04v18LsKJtOHgbVIAES2OgO0sjqiwp0cnGATsZl/OGzvsTo31
brstBum/0Ig2Mpr+5IXa4QFoP+naNXDyhv+D69huETwsMSImnjGL6L/GAgUUO5/t
sDN6bpQdM9wqpckNuFcV9hKbBHQ1nZhzr0gUjRnmN8CJFHOgI2Ndo4J4N9slnWo9
wEXJV7RN5/VQh3Ozb7m+DpAYo4K4r3Oexxa3MG7+IFY8t89O39PBc0dEQ6O8ztaJ
NCcubQVpSFxC7fwVjY56IHHofyznS7JLKMAcCe3+ssOHwJt0PZr/iqdBNHC9W4ZE
7fcNZtct
=jMLB
-----END PGP SIGNATURE-----
Merge tag '4.19-rc-smb3' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Three small SMB3 fixes, one for stable"
* tag '4.19-rc-smb3' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal module version number for cifs.ko to 2.12
cifs: check kmalloc before use
cifs: check if SMB2 PDU size has been padded and suppress the warning
cifs: create a define for how many iovs we need for an SMB2_open()
At the point where r is being checked for different values, r is always
going to be equal to 2 as the previous if statements jump to end or end1
if r is not 2. Hence the assignment to err can be simplified to just
err an assignment without any checks on the value or r.
Detected by CoverityScan, CID#1226737 ("Logically dead code")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull namespace fixes from Eric Biederman:
"This is a set of four fairly obvious bug fixes:
- a switch from d_find_alias to d_find_any_alias because the xattr
code perversely takes a dentry
- two mutex vs copy_to_user fixes from Jann Horn
- a fix to use a sanitized size not the size userspace passed in from
Christian Brauner"
* 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
getxattr: use correct xattr length
sys: don't hold uts_sem while accessing userspace memory
userns: move user access out of the mutex
cap_inode_getsecurity: use d_find_any_alias() instead of d_find_alias()
This empty file sneaked into the tree by mistake.
Remove it.
Fixes: 6eb61d587f ("ubifs: Pass struct ubifs_info to ubifs_assert()")
Signed-off-by: Richard Weinberger <richard@nod.at>
Merge yet more updates from Andrew Morton:
- the rest of MM
- various misc fixes and tweaks
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits)
mm: Change return type int to vm_fault_t for fault handlers
lib/fonts: convert comments to utf-8
s390: ebcdic: convert comments to UTF-8
treewide: convert ISO_8859-1 text comments to utf-8
drivers/gpu/drm/gma500/: change return type to vm_fault_t
docs/core-api: mm-api: add section about GFP flags
docs/mm: make GFP flags descriptions usable as kernel-doc
docs/core-api: split memory management API to a separate file
docs/core-api: move *{str,mem}dup* to "String Manipulation"
docs/core-api: kill trailing whitespace in kernel-api.rst
mm/util: add kernel-doc for kvfree
mm/util: make strndup_user description a kernel-doc comment
fs/proc/vmcore.c: hide vmcoredd_mmap_dumps() for nommu builds
treewide: correct "differenciate" and "instanciate" typos
fs/afs: use new return type vm_fault_t
drivers/hwtracing/intel_th/msu.c: change return type to vm_fault_t
mm: soft-offline: close the race against page allocation
mm: fix race on soft-offlining free huge pages
namei: allow restricted O_CREAT of FIFOs and regular files
hfs: prevent crash on exit from failed search
...
Use new return type vm_fault_t for fault handler. For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno. Once all instances are converted, vm_fault_t will become a
distinct type.
Ref-> commit 1c8f422059 ("mm: change return type to vm_fault_t")
The aim is to change the return type of finish_fault() and
handle_mm_fault() to vm_fault_t type. As part of that clean up return
type of all other recursively called functions have been changed to
vm_fault_t type.
The places from where handle_mm_fault() is getting invoked will be
change to vm_fault_t type but in a separate patch.
vmf_error() is the newly introduce inline function in 4.17-rc6.
[akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Without CONFIG_MMU, we get a build warning:
fs/proc/vmcore.c:228:12: error: 'vmcoredd_mmap_dumps' defined but not used [-Werror=unused-function]
static int vmcoredd_mmap_dumps(struct vm_area_struct *vma, unsigned long dst,
The function is only referenced from an #ifdef'ed caller, so
this uses the same #ifdef around it.
Link: http://lkml.kernel.org/r/20180525213526.2117790-1-arnd@arndb.de
Fixes: 7efe48df8a ("vmcore: append device dumps to vmcore as elf notes")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Ganesh Goudar <ganeshgr@chelsio.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use new return type vm_fault_t for fault handler in struct
vm_operations_struct. For now, this is just documenting that the
function returns a VM_FAULT value rather than an errno. Once all
instances are converted, vm_fault_t will become a distinct type.
See 1c8f422059 ("mm: change return type to vm_fault_t") for reference.
Link: http://lkml.kernel.org/r/20180702152017.GA3780@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Disallows open of FIFOs or regular files not owned by the user in world
writable sticky directories, unless the owner is the same as that of the
directory or the file is opened without the O_CREAT flag. The purpose
is to make data spoofing attacks harder. This protection can be turned
on and off separately for FIFOs and regular files via sysctl, just like
the symlinks/hardlinks protection. This patch is based on Openwall's
"HARDEN_FIFO" feature by Solar Designer.
This is a brief list of old vulnerabilities that could have been prevented
by this feature, some of them even allow for privilege escalation:
CVE-2000-1134
CVE-2007-3852
CVE-2008-0525
CVE-2009-0416
CVE-2011-4834
CVE-2015-1838
CVE-2015-7442
CVE-2016-7489
This list is not meant to be complete. It's difficult to track down all
vulnerabilities of this kind because they were often reported without any
mention of this particular attack vector. In fact, before
hardlinks/symlinks restrictions, fifos/regular files weren't the favorite
vehicle to exploit them.
[s.mesoraca16@gmail.com: fix bug reported by Dan Carpenter]
Link: https://lkml.kernel.org/r/20180426081456.GA7060@mwanda
Link: http://lkml.kernel.org/r/1524829819-11275-1-git-send-email-s.mesoraca16@gmail.com
[keescook@chromium.org: drop pr_warn_ratelimited() in favor of audit changes in the future]
[keescook@chromium.org: adjust commit subjet]
Link: http://lkml.kernel.org/r/20180416175918.GA13494@beast
Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Suggested-by: Solar Designer <solar@openwall.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hfs_find_exit() expects fd->bnode to be NULL after a search has failed.
hfs_brec_insert() may instead set it to an error-valued pointer. Fix
this to prevent a crash.
Link: http://lkml.kernel.org/r/53d9749a029c41b4016c495fc5838c9dba3afc52.1530294815.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Cc: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hfs_find_exit() expects fd->bnode to be NULL after a search has failed.
hfs_brec_insert() may instead set it to an error-valued pointer. Fix
this to prevent a crash.
Link: http://lkml.kernel.org/r/803590a35221fbf411b2c141419aea3233a6e990.1530294813.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernandez <ernesto.mnd.fernandez@gmail.com>
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An HFS+ filesystem can be mounted read-only without having a metadata
directory, which is needed to support hardlinks. But if the catalog
data is corrupted, a directory lookup may still find dentries claiming
to be hardlinks.
hfsplus_lookup() does check that ->hidden_dir is not NULL in such a
situation, but mistakenly does so after dereferencing it for the first
time. Reorder this check to prevent a crash.
This happens when looking up corrupted catalog data (dentry) on a
filesystem with no metadata directory (this could only ever happen on a
read-only mount). Wen Xu sent the replication steps in detail to the
fsdevel list: https://bugzilla.kernel.org/show_bug.cgi?id=200297
Link: http://lkml.kernel.org/r/20180712215344.q44dyrhymm4ajkao@eaf
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reported-by: Wen Xu <wen.xu@gatech.edu>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stable bufixes:
- v3.17+: Fix an off-by-one in bl_map_stripe()
- v4.9+: NFSv4 client live hangs after live data migration recovery
- v4.18+: xprtrdma: Fix disconnect regression
- v4.14+: Fix locking in pnfs_generic_recover_commit_reqs
- v4.9+: Fix a sleep in atomic context in nfs4_callback_sequence()
Features:
- Add support for asynchronous server-side COPY operations
Other bugfixes and cleanups:
- Optitmizations and fixes involving NFS v4.1 / pNFS layout handling
- Optimize lseek(fd, SEEK_CUR, 0) on directories to avoid locking
- Immediately reschedule writeback when the server replies with an error
- Fix excessive attribute revalidation in nfs_execute_ok()
- Add error checking to nfs_idmap_prepare_message()
- Use new vm_fault_t return type
- Return a delegation when reclaiming one that the server has recalled
- Referrals should inherit proto setting from parents
- Make rpc_auth_create_args a const
- Improvements to rpc_iostats tracking
- Fix a potential reference leak when there is an error processing a callback
- Fix rmdir / mkdir / rename nlink accounting
- Fix updating inode change attribute
- Fix error handling in nfsn4_sp4_select_mode()
- Use an appropriate work queue for direct-write completion
- Don't busy wait if NFSv4 session draining is interrupted
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlt/CYIACgkQ18tUv7Cl
QOu8gBAA0xQWmgRoG6oIdYUxvgYqhuJmMqC4SU1E6mCJ93xEuUSvEFw51X+84KCt
r6UPkp/bKiVe3EIinKTplIzuxgggXNG0EQmO46FYNTl7nqpN85ffLsQoWsiD23fp
j8afqKPFR2zfhHXLKQC7k1oiOpwGqJ+EJWgIW4llE80pSNaErEoEaDqSPds5thMN
dHEjjLr8ef6cbBux6sSPjwWGNbE82uoSu3MDuV2+e62hpGkgvuEYo1vyE6ujeZW5
MUsmw+AHZkwro0msTtNBOHcPZAS0q/2UMPzl1tsDeCWNl2mugqZ6szQLSS2AThKq
Zr6iK9Q5dWjJfrQHcjRMnYJB+SCX1SfPA7ASuU34opwcWPjecbS9Q92BNTByQYwN
o9ngs2K0mZfqpYESMAmf7Il134cCBrtEp3skGko2KopJcYcE5YUFhdKihi1yQQjU
UbOOubMpQk8vY9DpDCAwGbICKwUZwGvq27uuUWL20kFVDb1+jvfHwcV4KjRAJo/E
J9aFtU+qOh4rMPMnYlEVZcAZBGfenlv/DmBl1upRpjzBkteUpUJsAbCmGyAk4616
3RECasehgsjNCQpFIhv3FpUkWzP5jt0T3gRr1NeY6WKJZwYnHEJr9PtapS+EIsCT
tB5DvvaJqFtuHFOxzn+KlGaxdSodHF7klOq7NM3AC0cX8AkWqaU=
=8+9t
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.19-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"These patches include adding async support for the v4.2 COPY
operation. I think Bruce is planning to send the server patches for
the next release, but I figured we could get the client side out of
the way now since it's been in my tree for a while. This shouldn't
cause any problems, since the server will still respond with
synchronous copies even if the client requests async.
Features:
- Add support for asynchronous server-side COPY operations
Stable bufixes:
- Fix an off-by-one in bl_map_stripe() (v3.17+)
- NFSv4 client live hangs after live data migration recovery (v4.9+)
- xprtrdma: Fix disconnect regression (v4.18+)
- Fix locking in pnfs_generic_recover_commit_reqs (v4.14+)
- Fix a sleep in atomic context in nfs4_callback_sequence() (v4.9+)
Other bugfixes and cleanups:
- Optimizations and fixes involving NFS v4.1 / pNFS layout handling
- Optimize lseek(fd, SEEK_CUR, 0) on directories to avoid locking
- Immediately reschedule writeback when the server replies with an
error
- Fix excessive attribute revalidation in nfs_execute_ok()
- Add error checking to nfs_idmap_prepare_message()
- Use new vm_fault_t return type
- Return a delegation when reclaiming one that the server has
recalled
- Referrals should inherit proto setting from parents
- Make rpc_auth_create_args a const
- Improvements to rpc_iostats tracking
- Fix a potential reference leak when there is an error processing a
callback
- Fix rmdir / mkdir / rename nlink accounting
- Fix updating inode change attribute
- Fix error handling in nfsn4_sp4_select_mode()
- Use an appropriate work queue for direct-write completion
- Don't busy wait if NFSv4 session draining is interrupted"
* tag 'nfs-for-4.19-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (54 commits)
pNFS: Remove unwanted optimisation of layoutget
pNFS/flexfiles: ff_layout_pg_init_read should exit on error
pNFS: Treat RECALLCONFLICT like DELAY...
pNFS: When updating the stateid in layoutreturn, also update the recall range
NFSv4: Fix a sleep in atomic context in nfs4_callback_sequence()
NFSv4: Fix locking in pnfs_generic_recover_commit_reqs
NFSv4: Fix a typo in nfs4_init_channel_attrs()
NFSv4: Don't busy wait if NFSv4 session draining is interrupted
NFS recover from destination server reboot for copies
NFS add a simple sync nfs4_proc_commit after async COPY
NFS handle COPY ERR_OFFLOAD_NO_REQS
NFS send OFFLOAD_CANCEL when COPY killed
NFS export nfs4_async_handle_error
NFS handle COPY reply CB_OFFLOAD call race
NFS add support for asynchronous COPY
NFS COPY xdr handle async reply
NFS OFFLOAD_CANCEL xdr
NFS CB_OFFLOAD xdr
NFS: Use an appropriate work queue for direct-write completion
NFSv4: Fix error handling in nfs4_sp4_select_mode()
...
missing Chuck's fixes for the problem with callbacks over GSS from
multi-homed servers, and a smaller fix from Laura Abbott.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbftA8AAoJECebzXlCjuG+QPMQALieEKkX0YoqRhPz5G+RrWFy
KgOBFAoiRcjFQD6wMt9FzD6qYEZqSJ+I2b+K5N3BkdyDDQu845iD0wK0zBGhMgLm
7ith85nphIMbe18+5jPorqAsI9RlfBQjiSGw1MEx5dicLQQzTObHL5q+l5jcWna4
jWS3yUKv1URpOsR1hIryw74ktSnhuH8n//zmntw8aWrCkq3hnXOZK/agtYxZ7Viv
V3kiQsiNpL2FPRcHN7ejhLUTnRkkuD2iYKrzP/SpTT/JfdNEUXlMhKkAySogNpus
nvR9X7hwta8Lgrt7PSB9ibFTXtCupmuICg5mbDWy6nXea2NvpB01QhnTzrlX17Eh
Yfk/18z95b6Qs1v4m3SI8ESmyc6l5dMZozLudtHzifyCqooWZriEhCR1PlQfQ/FJ
4cYQ8U/qiMiZIJXL7N2wpSoSaWR5bqU1rXen29Np1WEDkiv4Nf5u2fsCXzv0ZH2C
ReWpNkbnNxsNiKpp4geBZtlcSEU1pk+1PqE0MagTdBV3iptiUHRSP4jR7qLnc0zT
J1lCvU7Fodnt9vNSxMpt2Jd6XxQ6xtx7n6aMQAiYFnXDs+hP2hPnJVCScnYW3L6R
2r1sHRKKeoOzCJ2thw+zu4lOwMm7WPkJPWAYfv90reWkiKoy2vG0S9P7wsNGoJuW
fuEjB2b9pow1Ffynat6q
=JnLK
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.19-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"Chuck Lever fixed a problem with NFSv4.0 callbacks over GSS from
multi-homed servers.
The only new feature is a minor bit of protocol (change_attr_type)
which the client doesn't even use yet.
Other than that, various bugfixes and cleanup"
* tag 'nfsd-4.19-1' of git://linux-nfs.org/~bfields/linux: (27 commits)
sunrpc: Add comment defining gssd upcall API keywords
nfsd: Remove callback_cred
nfsd: Use correct credential for NFSv4.0 callback with GSS
sunrpc: Extract target name into svc_cred
sunrpc: Enable the kernel to specify the hostname part of service principals
sunrpc: Don't use stack buffer with scatterlist
rpc: remove unneeded variable 'ret' in rdma_listen_handler
nfsd: use true and false for boolean values
nfsd: constify write_op[]
fs/nfsd: Delete invalid assignment statements in nfsd4_decode_exchange_id
NFSD: Handle full-length symlinks
NFSD: Refactor the generic write vector fill helper
svcrdma: Clean up Read chunk path
svcrdma: Avoid releasing a page in svc_xprt_release()
nfsd: Mark expected switch fall-through
sunrpc: remove redundant variables 'checksumlen','blocksize' and 'data'
nfsd: fix leaked file lock with nfs exported overlayfs
nfsd: don't advertise a SCSI layout for an unsupported request_queue
nfsd: fix corrupted reply to badly ordered compound
nfsd: clarify check_op_ordering
...
- Year 2038 preparations
- New UBI feature to skip CRC checks of static volumes
- A new Kconfig option to disable xattrs in UBIFS
- Lots of fixes in UBIFS, found by our new test framework
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAlt9zFkWHHJpY2hhcmRA
c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7waiuD/oDYzerOLe0R7n2sRT9zjtY8kCx
LuizRvDYUlmMynI6EVahfyJy2IixcDmXOklGdxJqUkN5igDC/FORWdQjv2X9y56d
qZ2dlS8aBvI0ZKBG2ew4VP1H67CXtCw8H9fE32SGotPmxKRUQqt2vKqo+vgQfapH
eSVPrOaoqoRh+/ieumYXsvFdEUWpa66G3tVMFe4znu+kYRBbGzSszxpuq1ukIls2
P9wewqbWAZpqn+n9A9+RBIv81g+jH87/acfjK2L7/lT9wsFO7BQGKi373dPbnTa5
9WsjGEd+Gt0kb4Kjh5QegY97bPqWjmaMj1BLqeQVpSbQqpzkiFMf9GW5+h3XqAfO
hM1zzgONZMxHdZSKH0bWzIRQbvU6v0d9C4J/elfFuH9ke2XscrxjOtZZQbtbGeYj
tE7FWoZnB8euXubulGAUBKofzWe+gItBe9+iA29EBETNOemrJyHyKjO0Fe9ze5p2
bfVFvN62kHz4ZCJoinwO/OpXnCuA91xrVocLOOIreb4dkZ/kqP+YZWFf70FcE1o5
sPAbAUu+hfb2LbpktEdZHHbhoupfCnJokzfboJMX0NWKRtFXJDONjogJYTFUjrpW
eXS+55+WFHoLWtx9J2IVmcb3cQrj/W/4J83kSg99cUkVjGpil50zmtzhq9bHzsLc
wazngueP7kW2l9bSSg==
=gCyp
-----END PGP SIGNATURE-----
Merge tag 'upstream-4.19-rc1' of git://git.infradead.org/linux-ubifs
Pull UBI/UBIFS updates from Richard Weinberger:
- Year 2038 preparations
- New UBI feature to skip CRC checks of static volumes
- A new Kconfig option to disable xattrs in UBIFS
- Lots of fixes in UBIFS, found by our new test framework
* tag 'upstream-4.19-rc1' of git://git.infradead.org/linux-ubifs: (21 commits)
ubifs: Set default assert action to read-only
ubifs: Allow setting assert action as mount parameter
ubifs: Rework ubifs_assert()
ubifs: Pass struct ubifs_info to ubifs_assert()
ubifs: Turn two ubifs_assert() into a WARN_ON()
ubi: expose the volume CRC check skip flag
ubi: provide a way to skip CRC checks
ubifs: Use kmalloc_array()
ubifs: Check data node size before truncate
Revert "UBIFS: Fix potential integer overflow in allocation"
ubifs: Add comment on c->commit_sem
ubifs: introduce Kconfig symbol for xattr support
ubifs: use swap macro in swap_dirty_idx
ubifs: tnc: use monotonic znode timestamp
ubifs: use timespec64 for inode timestamps
ubifs: xattr: Don't operate on deleted inodes
ubifs: gc: Fix typo
ubifs: Fix memory leak in lprobs self-check
ubi: Initialize Fastmap checkmapping correctly
ubifs: Fix synced_i_size calculation for xattr inodes
...
The kmalloc was not being checked - if it fails issue a warning
and return -ENOMEM to the caller.
Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Fixes: b8da344b74 ("cifs: dynamic allocation of ntlmssp blob")
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
cc: Stable <stable@vger.kernel.org>`
Some SMB2/3 servers, Win2016 but possibly others too, adds padding
not only between PDUs in a compound but also to the final PDU.
This padding extends the PDU to a multiple of 8 bytes.
Check if the unexpected length looks like this might be the case
and avoid triggering the log messages for :
"SMB2 server sent bad RFC1001 len %d not %d\n"
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When running in a container with a user namespace, if you call getxattr
with name = "system.posix_acl_access" and size % 8 != 4, then getxattr
silently skips the user namespace fixup that it normally does resulting in
un-fixed-up data being returned.
This is caused by posix_acl_fix_xattr_to_user() being passed the total
buffer size and not the actual size of the xattr as returned by
vfs_getxattr().
This commit passes the actual length of the xattr as returned by
vfs_getxattr() down.
A reproducer for the issue is:
touch acl_posix
setfacl -m user:0:rwx acl_posix
and the compile:
#define _GNU_SOURCE
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <unistd.h>
#include <attr/xattr.h>
/* Run in user namespace with nsuid 0 mapped to uid != 0 on the host. */
int main(int argc, void **argv)
{
ssize_t ret1, ret2;
char buf1[128], buf2[132];
int fret = EXIT_SUCCESS;
char *file;
if (argc < 2) {
fprintf(stderr,
"Please specify a file with "
"\"system.posix_acl_access\" permissions set\n");
_exit(EXIT_FAILURE);
}
file = argv[1];
ret1 = getxattr(file, "system.posix_acl_access",
buf1, sizeof(buf1));
if (ret1 < 0) {
fprintf(stderr, "%s - Failed to retrieve "
"\"system.posix_acl_access\" "
"from \"%s\"\n", strerror(errno), file);
_exit(EXIT_FAILURE);
}
ret2 = getxattr(file, "system.posix_acl_access",
buf2, sizeof(buf2));
if (ret2 < 0) {
fprintf(stderr, "%s - Failed to retrieve "
"\"system.posix_acl_access\" "
"from \"%s\"\n", strerror(errno), file);
_exit(EXIT_FAILURE);
}
if (ret1 != ret2) {
fprintf(stderr, "The value of \"system.posix_acl_"
"access\" for file \"%s\" changed "
"between two successive calls\n", file);
_exit(EXIT_FAILURE);
}
for (ssize_t i = 0; i < ret2; i++) {
if (buf1[i] == buf2[i])
continue;
fprintf(stderr,
"Unexpected different in byte %zd: "
"%02x != %02x\n", i, buf1[i], buf2[i]);
fret = EXIT_FAILURE;
}
if (fret == EXIT_SUCCESS)
fprintf(stderr, "Test passed\n");
else
fprintf(stderr, "Test failed\n");
_exit(fret);
}
and run:
./tester acl_posix
On a non-fixed up kernel this should return something like:
root@c1:/# ./t
Unexpected different in byte 16: ffffffa0 != 00
Unexpected different in byte 17: ffffff86 != 00
Unexpected different in byte 18: 01 != 00
and on a fixed kernel:
root@c1:~# ./t
Test passed
Cc: stable@vger.kernel.org
Fixes: 2f6f0654ab ("userns: Convert vfs posix_acl support to use kuids and kgids")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199945
Reported-by: Colin Watson <cjwatson@ubuntu.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Clean up: The global callback_cred is no longer used, so it can be
removed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
I've had trouble when operating a multi-homed Linux NFS server with
Kerberos using NFSv4.0. Lately, I've seen my clients reporting
this (and then hanging):
May 9 11:43:26 manet kernel: NFS: NFSv4 callback contains invalid cred
The client-side commit f11b2a1cfb ("nfs4: copy acceptor name from
context to nfs_client") appears to be related, but I suspect this
problem has been going on for some time before that.
RFC 7530 Section 3.3.3 says:
> For Kerberos V5, nfs/hostname would be a server principal in the
> Kerberos Key Distribution Center database. This is the same
> principal the client acquired a GSS-API context for when it issued
> the SETCLIENTID operation ...
In other words, an NFSv4.0 client expects that the server will use
the same GSS principal for callback that the client used to
establish its lease. For example, if the client used the service
principal "nfs@server.domain" to establish its lease, the server
is required to use "nfs@server.domain" when performing NFSv4.0
callback operations.
The Linux NFS server currently does not. It uses a common service
principal for all callback connections. Sometimes this works as
expected, and other times -- for example, when the server is
accessible via multiple hostnames -- it won't work at all.
This patch scrapes the target name from the client credential,
and uses that for the NFSv4.0 callback credential. That should
be correct much more often.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
NFSv4.0 callback needs to know the GSS target name the client used
when it established its lease. That information is available from
the GSS context created by gssproxy. Make it available in each
svc_cred.
Note this will also give us access to the real target service
principal name (which is typically "nfs", but spec does not require
that).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
In this round, we've tuned f2fs to improve general performance by serializing
block allocation and enhancing discard flows like fstrim which avoids user IO
contention. And we've added fsync_mode=nobarrier which gives an option to user
where it skips issuing cache_flush commands to underlying flash storage. And
there are many bug fixes related to fuzzed images, revoked atomic writes, quota
ops, and minor direct IO.
Enhancement:
- add fsync_mode=nobarrier which bypasses cache_flush command
- enhance the discarding flow which avoids user IOs and issues in LBA order
- readahead some encrypted blocks during GC
- enable in-memory inode checksum to verify the blocks if F2FS_CHECK_FS is set
- enhance nat_bits behavior
- set -o discard by default
- set REQ_RAHEAD to bio in ->readpages
Bug fixes:
- fix a corner case to corrupt atomic_writes revoking flow
- revisit i_gc_rwsem to fix race conditions
- fix some dio behaviors captured by xfstests
- correct handling errors given by quota-related failures
- add many sanity check flows to avoid fuzz test failures
- add more error number propagation to their callers
- fix several corner cases to continue fault injection w/ shutdown loop
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAlt82U4ACgkQQBSofoJI
UNJTLQ/+PhewnNa5tDfUgWdFnUFz3h9/NcC677l0OplOOUNxA8iSa1xamlKf/nf9
sB5ey0I7oBF8zQGxfndhHQfi6fpfUcNMr14hm+TS/3+d54xLJmiVShD5fjNSV2vB
Ur0xoozuQDwYF1e3QKdBQjFqaCf78VheTr3aWxyv22/Sg+PYylZJ2K8rHTB7mGPU
UG0aRnKrP3FPRjL7Q3m0Vm6b6eZ5uNdNrFfjgn/8yuQQ9V197K8vwSbPAsR5/pOh
miCQXyM708NgEYJRWkWmC/rDSQdU0/h/mGnJWrBrbceW62QefGOgd2jcVfmthHJa
ZXpj+BEG5bYpCCxGxF6N+u0e28OKonCwO/uvL8YAd5icN7yXtsKzoF1CCuXxOYf1
9K5SMylCTSyrs/+LV8CJoT2ya8w0l0w+R/txUYn8UT+4AgqU+chS2kJeXqw9tcHB
WLFs/rnAyofWCI/8frVBmJY+zA1ZZvTqs/lmVYrtJUkiOcMTq34WICBUAEFKV452
BM5dcu21bSIkapYispEt4Rr7o4P4HHMQ+N1i2yUZMFCz5T0RyzdybeS5THk2yVzd
L0kxfYU+zHigNX51ez8+Z7DyDLDBp6jkD0e66x73bUK9TGPH+ZbnAL6gwLikmD3M
+VxYl5nyW/3bxx1HdfK1Xwd4/wYMNBmdtn5NC50oZ+jpB0h8YCE=
=zgbL
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've tuned f2fs to improve general performance by
serializing block allocation and enhancing discard flows like fstrim
which avoids user IO contention. And we've added fsync_mode=nobarrier
which gives an option to user where it skips issuing cache_flush
commands to underlying flash storage. And there are many bug fixes
related to fuzzed images, revoked atomic writes, quota ops, and minor
direct IO.
Enhancements:
- add fsync_mode=nobarrier which bypasses cache_flush command
- enhance the discarding flow which avoids user IOs and issues in
LBA order
- readahead some encrypted blocks during GC
- enable in-memory inode checksum to verify the blocks if
F2FS_CHECK_FS is set
- enhance nat_bits behavior
- set -o discard by default
- set REQ_RAHEAD to bio in ->readpages
Bug fixes:
- fix a corner case to corrupt atomic_writes revoking flow
- revisit i_gc_rwsem to fix race conditions
- fix some dio behaviors captured by xfstests
- correct handling errors given by quota-related failures
- add many sanity check flows to avoid fuzz test failures
- add more error number propagation to their callers
- fix several corner cases to continue fault injection w/ shutdown
loop"
* tag 'f2fs-for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (89 commits)
f2fs: readahead encrypted block during GC
f2fs: avoid fi->i_gc_rwsem[WRITE] lock in f2fs_gc
f2fs: fix performance issue observed with multi-thread sequential read
f2fs: fix to skip verifying block address for non-regular inode
f2fs: rework fault injection handling to avoid a warning
f2fs: support fault_type mount option
f2fs: fix to return success when trimming meta area
f2fs: fix use-after-free of dicard command entry
f2fs: support discard submission error injection
f2fs: split discard command in prior to block layer
f2fs: wake up gc thread immediately when gc_urgent is set
f2fs: fix incorrect range->len in f2fs_trim_fs()
f2fs: refresh recent accessed nat entry in lru list
f2fs: fix avoid race between truncate and background GC
f2fs: avoid race between zero_range and background GC
f2fs: fix to do sanity check with block address in main area v2
f2fs: fix to do sanity check with inline flags
f2fs: fix to reset i_gc_failures correctly
f2fs: fix invalid memory access
f2fs: fix to avoid broken of dnode block list
...
...otherwise there will be list corruption due to inode_sb_list_add() being
called for inode already on the sb list.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: e950564b97 ("vfs: don't evict uninitialized inode")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge more updates from Andrew Morton:
- the rest of MM
- procfs updates
- various misc things
- more y2038 fixes
- get_maintainer updates
- lib/ updates
- checkpatch updates
- various epoll updates
- autofs updates
- hfsplus
- some reiserfs work
- fatfs updates
- signal.c cleanups
- ipc/ updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (166 commits)
ipc/util.c: update return value of ipc_getref from int to bool
ipc/util.c: further variable name cleanups
ipc: simplify ipc initialization
ipc: get rid of ids->tables_initialized hack
lib/rhashtable: guarantee initial hashtable allocation
lib/rhashtable: simplify bucket_table_alloc()
ipc: drop ipc_lock()
ipc/util.c: correct comment in ipc_obtain_object_check
ipc: rename ipcctl_pre_down_nolock()
ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()
ipc: reorganize initialization of kern_ipc_perm.seq
ipc: compute kern_ipc_perm.id under the ipc lock
init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE
fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp
adfs: use timespec64 for time conversion
kernel/sysctl.c: fix typos in comments
drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md
fork: don't copy inconsistent signal handler state to child
signal: make get_signal() return bool
signal: make sigkill_pending() return bool
...
get_seconds() is deprecated in favor of ktime_get_real_seconds(), which
returns a 64-bit timestamp.
In the SYSV file system, the superblock timestamp is only 32 bits wide,
and it is used to check whether a file system is clean, so the best
solution seems to be to force a wraparound and explicitly convert it to an
unsigned 32-bit value.
This is independent of the inode timestamps that are also 32-bit wide on
disk and that come from current_time().
Link: http://lkml.kernel.org/r/20180713145236.3152513-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We just truncate the seconds to 32-bit in one place now, so this can
trivially be converted over to using timespec64 consistently.
Link: http://lkml.kernel.org/r/20180620100133.4035614-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that we pass down 64-bit timestamps from VFS, we just need to convert
that correctly into on-disk timestamps. To make that work correctly, this
changes the last use of time_to_tm() in the kernel to time64_to_tm(),
which also lets use remove that deprecated interfaces.
Similarly, the time_t use in fat_time_fat2unix() truncates the timestamp
on the way in, which can be avoided by using types that are wide enough to
hold the intermediate values during the conversion.
[hirofumi@mail.parknet.co.jp: remove useless temporary variable, needless long long]
Link: http://lkml.kernel.org/r/20180619153646.3637529-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On corrupted FATfs may have invalid ->i_start. To handle it, this checks
->i_start before using, and return proper error code.
Link: http://lkml.kernel.org/r/87o9f8y1t5.fsf_-_@mail.parknet.co.jp
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Tested-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This fixes the following issues:
- When a buffer size is supplied to reiserfs_listxattr() such that each
individual name fits, but the concatenation of all names doesn't fit,
reiserfs_listxattr() overflows the supplied buffer. This leads to a
kernel heap overflow (verified using KASAN) followed by an out-of-bounds
usercopy and is therefore a security bug.
- When a buffer size is supplied to reiserfs_listxattr() such that a
name doesn't fit, -ERANGE should be returned. But reiserfs instead just
truncates the list of names; I have verified that if the only xattr on a
file has a longer name than the supplied buffer length, listxattr()
incorrectly returns zero.
With my patch applied, -ERANGE is returned in both cases and the memory
corruption doesn't happen anymore.
Credit for making me clean this code up a bit goes to Al Viro, who pointed
out that the ->actor calling convention is suboptimal and should be
changed.
Link: http://lkml.kernel.org/r/20180802151539.5373-1-jannh@google.com
Fixes: 48b32a3553 ("reiserfs: use generic xattr handlers")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Jeff Mahoney <jeffm@suse.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This uses the deprecated time_t type but is write-only, and could be
removed, but as Jeff explains, having a timestamp can be usefule for
post-mortem analysis in crash dumps.
In order to remove one of the last instances of time_t, this changes the
type to time64_t, same as j_trans_start_time.
Link: http://lkml.kernel.org/r/20180622133315.221210-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Before linux-2.4.6, print_time() was used to pretty-print an inode time
when running reiserfs in user space, after that it has become obsolete and
is still a bit incorrect: It behaves differently on 32-bit and 64-bit
machines, and uses a static buffer to hold a string, which could lead to
undefined behavior if we ever called this from multiple places
simultaneously.
Since we always want to treat the timestamps as 'unsigned' anyway, simply
printing them as an integer is both simpler and safer while avoiding the
deprecated time_t type.
Link: http://lkml.kernel.org/r/20180620142522.27639-3-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Using CLOCK_REALTIME time_t timestamps breaks on 32-bit systems in 2038,
and gives surprising results with a concurrent settimeofday().
This changes the reiserfs journal timestamps to use ktime_get_seconds()
instead, which makes it use a 64-bit CLOCK_MONOTONIC stamp.
In the procfs output, the monotonic timestamp needs to be converted back
to CLOCK_REALTIME to keep the existing ABI.
Link: http://lkml.kernel.org/r/20180620142522.27639-2-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The HFS+ Access Control Lists have not worked at all for the past five
years, and nobody seems to have noticed. Besides, POSIX draft ACLs are
not compatible with MacOS. Drop the feature entirely.
Link: http://lkml.kernel.org/r/20180714190608.wtnmmtjqeyladkut@eaf
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Files created under macOS cannot be opened under linux if their names
contain Korean characters, and vice versa.
The Korean alphabet is special because its normalization is done without a
table. The module deals with it correctly when composing, but forgets
about it for the decomposition.
Fix this using the Hangul decomposition function provided in the Unicode
Standard. The code fits a bit awkwardly because it requires a buffer,
while all the other normalizations are returned as pointers to the
decomposition table. This is actually also a bug because reordering may
still be needed, but for now leave it as it is.
The patch will cause trouble for Hangul filenames already created by the
module in the past. This shouldn't really be concern because its main
purpose was always sharing with macOS. If a user actually needs to access
such a file the nodecompose mount option should be enough.
Link: http://lkml.kernel.org/r/20180717220951.p6qqrgautc4pxvzu@eaf
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reported-by: Ting-Chang Hou <tchou@synology.com>
Tested-by: Ting-Chang Hou <tchou@synology.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After an extent is removed from the extent tree, the corresponding bits
are also cleared from the block allocation file. This is currently done
without releasing the tree lock.
The problem is that the allocation file has extents of its own; if it is
fragmented enough, some of them may be in the extent tree as well, and
hfsplus_get_block() will try to take the lock again.
To avoid deadlock, only hold the extent tree lock during the actual tree
operations.
Link: http://lkml.kernel.org/r/20180709202549.auxwkb6memlegb4a@eaf
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
syzbot is reporting NULL pointer dereference at mount_fs() [1]. This is
because hfsplus_fill_super() is by error returning 0 when
hfsplus_fill_super() detected invalid filesystem image, and mount_bdev()
is returning NULL because dget(s->s_root) == NULL if s->s_root == NULL,
and mount_fs() is accessing root->d_sb because IS_ERR(root) == false if
root == NULL. Fix this by returning -EINVAL when hfsplus_fill_super()
detected invalid filesystem image.
[1] https://syzkaller.appspot.com/bug?id=21acb6850cecbc960c927229e597158cf35f33d0
Link: http://lkml.kernel.org/r/d83ce31a-874c-dd5b-f790-41405983a5be@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+01ffaf5d9568dd1609f7@syzkaller.appspotmail.com>
Reviewed-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use new return type vm_fault_t for page_mkwrite handler.
Link: http://lkml.kernel.org/r/1529555928-2411-1-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The mount time field in the superblock uses a 64-bit timestamp, but
calling get_seconds() may truncate the current time to 32 bits.
This changes it to ktime_get_real_seconds() to avoid the potential
overflow.
Link: http://lkml.kernel.org/r/20180620075041.4154396-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: David Howells <dhowells@redhat.com>
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The userspace automount(8) daemon is meant to perform a forced expire when
sent a SIGUSR2.
But since the expiration is routed through the kernel and the kernel
doesn't send an expire request if the mount is busy this hasn't worked at
least since autofs version 5.
Add an AUTOFS_EXP_FORCED flag to allow implemention of the feature and
bump the protocol version so user space can check if it's implemented if
needed.
Link: http://lkml.kernel.org/r/152937734715.21213.6594007182776598970.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make the usage of the expire flags consistent by naming the expire flags
the same as it is named in the version 5 miscelaneous ioctl parameters and
only check the bit flags when needed.
Link: http://lkml.kernel.org/r/152937734046.21213.9454131988766280028.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The expire flag AUTOFS_EXP_LEAVES is cleared before the second call to
should_expire() in autofs_expire_indirect() but the parameter passed in
the second call is incorrect.
Fortunately AUTOFS_EXP_LEAVES expire flag has not been used for a long
time but might be needed in the future so fix it rather than remove the
expire leaves functionality.
Link: http://lkml.kernel.org/r/152937732410.21213.7447294898147765076.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The global variable "now" in fs/autofs/expire.c is used in an inconsistent
way, sometimes using jiffies directly, and sometimes using the "now"
variable, and setting it isn't done consistently either.
But the autofs dentry info last_used field is only updated during path
walks or during expire so jiffies can be used directly and the global
variable "now" removed.
Link: http://lkml.kernel.org/r/152937731702.21213.7371321165189170865.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Depending on how it is configured the autofs user space daemon can leave
in use mounts mounted at exit and re-connect to them at start up. But for
this to work best the state of the autofs file system needs to be left
intact over the restart.
Also, at system shutdown, mounts in an autofs file system might be
umounted exposing a mount point trigger for which subsequent access can
lead to a hang. So recent versions of automount(8) now does its best to
set autofs file system mounts catatonic at shutdown.
When autofs file system mounts are catatonic it's currently possible to
create and remove directories and symlinks which can be a problem at
restart, as described above.
So return EACCES in the directory, symlink and unlink methods if the
autofs file system is catatonic.
Link: http://lkml.kernel.org/r/152902119090.4144.9561910674530214291.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of having each caller pass the rdllink explicitly, just have
ep_is_linked() pass it while the callers just need the epi pointer. This
helper is all about the rdllink, and this change, furthermore, improves
the function's self documentation.
Link: http://lkml.kernel.org/r/20180727053432.16679-3-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Similar to other calls, ep_poll() is not called with interrupts disabled,
and we can therefore avoid the irq save/restore dance and just disable
local irqs. In fact, the call should never be called in irq context at
all, considering that the only path is
epoll_wait(2) -> do_epoll_wait() -> ep_poll().
When running on a 2 socket 40-core (ht) IvyBridge a common pipe based
epoll_wait(2) microbenchmark, the following performance improvements are
seen:
# threads vanilla dirty
1 1805587 2106412
2 1854064 2090762
4 1805484 2017436
8 1751222 1974475
16 1725299 1962104
32 1378463 1571233
64 787368 900784
Which is a pretty constantly near 15%.
Also add a lockdep check such that we detect any mischief before
deadlocking.
Link: http://lkml.kernel.org/r/20180727053432.16679-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
... 'tis easier on the eye.
[akpm@linux-foundation.org: use inlines rather than macros]
Link: http://lkml.kernel.org/r/20180725185620.11020-1-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sprinkle lockdep_assert_irqs_enabled() checks in the functions that do not
save and restore interrupts when dealing with the ep->wq.lock. These are
ep_scan_ready_list() and those called by epoll_ctl(): ep_insert, ep_modify
and ep_remove.
[akpm@linux-foundation.org: remove too-obvious comments]
Link: http://lkml.kernel.org/r/20180721183127.3busfa335zlcjeox@linux-r8p5
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both functions are similar to the context of ep_modify(), called via
epoll_ctl(2). Just like ep_modify(), saving and restoring interrupts is
an overkill in these calls as it will never be called with irqs disabled.
While ep_remove() can be called directly from EPOLL_CTL_DEL, it can also
be called when releasing the file, but this also complies with the above.
Link: http://lkml.kernel.org/r/20180720172956.2883-3-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "fs/epoll: loosen irq safety when possible".
Both patches replace saving+restoring interrupts when taking the ep->lock
(now the waitqueue lock), with just disabling local irqs. This shows
immediate performance benefits in patch 1 for an epoll workload running on
Xen. The main concern we need to have with this sort of changes in epoll
is the ep_poll_callback() which is passed to the wait queue wakeup and is
done very often under irq context, this patch does not touch this call.
Patches have been tested pretty heavily with the customer workload,
microbenchmarks, ltp testcases and two high level workloads that use epoll
under the hood: nginx and libevent benchmarks.
This patch (of 2):
Saving and restoring interrupts in ep_scan_ready_list() is an
overkill as it is never called with interrupts disabled. Loosen
this to simply disabling local irqs such that archs where managing
irqs is expensive or virtual environments. This patch yields
some throughput improvements on a workload that is epoll intensive
running on a single Xen DomU.
1 Job 7500 --> 8800 enq/s (+17%)
2 Jobs 14000 --> 15200 enq/s (+8%)
3 Jobs 20500 --> 22300 enq/s (+8%)
4 Jobs 25000 --> 28000 enq/s (+8-12)%
On bare metal:
For a 2-socket 40-core (ht) IvyBridge on a few workloads, unfortunately I
don't have a xen environment and the results for Xen I do have (which
numbers are in patch 1) I don't have the actual workload, so cannot
compare them directly.
1) Different configurations were used for a epoll_wait (pipes io)
microbench (http://linux-scalability.org/epoll/epoll-test.c) and shows
around a 7-10% improvement in overall total number of times the
epoll_wait() loops when using both regular and nested epolls, so very
raw numbers, but measurable nonetheless.
# threads vanilla dirty
1 1677717 1805587
2 1660510 1854064
4 1610184 1805484
8 1577696 1751222
16 1568837 1725299
32 1291532 1378463
64 752584 787368
Note that stddev is pretty small.
2) Another pipe test, which shows no real measurable improvement.
(http://www.xmailserver.org/linux-patches/pipetest.c)
Link: http://lkml.kernel.org/r/20180720172956.2883-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The userfaultfd code currently uses the unlocked waitqueue helpers for
managing fault_wqh, but instead of holding the waitqueue lock for this
waitqueue around these calls, it the waitqueue lock of
fault_pending_wq, which is a different waitqueue instance. Given that
the waitqueue is not exposed to the rest of the kernel this actually
works ok at the moment, but prevents the userfaultfd locking rules from
being enforced using lockdep.
Switch to the internally locked waitqueue helpers instead. This means
that the lock inside fault_wqh now nests inside the fault_pending_wqh
lock, but that's not a problem since it was entirely unused before.
[hch@lst.de: slight changelog updates]
[rppt@linux.vnet.ibm.com: spotted changelog spellos]
Link: http://lkml.kernel.org/r/20171214152344.6880-3-hch@lst.de
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "waitqueue lockdep annotation", v3.
This series adds a strategic lockdep_assert_held to __wake_up_common to
ensure callers really do hold the wait_queue_head lock when calling the
unlocked wake_up variants. It turns out epoll did not do this for a
fairly common path (hit all the time by systemd during bootup), so the
second patch fixed this instance as well.
This patch (of 3):
The epoll code currently uses the unlocked waitqueue helpers for managing
ep->wq, but instead of holding the waitqueue lock around these calls, it
uses its own ep->lock spinlock. Given that the waitqueue is not exposed
to the rest of the kernel this actually works ok at the moment, but
prevents the epoll locking rules from being enforced using lockdep.
Remove ep->lock and use the waitqueue lock to not only reduce the size of
struct eventpoll but also to make sure we can assert locking invariants in
the waitqueue code.
Link: http://lkml.kernel.org/r/20171214152344.6880-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Baron <jbaron@akamai.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The vmcoreinfo information is useful for runtime debugging tools, not just
for crash dumps. A lot of this information can be determined by other
means, but this is much more convenient, and it only adds a page at most
to the file.
Link: http://lkml.kernel.org/r/fddbcd08eed76344863303878b12de1c1e2a04b6.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current code does a full search of the segment list every time for
every page. This is wasteful, since it's almost certain that the next
page will be in the same segment. Instead, check if the previous segment
covers the current page before doing the list search.
Link: http://lkml.kernel.org/r/fd346c11090cf93d867e01b8d73a6567c5ac6361.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, the ELF file header, program headers, and note segment are
allocated all at once, in some icky code dating back to 2.3. Programs
tend to read the file header, then the program headers, then the note
segment, all separately, so this is a waste of effort. It's cleaner and
more efficient to handle the three separately.
Link: http://lkml.kernel.org/r/19c92cbad0e11f6103ff3274b2e7a7e51a1eb74b.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that we're using an rwsem, we can hold it during the entirety of
read_kcore() and have a common return path. This is preparation for the
next change.
[akpm@linux-foundation.org: fix locking bug reported by Tetsuo Handa]
Link: http://lkml.kernel.org/r/d7cfbc1e8a76616f3b699eaff9df0a2730380534.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's a theoretical race condition that will cause /proc/kcore to miss
a memory hotplug event:
CPU0 CPU1
// hotplug event 1
kcore_need_update = 1
open_kcore() open_kcore()
kcore_update_ram() kcore_update_ram()
// Walk RAM // Walk RAM
__kcore_update_ram() __kcore_update_ram()
kcore_need_update = 0
// hotplug event 2
kcore_need_update = 1
kcore_need_update = 0
Note that CPU1 set up the RAM kcore entries with the state after hotplug
event 1 but cleared the flag for hotplug event 2. The RAM entries will
therefore be stale until there is another hotplug event.
This is an extremely unlikely sequence of events, but the fix makes the
synchronization saner, anyways: we serialize the entire update sequence,
which means that whoever clears the flag will always succeed in replacing
the kcore list.
Link: http://lkml.kernel.org/r/6106c509998779730c12400c1b996425df7d7089.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now we only need kclist_lock from user context and at fs init time, and
the following changes need to sleep while holding the kclist_lock.
Link: http://lkml.kernel.org/r/521ba449ebe921d905177410fee9222d07882f0d.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The memory hotplug notifier kcore_callback() only needs kclist_lock to
prevent races with __kcore_update_ram(), but we can easily eliminate that
race by using an atomic xchg() in __kcore_update_ram(). This is
preparation for converting kclist_lock to an rwsem.
Link: http://lkml.kernel.org/r/0a4bc89f4dbde8b5b2ea309f7b4fb6a85fe29df2.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "/proc/kcore improvements", v4.
This series makes a few improvements to /proc/kcore. It fixes a couple of
small issues in v3 but is otherwise the same. Patches 1, 2, and 3 are
prep patches. Patch 4 is a fix/cleanup. Patch 5 is another prep patch.
Patches 6 and 7 are optimizations to ->read(). Patch 8 makes it possible
to enable CRASH_CORE on any architecture, which is needed for patch 9.
Patch 9 adds vmcoreinfo to /proc/kcore.
This patch (of 9):
kclist_add() is only called at init time, so there's no point in grabbing
any locks. We're also going to replace the rwlock with a rwsem, which we
don't want to try grabbing during early boot.
While we're here, mark kclist_add() with __init so that we'll get a
warning if it's called from non-init code.
Link: http://lkml.kernel.org/r/98208db1faf167aa8b08eebfa968d95c70527739.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Bhupesh Sharma <bhsharma@redhat.com>
Tested-by: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
elf_kcore_store_hdr() uses __pa() to find the physical address of
KCORE_RAM or KCORE_TEXT entries exported as program headers.
This trips CONFIG_DEBUG_VIRTUAL's checks, as the KCORE_TEXT entries are
not in the linear map.
Handle these two cases separately, using __pa_symbol() for the KCORE_TEXT
entries.
Link: http://lkml.kernel.org/r/20180711131944.15252-1-james.morse@arm.com
Signed-off-by: James Morse <james.morse@arm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use new return type vm_fault_t for fault handler in struct
vm_operations_struct. For now, this is just documenting that the function
returns a VM_FAULT value rather than an errno. Once all instances are
converted, vm_fault_t will become a distinct type.
See 1c8f422059 ("mm: change return type to vm_fault_t") for reference.
Link: http://lkml.kernel.org/r/20180702153325.GA3875@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ganesh Goudar <ganeshgr@chelsio.com>
Cc: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Number of CPUs is never high enough to force 64-bit arithmetic.
Save couple of bytes on x86_64.
Link: http://lkml.kernel.org/r/20180627200710.GC18434@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
->latency_record is defined as
struct latency_record[LT_SAVECOUNT];
so use the same macro whie iterating.
Link: http://lkml.kernel.org/r/20180627200534.GA18434@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Code checks if write is done by current to its own attributes.
For that get/put pair is unnecessary as it can be done under RCU.
Note: rcu_read_unlock() can be done even earlier since pointer to a task
is not dereferenced. It depends if /proc code should look scary or not:
rcu_read_lock();
task = pid_task(...);
rcu_read_unlock();
if (!task)
return -ESRCH;
if (task != current)
return -EACCESS:
P.S.: rename "length" variable. Code like this
length = -EINVAL;
should not exist.
Link: http://lkml.kernel.org/r/20180627200218.GF18113@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Readdir context is thread local, so ->pos is thread local,
move it out of readlock.
Link: http://lkml.kernel.org/r/20180627195339.GD18113@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_monotonic_boottime() is deprecated and uses the old timespec type.
Let's convert /proc/uptime to use ktime_get_boottime_ts64().
Link: http://lkml.kernel.org/r/20180620081746.282742-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
24074a35c5 ("proc: Make inline name size calculation automatic")
started to put PDE allocations into kmalloc-256 which is unnecessary as
~40 character names are very rare.
Put allocation back into kmalloc-192 cache for 64-bit non-debug builds.
Put BUILD_BUG_ON to know when PDE size has gotten out of control.
[adobriyan@gmail.com: fix BUILD_BUG_ON breakage on powerpc64]
Link: http://lkml.kernel.org/r/20180703191602.GA25521@avx2
Link: http://lkml.kernel.org/r/20180617215732.GA24688@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, percpu memory only exposes allocation and utilization
information via debugfs. This more or less is only really useful for
understanding the fragmentation and allocation information at a per-chunk
level with a few global counters. This is also gated behind a config.
BPF and cgroup, for example, have seen an increase in use causing
increased use of percpu memory. Let's make it easier for someone to
identify how much memory is being used.
This patch adds the "Percpu" stat to meminfo to more easily look up how
much percpu memory is in use. This number includes the cost for all
allocated backing pages and not just insight at the per a unit, per chunk
level. Metadata is excluded. I think excluding metadata is fair because
the backing memory scales with the numbere of cpus and can quickly
outweigh the metadata. It also makes this calculation light.
Link: http://lkml.kernel.org/r/20180807184723.74919-1-dennisszhou@gmail.com
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rather than in vm_area_alloc(). To ensure that the various oddball
stack-based vmas are in a good state. Some of the callers were zeroing
them out, others were not.
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The /proc/pid/smaps_rollup file is currently implemented via the
m_start/m_next/m_stop seq_file iterators shared with the other maps files,
that iterate over vma's. However, the rollup file doesn't print anything
for each vma, only accumulate the stats.
There are some issues with the current code as reported in [1] - the
accumulated stats can get skewed if seq_file start()/stop() op is called
multiple times, if show() is called multiple times, and after seeks to
non-zero position.
Patch [1] fixed those within existing design, but I believe it is
fundamentally wrong to expose the vma iterators to the seq_file mechanism
when smaps_rollup shows logically a single set of values for the whole
address space.
This patch thus refactors the code to provide a single "value" at offset
0, with vma iteration to gather the stats done internally. This fixes the
situations where results are skewed, and simplifies the code, especially
in show_smap(), at the expense of somewhat less code reuse.
[1] https://marc.info/?l=linux-mm&m=151927723128134&w=2
[vbabka@suse.c: use seq_file infrastructure]
Link: http://lkml.kernel.org/r/bf4525b0-fd5b-4c4c-2cb3-adee3dd95a48@suse.cz
Link: http://lkml.kernel.org/r/20180723111933.15443-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Daniel Colascione <dancol@google.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To prepare for handling /proc/pid/smaps_rollup differently from
/proc/pid/smaps factor out from show_smap() printing the parts of output
that are common for both variants, which is the bulk of the gathered
memory stats.
[vbabka@suse.cz: add const, per Alexey]
Link: http://lkml.kernel.org/r/b45f319f-cd04-337b-37f8-77f99786aa8a@suse.cz
Link: http://lkml.kernel.org/r/20180723111933.15443-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Colascione <dancol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To prepare for handling /proc/pid/smaps_rollup differently from
/proc/pid/smaps factor out vma mem stats gathering from show_smap() - it
will be used by both.
Link: http://lkml.kernel.org/r/20180723111933.15443-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Colascione <dancol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "cleanups and refactor of /proc/pid/smaps*".
The recent regression in /proc/pid/smaps made me look more into the code.
Especially the issues with smaps_rollup reported in [1] as explained in
Patch 4, which fixes them by refactoring the code. Patches 2 and 3 are
preparations for that. Patch 1 is me realizing that there's a lot of
boilerplate left from times where we tried (unsuccessfuly) to mark thread
stacks in the output.
Originally I had also plans to rework the translation from
/proc/pid/*maps* file offsets to the internal structures. Now the offset
means "vma number", which is not really stable (vma's can come and go
between read() calls) and there's an extra caching of last vma's address.
My idea was that offsets would be interpreted directly as addresses, which
would also allow meaningful seeks (see the ugly seek_to_smaps_entry() in
tools/testing/selftests/vm/mlock2.h). However loff_t is (signed) long
long so that might be insufficient somewhere for the unsigned long
addresses.
So the result is fixed issues with skewed /proc/pid/smaps_rollup results,
simpler smaps code, and a lot of unused code removed.
[1] https://marc.info/?l=linux-mm&m=151927723128134&w=2
This patch (of 4):
Commit b76437579d ("procfs: mark thread stack correctly in
proc/<pid>/maps") introduced differences between /proc/PID/maps and
/proc/PID/task/TID/maps to mark thread stacks properly, and this was
also done for smaps and numa_maps. However it didn't work properly and
was ultimately removed by commit b18cb64ead ("fs/proc: Stop trying to
report thread stacks").
Now the is_pid parameter for the related show_*() functions is unused
and we can remove it together with wrapper functions and ops structures
that differ for PID and TID cases only in this parameter.
Link: http://lkml.kernel.org/r/20180723111933.15443-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Colascione <dancol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
autofs_sbi() does not check the superblock magic number to verify it has
been given an autofs super block.
Link: http://lkml.kernel.org/r/153475422934.17131.7563724552005298277.stgit@pluto.themaw.net
Reported-by: <syzbot+87c3c541582e56943277@syzkaller.appspotmail.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ida_alloc_max() matches what this driver wants to do. Also removes a
call to ida_pre_get(). We no longer need the protection of the mutex,
so convert pty_count to an atomic_t and remove the mutex entirely.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
This contains various bug fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW3xvGwAKCRDh3BK/laaZ
PKECAP9qUpdtQ5RaIL/y9OGZzJLSZbBZuK3LGNY2u2B3EfrSjgEAvhkhXyOQgvVi
kgYLNszbg/C+w8U4Xc5GWB6cjNm6rwE=
=GJI7
-----END PGP SIGNATURE-----
Merge tag 'fuse-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse update from Miklos Szeredi:
"Various bug fixes and cleanups"
* tag 'fuse-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: reduce allocation size for splice_write
fuse: use kvmalloc to allocate array of pipe_buffer structs.
fuse: convert last timespec use to timespec64
fs: fuse: Adding new return type vm_fault_t
fuse: simplify fuse_abort_conn()
fuse: Add missed unlock_page() to fuse_readpages_fill()
fuse: Don't access pipe->buffers without pipe_lock()
fuse: fix initial parallel dirops
fuse: Fix oops at process_init_reply()
fuse: umount should wait for all requests
fuse: fix unlocked access to processing queue
fuse: fix double request_end()
This contains two new features:
1) Stack file operations: this allows removal of several hacks from the
VFS, proper interaction of read-only open files with copy-up,
possibility to implement fs modifying ioctls properly, and others.
2) Metadata only copy-up: when file is on lower layer and only metadata is
modified (except size) then only copy up the metadata and continue to
use the data from the lower file.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW3srhAAKCRDh3BK/laaZ
PC6tAQCP+KklcN+TvNp502f+O/kATahSpgnun4NY1/p4I8JV+AEAzdlkTN3+MiAO
fn9brN6mBK7h59DO3hqedPLJy2vrgwg=
=QDXH
-----END PGP SIGNATURE-----
Merge tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
"This contains two new features:
- Stack file operations: this allows removal of several hacks from
the VFS, proper interaction of read-only open files with copy-up,
possibility to implement fs modifying ioctls properly, and others.
- Metadata only copy-up: when file is on lower layer and only
metadata is modified (except size) then only copy up the metadata
and continue to use the data from the lower file"
* tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (66 commits)
ovl: Enable metadata only feature
ovl: Do not do metacopy only for ioctl modifying file attr
ovl: Do not do metadata only copy-up for truncate operation
ovl: add helper to force data copy-up
ovl: Check redirect on index as well
ovl: Set redirect on upper inode when it is linked
ovl: Set redirect on metacopy files upon rename
ovl: Do not set dentry type ORIGIN for broken hardlinks
ovl: Add an inode flag OVL_CONST_INO
ovl: Treat metacopy dentries as type OVL_PATH_MERGE
ovl: Check redirects for metacopy files
ovl: Move some dir related ovl_lookup_single() code in else block
ovl: Do not expose metacopy only dentry from d_real()
ovl: Open file with data except for the case of fsync
ovl: Add helper ovl_inode_realdata()
ovl: Store lower data inode in ovl_inode
ovl: Fix ovl_getattr() to get number of blocks from lower
ovl: Add helper ovl_dentry_lowerdata() to get lower data dentry
ovl: Copy up meta inode data from lowest data inode
ovl: Modify ovl_lookup() and friends to lookup metacopy dentry
...
- Fix an uninitialized variable
- Don't use obviously garbage AG header counters to calculate
transaction reservations
- Trigger icount recalculation on bad icount when monting.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlty8tsACgkQ+H93GTRK
tOtoKw/+OeCaY6jZc2JoztBwLSUJsMYQ0R8Wsj5GRb4bVp9b0zes7RJMFU03nCtj
XuE4Rhdsx+6+QZQKxTq/Z6lrKHEjF0kL1EVGHtL46Inr+Z+Rr4bLBG6NV1o0dg7B
CR1IqW5vYcZ7Vrk9ko/RXVXtuCIxBS5jSW/S/uFT95Y4lVMAf/2asR/OoYt5ZVE3
17CUfWRifiSGoBQpjtfZd63F23XlEEusiErC5iS9rUbE2qC9FxP9EuvoUP5M/n01
nLS34Fjw7X739AiwHbf10fQPOvBr7atTazCXskjy4gbwqIWTmuhbF4ieTU1OfTI8
ozhvYomBYLiZbsEYBhVCs09VEnIfHmf2HoLh//efGE8VEvoQllxdn/g2PQekoPAn
M7VnRUXCTvaLI8IE2d3Ed1VWm0OTea09xqEiNpB0XGjegim9pXuf6t/zbe4R0vJy
YLBgQT8XRPw5ZgCnBbxvZOXXxQtAqKnqZzYSWGxlHJhhduKVeKMqerhP0nn0ui8g
wAOmOe3XEoyLfSY8WY0ACEEEA00pAwErerwVEFLCpaKTh5GOY4i3OBdqcZOtXacn
f5oIeG9HZKAXKkOTGwpq1zGHTOYhz4mxAYhodRFiEE8rXHDa9odUWQ/iG0zgZaO6
19xznXjXkVWVg0QJqQJi6SbEkkrAEFtFRYH+VPTgWM/1tg47a14=
=+0Eq
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.19-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
- Fix an uninitialized variable
- Don't use obviously garbage AG header counters to calculate
transaction reservations
- Trigger icount recalculation on bad icount when mounting
* tag 'xfs-4.19-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: fix WARN_ON_ONCE on uninitialized variable
xfs: sanity check ag header values in xrep_calc_ag_resblks
xfs: recalculate summary counters at mount time if icount is bad
Pull core signal handling updates from Eric Biederman:
"It was observed that a periodic timer in combination with a
sufficiently expensive fork could prevent fork from every completing.
This contains the changes to remove the need for that restart.
This set of changes is split into several parts:
- The first part makes PIDTYPE_TGID a proper pid type instead
something only for very special cases. The part starts using
PIDTYPE_TGID enough so that in __send_signal where signals are
actually delivered we know if the signal is being sent to a a group
of processes or just a single process.
- With that prep work out of the way the logic in fork is modified so
that fork logically makes signals received while it is running
appear to be received after the fork completes"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
signal: Don't send signals to tasks that don't exist
signal: Don't restart fork when signals come in.
fork: Have new threads join on-going signal group stops
fork: Skip setting TIF_SIGPENDING in ptrace_init_task
signal: Add calculate_sigpending()
fork: Unconditionally exit if a fatal signal is pending
fork: Move and describe why the code examines PIDNS_ADDING
signal: Push pid type down into complete_signal.
signal: Push pid type down into __send_signal
signal: Push pid type down into send_signal
signal: Pass pid type into do_send_sig_info
signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
signal: Pass pid type into group_send_sig_info
signal: Pass pid and pid type into send_sigqueue
posix-timers: Noralize good_sigevent
signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
pid: Implement PIDTYPE_TGID
pids: Move the pgrp and session pid pointers from task_struct to signal_struct
kvm: Don't open code task_pid in kvm_vcpu_ioctl
pids: Compute task_tgid using signal->leader_pid
...
If we knew that the file was empty, we wouldn't be asking for a layout.
Any optimisation here is already done before calling pnfs_update_layout().
As it stands, we sometimes end up doing an unnecessary inband read to
the MDS even when holding a layout.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If we get an error while retrieving the layout, then we should
report it rather than falling back to I/O through the MDS.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The f2fs_gc() called by f2fs_balance_fs() requires to be called outside of
fi->i_gc_rwsem[WRITE], since f2fs_gc() can try to grab it in a loop.
If it hits the miximum retrials in GC, let's give a chance to release
gc_mutex for a short time in order not to go into live lock in the worst
case.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This reverts the commit - "b93f771 - f2fs: remove writepages lock"
to fix the drop in sequential read throughput.
Test: ./tiotest -t 32 -d /data/tio_tmp -f 32 -b 524288 -k 1 -k 3 -L
device: UFS
Before -
read throughput: 185 MB/s
total read requests: 85177 (of these ~80000 are 4KB size requests).
total write requests: 2546 (of these ~2208 requests are written in 512KB).
After -
read throughput: 758 MB/s
total read requests: 2417 (of these ~2042 are 512KB reads).
total write requests: 2701 (of these ~2034 requests are written in 512KB).
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
- Restructure of lockdep and latency tracers
This is the biggest change. Joel Fernandes restructured the hooks
from irqs and preemption disabling and enabling. He got rid of
a lot of the preprocessor #ifdef mess that they caused.
He turned both lockdep and the latency tracers to use trace events
inserted in the preempt/irqs disabling paths. But unfortunately,
these started to cause issues in corner cases. Thus, parts of the
code was reverted back to where lockde and the latency tracers
just get called directly (without using the trace events).
But because the original change cleaned up the code very nicely
we kept that, as well as the trace events for preempt and irqs
disabling, but they are limited to not being called in NMIs.
- Have trace events use SRCU for "rcu idle" calls. This was required
for the preempt/irqs off trace events. But it also had to not
allow them to be called in NMI context. Waiting till Paul makes
an NMI safe SRCU API.
- New notrace SRCU API to allow trace events to use SRCU.
- Addition of mcount-nop option support
- SPDX headers replacing GPL templates.
- Various other fixes and clean ups.
- Some fixes are marked for stable, but were not fully tested
before the merge window opened.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCW3ruhRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qiM7AP47NhYdSnCFCRUJfrt6PovXmQtuCHt3
c3QMoGGdvzh9YAEAqcSXwh7uLhpHUp1LjMAPkXdZVwNddf4zJQ1zyxQ+EAU=
=vgEr
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- Restructure of lockdep and latency tracers
This is the biggest change. Joel Fernandes restructured the hooks
from irqs and preemption disabling and enabling. He got rid of a lot
of the preprocessor #ifdef mess that they caused.
He turned both lockdep and the latency tracers to use trace events
inserted in the preempt/irqs disabling paths. But unfortunately,
these started to cause issues in corner cases. Thus, parts of the
code was reverted back to where lockdep and the latency tracers just
get called directly (without using the trace events). But because the
original change cleaned up the code very nicely we kept that, as well
as the trace events for preempt and irqs disabling, but they are
limited to not being called in NMIs.
- Have trace events use SRCU for "rcu idle" calls. This was required
for the preempt/irqs off trace events. But it also had to not allow
them to be called in NMI context. Waiting till Paul makes an NMI safe
SRCU API.
- New notrace SRCU API to allow trace events to use SRCU.
- Addition of mcount-nop option support
- SPDX headers replacing GPL templates.
- Various other fixes and clean ups.
- Some fixes are marked for stable, but were not fully tested before
the merge window opened.
* tag 'trace-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (44 commits)
tracing: Fix SPDX format headers to use C++ style comments
tracing: Add SPDX License format tags to tracing files
tracing: Add SPDX License format to bpf_trace.c
blktrace: Add SPDX License format header
s390/ftrace: Add -mfentry and -mnop-mcount support
tracing: Add -mcount-nop option support
tracing: Avoid calling cc-option -mrecord-mcount for every Makefile
tracing: Handle CC_FLAGS_FTRACE more accurately
Uprobe: Additional argument arch_uprobe to uprobe_write_opcode()
Uprobes: Simplify uprobe_register() body
tracepoints: Free early tracepoints after RCU is initialized
uprobes: Use synchronize_rcu() not synchronize_sched()
tracing: Fix synchronizing to event changes with tracepoint_synchronize_unregister()
ftrace: Remove unused pointer ftrace_swapper_pid
tracing: More reverting of "tracing: Centralize preemptirq tracepoints and unify their usage"
tracing/irqsoff: Handle preempt_count for different configs
tracing: Partial revert of "tracing: Centralize preemptirq tracepoints and unify their usage"
tracing: irqsoff: Account for additional preempt_disable
trace: Use rcu_dereference_raw for hooks from trace-event subsystem
tracing/kprobes: Fix within_notrace_func() to check only notrace functions
...
basic support for rbd images within namespaces (myself). Also included
y2038 conversion patches from Arnd, a pile of miscellaneous fixes from
Chengguang and Zheng's feature bit infrastructure for the filesystem.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAlt62CkTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzizfhB/0c/rz6frunc6EcZMWuBNzlOIOktJ/m
MEbPGjCxMAsmidO1rqHHYF4iEN5hr+3AWTbtIL2m6wkqYVdg3FjmNaAYB27AdQMG
kH9bLfrKIew72/NZqXfm25yjY/86kIt8t91kay4Lchc97tSYhnFSnku7iAX2HTND
TMhq/1O/GvEyw/RmqnenJEQqFJvKnfgPPQm6W8sM2bH0T5j+EXmDT/Rv+90LogFR
J4+pZkHqDfvyMb1WJ5MkumohytbRVzRNKcMpOvjquJSqUgtgZa2JdrIsypDqSNKY
nUT6jGGlxoSbHCqRwDJoFEJOlh5A9RwKqYxNuM2a/vs9u7HpvdCK/Iah
=AtgY
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.19-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"The main things are support for cephx v2 authentication protocol and
basic support for rbd images within namespaces (myself).
Also included are y2038 conversion patches from Arnd, a pile of
miscellaneous fixes from Chengguang and Zheng's feature bit
infrastructure for the filesystem"
* tag 'ceph-for-4.19-rc1' of git://github.com/ceph/ceph-client: (40 commits)
ceph: don't drop message if it contains more data than expected
ceph: support cephfs' own feature bits
crush: fix using plain integer as NULL warning
libceph: remove unnecessary non NULL check for request_key
ceph: refactor error handling code in ceph_reserve_caps()
ceph: refactor ceph_unreserve_caps()
ceph: change to void return type for __do_request()
ceph: compare fsc->max_file_size and inode->i_size for max file size limit
ceph: add additional size check in ceph_setattr()
ceph: add additional offset check in ceph_write_iter()
ceph: add additional range check in ceph_fallocate()
ceph: add new field max_file_size in ceph_fs_client
libceph: weaken sizeof check in ceph_x_verify_authorizer_reply()
libceph: check authorizer reply/challenge length before reading
libceph: implement CEPHX_V2 calculation mode
libceph: add authorizer challenge
libceph: factor out encrypt_authorizer()
libceph: factor out __ceph_x_decrypt()
libceph: factor out __prepare_write_connect()
libceph: store ceph_auth_handshake pointer in ceph_connection
...
Here are all of the driver core and related patches for 4.19-rc1.
Nothing huge here, just a number of small cleanups and the ability to
now stop the deferred probing after init happens.
All of these have been in linux-next for a while with only a merge issue
reported. That merge issue is in fs/sysfs/group.c and Stephen has
posted the diff of what it should be to resolve this. I'll follow up
with that diff to this pull request.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCW3g86Q8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ynyXQCePaZSW8wft4b7nLN8RdZ98ATBru0Ani10lrJa
HQeQJRNbWU1AZ0ym7695
=tOaH
-----END PGP SIGNATURE-----
Merge tag 'driver-core-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here are all of the driver core and related patches for 4.19-rc1.
Nothing huge here, just a number of small cleanups and the ability to
now stop the deferred probing after init happens.
All of these have been in linux-next for a while with only a merge
issue reported"
* tag 'driver-core-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (21 commits)
base: core: Remove WARN_ON from link dependencies check
drivers/base: stop new probing during shutdown
drivers: core: Remove glue dirs from sysfs earlier
driver core: remove unnecessary function extern declare
sysfs.h: fix non-kernel-doc comment
PM / Domains: Stop deferring probe at the end of initcall
iommu: Remove IOMMU_OF_DECLARE
iommu: Stop deferring probe at end of initcalls
pinctrl: Support stopping deferred probe after initcalls
dt-bindings: pinctrl: add a 'pinctrl-use-default' property
driver core: allow stopping deferred probe after init
driver core: add a debugfs entry to show deferred devices
sysfs: Fix internal_create_group() for named group updates
base: fix order of OF initialization
linux/device.h: fix kernel-doc notation warning
Documentation: update firmware loader fallback reference
kobject: Replace strncpy with memcpy
drivers: base: cacheinfo: use OF property_read_u32 instead of get_property,read_number
kernfs: Replace strncpy with memcpy
device: Add #define dev_fmt similar to #define pr_fmt
...
This tag is the same as 9p-for-4.19 without the two MAINTAINERS patches
Contains mostly fixes (6 to be backported to stable) and a few changes,
here is the breakdown:
* Rework how fids are attributed by replacing some custom tracking in a
list by an idr (f28cdf0430)
* For packet-based transports (virtio/rdma) validate that the packet
length matches what the header says (f984579a01)
* A few race condition fixes found by syzkaller (9f476d7c54,
430ac66eb4)
* Missing argument check when NULL device is passed in sys_mount
(10aa14527f)
* A few virtio fixes (23cba9cbde, 31934da810, d28c756cae)
* Some spelling and style fixes
----------------------------------------------------------------
Chirantan Ekbote (1):
9p/net: Fix zero-copy path in the 9p virtio transport
Colin Ian King (1):
fs/9p/v9fs.c: fix spelling mistake "Uknown" -> "Unknown"
Jean-Philippe Brucker (1):
net/9p: fix error path of p9_virtio_probe
Matthew Wilcox (4):
9p: Fix comment on smp_wmb
9p: Change p9_fid_create calling convention
9p: Replace the fidlist with an IDR
9p: Embed wait_queue_head into p9_req_t
Souptick Joarder (1):
fs/9p/vfs_file.c: use new return type vm_fault_t
Stephen Hemminger (1):
9p: fix whitespace issues
Tomas Bortoli (5):
net/9p/client.c: version pointer uninitialized
net/9p/trans_fd.c: fix race-condition by flushing workqueue before the kfree()
net/9p/trans_fd.c: fix race by holding the lock
9p: validate PDU length
9p: fix multiple NULL-pointer-dereferences
jiangyiwen (2):
net/9p/virtio: Fix hard lockup in req_done
9p/virtio: fix off-by-one error in sg list bounds check
piaojun (5):
net/9p/client.c: add missing '\n' at the end of p9_debug()
9p/net/protocol.c: return -ENOMEM when kmalloc() failed
net/9p/trans_virtio.c: fix some spell mistakes in comments
fs/9p/xattr.c: catch the error of p9_client_clunk when setting xattr failed
net/9p/trans_virtio.c: add null terminal for mount tag
fs/9p/v9fs.c | 2 +-
fs/9p/vfs_file.c | 2 +-
fs/9p/xattr.c | 6 ++++--
include/net/9p/client.h | 11 ++++-------
net/9p/client.c | 119 +++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------------------------------------------
net/9p/protocol.c | 2 +-
net/9p/trans_fd.c | 22 +++++++++++++++-------
net/9p/trans_rdma.c | 4 ++++
net/9p/trans_virtio.c | 66 +++++++++++++++++++++++++++++++++++++---------------------------
net/9p/trans_xen.c | 3 +++
net/9p/util.c | 1 -
12 files changed, 122 insertions(+), 116 deletions(-)
-----BEGIN PGP SIGNATURE-----
iF0EABECAB0WIQQ8idm2ZSicIMLgzKqoqIItDqvwPAUCW3ElNwAKCRCoqIItDqvw
PMzfAKCkCYFyNC89vcpxcCNsK7rFQ1qKlwCgoaBpZDdegOu0jMB7cyKwAWrB0LM=
=h3T0
-----END PGP SIGNATURE-----
Merge tag '9p-for-4.19-2' of git://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
"This contains mostly fixes (6 to be backported to stable) and a few
changes, here is the breakdown:
- rework how fids are attributed by replacing some custom tracking in
a list by an idr
- for packet-based transports (virtio/rdma) validate that the packet
length matches what the header says
- a few race condition fixes found by syzkaller
- missing argument check when NULL device is passed in sys_mount
- a few virtio fixes
- some spelling and style fixes"
* tag '9p-for-4.19-2' of git://github.com/martinetd/linux: (21 commits)
net/9p/trans_virtio.c: add null terminal for mount tag
9p/virtio: fix off-by-one error in sg list bounds check
9p: fix whitespace issues
9p: fix multiple NULL-pointer-dereferences
fs/9p/xattr.c: catch the error of p9_client_clunk when setting xattr failed
9p: validate PDU length
net/9p/trans_fd.c: fix race by holding the lock
net/9p/trans_fd.c: fix race-condition by flushing workqueue before the kfree()
net/9p/virtio: Fix hard lockup in req_done
net/9p/trans_virtio.c: fix some spell mistakes in comments
9p/net: Fix zero-copy path in the 9p virtio transport
9p: Embed wait_queue_head into p9_req_t
9p: Replace the fidlist with an IDR
9p: Change p9_fid_create calling convention
9p: Fix comment on smp_wmb
net/9p/client.c: version pointer uninitialized
fs/9p/v9fs.c: fix spelling mistake "Uknown" -> "Unknown"
net/9p: fix error path of p9_virtio_probe
9p/net/protocol.c: return -ENOMEM when kmalloc() failed
net/9p/client.c: add missing '\n' at the end of p9_debug()
...
Merge updates from Andrew Morton:
- a few misc things
- a few Y2038 fixes
- ntfs fixes
- arch/sh tweaks
- ocfs2 updates
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (111 commits)
mm/hmm.c: remove unused variables align_start and align_end
fs/userfaultfd.c: remove redundant pointer uwq
mm, vmacache: hash addresses based on pmd
mm/list_lru: introduce list_lru_shrink_walk_irq()
mm/list_lru.c: pass struct list_lru_node* as an argument to __list_lru_walk_one()
mm/list_lru.c: move locking from __list_lru_walk_one() to its caller
mm/list_lru.c: use list_lru_walk_one() in list_lru_walk_node()
mm, swap: make CONFIG_THP_SWAP depend on CONFIG_SWAP
mm/sparse: delete old sparse_init and enable new one
mm/sparse: add new sparse_init_nid() and sparse_init()
mm/sparse: move buffer init/fini to the common place
mm/sparse: use the new sparse buffer functions in non-vmemmap
mm/sparse: abstract sparse buffer allocations
mm/hugetlb.c: don't zero 1GiB bootmem pages
mm, page_alloc: double zone's batchsize
mm/oom_kill.c: document oom_lock
mm/hugetlb: remove gigantic page support for HIGHMEM
mm, oom: remove sleep from under oom_lock
kernel/dma: remove unsupported gfp_mask parameter from dma_alloc_from_contiguous()
mm/cma: remove unsupported gfp_mask parameter from cma_alloc()
...
Pointer uwq is being assigned but is never used hence it is redundant
and can be removed.
Cleans up clang warning:
warning: variable 'uwq' set but not used [-Wunused-but-set-variable]
Link: http://lkml.kernel.org/r/20180717090802.18357-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need to distinguish the situations when shrinker has very small
amount of objects (see vfs_pressure_ratio() called from
super_cache_count()), and when it has no objects at all. Currently, in
the both of these cases, shrinker::count_objects() returns 0.
The patch introduces new SHRINK_EMPTY return value, which will be used
for "no objects at all" case. It's is a refactoring mostly, as
SHRINK_EMPTY is replaced by 0 by all callers of do_shrink_slab() in this
patch, and all the magic will happen in further.
Link: http://lkml.kernel.org/r/153063069574.1818.11037751256699341813.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add list_lru::shrinker_id field and populate it by registered shrinker
id.
This will be used to set correct bit in memcg shrinkers map by lru code
in next patches, after there appeared the first related to memcg element
in list_lru.
Link: http://lkml.kernel.org/r/153063059758.1818.14866596416857717800.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Do two list_lru_init_memcg() calls after prealloc_super().
destroy_unused_super() in fail path is OK with this. Next patch needs
such the order.
Link: http://lkml.kernel.org/r/153063058712.1818.3382490999719078571.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The buffer_head can consume a significant amount of system memory and is
directly related to the amount of page cache. In our production
environment we have observed that a lot of machines are spending a
significant amount of memory as buffer_head and can not be left as
system memory overhead.
Charging buffer_head is not as simple as adding __GFP_ACCOUNT to the
allocation. The buffer_heads can be allocated in a memcg different from
the memcg of the page for which buffer_heads are being allocated. One
concrete example is memory reclaim. The reclaim can trigger I/O of
pages of any memcg on the system. So, the right way to charge
buffer_head is to extract the memcg from the page for which buffer_heads
are being allocated and then use targeted memcg charging API.
[shakeelb@google.com: use __GFP_ACCOUNT for directed memcg charging]
Link: http://lkml.kernel.org/r/20180702220208.213380-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-3-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Directed kmem charging", v8.
The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs. All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.
The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated. However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg. This patch series
contains two such concrete use-cases i.e. fsnotify and buffer_head.
The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener. The events
are allocated in the context of the event producer. However they should
be charged to the event consumer. Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.
To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg. In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged. For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.
This patch (of 2):
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.
However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.
There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.
The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.
The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.
This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.
[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a_ops->readpages() is only ever used for read-ahead. Ensure that we
pass this information down to the block layer.
Link: http://lkml.kernel.org/r/20180621010725.17813-5-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a_ops->readpages() is only ever used for read-ahead. Ensure that we
pass this information down to the block layer.
Link: http://lkml.kernel.org/r/20180621010725.17813-4-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a_ops->readpages() is only ever used for read-ahead, yet we don't flag
the IO being submitted as such. Fix that up. Any file system that uses
mpage_readpages() as its ->readpages() implementation will now get this
right.
Since we're passing in whether the IO is read-ahead or not, we don't
need to pass in the 'gfp' separately, as it is dependent on the IO being
read-ahead. Kill off that member.
Add some documentation notes on ->readpages() being purely for
read-ahead.
Link: http://lkml.kernel.org/r/20180621010725.17813-3-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Submit ->readpages() IO as read-ahead", v4.
The only caller of ->readpages() is from read-ahead, yet we don't submit
IO flagged with REQ_RAHEAD. This means we don't see it in blktrace, for
instance, which is a shame. Additionally, it's preventing further
functional changes in the block layer for deadling with read-ahead more
intelligently. We already make assumptions about ->readpages() just
being for read-ahead in the mpage implementation, using
readahead_gfp_mask(mapping) as out GFP mask of choice.
This small series fixes up mpage_readpages() to submit with REQ_RAHEAD,
which takes care of file systems using mpage_readpages(). The first
patch is a prep patch, that makes do_mpage_readpage() take an argument
structure.
This patch (of 4):
We're currently passing 8 arguments to this function, clean it up a bit
by packing the arguments in an args structure we pass to it.
No intentional functional changes in this patch.
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20180621010725.17813-2-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The documentation for seq_file suggests that it is necessary to be able
to move the iterator to a given offset, however that is not the case.
If the iterator is stored in the private data and is stable from one
read() syscall to the next, it is only necessary to support first/next
interactions. Implementing this in a client is a little clumsy.
- if ->start() is given a pos of zero, it should go to start of
sequence.
- if ->start() is given the name pos that was given to the most recent
next() or start(), it should restore the iterator to state just
before that last call
- if ->start is given another number, it should set the iterator one
beyond the start just before the last ->start or ->next call.
Also, the documentation says that the implementation can interpret the
pos however it likes (other than zero meaning start), but seq_file
increments the pos sometimes which does impose on the implementation.
This patch simplifies the interface for first/next iteration and
simplifies the code, while maintaining complete backward compatability.
Now:
- if ->start() is given a pos of zero, it should return an iterator
placed at the start of the sequence
- if ->start() is given a non-zero pos, it should return the iterator
in the same state it was after the last ->start or ->next.
This is particularly useful for interators which walk the multiple
chains in a hash table, e.g. using rhashtable_walk*. See
fs/gfs2/glock.c and drivers/staging/lustre/lustre/llite/vvp_dev.c
A large part of achieving this is to *always* call ->next after ->show
has successfully stored all of an entry in the buffer. Never just
increment the index instead. Also:
- always pass &m->index to ->start() and ->next(), never a temp
variable
- don't clear ->from when ->count is zero, as ->from is dead when
->count is zero.
Some ->next functions do not increment *pos when they return NULL. To
maintain compatability with this, we still need to increment m->index in
one place, if ->next didn't increment it. Note that such ->next
functions are buggy and should be fixed. A simple demonstration is
dd if=/proc/swaps bs=1000 skip=1
Choose any block size larger than the size of /proc/swaps. This will
always show the whole last line of /proc/swaps.
This patch doesn't work around buggy next() functions for this case.
[neilb@suse.com: ensure ->from is valid]
Link: http://lkml.kernel.org/r/87601ryb8a.fsf@notabene.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.com>
Acked-by: Jonathan Corbet <corbet@lwn.net> [docs]
Tested-by: Jann Horn <jannh@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This flag was introduce in 2.1.37pre1 and the only place it was tested
was removed in 2.1.43pre1. The flag was never set.
Let's discard it properly.
Link: http://lkml.kernel.org/r/877en0hewz.fsf@notabene.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since only dentry->d_name.len + 1 bytes out of DNAME_INLINE_LEN bytes
are initialized at __d_alloc(), we can't copy the whole size
unconditionally.
WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff8fa27465ac50)
636f6e66696766732e746d70000000000010000000000000020000000188ffff
i i i i i i i i i i i i i u u u u u u u u u u i i i i i u u u u
^
RIP: 0010:take_dentry_name_snapshot+0x28/0x50
RSP: 0018:ffffa83000f5bdf8 EFLAGS: 00010246
RAX: 0000000000000020 RBX: ffff8fa274b20550 RCX: 0000000000000002
RDX: ffffa83000f5be40 RSI: ffff8fa27465ac50 RDI: ffffa83000f5be60
RBP: ffffa83000f5bdf8 R08: ffffa83000f5be48 R09: 0000000000000001
R10: ffff8fa27465ac00 R11: ffff8fa27465acc0 R12: ffff8fa27465ac00
R13: ffff8fa27465acc0 R14: 0000000000000000 R15: 0000000000000000
FS: 00007f79737ac8c0(0000) GS:ffffffff8fc30000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff8fa274c0b000 CR3: 0000000134aa7002 CR4: 00000000000606f0
take_dentry_name_snapshot+0x28/0x50
vfs_rename+0x128/0x870
SyS_rename+0x3b2/0x3d0
entry_SYSCALL_64_fastpath+0x1a/0xa4
0xffffffffffffffff
Link: http://lkml.kernel.org/r/201709131912.GBG39012.QMJLOVFSFFOOtH@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are a variety of functions and variables that are local to the
source and do not need to be in global scope, so make them static. Also
make a couple of char arrays static const.
Cleans up sparse warnings:
symbol 'o2hb_heartbeat_mode_desc' was not declared. Should it be static?
symbol 'o2hb_heartbeat_mode' was not declared. Should it be static?
symbol 'o2hb_dependent_users' was not declared. Should it be static?
symbol 'o2hb_region_dec_user' was not declared. Should it be static?
symbol 'o2nm_fence_method_desc' was not declared. Should it be static?
symbol 'lockdep_keys' was not declared. Should it be static?
Link: http://lkml.kernel.org/r/20180628131659.12133-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Several functions have some unnecessary code, clean up these code.
Link: http://lkml.kernel.org/r/5B14DF72.5020800@huawei.com
Signed-off-by: Yan Wang <wangyan122@huawei.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We should return -EROFS rather than other errno if filesystem becomes
read-only.
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/5B191B26.9010501@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Acked-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the quest to remove all stack VLA usage from the kernel[1], this
allocates the maximum size stack buffer. Existing checks already
require that blocksize >= NTFS_BLOCK_SIZE and mft_record_size <=
PAGE_SIZE, so max_bhs can be at most PAGE_SIZE / NTFS_BLOCK_SIZE.
Sanity checks are added for robustness.
[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com
Link: http://lkml.kernel.org/r/20180626172909.41453-4-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the quest to remove all stack VLA usage from the kernel[1], this
moves the stack buffer used during decompression to be allocated
externally.
The existing "dest_max_index" used in the VLA is bounded by cb_max_page.
cb_max_page is bounded by max_page, and max_page is bounded by nr_pages.
Since nr_pages is used for the "pages" allocation, it can similarly be
used for the "completed_pages" allocation and passed into the
decompression function. The error paths are updated to free the new
allocation.
[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com
Link: http://lkml.kernel.org/r/20180626172909.41453-3-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ntfs_end_buffer_async_read() disables interrupts around kmap_atomic().
This is a leftover from the old kmap_atomic() implementation which
relied on fixed mapping slots, so the caller had to make sure that the
same slot could not be reused from an interrupting context.
kmap_atomic() was changed to dynamic slots long ago and commit
1ec9c5ddc1 ("include/linux/highmem.h: remove the second argument of
k[un]map_atomic()") removed the slot assignements, but the callers were
not checked for now redundant interrupt disabling.
Remove the conditional interrupt disable.
Link: http://lkml.kernel.org/r/20180611144913.gln5mklhqcrfsoom@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Anton Altaparmakov <anton@tuxera.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The VFS timestamps are all 64-bit now, the only missing piece for hpfs
is the internal conversion function. One interesting bit about hpfs is
that it can already deal with moving the 136 year window of its
timestamps to support a much wider range than other file systems with
32-bit timestamps. It also treats the timestamps as 'unsigned' on
64-bit architectures (but signed on 32-bit, because time_t always around
to negative numbers in 2038).
Changing the conversion to use time64_t makes 32-bit architectures
behave the same way as 64-bit. For completeness, this also adds a
clamp_t call for each conversion, so we don't wrap the timestamps but
instead stay within the [0..U32_MAX] range of the on-disk timestamps.
Link: http://lkml.kernel.org/r/20180718115017.742609-3-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that the VFS has been converted from timespec to timespec64
timestamps, only the conversion to/from ntfs timestamps uses 32-bit
seconds.
This changes that last missing piece to get the ntfs implementation
y2038 safe on 32-bit architectures.
Link: http://lkml.kernel.org/r/20180718115017.742609-2-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Anton Altaparmakov <anton@tuxera.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_seconds() is deprecated because of the 32-bit overflow and will be
removed. All callers in ufs also truncate to a 32-bit number, so
nothing changes during the conversion, but this should be harmless as
the superblock and cylinder group timestamps are not visible to user
space, except for checking the fs-dirty state, wich works fine across
the overflow.
This moves the call to get_seconds() into a new inline function, with a
comment explaining the constraints, while converting it to
ktime_get_real_seconds().
Link: http://lkml.kernel.org/r/20180718115017.742609-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is reworked from an earlier patch that Dan has posted:
https://patchwork.kernel.org/patch/10131727/
VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
the memory page it is dealing with is not typical memory from the linear
map. The get_user_pages_fast() path, since it does not resolve the vma,
is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
use that as a VM_MIXEDMAP replacement in some locations. In the cases
where there is no pte to consult we fallback to using vma_is_dax() to
detect the VM_MIXEDMAP special case.
Now that we have explicit driver pfn_t-flag opt-in/opt-out for
get_user_pages() support for DAX we can stop setting VM_MIXEDMAP. This
also means we no longer need to worry about safely manipulating vm_flags
in a future where we support dynamically changing the dax mode of a
file.
DAX should also now be supported with madvise_behavior(), vma_merge(),
and copy_page_range().
This patch has been tested against ndctl unit test. It has also been
tested against xfstests commit: 625515d using fake pmem created by
memmap and no additional issues have been observed.
Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>