summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2019-03-07Merge branch 'akpm' (patches from Andrew)Linus Torvalds8-105/+168
Merge more updates from Andrew Morton: - some of the rest of MM - various misc things - dynamic-debug updates - checkpatch - some epoll speedups - autofs - rapidio - lib/, lib/lzo/ updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (83 commits) samples/mic/mpssd/mpssd.h: remove duplicate header kernel/fork.c: remove duplicated include include/linux/relay.h: fix percpu annotation in struct rchan arch/nios2/mm/fault.c: remove duplicate include unicore32: stop printing the virtual memory layout MAINTAINERS: fix GTA02 entry and mark as orphan mm: create the new vm_fault_t type arm, s390, unicore32: remove oneliner wrappers for memblock_alloc() arch: simplify several early memory allocations openrisc: simplify pte_alloc_one_kernel() sh: prefer memblock APIs returning virtual address microblaze: prefer memblock API returning virtual address powerpc: prefer memblock APIs returning virtual address lib/lzo: separate lzo-rle from lzo lib/lzo: implement run-length encoding lib/lzo: fast 8-byte copy on arm64 lib/lzo: 64-bit CTZ on arm64 lib/lzo: tidy-up ifdefs ipc/sem.c: replace kvmalloc/memset with kvzalloc and use struct_size ipc: annotate implicit fall through ...
2019-03-07exec: increase BINPRM_BUF_SIZE to 256Oleg Nesterov1-1/+1
Large enterprise clients often run applications out of networked file systems where the IT mandated layout of project volumes can end up leading to paths that are longer than 128 characters. Bumping this up to the next order of two solves this problem in all but the most egregious case while still fitting into a 512b slab. [oleg@redhat.com: update comment, per Kees] Link: http://lkml.kernel.org/r/20181112160956.GA28472@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Ben Woodard <woodard@redhat.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-07fs/exec.c: replace opencoded set_mask_bits()Vineet Gupta1-6/+1
Link: http://lkml.kernel.org/r/1548275584-18096-2-git-send-email-vgupta@synopsys.com Link: http://lkml.kernel.org/g/20150807115710.GA16897@redhat.com Signed-off-by: Vineet Gupta <vgupta@synopsys.com> Reviewed-by: Anthony Yznaga <anthony.yznaga@oracle.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-07fat: enable .splice_write to support splice on O_DIRECT fileHou Tao1-0/+1
Now splice() on O_DIRECT-opened fat file will return -EFAULT, that is because the default .splice_write, namely default_file_splice_write(), will construct an ITER_KVEC iov_iter and dio_refill_pages() in dio path can not handle it. Fix it by implementing .splice_write through iter_file_splice_write(). Spotted by xfs-tests generic/091. Link: http://lkml.kernel.org/r/20190210094754.56355-1-houtao1@huawei.com Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-07autofs: clear O_NONBLOCK on the pipeNeilBrown1-0/+2
autofs does not expect the pipe it is given to have O_NONBLOCK set - specifically if __kernel_write() in autofs_write() returns -EAGAIN, this is treated as a fatal error and the pipe is closed. For safety autofs should, therefore, clear the O_NONBLOCK flag. Releases of systemd prior to 8th February 2019 used pipe2(p, O_NONBLOCK|O_CLOEXEC) and thus (inadvertently) set this flag. Link: http://lkml.kernel.org/r/154993550902.3321.1183632970046073478.stgit@pluto-themaw-net Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-07fs/autofs/inode.c: use seq_puts() for simple strings in autofs_show_options()Ian Kent1-6/+6
Fix checkpatch.sh WARNING about the use of seq_printf() to print simple strings in autofs_show_options(), use seq_puts() in this case. Link: http://lkml.kernel.org/r/154889012613.4863.12231175554744203482.stgit@pluto-themaw-net Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-07autofs: add ignore mount optionIan Kent2-1/+9
Add an autofs file system mount option that can be used to provide a generic indicator to applications that the mount entry should be ignored when displaying mount information. In other OSes that provide autofs and that provide a mount list to user space based on the kernel mount list a no-op mount option ("ignore" is the one use on the most common OS) is allowed so that autofs file system users can optionally use it. The idea is that it be used by user space programs to exclude autofs mounts from consideration when reading the mounts list. Prior to the change to link /etc/mtab to /proc/self/mounts all I needed to do to achieve this was to use mount(2) and not update the mtab but now that no longer works. I know the symlinking happened a long time ago and I considered doing this then but, at the time I couldn't remember the commonly used option name and thought persuading the various utility maintainers would be too hard. But now I have a RHEL request to do this for compatibility for a widely used product so I want to go ahead with it and try and enlist the help of some utility package maintainers. Clearly, without the option nothing can be done so it's at least a start. Link: http://lkml.kernel.org/r/154725123970.11260.6113771566924907275.stgit@pluto-themaw-net Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-07fs/binfmt_elf.c: spread const a littleAlexey Dobriyan1-5/+3
Link: http://lkml.kernel.org/r/20190204202830.GC27482@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-07fs/binfmt_elf.c: use list_for_each_entry()Alexey Dobriyan1-10/+5
[adobriyan@gmail.com: fixup compilation] Link: http://lkml.kernel.org/r/20190205064334.GA2152@avx2 Link: http://lkml.kernel.org/r/20190204202800.GB27482@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-07fs/binfmt_elf.c: don't be afraid of overflowAlexey Dobriyan1-6/+3
Number of ELF program headers is 16-bit by spec, so total size comfortably fits into "unsigned int". Space savings: 7 bytes! add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-7 (-7) Function old new delta load_elf_phdrs 137 130 -7 Link: http://lkml.kernel.org/r/20190204202715.GA27482@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-07epoll: use rwlock in order to reduce ep_poll_callback() contentionRoman Penyaev1-36/+122
The goal of this patch is to reduce contention of ep_poll_callback() which can be called concurrently from different CPUs in case of high events rates and many fds per epoll. Problem can be very well reproduced by generating events (write to pipe or eventfd) from many threads, while consumer thread does polling. In other words this patch increases the bandwidth of events which can be delivered from sources to the poller by adding poll items in a lockless way to the list. The main change is in replacement of the spinlock with a rwlock, which is taken on read in ep_poll_callback(), and then by adding poll items to the tail of the list using xchg atomic instruction. Write lock is taken everywhere else in order to stop list modifications and guarantee that list updates are fully completed (I assume that write side of a rwlock does not starve, it seems qrwlock implementation has these guarantees). The following are some microbenchmark results based on the test [1] which starts threads which generate N events each. The test ends when all events are successfully fetched by the poller thread: spinlock ======== threads events/ms run-time ms 8 6402 12495 16 7045 22709 32 7395 43268 rwlock + xchg ============= threads events/ms run-time ms 8 10038 7969 16 12178 13138 32 13223 24199 According to the results bandwidth of delivered events is significantly increased, thus execution time is reduced. This patch was tested with different sort of microbenchmarks and artificial delays (e.g. "udelay(get_random_int() & 0xff)") introduced in kernel on paths where items are added to lists. [1] https://github.com/rouming/test-tools/blob/master/stress-epoll.c Link: http://lkml.kernel.org/r/20190103150104.17128-5-rpenyaev@suse.de Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Jason Baron <jbaron@akamai.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-07epoll: unify awaking of wakeup source on ep_poll_callback() pathRoman Penyaev1-8/+1
Original comment "Activate ep->ws since epi->ws may get deactivated at any time" indeed sounds loud, but it is incorrect, because the path where we check epi->ws is a path where insert to ovflist happens, i.e. ep_scan_ready_list() has taken ep->mtx and waits for this callback to finish, thus ep_modify() (which unregisters wakeup source) waits for ep_scan_ready_list(). Here in this patch I simply call ep_pm_stay_awake_rcu(), which is a bit extra for this path (indirectly protected by main ep->mtx, so even rcu is not needed), but I do not want to create another naked __ep_pm_stay_awake() variant only for this particular case, so rcu variant is just better for all the cases. Link: http://lkml.kernel.org/r/20190103150104.17128-4-rpenyaev@suse.de Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Jason Baron <jbaron@akamai.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-07epoll: make sure all elements in ready list are in FIFO orderRoman Penyaev1-1/+5
Patch series "use rwlock in order to reduce ep_poll_callback() contention", v3. The last patch targets the contention problem in ep_poll_callback(), which can be very well reproduced by generating events (write to pipe or eventfd) from many threads, while consumer thread does polling. The following are some microbenchmark results based on the test [1] which starts threads which generate N events each. The test ends when all events are successfully fetched by the poller thread: spinlock ======== threads events/ms run-time ms 8 6402 12495 16 7045 22709 32 7395 43268 rwlock + xchg ============= threads events/ms run-time ms 8 10038 7969 16 12178 13138 32 13223 24199 According to the results bandwidth of delivered events is significantly increased, thus execution time is reduced. This patch (of 4): All coming events are stored in FIFO order and this is also should be applicable to ->ovflist, which originally is stack, i.e. LIFO. Thus to keep correct FIFO order ->ovflist should reversed by adding elements to the head of the read list but not to the tail. Link: http://lkml.kernel.org/r/20190103150104.17128-2-rpenyaev@suse.de Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Cc: Jason Baron <jbaron@akamai.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-07btrfs: implement btrfs_debug* in terms of helper macroRasmus Villemoes1-24/+10
First, the btrfs_debug macros open-code (one possible definition of) DYNAMIC_DEBUG_BRANCH, so they don't benefit from the CONFIG_JUMP_LABEL optimization. Second, a planned change of struct _ddebug (to reduce its size on 64 bit machines) requires that all descriptors in a translation unit use distinct identifiers. Using the new _dynamic_func_call_no_desc helper macro from dynamic_debug.h takes care of both of these. No functional change. Link: http://lkml.kernel.org/r/20190212214150.4807-12-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: David Sterba <dsterba@suse.com> Acked-by: Jason Baron <jbaron@akamai.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: "Rafael J . Wysocki" <rafael.j.wysocki@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-07linux/fs.h: move member alignment check next to definition of struct filenameRasmus Villemoes1-2/+0
Instead of doing this compile-time check in some slightly arbitrary user of struct filename, put it next to the definition. Link: http://lkml.kernel.org/r/20190208203015.29702-3-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-07Merge tag 'audit-pr-20190305' of ↵Linus Torvalds3-4/+6
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit updates from Paul Moore: "A lucky 13 audit patches for v5.1. Despite the rather large diffstat, most of the changes are from two bug fix patches that move code from one Kconfig option to another. Beyond that bit of churn, the remaining changes are largely cleanups and bug-fixes as we slowly march towards container auditing. It isn't all boring though, we do have a couple of new things: file capabilities v3 support, and expanded support for filtering on filesystems to solve problems with remote filesystems. All changes pass the audit-testsuite. Please merge for v5.1" * tag 'audit-pr-20190305' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: mark expected switch fall-through audit: hide auditsc_get_stamp and audit_serial prototypes audit: join tty records to their syscall audit: remove audit_context when CONFIG_ AUDIT and not AUDITSYSCALL audit: remove unused actx param from audit_rule_match audit: ignore fcaps on umount audit: clean up AUDITSYSCALL prototypes and stubs audit: more filter PATH records keyed on filesystem magic audit: add support for fcaps v3 audit: move loginuid and sessionid from CONFIG_AUDITSYSCALL to CONFIG_AUDIT audit: add syscall information to CONFIG_CHANGE records audit: hand taken context to audit_kill_trees for syscall logging audit: give a clue what CONFIG_CHANGE op was involved
2019-03-07Merge branch 'next-general' of ↵Linus Torvalds2-9/+56
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security subsystem updates from James Morris: - Extend LSM stacking to allow sharing of cred, file, ipc, inode, and task blobs. This paves the way for more full-featured LSMs to be merged, and is specifically aimed at LandLock and SARA LSMs. This work is from Casey and Kees. - There's a new LSM from Micah Morton: "SafeSetID gates the setid family of syscalls to restrict UID/GID transitions from a given UID/GID to only those approved by a system-wide whitelist." This feature is currently shipping in ChromeOS. * 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (62 commits) keys: fix missing __user in KEYCTL_PKEY_QUERY LSM: Update list of SECURITYFS users in Kconfig LSM: Ignore "security=" when "lsm=" is specified LSM: Update function documentation for cap_capable security: mark expected switch fall-throughs and add a missing break tomoyo: Bump version. LSM: fix return value check in safesetid_init_securityfs() LSM: SafeSetID: add selftest LSM: SafeSetID: remove unused include LSM: SafeSetID: 'depend' on CONFIG_SECURITY LSM: Add 'name' field for SafeSetID in DEFINE_LSM LSM: add SafeSetID module that gates setid calls LSM: add SafeSetID module that gates setid calls tomoyo: Allow multiple use_group lines. tomoyo: Coding style fix. tomoyo: Swicth from cred->security to task_struct->security. security: keys: annotate implicit fall throughs security: keys: annotate implicit fall throughs security: keys: annotate implicit fall through capabilities:: annotate implicit fall through ...
2019-03-07Merge tag 'xfs-5.1-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds76-1238/+2132
Pull xfs updates from Darrick Wong: "Here are a number of new features and bug fixes for 5.1 They've undergone a week's worth of fstesting and merge cleanly with master as of this morning Most of the changes center on improving metadata validation and fixing problems with online fsck, though there's also a new cache to speed up unlinked inode handling and cleanup of the copy on write code in preparation for future features Changes for Linux 5.1: - Fix online fsck to handle inode btrees correctly on 64k block filesystems - Teach online fsck to check directory and attribute names for invalid characters - Miscellanous fixes for online fsck - Introduce a new panic mask so that we can halt immediately on metadata corruption (for debugging purposes) - Fix a block mapping race during writeback - Cache unlinked inode list backrefs in memory to speed up list processing - Separate the bnobt/cntbt and inobt/finobt buffer verifiers so that we can detect crosslinked btrees - Refactor magic number verification so that we can standardize it - Strengthen ondisk metadata structure offset build time verification - Fix a memory corruption problem in the listxattr code - Fix a shutdown problem during log recovery due to unreserved finobt expansion - Fix a referential integrity problem where O_TMPFILE inodes were put on the unlinked list with nlink > 0 which would cause asserts during log recovery if the system went down immediately - Refactor the delayed allocation allocator to be more clever about the possibility that its mapping might be stale - Various fixes to the copy on write mechanism - Make CoW preallocation suitable for use even with writes that wouldn't otherwise require it - Refactor an internal API - Fix some statx implementation bugs - Fix miscellaneous compiler and static checker complaints" * tag 'xfs-5.1-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (70 commits) xfs: fix reporting supported extra file attributes for statx() xfs: fix backwards endian conversion in scrub xfs: fix uninitialized error variables xfs: rework breaking of shared extents in xfs_file_iomap_begin xfs: don't pass iomap flags to xfs_reflink_allocate_cow xfs: fix uninitialized error variable xfs: introduce an always_cow mode xfs: report IOMAP_F_SHARED from xfs_file_iomap_begin_delay xfs: make COW fork unwritten extent conversions more robust xfs: merge COW handling into xfs_file_iomap_begin_delay xfs: also truncate holes covered by COW blocks xfs: don't use delalloc extents for COW on files with extsize hints xfs: fix SEEK_DATA for speculative COW fork preallocation xfs: make xfs_bmbt_to_iomap more useful xfs: fix xfs_buf magic number endian checks xfs: retry COW fork delalloc conversion when no extent was found xfs: remove the truncate short cut in xfs_map_blocks xfs: move xfs_iomap_write_allocate to xfs_aops.c xfs: move stat accounting to xfs_bmapi_convert_delalloc xfs: move transaction handling to xfs_bmapi_convert_delalloc ..
2019-03-07Merge tag 'for-5.1-part1-tag' of ↵Linus Torvalds36-931/+1990
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "This contains usual mix of new features, core changes and fixes; full list below. I'm planning second pull request, with a few more fixes that arrived recently but too close to merge window, will send it next week. New features: - support zstd compression levels - new ioctl to unregister a device from the module (ie. reverse of device scan) - scrub prints a message to log when it's about to start or finish Core changes: - qgroups can now skip part of a tree that does not get updated during relocation, because this does not affect the quota accounting, estimated speedup in run time is about 20% - the compression workspace management had to be enhanced due to zstd requirements - various enospc fixes, when there's high fragmentation the over-reservation can cause ENOSPC that might not happen after a flush, in such cases try to wait if the situation improves Fixes: - various ioctls could overwrite previous return value if copy_to_user fails, fix this so the original error is reported - more reclaim vs GFP_KERNEL fixes - other cleanups and refactoring - fix a (valid) lockdep warning in a test when device replace is destroying worker threads - make qgroup async transaction commit more aggressive, this avoids some 'quota limit reached' errors if there are not enough data to trigger transaction in order to flush - fix deadlock between snapshot deletion and quotas when backref walking is called from context that already holds the same locks - fsync fixes: - fix fsync after succession of renames of different files - fix fsync after succession of renames and unlink/rmdir" * tag 'for-5.1-part1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (92 commits) btrfs: Remove unnecessary casts in btrfs_read_root_item Btrfs: remove assertion when searching for a key in a node/leaf Btrfs: add missing error handling after doing leaf/node binary search btrfs: drop the lock on error in btrfs_dev_replace_cancel btrfs: ensure that a DUP or RAID1 block group has exactly two stripes btrfs: init csum_list before possible free Btrfs: remove no longer needed range length checks for deduplication Btrfs: fix fsync after succession of renames and unlink/rmdir Btrfs: fix fsync after succession of renames of different files btrfs: honor path->skip_locking in backref code btrfs: qgroup: Make qgroup async transaction commit more aggressive btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to btrfs_qgroup_extent_record btrfs: scrub: remove unused nocow worker pointer btrfs: scrub: add assertions for worker pointers btrfs: scrub: convert scrub_workers_refcnt to refcount_t btrfs: scrub: add scrub_lock lockdep check in scrub_workers_get btrfs: scrub: fix circular locking dependency warning btrfs: fix comment its device list mutex not volume lock btrfs: extent_io: Kill the forward declaration of flush_write_bio btrfs: Fix grossly misleading argument names in extent io search ...
2019-03-07Merge tag 'fsnotify_for_v5.1-rc1' of ↵Linus Torvalds11-223/+671
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fanotify updates from Jan Kara: "Support for fanotify directory events and changes to make waiting for fanotify permission event response killable" * tag 'fsnotify_for_v5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (25 commits) fanotify: Make waits for fanotify events only killable fanotify: Use interruptible wait when waiting for permission events fanotify: Track permission event state fanotify: Simplify cleaning of access_list fsnotify: Create function to remove event from notification list fanotify: Move locking inside get_one_event() fanotify: Fold dequeue_event() into process_access_response() fanotify: Select EXPORTFS fanotify: report FAN_ONDIR to listener with FAN_REPORT_FID fanotify: add support for create/attrib/move/delete events fanotify: support events with data type FSNOTIFY_EVENT_INODE fanotify: check FS_ISDIR flag instead of d_is_dir() fsnotify: report FS_ISDIR flag with MOVE_SELF and DELETE_SELF events fanotify: use vfs_get_fsid() helper instead of vfs_statfs() vfs: add vfs_get_fsid() helper fanotify: cache fsid in fsnotify_mark_connector fanotify: enable FAN_REPORT_FID init flag fanotify: copy event fid info to user fanotify: encode file identifier for FAN_REPORT_FID fanotify: open code fill_event_metadata() ...
2019-03-07Merge tag 'fs_for_v5.1-rc1' of ↵Linus Torvalds9-37/+97
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext2 and udf fixes from Jan Kara: "A couple of fixes for udf and ext2. Namely: - fix making ext2 mountable (again) with 64k blocksize - fix for ext2 statx(2) handling - fix for udf handling of corrupted filesystem so that it doesn't get corrupted even further - couple smaller ext2 and udf cleanups" * tag 'fs_for_v5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: udf: Drop pointless check from udf_sync_fs() ext2: support statx syscall udf: disallow RW mount without valid integrity descriptor udf: finalize integrity descriptor before writeback udf: factor out LVID finalization for reuse ext2: Fix underflow in ext2_max_size() ext2: Fix a typo in comment ext2: Remove redundant check for finding no group ext2: Annotate implicit fall through in __ext2_truncate_blocks ext2: Set superblock revision when enabling xattr feature ext2: Remove redundant check on s_inode_size ext2: set proper return code
2019-03-07Merge tag 'dtype_for_v5.1-rc1' of ↵Linus Torvalds4-46/+113
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull dtype handling cleanups from Jan Kara: "A reworked dtype cleanup patches based on your feedback to the previous version of these. Again the series includes only the generic code and ext2 cleanup as a sample. The plan is to push cleanups for other filesystems separately through respective trees once the generic code lands to reduce the number of conflicts" * tag 'dtype_for_v5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: ext2: use common file type conversion fs: common implementation of file type
2019-03-06Merge tag 'driver-core-5.1-rc1' of ↵Linus Torvalds9-30/+18
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the big driver core patchset for 5.1-rc1 More patches than "normal" here this merge window, due to some work in the driver core by Alexander Duyck to rework the async probe functionality to work better for a number of devices, and independant work from Rafael for the device link functionality to make it work "correctly". Also in here is: - lots of BUS_ATTR() removals, the macro is about to go away - firmware test fixups - ihex fixups and simplification - component additions (also includes i915 patches) - lots of minor coding style fixups and cleanups. All of these have been in linux-next for a while with no reported issues" * tag 'driver-core-5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (65 commits) driver core: platform: remove misleading err_alloc label platform: set of_node in platform_device_register_full() firmware: hardcode the debug message for -ENOENT driver core: Add missing description of new struct device_link field driver core: Fix PM-runtime for links added during consumer probe drivers/component: kerneldoc polish async: Add cmdline option to specify drivers to be async probed driver core: Fix possible supplier PM-usage counter imbalance PM-runtime: Fix __pm_runtime_set_status() race with runtime resume driver: platform: Support parsing GpioInt 0 in platform_get_irq() selftests: firmware: fix verify_reqs() return value Revert "selftests: firmware: remove use of non-standard diff -Z option" Revert "selftests: firmware: add CONFIG_FW_LOADER_USER_HELPER_FALLBACK to config" device: Fix comment for driver_data in struct device kernfs: Allocating memory for kernfs_iattrs with kmem_cache. sysfs: remove unused include of kernfs-internal.h driver core: Postpone DMA tear-down until after devres release driver core: Document limitation related to DL_FLAG_RPM_ACTIVE PM-runtime: Take suppliers into account in __pm_runtime_set_status() device.h: Add __cold to dev_<level> logging functions ...
2019-03-06Merge branch 'akpm' (patches from Andrew)Linus Torvalds22-158/+208
Merge misc updates from Andrew Morton: - a few misc things - ocfs2 updates - most of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (159 commits) tools/testing/selftests/proc/proc-self-syscall.c: remove duplicate include proc: more robust bulk read test proc: test /proc/*/maps, smaps, smaps_rollup, statm proc: use seq_puts() everywhere proc: read kernel cpu stat pointer once proc: remove unused argument in proc_pid_lookup() fs/proc/thread_self.c: code cleanup for proc_setup_thread_self() fs/proc/self.c: code cleanup for proc_setup_self() proc: return exit code 4 for skipped tests mm,mremap: bail out earlier in mremap_to under map pressure mm/sparse: fix a bad comparison mm/memory.c: do_fault: avoid usage of stale vm_area_struct writeback: fix inode cgroup switching comment mm/huge_memory.c: fix "orig_pud" set but not used mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC mm/memcontrol.c: fix bad line in comment mm/cma.c: cma_declare_contiguous: correct err handling mm/page_ext.c: fix an imbalance with kmemleak mm/compaction: pass pgdat to too_many_isolated() instead of zone mm: remove zone_lru_lock() function, access ->lru_lock directly ...
2019-03-06Merge branch 'sched-core-for-linus' of ↵Linus Torvalds2-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The main changes in this cycle were: - refcount conversions - Solve the rq->leaf_cfs_rq_list can of worms for real. - improve power-aware scheduling - add sysctl knob for Energy Aware Scheduling - documentation updates - misc other changes" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits) kthread: Do not use TIMER_IRQSAFE kthread: Convert worker lock to raw spinlock sched/fair: Use non-atomic cpumask_{set,clear}_cpu() sched/fair: Remove unused 'sd' parameter from select_idle_smt() sched/wait: Use freezable_schedule() when possible sched/fair: Prune, fix and simplify the nohz_balancer_kick() comment block sched/fair: Explain LLC nohz kick condition sched/fair: Simplify nohz_balancer_kick() sched/topology: Fix percpu data types in struct sd_data & struct s_data sched/fair: Simplify post_init_entity_util_avg() by calling it with a task_struct pointer argument sched/fair: Fix O(nr_cgroups) in the load balancing path sched/fair: Optimize update_blocked_averages() sched/fair: Fix insertion in rq->leaf_cfs_rq_list sched/fair: Add tmp_alone_branch assertion sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock() sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK sched/pelt: Skip updating util_est when utilization is higher than CPU's capacity sched/fair: Update scale invariance of PELT sched/fair: Move the rq_of() helper function sched/core: Convert task_struct.stack_refcount to refcount_t ...
2019-03-06Merge branch 'locking-core-for-linus' of ↵Linus Torvalds1-16/+16
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "The biggest part of this tree is the new auto-generated atomics API wrappers by Mark Rutland. The primary motivation was to allow instrumentation without uglifying the primary source code. The linecount increase comes from adding the auto-generated files to the Git space as well: include/asm-generic/atomic-instrumented.h | 1689 ++++++++++++++++-- include/asm-generic/atomic-long.h | 1174 ++++++++++--- include/linux/atomic-fallback.h | 2295 +++++++++++++++++++++++++ include/linux/atomic.h | 1241 +------------ I preferred this approach, so that the full call stack of the (already complex) locking APIs is still fully visible in 'git grep'. But if this is excessive we could certainly hide them. There's a separate build-time mechanism to determine whether the headers are out of date (they should never be stale if we do our job right). Anyway, nothing from this should be visible to regular kernel developers. Other changes: - Add support for dynamic keys, which removes a source of false positives in the workqueue code, among other things (Bart Van Assche) - Updates to tools/memory-model (Andrea Parri, Paul E. McKenney) - qspinlock, wake_q and lockdep micro-optimizations (Waiman Long) - misc other updates and enhancements" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits) locking/lockdep: Shrink struct lock_class_key locking/lockdep: Add module_param to enable consistency checks lockdep/lib/tests: Test dynamic key registration lockdep/lib/tests: Fix run_tests.sh kernel/workqueue: Use dynamic lockdep keys for workqueues locking/lockdep: Add support for dynamic keys locking/lockdep: Verify whether lock objects are small enough to be used as class keys locking/lockdep: Check data structure consistency locking/lockdep: Reuse lock chains that have been freed locking/lockdep: Fix a comment in add_chain_cache() locking/lockdep: Introduce lockdep_next_lockchain() and lock_chain_count() locking/lockdep: Reuse list entries that are no longer in use locking/lockdep: Free lock classes that are no longer in use locking/lockdep: Update two outdated comments locking/lockdep: Make it easy to detect whether or not inside a selftest locking/lockdep: Split lockdep_free_key_range() and lockdep_reset_lock() locking/lockdep: Initialize the locks_before and locks_after lists earlier locking/lockdep: Make zap_class() remove all matching lock order entries locking/lockdep: Reorder struct lock_class members locking/lockdep: Avoid that add_chain_cache() adds an invalid chain to the cache ...
2019-03-05proc: use seq_puts() everywhereAlexey Dobriyan3-10/+10
seq_printf() without format specifiers == faster seq_puts() Link: http://lkml.kernel.org/r/20190114200545.GC9680@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05proc: read kernel cpu stat pointer onceAlexey Dobriyan1-28/+32
Help gcc generate better code: $ ./scripts/bloat-o-meter ../vmlinux-000 ../vmlinux-001 add/remove: 2/2 grow/shrink: 0/1 up/down: 92/-142 (-50) Function old new delta get_iowait_time.isra - 46 +46 get_idle_time.isra - 46 +46 show_stat 1489 1477 -12 get_iowait_time 65 - -65 get_idle_time 65 - -65 Link: http://lkml.kernel.org/r/20190114195907.GA9680@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05proc: remove unused argument in proc_pid_lookup()Zhikang Zhang3-3/+3
[adobriyan@gmail.com: delete "extern" from prototype] Link: http://lkml.kernel.org/r/20190114195635.GA9372@avx2 Signed-off-by: Zhikang Zhang <zhangzhikang1@huawei.com> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05fs/proc/thread_self.c: code cleanup for proc_setup_thread_self()Chengguang Xu1-8/+8
Remove unnecessary ERR_PTR()/PTR_ERR() cast in proc_setup_thread_self(). Link: http://lkml.kernel.org/r/20190124030150.8472-2-cgxu519@gmx.com Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05fs/proc/self.c: code cleanup for proc_setup_self()Chengguang Xu1-8/+8
Remove unnecessary ERR_PTR()/PTR_ERR() cast in proc_setup_self(). Link: http://lkml.kernel.org/r/20190124030150.8472-1-cgxu519@gmx.com Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfdJoel Fernandes (Google)1-1/+1
Android uses ashmem for sharing memory regions. We are looking forward to migrating all usecases of ashmem to memfd so that we can possibly remove the ashmem driver in the future from staging while also benefiting from using memfd and contributing to it. Note staging drivers are also not ABI and generally can be removed at anytime. One of the main usecases Android has is the ability to create a region and mmap it as writeable, then add protection against making any "future" writes while keeping the existing already mmap'ed writeable-region active. This allows us to implement a usecase where receivers of the shared memory buffer can get a read-only view, while the sender continues to write to the buffer. See CursorWindow documentation in Android for more details: https://developer.android.com/reference/android/database/CursorWindow This usecase cannot be implemented with the existing F_SEAL_WRITE seal. To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal which prevents any future mmap and write syscalls from succeeding while keeping the existing mmap active. A better way to do F_SEAL_FUTURE_WRITE seal was discussed [1] last week where we don't need to modify core VFS structures to get the same behavior of the seal. This solves several side-effects pointed by Andy. self-tests are provided in later patch to verify the expected semantics. [1] https://lore.kernel.org/lkml/20181111173650.GA256781@google.com/ Thanks a lot to Andy for suggestions to improve code. Link: http://lkml.kernel.org/r/20190112203816.85534-2-joel@joelfernandes.org Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Acked-by: John Stultz <john.stultz@linaro.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: J. Bruce Fields <bfields@fieldses.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Marc-Andr Lureau <marcandre.lureau@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05mm: update ptep_modify_prot_commit to take old pte value as argAneesh Kumar K.V1-3/+5
Architectures like ppc64 require to do a conditional tlb flush based on the old and new value of pte. Enable that by passing old pte value as the arg. Link: http://lkml.kernel.org/r/20190116085035.29729-3-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05mm: update ptep_modify_prot_start/commit to take vm_area_struct as argAneesh Kumar K.V1-2/+2
Patch series "NestMMU pte upgrade workaround for mprotect", v5. We can upgrade pte access (R -> RW transition) via mprotect. We need to make sure we follow the recommended pte update sequence as outlined in commit bd5050e38aec ("powerpc/mm/radix: Change pte relax sequence to handle nest MMU hang") for such updates. This patch series does that. This patch (of 5): Some architectures may want to call flush_tlb_range from these helpers. Link: http://lkml.kernel.org/r/20190116085035.29729-2-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05fs: kernfs: add poll file operationJohannes Weiner1-11/+20
Patch series "psi: pressure stall monitors", v3. Android is adopting psi to detect and remedy memory pressure that results in stuttering and decreased responsiveness on mobile devices. Psi gives us the stall information, but because we're dealing with latencies in the millisecond range, periodically reading the pressure files to detect stalls in a timely fashion is not feasible. Psi also doesn't aggregate its averages at a high enough frequency right now. This patch series extends the psi interface such that users can configure sensitive latency thresholds and use poll() and friends to be notified when these are breached. As high-frequency aggregation is costly, it implements an aggregation method that is optimized for fast, short-interval averaging, and makes the aggregation frequency adaptive, such that high-frequency updates only happen while monitored stall events are actively occurring. With these patches applied, Android can monitor for, and ward off, mounting memory shortages before they cause problems for the user. For example, using memory stall monitors in userspace low memory killer daemon (lmkd) we can detect mounting pressure and kill less important processes before device becomes visibly sluggish. In our memory stress testing psi memory monitors produce roughly 10x less false positives compared to vmpressure signals. Having ability to specify multiple triggers for the same psi metric allows other parts of Android framework to monitor memory state of the device and act accordingly. The new interface is straightforward. The user opens one of the pressure files for writing and writes a trigger description into the file descriptor that defines the stall state - some or full, and the maximum stall time over a given window of time. E.g.: /* Signal when stall time exceeds 100ms of a 1s window */ char trigger[] = "full 100000 1000000"; fd = open("/proc/pressure/memory"); write(fd, trigger, sizeof(trigger)); while (poll() >= 0) { ... } close(fd); When the monitored stall state is entered, psi adapts its aggregation frequency according to what the configured time window requires in order to emit event signals in a timely fashion. Once the stalling subsides, aggregation reverts back to normal. The trigger is associated with the open file descriptor. To stop monitoring, the user only needs to close the file descriptor and the trigger is discarded. Patches 1-4 prepare the psi code for polling support. Patch 5 implements the adaptive polling logic, the pressure growth detection optimized for short intervals, and hooks up write() and poll() on the pressure files. The patches were developed in collaboration with Johannes Weiner. This patch (of 5): Kernfs has a standardized poll/notification mechanism for waking all pollers on all fds when a filesystem node changes. To allow polling for custom events, add a .poll callback that can override the default. This is in preparation for pollable cgroup pressure files which have per-fd trigger configurations. Link: http://lkml.kernel.org/r/20190124211518.244221-2-surenb@google.com Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05memcg: localize memcg_kmem_enabled() checkShakeel Butt1-2/+1
Move the memcg_kmem_enabled() checks into memcg kmem charge/uncharge functions, so, the users don't have to explicitly check that condition. This is purely code cleanup patch without any functional change. Only the order of checks in memcg_charge_slab() can potentially be changed but the functionally it will be same. This should not matter as memcg_charge_slab() is not in the hot path. Link: http://lkml.kernel.org/r/20190103161203.162375-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05mm: convert PG_balloon to PG_offlineDavid Hildenbrand1-2/+2
PG_balloon was introduced to implement page migration/compaction for pages inflated in virtio-balloon. Nowadays, it is only a marker that a page is part of virtio-balloon and therefore logically offline. We also want to make use of this flag in other balloon drivers - for inflated pages or when onlining a section but keeping some pages offline (e.g. used right now by XEN and Hyper-V via set_online_page_callback()). We are going to expose this flag to dump tools like makedumpfile. But instead of exposing PG_balloon, let's generalize the concept of marking pages as logically offline, so it can be reused for other purposes later on. Rename PG_balloon to PG_offline. This is an indicator that the page is logically offline, the content stale and that it should not be touched (e.g. a hypervisor would have to allocate backing storage in order for the guest to dump an unused page). We can then e.g. exclude such pages from dumps. We replace and reuse KPF_BALLOON (23), as this shouldn't really harm (and for now the semantics stay the same). In following patches, we will make use of this bit also in other balloon drivers. While at it, document PGTABLE. [akpm@linux-foundation.org: fix comment text, per David] Link: http://lkml.kernel.org/r/20181119101616.8901-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Konstantin Khlebnikov <koct9i@gmail.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Pankaj gupta <pagupta@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Christian Hansen <chansen3@cisco.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Miles Chen <miles.chen@mediatek.com> Cc: David Rientjes <rientjes@google.com> Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Dave Young <dyoung@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Juergen Gross <jgross@suse.com> Cc: Julien Freche <jfreche@vmware.com> Cc: Kairui Song <kasong@redhat.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Len Brown <len.brown@intel.com> Cc: Lianbo Jiang <lijiang@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Xavier Deguillard <xdeguillard@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05fs/file.c: initialize init_files.resize_waitShuriyc Chu1-0/+1
(Taken from https://bugzilla.kernel.org/show_bug.cgi?id=200647) 'get_unused_fd_flags' in kthread cause kernel crash. It works fine on 4.1, but causes crash after get 64 fds. It also cause crash on ubuntu1404/1604/1804, centos7.5, and the crash messages are almost the same. The crash message on centos7.5 shows below: start fd 61 start fd 62 start fd 63 BUG: unable to handle kernel NULL pointer dereference at (null) IP: __wake_up_common+0x2e/0x90 PGD 0 Oops: 0000 [#1] SMP Modules linked in: test(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter devlink sunrpc kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd sg ppdev pcspkr virtio_balloon parport_pc parport i2c_piix4 joydev ip_tables xfs libcrc32c sr_mod cdrom sd_mod crc_t10dif crct10dif_generic ata_generic pata_acpi virtio_scsi virtio_console virtio_net cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common crc32c_intel drm ata_piix serio_raw libata virtio_pci virtio_ring i2c_core virtio floppy dm_mirror dm_region_hash dm_log dm_mod CPU: 2 PID: 1820 Comm: test_fd Kdump: loaded Tainted: G OE ------------ 3.10.0-862.3.3.el7.x86_64 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 task: ffff8e92b9431fa0 ti: ffff8e94247a0000 task.ti: ffff8e94247a0000 RIP: 0010:__wake_up_common+0x2e/0x90 RSP: 0018:ffff8e94247a2d18 EFLAGS: 00010086 RAX: 0000000000000000 RBX: ffffffff9d09daa0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffffff9d09daa0 RBP: ffff8e94247a2d50 R08: 0000000000000000 R09: ffff8e92b95dfda8 R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff9d09daa8 R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000003 FS: 0000000000000000(0000) GS:ffff8e9434e80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000017c686000 CR4: 00000000000207e0 Call Trace: __wake_up+0x39/0x50 expand_files+0x131/0x250 __alloc_fd+0x47/0x170 get_unused_fd_flags+0x30/0x40 test_fd+0x12a/0x1c0 [test] kthread+0xd1/0xe0 ret_from_fork_nospec_begin+0x21/0x21 Code: 66 90 55 48 89 e5 41 57 41 89 f7 41 56 41 89 ce 41 55 41 54 49 89 fc 49 83 c4 08 53 48 83 ec 10 48 8b 47 08 89 55 cc 4c 89 45 d0 <48> 8b 08 49 39 c4 48 8d 78 e8 4c 8d 69 e8 75 08 eb 3b 4c 89 ef RIP __wake_up_common+0x2e/0x90 RSP <ffff8e94247a2d18> CR2: 0000000000000000 This issue exists since CentOS 7.5 3.10.0-862 and CentOS 7.4 (3.10.0-693.21.1 ) is ok. Root cause: the item 'resize_wait' is not initialized before being used. Reported-by: Richard Zhang <zhang.zijian@h3c.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05fs/inode.c: inode_set_flags(): replace opencoded set_mask_bits()Vineet Gupta1-7/+1
It seems that commits 5f16f3225b0624 and 00a1a053ebe5, both with same commitlog ("ext4: atomically set inode->i_flags in ext4_set_inode_flags()") introduced the set_mask_bits API, but somehow missed not using it in ext4 in the end. Also, set_mask_bits() is used in fs quite a bit and we can possibly come up with a generic llsc based implementation (w/o the cmpxchg loop) Link: http://lkml.kernel.org/r/1548275584-18096-3-git-send-email-vgupta@synopsys.com Signed-off-by: Vineet Gupta <vgupta@synopsys.com> Reviewed-by: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05ocfs2: Use zero-sized array and struct_size() in kzalloc()Gustavo A. R. Silva1-6/+2
Update the code to use a zero-sized array instead of a pointer in structure ocfs2_slot_info and use struct_size() in kzalloc(). Notice that one of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; void *entry[]; }; instance = kzalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL); This code was detected with the help of Coccinelle. Link: http://lkml.kernel.org/r/20190108191903.GA22056@embeddedor Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05ocfs2: fix the application IO timeout when fstrim is runningGang He5-63/+106
The user reported this problem, the upper application IO was timeout when fstrim was running on this ocfs2 partition. the application monitoring resource agent considered that this application did not work, then this node was fenced by the cluster brain (e.g. pacemaker). The root cause is that fstrim thread always holds main_bm meta-file related locks until all the cluster groups are trimmed. This patch will make fstrim thread release main_bm meta-file related locks when each cluster group is trimmed, this will let the current application IO has a chance to claim the clusters from main_bm meta-file. Link: http://lkml.kernel.org/r/20190111090014.31645-1-ghe@suse.com Signed-off-by: Gang He <ghe@suse.com> Reviewed-by: Changwei Ge <ge.changwei@h3c.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05ocfs2: fix a panic problem caused by o2cb_ctlJia Guo1-6/+8
In the process of creating a node, it will cause NULL pointer dereference in kernel if o2cb_ctl failed in the interval (mkdir, o2cb_set_node_attribute(node_num)] in function o2cb_add_node. The node num is initialized to 0 in function o2nm_node_group_make_item, o2nm_node_group_drop_item will mistake the node number 0 for a valid node number when we delete the node before the node number is set correctly. If the local node number of the current host happens to be 0, cluster->cl_local_node will be set to O2NM_INVALID_NODE_NUM while o2hb_thread still running. The panic stack is generated as follows: o2hb_thread \-o2hb_do_disk_heartbeat \-o2hb_check_own_slot |-slot = &reg->hr_slots[o2nm_this_node()]; //o2nm_this_node() return O2NM_INVALID_NODE_NUM We need to check whether the node number is set when we delete the node. Link: http://lkml.kernel.org/r/133d8045-72cc-863e-8eae-5013f9f6bc51@huawei.com Signed-off-by: Jia Guo <guojia12@huawei.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Acked-by: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05Merge branch 'timers-2038-for-linus' of ↵Linus Torvalds4-14/+14
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull year 2038 updates from Thomas Gleixner: "Another round of changes to make the kernel ready for 2038. After lots of preparatory work this is the first set of syscalls which are 2038 safe: 403 clock_gettime64 404 clock_settime64 405 clock_adjtime64 406 clock_getres_time64 407 clock_nanosleep_time64 408 timer_gettime64 409 timer_settime64 410 timerfd_gettime64 411 timerfd_settime64 412 utimensat_time64 413 pselect6_time64 414 ppoll_time64 416 io_pgetevents_time64 417 recvmmsg_time64 418 mq_timedsend_time64 419 mq_timedreceiv_time64 420 semtimedop_time64 421 rt_sigtimedwait_time64 422 futex_time64 423 sched_rr_get_interval_time64 The syscall numbers are identical all over the architectures" * 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) riscv: Use latest system call ABI checksyscalls: fix up mq_timedreceive and stat exceptions unicore32: Fix __ARCH_WANT_STAT64 definition asm-generic: Make time32 syscall numbers optional asm-generic: Drop getrlimit and setrlimit syscalls from default list 32-bit userspace ABI: introduce ARCH_32BIT_OFF_T config option compat ABI: use non-compat openat and open_by_handle_at variants y2038: add 64-bit time_t syscalls to all 32-bit architectures y2038: rename old time and utime syscalls y2038: remove struct definition redirects y2038: use time32 syscall names on 32-bit syscalls: remove obsolete __IGNORE_ macros y2038: syscalls: rename y2038 compat syscalls x86/x32: use time64 versions of sigtimedwait and recvmmsg timex: change syscalls to use struct __kernel_timex timex: use __kernel_timex internally sparc64: add custom adjtimex/clock_adjtime functions time: fix sys_timer_settime prototype time: Add struct __kernel_timex time: make adjtime compat handling available for 32 bit ...
2019-03-05Merge branch 'irq-core-for-linus' of ↵Linus Torvalds1-3/+26
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "The interrupt departement delivers this time: - New infrastructure to manage NMIs on platforms which have a sane NMI delivery, i.e. identifiable NMI vectors instead of a single lump. - Simplification of the interrupt affinity management so drivers don't have to implement ugly loops around the PCI/MSI enablement. - Speedup for interrupt statistics in /proc/stat - Provide a function to retrieve the default irq domain - A new interrupt controller for the Loongson LS1X platform - Affinity support for the SiFive PLIC - Better support for the iMX irqsteer driver - NUMA aware memory allocations for GICv3 - The usual small fixes, improvements and cleanups all over the place" * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) irqchip/imx-irqsteer: Add multi output interrupts support irqchip/imx-irqsteer: Change to use reg_num instead of irq_group dt-bindings: irq: imx-irqsteer: Add multi output interrupts support dt-binding: irq: imx-irqsteer: Use irq number instead of group number irqchip/brcmstb-l2: Use _irqsave locking variants in non-interrupt code irqchip/gicv3-its: Use NUMA aware memory allocation for ITS tables irqdomain: Allow the default irq domain to be retrieved irqchip/sifive-plic: Implement irq_set_affinity() for SMP host irqchip/sifive-plic: Differentiate between PLIC handler and context irqchip/sifive-plic: Add warning in plic_init() if handler already present irqchip/sifive-plic: Pre-compute context hart base and enable base PCI/MSI: Remove obsolete sanity checks for multiple interrupt sets genirq/affinity: Remove the leftovers of the original set support nvme-pci: Simplify interrupt allocation genirq/affinity: Add new callback for (re)calculating interrupt sets genirq/affinity: Store interrupt sets size in struct irq_affinity genirq/affinity: Code consolidation irqchip/irq-sifive-plic: Check and continue in case of an invalid cpuid. irqchip/i8259: Fix shutdown order by moving syscore_ops registration dt-bindings: interrupt-controller: loongson ls1x intc ...
2019-03-05a.out: remove core dumping supportLinus Torvalds1-83/+0
We're (finally) phasing out a.out support for good. As Borislav Petkov points out, we've supported ELF binaries for about 25 years by now, and coredumping in particular has bitrotted over the years. None of the tool chains even support generating a.out binaries any more, and the plan is to deprecate a.out support entirely for the kernel. But I want to start with just removing the core dumping code, because I can still imagine that somebody actually might want to support a.out as a simpler biinary format. Particularly if you generate some random binaries on the fly, ELF is a much more complicated format (admittedly ELF also does have a lot of toolchain support, mitigating that complexity a lot and you really should have moved over in the last 25 years). So it's at least somewhat possible that somebody out there has some workflow that still involves generating and running a.out executables. In contrast, it's very unlikely that anybody depends on debugging any legacy a.out core files. But regardless, I want this phase-out to be done in two steps, so that we can resurrect a.out support (if needed) without having to resurrect the core file dumping that is almost certainly not needed. Jann Horn pointed to the <asm/a.out-core.h> file that my first trivial cut at this had missed. And Alan Cox points out that the a.out binary loader _could_ be done in user space if somebody wants to, but we might keep just the loader in the kernel if somebody really wants it, since the loader isn't that big and has no really odd special cases like the core dumping does. Acked-by: Borislav Petkov <bp@alien8.de> Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk> Cc: Jann Horn <jannh@google.com> Cc: Richard Weinberger <richard@nod.at> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05Merge branch 'linus' of ↵Linus Torvalds2-4/+5
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto update from Herbert Xu: "API: - Add helper for simple skcipher modes. - Add helper to register multiple templates. - Set CRYPTO_TFM_NEED_KEY when setkey fails. - Require neither or both of export/import in shash. - AEAD decryption test vectors are now generated from encryption ones. - New option CONFIG_CRYPTO_MANAGER_EXTRA_TESTS that includes random fuzzing. Algorithms: - Conversions to skcipher and helper for many templates. - Add more test vectors for nhpoly1305 and adiantum. Drivers: - Add crypto4xx prng support. - Add xcbc/cmac/ecb support in caam. - Add AES support for Exynos5433 in s5p. - Remove sha384/sha512 from artpec7 as hardware cannot do partial hash" [ There is a merge of the Freescale SoC tree in order to pull in changes required by patches to the caam/qi2 driver. ] * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (174 commits) crypto: s5p - add AES support for Exynos5433 dt-bindings: crypto: document Exynos5433 SlimSSS crypto: crypto4xx - add missing of_node_put after of_device_is_available crypto: cavium/zip - fix collision with generic cra_driver_name crypto: af_alg - use struct_size() in sock_kfree_s() crypto: caam - remove redundant likely/unlikely annotation crypto: s5p - update iv after AES-CBC op end crypto: x86/poly1305 - Clear key material from stack in SSE2 variant crypto: caam - generate hash keys in-place crypto: caam - fix DMA mapping xcbc key twice crypto: caam - fix hash context DMA unmap size hwrng: bcm2835 - fix probe as platform device crypto: s5p-sss - Use AES_BLOCK_SIZE define instead of number crypto: stm32 - drop pointless static qualifier in stm32_hash_remove() crypto: chelsio - Fixed Traffic Stall crypto: marvell - Remove set but not used variable 'ivsize' crypto: ccp - Update driver messages to remove some confusion crypto: adiantum - add 1536 and 4096-byte test vectors crypto: nhpoly1305 - add a test vector with len % 16 != 0 crypto: arm/aes-ce - update IV after partial final CTR block ...
2019-03-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds1-2/+2
Pull networking updates from David Miller: "Here we go, another merge window full of networking and #ebpf changes: 1) Snoop DHCPACKS in batman-adv to learn MAC/IP pairs in the DHCP range without dealing with floods of ARP traffic, from Linus Lüssing. 2) Throttle buffered multicast packet transmission in mt76, from Felix Fietkau. 3) Support adaptive interrupt moderation in ice, from Brett Creeley. 4) A lot of struct_size conversions, from Gustavo A. R. Silva. 5) Add peek/push/pop commands to bpftool, as well as bash completion, from Stanislav Fomichev. 6) Optimize sk_msg_clone(), from Vakul Garg. 7) Add SO_BINDTOIFINDEX, from David Herrmann. 8) Be more conservative with local resends due to local congestion, from Yuchung Cheng. 9) Allow vetoing of unsupported VXLAN FDBs, from Petr Machata. 10) Add health buffer support to devlink, from Eran Ben Elisha. 11) Add TXQ scheduling API to mac80211, from Toke Høiland-Jørgensen. 12) Add statistics to basic packet scheduler filter, from Cong Wang. 13) Add GRE tunnel support for mlxsw Spectrum-2, from Nir Dotan. 14) Lots of new IP tunneling forwarding tests, also from Nir Dotan. 15) Add 3ad stats to bonding, from Nikolay Aleksandrov. 16) Lots of probing improvements for bpftool, from Quentin Monnet. 17) Various nfp drive #ebpf JIT improvements from Jakub Kicinski. 18) Allow #ebpf programs to access gso_segs from skb shared info, from Eric Dumazet. 19) Add sock_diag support for AF_XDP sockets, from Björn Töpel. 20) Support 22260 iwlwifi devices, from Luca Coelho. 21) Use rbtree for ipv6 defragmentation, from Peter Oskolkov. 22) Add JMP32 instruction class support to #ebpf, from Jiong Wang. 23) Add spinlock support to #ebpf, from Alexei Starovoitov. 24) Support 256-bit keys and TLS 1.3 in ktls, from Dave Watson. 25) Add device infomation API to devlink, from Jakub Kicinski. 26) Add new timestamping socket options which are y2038 safe, from Deepa Dinamani. 27) Add RX checksum offloading for various sh_eth chips, from Sergei Shtylyov. 28) Flow offload infrastructure, from Pablo Neira Ayuso. 29) Numerous cleanups, improvements, and bug fixes to the PHY layer and many drivers from Heiner Kallweit. 30) Lots of changes to try and make packet scheduler classifiers run lockless as much as possible, from Vlad Buslov. 31) Support BCM957504 chip in bnxt_en driver, from Erik Burrows. 32) Add concurrency tests to tc-tests infrastructure, from Vlad Buslov. 33) Add hwmon support to aquantia, from Heiner Kallweit. 34) Allow 64-bit values for SO_MAX_PACING_RATE, from Eric Dumazet. And I would be remiss if I didn't thank the various major networking subsystem maintainers for integrating much of this work before I even saw it. Alexei Starovoitov, Daniel Borkmann, Pablo Neira Ayuso, Johannes Berg, Kalle Valo, and many others. Thank you!" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2207 commits) net/sched: avoid unused-label warning net: ignore sysctl_devconf_inherit_init_net without SYSCTL phy: mdio-mux: fix Kconfig dependencies net: phy: use phy_modify_mmd_changed in genphy_c45_an_config_aneg net: dsa: mv88e6xxx: add call to mv88e6xxx_ports_cmode_init to probe for new DSA framework selftest/net: Remove duplicate header sky2: Disable MSI on Dell Inspiron 1545 and Gateway P-79 net/mlx5e: Update tx reporter status in case channels were successfully opened devlink: Add support for direct reporter health state update devlink: Update reporter state to error even if recover aborted sctp: call iov_iter_revert() after sending ABORT team: Free BPF filter when unregistering netdev ip6mr: Do not call __IP6_INC_STATS() from preemptible context isdn: mISDN: Fix potential NULL pointer dereference of kzalloc net: dsa: mv88e6xxx: support in-band signalling on SGMII ports with external PHYs cxgb4/chtls: Prefix adapter flags with CXGB4 net-sysfs: Switch to bitmap_zalloc() mellanox: Switch to bitmap_zalloc() bpf: add test cases for non-pointer sanitiation logic mlxsw: i2c: Extend initialization by querying resources data ...
2019-03-04fs: Make splice() and tee() take into account O_NONBLOCK flag on pipesSlavomir Kaslev1-0/+12
The current implementation of splice() and tee() ignores O_NONBLOCK set on pipe file descriptors and checks only the SPLICE_F_NONBLOCK flag for blocking on pipe arguments. This is inconsistent since splice()-ing from/to non-pipe file descriptors does take O_NONBLOCK into consideration. Fix this by promoting O_NONBLOCK, when set on a pipe, to SPLICE_F_NONBLOCK. Some context for how the current implementation of splice() leads to inconsistent behavior. In the ongoing work[1] to add VM tracing capability to trace-cmd we stream tracing data over named FIFOs or vsockets from guests back to the host. When we receive SIGINT from user to stop tracing, we set O_NONBLOCK on the input file descriptor and set SPLICE_F_NONBLOCK for the next call to splice(). If splice() was blocked waiting on data from the input FIFO, after SIGINT splice() restarts with the same arguments (no SPLICE_F_NONBLOCK) and blocks again instead of returning -EAGAIN when no data is available. This differs from the splice() behavior when reading from a vsocket or when we're doing a traditional read()/write() loop (trace-cmd's --nosplice argument). With this patch applied we get the same behavior in all situations after setting O_NONBLOCK which also matches the behavior of doing a read()/write() loop instead of splice(). This change does have potential of breaking users who don't expect EAGAIN from splice() when SPLICE_F_NONBLOCK is not set. OTOH programs that set O_NONBLOCK and don't anticipate EAGAIN are arguably buggy[2]. [1] https://github.com/skaslev/trace-cmd/tree/vsock [2] https://github.com/torvalds/linux/blob/d47e3da1759230e394096fd742aad423c291ba48/fs/read_write.c#L1425 Signed-off-by: Slavomir Kaslev <kaslevs@vmware.com> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller4-6/+13
2019-03-04Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds4-7/+17
Pull vfs fixes from Al Viro: "Assorted fixes that sat in -next for a while, all over the place" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: aio: Fix locking in aio_poll() exec: Fix mem leak in kernel_read_file copy_mount_string: Limit string length to PATH_MAX cgroup: saner refcounting for cgroup_root fix cgroup_do_mount() handling of failure exits