summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2017-09-06ceph: fix "range cyclic" mode writepagesYan, Zheng1-17/+24
In range cyclic mode, writepages() should first write dirty pages in range [writeback_index, (pgoff_t)-1], then write pages in range [0, writeback_index -1]. Besides, if writepages() encounters a page that beyond EOF, it should restart from the beginning. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: cleanup local variables in ceph_writepages_start()Yan, Zheng1-12/+9
Remove two variables and define variables of same type together. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: optimize pagevec iterating in ceph_writepages_start()Yan, Zheng1-29/+25
ceph_writepages_start() supports writing non-continuous pages. If it encounters a non-dirty or non-writeable page in pagevec, it can continue to check the rest pages in pagevec. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: make writepage_nounlock() invalidate page that beyonds EOFYan, Zheng1-18/+32
Otherwise, the page left in state that page is associated with a snapc, but (PageDirty(page) || PageWriteback(page)) is false. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: properly get capsnap's size in get_oldest_context()Yan, Zheng1-57/+80
capsnap's size is set by __ceph_finish_cap_snap(). If capsnap is under writing, its size is zero. In this case, get_oldest_context() should read i_size. Besides, ceph_writepages_start() should re-check capsnap's size after dirty pages get locked. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: remove stale check in ceph_invalidatepage()Yan, Zheng1-8/+1
Both set_page_dirty and truncate_complete_page should be called for locked page, they can't race with each other. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: queue cap snap only when snap realm's context changesYan, Zheng1-21/+16
If we create capsnap when snap realm's context does not change, the new capsnap's snapc is equal to ci->i_head_snapc. Page writeback code can't differentiates dirty pages associated with the new capsnap from dirty pages associated with i_head_snapc. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: handle race between vmtruncate and queuing cap snapYan, Zheng1-1/+12
It's possible that we create a cap snap while there is pending vmtruncate (truncate hasn't been processed by worker thread). We should truncate dirty pages beyond capsnap->size in that case. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: fix message order check in handle_cap_export()Yan, Zheng1-1/+1
If caps for importer mds exists, but cap id mismatch, client should have received corresponding import message. Because cap ID does not change as long as client holds the caps. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: fix NULL pointer dereference in ceph_flush_snaps()Yan, Zheng1-1/+1
Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: adjust 36 checks for NULL pointersMarkus Elfring10-36/+36
The script “checkpatch.pl” pointed information out like the following. Comparison to NULL could be written ... Thus fix the affected source code places. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: delete an unnecessary return statement in update_dentry_lease()Markus Elfring1-1/+0
The script "checkpatch.pl" pointed information out like the following. WARNING: void function return statements are not generally useful Thus remove such a statement in the affected function. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: ENOMEM pr_err in __get_or_create_frag() is redundantMarkus Elfring1-5/+2
Omit an extra message for a memory allocation failure in this function. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: check negative offsets in ceph_llseek()Luis Henriques1-2/+2
When a user requests SEEK_HOLE or SEEK_DATA with a negative offset ceph_llseek should return -ENXIO. Currently -EINVAL is being returned for SEEK_DATA and 0 for SEEK_HOLE. Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: more accurate statfsDouglas Fuller1-1/+8
Improve accuracy of statfs reporting for Ceph filesystems comprising exactly one data pool. In this case, the Ceph monitor can now report the space usage for the single data pool instead of the global data for the entire Ceph cluster. Include support for this message in mon_client and leverage it in ceph/super. Signed-off-by: Douglas Fuller <dfuller@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: properly set snap follows for cap reconnectYan, Zheng1-1/+1
Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: don't use CEPH_OSD_FLAG_ORDERSNAPYan, Zheng1-3/+3
Inode can be moved between snap realms. It's possible inode is moved into a snap realm whose seq number is smaller than old snap realm's. So there is no guarantee that seq number inode's snap context always increases. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: include snapc in debug message of writeYan, Zheng2-7/+8
Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: make sure flushsnap messages are sent in proper orderYan, Zheng1-5/+7
Before sending new flushsnap message, check if there are old flushsnap messages that need to be re-sent. If there are, re-send old messages first. This guarantees ordering of flushsnap messages. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: fix -EOLDSNAPC handlingYan, Zheng1-10/+7
Need to drop cap reference before retry. Besides, it's better to redo file write checks for each retry because we re-lock inode. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: send LSSNAP request to auth mds of directory inodeYan, Zheng2-5/+14
Snapdir inode has no capability. __choose_mds() should choose mds base on capabilities of snapdir's parent inode. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: don't fill readdir cache for LSSNAP replyYan, Zheng1-8/+11
Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: cleanup ceph_readdir_prepopulate()Yan, Zheng1-7/+0
In LSSNAP case, req->r_dentry is already set to snapdir dentry. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: use errseq_t for writeback error reportingJeff Layton1-1/+1
Ensure that when writeback errors are marked that we report those to all file descriptions that were open at the time of the error. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: new cap message flags indicate if there is pending capsnapYan, Zheng1-1/+4
These flags tell mds if there is pending capsnap explicitly. Without this explicit notification, mds can only conclude if client has pending capsnap. The method mds use is inefficient and error-prone. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: nuke startsync opYanhu Cao2-22/+4
startsync is a no-op, has been for years. Remove it. Link: http://tracker.ceph.com/issues/20604 Signed-off-by: Yanhu Cao <gmayyyha@gmail.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: validate correctness of some mount optionsYan, Zheng2-7/+23
Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: limit osd write sizeYan, Zheng4-5/+11
OSD has a configurable limitation of max write size. OSD return error if write request size is larger than the limitation. For now, set max write size to CEPH_MSG_MAX_DATA_LEN. It should be small enough. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: limit osd read size to CEPH_MSG_MAX_DATA_LENYan, Zheng4-18/+15
libceph returns -EIO when read size > CEPH_MSG_MAX_DATA_LEN. Link: http://tracker.ceph.com/issues/20528 Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06ceph: remove unused cap_release_safety mount optionYan, Zheng2-7/+0
Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-01Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds2-2/+27
Pull cifs version warning fix from Steve French: "As requested, additional kernel warning messages to clarify the default dialect changes" [ There is still some discussion about exactly which version should be the new default. Longer-term we have auto-negotiation coming, but that's not there yet.. - Linus ] * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: Fix warning messages when mounting to older servers
2017-09-01epoll: fix race between ep_poll_callback(POLLFREE) and ep_free()/ep_remove()Oleg Nesterov1-16/+26
The race was introduced by me in commit 971316f0503a ("epoll: ep_unregister_pollwait() can use the freed pwq->whead"). I did not realize that nothing can protect eventpoll after ep_poll_callback() sets ->whead = NULL, only whead->lock can save us from the race with ep_free() or ep_remove(). Move ->whead = NULL to the end of ep_poll_callback() and add the necessary barriers. TODO: cleanup the ewake/EPOLLEXCLUSIVE logic, it was confusing even before this patch. Hopefully this explains use-after-free reported by syzcaller: BUG: KASAN: use-after-free in debug_spin_lock_before ... _raw_spin_lock_irqsave+0x4a/0x60 kernel/locking/spinlock.c:159 ep_poll_callback+0x29f/0xff0 fs/eventpoll.c:1148 this is spin_lock(eventpoll->lock), ... Freed by task 17774: ... kfree+0xe8/0x2c0 mm/slub.c:3883 ep_free+0x22c/0x2a0 fs/eventpoll.c:865 Fixes: 971316f0503a ("epoll: ep_unregister_pollwait() can use the freed pwq->whead") Reported-by: 范龙飞 <long7573@126.com> Cc: stable@vger.kernel.org Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-01Merge tag 'ceph-for-4.13-rc8' of git://github.com/ceph/ceph-clientLinus Torvalds2-18/+18
Pull ceph fix from Ilya Dryomov: "ceph fscache page locking fix from Zheng, marked for stable" * tag 'ceph-for-4.13-rc8' of git://github.com/ceph/ceph-client: ceph: fix readpage from fscache
2017-09-01Fix warning messages when mounting to older serversSteve French2-2/+27
When mounting to older servers, such as Windows XP (or even Windows 7), the limited error messages that can be passed back to user space can get confusing since the default dialect has changed from SMB1 (CIFS) to more secure SMB3 dialect. Log additional information when the user chooses to use the default dialects and when the server does not support the dialect requested. Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-08-31Merge tag 'cifs-fixes-for-4.13-rc7-and-stable' of ↵Linus Torvalds2-3/+3
git://git.samba.org/sfrench/cifs-2.6 Pull cifs fixes from Steve French: "Two cifs bug fixes for stable" * tag 'cifs-fixes-for-4.13-rc7-and-stable' of git://git.samba.org/sfrench/cifs-2.6: CIFS: remove endian related sparse warning CIFS: Fix maximum SMB2 header size
2017-08-31Merge branch 'mmu_notifier_fixes'Linus Torvalds1-8/+11
Merge mmu_notifier fixes from Jérôme Glisse: "The invalidate_page callback suffered from 2 pitfalls. First it used to happen after page table lock was release and thus a new page might have been setup for the virtual address before the call to invalidate_page(). This is in a weird way fixed by commit c7ab0d2fdc84 ("mm: convert try_to_unmap_one() to use page_vma_mapped_walk()") which moved the callback under the page table lock. Which also broke several existing user of the mmu_notifier API that assumed they could sleep inside this callback. The second pitfall was invalidate_page being the only callback not taking a range of address in respect to invalidation but was giving an address and a page. Lot of the callback implementer assumed this could never be THP and thus failed to invalidate the appropriate range for THP pages. By killing this callback we unify the mmu_notifier callback API to always take a virtual address range as input. There is now two clear API (I am not mentioning the youngess API which is seldomly used): - invalidate_range_start()/end() callback (which allow you to sleep) - invalidate_range() where you can not sleep but happen right after page table update under page table lock Note that a lot of existing user feels broken in respect to range_start/ range_end. Many user only have range_start() callback but there is nothing preventing them to undo what was invalidated in their range_start() callback after it returns but before any CPU page table update take place. The code pattern use in kvm or umem odp is an example on how to properly avoid such race. In a nutshell use some kind of sequence number and active range invalidation counter to block anything that might undo what the range_start() callback did. If you do not care about keeping fully in sync with CPU page table (ie you can live with CPU page table pointing to new different page for a given virtual address) then you can take a reference on the pages inside the range_start callback and drop it in range_end or when your driver is done with those pages. Last alternative is to use invalidate_range() if you can do invalidation without sleeping as invalidate_range() callback happens under the CPU page table spinlock right after the page table is updated. The first two patches convert existing mmu_notifier_invalidate_page() calls to mmu_notifier_invalidate_range() and bracket those call with call to mmu_notifier_invalidate_range_start()/end(). The next ten patches remove existing invalidate_page() callback as it can no longer happen. Finally the last page remove the invalidate_page() callback completely so it can RIP. Changes since v1: - remove more dead code in kvm (no testing impact) - more accurate end address computation (patch 2) in page_mkclean_one and try_to_unmap_one - added tested-by/reviewed-by gotten so far" * emailed patches from Jérôme Glisse <jglisse@redhat.com>: mm/mmu_notifier: kill invalidate_page KVM: update to new mmu_notifier semantic v2 xen/gntdev: update to new mmu_notifier semantic sgi-gru: update to new mmu_notifier semantic misc/mic/scif: update to new mmu_notifier semantic iommu/intel: update to new mmu_notifier semantic iommu/amd: update to new mmu_notifier semantic IB/hfi1: update to new mmu_notifier semantic IB/umem: update to new mmu_notifier semantic drm/amdgpu: update to new mmu_notifier semantic powerpc/powernv: update to new mmu_notifier semantic mm/rmap: update to new mmu_notifier semantic v2 dax: update to new mmu_notifier semantic
2017-08-31jfs should use MAX_LFS_FILESIZE when calculating s_maxbytesDave Kleikamp1-9/+3
jfs had previously avoided the use of MAX_LFS_FILESIZE because it hadn't accounted for the whole 32-bit index range on 32-bit systems. That has been fixed by commit 0cc3b0ec23ce ("Clarify (and fix) MAX_LFS_FILESIZE macros"), so we can simplify the code now. Suggested by Andreas Dilger. Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Cc: jfs-discussion@lists.sourceforge.net Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-31dax: update to new mmu_notifier semanticJérôme Glisse1-8/+11
Replace all mmu_notifier_invalidate_page() calls by *_invalidate_range() and make sure it is bracketed by calls to *_invalidate_range_start()/end(). Note that because we can not presume the pmd value or pte value we have to assume the worst and unconditionaly report an invalidation as happening. Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Bernhard Held <berny156@gmx.de> Cc: Adam Borowski <kilobyte@angband.pl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Wanpeng Li <kernellwp@gmail.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: axie <axie@amd.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-01ceph: fix readpage from fscacheYan, Zheng2-18/+18
ceph_readpage() unlocks page prematurely prematurely in the case that page is reading from fscache. Caller of readpage expects that page is uptodate when it get unlocked. So page shoule get locked by completion callback of fscache_read_or_alloc_pages() Cc: stable@vger.kernel.org # 4.1+, needs backporting for < 4.7 Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-08-30CIFS: remove endian related sparse warningSteve French1-1/+1
Recent patch had an endian warning ie cifs: return ENAMETOOLONG for overlong names in cifs_open()/cifs_lookup() Signed-off-by: Steve French <smfrench@gmail.com> CC: Ronnie Sahlberg <lsahlber@redhat.com> CC: Stable <stable@vger.kernel.org> Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-08-30CIFS: Fix maximum SMB2 header sizePavel Shilovsky1-2/+2
Currently the maximum size of SMB2/3 header is set incorrectly which leads to hanging of directory listing operations on encrypted SMB3 connections. Fix this by setting the maximum size to 170 bytes that is calculated as RFC1002 length field size (4) + transform header size (52) + SMB2 header size (64) + create response size (56). Cc: <stable@vger.kernel.org> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> Acked-by: Sachin Prabhu <sprabhu@redhat.com>
2017-08-28fs/select: Fix memory corruption in compat_get_fd_set()Helge Deller1-5/+1
Commit 464d62421cb8 ("select: switch compat_{get,put}_fd_set() to compat_{get,put}_bitmap()") changed the calculation on how many bytes need to be zeroed when userspace handed over a NULL pointer for a fdset array in the select syscall. The calculation was changed in compat_get_fd_set() wrongly from memset(fdset, 0, ((nr + 1) & ~1)*sizeof(compat_ulong_t)); to memset(fdset, 0, ALIGN(nr, BITS_PER_LONG)); The ALIGN(nr, BITS_PER_LONG) calculates the number of _bits_ which need to be zeroed in the target fdset array (rounded up to the next full bits for an unsigned long). But the memset() call expects the number of _bytes_ to be zeroed. This leads to clearing more memory than wanted (on the stack area or even at kmalloc()ed memory areas) and to random kernel crashes as we have seen them on the parisc platform. The correct change should have been memset(fdset, 0, (ALIGN(nr, BITS_PER_LONG) / BITS_PER_LONG) * BYTES_PER_LONG); which is the same as can be archieved with a call to zero_fd_set(nr, fdset). Fixes: 464d62421cb8 ("select: switch compat_{get,put}_fd_set() to compat_{get,put}_bitmap()" Acked-by:: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Helge Deller <deller@gmx.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-25Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-0/+10
Merge misc fixes from Andrew Morton: "6 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm/memblock.c: reversed logic in memblock_discard() fork: fix incorrect fput of ->exe_file causing use-after-free mm/madvise.c: fix freeing of locked page with MADV_FREE dax: fix deadlock due to misaligned PMD faults mm, shmem: fix handling /sys/kernel/mm/transparent_hugepage/shmem_enabled PM/hibernate: touch NMI watchdog when creating snapshot
2017-08-25Merge tag 'nfsd-4.13-2' of git://linux-nfs.org/~bfields/linuxLinus Torvalds1-4/+2
Pull nfsd fixes from Bruce Fields: "Two nfsd bugfixes, neither 4.13 regressions, but both potentially serious" * tag 'nfsd-4.13-2' of git://linux-nfs.org/~bfields/linux: net: sunrpc: svcsock: fix NULL-pointer exception nfsd: Limit end of page list when decoding NFSv4 WRITE
2017-08-25Merge tag 'cifs-fixes-for-4.13-rc6-and-stable' of ↵Linus Torvalds2-8/+14
git://git.samba.org/sfrench/cifs-2.6 Pull cifs fixes from Steve French: "Some bug fixes for stable for cifs" * tag 'cifs-fixes-for-4.13-rc6-and-stable' of git://git.samba.org/sfrench/cifs-2.6: cifs: return ENAMETOOLONG for overlong names in cifs_open()/cifs_lookup() cifs: Fix df output for users with quota limits
2017-08-25dax: fix deadlock due to misaligned PMD faultsRoss Zwisler1-0/+10
In DAX there are two separate places where the 2MiB range of a PMD is defined. The first is in the page tables, where a PMD mapping inserted for a given address spans from (vmf->address & PMD_MASK) to ((vmf->address & PMD_MASK) + PMD_SIZE - 1). That is, from the 2MiB boundary below the address to the 2MiB boundary above the address. So, for example, a fault at address 3MiB (0x30 0000) falls within the PMD that ranges from 2MiB (0x20 0000) to 4MiB (0x40 0000). The second PMD range is in the mapping->page_tree, where a given file offset is covered by a radix tree entry that spans from one 2MiB aligned file offset to another 2MiB aligned file offset. So, for example, the file offset for 3MiB (pgoff 768) falls within the PMD range for the order 9 radix tree entry that ranges from 2MiB (pgoff 512) to 4MiB (pgoff 1024). This system works so long as the addresses and file offsets for a given mapping both have the same offsets relative to the start of each PMD. Consider the case where the starting address for a given file isn't 2MiB aligned - say our faulting address is 3 MiB (0x30 0000), but that corresponds to the beginning of our file (pgoff 0). Now all the PMDs in the mapping are misaligned so that the 2MiB range defined in the page tables never matches up with the 2MiB range defined in the radix tree. The current code notices this case for DAX faults to storage with the following test in dax_pmd_insert_mapping(): if (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR) goto unlock_fallback; This test makes sure that the pfn we get from the driver is 2MiB aligned, and relies on the assumption that the 2MiB alignment of the pfn we get back from the driver matches the 2MiB alignment of the faulting address. However, faults to holes were not checked and we could hit the problem described above. This was reported in response to the NVML nvml/src/test/pmempool_sync TEST5: $ cd nvml/src/test/pmempool_sync $ make TEST5 You can grab NVML here: https://github.com/pmem/nvml/ The dmesg warning you see when you hit this error is: WARNING: CPU: 13 PID: 2900 at fs/dax.c:641 dax_insert_mapping_entry+0x2df/0x310 Where we notice in dax_insert_mapping_entry() that the radix tree entry we are about to replace doesn't match the locked entry that we had previously inserted into the tree. This happens because the initial insertion was done in grab_mapping_entry() using a pgoff calculated from the faulting address (vmf->address), and the replacement in dax_pmd_load_hole() => dax_insert_mapping_entry() is done using vmf->pgoff. In our failure case those two page offsets (one calculated from vmf->address, one using vmf->pgoff) point to different order 9 radix tree entries. This failure case can result in a deadlock because the radix tree unlock also happens on the pgoff calculated from vmf->address. This means that the locked radix tree entry that we swapped in to the tree in dax_insert_mapping_entry() using vmf->pgoff is never unlocked, so all future faults to that 2MiB range will block forever. Fix this by validating that the faulting address's PMD offset matches the PMD offset from the start of the file. This check is done at the very beginning of the fault and covers faults that would have mapped to storage as well as faults to holes. I left the COLOUR check in dax_pmd_insert_mapping() in place in case we ever hit the insanity condition where the alignment of the pfn we get from the driver doesn't match the alignment of the userspace address. Link: http://lkml.kernel.org/r/20170822222436.18926-1-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: "Slusarz, Marcin" <marcin.slusarz@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-24nfsd: Limit end of page list when decoding NFSv4 WRITEChuck Lever1-4/+2
When processing an NFSv4 WRITE operation, argp->end should never point past the end of the data in the final page of the page list. Otherwise, nfsd4_decode_compound can walk into uninitialized memory. More critical, nfsd4_decode_write is failing to increment argp->pagelen when it increments argp->pagelist. This can cause later xdr decoders to assume more data is available than really is, which can cause server crashes on malformed requests. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-24Merge branch 'for-4.13-rc7' of ↵Linus Torvalds5-60/+64
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "We have one more fixup that stems from the blk_status_t conversion that did not quite cover everything. The normal cases were not affected because the code is 0, but any error and retries could mix up new and old values" * 'for-4.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix blk_status_t/errno confusion
2017-08-24pty: Repair TIOCGPTPEEREric W. Biederman1-16/+49
The implementation of TIOCGPTPEER has two issues. When /dev/ptmx (as opposed to /dev/pts/ptmx) is opened the wrong vfsmount is passed to dentry_open. Which results in the kernel displaying the wrong pathname for the peer. The second is simply by caching the vfsmount and dentry of the peer it leaves them open, in a way they were not previously Which because of the inreased reference counts can cause unnecessary behaviour differences resulting in regressions. To fix these move the ioctl into tty_io.c at a generic level allowing the ioctl to have access to the struct file on which the ioctl is being called. This allows the path of the slave to be derived when opening the slave through TIOCGPTPEER instead of requiring the path to the slave be cached. Thus removing the need for caching the path. A new function devpts_ptmx_path is factored out of devpts_acquire and used to implement a function devpts_mntget. The new function devpts_mntget takes a filp to perform the lookup on and fsi so that it can confirm that the superblock that is found by devpts_ptmx_path is the proper superblock. v2: Lots of fixes to make the code actually work v3: Suggestions by Linus - Removed the unnecessary initialization of filp in ptm_open_peer - Simplified devpts_ptmx_path as gotos are no longer required [ This is the fix for the issue that was reverted in commit 143c97cc6529, but this time without breaking 'pbuilder' due to increased reference counts - Linus ] Fixes: 54ebbfb16034 ("tty: add TIOCGPTPEER ioctl") Reported-by: Christian Brauner <christian.brauner@canonical.com> Reported-and-tested-by: Stefan Lippers-Hollmann <s.l-h@gmx.de> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-24Btrfs: fix blk_status_t/errno confusionOmar Sandoval5-60/+64
This fixes several instances of blk_status_t and bare errno ints being mixed up, some of which are real bugs. In the normal case, 0 matches BLK_STS_OK, so we don't observe any effects of the missing conversion, but in case of errors or passes through the repair/retry paths, the errors get mixed up. The changes were identified using 'sparse', we don't have reports of the buggy behaviour. Fixes: 4e4cbee93d56 ("block: switch bios to blk_status_t") Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>