summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2014-08-04Merge branch 'for-3.17' of ↵Linus Torvalds1-2/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu updates from Tejun Heo: - Major reorganization of percpu header files which I think makes things a lot more readable and logical than before. - percpu-refcount is updated so that it requires explicit destruction and can be reinitialized if necessary. This was pulled into the block tree to replace the custom percpu refcnting implemented in blk-mq. - In the process, percpu and percpu-refcount got cleaned up a bit * 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (21 commits) percpu-refcount: implement percpu_ref_reinit() and percpu_ref_is_zero() percpu-refcount: require percpu_ref to be exited explicitly percpu-refcount: use unsigned long for pcpu_count pointer percpu-refcount: add helpers for ->percpu_count accesses percpu-refcount: one bit is enough for REF_STATUS percpu-refcount, aio: use percpu_ref_cancel_init() in ioctx_alloc() workqueue: stronger test in process_one_work() workqueue: clear POOL_DISASSOCIATED in rebind_workers() percpu: Use ALIGN macro instead of hand coding alignment calculation percpu: invoke __verify_pcpu_ptr() from the generic part of accessors and operations percpu: preffity percpu header files percpu: use raw_cpu_*() to define __this_cpu_*() percpu: reorder macros in percpu header files percpu: move {raw|this}_cpu_*() definitions to include/linux/percpu-defs.h percpu: move generic {raw|this}_cpu_*_N() definitions to include/asm-generic/percpu.h percpu: only allow sized arch overrides for {raw|this}_cpu_*() ops percpu: reorganize include/linux/percpu-defs.h percpu: move accessors from include/linux/percpu.h to percpu-defs.h percpu: include/asm-generic/percpu.h should contain only arch-overridable parts percpu: introduce arch_raw_cpu_ptr() ...
2014-08-04Merge tag 'locks-v3.17-1' of git://git.samba.org/jlayton/linuxLinus Torvalds1-13/+13
Pull file locking related changes from Jeff Layton: "Just a couple of changes from Christoph to start us down the road toward getting rid of the fl_owner_t typedef" * tag 'locks-v3.17-1' of git://git.samba.org/jlayton/linux: locks: purge fl_owner_t from fs/locks.c locks: typedef fl_owner_t to void *
2014-08-01vfs: fix check for fallocate on active swapfileEric Biggers1-3/+2
Fix the broken check for calling sys_fallocate() on an active swapfile, introduced by commit 0790b31b69374ddadefe ("fs: disallow all fallocate operation on active swapfile"). Signed-off-by: Eric Biggers <ebiggers3@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-01direct-io: fix AIO regressionChristoph Hellwig1-5/+4
The direct-io.c rewrite to use the iov_iter infrastructure stopped updating the size field in struct dio_submit, and thus rendered the check for allowing asynchronous completions to always return false. Fix this by comparing it to the count of bytes in the iov_iter instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Tim Chen <tim.c.chen@linux.intel.com> Tested-by: Tim Chen <tim.c.chen@linux.intel.com>
2014-07-29AFS: Correctly assemble the client UUIDDavid Howells1-2/+2
Correctly assemble the client UUID by OR'ing in the flags rather than assigning them over the other components. Reported-by: Himangi Saraogi <himangi774@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-27Merge branch 'vfs-for-3.16' of git://git.infradead.org/users/hch/vfsLinus Torvalds2-8/+9
Pull vfs fixes from Christoph Hellwig: "A vfsmount leak fix, and a compile warning fix" * 'vfs-for-3.16' of git://git.infradead.org/users/hch/vfs: fs: umount on symlink leaks mnt count direct-io: fix uninitialized warning in do_direct_IO()
2014-07-25Merge branch 'for-linus' of ↵Linus Torvalds1-4/+1
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: "These two pathes fix issues with the kernel-userspace protocol changes in v3.15" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: add FUSE_NO_OPEN_SUPPORT flag to INIT fuse: s_time_gran fix
2014-07-24fs: umount on symlink leaks mnt countVasily Averin1-1/+2
Currently umount on symlink blocks following umount: /vz is separate mount # ls /vz/ -al | grep test drwxr-xr-x. 2 root root 4096 Jul 19 01:14 testdir lrwxrwxrwx. 1 root root 11 Jul 19 01:16 testlink -> /vz/testdir # umount -l /vz/testlink umount: /vz/testlink: not mounted (expected) # lsof /vz # umount /vz umount: /vz: device is busy. (unexpected) In this case mountpoint_last() gets an extra refcount on path->mnt Signed-off-by: Vasily Averin <vvs@openvz.org> Acked-by: Ian Kent <raven@themaw.net> Acked-by: Jeff Layton <jlayton@primarydata.com> Cc: stable@vger.kernel.org Signed-off-by: Christoph Hellwig <hch@lst.de>
2014-07-24direct-io: fix uninitialized warning in do_direct_IO()Boaz Harrosh1-7/+7
The following warnings: fs/direct-io.c: In function ‘__blockdev_direct_IO’: fs/direct-io.c:1011:12: warning: ‘to’ may be used uninitialized in this function [-Wmaybe-uninitialized] fs/direct-io.c:913:16: note: ‘to’ was declared here fs/direct-io.c:1011:12: warning: ‘from’ may be used uninitialized in this function [-Wmaybe-uninitialized] fs/direct-io.c:913:10: note: ‘from’ was declared here are false positive because dio_get_page() either fails, or sets both 'from' and 'to'. Paul Bolle said ... Maybe it's better to move initializing "to" and "from" out of dio_get_page(). That _might_ make it easier for both the the reader and the compiler to understand what's going on. Something like this: Christoph Hellwig said ... The fix of moving the code definitively looks nicer, while I think uninitialized_var is horrible wart that won't get anywhere near my code. Boaz Harrosh: I agree with Christoph and Paul Signed-off-by: Boaz Harrosh <boaz@plexistor.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2014-07-23Merge branch 'for-3.16' of git://linux-nfs.org/~bfields/linuxLinus Torvalds1-1/+3
Pull nfsd bugfix from Bruce Fields: "Another regression from the xdr encoding rewrite" * 'for-3.16' of git://linux-nfs.org/~bfields/linux: NFSD: Fix crash encoding lock reply on 32-bit
2014-07-23simple_xattr: permit 0-size extended attributesHugh Dickins1-1/+1
If a filesystem uses simple_xattr to support user extended attributes, LTP setxattr01 and xfstests generic/062 fail with "Cannot allocate memory": simple_xattr_alloc()'s wrap-around test mistakenly excludes values of zero size. Fix that off-by-one (but apparently no filesystem needs them yet). Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jeff Layton <jlayton@poochiereds.net> Cc: Aristeu Rozanski <aris@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-23coredump: fix the setting of PF_DUMPCORESilesh C V1-1/+1
Commit 079148b919d0 ("coredump: factor out the setting of PF_DUMPCORE") cleaned up the setting of PF_DUMPCORE by removing it from all the linux_binfmt->core_dump() and moving it to zap_threads().But this ended up clearing all the previously set flags. This causes issues during core generation when tsk->flags is checked again (eg. for PF_USED_MATH to dump floating point registers). Fix this. Signed-off-by: Silesh C V <svellattu@mvista.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: <stable@vger.kernel.org> [3.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-23NFSD: Fix crash encoding lock reply on 32-bitKinglong Mee1-1/+3
Commit 8c7424cff6 "nfsd4: don't try to encode conflicting owner if low on space" forgot to free conf->data in nfsd4_encode_lockt and before sign conf->data to NULL in nfsd4_encode_lock_denied, causing a leak. Worse, kfree() can be called on an uninitialized pointer in the case of a succesful lock (or one that fails for a reason other than a conflict). (Note that lock->lk_denied.ld_owner.data appears it should be zero here, until you notice that it's one arm of a union the other arm of which is written to in the succesful case by the memcpy(&lock->lk_resp_stateid, &lock_stp->st_stid.sc_stateid, sizeof(stateid_t)); in nfsd4_lock(). In the 32-bit case this overwrites ld_owner.data.) Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Fixes: 8c7424cff6 ""nfsd4: don't try to encode conflicting owner if low on space" Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-22fuse: add FUSE_NO_OPEN_SUPPORT flag to INITAndrew Gallagher1-1/+1
Here some additional changes to set a capability flag so that clients can detect when it's appropriate to return -ENOSYS from open. This amends the following commit introduced in 3.14: 7678ac50615d fuse: support clients that don't implement 'open' However we can only add the flag to 3.15 and later since there was no protocol version update in 3.14. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: <stable@vger.kernel.org> # v3.15+
2014-07-22fuse: s_time_gran fixMiklos Szeredi1-3/+0
Default s_time_gran is 1, don't overwrite that if userspace didn't explicitly specify one. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: <stable@vger.kernel.org> # v3.15+
2014-07-20Merge branch 'for-linus' of ↵Linus Torvalds2-4/+15
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "We have two more fixes in my for-linus branch. I was hoping to also include a fix for a btrfs deadlock with compression enabled, but we're still nailing that one down" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: test for valid bdev before kobj removal in btrfs_rm_device Btrfs: fix abnormal long waiting in fsync
2014-07-20Merge tag 'nfs-for-3.16-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds6-62/+343
Pull NFS client fixes from Trond Myklebust: "Apologies for the relative lateness of this pull request, however the commits fix some issues with the NFS read/write code updates in 3.16-rc1 that can cause serious Oopsing when using small r/wsize. The delay was mainly due to extra testing to make sure that the fixes behave correctly. Highlights include; - Stable fix for an NFSv3 posix ACL regression - Multiple fixes for regressions to the NFS generic read/write code: - Fix page splitting bugs that come into play when a small rsize/wsize read/write needs to be sent again (due to error conditions or page redirty) - Fix nfs_wb_page_cancel, which is called by the "invalidatepage" method - Fix 2 compile warnings about unused variables - Fix a performance issue affecting unstable writes" * tag 'nfs-for-3.16-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS: Don't reset pg_moreio in __nfs_pageio_add_request NFS: Remove 2 unused variables nfs: handle multiple reqs in nfs_wb_page_cancel nfs: handle multiple reqs in nfs_page_async_flush nfs: change find_request to find_head_request nfs: nfs_page should take a ref on the head req nfs: mark nfs_page reqs with flag for extra ref nfs: only show Posix ACLs in listxattr if actually present
2014-07-19btrfs: test for valid bdev before kobj removal in btrfs_rm_deviceEric Sandeen1-4/+4
commit 99994cd btrfs: dev delete should remove sysfs entry added a btrfs_kobj_rm_device, which dereferences device->bdev... right after we check whether device->bdev might be NULL. I don't honestly know if it's possible to have a NULL device->bdev here, but assuming that it is (given the test), we need to move the kobject removal to be under that test. (Coverity spotted this) Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-07-19Btrfs: fix abnormal long waiting in fsyncLiu Bo1-0/+11
xfstests generic/127 detected this problem. With commit 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8, now fsync will only flush data within the passed range. This is the cause of the above problem, -- btrfs's fsync has a stage called 'sync log' which will wait for all the ordered extents it've recorded to finish. In xfstests/generic/127, with mixed operations such as truncate, fallocate, punch hole, and mapwrite, we get some pre-allocated extents, and mapwrite will mmap, and then msync. And I find that msync will wait for quite a long time (about 20s in my case), thanks to ftrace, it turns out that the previous fallocate calls 'btrfs_wait_ordered_range()' to flush dirty pages, but as the range of dirty pages may be larger than 'btrfs_wait_ordered_range()' wants, there can be some ordered extents created but not getting corresponding pages flushed, then they're left in memory until we fsync which runs into the stage 'sync log', and fsync will just wait for the system writeback thread to flush those pages and get ordered extents finished, so the latency is inevitable. This adds a flush similar to btrfs_start_ordered_extent() in btrfs_wait_logged_extents() to fix that. Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-07-18Merge tag 'gfs2-fixes' of ↵Linus Torvalds5-13/+17
git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes Pull gfs2 fixes from Steven Whitehouse: "This patch set contains two minor docs/spelling fixes, some fixes for flock, a change to use GFP_NOFS to avoid recursion on a rarely used code path and a fix for a race relating to the glock lru" * tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes: GFS2: fs/gfs2/rgrp.c: kernel-doc warning fixes GFS2: memcontrol: Spelling s/invlidate/invalidate/ GFS2: Allow caching of glocks for flock GFS2: Allow flocks to use normal glock dq rather than dq_wait GFS2: replace count*size kzalloc by kcalloc GFS2: Use GFP_NOFS when allocating glocks GFS2: Fix race in glock lru glock disposal GFS2: Only wait for demote when last holder is dequeued
2014-07-18Merge tag 'xfs-for-linus-3.16-rc5' of git://oss.sgi.com/xfs/xfsLinus Torvalds7-72/+106
Pull xfs fixes from Dave Chinner: "Fixes for low memory perforamnce regressions and a quota inode handling regression. These are regression fixes for issues recently introduced - the change in the stack switch location is fairly important, so I've held off sending this update until I was sure that it still addresses the stack usage problem the original solved. So while the commits in the xfs tree are recent, it has been under tested for several weeks now" * tag 'xfs-for-linus-3.16-rc5' of git://oss.sgi.com/xfs/xfs: xfs: null unused quota inodes when quota is on xfs: refine the allocation stack switch Revert "xfs: block allocation work needs to be kswapd aware"
2014-07-18GFS2: fs/gfs2/rgrp.c: kernel-doc warning fixesFabian Frederick1-2/+2
Cc: cluster-devel@redhat.com Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-07-18GFS2: memcontrol: Spelling s/invlidate/invalidate/Geert Uytterhoeven1-2/+2
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: cluster-devel@redhat.com Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-07-18GFS2: Allow caching of glocks for flockBob Peterson1-1/+1
This patch removes the GLF_NOCACHE flag from the glocks associated with flocks. There should be no good reason not to cache glocks for flocks: they only force the glock to be demoted before they can be reacquired, which can slow down performance and even cause glock hangs, especially in cases where the flocks are held in Shared (SH) mode. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-07-18GFS2: Allow flocks to use normal glock dq rather than dq_waitBob Peterson2-4/+2
This patch allows flock glocks to use a non-blocking dequeue rather than dq_wait. It also reverts the previous patch I had posted regarding dq_wait. The reverted patch isn't necessarily a bad idea, but I decided this might avoid unforeseen side effects, and was therefore safer. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-07-18GFS2: replace count*size kzalloc by kcallocFabian Frederick1-2/+2
kcalloc manages count*sizeof overflow. Cc: cluster-devel@redhat.com Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-07-18GFS2: Use GFP_NOFS when allocating glocksSteven Whitehouse1-2/+2
Normally GFP_KERNEL is ok here, but there is now a rarely used code path relating to deallocation of unlinked inodes (in certain corner cases) which if hit at times of memory shortage can cause recursion while trying to free memory. One solution would be to try and move the gfs2_glock_get() call so that it is no longer called while another glock is held, but that doesn't look at all easy, so GFP_NOFS is the best solution for the time being. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-07-18GFS2: Fix race in glock lru glock disposalSteven Whitehouse1-3/+7
We must not leave items on the LRU list with GLF_LOCK set, since they can be removed if the glock is brought back into use, which may then potentially result in a hang, waiting for GLF_LOCK to clear. It doesn't happen very often, since it requires a glock that has not been used for a long time to be brought back into use at the same moment that the shrinker is part way through disposing of glocks. The fix is to set GLF_LOCK at a later time, when we already know that the other locks can be obtained. Also, we now only release the lru_lock in case a resched is needed, rather than on every iteration. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-07-18GFS2: Only wait for demote when last holder is dequeuedBob Peterson1-1/+3
Function gfs2_glock_dq_wait is supposed to dequeue a glock and then wait for the lock to be demoted. The problem is, if this is a shared lock, its demote will depend on the other holders, which means you might end up waiting forever because the other process is blocked. This problem is especially apparent when dealing with nested flocks. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-07-15Merge branch 'for_linus' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull quota fix from Jan Kara: "Fix locking of dquot shrinker" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: quota: missing lock in dqcache_shrink_scan()
2014-07-15quota: missing lock in dqcache_shrink_scan()Niu Yawei1-0/+2
Commit 1ab6c4997e04 (fs: convert fs shrinkers to new scan/count API) accidentally removed locking from quota shrinker. Fix it - dqcache_shrink_scan() should use dq_list_lock to protect the scan on free_dquots list. CC: stable@vger.kernel.org Fixes: 1ab6c4997e04a00c50c6d786c2f046adc0d1f5de Signed-off-by: Niu Yawei <yawei.niu@intel.com> Signed-off-by: Jan Kara <jack@suse.cz>
2014-07-15Merge branch 'for-linus' of ↵Linus Torvalds4-53/+69
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: "This contains miscellaneous fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: replace count*size kzalloc by kcalloc fuse: release temporary page if fuse_writepage_locked() failed fuse: restructure ->rename2() fuse: avoid scheduling while atomic fuse: handle large user and group ID fuse: inode: drop cast fuse: ignore entry-timeout on LOOKUP_REVAL fuse: timeout comparison fix
2014-07-15xfs: null unused quota inodes when quota is onDave Chinner1-4/+21
When quota is on, it is expected that unused quota inodes have a value of NULLFSINO. The changes to support a separate project quota in 3.12 broken this rule for non-project quota inode enabled filesystem, as the code now refuses to write the group quota inode if neither group or project quotas are enabled. This regression was introduced by commit d892d58 ("xfs: Start using pquotaino from the superblock"). In this case, we should be writing NULLFSINO rather than nothing to ensure that we leave the group quota inode in a valid state while quotas are enabled. Failure to do so doesn't cause a current kernel to break - the separate project quota inodes introduced translation code to always treat a zero inode as NULLFSINO. This was introduced by commit 0102629 ("xfs: Initialize all quota inodes to be NULLFSINO") with is also in 3.12 but older kernels do not do this and hence taking a filesystem back to an older kernel can result in quotas failing initialisation at mount time. When that happens, we see this in dmesg: [ 1649.215390] XFS (sdb): Mounting Filesystem [ 1649.316894] XFS (sdb): Failed to initialize disk quotas. [ 1649.316902] XFS (sdb): Ending clean mount By ensuring that we write NULLFSINO to quota inodes that aren't active, we avoid this problem. We have to be really careful when determining if the quota inodes are active or not, because we don't want to write a NULLFSINO if the quota inodes are active and we simply aren't updating them. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15xfs: refine the allocation stack switchDave Chinner6-62/+90
The allocation stack switch at xfs_bmapi_allocate() has served it's purpose, but is no longer a sufficient solution to the stack usage problem we have in the XFS allocation path. Whilst the kernel stack size is now 16k, that is not a valid reason for undoing all our "keep stack usage down" modifications. What it does allow us to do is have the freedom to refine and perfect the modifications knowing that if we get it wrong it won't blow up in our faces - we have a safety net now. This is important because we still have the issue of older kernels having smaller stacks and that they are still supported and are demonstrating a wide range of different stack overflows. Red Hat has several open bugs for allocation based stack overflows from directory modifications and direct IO block allocation and these problems still need to be solved. If we can solve them upstream, then distro's won't need to bake their own unique solutions. To that end, I've observed that every allocation based stack overflow report has had a specific characteristic - it has happened during or directly after a bmap btree block split. That event requires a new block to be allocated to the tree, and so we effectively stack one allocation stack on top of another, and that's when we get into trouble. A further observation is that bmap btree block splits are much rarer than writeback allocation - over a range of different workloads I've observed the ratio of bmap btree inserts to splits ranges from 100:1 (xfstests run) to 10000:1 (local VM image server with sparse files that range in the hundreds of thousands to millions of extents). Either way, bmap btree split events are much, much rarer than allocation events. Finally, we have to move the kswapd state to the allocation workqueue work when allocation is done on behalf of kswapd. This is proving to cause significant perturbation in performance under memory pressure and appears to be generating allocation deadlock warnings under some workloads, so avoiding the use of a workqueue for the majority of kswapd writeback allocation will minimise the impact of such behaviour. Hence it makes sense to move the stack switch to xfs_btree_split() and only do it for bmap btree splits. Stack switches during allocation will be much rarer, so there won't be significant performacne overhead caused by switching stacks. The worse case stack from all allocation paths will be split, not just writeback. And the majority of memory allocations will be done in the correct context (e.g. kswapd) without causing additional latency, and so we simplify the memory reclaim interactions between processes, workqueues and kswapd. The worst stack I've been able to generate with this patch in place is 5600 bytes deep. It's very revealing because we exit XFS at: 37) 1768 64 kmem_cache_alloc+0x13b/0x170 about 1800 bytes of stack consumed, and the remaining 3800 bytes (and 36 functions) is memory reclaim, swap and the IO stack. And this occurs in the inode allocation from an open(O_CREAT) syscall, not writeback. The amount of stack being used is much less than I've previously be able to generate - fs_mark testing has been able to generate stack usage of around 7k without too much trouble; with this patch it's only just getting to 5.5k. This is primarily because the metadata allocation paths (e.g. directory blocks) are no longer causing double splits on the same stack, and hence now stack tracing is showing swapping being the worst stack consumer rather than XFS. Performance of fs_mark inode create workloads is unchanged. Performance of fs_mark async fsync workloads is consistently good with context switches reduced by around 150,000/s (30%). Performance of dbench, streaming IO and postmark is unchanged. Allocation deadlock warnings have not been seen on the workloads that generated them since adding this patch. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15Revert "xfs: block allocation work needs to be kswapd aware"Dave Chinner2-20/+9
This reverts commit 1f6d64829db78a7e1d63e15c9f48f0a5d2b5a679. This commit resulted in regressions in performance in low memory situations where kswapd was doing writeback of delayed allocation blocks. It resulted in significant parallelism of the kswapd work and with the special kswapd flags meant that hundreds of active allocation could dip into kswapd specific memory reserves and avoid being throttled. This cause a large amount of performance variation, as well as random OOM-killer invocations that didn't previously exist. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-14aio: protect reqs_available updates from changes in interrupt handlersBenjamin LaHaise1-0/+7
As of commit f8567a3845ac05bb28f3c1b478ef752762bd39ef it is now possible to have put_reqs_available() called from irq context. While put_reqs_available() is per cpu, it did not protect itself from interrupts on the same CPU. This lead to aio_complete() corrupting the available io requests count when run under a heavy O_DIRECT workloads as reported by Robert Elliott. Fix this by disabling irq updates around the per cpu batch updates of reqs_available. Many thanks to Robert and folks for testing and tracking this down. Reported-by: Robert Elliot <Elliott@hp.com> Tested-by: Robert Elliot <Elliott@hp.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@infradead.org> Cc: stable@vger.kenel.org
2014-07-14fuse: replace count*size kzalloc by kcallocFabian Frederick1-2/+2
kcalloc manages count*sizeof overflow. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-07-14fuse: release temporary page if fuse_writepage_locked() failedMaxim Patlasov1-1/+3
tmp_page to be freed if fuse_write_file_get() returns NULL. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-07-13locks: purge fl_owner_t from fs/locks.cChristoph Hellwig1-13/+13
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-07-13Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds5-45/+44
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bugfixes from Ted Ts'o: "More bug fixes for ext4 -- most importantly, a fix for a bug introduced in 3.15 that can end up triggering a file system corruption error after a journal replay. It shouldn't lead to any actual data corruption, but it is scary and can force file systems to be remounted read-only, etc" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix potential null pointer dereference in ext4_free_inode ext4: fix a potential deadlock in __ext4_es_shrink() ext4: revert commit which was causing fs corruption after journal replays ext4: disable synchronous transaction batching if max_batch_time==0 ext4: clarify ext4_error message in ext4_mb_generate_buddy_error() ext4: clarify error count warning messages ext4: fix unjournalled bg descriptor while initializing inode bitmap
2014-07-13NFS: Don't reset pg_moreio in __nfs_pageio_add_requestTrond Myklebust1-1/+1
Once we've started sending unstable NFS writes, we do not want to clear pg_moreio, or we may end up sending the very last request as a stable write if the commit lists are still empty. Do, however, reset pg_moreio in the case where we end up having to recoalesce the write if an attempt to use pNFS failed. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12NFS: Remove 2 unused variablesTrond Myklebust2-4/+0
Cc: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12nfs: handle multiple reqs in nfs_wb_page_cancelWeston Andros Adamson1-20/+21
Use nfs_lock_and_join_requests to merge all subrequests into the head request - this cancels and dereferences all subrequests. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12nfs: handle multiple reqs in nfs_page_async_flushWeston Andros Adamson3-25/+235
Change nfs_find_and_lock_request so nfs_page_async_flush can handle multiple requests in a page. There is only one request for a page the first time nfs_page_async_flush is called, but if a write or commit fails, async_flush is called again and there may be multiple requests associated with the page. The solution is to merge all the requests in a page group into a single request before calling nfs_pageio_add_request. Rename nfs_find_and_lock_request to nfs_lock_and_join_requests and change it to first lock all requests for the page, then cancel and merge all subrequests into the head request. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12nfs: change find_request to find_head_requestWeston Andros Adamson1-9/+24
nfs_page_find_request_locked* should find the head request for that page. Rename the functions and add comments to make this clear, and fix a bug that could return a subrequest when page_private isn't set on the page. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12nfs: nfs_page should take a ref on the head reqWeston Andros Adamson1-0/+10
nfs_pages that aren't the the head of a group must take a reference on the head as long as ->wb_head is set to it. This stops the head from hitting a refcount of 0 while there is still an active nfs_page for the page group. This avoids kref warnings in the writeback code when the page group head is found and referenced. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12nfs: mark nfs_page reqs with flag for extra refWeston Andros Adamson2-3/+9
Change the use of PG_INODE_REF - set it when taking extra reference on subrequests and take care to only release once for each request. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12ext4: fix potential null pointer dereference in ext4_free_inodeNamjae Jeon1-1/+1
Fix potential null pointer dereferencing problem caused by e43bb4e612 ("ext4: decrement free clusters/inodes counters when block group declared bad") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-07-12ext4: fix a potential deadlock in __ext4_es_shrink()Theodore Ts'o1-2/+2
This fixes the following lockdep complaint: [ INFO: possible circular locking dependency detected ] 3.16.0-rc2-mm1+ #7 Tainted: G O ------------------------------------------------------- kworker/u24:0/4356 is trying to acquire lock: (&(&sbi->s_es_lru_lock)->rlock){+.+.-.}, at: [<ffffffff81285fff>] __ext4_es_shrink+0x4f/0x2e0 but task is already holding lock: (&ei->i_es_lock){++++-.}, at: [<ffffffff81286961>] ext4_es_insert_extent+0x71/0x180 which lock already depends on the new lock. Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&ei->i_es_lock); lock(&(&sbi->s_es_lru_lock)->rlock); lock(&ei->i_es_lock); lock(&(&sbi->s_es_lru_lock)->rlock); *** DEADLOCK *** 6 locks held by kworker/u24:0/4356: #0: ("writeback"){.+.+.+}, at: [<ffffffff81071d00>] process_one_work+0x180/0x560 #1: ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff81071d00>] process_one_work+0x180/0x560 #2: (&type->s_umount_key#22){++++++}, at: [<ffffffff811a9c74>] grab_super_passive+0x44/0x90 #3: (jbd2_handle){+.+...}, at: [<ffffffff812979f9>] start_this_handle+0x189/0x5f0 #4: (&ei->i_data_sem){++++..}, at: [<ffffffff81247062>] ext4_map_blocks+0x132/0x550 #5: (&ei->i_es_lock){++++-.}, at: [<ffffffff81286961>] ext4_es_insert_extent+0x71/0x180 stack backtrace: CPU: 0 PID: 4356 Comm: kworker/u24:0 Tainted: G O 3.16.0-rc2-mm1+ #7 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Workqueue: writeback bdi_writeback_workfn (flush-253:0) ffffffff8213dce0 ffff880014b07538 ffffffff815df0bb 0000000000000007 ffffffff8213e040 ffff880014b07588 ffffffff815db3dd ffff880014b07568 ffff880014b07610 ffff88003b868930 ffff88003b868908 ffff88003b868930 Call Trace: [<ffffffff815df0bb>] dump_stack+0x4e/0x68 [<ffffffff815db3dd>] print_circular_bug+0x1fb/0x20c [<ffffffff810a7a3e>] __lock_acquire+0x163e/0x1d00 [<ffffffff815e89dc>] ? retint_restore_args+0xe/0xe [<ffffffff815ddc7b>] ? __slab_alloc+0x4a8/0x4ce [<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0 [<ffffffff810a8707>] lock_acquire+0x87/0x120 [<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0 [<ffffffff8128592d>] ? ext4_es_free_extent+0x5d/0x70 [<ffffffff815e6f09>] _raw_spin_lock+0x39/0x50 [<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0 [<ffffffff8119760b>] ? kmem_cache_alloc+0x18b/0x1a0 [<ffffffff81285fff>] __ext4_es_shrink+0x4f/0x2e0 [<ffffffff812869b8>] ext4_es_insert_extent+0xc8/0x180 [<ffffffff812470f4>] ext4_map_blocks+0x1c4/0x550 [<ffffffff8124c4c4>] ext4_writepages+0x6d4/0xd00 ... Reported-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reported-by: Minchan Kim <minchan@kernel.org> Cc: stable@vger.kernel.org Cc: Zheng Liu <gnehzuil.liu@gmail.com>
2014-07-11Merge branch 'for-3.16' of git://linux-nfs.org/~bfields/linuxLinus Torvalds1-1/+1
Pull nfsd bugfix from Bruce Fields: "Another xdr encoding regression that may cause incorrect encoding on failures of certain readdirs" * 'for-3.16' of git://linux-nfs.org/~bfields/linux: nfsd: Fix bad reserving space for encoding rdattr_error