summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2010-05-11CacheFiles: Fix occasional EIO on call to vfs_unlink()David Howells2-12/+87
Fix an occasional EIO returned by a call to vfs_unlink(): [ 4868.465413] CacheFiles: I/O Error: Unlink failed [ 4868.465444] FS-Cache: Cache cachefiles stopped due to I/O error [ 4947.320011] CacheFiles: File cache on md3 unregistering [ 4947.320041] FS-Cache: Withdrawing cache "mycache" [ 5127.348683] FS-Cache: Cache "mycache" added (type cachefiles) [ 5127.348716] CacheFiles: File cache on md3 registered [ 7076.871081] CacheFiles: I/O Error: Unlink failed [ 7076.871130] FS-Cache: Cache cachefiles stopped due to I/O error [ 7116.780891] CacheFiles: File cache on md3 unregistering [ 7116.780937] FS-Cache: Withdrawing cache "mycache" [ 7296.813394] FS-Cache: Cache "mycache" added (type cachefiles) [ 7296.813432] CacheFiles: File cache on md3 registered What happens is this: (1) A cached NFS file is seen to have become out of date, so NFS retires the object and immediately acquires a new object with the same key. (2) Retirement of the old object is done asynchronously - so the lookup/create to generate the new object may be done first. This can be a problem as the old object and the new object must exist at the same point in the backing filesystem (i.e. they must have the same pathname). (3) The lookup for the new object sees that a backing file already exists, checks to see whether it is valid and sees that it isn't. It then deletes that file and creates a new one on disk. (4) The retirement phase for the old file is then performed. It tries to delete the dentry it has, but ext4_unlink() returns -EIO because the inode attached to that dentry no longer matches the inode number associated with the filename in the parent directory. The trace below shows this quite well. [md5sum] ==> __fscache_relinquish_cookie(ffff88002d12fb58{NFS.fh,ffff88002ce62100},1) [md5sum] ==> __fscache_acquire_cookie({NFS.server},{NFS.fh},ffff88002ce62100) NFS has retired the old cookie and asked for a new one. [kslowd] ==> fscache_object_state_machine({OBJ52,OBJECT_ACTIVE,24}) [kslowd] <== fscache_object_state_machine() [->OBJECT_DYING] [kslowd] ==> fscache_object_state_machine({OBJ53,OBJECT_INIT,0}) [kslowd] <== fscache_object_state_machine() [->OBJECT_LOOKING_UP] [kslowd] ==> fscache_object_state_machine({OBJ52,OBJECT_DYING,24}) [kslowd] <== fscache_object_state_machine() [->OBJECT_RECYCLING] The old object (OBJ52) is going through the terminal states to get rid of it, whilst the new object - (OBJ53) - is coming into being. [kslowd] ==> fscache_object_state_machine({OBJ53,OBJECT_LOOKING_UP,0}) [kslowd] ==> cachefiles_walk_to_object({ffff88003029d8b8},OBJ53,@68,) [kslowd] lookup '@68' [kslowd] next -> ffff88002ce41bd0 positive [kslowd] advance [kslowd] lookup 'Es0g00og0_Nd_XCYe3BOzvXrsBLMlN6aw16M1htaA' [kslowd] next -> ffff8800369faac8 positive The new object has looked up the subdir in which the file would be in (getting dentry ffff88002ce41bd0) and then looked up the file itself (getting dentry ffff8800369faac8). [kslowd] validate 'Es0g00og0_Nd_XCYe3BOzvXrsBLMlN6aw16M1htaA' [kslowd] ==> cachefiles_bury_object(,'@68','Es0g00og0_Nd_XCYe3BOzvXrsBLMlN6aw16M1htaA') [kslowd] remove ffff8800369faac8 from ffff88002ce41bd0 [kslowd] unlink stale object [kslowd] <== cachefiles_bury_object() = 0 It then checks the file's xattrs to see if it's valid. NFS says that the auxiliary data indicate the file is out of date (obvious to us - that's why NFS ditched the old version and got a new one). CacheFiles then deletes the old file (dentry ffff8800369faac8). [kslowd] redo lookup [kslowd] lookup 'Es0g00og0_Nd_XCYe3BOzvXrsBLMlN6aw16M1htaA' [kslowd] next -> ffff88002cd94288 negative [kslowd] create -> ffff88002cd94288{ffff88002cdaf238{ino=148247}} CacheFiles then redoes the lookup and gets a negative result in a new dentry (ffff88002cd94288) which it then creates a file for. [kslowd] ==> cachefiles_mark_object_active(,OBJ53) [kslowd] <== cachefiles_mark_object_active() = 0 [kslowd] === OBTAINED_OBJECT === [kslowd] <== cachefiles_walk_to_object() = 0 [148247] [kslowd] <== fscache_object_state_machine() [->OBJECT_AVAILABLE] The new object is then marked active and the state machine moves to the available state - at which point NFS can start filling the object. [kslowd] ==> fscache_object_state_machine({OBJ52,OBJECT_RECYCLING,20}) [kslowd] ==> fscache_release_object() [kslowd] ==> cachefiles_drop_object({OBJ52,2}) [kslowd] ==> cachefiles_delete_object(,OBJ52{ffff8800369faac8}) The old object, meanwhile, goes on with being retired. If allocation occurs first, cachefiles_delete_object() has to wait for dir->d_inode->i_mutex to become available before it can continue. [kslowd] ==> cachefiles_bury_object(,'@68','Es0g00og0_Nd_XCYe3BOzvXrsBLMlN6aw16M1htaA') [kslowd] remove ffff8800369faac8 from ffff88002ce41bd0 [kslowd] unlink stale object EXT4-fs warning (device sda6): ext4_unlink: Inode number mismatch in unlink (148247!=148193) CacheFiles: I/O Error: Unlink failed FS-Cache: Cache cachefiles stopped due to I/O error CacheFiles then tries to delete the file for the old object, but the dentry it has (ffff8800369faac8) no longer points to a valid inode for that directory entry, and so ext4_unlink() returns -EIO when de->inode does not match i_ino. [kslowd] <== cachefiles_bury_object() = -5 [kslowd] <== cachefiles_delete_object() = -5 [kslowd] <== fscache_object_state_machine() [->OBJECT_DEAD] [kslowd] ==> fscache_object_state_machine({OBJ53,OBJECT_AVAILABLE,0}) [kslowd] <== fscache_object_state_machine() [->OBJECT_ACTIVE] (Note that the above trace includes extra information beyond that produced by the upstream code). The fix is to note when an object that is being retired has had its object deleted preemptively by a replacement object that is being created, and to skip the second removal attempt in such a case. Reported-by: Greg M <gregm@servu.net.au> Reported-by: Mark Moseley <moseleymark@gmail.com> Reported-by: Romain DEGEZ <romain.degez@smartjog.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-10autofs4-2.6.34-rc1 - fix link_count usageIan Kent1-3/+2
After commit 1f36f774b2 ("Switch !O_CREAT case to use of do_last()") in 2.6.34-rc1 autofs direct mounts stopped working. This is caused by current->link_count being 0 when ->follow_link() is called from do_filp_open(). I can't work out why this hasn't been seen before Als patch series. This patch removes the autofs dependence on current->link_count. Signed-off-by: Ian Kent <raven@themaw.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-07Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6Linus Torvalds1-35/+51
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: NFS: Fix RCU issues in the NFSv4 delegation code NFSv4: Fix the locking in nfs_inode_reclaim_delegation()
2010-05-04Merge branch 'fixes' of ↵Linus Torvalds8-74/+110
git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2 * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: ocfs2: Avoid a gcc warning in ocfs2_wipe_inode(). ocfs2: Avoid direct write if we fall back to buffered I/O ocfs2_dlmfs: Fix math error when reading LVB. ocfs2: Update VFS inode's id info after reflink. ocfs2: potential ERR_PTR dereference on error paths ocfs2: Add directory entry later in ocfs2_symlink() and ocfs2_mknod() ocfs2: use OCFS2_INODE_SKIP_ORPHAN_DIR in ocfs2_mknod error path ocfs2: use OCFS2_INODE_SKIP_ORPHAN_DIR in ocfs2_symlink error path ocfs2: add OCFS2_INODE_SKIP_ORPHAN_DIR flag and honor it in the inode wipe code ocfs2: Reset status if we want to restart file extension. ocfs2: Compute metaecc for superblocks during online resize. ocfs2: Check the owner of a lockres inside the spinlock ocfs2: one more warning fix in ocfs2_file_aio_write(), v2 ocfs2_dlmfs: User DLM_* when decoding file open flags.
2010-05-03ocfs2: Avoid a gcc warning in ocfs2_wipe_inode().Joel Becker1-1/+1
gcc warns that a variable is uninitialized. It's actually handled, but an early return fools gcc. Let's just initialize the variable to a garbage value that will crash if the usage is ever broken. Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-05-03Merge branch 'for-linus' of ↵Linus Torvalds12-38/+71
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: remove bad auth_x kmem_cache ceph: fix lockless caps check ceph: clear dir complete, invalidate dentry on replayed rename ceph: fix direct io truncate offset ceph: discard incoming messages with bad seq # ceph: fix seq counting for skipped messages ceph: add missing #includes ceph: fix leaked spinlock during mds reconnect ceph: print more useful version info on module load ceph: fix snap realm splits ceph: clear dir complete on d_move
2010-05-03ceph: remove bad auth_x kmem_cacheSage Weil1-22/+10
It's useless, since our allocations are already a power of 2. And it was allocated per-instance (not globally), which caused a name collision when we tried to mount a second file system with auth_x enabled. Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03ceph: fix lockless caps checkSage Weil1-1/+1
The __ variant requires caller to hold i_lock. Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03ceph: clear dir complete, invalidate dentry on replayed renameSage Weil1-0/+9
If a rename operation is resent to the MDS following an MDS restart, the client does not get a full reply (containing the resulting metadata) back. In that case, a ceph_rename() needs to compensate by doing anything useful that fill_inode() would have, like d_move(). It also needs to invalidate the dentry (to workaround the vfs_rename_dir() bug) and clear the dir complete flag, just like fill_trace(). Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03ceph: fix direct io truncate offsetSage Weil1-1/+2
truncate_inode_pages_range wants the end offset to align with the last byte in a page. Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03ceph: discard incoming messages with bad seq #Sage Weil1-0/+20
We can get old message seq #'s after a tcp reconnect for stateful sessions (i.e., the MDS). If we get a higher seq #, that is an error, and we shouldn't see any bad seq #'s for stateless (mon, osd) connections. Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03ceph: fix seq counting for skipped messagesSage Weil1-0/+2
Increment in_seq even when the message is skipped for some reason. Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03ceph: add missing #includesSage Weil3-0/+4
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03ceph: fix leaked spinlock during mds reconnectSage Weil1-1/+1
Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03ceph: print more useful version info on module loadSage Weil1-3/+4
Decouple the client version from the server side. Print relevant protocol and map version info instead. Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03ceph: fix snap realm splitsSage Weil1-10/+14
The snap realm split was checking i_snap_realm, not the list_head, to determine if an inode belonged in the new realm. The check always failed, which meant we always moved the inode, corrupting the old realm's list and causing various crashes. Also wait to release old realm reference to avoid possibility of use after free. Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03ceph: clear dir complete on d_moveSage Weil1-0/+4
d_move() reorders the d_subdirs list, breaking the readdir result caching. Unless/until d_move preserves that ordering, clear CEPH_I_COMPLETE on rename. Signed-off-by: Sage Weil <sage@newdream.net>
2010-05-03nilfs2: fix sync silent failureRyusuke Konishi1-0/+1
As of 32a88aa1, __sync_filesystem() will return 0 if s_bdi is not set. And nilfs does not set s_bdi anywhere. I noticed this problem by the warning introduced by the recent commit 5129a469 ("Catch filesystem lacking s_bdi"). WARNING: at fs/super.c:959 vfs_kern_mount+0xc5/0x14e() Hardware name: PowerEdge 2850 Modules linked in: nilfs2 loop tpm_tis tpm tpm_bios video shpchp pci_hotplug output dcdbas Pid: 3773, comm: mount.nilfs2 Not tainted 2.6.34-rc6-debug #38 Call Trace: [<c1028422>] warn_slowpath_common+0x60/0x90 [<c102845f>] warn_slowpath_null+0xd/0x10 [<c1095936>] vfs_kern_mount+0xc5/0x14e [<c1095a03>] do_kern_mount+0x32/0xbd [<c10a811e>] do_mount+0x671/0x6d0 [<c1073794>] ? __get_free_pages+0x1f/0x21 [<c10a684f>] ? copy_mount_options+0x2b/0xe2 [<c107b634>] ? strndup_user+0x48/0x67 [<c10a81de>] sys_mount+0x61/0x8f [<c100280c>] sysenter_do_call+0x12/0x32 This ensures to set s_bdi for nilfs and fixes the sync silent failure. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Acked-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-01NFS: Fix RCU issues in the NFSv4 delegation codeDavid Howells1-21/+23
Fix a number of RCU issues in the NFSv4 delegation code. (1) delegation->cred doesn't need to be RCU protected as it's essentially an invariant refcounted structure. By the time we get to nfs_free_delegation(), the delegation is being released, so no one else should be attempting to use the saved credentials, and they can be cleared. However, since the list of delegations could still be under traversal at this point by such as nfs_client_return_marked_delegations(), the cred should be released in nfs_do_free_delegation() rather than in nfs_free_delegation(). Simply using rcu_assign_pointer() to clear it is insufficient as that doesn't stop the cred from being destroyed, and nor does calling put_rpccred() after call_rcu(), given that the latter is asynchronous. (2) nfs_detach_delegation_locked() and nfs_inode_set_delegation() should use rcu_derefence_protected() because they can only be called if nfs_client::cl_lock is held, and that guards against anyone changing nfsi->delegation under it. Furthermore, the barrier imposed by rcu_dereference() is superfluous, given that the spin_lock() is also a barrier. (3) nfs_detach_delegation_locked() is now passed a pointer to the nfs_client struct so that it can issue lockdep advice based on clp->cl_lock for (2). (4) nfs_inode_return_delegation_noreclaim() and nfs_inode_return_delegation() should use rcu_access_pointer() outside the spinlocked region as they merely examine the pointer and don't follow it, thus rendering unnecessary the need to impose a partial ordering over the one item of interest. These result in an RCU warning like the following: [ INFO: suspicious rcu_dereference_check() usage. ] --------------------------------------------------- fs/nfs/delegation.c:332 invoked rcu_dereference_check() without protection! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 2 locks held by mount.nfs4/2281: #0: (&type->s_umount_key#34){+.+...}, at: [<ffffffff810b25b4>] deactivate_super+0x60/0x80 #1: (iprune_sem){+.+...}, at: [<ffffffff810c332a>] invalidate_inodes+0x39/0x13a stack backtrace: Pid: 2281, comm: mount.nfs4 Not tainted 2.6.34-rc1-cachefs #110 Call Trace: [<ffffffff8105149f>] lockdep_rcu_dereference+0xaa/0xb2 [<ffffffffa00b4591>] nfs_inode_return_delegation_noreclaim+0x5b/0xa0 [nfs] [<ffffffffa0095d63>] nfs4_clear_inode+0x11/0x1e [nfs] [<ffffffff810c2d92>] clear_inode+0x9e/0xf8 [<ffffffff810c3028>] dispose_list+0x67/0x10e [<ffffffff810c340d>] invalidate_inodes+0x11c/0x13a [<ffffffff810b1dc1>] generic_shutdown_super+0x42/0xf4 [<ffffffff810b1ebe>] kill_anon_super+0x11/0x4f [<ffffffffa009893c>] nfs4_kill_super+0x3f/0x72 [nfs] [<ffffffff810b25bc>] deactivate_super+0x68/0x80 [<ffffffff810c6744>] mntput_no_expire+0xbb/0xf8 [<ffffffff810c681b>] release_mounts+0x9a/0xb0 [<ffffffff810c689b>] put_mnt_ns+0x6a/0x79 [<ffffffffa00983a1>] nfs_follow_remote_path+0x5a/0x146 [nfs] [<ffffffffa0098334>] ? nfs_do_root_mount+0x82/0x95 [nfs] [<ffffffffa00985a9>] nfs4_try_mount+0x75/0xaf [nfs] [<ffffffffa0098874>] nfs4_get_sb+0x291/0x31a [nfs] [<ffffffff810b2059>] vfs_kern_mount+0xb8/0x177 [<ffffffff810b2176>] do_kern_mount+0x48/0xe8 [<ffffffff810c810b>] do_mount+0x782/0x7f9 [<ffffffff810c8205>] sys_mount+0x83/0xbe [<ffffffff81001eeb>] system_call_fastpath+0x16/0x1b Also on: fs/nfs/delegation.c:215 invoked rcu_dereference_check() without protection! [<ffffffff8105149f>] lockdep_rcu_dereference+0xaa/0xb2 [<ffffffffa00b4223>] nfs_inode_set_delegation+0xfe/0x219 [nfs] [<ffffffffa00a9c6f>] nfs4_opendata_to_nfs4_state+0x2c2/0x30d [nfs] [<ffffffffa00aa15d>] nfs4_do_open+0x2a6/0x3a6 [nfs] ... And: fs/nfs/delegation.c:40 invoked rcu_dereference_check() without protection! [<ffffffff8105149f>] lockdep_rcu_dereference+0xaa/0xb2 [<ffffffffa00b3bef>] nfs_free_delegation+0x3d/0x6e [nfs] [<ffffffffa00b3e71>] nfs_do_return_delegation+0x26/0x30 [nfs] [<ffffffffa00b406a>] __nfs_inode_return_delegation+0x1ef/0x1fe [nfs] [<ffffffffa00b448a>] nfs_client_return_marked_delegations+0xc9/0x124 [nfs] ... Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-05-01NFSv4: Fix the locking in nfs_inode_reclaim_delegation()Trond Myklebust1-14/+28
Ensure that we correctly rcu-dereference the delegation itself, and that we protect against removal while we're changing the contents. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-04-30ocfs2: Avoid direct write if we fall back to buffered I/OLi Dongyang1-11/+14
when we fall back to buffered write from direct write, we call __generic_file_aio_write() but that will end up doing direct write even we are only prepared to do buffered write because the file has the O_DIRECT flag set. This is a fix for https://bugzilla.novell.com/show_bug.cgi?id=591039 revised with Joel's comments. Signed-off-by: Li Dongyang <lidongyang@novell.com> Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-04-30Merge branch 'skip_delete_inode' of ↵Joel Becker387-905/+1370
git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2-mark into ocfs2-fixes
2010-04-30Inotify: Fix build failure in inotify user supportRalf Baechle1-0/+1
CONFIG_INOTIFY_USER defined but CONFIG_ANON_INODES undefined will result in the following build failure: LD vmlinux fs/built-in.o: In function 'sys_inotify_init1': (.text.sys_inotify_init1+0x22c): undefined reference to 'anon_inode_getfd' fs/built-in.o: In function `sys_inotify_init1': (.text.sys_inotify_init1+0x22c): relocation truncated to fit: R_MIPS_26 against 'anon_inode_getfd' make[2]: *** [vmlinux] Error 1 make[1]: *** [sub-make] Error 2 make: *** [all] Error 2 Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-29Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds6-9/+120
* 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: add a shrinker to background inode reclaim
2010-04-29Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds1-0/+1
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: exofs: Fix "add bdi backing to mount session" fall out fs: fs/super.c needs to include backing-dev.h for !CONFIG_BLOCK
2010-04-29xfs: add a shrinker to background inode reclaimDave Chinner6-9/+120
On low memory boxes or those with highmem, kernel can OOM before the background reclaims inodes via xfssyncd. Add a shrinker to run inode reclaim so that it inode reclaim is expedited when memory is low. This is more complex than it needs to be because the VM folk don't want a context added to the shrinker infrastructure. Hence we need to add a global list of XFS mount structures so the shrinker can traverse them. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-04-29exofs: Fix "add bdi backing to mount session" fall outBoaz Harrosh1-1/+1
The patch: add bdi backing to mount session (b3d0ab7e60d1865bb6f6a79a77aaba22f2543236) Has a bug in the placement of the bdi member at struct exofs_sb_info. The layout member must be kept last. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-29fs: fs/super.c needs to include backing-dev.h for !CONFIG_BLOCKJens Axboe1-0/+1
When CONFIG_BLOCK is set, it ends up getting backing-dev.h included. But for !CONFIG_BLOCK, it isn't so lucky. The proper thing to do is include <linux/backing-dev.h> directly from the file it's used from, so do that. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-29Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6Linus Torvalds5-22/+45
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: nfs: fix memory leak in nfs_get_sb with CONFIG_NFS_V4 nfs: fix some issues in nfs41_proc_reclaim_complete() NFS: Ensure that nfs_wb_page() waits for Pg_writeback to clear NFS: Fix an unstable write data integrity race nfs: testing for null instead of ERR_PTR() NFS: rsize and wsize settings ignored on v4 mounts NFSv4: Don't attempt an atomic open if the file is a mountpoint SUNRPC: Fix a bug in rpcauth_prune_expired
2010-04-29pktcdvd: improve BKL and compat_ioctl.c usageArnd Bergmann1-3/+0
The pktcdvd driver uses proper locking and does not need the BKL in the ioctl and llseek functions of the character device, so kill both. Moving the compat_ioctl handling from common code into the driver itself fixes build problems when CONFIG_BLOCK is disabled. Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-29exofs: Fix "add bdi backing to mount session" fall outBoaz Harrosh1-1/+1
Commit b3d0ab7e60d1865bb6f6a79a77aaba22f2543236 ("exofs: add bdi backing to mount session") has a bug in the placement of the bdi member at struct exofs_sb_info. The layout member must be kept last. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-28nfs d_revalidate() is too trigger-happy with d_drop()Al Viro1-0/+2
If dentry found stale happens to be a root of disconnected tree, we can't d_drop() it; its d_hash is actually part of s_anon and d_drop() would simply hide it from shrink_dcache_for_umount(), leading to all sorts of fun, including busy inodes on umount and oopsen after that. Bug had been there since at least 2006 (commit c636eb already has it), so it's definitely -stable fodder. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-28nfs: fix memory leak in nfs_get_sb with CONFIG_NFS_V4Xiaotian Feng1-0/+1
With CONFIG_NFS_V4 and data version 4, nfs_get_sb will allocate memory for export_path in nfs4_validate_text_mount_data, so we need to free it then. This is addressed in following kmemleak report: unreferenced object 0xffff88016bf48a50 (size 16): comm "mount.nfs", pid 22567, jiffies 4651574704 (age 175471.200s) hex dump (first 16 bytes): 2f 6f 70 74 2f 77 6f 72 6b 00 6b 6b 6b 6b 6b a5 /opt/work.kkkkk. backtrace: [<ffffffff814b34f9>] kmemleak_alloc+0x60/0xa7 [<ffffffff81102c76>] kmemleak_alloc_recursive.clone.5+0x1b/0x1d [<ffffffff811046b3>] __kmalloc_track_caller+0x18f/0x1b7 [<ffffffff810e1b08>] kstrndup+0x37/0x54 [<ffffffffa0336971>] nfs_parse_devname+0x152/0x204 [nfs] [<ffffffffa0336af3>] nfs4_validate_text_mount_data+0xd0/0xdc [nfs] [<ffffffffa0338deb>] nfs_get_sb+0x325/0x736 [nfs] [<ffffffff81113671>] vfs_kern_mount+0xbd/0x17c [<ffffffff81113798>] do_kern_mount+0x4d/0xed [<ffffffff81129a87>] do_mount+0x787/0x7fe [<ffffffff81129b86>] sys_mount+0x88/0xc2 [<ffffffff81009b42>] system_call_fastpath+0x16/0x1b Signed-off-by: Xiaotian Feng <dfeng@redhat.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Benny Halevy <bhalevy@panasas.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-04-28nfs: fix some issues in nfs41_proc_reclaim_complete()Dan Carpenter1-1/+4
The original code passed an ERR_PTR() to rpc_put_task() and instead of returning zero on success it returned -ENOMEM. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-04-28Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds19-16/+90
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: coda: move backing-dev.h kernel include inside __KERNEL__ mtd: ensure that bdi entries are properly initialized and registered Move mtd_bdi_*mappable to mtdcore.c btrfs: convert to using bdi_setup_and_register() Catch filesystems lacking s_bdi drbd: Terminate a connection early if sending the protocol fails drbd: fix memory leak Fix JFFS2 sync silent failure smbfs: add bdi backing to mount session ncpfs: add bdi backing to mount session exofs: add bdi backing to mount session ecryptfs: add bdi backing to mount session coda: add bdi backing to mount session cifs: add bdi backing to mount session afs: add bdi backing to mount session. 9p: add bdi backing to mount session bdi: add helper function for doing init and register of a bdi for a file system block: ensure jiffies wrap is handled correctly in blk_rq_timed_out_timer
2010-04-27Merge branch 'for-2.6.34' of git://linux-nfs.org/~bfields/linuxLinus Torvalds1-4/+4
* 'for-2.6.34' of git://linux-nfs.org/~bfields/linux: nfsd4: bug in read_buf
2010-04-27procfs: fix tid fdinfoJerome Marchand1-1/+1
Correct the file_operations struct in fdinfo entry of tid_base_stuff[]. Presently /proc/*/task/*/fdinfo contains symlinks to opened files like /proc/*/fd/. Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Miklos Szeredi <mszeredi@suse.cz> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-27NFS: Ensure that nfs_wb_page() waits for Pg_writeback to clearTrond Myklebust1-15/+4
Neil Brown reports that he is seeing the BUG_ON(ret == 0) trigger in nfs_page_async_flush. According to the trace in https://bugzilla.novell.com/show_bug.cgi?id=599628 the problem appears to be due to nfs_wb_page() not waiting for the PG_writeback flag to clear. There is a ditto problem in nfs_wb_page_cancel() Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-04-27Remove redundant check for CONFIG_MMUChristoph Egger1-7/+0
The checks for CONFIG_MMU at this location are duplicated as all the code is located inside a #ifndef CONFIG_MMU block. So the first conditional block will always be included while the second never will. Signed-off-by: Christoph Egger <siccegge@stud.informatik.uni-erlangen.de> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linusLinus Torvalds3-5/+7
* git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus: squashfs: fix potential buffer over-run on 4K block file systems squashfs: add missing buffer free squashfs: fix warn_on when root inode is corrupted squashfs: fix locking bug in zlib wrapper
2010-04-26nfsd4: bug in read_bufNeil Brown1-4/+4
When read_buf is called to move over to the next page in the pagelist of an NFSv4 request, it sets argp->end to essentially a random number, certainly not an address within the page which argp->p now points to. So subsequent calls to READ_BUF will think there is much more than a page of spare space (the cast to u32 ensures an unsigned comparison) so we can expect to fall off the end of the second page. We never encountered thsi in testing because typically the only operations which use more than two pages are write-like operations, which have their own decoding logic. Something like a getattr after a write may cross a page boundary, but it would be very unusual for it to cross another boundary after that. Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-04-26xfs: more swap extent fixes for dynamic fork offsetsDave Chinner1-6/+16
A new xfsqa test (226) with a prototype xfs_fsr change to try to handle dynamic fork offsets better triggers an assertion failure where the inode data fork is in btree format, yet there is room in the inode for it to be in extent format. The two inodes look like: before: ino 0x101 (target), num_extents 11, Max in-fork extents 6, broot size 40, fork offset 96 before: ino 0x115 (temp), num_extents 5, Max in-fork extents 3, broot size 40, fork offset 56 after: ino 0x101 (target), num_extents 5, Max in-fork extents 6, broot size 40, fork offset 96 after: ino 0x115 (temp), num_extents 11, Max in-fork extents 3, broot size 40, fork offset 56 Basically the target inode ends up with 5 extents in btree format, but it had space for 6 extents in extent format, so ends up incorrect. Notably here the broot size is the same, and that is where the kernel code is going wrong - the btree root will fit, so it lets the swap go ahead. The check should not allow the swap to take place if the number of extents while in btree format is less than the number of extents that can fit in the inode in extent format. Adding that check will prevent this swap and corruption from occurring. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-04-26btrfs: convert to using bdi_setup_and_register()Jens Axboe1-11/+1
It's now a provided helper, so get rid of the internal setup and btrfs atomic_t bdi enumerator. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-25Merge branch 'for_linus' of ↵Linus Torvalds3-11/+14
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: Issue the discard operation *before* releasing the blocks to be reused ext4: Fix buffer head leaks after calls to ext4_get_inode_loc() ext4: Fix possible lost inode write in no journal mode
2010-04-25Catch filesystems lacking s_bdiJörn Engel2-4/+7
noop_backing_dev_info is used only as a flag to mark filesystems that don't have any backing store, like tmpfs, procfs, spufs, etc. Signed-off-by: Joern Engel <joern@logfs.org> Changed the BUG_ON() to a WARN_ON(). Note that adding dirty inodes to the noop_backing_dev_info is not legal and will not result in them being flushed, but we already catch this condition in __mark_inode_dirty() when checking for a registered bdi. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-25squashfs: fix potential buffer over-run on 4K block file systemsPhillip Lougher1-3/+2
Sizing the buffer based on block size is incorrect, leading to a potential buffer over-run on 4K block size file systems (because the metadata block size is always 8K). This bug doesn't seem have triggered because 4K block size file systems are not default, and also because metadata blocks after compression tend to be less than 4K. Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2010-04-25squashfs: add missing buffer freePhillip Lougher1-0/+1
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2010-04-25squashfs: fix warn_on when root inode is corruptedPhillip Lougher1-1/+2
Fix warn_on triggered by mounting a fsfuzzer corrupted file system, where the root inode has been corrupted. Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk> Reported-by: Steve Grubb <sgrubb@redhat.com>
2010-04-24fs/block_dev.c: fix performance regression in O_DIRECT|O_SYNC writes to ↵Anton Blanchard1-5/+12
block devices We are seeing a large regression in database performance on recent kernels. The database opens a block device with O_DIRECT|O_SYNC and a number of threads write to different regions of the file at the same time. A simple test case is below. I haven't defined DEVICE since getting it wrong will destroy your data :) On an 3 disk LVM with a 64k chunk size we see about 17MB/sec and only a few threads in IO wait: procs -----io---- -system-- -----cpu------ r b bi bo in cs us sy id wa st 0 3 0 16170 656 2259 0 0 86 14 0 0 2 0 16704 695 2408 0 0 92 8 0 0 2 0 17308 744 2653 0 0 86 14 0 0 2 0 17933 759 2777 0 0 89 10 0 Most threads are blocking in vfs_fsync_range, which has: mutex_lock(&mapping->host->i_mutex); err = fop->fsync(file, dentry, datasync); if (!ret) ret = err; mutex_unlock(&mapping->host->i_mutex); commit 148f948ba877f4d3cdef036b1ff6d9f68986706a (vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode) offers some explanation of what is going on: Use these new helpers for syncing from generic VFS functions. This makes O_SYNC writes to block devices acquire i_mutex for syncing. If we really care about this, we can make block_fsync() drop the i_mutex and reacquire it before it returns. Thanks Jan for such a good commit message! As well as dropping i_mutex, Christoph suggests we should remove the call to sync_blockdev(): > sync_blockdev is an overcomplicated alias for filemap_write_and_wait on > the block device inode, which is exactly what we did just before calling > into ->fsync The patch below incorporates both suggestions. With it the testcase improves from 17MB/s to 68M/sec: procs -----io---- -system-- -----cpu------ r b bi bo in cs us sy id wa st 0 7 0 65536 1000 3878 0 0 70 30 0 0 34 0 69632 1016 3921 0 1 46 53 0 0 57 0 69632 1000 3921 0 0 55 45 0 0 53 0 69640 754 4111 0 0 81 19 0 Testcase: #define _GNU_SOURCE #include <stdio.h> #include <pthread.h> #include <unistd.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #define NR_THREADS 64 #define BUFSIZE (64 * 1024) #define DEVICE "/dev/mapper/XXXXXX" #define ALIGN(VAL, SIZE) (((VAL)+(SIZE)-1) & ~((SIZE)-1)) static int fd; static void *doit(void *arg) { unsigned long offset = (long)arg; char *b, *buf; b = malloc(BUFSIZE + 1024); buf = (char *)ALIGN((unsigned long)b, 1024); memset(buf, 0, BUFSIZE); while (1) pwrite(fd, buf, BUFSIZE, offset); } int main(int argc, char *argv[]) { int flags = O_RDWR|O_DIRECT; int i; unsigned long offset = 0; if (argc > 1 && !strcmp(argv[1], "O_SYNC")) flags |= O_SYNC; fd = open(DEVICE, flags); if (fd == -1) { perror("open"); exit(1); } for (i = 0; i < NR_THREADS-1; i++) { pthread_t tid; pthread_create(&tid, NULL, doit, (void *)offset); offset += BUFSIZE; } doit((void *)offset); return 0; } Signed-off-by: Anton Blanchard <anton@samba.org> Acked-by: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-24reiserfs: fix corruption during shrinking of xattrsJeff Mahoney1-1/+1
Commit 48b32a3553a54740d236b79a90f20147a25875e3 ("reiserfs: use generic xattr handlers") introduced a problem that causes corruption when extended attributes are replaced with a smaller value. The issue is that the reiserfs_setattr to shrink the xattr file was moved from before the write to after the write. The root issue has always been in the reiserfs xattr code, but was papered over by the fact that in the shrink case, the file would just be expanded again while the xattr was written. The end result is that the last 8 bytes of xattr data are lost. This patch fixes it to use new_size. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=14826 Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reported-by: Christian Kujau <lists@nerdbynature.de> Tested-by: Christian Kujau <lists@nerdbynature.de> Cc: Edward Shishkin <edward.shishkin@gmail.com> Cc: Jethro Beekman <kernel@jbeekman.nl> Cc: Greg Surbey <gregsurbey@hotmail.com> Cc: Marco Gatti <marco.gatti@gmail.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>