summaryrefslogtreecommitdiff
path: root/fs/ext4/inode.c
AgeCommit message (Collapse)AuthorFilesLines
2014-10-11ext4: fix reservation overflow in ext4_da_write_beginEric Sandeen1-1/+16
Delalloc write journal reservations only reserve 1 credit, to update the inode if necessary. However, it may happen once in a filesystem's lifetime that a file will cross the 2G threshold, and require the LARGE_FILE feature to be set in the superblock as well, if it was not set already. This overruns the transaction reservation, and can be demonstrated simply on any ext4 filesystem without the LARGE_FILE feature already set: dd if=/dev/zero of=testfile bs=1 seek=2147483646 count=1 \ conv=notrunc of=testfile sync dd if=/dev/zero of=testfile bs=1 seek=2147483647 count=1 \ conv=notrunc of=testfile leads to: EXT4-fs: ext4_do_update_inode:4296: aborting transaction: error 28 in __ext4_handle_dirty_super EXT4-fs error (device loop0) in ext4_do_update_inode:4301: error 28 EXT4-fs error (device loop0) in ext4_reserve_inode_write:4757: Readonly filesystem EXT4-fs error (device loop0) in ext4_dirty_inode:4876: error 28 EXT4-fs error (device loop0) in ext4_da_write_end:2685: error 28 Adjust the number of credits based on whether the flag is already set, and whether the current write may extend past the LARGE_FILE limit. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Cc: stable@vger.kernel.org
2014-10-05ext4: add ext4_iget_normal() which is to be used for dir tree lookupsTheodore Ts'o1-0/+7
If there is a corrupted file system which has directory entries that point at reserved, metadata inodes, prohibit them from being used by treating them the same way we treat Boot Loader inodes --- that is, mark them to be bad inodes. This prohibits them from being opened, deleted, or modified via chmod, chown, utimes, etc. In particular, this prevents a corrupted file system which has a directory entry which points at the journal inode from being deleted and its blocks released, after which point Much Hilarity Ensues. Reported-by: Sami Liedes <sami.liedes@iki.fi> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-10-05ext4: don't orphan or truncate the boot loader inodeTheodore Ts'o1-4/+3
The boot loader inode (inode #5) should never be visible in the directory hierarchy, but it's possible if the file system is corrupted that there will be a directory entry that points at inode #5. In order to avoid accidentally trashing it, when such a directory inode is opened, the inode will be marked as a bad inode, so that it's not possible to modify (or read) the inode from userspace. Unfortunately, when we unlink this (invalid/illegal) directory entry, we will put the bad inode on the ophan list, and then when try to unlink the directory, we don't actually remove the bad inode from the orphan list before freeing in-memory inode structure. This means the in-memory orphan list is corrupted, leading to a kernel oops. In addition, avoid truncating a bad inode in ext4_destroy_inode(), since truncating the boot loader inode is not a smart thing to do. Reported-by: Sami Liedes <sami.liedes@iki.fi> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-10-01ext4: fix return value of ext4_do_update_inodeLi Xi1-1/+2
When ext4_do_update_inode() gets error from ext4_inode_blocks_set(), error number should be returned. Signed-off-by: Li Xi <lixi@ddn.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>
2014-10-01ext4: fix mmap data corruption when blocksize < pagesizeJan Kara1-1/+5
Use truncate_isize_extended() when hole is being created in a file so that ->page_mkwrite() will get called for the partial tail page if it is mmaped (see the first patch in the series for details). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-09-04ext4: drop the EXT4_STATE_DELALLOC_RESERVED flagTheodore Ts'o1-16/+4
Having done a full regression test, we can now drop the DELALLOC_RESERVED state flag. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2014-09-01ext4: fix comments about get_blocksSeunghun Lee1-2/+2
get_blocks is renamed to get_block. Signed-off-by: Seunghun Lee <waydi1@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-30ext4: use ext4_update_i_disksize instead of opencoded onesDmitry Monakhov1-4/+1
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-29ext4: convert ext4_bread() to use the ERR_PTR conventionTheodore Ts'o1-10/+4
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-29ext4: convert ext4_getblk() to use the ERR_PTR conventionTheodore Ts'o1-26/+25
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-28ext4: update i_disksize coherently with block allocation on error pathDmitry Monakhov1-2/+8
In case of delalloc block i_disksize may be less than i_size. So we have to update i_disksize each time we allocated and submitted some blocks beyond i_disksize. We weren't doing this on the error paths, so fix this. testcase: xfstest generic/019 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-08-23ext4: move i_size,i_disksize update routines to helper functionDmitry Monakhov1-26/+8
Cc: stable@vger.kernel.org # needed for bug fix patches Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-15ext4: fix punch hole on files with indirect mappingLukas Czerner1-1/+1
Currently punch hole code on files with direct/indirect mapping has some problems which may lead to a data loss. For example (from Jan Kara): fallocate -n -p 10240000 4096 will punch the range 10240000 - 12632064 instead of the range 1024000 - 10244096. Also the code is a bit weird and it's not using infrastructure provided by indirect.c, but rather creating it's own way. This patch fixes the issues as well as making the operation to run 4 times faster from my testing (punching out 60GB file). It uses similar approach used in ext4_ind_truncate() which takes advantage of ext4_free_branches() function. Also rename the ext4_free_hole_blocks() to something more sensible, like the equivalent we have for extent mapped files. Call it ext4_ind_remove_space(). This has been tested mostly with fsx and some xfstests which are testing punch hole but does not require unwritten extents which are not supported with direct/indirect mapping. Not problems showed up even with 1024k block size. CC: stable@vger.kernel.org Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-15ext4: remove metadata reservation checksTheodore Ts'o1-123/+5
Commit 27dd43854227b ("ext4: introduce reserved space") reserves 2% of the file system space to make sure metadata allocations will always succeed. Given that, tracking the reservation of metadata blocks is no longer necessary. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-06-12Merge branch 'for-linus' of ↵Linus Torvalds1-13/+11
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "This the bunch that sat in -next + lock_parent() fix. This is the minimal set; there's more pending stuff. In particular, I really hope to get acct.c fixes merged this cycle - we need that to deal sanely with delayed-mntput stuff. In the next pile, hopefully - that series is fairly short and localized (kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more iov_iter work. Most of prereqs for ->splice_write with sane locking order are there and Kent's dio rewrite would also fit nicely on top of this pile" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits) lock_parent: don't step on stale ->d_parent of all-but-freed one kill generic_file_splice_write() ceph: switch to iter_file_splice_write() shmem: switch to iter_file_splice_write() nfs: switch to iter_splice_write_file() fs/splice.c: remove unneeded exports ocfs2: switch to iter_file_splice_write() ->splice_write() via ->write_iter() bio_vec-backed iov_iter optimize copy_page_{to,from}_iter() bury generic_file_aio_{read,write} lustre: get rid of messing with iovecs ceph: switch to ->write_iter() ceph_sync_direct_write: stop poking into iov_iter guts ceph_sync_read: stop poking into iov_iter guts new helper: copy_page_from_iter() fuse: switch to ->write_iter() btrfs: switch to ->write_iter() ocfs2: switch to ->write_iter() xfs: switch to ->write_iter() ...
2014-06-02ext4: handle symlink properly with inline_dataZheng Liu1-0/+3
This commit tries to fix a bug that we can't read symlink properly with inline data feature when the length of symlink is greater than 60 bytes but less than extra space. The key issue is in ext4_inode_is_fast_symlink() that it doesn't check whether or not an inode has inline data. When the user creates a new symlink, an inode will be allocated with MAY_INLINE_DATA flag. Then symlink will be stored in ->i_block and extended attribute space. In the mean time, this inode is with inline data flag. After remounting it, ext4_inode_is_fast_symlink() function thinks that this inode is a fast symlink so that the data in ->i_block is copied to the user, and the data in extra space is trimmed. In fact this inode should be as a normal symlink. The following script can hit this bug. #!/bin/bash cd ${MNT} filename=ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 rm -rf test mkdir test cd test echo "hello" >$filename ln -s $filename symlinkfile cd sudo umount /mnt/sda1 sudo mount -t ext4 /dev/sda1 /mnt/sda1 readlink /mnt/sda1/test/symlinkfile After applying this patch, it will break the assumption in e2fsck because the original implementation doesn't want to support symlink with inline data. Reported-by: "Darrick J. Wong" <darrick.wong@oracle.com> Reported-by: Ian Nartowicz <claws@nartowicz.co.uk> Cc: Ian Nartowicz <claws@nartowicz.co.uk> Cc: Tao Ma <tm@tao.ma> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-05-12ext4: add missing BUFFER_TRACE before ext4_journal_get_write_accessliang xie1-0/+3
Make them more consistently Signed-off-by: xieliang <xieliang@xiaomi.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12ext4: remove unnecessary double parenthesesLukas Czerner1-5/+5
Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12ext4: make local functions staticStephen Hemminger1-1/+1
I have been running make namespacecheck to look for unneeded globals, and found these in ext4. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12ext4: fix data integrity sync in ordered modeNamjae Jeon1-2/+4
When we perform a data integrity sync we tag all the dirty pages with PAGECACHE_TAG_TOWRITE at start of ext4_da_writepages. Later we check for this tag in write_cache_pages_da and creates a struct mpage_da_data containing contiguously indexed pages tagged with this tag and sync these pages with a call to mpage_da_map_and_submit. This process is done in while loop until all the PAGECACHE_TAG_TOWRITE pages are synced. We also do journal start and stop in each iteration. journal_stop could initiate journal commit which would call ext4_writepage which in turn will call ext4_bio_write_page even for delayed OR unwritten buffers. When ext4_bio_write_page is called for such buffers, even though it does not sync them but it clears the PAGECACHE_TAG_TOWRITE of the corresponding page and hence these pages are also not synced by the currently running data integrity sync. We will end up with dirty pages although sync is completed. This could cause a potential data loss when the sync call is followed by a truncate_pagecache call, which is exactly the case in collapse_range. (It will cause generic/127 failure in xfstests) To avoid this issue, we can use set_page_writeback_keepwrite instead of set_page_writeback, which doesn't clear TOWRITE tag. Cc: stable@vger.kernel.org Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2014-05-06Merge ext4 changes in ext4_file_write() into for-nextAl Viro1-12/+13
From ext4.git#dev, needed for switch of ext4 to ->write_iter() ;-/
2014-05-06switch {__,}blockdev_direct_IO() to iov_iterAl Viro1-2/+2
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06get rid of pointless iov_length() in ->direct_IO()Al Viro1-4/+4
all callers have iov_length(iter->iov, iter->nr_segs) == iov_iter_count(iter) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06ext4: switch the guts of ->direct_IO() to iov_iterAl Viro1-8/+7
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06pass iov_iter to ->direct_IO()Al Viro1-6/+5
unmodified, for now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-21ext4: add a new spinlock i_raw_lock to protect the ext4's raw inodeTheodore Ts'o1-19/+22
To avoid potential data races, use a spinlock which protects the raw (on-disk) inode. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2014-04-20ext4: rename uninitialized extents to unwrittenLukas Czerner1-9/+9
Currently in ext4 there is quite a mess when it comes to naming unwritten extents. Sometimes we call it uninitialized and sometimes we refer to it as unwritten. The right name for the extent which has been allocated but does not contain any written data is _unwritten_. Other file systems are using this name consistently, even the buffer head state refers to it as unwritten. We need to fix this confusion in ext4. This commit changes every reference to an uninitialized extent (meaning allocated but unwritten) to unwritten extent. This includes comments, function names and variable names. It even covers abbreviation of the word uninitialized (such as uninit) and some misspellings. This commit does not change any of the code paths at all. This has been confirmed by comparing md5sums of the assembly code of each object file after all the function names were stripped from it. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-20ext4: get rid of EXT4_MAP_UNINIT flagLukas Czerner1-3/+4
Currently EXT4_MAP_UNINIT is used in dioread_nolock case to mark the cases where we're using dioread_nolock and we're writing into either unallocated, or unwritten extent, because we need to make sure that any DIO write into that inode will wait for the extent conversion. However EXT4_MAP_UNINIT is not only entirely misleading name but also unnecessary because we can check for EXT4_MAP_UNWRITTEN in the dioread_nolock case instead. This commit removes EXT4_MAP_UNINIT flag. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-18ext4: discard preallocations after removing spaceLukas Czerner1-1/+0
Currently in ext4_collapse_range() and ext4_punch_hole() we're discarding preallocation twice. Once before we attempt to do any changes and second time after we're done with the changes. While the second call to ext4_discard_preallocations() in ext4_punch_hole() case is not needed, we need to discard preallocation right after ext4_ext_remove_space() in collapse range case because in the case we had to restart a transaction in the middle of removing space we might have new preallocations created. Remove unneeded ext4_discard_preallocations() ext4_punch_hole() and move it to the better place in ext4_collapse_range() Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-12fs: disallow all fallocate operation on active swapfileLukas Czerner1-5/+0
Currently some file system have IS_SWAPFILE check in their fallocate implementations and some do not. However we should really prevent any fallocate operation on swapfile so move the check to vfs and remove the redundant checks from the file systems fallocate implementations. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-12ext4: remove unnecessary check for APPEND and IMMUTABLELukas Czerner1-5/+1
All the checks IS_APPEND and IS_IMMUTABLE for the fallocate operation on the inode are done in vfs. No need to do this again in ext4. Remove it. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-11ext4: move ext4_update_i_disksize() into mpage_map_and_submit_extent()Theodore Ts'o1-3/+13
The function ext4_update_i_disksize() is used in only one place, in the function mpage_map_and_submit_extent(). Move its code to simplify the code paths, and also move the call to ext4_mark_inode_dirty() into the i_data_sem's critical region, to be consistent with all of the other places where we update i_disksize. That way, we also keep the raw_inode's i_disksize protected, to avoid the following race: CPU #1 CPU #2 down_write(&i_data_sem) Modify i_disk_size up_write(&i_data_sem) down_write(&i_data_sem) Modify i_disk_size Copy i_disk_size to on-disk inode up_write(&i_data_sem) Copy i_disk_size to on-disk inode Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.org
2014-04-08ext4: update PF_MEMALLOC handling in ext4_write_inode()Theodore Ts'o1-12/+11
The special handling of PF_MEMALLOC callers in ext4_write_inode() shouldn't be necessary as there shouldn't be any. Warn about it. Also update comment before the function as it seems somewhat outdated. (Changes modeled on an ext3 patch posted by Jan Kara to the linux-ext4 mailing list on Februaryt 28, 2014, which apparently never went into the ext3 tree.) Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Jan Kara <jack@suse.cz>
2014-04-07ext4: FIBMAP ioctl causes BUG_ON due to handle EXT_MAX_BLOCKSKazuya Mio1-0/+4
When we try to get 2^32-1 block of the file which has the extent (ee_block=2^32-2, ee_len=1) with FIBMAP ioctl, it causes BUG_ON in ext4_ext_put_gap_in_cache(). To avoid the problem, ext4_map_blocks() needs to check the file logical block number. ext4_ext_put_gap_in_cache() called via ext4_map_blocks() cannot handle 2^32-1 because the maximum file logical block number is 2^32-2. Note that ext4_ind_map_blocks() returns -EIO when the block number is invalid. So ext4_map_blocks() should also return the same errno. Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-04-04Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-45/+75
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Major changes for 3.14 include support for the newly added ZERO_RANGE and COLLAPSE_RANGE fallocate operations, and scalability improvements in the jbd2 layer and in xattr handling when the extended attributes spill over into an external block. Other than that, the usual clean ups and minor bug fixes" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits) ext4: fix premature freeing of partial clusters split across leaf blocks ext4: remove unneeded test of ret variable ext4: fix comment typo ext4: make ext4_block_zero_page_range static ext4: atomically set inode->i_flags in ext4_set_inode_flags() ext4: optimize Hurd tests when reading/writing inodes ext4: kill i_version support for Hurd-castrated file systems ext4: each filesystem creates and uses its own mb_cache fs/mbcache.c: doucple the locking of local from global data fs/mbcache.c: change block and index hash chain to hlist_bl_node ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate ext4: refactor ext4_fallocate code ext4: Update inode i_size after the preallocation ext4: fix partial cluster handling for bigalloc file systems ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents ext4: only call sync_filesystm() when remounting read-only fs: push sync_filesystem() down to the file system's remount_fs() jbd2: improve error messages for inconsistent journal heads jbd2: minimize region locked by j_list_lock in jbd2_journal_forget() jbd2: minimize region locked by j_list_lock in journal_get_create_access() ...
2014-04-03mm + fs: store shadow entries in page cacheJohannes Weiner1-2/+2
Reclaim will be leaving shadow entries in the page cache radix tree upon evicting the real page. As those pages are found from the LRU, an iput() can lead to the inode being freed concurrently. At this point, reclaim must no longer install shadow pages because the inode freeing code needs to ensure the page tree is really empty. Add an address_space flag, AS_EXITING, that the inode freeing code sets under the tree lock before doing the final truncate. Reclaim will check for this flag before installing shadow pages. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-30ext4: atomically set inode->i_flags in ext4_set_inode_flags()Theodore Ts'o1-6/+9
Use cmpxchg() to atomically set i_flags instead of clearing out the S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the EXT4_IMMUTABLE_FL, EXT4_APPEND_FL flags, since this opens up a race where an immutable file has the immutable flag cleared for a brief window of time. Reported-by: John Sullivan <jsrhbz@kanargh.force9.co.uk> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-24ext4: fix comment typoMatthew Wilcox1-1/+1
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-24ext4: make ext4_block_zero_page_range staticMatthew Wilcox1-21/+21
It's only called within inode.c, so make it static, remove its prototype from ext4.h and move it above all of its callers so it doesn't need a prototype within inode.c. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-24ext4: atomically set inode->i_flags in ext4_set_inode_flags()Theodore Ts'o1-6/+8
Use cmpxchg() to atomically set i_flags instead of clearing out the S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the EXT4_IMMUTABLE_FL, EXT4_APPEND_FL flags, since this opens up a race where an immutable file has the immutable flag cleared for a brief window of time. Reported-by: John Sullivan <jsrhbz@kanargh.force9.co.uk> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
2014-03-24ext4: optimize Hurd tests when reading/writing inodesTheodore Ts'o1-6/+3
Set a in-memory superblock flag to indicate whether the file system is designed to support the Hurd. Also, add a sanity check to make sure the 64-bit feature is not set for Hurd file systems, since i_file_acl_high conflicts with a Hurd-specific field. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-20ext4: kill i_version support for Hurd-castrated file systemsTheodore Ts'o1-11/+18
The Hurd file system uses uses the inode field which is now used for i_version for its translator block. This means that ext2 file systems that are formatted for GNU Hurd can't be used to support NFSv4. Given that Hurd file systems don't support extents, and a huge number of modern file system features, this is no great loss. If we don't do this, the attempt to update the i_version field will stomp over the translator block field, which will cause file system corruption for Hurd file systems. This can be replicated via: mke2fs -t ext2 -o hurd /dev/vdc mount -t ext4 /dev/vdc /vdc touch /vdc/bug0000 umount /dev/vdc e2fsck -f /dev/vdc Addresses-Debian-Bug: #738758 Reported-By: Gabriele Giacone <1o5g4r8o@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-18ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocateLukas Czerner1-6/+11
Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same functionality as xfs ioctl XFS_IOC_ZERO_RANGE. It can be used to convert a range of file to zeros preferably without issuing data IO. Blocks should be preallocated for the regions that span holes in the file, and the entire range is preferable converted to unwritten extents This can be also used to preallocate blocks past EOF in the same way as with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode size to remain the same. Also add appropriate tracepoints. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-04ext4: Speedup WB_SYNC_ALL pass called from sync(2)Jan Kara1-2/+11
When doing filesystem wide sync, there's no need to force transaction commit (or synchronously write inode buffer) separately for each inode because ext4_sync_fs() takes care of forcing commit at the end (VFS takes care of flushing buffer cache, respectively). Most of the time this slowness doesn't manifest because previous WB_SYNC_NONE writeback doesn't leave much to write but when there are processes aggressively creating new files and several filesystems to sync, the sync slowness can be noticeable. In the following test script sync(1) takes around 6 minutes when there are two ext4 filesystems mounted on a standard SATA drive. After this patch sync takes a couple of seconds so we have about two orders of magnitude improvement. function run_writers { for (( i = 0; i < 10; i++ )); do mkdir $1/dir$i for (( j = 0; j < 40000; j++ )); do dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null done & done } for dir in "$@"; do run_writers $dir done sleep 40 time sync Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-02-20ext4: avoid exposure of stale data in ext4_punch_hole()Maxim Patlasov1-0/+6
While handling punch-hole fallocate, it's useless to truncate page cache before removing the range from extent tree (or block map in indirect case) because page cache can be re-populated (by read-ahead or read(2) or mmap-ed read) immediately after truncating page cache, but before updating extent tree (or block map). In that case the user will see stale data even after fallocate is completed. Until the problem of data corruption resulting from pages backed by already freed blocks is fully resolved, the simple thing we can do now is to add another truncation of pagecache after punch hole is done. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2014-02-20ext4: avoid possible overflow in ext4_map_blocks()Theodore Ts'o1-0/+6
The ext4_map_blocks() function returns the number of blocks which satisfying the caller's request. This number of blocks requested by the caller is specified by an unsigned integer, but the return value of ext4_map_blocks() is a signed integer (to accomodate error codes per the kernel's standard error signalling convention). Historically, overflows could never happen since mballoc() will refuse to allocate more than 2048 blocks at a time (which is something we should fix), and if the blocks were already allocated, the fact that there would be some number of intervening metadata blocks pretty much guaranteed that there could never be a contiguous region of data blocks that was greater than 2**31 blocks. However, this is now possible if there is a file system which is a bit bigger than 8TB, and is created using the new mke2fs hugeblock feature, which can create a perfectly contiguous file. In that case, if a userspace program attempted to call fallocate() on this already fully allocated file, it's possible that ext4_map_blocks() could return a number large enough that it would overflow a signed integer, resulting in a ext4 thinking that the ext4_map_blocks() call had failed with some strange error code. Since ext4_map_blocks() is always free to return a smaller number of blocks than what was requested by the caller, fix this by capping the number of blocks that ext4_map_blocks() will ever try to map to 2**31 - 1. In practice this should never get hit, except by someone deliberately trying to provke the above-described bug. Thanks to the PaX team for asking whethre this could possibly happen in some off-line discussions about using some static code checking technology they are developing to find bugs in kernel code. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-01-28Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-8/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 update from Ted Ts'o: "Bug fixes and cleanups for ext4. We also enable the punch hole functionality for bigalloc file systems" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: delete "set but not used" variables ext4: don't pass freed handle to ext4_walk_page_buffers ext4: avoid clearing beyond i_blocks when truncating an inline data file ext4: ext4_inode_is_fast_symlink should use EXT4_CLUSTER_SIZE ext4: fix a typo in extents.c ext4: use %pd printk specificer ext4: standardize error handling in ext4_da_write_inline_data_begin() ext4: retry allocation when inline->extent conversion failed ext4: enable punch hole for bigalloc
2014-01-28Merge branch 'for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "Assorted stuff; the biggest pile here is Christoph's ACL series. Plus assorted cleanups and fixes all over the place... There will be another pile later this week" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (43 commits) __dentry_path() fixes vfs: Remove second variable named error in __dentry_path vfs: Is mounted should be testing mnt_ns for NULL or error. Fix race when checking i_size on direct i/o read hfsplus: remove can_set_xattr nfsd: use get_acl and ->set_acl fs: remove generic_acl nfs: use generic posix ACL infrastructure for v3 Posix ACLs gfs2: use generic posix ACL infrastructure jfs: use generic posix ACL infrastructure xfs: use generic posix ACL infrastructure reiserfs: use generic posix ACL infrastructure ocfs2: use generic posix ACL infrastructure jffs2: use generic posix ACL infrastructure hfsplus: use generic posix ACL infrastructure f2fs: use generic posix ACL infrastructure ext2/3/4: use generic posix ACL infrastructure btrfs: use generic posix ACL infrastructure fs: make posix_acl_create more useful fs: make posix_acl_chmod more useful ...
2014-01-25ext2/3/4: use generic posix ACL infrastructureChristoph Hellwig1-1/+1
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-23Merge tag 'xfs-for-linus-v3.14-rc1' of git://oss.sgi.com/xfs/xfsLinus Torvalds1-0/+4
Pull xfs update from Ben Myers: "This is primarily bug fixes, many of which you already have. New stuff includes a series to decouple the in-memory and on-disk log format, helpers in the area of inode clusters, and i_version handling. We decided to try to use more topic branches this release, so there are some merge commits in there on account of that. I'm afraid I didn't do a good job of putting meaningful comments in the first couple of merges. Sorry about that. I think I have the hang of it now. For 3.14-rc1 there are fixes in the areas of remote attributes, discard, growfs, memory leaks in recovery, directory v2, quotas, the MAINTAINERS file, allocation alignment, extent list locking, and in xfs_bmapi_allocate. There are cleanups in xfs_setsize_buftarg, removing unused macros, quotas, setattr, and freeing of inode clusters. The in-memory and on-disk log format have been decoupled, a common helper to calculate the number of blocks in an inode cluster has been added, and handling of i_version has been pulled into the filesystems that use it. - cleanup in xfs_setsize_buftarg - removal of remaining unused flags for vop toss/flush/flushinval - fix for memory corruption in xfs_attrlist_by_handle - fix for out-of-date comment in xfs_trans_dqlockedjoin - fix for discard if range length is less than one block - fix for overrun of agfl buffer using growfs on v4 superblock filesystems - pull i_version handling out into the filesystems that use it - don't leak recovery items on error - fix for memory leak in xfs_dir2_node_removename - several cleanups for quotas - fix bad assertion in xfs_qm_vop_create_dqattach - cleanup for xfs_setattr_mode, and add xfs_setattr_time - fix quota assert in xfs_setattr_nonsize - fix an infinite loop when turning off group/project quota before user quota - fix for temporary buffer allocation failure in xfs_dir2_block_to_sf with large directory block sizes - fix Dave's email address in MAINTAINERS - cleanup calculation of freed inode cluster blocks - fix alignment of initial file allocations to match filesystem geometry - decouple in-memory and on-disk log format - introduce a common helper to calculate the number of filesystem blocks in an inode cluster - fixes for extent list locking - fix for off-by-one in xfs_attr3_rmt_verify - fix for missing destroy_work_on_stack in xfs_bmapi_allocate" * tag 'xfs-for-linus-v3.14-rc1' of git://oss.sgi.com/xfs/xfs: (51 commits) xfs: Calling destroy_work_on_stack() to pair with INIT_WORK_ONSTACK() xfs: fix off-by-one error in xfs_attr3_rmt_verify xfs: assert that we hold the ilock for extent map access xfs: use xfs_ilock_attr_map_shared in xfs_attr_list_int xfs: use xfs_ilock_attr_map_shared in xfs_attr_get xfs: use xfs_ilock_data_map_shared in xfs_qm_dqiterate xfs: use xfs_ilock_data_map_shared in xfs_qm_dqtobp xfs: take the ilock around xfs_bmapi_read in xfs_zero_remaining_bytes xfs: reinstate the ilock in xfs_readdir xfs: add xfs_ilock_attr_map_shared xfs: rename xfs_ilock_map_shared xfs: remove xfs_iunlock_map_shared xfs: no need to lock the inode in xfs_find_handle xfs: use xfs_icluster_size_fsb in xfs_imap xfs: use xfs_icluster_size_fsb in xfs_ifree_cluster xfs: use xfs_icluster_size_fsb in xfs_ialloc_inode_init xfs: use xfs_icluster_size_fsb in xfs_bulkstat xfs: introduce a common helper xfs_icluster_size_fsb xfs: get rid of XFS_IALLOC_BLOCKS macros xfs: get rid of XFS_INODE_CLUSTER_SIZE macros ...