summaryrefslogtreecommitdiff
path: root/fs/ext4/extents.c
AgeCommit message (Collapse)AuthorFilesLines
2014-01-06ext4: fix bigalloc regressionEric Whitney1-1/+1
Commit f5a44db5d2 introduced a regression on filesystems created with the bigalloc feature (cluster size > blocksize). It causes xfstests generic/006 and /013 to fail with an unexpected JBD2 failure and transaction abort that leaves the test file system in a read only state. Other xfstests run on bigalloc file systems are likely to fail as well. The cause is the accidental use of a cluster mask where a cluster offset was needed in ext4_ext_map_blocks(). Signed-off-by: Eric Whitney <enwlinux@gmail.com>
2013-12-20ext4: add explicit casts when masking cluster sizesTheodore Ts'o1-14/+12
The missing casts can cause the high 64-bits of the physical blocks to be lost. Set up new macros which allows us to make sure the right thing happen, even if at some point we end up supporting larger logical block numbers. Thanks to the Emese Revfy and the PaX security team for reporting this issue. Reported-by: PaX Team <pageexec@freemail.hu> Reported-by: Emese Revfy <re.emese@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-12-03ext4: check for overlapping extents in ext4_valid_extent_entries()Eryu Guan1-1/+18
A corrupted ext4 may have out of order leaf extents, i.e. extent: lblk 0--1023, len 1024, pblk 9217, flags: LEAF UNINIT extent: lblk 1000--2047, len 1024, pblk 10241, flags: LEAF UNINIT ^^^^ overlap with previous extent Reading such extent could hit BUG_ON() in ext4_es_cache_extent(). BUG_ON(end < lblk); The problem is that __read_extent_tree_block() tries to cache holes as well but assumes 'lblk' is greater than 'prev' and passes underflowed length to ext4_es_cache_extent(). Fix it by checking for overlapping extents in ext4_valid_extent_entries(). I hit this when fuzz testing ext4, and am able to reproduce it by modifying the on-disk extent by hand. Also add the check for (ee_block + len - 1) in ext4_valid_extent() to make sure the value is not overflow. Ran xfstests on patched ext4 and no regression. Cc: Lukáš Czerner <lczerner@redhat.com> Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-11-07ext4: remove unreachable code after ext4_can_extents_be_merged()Eric Sandeen1-26/+0
Commit ec22ba8e ("ext4: disable merging of uninitialized extents") ensured that if either extent under consideration is uninit, we decline to merge, and ext4_can_extents_be_merged() returns false. So there is no need for the caller to then test whether the extent under consideration is unitialized; if it were, we wouldn't have gotten that far. The comments were also inaccurate; ext4_can_extents_be_merged() no longer XORs the states, it fails if *either* is uninit. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-11-04ext4: remove unreachable code in ext4_can_extents_be_merged()Eric Sandeen1-7/+2
Commit ec22ba8e ("ext4: disable merging of uninitialized extents") ensured that if either extent under consideration is uninit, we decline to merge, and immediately return. But right after that test, we test again for an uninit extent; we can never hit this. So just remove the impossible test and associated variable. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-08-28ext4: isolate ext4_extents.h fileZheng Liu1-2/+19
After applied the commit (4a092d73), we have reduced the number of source files that need to #include ext4_extents.h. But we can do better. This commit defines ext4_zeroout_es() in extents.c and move EXT_MAX_BLOCKS into ext4.h in order not to include ext4_extents.h in indirect.c and ioctl.c. Meanwhile we just need to include this file in extent_status.c when ES_AGGRESSIVE_TEST is defined. Otherwise, this commit removes a duplicated declaration in trace/events/ext4.h. After applied this patch, we just need to include ext4_extents.h file in {super,migrate,move_extents,extents}.c, and it is easy for us to define a new extent disk layout. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-28ext4: Fix misspellings using 'codespell' toolAnatol Pomozov1-1/+1
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-28ext4: fix use of potentially uninitialized variables in debugging codeAndi Shyti1-3/+2
If ext_debugging is enabled and path[depth].p_ext is NULL, len and lblock are printed non initialized Signed-off-by: Andi Shyti <andi@etezian.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-17ext4: fix warning in ext4_da_update_reserve_space()Jan Kara1-1/+2
reaim workfile.dbase test easily triggers warning in ext4_da_update_reserve_space(): EXT4-fs warning (device ram0): ext4_da_update_reserve_space:365: ino 12, allocated 1 with only 0 reserved metadata blocks (releasing 1 blocks with reserved 9 data blocks) The problem is that (one of) tests creates file and then randomly writes to it with O_SYNC. That results in writing back pages of the file in random order so we create extents for written blocks say 0, 2, 4, 6, 8 - this last allocation also allocates new block for extents. Then we writeout block 1 so we have extents 0-2, 4, 6, 8 and we release indirect extent block because extents fit in the inode again. Then we writeout block 10 and we need to allocate indirect extent block again which triggers the warning because we don't have the reservation anymore. Fix the problem by giving back freed metadata blocks resulting from extent merging into inode's reservation pool. Signed-off-by: Jan Kara <jack@suse.cz>
2013-08-16ext4: add support for extent pre-cachingTheodore Ts'o1-1/+72
Add a new fiemap flag which forces the all of the extents in an inode to be cached in the extent_status tree. This is critically important when using AIO to a preallocated file, since if we need to read in blocks from the extent tree, the io_submit(2) system call becomes synchronous, and the AIO is no longer "A", which is bad. In addition, for most files which have an external leaf tree block, the cost of caching the information in the extent status tree will be less than caching the entire 4k block in the buffer cache. So it is generally a win to keep the extent information cached. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-16ext4: cache all of an extent tree's leaf block upon readingTheodore Ts'o1-27/+60
When we read in an extent tree leaf block from disk, arrange to have all of its entries cached. In nearly all cases the in-memory representation will be more compact than the on-disk representation in the buffer cache, and it allows us to get the information without having to traverse the extent tree for successive extents. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-08-16ext4: print the block number of invalid extent tree blocksTheodore Ts'o1-12/+12
When we find an invalid extent tree block, report the block number of the bad block for debugging purposes. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-08-16ext4: refactor code to read the extent tree blockTheodore Ts'o1-54/+43
Refactor out the code needed to read the extent tree block into a single read_extent_tree_block() function. In addition to simplifying the code, it also makes sure that we call the ext4_ext_load_extent tracepoint whenever we need to read an extent tree block from disk. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-07-29ext4: fix retry handling in ext4_ext_truncate()Theodore Ts'o1-1/+1
We tested for ENOMEM instead of -ENOMEM. Oops. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-07-16ext4: call ext4_es_lru_add() after handling cache missTheodore Ts'o1-2/+3
If there are no items in the extent status tree, ext4_es_lru_add() is a no-op. So it is not sufficient to call ext4_es_lru_add() before we try to lookup an entry in the extent status tree. We also need to call it at the end of ext4_ext_map_blocks(), after items have been added to the extent status tree. This could lead to inodes with that have extent status trees but which are not in the LRU list, which means they won't get considered for eviction by the es_shrinker. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Zheng Liu <wenqing.lz@taobao.com> Cc: stable@vger.kernel.org
2013-07-15ext4: yield during large unlinksTheodore Ts'o1-0/+3
During large unlink operations on files with extents, we can use a lot of CPU time. This adds a cond_resched() call when starting to examine the next level of a multi-level extent tree. Multi-level extent trees are rare in the first place, and this should rarely be executed. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-15ext4: simplify calculation of blocks to free on errorTheodore Ts'o1-2/+2
In ext4_ext_map_blocks(), if we have successfully allocated the data blocks, but then run into trouble inserting the extent into the extent tree, most likely due to an ENOSPC condition, determine the arguments to ext4_free_blocks() in a simpler way which is easier to prove to be correct. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-15ext4: fix error handling in ext4_ext_truncate()Theodore Ts'o1-0/+11
Previously ext4_ext_truncate() was ignoring potential error returns from ext4_es_remove_extent() and ext4_ext_remove_space(). This can lead to the on-diks extent tree and the extent status tree cache getting out of sync, which is particuarlly bad, and can lead to file system corruption and potential data loss. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-07-01ext4: optimize starting extent in ext4_ext_rm_leaf()Ashish Sangwan1-1/+3
Both hole punch and truncate use ext4_ext_rm_leaf() for removing blocks. Currently we choose the last extent as the starting point for removing blocks: ex = EXT_LAST_EXTENT(eh); This is OK for truncate but for hole punch we can optimize the extent selection as the path is already initialized. We could use this information to select proper starting extent. The code change in this patch will not affect truncate as for truncate path[depth].p_ext will always be NULL. Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01ext4: translate flag bits to strings in tracepointsTheodore Ts'o1-1/+1
Translate the bitfields used in various flags argument to strings to make the tracepoint output more human-readable. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01ext4: pass inode pointer instead of file pointer to punch holeAshish Sangwan1-1/+1
No need to pass file pointer when we can directly pass inode pointer. Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extentsJie Liu1-1/+2
Return the FIEMAP_EXTENT_UNKNOWN flag as well except the FIEMAP_EXTENT_DELALLOC because the data location of an delayed allocation extent is unknown. Signed-off-by: Jie Liu <jeff.liu@oracle.com>
2013-06-12ext4: don't use EXT4_FREE_BLOCKS_FORGET unnecessarilyTheodore Ts'o1-14/+12
Commit 18888cf0883c: "ext4: speed up truncate/unlink by not using bforget() unless needed" removed the use of EXT4_FREE_BLOCKS_FORGET in the most important codepath for file systems using extents, but a similar optimization also can be done for file systems using indirect blocks, and for the two special cases in the ext4 extents code. Cc: Andrey Sidorov <qrxd43@motorola.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: use transaction reservation for extent conversion in ext4_end_ioJan Kara1-11/+29
Later we would like to clear PageWriteback bit only after extent conversion from unwritten to written extents is performed. However it is not possible to start a transaction after PageWriteback is set because that violates lock ordering (and is easy to deadlock). So we have to reserve a transaction before locking pages and sending them for IO and later we use the transaction for extent conversion from ext4_end_io(). Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04ext4: better estimate credits needed for ext4_da_writepages()Jan Kara1-9/+7
We limit the number of blocks written in a single loop of ext4_da_writepages() to 64 when inode uses indirect blocks. That is unnecessary as credit estimates for mapping logically continguous run of blocks is rather low even for inode with indirect blocks. So just lift this limitation and properly calculate the number of necessary credits. This better credit estimate will also later allow us to always write at least a single page in one iteration. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-05-31ext4: fix data offset overflow in ext4_xattr_fiemap() on 32-bit archsJan Kara1-2/+2
On 32-bit architectures with 32-bit sector_t computation of data offset in ext4_xattr_fiemap() can overflow resulting in reporting bogus data location. Fix the problem by typing block number to proper type before shifting. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27ext4: make punch hole code path work with bigallocLukas Czerner1-18/+51
Currently punch hole is disabled in file systems with bigalloc feature enabled. However the recent changes in punch hole patch should make it easier to support punching holes on bigalloc enabled file systems. This commit changes partial_cluster handling in ext4_remove_blocks(), ext4_ext_rm_leaf() and ext4_ext_remove_space(). Currently partial_cluster is unsigned long long type and it makes sure that we will free the partial cluster if all extents has been released from that cluster. However it has been specifically designed only for truncate. With punch hole we can be freeing just some extents in the cluster leaving the rest untouched. So we have to make sure that we will notice cluster which still has some extents. To do this I've changed partial_cluster to be signed long long type. The only scenario where this could be a problem is when cluster_size == block size, however in that case there would not be any partial clusters so we're safe. For bigger clusters the signed type is enough. Now we use the negative value in partial_cluster to mark such cluster used, hence we know that we must not free it even if all other extents has been freed from such cluster. This scenario can be described in simple diagram: |FFF...FF..FF.UUU| ^----------^ punch hole . - free space | - cluster boundary F - freed extent U - used extent Also update respective tracepoints to use signed long long type for partial_cluster. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27ext4: update ext4_ext_remove_space trace pointLukas Czerner1-3/+3
Add "end" variable. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27ext4: remove unused code from ext4_remove_blocks()Lukas Czerner1-17/+4
The "head removal" branch in the condition is never used in any code path in ext4 since the function only caller ext4_ext_rm_leaf() will make sure that the extent is properly split before removing blocks. Note that there is a bug in this branch anyway. This commit removes the unused code completely and makes use of ext4_error() instead of printk if dubious range is provided. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-03ext4: fix fio regressionYan, Zheng1-4/+5
We (Linux Kernel Performance project) found a regression introduced by commit: f7fec032aa ext4: track all extent status in extent status tree The commit causes about 20% performance decrease in fio random write test. Profiler shows that rb_next() uses a lot of CPU time. The call stack is: rb_next ext4_es_find_delayed_extent ext4_map_blocks _ext4_get_block ext4_get_block_write __blockdev_direct_IO ext4_direct_IO generic_file_direct_write __generic_file_aio_write ext4_file_write aio_rw_vect_retry aio_run_iocb do_io_submit sys_io_submit system_call_fastpath io_submit td_io_getevents io_u_queued_complete thread_main main __libc_start_main The cause is that ext4_es_find_delayed_extent() doesn't have an upper bound, it keeps searching until a delayed extent is found. When there are a lots of non-delayed entries in the extent state tree, ext4_es_find_delayed_extent() may uses a lot of CPU time. Reported-by: LKP project <lkp@linux.intel.com> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Cc: "Theodore Ts'o" <tytso@mit.edu>
2013-04-19ext4: mext_insert_extents should update extent block checksumDarrick J. Wong1-5/+2
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-10ext4: move ext4_ind_migrate() into migrate.cLukas Czerner1-57/+0
Move ext4_ind_migrate() into migrate.c file since it makes much more sense and ext4_ext_migrate() is there as well. Also fix tiny style problem - add spaces around "=" in "i=0". Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-09ext4: fix big-endian bug in extent migration codeDmitry Monakhov1-1/+1
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-04-09ext4: introduce reserved spaceLukas Czerner1-9/+18
Currently in ENOSPC condition when writing into unwritten space, or punching a hole, we might need to split the extent and grow extent tree. However since we can not allocate any new metadata blocks we'll have to zero out unwritten part of extent or punched out part of extent, or in the worst case return ENOSPC even though use actually does not allocate any space. Also in delalloc path we do reserve metadata and data blocks for the time we're going to write out, however metadata block reservation is very tricky especially since we expect that logical connectivity implies physical connectivity, however that might not be the case and hence we might end up allocating more metadata blocks than previously reserved. So in future, metadata reservation checks should be removed since we can not assure that we do not under reserve. And this is where reserved space comes into the picture. When mounting the file system we slice off a little bit of the file system space (2% or 4096 clusters, whichever is smaller) which can be then used for the cases mentioned above to prevent costly zeroout, or unexpected ENOSPC. The number of reserved clusters can be set via sysfs, however it can never be bigger than number of free clusters in the file system. Note that this patch fixes the failure of xfstest 274 as expected. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-04-08ext4: fix incorrect lock ordering for ext4_ind_migrateDmitry Monakhov1-7/+5
existing locking ordering: journal-> i_data_sem, but ext4_ind_migrate() grab locks in opposite order which may result in deadlock. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03ext4: try to prepend extent to the existing oneLukas Czerner1-23/+85
Currently when inserting extent in ext4_ext_insert_extent() we would only try to to see if we can append new extent to the found extent. If we can not, then we proceed with adding new extent into the extent tree, but then possibly merging it back again. We can avoid this situation by trying to append and prepend new extent to the existing ones. However since the new extent can be on either sides of the existing extent, we have to pick the right extent to try to append/prepend to. This patch adds the conditions to pick the right extent to append/prepend to and adds the actual prepending condition as well. This will also eliminate the need to use "reserved" block for possibly growing extent tree. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03ext4: Transfer initialized block to right neighbor if possibleLukas Czerner1-46/+89
Currently when converting extent to initialized we attempt to transfer initialized block to the left neighbour if possible when certain criteria are met. However we do not attempt to do the same for the right neighbor. This commit adds the possibility to transfer initialized block to the right neighbour if: 1. We're not converting the whole extent 2. Both extents are stored in the same extent tree node 3. Right neighbor is initialized 4. Right neighbor is logically abutting the current one 5. Right neighbor is physically abutting the current one 6. Right neighbor would not overflow the length limit This is basically the same logic as with transferring to the left. This will gain us some performance benefits since it is faster than inserting extent and then merging it. It would also prevent some situation in delalloc patch when we might run out of metadata reservation. This is due to the fact that we would attempt to split the extent first (possibly allocating new metadata block) even though we did not counted for that because it can (and will) be merged again. This commit fix that scenario, because we no longer need to split the extent in such case. Signed-off-by: Lukas Czerner <lczerner@redhat.com>
2013-04-03ext4: support simple conversion of extent-mapped inodes to use i_blocksTheodore Ts'o1-0/+59
In order to make it simpler to test the code which support i_blocks/indirect-mapped inodes, support the conversion of inodes which are less than 12 blocks and which are contained in no more than a single extent. The primary intended use of this code is to converting freshly created zero-length files and empty directories. Note that the version of chattr in e2fsprogs 1.42.7 and earlier has a check that prevents the clearing of the extent flag. A simple patch which allows "chattr -e <file>" to work will be checked into the e2fsprogs git repository. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03ext4: refactor truncate codeTheodore Ts'o1-59/+1
Move common code in ext4_ind_truncate() and ext4_ext_truncate() into ext4_truncate(). This saves over 60 lines of code. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03ext4: refactor punch hole codeTheodore Ts'o1-183/+2
Move common code in ext4_ind_punch_hole() and ext4_ext_punch_hole() into ext4_punch_hole(). This saves over 150 lines of code. This also fixes a potential bug when the punch_hole() code is racing against indirect-to-extents or extents-to-indirect migation. We are currently using i_mutex to protect against changes to the inode flag; specifically, the append-only, immutable, and extents inode flags. So we need to take i_mutex before deciding whether to use the extents-specific or indirect-specific punch_hole code. Also, there was a missing call to ext4_inode_block_unlocked_dio() in the indirect punch codepath. This was added in commit 02d262dffcf4c to block DIO readers racing against the punch operation in the codepath for extent-mapped inodes, but it was missing for indirect-block mapped inodes. One of the advantages of refactoring the code is that it makes such oversights much less likely. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03ext4: fix big-endian bugs which could cause fs corruptionsZheng Liu1-4/+7
When an extent was zeroed out, we forgot to do convert from cpu to le16. It could make us hit a BUG_ON when we try to write dirty pages out. So fix it. [ Also fix a bug found by Dmitry Monakhov where we were missing le32_to_cpu() calls in the new indirect punch hole code. There are a number of other big endian warnings found by static code analyzers, but we'll wait for the next merge window to fix them all up. These fixes are designed to be Obviously Correct by code inspection, and easy to demonstrate that it won't make any difference (and hence, won't introduce any bugs) on little endian architectures such as x86. --tytso ] Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reported-by: CAI Qian <caiqian@redhat.com> Reported-by: Christian Kujau <lists@nerdbynature.de> Cc: Dmitry Monakhov <dmonakhov@openvz.org>
2013-03-21Merge tag 'ext4_for_linue' of ↵Linus Torvalds1-21/+84
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Fix a number of regression and other bugs in ext4, most of which were relatively obscure cornercases or races that were found using regression tests." * tag 'ext4_for_linue' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits) ext4: fix data=journal fast mount/umount hang ext4: fix ext4_evict_inode() racing against workqueue processing code ext4: fix memory leakage in mext_check_coverage ext4: use s_extent_max_zeroout_kb value as number of kb ext4: use atomic64_t for the per-flexbg free_clusters count jbd2: fix use after free in jbd2_journal_dirty_metadata() ext4: reserve metadata block for every delayed write ext4: update reserved space after the 'correction' ext4: do not use yield() ext4: remove unused variable in ext4_free_blocks() ext4: fix WARN_ON from ext4_releasepage() ext4: fix the wrong number of the allocated blocks in ext4_split_extent() ext4: update extent status tree after an extent is zeroed out ext4: fix wrong m_len value after unwritten extent conversion ext4: add self-testing infrastructure to do a sanity check ext4: avoid a potential overflow in ext4_es_can_be_merged() ext4: invalidate extent status tree during extent migration ext4: remove unnecessary wait for extent conversion in ext4_fallocate() ext4: add warning to ext4_convert_unwritten_extents_endio ext4: disable merging of uninitialized extents ...
2013-03-12ext4: use s_extent_max_zeroout_kb value as number of kbLukas Czerner1-1/+1
Currently when converting extent to initialized, we have to decide whether to zeroout part/all of the uninitialized extent in order to avoid extent tree growing rapidly. The decision is made by comparing the size of the extent with the configurable value s_extent_max_zeroout_kb which is in kibibytes units. However when converting it to number of blocks we currently use it as it was in bytes. This is obviously bug and it will result in ext4 _never_ zeroout extents, but rather always split and convert parts to initialized while leaving the rest uninitialized in default setting. Fix this by using s_extent_max_zeroout_kb as kibibytes. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-03-10ext4: update reserved space after the 'correction'Lukas Czerner1-3/+9
Currently in ext4_ext_map_blocks() in delayed allocation writeback we would update the reservation and after that check whether we claimed cluster outside of the range of the allocation and if so, we'll give the block back to the reservation pool. However this also means that if the number of reserved data block dropped to zero before the correction, we would release all the metadata reservation as well, however we might still need it because the we're not done with the delayed allocation and there might be more blocks to come. This will result in error messages such as: EXT4-fs warning (device sdb): ext4_da_update_reserve_space:361: ino 12, allocated 1 with only 0 reserved metadata blocks (releasing 1 blocks with reserved 1 data blocks) This will only happen on bigalloc file system and it can be easily reproduced using fiemap-tester from xfstests like this: ./src/fiemap-tester -m DHDHDHDHD -S -p0 /mnt/test/file Or using xfstests such as 225. Fix this by doing the correction first and updating the reservation after that so that we do not accidentally decrease i_reserved_data_blocks to zero. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-10ext4: fix the wrong number of the allocated blocks in ext4_split_extent()Zheng Liu1-1/+5
This commit fixes a wrong return value of the number of the allocated blocks in ext4_split_extent. When the length of blocks we want to allocate is greater than the length of the current extent, we return a wrong number. Let's see what happens in the following case when we call ext4_split_extent(). map: [48, 72] ex: [32, 64, u] 'ex' will be split into two parts: ex1: [32, 47, u] ex2: [48, 64, w] 'map->m_len' is returned from this function, and the value is 24. But the real length is 16. So it should be fixed. Meanwhile in this commit we use right length of the allocated blocks when get_reserved_cluster_alloc in ext4_ext_handle_uninitialized_extents is called. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Dmitry Monakhov <dmonakhov@openvz.org> Cc: stable@vger.kernel.org
2013-03-10ext4: update extent status tree after an extent is zeroed outZheng Liu1-4/+31
When we try to split an extent, this extent could be zeroed out and mark as initialized. But we don't know this in ext4_map_blocks because it only returns a length of allocated extent. Meanwhile we will mark this extent as uninitialized because we only check m_flags. This commit update extent status tree when we try to split an unwritten extent. We don't need to worry about the status of this extent because we always mark it as initialized. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Dmitry Monakhov <dmonakhov@openvz.org>
2013-03-10ext4: fix wrong m_len value after unwritten extent conversionZheng Liu1-0/+4
The ext4_ext_handle_uninitialized_extents() function was assuming the return value of ext4_ext_map_blocks() is equal to map->m_len. This incorrect assumption was harmless until we started use status tree as a extent cache because we need to update status tree according to 'm_len' value. Meanwhile this commit marks EXT4_MAP_MAPPED flag after unwritten extent conversion. It shouldn't cause a bug because we update status tree according to checking EXT4_MAP_UNWRITTEN flag. But it should be fixed. After applied this commit, the following error message from self-testing infrastructure disappears. ... kernel: ES len assertation failed for inode: 230 retval 1 != map->m_len 3 in ext4_map_blocks (allocation) ... Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Dmitry Monakhov <dmonakhov@openvz.org>
2013-03-04ext4: remove unnecessary wait for extent conversion in ext4_fallocate()Jan Kara1-2/+0
Now that we don't merge uninitialized extents anymore, ext4_fallocate() is free to operate on the inode while there are still some extent conversions pending - it won't disturb them in any way. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-04ext4: add warning to ext4_convert_unwritten_extents_endioDmitry Monakhov1-1/+12
Splitting extents inside endio is a bad thing, but unfortunately it is still possible. In fact we are pretty close to the moment when all related issues will be fixed. Let's warn developer if it still the case. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2013-03-04ext4: disable merging of uninitialized extentsDmitry Monakhov1-3/+5
Derived from Jan's patch:http://permalink.gmane.org/gmane.comp.file-systems.ext4/36470 Merging of uninitialized extents creates all sorts of interesting race possibilities when writeback / DIO races with fallocate. Thus ext4_convert_unwritten_extents_endio() has to deal with a case where extent to be converted needs to be split out first. That isn't nice for two reasons: 1) It may need allocation of extent tree block so ENOSPC is possible. 2) It complicates end_io handling code So we disable merging of uninitialized extents which allows us to simplify the code. Extents will get merged after they are converted to initialized ones. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>