summaryrefslogtreecommitdiff
path: root/fs/btrfs/disk-io.c
AgeCommit message (Collapse)AuthorFilesLines
2010-08-07block: unify flags for struct bio and struct requestChristoph Hellwig1-4/+4
Remove the current bio flags and reuse the request flags for the bio, too. This allows to more easily trace the type of I/O from the filesystem down to the block driver. There were two flags in the bio that were missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've renamed two request flags that had a superflous RW in them. Note that the flags are in bio.h despite having the REQ_ name - as blkdev.h includes bio.h that is the only way to go for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-06-11Btrfs: btrfs_read_fs_root_no_name() returns ERR_PTRsDan Carpenter1-0/+4
btrfs_read_fs_root_no_name() returns ERR_PTRs on error so I added a check for that. It's not clear to me if it can also return NULL pointers or not so I left the original NULL pointer check as is. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-06-11Btrfs: handle kzalloc() failure in open_ctree()Dan Carpenter1-2/+5
Unwind and return -ENOMEM if the allocation fails here. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25Btrfs: use async helpers for DIO write checksummingChris Mason1-5/+18
The async helper threads offload crc work onto all the CPUs, and make streaming writes much faster. This changes the O_DIRECT write code to use them. The only small complication was that we need to pass in the logical offset in the file for each bio, because we can't find it in the bio's pages. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25Btrfs: Metadata ENOSPC handling for tree logYan, Zheng1-36/+0
Previous patches make the allocater return -ENOSPC if there is no unreserved free metadata space. This patch updates tree log code and various other places to propagate/handle the ENOSPC error. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25Btrfs: Metadata reservation for orphan inodesYan, Zheng1-12/+20
reserve metadata space for handling orphan inodes Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25Btrfs: Introduce global metadata reservationYan, Zheng1-30/+29
Reserve metadata space for extent tree, checksum tree and root tree Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25Btrfs: Integrate metadata reservation with start_transactionYan, Zheng1-3/+3
Besides simplify the code, this change makes sure all metadata reservation for normal metadata operations are released after committing transaction. Changes since V1: Add code that check if unlink and rmdir will free space. Add ENOSPC handling for clone ioctl. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25Btrfs: Introduce contexts for metadata reservationYan, Zheng1-0/+9
Introducing metadata reseravtion contexts has two major advantages. First, it makes metadata reseravtion more traceable. Second, it can reclaim freed space and re-add them to the itself after transaction committed. Besides add btrfs_block_rsv structure and related helper functions, This patch contains following changes: Move code that decides if freed tree block should be pinned into btrfs_free_tree_block(). Make space accounting more accurate, mainly for handling read only block groups. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25Btrfs: Shrink delay allocated space in a synchronizedYan, Zheng1-6/+0
Shrink delayed allocation space in a synchronized manner is more controllable than flushing all delay allocated space in an async thread. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-04-26btrfs: convert to using bdi_setup_and_register()Jens Axboe1-11/+1
It's now a provided helper, so get rid of the internal setup and btrfs atomic_t bdi enumerator. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstableLinus Torvalds1-4/+8
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: add check for changed leaves in setup_leaf_for_split Btrfs: create snapshot references in same commit as snapshot Btrfs: fix small race with delalloc flushing waitqueue's Btrfs: use add_to_page_cache_lru, use __page_cache_alloc Btrfs: fix chunk allocate size calculation Btrfs: kill max_extent mount option Btrfs: fail to mount if we have problems reading the block groups Btrfs: check btrfs_get_extent return for IS_ERR() Btrfs: handle kmalloc() failure in inode lookup ioctl Btrfs: dereferencing freed memory Btrfs: Simplify num_stripes's calculation logical for __btrfs_alloc_chunk() Btrfs: Add error handle for btrfs_search_slot() in btrfs_read_chunk_tree() Btrfs: Remove unnecessary finish_wait() in wait_current_trans() Btrfs: add NULL check for do_walk_down() Btrfs: remove duplicate include in ioctl.c Fix trivial conflict in fs/btrfs/compression.c due to slab.h include cleanups.
2010-03-30Btrfs: kill max_extent mount optionJosef Bacik1-1/+0
As Yan pointed out, theres not much reason for all this complicated math to account for file extents being split up into max_extent chunks, since they are likely to all end up in the same leaf anyway. Since there isn't much reason to use max_extent, just remove the option altogether so we have one less thing we need to test. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-30Btrfs: fail to mount if we have problems reading the block groupsJosef Bacik1-3/+8
We don't actually check the return value of btrfs_read_block_groups, so we can possibly succeed to mount, but then fail to say read the superblock xattr for selinux which will cause the vfs code to deactivate the super. This is a problem because in find_free_extent we just assume that we will find the right space_info for the allocation we want. But if we failed to read the block groups, we won't have setup any space_info's, and we'll hit a NULL pointer deref in find_free_extent. This patch fixes that problem by checking the return value of btrfs_read_block_groups, and failing out properly. I've also added a check in find_free_extent so if for some reason we don't find an appropriate space_info, we just return -ENOSPC. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-30include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo1-0/+1
implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-15Btrfs: cache the extent state everywhere we possibly can V2Josef Bacik1-6/+9
This patch just goes through and fixes everybody that does lock_extent() blah unlock_extent() to use lock_extent_bits() blah unlock_extent_cached() and pass around a extent_state so we only have to do the searches once per function. This gives me about a 3 mb/s boots on my random write test. I have not converted some things, like the relocation and ioctl's, since they aren't heavily used and the relocation stuff is in the middle of being re-written. I also changed the clear_extent_bit() to only unset the cached state if we are clearing EXTENT_LOCKED and related stuff, so we can do things like this lock_extent_bits() clear delalloc bits unlock_extent_cached() without losing our cached state. I tested this thoroughly and turned on LEAK_DEBUG to make sure we weren't leaking extent states, everything worked out fine. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-08Btrfs: use RB_ROOT to intialize rb_trees instead of setting rb_node to NULLEric Paris1-2/+2
btrfs inialize rb trees in quite a number of places by settin rb_node = NULL; The problem with this is that 17d9ddc72fb8bba0d4f678 in the linux-next tree adds a new field to that struct which needs to be NULL for the new rbtree library code to work properly. This patch uses RB_ROOT as the intializer so all of the relevant fields will be NULL'd. Without the patch I get a panic. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-02-04Btrfs: remove BUG_ON() due to mounting bad filesystemMiao Xie1-1/+6
Mounting a bad filesystem caused a BUG_ON(). The following is steps to reproduce it. # mkfs.btrfs /dev/sda2 # mount /dev/sda2 /mnt # mkfs.btrfs /dev/sda1 /dev/sda2 (the program says that /dev/sda2 was mounted, and then exits. ) # umount /mnt # mount /dev/sda1 /mnt At the third step, mkfs.btrfs exited in the way of make filesystem. So the initialization of the filesystem didn't finish. So the filesystem was bad, and it caused BUG_ON() when mounting it. But BUG_ON() should be called by the wrong code, not user's operation, so I think it is a bug of btrfs. This patch fixes it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-28Btrfs: run orphan cleanup on default fs rootJosef Bacik1-0/+6
This patch revert's commit 6c090a11e1c403b727a6a8eff0b97d5fb9e95cb5 Since it introduces this problem where we can run orphan cleanup on a volume that can have orphan entries re-added. Instead of my original fix, Yan Zheng pointed out that we can just revert my original fix and then run the orphan cleanup in open_ctree after we look up the fs_root. I have tested this with all the tests that gave me problems and this patch fixes both problems. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-12-17Btrfs: Add delayed iputYan, Zheng1-0/+4
iput() can trigger new transactions if we are dropping the final reference, so calling it in btrfs_commit_transaction may end up deadlock. This patch adds delayed iput to avoid the issue. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-12-17Btrfs: Avoid orphan inodes cleanup while replaying logYan, Zheng1-6/+11
We do log replay in a single transaction, so it's not good to do unbound operations. This patch cleans up orphan inodes cleanup after replaying the log. It also avoids doing other unbound operations such as truncating a file during replaying log. These unbound operations are postponed to the orphan inode cleanup stage. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-12-15Btrfs: Avoid superfluous tree-log writeoutYan, Zheng1-3/+3
We allow two log transactions at a time, but use same flag to mark dirty tree-log btree blocks. So we may flush dirty blocks belonging to newer log transaction when committing a log transaction. This patch fixes the issue by using two flags to mark dirty tree-log btree blocks. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-15Merge branch 'master' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: always pin metadata in discard mode Btrfs: enable discard support Btrfs: add -o discard option Btrfs: properly wait log writers during log sync Btrfs: fix possible ENOSPC problems with truncate Btrfs: fix btrfs acl #ifdef checks Btrfs: streamline tree-log btree block writeout Btrfs: avoid tree log commit when there are no changes Btrfs: only write one super copy during fsync
2009-10-13Btrfs: avoid tree log commit when there are no changesChris Mason1-0/+2
rpm has a habit of running fdatasync when the file hasn't changed. We already detect if a file hasn't been changed in the current transaction but it might have been sent to the tree-log in this transaction and not changed since the last call to fsync. In this case, we want to avoid a tree log sync, which includes a number of synchronous writes and barriers. This commit extends the existing tracking of the last transaction to change a file to also track the last sub-transaction. The end result is that rpm -ivh and -Uvh are roughly twice as fast, and on par with ext3. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstableLinus Torvalds1-19/+29
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: fix file clone ioctl for bookend extents Btrfs: fix uninit compiler warning in cow_file_range_nocow Btrfs: constify dentry_operations Btrfs: optimize back reference update during btrfs_drop_snapshot Btrfs: remove negative dentry when deleting subvolumne Btrfs: optimize fsync for the single writer case Btrfs: async delalloc flushing under space pressure Btrfs: release delalloc reservations on extent item insertion Btrfs: delay clearing EXTENT_DELALLOC for compressed extents Btrfs: cleanup extent_clear_unlock_delalloc flags Btrfs: fix possible softlockup in the allocator Btrfs: fix deadlock on async thread startup
2009-10-08Btrfs: async delalloc flushing under space pressureJosef Bacik1-0/+6
This patch moves the delalloc flushing that occurs when we are under space pressure off to a async thread pool. This helps since we only free up metadata space when we actually insert the extent item, which means it takes quite a while for space to be free'ed up if we wait on all ordered extents. However, if space is freed up due to inline extents being inserted, we can wake people who are waiting up early, and they can finish their work. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-05Btrfs: fix deadlock on async thread startupChris Mason1-19/+23
The btrfs async worker threads are used for a wide variety of things, including processing bio end_io functions. This means that when the endio threads aren't running, the rest of the FS isn't able to do the final processing required to clear PageWriteback. The endio threads also try to exit as they become idle and start more as the work piles up. The problem is that starting more threads means kthreadd may need to allocate ram, and that allocation may wait until the global number of writeback pages on the system is below a certain limit. The result of that throttling is that end IO threads wait on kthreadd, who is waiting on IO to end, which will never happen. This commit fixes the deadlock by handing off thread startup to a dedicated thread. It also fixes a bug where the on-demand thread creation was creating far too many threads because it didn't take into account threads being started by other procs. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-01Btrfs: remove duplicates of filemap_ helpersChristoph Hellwig1-6/+4
Use filemap_fdatawrite_range and filemap_fdatawait_range instead of local copies of the functions. For filemap_fdatawait_range that also means replacing the awkward old wait_on_page_writeback_range calling convention with the regular filemap byte offsets. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-01Merge branch 'master' of ↵Chris Mason1-2/+4
git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable into for-linus
2009-10-01Btrfs: fix arguments to btrfs_wait_on_page_writeback_rangeChristoph Hellwig1-1/+3
wait_on_page_writeback_range/btrfs_wait_on_page_writeback_range takes a pagecache offset, not a byte offset into the file. Shift the arguments around to wait for the correct range Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-28Btrfs: proper -ENOSPC handlingJosef Bacik1-1/+1
At the start of a transaction we do a btrfs_reserve_metadata_space() and specify how many items we plan on modifying. Then once we've done our modifications and such, just call btrfs_unreserve_metadata_space() for the same number of items we reserved. For keeping track of metadata needed for data I've had to add an extent_io op for when we merge extents. This lets us track space properly when we are doing sequential writes, so we don't end up reserving way more metadata space than what we need. The only place where the metadata space accounting is not done is in the relocation code. This is because Yan is going to be reworking that code in the near future, so running btrfs-vol -b could still possibly result in a ENOSPC related panic. This patch also turns off the metadata_ratio stuff in order to allow users to more efficiently use their disk space. This patch makes it so we track how much metadata we need for an inode's delayed allocation extents by tracking how many extents are currently waiting for allocation. It introduces two new callbacks for the extent_io tree's, merge_extent_hook and split_extent_hook. These help us keep track of when we merge delalloc extents together and split them up. Reservations are handled prior to any actually dirty'ing occurs, and then we unreserve after we dirty. btrfs_unreserve_metadata_for_delalloc() will make the appropriate unreservations as needed based on the number of reservations we currently have and the number of extents we currently have. Doing the reservation outside of doing any of the actual dirty'ing lets us do things like filemap_flush() the inode to try and force delalloc to happen, or as a last resort actually start allocation on all delalloc inodes in the fs. This has survived dbench, fs_mark and an fsx torture test. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-24Merge branch 'master' of ↵Chris Mason1-81/+154
git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable into for-linus Conflicts: fs/btrfs/super.c
2009-09-24Btrfs: hash the btree inode during fill_superYan Zheng1-0/+1
The snapshot deletion patches dropped this line, but the inode needs to be hashed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-22const: mark remaining address_space_operations constAlexey Dobriyan1-1/+1
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-21Btrfs: add snapshot/subvolume destroy ioctlYan, Zheng1-19/+62
This patch adds snapshot/subvolume destroy ioctl. A subvolume that isn't being used and doesn't contains links to other subvolumes can be destroyed. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-21Btrfs: change how subvolumes are organizedYan, Zheng1-18/+53
btrfs allows subvolumes and snapshots anywhere in the directory tree. If we snapshot a subvolume that contains a link to other subvolume called subvolA, subvolA can be accessed through both the original subvolume and the snapshot. This is similar to creating hard link to directory, and has the very similar problems. The aim of this patch is enforcing there is only one access point to each subvolume. Only the first directory entry (the one added when the subvolume/snapshot was created) is treated as valid access point. The first directory entry is distinguished by checking root forward reference. If the corresponding root forward reference is missing, we know the entry is not the first one. This patch also adds snapshot/subvolume rename support, the code allows rename subvolume link across subvolumes. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-21Btrfs: do not reuse objectid of deleted snapshot/subvolYan, Zheng1-25/+14
The new back reference format does not allow reusing objectid of deleted snapshot/subvol. So we use ++highest_objectid to allocate objectid for new snapshot/subvol. Now we use ++highest_objectid to allocate objectid for both new inode and new snapshot/subvolume, so this patch removes 'find hole' code in btrfs_find_free_objectid. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-17Btrfs: improve async block group cachingYan Zheng1-2/+5
This patch gets rid of two limitations of async block group caching. The old code delays handling pinned extents when block group is in caching. To allocate logged file extents, the old code need wait until block group is fully cached. To get rid of the limitations, This patch introduces a data structure to track the progress of caching. Base on the caching progress, we know which extents should be added to the free space cache when handling the pinned extents. The logged file extents are also handled in a similar way. This patch also changes how pinned extents are tracked. The old code uses one tree to track pinned extents, and copy the pinned extents tree at transaction commit time. This patch makes it use two trees to track pinned extents. One tree for extents that are pinned in the running transaction, one tree for extents that can be unpinned. At transaction commit time, we swap the two trees. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-16fs: Assign bdi in super_blockJens Axboe1-0/+1
We do this automatically in get_sb_bdev() from the set_bdev_super() callback. Filesystems that have their own private backing_dev_info must assign that in ->fill_super(). Note that ->s_bdi assignment is required for proper writeback! Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-11Merge branch 'master' of ↵Chris Mason1-17/+19
git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
2009-09-11Btrfs: switch extent_map to a rw lockChris Mason1-7/+7
There are two main users of the extent_map tree. The first is regular file inodes, where it is evenly spread between readers and writers. The second is the chunk allocation tree, which maps blocks from logical addresses to phyiscal ones, and it is 99.99% reads. The mapping tree is a point of lock contention during heavy IO workloads, so this commit switches things to a rw lock. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11Btrfs: Allow worker threads to exit when idleChris Mason1-10/+12
The Btrfs worker threads don't currently die off after they have been idle for a while, leading to a lot of threads sitting around doing nothing for each mount. Also, they are unable to start atomically (from end_io hanlders). This commit reworks the worker threads so they can be started from end_io handlers (just setting a flag that asks for a thread to be added at a later date) and so they can exit if they have been idle for a long time. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11writeback: add name to backing_dev_infoJens Axboe1-0/+1
This enables us to track who does what and print info. Its main use is catching dirty inodes on the default_backing_dev_info, so we can fix that up. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-07-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstableLinus Torvalds1-1/+1
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: be more polite in the async caching threads Btrfs: preserve commit_root for async caching
2009-07-30Btrfs: preserve commit_root for async cachingYan Zheng1-1/+1
The async block group caching code uses the commit_root pointer to get a stable version of the extent allocation tree for scanning. This copy of the tree root isn't going to change and it significantly reduces the complexity of the scanning code. During a commit, we have a loop where we update the extent allocation tree root. We need to loop because updating the root pointer in the tree of tree roots may allocate blocks which may change the extent allocation tree. Right now the commit_root pointer is changed inside this loop. It is more correct to change the commit_root pointer only after all the looping is done. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstableLinus Torvalds1-0/+15
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (22 commits) Btrfs: Fix async caching interaction with unmount Btrfs: change how we unpin extents Btrfs: Correct redundant test in add_inode_ref Btrfs: find smallest available device extent during chunk allocation Btrfs: clear all space_info->full after removing a block group Btrfs: make flushoncommit mount option correctly wait on ordered_extents Btrfs: Avoid delayed reference update looping Btrfs: Fix ordering of key field checks in btrfs_previous_item Btrfs: find_free_dev_extent doesn't handle holes at the start of the device Btrfs: Remove code duplication in comp_keys Btrfs: async block group caching Btrfs: use hybrid extents+bitmap rb tree for free space Btrfs: Fix crash on read failures at mount Btrfs: remove of redundant btrfs_header_level Btrfs: adjust NULL test Btrfs: Remove broken sanity check from btrfs_rmap_block() Btrfs: convert nested spin_lock_irqsave to spin_lock Btrfs: make sure all dirty blocks are written at commit time Btrfs: fix locking issue in btrfs_find_next_key Btrfs: fix double increment of path->slots[0] in btrfs_next_leaf ...
2009-07-28Btrfs: Fix async caching interaction with unmountYan Zheng1-0/+3
- don't stop the caching thread until btrfs_commit_super return. - if caching is interrupted by umount, set last to (u64)-1. otherwise the un-scanned range of block group will be considered as free extent. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-27Btrfs: change how we unpin extentsJosef Bacik1-2/+1
We are racy with async block caching and unpinning extents. This patch makes things much less complicated by only unpinning the extent if the block group is cached. We check the block_group->cached var under the block_group->lock spin lock. If it is set to BTRFS_CACHE_FINISHED then we update the pinned counters, and unpin the extent and add the free space back. If it is not set to this, we start the caching of the block group so the next time we unpin extents we can unpin the extent. This keeps us from racing with the async caching threads, lets us kill the fs wide async thread counter, and keeps us from having to set DELALLOC bits for every extent we hit if there are caching kthreads going. One thing that needed to be changed was btrfs_free_super_mirror_extents. Now instead of just looking for LOCKED extents, we also look for DIRTY extents, since we could have left some extents pinned in the previous transaction that will never get freed now that we are unmounting, which would cause us to leak memory. So btrfs_free_super_mirror_extents has been changed to btrfs_free_pinned_extents, and it will clear the extents locked for the super mirror, and any remaining pinned extents that may be present. Thank you, Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24Btrfs: async block group cachingJosef Bacik1-0/+3
This patch moves the caching of the block group off to a kthread in order to allow people to allocate sooner. Instead of blocking up behind the caching mutex, we instead kick of the caching kthread, and then attempt to make an allocation. If we cannot, we wait on the block groups caching waitqueue, which the caching kthread will wake the waiting threads up everytime it finds 2 meg worth of space, and then again when its finished caching. This is how I tested the speedup from this mkfs the disk mount the disk fill the disk up with fs_mark unmount the disk mount the disk time touch /mnt/foo Without my changes this took 11 seconds on my box, with these changes it now takes 1 second. Another change thats been put in place is we lock the super mirror's in the pinned extent map in order to keep us from adding that stuff as free space when caching the block group. This doesn't really change anything else as far as the pinned extent map is concerned, since for actual pinned extents we use EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock those extents to keep from leaking memory. I've also added a check where when we are reading block groups from disk, if the amount of space used == the size of the block group, we go ahead and mark the block group as cached. This drastically reduces the amount of time it takes to cache the block groups. Using the same test as above, except doing a dd to a file and then unmounting, it used to take 33 seconds to umount, now it takes 3 seconds. This version uses the commit_root in the caching kthread, and then keeps track of how many async caching threads are running at any given time so if one of the async threads is still running as we cross transactions we can wait until its finished before handling the pinned extents. Thank you, Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-22Btrfs: Fix crash on read failures at mountDavid Woodhouse1-0/+10
If the tree roots hit read errors during mount, btrfs is not properly erroring out. We need to check the uptodate bits after reading in the tree root node. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>