summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2008-09-25Btrfs: Cleanup and comment ordered-data.cChris Mason3-70/+121
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Force caching of metadata block groups on mount to avoid deadlockChris Mason1-0/+5
This is a temporary change to avoid deadlocks until the extent tree locking is fixed up. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25btrfs_next_leaf: do readahead when skip_locking is turned onChris Mason1-1/+2
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Add a per-inode lock around btrfs_drop_extentsChris Mason4-0/+21
btrfs_drop_extents is always called with a range lock held on the inode. But, it may operate on extents outside that range as it drops and splits them. This patch adds a per-inode mutex that is held while calling btrfs_drop_extents and while inserting new extents into the tree. It prevents races from two procs working against adjacent ranges in the tree. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Don't pin pages in ram until the entire ordered extent is on disk.Chris Mason4-19/+69
Checksum items are not inserted until the entire ordered extent is on disk, but individual pages might be clean and available for reclaim long before the whole extent is on disk. In order to allow those pages to be freed, we need to be able to search the list of ordered extents to find the checksum that is going to be inserted in the tree. This way if the page needs to be read back in before the checksums are in the btree, we'll be able to verify the checksum on the page. This commit adds the ability to search the pending ordered extents for a given offset in the file, and changes btrfs_releasepage to allow ordered pages to be freed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25btrfs_start_transaction: wait for commits in progress to finishChris Mason6-9/+51
btrfs_commit_transaction has to loop waiting for any writers in the transaction to finish before it can proceed. btrfs_start_transaction should be polite and not join a transaction that is in the process of being finished off. There are a few places that can't wait, basically the ones doing IO that might be needed to finish the transaction. For them, btrfs_join_transaction is added. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Update on disk i_size only after pending ordered extents are doneChris Mason5-11/+119
This changes the ordered data code to update i_size after the extent is on disk. An on disk i_size is maintained in the in-memory btrfs inode structures, and this is updated as extents finish. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Use async helpers to deal with pages that have been improperly dirtiedChris Mason6-9/+106
Higher layers sometimes call set_page_dirty without asking the filesystem to help. This causes many problems for the data=ordered and cow code. This commit detects pages that haven't been properly setup for IO and kicks off an async helper to deal with them. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: New data=ordered implementationChris Mason14-502/+910
The old data=ordered code would force commit to wait until all the data extents from the transaction were fully on disk. This introduced large latencies into the commit and stalled new writers in the transaction for a long time. The new code changes the way data allocations and extents work: * When delayed allocation is filled, data extents are reserved, and the extent bit EXTENT_ORDERED is set on the entire range of the extent. A struct btrfs_ordered_extent is allocated an inserted into a per-inode rbtree to track the pending extents. * As each page is written EXTENT_ORDERED is cleared on the bytes corresponding to that page. * When all of the bytes corresponding to a single struct btrfs_ordered_extent are written, The previously reserved extent is inserted into the FS btree and into the extent allocation trees. The checksums for the file data are also updated. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Drop some verbose printksChris Mason2-4/+0
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add locking around volume management (device add/remove/balance)Chris Mason6-40/+103
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Fix deadlock while searching for dead roots on mountChris Mason1-1/+9
btrfs_find_dead_roots called btrfs_read_fs_root_no_radix, which means we end up calling btrfs_search_slot with a path already held. The fix is to remember the key inside btrfs_find_dead_roots and drop the path. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Reduce contention on the root nodeChris Mason2-6/+21
This calls unlock_up sooner in btrfs_search_slot in order to decrease the amount of work done with the higher level tree locks held. Also, it changes btrfs_tree_lock to spin for a big against the page lock before scheduling. This makes a big difference in context switch rate under highly contended workloads. Longer term, a better locking structure is needed than the page lock. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Online btree defragmentation fixesChris Mason9-129/+190
The btree defragger wasn't making forward progress because the new key wasn't being saved by the btrfs_search_forward function. This also disables the automatic btree defrag, it wasn't scaling well to huge filesystems. The auto-defrag needs to be done differently. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add a per-inode csum mutex to avoid races creating csum itemsChris Mason6-6/+21
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Change find_extent_buffer to use TestSetPageLockedChris Mason2-3/+6
This makes it possible for callers to check for extent_buffers in cache without deadlocking against any btree locks held. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add btree locking to the tree defragmentation codeChris Mason4-201/+93
The online btree defragger is simplified and rewritten to use standard btree searches instead of a walk up / down mechanism. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Replace the transaction work queue with kthreadsChris Mason8-118/+136
This creates one kthread for commits and one kthread for deleting old snapshots. All the work queues are removed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Add btrfs_end_transaction_throttle to force writers to wait for pending commitsChris Mason7-59/+55
The existing throttle mechanism was often not sufficient to prevent new writers from coming in and making a given transaction run forever. This adds an explicit wait at the end of most operations so they will allow the current transaction to close. There is no wait inside file_write, inode updates, or cow filling, all which have different deadlock possibilities. This is a temporary measure until better asynchronous commit support is added. This code leads to stalls as it waits for data=ordered writeback, and it really needs to be fixed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Fix snapshot deletion to release the alloc_mutex much more often.Chris Mason3-9/+23
This lowers the impact of snapshot deletion on the rest of the FS. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add a skip_locking parameter to struct path, and make various funcs ↵Chris Mason3-14/+25
honor it Allocations may need to read in block groups from the extent allocation tree, which will require a tree search and take locks on the extent allocation tree. But, those locks might already be held in other places, leading to deadlocks. Since the alloc_mutex serializes everything right now, it is safe to skip the btree locking while caching block groups. A better fix will be to either create a recursive lock or find a way to back off existing locks while caching block groups. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Fix btrfs_next_leaf to check for new items after dropping locksChris Mason1-0/+7
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Fix btrfs_del_ordered_inode to allow forcing the drop during unlinksChris Mason5-13/+19
This allows us to delete an unlinked inode with dirty pages from the list instead of forcing commit to write these out before deleting the inode. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Drop locks in btrfs_search_slot when reading a tree block.Chris Mason4-39/+38
One lock per btree block can make for significant congestion if everyone has to wait for IO at the high levels of the btree. This drops locks held by a path when doing reads during a tree search. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Replace the big fs_mutex with a collection of other locksChris Mason12-165/+101
Extent alloctions are still protected by a large alloc_mutex. Objectid allocations are covered by a objectid mutex Other btree operations are protected by a lock on individual btree nodes Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Start btree concurrency work.Chris Mason12-214/+579
The allocation trees and the chunk trees are serialized via their own dedicated mutexes. This means allocation location is still not very fine grained. The main FS btree is protected by locks on each block in the btree. Locks are taken top / down, and as processing finishes on a given level of the tree, the lock is released after locking the lower level. The end result of a search is now a path where only the lowest level is locked. Releasing or freeing the path drops any locks held. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add a thread pool just for submit_bioChris Mason3-1/+10
If a bio submission is after a lock holder waiting for the bio on the work queue, it is possible to deadlock. Move the bios into their own pool. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25BTRFS_IOC_TRANS_START should be privileguedChristoph Hellwig1-0/+3
As mentioned in the comment next to it btrfs_ioctl_trans_start can do bad damage to filesystems and thus should be limited to privilegued users. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: split out ioctl.cChristoph Hellwig4-729/+796
Split the ioctl handling out of inode.c into a file of it's own. Also fix up checkpatch.pl warnings for the moved code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: kerneldoc comments for extent_map.cChristoph Hellwig1-12/+49
Add kerneldoc comments for all exported functions. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add a mount option to control worker thread pool sizeChris Mason3-16/+28
mount -o thread_pool_size changes the default, which is min(num_cpus + 2, 8). Larger thread pools would make more sense on very large disk arrays. This mount option controls the max size of each thread pool. There are multiple thread pools, so the total worker count will be larger than the mount option. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Worker thread optimizationsChris Mason2-34/+73
This changes the worker thread pool to maintain a list of idle threads, avoiding a complex search for a good thread to wake up. Threads have two states: idle - we try to reuse the last thread used in hopes of improving the batching ratios busy - each time a new work item is added to a busy task, the task is rotated to the end of the line. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add backport for the kthread work on kernels older than 2.6.20Chris Mason1-1/+8
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Fix mount -o max_inline=0Chris Mason1-2/+5
max_inline=0 used to force the max_inline size to one sector instead. Now it properly disables inline data items, while still being able to read any that happen to exist on disk. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add async worker threads for pre and post IO checksummingChris Mason8-132/+626
Btrfs has been using workqueues to spread the checksumming load across other CPUs in the system. But, workqueues only schedule work on the same CPU that queued the work, giving them a limited benefit for systems with higher CPU counts. This code adds a generic facility to schedule work with pools of kthreads, and changes the bio submission code to queue bios up. The queueing is important to make sure large numbers of procs on the system don't turn streaming workloads into random workloads by sending IO down concurrently. The end result of all of this is much higher performance (and CPU usage) when doing checksumming on large machines. Two worker pools are created, one for writes and one for endio processing. The two could deadlock if we tried to service both from a single pool. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25btrfs: allow scanning multiple devices during mountChristoph Hellwig1-5/+16
Allows to specify one or multiple device=/dev/foo options during mount so that ioctls on the control device can be avoided. Especially useful when trying to mount a multi-device setup as root. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25btrfs: sanity mount option parsing and early mount codeChristoph Hellwig3-108/+141
Also adds lots of comments to describe what's going on here. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25btrfs: fix strange indentation in lookup_extent_mappingChristoph Hellwig1-1/+7
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25btrfs: tiny makefile cleanupChristoph Hellwig1-7/+1
use normal kbuild syntax to build acl.o conditinally and remove comment out lines. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: transaction ioctlsSage Weil5-2/+83
These ioctls let a user application hold a transaction open while it performs a series of operations. A final ioctl does a sync on the fs (closing the current transaction). This is the main requirement for Ceph's OSD to be able to keep the data it's storing in a btrfs volume consistent, and AFAICS it works just fine. The application would do something like fd = ::open("some/file", O_RDONLY); ::ioctl(fd, BTRFS_IOC_TRANS_START); /* do a bunch of stuff */ ::ioctl(fd, BTRFS_IOC_TRANS_END); or just ::close(fd); And to ensure it commits to disk, ::ioctl(fd, BTRFS_IOC_SYNC); When a transaction is held open, the trans_handle is attached to the struct file (via private_data) so that it will get cleaned up if the process dies unexpectedly. A held transaction is also ended on fsync() to avoid a deadlock. A misbehaving application could also deliberately hold a transaction open, effectively locking up the FS, so it may make sense to restrict something like this to root or something. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Dislable acl xattr handlersYan1-6/+6
The acl code is not yet complete, and the xattr handlers are causing problems for cp -p on some distros. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: bdi_init and bdi_destroy come with 2.6.23Jan Engelhardt1-3/+3
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25btrfsctl -A error code fixupLinda Knippers1-2/+2
Send the error back to userland if the ioctl fails Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Invalidate dcache entry after creating snapshot andSven Wegener3-1/+39
We need to invalidate an existing dcache entry after creating a new snapshot or subvolume, because a negative dache entry will stop us from accessing the new snapshot or subvolume. --- ctree.h | 23 +++++++++++++++++++++++ inode.c | 4 ++++ transaction.c | 4 ++++ 3 files changed, 31 insertions(+) Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Fix race in running_transaction checksChris Mason1-1/+3
When a new transaction was started, the code would incorrectly set the pointer in fs_info before all the data structures were setup. fsync heavy workloads hit races on the setup of the ordered inode spinlock Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25btrfs delete ordered inode handling fixMingming5-32/+23
Use btrfs_release_file instead of a put_inode call Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Always use the async submission queue for checksummed writesChris Mason1-7/+0
This avoids IO stalls and poorly ordered IO from inline writers mixing in with the async submission queue Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Allocator fix variety packChris Mason5-97/+209
* Force chunk allocation when find_free_extent has to do a full scan * Record the max key at the start of defrag so it doesn't run forever * Block groups might not be contiguous, make a forward search for the next block group in extent-tree.c * Get rid of extra checks for total fs size * Fix relocate_one_reference to avoid relocating the same file data block twice when referenced by an older transaction * Use the open device count when allocating chunks so that we don't try to allocate from devices that don't exist Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Use kzalloc on the fs_devices allocationChris Mason1-2/+1
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Handle transid == 0 while opening devicesChris Mason1-1/+1
Signed-off-by: Chris Mason <chris.mason@oracle.com>