summaryrefslogtreecommitdiff
path: root/fs/btrfs/delayed-inode.c
AgeCommit message (Collapse)AuthorFilesLines
2022-07-25btrfs: do not batch insert non-consecutive dir indexes during log replayJosef Bacik1-2/+33
While running generic/475 in a loop I got the following error BTRFS critical (device dm-11): corrupt leaf: root=5 block=31096832 slot=69, bad key order, prev (263 96 531) current (263 96 524) <snip> item 65 key (263 96 517) itemoff 14132 itemsize 33 item 66 key (263 96 523) itemoff 14099 itemsize 33 item 67 key (263 96 525) itemoff 14066 itemsize 33 item 68 key (263 96 531) itemoff 14033 itemsize 33 item 69 key (263 96 524) itemoff 14000 itemsize 33 As you can see here we have 3 dir index keys with the dir index value of 523, 524, and 525 inserted between 517 and 524. This occurs because our dir index insertion code will bulk insert all dir index items on the node regardless of their actual key value. This makes sense on a normally running system, because if there's a gap in between the items there was a deletion before the item was inserted, so there's not going to be an overlap of the dir index items that need to be inserted and what exists on disk. However during log replay this isn't necessarily true, we could have any number of dir indexes in the tree already. Fix this by seeing if we're replaying the log, and if we are simply skip batching if there's a gap in the key space. This file system was left broken from the fstest, I tested this patch against the broken fs to make sure it replayed the log properly, and then btrfs checked the file system after the log replay to verify everything was ok. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: reduce amount of reserved metadata for delayed item insertionFilipe Manana1-14/+143
Whenever we want to create a new dir index item (when creating an inode, create a hard link, rename a file) we reserve 1 unit of metadata space for it in a transaction (that's 256K for a node/leaf size of 16K), and then create a delayed insertion item for it to be added later to the subvolume's tree. That unit of metadata is kept until the delayed item is inserted into the subvolume tree, which may take a while to happen (in the worst case, it's done only when the transaction commits). If we have multiple dir index items to insert for the same directory, say N index items, and they all fit in a single leaf of metadata, then we are holding N units of reserved metadata space when all we need is 1 unit. This change addresses that, whenever a new delayed dir index item is added, we release the unit of metadata the caller has reserved when it started the transaction if adding that new dir index item does not result in touching one more metadata leaf, otherwise the reservation is kept by transferring it from the transaction block reserve to the delayed items block reserve, just like before. Given that with a leaf size of 16K we can have a few hundred dir index items in a single leaf (the exact value depends on file name lengths), this reduces pressure on metadata reservation by releasing unnecessary space much sooner. The following fs_mark test showed some improvement when creating many files in parallel on machine running a non debug kernel (debian's default kernel config) with 12 cores: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" FILES=100000 THREADS=$(nproc --all) echo "performance" | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor mkfs.btrfs -f $DEV mount $MOUNT_OPTIONS $DEV $MNT OPTS="-S 0 -L 10 -n $FILES -s 0 -t $THREADS -k" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT Before: FSUse% Count Size Files/sec App Overhead 2 1200000 0 225991.3 5465891 4 2400000 0 345728.1 5512106 4 3600000 0 346959.5 5557653 8 4800000 0 329643.0 5587548 8 6000000 0 312657.4 5606717 8 7200000 0 281707.5 5727985 12 8400000 0 88309.8 5020422 12 9600000 0 85835.9 5207496 16 10800000 0 81039.2 5404964 16 12000000 0 58548.6 5842468 After: FSUse% Count Size Files/sec App Overhead 2 1200000 0 230604.5 5778375 4 2400000 0 348908.3 5508072 4 3600000 0 357028.7 5484337 6 4800000 0 342898.3 5565703 6 6000000 0 314670.8 5751555 8 7200000 0 282548.2 5778177 12 8400000 0 90844.9 5306819 12 9600000 0 86963.1 5304689 16 10800000 0 89113.2 5455248 16 12000000 0 86693.5 5518933 The "after" results are after applying this patch and all the other patches in the same patchset, which is comprised of the following changes: btrfs: balance btree dirty pages and delayed items after a rename btrfs: free the path earlier when creating a new inode btrfs: balance btree dirty pages and delayed items after clone and dedupe btrfs: add assertions when deleting batches of delayed items btrfs: deal with deletion errors when deleting delayed items btrfs: refactor the delayed item deletion entry point btrfs: improve batch deletion of delayed dir index items btrfs: assert that delayed item is a dir index item when adding it btrfs: improve batch insertion of delayed dir index items btrfs: do not BUG_ON() on failure to reserve metadata for delayed item btrfs: set delayed item type when initializing it btrfs: reduce amount of reserved metadata for delayed item insertion Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: set delayed item type when initializing itFilipe Manana1-22/+8
Currently we set the type of a delayed item only after successfully inserting it into its respective rbtree. This is fine, as the type is not used anywhere before that point, but for the next patch in the series, there will be the need to check the type of a delayed item before inserting it into a rbtree. So set the type of a delayed item immediately after allocating it. This also makes the trivial wrappers for adding insertion and deletion useless, so it removes them as well. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: do not BUG_ON() on failure to reserve metadata for delayed itemFilipe Manana1-3/+6
At btrfs_insert_delayed_dir_index(), we don't expect the metadata reservation for the delayed dir index item insertion to fail, because the caller is supposed to have reserved 1 unit of metadata space for that. All callers are able to deal with an error in case that happens, so there is no need for something so drastic as a BUG_ON() in case of failure. Instead just emit a warning, so that's easily noticed during development (fstests in particular), and return the error to the caller. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: improve batch insertion of delayed dir index itemsFilipe Manana1-15/+9
Currently we group delayed dir index items for insertion as a single batch (a single btree operation) as long as their keys are sequential in the key space. For example we have delayed index items for the following index keys: 10, 11, 12, 15, 16, 20, 21 We end up building three batches: 1) First one for index keys 10, 11 and 12; 2) Second one for index keys 15 and 16; 3) Third one for index keys 20 and 21. However, since the dir index numbers come from a monotonically increasing counter and are never reused, we could group all these items into a single batch. The existence of holes in the sequence happens only when we had delayed dir index items for insertion that got deleted before they were flushed to the subvolume's tree. The delayed items are stored in a rbtree based on their key order, so we can just group items into a batch as long as they all fit in a leaf, and ignore if there's a gap (key offset, index number) between two consecutive items. This is more efficient and reduces the amount of time spent when running delayed items if there are gaps between dir index items. For example running the following test script: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f $DEV mount $DEV $MNT NUM_FILES=100 mkdir $MNT/testdir for ((i = 1; i <= $NUM_FILES; i++)); do echo -n > $MNT/testdir/file_$i done # Now delete every other file, to create gaps in the dir index keys. for ((i = 1; i <= $NUM_FILES; i += 2)); do rm -f $MNT/testdir/file_$i done start=$(date +%s%N) sync end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo -e "\nsync took $dur milliseconds" umount $MNT While having the following bpftrace script running in another shell: $ cat bpf-delayed-items-inserts.sh #!/usr/bin/bpftrace /* Must add 'noinline' to btrfs_insert_delayed_items(). */ k:btrfs_insert_delayed_items { @start_insert_delayed_items[tid] = nsecs; } k:btrfs_insert_empty_items /@start_insert_delayed_items[tid]/ { @insert_batches = count(); } kr:btrfs_insert_delayed_items /@start_insert_delayed_items[tid]/ { $dur = (nsecs - @start_insert_delayed_items[tid]) / 1000; @btrfs_insert_delayed_items_total_time = sum($dur); delete(@start_insert_delayed_items[tid]); } Before this change: @btrfs_insert_delayed_items_total_time: 576 @insert_batches: 51 After this change: @btrfs_insert_delayed_items_total_time: 174 @insert_batches: 2 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: assert that delayed item is a dir index item when adding itFilipe Manana1-3/+5
All delayed items are for dir index items, we don't support any other item types at the moment. So simplify __btrfs_add_delayed_item() and add an assertion for checking the item's key type. This also allows the next change to be simpler and avoid to check key types. In case we add support for different item types in the future, then we'll hit the assertion during development and be able to adjust any code that is assuming delayed items are always associated to dir index items. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: improve batch deletion of delayed dir index itemsFilipe Manana1-35/+25
Currently we group delayed dir index items for deletion in a single batch (single btree operation) as long as they all exist in the same leaf and as long as their keys are sequential in the key space. For example if we have a leaf that has dir index items with offsets: 2, 3, 4, 6, 7, 10 And we have delayed dir index items for deleting all these indexes, and no delayed items for any other index keys in between, then we end up deleting in 3 batches: 1) First batch for indexes 2, 3 and 4; 2) Second batch for indexes 6 and 7; 3) Third batch for index 10. This is a waste because we can delete all the index keys in a single batch. What matters is that each consecutive delayed index key matches each consecutive dir index key in a leaf. So update the logic at btrfs_batch_delete_items() to check only for a key match between delayed dir index items and dir index items in a leaf. Also avoid the useless first iteration on comparing the key of the first slot to delete with the key of the first delayed item, as it's silly since they always match, as the delayed item's key was used for the btree search that gave us the path we have. This is more efficient and reduces runtime of running delayed items, as well as lock contention on the subvolume's tree. For example, the following test script: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f $DEV mount $DEV $MNT NUM_FILES=1000 mkdir $MNT/testdir for ((i = 1; i <= $NUM_FILES; i++)); do echo -n > $MNT/testdir/file_$i done # Now delete every other file, to create gaps in the dir index keys. for ((i = 1; i <= $NUM_FILES; i += 2)); do rm -f $MNT/testdir/file_$i done # Sync to force any delayed items to be flushed to the tree. sync start=$(date +%s%N) rm -fr $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo -e "\nrm -fr took $dur milliseconds" umount $MNT Running that test script while having the following bpftrace script running in another shell: $ cat bpf-measure.sh #!/usr/bin/bpftrace /* Add 'noinline' to btrfs_delete_delayed_items()'s definition. */ k:btrfs_delete_delayed_items { @start_delete_delayed_items[tid] = nsecs; } k:btrfs_del_items /@start_delete_delayed_items[tid]/ { @delete_batches = count(); } kr:btrfs_delete_delayed_items /@start_delete_delayed_items[tid]/ { $dur = (nsecs - @start_delete_delayed_items[tid]) / 1000; @btrfs_delete_delayed_items_total_time = sum($dur); delete(@start_delete_delayed_items[tid]); } Before this change: @btrfs_delete_delayed_items_total_time: 9563 @delete_batches: 1001 After this change: @btrfs_delete_delayed_items_total_time: 7328 @delete_batches: 509 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: refactor the delayed item deletion entry pointFilipe Manana1-32/+39
The delayed item deletion entry point, btrfs_delete_delayed_items(), is a bit convoluted for a few reasons: 1) It's really a loop disguised with labels and goto statements; 2) There's a 'delete_fail' label which isn't only for error cases, we can jump to that label even if no error happened, if we simply don't have more delayed items to delete; 3) Unnecessarily keeps track of the current and previous items for no good reason, as after getting the next item and releasing the current one, it just jumps to the 'again' label just to look again for the first delayed item; 4) When a delayed item is not in the tree (because it was already deleted before), it releases the item while holding a path locked, which is not necessary and adds more contention to the tree, specially taking into account that the path came from a deletion search, meaning we have write locks for nodes at levels 2, 1 and 0. And releasing the item is not computationally trivial (rb tree deletion, a kfree() and some trivial things). So refactor it to use a while loop and add some comments to make it more obvious why we can have delayed items without a matching item in the tree as well as why not keep the delayed node locked all the time when running all its deletion items. This is also a preparation for some upcoming work involving delayed items. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: deal with deletion errors when deleting delayed itemsFilipe Manana1-1/+3
Currently, btrfs_delete_delayed_items() ignores any errors returned from btrfs_batch_delete_items(). This looks fishy but it's not a problem at the moment because: 1) Two of the errors returned from btrfs_batch_delete_items() are for impossible cases, cases where a delayed item does not match any item in the leaf the path points to - btrfs_delete_delayed_items() always calls btrfs_batch_delete_items() with a path that points to a leaf that contains an item matching a delayed item; 2) btrfs_batch_delete_items() may return an error from btrfs_del_items(), in which case it does not release the delayed items of the batch. At the moment this is harmless because btrfs_del_items() actually is always able to delete items, even if it returns an error - when it returns an error it's because it ended up with a leaf mostly empty (less than 1/3 full) and failed to migrate items from that leaf into its neighbour leaves - this is not critical, as all the items were deleted, we just left the tree a bit unbalanced, but it's still a valid tree and causes no harm, and future operations on the tree will eventually balance it. So even if we get an error from btrfs_del_items(), the delayed items will not be released but the next time we run delayed items we will find out, at btrfs_delete_delayed_items(), that they are not present in the tree anymore and then release them. This is all a bit subtle, and it's certainly prone to be a disaster in case btrfs_del_items() changes one day and may return errors before being able to delete all the requested items, in which case we could leave the filesystem in an inconsistent state as we would commit a transaction despite a failure from deleting items from the tree. So make btrfs_delete_delayed_items() check for any errors from the call to btrfs_batch_delete_items(). Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: add assertions when deleting batches of delayed itemsFilipe Manana1-8/+16
There are a few impossible cases that btrfs_batch_delete_items() tries to deal with: 1) Getting a path pointing to a NULL leaf; 2) The leaf slot is pointing beyond the last item in the leaf; 3) We can't find a single item to delete. The first case is impossible because the given path was returned by a successful call to btrfs_search_slot(). Replace the BUG_ON() with an ASSERT for this. The second case is impossible because we are always called when a delayed item matches an item in the given leaf. So add an ASSERT() for that and if that condition is not satisfied, trigger a warning and return an error. The third case is impossible exactly because of the same reason as the second case. The given delayed item matches one item in the leaf, so we know that our batch always has at least one item. Add an ASSERT to check that, trigger a warning if that expectation fails and return an error. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-15Revert "btrfs: turn delayed_nodes_tree into an XArray"David Sterba1-39/+45
This reverts commit 253bf57555e451dec5a7f09dc95d380ce8b10e5b. Revert the xarray conversion, there's a problem with potential sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS allocation. The radix tree used the preloading mechanism to avoid sleeping but this is not available in xarray. Conversion from spin lock to mutex is possible but at time of rc6 is riskier than a clean revert. [1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: turn delayed_nodes_tree into an XArrayGabriel Niebler1-45/+39
… in the btrfs_root struct and adjust all usages of this object to use the XArray API, because it is notionally easier to use and understand, as it provides array semantics, and also takes care of locking for us, further simplifying the code. Also use the opportunity to do some light refactoring. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Gabriel Niebler <gniebler@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: add an inode-item.hJosef Bacik1-0/+1
We have a few helpers in inode-item.c, and I'm going to make a few changes to how we do truncate in the future, so break out these definitions into their own header file to trim down ctree.h some and make it easier to do the work on truncate in the future. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: change root to fs_info for btrfs_reserve_metadata_bytesJosef Bacik1-1/+1
We used to need the root for btrfs_reserve_metadata_bytes to check the orphan cleanup state, but we no longer need that, we simply need the fs_info. Change btrfs_reserve_metadata_bytes() to use the fs_info, and change both btrfs_block_rsv_refill() and btrfs_block_rsv_add() to do the same as they simply call btrfs_reserve_metadata_bytes() and then manipulate the block_rsv that is being used. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: loop only once over data sizes array when inserting an item batchFilipe Manana1-19/+22
When inserting a batch of items into a btree, we end up looping over the data sizes array 3 times: 1) Once in the caller of btrfs_insert_empty_items(), when it populates the array with the data sizes for each item; 2) Once at btrfs_insert_empty_items() to sum the elements of the data sizes array and compute the total data size; 3) And then once again at setup_items_for_insert(), where we do exactly the same as what we do at btrfs_insert_empty_items(), to compute the total data size. That is not bad for small arrays, but when the arrays have hundreds of elements, the time spent on looping is not negligible. For example when doing batch inserts of delayed items for dir index items or when logging a directory, it's common to have 200 to 260 dir index items in a single batch when using a leaf size of 16K and using file names between 8 and 12 characters. For a 64K leaf size, multiply that by 4. Taking into account that during directory logging or when flushing delayed dir index items we can have many of those large batches, the time spent on the looping adds up quickly. It's also more important to avoid it at setup_items_for_insert(), since we are holding a write lock on a leaf and, in some cases, on upper nodes of the btree, which causes us to block other tasks that want to access the leaf and nodes for longer than necessary. So change the code so that setup_items_for_insert() and btrfs_insert_empty_items() no longer compute the total data size, and instead rely on the caller to supply it. This makes us loop over the array only once, where we can both populate the data size array and compute the total data size, taking advantage of spatial and temporal locality. To make this more manageable, use a structure to contain all the relevant details for a batch of items (keys array, data sizes array, total data size, number of items), and use it as an argument for btrfs_insert_empty_items() and setup_items_for_insert(). This patch is part of a small patchset that is comprised of the following patches: btrfs: loop only once over data sizes array when inserting an item batch btrfs: unexport setup_items_for_insert() btrfs: use single bulk copy operations when logging directories This is patch 1/3 and performance results, and the specific tests, are included in the changelog of patch 3/3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: add ro compat flags to inodesBoris Burkov1-2/+7
Currently, inode flags are fully backwards incompatible in btrfs. If we introduce a new inode flag, then tree-checker will detect it and fail. This can even cause us to fail to mount entirely. To make it possible to introduce new flags which can be read-only compatible, like VERITY, we add new ro flags to btrfs without treating them quite so harshly in tree-checker. A read-only file system can survive an unexpected flag, and can be mounted. As for the implementation, it unfortunately gets a little complicated. The on-disk representation of the inode, btrfs_inode_item, has an __le64 for flags but the in-memory representation, btrfs_inode, uses a u32. David Sterba had the nice idea that we could reclaim those wasted 32 bits on disk and use them for the new ro_compat flags. It turns out that the tree-checker code which checks for unknown flags is broken, and ignores the upper 32 bits we are hoping to use. The issue is that the flags use the literal 1 rather than 1ULL, so the flags are signed ints, and one of them is specifically (1 << 31). As a result, the mask which ORs the flags is a negative integer on machines where int is 32 bit twos complement. When tree-checker evaluates the expression: btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK) The mask is something like 0x80000abc, which gets promoted to u64 with sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves all the upper bits zeroed, and we can't detect unexpected flags. This suggests that we can't use those bits after all. Luckily, we have good reason to believe that they are zero anyway. Inode flags are metadata, which is always checksummed, so any bit flips that would introduce 1s would cause a checksum failure anyway (excluding the improbable case of the checksum getting corrupted exactly badly). Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit inode flag should preserve its value and not add leading zeroes (at least for twos complement). The only place that flag (BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in the root item, and indeed for that inode we see 0xffffffff80000000 as the flags on disk. However, that inode is never seen by tree checker, nor is it used in a context where verity might be meaningful. Theoretically, a future ro flag might cause trouble on that inode, so we should proactively clean up that mess before it does. With the introduction of the new ro flags, keep two separate unsigned masks and check them against the appropriate u32. Since we no longer run afoul of sign extension, this also stops writing out 0xffffffff80000000 in root_item inodes going forward. Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: stop doing GFP_KERNEL memory allocations in the ref verify toolFilipe Manana1-12/+0
In commit 351cbf6e4410e7 ("btrfs: use nofs allocations for running delayed items") we wrapped all btree updates when running delayed items with memalloc_nofs_save() and memalloc_nofs_restore(), due to a lock inversion detected by lockdep involving reclaim and the mutex of delayed nodes. The problem is because the ref verify tool does some memory allocations with GFP_KERNEL, which can trigger reclaim and reclaim can trigger inode eviction, which requires locking the mutex of an inode's delayed node. On the other hand the ref verify tool is called when allocating metadata extents as part of operations that modify a btree, which is a problem when running delayed nodes, where we do btree updates while holding the mutex of a delayed node. This is what caused the lockdep warning. Instead of wrapping every btree update when running delayed nodes, change the ref verify tool to never do GFP_KERNEL allocations, because: 1) We get less repeated code, which at the moment does not even have a comment mentioning why we need to setup the NOFS context, which is a recommended good practice as mentioned at Documentation/core-api/gfp_mask-from-fs-io.rst 2) The ref verify tool is something meant only for debugging and not something that should be enabled on non-debug / non-development kernels; 3) We may have yet more places outside delayed-inode.c where we have similar problem: doing btree updates while holding some lock and then having the GFP_KERNEL memory allocations, from the ref verify tool, trigger reclaim and trying again to acquire the same lock through the reclaim path. Or we could get more such cases in the future, therefore this change prevents getting into similar cases when using the ref verify tool. Curiously most of the memory allocations done by the ref verify tool were already using GFP_NOFS, except a few ones for no apparent reason. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: improve the batch insertion of delayed itemsFilipe Manana1-133/+79
When we insert the delayed items of an inode, which corresponds to the directory index keys for a directory (key type BTRFS_DIR_INDEX_KEY), we do the following: 1) Pick the first delayed item from the rbtree and insert it into the fs/subvolume btree, using btrfs_insert_empty_item() for that; 2) Without releasing the path returned by btrfs_insert_empty_item(), keep collecting as many consecutive delayed items from the rbtree as possible, as long as each one's BTRFS_DIR_INDEX_KEY key is the immediate successor of the previously picked item and as long as they fit in the available space of the leaf the path points to; 3) Then insert all the collected items into the leaf; 4) Release the reserve metadata space for each collected item and release each item (implies deleting from the rbtree); 5) Unlock the path. While this is much better than inserting items one by one, it can be improved in a few aspects: 1) Instead of adding items based on the remaining free space of the leaf, collect as many items that can fit in a leaf and bulk insert them. This results in less and larger batches, reducing the total amount of time to insert the delayed items. For example when adding 100K files to a directory, we ended up creating 1658 batches with very variable sizes ranging from 1 item to 118 items, on a filesystem with a node/leaf size of 16K. After this change, we end up with 839 batches, with the vast majority of them having exactly 120 items; 2) We do the search for more items to batch, by iterating the rbtree, while holding a write lock on the leaf; 3) While still holding the leaf locked, we are releasing the reserved metadata for each item and then deleting each item, keeping a write lock on the leaf for longer than necessary. Releasing the delayed items one by one can take a significant amount of time, because deleting them from the rbtree can often be a bit slow when the deletion results in rebalancing the rbtree. So change this so that we try to create larger batches, with a total item size up to the maximum a leaf can support, and by unlocking the leaf immediately after inserting the items, releasing the reserved metadata space of each item and releasing each item without holding the write lock on the leaf. The following script that runs fs_mark was used to test this change: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-m single -d single" FILES=1000000 THREADS=16 FILE_SIZE=0 echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $DEV &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT OPTS="-S 0 -L 5 -n $FILES -s $FILE_SIZE -t 16" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT It was run on machine with 12 cores, 64G of ram, using a NVMe device and using a non-debug kernel config (Debian's default config). Results before this change: FSUse% Count Size Files/sec App Overhead 1 16000000 0 76182.1 72223046 3 32000000 0 62746.9 80776528 5 48000000 0 77029.0 93022381 6 64000000 0 73691.6 95251075 8 80000000 0 66288.0 85089634 Results after this change: FSUse% Count Size Files/sec App Overhead 1 16000000 0 79049.5 (+3.7%) 69700824 3 32000000 0 65248.9 (+3.9%) 80583693 5 48000000 0 77991.4 (+1.2%) 90040908 6 64000000 0 75096.8 (+1.9%) 89862241 8 80000000 0 66926.8 (+1.0%) 84429169 Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: remove total_data_size variable in btrfs_batch_insert_items()Nathan Chancellor1-2/+1
clang warns: fs/btrfs/delayed-inode.c:684:6: warning: variable 'total_data_size' set but not used [-Wunused-but-set-variable] int total_data_size = 0, total_size = 0; ^ 1 warning generated. This variable's value has been unused since commit fc0d82e103c7 ("btrfs: sink total_data parameter in setup_items_for_insert"). Eliminate it. Link: https://github.com/ClangBuiltLinux/linux/issues/1391 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: abort transaction if we fail to update the delayed inodeJosef Bacik1-0/+8
If we fail to update the delayed inode we need to abort the transaction, because we could leave an inode with the improper counts or some other such corruption behind. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: fix error handling in __btrfs_update_delayed_inodeJosef Bacik1-6/+4
If we get an error while looking up the inode item we'll simply bail without cleaning up the delayed node. This results in this style of warning happening on commit: WARNING: CPU: 0 PID: 76403 at fs/btrfs/delayed-inode.c:1365 btrfs_assert_delayed_root_empty+0x5b/0x90 CPU: 0 PID: 76403 Comm: fsstress Tainted: G W 5.13.0-rc1+ #373 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 RIP: 0010:btrfs_assert_delayed_root_empty+0x5b/0x90 RSP: 0018:ffffb8bb815a7e50 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff95d6d07e1888 RCX: ffff95d6c0fa3000 RDX: 0000000000000002 RSI: 000000000029e91c RDI: ffff95d6c0fc8060 RBP: ffff95d6c0fc8060 R08: 00008d6d701a2c1d R09: 0000000000000000 R10: ffff95d6d1760ea0 R11: 0000000000000001 R12: ffff95d6c15a4d00 R13: ffff95d6c0fa3000 R14: 0000000000000000 R15: ffffb8bb815a7e90 FS: 00007f490e8dbb80(0000) GS:ffff95d73bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6e75555cb0 CR3: 00000001101ce001 CR4: 0000000000370ef0 Call Trace: btrfs_commit_transaction+0x43c/0xb00 ? finish_wait+0x80/0x80 ? vfs_fsync_range+0x90/0x90 iterate_supers+0x8c/0x100 ksys_sync+0x50/0x90 __do_sys_sync+0xa/0x10 do_syscall_64+0x3d/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae Because the iref isn't dropped and this leaves an elevated node->count, so any release just re-queues it onto the delayed inodes list. Fix this by going to the out label to handle the proper cleanup of the delayed node. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: make btrfs_release_delayed_iref handle the !iref caseJosef Bacik1-10/+10
Right now we only cleanup the delayed iref if we have BTRFS_DELAYED_NODE_DEL_IREF set on the node. However we have some error conditions that need to cleanup the iref if it still exists, so to make this code cleaner move the test_bit into btrfs_release_delayed_iref itself and unconditionally call it in each of the cases instead. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: use btrfs_inode_lock/btrfs_inode_unlock inode lock helpersJosef Bacik1-2/+2
A few places we intermix btrfs_inode_lock with a inode_unlock, and some places we just use inode_lock/inode_unlock instead of btrfs_inode_lock. None of these places are using this incorrectly, but as we adjust some of these callers it would be nice to keep everything consistent, so convert everybody to use btrfs_inode_lock/btrfs_inode_unlock. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: simplify code flow in btrfs_delayed_inode_reserve_metadataNikolay Borisov1-20/+5
btrfs_block_rsv_add can return only ENOSPC since it's called with NO_FLUSH modifier. This so simplify the logic in btrfs_delayed_inode_reserve_metadata to exploit this invariant. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add assert and comment ] Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: remove btrfs_inode parameter from btrfs_delayed_inode_reserve_metadataNikolay Borisov1-5/+3
It's only used for tracepoint to obtain the inode number, but we already have the ino from btrfs_delayed_node::inode_id. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-02btrfs: don't flush from btrfs_delayed_inode_reserve_metadataNikolay Borisov1-1/+2
Calling btrfs_qgroup_reserve_meta_prealloc from btrfs_delayed_inode_reserve_metadata can result in flushing delalloc while holding a transaction and delayed node locks. This is deadlock prone. In the past multiple commits: * ae5e070eaca9 ("btrfs: qgroup: don't try to wait flushing if we're already holding a transaction") * 6f23277a49e6 ("btrfs: qgroup: don't commit transaction when we already hold the handle") Tried to solve various aspects of this but this was always a whack-a-mole game. Unfortunately those 2 fixes don't solve a deadlock scenario involving btrfs_delayed_node::mutex. Namely, one thread can call btrfs_dirty_inode as a result of reading a file and modifying its atime: PID: 6963 TASK: ffff8c7f3f94c000 CPU: 2 COMMAND: "test" #0 __schedule at ffffffffa529e07d #1 schedule at ffffffffa529e4ff #2 schedule_timeout at ffffffffa52a1bdd #3 wait_for_completion at ffffffffa529eeea <-- sleeps with delayed node mutex held #4 start_delalloc_inodes at ffffffffc0380db5 #5 btrfs_start_delalloc_snapshot at ffffffffc0393836 #6 try_flush_qgroup at ffffffffc03f04b2 #7 __btrfs_qgroup_reserve_meta at ffffffffc03f5bb6 <-- tries to reserve space and starts delalloc inodes. #8 btrfs_delayed_update_inode at ffffffffc03e31aa <-- acquires delayed node mutex #9 btrfs_update_inode at ffffffffc0385ba8 #10 btrfs_dirty_inode at ffffffffc038627b <-- TRANSACTIION OPENED #11 touch_atime at ffffffffa4cf0000 #12 generic_file_read_iter at ffffffffa4c1f123 #13 new_sync_read at ffffffffa4ccdc8a #14 vfs_read at ffffffffa4cd0849 #15 ksys_read at ffffffffa4cd0bd1 #16 do_syscall_64 at ffffffffa4a052eb #17 entry_SYSCALL_64_after_hwframe at ffffffffa540008c This will cause an asynchronous work to flush the delalloc inodes to happen which can try to acquire the same delayed_node mutex: PID: 455 TASK: ffff8c8085fa4000 CPU: 5 COMMAND: "kworker/u16:30" #0 __schedule at ffffffffa529e07d #1 schedule at ffffffffa529e4ff #2 schedule_preempt_disabled at ffffffffa529e80a #3 __mutex_lock at ffffffffa529fdcb <-- goes to sleep, never wakes up. #4 btrfs_delayed_update_inode at ffffffffc03e3143 <-- tries to acquire the mutex #5 btrfs_update_inode at ffffffffc0385ba8 <-- this is the same inode that pid 6963 is holding #6 cow_file_range_inline.constprop.78 at ffffffffc0386be7 #7 cow_file_range at ffffffffc03879c1 #8 btrfs_run_delalloc_range at ffffffffc038894c #9 writepage_delalloc at ffffffffc03a3c8f #10 __extent_writepage at ffffffffc03a4c01 #11 extent_write_cache_pages at ffffffffc03a500b #12 extent_writepages at ffffffffc03a6de2 #13 do_writepages at ffffffffa4c277eb #14 __filemap_fdatawrite_range at ffffffffa4c1e5bb #15 btrfs_run_delalloc_work at ffffffffc0380987 <-- starts running delayed nodes #16 normal_work_helper at ffffffffc03b706c #17 process_one_work at ffffffffa4aba4e4 #18 worker_thread at ffffffffa4aba6fd #19 kthread at ffffffffa4ac0a3d #20 ret_from_fork at ffffffffa54001ff To fully address those cases the complete fix is to never issue any flushing while holding the transaction or the delayed node lock. This patch achieves it by calling qgroup_reserve_meta directly which will either succeed without flushing or will fail and return -EDQUOT. In the latter case that return value is going to be propagated to btrfs_dirty_inode which will fallback to start a new transaction. That's fine as the majority of time we expect the inode will have BTRFS_DELAYED_NODE_INODE_DIRTY flag set which will result in directly copying the in-memory state. Fixes: c53e9653605d ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-02btrfs: free correct amount of space in btrfs_delayed_inode_reserve_metadataNikolay Borisov1-1/+1
Following commit f218ea6c4792 ("btrfs: delayed-inode: Remove wrong qgroup meta reservation calls") this function now reserves num_bytes, rather than the fixed amount of nodesize. As such this requires the same amount to be freed in case of failure. Fix this by adjusting the amount we are freeing. Fixes: f218ea6c4792 ("btrfs: delayed-inode: Remove wrong qgroup meta reservation calls") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08btrfs: simplify condition in __btrfs_run_delayed_itemsAbaci Team1-1/+1
Fix the following coccicheck warnings: ./fs/btrfs/delayed-inode.c:1157:39-41: WARNING !A || A && B is equivalent to !A || B. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Suggested-by: Jiapeng Zhong <oswb@linux.alibaba.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Abaci Team <abaci-bugfix@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: make btrfs_delayed_update_inode take btrfs_inodeNikolay Borisov1-5/+7
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: locking: rip out path->leave_spinningJosef Bacik1-4/+0
We no longer distinguish between blocking and spinning, so rip out all this code. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: locking: remove all the blocking helpersJosef Bacik1-7/+0
Now that we're using a rw_semaphore we no longer need to indicate if a lock is blocking or not, nor do we need to flip the entire path from blocking to spinning. Remove these helpers and all the places they are called. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: sink total_data parameter in setup_items_for_insertNikolay Borisov1-2/+1
That parameter can easily be derived based on the "data_size" and "nr" parameters exploit this fact to simply the function's signature. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: eliminate total_size parameter from setup_items_for_insertNikolay Borisov1-2/+2
The value of this argument can be derived from the total_data as it's simply the value of the data size + size of btrfs_items being touched. Move the parameter calculation inside the function. This results in a simpler interface and also a minor size reduction: ./scripts/bloat-o-meter ctree.original fs/btrfs/ctree.o add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-34 (-34) Function old new delta btrfs_duplicate_item 260 259 -1 setup_items_for_insert 1200 1190 -10 btrfs_insert_empty_items 177 154 -23 Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: qgroup: fix wrong qgroup metadata reserve for delayed inodeQu Wenruo1-2/+1
For delayed inode facility, qgroup metadata is reserved for it, and later freed. However we're freeing more bytes than we reserved. In btrfs_delayed_inode_reserve_metadata(): num_bytes = btrfs_calc_metadata_size(fs_info, 1); ... ret = btrfs_qgroup_reserve_meta_prealloc(root, fs_info->nodesize, true); ... if (!ret) { node->bytes_reserved = num_bytes; But in btrfs_delayed_inode_release_metadata(): if (qgroup_free) btrfs_qgroup_free_meta_prealloc(node->root, node->bytes_reserved); else btrfs_qgroup_convert_reserved_meta(node->root, node->bytes_reserved); This means, we're always releasing more qgroup metadata rsv than we have reserved. This won't trigger selftest warning, as btrfs qgroup metadata rsv has extra protection against cases like quota enabled half-way. But we still need to fix this problem any way. This patch will use the same num_bytes for qgroup metadata rsv so we could handle it correctly. Fixes: f218ea6c4792 ("btrfs: delayed-inode: Remove wrong qgroup meta reservation calls") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-25btrfs: use nofs allocations for running delayed itemsJosef Bacik1-0/+13
Zygo reported the following lockdep splat while testing the balance patches ====================================================== WARNING: possible circular locking dependency detected 5.6.0-c6f0579d496a+ #53 Not tainted ------------------------------------------------------ kswapd0/1133 is trying to acquire lock: ffff888092f622c0 (&delayed_node->mutex){+.+.}, at: __btrfs_release_delayed_node+0x7c/0x5b0 but task is already holding lock: ffffffff8fc5f860 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (fs_reclaim){+.+.}: fs_reclaim_acquire.part.91+0x29/0x30 fs_reclaim_acquire+0x19/0x20 kmem_cache_alloc_trace+0x32/0x740 add_block_entry+0x45/0x260 btrfs_ref_tree_mod+0x6e2/0x8b0 btrfs_alloc_tree_block+0x789/0x880 alloc_tree_block_no_bg_flush+0xc6/0xf0 __btrfs_cow_block+0x270/0x940 btrfs_cow_block+0x1ba/0x3a0 btrfs_search_slot+0x999/0x1030 btrfs_insert_empty_items+0x81/0xe0 btrfs_insert_delayed_items+0x128/0x7d0 __btrfs_run_delayed_items+0xf4/0x2a0 btrfs_run_delayed_items+0x13/0x20 btrfs_commit_transaction+0x5cc/0x1390 insert_balance_item.isra.39+0x6b2/0x6e0 btrfs_balance+0x72d/0x18d0 btrfs_ioctl_balance+0x3de/0x4c0 btrfs_ioctl+0x30ab/0x44a0 ksys_ioctl+0xa1/0xe0 __x64_sys_ioctl+0x43/0x50 do_syscall_64+0x77/0x2c0 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #0 (&delayed_node->mutex){+.+.}: __lock_acquire+0x197e/0x2550 lock_acquire+0x103/0x220 __mutex_lock+0x13d/0xce0 mutex_lock_nested+0x1b/0x20 __btrfs_release_delayed_node+0x7c/0x5b0 btrfs_remove_delayed_node+0x49/0x50 btrfs_evict_inode+0x6fc/0x900 evict+0x19a/0x2c0 dispose_list+0xa0/0xe0 prune_icache_sb+0xbd/0xf0 super_cache_scan+0x1b5/0x250 do_shrink_slab+0x1f6/0x530 shrink_slab+0x32e/0x410 shrink_node+0x2a5/0xba0 balance_pgdat+0x4bd/0x8a0 kswapd+0x35a/0x800 kthread+0x1e9/0x210 ret_from_fork+0x3a/0x50 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(&delayed_node->mutex); lock(fs_reclaim); lock(&delayed_node->mutex); *** DEADLOCK *** 3 locks held by kswapd0/1133: #0: ffffffff8fc5f860 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30 #1: ffffffff8fc380d8 (shrinker_rwsem){++++}, at: shrink_slab+0x1e8/0x410 #2: ffff8881e0e6c0e8 (&type->s_umount_key#42){++++}, at: trylock_super+0x1b/0x70 stack backtrace: CPU: 2 PID: 1133 Comm: kswapd0 Not tainted 5.6.0-c6f0579d496a+ #53 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Call Trace: dump_stack+0xc1/0x11a print_circular_bug.isra.38.cold.57+0x145/0x14a check_noncircular+0x2a9/0x2f0 ? print_circular_bug.isra.38+0x130/0x130 ? stack_trace_consume_entry+0x90/0x90 ? save_trace+0x3cc/0x420 __lock_acquire+0x197e/0x2550 ? btrfs_inode_clear_file_extent_range+0x9b/0xb0 ? register_lock_class+0x960/0x960 lock_acquire+0x103/0x220 ? __btrfs_release_delayed_node+0x7c/0x5b0 __mutex_lock+0x13d/0xce0 ? __btrfs_release_delayed_node+0x7c/0x5b0 ? __asan_loadN+0xf/0x20 ? pvclock_clocksource_read+0xeb/0x190 ? __btrfs_release_delayed_node+0x7c/0x5b0 ? mutex_lock_io_nested+0xc20/0xc20 ? __kasan_check_read+0x11/0x20 ? check_chain_key+0x1e6/0x2e0 mutex_lock_nested+0x1b/0x20 ? mutex_lock_nested+0x1b/0x20 __btrfs_release_delayed_node+0x7c/0x5b0 btrfs_remove_delayed_node+0x49/0x50 btrfs_evict_inode+0x6fc/0x900 ? btrfs_setattr+0x840/0x840 ? do_raw_spin_unlock+0xa8/0x140 evict+0x19a/0x2c0 dispose_list+0xa0/0xe0 prune_icache_sb+0xbd/0xf0 ? invalidate_inodes+0x310/0x310 super_cache_scan+0x1b5/0x250 do_shrink_slab+0x1f6/0x530 shrink_slab+0x32e/0x410 ? do_shrink_slab+0x530/0x530 ? do_shrink_slab+0x530/0x530 ? __kasan_check_read+0x11/0x20 ? mem_cgroup_protected+0x13d/0x260 shrink_node+0x2a5/0xba0 balance_pgdat+0x4bd/0x8a0 ? mem_cgroup_shrink_node+0x490/0x490 ? _raw_spin_unlock_irq+0x27/0x40 ? finish_task_switch+0xce/0x390 ? rcu_read_lock_bh_held+0xb0/0xb0 kswapd+0x35a/0x800 ? _raw_spin_unlock_irqrestore+0x4c/0x60 ? balance_pgdat+0x8a0/0x8a0 ? finish_wait+0x110/0x110 ? __kasan_check_read+0x11/0x20 ? __kthread_parkme+0xc6/0xe0 ? balance_pgdat+0x8a0/0x8a0 kthread+0x1e9/0x210 ? kthread_create_worker_on_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x50 This is because we hold that delayed node's mutex while doing tree operations. Fix this by just wrapping the searches in nofs. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: Remove __ prefix from btrfs_block_rsv_releaseNikolay Borisov1-4/+2
Currently the non-prefixed version is a simple wrapper used to hide the 4th argument of the prefixed version. This doesn't bring much value in practice and only makes the code harder to follow by adding another level of indirection. Rectify this by removing the __ prefix and have only one public function to release bytes from a block reservation. No semantic changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: add wrapper for transaction abort predicateDavid Sterba1-1/+1
The status of aborted transaction can change between calls and it needs to be accessed by READ_ONCE. Add a helper that also wraps the unlikely hint. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: use the file extent tree infrastructureJosef Bacik1-0/+3
We want to use this everywhere we modify the file extent items permanently. These include: 1) Inserting new file extents for writes and prealloc extents. 2) Truncating inode items. 3) btrfs_cont_expand(). 4) Insert inline extents. 5) Insert new extents from log replay. 6) Insert a new extent for clone, as it could be past i_size. 7) Hole punching For hole punching in particular it might seem it's not necessary because anybody extending would use btrfs_cont_expand, however there is a corner that still can give us trouble. Start with an empty file and fallocate KEEP_SIZE 1M-2M We now have a 0 length file, and a hole file extent from 0-1M, and a prealloc extent from 1M-2M. Now punch 1M-1.5M Because this is past i_size we have [HOLE EXTENT][ NOTHING ][PREALLOC] [0 1M][1M 1.5M][1.5M 2M] with an i_size of 0. Now if we pwrite 0-1.5M we'll increas our i_size to 1.5M, but our disk_i_size is still 0 until the ordered extent completes. However if we now immediately truncate 2M on the file we'll just call btrfs_cont_expand(inode, 1.5M, 2M), since our old i_size is 1.5M. If we commit the transaction here and crash we'll expose the gap. To fix this we need to clear the file extent mapping for the range that we punched but didn't insert a corresponding file extent for. This will mean the truncate will only get an disk_i_size set to 1M if we crash before the finish ordered io happens. I've written an xfstest to reproduce the problem and validate this fix. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: use refcount_inc_not_zero in kill_all_nodesJosef Bacik1-3/+10
We hit the following warning while running down a different problem [ 6197.175850] ------------[ cut here ]------------ [ 6197.185082] refcount_t: underflow; use-after-free. [ 6197.194704] WARNING: CPU: 47 PID: 966 at lib/refcount.c:190 refcount_sub_and_test_checked+0x53/0x60 [ 6197.521792] Call Trace: [ 6197.526687] __btrfs_release_delayed_node+0x76/0x1c0 [ 6197.536615] btrfs_kill_all_delayed_nodes+0xec/0x130 [ 6197.546532] ? __btrfs_btree_balance_dirty+0x60/0x60 [ 6197.556482] btrfs_clean_one_deleted_snapshot+0x71/0xd0 [ 6197.566910] cleaner_kthread+0xfa/0x120 [ 6197.574573] kthread+0x111/0x130 [ 6197.581022] ? kthread_create_on_node+0x60/0x60 [ 6197.590086] ret_from_fork+0x1f/0x30 [ 6197.597228] ---[ end trace 424bb7ae00509f56 ]--- This is because the free side drops the ref without the lock, and then takes the lock if our refcount is 0. So you can have nodes on the tree that have a refcount of 0. Fix this by zero'ing out that element in our temporary array so we don't try to kill it again. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: move btrfs_unlock_up_safe to other locking functionsDavid Sterba1-0/+1
The function belongs to the family of locking functions, so move it there. The 'noinline' keyword is dropped as it's now an exported function that does not need it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: get rid of unique workqueue helper functionsOmar Sandoval1-2/+2
Commit 9e0af2376434 ("Btrfs: fix task hang under heavy compressed write") worked around the issue that a recycled work item could get a false dependency on the original work item due to how the workqueue code guarantees non-reentrancy. It did so by giving different work functions to different types of work. However, the fixes in the previous few patches are more complete, as they prevent a work item from being recycled at all (except for a tiny window that the kernel workqueue code handles for us). This obsoletes the previous fix, so we don't need the unique helpers for correctness. The only other reason to keep them would be so they show up in stack traces, but they always seem to be optimized to a tail call, so they don't show up anyways. So, let's just get rid of the extra indirection. While we're here, rename normal_work_helper() to the more informative btrfs_work_helper(). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: move cond_wake_up functions out of ctreeDavid Sterba1-0/+1
The file ctree.h serves as a header for everything and has become quite bloated. Split some helpers that are generic and create a new file that should be the catch-all for code that's not btrfs-specific. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: only reserve metadata_size for inodesJosef Bacik1-1/+1
Historically we reserved worst case for every btree operation, and generally speaking we want to do that in cases where it could be the worst case. However for updating inodes we know the inode items are already in the tree, so it will only be an update operation and never an insert operation. This allows us to always reserve only the metadata_size amount for inode updates rather than the insert_metadata_size amount. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: rename the btrfs_calc_*_metadata_size helpersJosef Bacik1-2/+2
btrfs_calc_trunc_metadata_size differs from trans_metadata_size in that it doesn't take into account any splitting at the levels, because truncate will never split nodes. However truncate _and_ changing will never split nodes, so rename btrfs_calc_trunc_metadata_size to btrfs_calc_metadata_size. Also btrfs_calc_trans_metadata_size is purely for inserting items, so rename this to btrfs_calc_insert_metadata_size. Making these clearer will help when I start using them differently in upcoming patches. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: delayed-inode: Kill the BUG_ON() in btrfs_delete_delayed_dir_index()Qu Wenruo1-2/+11
There is one report of fuzzed image which leads to BUG_ON() in btrfs_delete_delayed_dir_index(). Although that fuzzed image can already be addressed by enhanced extent-tree error handler, it's still better to hunt down more BUG_ON(). This patch will hunt down two BUG_ON()s in btrfs_delete_delayed_dir_index(): - One for error from btrfs_delayed_item_reserve_metadata() Instead of BUG_ON(), we output an error message and free the item. And return the error. All callers of this function handles the error by aborting current trasaction. - One for possible EEXIST from __btrfs_add_delayed_deletion_item() That function can return -EEXIST. We already have a good enough error message for that, only need to clean up the reserved metadata space and allocated item. To help above cleanup, also modifiy __btrfs_remove_delayed_item() called in btrfs_release_delayed_item(), to skip unassociated item. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203253 Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: get fs_info from eb in btrfs_leaf_free_spaceDavid Sterba1-2/+1
We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: use common file type conversionPhillip Potter1-1/+1
Deduplicate the btrfs file type conversion implementation - file systems that use the same file types as defined by POSIX do not need to define their own versions and can use the common helper functions decared in fs_types.h and implemented in fs_types.c Common implementation can be found via commit: bbe7449e2599 "fs: common implementation of file type" Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Phillip Potter <phil@philpotter.co.uk> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15Btrfs: kill btrfs_clear_path_blockingLiu Bo1-3/+0
Btrfs's btree locking has two modes, spinning mode and blocking mode, while searching btree, locking is always acquired in spinning mode and then converted to blocking mode if necessary, and in some hot paths we may switch the locking back to spinning mode by btrfs_clear_path_blocking(). When acquiring locks, both of reader and writer need to wait for blocking readers and writers to complete before doing read_lock()/write_lock(). The problem is that btrfs_clear_path_blocking() needs to switch nodes in the path to blocking mode at first (by btrfs_set_path_blocking) to make lockdep happy before doing its actual clearing blocking job. When switching to blocking mode from spinning mode, it consists of step 1) bumping up blocking readers counter and step 2) read_unlock()/write_unlock(), this has caused serious ping-pong effect if there're a great amount of concurrent readers/writers, as waiters will be woken up and go to sleep immediately. 1) Killing this kind of ping-pong results in a big improvement in my 1600k files creation script, MNT=/mnt/btrfs mkfs.btrfs -f /dev/sdf mount /dev/def $MNT time fsmark -D 10000 -S0 -n 100000 -s 0 -L 1 -l /tmp/fs_log.txt \ -d $MNT/0 -d $MNT/1 \ -d $MNT/2 -d $MNT/3 \ -d $MNT/4 -d $MNT/5 \ -d $MNT/6 -d $MNT/7 \ -d $MNT/8 -d $MNT/9 \ -d $MNT/10 -d $MNT/11 \ -d $MNT/12 -d $MNT/13 \ -d $MNT/14 -d $MNT/15 w/o patch: real 2m27.307s user 0m12.839s sys 13m42.831s w/ patch: real 1m2.273s user 0m15.802s sys 8m16.495s 1.1) latency histogram from funclatency[1] Overall with the patch, there're ~50% less write lock acquisition and the 95% max latency that write lock takes also reduces to ~100ms from >500ms. -------------------------------------------- w/o patch: -------------------------------------------- Function = btrfs_tree_lock msecs : count distribution 0 -> 1 : 2385222 |****************************************| 2 -> 3 : 37147 | | 4 -> 7 : 20452 | | 8 -> 15 : 13131 | | 16 -> 31 : 3877 | | 32 -> 63 : 3900 | | 64 -> 127 : 2612 | | 128 -> 255 : 974 | | 256 -> 511 : 165 | | 512 -> 1023 : 13 | | Function = btrfs_tree_read_lock msecs : count distribution 0 -> 1 : 6743860 |****************************************| 2 -> 3 : 2146 | | 4 -> 7 : 190 | | 8 -> 15 : 38 | | 16 -> 31 : 4 | | -------------------------------------------- w/ patch: -------------------------------------------- Function = btrfs_tree_lock msecs : count distribution 0 -> 1 : 1318454 |****************************************| 2 -> 3 : 6800 | | 4 -> 7 : 3664 | | 8 -> 15 : 2145 | | 16 -> 31 : 809 | | 32 -> 63 : 219 | | 64 -> 127 : 10 | | Function = btrfs_tree_read_lock msecs : count distribution 0 -> 1 : 6854317 |****************************************| 2 -> 3 : 2383 | | 4 -> 7 : 601 | | 8 -> 15 : 92 | | 2) dbench also proves the improvement, dbench -t 120 -D /mnt/btrfs 16 w/o patch: Throughput 158.363 MB/sec w/ patch: Throughput 449.52 MB/sec 3) xfstests didn't show any additional failures. One thing to note is that callers may set path->leave_spinning to have all nodes in the path stay in spinning mode, which means callers are ready to not sleep before releasing the path, but it won't cause problems if they don't want to sleep in blocking mode. [1]: https://github.com/iovisor/bcc/blob/master/tools/funclatency.py Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15Btrfs: delayed-inode: use rb_first_cached for ins_root and del_rootLiu Bo1-13/+16
rb_first_cached() trades an extra pointer "leftmost" for doing the same job as rb_first() but in O(1). Functions manipulating delayed_item need to get the first entry, this converts it to use rb_first_cached(). For more details about the optimization see patch "Btrfs: delayed-refs: use rb_first_cached for href_root". Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: Remove 'objectid' member from struct btrfs_rootMisono Tomohiro1-2/+3
There are two members in struct btrfs_root which indicate root's objectid: objectid and root_key.objectid. They are both set to the same value in __setup_root(): static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info, u64 objectid) { ... root->objectid = objectid; ... root->root_key.objectid = objecitd; ... } and not changed to other value after initialization. grep in btrfs directory shows both are used in many places: $ grep -rI "root->root_key.objectid" | wc -l 133 $ grep -rI "root->objectid" | wc -l 55 (4.17, inc. some noise) It is confusing to have two similar variable names and it seems that there is no rule about which should be used in a certain case. Since ->root_key itself is needed for tree reloc tree, let's remove 'objecitd' member and unify code to use ->root_key.objectid in all places. Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>