btrfs: skip unnecessary extent buffer sharedness checks during fiemap

During fiemap, for each file extent we find, we must check if it's shared or not. The sharedness check starts by verifying if the extent is directly shared (its refcount in the extent tree is > 1), and if it is not directly shared, then we will check if every node in the subvolume b+tree leading from the root to the leaf that has the file extent item (in reverse order), is shared (through snapshots). However this second step is not needed if our extent was created in a transaction more recent than the last transaction where a snapshot of the inode's root happened, because it can't be shared indirectly (through shared subtrees) without a snapshot created in a more recent transaction. So grab the generation of the extent from the extent map and pass it to btrfs_is_data_extent_shared(), which will skip this second phase when the generation is more recent than the root's last snapshot value. Note that we skip this optimization if the extent map is the result of merging 2 or more extent maps, because in this case its generation is the maximum of the generations of all merged extent maps. The fact the we use extent maps and they can be merged despite the underlying extents being distinct (different file extent items in the subvolume b+tree and different extent items in the extent b+tree), can result in some bugs when reporting shared extents. But this is a problem of the current implementation of fiemap relying on extent maps. One example where we get incorrect results is: $ cat fiemap-bug.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f $DEV mount $DEV $MNT # Create a file with two 256K extents. # Since there is no other write activity, they will be contiguous, # and their extent maps merged, despite having two distinct extents. xfs_io -f -c "pwrite -S 0xab 0 256K" \ -c "fsync" \ -c "pwrite -S 0xcd 256K 256K" \ -c "fsync" \ $MNT/foo # Now clone only the second extent into another file. xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar # Filefrag will report a single 512K extent, and say it's not shared. echo filefrag -v $MNT/foo umount $MNT Running the reproducer: $ ./fiemap-bug.sh wrote 262144/262144 bytes at offset 0 256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec) wrote 262144/262144 bytes at offset 262144 256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec) linked 262144/262144 bytes at offset 0 256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec) Filesystem type is: 9123683e File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 127: 3328.. 3455: 128: last,eof /mnt/sdj/foo: 1 extent found We end up reporting that we have a single 512K that is not shared, however we have two 256K extents, and the second one is shared. Changing the reproducer to clone instead the first extent into file 'bar', makes us report a single 512K extent that is shared, which is algo incorrect since we have two 256K extents and only the first one is shared. This is z problem that existed before this change, and remains after this change, as it can't be easily fixed. The next patch in the series reworks fiemap to primarily use file extent items instead of extent maps (except for checking for delalloc ranges), with the goal of improving its scalability and performance, but it also ends up fixing this particular bug caused by extent map merging. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
author: Filipe Manana <fdmanana@suse.com> 2022-09-01 14:18:29 +0100
committer: David Sterba <dsterba@suse.com> 2022-09-26 12:28:01 +0200
commit: b8f164e3e67f224f1751b708e66ccebcce1864c4 (patch)
tree: 28e19669c0799f70111a455594011491198470ba
parent: 12a824dc67a61ec02ad2960f1856654febcd5d9d (diff)
3 files changed, 38 insertions, 8 deletions
diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index e0ff71159071..dce3a16996b9 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1613,12 +1613,14 @@ static void store_backref_shared_cache(struct btrfs_backref_shared_cache *cache,
 /*
  * Check if a data extent is shared or not.
  *
- * @root:   root inode belongs to
- * @inum:   inode number of the inode whose extent we are checking
- * @bytenr: logical bytenr of the extent we are checking
- * @roots:  list of roots this extent is shared among
- * @tmp:    temporary list used for iteration
- * @cache:  a backref lookup result cache
+ * @root:        The root the inode belongs to.
+ * @inum:        Number of the inode whose extent we are checking.
+ * @bytenr:      Logical bytenr of the extent we are checking.
+ * @extent_gen:  Generation of the extent (file extent item) or 0 if it is
+ *               not known.
+ * @roots:       List of roots this extent is shared among.
+ * @tmp:         Temporary list used for iteration.
+ * @cache:       A backref lookup result cache.
  *
  * btrfs_is_data_extent_shared uses the backref walking code but will short
  * circuit as soon as it finds a root or inode that doesn't match the
@@ -1632,6 +1634,7 @@ static void store_backref_shared_cache(struct btrfs_backref_shared_cache *cache,
  * Return: 0 if extent is not shared, 1 if it is shared, < 0 on error.
  */
 int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
+				u64 extent_gen,
 				struct ulist *roots, struct ulist *tmp,
 				struct btrfs_backref_shared_cache *cache)
 {
@@ -1683,6 +1686,18 @@ int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
 		if (ret < 0 && ret != -ENOENT)
 			break;
 		ret = 0;
+		/*
+		 * If our data extent is not shared through reflinks and it was
+		 * created in a generation after the last one used to create a
+		 * snapshot of the inode's root, then it can not be shared
+		 * indirectly through subtrees, as that can only happen with
+		 * snapshots. In this case bail out, no need to check for the
+		 * sharedness of extent buffers.
+		 */
+		if (level == -1 &&
+		    extent_gen > btrfs_root_last_snapshot(&root->root_item))
+			break;
+
 		if (level >= 0)
 			store_backref_shared_cache(cache, root, bytenr,
 						   level, false);
diff --git a/fs/btrfs/backref.h b/fs/btrfs/backref.h
index fec77b30a249..52ae6957b414 100644
--- a/fs/btrfs/backref.h
+++ b/fs/btrfs/backref.h
@@ -77,6 +77,7 @@ int btrfs_find_one_extref(struct btrfs_root *root, u64 inode_objectid,
 			  struct btrfs_inode_extref **ret_extref,
 			  u64 *found_off);
 int btrfs_is_data_extent_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
+				u64 extent_gen,
 				struct ulist *roots, struct ulist *tmp,
 				struct btrfs_backref_shared_cache *cache);
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 618c6890699a..46f89cbe6193 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5574,10 +5574,24 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 			flags |= (FIEMAP_EXTENT_DELALLOC |
 				  FIEMAP_EXTENT_UNKNOWN);
 		} else if (fieinfo->fi_extents_max) {
+			u64 extent_gen;
 			u64 bytenr = em->block_start -
 				(em->start - em->orig_start);
 
 			/*
+			 * If two extent maps are merged, then their generation
+			 * is set to the maximum between their generations.
+			 * Otherwise its generation matches the one we have in
+			 * corresponding file extent item. If we have a merged
+			 * extent map, don't use its generation to speedup the
+			 * sharedness check below.
+			 */
+			if (test_bit(EXTENT_FLAG_MERGED, &em->flags))
+				extent_gen = 0;
+			else
+				extent_gen = em->generation;
+
+			/*
 			 * As btrfs supports shared space, this information
 			 * can be exported to userspace tools via
 			 * flag FIEMAP_EXTENT_SHARED.  If fi_extents_max == 0
@@ -5585,8 +5599,8 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 			 * lookup stuff.
 			 */
 			ret = btrfs_is_data_extent_shared(root, btrfs_ino(inode),
-							  bytenr, roots,
-							  tmp_ulist,
+							  bytenr, extent_gen,
+							  roots, tmp_ulist,
 							  backref_cache);
 			if (ret < 0)
 				goto out_free;
author	Filipe Manana <fdmanana@suse.com>	2022-09-01 14:18:29 +0100
committer	David Sterba <dsterba@suse.com>	2022-09-26 12:28:01 +0200
commit	b8f164e3e67f224f1751b708e66ccebcce1864c4 (patch)
tree	28e19669c0799f70111a455594011491198470ba
parent	12a824dc67a61ec02ad2960f1856654febcd5d9d (diff)