Age | Commit message (Collapse) | Author | Files | Lines |
|
There are quite some div64 calls inside btrfs_map_block() and its
variants.
Such calls are for @stripe_nr, where @stripe_nr is the number of
stripes before our logical bytenr inside a chunk.
However we can eliminate such div64 calls by just reducing the width of
@stripe_nr from 64 to 32.
This can be done because our chunk size limit is already 10G, with fixed
stripe length 64K.
Thus a U32 is definitely enough to contain the number of stripes.
With such width reduction, we can get rid of slower div64, and extra
warning for certain 32bit arch.
This patch would do:
- Add a new tree-checker chunk validation on chunk length
Make sure no chunk can reach 256G, which can also act as a bitflip
checker.
- Reduce the width from u64 to u32 for @stripe_nr variables
- Replace unnecessary div64 calls with regular modulo and division
32bit division and modulo are much faster than 64bit operations, and
we are finally free of the div64 fear at least in those involved
functions.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently btrfs doesn't support stripe lengths other than 64KiB.
This is already set in the tree-checker.
There is really no meaning to record that fixed value in map_lookup for
now, and can all be replaced with BTRFS_STRIPE_LEN.
Furthermore we can use the fix stripe length to do the following
optimization:
- Use BTRFS_STRIPE_LEN_SHIFT to replace some 64bit division
Now we only need to do a right shift.
And the value of BTRFS_STRIPE_LEN itself is already too large for bit
shift, thus if we accidentally use BTRFS_STRIPE_LEN to do bit shift,
a compiler warning would be triggered.
Thus this bit shift optimization would be safe.
- Use BTRFS_STRIPE_LEN_MASK to calculate the offset inside a stripe
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There quite a few spelling mistakes as found using codespell. Fix them.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
When debugging a scrub related metadata error, it turns out that our
metadata error reporting is not ideal.
The only 3 error messages are:
- BTRFS error (device dm-2): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 0, gen 1
Showing we have metadata generation mismatch errors.
- BTRFS error (device dm-2): unable to fixup (regular) error at logical 7110656 on dev /dev/mapper/test-scratch1
Showing which tree blocks are corrupted.
- BTRFS warning (device dm-2): checksum/header error at logical 24772608 on dev /dev/mapper/test-scratch2, physical 3801088: metadata node (level 1) in tree 5
Showing which physical range the corrupted metadata is at.
We have to combine the above 3 to know we have a corrupted metadata with
generation mismatch.
And this is already the better case, if we have other problems, like
fsid mismatch, we can not even know the cause.
[CAUSE]
The problem is caused by the fact that, scrub_checksum_tree_block()
never outputs any error message.
It just return two bits for scrub: sblock->header_error, and
sblock->generation_error.
And later we report error in scrub_print_warning(), but unfortunately we
only have two bits, there is not really much thing we can done to print
any detailed errors.
[FIX]
This patch will do the following to enhance the error reporting of
metadata scrub:
- Add extra warning (ratelimited) for every error we hit
This can help us to distinguish the different types of errors.
Some errors can help us to know what's going wrong immediately,
like bytenr mismatch.
- Re-order the checks
Currently we check bytenr first, then immediately generation.
This can lead to false generation mismatch reports, while the fsid
mismatches.
Here is the new output for the bug I'm debugging (we forgot to
writeback tree blocks for commit roots):
BTRFS warning (device dm-2): tree block 24117248 mirror 1 has bad fsid, has b77cd862-f150-4c71-90ec-7baf0544d83f want 17df6abf-23cd-445f-b350-5b3e40bfd2fc
BTRFS warning (device dm-2): tree block 24117248 mirror 0 has bad fsid, has b77cd862-f150-4c71-90ec-7baf0544d83f want 17df6abf-23cd-445f-b350-5b3e40bfd2fc
Now we can immediately know it's some tree blocks didn't even get written
back, other than the original confusing generation mismatch.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Although we have an existing function, btrfs_lookup_csums_range(), to
find all data checksums for a range, it's based on a btrfs_ordered_sum
list.
For the incoming RAID56 data checksum verification at RMW time, we don't
want to waste time by allocating temporary memory.
So this patch will introduce a new helper, btrfs_lookup_csums_bitmap().
It will use bitmap based result, which will be a perfect fit for later
RAID56 usage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
If dev-replace failed to re-construct its data/metadata, the kernel
message would be incorrect for the missing device:
BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started
BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev (efault)
Note the above "dev (efault)" of the second line.
While the first line is properly reporting "<missing disk>".
[CAUSE]
Although dev-replace is using btrfs_dev_name(), the heavy lifting work
is still done by scrub (scrub is reused by both dev-replace and regular
scrub).
Unfortunately scrub code never uses btrfs_dev_name() helper, as it's
only declared locally inside dev-replace.c.
[FIX]
Fix the output by:
- Move the btrfs_dev_name() helper to volumes.h
- Use btrfs_dev_name() to replace open-coded rcu_str_deref() calls
Only zoned code is not touched, as I'm not familiar with degraded
zoned code.
- Constify return value and parameter
Now the output looks pretty sane:
BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started
BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev <missing disk>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The public backref walking functions have quite a lot of arguments that
are passed down the call stack to find_parent_nodes(), the core function
of the backref walking code.
The next patches in series will need to add even arguments to these
functions that should be passed not only to find_parent_nodes(), but also
to other functions used by the later (directly or even lower in the call
stack).
So create a structure to hold all these arguments and state used by the
main backref walking function, find_parent_nodes(), and use it as the
argument for the public backref walking functions iterate_extent_inodes(),
btrfs_find_all_leafs() and btrfs_find_all_roots().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The interface for find_parent_nodes() has two extent offset related
arguments:
1) One u64 pointer argument for the extent offset;
2) One boolean argument to tell if the extent offset should be ignored or
not.
These are confusing, becase the extent offset pointer can be NULL and in
some cases callers pass a NULL value as a way to tell the backref walking
code to ignore offsets in file extent items (and simply consider all file
extent items that point to the target data extent).
The boolean argument was added in commit c995ab3cda3f ("btrfs: add a flag
to iterate_inodes_from_logical to find all extent refs for uncompressed
extents"), but it was never really necessary, it was enough if it could
find a way to get a NULL value passed to the "extent_item_pos" argument of
find_parent_nodes(). The arguments are also passed to functions called
by find_parent_nodes() and respective helper functions, which further
makes everything more complicated than needed.
Then we have several backref walking related functions that end up calling
find_parent_nodes(), either directly or through some other function that
they call, and for many we have to use an "extent_item_pos" (u64) argument
and a boolean "ignore_offset" argument too.
This is confusing and not really necessary. So use a single argument to
specify the extent offset, as a simple u64 and not as a pointer, but
using a special value of (u64)-1, defined as a documented constant, to
indicate when the extent offset should be ignored.
This is also preparation work for the upcoming patches in the series that
add other arguments to find_parent_nodes() and other related functions
that use it.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently send does not do the best decisions when it comes to decide
between multiple clone sources, which results in clone operations for
partial extent ranges, which has the following disadvantages:
1) We get less shared extents at the destination;
2) We have to read more data during the send operation and emit more
write commands.
Besides not being optimal behaviour, it also breaks user expectations and
is often reported by users, with a recent example in the Link tag at the
bottom of this change log.
Part of the reason for this non-optimal behaviour is that the backref
walking code does not provide information about the length of the file
extent items that were found for each backref, so send is blind about
which backref is the best to chose as a cloning source.
The other existing reasons are just silliness, namely always prefering
the inode with the lowest number when multiple are found for the same
root and when we can clone from multiple roots, always prefer the send
root over any of the other clone roots. This does not make any sense
since any inode or root is fine and as good as any other inode/root.
Fix this by making backref walking pass information about the number of
bytes referenced by each file extent item and then have send's backref
callback pick the inode with the highest number of bytes for each root.
Finally select the root from which we can clone more bytes from.
Example reproducer:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV
mount $DEV $MNT
xfs_io -f -c "pwrite -S 0xab -b 2M 0 2M" $MNT/foo
cp --reflink=always $MNT/foo $MNT/bar
cp --reflink=always $MNT/foo $MNT/baz
sync
# Overwrite the second half of file foo.
xfs_io -c "pwrite -S 0xcd -b 1M 1M 1M" $MNT/foo
sync
echo
echo "*** fiemap in the original filesystem ***"
echo
xfs_io -c "fiemap -v" $MNT/foo
xfs_io -c "fiemap -v" $MNT/bar
xfs_io -c "fiemap -v" $MNT/baz
echo
btrfs filesystem du $MNT
btrfs subvolume snapshot -r $MNT $MNT/snap
btrfs send -f /tmp/send_stream $MNT/snap
umount $MNT
mkfs.btrfs -f $DEV &> /dev/null
mount $DEV $MNT
btrfs receive -f /tmp/send_stream $MNT
echo
echo "*** fiemap in the new filesystem ***"
echo
xfs_io -r -c "fiemap -v" $MNT/snap/foo
xfs_io -r -c "fiemap -v" $MNT/snap/bar
xfs_io -r -c "fiemap -v" $MNT/snap/baz
echo
btrfs filesystem du $MNT
rm -f /tmp/send_stream
rm -f /tmp/snap.fssum
umount $MNT
Before this change:
$ ./test.sh
(...)
*** fiemap in the original filesystem ***
/mnt/sdi/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..2047]: 26624..28671 2048 0x2000
1: [2048..4095]: 30720..32767 2048 0x1
/mnt/sdi/bar:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..4095]: 26624..30719 4096 0x2001
/mnt/sdi/baz:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..4095]: 26624..30719 4096 0x2001
Total Exclusive Set shared Filename
2.00MiB 1.00MiB - /mnt/sdi/foo
2.00MiB 0.00B - /mnt/sdi/bar
2.00MiB 0.00B - /mnt/sdi/baz
6.00MiB 1.00MiB 2.00MiB /mnt/sdi
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap'
At subvol /mnt/sdi/snap
At subvol snap
*** fiemap in the new filesystem ***
/mnt/sdi/snap/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..4095]: 26624..30719 4096 0x2001
/mnt/sdi/snap/bar:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..2047]: 26624..28671 2048 0x2000
1: [2048..4095]: 30720..32767 2048 0x1
/mnt/sdi/snap/baz:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..2047]: 26624..28671 2048 0x2000
1: [2048..4095]: 32768..34815 2048 0x1
Total Exclusive Set shared Filename
2.00MiB 0.00B - /mnt/sdi/snap/foo
2.00MiB 1.00MiB - /mnt/sdi/snap/bar
2.00MiB 1.00MiB - /mnt/sdi/snap/baz
6.00MiB 2.00MiB - /mnt/sdi/snap
6.00MiB 2.00MiB 2.00MiB /mnt/sdi
We end up with two 1M extents that are not shared for files bar and baz.
After this change:
$ ./test.sh
(...)
*** fiemap in the original filesystem ***
/mnt/sdi/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..2047]: 26624..28671 2048 0x2000
1: [2048..4095]: 30720..32767 2048 0x1
/mnt/sdi/bar:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..4095]: 26624..30719 4096 0x2001
/mnt/sdi/baz:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..4095]: 26624..30719 4096 0x2001
Total Exclusive Set shared Filename
2.00MiB 1.00MiB - /mnt/sdi/foo
2.00MiB 0.00B - /mnt/sdi/bar
2.00MiB 0.00B - /mnt/sdi/baz
6.00MiB 1.00MiB 2.00MiB /mnt/sdi
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap'
At subvol /mnt/sdi/snap
At subvol snap
*** fiemap in the new filesystem ***
/mnt/sdi/snap/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..4095]: 26624..30719 4096 0x2001
/mnt/sdi/snap/bar:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..2047]: 26624..28671 2048 0x2000
1: [2048..4095]: 30720..32767 2048 0x2001
/mnt/sdi/snap/baz:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..2047]: 26624..28671 2048 0x2000
1: [2048..4095]: 30720..32767 2048 0x2001
Total Exclusive Set shared Filename
2.00MiB 0.00B - /mnt/sdi/snap/foo
2.00MiB 0.00B - /mnt/sdi/snap/bar
2.00MiB 0.00B - /mnt/sdi/snap/baz
6.00MiB 0.00B - /mnt/sdi/snap
6.00MiB 0.00B 3.00MiB /mnt/sdi
Now there's a much better sharing, files bar and baz share 1M of the
extent of file foo and the second extent of files bar and baz is shared
between themselves.
This will later be turned into a test case for fstests.
Link: https://lore.kernel.org/linux-btrfs/20221008005704.795b44b0@crass-HP-ZBook-15-G2/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Move these out of ctree.h into scrub.h to cut down on code in ctree.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Move these prototypes out of ctree.h and into file-item.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
All callers pas GFP_KERNEL as parameter so we can use it directly in
alloc_scrub_sector.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There's only one caller that calls scrub_setup_recheck_block in the
memalloc_nofs_save/_restore protection so it's effectively already
GFP_NOFS and it's safe to use GFP_KERNEL.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is a large patch, but because they're all macros it's impossible to
split up. Simply copy all of the item accessors in ctree.h and paste
them in accessors.h, and then update any files to include the header so
everything compiles.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comments, style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We have several fs wide related helpers in ctree.h. The bulk of these
are the incompat flag test helpers, but there are things such as
btrfs_fs_closing() and the read only helpers that also aren't directly
related to the ctree code. Move these into a fs.h header, which will
serve as the location for file system wide related helpers.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is only used locally in scrub.c, move it out of ctree.h into
scrub.c.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If we're doing device replace on a zoned filesystem and discover in
scrub_enumerate_chunks() that we don't have to copy the block group it is
unlocked before it gets skipped.
But as the block group hasn't yet been locked before it leads to a locking
imbalance. To fix this simply remove the unlock.
This was uncovered by fstests' testcase btrfs/163.
Fixes: 9283b9e09a6d ("btrfs: remove lock protection for BLOCK_GROUP_FLAG_TO_COPY")
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This reverts commit 786672e9e1a39a231806313e3c445c236588ceef.
[BUG]
Since commit 786672e9e1a3 ("btrfs: scrub: use larger block size for data
extent scrub"), btrfs scrub no longer reports errors if the corruption
is not in the first sector of a STRIPE_LEN.
The following script can expose the problem:
mkfs.btrfs -f $dev
mount $dev $mnt
xfs_io -f -c "pwrite -S 0xff 0 8k" $mnt/foobar
umount $mnt
# 13631488 is the logical bytenr of above 8K extent
btrfs-map-logical -l 13631488 -b 4096 $dev
mirror 1 logical 13631488 physical 13631488 device /dev/test/scratch1
# Corrupt the 2nd sector of that extent
xfs_io -f -c "pwrite -S 0x00 13635584 4k" $dev
mount $dev $mnt
btrfs scrub start -B $mnt
scrub done for 54e63f9f-0c30-4c84-a33b-5c56014629b7
Scrub started: Mon Nov 7 07:18:27 2022
Status: finished
Duration: 0:00:00
Total to scrub: 536.00MiB
Rate: 0.00B/s
Error summary: no errors found <<<
[CAUSE]
That offending commit enlarges the data extent scrub size from sector
size to BTRFS_STRIPE_LEN, to avoid extra scrub_block to be allocated.
But unfortunately the data extent scrub is still heavily relying on the
fact that there is only one scrub_sector per scrub_block.
Thus it will only check the first sector, and ignoring the remaining
sectors.
Furthermore the error reporting is not able to handle multiple sectors
either.
[FIX]
For now just revert the offending commit.
The consequence is just extra memory usage during scrub.
We will need a proper change to make the remaining data scrub path to
handle multiple sectors before we enlarging the data scrub size.
Reported-by: Li Zhang <zhanglikernel@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If we have NOWAIT specified on our IOCB and we're writing into a
PREALLOC or NOCOW extent then we need to be able to tell
can_nocow_extent that we don't want to wait on any locks or metadata IO.
Fix can_nocow_extent to allow for NOWAIT.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The parity raid write/recover functionality is currently not very well
abstracted from the bio submission and completion handling in volumes.c:
- the raid56 code directly completes the original btrfs_bio fed into
btrfs_submit_bio instead of dispatching back to volumes.c
- the raid56 code consumes the bioc and bio_counter references taken
by volumes.c, which also leads to special casing of the calls from
the scrub code into the raid56 code
To fix this up supply a bi_end_io handler that calls back into the
volumes.c machinery, which then puts the bioc, decrements the bio_counter
and completes the original bio, and updates the scrub code to also
take ownership of the bioc and bio_counter in all cases.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[PROBLEM]
The existing scrub code for data extents always limit the block size to
sectorsize.
This causes quite some extra scrub_block being allocated:
(there is a data extent at logical bytenr 298844160, length 64KiB)
alloc_scrub_block: new block: logical=298844160 physical=298844160 mirror=1
alloc_scrub_block: new block: logical=298848256 physical=298848256 mirror=1
alloc_scrub_block: new block: logical=298852352 physical=298852352 mirror=1
alloc_scrub_block: new block: logical=298856448 physical=298856448 mirror=1
alloc_scrub_block: new block: logical=298860544 physical=298860544 mirror=1
alloc_scrub_block: new block: logical=298864640 physical=298864640 mirror=1
alloc_scrub_block: new block: logical=298868736 physical=298868736 mirror=1
alloc_scrub_block: new block: logical=298872832 physical=298872832 mirror=1
alloc_scrub_block: new block: logical=298876928 physical=298876928 mirror=1
alloc_scrub_block: new block: logical=298881024 physical=298881024 mirror=1
alloc_scrub_block: new block: logical=298885120 physical=298885120 mirror=1
alloc_scrub_block: new block: logical=298889216 physical=298889216 mirror=1
alloc_scrub_block: new block: logical=298893312 physical=298893312 mirror=1
alloc_scrub_block: new block: logical=298897408 physical=298897408 mirror=1
alloc_scrub_block: new block: logical=298901504 physical=298901504 mirror=1
alloc_scrub_block: new block: logical=298905600 physical=298905600 mirror=1
...
scrub_block_put: free block: logical=298844160 physical=298844160 len=4096 mirror=1
scrub_block_put: free block: logical=298848256 physical=298848256 len=4096 mirror=1
scrub_block_put: free block: logical=298852352 physical=298852352 len=4096 mirror=1
scrub_block_put: free block: logical=298856448 physical=298856448 len=4096 mirror=1
scrub_block_put: free block: logical=298860544 physical=298860544 len=4096 mirror=1
scrub_block_put: free block: logical=298864640 physical=298864640 len=4096 mirror=1
scrub_block_put: free block: logical=298868736 physical=298868736 len=4096 mirror=1
scrub_block_put: free block: logical=298872832 physical=298872832 len=4096 mirror=1
scrub_block_put: free block: logical=298876928 physical=298876928 len=4096 mirror=1
scrub_block_put: free block: logical=298881024 physical=298881024 len=4096 mirror=1
scrub_block_put: free block: logical=298885120 physical=298885120 len=4096 mirror=1
scrub_block_put: free block: logical=298889216 physical=298889216 len=4096 mirror=1
scrub_block_put: free block: logical=298893312 physical=298893312 len=4096 mirror=1
scrub_block_put: free block: logical=298897408 physical=298897408 len=4096 mirror=1
scrub_block_put: free block: logical=298901504 physical=298901504 len=4096 mirror=1
scrub_block_put: free block: logical=298905600 physical=298905600 len=4096 mirror=1
This behavior will waste a lot of memory, especially after we have moved
quite some members from scrub_sector to scrub_block.
[FIX]
To reduce the allocation of scrub_block, and to reduce memory usage, use
BTRFS_STRIPE_LEN instead of sectorsize as the block size to scrub data
extents.
This results only one scrub_block to be allocated for above data extent:
alloc_scrub_block: new block: logical=298844160 physical=298844160 mirror=1
scrub_block_put: free block: logical=298844160 physical=298844160 len=65536 mirror=1
This would greatly reduce the memory usage (even it's just transient)
for larger data extents scrub.
For above example, the memory usage would be:
Old: num_sectors * (sizeof(scrub_block) + sizeof(scrub_sector))
16 * (408 + 96) = 8065
New: sizeof(scrub_block) + num_sectors * sizeof(scrub_sector)
408 + 16 * 96 = 1944
A good reduction of 75.9%.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
scrub_block
Currently we store the following members in scrub_sector:
- logical
- physical
- physical_for_dev_replace
- dev
- mirror_num
However the current scrub code has ensured that scrub_blocks never cross
stripe boundary.
This is caused by the entry functions (scrub_simple_mirror,
scrub_simple_stripe), thus every scrub_block will not cross stripe
boundary.
Thus this makes it possible to move those members into scrub_block other
than putting them into scrub_sector.
This should save quite some memory, as a scrub_block can be as large as 64
sectors, even for metadata it's 16 sectors byte default.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Although scrub currently works for subpage (PAGE_SIZE > sectorsize) cases,
it will allocate one page for each scrub_sector, which can cause extra
unnecessary memory usage.
Utilize scrub_block::pages[] instead of allocating page for each
scrub_sector, this allows us to integrate larger extents while using
less memory.
For example, if our page size is 64K, sectorsize is 4K, and we got an
32K sized extent.
We will only allocate one page for scrub_block, and all 8 scrub sectors
will point to that page.
To do that properly, here we introduce several small helpers:
- scrub_page_get_logical()
Get the logical bytenr of a page.
We store the logical bytenr of the page range into page::private.
But for 32bit systems, their (void *) is not large enough to contain
a u64, so in that case we will need to allocate extra memory for it.
For 64bit systems, we can use page::private directly.
- scrub_block_get_logical()
Just get the logical bytenr of the first page.
- scrub_sector_get_page()
Return the page which the scrub_sector points to.
- scrub_sector_get_page_offset()
Return the offset inside the page which the scrub_sector points to.
- scrub_sector_get_kaddr()
Return the address which the scrub_sector points to.
Just a wrapper using scrub_sector_get_page() and
scrub_sector_get_page_offset()
- bio_add_scrub_sector()
Please note that, even with this patch, we're still allocating one page
for one sector for data extents.
This is because in scrub_extent() we split the data extent using
sectorsize.
The memory usage reduction will need extra work to make scrub to work
like data read to only use the correct sector(s).
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
for subpage
[BACKGROUND]
Currently for scrub, we allocate one page for one sector, this is fine
for PAGE_SIZE == sectorsize support, but can waste extra memory for
subpage support.
[CODE CHANGE]
Make scrub_block contain all the pages, so if we're scrubbing an extent
sized 64K, and our page size is also 64K, we only need to allocate one
page.
[LIFESPAN CHANGE]
Since now scrub_sector no longer holds a page, but is using
scrub_block::pages[] instead, we have to ensure scrub_block has a longer
lifespan for write bio. The lifespan for read bio is already large
enough.
Now scrub_block will only be released after the write bio finished.
[COMING NEXT]
Currently we only added scrub_block::pages[] for this purpose, but
scrub_sector is still utilizing the old scrub_sector::page.
The switch will happen in the next patch.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
helper
The allocation and initialization is shared by 3 call sites, and we're
going to change the initialization of some members in the upcoming
patches.
So factor out the allocation and initialization of scrub_sector into a
helper, alloc_scrub_sector(), which will do the following work:
- Allocate the memory for scrub_sector
- Allocate a page for scrub_sector::page
- Initialize scrub_sector::refs to 1
- Attach the allocated scrub_sector to scrub_block
The attachment is bidirectional, which means scrub_block::sectorv[]
will be updated and scrub_sector::sblock will also be updated.
- Update scrub_block::sector_count and do extra sanity check on it
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Although there are only two callers, we are going to add some members
for scrub_block in the incoming patches. Factoring out the
initialization code will make later expansion easier.
One thing to note is, even scrub_handle_errored_block() doesn't utilize
scrub_block::refs, we still use alloc_scrub_block() to initialize
sblock::ref, allowing us to use scrub_block_put() to do cleanup.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In function scrub_handle_errored_block(), we use @sblocks_for_recheck
pointer to hold one scrub_block for each mirror, and uses kcalloc() to
allocate an array.
But this one pointer for an array is not readable due to the member
offsets done by addition and not [].
Change this pointer to struct scrub_block *[BTRFS_MAX_MIRRORS], this
will slightly increase the stack memory usage.
Since function scrub_handle_errored_block() won't get iterative calls,
this extra cost would completely be acceptable.
And since we're here, also set sblock->refs and use scrub_block_put() to
clean them up, as later we will add extra members in scrub_block, which
needs scrub_block_put() to clean them up.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There are several sanity checks which are no longer possible to trigger
inside btrfs_scrub_dev().
Since we have mount time check against super block nodesize/sectorsize,
and our fixed macro is hardcoded to handle even the worst combination.
Thus those sanity checks are no longer needed, can be easily removed.
But this patch still uses some ASSERT()s as a safe net just in case we
change some features in the future to trigger those impossible
combinations.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We use this during device replace for zoned devices, we were simply
taking the lock because it was in a bit field and we needed the lock to
be safe with other modifications in the bitfield. With the bit helpers
we no longer require that locking.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We use a bit field in the btrfs_block_group for different flags, however
this is awkward because we have to hold the block_group->lock for any
modification of any of these fields, and makes the code clunky for a few
of these flags. Convert these to a properly flags setup so we can
utilize the bit helpers.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
The following script shows that, although scrub can detect super block
errors, it never tries to fix it:
mkfs.btrfs -f -d raid1 -m raid1 $dev1 $dev2
xfs_io -c "pwrite 67108864 4k" $dev2
mount $dev1 $mnt
btrfs scrub start -B $dev2
btrfs scrub start -Br $dev2
umount $mnt
The first scrub reports the super error correctly:
scrub done for f3289218-abd3-41ac-a630-202f766c0859
Scrub started: Tue Aug 2 14:44:11 2022
Status: finished
Duration: 0:00:00
Total to scrub: 1.26GiB
Rate: 0.00B/s
Error summary: super=1
Corrected: 0
Uncorrectable: 0
Unverified: 0
But the second read-only scrub still reports the same super error:
Scrub started: Tue Aug 2 14:44:11 2022
Status: finished
Duration: 0:00:00
Total to scrub: 1.26GiB
Rate: 0.00B/s
Error summary: super=1
Corrected: 0
Uncorrectable: 0
Unverified: 0
[CAUSE]
The comments already shows that super block can be easily fixed by
committing a transaction:
/*
* If we find an error in a super block, we just report it.
* They will get written with the next transaction commit
* anyway
*/
But the truth is, such assumption is not always true, and since scrub
should try to repair every error it found (except for read-only scrub),
we should really actively commit a transaction to fix this.
[FIX]
Just commit a transaction if we found any super block errors, after
everything else is done.
We cannot do this just after scrub_supers(), as
btrfs_commit_transaction() will try to pause and wait for the running
scrub, thus we can not call it with scrub_lock hold.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[PROBLEM]
Unlike data/metadata corruption, if scrub detected some error in the
super block, the only error message is from the updated device status:
BTRFS info (device dm-1): scrub: started on devid 2
BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0
This is not helpful at all.
[CAUSE]
Unlike data/metadata error reporting, there is no visible report in
kernel dmesg to report supper block errors.
In fact, return value of scrub_checksum_super() is intentionally
skipped, thus scrub_handle_errored_block() will never be called for
super blocks.
[FIX]
Make super block errors to output an error message, now the full
dmesg would looks like this:
BTRFS info (device dm-1): scrub: started on devid 2
BTRFS warning (device dm-1): super block error on device /dev/mapper/test-scratch2, physical 67108864
BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0
BTRFS info (device dm-1): scrub: started on devid 2
This fix involves:
- Move the super_errors reporting to scrub_handle_errored_block()
This allows the device status message to show after the super block
error message.
But now we no longer distinguish super block corruption and generation
mismatch, now all counted as corruption.
- Properly check the return value from scrub_checksum_super()
- Add extra super block error reporting for scrub_print_warning().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Always consume the bio and call the end_io handler on error instead of
returning an error and letting the caller handle it. This matches what
the block layer submission does and avoids any confusion on who
needs to handle errors.
Also use the proper bool type for the generic_io argument.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The raid56 code assumes a fixed stripe length BTRFS_STRIPE_LEN but there
are functions passing it as arguments, this is not necessary. The fixed
value has been used for a long time and though the stripe length should
be configurable by super block member stripesize, this hasn't been
implemented and would require more changes so we don't need to keep this
code around until then.
Partially based on a patch from Qu Wenruo.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
[ update changelog ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
For scrub_stripe() we can easily calculate the dev extent length as we
have the full info of the chunk.
Thus there is no need to pass @dev_extent_len from the caller, and we
introduce a helper, btrfs_calc_stripe_length(), to do the calculation
from extent_map structure.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Previously we use "unsigned long *" for those two bitmaps.
But since we only support fixed stripe length (64KiB, already checked in
tree-checker), "unsigned long *" is really a waste of memory, while we
can just use "unsigned long".
This saves us 8 bytes in total for scrub_parity.
To be extra safe, add an ASSERT() making sure calclulated @nsectors is
always smaller than BITS_PER_LONG.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[SUSPICIOUS CODE]
When refactoring scrub code, I noticed a very strange behavior around
scrub_remap_extent():
if (sctx->is_dev_replace)
scrub_remap_extent(fs_info, cur_logical, scrub_len,
&cur_physical, &target_dev, &cur_mirror);
As replace target is a 1:1 copy of the source device, thus physical
offset inside the target should be the same as physical inside source,
thus this remap call makes no sense to me.
[REAL FUNCTIONALITY]
After more investigation, the function name scrub_remap_extent()
doesn't tell anything of the truth, nor does its if () condition.
The real story behind this function is that, for scrub_pages() we never
expect missing device, even for replacing missing device.
What scrub_remap_extent() is really doing is to find a live mirror, and
make later scrub_pages() to read data from the good copy, other than
from the missing device and increase error counters unnecessarily.
[IMPROVEMENT]
We have no need to bother scrub_remap_extent() in scrub_simple_mirror()
at all, we only need to call it before we call scrub_pages().
And rename the function to scrub_find_live_copy(), add extra comments on
them.
By this we can remove one parameter from scrub_extent(), and reduce the
unnecessary calls to scrub_remap_extent() for regular replace.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since we have find_first_extent_item() to iterate the extent items of a
certain range, there is no need to use the open-coded version.
Replace the final scrub call site with find_first_extent_item().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently scrub_raid56_parity() has a large double loop, handling the
following things at the same time:
- Iterate each data stripe
- Iterate each extent item in one data stripe
Refactor this by:
- Introduce a new helper to handle data stripe iteration
The new helper is scrub_raid56_data_stripe_for_parity(), which
only has one while() loop handling the extent items inside the
data stripe.
The code is still mostly the same as the old code.
- Call cond_resched() for each extent
Previously we only call cond_resched() under a complex if () check.
I see no special reason to do that, and for other scrub functions,
like scrub_simple_mirror() we're already doing the same cond_resched()
after scrubbing one extent.
- Add more comments
Please note that, this patch is only to address the double loop, there
are incoming patches to do extra cleanup.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Although RAID56 has complex repair mechanism, which involves reading the
whole full stripe, but inside one data stripe, it's in fact no different
than SINGLE/RAID1.
The point here is, for data stripe we just check the csum for each
extent we hit. Only for csum mismatch case, our repair paths divide.
So we can still reuse scrub_simple_mirror() for RAID56 data stripes,
which saves quite some code.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since we have moved all other profiles handling into their own
functions, now the main body of scrub_stripe() is just handling RAID56
profiles.
There is no need to address other profiles in the main loop of
scrub_stripe(), so we can remove those dead branches.
Since we're here, also slightly change the timing of initialization of
variables like @offset, @increment and @logical.
Especially for @logical, we don't really need to initialize it for
btrfs_extent_root()/btrfs_csum_root(), we can use bg->start for that
purpose.
Now those variables are only initialize for RAID56 branches.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The new entrance will iterate through each data stripe which belongs to
the target device.
And since inside each data stripe, RAID0 is just SINGLE, while RAID10 is
just RAID1, we can reuse scrub_simple_mirror() to do the scrub properly.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The new helper, scrub_simple_mirror(), will scrub all extents inside a
range which only has simple mirror based duplication.
This covers every range of SINGLE/DUP/RAID1/RAID1C*, and inside each
data stripe for RAID0/RAID10.
Currently we will use this function to scrub SINGLE/DUP/RAID1/RAID1C*
profiles. As one can see, the new entrance for those simple-mirror
based profiles can be small enough (with comments, just reach 100
lines).
This function will be the basis for the incoming scrub refactor.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The new helper, find_first_extent_item(), will locate an extent item
(either EXTENT_ITEM or METADATA_ITEM) which covers any byte of the
search range.
This helper will later be used to refactor scrub code.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The variable @physical_end is the exclusive stripe end, currently it's
calculated using @physical + @dev_extent_len / map->stripe_len *
map->stripe_len.
And since at allocation time we ensured dev_extent_len is stripe_len
aligned, the result is the same as @physical + @dev_extent_len.
So this patch will just assign @physical and @physical_end early,
without using @nstripes.
This is especially helpful for any possible out: label user, as now we
only need to initialize @offset before going to out: label.
Since we're here, also make @physical_end constant.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
All three scrub workqueues don't need ordered execution or thread
disabling threshold (as the thresh parameter is less than DFT_THRESHOLD).
Just switch to the normal workqueues that use a lot less resources,
especially in the work_struct vs btrfs_work structures.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This requires one extra parameter @pgoff for the function.
In the current code base, scrub is still one page per sector, thus the
new parameter will always be 0.
It needs the extra subpage scrub optimization code to fully take
advantage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
All the scrub bios go straight to the block device or the raid56 code,
none of which looks at the btrfs_bio.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The I/O in repair_io_failue is synchronous and doesn't need a btrfs_bio,
so just use an on-stack bio.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The I/O in repair_io_failue is synchronous and doesn't need a btrfs_bio,
so just use an on-stack bio.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|