summaryrefslogtreecommitdiff
path: root/fs/btrfs/compression.c
AgeCommit message (Collapse)AuthorFilesLines
2023-02-15btrfs: split zone append bios in btrfs_submit_bioChristoph Hellwig1-90/+22
The current btrfs zoned device support is a little cumbersome in the data I/O path as it requires the callers to not issue I/O larger than the supported ZONE_APPEND size of the underlying device. This leads to a lot of extra accounting. Instead change btrfs_submit_bio so that it can take write bios of arbitrary size and form from the upper layers, and just split them internally to the ZONE_APPEND queue limits. Then remove all the upper layer warts catering to limited write sized on zoned devices, including the extra refcount in the compressed_bio. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: remove the fs_info argument to btrfs_submit_bioChristoph Hellwig1-2/+2
btrfs_submit_bio can derive it trivially from bbio->inode, so stop bothering in the callers. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: remove stripe boundary calculation for compressed I/OQu Wenruo1-97/+33
Stop looking at the stripe boundary in alloc_compressed_bio() now that that btrfs_submit_bio can split bios, open code the now trivial code from alloc_compressed_bio() in btrfs_submit_compressed_read and stop maintaining the pending_ios count for reads as there is always just a single bio now. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> [hch: remove more cruft in btrfs_submit_compressed_read, use btrfs_zoned_get_device in alloc_compressed_bio] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: handle recording of zoned writes in the storage layerChristoph Hellwig1-1/+0
Move the code that splits the ordered extents and records the physical location for them to the storage layer so that the higher level consumers don't have to care about physical block numbers at all. This will also allow to eventually remove accounting for the zone append write sizes in the upper layer with a little bit more block layer work. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: handle checksum generation in the storage layerChristoph Hellwig1-9/+0
Instead of letting the callers of btrfs_submit_bio deal with checksumming the (meta)data in the bio and making decisions on when to offload the checksumming to the bio, leave that to btrfs_submit_bio. Do do so the existing btrfs_submit_bio function is split into an upper and a lower half, so that the lower half can be offloaded to a workqueue. Note that this changes the behavior for direct writes to raid56 volumes so that async checksum offloading is not skipped when more I/O is expected. This runs counter to the argument explaining why it was done, although I can't measure any affects of the change. Commits later in this series will make sure the entire direct writes is offloaded to the workqueue at once and thus make sure it is sent to the raid56 code from a single thread. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: simplify the btrfs_csum_one_bio calling conventionChristoph Hellwig1-2/+4
To prepare for further bio submission changes btrfs_csum_one_bio should be able to take all it's arguments from the btrfs_bio structure. It can always use the bbio->inode already, and once the compression code is updated to set ->file_offset that one can be used unconditionally as well instead of looking at the page mapping now that btrfs doesn't allow ordered extents to span discontiguous data ranges. The only slightly tricky bit is the one_ordered flag set by the compressed writes. Replace that one with the driver private bio flag, which gets cleared before the bio is handed off to the block layer so that we don't get in the way of driver use. Note: this leaves an argument and a flag to btrfs_wq_submit_bio unused. But that whole mechanism will be removed in its current form in the next patch. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: handle checksum validation and repair at the storage layerChristoph Hellwig1-39/+2
Currently btrfs handles checksum validation and repair in the end I/O handler for the btrfs_bio. This leads to a lot of duplicate code plus issues with varying semantics or bugs, e.g. - the until recently broken repair for compressed extents - the fact that encoded reads validate the checksums but do not kick of read repair - the inconsistent checking of the BTRFS_FS_STATE_NO_CSUMS flag This commit revamps the checksum validation and repair code to instead work below the btrfs_submit_bio interfaces. In case of a checksum failure (or a plain old I/O error), the repair is now kicked off before the upper level ->end_io handler is invoked. Progress of an in-progress repair is tracked by a small structure that is allocated using a mempool for each original bio with failed sectors, which holds a reference to the original bio. This new structure is allocated using a mempool to guarantee forward progress even under memory pressure. The mempool will be replenished when the repair completes, just as the mempools backing the bios. There is one significant behavior change here: If repair fails or is impossible to start with, the whole bio will be failed to the upper layer. This is the behavior that all I/O submitters except for buffered I/O already emulated in their end_io handler. For buffered I/O this now means that a large readahead request can fail due to a single bad sector, but as readahead errors are ignored the following readpage if the sector is actually accessed will still be able to read. This also matches the I/O failure handling in other file systems. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: pre-load data checksum for reads in btrfs_submit_bioChristoph Hellwig1-6/+0
Instead of calling btrfs_lookup_bio_sums in every caller of btrfs_submit_bio that reads data, do the call once in btrfs_submit_bio. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: save the bio iter for checksum validation in common codeChristoph Hellwig1-4/+0
All callers of btrfs_submit_bio that want to validate checksums currently have to store a copy of the iter in the btrfs_bio. Move the assignment into common code. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: simplify parameters of btrfs_lookup_bio_sumsChristoph Hellwig1-1/+1
The csums argument is always NULL now, so remove it and always allocate the csums array in the btrfs_bio. Also pass the btrfs_bio instead of inode + bio to document that this function requires a btrfs_bio and not just any bio. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: add a btrfs_inode pointer to struct btrfs_bioChristoph Hellwig1-1/+2
All btrfs_bio I/Os are associated with an inode. Add a pointer to that inode, which will allow to simplify a lot of calling conventions, and which will be needed in the I/O completion path in the future. This grow the btrfs_bio structure by a pointer, but that grows will be offset by the removal of the device pointer soon. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: fix spelling mistakes found using codespellColin Ian King1-2/+2
There quite a few spelling mistakes as found using codespell. Fix them. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: use PAGE_{ALIGN, ALIGNED, ALIGN_DOWN} macroYushan Zhou1-1/+1
The header file linux/mm.h provides PAGE_ALIGN, PAGE_ALIGNED, PAGE_ALIGN_DOWN macros. Use these macros to make code more concise. Signed-off-by: Yushan Zhou <katrinzhou@tencent.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: split the bio submission path into a separate fileChristoph Hellwig1-1/+1
The code used by btrfs_submit_bio only interacts with the rest of volumes.c through __btrfs_map_block (which itself is a more generic version of two exported helpers) and does not really have anything to do with volumes.c. Create a new bio.c file and a bio.h header going along with it for the btrfs_bio-based storage layer, which will grow even more going forward. Also update the file with my copyright notice given that a large part of the moved code was written or rewritten by me. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: constify input buffer parameter in compression codeDavid Sterba1-2/+2
The input buffers passed down to compression must never be changed, switch type to u8 as it's a raw byte buffer and use const. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: pass btrfs_inode to btrfs_check_data_csumDavid Sterba1-1/+1
The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: pass btrfs_inode to btrfs_repair_one_sectorDavid Sterba1-1/+1
The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: change how repair action is passed to btrfs_repair_one_sectorDavid Sterba1-1/+1
There's a function pointer passed to btrfs_repair_one_sector that will submit the right bio for repair. However there are only two callbacks, for buffered and for direct IO. This can be simplified to a bool-based switch and call either function, indirect calls in this case is an unnecessary abstraction. This allows to remove the submit_bio_hook_t typedef. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move super_block specific helpers into super.hJosef Bacik1-0/+1
This will make syncing fs.h to user space a little easier if we can pull the super block specific helpers out of fs.h and put them in super.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move file-item prototypes into their own headerJosef Bacik1-0/+1
Move these prototypes out of ctree.h and into file-item.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move BTRFS_FS_STATE* definitions and helpers to fs.hJosef Bacik1-0/+1
We're going to use fs.h to hold fs wide related helpers and definitions, move the FS_STATE enum and related helpers to fs.h, and then update all files that need these definitions to include fs.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: make module init/exit match their sequenceQu Wenruo1-1/+2
[BACKGROUND] In theory init_btrfs_fs() and exit_btrfs_fs() should match their sequence, thus normally they should look like this: init_btrfs_fs() | exit_btrfs_fs() ----------------------+------------------------ init_A(); | init_B(); | init_C(); | | exit_C(); | exit_B(); | exit_A(); So is for the error path of init_btrfs_fs(). But it's not the case, some exit functions don't match their init functions sequence in init_btrfs_fs(). Furthermore in init_btrfs_fs(), we need to have a new error label for each new init function we added. This is not really expandable, especially recently we may add several new functions to init_btrfs_fs(). [ENHANCEMENT] The patch will introduce the following things to enhance the situation: - struct init_sequence Just a wrapper of init and exit function pointers. The init function must use int type as return value, thus some init functions need to be updated to return 0. The exit function can be NULL, as there are some init sequence just outputting a message. - struct mod_init_seq[] array This is a const array, recording all the initialization we need to do in init_btrfs_fs(), and the order follows the old init_btrfs_fs(). - bool mod_init_result[] array This is a bool array, recording if we have initialized one entry in mod_init_seq[]. The reason to split mod_init_seq[] and mod_init_result[] is to avoid section mismatch in reference. All init function are in .init.text, but if mod_init_seq[] records the @initialized member it can no longer be const, thus will be put into .data section, and cause modpost warning. For init_btrfs_fs() we just call all init functions in their order in mod_init_seq[] array, and after each call, setting corresponding mod_init_result[] to true. For exit_btrfs_fs() and error handling path of init_btrfs_fs(), we just iterate mod_init_seq[] in reverse order, and skip all uninitialized entry. With this patch, init_btrfs_fs()/exit_btrfs_fs() will be much easier to expand and will always follow the strict order. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-08fs: fix leaked psi pressure stateJohannes Weiner1-6/+8
When psi annotations were added to to btrfs compression reads, the psi state tracking over add_ra_bio_pages and btrfs_submit_compressed_read was faulty. A pressure state, once entered, is never left. This results in incorrectly elevated pressure, which triggers OOM kills. pflags record the *previous* memstall state when we enter a new one. The code tried to initialize pflags to 1, and then optimize the leave call when we either didn't enter a memstall, or were already inside a nested stall. However, there can be multiple PageWorkingset pages in the bio, at which point it's that path itself that enters repeatedly and overwrites pflags. This causes us to miss the exit. Enter the stall only once if needed, then unwind correctly. erofs has the same problem, fix that up too. And move the memstall exit past submit_bio() to restore submit accounting originally added by b8e24a9300b0 ("block: annotate refault stalls from IO submission"). Link: https://lkml.kernel.org/r/Y2UHRqthNUwuIQGS@cmpxchg.org Fixes: 4088a47e78f9 ("btrfs: add manual PSI accounting for compressed reads") Fixes: 99486c511f68 ("erofs: add manual PSI accounting for the compressed address space") Fixes: 118f3663fbc6 ("block: remove PSI accounting from the bio layer") Link: https://lore.kernel.org/r/d20a0a85-e415-cf78-27f9-77dd7a94bc8d@leemhuis.info/ Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Thorsten Leemhuis <linux@leemhuis.info> Tested-by: Thorsten Leemhuis <linux@leemhuis.info> Cc: Chao Yu <chao@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: David Sterba <dsterba@suse.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-10Merge tag 'mm-stable-2022-10-08' of ↵Linus Torvalds1-16/+15
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam Howlett. An overlapping range-based tree for vmas. It it apparently slightly more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat at [1]. This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1] * tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits) hugetlb: allocate vma lock for all sharable vmas hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer hugetlb: fix vma lock handling during split vma and range unmapping mglru: mm/vmscan.c: fix imprecise comments mm/mglru: don't sync disk for each aging cycle mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol mm: memcontrol: use do_memsw_account() in a few more places mm: memcontrol: deprecate swapaccounting=0 mode mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled mm/secretmem: remove reduntant return value mm/hugetlb: add available_huge_pages() func mm: remove unused inline functions from include/linux/mm_inline.h selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd selftests/vm: add thp collapse shmem testing selftests/vm: add thp collapse file and tmpfs testing selftests/vm: modularize thp collapse memory operations selftests/vm: dedup THP helpers mm/khugepaged: add tracepoint to hpage_collapse_scan_file() mm/madvise: add file and shmem support to MADV_COLLAPSE ...
2022-10-07Merge tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linuxLinus Torvalds1-2/+12
Pull block updates from Jens Axboe: - NVMe pull requests via Christoph: - handle number of queue changes in the TCP and RDMA drivers (Daniel Wagner) - allow changing the number of queues in nvmet (Daniel Wagner) - also consider host_iface when checking ip options (Daniel Wagner) - don't map pages which can't come from HIGHMEM (Fabio M. De Francesco) - avoid unnecessary flush bios in nvmet (Guixin Liu) - shrink and better pack the nvme_iod structure (Keith Busch) - add comment for unaligned "fake" nqn (Linjun Bao) - print actual source IP address through sysfs "address" attr (Martin Belanger) - various cleanups (Jackie Liu, Wolfram Sang, Genjian Zhang) - handle effects after freeing the request (Keith Busch) - copy firmware_rev on each init (Keith Busch) - restrict management ioctls to admin (Keith Busch) - ensure subsystem reset is single threaded (Keith Busch) - report the actual number of tagset maps in nvme-pci (Keith Busch) - small fabrics authentication fixups (Christoph Hellwig) - add common code for tagset allocation and freeing (Christoph Hellwig) - stop using the request_queue in nvmet (Christoph Hellwig) - set min_align_mask before calculating max_hw_sectors (Rishabh Bhatnagar) - send a rediscover uevent when a persistent discovery controller reconnects (Sagi Grimberg) - misc nvmet-tcp fixes (Varun Prakash, zhenwei pi) - MD pull request via Song: - Various raid5 fix and clean up, by Logan Gunthorpe and David Sloan. - Raid10 performance optimization, by Yu Kuai. - sbitmap wakeup hang fixes (Hugh, Keith, Jan, Yu) - IO scheduler switching quisce fix (Keith) - s390/dasd block driver updates (Stefan) - support for recovery for the ublk driver (ZiyangZhang) - rnbd drivers fixes and updates (Guoqing, Santosh, ye, Christoph) - blk-mq and null_blk map fixes (Bart) - various bcache fixes (Coly, Jilin, Jules) - nbd signal hang fix (Shigeru) - block writeback throttling fix (Yu) - optimize the passthrough mapping handling (me) - prepare block cgroups to being gendisk based (Christoph) - get rid of an old PSI hack in the block layer, moving it to the callers instead where it belongs (Christoph) - blk-throttle fixes and cleanups (Yu) - misc fixes and cleanups (Liu Shixin, Liu Song, Miaohe, Pankaj, Ping-Xiang, Wolfram, Saurabh, Li Jinlin, Li Lei, Lin, Li zeming, Miaohe, Bart, Coly, Gaosheng * tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux: (162 commits) sbitmap: fix lockup while swapping block: add rationale for not using blk_mq_plug() when applicable block: adapt blk_mq_plug() to not plug for writes that require a zone lock s390/dasd: use blk_mq_alloc_disk blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep nvmet: don't look at the request_queue in nvmet_bdev_set_limits nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all blk-mq: use quiesced elevator switch when reinitializing queues block: replace blk_queue_nowait with bdev_nowait nvme: remove nvme_ctrl_init_connect_q nvme-loop: use the tagset alloc/free helpers nvme-loop: store the generic nvme_ctrl in set->driver_data nvme-loop: initialize sqsize later nvme-fc: use the tagset alloc/free helpers nvme-fc: store the generic nvme_ctrl in set->driver_data nvme-fc: keep ctrl->sqsize in sync with opts->queue_size nvme-rdma: use the tagset alloc/free helpers nvme-rdma: store the generic nvme_ctrl in set->driver_data nvme-tcp: use the tagset alloc/free helpers nvme-tcp: store the generic nvme_ctrl in set->driver_data ...
2022-09-26btrfs: unify the lock/unlock extent variantsJosef Bacik1-3/+3
We have two variants of lock/unlock extent, one set that takes a cached state, another that does not. This is slightly annoying, and generally speaking there are only a few places where we don't have a cached state. Simplify this by making lock_extent/unlock_extent the only variant and make it take a cached state, then convert all the callers appropriately. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: rename clean_io_failure and remove extraneous argsJosef Bacik1-3/+2
This is exported, so rename it to btrfs_clean_io_failure. Additionally we are passing in the io tree's and such from the inode, so instead of doing all that simply pass in the inode itself and get all the components we need directly inside of btrfs_clean_io_failure. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: give struct btrfs_bio a real end_io handlerChristoph Hellwig1-24/+17
Currently btrfs_bio end I/O handling is a bit of a mess. The bi_end_io handler and bi_private pointer of the embedded struct bio are both used to handle the completion of the high-level btrfs_bio and for the I/O completion for the low-level device that the embedded bio ends up being sent to. To support this bi_end_io and bi_private are saved into the btrfs_io_context structure and then restored after the bio sent to the underlying device has completed the actual I/O. Untangle this by adding an end I/O handler and private data to struct btrfs_bio for the high-level btrfs_bio based completions, and leave the actual bio bi_end_io handler and bi_private pointer entirely to the low-level device I/O. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: pass the operation to btrfs_bio_allocChristoph Hellwig1-3/+1
Pass the operation to btrfs_bio_alloc, matching what bio_alloc_bioset set does. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-20btrfs: add manual PSI accounting for compressed readsChristoph Hellwig1-2/+12
btrfs compressed reads try to always read the entire compressed chunk, even if only a subset is requested. Currently this is covered by the magic PSI accounting underneath submit_bio, but that is about to go away. Instead add manual psi_memstall_{enter,leave} annotations. Note that for readahead this really should be using readahead_expand, but the additionals reads are also done for plain ->read_folio where readahead_expand can't work, so this overall logic is left as-is for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: David Sterba <dsterba@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220915094200.139713-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-11btrfs: convert end_compressed_writeback() to use filemap_get_folios()Vishal Moola (Oracle)1-16/+15
Converted function to use folios throughout. This is in preparation for the removal of find_get_pages_contig(). Now also supports large folios. Since we may receive more than nr_pages pages, nr_pages may underflow. Since nr_pages > 0 is equivalent to index <= end_index, we replaced it with this check instead. Also this function does not care about the pages being contiguous so we can just use filemap_get_folios() to be more efficient. Link: https://lkml.kernel.org/r/20220824004023.77310-4-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: David Sterba <dsterba@suse.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterb@suse.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-03Merge tag 'for-5.20-tag' of ↵Linus Torvalds1-257/+102
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "This brings some long awaited changes, the send protocol bump, otherwise lots of small improvements and fixes. The main core part is reworking bio handling, cleaning up the submission and endio and improving error handling. There are some changes outside of btrfs adding helpers or updating API, listed at the end of the changelog. Features: - sysfs: - export chunk size, in debug mode add tunable for setting its size - show zoned among features (was only in debug mode) - show commit stats (number, last/max/total duration) - send protocol updated to 2 - new commands: - ability write larger data chunks than 64K - send raw compressed extents (uses the encoded data ioctls), ie. no decompression on send side, no compression needed on receive side if supported - send 'otime' (inode creation time) among other timestamps - send file attributes (a.k.a file flags and xflags) - this is first version bump, backward compatibility on send and receive side is provided - there are still some known and wanted commands that will be implemented in the near future, another version bump will be needed, however we want to minimize that to avoid causing usability issues - print checksum type and implementation at mount time - don't print some messages at mount (mentioned as people asked about it), we want to print messages namely for new features so let's make some space for that - big metadata - this has been supported for a long time and is not a feature that's worth mentioning - skinny metadata - same reason, set by default by mkfs Performance improvements: - reduced amount of reserved metadata for delayed items - when inserted items can be batched into one leaf - when deleting batched directory index items - when deleting delayed items used for deletion - overall improved count of files/sec, decreased subvolume lock contention - metadata item access bounds checker micro-optimized, with a few percent of improved runtime for metadata-heavy operations - increase direct io limit for read to 256 sectors, improved throughput by 3x on sample workload Notable fixes: - raid56 - reduce parity writes, skip sectors of stripe when there are no data updates - restore reading from on-disk data instead of using stripe cache, this reduces chances to damage correct data due to RMW cycle - refuse to replay log with unknown incompat read-only feature bit set - zoned - fix page locking when COW fails in the middle of allocation - improved tracking of active zones, ZNS drives may limit the number and there are ENOSPC errors due to that limit and not actual lack of space - adjust maximum extent size for zone append so it does not cause late ENOSPC due to underreservation - mirror reading error messages show the mirror number - don't fallback to buffered IO for NOWAIT direct IO writes, we don't have the NOWAIT semantics for buffered io yet - send, fix sending link commands for existing file paths when there are deleted and created hardlinks for same files - repair all mirrors for profiles with more than 1 copy (raid1c34) - fix repair of compressed extents, unify where error detection and repair happen Core changes: - bio completion cleanups - don't double defer compression bios - simplify endio workqueues - add more data to btrfs_bio to avoid allocation for read requests - rework bio error handling so it's same what block layer does, the submission works and errors are consumed in endio - when asynchronous bio offload fails fall back to synchronous checksum calculation to avoid errors under writeback or memory pressure - new trace points - raid56 events - ordered extent operations - super block log_root_transid deprecated (never used) - mixed_backref and big_metadata sysfs feature files removed, they've been default for sufficiently long time, there are no known users and mixed_backref could be confused with mixed_groups Non-btrfs changes, API updates: - minor highmem API update to cover const arguments - switch all kmap/kmap_atomic to kmap_local - remove redundant flush_dcache_page() - address_space_operations::writepage callback removed - add bdev_max_segments() helper" * tag 'for-5.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (163 commits) btrfs: don't call btrfs_page_set_checked in finish_compressed_bio_read btrfs: fix repair of compressed extents btrfs: remove the start argument to check_data_csum and export btrfs: pass a btrfs_bio to btrfs_repair_one_sector btrfs: simplify the pending I/O counting in struct compressed_bio btrfs: repair all known bad mirrors btrfs: merge btrfs_dev_stat_print_on_error with its only caller btrfs: join running log transaction when logging new name btrfs: simplify error handling in btrfs_lookup_dentry btrfs: send: always use the rbtree based inode ref management infrastructure btrfs: send: fix sending link commands for existing file paths btrfs: send: introduce recorded_ref_alloc and recorded_ref_free btrfs: zoned: wait until zone is finished when allocation didn't progress btrfs: zoned: write out partially allocated region btrfs: zoned: activate necessary block group btrfs: zoned: activate metadata block group on flush_space btrfs: zoned: disable metadata overcommit for zoned btrfs: zoned: introduce space_info->active_total_bytes btrfs: zoned: finish least available block group on data bg allocation btrfs: let can_allocate_chunk return error ...
2022-07-25btrfs: don't call btrfs_page_set_checked in finish_compressed_bio_readChristoph Hellwig1-22/+2
This flag was used to communicate that the low-level compression code already did verify the checksum to the high-level I/O completion code. But it has been unused for a long time as the upper btrfs_bio for the decompressed data had a NULL csum pointer basically since that pointer existed and the code already checks for that a little later. Note that this does not affect the other use of the checked flag, which is only used for the COW fixup worker. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: fix repair of compressed extentsChristoph Hellwig1-113/+58
Currently the checksum of compressed extents is verified based on the compressed data and the lower btrfs_bio, but the actual repair process is driven by end_bio_extent_readpage on the upper btrfs_bio for the decompressed data. This has a bunch of issues, including not being able to properly communicate the failed mirror up in case that the I/O submission got preempted, a general loss of if an error was an I/O error or a checksum verification failure, but most importantly that this design causes btrfs_clean_io_failure to eventually write back the uncompressed good data onto the disk sectors that are supposed to contain compressed data. Fix this by moving the repair to the lower btrfs_bio. To do so, a fair amount of code has to be reshuffled: a) the lower btrfs_bio now needs a valid csum pointer. The easiest way to achieve that is to pass NULL btrfs_lookup_bio_sums and just use the btrfs_bio management of csums. For a compressed_bio that is split into multiple btrfs_bios this means additional memory allocations, but the code becomes a lot more regular. b) checksum verification now runs directly on the lower btrfs_bio instead of the compressed_bio. This actually nicely simplifies the end I/O processing. c) btrfs_repair_one_sector can't just look up the logical address for the file offset any more, as there is no corresponding relative offsets that apply to the file offset and the logic address for compressed extents. Instead require that the saved bvec_iter in the btrfs_bio is filled out for all read bios and use that, which again removes a fair amount of code. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: simplify the pending I/O counting in struct compressed_bioChristoph Hellwig1-95/+30
Instead of counting the sectors just count the bios, with an extra reference held during submission. This significantly simplifies the submission side error handling. This slightly changes completion and error handling of btrfs_submit_compressed_{read,write} because with the old code the compressed_bio could have been completed in submit_compressed_{read,write} only if there was an error during submission for one of the lower bio, whilst with the new code there is a chance for this to happen even for successful submission if the all the lower bios complete before the end of the function is reached. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: do not return errors from btrfs_map_bioChristoph Hellwig1-6/+2
Always consume the bio and call the end_io handler on error instead of returning an error and letting the caller handle it. This matches what the block layer submission does and avoids any confusion on who needs to handle errors. As this requires touching all the callers, rename the function to btrfs_submit_bio, which describes the functionality much better. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: remove btrfs_end_io_wqChristoph Hellwig1-4/+0
All reads bio that go through btrfs_map_bio need to be completed in user context. And read I/Os are the most common and timing critical in almost any file system workloads. Embed a work_struct into struct btrfs_bio and use it to complete all read bios submitted through btrfs_map, using the REQ_META flag to decide which workqueue they are placed on. This removes the need for a separate 128 byte allocation (typically rounded up to 192 bytes by slab) for all reads with a size increase of 24 bytes for struct btrfs_bio. Future patches will reorganize struct btrfs_bio to make use of this extra space for writes as well. (All sizes are based a on typical 64-bit non-debug build) Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: don't use btrfs_bio_wq_end_io for compressed writesChristoph Hellwig1-22/+22
Compressed write bio completion is the only user of btrfs_bio_wq_end_io for writes, and the use of btrfs_bio_wq_end_io is a little suboptimal here as we only real need user context for the final completion of a compressed_bio structure, and not every single bio completion. Add a work_struct to struct compressed_bio instead and use that to call finish_compressed_bio_write. This allows to remove all handling of write bios in the btrfs_bio_wq_end_io infrastructure. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: remove redundant calls to flush_dcache_pageDavid Sterba1-2/+0
Both memzero_page and memcpy_to_page already call flush_dcache_page so we can remove the calls from btrfs code. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: introduce a data checksum checking helperQu Wenruo1-9/+4
Although we have several data csum verification code, we never have a function really just to verify checksum for one sector. Function check_data_csum() do extra work for error reporting, thus it requires a lot of extra things like file offset, bio_offset etc. Function btrfs_verify_data_csum() is even worse, it will utilize page checked flag, which means it can not be utilized for direct IO pages. Here we introduce a new helper, btrfs_check_sector_csum(), which really only accept a sector in page, and expected checksum pointer. We use this function to implement check_data_csum(), and export it for incoming patch. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> [hch: keep passing the csum array as an arguments, as the callers want to print it, rename per request] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-14fs/btrfs: Use the enum req_op and blk_opf_t typesBart Van Assche1-3/+3
Improve static type checking by using the enum req_op type for variables that represent a request operation and the new blk_opf_t type for variables that represent request flags. Acked-by: David Sterba <dsterba@suse.com> Cc: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-51-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-16btrfs: derive compression type from extent map during readsGoldwyn Rodrigues1-5/+5
Derive the compression type from extent map as opposed to the bio flags passed. This makes it more precise and not reliant on function parameters. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: do not return errors from btrfs_submit_compressed_readChristoph Hellwig1-6/+5
btrfs_submit_compressed_read already calls ->bi_end_io on error and the caller must ignore the return value, so remove it. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: do not pass compressed_bio to submit_compressed_bio()Goldwyn Rodrigues1-3/+2
Parameter struct compressed_bio is not used by the function submit_compressed_bio(). Remove it. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: factor out allocating an array of pagesSweet Tea Dorminy1-21/+15
Several functions currently populate an array of page pointers one allocated page at a time. Factor out the common code so as to allow improvements to all of the sites at once. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-06btrfs: fix btrfs_submit_compressed_write cgroup attributionDennis Zhou1-0/+8
This restores the logic from commit 46bcff2bfc5e ("btrfs: fix compressed write bio blkcg attribution") which added cgroup attribution to btrfs writeback. It also adds back the REQ_CGROUP_PUNT flag for these ios. Fixes: 91507240482e ("btrfs: determine stripe boundary at bio allocation time in btrfs_submit_compressed_write") CC: stable@vger.kernel.org # 5.16+ Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: do not double complete bio on errors during compressed readsJosef Bacik1-6/+14
I hit some weird panics while fixing up the error handling from btrfs_lookup_bio_sums(). Turns out the compression path will complete the bio we use if we set up any of the compression bios and then return an error, and then btrfs_submit_data_bio() will also call bio_endio() on the bio. Fix this by making btrfs_submit_compressed_read() responsible for calling bio_endio() on the bio if there are any errors. Currently it was only doing it if we created the compression bios, otherwise it was depending on btrfs_submit_data_bio() to do the right thing. This creates the above problem, so fix up btrfs_submit_compressed_read() to always call bio_endio() in case of an error, and then simply return from btrfs_submit_data_bio() if we had to call btrfs_submit_compressed_read(). Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: track compressed bio errors as blk_status_tJosef Bacik1-11/+13
Right now we just have a binary "errors" flag, so any error we get on the compressed bio's gets translated to EIO. This isn't necessarily a bad thing, but if we get an ENOMEM it may be nice to know that's what happened instead of an EIO. Track our errors as a blk_status_t, and do the appropriate setting of the errors accordingly. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: remove the bio argument from finish_compressed_bio_readJosef Bacik1-5/+3
This bio is usually one of the compressed bio's, and we don't actually need it in this function, so remove the argument and stop passing it around. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: check correct bio in finish_compressed_bio_readJosef Bacik1-1/+1
Commit c09abff87f90 ("btrfs: cloned bios must not be iterated by bio_for_each_segment_all") added ASSERT()'s to make sure we weren't calling bio_for_each_segment_all() on a RAID5/6 bio. However it was checking the bio that the compression code passed in, not the cb->orig_bio that we actually iterate over, so adjust this ASSERT() to check the correct bio. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>