summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)AuthorFilesLines
2022-04-29bfq: Fix warning in bfqq_request_over_limit()Jan Kara1-3/+9
People are occasionally reporting a warning bfqq_request_over_limit() triggering reporting that BFQ's idea of cgroup hierarchy (and its depth) does not match what generic blkcg code thinks. This can actually happen when bfqq gets moved between BFQ groups while bfqq_request_over_limit() is running. Make sure the code is safe against BFQ queue being moved to a different BFQ group. Fixes: 76f1df88bbc2 ("bfq: Limit number of requests consumed by each cgroup") CC: stable@vger.kernel.org Link: https://lore.kernel.org/all/CAJCQCtTw_2C7ZSz7as5Gvq=OmnDiio=HRkQekqWpKot84sQhFA@mail.gmail.com/ Reported-by: Chris Murphy <lists@colorremedies.com> Reported-by: "yukuai (C)" <yukuai3@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220407140738.9723-1-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-27Revert "block: inherit request start time from bio for BLK_CGROUP"Tejun Heo1-8/+1
This reverts commit 0006707723233cb2a9a23ca19fc3d0864835704c. It has a couple problems: * bio_issue_time() is stored in bio->bi_issue truncated to 51 bits. This overflows in slightly over 26 days. Setting rq->io_start_time_ns with it means that io duration calculation would yield >26days after 26 days of uptime. This, for example, confuses kyber making it cause high IO latencies. * rq->io_start_time_ns should record the time that the IO is issued to the device so that on-device latency can be measured. However, bio_issue_time() is set before the bio goes through the rq-qos controllers (wbt, iolatency, iocost), so when the bio gets throttled in any of the mechanisms, the measured latencies make no sense - on-device latencies end up higher than request-alloc-to-completion latencies. We'll need a smarter way to avoid calling ktime_get_ns() repeatedly back-to-back. For now, let's revert the commit. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v5.16+ Link: https://lore.kernel.org/r/YmmeOLfo5lzc+8yI@slm.duckdns.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-27iocost: don't reset the inuse weight of under-weighted debtorsTejun Heo1-1/+11
When an iocg is in debt, its inuse weight is owned by debt handling and should stay at 1. This invariant was broken when determining the amount of surpluses at the beginning of donation calculation - when an iocg's hierarchical weight is too low, the iocg is excluded from donation calculation and its inuse is reset to its active regardless of its indebtedness, triggering warnings like the following: WARNING: CPU: 5 PID: 0 at block/blk-iocost.c:1416 iocg_kick_waitq+0x392/0x3a0 ... RIP: 0010:iocg_kick_waitq+0x392/0x3a0 Code: 00 00 be ff ff ff ff 48 89 4d a8 e8 98 b2 70 00 48 8b 4d a8 85 c0 0f 85 4a fe ff ff 0f 0b e9 43 fe ff ff 0f 0b e9 4d fe ff ff <0f> 0b e9 50 fe ff ff e8 a2 ae 70 00 66 90 0f 1f 44 00 00 55 48 89 RSP: 0018:ffffc90000200d08 EFLAGS: 00010016 ... <IRQ> ioc_timer_fn+0x2e0/0x1470 call_timer_fn+0xa1/0x2c0 ... As this happens only when an iocg's hierarchical weight is negligible, its impact likely is limited to triggering the warnings. Fix it by skipping resetting inuse of under-weighted debtors. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Rik van Riel <riel@surriel.com> Fixes: c421a3eb2e27 ("blk-iocost: revamp debt handling") Cc: stable@vger.kernel.org # v5.10+ Link: https://lore.kernel.org/r/YmjODd4aif9BzFuO@slm.duckdns.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-15block: don't print I/O error warning for dead disksChristoph Hellwig1-1/+2
When a disk has been marked dead, don't print warnings for I/O errors as they are very much expected. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220323163815.1526998-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-15block/compat_ioctl: fix range check in BLKGETSIZEKhazhismel Kumykov1-1/+1
kernel ulong and compat_ulong_t may not be same width. Use type directly to eliminate mismatches. This would result in truncation rather than EFBIG for 32bit mode for large disks. Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220414224056.2875681-1-khazhy@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-14block: fix offset/size check in bio_trim()Ming Lei1-1/+1
Unit of bio->bi_iter.bi_size is bytes, but unit of offset/size is sector. Fix the above issue in checking offset/size in bio_trim(). Fixes: e83502ca5f1e ("block: fix argument type of bio_trim()") Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220414084443.1736850-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-01Merge tag 'for-5.18/block-2022-04-01' of git://git.kernel.dk/linux-blockLinus Torvalds5-25/+40
Pull block fixes from Jens Axboe: "Either fixes or a few additions that got missed in the initial merge window pull. In detail: - List iterator fix to avoid leaking value post loop (Jakob) - One-off fix in minor count (Christophe) - Fix for a regression in how io priority setting works for an exiting task (Jiri) - Fix a regression in this merge window with blkg_free() being called in an inappropriate context (Ming) - Misc fixes (Ming, Tom)" * tag 'for-5.18/block-2022-04-01' of git://git.kernel.dk/linux-block: blk-wbt: remove wbt_track stub block: use dedicated list iterator variable block: Fix the maximum minor value is blk_alloc_ext_minor() block: restore the old set_task_ioprio() behaviour wrt PF_EXITING block: avoid calling blkg_free() in atomic context lib/sbitmap: allocate sb->map via kvzalloc_node
2022-03-31blk-wbt: remove wbt_track stubTom Rix1-3/+0
cppcheck returns this warning [block/blk-wbt.h:104] -> [block/blk-wbt.c:592]: (warning) Function 'wbt_track' argument order different: declaration 'rq, flags, ' definition 'rqos, rq, bio' In commit c1c80384c8f4 ("block: remove external dependency on wbt_flags") wbt_track was removed for the real declaration, its stub should have been as well. Signed-off-by: Tom Rix <trix@redhat.com> Link: https://lore.kernel.org/r/20220331185458.3427454-1-trix@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-31block: use dedicated list iterator variableJakob Koschel1-9/+16
To move the list iterator variable into the list_for_each_entry_*() macro in the future it should be avoided to use the list iterator variable after the loop body. To *never* use the list iterator variable after the loop it was concluded to use a separate iterator variable instead of a found boolean [1]. Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ [1] Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com> Link: https://lore.kernel.org/r/20220331091218.641532-1-jakobkoschel@gmail.com [axboe: move lookup to where return value is checked] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-28Merge tag 'ptrace-cleanups-for-v5.18' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull ptrace cleanups from Eric Biederman: "This set of changes removes tracehook.h, moves modification of all of the ptrace fields inside of siglock to remove races, adds a missing permission check to ptrace.c The removal of tracehook.h is quite significant as it has been a major source of confusion in recent years. Much of that confusion was around task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled making the semantics clearer). For people who don't know tracehook.h is a vestiage of an attempt to implement uprobes like functionality that was never fully merged, and was later superseeded by uprobes when uprobes was merged. For many years now we have been removing what tracehook functionaly a little bit at a time. To the point where anything left in tracehook.h was some weird strange thing that was difficult to understand" * tag 'ptrace-cleanups-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ptrace: Remove duplicated include in ptrace.c ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZE ptrace: Return the signal to continue with from ptrace_stop ptrace: Move setting/clearing ptrace_message into ptrace_stop tracehook: Remove tracehook.h resume_user_mode: Move to resume_user_mode.h resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume signal: Move set_notify_signal and clear_notify_signal into sched/signal.h task_work: Decouple TIF_NOTIFY_SIGNAL and task_work task_work: Call tracehook_notify_signal from get_signal on all architectures task_work: Introduce task_work_pending task_work: Remove unnecessary include from posix_timers.h ptrace: Remove tracehook_signal_handler ptrace: Remove arch_syscall_{enter,exit}_tracehook ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h ptrace/arm: Rename tracehook_report_syscall report_syscall ptrace: Move ptrace_report_syscall into ptrace.h
2022-03-28block: Fix the maximum minor value is blk_alloc_ext_minor()Christophe JAILLET1-1/+1
ida_alloc_range(..., min, max, ...) returns values from min to max, inclusive. So, NR_EXT_DEVT is a valid idx returned by blk_alloc_ext_minor(). This is an issue because in device_add_disk(), this value is used in: ddev->devt = MKDEV(disk->major, disk->first_minor); and NR_EXT_DEVT is '(1 << MINORBITS)'. So, should 'disk->first_minor' be NR_EXT_DEVT, it would overflow. Fixes: 22ae8ce8b892 ("block: simplify bdev/disk lookup in blkdev_get") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/cc17199798312406b90834e433d2cefe8266823d.1648306232.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-28block: restore the old set_task_ioprio() behaviour wrt PF_EXITINGJiri Slaby1-2/+1
PF_EXITING tasks were silently ignored before the below commits. Continue doing so. Otherwise python-psutil tests fail: ERROR: psutil.tests.test_process.TestProcess.test_zombie_process ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/abuild/rpmbuild/BUILD/psutil-5.9.0/build/lib.linux-x86_64-3.9/psutil/_pslinux.py", line 1661, in wrapper return fun(self, *args, **kwargs) File "/home/abuild/rpmbuild/BUILD/psutil-5.9.0/build/lib.linux-x86_64-3.9/psutil/_pslinux.py", line 2133, in ionice_set return cext.proc_ioprio_set(self.pid, ioclass, value) ProcessLookupError: [Errno 3] No such process During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/abuild/rpmbuild/BUILD/psutil-5.9.0/psutil/tests/test_process.py", line 1313, in test_zombie_process succeed_or_zombie_p_exc(fun) File "/home/abuild/rpmbuild/BUILD/psutil-5.9.0/psutil/tests/test_process.py", line 1288, in succeed_or_zombie_p_exc return fun() File "/home/abuild/rpmbuild/BUILD/psutil-5.9.0/build/lib.linux-x86_64-3.9/psutil/__init__.py", line 792, in ionice return self._proc.ionice_set(ioclass, value) File "/home/abuild/rpmbuild/BUILD/psutil-5.9.0/build/lib.linux-x86_64-3.9/psutil/_pslinux.py", line 1665, in wrapper raise NoSuchProcess(self.pid, self._name) psutil.NoSuchProcess: process no longer exists (pid=2057) Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Fixes: 5fc11eebb4 (block: open code create_task_io_context in set_task_ioprio) Fixes: a957b61254 (block: fix error in handling dead task for ioprio setting) Signed-off-by: Jiri Slaby <jslaby@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220328085928.7899-1-jslaby@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-26Merge tag 'for-5.18/64bit-pi-2022-03-25' of git://git.kernel.dk/linux-blockLinus Torvalds3-2/+198
Pull block layer 64-bit data integrity support from Jens Axboe: "This adds support for 64-bit data integrity in the block layer and in NVMe" * tag 'for-5.18/64bit-pi-2022-03-25' of git://git.kernel.dk/linux-block: crypto: fix crc64 testmgr digest byte order nvme: add support for enhanced metadata block: add pi for extended integrity crypto: add rocksoft 64b crc guard tag framework lib: add rocksoft model crc64 linux/kernel: introduce lower_48_bits function asm-generic: introduce be48 unaligned accessors nvme: allow integrity on extended metadata formats block: support pi with extended metadata
2022-03-26Merge tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-blockLinus Torvalds7-46/+0
Pull NVMe write streams removal from Jens Axboe: "This removes the write streams support in NVMe. No vendor ever really shipped working support for this, and they are not interested in supporting it. With the NVMe support gone, we have nothing in the tree that supports this. Remove passing around of the hints. The only discussion point in this patchset imho is the fact that the file specific write hint setting/getting fcntl helpers will now return -1/EINVAL like they did before we supported write hints. No known applications use these functions, I only know of one prototype that I help do for RocksDB, and that's not used. That said, with a change like this, it's always a bit controversial. Alternatively, we could just make them return 0 and pretend it worked. It's placement based hints after all" * tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block: fs: remove fs.f_write_hint fs: remove kiocb.ki_hint block: remove the per-bio/request write hint nvme: remove support or stream based temperature hint
2022-03-24Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds8-163/+2
Pull SCSI updates from James Bottomley: "This series consists of the usual driver updates (qla2xxx, pm8001, libsas, smartpqi, scsi_debug, lpfc, iscsi, mpi3mr) plus minor updates and bug fixes. The high blast radius core update is the removal of write same, which affects block and several non-SCSI devices. The other big change, which is more local, is the removal of the SCSI pointer" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (281 commits) scsi: scsi_ioctl: Drop needless assignment in sg_io() scsi: bsg: Drop needless assignment in scsi_bsg_sg_io_fn() scsi: lpfc: Copyright updates for 14.2.0.0 patches scsi: lpfc: Update lpfc version to 14.2.0.0 scsi: lpfc: SLI path split: Refactor BSG paths scsi: lpfc: SLI path split: Refactor Abort paths scsi: lpfc: SLI path split: Refactor SCSI paths scsi: lpfc: SLI path split: Refactor CT paths scsi: lpfc: SLI path split: Refactor misc ELS paths scsi: lpfc: SLI path split: Refactor VMID paths scsi: lpfc: SLI path split: Refactor FDISC paths scsi: lpfc: SLI path split: Refactor LS_RJT paths scsi: lpfc: SLI path split: Refactor LS_ACC paths scsi: lpfc: SLI path split: Refactor the RSCN/SCR/RDF/EDC/FARPR paths scsi: lpfc: SLI path split: Refactor PLOGI/PRLI/ADISC/LOGO paths scsi: lpfc: SLI path split: Refactor base ELS paths and the FLOGI path scsi: lpfc: SLI path split: Introduce lpfc_prep_wqe scsi: lpfc: SLI path split: Refactor fast and slow paths to native SLI4 scsi: lpfc: SLI path split: Refactor lpfc_iocbq scsi: lpfc: Use kcalloc() ...
2022-03-24Merge tag 'for-5.18/dm-changes' of ↵Linus Torvalds2-5/+13
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Significant refactoring and fixing of how DM core does bio-based IO accounting with focus on fixing wildly inaccurate IO stats for dm-crypt (and other DM targets that defer bio submission in their own workqueues). End result is proper IO accounting, made possible by targets being updated to use the new dm_submit_bio_remap() interface. - Add hipri bio polling support (REQ_POLLED) to bio-based DM. - Reduce dm_io and dm_target_io structs so that a single dm_io (which contains dm_target_io and first clone bio) weighs in at 256 bytes. For reference the bio struct is 128 bytes. - Various other small cleanups, fixes or improvements in DM core and targets. - Update MAINTAINERS with my kernel.org email address to allow distinction between my "upstream" and "Red" Hats. * tag 'for-5.18/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (46 commits) dm: consolidate spinlocks in dm_io struct dm: reduce size of dm_io and dm_target_io structs dm: switch dm_target_io booleans over to proper flags dm: switch dm_io booleans over to proper flags dm: update email address in MAINTAINERS dm: return void from __send_empty_flush dm: factor out dm_io_complete dm cache: use dm_submit_bio_remap dm: simplify dm_sumbit_bio_remap interface dm thin: use dm_submit_bio_remap dm: add WARN_ON_ONCE to dm_submit_bio_remap dm: support bio polling block: add ->poll_bio to block_device_operations dm mpath: use DMINFO instead of printk with KERN_INFO dm: stop using bdevname dm-zoned: remove the ->name field in struct dmz_dev dm: remove unnecessary local variables in __bind dm: requeue IO if mapping table not yet available dm io: remove stale comment block for dm_io() dm thin metadata: remove unused dm_thin_remove_block and __remove ...
2022-03-22block: avoid calling blkg_free() in atomic contextMing Lei1-10/+22
blkg_free() can currently be called in atomic context, either spin lock is held, or run in rcu callback. Meantime either request queue's release handler or ->pd_free_fn can sleep. Fix the issue by scheduling a work function for freeing blkcg_gq the instance. [ 148.553894] BUG: sleeping function called from invalid context at block/blk-sysfs.c:767 [ 148.557381] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 0, name: swapper/13 [ 148.560741] preempt_count: 101, expected: 0 [ 148.562577] RCU nest depth: 0, expected: 0 [ 148.564379] 1 lock held by swapper/13/0: [ 148.566127] #0: ffffffff82615f80 (rcu_callback){....}-{0:0}, at: rcu_lock_acquire+0x0/0x1b [ 148.569640] Preemption disabled at: [ 148.569642] [<ffffffff8123f9c3>] ___slab_alloc+0x554/0x661 [ 148.573559] CPU: 13 PID: 0 Comm: swapper/13 Kdump: loaded Not tainted 5.17.0_up+ #110 [ 148.576834] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-1.fc33 04/01/2014 [ 148.579768] Call Trace: [ 148.580567] <IRQ> [ 148.581262] dump_stack_lvl+0x56/0x7c [ 148.582367] ? ___slab_alloc+0x554/0x661 [ 148.583526] __might_resched+0x1af/0x1c8 [ 148.584678] blk_release_queue+0x24/0x109 [ 148.585861] kobject_cleanup+0xc9/0xfe [ 148.586979] blkg_free+0x46/0x63 [ 148.587962] rcu_do_batch+0x1c5/0x3db [ 148.589057] rcu_core+0x14a/0x184 [ 148.590065] __do_softirq+0x14d/0x2c7 [ 148.591167] __irq_exit_rcu+0x7a/0xd4 [ 148.592264] sysvec_apic_timer_interrupt+0x82/0xa5 [ 148.593649] </IRQ> [ 148.594354] <TASK> [ 148.595058] asm_sysvec_apic_timer_interrupt+0x12/0x20 Cc: Tejun Heo <tj@kernel.org> Fixes: 0a9a25ca7843 ("block: let blkcg_gq grab request queue's refcnt") Reported-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/linux-block/20220322093322.GA27283@lst.de/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220323011308.2010380-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-22Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds1-1/+2
Pull filesystem folio updates from Matthew Wilcox: "Primarily this series converts some of the address_space operations to take a folio instead of a page. Notably: - a_ops->is_partially_uptodate() takes a folio instead of a page and changes the type of the 'from' and 'count' arguments to make it obvious they're bytes. - a_ops->invalidatepage() becomes ->invalidate_folio() and has a similar type change. - a_ops->launder_page() becomes ->launder_folio() - a_ops->set_page_dirty() becomes ->dirty_folio() and adds the address_space as an argument. There are a couple of other misc changes up front that weren't worth separating into their own pull request" * tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache: (53 commits) fs: Remove aops ->set_page_dirty fb_defio: Use noop_dirty_folio() fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio fs: Convert __set_page_dirty_buffers to block_dirty_folio nilfs: Convert nilfs_set_page_dirty() to nilfs_dirty_folio() mm: Convert swap_set_page_dirty() to swap_dirty_folio() ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folio f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio afs: Convert afs_dir_set_page_dirty() to afs_dir_dirty_folio() btrfs: Convert extent_range_redirty_for_io() to use folios fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio btrfs: Convert from set_page_dirty to dirty_folio fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio() fs: Add aops->dirty_folio fs: Remove aops->launder_page orangefs: Convert launder_page to launder_folio nfs: Convert from launder_page to launder_folio fuse: Convert from launder_page to launder_folio ...
2022-03-22Merge branch 'akpm' (patches from Andrew)Linus Torvalds2-2/+2
Merge updates from Andrew Morton: - A few misc subsystems: kthread, scripts, ntfs, ocfs2, block, and vfs - Most the MM patches which precede the patches in Willy's tree: kasan, pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap, sparsemem, vmalloc, pagealloc, memory-failure, mlock, hugetlb, userfaultfd, vmscan, compaction, mempolicy, oom-kill, migration, thp, cma, autonuma, psi, ksm, page-poison, madvise, memory-hotplug, rmap, zswap, uaccess, ioremap, highmem, cleanups, kfence, hmm, and damon. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (227 commits) mm/damon/sysfs: remove repeat container_of() in damon_sysfs_kdamond_release() Docs/ABI/testing: add DAMON sysfs interface ABI document Docs/admin-guide/mm/damon/usage: document DAMON sysfs interface selftests/damon: add a test for DAMON sysfs interface mm/damon/sysfs: support DAMOS stats mm/damon/sysfs: support DAMOS watermarks mm/damon/sysfs: support schemes prioritization mm/damon/sysfs: support DAMOS quotas mm/damon/sysfs: support DAMON-based Operation Schemes mm/damon/sysfs: support the physical address space monitoring mm/damon/sysfs: link DAMON for virtual address spaces monitoring mm/damon: implement a minimal stub for sysfs-based DAMON interface mm/damon/core: add number of each enum type values mm/damon/core: allow non-exclusive DAMON start/stop Docs/damon: update outdated term 'regions update interval' Docs/vm/damon/design: update DAMON-Idle Page Tracking interference handling Docs/vm/damon: call low level monitoring primitives the operations mm/damon: remove unnecessary CONFIG_DAMON option mm/damon/paddr,vaddr: remove damon_{p,v}a_{target_valid,set_operations}() mm/damon/dbgfs-test: fix is_target_id() change ...
2022-03-22fs: allocate inode by using alloc_inode_sb()Muchun Song1-1/+1
The inode allocation is supposed to use alloc_inode_sb(), so convert kmem_cache_alloc() of all filesystems to alloc_inode_sb(). Link: https://lkml.kernel.org/r/20220228122126.37293-5-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Theodore Ts'o <tytso@mit.edu> [ext4] Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Alex Shi <alexs@kernel.org> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kari Argillander <kari.argillander@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22block/bfq-iosched.c: use "false" rather than "BLK_RW_ASYNC"NeilBrown1-1/+1
bfq_get_queue() expects a "bool" for the third arg, so pass "false" rather than "BLK_RW_ASYNC" which will soon be removed. Link: https://lkml.kernel.org/r/164549983746.9187.7949730109246767909.stgit@noble.brown Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Jens Axboe <axboe@kernel.dk> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Layton <jlayton@kernel.org> Cc: Lars Ellenberg <lars.ellenberg@linbit.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Paolo Valente <paolo.valente@linaro.org> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-21Merge tag 'for-5.18/block-2022-03-18' of git://git.kernel.dk/linux-blockLinus Torvalds47-659/+1430
Pull block updates from Jens Axboe: - BFQ cleanups and fixes (Yu, Zhang, Yahu, Paolo) - blk-rq-qos completion fix (Tejun) - blk-cgroup merge fix (Tejun) - Add offline error return value to distinguish it from an IO error on the device (Song) - IO stats fixes (Zhang, Christoph) - blkcg refcount fixes (Ming, Yu) - Fix for indefinite dispatch loop softlockup (Shin'ichiro) - blk-mq hardware queue management improvements (Ming) - sbitmap dead code removal (Ming, John) - Plugging merge improvements (me) - Show blk-crypto capabilities in sysfs (Eric) - Multiple delayed queue run improvement (David) - Block throttling fixes (Ming) - Start deprecating auto module loading based on dev_t (Christoph) - bio allocation improvements (Christoph, Chaitanya) - Get rid of bio_devname (Christoph) - bio clone improvements (Christoph) - Block plugging improvements (Christoph) - Get rid of genhd.h header (Christoph) - Ensure drivers use appropriate flush helpers (Christoph) - Refcounting improvements (Christoph) - Queue initialization and teardown improvements (Ming, Christoph) - Misc fixes/improvements (Barry, Chaitanya, Colin, Dan, Jiapeng, Lukas, Nian, Yang, Eric, Chengming) * tag 'for-5.18/block-2022-03-18' of git://git.kernel.dk/linux-block: (127 commits) block: cancel all throttled bios in del_gendisk() block: let blkcg_gq grab request queue's refcnt block: avoid use-after-free on throttle data block: limit request dispatch loop duration block/bfq-iosched: Fix spelling mistake "tenative" -> "tentative" sr: simplify the local variable initialization in sr_block_open() block: don't merge across cgroup boundaries if blkcg is enabled block: fix rq-qos breakage from skipping rq_qos_done_bio() block: flush plug based on hardware and software queue order block: ensure plug merging checks the correct queue at least once block: move rq_qos_exit() into disk_release() block: do more work in elevator_exit block: move blk_exit_queue into disk_release block: move q_usage_counter release into blk_queue_release block: don't remove hctx debugfs dir from blk_mq_exit_queue block: move blkcg initialization/destroy into disk allocation/release handler sr: implement ->free_disk to simplify refcounting sd: implement ->free_disk to simplify refcounting sd: delay calling free_opal_dev sd: call sd_zbc_release_disk before releasing the scsi_device reference ...
2022-03-18block: cancel all throttled bios in del_gendisk()Yu Kuai3-1/+41
Throttled bios can't be issued after del_gendisk() is done, thus it's better to cancel them immediately rather than waiting for throttle is done. For example, if user thread is throttled with low bps while it's issuing large io, and the device is deleted. The user thread will wait for a long time for io to return. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220318130144.1066064-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-18block: let blkcg_gq grab request queue's refcntMing Lei1-0/+5
In the whole lifetime of blkcg_gq instance, ->q will be referred, such as, ->pd_free_fn() is called in blkg_free, and throtl_pd_free() still may touch the request queue via &tg->service_queue.pending_timer which is handled by throtl_pending_timer_fn(), so it is reasonable to grab request queue's refcnt by blkcg_gq instance. Previously blkcg_exit_queue() is called from blk_release_queue, and it is hard to avoid the use-after-free. But recently commit 1059699f87eb ("block: move blkcg initialization/destroy into disk allocation/release handler") is merged to for-5.18/block, it becomes simple to fix the issue by simply grabbing request queue's refcnt. Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220318130144.1066064-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-18block: avoid use-after-free on throttle dataMing Lei1-1/+11
In throtl_pending_timer_fn(), request queue is retrieved from throttle data. And tg's pending timer is deleted synchronously when releasing the associated blkg, at that time, throttle data may have been freed since commit 1059699f87eb ("block: move blkcg initialization/destroy into disk allocation/release handler") moves freeing q->td to disk_release() from blk_release_queue(). So use-after-free on q->td in throtl_pending_timer_fn can be triggered. Fixes the issue by: - do nothing in case that disk is released, when there isn't any bio to dispatch - retrieve request queue from blkg instead of throttle data for non top-level pending timer. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220318130144.1066064-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-17block: limit request dispatch loop durationShin'ichiro Kawasaki1-1/+8
When IO requests are made continuously and the target block device handles requests faster than request arrival, the request dispatch loop keeps on repeating to dispatch the arriving requests very long time, more than a minute. Since the loop runs as a workqueue worker task, the very long loop duration triggers workqueue watchdog timeout and BUG [1]. To avoid the very long loop duration, break the loop periodically. When opportunity to dispatch requests still exists, check need_resched(). If need_resched() returns true, the dispatch loop already consumed its time slice, then reschedule the dispatch work and break the loop. With heavy IO load, need_resched() does not return true for 20~30 seconds. To cover such case, check time spent in the dispatch loop with jiffies. If more than 1 second is spent, reschedule the dispatch work and break the loop. [1] [ 609.691437] BUG: workqueue lockup - pool cpus=10 node=1 flags=0x0 nice=-20 stuck for 35s! [ 609.701820] Showing busy workqueues and worker pools: [ 609.707915] workqueue events: flags=0x0 [ 609.712615] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 [ 609.712626] pending: drm_fb_helper_damage_work [drm_kms_helper] [ 609.712687] workqueue events_freezable: flags=0x4 [ 609.732943] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 [ 609.732952] pending: pci_pme_list_scan [ 609.732968] workqueue events_power_efficient: flags=0x80 [ 609.751947] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 [ 609.751955] pending: neigh_managed_work [ 609.752018] workqueue kblockd: flags=0x18 [ 609.769480] pwq 21: cpus=10 node=1 flags=0x0 nice=-20 active=3/256 refcnt=4 [ 609.769488] in-flight: 1020:blk_mq_run_work_fn [ 609.769498] pending: blk_mq_timeout_work, blk_mq_run_work_fn [ 609.769744] pool 21: cpus=10 node=1 flags=0x0 nice=-20 hung=35s workers=2 idle: 67 [ 639.899730] BUG: workqueue lockup - pool cpus=10 node=1 flags=0x0 nice=-20 stuck for 66s! [ 639.909513] Showing busy workqueues and worker pools: [ 639.915404] workqueue events: flags=0x0 [ 639.920197] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 [ 639.920215] pending: drm_fb_helper_damage_work [drm_kms_helper] [ 639.920365] workqueue kblockd: flags=0x18 [ 639.939932] pwq 21: cpus=10 node=1 flags=0x0 nice=-20 active=3/256 refcnt=4 [ 639.939942] in-flight: 1020:blk_mq_run_work_fn [ 639.939955] pending: blk_mq_timeout_work, blk_mq_run_work_fn [ 639.940212] pool 21: cpus=10 node=1 flags=0x0 nice=-20 hung=66s workers=2 idle: 67 Fixes: 6e6fcbc27e778 ("blk-mq: support batching dispatch in case of io") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Cc: stable@vger.kernel.org # v5.10+ Link: https://lore.kernel.org/linux-block/20220310091649.zypaem5lkyfadymg@shindev/ Link: https://lore.kernel.org/r/20220318022641.133484-1-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-16fs: Convert __set_page_dirty_buffers to block_dirty_folioMatthew Wilcox (Oracle)1-1/+1
Convert all callers; mostly this is just changing the aops to point at it, but a few implementations need a little more work. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-16block/bfq-iosched: Fix spelling mistake "tenative" -> "tentative"Colin Ian King1-1/+1
There is a spelling mistake in a bfq_log_bfqq message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220315221539.2959167-1-colin.i.king@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-15fs: Turn block_invalidatepage into block_invalidate_folioMatthew Wilcox (Oracle)1-0/+1
Remove special-casing of a NULL invalidatepage, since there is no more block_invalidatepage. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-14block: don't merge across cgroup boundaries if blkcg is enabledTejun Heo2-0/+28
blk-iocost and iolatency are cgroup aware rq-qos policies but they didn't disable merges across different cgroups. This obviously can lead to accounting and control errors but more importantly to priority inversions - e.g. an IO which belongs to a higher priority cgroup or IO class may end up getting throttled incorrectly because it gets merged to an IO issued from a low priority cgroup. Fix it by adding blk_cgroup_mergeable() which is called from merge paths and rejects cross-cgroup and cross-issue_as_root merges. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: d70675121546 ("block: introduce blk-iolatency io controller") Cc: stable@vger.kernel.org # v4.19+ Cc: Josef Bacik <jbacik@fb.com> Link: https://lore.kernel.org/r/Yi/eE/6zFNyWJ+qd@slm.duckdns.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-14block: fix rq-qos breakage from skipping rq_qos_done_bio()Tejun Heo3-12/+13
a647a524a467 ("block: don't call rq_qos_ops->done_bio if the bio isn't tracked") made bio_endio() skip rq_qos_done_bio() if BIO_TRACKED is not set. While this fixed a potential oops, it also broke blk-iocost by skipping the done_bio callback for merged bios. Before, whether a bio goes through rq_qos_throttle() or rq_qos_merge(), rq_qos_done_bio() would be called on the bio on completion with BIO_TRACKED distinguishing the former from the latter. rq_qos_done_bio() is not called for bios which wenth through rq_qos_merge(). This royally confuses blk-iocost as the merged bios never finish and are considered perpetually in-flight. One reliably reproducible failure mode is an intermediate cgroup geting stuck active preventing its children from being activated due to the leaf-only rule, leading to loss of control. The following is from resctl-bench protection scenario which emulates isolating a web server like workload from a memory bomb run on an iocost configuration which should yield a reasonable level of protection. # cat /sys/block/nvme2n1/device/model Samsung SSD 970 PRO 512GB # cat /sys/fs/cgroup/io.cost.model 259:0 ctrl=user model=linear rbps=834913556 rseqiops=93622 rrandiops=102913 wbps=618985353 wseqiops=72325 wrandiops=71025 # cat /sys/fs/cgroup/io.cost.qos 259:0 enable=1 ctrl=user rpct=95.00 rlat=18776 wpct=95.00 wlat=8897 min=60.00 max=100.00 # resctl-bench -m 29.6G -r out.json run protection::scenario=mem-hog,loops=1 ... Memory Hog Summary ================== IO Latency: R p50=242u:336u/2.5m p90=794u:1.4m/7.5m p99=2.7m:8.0m/62.5m max=8.0m:36.4m/350m W p50=221u:323u/1.5m p90=709u:1.2m/5.5m p99=1.5m:2.5m/9.5m max=6.9m:35.9m/350m Isolation and Request Latency Impact Distributions: min p01 p05 p10 p25 p50 p75 p90 p95 p99 max mean stdev isol% 15.90 15.90 15.90 40.05 57.24 59.07 60.01 74.63 74.63 90.35 90.35 58.12 15.82 lat-imp% 0 0 0 0 0 4.55 14.68 15.54 233.5 548.1 548.1 53.88 143.6 Result: isol=58.12:15.82% lat_imp=53.88%:143.6 work_csv=100.0% missing=3.96% The isolation result of 58.12% is close to what this device would show without any IO control. Fix it by introducing a new flag BIO_QOS_MERGED to mark merged bios and calling rq_qos_done_bio() on them too. For consistency and clarity, rename BIO_TRACKED to BIO_QOS_THROTTLED. The flag checks are moved into rq_qos_done_bio() so that it's next to the code paths that set the flags. With the patch applied, the above same benchmark shows: # resctl-bench -m 29.6G -r out.json run protection::scenario=mem-hog,loops=1 ... Memory Hog Summary ================== IO Latency: R p50=123u:84.4u/985u p90=322u:256u/2.5m p99=1.6m:1.4m/9.5m max=11.1m:36.0m/350m W p50=429u:274u/995u p90=1.7m:1.3m/4.5m p99=3.4m:2.7m/11.5m max=7.9m:5.9m/26.5m Isolation and Request Latency Impact Distributions: min p01 p05 p10 p25 p50 p75 p90 p95 p99 max mean stdev isol% 84.91 84.91 89.51 90.73 92.31 94.49 96.36 98.04 98.71 100.0 100.0 94.42 2.81 lat-imp% 0 0 0 0 0 2.81 5.73 11.11 13.92 17.53 22.61 4.10 4.68 Result: isol=94.42:2.81% lat_imp=4.10%:4.68 work_csv=58.34% missing=0% Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: a647a524a467 ("block: don't call rq_qos_ops->done_bio if the bio isn't tracked") Cc: stable@vger.kernel.org # v5.15+ Cc: Ming Lei <ming.lei@redhat.com> Cc: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/Yi7rdrzQEHjJLGKB@slm.duckdns.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-14block: release rq qos structures for queue without diskMing Lei1-0/+4
blkcg_init_queue() may add rq qos structures to request queue, previously blk_cleanup_queue() calls rq_qos_exit() to release them, but commit 8e141f9eb803 ("block: drain file system I/O on del_gendisk") moves rq_qos_exit() into del_gendisk(), so memory leak is caused because queues may not have disk, such as un-present scsi luns, nvme admin queue, ... Fixes the issue by adding rq_qos_exit() to blk_cleanup_queue() back. BTW, v5.18 won't need this patch any more since we move blkcg_init_queue()/blkcg_exit_queue() into disk allocation/release handler, and patches have been in for-5.18/block. Cc: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org Fixes: 8e141f9eb803 ("block: drain file system I/O on del_gendisk") Reported-by: syzbot+b42749a851a47a0f581b@syzkaller.appspotmail.com Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220314043018.177141-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-11block: flush plug based on hardware and software queue orderJens Axboe1-31/+28
We used to sort the plug list if we had multiple queues before dispatching requests to the IO scheduler. This usually isn't needed, but for certain workloads that interleave requests to disks, it's a less efficient to process the plug list one-by-one if everything is interleaved. Don't sort the list, but skip through it and flush out entries that have the same target at the same time. Fixes: df87eb0fce8f ("block: get rid of plug list sorting") Reported-and-tested-by: Song Liu <song@kernel.org> Reviewed-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-11block: ensure plug merging checks the correct queue at least onceJens Axboe1-6/+14
Song reports that a RAID rebuild workload runs much slower recently, and it is seeing a lot less merging than it did previously. The reason is that a previous commit reduced the amount of work we do for plug merging. RAID rebuild interleaves requests between disks, so a last-entry check in plug merging always misses a merge opportunity since we always find a different disk than what we are looking for. Modify the logic such that it's still a one-hit cache, but ensure that we check enough to find the right target before giving up. Fixes: d38a9c04c0d5 ("block: only check previous entry for plug merge attempt") Reported-and-tested-by: Song Liu <song@kernel.org> Reviewed-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-10resume_user_mode: Move to resume_user_mode.hEric W. Biederman1-1/+1
Move set_notify_resume and tracehook_notify_resume into resume_user_mode.h. While doing that rename tracehook_notify_resume to resume_user_mode_work. Update all of the places that included tracehook.h for these functions to include resume_user_mode.h instead. Update all of the callers of tracehook_notify_resume to call resume_user_mode_work. Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20220309162454.123006-12-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-09block: add ->poll_bio to block_device_operationsMing Lei2-5/+13
Prepare for supporting IO polling for bio-based driver. Add ->poll_bio callback so that bio-based driver can provide their own logic for polling bio. Also fix ->submit_bio_bio typo in comment block above __submit_bio_noacct. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2022-03-08block: move rq_qos_exit() into disk_release()Ming Lei1-2/+1
Keep all teardown of file system I/O related functionality in one place. There can't be file system I/O in disk_release(), so it is safe to move rq_qos_exit() there. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220308055200.735835-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08block: do more work in elevator_exitChristoph Hellwig2-6/+3
Move the calls to ioc_clear_queue and blk_mq_sched_free_rqs into elevator_exit. Except for one call where we know we can't have io_cq structures yet these always go together, and that extra call in an error path is harmless. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220308055200.735835-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08block: move blk_exit_queue into disk_releaseMing Lei2-17/+30
There can't be file system I/O in disk_release(), so move the call to blk_exit_queue() there, preparing to have the teardown of file system I/O only functionality in one place, when the gendisk that is needed for it is torn down. We still need to freeze queue here since the request is freed after the bio is completed and passthrough request rely on scheduler tags as well. The disk can be released before or after queue is cleaned up, and we have to free the scheduler request pool before blk_cleanup_queue returns, while the static request pool has to be freed before exiting the I/O scheduler. Signed-off-by: Ming Lei <ming.lei@redhat.com> [hch: rebased, updated the commit log] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220308055200.735835-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08block: move q_usage_counter release into blk_queue_releaseMing Lei2-2/+2
After blk_cleanup_queue() returns, disk may not be released yet, so probably bio may still be submitted and ->q_usage_counter may be touched, so far this way seems safe, but not good from API's viewpoint. Move the release q_usage_counter into blk_queue_release(). Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220308055200.735835-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08block: don't remove hctx debugfs dir from blk_mq_exit_queueMing Lei1-1/+0
The queue's top debugfs dir is removed from blk_release_queue(), so all hctx's debugfs dirs are removed from there. Given blk_mq_exit_queue() is only called from blk_cleanup_queue(), it isn't necessary to remove hctx debugfs from blk_mq_exit_queue(). So remove it from blk_mq_exit_queue(). Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220308055200.735835-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08block: move blkcg initialization/destroy into disk allocation/release handlerMing Lei3-12/+9
blkcg works on FS bio level, so it is reasonable to make both blkcg and gendisk sharing same lifetime. Meantime there won't be any FS IO when releasing disk, so safe to move blkcg initialization/destroy into disk allocation/release handler Long term, we can move blkcg into gendisk completely. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220308055200.735835-10-hch@lst.de [axboe: fixup missing blk-cgroup.h include] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08blk-mq: handle already freed tags gracefully in blk_mq_free_rqsMing Lei1-0/+3
To simplify further changes allow for double calling blk_mq_free_rqs on a queue. Signed-off-by: Ming Lei <ming.lei@redhat.com> [hch: split out from a larger patch] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220308055200.735835-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08blk-mq: do not include passthrough requests in I/O accountingChristoph Hellwig2-4/+9
I/O accounting buckets I/O into the read/write/discard categories into which passthrough I/O does not fit at all. It also accounts to the block_device, which may not even exist for passthrough I/O. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220308055200.735835-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08blk-mq: manage hctx map via xarrayMing Lei4-40/+28
First code becomes more clean by switching to xarray from plain array. Second use-after-free on q->queue_hw_ctx can be fixed because queue_for_each_hw_ctx() may be run when updating nr_hw_queues is in-progress. With this patch, q->hctx_table is defined as xarray, and this structure will share same lifetime with request queue, so queue_for_each_hw_ctx() can use q->hctx_table to lookup hctx reliably. Reported-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220308073219.91173-7-ming.lei@redhat.com [axboe: fix blk_mq_hw_ctx forward declaration] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08blk-mq: prepare for implementing hctx table via xarrayMing Lei5-28/+35
It is inevitable to cause use-after-free on q->queue_hw_ctx between queue_for_each_hw_ctx() and blk_mq_update_nr_hw_queues(). And converting to xarray can fix the uaf, meantime code gets cleaner. Prepare for converting q->queue_hctx_ctx into xarray, one thing is that xa_for_each() can only accept 'unsigned long' as index, so changes type of hctx index of queue_for_each_hw_ctx() into 'unsigned long'. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220308073219.91173-6-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08blk-mq: reconfigure poll after queue map is changedMing Lei1-3/+13
queue map can be changed when updating nr_hw_queues, so we need to reconfigure queue's poll capability. Add one helper for doing this job. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220308073219.91173-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08blk-mq: simplify reallocation of hw ctxs a bitMing Lei1-19/+14
blk_mq_alloc_and_init_hctx() has already taken reuse into account, so no need to do it outside, then we can simplify blk_mq_realloc_hw_ctxs(). Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220308073219.91173-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08blk-mq: figure out correct numa node for hw queueMing Lei1-6/+30
The current code always uses default queue map and hw queue index for figuring out the numa node for hw queue, this way isn't correct because blk-mq supports three queue maps, and the correct queue map should be used for the specified hw queue. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220308073219.91173-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08Revert "Revert "block, bfq: honor already-setup queue merges""Paolo Valente1-3/+13
A crash [1] happened to be triggered in conjunction with commit 2d52c58b9c9b ("block, bfq: honor already-setup queue merges"). The latter was then reverted by commit ebc69e897e17 ("Revert "block, bfq: honor already-setup queue merges""). Yet, the reverted commit was not the one introducing the bug. In fact, it actually triggered a UAF introduced by a different commit, and now fixed by commit d29bd41428cf ("block, bfq: reset last_bfqq_created on group change"). So, there is no point in keeping commit 2d52c58b9c9b ("block, bfq: honor already-setup queue merges") out. This commit restores it. [1] https://bugzilla.kernel.org/show_bug.cgi?id=214503 Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20211125181510.15004-1-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>