summaryrefslogtreecommitdiff
path: root/drivers
AgeCommit message (Collapse)AuthorFilesLines
2022-09-28Merge tag 'nvme-6.1-2022-09-28' of git://git.infradead.org/nvme into ↵Jens Axboe16-433/+345
for-6.1/block Pull NVMe updates from Christoph: "nvme updates for Linux 6.1 - handle effects after freeing the request (Keith Busch) - copy firmware_rev on each init (Keith Busch) - restrict management ioctls to admin (Keith Busch) - ensure subsystem reset is single threaded (Keith Busch) - report the actual number of tagset maps in nvme-pci (Keith Busch) - small fabrics authentication fixups (Christoph Hellwig) - add common code for tagset allocation and freeing (Christoph Hellwig) - stop using the request_queue in nvmet (Christoph Hellwig) - set min_align_mask before calculating max_hw_sectors (Rishabh Bhatnagar) - send a rediscover uevent when a persistent discovery controller reconnects (Sagi Grimberg) - misc nvmet-tcp fixes (Varun Prakash, zhenwei pi)" * tag 'nvme-6.1-2022-09-28' of git://git.infradead.org/nvme: (31 commits) nvmet: don't look at the request_queue in nvmet_bdev_set_limits nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all nvme: remove nvme_ctrl_init_connect_q nvme-loop: use the tagset alloc/free helpers nvme-loop: store the generic nvme_ctrl in set->driver_data nvme-loop: initialize sqsize later nvme-fc: use the tagset alloc/free helpers nvme-fc: store the generic nvme_ctrl in set->driver_data nvme-fc: keep ctrl->sqsize in sync with opts->queue_size nvme-rdma: use the tagset alloc/free helpers nvme-rdma: store the generic nvme_ctrl in set->driver_data nvme-tcp: use the tagset alloc/free helpers nvme-tcp: store the generic nvme_ctrl in set->driver_data nvme-tcp: remove the unused queue_size member in nvme_tcp_queue nvme: add common helpers to allocate and free tagsets nvme-auth: add a MAINTAINERS entry nvmet: add helpers to set the result field for connect commands nvme: improve the NVME_CONNECT_AUTHREQ* definitions nvmet-auth: don't try to cancel a non-initialized work_struct nvmet-tcp: remove nvmet_tcp_finish_cmd ...
2022-09-28s390/dasd: use blk_mq_alloc_diskChristoph Hellwig7-88/+39
As far as I can tell there is no need for the staged setup in dasd, so allocate the tagset and the disk with the queue in dasd_gendisk_alloc. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Stefan Haberland <sth@linux.ibm.com> Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Link: https://lore.kernel.org/r/20220928143945.1687114-2-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-27nvmet: don't look at the request_queue in nvmet_bdev_set_limitsChristoph Hellwig1-6/+5
nvmet is a consumer of the block layer and should not directly look at the request_queue. Use the bdev_ helpers to retrieve the device limits instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org>
2022-09-27nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_allChristoph Hellwig1-2/+1
nvmet is a consumer of the block layer and should not directly look at the request_queue. Just use the NUMA node ID from the gendisk instead of the request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org>
2022-09-27block: replace blk_queue_nowait with bdev_nowaitChristoph Hellwig2-5/+3
Replace blk_queue_nowait with a bdev_nowait helpers that takes the block_device given that the I/O submission path should not have to look into the request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20220927075815.269694-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-27nvme: remove nvme_ctrl_init_connect_qChristoph Hellwig1-8/+0
Unused now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27nvme-loop: use the tagset alloc/free helpersChristoph Hellwig1-64/+19
Use the common helpers to allocate and free the tagsets. To make this work the generic nvme_ctrl now needs to be stored in the hctx private data instead of the nvme_loop_ctrl. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27nvme-loop: store the generic nvme_ctrl in set->driver_dataChristoph Hellwig1-5/+5
Point the private data to the generic controller structure in preparation of using the common tagset init/exit code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27nvme-loop: initialize sqsize laterChristoph Hellwig1-1/+1
Defer initializing the sqsize field from the options until it has been capped by MAXCMD. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27nvme-fc: use the tagset alloc/free helpersChristoph Hellwig1-66/+17
Use the common helpers to allocate and free the tagsets. To make this work the generic nvme_ctrl now needs to be stored in the hctx private data instead of the nvme_fc_ctrl. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: James Smart <jsmart2021@gmail.com>
2022-09-27nvme-fc: store the generic nvme_ctrl in set->driver_dataChristoph Hellwig1-20/+12
Point the private data to the generic controller structure in preparation of using the common tagset init/exit code and use the chance the cleanup the init_hctx methods a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: James Smart <jsmart2021@gmail.com>
2022-09-27nvme-fc: keep ctrl->sqsize in sync with opts->queue_sizeChristoph Hellwig1-9/+1
Also update the sqsize field when capping the queue size, and remove the check a queue size that is larger than sqsize given that sqsize is only initialized from opts->queue_size. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: James Smart <jsmart2021@gmail.com>
2022-09-27nvme-rdma: use the tagset alloc/free helpersChristoph Hellwig1-99/+34
Use the common helpers to allocate and free the tagsets. To make this work the generic nvme_ctrl now needs to be stored in the hctx private data instead of the nvme_rdma_ctrl. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27nvme-rdma: store the generic nvme_ctrl in set->driver_dataChristoph Hellwig1-6/+6
Point the private data to the generic controller structure in preparation of using the common tagset init/exit code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27nvme-tcp: use the tagset alloc/free helpersChristoph Hellwig1-85/+16
Use the common helpers to allocate and free the tagsets. To make this work the generic nvme_ctrl now needs to be stored in the hctx private data instead of the nvme_tcp_ctrl. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27nvme-tcp: store the generic nvme_ctrl in set->driver_dataChristoph Hellwig1-6/+6
Point the private data to the generic controller structure in preparation of using the common tagset init/exit code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27nvme-tcp: remove the unused queue_size member in nvme_tcp_queueChristoph Hellwig1-6/+3
->nvme_tcp_queue is not used anywhere, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27nvme: add common helpers to allocate and free tagsetsChristoph Hellwig2-0/+110
Add common helpers to allocate and tear down the admin and I/O tag sets, including the special queues allocated with them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27nvmet: add helpers to set the result field for connect commandsChristoph Hellwig1-10/+8
The code to set the result field for the admin and I/O connect commands is not only verbose and duplicated, but also violates the aliasing rules as it accesses both the u16 and u32 members in the union. Add a little helper to sort all that out. Fixes: db1312dd9548 ("nvmet: implement basic In-Band Authentication") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de>
2022-09-27nvme: improve the NVME_CONNECT_AUTHREQ* definitionsChristoph Hellwig1-4/+2
Mark them as unsigned so that we don't need extra casts, and define them relative to cdword0 instead of requiring extra shifts. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de>
2022-09-27nvmet-auth: don't try to cancel a non-initialized work_structChristoph Hellwig4-14/+13
Currently blktests nvme/002 trips up debugobjects if CONFIG_NVME_AUTH is enabled, but authentication is not on a queue. This is because nvmet_auth_sq_free cancels sq->auth_expired_work unconditionaly, while auth_expired_work is only ever initialized if authentication is enabled for a given controller. Fix this by calling most of what is nvmet_init_auth unconditionally when initializing the SQ, and just do the setting of the result field in the connect command handler. Fixes: db1312dd9548 ("nvmet: implement basic In-Band Authentication") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de>
2022-09-27nvmet-tcp: remove nvmet_tcp_finish_cmdzhenwei pi1-8/+2
There is only a single call-site of nvmet_tcp_finish_cmd(), this becomes redundant. Remove nvmet_tcp_finish_cmd() and use the original function body instead. Suggested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27nvmet-tcp: add bounds check on Transfer TagVarun Prakash1-2/+9
ttag is used as an index to get cmd in nvmet_tcp_handle_h2c_data_pdu(), add a bounds check to avoid out-of-bounds access. Signed-off-by: Varun Prakash <varun@chelsio.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27nvmet-tcp: handle ICReq PDU received in NVMET_TCP_Q_LIVE stateVarun Prakash1-0/+7
As per NVMe/TCP transport specification ICReq PDU is the first PDU received by the controller and controller should receive only one ICReq PDU. If controller receives more than one ICReq PDU then this can be considered as fatal error. nvmet-tcp driver does not check for ICReq PDU opcode if queue state is NVMET_TCP_Q_LIVE. In LIVE state ICReq PDU is treated as CapsuleCmd PDU, this can result in abnormal behavior. Add a check for ICReq PDU in nvmet_tcp_done_recv_pdu() to fix this issue. Signed-off-by: Varun Prakash <varun@chelsio.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27nvmet-tcp: fix NULL pointer dereference during releasezhenwei pi1-3/+16
nvmet-tcp frees CMD buffers in nvmet_tcp_uninit_data_in_cmds(), and waits the inflight IO requests in nvmet_sq_destroy(). During wait the inflight IO requests, the callback nvmet_tcp_queue_response() is called from backend after IO complete, this leads a typical Use-After-Free issue like this: BUG: kernel NULL pointer dereference, address: 0000000000000008 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 107f80067 P4D 107f80067 PUD 10789e067 PMD 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 1 PID: 123 Comm: kworker/1:1H Kdump: loaded Tainted: G E 6.0.0-rc2.bm.1-amd64 #15 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Workqueue: nvmet_tcp_wq nvmet_tcp_io_work [nvmet_tcp] RIP: 0010:shash_ahash_digest+0x2b/0x110 Code: 1f 44 00 00 41 57 41 56 41 55 41 54 55 48 89 fd 53 48 89 f3 48 83 ec 08 44 8b 67 30 45 85 e4 74 1c 48 8b 57 38 b8 00 10 00 00 <44> 8b 7a 08 44 29 f8 39 42 0c 0f 46 42 0c 41 39 c4 76 43 48 8b 03 RSP: 0018:ffffc9000051bdd8 EFLAGS: 00010206 RAX: 0000000000001000 RBX: ffff888100ab5470 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff888100ab5470 RDI: ffff888100ab5420 RBP: ffff888100ab5420 R08: ffff8881024d08c8 R09: ffff888103e1b4b8 R10: 8080808080808080 R11: 0000000000000000 R12: 0000000000001000 R13: 0000000000000000 R14: ffff88813412bd4c R15: ffff8881024d0800 FS: 0000000000000000(0000) GS:ffff88883fa40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000008 CR3: 0000000104b48000 CR4: 0000000000350ee0 Call Trace: <TASK> nvmet_tcp_io_work+0xa52/0xb52 [nvmet_tcp] ? __switch_to+0x106/0x420 process_one_work+0x1ae/0x380 ? process_one_work+0x380/0x380 worker_thread+0x30/0x360 ? process_one_work+0x380/0x380 kthread+0xe6/0x110 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 Separate nvmet_tcp_uninit_data_in_cmds() into two steps: uninit data in cmds <- new step 1 nvmet_sq_destroy(); cancel_work_sync(&queue->io_work); free CMD buffers <- new step 2 Signed-off-by: zhenwei pi <pizhenwei@bytedance.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27nvme-pci: report the actual number of tagset mapsKeith Busch1-2/+4
We've been reporting 2 maps regardless of whether the module parameter asked for anything beyond the default queues. A consequence of this means that blk-mq will reinitialize the all the hardware contexts and io schedulers on every controller reset when the mapping is exactly the same as before. This unnecessary overhead is adding several milliseconds on a reset for environments that don't need it. Report the actual number of mappings in use. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27nvme-pci: set min_align_mask before calculating max_hw_sectorsRishabh Bhatnagar1-1/+2
If swiotlb is force enabled dma_max_mapping_size ends up calling swiotlb_max_mapping_size which takes into account the min align mask for the device. Set the min align mask for nvme driver before calling dma_max_mapping_size while calculating max hw sectors. Signed-off-by: Rishabh Bhatnagar <risbhat@amazon.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27nvme: send a rediscover uevent when a persistent discovery controller reconnectsSagi Grimberg2-0/+11
When a discovery controller is disconnected, no AENs will arrive to notify the host about discovery log change events. In order to solve this, send a uevent notification when a persistent discovery controller reconnects. We add a new ctrl flag NVME_CTRL_STARTED_ONCE that will be set on the first start, and consecutive calls will find it set, and send the event to userspace if the controller is a discovery controller. Upon the event reception, userspace will re-read the discovery log page and will act upon changes as it sees fit. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27nvme: enumerate controller flagsSagi Grimberg1-2/+5
We expect to grow a few of these flags for various purposes so make them a proper enumeration. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27nvme: ensure subsystem reset is single threadedKeith Busch1-3/+13
The subsystem reset writes to a register, so we have to ensure the device state is capable of handling that otherwise the driver may access unmapped registers. Use the state machine to ensure the subsystem reset doesn't try to write registers on a device already undergoing this type of reset. Link: https://bugzilla.kernel.org/show_bug.cgi?id=214771 Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27nvme: restrict management ioctls to adminKeith Busch1-0/+6
The passthrough commands already have this restriction, but the other operations do not. Require the same capabilities for all users as all of these operations, which include resets and rescans, can be disruptive. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27nvme: copy firmware_rev on each initKeith Busch1-1/+2
The firmware revision can change on after a reset so copy the most recent info each time instead of just the first time, otherwise the sysfs firmware_rev entry may contain stale data. Reported-by: Jeff Lien <jeff.lien@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Chao Leng <lengchao@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27nvme: handle effects after freeing the requestKeith Busch4-14/+23
If a reset occurs after the scan work attempts to issue a command, the reset may quisce the admin queue, which blocks the scan work's command from dispatching. The scan work will not be able to complete while the queue is quiesced. Meanwhile, the reset work will cancel all outstanding admin tags and wait until all requests have transitioned to idle, which includes the passthrough request. But the passthrough request won't be set to idle until after the scan_work flushes, so we're deadlocked. Fix this by handling the end effects after the request has been freed. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216354 Reported-by: Jonathan Derrick <Jonathan.Derrick@solidigm.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chao Leng <lengchao@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-23ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY supportZiyangZhang1-0/+116
START_USER_RECOVERY and END_USER_RECOVERY are two new control commands to support user recovery feature. After a crash, user should send START_USER_RECOVERY, it will: (1) check if (a)current ublk_device is UBLK_S_DEV_QUIESCED which was set by quiesce_work and (b)chardev is released (2) reinit all ubqs, including: (a) put the task_struct and reset ->ubq_daemon to NULL. (b) reset all ublk_io. (3) reset ub->mm to NULL. Then, user should start a new process and send FETCH_REQ on each ubq_daemon. Finally, user should send END_USER_RECOVERY, it will: (1) wait for all new ubq_daemons getting ready. (2) update ublksrv_pid (3) unquiesce the request queue and expect incoming ublk_queue_rq() (4) convert ub's state to UBLK_S_DEV_LIVE Note: we can handle STOP_DEV between START_USER_RECOVERY and END_USER_RECOVERY. This is helpful to users who cannot start new process after sending START_USER_RECOVERY ctrl-cmd. Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220923153919.44078-7-ZiyangZhang@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-23ublk_drv: support UBLK_F_USER_RECOVERY_REISSUEZiyangZhang1-4/+18
UBLK_F_USER_RECOVERY_REISSUE implies that: With a dying ubq_daemon, ublk_drv let monitor_work requeues rq issued to userspace(ublksrv) before the ubq_daemon is dying. UBLK_F_USER_RECOVERY_REISSUE is designed for backends which: (1) tolerate double-write since ublk_drv may issue the same rq twice. (2) does not let frontend users get I/O error, such as read-only FS and VM backend. Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220923153919.44078-6-ZiyangZhang@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-23ublk_drv: consider recovery feature in aborting mechanismZiyangZhang1-6/+110
With USER_RECOVERY feature enabled, the monitor_work schedules quiesce_work after finding a dying ubq_daemon. The monitor_work should also abort all rqs issued to userspace before the ubq_daemon is dying. The quiesce_work's job is to: (1) quiesce request queue. (2) check if there is any INFLIGHT rq. If so, we retry until all these rqs are requeued and become IDLE. These rqs should be requeued by ublk_queue_rq(), task work, io_uring fallback wq or monitor_work. (3) complete all ioucmds by calling io_uring_cmd_done(). We are safe to do so because no ioucmd can be referenced now. (5) set ub's state to UBLK_S_DEV_QUIESCED, which means we are ready for recovery. This state is exposed to userspace by GET_DEV_INFO. The driver can always handle STOP_DEV and cleanup everything no matter ub's state is LIVE or QUIESCED. After ub's state is UBLK_S_DEV_QUIESCED, user can recover with new process. Note: we do not change the default behavior with reocvery feature disabled. monitor_work still schedules stop_work and abort inflight rqs. And finally ublk_device is released. Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220923153919.44078-5-ZiyangZhang@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-23ublk_drv: requeue rqs with recovery feature enabledZiyangZhang1-5/+15
With recovery feature enabled, in ublk_queue_rq or task work (in exit_task_work or fallback wq), we requeue rqs instead of ending(aborting) them. Besides, No matter recovery feature is enabled or disabled, we schedule monitor_work immediately. Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220923153919.44078-4-ZiyangZhang@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-23ublk_drv: define macros for recovery feature and check themZiyangZhang1-1/+17
Define some macros for recovery feature. UBLK_S_DEV_QUIESCED implies that ublk_device is quiesced and is ready for recovery. This state can be observed by userspace. UBLK_F_USER_RECOVERY implies that: (1) ublk_drv enables recovery feature. It won't let monitor_work to automatically abort rqs and release the device. (2) With a dying ubq_daemon, ublk_drv ends(aborts) rqs issued to userspace(ublksrv) before crash. (3) With a dying ubq_daemon, in task work and ublk_queue_rq(), ublk_drv requeues rqs. Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220923153919.44078-3-ZiyangZhang@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-23ublk_drv: check 'current' instead of 'ubq_daemon'ZiyangZhang1-2/+10
This check is not atomic. So with recovery feature, ubq_daemon may be modified simultaneously by recovery task. Instead, check 'current' is safe here because 'current' never changes. Also add comment explaining this check, which is really important for understanding recovery feature. Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220923153919.44078-2-ZiyangZhang@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-22md: Fix spelling mistake in comments of r5l_logZhou nan1-1/+1
Fix spelling of dones't in comments. Signed-off-by: Zhou nan <zhounan@nfschina.com> Signed-off-by: Song Liu <song@kernel.org>
2022-09-22md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5dLogan Gunthorpe1-0/+12
A complicated deadlock exists when using the journal and an elevated group_thrtead_cnt. It was found with loop devices, but its not clear whether it can be seen with real disks. The deadlock can occur simply by writing data with an fio script. When the deadlock occurs, multiple threads will hang in different ways: 1) The group threads will hang in the blk-wbt code with bios waiting to be submitted to the block layer: io_schedule+0x70/0xb0 rq_qos_wait+0x153/0x210 wbt_wait+0x115/0x1b0 io_schedule+0x70/0xb0 rq_qos_wait+0x153/0x210 wbt_wait+0x115/0x1b0 __rq_qos_throttle+0x38/0x60 blk_mq_submit_bio+0x589/0xcd0 wbt_wait+0x115/0x1b0 __rq_qos_throttle+0x38/0x60 blk_mq_submit_bio+0x589/0xcd0 __submit_bio+0xe6/0x100 submit_bio_noacct_nocheck+0x42e/0x470 submit_bio_noacct+0x4c2/0xbb0 ops_run_io+0x46b/0x1a30 handle_stripe+0xcd3/0x36b0 handle_active_stripes.constprop.0+0x6f6/0xa60 raid5_do_work+0x177/0x330 Or: io_schedule+0x70/0xb0 rq_qos_wait+0x153/0x210 wbt_wait+0x115/0x1b0 __rq_qos_throttle+0x38/0x60 blk_mq_submit_bio+0x589/0xcd0 __submit_bio+0xe6/0x100 submit_bio_noacct_nocheck+0x42e/0x470 submit_bio_noacct+0x4c2/0xbb0 flush_deferred_bios+0x136/0x170 raid5_do_work+0x262/0x330 2) The r5l_reclaim thread will hang in the same way, submitting a bio to the block layer: io_schedule+0x70/0xb0 rq_qos_wait+0x153/0x210 wbt_wait+0x115/0x1b0 __rq_qos_throttle+0x38/0x60 blk_mq_submit_bio+0x589/0xcd0 __submit_bio+0xe6/0x100 submit_bio_noacct_nocheck+0x42e/0x470 submit_bio_noacct+0x4c2/0xbb0 submit_bio+0x3f/0xf0 md_super_write+0x12f/0x1b0 md_update_sb.part.0+0x7c6/0xff0 md_update_sb+0x30/0x60 r5l_do_reclaim+0x4f9/0x5e0 r5l_reclaim_thread+0x69/0x30b However, before hanging, the MD_SB_CHANGE_PENDING flag will be set for sb_flags in r5l_write_super_and_discard_space(). This flag will never be cleared because the submit_bio() call never returns. 3) Due to the MD_SB_CHANGE_PENDING flag being set, handle_stripe() will do no processing on any pending stripes and re-set STRIPE_HANDLE. This will cause the raid5d thread to enter an infinite loop, constantly trying to handle the same stripes stuck in the queue. The raid5d thread has a blk_plug that holds a number of bios that are also stuck waiting seeing the thread is in a loop that never schedules. These bios have been accounted for by blk-wbt thus preventing the other threads above from continuing when they try to submit bios. --Deadlock. To fix this, add the same wait_event() that is used in raid5_do_work() to raid5d() such that if MD_SB_CHANGE_PENDING is set, the thread will schedule and wait until the flag is cleared. The schedule action will flush the plug which will allow the r5l_reclaim thread to continue, thus preventing the deadlock. However, md_check_recovery() calls can also clear MD_SB_CHANGE_PENDING from the same thread and can thus deadlock if the thread is put to sleep. So avoid waiting if md_check_recovery() is being called in the loop. It's not clear when the deadlock was introduced, but the similar wait_event() call in raid5_do_work() was added in 2017 by this commit: 16d997b78b15 ("md/raid5: simplfy delaying of writes while metadata is updated.") Link: https://lore.kernel.org/r/7f3b87b6-b52a-f737-51d7-a4eec5c44112@deltatee.com Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>
2022-09-22md/raid10: convert resync_lock to use seqlockYu Kuai2-30/+59
Currently, wait_barrier() will hold 'resync_lock' to read 'conf->barrier', and io can't be dispatched until 'barrier' is dropped. Since holding the 'barrier' is not common, convert 'resync_lock' to use seqlock so that holding lock can be avoided in fast path. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-and-Tested-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>
2022-09-22md/raid10: fix improper BUG_ON() in raise_barrier()Yu Kuai1-1/+1
'conf->barrier' is protected by 'conf->resync_lock', reading 'conf->barrier' without holding the lock is wrong. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>
2022-09-22md/raid10: prevent unnecessary calls to wake_up() in fast pathYu Kuai1-2/+8
Currently, wake_up() is called unconditionally in fast path such as raid10_make_request(), which will cause lock contention under high concurrency: raid10_make_request wake_up __wake_up_common_lock spin_lock_irqsave Improve performance by only call wake_up() if waitqueue is not empty in allow_barrier() and raid10_make_request(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>
2022-09-22md/raid10: don't modify 'nr_waitng' in wait_barrier() for the case nowaitYu Kuai1-2/+2
For the case nowait in wait_barrier(), there is no point to increase nr_waiting and then decrease it. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>
2022-09-22md/raid10: factor out code from wait_barrier() to stop_waiting_barrier()Yu Kuai1-22/+28
Currently the nasty condition in wait_barrier() is hard to read. This patch factors out the condition into a function. There are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>
2022-09-22md: Remove extra mddev_get() in md_seq_start()Logan Gunthorpe1-1/+0
A regression is seen where mddev devices stay permanently after they are stopped due to an elevated reference count. This was tracked down to an extra mddev_get() in md_seq_start(). It only happened rarely because most of the time the md_seq_start() is called with a zero offset. The path with an extra mddev_get() only happens when it starts with a non-zero offset. The commit noted below changed an mddev_get() to check its success but inadvertently left the original call in. Remove the extra call. Fixes: 12a6caf27324 ("md: only delete entries from all_mddevs when the disk is freed") Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Guoqing Jiang <Guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>
2022-09-22md/raid5: Remove unnecessary bio_put() in raid5_read_one_chunk()David Sloan1-1/+0
When running chunk-sized reads on disks with badblocks duplicate bio free/puts are observed: ============================================================================= BUG bio-200 (Not tainted): Object already free ----------------------------------------------------------------------------- Allocated in mempool_alloc_slab+0x17/0x20 age=3 cpu=2 pid=7504 __slab_alloc.constprop.0+0x5a/0xb0 kmem_cache_alloc+0x31e/0x330 mempool_alloc_slab+0x17/0x20 mempool_alloc+0x100/0x2b0 bio_alloc_bioset+0x181/0x460 do_mpage_readpage+0x776/0xd00 mpage_readahead+0x166/0x320 blkdev_readahead+0x15/0x20 read_pages+0x13f/0x5f0 page_cache_ra_unbounded+0x18d/0x220 force_page_cache_ra+0x181/0x1c0 page_cache_sync_ra+0x65/0xb0 filemap_get_pages+0x1df/0xaf0 filemap_read+0x1e1/0x700 blkdev_read_iter+0x1e5/0x330 vfs_read+0x42a/0x570 Freed in mempool_free_slab+0x17/0x20 age=3 cpu=2 pid=7504 kmem_cache_free+0x46d/0x490 mempool_free_slab+0x17/0x20 mempool_free+0x66/0x190 bio_free+0x78/0x90 bio_put+0x100/0x1a0 raid5_make_request+0x2259/0x2450 md_handle_request+0x402/0x600 md_submit_bio+0xd9/0x120 __submit_bio+0x11f/0x1b0 submit_bio_noacct_nocheck+0x204/0x480 submit_bio_noacct+0x32e/0xc70 submit_bio+0x98/0x1a0 mpage_readahead+0x250/0x320 blkdev_readahead+0x15/0x20 read_pages+0x13f/0x5f0 page_cache_ra_unbounded+0x18d/0x220 Slab 0xffffea000481b600 objects=21 used=0 fp=0xffff8881206d8940 flags=0x17ffffc0010201(locked|slab|head|node=0|zone=2|lastcpupid=0x1fffff) CPU: 0 PID: 34525 Comm: kworker/u24:2 Not tainted 6.0.0-rc2-localyes-265166-gf11c5343fa3f #143 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Workqueue: raid5wq raid5_do_work Call Trace: <TASK> dump_stack_lvl+0x5a/0x78 dump_stack+0x10/0x16 print_trailer+0x158/0x165 object_err+0x35/0x50 free_debug_processing.cold+0xb7/0xbe __slab_free+0x1ae/0x330 kmem_cache_free+0x46d/0x490 mempool_free_slab+0x17/0x20 mempool_free+0x66/0x190 bio_free+0x78/0x90 bio_put+0x100/0x1a0 mpage_end_io+0x36/0x150 bio_endio+0x2fd/0x360 md_end_io_acct+0x7e/0x90 bio_endio+0x2fd/0x360 handle_failed_stripe+0x960/0xb80 handle_stripe+0x1348/0x3760 handle_active_stripes.constprop.0+0x72a/0xaf0 raid5_do_work+0x177/0x330 process_one_work+0x616/0xb20 worker_thread+0x2bd/0x6f0 kthread+0x179/0x1b0 ret_from_fork+0x22/0x30 </TASK> The double free is caused by an unnecessary bio_put() in the if(is_badblock(...)) error path in raid5_read_one_chunk(). The error path was moved ahead of bio_alloc_clone() in c82aa1b76787c ("md/raid5: move checking badblock before clone bio in raid5_read_one_chunk"). The previous code checked and freed align_bio which required a bio_put. After the move that is no longer needed as raid_bio is returned to the control of the common io path which performs its own endio resulting in a double free on bad device blocks. Fixes: c82aa1b76787c ("md/raid5: move checking badblock before clone bio in raid5_read_one_chunk") Signed-off-by: David Sloan <david.sloan@eideticom.com> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Guoqing Jiang <Guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>
2022-09-22md/raid5: Ensure stripe_fill happens on non-read IO with journalLogan Gunthorpe1-1/+1
When doing degrade/recover tests using the journal a kernel BUG is hit at drivers/md/raid5.c:4381 in handle_parity_checks5(): BUG_ON(!test_bit(R5_UPTODATE, &dev->flags)); This was found to occur because handle_stripe_fill() was skipped for stripes in the journal due to a condition in that function. Thus blocks were not fetched and R5_UPTODATE was not set when the code reached handle_parity_checks5(). To fix this, don't skip handle_stripe_fill() unless the stripe is for read. Fixes: 07e83364845e ("md/r5cache: shift complex rmw from read path to write path") Link: https://lore.kernel.org/linux-raid/e05c4239-41a9-d2f7-3cfa-4aa9d2cea8c1@deltatee.com/ Suggested-by: Song Liu <song@kernel.org> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>
2022-09-22md/raid5: Don't read ->active_stripes if it's not neededLogan Gunthorpe1-3/+2
The atomic_read() is not needed in many cases so only do the read after the first checks are done. Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>