summaryrefslogtreecommitdiff
path: root/drivers/nvme/target/loop.c
AgeCommit message (Collapse)AuthorFilesLines
2019-11-27nvmet-loop: Avoid preallocating big SGL for dataIsrael Rukshin1-4/+4
nvme_loop_create_io_queues() preallocates a big buffer for the IO SGL based on SG_CHUNK_SIZE. Modern DMA engines are often capable of dealing with very big segments so the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB SGL allocation per command. If a controller has lots of deep queues, preallocation for the sg list can consume substantial amounts of memory. For nvmet-loop, nr_hw_queues can be 128 and each queue's depth 128. This means the resulting preallocation for the data SGL is 128*128*4K = 64MB per controller. Switch to runtime allocation for SGL for lists longer than 2 entries. This is the approach used by NVMe PCI so it should be reasonable for NVMeOF as well. Runtime SGL allocation has always been the case for the legacy I/O path so this is nothing new. Tested-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-11-25Merge tag 'for-5.5/drivers-20191121' of git://git.kernel.dk/linux-blockLinus Torvalds1-4/+3
Pull block driver updates from Jens Axboe: "Here are the main block driver updates for 5.5. Nothing major in here, mostly just fixes. This contains: - a set of bcache changes via Coly - MD changes from Song - loop unmap write-zeroes fix (Darrick) - spelling fixes (Geert) - zoned additions cleanups to null_blk/dm (Ajay) - allow null_blk online submit queue changes (Bart) - NVMe changes via Keith, nothing major here either" * tag 'for-5.5/drivers-20191121' of git://git.kernel.dk/linux-block: (56 commits) Revert "bcache: fix fifo index swapping condition in journal_pin_cmp()" drivers/md/raid5-ppl.c: use the new spelling of RWH_WRITE_LIFE_NOT_SET drivers/md/raid5.c: use the new spelling of RWH_WRITE_LIFE_NOT_SET bcache: don't export symbols bcache: remove the extra cflags for request.o bcache: at least try to shrink 1 node in bch_mca_scan() bcache: add idle_max_writeback_rate sysfs interface bcache: add code comments in bch_btree_leaf_dirty() bcache: fix deadlock in bcache_allocator bcache: add code comment bch_keylist_pop() and bch_keylist_pop_front() bcache: deleted code comments for dead code in bch_data_insert_keys() bcache: add more accurate error messages in read_super() bcache: fix static checker warning in bcache_device_free() bcache: fix a lost wake-up problem caused by mca_cannibalize_lock bcache: fix fifo index swapping condition in journal_pin_cmp() md/raid10: prevent access of uninitialized resync_pages offset md: avoid invalid memory access for array sb->dev_roles md/raid1: avoid soft lockup under high load null_blk: add zone open, close, and finish support dm: add zone open, close and finish support ...
2019-11-04nvmet: Open code nvmet_req_execute()Christoph Hellwig1-1/+1
Now that nvmet_req_execute does nothing, open code it. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> [split patch, update changelog] Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-04nvme: move common call to nvme_cleanup_cmd to core layerMax Gurtovoy1-1/+0
nvme_cleanup_cmd should be called for each call to nvme_setup_cmd (symmetrical functions). Move the call for nvme_cleanup_cmd to the common core layer and call it during nvme_complete_rq for the good flow. For error flow, each transport will call nvme_cleanup_cmd independently. Also take care of a special case of path failure, where we call nvme_complete_rq without doing nvme_setup_cmd. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-04nvme: introduce nvme_is_aen_req functionIsrael Rukshin1-2/+2
This function improves code readability and reduces code duplication. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-15nvmet-loop: fix possible leakage during error flowMax Gurtovoy1-1/+3
During nvme_loop_queue_rq error flow, one must call nvme_cleanup_cmd since it's symmetric to nvme_setup_cmd. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-09-17Merge tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-blockLinus Torvalds1-14/+16
Pull block updates from Jens Axboe: - Two NVMe pull requests: - ana log parse fix from Anton - nvme quirks support for Apple devices from Ben - fix missing bio completion tracing for multipath stack devices from Hannes and Mikhail - IP TOS settings for nvme rdma and tcp transports from Israel - rq_dma_dir cleanups from Israel - tracing for Get LBA Status command from Minwoo - Some nvme-tcp cleanups from Minwoo, Potnuri and Myself - Some consolidation between the fabrics transports for handling the CAP register - reset race with ns scanning fix for fabrics (move fabrics commands to a dedicated request queue with a different lifetime from the admin request queue)." - controller reset and namespace scan races fixes - nvme discovery log change uevent support - naming improvements from Keith - multiple discovery controllers reject fix from James - some regular cleanups from various people - Series fixing (and re-fixing) null_blk debug printing and nr_devices checks (André) - A few pull requests from Song, with fixes from Andy, Guoqing, Guilherme, Neil, Nigel, and Yufen. - REQ_OP_ZONE_RESET_ALL support (Chaitanya) - Bio merge handling unification (Christoph) - Pick default elevator correctly for devices with special needs (Damien) - Block stats fixes (Hou) - Timeout and support devices nbd fixes (Mike) - Series fixing races around elevator switching and device add/remove (Ming) - sed-opal cleanups (Revanth) - Per device weight support for BFQ (Fam) - Support for blk-iocost, a new model that can properly account cost of IO workloads. (Tejun) - blk-cgroup writeback fixes (Tejun) - paride queue init fixes (zhengbin) - blk_set_runtime_active() cleanup (Stanley) - Block segment mapping optimizations (Bart) - lightnvm fixes (Hans/Minwoo/YueHaibing) - Various little fixes and cleanups * tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block: (186 commits) null_blk: format pr_* logs with pr_fmt null_blk: match the type of parameter nr_devices null_blk: do not fail the module load with zero devices block: also check RQF_STATS in blk_mq_need_time_stamp() block: make rq sector size accessible for block stats bfq: Fix bfq linkage error raid5: use bio_end_sector in r5_next_bio raid5: remove STRIPE_OPS_REQ_PENDING md: add feature flag MD_FEATURE_RAID0_LAYOUT md/raid0: avoid RAID0 data corruption due to layout confusion. raid5: don't set STRIPE_HANDLE to stripe which is in batch list raid5: don't increment read_errors on EILSEQ return nvmet: fix a wrong error status returned in error log page nvme: send discovery log page change events to userspace nvme: add uevent variables for controller devices nvme: enable aen regardless of the presence of I/O queues nvme-fabrics: allow discovery subsystems accept a kato nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery() nvme: Remove redundant assignment of cq vector nvme: Assign subsys instance from first ctrl ...
2019-08-29nvme: make fabrics command run on a separate request queueSagi Grimberg1-3/+13
We have a fundamental issue that fabric commands use the admin_q. The reason is, that admin-connect, register reads and writes and admin commands cannot be guaranteed ordering while we are running controller resets. For example, when we reset a controller we perform: 1. disable the controller 2. teardown the admin queue 3. re-establish the admin queue 4. enable the controller In order to perform (3), we need to unquiesce the admin queue, however we may have some admin commands that are already pending on the quiesced admin_q and will immediate execute when we unquiesce it before we execute (4). The host must not send admin commands to the controller before enabling the controller. To fix this, we have the fabric commands (admin connect and property get/set, but not I/O queue connect) use a separate fabrics_q and make sure to quiesce the admin_q before we disable the controller, and unquiesce it only after we enable the controller. This fixes the error prints from nvmet in a controller reset storm test: kernel: nvmet: got cmd 6 while CC.EN == 0 on qid = 0 Which indicate that the host is sending an admin command when the controller is not enabled. Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme: move sqsize setting to the coreSagi Grimberg1-11/+1
nvme_enable_ctrl reads the cap register right after, so no need to do that locally in the transport driver. Have sqsize setting in nvme_init_identify. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-04nvme: wait until all completed request's complete fn is calledMing Lei1-0/+2
When aborting in-flight request for recovering controller, we have to make sure that queue's complete function is called on completed request before moving on. Otherwise, for example, the warning of WARN_ON_ONCE(qp->mrs_used > 0) in ib_destroy_qp_user() may be triggered on nvme-rdma. Fix this issue by using blk_mq_tagset_wait_completed_request. Cc: Max Gurtovoy <maxg@mellanox.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Keith Busch <keith.busch@intel.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-31nvmet-loop: Flush nvme_delete_wq when removing the portLogan Gunthorpe1-0/+8
After calling nvme_loop_delete_ctrl(), the controllers will not yet be deleted because nvme_delete_ctrl() only schedules work to do the delete. This means a race can occur if a port is removed but there are still active controllers trying to access that memory. To fix this, flush the nvme_delete_wq before returning from nvme_loop_remove_port() so that any controllers that might be in the process of being deleted won't access a freed port. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by : Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-06-20scsi: lib/sg_pool.c: improve APIs for allocating sg poolMing Lei1-2/+2
sg_alloc_table_chained() currently allows the caller to provide one preallocated SGL and returns if the requested number isn't bigger than size of that SGL. This is used to inline an SGL for an IO request. However, scattergather code only allows that size of the 1st preallocated SGL to be SG_CHUNK_SIZE(128). This means a substantial amount of memory (4KB) is claimed for the SGL for each IO request. If the I/O is small, it would be prudent to allocate a smaller SGL. Introduce an extra parameter to sg_alloc_table_chained() and sg_free_table_chained() for specifying size of the preallocated SGL. Both __sg_free_table() and __sg_alloc_table() assume that each SGL has the same size except for the last one. Change the code to allow both functions to accept a variable size for the 1st preallocated SGL. [mkp: attempted to clarify commit desc] Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Ewan D. Milne <emilne@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: netdev@vger.kernel.org Cc: linux-nvme@lists.infradead.org Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2019-04-25nvme-loop: kill timeout handlerMing Lei1-16/+0
Firstly it doesn't make sense to handle timeout for loop: 1) for admin queue, the request is always completed in code path of queuing IO. 2) for normal IO request, the timeout on these IOs have been handled by underlying queue already. Secondly nvme-loop's timeout handler is simply broken, and easy to cause issue: 1) no any sync/protection between timeout and normal completion, and now it is driver's responsibility to deal with that; 2) bad reset implementation, blk_mq_update_nr_hw_queues() is called after all NSs's queue is stopped(quiesced), and easy to trigger deadlock. So kill the timeout handler. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewd-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-25nvmet: rename nvme_completion instances from rsp to cqeMax Gurtovoy1-3/+3
Use NVMe namings for improving code readability. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by : Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-20nvme-loop: convert to SPDX identifiersChristoph Hellwig1-9/+1
Update license to use SPDX-License-Identifier instead of verbose license text. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2018-12-18nvme-fabrics: allow nvmf_connect_io_queue to pollSagi Grimberg1-1/+1
Preparation for polling support for fabrics. Polling support means that our completion queues are not generating any interrupts which means we need to poll for the nvmf io queue connect as well. Reviewed by Steve Wise <swise@opengridcomputing.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-08-14Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-blockLinus Torvalds1-0/+1
Pull block updates from Jens Axboe: "First pull request for this merge window, there will also be a followup request with some stragglers. This pull request contains: - Fix for a thundering heard issue in the wbt block code (Anchal Agarwal) - A few NVMe pull requests: * Improved tracepoints (Keith) * Larger inline data support for RDMA (Steve Wise) * RDMA setup/teardown fixes (Sagi) * Effects log suppor for NVMe target (Chaitanya Kulkarni) * Buffered IO suppor for NVMe target (Chaitanya Kulkarni) * TP4004 (ANA) support (Christoph) * Various NVMe fixes - Block io-latency controller support. Much needed support for properly containing block devices. (Josef) - Series improving how we handle sense information on the stack (Kees) - Lightnvm fixes and updates/improvements (Mathias/Javier et al) - Zoned device support for null_blk (Matias) - AIX partition fixes (Mauricio Faria de Oliveira) - DIF checksum code made generic (Max Gurtovoy) - Add support for discard in iostats (Michael Callahan / Tejun) - Set of updates for BFQ (Paolo) - Removal of async write support for bsg (Christoph) - Bio page dirtying and clone fixups (Christoph) - Set of bcache fix/changes (via Coly) - Series improving blk-mq queue setup/teardown speed (Ming) - Series improving merging performance on blk-mq (Ming) - Lots of other fixes and cleanups from a slew of folks" * tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits) blkcg: Make blkg_root_lookup() work for queues in bypass mode bcache: fix error setting writeback_rate through sysfs interface null_blk: add lock drop/acquire annotation Blk-throttle: reduce tail io latency when iops limit is enforced block: paride: pd: mark expected switch fall-throughs block: Ensure that a request queue is dissociated from the cgroup controller block: Introduce blk_exit_queue() blkcg: Introduce blkg_root_lookup() block: Remove two superfluous #include directives blk-mq: count the hctx as active before allocating tag block: bvec_nr_vecs() returns value for wrong slab bcache: trivial - remove tailing backslash in macro BTREE_FLAG bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section bcache: set max writeback rate when I/O request is idle bcache: add code comments for bset.c bcache: fix mistaken comments in request.c bcache: fix mistaken code comments in bcache.h bcache: add a comment in super.c bcache: avoid unncessary cache prefetch bch_btree_node_get() bcache: display rate debug parameters to 0 when writeback is not running ...
2018-07-24nvme: if_ready checks to fail io to deleting controllerJames Smart1-1/+1
The revised if_ready checks skipped over the case of returning error when the controller is being deleted. Instead it was returning BUSY, which caused the ios to retry, which caused the ns delete to hang waiting for the ios to drain. Stack trace of hang looks like: kworker/u64:2 D 0 74 2 0x80000000 Workqueue: nvme-delete-wq nvme_delete_ctrl_work [nvme_core] Call Trace: ? __schedule+0x26d/0x820 schedule+0x32/0x80 blk_mq_freeze_queue_wait+0x36/0x80 ? remove_wait_queue+0x60/0x60 blk_cleanup_queue+0x72/0x160 nvme_ns_remove+0x106/0x140 [nvme_core] nvme_remove_namespaces+0x7e/0xa0 [nvme_core] nvme_delete_ctrl_work+0x4d/0x80 [nvme_core] process_one_work+0x160/0x350 worker_thread+0x1c3/0x3d0 kthread+0xf5/0x130 ? process_one_work+0x350/0x350 ? kthread_bind+0x10/0x10 ret_from_fork+0x1f/0x30 Extend nvmf_fail_nonready_command() to supply the controller pointer so that the controller state can be looked at. Fail any io to a controller that is deleting. Fixes: 3bc32bb1186c ("nvme-fabrics: refactor queue ready check") Fixes: 35897b920c8a ("nvme-fabrics: fix and refine state checks in __nvmf_check_ready") Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Ewan D. Milne <emilne@redhat.com> Reviewed-by: Ewan D. Milne <emilne@redhat.com>
2018-07-23nvme: cache struct nvme_ctrl reference to struct nvme_requestSagi Grimberg1-0/+1
We will need to reference the controller in the setup and completion time for tracing and future traffic based keep alive support. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-06-15nvme-fabrics: refactor queue ready checkChristoph Hellwig1-4/+3
Move the is_connected check to the fibre channel transport, as it has no meaning for other transports. To facilitate this split out a new nvmf_fail_nonready_command helper that is called by the transport when it is asked to handle a command on a queue that is not ready. Also avoid a function call for the queue live fast path by inlining the check. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Smart <james.smart@broadcom.com>
2018-05-30nvme-loop: add support for multiple portsChristoph Hellwig1-18/+30
This is useful at least for multipath testing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2018-05-29Merge branch 'nvme-4.18-2' of git://git.infradead.org/nvme into for-4.18/blockJens Axboe1-1/+1
Pull NVMe changes from Christoph: "Here is the current batch of nvme updates for 4.18, we have a few more patches in the queue, but I'd like to get this pile into your tree and linux-next ASAP. The biggest item is support for file-backed namespaces in the NVMe target from Chaitanya, in addition to that we mostly small fixes from all the usual suspects." * 'nvme-4.18-2' of git://git.infradead.org/nvme: nvme: fixup memory leak in nvme_init_identify() nvme: fix KASAN warning when parsing host nqn nvmet-loop: use nr_phys_segments when map rq to sgl nvmet-fc: increase LS buffer count per fc port nvmet: add simple file backed ns support nvmet: remove duplicate NULL initialization for req->ns nvmet: make a few error messages more generic nvme-fabrics: allow duplicate connections to the discovery controller nvme-fabrics: centralize discovery controller defaults nvme-fabrics: remove unnecessary controller subnqn validation nvme-fc: remove setting DNR on exception conditions nvme-rdma: stop admin queue before freeing it nvme-pci: Fix AER reset handling nvme-pci: set nvmeq->cq_vector after alloc cq/sq nvme: host: core: fix precedence of ternary operator nvme: fix lockdep warning in nvme_mpath_clear_current_path
2018-05-29nvme: return BLK_EH_DONE from ->timeoutChristoph Hellwig1-1/+1
NVMe always completes the request before returning from ->timeout, either by polling for it, or by disabling the controller. Return BLK_EH_DONE so that the block layer doesn't even try to complete it again. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-25nvmet-loop: use nr_phys_segments when map rq to sglChaitanya Kulkarni1-1/+1
Use blk_rq_nr_phys_segments() instead of blk_rq_payload_bytes() to check if a command contains data to me mapped. This fixes the case where a struct requests contains LBAs, but no data will actually be send, e.g. the pending Write Zeroes support. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-03nvmet: switch loopback target state to connecting when resettingJohannes Thumshirn1-0/+6
After commit bb06ec31452f ("nvme: expand nvmf_check_if_ready checks") resetting of the loopback nvme target failed as we forgot to switch it's state to NVME_CTRL_CONNECTING before we reconnect the admin queues. Therefore the checks in nvmf_check_if_ready() choose to go to the reject_io case and thus we couldn't sent out an identify controller command to reconnect. Change the controller state to NVME_CTRL_CONNECTING after tearing down the old connection and before re-establishing the connection. Fixes: bb06ec31452f ("nvme: expand nvmf_check_if_ready checks") Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-12nvme: expand nvmf_check_if_ready checksJames Smart1-9/+2
The nvmf_check_if_ready() checks that were added are very simplistic. As such, the routine allows a lot of cases to fail ios during windows of reset or re-connection. In cases where there are not multi-path options present, the error goes back to the callee - the filesystem or application. Not good. The common routine was rewritten and calling syntax slightly expanded so that per-transport is_ready routines don't need to be present. The transports now call the routine directly. The routine is now a fabrics routine rather than an inline function. The routine now looks at controller state to decide the action to take. Some states mandate io failure. Others define the condition where a command can be accepted. When the decision is unclear, a generic queue-or-reject check is made to look for failfast or multipath ios and only fails the io if it is so marked. Otherwise, the io will be queued and wait for the controller state to resolve. Admin commands issued via ioctl share a live admin queue with commands from the transport for controller init. The ioctls could be intermixed with the initialization commands. It's possible for the ioctl cmd to be issued prior to the controller being enabled. To block this, the ioctl admin commands need to be distinguished from admin commands used for controller init. Added a USERCMD nvme_req(req)->rq_flags bit to reflect this division and set it on ioctls requests. As the nvmf_check_if_ready() routine is called prior to nvme_setup_cmd(), ensure that commands allocated by the ioctl path (actually anything in core.c) preps the nvme_req(req) before starting the io. This will preserve the USERCMD flag during execution and/or retry. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.e> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-12nvme-loop: fix kernel oops in case of unhandled commandMing Lei1-7/+2
When nvmet_req_init() fails, __nvmet_req_complete() is called to handle the target request via .queue_response(), so nvme_loop_queue_response() shouldn't be called again for handling the failure. This patch fixes this case by the following way: - move blk_mq_start_request() before nvmet_req_init(), so nvme_loop_queue_response() may work well to complete this host request - don't call nvme_cleanup_cmd() which is done in nvme_loop_complete_rq() - don't call nvme_loop_queue_response() which is done via .queue_response() Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> [trimmed changelog] Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26nvmet: constify struct nvmet_fabrics_opsChristoph Hellwig1-2/+2
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-22nvmet-loop: use blk_rq_payload_bytes for sgl selectionChristoph Hellwig1-2/+2
blk_rq_bytes does the wrong thing for special payloads like discards and might cause the driver to not set up a SGL. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Keith Busch <keith.busch@intel.com>
2018-01-15nvme: host delete_work and reset_work on separate workqueuesRoy Shterman1-1/+1
We need to ensure that delete_work will be hosted on a different workqueue than all the works we flush or cancel from it. Otherwise we may hit a circular dependency warning [1]. Also, given that delete_work flushes reset_work, host reset_work on nvme_reset_wq and delete_work on nvme_delete_wq. In addition, fix the flushing in the individual drivers to flush nvme_delete_wq when draining queued deletes. [1]: [ 178.491942] ============================================= [ 178.492718] [ INFO: possible recursive locking detected ] [ 178.493495] 4.9.0-rc4-c844263313a8-lb #3 Tainted: G OE [ 178.494382] --------------------------------------------- [ 178.495160] kworker/5:1/135 is trying to acquire lock: [ 178.495894] ( [ 178.496120] "nvme-wq" [ 178.496471] ){++++.+} [ 178.496599] , at: [ 178.496921] [<ffffffffa70ac206>] flush_work+0x1a6/0x2d0 [ 178.497670] but task is already holding lock: [ 178.498499] ( [ 178.498724] "nvme-wq" [ 178.499074] ){++++.+} [ 178.499202] , at: [ 178.499520] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0 [ 178.500343] other info that might help us debug this: [ 178.501269] Possible unsafe locking scenario: [ 178.502113] CPU0 [ 178.502472] ---- [ 178.502829] lock( [ 178.503115] "nvme-wq" [ 178.503467] ); [ 178.503716] lock( [ 178.504001] "nvme-wq" [ 178.504353] ); [ 178.504601] *** DEADLOCK *** [ 178.505441] May be due to missing lock nesting notation [ 178.506453] 2 locks held by kworker/5:1/135: [ 178.507068] #0: [ 178.507330] ( [ 178.507598] "nvme-wq" [ 178.507726] ){++++.+} [ 178.508079] , at: [ 178.508173] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0 [ 178.509004] #1: [ 178.509265] ( [ 178.509532] (&ctrl->delete_work) [ 178.509795] ){+.+.+.} [ 178.510145] , at: [ 178.510239] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0 [ 178.511070] stack backtrace: : [ 178.511693] CPU: 5 PID: 135 Comm: kworker/5:1 Tainted: G OE 4.9.0-rc4-c844263313a8-lb #3 [ 178.512974] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014 [ 178.514247] Workqueue: nvme-wq nvme_del_ctrl_work [nvme_tcp] [ 178.515071] ffffc2668175bae0 ffffffffa7450823 ffffffffa88abd80 ffffffffa88abd80 [ 178.516195] ffffc2668175bb98 ffffffffa70eb012 ffffffffa8d8d90d ffff9c472e9ea700 [ 178.517318] ffff9c472e9ea700 ffff9c4700000000 ffff9c4700007200 ab83be61bec0d50e [ 178.518443] Call Trace: [ 178.518807] [<ffffffffa7450823>] dump_stack+0x85/0xc2 [ 178.519542] [<ffffffffa70eb012>] __lock_acquire+0x17d2/0x18f0 [ 178.520377] [<ffffffffa75839a7>] ? serial8250_console_putchar+0x27/0x30 [ 178.521330] [<ffffffffa7583980>] ? wait_for_xmitr+0xa0/0xa0 [ 178.522174] [<ffffffffa70ac1eb>] ? flush_work+0x18b/0x2d0 [ 178.522975] [<ffffffffa70eb7cb>] lock_acquire+0x11b/0x220 [ 178.523753] [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0 [ 178.524535] [<ffffffffa70ac229>] flush_work+0x1c9/0x2d0 [ 178.525291] [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0 [ 178.526077] [<ffffffffa70a9cf0>] ? flush_workqueue_prep_pwqs+0x220/0x220 [ 178.527040] [<ffffffffa70ae7cf>] __cancel_work_timer+0x10f/0x1d0 [ 178.527907] [<ffffffffa70fecb9>] ? vprintk_default+0x29/0x40 [ 178.528726] [<ffffffffa71cb507>] ? printk+0x48/0x50 [ 178.529434] [<ffffffffa70ae8c3>] cancel_delayed_work_sync+0x13/0x20 [ 178.530381] [<ffffffffc042100b>] nvme_stop_ctrl+0x5b/0x70 [nvme_core] [ 178.531314] [<ffffffffc0403dcc>] nvme_del_ctrl_work+0x2c/0x50 [nvme_tcp] [ 178.532271] [<ffffffffa70ad741>] process_one_work+0x1e1/0x6a0 [ 178.533101] [<ffffffffa70ad6c2>] ? process_one_work+0x162/0x6a0 [ 178.533954] [<ffffffffa70adc4e>] worker_thread+0x4e/0x490 [ 178.534735] [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0 [ 178.535588] [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0 [ 178.536441] [<ffffffffa70b48cf>] kthread+0xff/0x120 [ 178.537149] [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60 [ 178.538094] [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60 [ 178.538900] [<ffffffffa78e332a>] ret_from_fork+0x2a/0x40 Signed-off-by: Roy Shterman <roys@lightbitslabs.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08nvme-fabrics: protect against module unload during create_ctrlRoy Shterman1-0/+1
NVMe transport driver module unload may (and usually does) trigger iteration over the active controllers and delete them all (sometimes under a mutex). However, a controller can be created concurrently with module unload which can lead to leakage of resources (most important char device node leakage) in case the controller creation occured after the unload delete and drain sequence. To protect against this, we take a module reference to guarantee that the nvme transport driver is not unloaded while creating a controller. Signed-off-by: Roy Shterman <roys@lightbitslabs.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-20nvme-loop: check if queue is ready in queue_rqSagi Grimberg1-1/+24
In case the queue is not LIVE (fully functional and connected at the nvmf level), we cannot allow any commands other than connect to pass through. Add a new queue state flag NVME_LOOP_Q_LIVE which is set after nvmf connect and cleared in queue teardown. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-10nvmet: better data length validationChristoph Hellwig1-1/+2
Currently the NVMe target stores the expexted data length in req->data_len and uses that for data transfer decisions, but that does not take the actual transfer length in the SGLs into account. So this adds a new transfer_len field, into which the transport drivers store the actual transfer length. We then check the two match before actually executing the command. The FC transport driver already had such a field, which is removed in favour of the common one. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10nvme: remove handling of multiple AEN requestsKeith Busch1-1/+1
The driver can handle tracking only one AEN request, so this patch removes handling for multiple ones. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10nvme: centralize AEN definesKeith Busch1-11/+3
All the transports were unnecessarilly duplicating the AEN request accounting. This patch defines everything in one place. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Guan Junxiong <guanjunxiong@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01nvme: consolidate common code from ->reset_workChristoph Hellwig1-4/+0
No change in behavior except that the FC code cancels two work items a little later now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-01nvme: move controller deletion to common codeChristoph Hellwig1-39/+9
Move the ->delete_work and the associated helpers to common code instead of duplicating them in every driver. This also adds the missing reference get/put for the loop driver. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-10-27nvme: switch controller refcounting to use struct deviceChristoph Hellwig1-1/+1
Instead of allocating a separate struct device for the character device handle embedd it into struct nvme_ctrl and use it for the main controller refcounting. This removes double refcounting and gets us an automatic reference for the character device operations. We keep ctrl->device as a pointer for now to avoid chaning printks all over, but in the future we could look into message printing helpers that take a controller structure similar to what other subsystems do. Note the delete_ctrl operation always already has a reference (either through sysfs due this change, or because every open file on the /dev/nvme-fabrics node has a refernece) when it is entered now, so we don't need to do the unless_zero variant there. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com>
2017-10-19nvme-loop: Add BLK_MQ_F_NO_SCHED flag to admin tag setIsrael Rukshin1-0/+1
This flag is useful for admin queues that aren't used for normal IO. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-08-28nvme: Add admin_tagset pointer to nvme_ctrlSagi Grimberg1-0/+1
Will be used when we centralize control flows. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-06nvme: split nvme_uninit_ctrl into stop and uninitSagi Grimberg1-13/+6
Usually before we teardown the controller we want to: 1. complete/cancel any ctrl inflight works 2. remove ctrl namespaces (only for removal though, resets shouldn't remove any namespaces). but we do not want to destroy the controller device as we might use it for logging during the teardown stage. This patch adds nvme_start_ctrl() which queues inflight controller works (aen, ns scan, queue start and keep-alive if kato is set) and nvme_stop_ctrl() which cancels the works namespace removal is left to the callers to handle. Move nvme_uninit_ctrl after we are done with the controller device. Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2017-07-06nvme-loop: quiesce/unquiesce admin_q instead of start/stop its hw queuesSagi Grimberg1-1/+2
unlike blk_mq_stop_hw_queues and blk_mq_start_stopped_hw_queues quiescing/unquiescing respects the submission path rcu grace. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2017-07-04nvme-loop: update tagset nr_hw_queues after reconnecting/resettingSagi Grimberg1-0/+3
We might have more/less queues once we reconnect/reset. For example due to cpu going online/offline Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2017-07-02nvme: move ctrl cap to struct nvme_ctrlSagi Grimberg1-4/+3
All transports use either a private cache of controller cap or an on-stack copy, move it to the generic struct nvme_ctrl. In the future it will also be maintained by the core. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2017-07-02nvme: move queue_count to the nvme_ctrlSagi Grimberg1-8/+7
All all transports use the queue_count in exactly the same, so move it to the generic struct nvme_ctrl. In the future it will also be maintained by the core. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-By: James Smart <james.smart@broadcom.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2017-06-28nvme: read the subsystem NQN from Identify ControllerChristoph Hellwig1-1/+0
NVMe 1.2.1 or later requires controllers to provide a subsystem NQN in the Identify controller data structures. Use this NQN for the subsysnqn sysfs attribute by storing it in the nvme_ctrl structure after verifying it. For older controllers we generate a "fake" NQN per non-normative text in the NVMe 1.3 spec. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28nvme: use a single NVME_AQ_DEPTH and relax it to 32Sagi Grimberg1-3/+1
No need to differentiate fabrics from pci/loop, also lower it to 32 as we don't really need 256 inflight admin commands. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-15nvme: move reset workqueue handling to common codeChristoph Hellwig1-21/+4
This moves the nvme_reset function from the PCIe driver to common code, renaming it to nvme_reset_ctrl in the process. Additionally a new helper nvme_reset_ctrl_sync is added for the case where we want to wait for the reset. To facilitate that the reset_work work structure is move to the common nvme_ctrl structure and the ->reset_ctrl method is removed. For now the drivers initialize the reset_work with their own callback, but longer term we should move to callouts for specific parts of the reset process and move even more code to the core. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2017-06-15nvme-loop: merge init_request methodsChristoph Hellwig1-9/+4
Now that we get the tagset passed we can have a single implementation for the I/O and admin queues. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-15nvme: Move transports to use nvme-core workqueueSagi Grimberg1-4/+4
Instead of each transport using it's own workqueue, export a single nvme-core workqueue and use that instead. In the future, this will help us moving towards some unification if controller setup/teardown flows. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>