summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)AuthorFilesLines
2020-10-13Merge tag 'drivers-5.10-2020-10-12' of git://git.kernel.dk/linux-blockLinus Torvalds24-620/+1032
Pull block driver updates from Jens Axboe: "Here are the driver updates for 5.10. A few SCSI updates in here too, in coordination with Martin as they depend on core block changes for the shared tag bitmap. This contains: - NVMe pull requests via Christoph: - fix keep alive timer modification (Amit Engel) - order the PCI ID list more sensibly (Andy Shevchenko) - cleanup the open by controller helper (Chaitanya Kulkarni) - use an xarray for the CSE log lookup (Chaitanya Kulkarni) - support ZNS in nvmet passthrough mode (Chaitanya Kulkarni) - fix nvme_ns_report_zones (Christoph Hellwig) - add a sanity check to nvmet-fc (James Smart) - fix interrupt allocation when too many polled queues are specified (Jeffle Xu) - small nvmet-tcp optimization (Mark Wunderlich) - fix a controller refcount leak on init failure (Chaitanya Kulkarni) - misc cleanups (Chaitanya Kulkarni) - major refactoring of the scanning code (Christoph Hellwig) - MD updates via Song: - Bug fixes in bitmap code, from Zhao Heming - Fix a work queue check, from Guoqing Jiang - Fix raid5 oops with reshape, from Song Liu - Clean up unused code, from Jason Yan - Discard improvements, from Xiao Ni - raid5/6 page offset support, from Yufen Yu - Shared tag bitmap for SCSI/hisi_sas/null_blk (John, Kashyap, Hannes) - null_blk open/active zone limit support (Niklas) - Set of bcache updates (Coly, Dongsheng, Qinglang)" * tag 'drivers-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (78 commits) md/raid5: fix oops during stripe resizing md/bitmap: fix memory leak of temporary bitmap md: fix the checking of wrong work queue md/bitmap: md_bitmap_get_counter returns wrong blocks md/bitmap: md_bitmap_read_sb uses wrong bitmap blocks md/raid0: remove unused function is_io_in_chunk_boundary() nvme-core: remove extra condition for vwc nvme-core: remove extra variable nvme: remove nvme_identify_ns_list nvme: refactor nvme_validate_ns nvme: move nvme_validate_ns nvme: query namespace identifiers before adding the namespace nvme: revalidate zone bitmaps in nvme_update_ns_info nvme: remove nvme_update_formats nvme: update the known admin effects nvme: set the queue limits in nvme_update_ns_info nvme: remove the 0 lba_shift check in nvme_update_ns_info nvme: clean up the check for too large logic block sizes nvme: freeze the queue over ->lba_shift updates nvme: factor out a nvme_configure_metadata helper ...
2020-10-13Merge tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-blockLinus Torvalds13-119/+106
Pull block updates from Jens Axboe: - Series of merge handling cleanups (Baolin, Christoph) - Series of blk-throttle fixes and cleanups (Baolin) - Series cleaning up BDI, seperating the block device from the backing_dev_info (Christoph) - Removal of bdget() as a generic API (Christoph) - Removal of blkdev_get() as a generic API (Christoph) - Cleanup of is-partition checks (Christoph) - Series reworking disk revalidation (Christoph) - Series cleaning up bio flags (Christoph) - bio crypt fixes (Eric) - IO stats inflight tweak (Gabriel) - blk-mq tags fixes (Hannes) - Buffer invalidation fixes (Jan) - Allow soft limits for zone append (Johannes) - Shared tag set improvements (John, Kashyap) - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel) - DM no-wait support (Mike, Konstantin) - Request allocation improvements (Ming) - Allow md/dm/bcache to use IO stat helpers (Song) - Series improving blk-iocost (Tejun) - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang, Xianting, Yang, Yufen, yangerkun) * tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits) block: fix uapi blkzoned.h comments blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue blk-mq: get rid of the dead flush handle code path block: get rid of unnecessary local variable block: fix comment and add lockdep assert blk-mq: use helper function to test hw stopped block: use helper function to test queue register block: remove redundant mq check block: invoke blk_mq_exit_sched no matter whether have .exit_sched percpu_ref: don't refer to ref->data if it isn't allocated block: ratelimit handle_bad_sector() message blk-throttle: Re-use the throtl_set_slice_end() blk-throttle: Open code __throtl_de/enqueue_tg() blk-throttle: Move service tree validation out of the throtl_rb_first() blk-throttle: Move the list operation after list validation blk-throttle: Fix IO hang for a corner case blk-throttle: Avoid tracking latency if low limit is invalid blk-throttle: Avoid getting the current time if tg->last_finish_time is 0 blk-throttle: Remove a meaningless parameter for throtl_downgrade_state() block: Remove redundant 'return' statement ...
2020-10-12Merge tag 'ras_updates_for_v5.10' of ↵Linus Torvalds1-6/+9
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS updates from Borislav Petkov: - Extend the recovery from MCE in kernel space also to processes which encounter an MCE in kernel space but while copying from user memory by sending them a SIGBUS on return to user space and umapping the faulty memory, by Tony Luck and Youquan Song. - memcpy_mcsafe() rework by splitting the functionality into copy_mc_to_user() and copy_mc_to_kernel(). This, as a result, enables support for new hardware which can recover from a machine check encountered during a fast string copy and makes that the default and lets the older hardware which does not support that advance recovery, opt in to use the old, fragile, slow variant, by Dan Williams. - New AMD hw enablement, by Yazen Ghannam and Akshay Gupta. - Do not use MSR-tracing accessors in #MC context and flag any fault while accessing MCA architectural MSRs as an architectural violation with the hope that such hw/fw misdesigns are caught early during the hw eval phase and they don't make it into production. - Misc fixes, improvements and cleanups, as always. * tag 'ras_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Allow for copy_mc_fragile symbol checksum to be generated x86/mce: Decode a kernel instruction to determine if it is copying from user x86/mce: Recover from poison found while copying from user space x86/mce: Avoid tail copy when machine check terminated a copy from user x86/mce: Add _ASM_EXTABLE_CPY for copy user access x86/mce: Provide method to find out the type of an exception handler x86/mce: Pass pointer to saved pt_regs to severity calculation routines x86/copy_mc: Introduce copy_mc_enhanced_fast_string() x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}() x86/mce: Drop AMD-specific "DEFERRED" case from Intel severity rule list x86/mce: Add Skylake quirk for patrol scrub reported errors RAS/CEC: Convert to DEFINE_SHOW_ATTRIBUTE() x86/mce: Annotate mce_rd/wrmsrl() with noinstr x86/mce/dev-mcelog: Do not update kflags on AMD systems x86/mce: Stop mce_reign() from re-computing severity for every CPU x86/mce: Make mce_rdmsrl() panic on an inaccessible MSR x86/mce: Increase maximum number of banks to 64 x86/mce: Delay clearing IA32_MCG_STATUS to the end of do_machine_check() x86/MCE/AMD, EDAC/mce_amd: Remove struct smca_hwid.xec_bitmap RAS/CEC: Fix cec_init() prototype
2020-10-08md/raid5: fix oops during stripe resizingSong Liu1-2/+2
KoWei reported crash during raid5 reshape: [ 1032.252932] Oops: 0002 [#1] SMP PTI [...] [ 1032.252943] RIP: 0010:memcpy_erms+0x6/0x10 [...] [ 1032.252947] RSP: 0018:ffffba1ac0c03b78 EFLAGS: 00010286 [ 1032.252949] RAX: 0000784ac0000000 RBX: ffff91bec3d09740 RCX: 0000000000001000 [ 1032.252951] RDX: 0000000000001000 RSI: ffff91be6781c000 RDI: 0000784ac0000000 [ 1032.252953] RBP: ffffba1ac0c03bd8 R08: 0000000000001000 R09: ffffba1ac0c03bf8 [ 1032.252954] R10: 0000000000000000 R11: 0000000000000000 R12: ffffba1ac0c03bf8 [ 1032.252955] R13: 0000000000001000 R14: 0000000000000000 R15: 0000000000000000 [ 1032.252958] FS: 0000000000000000(0000) GS:ffff91becf500000(0000) knlGS:0000000000000000 [ 1032.252959] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1032.252961] CR2: 0000784ac0000000 CR3: 000000031780a002 CR4: 00000000001606e0 [ 1032.252962] Call Trace: [ 1032.252969] ? async_memcpy+0x179/0x1000 [async_memcpy] [ 1032.252977] ? raid5_release_stripe+0x8e/0x110 [raid456] [ 1032.252982] handle_stripe_expansion+0x15a/0x1f0 [raid456] [ 1032.252988] handle_stripe+0x592/0x1270 [raid456] [ 1032.252993] handle_active_stripes.isra.0+0x3cb/0x5a0 [raid456] [ 1032.252999] raid5d+0x35c/0x550 [raid456] [ 1032.253002] ? schedule+0x42/0xb0 [ 1032.253006] ? schedule_timeout+0x10e/0x160 [ 1032.253011] md_thread+0x97/0x160 [ 1032.253015] ? wait_woken+0x80/0x80 [ 1032.253019] kthread+0x104/0x140 [ 1032.253022] ? md_start_sync+0x60/0x60 [ 1032.253024] ? kthread_park+0x90/0x90 [ 1032.253027] ret_from_fork+0x35/0x40 This is because cache_size_mutex was unlocked too early in resize_stripes, which races with grow_one_stripe() that grow_one_stripe() allocates a stripe with wrong pool_size. Fix this issue by unlocking cache_size_mutex after updating pool_size. Cc: <stable@vger.kernel.org> # v4.4+ Reported-by: KoWei Sung <winders@amazon.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-10-08md/bitmap: fix memory leak of temporary bitmapZhao Heming2-1/+3
Callers of get_bitmap_from_slot() are responsible to free the bitmap. Suggested-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-10-08md: fix the checking of wrong work queueGuoqing Jiang1-1/+1
It should check md_rdev_misc_wq instead of md_misc_wq. Fixes: cc1ffe61c026 ("md: add new workqueue for delete rdev") Cc: <stable@vger.kernel.org> # v5.8+ Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-10-08md/bitmap: md_bitmap_get_counter returns wrong blocksZhao Heming1-1/+1
md_bitmap_get_counter() has code: ``` if (bitmap->bp[page].hijacked || bitmap->bp[page].map == NULL) csize = ((sector_t)1) << (bitmap->chunkshift + PAGE_COUNTER_SHIFT - 1); ``` The minus 1 is wrong, this branch should report 2048 bits of space. With "-1" action, this only report 1024 bit of space. This bug code returns wrong blocks, but it doesn't inflence bitmap logic: 1. Most callers focus this function return value (the counter of offset), not the parameter blocks. 2. The bug is only triggered when hijacked is true or map is NULL. the hijacked true condition is very rare. the "map == null" only true when array is creating or resizing. 3. Even the caller gets wrong blocks, current code makes caller just to call md_bitmap_get_counter() one more time. Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-10-08md/bitmap: md_bitmap_read_sb uses wrong bitmap blocksZhao Heming1-2/+2
The patched code is used to get chunks number, should use round-up div to replace current sector_div. The same code is in md_bitmap_resize(): ``` chunks = DIV_ROUND_UP_SECTOR_T(blocks, 1 << chunkshift); ``` Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-10-08md/raid0: remove unused function is_io_in_chunk_boundary()Jason Yan1-17/+0
This function is no longger needed after commit 20d0189b1012 ("block: Introduce new bio_split()"). Signed-off-by: Jason Yan <yanaijie@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-10-06x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()Dan Williams1-6/+9
In reaction to a proposal to introduce a memcpy_mcsafe_fast() implementation Linus points out that memcpy_mcsafe() is poorly named relative to communicating the scope of the interface. Specifically what addresses are valid to pass as source, destination, and what faults / exceptions are handled. Of particular concern is that even though x86 might be able to handle the semantics of copy_mc_to_user() with its common copy_user_generic() implementation other archs likely need / want an explicit path for this case: On Fri, May 1, 2020 at 11:28 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, Apr 30, 2020 at 6:21 PM Dan Williams <dan.j.williams@intel.com> wrote: > > > > However now I see that copy_user_generic() works for the wrong reason. > > It works because the exception on the source address due to poison > > looks no different than a write fault on the user address to the > > caller, it's still just a short copy. So it makes copy_to_user() work > > for the wrong reason relative to the name. > > Right. > > And it won't work that way on other architectures. On x86, we have a > generic function that can take faults on either side, and we use it > for both cases (and for the "in_user" case too), but that's an > artifact of the architecture oddity. > > In fact, it's probably wrong even on x86 - because it can hide bugs - > but writing those things is painful enough that everybody prefers > having just one function. Replace a single top-level memcpy_mcsafe() with either copy_mc_to_user(), or copy_mc_to_kernel(). Introduce an x86 copy_mc_fragile() name as the rename for the low-level x86 implementation formerly named memcpy_mcsafe(). It is used as the slow / careful backend that is supplanted by a fast copy_mc_generic() in a follow-on patch. One side-effect of this reorganization is that separating copy_mc_64.S to its own file means that perf no longer needs to track dependencies for its memcpy_64.S benchmarks. [ bp: Massage a bit. ] Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: <stable@vger.kernel.org> Link: http://lore.kernel.org/r/CAHk-=wjSqtXAqfUJxFtWNwmguFASTgB0dz1dT3V-78Quiezqbg@mail.gmail.com Link: https://lkml.kernel.org/r/160195561680.2163339.11574962055305783722.stgit@dwillia2-desk3.amr.corp.intel.com
2020-10-05block: make bio_crypt_clone() able to failEric Biggers1-3/+4
bio_crypt_clone() assumes its gfp_mask argument always includes __GFP_DIRECT_RECLAIM, so that the mempool_alloc() will always succeed. However, bio_crypt_clone() might be called with GFP_ATOMIC via setup_clone() in drivers/md/dm-rq.c, or with GFP_NOWAIT via kcryptd_io_read() in drivers/md/dm-crypt.c. Neither case is currently reachable with a bio that actually has an encryption context. However, it's fragile to rely on this. Just make bio_crypt_clone() able to fail, analogous to bio_integrity_clone(). Reported-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Satya Tangirala <satyat@google.com> Cc: Satya Tangirala <satyat@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: remove embedded struct cache_sb from struct cache_setColy Li11-59/+46
Since bcache code was merged into mainline kerrnel, each cache set only as one single cache in it. The multiple caches framework is here but the code is far from completed. Considering the multiple copies of cached data can also be stored on e.g. md raid1 devices, it is unnecessary to support multiple caches in one cache set indeed. The previous preparation patches fix the dependencies of explicitly making a cache set only have single cache. Now we don't have to maintain an embedded partial super block in struct cache_set, the in-memory super block can be directly referenced from struct cache. This patch removes the embedded struct cache_sb from struct cache_set, and fixes all locations where the superb lock was referenced from this removed super block by referencing the in-memory super block of struct cache. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: check and set sync status on cache's in-memory super blockColy Li4-10/+7
Currently the cache's sync status is checked and set on cache set's in- memory partial super block. After removing the embedded struct cache_sb from cache set and reference cache's in-memory super block from struct cache_set, the sync status can set and check directly on cache's super block. This patch checks and sets the cache sync status directly on cache's in-memory super block. This is a preparation for later removing embedded struct cache_sb from struct cache_set. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: remove can_attach_cache()Coly Li1-10/+0
After removing the embedded struct cache_sb from struct cache_set, cache set will directly reference the in-memory super block of struct cache. It is unnecessary to compare block_size, bucket_size and nr_in_set from the identical in-memory super block in can_attach_cache(). This is a preparation patch for latter removing cache_set->sb from struct cache_set. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: don't check seq numbers in register_cache_set()Coly Li1-15/+0
In order to update the partial super block of cache set, the seq numbers of cache and cache set are checked in register_cache_set(). If cache's seq number is larger than cache set's seq number, cache set must update its partial super block from cache's super block. It is unncessary when the embedded struct cache_sb is removed from struct cache set. This patch removed the seq numbers checking from register_cache_set(), because later there will be no such partial super block in struct cache set, the cache set will directly reference in-memory super block from struct cache. This is a preparation patch for removing embedded struct cache_sb from struct cache_set. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: only use bucket_bytes() on struct cacheColy Li2-2/+2
Because struct cache_set and struct cache both have struct cache_sb, macro bucket_bytes() currently are used on both of them. When removing the embedded struct cache_sb from struct cache_set, this macro won't be used on struct cache_set anymore. This patch unifies all bucket_bytes() usage only on struct cache, this is one of the preparation to remove the embedded struct cache_sb from struct cache_set. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: remove useless bucket_pages()Coly Li1-1/+0
It seems alloc_bucket_pages() is the only user of bucket_pages(). Considering alloc_bucket_pages() is removed from bcache code, it is safe to remove the useless macro bucket_pages() now. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: remove useless alloc_bucket_pages()Coly Li1-3/+0
Now no one uses alloc_bucket_pages() anymore, remove it from bcache.h. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: only use block_bytes() on struct cacheColy Li7-24/+24
Because struct cache_set and struct cache both have struct cache_sb, therefore macro block_bytes() can be used on both of them. When removing the embedded struct cache_sb from struct cache_set, this macro won't be used on struct cache_set anymore. This patch unifies all block_bytes() usage only on struct cache, this is one of the preparation to remove the embedded struct cache_sb from struct cache_set. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: add set_uuid in struct cache_setColy Li3-13/+14
This patch adds a separated set_uuid[16] in struct cache_set, to store the uuid of the cache set. This is the preparation to remove the embedded struct cache_sb from struct cache_set. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: remove for_each_cache()Coly Li6-294/+237
Since now each cache_set explicitly has single cache, for_each_cache() is unnecessary. This patch removes this macro, and update all locations where it is used, and makes sure all code logic still being consistent. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: explicitly make cache_set only have single cacheColy Li3-17/+12
Currently although the bcache code has a framework for multiple caches in a cache set, but indeed the multiple caches never completed and users use md raid1 for multiple copies of the cached data. This patch does the following change in struct cache_set, to explicitly make a cache_set only have single cache, - Change pointer array "*cache[MAX_CACHES_PER_SET]" to a single pointer "*cache". - Remove pointer array "*cache_by_alloc[MAX_CACHES_PER_SET]". - Remove "caches_loaded". Now the code looks as exactly what it does in practic: only one cache is used in the cache set. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: remove 'int n' from parameter list of bch_bucket_alloc_set()Coly Li4-24/+19
The parameter 'int n' from bch_bucket_alloc_set() is not cleared defined. From the code comments n is the number of buckets to alloc, but from the code itself 'n' is the maximum cache to iterate. Indeed all the locations where bch_bucket_alloc_set() is called, 'n' is alwasy 1. This patch removes the confused and unnecessary 'int n' from parameter list of bch_bucket_alloc_set(), and explicitly allocates only 1 bucket for its caller. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: Convert to DEFINE_SHOW_ATTRIBUTEQinglang Miao1-13/+3
Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code. As inode->iprivate equals to third parameter of debugfs_create_file() which is NULL. So it's equivalent to original code logic. Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: check c->root with IS_ERR_OR_NULL() in mca_reserve()Dongsheng Yang1-1/+1
In mca_reserve(c) macro, we are checking root whether is NULL or not. But that's not enough, when we read the root node in run_cache_set(), if we got an error in bch_btree_node_read_done(), we will return ERR_PTR(-EIO) to c->root. And then we will go continue to unregister, but before calling unregister_shrinker(&c->shrink), there is a possibility to call bch_mca_count(), and we would get a crash with call trace like that: [ 2149.876008] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000b5 ... ... [ 2150.598931] Call trace: [ 2150.606439] bch_mca_count+0x58/0x98 [escache] [ 2150.615866] do_shrink_slab+0x54/0x310 [ 2150.624429] shrink_slab+0x248/0x2d0 [ 2150.632633] drop_slab_node+0x54/0x88 [ 2150.640746] drop_slab+0x50/0x88 [ 2150.648228] drop_caches_sysctl_handler+0xf0/0x118 [ 2150.657219] proc_sys_call_handler.isra.18+0xb8/0x110 [ 2150.666342] proc_sys_write+0x40/0x50 [ 2150.673889] __vfs_write+0x48/0x90 [ 2150.681095] vfs_write+0xac/0x1b8 [ 2150.688145] ksys_write+0x6c/0xd0 [ 2150.695127] __arm64_sys_write+0x24/0x30 [ 2150.702749] el0_svc_handler+0xa0/0x128 [ 2150.710296] el0_svc+0x8/0xc Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02bcache: share register sysfs with async registerColy Li1-5/+7
Previously the experimental async registration uses a separate sysfs file register_async. Now the async registration code seems working well for a while, we can do furtuher testing with it now. This patch changes the async bcache registration shares the same sysfs file /sys/fs/bcache/register (and register_quiet). Async registration will be default behavior if BCACHE_ASYNC_REGISTRATION is set in kernel configure. By default, BCACHE_ASYNC_REGISTRATION is not configured yet. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25dm: add support for REQ_NOWAIT and enable it for linear targetKonstantin Khlebnikov3-3/+38
Add DM target feature flag DM_TARGET_NOWAIT which advertises that target works with REQ_NOWAIT bios. Add dm_table_supports_nowait() and update dm_table_set_restrictions() to set/clear QUEUE_FLAG_NOWAIT accordingly. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25md: don't detour through bd_contains for the gendiskChristoph Hellwig2-2/+2
bd_disk is set on all block devices, including those for partitions. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25md: compare bd_disk instead of bd_containsChristoph Hellwig1-4/+3
To check for partitions of the same disk bd_contains works as well, but bd_disk is way more obvious. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25block: add a bdev_is_partition helperChristoph Hellwig1-1/+1
Add a littler helper to make the somewhat arcane bd_contains checks a little more obvious. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24md/raid10: improve discard request for far layoutXiao Ni2-23/+64
For far layout, the discard region is not continuous on disks. So it needs far copies r10bio to cover all regions. It needs a way to know all r10bios have finish or not. Similar with raid10_sync_request, only the first r10bio master_bio records the discard bio. Other r10bios master_bio record the first r10bio. The first r10bio can finish after other r10bios finish and then return the discard bio. Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24md/raid10: improve raid10 discard requestXiao Ni1-1/+255
Now the discard request is split by chunk size. So it takes a long time to finish mkfs on disks which support discard function. This patch improve handling raid10 discard request. It uses the similar way with patch 29efc390b (md/md0: optimize raid0 discard handling). But it's a little complex than raid0. Because raid10 has different layout. If raid10 is offset layout and the discard request is smaller than stripe size. There are some holes when we submit discard bio to underlayer disks. For example: five disks (disk1 - disk5) D01 D02 D03 D04 D05 D05 D01 D02 D03 D04 D06 D07 D08 D09 D10 D10 D06 D07 D08 D09 The discard bio just wants to discard from D03 to D10. For disk3, there is a hole between D03 and D08. For disk4, there is a hole between D04 and D09. D03 is a chunk, raid10_write_request can handle one chunk perfectly. So the part that is not aligned with stripe size is still handled by raid10_write_request. If reshape is running when discard bio comes and the discard bio spans the reshape position, raid10_write_request is responsible to handle this discard bio. I did a test with this patch set. Without patch: time mkfs.xfs /dev/md0 real4m39.775s user0m0.000s sys0m0.298s With patch: time mkfs.xfs /dev/md0 real0m0.105s user0m0.000s sys0m0.007s nvme3n1 259:1 0 477G 0 disk └─nvme3n1p1 259:10 0 50G 0 part nvme4n1 259:2 0 477G 0 disk └─nvme4n1p1 259:11 0 50G 0 part nvme5n1 259:6 0 477G 0 disk └─nvme5n1p1 259:12 0 50G 0 part nvme2n1 259:9 0 477G 0 disk └─nvme2n1p1 259:15 0 50G 0 part nvme0n1 259:13 0 477G 0 disk └─nvme0n1p1 259:14 0 50G 0 part Reviewed-by: Coly Li <colyli@suse.de> Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24md/raid10: pull codes that wait for blocked dev into one functionXiao Ni1-51/+67
The following patch will reuse these logics, so pull the same codes into one function. Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24md/raid10: extend r10bio devs to raid disksXiao Ni1-5/+4
Now it allocs r10bio->devs[conf->copies]. Discard bio needs to submit to all member disks and it needs to use r10bio. So extend to r10bio->devs[geo.raid_disks]. Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24md: add md_submit_discard_bio() for submitting discard bioXiao Ni3-12/+24
Move these logic from raid0.c to md.c, so that we can also use it in raid10.c. Reviewed-by: Coly Li <colyli@suse.de> Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24md: Simplify code with existing definition RESYNC_SECTORS in raid10.cZhen Lei1-4/+4
#define RESYNC_SECTORS (RESYNC_BLOCK_SIZE >> 9) "RESYNC_BLOCK_SIZE/512" is equal to "RESYNC_BLOCK_SIZE >> 9", replace it with RESYNC_SECTORS. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24md/raid5: reallocate page array after setting new stripe_sizeYufen Yu1-0/+20
When try to resize stripe_size, we also need to free old shared page array and allocate new. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24md/raid5: resize stripe_head when reshape arrayYufen Yu1-0/+24
When reshape array, we try to reuse shared pages of old stripe_head, and allocate more for the new one if needed. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24md/raid5: let multiple devices of stripe_head share pageYufen Yu2-3/+114
In current implementation, grow_buffers() uses alloc_page() to allocate the buffers for each stripe_head, i.e. allocate a page for each dev[i] in stripe_head. After setting stripe_size as a configurable value by writing sysfs entry, it means that we always allocate 64K buffers, but just use 4K of them when stripe_size is 4K in 64KB arm64. To avoid wasting memory, we try to let multiple sh->dev share one real page. That means, multiple sh->dev[i].page will point to the only page with different offset. Example of 64K PAGE_SIZE and 4K stripe_size as following: 64K PAGE_SIZE +---+---+---+---+------------------------------+ | | | | | | | | | | +-+-+-+-+-+-+-+-+------------------------------+ ^ ^ ^ ^ | | | +----------------------------+ | | | | | | +-------------------+ | | | | | | +----------+ | | | | | | +-+ | | | | | | | +-----+-----+------+-----+------+-----+------+------+ sh | offset(0) | offset(4K) | offset(8K) | offset(12K) | + +-----------+------------+------------+-------------+ +----> dev[0].page dev[1].page dev[2].page dev[3].page A new 'pages' array will be added into stripe_head to record shared page used by this stripe_head. Allocate them when grow_buffers() and free them when shrink_buffers(). After trying to share page, the users of sh->dev[i].page need to take care of the related page offset: page of issued bio and page passed to xor compution functions. But thanks for previous different page offset supported. Here, we just need to set correct dev[i].offset. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24md/raid6: let async recovery function support different page offsetYufen Yu1-2/+2
For now, asynchronous raid6 recovery calculate functions are require common offset for pages. But, we expect them to support different page offset after introducing stripe shared page. Do that by simplily adding page offset where each page address are referred. Then, replace the old interface with the new ones in raid6 and raid6test. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24md/raid6: let syndrome computor support different page offsetYufen Yu1-12/+24
For now, syndrome compute functions require common offset in the pages array. However, we expect them to support different offset when try to use shared page in the following. Simplily covert them by adding page offset where each page address are referred. Since the only caller of async_gen_syndrome() and async_syndrome_val() are in raid6, we don't want to reserve the old interface but modify the interface directly. After that, replacing old interfaces with new ones for raid6 and raid6test. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24md/raid5: convert to new xor compution interfaceYufen Yu1-9/+34
We try to replace async_xor() and async_xor_val() with the new introduced interface async_xor_offs() and async_xor_val_offs() for raid456. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24md/raid5: make async_copy_data() to support different page offsetYufen Yu1-3/+5
ops_run_biofill() and ops_run_biodrain() will call async_copy_data() to copy sh->dev[i].page from or to bio page. For now, it implies the offset of dev[i].page is 0. But we want to support different page offset in the following. Thus, pass page offset to these functions and replace 'page_offset' with 'page_offset + poff'. No functional change. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24md/raid5: add a new member of offset into r5devYufen Yu2-11/+40
Add a new member of offset into struct r5dev. It indicates the offset of related dev[i].page. For now, since each device have a privated page, the value is always 0. Thus, we set offset as 0 when allcate page in grow_buffers() and resize_stripes(). To support following different page offset, we try to use the page offset rather than '0' directly for async_memcpy() and ops_run_io(). We try to support different page offset for xor compution functions in the following. To avoid repeatly allocate a new array each time, we add a memory region into scribble buffer to record offset. No functional change. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24md: only calculate blocksize once and use i_blocksize()Xianting Tian1-3/+4
We alreday has the interface i_blocksize(), which can be used to get blocksize, so use it. Only calculate blocksize once and use it within read_page(). Signed-off-by: Xianting Tian <tian.xianting@h3c.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24Merge branch 'for-5.10/block' into for-5.10/driversJens Axboe12-106/+58
* for-5.10/block: (140 commits) bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag bdi: invert BDI_CAP_NO_ACCT_WB bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag mm: use SWP_SYNCHRONOUS_IO more intelligently bdi: remove BDI_CAP_SYNCHRONOUS_IO bdi: remove BDI_CAP_CGROUP_WRITEBACK block: lift setting the readahead size into the block layer md: update the optimal I/O size on reshape bdi: initialize ->ra_pages and ->io_pages in bdi_init aoe: set an optimal I/O size bcache: inherit the optimal I/O size drbd: remove dead code in device_to_statistics fs: remove the unused SB_I_MULTIROOT flag block: mark blkdev_get static PM: mm: cleanup swsusp_swap_check mm: split swap_type_of PM: rewrite is_hibernate_resume_dev to not require an inode mm: cleanup claim_swapfile ocfs2: cleanup o2hb_region_dev_store dasd: cleanup dasd_scan_partitions ...
2020-09-24bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flagChristoph Hellwig2-7/+7
The BDI_CAP_STABLE_WRITES is one of the few bits of information in the backing_dev_info shared between the block drivers and the writeback code. To help untangling the dependency replace it with a queue flag and a superblock flag derived from it. This also helps with the case of e.g. a file system requiring stable writes due to its own checksumming, but not forcing it on other users of the block device like the swap code. One downside is that we an't support the stable_pages_required bdi attribute in sysfs anymore. It is replaced with a queue attribute which also is writable for easier testing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24block: lift setting the readahead size into the block layerChristoph Hellwig5-56/+3
Drivers shouldn't really mess with the readahead size, as that is a VM concept. Instead set it based on the optimal I/O size by lifting the algorithm from the md driver when registering the disk. Also set bdi->io_pages there as well by applying the same scheme based on max_sectors. To ensure the limits work well for stacking drivers a new helper is added to update the readahead limits from the block limits, which is also called from disk_stack_limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24md: update the optimal I/O size on reshapeChristoph Hellwig2-10/+22
The raid5 and raid10 drivers currently update the read-ahead size, but not the optimal I/O size on reshape. To prepare for deriving the read-ahead size from the optimal I/O size make sure it is updated as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24bcache: inherit the optimal I/O sizeChristoph Hellwig1-0/+2
Inherit the optimal I/O size setting just like the readahead window, as any reason to do larger I/O does not apply to just readahead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>