summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)AuthorFilesLines
2017-09-14Merge tag 'for-4.14/dm-changes' of ↵Linus Torvalds18-142/+170
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Some request-based DM core and DM multipath fixes and cleanups - Constify a few variables in DM core and DM integrity - Add bufio optimization and checksum failure accounting to DM integrity - Fix DM integrity to avoid checking integrity of failed reads - Fix DM integrity to use init_completion - A couple DM log-writes target fixes - Simplify DAX flushing by eliminating the unnecessary flush abstraction that was stood up for DM's use. * tag 'for-4.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dax: remove the pmem_dax_ops->flush abstraction dm integrity: use init_completion instead of COMPLETION_INITIALIZER_ONSTACK dm integrity: make blk_integrity_profile structure const dm integrity: do not check integrity for failed read operations dm log writes: fix >512b sectorsize support dm log writes: don't use all the cpu while waiting to log blocks dm ioctl: constify ioctl lookup table dm: constify argument arrays dm integrity: count and display checksum failures dm integrity: optimize writing dm-bufio buffers that are partially changed dm rq: do not update rq partially in each ending bio dm rq: make dm-sq requeuing behavior consistent with dm-mq behavior dm mpath: complain about unsupported __multipath_map_bio() return values dm mpath: avoid that building with W=1 causes gcc 7 to complain about fall-through
2017-09-11dax: remove the pmem_dax_ops->flush abstractionMikulas Patocka3-54/+0
Commit abebfbe2f731 ("dm: add ->flush() dax operation support") is buggy. A DM device may be composed of multiple underlying devices and all of them need to be flushed. That commit just routes the flush request to the first device and ignores the other devices. It could be fixed by adding more complex logic to the device mapper. But there is only one implementation of the method pmem_dax_ops->flush - that is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we don't need the pmem_dax_ops->flush abstraction at all, we can call arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush can't ever reach anything different from arch_wb_cache_pmem(). It should be also pointed out that for some uses of persistent memory it is needed to flush only a very small amount of data (such as 1 cacheline), and it would be overkill if we go through that device mapper machinery for a single flushed cache line. Fix this by removing the pmem_dax_ops->flush abstraction and call arch_wb_cache_pmem() directly from dax_flush(). Also, remove the device mapper code that forwards the flushes. Fixes: abebfbe2f731 ("dm: add ->flush() dax operation support") Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-09-11dm integrity: use init_completion instead of COMPLETION_INITIALIZER_ONSTACKArnd Bergmann1-10/+10
The new lockdep support for completions causeed the stack usage in dm-integrity to explode, in case of write_journal from 504 bytes to 1120 (using arm gcc-7.1.1): drivers/md/dm-integrity.c: In function 'write_journal': drivers/md/dm-integrity.c:827:1: error: the frame size of 1120 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] The problem is that not only the size of 'struct completion' grows significantly, but we end up having multiple copies of it on the stack when we assign it from a local variable after the initial declaration. COMPLETION_INITIALIZER_ONSTACK() is the right thing to use when we want to declare and initialize a completion on the stack. However, this driver doesn't do that and instead initializes the completion just before it is used. In this case, init_completion() does the same thing more efficiently, and drops the stack usage for the function above down to 496 bytes. While the other functions in this file are not bad enough to cause a warning, they benefit equally from the change, so I do the change across the entire file. In the one place where we reuse a completion, I picked the cheaper reinit_completion() over init_completion(). Fixes: cd8084f91c02 ("locking/lockdep: Apply crossrelease to completions") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-09-11dm integrity: make blk_integrity_profile structure constBhumika Goyal1-1/+1
Make this structure const as it is only stored in the profile field of a blk_integrity structure. This field is of type const, so make structure as const. Signed-off-by: Bhumika Goyal <bhumirks@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-09-11dm integrity: do not check integrity for failed read operationsHyunchul Lee1-1/+5
Even though read operations fail, dm_integrity_map_continue() calls integrity_metadata() to check integrity. In this case, just complete these. This also makes it so read I/O errors do not generate integrity warnings in the kernel log. Cc: stable@vger.kernel.org Signed-off-by: Hyunchul Lee <cheol.lee@lge.com> Acked-by: Milan Broz <gmazyland@gmail.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-09-11dm log writes: fix >512b sectorsize supportJosef Bacik1-11/+31
512b sectors vs device's physical sectorsize was not maintained consistently and as such the support for >512b sector devices has bugs. The log metadata expects native sectorsize but 512b sectors were being stored. Also, device's sectorsize was assumed when assigning the bi_sector for blocks that were being logged. Fix this up by adding two helpers to convert between bio and dev sectors, and use these in the appropriate places to fix the problem and make it clear which units go where. Doing so allows dm-log-writes use with 4k devices. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-09-11dm log writes: don't use all the cpu while waiting to log blocksJosef Bacik1-1/+1
The check to see if the logging kthread needs to go to sleep is wrong, it checks lc->pending_blocks, which will be non-0 if there are any blocks that are pending, whether they are ready to be logged or not. What we really want is to go to sleep until it's time to log blocks, so change this check so we do actually go to sleep in between flushes. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-09-09Merge branch 'for-4.14/block-postmerge' of git://git.kernel.dk/linux-blockLinus Torvalds10-57/+99
Pull followup block layer updates from Jens Axboe: "I ended up splitting the main pull request for this series into two, mainly because of clashes between NVMe fixes that went into 4.13 after the for-4.14 branches were split off. This pull request is mostly NVMe, but not exclusively. In detail, it contains: - Two pull request for NVMe changes from Christoph. Nothing new on the feature front, basically just fixes all over the map for the core bits, transport, rdma, etc. - Series from Bart, cleaning up various bits in the BFQ scheduler. - Series of bcache fixes, which has been lingering for a release or two. Coly sent this in, but patches from various people in this area. - Set of patches for BFQ from Paolo himself, updating both documentation and fixing some corner cases in performance. - Series from Omar, attempting to now get the 4k loop support correct. Our confidence level is higher this time. - Series from Shaohua for loop as well, improving O_DIRECT performance and fixing a use-after-free" * 'for-4.14/block-postmerge' of git://git.kernel.dk/linux-block: (74 commits) bcache: initialize dirty stripes in flash_dev_run() loop: set physical block size to logical block size bcache: fix bch_hprint crash and improve output bcache: Update continue_at() documentation bcache: silence static checker warning bcache: fix for gc and write-back race bcache: increase the number of open buckets bcache: Correct return value for sysfs attach errors bcache: correct cache_dirty_target in __update_writeback_rate() bcache: gc does not work when triggering by manual command bcache: Don't reinvent the wheel but use existing llist API bcache: do not subtract sectors_to_gc for bypassed IO bcache: fix sequential large write IO bypass bcache: Fix leak of bdev reference block/loop: remove unused field block/loop: fix use after free bfq: Use icq_to_bic() consistently bfq: Suppress compiler warnings about comparisons bfq: Check kstrtoul() return value bfq: Declare local functions static ...
2017-09-07Merge tag 'md/4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds9-65/+225
Pull MD updates from Shaohua Li: "This update mainly fixes bugs: - Make raid5 ppl support several ppl from Pawel - Several raid5-cache bug fixes from Song - Bitmap fixes from Neil and Me - One raid1/10 regression fix since 4.12 from Me - Other small fixes and cleanup" * tag 'md/4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: md/bitmap: disable bitmap_resize for file-backed bitmaps. raid5-ppl: Recovery support for multiple partial parity logs md: Runtime support for multiple ppls md/raid0: attach correct cgroup info in bio lib/raid6: align AVX512 constants to 512 bits, not bytes raid5: remove raid5_build_block md/r5cache: call mddev_lock/unlock() in r5c_journal_mode_show md: replace seq_release_private with seq_release md: notify about new spare disk in the container md/raid1/10: reset bio allocated from mempool md/raid5: release/flush io in raid5_do_work() md/bitmap: copy correct data for bitmap super
2017-09-07bcache: initialize dirty stripes in flash_dev_run()Tang Junhui3-6/+7
bcache uses a Proportion-Differentiation Controller algorithm to control writeback rate to cached devices. In the PD controller algorithm, dirty stripes of thin flash device should not be counted in, because flash only volumes never write back dirty data. Currently dirty stripe counter for thin flash device is not initialized when the thin flash device starts. Which means the following calculation in PD controller will reference an undefined dirty stripes number, and all cached devices attached to the same cache set where the thin flash device lies on may have an inaccurate writeback rate. This patch calles bch_sectors_dirty_init() in flash_dev_run(), to correctly initialize dirty stripe counter when the thin flash device starts to run. This patch also does following parameter data type change, -void bch_sectors_dirty_init(struct cached_dev *dc); +void bch_sectors_dirty_init(struct bcache_device *); to call this function conveniently in flash_dev_run(). (Commit log is composed by Coly Li) Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-07Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-blockLinus Torvalds38-162/+165
Pull block layer updates from Jens Axboe: "This is the first pull request for 4.14, containing most of the code changes. It's a quiet series this round, which I think we needed after the churn of the last few series. This contains: - Fix for a registration race in loop, from Anton Volkov. - Overflow complaint fix from Arnd for DAC960. - Series of drbd changes from the usual suspects. - Conversion of the stec/skd driver to blk-mq. From Bart. - A few BFQ improvements/fixes from Paolo. - CFQ improvement from Ritesh, allowing idling for group idle. - A few fixes found by Dan's smatch, courtesy of Dan. - A warning fixup for a race between changing the IO scheduler and device remova. From David Jeffery. - A few nbd fixes from Josef. - Support for cgroup info in blktrace, from Shaohua. - Also from Shaohua, new features in the null_blk driver to allow it to actually hold data, among other things. - Various corner cases and error handling fixes from Weiping Zhang. - Improvements to the IO stats tracking for blk-mq from me. Can drastically improve performance for fast devices and/or big machines. - Series from Christoph removing bi_bdev as being needed for IO submission, in preparation for nvme multipathing code. - Series from Bart, including various cleanups and fixes for switch fall through case complaints" * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits) kernfs: checking for IS_ERR() instead of NULL drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set drbd: Fix allyesconfig build, fix recent commit drbd: switch from kmalloc() to kmalloc_array() drbd: abort drbd_start_resync if there is no connection drbd: move global variables to drbd namespace and make some static drbd: rename "usermode_helper" to "drbd_usermode_helper" drbd: fix race between handshake and admin disconnect/down drbd: fix potential deadlock when trying to detach during handshake drbd: A single dot should be put into a sequence. drbd: fix rmmod cleanup, remove _all_ debugfs entries drbd: Use setup_timer() instead of init_timer() to simplify the code. drbd: fix potential get_ldev/put_ldev refcount imbalance during attach drbd: new disk-option disable-write-same drbd: Fix resource role for newly created resources in events2 drbd: mark symbols static where possible drbd: Send P_NEG_ACK upon write error in protocol != C drbd: add explicit plugging when submitting batches drbd: change list_for_each_safe to while(list_first_entry_or_null) drbd: introduce drbd_recv_header_maybe_unplug ...
2017-09-06Merge branch 'linus' of ↵Linus Torvalds1-6/+5
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "Here is the crypto update for 4.14: API: - Defer scompress scratch buffer allocation to first use. - Add __crypto_xor that takes separte src and dst operands. - Add ahash multiple registration interface. - Revamped aead/skcipher algif code to fix async IO properly. Drivers: - Add non-SIMD fallback code path on ARM for SVE. - Add AMD Security Processor framework for ccp. - Add support for RSA in ccp. - Add XTS-AES-256 support for CCP version 5. - Add support for PRNG in sun4i-ss. - Add support for DPAA2 in caam. - Add ARTPEC crypto support. - Add Freescale RNGC hwrng support. - Add Microchip / Atmel ECC driver. - Add support for STM32 HASH module" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (116 commits) crypto: af_alg - get_page upon reassignment to TX SGL crypto: cavium/nitrox - Fix an error handling path in 'nitrox_probe()' crypto: inside-secure - fix an error handling path in safexcel_probe() crypto: rockchip - Don't dequeue the request when device is busy crypto: cavium - add release_firmware to all return case crypto: sahara - constify platform_device_id MAINTAINERS: Add ARTPEC crypto maintainer crypto: axis - add ARTPEC-6/7 crypto accelerator driver crypto: hash - add crypto_(un)register_ahashes() dt-bindings: crypto: add ARTPEC crypto crypto: algif_aead - fix comment regarding memory layout crypto: ccp - use dma_mapping_error to check map error lib/mpi: fix build with clang crypto: sahara - Remove leftover from previous used spinlock crypto: sahara - Fix dma unmap direction crypto: af_alg - consolidation of duplicate code crypto: caam - Remove unused dentry members crypto: ccp - select CONFIG_CRYPTO_RSA crypto: ccp - avoid uninitialized variable warning crypto: serpent - improve __serpent_setkey with UBSAN ...
2017-09-06bcache: fix bch_hprint crash and improve outputMichael Lyle1-15/+35
Most importantly, solve a crash where %llu was used to format signed numbers. This would cause a buffer overflow when reading sysfs writeback_rate_debug, as only 20 bytes were allocated for this and %llu writes 20 characters plus a null. Always use the units mechanism rather than having different output paths for simplicity. Also, correct problems with display output where 1.10 was a larger number than 1.09, by multiplying by 10 and then dividing by 1024 instead of dividing by 100. (Remainders of >= 1000 would print as .10). Minor changes: Always display the decimal point instead of trying to omit it based on number of digits shown. Decide what units to use based on 1000 as a threshold, not 1024 (in other words, always print at most 3 digits before the decimal point). Signed-off-by: Michael Lyle <mlyle@lyle.org> Reported-by: Dmitry Yu Okunev <dyokunev@ut.mephi.ru> Acked-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-06bcache: Update continue_at() documentationDan Carpenter1-4/+0
continue_at() doesn't have a return statement anymore. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Coly Li <colyli@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-06bcache: silence static checker warningDan Carpenter1-3/+0
In olden times, closure_return() used to have a hidden return built in. We removed the hidden return but forgot to add a new return here. If "c" were NULL we would oops on the next line, but fortunately "c" is never NULL. Let's just remove the if statement. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-06bcache: fix for gc and write-back raceTang Junhui3-2/+10
gc and write-back get raced (see the email "bcache get stucked" I sended before): gc thread write-back thread | |bch_writeback_thread() |bch_gc_thread() | | |==>read_dirty() |==>bch_btree_gc() | |==>btree_root() //get btree root | | //node write locker | |==>bch_btree_gc_root() | | |==>read_dirty_submit() | |==>write_dirty() | |==>continue_at(cl, | | write_dirty_finish, | | system_wq); | |==>write_dirty_finish()//excute | | //in system_wq | |==>bch_btree_insert() | |==>bch_btree_map_leaf_nodes() | |==>__bch_btree_map_nodes() | |==>btree_root //try to get btree | | //root node read | | //lock | |-----stuck here |==>bch_btree_set_root() |==>bch_journal_meta() |==>bch_journal() |==>journal_try_write() |==>journal_write_unlocked() //journal_full(&c->journal) | //condition satisfied |==>continue_at(cl, journal_write, system_wq); //try to excute | //journal_write in system_wq | //but work queue is excuting | //write_dirty_finish() |==>closure_sync(); //wait journal_write execute | //over and wake up gc, |-------------stuck here |==>release root node write locker This patch alloc a separate work-queue for write-back thread to avoid such race. (Commit log re-organized by Coly Li to pass checkpatch.pl checking) Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Acked-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-06bcache: increase the number of open bucketsTang Junhui1-1/+3
In currently, we only alloc 6 open buckets for each cache set, but in usually, we always attach about 10 or so backend devices for each cache set, and the each bcache device are always accessed by about 10 or so threads in top application layer. So 6 open buckets are too few, It has led to that each of the same thread write data to different buckets, which would cause low efficiency write-back, and also cause buckets inefficient, and would be Very easy to run out of. I add debug message in bch_open_buckets_alloc() to print alloc bucket info, and test with ten bcache devices with a cache set, and each bcache device is accessed by ten threads. From the debug message, we can see that, after the modification, One bucket is more likely to assign to the same thread, and the data from the same thread are more likely to write the same bucket. Usually the same thread always write/read the same backend device, so it is good for write-back and also promote the usage efficiency of buckets. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-06bcache: Correct return value for sysfs attach errorsTony Asleson1-2/+2
If you encounter any errors in bch_cached_dev_attach it will return a negative error code. The variable 'v' which stores the result is unsigned, thus user space sees a very large value returned for bytes written which can cause incorrect user space behavior. Utilize 1 signed variable to use throughout the function to preserve error return capability. Signed-off-by: Tony Asleson <tasleson@redhat.com> Acked-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-06bcache: correct cache_dirty_target in __update_writeback_rate()Tang Junhui2-1/+21
__update_write_rate() uses a Proportion-Differentiation Controller algorithm to control writeback rate. A dirty target number is used in this PD controller to control writeback rate. A larger target number will make the writeback rate smaller, on the versus, a smaller target number will make the writeback rate larger. bcache uses the following steps to calculate the target number, 1) cache_sectors = all-buckets-of-cache-set * buckets-size 2) cache_dirty_target = cache_sectors * cached-device-writeback_percent 3) target = cache_dirty_target * (sectors-of-cached-device/sectors-of-all-cached-devices-of-this-cache-set) The calculation at step 1) for cache_sectors is incorrect, which does not consider dirty blocks occupied by flash only volume. A flash only volume can be took as a bcache device without cached device. All data sectors allocated for it are persistent on cache device and marked dirty, they are not touched by bcache writeback and garbage collection code. So data blocks of flash only volume should be ignore when calculating cache_sectors of cache set. Current code does not subtract dirty sectors of flash only volume, which results a larger target number from the above 3 steps. And in sequence the cache device's writeback rate is smaller then a correct value, writeback speed is slower on all cached devices. This patch fixes the incorrect slower writeback rate by subtracting dirty sectors of flash only volumes in __update_writeback_rate(). (Commit log composed by Coly Li to pass checkpatch.pl checking) Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-06bcache: gc does not work when triggering by manual commandTang Junhui1-1/+14
I try to execute the following command to trigger gc thread: [root@localhost internal]# echo 1 > trigger_gc But it does not work, I debug the code in gc_should_run(), It works only if in invalidating or sectors_to_gc < 0. So set sectors_to_gc to -1 to meet the condition when we trigger gc by manual command. (Code comments aded by Coly Li) Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-06bcache: Don't reinvent the wheel but use existing llist APIByungchul Park1-13/+2
Although llist provides proper APIs, they are not used. Make them used. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-06bcache: do not subtract sectors_to_gc for bypassed IOTang Junhui1-3/+3
Since bypassed IOs use no bucket, so do not subtract sectors_to_gc to trigger gc thread. Signed-off-by: tang.junhui <tang.junhui@zte.com.cn> Acked-by: Coly Li <colyli@suse.de> Reviewed-by: Eric Wheeler <bcache@linux.ewheeler.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-06bcache: fix sequential large write IO bypassTang Junhui1-6/+0
Sequential write IOs were tested with bs=1M by FIO in writeback cache mode, these IOs were expected to be bypassed, but actually they did not. We debug the code, and find in check_should_bypass(): if (!congested && mode == CACHE_MODE_WRITEBACK && op_is_write(bio_op(bio)) && (bio->bi_opf & REQ_SYNC)) goto rescale that means, If in writeback mode, a write IO with REQ_SYNC flag will not be bypassed though it is a sequential large IO, It's not a correct thing to do actually, so this patch remove these codes. Signed-off-by: tang.junhui <tang.junhui@zte.com.cn> Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-06bcache: Fix leak of bdev referenceJan Kara1-0/+2
If blkdev_get_by_path() in register_bcache() fails, we try to lookup the block device using lookup_bdev() to detect which situation we are in to properly report error. However we never drop the reference returned to us from lookup_bdev(). Fix that. Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-31md/bitmap: disable bitmap_resize for file-backed bitmaps.NeilBrown1-0/+5
bitmap_resize() does not work for file-backed bitmaps. The buffer_heads are allocated and initialized when the bitmap is read from the file, but resize doesn't read from the file, it loads from the internal bitmap. When it comes time to write the new bitmap, the bh is non-existent and we crash. The common case when growing an array involves making the array larger, and that normally means making the bitmap larger. Doing that inside the kernel is possible, but would need more code. It is probably easier to require people who use file-backed bitmaps to remove them and re-add after a reshape. So this patch disables the resizing of arrays which have file-backed bitmaps. This is better than crashing. Reported-by: Zhilong Liu <zlliu@suse.com> Fixes: d60b479d177a ("md/bitmap: add bitmap_resize function to allow bitmap resizing.") Cc: stable@vger.kernel.org (v3.5+). Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-28Merge tag 'v4.13-rc7' into for-4.14/block-postmergeJens Axboe2-16/+50
Linux 4.13-rc7 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-28dm ioctl: constify ioctl lookup tableEric Biggers1-1/+1
Constify the lookup table for device-mapper ioctls so that it is placed in .rodata. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-08-28dm: constify argument arraysEric Biggers9-17/+18
The arrays of 'struct dm_arg' are never modified by the device-mapper core, so constify them so that they are placed in .rodata. (Exception: the args array in dm-raid cannot be constified because it is allocated on the stack and modified.) Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-08-28dm integrity: count and display checksum failuresMikulas Patocka1-2/+8
This changes DM integrity to count the number of checksum failures and report the counter in response to STATUSTYPE_INFO request (via 'dmsetup status'). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-08-28dm integrity: optimize writing dm-bufio buffers that are partially changedMikulas Patocka3-29/+77
Rather than write the entire dm-bufio buffer when only a subset is changed, improve dm-bufio (and dm-integrity) by only writing the subset of the buffer that changed. Update dm-integrity to make use of dm-bufio's new dm_bufio_mark_partial_buffer_dirty() interface. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-08-28raid5-ppl: Recovery support for multiple partial parity logsPawel Baldysiak1-38/+90
Search PPL buffer in order to find out the latest PPL header (the one with largest generation number) and use it for recovery. The PPL entry format and recovery algorithm are the same as for single PPL approach. Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-28md: Runtime support for multiple pplsPawel Baldysiak6-8/+59
Increase PPL area to 1MB and use it as circular buffer to store PPL. The entry with highest generation number is the latest one. If PPL to be written is larger then space left in a buffer, rewind the buffer to the start (don't wrap it). Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-28dm rq: do not update rq partially in each ending bioMing Lei2-11/+8
We don't need to update the original dm request partially when ending each cloned bio: just update original dm request once when the whole cloned request is finished. This still allows full support for partial completion because a new 'completed' counter accounts for incremental progress as the clone bios complete. Partial request update can be a bit expensive, so we should try to avoid it, especially because it is run in softirq context. Avoiding all the partial request updates fixes both hard lockup and soft lockups that were easily reproduced while running Laurence's test[1] on IB/SRP. BTW, after d4acf3650c7c ("block: Make blk_mq_delay_kick_requeue_list() rerun the queue at a quiet time"), we need to make the test more aggressive for reproducing the lockup: 1) run hammer_write.sh 32 or 64 concurrently. 2) write 8M each time [1] https://marc.info/?l=linux-block&m=150220185510245&w=2 Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-08-28dm rq: make dm-sq requeuing behavior consistent with dm-mq behaviorBart Van Assche1-4/+5
DM_MAPIO_DELAY_REQUEUE causes dm-mq to requeue after a delay but causes dm-sq to requeue immediately. Make the behavior of dm-sq consistent with that of dm-mq. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-08-28dm mpath: complain about unsupported __multipath_map_bio() return valuesBart Van Assche1-0/+4
WARN_ONCE() if __multipath_map_bio() returns an unsupported return value. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-08-28dm mpath: avoid that building with W=1 causes gcc 7 to complain about ↵Bart Van Assche1-0/+1
fall-through Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-08-28dm mpath: do not lock up a CPU with requeuing activityBart Van Assche1-1/+0
When using the block layer in single queue mode, get_request() returns ERR_PTR(-EAGAIN) if the queue is dying and the REQ_NOWAIT flag has been passed to get_request(). Avoid that the kernel reports soft lockup complaints in this case due to continuous requeuing activity. Fixes: 7083abbbf ("dm mpath: avoid that path removal can trigger an infinite loop") Cc: stable@vger.kernel.org Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Tested-by: Laurence Oberman <loberman@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-08-28dm: fix printk() rate limiting codeBart Van Assche1-10/+0
Using the same rate limiting state for different kinds of messages is wrong because this can cause a high frequency message to suppress a report of a low frequency message. Hence use a unique rate limiting state per message type. Fixes: 71a16736a15e ("dm: use local printk ratelimit") Cc: stable@vger.kernel.org Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-08-28dm mpath: retry BLK_STS_RESOURCE errorsBart Van Assche1-1/+0
Retry requests instead of failing them if an out-of-memory error occurs or the block driver below dm-mpath is busy. This restores the v4.12 behavior of noretry_error(), namely that -ENOMEM results in a retry. Fixes: 2a842acab109 ("block: introduce new block status code type") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-08-28dm: fix the second dec_pending() argument in __split_and_process_bio()Bart Van Assche1-1/+1
Detected by sparse. Fixes: 4e4cbee93d56 ("block: switch bios to blk_status_t") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Laurence Oberman <loberman@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-08-25md/raid0: attach correct cgroup info in bioShaohua Li1-0/+1
The discard bio doesn't attach the original bio cgroup info. Normal bio is cloned, so is fine. Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-25raid5: remove raid5_build_blockGuoqing Jiang1-10/+1
Now raid5_build_block is just called to set the sector of r5dev, raid5_compute_blocknr can be used directly for the purpose. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-25md/r5cache: call mddev_lock/unlock() in r5c_journal_mode_showSong Liu1-2/+10
In r5c_journal_mode_show(), it is necessary to call mddev_lock() before accessing conf and conf->log. Otherwise, the conf->log may change (and become NULL). Signed-off-by: Song Liu <songliubraving@fb.com> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-25md: replace seq_release_private with seq_releaseCihangir Akturk1-1/+1
Since commit f15146380d28 ("fs: seq_file - add event counter to simplify poll() support"), md.c code has been no longer used the private field of the struct seq_file, but seq_release_private() has been continued to be used to release the allocated seq_file. This seems to have been forgotten. So convert it to use seq_release() instead of seq_release_private(). Signed-off-by: Cihangir Akturk <cakturk@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-25md: notify about new spare disk in the containerAlexey Obitotskiy1-0/+2
In case of external metadata arrays spare disks are added to containers first. mdadm keeps monitoring /proc/mdstat output and when spare disk is available, it moves it from the container to the array. The problem is there is no notification of new spare disk in the container and mdadm waits a long time (until timeout) before it takes the action. Signed-off-by: Alexey Obitotskiy <aleksey.obitotskiy@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-25md/raid1/10: reset bio allocated from mempoolShaohua Li2-4/+50
Data allocated from mempool doesn't always get initialized, this happens when the data is reused instead of fresh allocation. In the raid1/10 case, we must reinitialize the bios. Reported-by: Jonathan G. Underwood <jonathan.underwood@gmail.com> Fixes: f0250618361d(md: raid10: don't use bio's vec table to manage resync pages) Fixes: 98d30c5812c3(md: raid1: don't use bio's vec table to manage resync pages) Cc: stable@vger.kernel.org (4.12+) Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-24md/raid5: release/flush io in raid5_do_work()Song Liu1-0/+4
In raid5, there are scenarios where some ios are deferred to a later time, and some IO need a flush to complete. To make sure we make progress with these IOs, we need to call the following functions: flush_deferred_bios(conf); r5l_flush_stripe_to_raid(conf->log); Both of these functions are called in raid5d(), but missing in raid5_do_work(). As a result, these functions are not called when multi-threading (group_thread_cnt > 0) is enabled. This patch adds calls to these function to raid5_do_work(). Note for stable branches: r5l_flush_stripe_to_raid(conf->log) is need for 4.4+ flush_deferred_bios(conf) is only needed for 4.11+ Cc: stable@vger.kernel.org (4.4+) Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-24md/bitmap: copy correct data for bitmap superShaohua Li1-2/+2
raid5 cache could write bitmap superblock before bitmap superblock is initialized. The bitmap superblock is less than 512B. The current code will only copy the superblock to a new page and write the whole 512B, which will zero the the data after the superblock. Unfortunately the data could include bitmap, which we should preserve. The patch will make superblock read do 4k chunk and we always copy the 4k data to new page, so the superblock write will old data to disk and we don't change the bitmap. Reported-by: Song Liu <songliubraving@fb.com> Reviewed-by: Song Liu <songliubraving@fb.com> Cc: stable@vger.kernel.org (4.10+) Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-23block: replace bi_bdev with a gendisk pointer and partitions indexChristoph Hellwig38-153/+156
This way we don't need a block_device structure to submit I/O. The block_device has different life time rules from the gendisk and request_queue and is usually only available when the block device node is open. Other callers need to explicitly create one (e.g. the lightnvm passthrough code, or the new nvme multipathing code). For the actual I/O path all that we need is the gendisk, which exists once per block device. But given that the block layer also does partition remapping we additionally need a partition index, which is used for said remapping in generic_make_request. Note that all the block drivers generally want request_queue or sometimes the gendisk, so this removes a layer of indirection all over the stack. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23raid5: remove a call to get_start_sectChristoph Hellwig1-1/+3
The block layer always remaps partitions before calling into the ->make_request methods of drivers. Thus the call to get_start_sect in in_chunk_boundary will always return 0 and can be removed. Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>