summaryrefslogtreecommitdiff
path: root/drivers/block/zram/zram_drv.c
AgeCommit message (Collapse)AuthorFilesLines
2021-11-06Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-18/+48
Merge misc updates from Andrew Morton: "257 patches. Subsystems affected by this patch series: scripts, ocfs2, vfs, and mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache, gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools, memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm, vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram, cleanups, kfence, and damon)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits) mm/damon: remove return value from before_terminate callback mm/damon: fix a few spelling mistakes in comments and a pr_debug message mm/damon: simplify stop mechanism Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions Docs/admin-guide/mm/damon/start: simplify the content Docs/admin-guide/mm/damon/start: fix a wrong link Docs/admin-guide/mm/damon/start: fix wrong example commands mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on mm/damon: remove unnecessary variable initialization Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM) selftests/damon: support watermarks mm/damon/dbgfs: support watermarks mm/damon/schemes: activate schemes based on a watermarks mechanism tools/selftests/damon: update for regions prioritization of schemes mm/damon/dbgfs: support prioritization weights mm/damon/vaddr,paddr: support pageout prioritization mm/damon/schemes: prioritize regions within the quotas mm/damon/selftests: support schemes quotas mm/damon/dbgfs: support quotas of schemes ...
2021-11-06zram: introduce an aged idle interfaceBrian Geffon1-16/+46
This change introduces an aged idle interface to the existing idle sysfs file for zram. When CONFIG_ZRAM_MEMORY_TRACKING is enabled the idle file now also accepts an integer argument. This integer is the age (in seconds) of pages to mark as idle. The idle file still supports 'all' as it always has. This new approach allows for much more control over which pages get marked as idle. [bgeffon@google.com: use IS_ENABLED and cleanup comment] Link: https://lkml.kernel.org/r/20210924161128.1508015-1-bgeffon@google.com [bgeffon@google.com: Sergey's cleanup suggestions] Link: https://lkml.kernel.org/r/20210929143056.13067-1-bgeffon@google.com Link: https://lkml.kernel.org/r/20210923130115.1344361-1-bgeffon@google.com Signed-off-by: Brian Geffon <bgeffon@google.com> Acked-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Jesse Barnes <jsbarnes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06zram: off by one in read_block_state()Dan Carpenter1-1/+1
snprintf() returns the number of bytes it would have printed if there were space. But it does not count the NUL terminator. So that means that if "count == copied" then this has already overflowed by one character. This bug likely isn't super harmful in real life. Link: https://lkml.kernel.org/r/20210916130404.GA25094@kili Fixes: c0265342bff4 ("zram: introduce zram memory tracking") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06zram_drv: allow reclaim on bio_allocJaewon Kim1-1/+1
The read_from_bdev_async is not called on atomic context. So GFP_NOIO is available rather than GFP_ATOMIC. If there were reclaimable pages with GFP_NOIO, we can avoid allocation failure and page fault failure. Link: https://lkml.kernel.org/r/20210908005241.28062-1-jaewon31.kim@samsung.com Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com> Reported-by: Yong-Taek Lee <ytk.lee@samsung.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-10-18block: switch polling to be bio basedChristoph Hellwig1-7/+3
Replace the blk_poll interface that requires the caller to keep a queue and cookie from the submissions with polling based on the bio. Polling for the bio itself leads to a few advantages: - the cookie construction can made entirely private in blk-mq.c - the caller does not need to remember the request_queue and cookie separately and thus sidesteps their lifetime issues - keeping the device and the cookie inside the bio allows to trivially support polling BIOs remapping by stacking drivers - a lot of code to propagate the cookie back up the submission path can be removed entirely. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-01block: move bd_mutex to struct gendiskChristoph Hellwig1-9/+9
Replace the per-block device bd_mutex with a per-gendisk open_mutex, thus simplifying locking wherever we deal with partitions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Link: https://lore.kernel.org/r/20210525061301.2242282-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-01zram: convert to blk_alloc_disk/blk_cleanup_diskChristoph Hellwig1-15/+4
Convert the zram driver to use the blk_alloc_disk and blk_cleanup_disk helpers to simplify gendisk and request_queue allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20210521055116.1053587-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-13zram: fix broken page writebackMinchan Kim1-3/+3
commit 0d8359620d9b ("zram: support page writeback") introduced two problems. It overwrites writeback_store's return value as kstrtol's return value, which makes return value zero so user could see zero as return value of write syscall even though it wrote data successfully. It also breaks index value in the loop in that it doesn't increase the index any longer. It means it can write only first starting block index so user couldn't write all idle pages in the zram so lose memory saving chance. This patch fixes those issues. Link: https://lkml.kernel.org/r/20210312173949.2197662-2-minchan@kernel.org Fixes: 0d8359620d9b("zram: support page writeback") Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Amos Bianchi <amosbianchi@google.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: John Dias <joaodias@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-13zram: fix return value on writeback_storeMinchan Kim1-3/+8
writeback_store's return value is overwritten by submit_bio_wait's return value. Thus, writeback_store will return zero since there was no IO error. In the end, write syscall from userspace will see the zero as return value, which could make the process stall to keep trying the write until it will succeed. Link: https://lkml.kernel.org/r/20210312173949.2197662-1-minchan@kernel.org Fixes: 3b82a051c101("drivers/block/zram/zram_drv.c: fix error return codes not being returned in writeback_store") Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: John Dias <joaodias@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26zsmalloc: account the number of compacted pages correctlyRokudo Yan1-1/+1
There exists multiple path may do zram compaction concurrently. 1. auto-compaction triggered during memory reclaim 2. userspace utils write zram<id>/compaction node So, multiple threads may call zs_shrinker_scan/zs_compact concurrently. But pages_compacted is a per zsmalloc pool variable and modification of the variable is not serialized(through under class->lock). There are two issues here: 1. the pages_compacted may not equal to total number of pages freed(due to concurrently add). 2. zs_shrinker_scan may not return the correct number of pages freed(issued by current shrinker). The fix is simple: 1. account the number of pages freed in zs_compact locally. 2. use actomic variable pages_compacted to accumulate total number. Link: https://lkml.kernel.org/r/20210202122235.26885-1-wu-yan@tcl.com Fixes: 860c707dca155a56 ("zsmalloc: account the number of compacted pages") Signed-off-by: Rokudo Yan <wu-yan@tcl.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-26zram: fix NULL check before some freeing functions is not neededTian Tao1-2/+1
fixed the below warning: /drivers/block/zram/zram_drv.c:534:2-8: WARNING: NULL check before some freeing functions is not needed. Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block: store a block_device pointer in struct bioChristoph Hellwig1-1/+1
Replace the gendisk pointer in struct bio with a pointer to the newly improved struct block device. From that the gendisk can be trivially accessed with an extra indirection, but it also allows to directly look up all information related to partition remapping. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-16Merge tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-blockLinus Torvalds1-28/+6
Pull block updates from Jens Axboe: "Another series of killing more code than what is being added, again thanks to Christoph's relentless cleanups and tech debt tackling. This contains: - blk-iocost improvements (Baolin Wang) - part0 iostat fix (Jeffle Xu) - Disable iopoll for split bios (Jeffle Xu) - block tracepoint cleanups (Christoph Hellwig) - Merging of struct block_device and hd_struct (Christoph Hellwig) - Rework/cleanup of how block device sizes are updated (Christoph Hellwig) - Simplification of gendisk lookup and removal of block device aliasing (Christoph Hellwig) - Block device ioctl cleanups (Christoph Hellwig) - Removal of bdget()/blkdev_get() as exported API (Christoph Hellwig) - Disk change rework, avoid ->revalidate_disk() (Christoph Hellwig) - sbitmap improvements (Pavel Begunkov) - Hybrid polling fix (Pavel Begunkov) - bvec iteration improvements (Pavel Begunkov) - Zone revalidation fixes (Damien Le Moal) - blk-throttle limit fix (Yu Kuai) - Various little fixes" * tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block: (126 commits) blk-mq: fix msec comment from micro to milli seconds blk-mq: update arg in comment of blk_mq_map_queue blk-mq: add helper allocating tagset->tags Revert "block: Fix a lockdep complaint triggered by request queue flushing" nvme-loop: use blk_mq_hctx_set_fq_lock_class to set loop's lock class blk-mq: add new API of blk_mq_hctx_set_fq_lock_class block: disable iopoll for split bio block: Improve blk_revalidate_disk_zones() checks sbitmap: simplify wrap check sbitmap: replace CAS with atomic and sbitmap: remove swap_lock sbitmap: optimise sbitmap_deferred_clear() blk-mq: skip hybrid polling if iopoll doesn't spin blk-iocost: Factor out the base vrate change into a separate function blk-iocost: Factor out the active iocgs' state check into a separate function blk-iocost: Move the usage ratio calculation to the correct place blk-iocost: Remove unnecessary advance declaration blk-iocost: Fix some typos in comments blktrace: fix up a kerneldoc comment block: remove the request_queue to argument request based tracepoints ...
2020-12-15zram: break the strict dependency from lzoRui Salvaterra1-1/+1
From the beginning, the zram block device always enabled CRYPTO_LZO, since lzo-rle is hardcoded as the fallback compression algorithm. As a consequence, on systems where another compression algorithm is chosen (e.g. CRYPTO_ZSTD), the lzo kernel module becomes unused, while still having to be built/loaded. This patch removes the hardcoded lzo-rle dependency and allows the user to select the default compression algorithm for zram at build time. The previous behaviour is kept, as the default algorithm is still lzo-rle. Link: https://lkml.kernel.org/r/20201207121245.50529-1-rsalvaterra@gmail.com Signed-off-by: Rui Salvaterra <rsalvaterra@gmail.com> Suggested-by: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Suggested-by: Minchan Kim <minchan@kernel.org> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15zram: add stat to gather incompressible pages since zram set upMinchan Kim1-2/+4
Currently, zram supports the stat via /sys/block/zram/mm_stat to represent how many of incompressible pages are stored at the moment but it couldn't show how many times incompressible pages were wrote down since zram set up. It's also good indication to see how zram is effective in the system. Link: https://lkml.kernel.org/r/20201130201907.1284910-1-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15zram: support page writebackMinchan Kim1-4/+17
There is demand to writeback specific process pages to backing store instead of all idles pages in the system due to storage wear out concerns and to launching latency of apps which are most of the time idle but are critical for resume latency. This patch extends the writeback knob to support a specific page writeback. Link: https://lkml.kernel.org/r/20201020190506.3758660-1-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-01block: stop using bdget_disk for partition 0Christoph Hellwig1-12/+2
We can just dereference the point in struct gendisk instead. Also remove the now unused export. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01block: switch partition lookup to use struct block_deviceChristoph Hellwig1-1/+1
Use struct block_device to lookup partitions on a disk. This removes all usage of struct hd_struct from the I/O path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: Chao Yu <yuchao0@huawei.com> [f2fs] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01block: allocate struct hd_struct as part of struct bdev_inodeChristoph Hellwig1-1/+1
Allocate hd_struct together with struct block_device to pre-load the lifetime rule changes in preparation of merging the two structures. Note that part0 was previously embedded into struct gendisk, but is a separate allocation now, and already points to the block_device instead of the hd_struct. The lifetime of struct gendisk is still controlled by the struct device embedded in the part0 hd_struct. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01zram: do not call set_blocksizeChristoph Hellwig1-10/+1
set_blocksize is used by file systems to use their preferred buffer cache block size. Block drivers should not set it. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16zram: use set_capacity_and_notifyChristoph Hellwig1-5/+2
Use set_capacity_and_notify to set the size of both the disk and block device. This also gets the uevent notifications for the resize for free. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-24Merge tag 'block-5.10-2020-10-24' of git://git.kernel.dk/linux-blockLinus Torvalds1-3/+5
Pull block fixes from Jens Axboe: - NVMe pull request from Christoph - rdma error handling fixes (Chao Leng) - fc error handling and reconnect fixes (James Smart) - fix the qid displace when tracing ioctl command (Keith Busch) - don't use BLK_MQ_REQ_NOWAIT for passthru (Chaitanya Kulkarni) - fix MTDT for passthru (Logan Gunthorpe) - blacklist Write Same on more devices (Kai-Heng Feng) - fix an uninitialized work struct (zhenwei pi)" - lightnvm out-of-bounds fix (Colin) - SG allocation leak fix (Doug) - rnbd fixes (Gioh, Guoqing, Jack) - zone error translation fixes (Keith) - kerneldoc markup fix (Mauro) - zram lockdep fix (Peter) - Kill unused io_context members (Yufen) - NUMA memory allocation cleanup (Xianting) - NBD config wakeup fix (Xiubo) * tag 'block-5.10-2020-10-24' of git://git.kernel.dk/linux-block: (27 commits) block: blk-mq: fix a kernel-doc markup nvme-fc: shorten reconnect delay if possible for FC nvme-fc: wait for queues to freeze before calling update_hr_hw_queues nvme-fc: fix error loop in create_hw_io_queues nvme-fc: fix io timeout to abort I/O null_blk: use zone status for max active/open nvmet: don't use BLK_MQ_REQ_NOWAIT for passthru nvmet: cleanup nvmet_passthru_map_sg() nvmet: limit passthru MTDS by BIO_MAX_PAGES nvmet: fix uninitialized work for zero kato nvme-pci: disable Write Zeroes on Sandisk Skyhawk nvme: use queuedata for nvme_req_qid nvme-rdma: fix crash due to incorrect cqe nvme-rdma: fix crash when connect rejected block: remove unused members for io_context blk-mq: remove the calling of local_memory_node() zram: Fix __zram_bvec_{read,write}() locking order skd_main: remove unused including <linux/version.h> sgl_alloc_order: fix memory leak lightnvm: fix out-of-bounds write to array devices->info[] ...
2020-10-19zram: Fix __zram_bvec_{read,write}() locking orderPeter Zijlstra1-3/+5
Mikhail reported a lockdep spat detailing how __zram_bvec_read() and __zram_bvec_write() use zstrm->lock and zspage->lock in opposite order. Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-16zram: failing to decompress is WARN_ON worthyDouglas Anderson1-1/+1
If we fail to decompress in zram it's a pretty serious problem. We were entrusted to be able to decompress the old data but we failed. Either we've got some crazy bug in the compression code or we've got memory corruption. At the moment, when this happens the log looks like this: ERR kernel: [ 1833.099861] zram: Decompression failed! err=-22, page=336112 ERR kernel: [ 1833.099881] zram: Decompression failed! err=-22, page=336112 ALERT kernel: [ 1833.099886] Read-error on swap-device (253:0:2688896) It is true that we have an "ALERT" level log in there, but (at least to me) it feels like even this isn't enough to impart the seriousness of this error. Let's convert to a WARN_ON. Note that WARN_ON is automatically "unlikely" so we can simply replace the old annotation with the new one. Signed-off-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Sonny Rao <sonnyrao@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Link: https://lkml.kernel.org/r/20200917174059.1.If09c882545dbe432268f7a67a4d4cfcb6caace4f@changeid Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-24bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flagChristoph Hellwig1-1/+1
The BDI_CAP_STABLE_WRITES is one of the few bits of information in the backing_dev_info shared between the block drivers and the writeback code. To help untangling the dependency replace it with a queue flag and a superblock flag derived from it. This also helps with the case of e.g. a file system requiring stable writes due to its own checksumming, but not forcing it on other users of the block device like the swap code. One downside is that we an't support the stable_pages_required bdi attribute in sysfs anymore. It is replaced with a queue attribute which also is writable for easier testing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24bdi: remove BDI_CAP_SYNCHRONOUS_IOChristoph Hellwig1-6/+13
BDI_CAP_SYNCHRONOUS_IO is only checked in the swap code, and used to decided if ->rw_page can be used on a block device. Just check up for the method instead. The only complication is that zram needs a second set of block_device_operations as it can switch between modes that actually support ->rw_page and those who don't. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23zram: cleanup backing_dev_storeChristoph Hellwig1-3/+4
Use blkdev_get_by_dev instead of bdgrab + blkdev_get. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02block: add a new revalidate_disk_size helperChristoph Hellwig1-2/+2
revalidate_disk is a relative awkward helper for driver use, as it first calls an optional driver method and then updates the block device size, while most callers either don't need the method call at all, or want to keep state between the caller and the called method. Add a revalidate_disk_size helper that just performs the update of the block device size from the gendisk one, and switch all drivers that do not implement ->revalidate_disk to use the new helper instead of revalidate_disk() Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-03Merge tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-blockLinus Torvalds1-7/+7
Pull core block updates from Jens Axboe: "Good amount of cleanups and tech debt removals in here, and as a result, the diffstat shows a nice net reduction in code. - Softirq completion cleanups (Christoph) - Stop using ->queuedata (Christoph) - Cleanup bd claiming (Christoph) - Use check_events, moving away from the legacy media change (Christoph) - Use inode i_blkbits consistently (Christoph) - Remove old unused writeback congestion bits (Christoph) - Cleanup/unify submission path (Christoph) - Use bio_uninit consistently, instead of bio_disassociate_blkg (Christoph) - sbitmap cleared bits handling (John) - Request merging blktrace event addition (Jan) - sysfs add/remove race fixes (Luis) - blk-mq tag fixes/optimizations (Ming) - Duplicate words in comments (Randy) - Flush deferral cleanup (Yufen) - IO context locking/retry fixes (John) - struct_size() usage (Gustavo) - blk-iocost fixes (Chengming) - blk-cgroup IO stats fixes (Boris) - Various little fixes" * tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block: (135 commits) block: blk-timeout: delete duplicated word block: blk-mq-sched: delete duplicated word block: blk-mq: delete duplicated word block: genhd: delete duplicated words block: elevator: delete duplicated word and fix typos block: bio: delete duplicated words block: bfq-iosched: fix duplicated word iocost_monitor: start from the oldest usage index iocost: Fix check condition of iocg abs_vdebt block: Remove callback typedefs for blk_mq_ops block: Use non _rcu version of list functions for tag_set_list blk-cgroup: show global disk stats in root cgroup io.stat blk-cgroup: make iostat functions visible to stat printing block: improve discard bio alignment in __blkdev_issue_discard() block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers block: defer flush request no matter whether we have elevator block: make blk_timeout_init() static block: remove retry loop in ioc_release_fn() block: remove unnecessary ioc nested locking block: integrate bd_start_claiming into __blkdev_get ...
2020-07-01Revert "zram: convert remaining CLASS_ATTR() to CLASS_ATTR_RO()"Wade Mealing1-1/+2
Turns out that the permissions for 0400 really are what we want here, otherwise any user can read from this file. [fixed formatting, added changelog, and made attribute static - gregkh] Reported-by: Wade Mealing <wmealing@redhat.com> Cc: stable <stable@vger.kernel.org> Fixes: f40609d1591f ("zram: convert remaining CLASS_ATTR() to CLASS_ATTR_RO()") Link: https://bugzilla.redhat.com/show_bug.cgi?id=1847832 Reviewed-by: Steffen Maier <maier@linux.ibm.com> Acked-by: Minchan Kim <minchan@kernel.org> Link: https://lore.kernel.org/r/20200617114946.GA2131650@kroah.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-01block: move ->make_request_fn to struct block_device_operationsChristoph Hellwig1-5/+6
The make_request_fn is a little weird in that it sits directly in struct request_queue instead of an operation vector. Replace it with a block_device_operations method called submit_bio (which describes much better what it does). Also remove the request_queue argument to it, as the queue can be derived pretty trivially from the bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01zram: stop using ->queuedataChristoph Hellwig1-2/+1
Instead of setting up the queuedata as well just use one private data field. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27zram: nvdimm: use bio_{start,end}_io_acct and disk_{start,end}_io_acctChristoph Hellwig1-14/+10
Switch zram to use the nicer bio accounting helpers, and as part of that ensure each bio is counted as a single I/O request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-27block: simplify queue allocationChristoph Hellwig1-3/+1
Current make_request based drivers use either blk_alloc_queue_node or blk_alloc_queue to allocate a queue, and then set up the make_request_fn function pointer and a few parameters using the blk_queue_make_request helper. Simplify this by passing the make_request pointer to blk_alloc_queue, and while at it merge the _node variant into the main helper by always passing a node_id, and remove the superfluous gfp_mask parameter. A lower-level __blk_alloc_queue is kept for the blk-mq case. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25block: move the part_stat* helpers from genhd.h to a new headerChristoph Hellwig1-0/+1
These macros are just used by a few files. Move them out of genhd.h, which is included everywhere into a new standalone header. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-31drivers/block/zram/zram_drv.c: fix error return codes not being returned in ↵Colin Ian King1-2/+1
writeback_store Currently when an error code -EIO or -ENOSPC in the for-loop of writeback_store the error code is being overwritten by a ret = len assignment at the end of the function and the error codes are being lost. Fix this by assigning ret = len at the start of the function and remove the assignment from the end, hence allowing ret to be preserved when error codes are assigned to it. Addresses Coverity ("Unused value") Link: http://lkml.kernel.org/r/20191128122958.178290-1-colin.king@canonical.com Fixes: a939888ec38b ("zram: support idle/huge page writeback") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31zram: try to avoid worst-case scenario on same element pagesTaejoon Song1-2/+5
The worst-case scenario on finding same element pages is that almost all elements are same at the first glance but only last few elements are different. Since the same element tends to be grouped from the beginning of the pages, if we check the first element with the last element before looping through all elements, we might have some chances to quickly detect non-same element pages. 1. Test is done under LG webOS TV (64-bit arch) 2. Dump the swap-out pages (~819200 pages) 3. Analyze the pages with simple test script which counts the iteration number and measures the speed at off-line Under 64-bit arch, the worst iteration count is PAGE_SIZE / 8 bytes = 512. The speed is based on the time to consume page_same_filled() function only. The result, on average, is listed as below: Num of Iter Speed(MB/s) Looping-Forward (Orig) 38 99265 Looping-Backward 36 102725 Last-element-check (This Patch) 33 125072 The result shows that the average iteration count decreases by 13% and the speed increases by 25% with this patch. This patch does not increase the overall time complexity, though. I also ran simpler version which uses backward loop. Just looping backward also makes some improvement, but less than this patch. [taejoon.song@lge.com: fix off-by-one] Link: http://lkml.kernel.org/r/1578642001-11765-1-git-send-email-taejoon.song@lge.com Link: http://lkml.kernel.org/r/1575424418-16119-1-git-send-email-taejoon.song@lge.com Signed-off-by: Taejoon Song <taejoon.song@lge.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19zram: fix race between backing_dev_show and backing_dev_storeChenwandun1-2/+3
CPU0: CPU1: backing_dev_show backing_dev_store ...... ...... file = zram->backing_dev; down_read(&zram->init_lock); down_read(&zram->init_init_lock) file_path(file, ...); zram->backing_dev = backing_dev; up_read(&zram->init_lock); up_read(&zram->init_lock); gets the value of zram->backing_dev too early in backing_dev_show, which resultin the value being NULL at the beginning, and not NULL later. backtrace: d_path+0xcc/0x174 file_path+0x10/0x18 backing_dev_show+0x40/0xb4 dev_attr_show+0x20/0x54 sysfs_kf_seq_show+0x9c/0x10c kernfs_seq_show+0x28/0x30 seq_read+0x184/0x488 kernfs_fop_read+0x5c/0x1a4 __vfs_read+0x44/0x128 vfs_read+0xa0/0x138 SyS_read+0x54/0xb4 Link: http://lkml.kernel.org/r/1571046839-16814-1-git-send-email-chenwandun@huawei.com Signed-off-by: Chenwandun <chenwandun@huawei.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> [4.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-26zram: pass down the bvec we need to read into in the work structJérôme Glisse1-2/+3
When scheduling work item to read page we need to pass down the proper bvec struct which points to the page to read into. Before this patch it uses a randomly initialized bvec (only if PAGE_SIZE != 4096) which is wrong. Note that without this patch on arch/kernel where PAGE_SIZE != 4096 userspace could read random memory through a zram block device (thought userspace probably would have no control on the address being read). Link: http://lkml.kernel.org/r/20190408183219.26377-1-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-29drivers/block/zram/zram_drv.c: fix idle/writeback string compareMinchan Kim1-26/+6
Makoto report a below KASAN error: zram does out-of-bounds read. Because strscpy copies from source up to count bytes unconditionally. It could cause out-of-bounds read on next object in slab. To prevent it, use strlcpy which checks source's length automatically. BUG: KASAN: slab-out-of-bounds in strscpy+0x68/0x154 Read of size 8 at addr ffffffc0c3495a00 by task system_server/1314 .. Call trace: strscpy+0x68/0x154 idle_store+0xc4/0x34c dev_attr_store+0x50/0x6c sysfs_kf_write+0x98/0xb4 kernfs_fop_write+0x198/0x260 __vfs_write+0x10c/0x338 vfs_write+0x114/0x238 SyS_write+0xc8/0x168 __sys_trace_return+0x0/0x4 Allocated by task 1314: __kmalloc+0x280/0x318 kernfs_fop_write+0xac/0x260 __vfs_write+0x10c/0x338 vfs_write+0x114/0x238 SyS_write+0xc8/0x168 __sys_trace_return+0x0/0x4 Freed by task 2855: kfree+0x138/0x630 kernfs_put_open_node+0x10c/0x124 kernfs_fop_release+0xd8/0x114 __fput+0x130/0x2a4 ____fput+0x1c/0x28 task_work_run+0x16c/0x1c8 do_notify_resume+0x2bc/0x107c work_pending+0x8/0x10 The buggy address belongs to the object at ffffffc0c3495a00 which belongs to the cache kmalloc-128 of size 128 The buggy address is located 0 bytes inside of 128-byte region [ffffffc0c3495a00, ffffffc0c3495a80) The buggy address belongs to the page: page:ffffffbf030d2500 count:1 mapcount:0 mapping: (null) index:0x0 compound_mapcount: 0 flags: 0x4000000000010200(slab|head) page dumped because: kasan: bad access detected Memory state around the buggy address: ffffffc0c3495900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffffffc0c3495980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffffffc0c3495a00: 04 fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ^ ffffffc0c3495a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffffffc0c3495b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Link: http://lkml.kernel.org/r/20190319231911.145968-1-minchan@kernel.org Cc: <stable@vger.kernel.org> [5.0] Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Makoto Wu <makotowu@google.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14zram: default to lzo-rle instead of lzoDave Rodgman1-1/+1
lzo-rle gives higher performance and similar compression ratios to lzo. Link: http://lkml.kernel.org/r/20190205155944.16007-4-dave.rodgman@arm.com Signed-off-by: Dave Rodgman <dave.rodgman@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-08zram: idle writeback fixes and cleanupMinchan Kim1-24/+66
This patch includes some fixes and cleanup for idle-page writeback. 1. writeback_limit interface Now writeback_limit interface is rather conusing. For example, once writeback limit budget is exausted, admin can see 0 from /sys/block/zramX/writeback_limit which is same semantic with disable writeback_limit at this moment. IOW, admin cannot tell that zero came from disable writeback limit or exausted writeback limit. To make the interface clear, let's sepatate enable of writeback limit to another knob - /sys/block/zram0/writeback_limit_enable * before: while true : # to re-enable writeback limit once previous one is used up echo 0 > /sys/block/zram0/writeback_limit echo $((200<<20)) > /sys/block/zram0/writeback_limit .. .. # used up the writeback limit budget * new # To enable writeback limit, from the beginning, admin should # enable it. echo $((200<<20)) > /sys/block/zram0/writeback_limit echo 1 > /sys/block/zram/0/writeback_limit_enable while true : echo $((200<<20)) > /sys/block/zram0/writeback_limit .. .. # used up the writeback limit budget It's much strightforward. 2. fix condition check idle/huge writeback mode check The mode in writeback_store is not bit opeartion any more so no need to use bit operations. Furthermore, current condition check is broken in that it does writeback every pages regardless of huge/idle. 3. clean up idle_store No need to use goto. [minchan@kernel.org: missed spin_lock_init] Link: http://lkml.kernel.org/r/20190103001601.GA255139@google.com Link: http://lkml.kernel.org/r/20181224033529.19450-1-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Suggested-by: John Dias <joaodias@google.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: John Dias <joaodias@google.com> Cc: Srinivas Paladugu <srnvs@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28zram: writeback throttleMinchan Kim1-3/+49
If there are lots of write IO with flash device, it could have a wearout problem of storage. To overcome the problem, admin needs to design write limitation to guarantee flash health for entire product life. This patch creates a new knob "writeback_limit" for zram. writeback_limit's default value is 0 so that it doesn't limit any writeback. If admin want to measure writeback count in a certain period, he could know it via /sys/block/zram0/bd_stat's 3rd column. If admin want to limit writeback as per-day 400M, he could do it like below. MB_SHIFT=20 4K_SHIFT=12 echo $((400<<MB_SHIFT>>4K_SHIFT)) > \ /sys/block/zram0/writeback_limit. If admin want to allow further write again, he could do it like below echo 0 > /sys/block/zram0/writeback_limit If admin want to see remaining writeback budget, cat /sys/block/zram0/writeback_limit The writeback_limit count will reset whenever you reset zram (e.g., system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of writeback happened until you reset the zram to allocate extra writeback budget in next setting is user's job. [minchan@kernel.org: v4] Link: http://lkml.kernel.org/r/20181203024045.153534-8-minchan@kernel.org Link: http://lkml.kernel.org/r/20181127055429.251614-8-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Joey Pabalinas <joeypabalinas@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28zram: add bd_stat statisticsMinchan Kim1-0/+29
bd_stat represents things that happened in the backing device. Currently it supports bd_counts, bd_reads and bd_writes which are helpful to understand wearout of flash and memory saving. [minchan@kernel.org: v4] Link: http://lkml.kernel.org/r/20181203024045.153534-7-minchan@kernel.org Link: http://lkml.kernel.org/r/20181127055429.251614-7-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Joey Pabalinas <joeypabalinas@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28zram: support idle/huge page writebackMinchan Kim1-74/+173
Add a new feature "zram idle/huge page writeback". In the zram-swap use case, zram usually has many idle/huge swap pages. It's pointless to keep them in memory (ie, zram). To solve this problem, this feature introduces idle/huge page writeback to the backing device so the goal is to save more memory space on embedded systems. Normal sequence to use idle/huge page writeback feature is as follows, while (1) { # mark allocated zram slot to idle echo all > /sys/block/zram0/idle # leave system working for several hours # Unless there is no access for some blocks on zram, # they are still IDLE marked pages. echo "idle" > /sys/block/zram0/writeback or/and echo "huge" > /sys/block/zram0/writeback # write the IDLE or/and huge marked slot into backing device # and free the memory. } Per the discussion at https://lore.kernel.org/lkml/20181122065926.GG3441@jagdpanzerIV/T/#u, This patch removes direct incommpressibe page writeback feature (d2afd25114f4 ("zram: write incompressible pages to backing device")). Below concerns from Sergey: == &< == "IDLE writeback" is superior to "incompressible writeback". "incompressible writeback" is completely unpredictable and uncontrollable; it depens on data patterns and compression algorithms. While "IDLE writeback" is predictable. I even suspect, that, *ideally*, we can remove "incompressible writeback". "IDLE pages" is a super set which also includes "incompressible" pages. So, technically, we still can do "incompressible writeback" from "IDLE writeback" path; but a much more reasonable one, based on a page idling period. I understand that you want to keep "direct incompressible writeback" around. ZRAM is especially popular on devices which do suffer from flash wearout, so I can see "incompressible writeback" path becoming a dead code, long term. == &< == Below concerns from Minchan: == &< == My concern is if we enable CONFIG_ZRAM_WRITEBACK in this implementation, both hugepage/idlepage writeck will turn on. However someuser want to enable only idlepage writeback so we need to introduce turn on/off knob for hugepage or new CONFIG_ZRAM_IDLEPAGE_WRITEBACK for those usecase. I don't want to make it complicated *if possible*. Long term, I imagine we need to make VM aware of new swap hierarchy a little bit different with as-is. For example, first high priority swap can return -EIO or -ENOCOMP, swap try to fallback to next lower priority swap device. With that, hugepage writeback will work tranparently. So we could regard it as regression because incompressible pages doesn't go to backing storage automatically. Instead, user should do it via "echo huge" > /sys/block/zram/writeback" manually. == &< == Link: http://lkml.kernel.org/r/20181127055429.251614-6-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Joey Pabalinas <joeypabalinas@gmail.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28zram: introduce ZRAM_IDLE flagMinchan Kim1-3/+54
To support idle page writeback with upcoming patches, this patch introduces a new ZRAM_IDLE flag. Userspace can mark zram slots as "idle" via "echo all > /sys/block/zramX/idle" which marks every allocated zram slot as ZRAM_IDLE. User could see it by /sys/kernel/debug/zram/zram0/block_state. 300 75.033841 ...i 301 63.806904 s..i 302 63.806919 ..hi Once there is IO for the slot, the mark will be disappeared. 300 75.033841 ... 301 63.806904 s..i 302 63.806919 ..hi Therefore, 300th block is idle zpage. With this feature, user can how many zram has idle pages which are waste of memory. Link: http://lkml.kernel.org/r/20181127055429.251614-5-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Joey Pabalinas <joeypabalinas@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28zram: refactor flags and writeback stuffMinchan Kim1-65/+40
Rename some variables and restructure some code for better readability in writeback and zs_free_page. Link: http://lkml.kernel.org/r/20181127055429.251614-4-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Joey Pabalinas <joeypabalinas@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28zram: fix double free backing deviceMinchan Kim1-1/+3
If blkdev_get fails, we shouldn't do blkdev_put. Otherwise, kernel emits below log. This patch fixes it. WARNING: CPU: 0 PID: 1893 at fs/block_dev.c:1828 blkdev_put+0x105/0x120 Modules linked in: CPU: 0 PID: 1893 Comm: swapoff Not tainted 4.19.0+ #453 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 RIP: 0010:blkdev_put+0x105/0x120 Call Trace: __x64_sys_swapoff+0x46d/0x490 do_syscall_64+0x5a/0x190 entry_SYSCALL_64_after_hwframe+0x49/0xbe irq event stamp: 4466 hardirqs last enabled at (4465): __free_pages_ok+0x1e3/0x490 hardirqs last disabled at (4466): trace_hardirqs_off_thunk+0x1a/0x1c softirqs last enabled at (3420): __do_softirq+0x333/0x446 softirqs last disabled at (3407): irq_exit+0xd1/0xe0 Link: http://lkml.kernel.org/r/20181127055429.251614-3-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Joey Pabalinas <joeypabalinas@gmail.com> Cc: <stable@vger.kernel.org> [4.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28zram: fix lockdep warning of free block handlingMinchan Kim1-17/+21
Patch series "zram idle page writeback", v3. Inherently, swap device has many idle pages which are rare touched since it was allocated. It is never problem if we use storage device as swap. However, it's just waste for zram-swap. This patchset supports zram idle page writeback feature. * Admin can define what is idle page "no access since X time ago" * Admin can define when zram should writeback them * Admin can define when zram should stop writeback to prevent wearout Details are in each patch's description. This patch (of 7): ================================ WARNING: inconsistent lock state 4.19.0+ #390 Not tainted -------------------------------- inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. zram_verify/2095 [HC0[0]:SC1[1]:HE1:SE0] takes: 00000000b1828693 (&(&zram->bitmap_lock)->rlock){+.?.}, at: put_entry_bdev+0x1e/0x50 {SOFTIRQ-ON-W} state was registered at: _raw_spin_lock+0x2c/0x40 zram_make_request+0x755/0xdc9 generic_make_request+0x373/0x6a0 submit_bio+0x6c/0x140 __swap_writepage+0x3a8/0x480 shrink_page_list+0x1102/0x1a60 shrink_inactive_list+0x21b/0x3f0 shrink_node_memcg.constprop.99+0x4f8/0x7e0 shrink_node+0x7d/0x2f0 do_try_to_free_pages+0xe0/0x300 try_to_free_pages+0x116/0x2b0 __alloc_pages_slowpath+0x3f4/0xf80 __alloc_pages_nodemask+0x2a2/0x2f0 __handle_mm_fault+0x42e/0xb50 handle_mm_fault+0x55/0xb0 __do_page_fault+0x235/0x4b0 page_fault+0x1e/0x30 irq event stamp: 228412 hardirqs last enabled at (228412): [<ffffffff98245846>] __slab_free+0x3e6/0x600 hardirqs last disabled at (228411): [<ffffffff98245625>] __slab_free+0x1c5/0x600 softirqs last enabled at (228396): [<ffffffff98e0031e>] __do_softirq+0x31e/0x427 softirqs last disabled at (228403): [<ffffffff98072051>] irq_exit+0xd1/0xe0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(&zram->bitmap_lock)->rlock); <Interrupt> lock(&(&zram->bitmap_lock)->rlock); *** DEADLOCK *** no locks held by zram_verify/2095. stack backtrace: CPU: 5 PID: 2095 Comm: zram_verify Not tainted 4.19.0+ #390 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 Call Trace: <IRQ> dump_stack+0x67/0x9b print_usage_bug+0x1bd/0x1d3 mark_lock+0x4aa/0x540 __lock_acquire+0x51d/0x1300 lock_acquire+0x90/0x180 _raw_spin_lock+0x2c/0x40 put_entry_bdev+0x1e/0x50 zram_free_page+0xf6/0x110 zram_slot_free_notify+0x42/0xa0 end_swap_bio_read+0x5b/0x170 blk_update_request+0x8f/0x340 scsi_end_request+0x2c/0x1e0 scsi_io_completion+0x98/0x650 blk_done_softirq+0x9e/0xd0 __do_softirq+0xcc/0x427 irq_exit+0xd1/0xe0 do_IRQ+0x93/0x120 common_interrupt+0xf/0xf </IRQ> With writeback feature, zram_slot_free_notify could be called in softirq context by end_swap_bio_read. However, bitmap_lock is not aware of that so lockdep yell out: get_entry_bdev spin_lock(bitmap->lock); irq softirq end_swap_bio_read zram_slot_free_notify zram_slot_lock <-- deadlock prone zram_free_page put_entry_bdev spin_lock(bitmap->lock); <-- deadlock prone With akpm's suggestion (i.e. bitmap operation is already atomic), we could remove bitmap lock. It might fail to find a empty slot if serious contention happens. However, it's not severe problem because huge page writeback has already possiblity to fail if there is severe memory pressure. Worst case is just keeping the incompressible in memory, not storage. The other problem is zram_slot_lock in zram_slot_slot_free_notify. To make it safe is this patch introduces zram_slot_trylock where zram_slot_free_notify uses it. Although it's rare to be contented, this patch adds new debug stat "miss_free" to keep monitoring how often it happens. Link: http://lkml.kernel.org/r/20181127055429.251614-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Joey Pabalinas <joeypabalinas@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-09-28zram: register default groups with device_add_disk()Hannes Reinecke1-21/+7
Register default sysfs groups during device_add_disk() to avoid a race condition with udev during startup. Signed-off-by: Hannes Reinecke <hare@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>