diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2019-09-17 16:57:47 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2019-09-17 16:57:47 -0700 |
commit | 7ad67ca5534ee7c958559c4ad610f05c4578e361 (patch) | |
tree | dc6b6a8a6b70b5f25b07bcdc06d8e77e705f6822 /Documentation | |
parent | 5260c2b863ef1152445ce93476c95d8c8a727eef (diff) | |
parent | 9c7eddf1b080f98fed1aadb74fe784f29bf77a08 (diff) |
Merge tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
- Two NVMe pull requests:
- ana log parse fix from Anton
- nvme quirks support for Apple devices from Ben
- fix missing bio completion tracing for multipath stack devices
from Hannes and Mikhail
- IP TOS settings for nvme rdma and tcp transports from Israel
- rq_dma_dir cleanups from Israel
- tracing for Get LBA Status command from Minwoo
- Some nvme-tcp cleanups from Minwoo, Potnuri and Myself
- Some consolidation between the fabrics transports for handling
the CAP register
- reset race with ns scanning fix for fabrics (move fabrics
commands to a dedicated request queue with a different lifetime
from the admin request queue)."
- controller reset and namespace scan races fixes
- nvme discovery log change uevent support
- naming improvements from Keith
- multiple discovery controllers reject fix from James
- some regular cleanups from various people
- Series fixing (and re-fixing) null_blk debug printing and nr_devices
checks (André)
- A few pull requests from Song, with fixes from Andy, Guoqing,
Guilherme, Neil, Nigel, and Yufen.
- REQ_OP_ZONE_RESET_ALL support (Chaitanya)
- Bio merge handling unification (Christoph)
- Pick default elevator correctly for devices with special needs
(Damien)
- Block stats fixes (Hou)
- Timeout and support devices nbd fixes (Mike)
- Series fixing races around elevator switching and device add/remove
(Ming)
- sed-opal cleanups (Revanth)
- Per device weight support for BFQ (Fam)
- Support for blk-iocost, a new model that can properly account cost of
IO workloads. (Tejun)
- blk-cgroup writeback fixes (Tejun)
- paride queue init fixes (zhengbin)
- blk_set_runtime_active() cleanup (Stanley)
- Block segment mapping optimizations (Bart)
- lightnvm fixes (Hans/Minwoo/YueHaibing)
- Various little fixes and cleanups
* tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block: (186 commits)
null_blk: format pr_* logs with pr_fmt
null_blk: match the type of parameter nr_devices
null_blk: do not fail the module load with zero devices
block: also check RQF_STATS in blk_mq_need_time_stamp()
block: make rq sector size accessible for block stats
bfq: Fix bfq linkage error
raid5: use bio_end_sector in r5_next_bio
raid5: remove STRIPE_OPS_REQ_PENDING
md: add feature flag MD_FEATURE_RAID0_LAYOUT
md/raid0: avoid RAID0 data corruption due to layout confusion.
raid5: don't set STRIPE_HANDLE to stripe which is in batch list
raid5: don't increment read_errors on EILSEQ return
nvmet: fix a wrong error status returned in error log page
nvme: send discovery log page change events to userspace
nvme: add uevent variables for controller devices
nvme: enable aen regardless of the presence of I/O queues
nvme-fabrics: allow discovery subsystems accept a kato
nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery()
nvme: Remove redundant assignment of cq vector
nvme: Assign subsys instance from first ctrl
...
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/admin-guide/cgroup-v2.rst | 97 | ||||
-rw-r--r-- | Documentation/admin-guide/kernel-parameters.txt | 6 | ||||
-rw-r--r-- | Documentation/admin-guide/kernel-per-CPU-kthreads.rst | 8 | ||||
-rw-r--r-- | Documentation/block/null_blk.rst | 33 | ||||
-rw-r--r-- | Documentation/block/switching-sched.rst | 4 |
5 files changed, 118 insertions, 30 deletions
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 5f1c266131b0..0fa8c0e615c2 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1469,6 +1469,103 @@ IO Interface Files 8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0 8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021 + io.cost.qos + A read-write nested-keyed file with exists only on the root + cgroup. + + This file configures the Quality of Service of the IO cost + model based controller (CONFIG_BLK_CGROUP_IOCOST) which + currently implements "io.weight" proportional control. Lines + are keyed by $MAJ:$MIN device numbers and not ordered. The + line for a given device is populated on the first write for + the device on "io.cost.qos" or "io.cost.model". The following + nested keys are defined. + + ====== ===================================== + enable Weight-based control enable + ctrl "auto" or "user" + rpct Read latency percentile [0, 100] + rlat Read latency threshold + wpct Write latency percentile [0, 100] + wlat Write latency threshold + min Minimum scaling percentage [1, 10000] + max Maximum scaling percentage [1, 10000] + ====== ===================================== + + The controller is disabled by default and can be enabled by + setting "enable" to 1. "rpct" and "wpct" parameters default + to zero and the controller uses internal device saturation + state to adjust the overall IO rate between "min" and "max". + + When a better control quality is needed, latency QoS + parameters can be configured. For example:: + + 8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0 + + shows that on sdb, the controller is enabled, will consider + the device saturated if the 95th percentile of read completion + latencies is above 75ms or write 150ms, and adjust the overall + IO issue rate between 50% and 150% accordingly. + + The lower the saturation point, the better the latency QoS at + the cost of aggregate bandwidth. The narrower the allowed + adjustment range between "min" and "max", the more conformant + to the cost model the IO behavior. Note that the IO issue + base rate may be far off from 100% and setting "min" and "max" + blindly can lead to a significant loss of device capacity or + control quality. "min" and "max" are useful for regulating + devices which show wide temporary behavior changes - e.g. a + ssd which accepts writes at the line speed for a while and + then completely stalls for multiple seconds. + + When "ctrl" is "auto", the parameters are controlled by the + kernel and may change automatically. Setting "ctrl" to "user" + or setting any of the percentile and latency parameters puts + it into "user" mode and disables the automatic changes. The + automatic mode can be restored by setting "ctrl" to "auto". + + io.cost.model + A read-write nested-keyed file with exists only on the root + cgroup. + + This file configures the cost model of the IO cost model based + controller (CONFIG_BLK_CGROUP_IOCOST) which currently + implements "io.weight" proportional control. Lines are keyed + by $MAJ:$MIN device numbers and not ordered. The line for a + given device is populated on the first write for the device on + "io.cost.qos" or "io.cost.model". The following nested keys + are defined. + + ===== ================================ + ctrl "auto" or "user" + model The cost model in use - "linear" + ===== ================================ + + When "ctrl" is "auto", the kernel may change all parameters + dynamically. When "ctrl" is set to "user" or any other + parameters are written to, "ctrl" become "user" and the + automatic changes are disabled. + + When "model" is "linear", the following model parameters are + defined. + + ============= ======================================== + [r|w]bps The maximum sequential IO throughput + [r|w]seqiops The maximum 4k sequential IOs per second + [r|w]randiops The maximum 4k random IOs per second + ============= ======================================== + + From the above, the builtin linear model determines the base + costs of a sequential and random IO and the cost coefficient + for the IO size. While simple, this model can cover most + common device classes acceptably. + + The IO cost model isn't expected to be accurate in absolute + sense and is scaled to the device behavior dynamically. + + If needed, tools/cgroup/iocost_coef_gen.py can be used to + generate device-specific coefficients. + io.weight A read-write flat-keyed file which exists on non-root cgroups. The default is "default 100". diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 3e2d22eb5e59..6ef205fd7c97 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1201,12 +1201,6 @@ See comment before function elanfreq_setup() in arch/x86/kernel/cpu/cpufreq/elanfreq.c. - elevator= [IOSCHED] - Format: { "mq-deadline" | "kyber" | "bfq" } - See Documentation/block/deadline-iosched.rst, - Documentation/block/kyber-iosched.rst and - Documentation/block/bfq-iosched.rst for details. - elfcorehdr=[size[KMG]@]offset[KMG] [IA64,PPC,SH,X86,S390] Specifies physical address of start of kernel core image elf header and optionally the size. Generally diff --git a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst index 4f18456dd3b1..baeeba8762ae 100644 --- a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst +++ b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst @@ -274,9 +274,7 @@ To reduce its OS jitter, do any of the following: (based on an earlier one from Gilad Ben-Yossef) that reduces or even eliminates vmstat overhead for some workloads at https://lkml.org/lkml/2013/9/4/379. - e. Boot with "elevator=noop" to avoid workqueue use by - the block layer. - f. If running on high-end powerpc servers, build with + e. If running on high-end powerpc servers, build with CONFIG_PPC_RTAS_DAEMON=n. This prevents the RTAS daemon from running on each CPU every second or so. (This will require editing Kconfig files and will defeat @@ -284,12 +282,12 @@ To reduce its OS jitter, do any of the following: due to the rtas_event_scan() function. WARNING: Please check your CPU specifications to make sure that this is safe on your particular system. - g. If running on Cell Processor, build your kernel with + f. If running on Cell Processor, build your kernel with CBE_CPUFREQ_SPU_GOVERNOR=n to avoid OS jitter from spu_gov_work(). WARNING: Please check your CPU specifications to make sure that this is safe on your particular system. - h. If running on PowerMAC, build your kernel with + g. If running on PowerMAC, build your kernel with CONFIG_PMAC_RACKMETER=n to disable the CPU-meter, avoiding OS jitter from rackmeter_do_timer(). diff --git a/Documentation/block/null_blk.rst b/Documentation/block/null_blk.rst index 31451d80783c..edbbab2f12f8 100644 --- a/Documentation/block/null_blk.rst +++ b/Documentation/block/null_blk.rst @@ -1,19 +1,16 @@ +.. SPDX-License-Identifier: GPL-2.0 + ======================== Null block device driver ======================== -1. Overview -=========== +Overview +======== -The null block device (/dev/nullb*) is used for benchmarking the various +The null block device (``/dev/nullb*``) is used for benchmarking the various block-layer implementations. It emulates a block device of X gigabytes in size. -The following instances are possible: - - Single-queue block-layer - - - Request-based. - - Single submission queue per device. - - Implements IO scheduling algorithms (CFQ, Deadline, noop). +It does not execute any read/write operation, just mark them as complete in +the request queue. The following instances are possible: Multi-queue block-layer @@ -27,15 +24,15 @@ The following instances are possible: All of them have a completion queue for each core in the system. -2. Module parameters applicable for all instances -================================================= +Module parameters +================= queue_mode=[0-2]: Default: 2-Multi-queue Selects which block-layer the module should instantiate with. = ============ 0 Bio-based - 1 Single-queue + 1 Single-queue (deprecated) 2 Multi-queue = ============ @@ -67,7 +64,7 @@ irqmode=[0-2]: Default: 1-Soft-irq completion_nsec=[ns]: Default: 10,000ns Combined with irqmode=2 (timer). The time each completion event must wait. -submit_queues=[1..nr_cpus]: +submit_queues=[1..nr_cpus]: Default: 1 The number of submission queues attached to the device driver. If unset, it defaults to 1. For multi-queue, it is ignored when use_per_node_hctx module parameter is 1. @@ -75,9 +72,11 @@ submit_queues=[1..nr_cpus]: hw_queue_depth=[0..qdepth]: Default: 64 The hardware queue depth of the device. -III: Multi-queue specific parameters +Multi-queue specific parameters +------------------------------- use_per_node_hctx=[0/1]: Default: 0 + Number of hardware context queues. = ===================================================================== 0 The number of submit queues are set to the value of the submit_queues @@ -87,6 +86,7 @@ use_per_node_hctx=[0/1]: Default: 0 = ===================================================================== no_sched=[0/1]: Default: 0 + Enable/disable the io scheduler. = ====================================== 0 nullb* use default blk-mq io scheduler @@ -94,6 +94,7 @@ no_sched=[0/1]: Default: 0 = ====================================== blocking=[0/1]: Default: 0 + Blocking behavior of the request queue. = =============================================================== 0 Register as a non-blocking blk-mq driver device. @@ -103,6 +104,7 @@ blocking=[0/1]: Default: 0 = =============================================================== shared_tags=[0/1]: Default: 0 + Sharing tags between devices. = ================================================================ 0 Tag set is not shared. @@ -111,6 +113,7 @@ shared_tags=[0/1]: Default: 0 = ================================================================ zoned=[0/1]: Default: 0 + Device is a random-access or a zoned block device. = ====================================================================== 0 Block device is exposed as a random-access block device. diff --git a/Documentation/block/switching-sched.rst b/Documentation/block/switching-sched.rst index 42042417380e..520f6b857544 100644 --- a/Documentation/block/switching-sched.rst +++ b/Documentation/block/switching-sched.rst @@ -2,10 +2,6 @@ Switching Scheduler =================== -To choose IO schedulers at boot time, use the argument 'elevator=deadline'. -'noop' and 'cfq' (the default) are also available. IO schedulers are assigned -globally at boot time only presently. - Each io queue has a set of io scheduler tunables associated with it. These tunables control how the io scheduler works. You can find these entries in:: |