summaryrefslogtreecommitdiff
path: root/Documentation
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2019-09-17 16:57:47 -0700
committerLinus Torvalds <torvalds@linux-foundation.org>2019-09-17 16:57:47 -0700
commit7ad67ca5534ee7c958559c4ad610f05c4578e361 (patch)
treedc6b6a8a6b70b5f25b07bcdc06d8e77e705f6822 /Documentation
parent5260c2b863ef1152445ce93476c95d8c8a727eef (diff)
parent9c7eddf1b080f98fed1aadb74fe784f29bf77a08 (diff)
Merge tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe: - Two NVMe pull requests: - ana log parse fix from Anton - nvme quirks support for Apple devices from Ben - fix missing bio completion tracing for multipath stack devices from Hannes and Mikhail - IP TOS settings for nvme rdma and tcp transports from Israel - rq_dma_dir cleanups from Israel - tracing for Get LBA Status command from Minwoo - Some nvme-tcp cleanups from Minwoo, Potnuri and Myself - Some consolidation between the fabrics transports for handling the CAP register - reset race with ns scanning fix for fabrics (move fabrics commands to a dedicated request queue with a different lifetime from the admin request queue)." - controller reset and namespace scan races fixes - nvme discovery log change uevent support - naming improvements from Keith - multiple discovery controllers reject fix from James - some regular cleanups from various people - Series fixing (and re-fixing) null_blk debug printing and nr_devices checks (André) - A few pull requests from Song, with fixes from Andy, Guoqing, Guilherme, Neil, Nigel, and Yufen. - REQ_OP_ZONE_RESET_ALL support (Chaitanya) - Bio merge handling unification (Christoph) - Pick default elevator correctly for devices with special needs (Damien) - Block stats fixes (Hou) - Timeout and support devices nbd fixes (Mike) - Series fixing races around elevator switching and device add/remove (Ming) - sed-opal cleanups (Revanth) - Per device weight support for BFQ (Fam) - Support for blk-iocost, a new model that can properly account cost of IO workloads. (Tejun) - blk-cgroup writeback fixes (Tejun) - paride queue init fixes (zhengbin) - blk_set_runtime_active() cleanup (Stanley) - Block segment mapping optimizations (Bart) - lightnvm fixes (Hans/Minwoo/YueHaibing) - Various little fixes and cleanups * tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block: (186 commits) null_blk: format pr_* logs with pr_fmt null_blk: match the type of parameter nr_devices null_blk: do not fail the module load with zero devices block: also check RQF_STATS in blk_mq_need_time_stamp() block: make rq sector size accessible for block stats bfq: Fix bfq linkage error raid5: use bio_end_sector in r5_next_bio raid5: remove STRIPE_OPS_REQ_PENDING md: add feature flag MD_FEATURE_RAID0_LAYOUT md/raid0: avoid RAID0 data corruption due to layout confusion. raid5: don't set STRIPE_HANDLE to stripe which is in batch list raid5: don't increment read_errors on EILSEQ return nvmet: fix a wrong error status returned in error log page nvme: send discovery log page change events to userspace nvme: add uevent variables for controller devices nvme: enable aen regardless of the presence of I/O queues nvme-fabrics: allow discovery subsystems accept a kato nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery() nvme: Remove redundant assignment of cq vector nvme: Assign subsys instance from first ctrl ...
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/admin-guide/cgroup-v2.rst97
-rw-r--r--Documentation/admin-guide/kernel-parameters.txt6
-rw-r--r--Documentation/admin-guide/kernel-per-CPU-kthreads.rst8
-rw-r--r--Documentation/block/null_blk.rst33
-rw-r--r--Documentation/block/switching-sched.rst4
5 files changed, 118 insertions, 30 deletions
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 5f1c266131b0..0fa8c0e615c2 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1469,6 +1469,103 @@ IO Interface Files
8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0
8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021
+ io.cost.qos
+ A read-write nested-keyed file with exists only on the root
+ cgroup.
+
+ This file configures the Quality of Service of the IO cost
+ model based controller (CONFIG_BLK_CGROUP_IOCOST) which
+ currently implements "io.weight" proportional control. Lines
+ are keyed by $MAJ:$MIN device numbers and not ordered. The
+ line for a given device is populated on the first write for
+ the device on "io.cost.qos" or "io.cost.model". The following
+ nested keys are defined.
+
+ ====== =====================================
+ enable Weight-based control enable
+ ctrl "auto" or "user"
+ rpct Read latency percentile [0, 100]
+ rlat Read latency threshold
+ wpct Write latency percentile [0, 100]
+ wlat Write latency threshold
+ min Minimum scaling percentage [1, 10000]
+ max Maximum scaling percentage [1, 10000]
+ ====== =====================================
+
+ The controller is disabled by default and can be enabled by
+ setting "enable" to 1. "rpct" and "wpct" parameters default
+ to zero and the controller uses internal device saturation
+ state to adjust the overall IO rate between "min" and "max".
+
+ When a better control quality is needed, latency QoS
+ parameters can be configured. For example::
+
+ 8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0
+
+ shows that on sdb, the controller is enabled, will consider
+ the device saturated if the 95th percentile of read completion
+ latencies is above 75ms or write 150ms, and adjust the overall
+ IO issue rate between 50% and 150% accordingly.
+
+ The lower the saturation point, the better the latency QoS at
+ the cost of aggregate bandwidth. The narrower the allowed
+ adjustment range between "min" and "max", the more conformant
+ to the cost model the IO behavior. Note that the IO issue
+ base rate may be far off from 100% and setting "min" and "max"
+ blindly can lead to a significant loss of device capacity or
+ control quality. "min" and "max" are useful for regulating
+ devices which show wide temporary behavior changes - e.g. a
+ ssd which accepts writes at the line speed for a while and
+ then completely stalls for multiple seconds.
+
+ When "ctrl" is "auto", the parameters are controlled by the
+ kernel and may change automatically. Setting "ctrl" to "user"
+ or setting any of the percentile and latency parameters puts
+ it into "user" mode and disables the automatic changes. The
+ automatic mode can be restored by setting "ctrl" to "auto".
+
+ io.cost.model
+ A read-write nested-keyed file with exists only on the root
+ cgroup.
+
+ This file configures the cost model of the IO cost model based
+ controller (CONFIG_BLK_CGROUP_IOCOST) which currently
+ implements "io.weight" proportional control. Lines are keyed
+ by $MAJ:$MIN device numbers and not ordered. The line for a
+ given device is populated on the first write for the device on
+ "io.cost.qos" or "io.cost.model". The following nested keys
+ are defined.
+
+ ===== ================================
+ ctrl "auto" or "user"
+ model The cost model in use - "linear"
+ ===== ================================
+
+ When "ctrl" is "auto", the kernel may change all parameters
+ dynamically. When "ctrl" is set to "user" or any other
+ parameters are written to, "ctrl" become "user" and the
+ automatic changes are disabled.
+
+ When "model" is "linear", the following model parameters are
+ defined.
+
+ ============= ========================================
+ [r|w]bps The maximum sequential IO throughput
+ [r|w]seqiops The maximum 4k sequential IOs per second
+ [r|w]randiops The maximum 4k random IOs per second
+ ============= ========================================
+
+ From the above, the builtin linear model determines the base
+ costs of a sequential and random IO and the cost coefficient
+ for the IO size. While simple, this model can cover most
+ common device classes acceptably.
+
+ The IO cost model isn't expected to be accurate in absolute
+ sense and is scaled to the device behavior dynamically.
+
+ If needed, tools/cgroup/iocost_coef_gen.py can be used to
+ generate device-specific coefficients.
+
io.weight
A read-write flat-keyed file which exists on non-root cgroups.
The default is "default 100".
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 3e2d22eb5e59..6ef205fd7c97 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1201,12 +1201,6 @@
See comment before function elanfreq_setup() in
arch/x86/kernel/cpu/cpufreq/elanfreq.c.
- elevator= [IOSCHED]
- Format: { "mq-deadline" | "kyber" | "bfq" }
- See Documentation/block/deadline-iosched.rst,
- Documentation/block/kyber-iosched.rst and
- Documentation/block/bfq-iosched.rst for details.
-
elfcorehdr=[size[KMG]@]offset[KMG] [IA64,PPC,SH,X86,S390]
Specifies physical address of start of kernel core
image elf header and optionally the size. Generally
diff --git a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst
index 4f18456dd3b1..baeeba8762ae 100644
--- a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst
+++ b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst
@@ -274,9 +274,7 @@ To reduce its OS jitter, do any of the following:
(based on an earlier one from Gilad Ben-Yossef) that
reduces or even eliminates vmstat overhead for some
workloads at https://lkml.org/lkml/2013/9/4/379.
- e. Boot with "elevator=noop" to avoid workqueue use by
- the block layer.
- f. If running on high-end powerpc servers, build with
+ e. If running on high-end powerpc servers, build with
CONFIG_PPC_RTAS_DAEMON=n. This prevents the RTAS
daemon from running on each CPU every second or so.
(This will require editing Kconfig files and will defeat
@@ -284,12 +282,12 @@ To reduce its OS jitter, do any of the following:
due to the rtas_event_scan() function.
WARNING: Please check your CPU specifications to
make sure that this is safe on your particular system.
- g. If running on Cell Processor, build your kernel with
+ f. If running on Cell Processor, build your kernel with
CBE_CPUFREQ_SPU_GOVERNOR=n to avoid OS jitter from
spu_gov_work().
WARNING: Please check your CPU specifications to
make sure that this is safe on your particular system.
- h. If running on PowerMAC, build your kernel with
+ g. If running on PowerMAC, build your kernel with
CONFIG_PMAC_RACKMETER=n to disable the CPU-meter,
avoiding OS jitter from rackmeter_do_timer().
diff --git a/Documentation/block/null_blk.rst b/Documentation/block/null_blk.rst
index 31451d80783c..edbbab2f12f8 100644
--- a/Documentation/block/null_blk.rst
+++ b/Documentation/block/null_blk.rst
@@ -1,19 +1,16 @@
+.. SPDX-License-Identifier: GPL-2.0
+
========================
Null block device driver
========================
-1. Overview
-===========
+Overview
+========
-The null block device (/dev/nullb*) is used for benchmarking the various
+The null block device (``/dev/nullb*``) is used for benchmarking the various
block-layer implementations. It emulates a block device of X gigabytes in size.
-The following instances are possible:
-
- Single-queue block-layer
-
- - Request-based.
- - Single submission queue per device.
- - Implements IO scheduling algorithms (CFQ, Deadline, noop).
+It does not execute any read/write operation, just mark them as complete in
+the request queue. The following instances are possible:
Multi-queue block-layer
@@ -27,15 +24,15 @@ The following instances are possible:
All of them have a completion queue for each core in the system.
-2. Module parameters applicable for all instances
-=================================================
+Module parameters
+=================
queue_mode=[0-2]: Default: 2-Multi-queue
Selects which block-layer the module should instantiate with.
= ============
0 Bio-based
- 1 Single-queue
+ 1 Single-queue (deprecated)
2 Multi-queue
= ============
@@ -67,7 +64,7 @@ irqmode=[0-2]: Default: 1-Soft-irq
completion_nsec=[ns]: Default: 10,000ns
Combined with irqmode=2 (timer). The time each completion event must wait.
-submit_queues=[1..nr_cpus]:
+submit_queues=[1..nr_cpus]: Default: 1
The number of submission queues attached to the device driver. If unset, it
defaults to 1. For multi-queue, it is ignored when use_per_node_hctx module
parameter is 1.
@@ -75,9 +72,11 @@ submit_queues=[1..nr_cpus]:
hw_queue_depth=[0..qdepth]: Default: 64
The hardware queue depth of the device.
-III: Multi-queue specific parameters
+Multi-queue specific parameters
+-------------------------------
use_per_node_hctx=[0/1]: Default: 0
+ Number of hardware context queues.
= =====================================================================
0 The number of submit queues are set to the value of the submit_queues
@@ -87,6 +86,7 @@ use_per_node_hctx=[0/1]: Default: 0
= =====================================================================
no_sched=[0/1]: Default: 0
+ Enable/disable the io scheduler.
= ======================================
0 nullb* use default blk-mq io scheduler
@@ -94,6 +94,7 @@ no_sched=[0/1]: Default: 0
= ======================================
blocking=[0/1]: Default: 0
+ Blocking behavior of the request queue.
= ===============================================================
0 Register as a non-blocking blk-mq driver device.
@@ -103,6 +104,7 @@ blocking=[0/1]: Default: 0
= ===============================================================
shared_tags=[0/1]: Default: 0
+ Sharing tags between devices.
= ================================================================
0 Tag set is not shared.
@@ -111,6 +113,7 @@ shared_tags=[0/1]: Default: 0
= ================================================================
zoned=[0/1]: Default: 0
+ Device is a random-access or a zoned block device.
= ======================================================================
0 Block device is exposed as a random-access block device.
diff --git a/Documentation/block/switching-sched.rst b/Documentation/block/switching-sched.rst
index 42042417380e..520f6b857544 100644
--- a/Documentation/block/switching-sched.rst
+++ b/Documentation/block/switching-sched.rst
@@ -2,10 +2,6 @@
Switching Scheduler
===================
-To choose IO schedulers at boot time, use the argument 'elevator=deadline'.
-'noop' and 'cfq' (the default) are also available. IO schedulers are assigned
-globally at boot time only presently.
-
Each io queue has a set of io scheduler tunables associated with it. These
tunables control how the io scheduler works. You can find these entries
in::