block, bfq: do not merge queues on flash storage with queueing

To boost throughput with a set of processes doing interleaved I/O (i.e., a set of processes whose individual I/O is random, but whose merged cumulative I/O is sequential), BFQ merges the queues associated with these processes, i.e., redirects the I/O of these processes into a common, shared queue. In the shared queue, I/O requests are ordered by their position on the medium, thus sequential I/O gets dispatched to the device when the shared queue is served. Queue merging costs execution time, because, to detect which queues to merge, BFQ must maintain a list of the head I/O requests of active queues, ordered by request positions. Measurements showed that this costs about 10% of BFQ's total per-request processing time. Request processing time becomes more and more critical as the speed of the underlying storage device grows. Yet, fortunately, queue merging is basically useless on the very devices that are so fast to make request processing time critical. To reach a high throughput, these devices must have many requests queued at the same time. But, in this configuration, the internal scheduling algorithms of these devices do also the job of queue merging: they reorder requests so as to obtain as much as possible a sequential I/O pattern. As a consequence, with processes doing interleaved I/O, the throughput reached by one such device is likely to be the same, with and without queue merging. In view of this fact, this commit disables queue merging, and all related housekeeping, for non-rotational devices with internal queueing. The total, single-lock-protected, per-request processing time of BFQ drops to, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz (time measured with simple code instrumentation, and using the throughput-sync.sh script of the S suite [1], in performance-profiling mode). To put this result into context, the total, single-lock-protected, per-request execution time of the lightest I/O scheduler available in blk-mq, mq-deadline, is 0.7 us (mq-deadline is ~800 LOC, against ~10500 LOC for BFQ). Disabling merging provides a further, remarkable benefit in terms of throughput. Merging tends to make many workloads artificially more uneven, mainly because of shared queues remaining non empty for incomparably more time than normal queues. So, if, e.g., one of the queues in a set of merged queues has a higher weight than a normal queue, then the shared queue may inherit such a high weight and, by staying almost always active, may force BFQ to perform I/O plugging most of the time. This evidently makes it harder for BFQ to let the device reach a high throughput. As a practical example of this problem, and of the benefits of this commit, we measured again the throughput in the nasty scenario considered in previous commit messages: dbench test (in the Phoronix suite), with 6 clients, on a filesystem with journaling, and with the journaling daemon enjoying a higher weight than normal processes. With this commit, the throughput grows from ~150 MB/s to ~200 MB/s on a PLEXTOR PX-256M5 SSD. This is the same peak throughput reached by any of the other I/O schedulers. As such, this is also likely to be the maximum possible throughput reachable with this workload on this device, because I/O is mostly random, and the other schedulers basically just pass I/O requests to the drive as fast as possible. [1] https://github.com/Algodev-github/S Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Francesco Pollicino <fra.fra.800@gmail.com> Signed-off-by: Alessio Masola <alessio.masola@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
author: Paolo Valente <paolo.valente@linaro.org> 2019-03-12 09:59:30 +0100
committer: Jens Axboe <axboe@kernel.dk> 2019-04-01 08:15:40 -0600
commit: 8cacc5ab3eacf5284bc9b0d7d5b85b748a338104 (patch)
tree: d60138c474a05347c32f810c4ddcd614eca91348 /block/bfq-iosched.h
parent: 2341d662e9a2a5751ff8ac4ffa640fb493b0ee84 (diff)
1 files changed, 3 insertions, 0 deletions
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 26869cfbbfa9..829730b96fb2 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -497,6 +497,9 @@ struct bfq_data {
 	/* number of requests dispatched and waiting for completion */
 	int rq_in_driver;
 
+	/* true if the device is non rotational and performs queueing */
+	bool nonrot_with_queueing;
+
 	/*
 	 * Maximum number of requests in driver in the last
 	 * @hw_tag_samples completed requests.
author	Paolo Valente <paolo.valente@linaro.org>	2019-03-12 09:59:30 +0100
committer	Jens Axboe <axboe@kernel.dk>	2019-04-01 08:15:40 -0600
commit	8cacc5ab3eacf5284bc9b0d7d5b85b748a338104 (patch)
tree	d60138c474a05347c32f810c4ddcd614eca91348 /block/bfq-iosched.h
parent	2341d662e9a2a5751ff8ac4ffa640fb493b0ee84 (diff)