Age | Commit message (Collapse) | Author | Files | Lines |
|
Pull block layer updates from Jens Axboe:
- Add BFQ IO scheduler under the new blk-mq scheduling framework. BFQ
was initially a fork of CFQ, but subsequently changed to implement
fairness based on B-WF2Q+, a modified variant of WF2Q. BFQ is meant
to be used on desktop type single drives, providing good fairness.
From Paolo.
- Add Kyber IO scheduler. This is a full multiqueue aware scheduler,
using a scalable token based algorithm that throttles IO based on
live completion IO stats, similary to blk-wbt. From Omar.
- A series from Jan, moving users to separately allocated backing
devices. This continues the work of separating backing device life
times, solving various problems with hot removal.
- A series of updates for lightnvm, mostly from Javier. Includes a
'pblk' target that exposes an open channel SSD as a physical block
device.
- A series of fixes and improvements for nbd from Josef.
- A series from Omar, removing queue sharing between devices on mostly
legacy drivers. This helps us clean up other bits, if we know that a
queue only has a single device backing. This has been overdue for
more than a decade.
- Fixes for the blk-stats, and improvements to unify the stats and user
windows. This both improves blk-wbt, and enables other users to
register a need to receive IO stats for a device. From Omar.
- blk-throttle improvements from Shaohua. This provides a scalable
framework for implementing scalable priotization - particularly for
blk-mq, but applicable to any type of block device. The interface is
marked experimental for now.
- Bucketized IO stats for IO polling from Stephen Bates. This improves
efficiency of polled workloads in the presence of mixed block size
IO.
- A few fixes for opal, from Scott.
- A few pulls for NVMe, including a lot of fixes for NVMe-over-fabrics.
From a variety of folks, mostly Sagi and James Smart.
- A series from Bart, improving our exposed info and capabilities from
the blk-mq debugfs support.
- A series from Christoph, cleaning up how handle WRITE_ZEROES.
- A series from Christoph, cleaning up the block layer handling of how
we track errors in a request. On top of being a nice cleanup, it also
shrinks the size of struct request a bit.
- Removal of mg_disk and hd (sorry Linus) by Christoph. The former was
never used by platforms, and the latter has outlived it's usefulness.
- Various little bug fixes and cleanups from a wide variety of folks.
* 'for-4.12/block' of git://git.kernel.dk/linux-block: (329 commits)
block: hide badblocks attribute by default
blk-mq: unify hctx delay_work and run_work
block: add kblock_mod_delayed_work_on()
blk-mq: unify hctx delayed_run_work and run_work
nbd: fix use after free on module unload
MAINTAINERS: bfq: Add Paolo as maintainer for the BFQ I/O scheduler
blk-mq-sched: alloate reserved tags out of normal pool
mtip32xx: use runtime tag to initialize command header
scsi: Implement blk_mq_ops.show_rq()
blk-mq: Add blk_mq_ops.show_rq()
blk-mq: Show operation, cmd_flags and rq_flags names
blk-mq: Make blk_flags_show() callers append a newline character
blk-mq: Move the "state" debugfs attribute one level down
blk-mq: Unregister debugfs attributes earlier
blk-mq: Only unregister hctxs for which registration succeeded
blk-mq-debugfs: Rename functions for registering and unregistering the mq directory
blk-mq: Let blk_mq_debugfs_register() look up the queue name
blk-mq: Register <dev>/queue/mq after having registered <dev>/queue
ide-pm: always pass 0 error to ide_complete_rq in ide_do_devset
ide-pm: always pass 0 error to __blk_end_request_all
..
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI fix from Bjorn Helgaas:
"Sorry this is so late. It's been in -next for over a week, but I
forgot to send it on until now.
A single fix to the DT binding of the HiSilicon PCIe host support"
* tag 'pci-v4.11-fixes-5' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI: hisi: Fix DT binding (hisi-pcie-almost-ecam)
|
|
This patch introduces a simple heuristic to load applications quickly,
and to perform the I/O requested by interactive applications just as
quickly. To this purpose, both a newly-created queue and a queue
associated with an interactive application (we explain in a moment how
BFQ decides whether the associated application is interactive),
receive the following two special treatments:
1) The weight of the queue is raised.
2) The queue unconditionally enjoys device idling when it empties; in
fact, if the requests of a queue are sync, then performing device
idling for the queue is a necessary condition to guarantee that the
queue receives a fraction of the throughput proportional to its weight
(see [1] for details).
For brevity, we call just weight-raising the combination of these
two preferential treatments. For a newly-created queue,
weight-raising starts immediately and lasts for a time interval that:
1) depends on the device speed and type (rotational or
non-rotational), and 2) is equal to the time needed to load (start up)
a large-size application on that device, with cold caches and with no
additional workload.
Finally, as for guaranteeing a fast execution to interactive,
I/O-related tasks (such as opening a file), consider that any
interactive application blocks and waits for user input both after
starting up and after executing some task. After a while, the user may
trigger new operations, after which the application stops again, and
so on. Accordingly, the low-latency heuristic weight-raises again a
queue in case it becomes backlogged after being idle for a
sufficiently long (configurable) time. The weight-raising then lasts
for the same time as for a just-created queue.
According to our experiments, the combination of this low-latency
heuristic and of the improvements described in the previous patch
allows BFQ to guarantee a high application responsiveness.
[1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
Scheduler", Proceedings of the First Workshop on Mobile System
Technologies (MST-2015), May 2015.
http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Add complete support for full hierarchical scheduling, with a cgroups
interface. Full hierarchical scheduling is implemented through the
'entity' abstraction: both bfq_queues, i.e., the internal BFQ queues
associated with processes, and groups are represented in general by
entities. Given the bfq_queues associated with the processes belonging
to a given group, the entities representing these queues are sons of
the entity representing the group. At higher levels, if a group, say
G, contains other groups, then the entity representing G is the parent
entity of the entities representing the groups in G.
Hierarchical scheduling is performed as follows: if the timestamps of
a leaf entity (i.e., of a bfq_queue) change, and such a change lets
the entity become the next-to-serve entity for its parent entity, then
the timestamps of the parent entity are recomputed as a function of
the budget of its new next-to-serve leaf entity. If the parent entity
belongs, in its turn, to a group, and its new timestamps let it become
the next-to-serve for its parent entity, then the timestamps of the
latter parent entity are recomputed as well, and so on. When a new
bfq_queue must be set in service, the reverse path is followed: the
next-to-serve highest-level entity is chosen, then its next-to-serve
child entity, and so on, until the next-to-serve leaf entity is
reached, and the bfq_queue that this entity represents is set in
service.
Writeback is accounted for on a per-group basis, i.e., for each group,
the async I/O requests of the processes of the group are enqueued in a
distinct bfq_queue, and the entity associated with this queue is a
child of the entity associated with the group.
Weights can be assigned explicitly to groups and processes through the
cgroups interface, differently from what happens, for single
processes, if the cgroups interface is not used (as explained in the
description of the previous patch). In particular, since each node has
a full scheduler, each group can be assigned its own weight.
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
We tag as v0 the version of BFQ containing only BFQ's engine plus
hierarchical support. BFQ's engine is introduced by this commit, while
hierarchical support is added by next commit. We use the v0 tag to
distinguish this minimal version of BFQ from the versions containing
also the features and the improvements added by next commits. BFQ-v0
coincides with the version of BFQ submitted a few years ago [1], apart
from the introduction of preemption, described below.
BFQ is a proportional-share I/O scheduler, whose general structure,
plus a lot of code, are borrowed from CFQ.
- Each process doing I/O on a device is associated with a weight and a
(bfq_)queue.
- BFQ grants exclusive access to the device, for a while, to one queue
(process) at a time, and implements this service model by
associating every queue with a budget, measured in number of
sectors.
- After a queue is granted access to the device, the budget of the
queue is decremented, on each request dispatch, by the size of the
request.
- The in-service queue is expired, i.e., its service is suspended,
only if one of the following events occurs: 1) the queue finishes
its budget, 2) the queue empties, 3) a "budget timeout" fires.
- The budget timeout prevents processes doing random I/O from
holding the device for too long and dramatically reducing
throughput.
- Actually, as in CFQ, a queue associated with a process issuing
sync requests may not be expired immediately when it empties. In
contrast, BFQ may idle the device for a short time interval,
giving the process the chance to go on being served if it issues
a new request in time. Device idling typically boosts the
throughput on rotational devices, if processes do synchronous
and sequential I/O. In addition, under BFQ, device idling is
also instrumental in guaranteeing the desired throughput
fraction to processes issuing sync requests (see [2] for
details).
- With respect to idling for service guarantees, if several
processes are competing for the device at the same time, but
all processes (and groups, after the following commit) have
the same weight, then BFQ guarantees the expected throughput
distribution without ever idling the device. Throughput is
thus as high as possible in this common scenario.
- Queues are scheduled according to a variant of WF2Q+, named
B-WF2Q+, and implemented using an augmented rb-tree to preserve an
O(log N) overall complexity. See [2] for more details. B-WF2Q+ is
also ready for hierarchical scheduling. However, for a cleaner
logical breakdown, the code that enables and completes
hierarchical support is provided in the next commit, which focuses
exactly on this feature.
- B-WF2Q+ guarantees a tight deviation with respect to an ideal,
perfectly fair, and smooth service. In particular, B-WF2Q+
guarantees that each queue receives a fraction of the device
throughput proportional to its weight, even if the throughput
fluctuates, and regardless of: the device parameters, the current
workload and the budgets assigned to the queue.
- The last, budget-independence, property (although probably
counterintuitive in the first place) is definitely beneficial, for
the following reasons:
- First, with any proportional-share scheduler, the maximum
deviation with respect to an ideal service is proportional to
the maximum budget (slice) assigned to queues. As a consequence,
BFQ can keep this deviation tight not only because of the
accurate service of B-WF2Q+, but also because BFQ *does not*
need to assign a larger budget to a queue to let the queue
receive a higher fraction of the device throughput.
- Second, BFQ is free to choose, for every process (queue), the
budget that best fits the needs of the process, or best
leverages the I/O pattern of the process. In particular, BFQ
updates queue budgets with a simple feedback-loop algorithm that
allows a high throughput to be achieved, while still providing
tight latency guarantees to time-sensitive applications. When
the in-service queue expires, this algorithm computes the next
budget of the queue so as to:
- Let large budgets be eventually assigned to the queues
associated with I/O-bound applications performing sequential
I/O: in fact, the longer these applications are served once
got access to the device, the higher the throughput is.
- Let small budgets be eventually assigned to the queues
associated with time-sensitive applications (which typically
perform sporadic and short I/O), because, the smaller the
budget assigned to a queue waiting for service is, the sooner
B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).
- Weights can be assigned to processes only indirectly, through I/O
priorities, and according to the relation:
weight = 10 * (IOPRIO_BE_NR - ioprio).
The next patch provides, instead, a cgroups interface through which
weights can be assigned explicitly.
- If several processes are competing for the device at the same time,
but all processes and groups have the same weight, then BFQ
guarantees the expected throughput distribution without ever idling
the device. It uses preemption instead. Throughput is then much
higher in this common scenario.
- ioprio classes are served in strict priority order, i.e.,
lower-priority queues are not served as long as there are
higher-priority queues. Among queues in the same class, the
bandwidth is distributed in proportion to the weight of each
queue. A very thin extra bandwidth is however guaranteed to the Idle
class, to prevent it from starving.
- If the strict_guarantees parameter is set (default: unset), then BFQ
- always performs idling when the in-service queue becomes empty;
- forces the device to serve one I/O request at a time, by
dispatching a new request only if there is no outstanding
request.
In the presence of differentiated weights or I/O-request sizes,
both the above conditions are needed to guarantee that every
queue receives its allotted share of the bandwidth (see
Documentation/block/bfq-iosched.txt for more details). Setting
strict_guarantees may evidently affect throughput.
[1] https://lkml.org/lkml/2008/4/1/234
https://lkml.org/lkml/2008/11/11/148
[2] P. Valente and M. Andreolini, "Improving Application
Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
the 5th Annual International Systems and Storage Conference
(SYSTOR '12), June 2012.
Slightly extended version:
http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
results.pdf
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
This patch introduces pblk, a host-side translation layer for
Open-Channel SSDs to expose them like block devices. The translation
layer allows data placement decisions, and I/O scheduling to be
managed by the host, enabling users to optimize the SSD for their
specific workloads.
An open-channel SSD has a set of LUNs (parallel units) and a
collection of blocks. Each block can be read in any order, but
writes must be sequential. Writes may also fail, and if a block
requires it, must also be reset before new writes can be
applied.
To manage the constraints, pblk maintains a logical to
physical address (L2P) table, write cache, garbage
collection logic, recovery scheme, and logic to rate-limit
user I/Os versus garbage collection I/Os.
The L2P table is fully-associative and manages sectors at a
4KB granularity. Pblk stores the L2P table in two places, in
the out-of-band area of the media and on the last page of a
line. In the cause of a power failure, pblk will perform a
scan to recover the L2P table.
The user data is organized into lines. A line is data
striped across blocks and LUNs. The lines enable the host to
reduce the amount of metadata to maintain besides the user
data and makes it easier to implement RAID or erasure coding
in the future.
pblk implements multi-tenant support and can be instantiated
multiple times on the same drive. Each instance owns a
portion of the SSD - both regarding I/O bandwidth and
capacity - providing I/O isolation for each case.
Finally, pblk also exposes a sysfs interface that allows
user-space to peek into the internals of pblk. The interface
is available at /dev/block/*/pblk/ where * is the block
device name exposed.
This work also contains contributions from:
Matias Bjørling <matias@cnexlabs.com>
Simon A. F. Lund <slund@cnexlabs.com>
Young Tack Jin <youngtack.jin@gmail.com>
Huaicheng Li <huaicheng@cs.uchicago.edu>
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
The Kyber I/O scheduler is an I/O scheduler for fast devices designed to
scale to multiple queues. Users configure only two knobs, the target
read and synchronous write latencies, and the scheduler tunes itself to
achieve that latency goal.
The implementation is based on "tokens", built on top of the scalable
bitmap library. Tokens serve as a mechanism for limiting requests. There
are two tiers of tokens: queueing tokens and dispatch tokens.
A queueing token is required to allocate a request. In fact, these
tokens are actually the blk-mq internal scheduler tags, but the
scheduler manages the allocation directly in order to implement its
policy.
Dispatch tokens are device-wide and split up into two scheduling
domains: reads vs. writes. Each hardware queue dispatches batches
round-robin between the scheduling domains as long as tokens are
available for that domain.
These tokens can be used as the mechanism to enable various policies.
The policy Kyber uses is inspired by active queue management techniques
for network routing, similar to blk-wbt. The scheduler monitors
latencies and scales the number of dispatch tokens accordingly. Queueing
tokens are used to prevent starvation of synchronous requests by
asynchronous requests.
Various extensions are possible, including better heuristics and ionice
support. The new scheduler isn't set as the default yet.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
This drivers was added in 2008, but as far as a I can tell we never had a
single platform that actually registered resources for the platform driver.
It's also been unmaintained for a long time and apparently has a ATA mode
that can be driven using the IDE/libata subsystem.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
The "hisilicon,pcie-almost-ecam" binding goes against the usual DT
conventions, and is non-sensical in that it describes the IP based on
what it isn't. Fix the DT binding with "hisilicon,hip06-pcie-ecam"
and "hisilicon,hip07-pcie-ecam".
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Dongdong Liu <liudongdong3@huawei.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are 3 small fixes for 4.11-rc6.
One resolves a reported issue with sysfs files that NeilBrown found,
one is a documenatation fix for the stable kernel rules, and the last
is a small MAINTAINERS file update for kernfs"
* tag 'driver-core-4.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
MAINTAINERS: separate out kernfs maintainership
sysfs: be careful of error returns from ops->show()
Documentation: stable-kernel-rules: fix stable-tag format
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS fixes from Al Viro:
"statx followup fixes and a fix for stack-smashing on alpha"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
alpha: fix stack smashing in old_adjtimex(2)
statx: Include a mask for stx_attributes in struct statx
statx: Reserve the top bit of the mask for future struct expansion
xfs: report crtime and attribute flags to statx
ext4: Add statx support
statx: optimize copy of struct statx to userspace
statx: remove incorrect part of vfs_statx() comment
statx: reject unknown flags when using NULL path
Documentation/filesystems: fix documentation for ->getattr()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control fix from Linus Walleij:
"This late fix for pin control is hopefully the last I send this cycle.
The problem was detected early in the v4.11 release cycle and there
has been some back and forth on how to solve it. Sadly the proper fix
arrives late, but at least not too late.
An issue was detected with pin control on the Freescale i.MX after the
refactorings for more general group and function handling.
We now have the proper fix for this"
* tag 'pinctrl-v4.11-4' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: core: Fix pinctrl_register_and_init() with pinctrl_enable()
|
|
Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
kill this hack.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
A patch documenting how to specify which kernels a particular fix should
be backported to (seemingly) inadvertently added a minus sign after the
kernel version. This particular stable-tag format had never been used
prior to this patch, and was neither present when the patch in question
was first submitted (it was added in v2 without any comment).
Drop the minus sign to avoid any confusion.
Fixes: fdc81b7910ad ("stable_kernel_rules: Add clause about specification of kernel versions to patch.")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
We've added a considerable amount of fixes for stalls and issues
with the blk-mq scheduling in the 4.11 series since forking
off the for-4.12/block branch. We need to do improvements on
top of that for 4.12, so pull in the previous fixes to make
our lives easier going forward.
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Recent pinctrl changes to allow dynamic allocation of pins exposed one
more issue with the pinctrl pins claimed early by the controller itself.
This caused a regression for IMX6 pinctrl hogs.
Before enabling the pin controller driver we need to wait until it has
been properly initialized, then claim the hogs, and only then enable it.
To fix the regression, split the code into pinctrl_claim_hogs() and
pinctrl_enable(). And then let's require that pinctrl_enable() is always
called by the pin controller driver when ready after calling
pinctrl_register_and_init().
Depends-on: 950b0d91dc10 ("pinctrl: core: Fix regression caused by delayed
work for hogs")
Fixes: df61b366af26 ("pinctrl: core: Use delayed work for hogs")
Fixes: e566fc11ea76 ("pinctrl: imx: use generic pinctrl helpers for
managing groups")
Cc: Haojian Zhuang <haojian.zhuang@linaro.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Mika Penttilä <mika.penttila@nextfour.com>
Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: Nishanth Menon <nm@ti.com>
Cc: Shawn Guo <shawnguo@kernel.org>
Cc: Stefan Agner <stefan@agner.ch>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Tested-by: Gary Bisson <gary.bisson@boundarydevices.com>
Tested-by: Fabio Estevam <fabio.estevam@nxp.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm
From: Christoffer Dall <cdall@linaro.org>
KVM/ARM Fixes for v4.11-rc6
Fixes include:
- Fix a problem with GICv3 userspace save/restore
- Clarify GICv2 userspace save/restore ABI
- Be more careful in clearing GIC LRs
- Add missing synchronization primitive to our MMU handling code
|
|
As an oversight, for GICv2, we accidentally export the GICC_PMR register
in the format of the GICH_VMCR.VMPriMask field in the lower 5 bits of a
word, meaning that userspace must always use the lower 5 bits to
communicate with the KVM device and must shift the value left by 3
places to obtain the actual priority mask level.
Since GICv3 supports the full 8 bits of priority masking in the ICH_VMCR,
we have to fix the value we export when emulating a GICv2 on top of a
hardware GICv3 and exporting the emulated GICv2 state to userspace.
Take the chance to clarify this aspect of the ABI.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <cdall@linaro.org>
|
|
Following the recent merge of statx, correct the documented prototype
for the ->getattr() inode operation, and add an entry to the porting
file.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Merge misc fixes from Andrew Morton:
"11 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
kasan: do not sanitize kexec purgatory
drivers/rapidio/devices/tsi721.c: make module parameter variable name unique
mm/hugetlb.c: don't call region_abort if region_chg fails
kasan: report only the first error by default
hugetlbfs: initialize shared policy as part of inode allocation
mm: fix section name for .data..ro_after_init
mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd()
mm: workingset: fix premature shadow node shrinking with cgroups
mm: rmap: fix huge file mmap accounting in the memcg stats
mm: move mm_percpu_wq initialization earlier
mm: migrate: fix remove_migration_pte() for ksm pages
|
|
Disable kasan after the first report. There are several reasons for
this:
- Single bug quite often has multiple invalid memory accesses causing
storm in the dmesg.
- Write OOB access might corrupt metadata so the next report will print
bogus alloc/free stacktraces.
- Reports after the first easily could be not bugs by itself but just
side effects of the first one.
Given that multiple reports usually only do harm, it makes sense to
disable kasan after the first one. If user wants to see all the
reports, the boot-time parameter kasan_multi_shot must be used.
[aryabinin@virtuozzo.com: wrote changelog and doc, added missing include]
Link: http://lkml.kernel.org/r/20170323154416.30257-1-aryabinin@virtuozzo.com
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
"This fixes the following issues:
- memory corruption when kmalloc fails in xts/lrw
- mark some CCP DMA channels as private
- fix reordering race in padata
- regression in omap-rng DT description"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: xts,lrw - fix out-of-bounds write after kmalloc failure
crypto: ccp - Make some CCP DMA channels private
padata: avoid race in reordering
dt-bindings: rng: clocks property on omap_rng not always mandatory
|
|
Pull KVM fixes from Paolo Bonzini:
"All x86-specific, apart from some arch-independent syzkaller fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: cleanup the page tracking SRCU instance
KVM: nVMX: fix nested EPT detection
KVM: pci-assign: do not map smm memory slot pages in vt-d page tables
KVM: kvm_io_bus_unregister_dev() should never fail
KVM: VMX: Fix enable VPID conditions
KVM: nVMX: Fix nested VPID vmx exec control
KVM: x86: correct async page present tracepoint
kvm: vmx: Flush TLB when the APIC-access address changes
KVM: x86: use pic/ioapic destructor when destroy vm
KVM: x86: check existance before destroy
KVM: x86: clear bus pointer when destroyed
KVM: Documentation: document MCE ioctls
KVM: nVMX: don't reset kvm mmu twice
PTP: fix ptr_ret.cocci warnings
kvm: fix usage of uninit spinlock in avic_vm_destroy()
KVM: VMX: downgrade warning on unexpected exit code
|
|
throtl_slice is important for blk-throttling. It's called slice
internally but it really is a time window blk-throttling samples data.
blk-throttling will make decision based on the samplings. An example is
bandwidth measurement. A cgroup's bandwidth is measured in the time
interval of throtl_slice.
A small throtl_slice meanse cgroups have smoother throughput but burn
more CPUs. It has 100ms default value, which is not appropriate for all
disks. A fast SSD can dispatch a lot of IOs in 100ms. This patch makes
it tunable.
Since throtl_slice isn't a time slice, the sysfs name
'throttle_sample_time' reflects its character better.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"A smattering of different small fixes for some random driver
subsystems. Nothing all that major, just resolutions for reported
issues and bugs.
All have been in linux-next with no reported issues"
* tag 'char-misc-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (21 commits)
extcon: int3496: Set the id pin to direction-input if necessary
extcon: int3496: Use gpiod_get instead of gpiod_get_index
extcon: int3496: Add dependency on X86 as it's Intel specific
extcon: int3496: Add GPIO ACPI mapping table
extcon: int3496: Rename GPIO pins in accordance with binding
vmw_vmci: handle the return value from pci_alloc_irq_vectors correctly
ppdev: fix registering same device name
parport: fix attempt to write duplicate procfiles
auxdisplay: img-ascii-lcd: add missing sentinel entry in img_ascii_lcd_matches
Drivers: hv: vmbus: Don't leak memory when a channel is rescinded
Drivers: hv: vmbus: Don't leak channel ids
Drivers: hv: util: don't forget to init host_ts.lock
Drivers: hv: util: move waiting for release to hv_utils_transport itself
vmbus: remove hv_event_tasklet_disable/enable
vmbus: use rcu for per-cpu channel list
mei: don't wait for os version message reply
mei: fix deadlock on mei reset
intel_th: pci: Add Gemini Lake support
intel_th: pci: Add Denverton SOC support
intel_th: Don't leak module refcount on failure to activate
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB/PHY fixes from Greg KH:
"Here are a number of small USB and PHY driver fixes for 4.11-rc4.
Nothing major here, just an bunch of small fixes, and a handfull of
good fixes from Johan for devices with crazy descriptors. There are a
few new device ids in here as well.
All of these have been in linux-next with no reported issues"
* tag 'usb-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (26 commits)
usb: gadget: f_hid: fix: Don't access hidg->req without spinlock held
usb: gadget: udc: remove pointer dereference after free
usb: gadget: f_uvc: Sanity check wMaxPacketSize for SuperSpeed
usb: gadget: f_uvc: Fix SuperSpeed companion descriptor's wBytesPerInterval
usb: gadget: acm: fix endianness in notifications
usb: dwc3: gadget: delay unmap of bounced requests
USB: serial: qcserial: add Dell DW5811e
usb: hub: Fix crash after failure to read BOS descriptor
ACM gadget: fix endianness in notifications
USB: usbtmc: fix probe error path
USB: usbtmc: add missing endpoint sanity check
USB: serial: option: add Quectel UC15, UC20, EC21, and EC25 modems
usb: musb: fix possible spinlock deadlock
usb: musb: dsps: fix iounmap in error and exit paths
usb: musb: cppi41: don't check early-TX-interrupt for Isoch transfer
usb-core: Add LINEAR_FRAME_INTR_BINTERVAL USB quirk
uwb: i1480-dfu: fix NULL-deref at probe
uwb: hwa-rc: fix NULL-deref at probe
USB: wusbcore: fix NULL-deref at probe
USB: uss720: fix NULL-deref at probe
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull more powerpc fixes from Michael Ellerman:
"These are all pretty minor. The fix for idle wakeup would be a bad bug
but has not been observed in practice.
The update to the gcc-plugins docs was Cc'ed to Kees and Jon, Kees
OK'ed it going via powerpc and I didn't hear from Jon.
- cxl: Route eeh events to all slices for pci_channel_io_perm_failure state
- powerpc/64s: Fix idle wakeup potential to clobber registers
- Revert "powerpc/64: Disable use of radix under a hypervisor"
- gcc-plugins: update architecture list in documentation
Thanks to: Andrew Donnellan, Nicholas Piggin, Paul Mackerras, Vaibhav
Jain"
* tag 'powerpc-4.11-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
gcc-plugins: update architecture list in documentation
Revert "powerpc/64: Disable use of radix under a hypervisor"
powerpc/64s: Fix idle wakeup potential to clobber registers
cxl: Route eeh events to all slices for pci_channel_io_perm_failure state
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:
"A handful of Sunxi and Rockchip clk driver fixes and a core framework
one where we need to copy a string because we can't guarantee it isn't
freed sometime later"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: sunxi-ng: fix recalc_rate formula of NKMP clocks
clk: sunxi-ng: Fix div/mult settings for osc12M on A64
clk: rockchip: Make uartpll a child of the gpll on rk3036
clk: rockchip: add "," to mux_pll_src_apll_dpll_gpll_usb480m_p on rk3036
clk: core: Copy connection id
dt-bindings: arm: update Armada CP110 system controller binding
clk: sunxi-ng: sun6i: Fix enable bit offset for hdmi-ddc module clock
clk: sunxi: ccu-sun5i needs nkmp
clk: sunxi-ng: mp: Adjust parent rate for pre-dividers
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Pull mmc fixes from Ulf Hansson:
"Here are a couple of mmc fixes intended for v4.11 rc4.
MMC core:
- Fix initialization of HS400-ES eMMC cards
- A couple of fixes for the mmc block device driver
- Resolved a compiler warning
MMC host:
- sdhci: Do not disable IRQs while waiting for clock
- sdhci-pci: Do not disable IRQs in sdhci_intel_set_power
- sdhci-of-arasan: Fix incorrect timeout clock
- mediatek: Fix bug for setting wrong clock frequency
- sdhci-of-at91: Use regulator to fix cmd timeout errors
- ushc: Fix NULL-deref at probe
- rockchip-dw-mshc: Rename RK1108 to RV1108 in DT"
* tag 'mmc-v4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: sdhci-pci: Do not disable interrupts in sdhci_intel_set_power
mmc: sdhci: Do not disable interrupts while waiting for clock
mmc: ushc: fix NULL-deref at probe
mmc: sdhci-of-at91: Support external regulators
mmc: core: mmc_blk_rw_cmd_err - remove unused variable
mmc: mediatek: Fixed bug where clock frequency could be set wrong
mmc: block: Fix cmd error reset failure path
mmc: block: Fix is_waiting_last_req set incorrectly
mmc: core: Fix access to HS400-ES devices
mmc: sdhci-of-arasan: fix incorrect timeout clock
dt-bindings: rockchip-dw-mshc: rename RK1108 to RV1108
|
|
Commit 52060836f79 ("dt-bindings: omap-rng: Document SafeXcel IP-76
device variant") update the omap_rng Device Tree binding to add support
for the IP-76 variation of the IP. As part of this change, a "clocks"
property was added, but is indicated as "Required", without indicated
it's actually only required for some compatible strings.
This commit fixes that, by explicitly stating that the clocks property
is only required with the inside-secure,safexcel-eip76 compatible
string.
Fixes: 52060836f79 ("dt-bindings: omap-rng: Document SafeXcel IP-76 device variant")
Cc: <stable@vger.kernel.org>
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/daeinki/drm-exynos into drm-fixes
Just several fixups,
- fix page fault and vblank timeout issues due to delayed vblank handling.
- fix panel driver probing to fail without te-gpios property.
- fix potential security hole by using "%pK" format.
- fix wrong if statement condition.
And one cleanup which removes Exynos4415 SoC support which is not supported
anymore.
* 'exynos-drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/daeinki/drm-exynos:
drm/exynos/dsi: make te-gpios optional
drm/exynos: Print kernel pointers in a restricted form
drm/exynos/decon5433: fix software trigger mask
drm/exynos/fimd: signal frame done interrupt at front porch
drm/exynos/decon5433: signal frame done interrupt at front porch
drm/exynos/decon5433: fix vblank event handling
drm/exynos: move crtc event handling to drivers callbacks
drm/exynos: Remove support for Exynos4415 (SoC not supported anymore)
drm/exynos/decon5433: & vs | typo
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb into usb-linus
Felipe writes:
usb: fixes for v4.11-rc4
f_acm got an endianness fix by Oliver Neukum. This has been around for a
long time but it's finally fixed.
f_hid learned that it should never access hidg->req without first
grabbing the spinlock.
Roger Quadros fixed two bugs in the f_uvc function driver.
Janusz Dziedzic fixed a very peculiar bug with EP0, one that's rather
difficult to trigger. When we're dealing with bounced EP0 requests, we
should delay unmap until after ->complete() is called.
UDC class got a use-after-free fix.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kishon/linux-phy into usb-linus
Kishon writes:
phy: for 4.11-rc
*) Revert USB3 PHY support for Broadcom NSP SoC
*) Fix compiler error on qcom-usb-hs when depends on EXTCON
is not added
*) Fix error handling in phy-exynos-pcie
Signed-off-by: Kishon Vijay Abraham I <kishon@ti.com>
|
|
In order to make GPIO ACPI library stricter prepare users of
gpiod_get_index() to correctly behave when there no mapping is
provided by firmware.
Here we add explicit mapping between _CRS GpioIo() resources and
their names used in the driver.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Chanwoo Choi <cw00.choi@samsung.com>
|
|
Commit 65c059bcaa73 ("powerpc: Enable support for GCC plugins") enabled GCC
plugins on powerpc, but neglected to update the architecture list in the
docs. Rectify this.
Fixes: 65c059bcaa73 ("powerpc: Enable support for GCC plugins")
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
Support for Exynos4415 is going away because there are no internal nor
external users.
Since commit 46dcf0ff0de3 ("ARM: dts: exynos: Remove exynos4415.dtsi"),
the platform cannot be instantiated so remove also the drivers.
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Acked-by: Kukjin Kim <kgene@kernel.org>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Inki Dae <inki.dae@samsung.com>
|
|
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
Pull networking fixes from David Miller:
1) Ensure that mtu is at least IPV6_MIN_MTU in ipv6 VTI tunnel driver,
from Steffen Klassert.
2) Fix crashes when user tries to get_next_key on an LPM bpf map, from
Alexei Starovoitov.
3) Fix detection of VLAN fitlering feature for bnx2x VF devices, from
Michal Schmidt.
4) We can get a divide by zero when TCP socket are morphed into
listening state, fix from Eric Dumazet.
5) Fix socket refcounting bugs in skb_complete_wifi_ack() and
skb_complete_tx_timestamp(). From Eric Dumazet.
6) Use after free in dccp_feat_activate_values(), also from Eric
Dumazet.
7) Like bonding team needs to use ETH_MAX_MTU as netdev->max_mtu, from
Jarod Wilson.
8) Fix use after free in vrf_xmit(), from David Ahern.
9) Don't do UDP Fragmentation Offload on IPComp ipsec packets, from
Alexey Kodanev.
10) Properly check napi_complete_done() return value in order to decide
whether to re-enable IRQs or not in amd-xgbe driver, from Thomas
Lendacky.
11) Fix double free of hwmon device in marvell phy driver, from Andrew
Lunn.
12) Don't crash on malformed netlink attributes in act_connmark, from
Etienne Noss.
13) Don't remove routes with a higher metric in ipv6 ECMP route replace,
from Sabrina Dubroca.
14) Don't write into a cloned SKB in ipv6 fragmentation handling, from
Florian Westphal.
15) Fix routing redirect races in dccp and tcp, basically the ICMP
handler can't modify the socket's cached route in it's locked by the
user at this moment. From Jon Maxwell.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (108 commits)
qed: Enable iSCSI Out-of-Order
qed: Correct out-of-bound access in OOO history
qed: Fix interrupt flags on Rx LL2
qed: Free previous connections when releasing iSCSI
qed: Fix mapping leak on LL2 rx flow
qed: Prevent creation of too-big u32-chains
qed: Align CIDs according to DORQ requirement
mlxsw: reg: Fix SPVMLR max record count
mlxsw: reg: Fix SPVM max record count
net: Resend IGMP memberships upon peer notification.
dccp: fix memory leak during tear-down of unsuccessful connection request
tun: fix premature POLLOUT notification on tun devices
dccp/tcp: fix routing redirect race
ucc/hdlc: fix two little issue
vxlan: fix ovs support
net: use net->count to check whether a netns is alive or not
bridge: drop netfilter fake rtable unconditionally
ipv6: avoid write to a possibly cloned skb
net: wimax/i2400m: fix NULL-deref at probe
isdn/gigaset: fix NULL-deref at probe
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
"Three cgroup fixes. Nothing critical:
- the pids controller could trigger suspicious RCU warning
spuriously. Fixed.
- in the debug controller, %p -> %pK to protect kernel pointer
from getting exposed.
- documentation formatting fix"
* 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroups: censor kernel pointer in debug files
cgroup/pids: remove spurious suspicious RCU usage warning
cgroup: Fix indenting in PID controller documentation
|
|
Rockchip finally named the SOC as RV1108, so change it.
Signed-off-by: Andy Yan <andy.yan@rock-chips.com>
Acked-by: Rob Herring <robh@kernel.org>
Reviewed-by: Heiko Stuebner <heiko@sntech.de>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
|
|
It wasn't clear if the 'forwarding' setting needs to be enabled on the
interface that packets are received from, or on the interface that
packets are forwarded to, or both.
In fact (according to my code reading) the setting is relevant on the
interface that packets are received from, so this change updates the doc
to say that.
Signed-off-by: Neil Jerram <neil@tigera.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
- a workaround for a GIC erratum
- a missing stub function for CONFIG_IRQDOMAIN=n
- fixes for a couple of type inconsistencies
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/crossbar: Fix incorrect type of register size
irqchip/gicv3-its: Add workaround for QDF2400 ITS erratum 0065
irqdomain: Add empty irq_domain_check_msi_remap
irqchip/crossbar: Fix incorrect type of local variables
|
|
Pull KVM fixes from Radim Krčmář:
"ARM updates from Marc Zyngier:
- vgic updates:
- Honour disabling the ITS
- Don't deadlock when deactivating own interrupts via MMIO
- Correctly expose the lact of IRQ/FIQ bypass on GICv3
- I/O virtualization:
- Make KVM_CAP_NR_MEMSLOTS big enough for large guests with many
PCIe devices
- General bug fixes:
- Gracefully handle exception generated with syndroms that the host
doesn't understand
- Properly invalidate TLBs on VHE systems
x86:
- improvements in emulation of VMCLEAR, VMX MSR bitmaps, and VCPU
reset
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: nVMX: do not warn when MSR bitmap address is not backed
KVM: arm64: Increase number of user memslots to 512
KVM: arm/arm64: Remove KVM_PRIVATE_MEM_SLOTS definition that are unused
KVM: arm/arm64: Enable KVM_CAP_NR_MEMSLOTS on arm/arm64
KVM: Add documentation for KVM_CAP_NR_MEMSLOTS
KVM: arm/arm64: VGIC: Fix command handling while ITS being disabled
arm64: KVM: Survive unknown traps from guests
arm: KVM: Survive unknown traps from guests
KVM: arm/arm64: Let vcpu thread modify its own active state
KVM: nVMX: reset nested_run_pending if the vCPU is going to be reset
kvm: nVMX: VMCLEAR should not cause the vCPU to shut down
KVM: arm/arm64: vgic-v3: Don't pretend to support IRQ/FIQ bypass
arm64: KVM: VHE: Clear HCR_TGE when invalidating guest TLBs
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here is a number of different USB fixes for 4.11-rc2.
Seems like there were a lot of unresolved issues that people have been
finding for this subsystem, and a bunch of good security auditing
happening as well from Johan Hovold. There's the usual batch of gadget
driver fixes and xhci issues resolved as well.
All of these have been in linux-next with no reported issues"
* tag 'usb-4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (35 commits)
usb: host: xhci-plat: Fix timeout on removal of hot pluggable xhci controllers
usb: host: xhci-dbg: HCIVERSION should be a binary number
usb: xhci: remove dummy extra_priv_size for size of xhci_hcd struct
usb: xhci-mtk: check hcc_params after adding primary hcd
USB: serial: digi_acceleport: fix OOB-event processing
MAINTAINERS: usb251xb: remove reference inexistent file
doc: dt-bindings: usb251xb: mark reg as required
usb: usb251xb: dt: add unit suffix to oc-delay and power-on-time
usb: usb251xb: remove max_{power,current}_{sp,bp} properties
usb-storage: Add ignore-residue quirk for Initio INIC-3619
USB: iowarrior: fix NULL-deref in write
USB: iowarrior: fix NULL-deref at probe
usb: phy: isp1301: Add OF device ID table
usb: ohci-at91: Do not drop unhandled USB suspend control requests
USB: serial: safe_serial: fix information leak in completion handler
USB: serial: io_ti: fix information leak in completion handler
USB: serial: omninet: drop open callback
USB: serial: omninet: fix reference leaks at open
USB: serial: io_ti: fix NULL-deref in interrupt callback
usb: dwc3: gadget: make to increment req->remaining in all cases
...
|
|
Merge fixes from Andrew Morton:
"26 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (26 commits)
userfaultfd: remove wrong comment from userfaultfd_ctx_get()
fat: fix using uninitialized fields of fat_inode/fsinfo_inode
sh: cayman: IDE support fix
kasan: fix races in quarantine_remove_cache()
kasan: resched in quarantine_remove_cache()
mm: do not call mem_cgroup_free() from within mem_cgroup_alloc()
thp: fix another corner case of munlock() vs. THPs
rmap: fix NULL-pointer dereference on THP munlocking
mm/memblock.c: fix memblock_next_valid_pfn()
userfaultfd: selftest: vm: allow to build in vm/ directory
userfaultfd: non-cooperative: userfaultfd_remove revalidate vma in MADV_DONTNEED
userfaultfd: non-cooperative: fix fork fctx->new memleak
mm/cgroup: avoid panic when init with low memory
drivers/md/bcache/util.h: remove duplicate inclusion of blkdev.h
mm/vmstats: add thp_split_pud event for clarity
include/linux/fs.h: fix unsigned enum warning with gcc-4.2
userfaultfd: non-cooperative: release all ctx in dup_userfaultfd_complete
userfaultfd: non-cooperative: robustness check
userfaultfd: non-cooperative: rollback userfaultfd_exit
x86, mm: unify exit paths in gup_pte_range()
...
|
|
Patch series "userfaultfd non-cooperative further update for 4.11 merge
window".
Unfortunately I noticed one relevant bug in userfaultfd_exit while doing
more testing. I've been doing testing before and this was also tested
by kbuild bot and exercised by the selftest, but this bug never
reproduced before.
I dropped userfaultfd_exit as result. I dropped it because of
implementation difficulty in receiving signals in __mmput and because I
think -ENOSPC as result from the background UFFDIO_COPY should be enough
already.
Before I decided to remove userfaultfd_exit, I noticed userfaultfd_exit
wasn't exercised by the selftest and when I tried to exercise it, after
moving it to a more correct place in __mmput where it would make more
sense and where the vma list is stable, it resulted in the
event_wait_completion in D state. So then I added the second patch to
be sure even if we call userfaultfd_event_wait_completion too late
during task exit(), we won't risk to generate tasks in D state. The
same check exists in handle_userfault() for the same reason, except it
makes a difference there, while here is just a robustness check and it's
run under WARN_ON_ONCE.
While looking at the userfaultfd_event_wait_completion() function I
looked back at its callers too while at it and I think it's not ok to
stop executing dup_fctx on the fcs list because we relay on
userfaultfd_event_wait_completion to execute
userfaultfd_ctx_put(fctx->orig) which is paired against
userfaultfd_ctx_get(fctx->orig) in dup_userfault just before
list_add(fcs). This change only takes care of fctx->orig but this area
also needs further review looking for similar problems in fctx->new.
The only patch that is urgent is the first because it's an use after
free during a SMP race condition that affects all processes if
CONFIG_USERFAULTFD=y. Very hard to reproduce though and probably
impossible without SLUB poisoning enabled.
This patch (of 3):
I once reproduced this oops with the userfaultfd selftest, it's not
easily reproducible and it requires SLUB poisoning to reproduce.
general protection fault: 0000 [#1] SMP
Modules linked in:
CPU: 2 PID: 18421 Comm: userfaultfd Tainted: G ------------ T 3.10.0+ #15
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
task: ffff8801f83b9440 ti: ffff8801f833c000 task.ti: ffff8801f833c000
RIP: 0010:[<ffffffff81451299>] [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0
RSP: 0018:ffff8801f833fe80 EFLAGS: 00010202
RAX: ffff8801f833ffd8 RBX: 6b6b6b6b6b6b6b6b RCX: ffff8801f83b9440
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800baf18600
RBP: ffff8801f833fee8 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: ffffffff8127ceb3 R12: 0000000000000000
R13: ffff8800baf186b0 R14: ffff8801f83b99f8 R15: 00007faed746c700
FS: 0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007faf0966f028 CR3: 0000000001bc6000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
do_exit+0x297/0xd10
SyS_exit+0x17/0x20
tracesys+0xdd/0xe2
Code: 00 00 66 66 66 66 90 55 48 89 e5 41 54 53 48 83 ec 58 48 8b 1f 48 85 db 75 11 eb 73 66 0f 1f 44 00 00 48 8b 5b 10 48 85 db 74 64 <4c> 8b a3 b8 00 00 00 4d 85 e4 74 eb 41 f6 84 24 2c 01 00 00 80
RIP [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0
RSP <ffff8801f833fe80>
---[ end trace 9fecd6dcb442846a ]---
In the debugger I located the "mm" pointer in the stack and walking
mm->mmap->vm_next through the end shows the vma->vm_next list is fully
consistent and it is null terminated list as expected. So this has to
be an SMP race condition where userfaultfd_exit was running while the
vma list was being modified by another CPU.
When userfaultfd_exit() run one of the ->vm_next pointers pointed to
SLAB_POISON (RBX is the vma pointer and is 0x6b6b..).
The reason is that it's not running in __mmput but while there are still
other threads running and it's not holding the mmap_sem (it can't as it
has to wait the even to be received by the manager). So this is an use
after free that was happening for all processes.
One more implementation problem aside from the race condition:
userfaultfd_exit has really to check a flag in mm->flags before walking
the vma or it's going to slowdown the exit() path for regular tasks.
One more implementation problem: at that point signals can't be
delivered so it would also create a task in D state if the manager
doesn't read the event.
The major design issue: it overall looks superfluous as the manager can
check for -ENOSPC in the background transfer:
if (mmget_not_zero(ctx->mm)) {
[..]
} else {
return -ENOSPC;
}
It's safer to roll it back and re-introduce it later if at all.
[rppt@linux.vnet.ibm.com: documentation fixup after removal of UFFD_EVENT_EXIT]
Link: http://lkml.kernel.org/r/1488345437-4364-1-git-send-email-rppt@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/20170224181957.19736-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Fix typos and add the following to the scripts/spelling.txt:
overide||override
While we are here, fix the doubled "address" in the touched line
Documentation/devicetree/bindings/regulator/ti-abb-regulator.txt.
Also, fix the comment block style in the touched hunks in
drivers/media/dvb-frontends/drx39xyj/drx_driver.h.
Link: http://lkml.kernel.org/r/1481573103-11329-21-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Fix typos and add the following to the scripts/spelling.txt:
disble||disable
disbled||disabled
I kept the TSL2563_INT_DISBLED in /drivers/iio/light/tsl2563.c
untouched. The macro is not referenced at all, but this commit is
touching only comment blocks just in case.
Link: http://lkml.kernel.org/r/1481573103-11329-20-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix several issues in the intel_pstate driver and one issue in
the schedutil cpufreq governor, clean up that governor a bit and hook
up existing code for disabling cpufreq to a new kernel command line
option.
Specifics:
- Three fixes for intel_pstate problems related to the passive mode
(in which it acts as a regular cpufreq scaling driver), two for the
handling of global P-state limits and one for the handling of the
cpu_frequency tracepoint in that mode (Rafael Wysocki).
- Three fixes for the handling of P-state limits in intel_pstate in
the active mode (Rafael Wysocki).
- Introduction of a new cpufreq.off=1 kernel command line argument
that will disable cpufreq entirely if passed to the kernel and is
simply hooked up to the existing code used by Xen (Len Brown).
- Fix for the schedutil cpufreq governor to prevent it from using
stale raw frequency values in configurations with mutiple CPUs
sharing one policy object and a cleanup for it reducing its
overhead slightly (Viresh Kumar)"
* tag 'pm-4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy
cpufreq: intel_pstate: Fix intel_pstate_verify_policy()
cpufreq: intel_pstate: Fix global settings in active mode
cpufreq: Add the "cpufreq.off=1" cmdline option
cpufreq: schedutil: Pass sg_policy to get_next_freq()
cpufreq: schedutil: move cached_raw_freq to struct sugov_policy
cpufreq: intel_pstate: Avoid triggering cpu_frequency tracepoint unnecessarily
cpufreq: intel_pstate: Fix intel_cpufreq_verify_policy()
cpufreq: intel_pstate: Do not use performance_limits in passive mode
|
|
* pm-cpufreq:
cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy
cpufreq: intel_pstate: Fix intel_pstate_verify_policy()
cpufreq: intel_pstate: Fix global settings in active mode
cpufreq: Add the "cpufreq.off=1" cmdline option
cpufreq: intel_pstate: Avoid triggering cpu_frequency tracepoint unnecessarily
cpufreq: intel_pstate: Fix intel_cpufreq_verify_policy()
cpufreq: intel_pstate: Do not use performance_limits in passive mode
|