Age | Commit message (Collapse) | Author | Files | Lines |
|
Use the new mmu_notifier_addr_valid() helper to know if an address is under
going any invalidation. This is more accurate than the mmu_notifier_count
used so far.
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
|
|
Use context informations on why a range is being invalidated to determine
if we need to invalidate the secondary page table (spt) or not.
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
|
|
This keep a list of all virtual address range being invalidated (ie inside
a mmu_notifier_invalidate_range_start/end section). Also add an helper to
check if a range is under going such invalidation. With this it easy for a
concurrent thread to ignore invalidation that do not affect the virtual
address range it is working on.
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Artemy Kovalyov <artemyko@mellanox.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
|
|
This patch just add the information it does not introduce any optimi-
zation, thus there are no functional change with this patch.
The mmu_notifier callback for range invalidation happens for a number
of reasons. Provide some context information to callback to allow for
optimization. For instance a device driver only need to free tracking
structure for a range if notification is for an munmap. Prior to this
patch the driver would have to free them on each mmu_notifier callback
and reallocate them on next page fault (as it would have to assume it
was an munmap).
Protection change also might turn into a no-op for a driver if driver
mapped a range read only and CPU page table is updated from read and
write to read only then device page table do not need update.
Those are just some of the optimization this patch allows.
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Artemy Kovalyov <artemyko@mellanox.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
|
|
Using a struct for mmu_notifier_invalidate_range_start()|end() allows
to add more parameters in the future without having to change every
call sites or every callback. They are no functional change with this
patch.
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Artemy Kovalyov <artemyko@mellanox.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
|
|
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
|
|
This reverts 02ede1345ff68cd3e629864161c1d8adcef1eae5
Patch has been dropped from mmotm tree with reason: wrecked
|
|
fix build warning for CONFIG_SLOB:
mm/memcontrol.c:706:27: warning: 'get_mem_cgroup' defined but not used [-Wunused-function]
static struct mem_cgroup *get_mem_cgroup(struct mem_cgroup *memcg)
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803181318530.241887@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: kbuild test robot <lkp@intel.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Calling __stack_chk_guard_setup() in decompress_kernel() is too late, so
stack checking always fails for decompress_kernel() itself. So remove
__stack_chk_guard_setup() and initialize __stack_chk_guard before we call
decompress_kernel().
Original code comes from ARM but also used for MIPS and SH, so fix them
together. If without this fix, compressed booting of these archs will
fail because stack checking is enabled by default (>=4.16).
Link: http://lkml.kernel.org/r/1521186916-13745-1-git-send-email-chenhc@lemote.com
Signed-off-by: Huacai Chen <chenhc@lemote.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: James Hogan <james.hogan@mips.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
For kexec_file loading, if kexec_buf.top_down is 'true', the memory which
is used to load kernel/initrd/purgatory is supposed to be allocated from
top to down. This is also consistent with the old kexec loading
interface.
However, the current arch_kexec_walk_mem() doesn't do like this. It
ignores checking kexec_buf.top_down, but calls walk_system_ram_res()
directly to go through all resources of System RAM, to try to find a
memory region which can contain the specific kexec buffer, then calls
locate_mem_hole_callback() to allocate memory in that found memory region
from top to down. This is not right.
Here add checking if kexec_buf.top_down is 'true' in
arch_kexec_walk_mem(), if yes, call the newly added
walk_system_ram_res_rev() to find memory region from top to down to load
kernel.
Link: http://lkml.kernel.org/r/20180322033722.9279-3-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Philipp Rudo <prudo@linux.vnet.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This function, being a variant of walk_system_ram_res() introduced in
commit 8c86e70acead ("resource: provide new functions to walk through
resources"), walks through a list of all the resources of System RAM in
reversed order, i.e., from higher to lower.
It will be used in kexec_file code.
Link: http://lkml.kernel.org/r/20180322033722.9279-2-bhe@redhat.com
Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Philipp Rudo <prudo@linux.vnet.ibm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
As previously reported (https://patchwork.kernel.org/patch/8642031/) it's
possible to call shrink_dentry_list with a large number of dentries (>
10000). This, in turn, could trigger the softlockup detector and possibly
trigger a panic. In addition to the unmount path being vulnerable to this
scenario, at SuSE we've observed similar situation happening during
process exit on processes that touch a lot of dentries. Here is an
excerpt from a crash dump. The number after the colon are the number of
dentries on the list passed to shrink_dentry_list:
PID 99760: 10722
PID 107530: 215
PID 108809: 24134
PID 108877: 21331
PID 141708: 16487
So we want to kill between 15k-25k dentries without yielding.
And one possible call stack looks like:
4 [ffff8839ece41db0] _raw_spin_lock at ffffffff8152a5f8
5 [ffff8839ece41db0] evict at ffffffff811c3026
6 [ffff8839ece41dd0] __dentry_kill at ffffffff811bf258
7 [ffff8839ece41df0] shrink_dentry_list at ffffffff811bf593
8 [ffff8839ece41e18] shrink_dcache_parent at ffffffff811bf830
9 [ffff8839ece41e50] proc_flush_task at ffffffff8120dd61
10 [ffff8839ece41ec0] release_task at ffffffff81059ebd
11 [ffff8839ece41f08] do_exit at ffffffff8105b8ce
12 [ffff8839ece41f78] sys_exit at ffffffff8105bd53
13 [ffff8839ece41f80] system_call_fastpath at ffffffff81532909
While some of the callers of shrink_dentry_list do use cond_resched, this
is not sufficient to prevent softlockups. So just move cond_resched into
shrink_dentry_list from its callers.
David said: I've found hundreds of occurrences of warnings that we emit
when need_resched stays set for a prolonged period of time with the
stack trace that is included in the change log.
Link: http://lkml.kernel.org/r/1521718946-31521-1-git-send-email-nborisov@suse.com
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
I got "oom_reaper: unable to reap pid:" messages when the victim thread
was blocked inside free_pgtables() (which occurred after returning from
unmap_vmas() and setting MMF_OOM_SKIP). We don't need to complain when
exit_mmap() already set MMF_OOM_SKIP.
[ 663.593821] Killed process 7558 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[ 664.684801] oom_reaper: unable to reap pid:7558 (a.out)
[ 664.892292] a.out D13272 7558 6931 0x00100084
[ 664.895765] Call Trace:
[ 664.897574] ? __schedule+0x25f/0x780
[ 664.900099] schedule+0x2d/0x80
[ 664.902260] rwsem_down_write_failed+0x2bb/0x440
[ 664.905249] ? rwsem_down_write_failed+0x55/0x440
[ 664.908335] ? free_pgd_range+0x569/0x5e0
[ 664.911145] call_rwsem_down_write_failed+0x13/0x20
[ 664.914121] down_write+0x49/0x60
[ 664.916519] ? unlink_file_vma+0x28/0x50
[ 664.919255] unlink_file_vma+0x28/0x50
[ 664.922234] free_pgtables+0x36/0x100
[ 664.924797] exit_mmap+0xbb/0x180
[ 664.927220] mmput+0x50/0x110
[ 664.929504] copy_process.part.41+0xb61/0x1fe0
[ 664.932448] ? _do_fork+0xe6/0x560
[ 664.934902] ? _do_fork+0xe6/0x560
[ 664.937361] _do_fork+0xe6/0x560
[ 664.939742] ? syscall_trace_enter+0x1a9/0x240
[ 664.942693] ? retint_user+0x18/0x18
[ 664.945309] ? page_fault+0x2f/0x50
[ 664.947896] ? trace_hardirqs_on_caller+0x11f/0x1b0
[ 664.951075] do_syscall_64+0x74/0x230
[ 664.953747] entry_SYSCALL_64_after_hwframe+0x42/0xb7
Link: http://lkml.kernel.org/r/201803221946.DHG65638.VFJHFtOSQLOMOF@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This patch fixes a corner case for KSM. When two pages belong or belonged
to the same transparent hugepage, and they should be merged, KSM fails to
split the page, and therefore no merging happens.
This bug can be reproduced by:
* making sure ksm is running (in case disabling ksmtuned)
* enabling transparent hugepages
* allocating a THP-aligned 1-THP-sized buffer
e.g. on amd64: posix_memalign(&p, 1<<21, 1<<21)
* filling it with the same values
e.g. memset(p, 42, 1<<21)
* performing madvise to make it mergeable
e.g. madvise(p, 1<<21, MADV_MERGEABLE)
* waiting for KSM to perform a few scans
The expected outcome is that the all the pages get merged (1 shared and
the rest sharing); the actual outcome is that no pages get merged (1
unshared and the rest volatile)
The reason of this behaviour is that we increase the reference count once
for both pages we want to merge, but if they belong to the same hugepage
(or compound page), the reference counter used in both cases is the one of
the head of the compound page. This means that split_huge_page will find
a value of the reference counter too high and will fail.
This patch solves this problem by testing if the two pages to merge belong
to the same hugepage when attempting to merge them. If so, the hugepage
is split safely. This means that the hugepage is not split if not
necessary.
Link: http://lkml.kernel.org/r/1521548069-24758-1-git-send-email-imbrenda@linux.vnet.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Co-authored-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per line)
#29:
while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
WARNING: line over 80 characters
#214: FILE: mm/vmscan.c:2581:
+ * zones or there is heavy usage of a slow backing device.
WARNING: line over 80 characters
#225: FILE: mm/vmscan.c:2592:
+ if (stat.nr_writeback && stat.nr_writeback == stat.nr_taken)
WARNING: line over 80 characters
#255: FILE: mm/vmscan.c:2630:
+ current_may_throttle() && pgdat_memcg_congested(pgdat, root))
total: 0 errors, 4 warnings, 189 lines checked
NOTE: For some of the reported defects, checkpatch may be able to
mechanically convert to the typical style using --fix or --fix-inplace.
./patches/mm-vmscan-dont-mess-with-pgdat-flags-in-memcg-reclaim.patch has style problems, please review.
NOTE: If any of the errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
memcg reclaim may alter pgdat->flags based on the state of LRU lists in
cgroup and its children. PGDAT_WRITEBACK may force kswapd to sleep
congested_wait(), PGDAT_DIRTY may force kswapd to writeback filesystem
pages. But the worst here is PGDAT_CONGESTED, since it may force all
direct reclaims to stall in wait_iff_congested(). Note that only kswapd
have powers to clear any of these bits. This might just never happen if
cgroup limits configured that way. So all direct reclaims will stall as
long as we have some congested bdi in the system.
Leave all pgdat->flags manipulations to kswapd. kswapd scans the whole
pgdat, so it's reasonable to leave all decisions about node stat to
kswapd. Also add per-cgroup congestion state to avoid needlessly burning
CPU in cgroup reclaim if heavy congestion is observed.
Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY
bits since they alter only kswapd behavior.
The problem could be easily demonstrated by creating heavy congestion
in one cgroup:
echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
mkdir -p /sys/fs/cgroup/congester
echo 512M > /sys/fs/cgroup/congester/memory.max
echo $$ > /sys/fs/cgroup/congester/cgroup.procs
/* generate a lot of diry data on slow HDD */
while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
....
while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
and some job in another cgroup:
mkdir /sys/fs/cgroup/victim
echo 128M > /sys/fs/cgroup/victim/memory.max
# time cat /dev/sda > /dev/null
real 10m15.054s
user 0m0.487s
sys 1m8.505s
According to the tracepoint in wait_iff_congested(), the 'cat' spent 50%
of the time sleeping there.
With the patch, cat don't waste time anymore:
# time cat /dev/sda > /dev/null
real 5m32.911s
user 0m0.411s
sys 0m56.664s
Link: http://lkml.kernel.org/r/20180315164553.17856-6-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We have separate LRU list for each memory cgroup. Memory reclaim iterates
over cgroups and calls shrink_inactive_list() every inactive LRU list.
Based on the state of a single LRU shrink_inactive_list() may flag the
whole node as dirty,congested or under writeback. This is obviously wrong
and hurtful. It's especially hurtful when we have possibly small
congested cgroup in system. Than *all* direct reclaims waste time by
sleeping in wait_iff_congested().
Sum reclaim stats across all visited LRUs on node and flag node as dirty,
congested or under writeback based on that sum. This only fixes the
problem for global reclaim case. Per-cgroup reclaim will be addressed
separately by the next patch.
This change will also affect systems with no memory cgroups. Reclaimer
now makes decision based on reclaim stats of the both anon and file LRU
lists. E.g. if the file list is in congested state and get_scan_count()
decided to reclaim some anon pages, reclaimer will start shrinking anon
without delay in wait_iff_congested() like it was before. It seems to be
a reasonable thing to do. Why waste time sleeping, before reclaiming anon
given that we going to try to reclaim it anyway?
Link: http://lkml.kernel.org/r/20180315164553.17856-5-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Only kswapd can have non-zero nr_immediate, and current_may_throttle() is
always true for kswapd (PF_LESS_THROTTLE bit is never set) thus it's
enough to check stat.nr_immediate only.
Link: http://lkml.kernel.org/r/20180315164553.17856-4-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
tracepoint
With upcoming changes keeping the mm_vmscan_lru_shrink_inactive tracepoint
intact would require some additional code churn. In particular 'struct
recalim_stat' will gain more wide usage, but we don't need 'nr_activate',
'nr_ref_keep', 'nr_unmap_fail' counters anywhere besides tracepoint.
Since mm_vmscan_lru_shrink_inactive tracepoint mostly provide information
collected by shrink_page_list(), we can just replace it by tracepoint in
shrink_page_list(). We don't have 'nr_scanned' and 'file' arguments
there, but user could obtain this information from mm_vmscan_lru_isolate
tracepoint.
Link: http://lkml.kernel.org/r/20180315164553.17856-3-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Update some comments that became stale since transiton from per-zone to
per-node reclaim.
Link: http://lkml.kernel.org/r/20180315164553.17856-2-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This fixes a warning shown when phys_addr_t is 32-bit int when compiling
with clang:
mm/memblock.c:927:15: warning: implicit conversion from 'unsigned long long'
to 'phys_addr_t' (aka 'unsigned int') changes value from
18446744073709551615 to 4294967295 [-Wconstant-conversion]
r->base : ULLONG_MAX;
^~~~~~~~~~
./include/linux/kernel.h:30:21: note: expanded from macro 'ULLONG_MAX'
#define ULLONG_MAX (~0ULL)
^~~~~
Link: http://lkml.kernel.org/r/20180319005645.29051-1-stefan@agner.ch
Signed-off-by: Stefan Agner <stefan@agner.ch>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
At present the construct
if (VM_WARN(...))
will compile OK with CONFIG_DEBUG_VM=y and will fail with
CONFIG_DEBUG_VM=n. The reason is that VM_{WARN,BUG}* have always been
special wrt. {WARN/BUG}* and never generate any code when DEBUG_VM is
disabled. So we cannot really use it in conditionals.
We considered changing things so that this construct works in both cases
but that might cause unwanted code generation with CONFIG_DEBUG_VM=n. It
is safer and simpler to make the build fail in both cases.
[akpm@linux-foundation.org: changelog]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use a helper function to prefetch buddy as suggested by Dave Hansen.
Drop the change of list_add_tail() to avoid disordering page.
Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com
Link: http://lkml.kernel.org/r/20180320113146.GB24737@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Suggested-by: Ying Huang <ying.huang@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
There are a couple of places where parameter description and function name
do not match the actual code. Fix it.
Link: http://lkml.kernel.org/r/1520843448-17347-1-git-send-email-honglei.wang@oracle.com
Signed-off-by: Honglei Wang <honglei.wang@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Attempting to hotplug CPUs with CONFIG_VM_EVENT_COUNTERS enabled
can cause vmstat_update() to report a BUG due to preemption not
being disabled around smp_processor_id(). Discovered on Ubiquiti
EdgeRouter Pro with Cavium Octeon II processor.
BUG: using smp_processor_id() in preemptible [00000000] code:
kworker/1:1/269
caller is vmstat_update+0x50/0xa0
CPU: 0 PID: 269 Comm: kworker/1:1 Not tainted
4.16.0-rc4-Cavium-Octeon-00009-gf83bbd5-dirty #1
Workqueue: mm_percpu_wq vmstat_update
Stack : 0000002600000026 0000000010009ce0 0000000000000000 0000000000000001
0000000000000000 0000000000000000 0000000000000005 8001180000000800
00000000000000bf 0000000000000000 00000000000000bf 766d737461745f75
ffffffff83ad0000 0000000000000007 0000000000000000 0000000008000000
0000000000000000 ffffffff818d0000 0000000000000001 ffffffff81818a70
0000000000000000 0000000000000000 ffffffff8115bbb0 ffffffff818a0000
0000000000000005 ffffffff8144dc50 0000000000000000 0000000000000000
8000000088980000 8000000088983b30 0000000000000088 ffffffff813d3054
0000000000000000 ffffffff83ace622 00000000000000be 0000000000000000
00000000000000be ffffffff81121fb4 0000000000000000 0000000000000000
...
Call Trace:
[<ffffffff81121fb4>] show_stack+0x94/0x128
[<ffffffff813d3054>] dump_stack+0xa4/0xe0
[<ffffffff813fcfb8>] check_preemption_disabled+0x118/0x120
[<ffffffff811eafd8>] vmstat_update+0x50/0xa0
[<ffffffff8115b954>] process_one_work+0x144/0x348
[<ffffffff8115bd00>] worker_thread+0x150/0x4b8
[<ffffffff811622a0>] kthread+0x110/0x140
[<ffffffff8111c304>] ret_from_kernel_thread+0x14/0x1c
Link: http://lkml.kernel.org/r/1520881552-25659-1-git-send-email-steven.hill@cavium.com
Signed-off-by: Steven J. Hill <steven.hill@cavium.com>
Cc: Tejun Heo <htejun@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This patch fixes 5f48f0bd4e368425 ("mm, page_owner: skip unnecessary
stack_trace entries").
Because if we skip first two entries then logic of checking count value as
2 for recursion is broken and code will go in one depth recursion.
so we need to check only one call of _RET_IP(__set_page_owner)
while checking for recursion.
Current Backtrace while checking for recursion:-
(save_stack) from (__set_page_owner) // (But recursion returns true here)
(__set_page_owner) from (get_page_from_freelist)
(get_page_from_freelist) from (__alloc_pages_nodemask)
(__alloc_pages_nodemask) from (depot_save_stack)
(depot_save_stack) from (save_stack) // recursion should return true here
(save_stack) from (__set_page_owner)
(__set_page_owner) from (get_page_from_freelist)
(get_page_from_freelist) from (__alloc_pages_nodemask+)
(__alloc_pages_nodemask) from (depot_save_stack)
(depot_save_stack) from (save_stack)
(save_stack) from (__set_page_owner)
(__set_page_owner) from (get_page_from_freelist)
Correct Backtrace with fix:
(save_stack) from (__set_page_owner) // recursion returned true here
(__set_page_owner) from (get_page_from_freelist)
(get_page_from_freelist) from (__alloc_pages_nodemask+)
(__alloc_pages_nodemask) from (depot_save_stack)
(depot_save_stack) from (save_stack)
(save_stack) from (__set_page_owner)
(__set_page_owner) from (get_page_from_freelist)
Link: http://lkml.kernel.org/r/1521607043-34670-1-git-send-email-maninder1.s@samsung.com
Fixes: 5f48f0bd4e36 ("mm, page_owner: skip unnecessary stack_trace entries")
Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Signed-off-by: Vaneet Narang <v.narang@samsung.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@techadventures.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ayush Mittal <ayush.m@samsung.com>
Cc: Prakash Gupta <guptap@codeaurora.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Vasyl Gomonovych <gomonovych@gmail.com>
Cc: Amit Sahrawat <a.sahrawat@samsung.com>
Cc: <pankaj.m@samsung.com>
Cc: Vaneet Narang <v.narang@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
If System V shmget/shmat operations are used to create a hugetlbfs backed
mapping, it is possible to munmap part of the mapping and split the
underlying vma such that it is not huge page aligned. This will
untimately result in the following BUG:
kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/mm/hugetlb.c:3310!
Oops: Exception in kernel mode, sig: 5 [#1]
LE SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in: kcm nfc af_alg caif_socket caif phonet fcrypt
8<--8<--8<--8< snip 8<--8<--8<--8<
CPU: 18 PID: 43243 Comm: trinity-subchil Tainted: G C E
4.15.0-10-generic #11-Ubuntu
NIP: c00000000036e764 LR: c00000000036ee48 CTR: 0000000000000009
REGS: c000003fbcdcf810 TRAP: 0700 Tainted: G C E
(4.15.0-10-generic)
MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24002222 XER:
20040000
CFAR: c00000000036ee44 SOFTE: 1
GPR00: c00000000036ee48 c000003fbcdcfa90 c0000000016ea600 c000003fbcdcfc40
GPR04: c000003fd9858950 00007115e4e00000 00007115e4e10000 0000000000000000
GPR08: 0000000000000010 0000000000010000 0000000000000000 0000000000000000
GPR12: 0000000000002000 c000000007a2c600 00000fe3985954d0 00007115e4e00000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 00000fe398595a94 000000000000a6fc c000003fd9858950 0000000000018554
GPR24: c000003fdcd84500 c0000000019acd00 00007115e4e10000 c000003fbcdcfc40
GPR28: 0000000000200000 00007115e4e00000 c000003fbc9ac600 c000003fd9858950
NIP [c00000000036e764] __unmap_hugepage_range+0xa4/0x760
LR [c00000000036ee48] __unmap_hugepage_range_final+0x28/0x50
Call Trace:
[c000003fbcdcfa90] [00007115e4e00000] 0x7115e4e00000 (unreliable)
[c000003fbcdcfb50] [c00000000036ee48]
__unmap_hugepage_range_final+0x28/0x50
[c000003fbcdcfb80] [c00000000033497c] unmap_single_vma+0x11c/0x190
[c000003fbcdcfbd0] [c000000000334e14] unmap_vmas+0x94/0x140
[c000003fbcdcfc20] [c00000000034265c] exit_mmap+0x9c/0x1d0
[c000003fbcdcfce0] [c000000000105448] mmput+0xa8/0x1d0
[c000003fbcdcfd10] [c00000000010fad0] do_exit+0x360/0xc80
[c000003fbcdcfdd0] [c0000000001104c0] do_group_exit+0x60/0x100
[c000003fbcdcfe10] [c000000000110584] SyS_exit_group+0x24/0x30
[c000003fbcdcfe30] [c00000000000b184] system_call+0x58/0x6c
Instruction dump:
552907fe e94a0028 e94a0408 eb2a0018 81590008 7f9c5036 0b090000 e9390010
7d2948f8 7d2a2838 0b0a0000 7d293038 <0b090000> e9230086 2fa90000 419e0468
---[ end trace ee88f958a1c62605 ]---
This bug was introduced by 31383c6865a5 ("mm, hugetlbfs: introduce
->split() to vm_operations_struct"). A split function was added to
vm_operations_struct to determine if a mapping can be split. This was
mostly for device-dax and hugetlbfs mappings which have specific alignment
constraints.
Mappings initiated via shmget/shmat have their original vm_ops overwritten
with shm_vm_ops. shm_vm_ops functions will call back to the original
vm_ops if needed. Add such a split function to shm_vm_ops.
Link: http://lkml.kernel.org/r/20180321161314.7711-1-mike.kravetz@oracle.com
Fixes: 31383c6865a5 ("mm, hugetlbfs: introduce ->split() to vm_operations_struct")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Reviewed-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
All the root caches are linked into slab_root_caches which was introduced
by the commit 510ded33e075 ("slab: implement slab_root_caches list") but
it missed to add the SLAB's kmem_cache.
While experimenting with opt-in/opt-out kmem accounting, I noticed system
crashes due to NULL dereference inside cache_from_memcg_idx() while
deferencing kmem_cache.memcg_params.memcg_caches. The upstream clean
kernel will not see these crashes but SLAB should be consistent with SLUB
which does linked its boot caches (kmem_cache_node and kmem_cache) into
slab_root_caches.
Link: http://lkml.kernel.org/r/20180319210020.60289-1-shakeelb@google.com
Fixes: 510ded33e075c ("slab: implement slab_root_caches list")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and
madvised allocations") changed the page allocator to no longer detect thp
allocations based on __GFP_NORETRY.
It did not, however, modify the mem cgroup try_charge() path to avoid oom
kill for either khugepaged collapsing or thp faulting. It is never
expected to oom kill a process to allocate a hugepage for thp; reclaim is
governed by the thp defrag mode and MADV_HUGEPAGE, but allocations (and
charging) should fallback instead of oom killing processes.
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803191409420.124411@chino.kir.corp.google.com
Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <>
|
|
Commit 726d061fbd36 ("mm: vmscan: kick flushers when we encounter dirty
pages on the LRU") added flusher invocation to shrink_inactive_list() when
many dirty pages on the LRU are encountered.
However, shrink_inactive_list() doesn't wake up flushers for legacy cgroup
reclaim, so the next commit bbef938429f5 ("mm: vmscan: remove old flusher
wakeup from direct reclaim path") removed the only source of flusher's
wake up in legacy mem cgroup reclaim path.
This leads to premature OOM if there is too many dirty pages in cgroup:
# mkdir /sys/fs/cgroup/memory/test
# echo $$ > /sys/fs/cgroup/memory/test/tasks
# echo 50M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
# dd if=/dev/zero of=tmp_file bs=1M count=100
Killed
dd invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0
Call Trace:
dump_stack+0x46/0x65
dump_header+0x6b/0x2ac
oom_kill_process+0x21c/0x4a0
out_of_memory+0x2a5/0x4b0
mem_cgroup_out_of_memory+0x3b/0x60
mem_cgroup_oom_synchronize+0x2ed/0x330
pagefault_out_of_memory+0x24/0x54
__do_page_fault+0x521/0x540
page_fault+0x45/0x50
Task in /test killed as a result of limit of /test
memory: usage 51200kB, limit 51200kB, failcnt 73
memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0
kmem: usage 296kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /test: cache:49632KB rss:1056KB rss_huge:0KB shmem:0KB
mapped_file:0KB dirty:49500KB writeback:0KB swap:0KB inactive_anon:0KB
active_anon:1168KB inactive_file:24760KB active_file:24960KB unevictable:0KB
Memory cgroup out of memory: Kill process 3861 (bash) score 88 or sacrifice child
Killed process 3876 (dd) total-vm:8484kB, anon-rss:1052kB, file-rss:1720kB, shmem-rss:0kB
oom_reaper: reaped process 3876 (dd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Wake up flushers in legacy cgroup reclaim too.
Link: http://lkml.kernel.org/r/20180315164553.17856-1-aryabinin@virtuozzo.com
Fixes: bbef938429f5 ("mm: vmscan: remove old flusher wakeup from direct reclaim path")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <>
|
|
This reverts b92df1de5d289c0b ("mm: page_alloc: skip over regions of
invalid pfns where possible"). The commit is meant to be a boot init
speed up skipping the loop in memmap_init_zone() for invalid pfns. But
given some specific memory mapping on x86_64 (or more generally
theoretically anywhere but on arm with CONFIG_HAVE_ARCH_PFN_VALID) the
implementation also skips valid pfns which is plain wrong and causes
'kernel BUG at mm/page_alloc.c:1389!'
crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1
kernel BUG at mm/page_alloc.c:1389!
invalid opcode: 0000 [#1] SMP
--
RIP: 0010:[<ffffffff8118833e>] [<ffffffff8118833e>] move_freepages+0x15e/0x160
RSP: 0018:ffff88054d727688 EFLAGS: 00010087
--
Call Trace:
[<ffffffff811883b3>] move_freepages_block+0x73/0x80
[<ffffffff81189e63>] __rmqueue+0x263/0x460
[<ffffffff8118c781>] get_page_from_freelist+0x7e1/0x9e0
[<ffffffff8118caf6>] __alloc_pages_nodemask+0x176/0x420
--
RIP [<ffffffff8118833e>] move_freepages+0x15e/0x160
RSP <ffff88054d727688>
crash> page_init_bug -v | grep RAM
<struct resource 0xffff88067fffd2f8> 1000 - 9bfff System RAM (620.00 KiB)
<struct resource 0xffff88067fffd3a0> 100000 - 430bffff System RAM ( 1.05 GiB = 1071.75 MiB = 1097472.00 KiB)
<struct resource 0xffff88067fffd410> 4b0c8000 - 4bf9cfff System RAM ( 14.83 MiB = 15188.00 KiB)
<struct resource 0xffff88067fffd480> 4bfac000 - 646b1fff System RAM (391.02 MiB = 400408.00 KiB)
<struct resource 0xffff88067fffd560> 7b788000 - 7b7fffff System RAM (480.00 KiB)
<struct resource 0xffff88067fffd640> 100000000 - 67fffffff System RAM ( 22.00 GiB)
crash> page_init_bug | head -6
<struct resource 0xffff88067fffd560> 7b788000 - 7b7fffff System RAM (480.00 KiB)
<struct page 0xffffea0001ede200> 1fffff00000000 0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32 4096 1048575
<struct page 0xffffea0001ede200> 505736 505344 <struct page 0xffffea0001ed8000> 505855 <struct page 0xffffea0001edffc0>
<struct page 0xffffea0001ed8000> 0 0 <struct pglist_data 0xffff88047ffd9000> 0 <struct zone 0xffff88047ffd9000> DMA 1 4095
<struct page 0xffffea0001edffc0> 1fffff00000400 0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32 4096 1048575
BUG, zones differ!
crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
ffffea0001e00000 78000000 0 0 0 0
ffffea0001ed7fc0 7b5ff000 0 0 0 0
ffffea0001ed8000 7b600000 0 0 0 0 <<<<
ffffea0001ede1c0 7b787000 0 0 0 0
ffffea0001ede200 7b788000 0 0 1 1fffff00000000
Link: http://lkml.kernel.org/r/20180316143855.29838-1-neelx@redhat.com
Fixes: b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns where possible")
Signed-off-by: Daniel Vacek <neelx@redhat.com>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <>
|
|
shmem_unused_huge_shrink() gets called from reclaim path. Waiting for
page lock may lead to deadlock there.
There was a bug report that may be attributed to this:
http://lkml.kernel.org/r/alpine.LRH.2.11.1801242349220.30642@mail.ewheeler.net
Replace lock_page() with trylock_page() and skip the page if we failed to
lock it. We will get to the page on the next scan.
We can test for the PageTransHuge() outside the page lock as we only need
protection against splitting the page under us. Holding pin oni the page
is enough for this.
Link: http://lkml.kernel.org/r/20180316210830.43738-1-kirill.shutemov@linux.intel.com
Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Eric Wheeler <linux-mm@lists.ewheeler.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <>
|
|
deferred_split_scan() gets called from reclaim path. Waiting for page
lock may lead to deadlock there.
Replace lock_page() with trylock_page() and skip the page if we failed to
lock it. We will get to the page on the next scan.
Link: http://lkml.kernel.org/r/20180315150747.31945-1-kirill.shutemov@linux.intel.com
Fixes: 9a982250f773 ("thp: introduce deferred_split_huge_page()")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <>
|
|
khugepaged is not yet able to convert PTE-mapped huge pages back to PMD
mapped. We do not collapse such pages. See check khugepaged_scan_pmd().
But if between khugepaged_scan_pmd() and __collapse_huge_page_isolate()
somebody managed to instantiate THP in the range and then split the PMD
back to PTEs we would have a problem -- VM_BUG_ON_PAGE(PageCompound(page))
will get triggered.
It's possible since we drop mmap_sem during collapse to re-take for
write.
Replace the VM_BUG_ON() with graceful collapse fail.
Link: http://lkml.kernel.org/r/20180315152353.27989-1-kirill.shutemov@linux.intel.com
Fixes: b1caa957ae6d ("khugepaged: ignore pmd tables with THP mapped with ptes")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <>
|
|
This reverts commit 173950c2b1a0a6efb35e0e005f02f1495084b3e9.
|
|
This reverts fe1081e93ecc14ed2f6c620ce67fb402a1eab959
Patch has been dropped from mmotm tree with reason: obsolete
|
|
This reverts 709abb4526bcf5fe031ca726a7538b06682c9594
Patch has been dropped from mmotm tree with reason: wrecked
|
|
This reverts commit 864b75f9d6b0100bb24fdd9a20d156e7cda9b5ae.
Commit 864b75f9d6b0 ("mm/page_alloc: fix memmap_init_zone pageblock
alignment") modified the logic in memmap_init_zone() to initialize
struct pages associated with invalid PFNs, to appease a VM_BUG_ON()
in move_freepages(), which is redundant by its own admission, and
dereferences struct page fields to obtain the zone without checking
whether the struct pages in question are valid to begin with.
Commit 864b75f9d6b0 only makes it worse, since the rounding it does
may cause pfn assume the same value it had in a prior iteration of
the loop, resulting in an infinite loop and a hang very early in the
boot. Also, since it doesn't perform the same rounding on start_pfn
itself but only on intermediate values following an invalid PFN, we
may still hit the same VM_BUG_ON() as before.
So instead, let's fix this at the core, and ensure that the BUG
check doesn't dereference struct page fields of invalid pages.
Fixes: 864b75f9d6b0 ("mm/page_alloc: fix memmap_init_zone pageblock alignment")
Tested-by: Jan Glauber <jglauber@cavium.com>
Tested-by: Shanker Donthineni <shankerd@codeaurora.org>
Cc: Daniel Vacek <neelx@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 3e04040df6d4613a8af5a80882d5f7f298f49810)
|
|
Use VM_WARN return value, remove too-obvious comment.
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Nic Losby <blurbdust@gmail.com>
Cc: Yisheng Xie <xieyisheng1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
These macros cause the build to break when someone uses their return value
with CONFIG_DEBUG_VM=n:
if (VM_WARN(from > to, "%s called with a negative range\n", __func__))
return -EINVAL;
mm/hugetlb.c: In function 'hugetlb_reserve_pages':
mm/hugetlb.c:4378: error: void value not ignored as it ought to be
Fix this up.
Reported-by: Michal Hocko <mhocko@kernel.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
add pool param to zs_huge_class_size() (Minchan)
Link: http://lkml.kernel.org/r/20180314081833.1096-3-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
add pool param to zs_huge_class_size() (Minchan)
Link: http://lkml.kernel.org/r/20180314081833.1096-2-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Implement pud_free_pmd_page() and pmd_free_pte_page() on x86, which clear
a given pud/pmd entry and free up lower level page table(s). Address
range associated with the pud/pmd entry must have been purged by INVLPG.
Link: http://lkml.kernel.org/r/20180314180155.19492-3-toshi.kani@hpe.com
Fixes: e61ce6ade404e ("mm: change ioremap to set up huge I/O mappings")
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Reported-by: Lei Li <lious.lilei@hisilicon.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
fix typo in pmd_free_pte_page() stub
Cc: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap() may create
pud/pmd mappings. Kernel panic was observed on arm64 systems with
Cortex-A75 in the following steps as described by Hanjun Guo.
1. ioremap a 4K size, valid page table will build,
2. iounmap it, pte0 will set to 0;
3. ioremap the same address with 2M size, pgd/pmd is unchanged,
then set the a new value for pmd;
4. pte0 is leaked;
5. CPU may meet exception because the old pmd is still in TLB,
which will lead to kernel panic.
This panic is not reproducible on x86. INVLPG, called from iounmap,
purges all levels of entries associated with purged address on x86. x86
still has memory leak.
The patch changes the ioremap path to free unmapped page table(s) since
doing so in the unmap path has the following issues:
- The iounmap() path is shared with vunmap(). Since vmap() only
supports pte mappings, making vunmap() to free a pte page is an
overhead for regular vmap users as they do not need a pte page
freed up.
- Checking if all entries in a pte page are cleared in the unmap path
is racy, and serializing this check is expensive.
- The unmap path calls free_vmap_area_noflush() to do lazy TLB purges.
Clearing a pud/pmd entry before the lazy TLB purges needs extra TLB
purge.
Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(), which
clear a given pud/pmd entry and free up a page for the lower level
entries.
This patch implements their stub functions on x86 and arm64, which work as
workaround.
Link: http://lkml.kernel.org/r/20180314180155.19492-2-toshi.kani@hpe.com
Fixes: e61ce6ade404e ("mm: change ioremap to set up huge I/O mappings")
Reported-by: Lei Li <lious.lilei@hisilicon.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Wang Xuefeng <wxf.wang@hisilicon.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chintan Pandya <cpandya@codeaurora.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
fix -ve left shift count on sh
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Nic Losby <blurbdust@gmail.com>
Cc: Yisheng Xie <xieyisheng1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Since commit ad67b74d2469d9b8 ("printk: hash addresses printed with
%p"), the virtual memory layout printed during boot up contains "ptrval"
instead of actual addresses:
Memory: 268040K/276480K available (2979K kernel code, 310K rwdata, 784K rodata, 144K init, 172K bss, 8440K reserved, 0K cma-reserved)
Virtual kernel memory layout:
vector : 0x003d2e74 - 0x003d3274 ( 1 KiB)
kmap : 0xd0000000 - 0xf0000000 ( 512 MiB)
vmalloc : 0x11800000 - 0xd0000000 (3048 MiB)
lowmem : 0x00000000 - 0x11000000 ( 272 MiB)
.init : 0x(ptrval) - 0x(ptrval) ( 144 KiB)
.text : 0x(ptrval) - 0x(ptrval) (2980 KiB)
.data : 0x(ptrval) - 0x(ptrval) (1095 KiB)
.bss : 0x(ptrval) - 0x(ptrval) ( 173 KiB)
Instead of changing the printing to "%px", and leaking virtual memory
layout information again, just remove the printing completely, cfr. e.g.
commit 071929dbdd865f77 ("arm64: Stop printing the virtual memory
layout").
All interesting information (actual section sizes) is already printed by
mem_init_print_info() just above anyway.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
(cherry picked from commit 445420c31cbba4a218b432bece0b500b6c4f9d63)
|
|
m32r allmodconfig complains with
security/integrity/digsig.c:146:2: error: implicit declaration of function 'vfree' [-Werror=implicit-function-declaration]
vfree(data);
^
Signed-off-by: Michal Hocko <mhocko@suse.com>
|
|
KVM is hanging during postcopy live migration with userfaultfd because
get_user_pages_unlocked is not capable to handle FOLL_NOWAIT.
Earlier FOLL_NOWAIT was only ever passed to get_user_pages.
Specifically faultin_page (the callee of get_user_pages_unlocked caller)
doesn't know that if FAULT_FLAG_RETRY_NOWAIT was set in the page fault
flags, when VM_FAULT_RETRY is returned, the mmap_sem wasn't actually
released (even if nonblocking is not NULL). So it sets *nonblocking to
zero and the caller won't release the mmap_sem thinking it was already
released, but it wasn't because of FOLL_NOWAIT.
Link: http://lkml.kernel.org/r/20180302174343.5421-2-aarcange@redhat.com
Fixes: ce53053ce378c ("kvm: switch get_user_page_nowait() to get_user_pages_unlocked()")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Tested-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 96312e61282ae3f6537a562625706498cbc75594)
|
|
security/keys/big_key.c needs vmalloc.h, per sfr
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|