summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-03-23kvm: add address to mmu_notifier_retry() args and use new mmu_notifier helpermmu-notifier-rfc-kvmJérôme Glisse10-13/+14
Use the new mmu_notifier_addr_valid() helper to know if an address is under going any invalidation. This is more accurate than the mmu_notifier_count used so far. Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
2018-03-23kvm: leverage mmu_notifier context information to avoid useless spt invalidationJérôme Glisse1-0/+14
Use context informations on why a range is being invalidated to determine if we need to invalidate the secondary page table (spt) or not. Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
2018-03-23mm/mmu_notifier: keep track of ranges being invalidatedmmu-notifier-rfcJérôme Glisse2-0/+66
This keep a list of all virtual address range being invalidated (ie inside a mmu_notifier_invalidate_range_start/end section). Also add an helper to check if a range is under going such invalidation. With this it easy for a concurrent thread to ignore invalidation that do not affect the virtual address range it is working on. Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Leon Romanovsky <leonro@mellanox.com> Cc: Artemy Kovalyov <artemyko@mellanox.com> Cc: Evgeny Baskakov <ebaskakov@nvidia.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Mark Hairgrove <mhairgrove@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Sudeep Dutt <sudeep.dutt@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com>
2018-03-23mm/mmu_notifier: provide context information about range invalidationJérôme Glisse15-0/+53
This patch just add the information it does not introduce any optimi- zation, thus there are no functional change with this patch. The mmu_notifier callback for range invalidation happens for a number of reasons. Provide some context information to callback to allow for optimization. For instance a device driver only need to free tracking structure for a range if notification is for an munmap. Prior to this patch the driver would have to free them on each mmu_notifier callback and reallocate them on next page fault (as it would have to assume it was an munmap). Protection change also might turn into a no-op for a driver if driver mapped a range read only and CPU page table is updated from read and write to read only then device page table do not need update. Those are just some of the optimization this patch allows. Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Leon Romanovsky <leonro@mellanox.com> Cc: Artemy Kovalyov <artemyko@mellanox.com> Cc: Evgeny Baskakov <ebaskakov@nvidia.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Mark Hairgrove <mhairgrove@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Sudeep Dutt <sudeep.dutt@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com>
2018-03-23mm/mmu_notifier: use struct for invalidate_range_start/end parametersJérôme Glisse27-293/+301
Using a struct for mmu_notifier_invalidate_range_start()|end() allows to add more parameters in the future without having to change every call sites or every callback. They are no functional change with this patch. Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Leon Romanovsky <leonro@mellanox.com> Cc: Artemy Kovalyov <artemyko@mellanox.com> Cc: Evgeny Baskakov <ebaskakov@nvidia.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Mark Hairgrove <mhairgrove@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Sudeep Dutt <sudeep.dutt@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com>
2018-03-23BUILDFIXJérôme Glisse1-0/+1
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
2018-03-23Reverted "mm/hugetlb.c: clean up VM_WARN usage"Michal Hocko1-1/+4
This reverts 02ede1345ff68cd3e629864161c1d8adcef1eae5 Patch has been dropped from mmotm tree with reason: wrecked
2018-03-23mm: memcg: remote memcg charging for kmem allocations fixDavid Rientjes1-9/+9
fix build warning for CONFIG_SLOB: mm/memcontrol.c:706:27: warning: 'get_mem_cgroup' defined but not used [-Wunused-function] static struct mem_cgroup *get_mem_cgroup(struct mem_cgroup *memcg) Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803181318530.241887@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: kbuild test robot <lkp@intel.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-23zboot: fix stack protector in compressed boot phaseHuacai Chen7-21/+20
Calling __stack_chk_guard_setup() in decompress_kernel() is too late, so stack checking always fails for decompress_kernel() itself. So remove __stack_chk_guard_setup() and initialize __stack_chk_guard before we call decompress_kernel(). Original code comes from ARM but also used for MIPS and SH, so fix them together. If without this fix, compressed booting of these archs will fail because stack checking is enabled by default (>=4.16). Link: http://lkml.kernel.org/r/1521186916-13745-1-git-send-email-chenhc@lemote.com Signed-off-by: Huacai Chen <chenhc@lemote.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: James Hogan <james.hogan@mips.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-23kernel/kexec_file.c: load kernel at top of system RAM if requiredBaoquan He1-0/+2
For kexec_file loading, if kexec_buf.top_down is 'true', the memory which is used to load kernel/initrd/purgatory is supposed to be allocated from top to down. This is also consistent with the old kexec loading interface. However, the current arch_kexec_walk_mem() doesn't do like this. It ignores checking kexec_buf.top_down, but calls walk_system_ram_res() directly to go through all resources of System RAM, to try to find a memory region which can contain the specific kexec buffer, then calls locate_mem_hole_callback() to allocate memory in that found memory region from top to down. This is not right. Here add checking if kexec_buf.top_down is 'true' in arch_kexec_walk_mem(), if yes, call the newly added walk_system_ram_res_rev() to find memory region from top to down to load kernel. Link: http://lkml.kernel.org/r/20180322033722.9279-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Dave Young <dyoung@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Philipp Rudo <prudo@linux.vnet.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-23kernel/kexec_file.c: add walk_system_ram_res_rev()AKASHI Takahiro2-0/+66
This function, being a variant of walk_system_ram_res() introduced in commit 8c86e70acead ("resource: provide new functions to walk through resources"), walks through a list of all the resources of System RAM in reversed order, i.e., from higher to lower. It will be used in kexec_file code. Link: http://lkml.kernel.org/r/20180322033722.9279-2-bhe@redhat.com Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Philipp Rudo <prudo@linux.vnet.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-23fs/dcache.c: add cond_resched() in shrink_dentry_list()Nikolay Borisov1-3/+3
As previously reported (https://patchwork.kernel.org/patch/8642031/) it's possible to call shrink_dentry_list with a large number of dentries (> 10000). This, in turn, could trigger the softlockup detector and possibly trigger a panic. In addition to the unmount path being vulnerable to this scenario, at SuSE we've observed similar situation happening during process exit on processes that touch a lot of dentries. Here is an excerpt from a crash dump. The number after the colon are the number of dentries on the list passed to shrink_dentry_list: PID 99760: 10722 PID 107530: 215 PID 108809: 24134 PID 108877: 21331 PID 141708: 16487 So we want to kill between 15k-25k dentries without yielding. And one possible call stack looks like: 4 [ffff8839ece41db0] _raw_spin_lock at ffffffff8152a5f8 5 [ffff8839ece41db0] evict at ffffffff811c3026 6 [ffff8839ece41dd0] __dentry_kill at ffffffff811bf258 7 [ffff8839ece41df0] shrink_dentry_list at ffffffff811bf593 8 [ffff8839ece41e18] shrink_dcache_parent at ffffffff811bf830 9 [ffff8839ece41e50] proc_flush_task at ffffffff8120dd61 10 [ffff8839ece41ec0] release_task at ffffffff81059ebd 11 [ffff8839ece41f08] do_exit at ffffffff8105b8ce 12 [ffff8839ece41f78] sys_exit at ffffffff8105bd53 13 [ffff8839ece41f80] system_call_fastpath at ffffffff81532909 While some of the callers of shrink_dentry_list do use cond_resched, this is not sufficient to prevent softlockups. So just move cond_resched into shrink_dentry_list from its callers. David said: I've found hundreds of occurrences of warnings that we emit when need_resched stays set for a prolonged period of time with the stack trace that is included in the change log. Link: http://lkml.kernel.org/r/1521718946-31521-1-git-send-email-nborisov@suse.com Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-23mm,oom_reaper: check for MMF_OOM_SKIP before complainingTetsuo Handa1-1/+2
I got "oom_reaper: unable to reap pid:" messages when the victim thread was blocked inside free_pgtables() (which occurred after returning from unmap_vmas() and setting MMF_OOM_SKIP). We don't need to complain when exit_mmap() already set MMF_OOM_SKIP. [ 663.593821] Killed process 7558 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB [ 664.684801] oom_reaper: unable to reap pid:7558 (a.out) [ 664.892292] a.out D13272 7558 6931 0x00100084 [ 664.895765] Call Trace: [ 664.897574] ? __schedule+0x25f/0x780 [ 664.900099] schedule+0x2d/0x80 [ 664.902260] rwsem_down_write_failed+0x2bb/0x440 [ 664.905249] ? rwsem_down_write_failed+0x55/0x440 [ 664.908335] ? free_pgd_range+0x569/0x5e0 [ 664.911145] call_rwsem_down_write_failed+0x13/0x20 [ 664.914121] down_write+0x49/0x60 [ 664.916519] ? unlink_file_vma+0x28/0x50 [ 664.919255] unlink_file_vma+0x28/0x50 [ 664.922234] free_pgtables+0x36/0x100 [ 664.924797] exit_mmap+0xbb/0x180 [ 664.927220] mmput+0x50/0x110 [ 664.929504] copy_process.part.41+0xb61/0x1fe0 [ 664.932448] ? _do_fork+0xe6/0x560 [ 664.934902] ? _do_fork+0xe6/0x560 [ 664.937361] _do_fork+0xe6/0x560 [ 664.939742] ? syscall_trace_enter+0x1a9/0x240 [ 664.942693] ? retint_user+0x18/0x18 [ 664.945309] ? page_fault+0x2f/0x50 [ 664.947896] ? trace_hardirqs_on_caller+0x11f/0x1b0 [ 664.951075] do_syscall_64+0x74/0x230 [ 664.953747] entry_SYSCALL_64_after_hwframe+0x42/0xb7 Link: http://lkml.kernel.org/r/201803221946.DHG65638.VFJHFtOSQLOMOF@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-23mm/ksm: fix interaction with THPClaudio Imbrenda1-0/+28
This patch fixes a corner case for KSM. When two pages belong or belonged to the same transparent hugepage, and they should be merged, KSM fails to split the page, and therefore no merging happens. This bug can be reproduced by: * making sure ksm is running (in case disabling ksmtuned) * enabling transparent hugepages * allocating a THP-aligned 1-THP-sized buffer e.g. on amd64: posix_memalign(&p, 1<<21, 1<<21) * filling it with the same values e.g. memset(p, 42, 1<<21) * performing madvise to make it mergeable e.g. madvise(p, 1<<21, MADV_MERGEABLE) * waiting for KSM to perform a few scans The expected outcome is that the all the pages get merged (1 shared and the rest sharing); the actual outcome is that no pages get merged (1 unshared and the rest volatile) The reason of this behaviour is that we increase the reference count once for both pages we want to merge, but if they belong to the same hugepage (or compound page), the reference counter used in both cases is the one of the head of the compound page. This means that split_huge_page will find a value of the reference counter too high and will fail. This patch solves this problem by testing if the two pages to merge belong to the same hugepage when attempting to merge them. If so, the hugepage is split safely. This means that the hugepage is not split if not necessary. Link: http://lkml.kernel.org/r/1521548069-24758-1-git-send-email-imbrenda@linux.vnet.ibm.com Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com> Co-authored-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-23mm-vmscan-dont-mess-with-pgdat-flags-in-memcg-reclaim-checkpatch-fixesAndrew Morton1-12/+14
WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per line) #29: while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done & WARNING: line over 80 characters #214: FILE: mm/vmscan.c:2581: + * zones or there is heavy usage of a slow backing device. WARNING: line over 80 characters #225: FILE: mm/vmscan.c:2592: + if (stat.nr_writeback && stat.nr_writeback == stat.nr_taken) WARNING: line over 80 characters #255: FILE: mm/vmscan.c:2630: + current_may_throttle() && pgdat_memcg_congested(pgdat, root)) total: 0 errors, 4 warnings, 189 lines checked NOTE: For some of the reported defects, checkpatch may be able to mechanically convert to the typical style using --fix or --fix-inplace. ./patches/mm-vmscan-dont-mess-with-pgdat-flags-in-memcg-reclaim.patch has style problems, please review. NOTE: If any of the errors are false positives, please report them to the maintainer, see CHECKPATCH in MAINTAINERS. Please run checkpatch prior to sending patches Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-23mm/vmscan: don't mess with pgdat->flags in memcg reclaimAndrey Ryabinin4-37/+70
memcg reclaim may alter pgdat->flags based on the state of LRU lists in cgroup and its children. PGDAT_WRITEBACK may force kswapd to sleep congested_wait(), PGDAT_DIRTY may force kswapd to writeback filesystem pages. But the worst here is PGDAT_CONGESTED, since it may force all direct reclaims to stall in wait_iff_congested(). Note that only kswapd have powers to clear any of these bits. This might just never happen if cgroup limits configured that way. So all direct reclaims will stall as long as we have some congested bdi in the system. Leave all pgdat->flags manipulations to kswapd. kswapd scans the whole pgdat, so it's reasonable to leave all decisions about node stat to kswapd. Also add per-cgroup congestion state to avoid needlessly burning CPU in cgroup reclaim if heavy congestion is observed. Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY bits since they alter only kswapd behavior. The problem could be easily demonstrated by creating heavy congestion in one cgroup: echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control mkdir -p /sys/fs/cgroup/congester echo 512M > /sys/fs/cgroup/congester/memory.max echo $$ > /sys/fs/cgroup/congester/cgroup.procs /* generate a lot of diry data on slow HDD */ while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done & .... while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done & and some job in another cgroup: mkdir /sys/fs/cgroup/victim echo 128M > /sys/fs/cgroup/victim/memory.max # time cat /dev/sda > /dev/null real 10m15.054s user 0m0.487s sys 1m8.505s According to the tracepoint in wait_iff_congested(), the 'cat' spent 50% of the time sleeping there. With the patch, cat don't waste time anymore: # time cat /dev/sda > /dev/null real 5m32.911s user 0m0.411s sys 0m56.664s Link: http://lkml.kernel.org/r/20180315164553.17856-6-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-23mm/vmscan: don't change pgdat state on base of a single LRU list state.Andrey Ryabinin1-58/+73
We have separate LRU list for each memory cgroup. Memory reclaim iterates over cgroups and calls shrink_inactive_list() every inactive LRU list. Based on the state of a single LRU shrink_inactive_list() may flag the whole node as dirty,congested or under writeback. This is obviously wrong and hurtful. It's especially hurtful when we have possibly small congested cgroup in system. Than *all* direct reclaims waste time by sleeping in wait_iff_congested(). Sum reclaim stats across all visited LRUs on node and flag node as dirty, congested or under writeback based on that sum. This only fixes the problem for global reclaim case. Per-cgroup reclaim will be addressed separately by the next patch. This change will also affect systems with no memory cgroups. Reclaimer now makes decision based on reclaim stats of the both anon and file LRU lists. E.g. if the file list is in congested state and get_scan_count() decided to reclaim some anon pages, reclaimer will start shrinking anon without delay in wait_iff_congested() like it was before. It seems to be a reasonable thing to do. Why waste time sleeping, before reclaiming anon given that we going to try to reclaim it anyway? Link: http://lkml.kernel.org/r/20180315164553.17856-5-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-23mm/vmscan: remove redundant current_may_throttle() checkAndrey Ryabinin1-1/+1
Only kswapd can have non-zero nr_immediate, and current_may_throttle() is always true for kswapd (PF_LESS_THROTTLE bit is never set) thus it's enough to check stat.nr_immediate only. Link: http://lkml.kernel.org/r/20180315164553.17856-4-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-23mm/vmscan: replace mm_vmscan_lru_shrink_inactive with shrink_page_list ↵Andrey Ryabinin2-34/+20
tracepoint With upcoming changes keeping the mm_vmscan_lru_shrink_inactive tracepoint intact would require some additional code churn. In particular 'struct recalim_stat' will gain more wide usage, but we don't need 'nr_activate', 'nr_ref_keep', 'nr_unmap_fail' counters anywhere besides tracepoint. Since mm_vmscan_lru_shrink_inactive tracepoint mostly provide information collected by shrink_page_list(), we can just replace it by tracepoint in shrink_page_list(). We don't have 'nr_scanned' and 'file' arguments there, but user could obtain this information from mm_vmscan_lru_isolate tracepoint. Link: http://lkml.kernel.org/r/20180315164553.17856-3-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-23mm/vmscan: update stale commentsAndrey Ryabinin1-5/+5
Update some comments that became stale since transiton from per-zone to per-node reclaim. Link: http://lkml.kernel.org/r/20180315164553.17856-2-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-23mm/memblock.c: cast constant ULLONG_MAX to phys_addr_tStefan Agner1-2/+2
This fixes a warning shown when phys_addr_t is 32-bit int when compiling with clang: mm/memblock.c:927:15: warning: implicit conversion from 'unsigned long long' to 'phys_addr_t' (aka 'unsigned int') changes value from 18446744073709551615 to 4294967295 [-Wconstant-conversion] r->base : ULLONG_MAX; ^~~~~~~~~~ ./include/linux/kernel.h:30:21: note: expanded from macro 'ULLONG_MAX' #define ULLONG_MAX (~0ULL) ^~~~~ Link: http://lkml.kernel.org/r/20180319005645.29051-1-stefan@agner.ch Signed-off-by: Stefan Agner <stefan@agner.ch> Cc: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-23include/linux/mmdebug.h: make VM_WARN* non-rvalsMichal Hocko1-4/+4
At present the construct if (VM_WARN(...)) will compile OK with CONFIG_DEBUG_VM=y and will fail with CONFIG_DEBUG_VM=n. The reason is that VM_{WARN,BUG}* have always been special wrt. {WARN/BUG}* and never generate any code when DEBUG_VM is disabled. So we cannot really use it in conditionals. We considered changing things so that this construct works in both cases but that might cause unwanted code generation with CONFIG_DEBUG_VM=n. It is safer and simpler to make the build fail in both cases. [akpm@linux-foundation.org: changelog] Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-23mm-free_pcppages_bulk-prefetch-buddy-while-not-holding-lock-v4-update2Aaron Lu1-11/+14
Use a helper function to prefetch buddy as suggested by Dave Hansen. Drop the change of list_add_tail() to avoid disordering page. Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com Link: http://lkml.kernel.org/r/20180320113146.GB24737@intel.com Signed-off-by: Aaron Lu <aaron.lu@intel.com> Suggested-by: Ying Huang <ying.huang@intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-23mm/memcontrol.c: fix parameter description mismatchHonglei Wang1-3/+3
There are a couple of places where parameter description and function name do not match the actual code. Fix it. Link: http://lkml.kernel.org/r/1520843448-17347-1-git-send-email-honglei.wang@oracle.com Signed-off-by: Honglei Wang <honglei.wang@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-23mm/vmstat.c: Fix vmstat_update() preemption BUG.Steven J. Hill1-0/+2
Attempting to hotplug CPUs with CONFIG_VM_EVENT_COUNTERS enabled can cause vmstat_update() to report a BUG due to preemption not being disabled around smp_processor_id(). Discovered on Ubiquiti EdgeRouter Pro with Cavium Octeon II processor. BUG: using smp_processor_id() in preemptible [00000000] code: kworker/1:1/269 caller is vmstat_update+0x50/0xa0 CPU: 0 PID: 269 Comm: kworker/1:1 Not tainted 4.16.0-rc4-Cavium-Octeon-00009-gf83bbd5-dirty #1 Workqueue: mm_percpu_wq vmstat_update Stack : 0000002600000026 0000000010009ce0 0000000000000000 0000000000000001 0000000000000000 0000000000000000 0000000000000005 8001180000000800 00000000000000bf 0000000000000000 00000000000000bf 766d737461745f75 ffffffff83ad0000 0000000000000007 0000000000000000 0000000008000000 0000000000000000 ffffffff818d0000 0000000000000001 ffffffff81818a70 0000000000000000 0000000000000000 ffffffff8115bbb0 ffffffff818a0000 0000000000000005 ffffffff8144dc50 0000000000000000 0000000000000000 8000000088980000 8000000088983b30 0000000000000088 ffffffff813d3054 0000000000000000 ffffffff83ace622 00000000000000be 0000000000000000 00000000000000be ffffffff81121fb4 0000000000000000 0000000000000000 ... Call Trace: [<ffffffff81121fb4>] show_stack+0x94/0x128 [<ffffffff813d3054>] dump_stack+0xa4/0xe0 [<ffffffff813fcfb8>] check_preemption_disabled+0x118/0x120 [<ffffffff811eafd8>] vmstat_update+0x50/0xa0 [<ffffffff8115b954>] process_one_work+0x144/0x348 [<ffffffff8115bd00>] worker_thread+0x150/0x4b8 [<ffffffff811622a0>] kthread+0x110/0x140 [<ffffffff8111c304>] ret_from_kernel_thread+0x14/0x1c Link: http://lkml.kernel.org/r/1520881552-25659-1-git-send-email-steven.hill@cavium.com Signed-off-by: Steven J. Hill <steven.hill@cavium.com> Cc: Tejun Heo <htejun@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-23mm/page_owner: fix recursion bug after changing skip entriesManinder Singh1-3/+3
This patch fixes 5f48f0bd4e368425 ("mm, page_owner: skip unnecessary stack_trace entries"). Because if we skip first two entries then logic of checking count value as 2 for recursion is broken and code will go in one depth recursion. so we need to check only one call of _RET_IP(__set_page_owner) while checking for recursion. Current Backtrace while checking for recursion:- (save_stack) from (__set_page_owner) // (But recursion returns true here) (__set_page_owner) from (get_page_from_freelist) (get_page_from_freelist) from (__alloc_pages_nodemask) (__alloc_pages_nodemask) from (depot_save_stack) (depot_save_stack) from (save_stack) // recursion should return true here (save_stack) from (__set_page_owner) (__set_page_owner) from (get_page_from_freelist) (get_page_from_freelist) from (__alloc_pages_nodemask+) (__alloc_pages_nodemask) from (depot_save_stack) (depot_save_stack) from (save_stack) (save_stack) from (__set_page_owner) (__set_page_owner) from (get_page_from_freelist) Correct Backtrace with fix: (save_stack) from (__set_page_owner) // recursion returned true here (__set_page_owner) from (get_page_from_freelist) (get_page_from_freelist) from (__alloc_pages_nodemask+) (__alloc_pages_nodemask) from (depot_save_stack) (depot_save_stack) from (save_stack) (save_stack) from (__set_page_owner) (__set_page_owner) from (get_page_from_freelist) Link: http://lkml.kernel.org/r/1521607043-34670-1-git-send-email-maninder1.s@samsung.com Fixes: 5f48f0bd4e36 ("mm, page_owner: skip unnecessary stack_trace entries") Signed-off-by: Maninder Singh <maninder1.s@samsung.com> Signed-off-by: Vaneet Narang <v.narang@samsung.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@techadventures.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ayush Mittal <ayush.m@samsung.com> Cc: Prakash Gupta <guptap@codeaurora.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Vasyl Gomonovych <gomonovych@gmail.com> Cc: Amit Sahrawat <a.sahrawat@samsung.com> Cc: <pankaj.m@samsung.com> Cc: Vaneet Narang <v.narang@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-23ipc/shm.c: add split function to shm_vm_opsMike Kravetz1-0/+12
If System V shmget/shmat operations are used to create a hugetlbfs backed mapping, it is possible to munmap part of the mapping and split the underlying vma such that it is not huge page aligned. This will untimately result in the following BUG: kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/mm/hugetlb.c:3310! Oops: Exception in kernel mode, sig: 5 [#1] LE SMP NR_CPUS=2048 NUMA PowerNV Modules linked in: kcm nfc af_alg caif_socket caif phonet fcrypt 8<--8<--8<--8< snip 8<--8<--8<--8< CPU: 18 PID: 43243 Comm: trinity-subchil Tainted: G C E 4.15.0-10-generic #11-Ubuntu NIP: c00000000036e764 LR: c00000000036ee48 CTR: 0000000000000009 REGS: c000003fbcdcf810 TRAP: 0700 Tainted: G C E (4.15.0-10-generic) MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24002222 XER: 20040000 CFAR: c00000000036ee44 SOFTE: 1 GPR00: c00000000036ee48 c000003fbcdcfa90 c0000000016ea600 c000003fbcdcfc40 GPR04: c000003fd9858950 00007115e4e00000 00007115e4e10000 0000000000000000 GPR08: 0000000000000010 0000000000010000 0000000000000000 0000000000000000 GPR12: 0000000000002000 c000000007a2c600 00000fe3985954d0 00007115e4e00000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 00000fe398595a94 000000000000a6fc c000003fd9858950 0000000000018554 GPR24: c000003fdcd84500 c0000000019acd00 00007115e4e10000 c000003fbcdcfc40 GPR28: 0000000000200000 00007115e4e00000 c000003fbc9ac600 c000003fd9858950 NIP [c00000000036e764] __unmap_hugepage_range+0xa4/0x760 LR [c00000000036ee48] __unmap_hugepage_range_final+0x28/0x50 Call Trace: [c000003fbcdcfa90] [00007115e4e00000] 0x7115e4e00000 (unreliable) [c000003fbcdcfb50] [c00000000036ee48] __unmap_hugepage_range_final+0x28/0x50 [c000003fbcdcfb80] [c00000000033497c] unmap_single_vma+0x11c/0x190 [c000003fbcdcfbd0] [c000000000334e14] unmap_vmas+0x94/0x140 [c000003fbcdcfc20] [c00000000034265c] exit_mmap+0x9c/0x1d0 [c000003fbcdcfce0] [c000000000105448] mmput+0xa8/0x1d0 [c000003fbcdcfd10] [c00000000010fad0] do_exit+0x360/0xc80 [c000003fbcdcfdd0] [c0000000001104c0] do_group_exit+0x60/0x100 [c000003fbcdcfe10] [c000000000110584] SyS_exit_group+0x24/0x30 [c000003fbcdcfe30] [c00000000000b184] system_call+0x58/0x6c Instruction dump: 552907fe e94a0028 e94a0408 eb2a0018 81590008 7f9c5036 0b090000 e9390010 7d2948f8 7d2a2838 0b0a0000 7d293038 <0b090000> e9230086 2fa90000 419e0468 ---[ end trace ee88f958a1c62605 ]--- This bug was introduced by 31383c6865a5 ("mm, hugetlbfs: introduce ->split() to vm_operations_struct"). A split function was added to vm_operations_struct to determine if a mapping can be split. This was mostly for device-dax and hugetlbfs mappings which have specific alignment constraints. Mappings initiated via shmget/shmat have their original vm_ops overwritten with shm_vm_ops. shm_vm_ops functions will call back to the original vm_ops if needed. Add such a split function to shm_vm_ops. Link: http://lkml.kernel.org/r/20180321161314.7711-1-mike.kravetz@oracle.com Fixes: 31383c6865a5 ("mm, hugetlbfs: introduce ->split() to vm_operations_struct") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Reviewed-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-23mm, slab: memcg_link the SLAB's kmem_cacheShakeel Butt1-0/+1
All the root caches are linked into slab_root_caches which was introduced by the commit 510ded33e075 ("slab: implement slab_root_caches list") but it missed to add the SLAB's kmem_cache. While experimenting with opt-in/opt-out kmem accounting, I noticed system crashes due to NULL dereference inside cache_from_memcg_idx() while deferencing kmem_cache.memcg_params.memcg_caches. The upstream clean kernel will not see these crashes but SLAB should be consistent with SLUB which does linked its boot caches (kmem_cache_node and kmem_cache) into slab_root_caches. Link: http://lkml.kernel.org/r/20180319210020.60289-1-shakeelb@google.com Fixes: 510ded33e075c ("slab: implement slab_root_caches list") Signed-off-by: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-23mm, thp: do not cause memcg oom for thpDavid Rientjes2-4/+9
Commit 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations") changed the page allocator to no longer detect thp allocations based on __GFP_NORETRY. It did not, however, modify the mem cgroup try_charge() path to avoid oom kill for either khugepaged collapsing or thp faulting. It is never expected to oom kill a process to allocate a hugepage for thp; reclaim is governed by the thp defrag mode and MADV_HUGEPAGE, but allocations (and charging) should fallback instead of oom killing processes. Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803191409420.124411@chino.kir.corp.google.com Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations") Signed-off-by: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <>
2018-03-23mm/vmscan: wake up flushers for legacy cgroups tooAndrey Ryabinin1-15/+16
Commit 726d061fbd36 ("mm: vmscan: kick flushers when we encounter dirty pages on the LRU") added flusher invocation to shrink_inactive_list() when many dirty pages on the LRU are encountered. However, shrink_inactive_list() doesn't wake up flushers for legacy cgroup reclaim, so the next commit bbef938429f5 ("mm: vmscan: remove old flusher wakeup from direct reclaim path") removed the only source of flusher's wake up in legacy mem cgroup reclaim path. This leads to premature OOM if there is too many dirty pages in cgroup: # mkdir /sys/fs/cgroup/memory/test # echo $$ > /sys/fs/cgroup/memory/test/tasks # echo 50M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes # dd if=/dev/zero of=tmp_file bs=1M count=100 Killed dd invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0 Call Trace: dump_stack+0x46/0x65 dump_header+0x6b/0x2ac oom_kill_process+0x21c/0x4a0 out_of_memory+0x2a5/0x4b0 mem_cgroup_out_of_memory+0x3b/0x60 mem_cgroup_oom_synchronize+0x2ed/0x330 pagefault_out_of_memory+0x24/0x54 __do_page_fault+0x521/0x540 page_fault+0x45/0x50 Task in /test killed as a result of limit of /test memory: usage 51200kB, limit 51200kB, failcnt 73 memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0 kmem: usage 296kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /test: cache:49632KB rss:1056KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:49500KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:1168KB inactive_file:24760KB active_file:24960KB unevictable:0KB Memory cgroup out of memory: Kill process 3861 (bash) score 88 or sacrifice child Killed process 3876 (dd) total-vm:8484kB, anon-rss:1052kB, file-rss:1720kB, shmem-rss:0kB oom_reaper: reaped process 3876 (dd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB Wake up flushers in legacy cgroup reclaim too. Link: http://lkml.kernel.org/r/20180315164553.17856-1-aryabinin@virtuozzo.com Fixes: bbef938429f5 ("mm: vmscan: remove old flusher wakeup from direct reclaim path") Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Tested-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <>
2018-03-23Revert "mm: page_alloc: skip over regions of invalid pfns where possible"Daniel Vacek3-40/+1
This reverts b92df1de5d289c0b ("mm: page_alloc: skip over regions of invalid pfns where possible"). The commit is meant to be a boot init speed up skipping the loop in memmap_init_zone() for invalid pfns. But given some specific memory mapping on x86_64 (or more generally theoretically anywhere but on arm with CONFIG_HAVE_ARCH_PFN_VALID) the implementation also skips valid pfns which is plain wrong and causes 'kernel BUG at mm/page_alloc.c:1389!' crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1 kernel BUG at mm/page_alloc.c:1389! invalid opcode: 0000 [#1] SMP -- RIP: 0010:[<ffffffff8118833e>] [<ffffffff8118833e>] move_freepages+0x15e/0x160 RSP: 0018:ffff88054d727688 EFLAGS: 00010087 -- Call Trace: [<ffffffff811883b3>] move_freepages_block+0x73/0x80 [<ffffffff81189e63>] __rmqueue+0x263/0x460 [<ffffffff8118c781>] get_page_from_freelist+0x7e1/0x9e0 [<ffffffff8118caf6>] __alloc_pages_nodemask+0x176/0x420 -- RIP [<ffffffff8118833e>] move_freepages+0x15e/0x160 RSP <ffff88054d727688> crash> page_init_bug -v | grep RAM <struct resource 0xffff88067fffd2f8> 1000 - 9bfff System RAM (620.00 KiB) <struct resource 0xffff88067fffd3a0> 100000 - 430bffff System RAM ( 1.05 GiB = 1071.75 MiB = 1097472.00 KiB) <struct resource 0xffff88067fffd410> 4b0c8000 - 4bf9cfff System RAM ( 14.83 MiB = 15188.00 KiB) <struct resource 0xffff88067fffd480> 4bfac000 - 646b1fff System RAM (391.02 MiB = 400408.00 KiB) <struct resource 0xffff88067fffd560> 7b788000 - 7b7fffff System RAM (480.00 KiB) <struct resource 0xffff88067fffd640> 100000000 - 67fffffff System RAM ( 22.00 GiB) crash> page_init_bug | head -6 <struct resource 0xffff88067fffd560> 7b788000 - 7b7fffff System RAM (480.00 KiB) <struct page 0xffffea0001ede200> 1fffff00000000 0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32 4096 1048575 <struct page 0xffffea0001ede200> 505736 505344 <struct page 0xffffea0001ed8000> 505855 <struct page 0xffffea0001edffc0> <struct page 0xffffea0001ed8000> 0 0 <struct pglist_data 0xffff88047ffd9000> 0 <struct zone 0xffff88047ffd9000> DMA 1 4095 <struct page 0xffffea0001edffc0> 1fffff00000400 0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32 4096 1048575 BUG, zones differ! crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000 PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffffea0001e00000 78000000 0 0 0 0 ffffea0001ed7fc0 7b5ff000 0 0 0 0 ffffea0001ed8000 7b600000 0 0 0 0 <<<< ffffea0001ede1c0 7b787000 0 0 0 0 ffffea0001ede200 7b788000 0 0 1 1fffff00000000 Link: http://lkml.kernel.org/r/20180316143855.29838-1-neelx@redhat.com Fixes: b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns where possible") Signed-off-by: Daniel Vacek <neelx@redhat.com> Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Paul Burton <paul.burton@imgtec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <>
2018-03-23mm/shmem: do not wait for lock_page() in shmem_unused_huge_shrink()Kirill A. Shutemov1-11/+20
shmem_unused_huge_shrink() gets called from reclaim path. Waiting for page lock may lead to deadlock there. There was a bug report that may be attributed to this: http://lkml.kernel.org/r/alpine.LRH.2.11.1801242349220.30642@mail.ewheeler.net Replace lock_page() with trylock_page() and skip the page if we failed to lock it. We will get to the page on the next scan. We can test for the PageTransHuge() outside the page lock as we only need protection against splitting the page under us. Holding pin oni the page is enough for this. Link: http://lkml.kernel.org/r/20180316210830.43738-1-kirill.shutemov@linux.intel.com Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Eric Wheeler <linux-mm@lists.ewheeler.net> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <>
2018-03-23mm/thp: do not wait for lock_page() in deferred_split_scan()Kirill A. Shutemov1-1/+3
deferred_split_scan() gets called from reclaim path. Waiting for page lock may lead to deadlock there. Replace lock_page() with trylock_page() and skip the page if we failed to lock it. We will get to the page on the next scan. Link: http://lkml.kernel.org/r/20180315150747.31945-1-kirill.shutemov@linux.intel.com Fixes: 9a982250f773 ("thp: introduce deferred_split_huge_page()") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <>
2018-03-23mm/khugepaged.c: convert VM_BUG_ON() to collapse failKirill A. Shutemov1-1/+6
khugepaged is not yet able to convert PTE-mapped huge pages back to PMD mapped. We do not collapse such pages. See check khugepaged_scan_pmd(). But if between khugepaged_scan_pmd() and __collapse_huge_page_isolate() somebody managed to instantiate THP in the range and then split the PMD back to PTEs we would have a problem -- VM_BUG_ON_PAGE(PageCompound(page)) will get triggered. It's possible since we drop mmap_sem during collapse to re-take for write. Replace the VM_BUG_ON() with graceful collapse fail. Link: http://lkml.kernel.org/r/20180315152353.27989-1-kirill.shutemov@linux.intel.com Fixes: b1caa957ae6d ("khugepaged: ignore pmd tables with THP mapped with ptes") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <>
2018-03-23Revert "mm/memblock.c: hardcode the end_pfn being -1"Michal Hocko1-5/+5
This reverts commit 173950c2b1a0a6efb35e0e005f02f1495084b3e9.
2018-03-23Reverted "mm: remove unused arg from memblock_next_valid_pfn()"Michal Hocko3-3/+4
This reverts fe1081e93ecc14ed2f6c620ce67fb402a1eab959 Patch has been dropped from mmotm tree with reason: obsolete
2018-03-23Reverted "mm/sparse.c: optimize memmap allocation during sparse_init()"Michal Hocko2-26/+12
This reverts 709abb4526bcf5fe031ca726a7538b06682c9594 Patch has been dropped from mmotm tree with reason: wrecked
2018-03-15Revert "mm/page_alloc: fix memmap_init_zone pageblock alignment"Ard Biesheuvel1-8/+5
This reverts commit 864b75f9d6b0100bb24fdd9a20d156e7cda9b5ae. Commit 864b75f9d6b0 ("mm/page_alloc: fix memmap_init_zone pageblock alignment") modified the logic in memmap_init_zone() to initialize struct pages associated with invalid PFNs, to appease a VM_BUG_ON() in move_freepages(), which is redundant by its own admission, and dereferences struct page fields to obtain the zone without checking whether the struct pages in question are valid to begin with. Commit 864b75f9d6b0 only makes it worse, since the rounding it does may cause pfn assume the same value it had in a prior iteration of the loop, resulting in an infinite loop and a hang very early in the boot. Also, since it doesn't perform the same rounding on start_pfn itself but only on intermediate values following an invalid PFN, we may still hit the same VM_BUG_ON() as before. So instead, let's fix this at the core, and ensure that the BUG check doesn't dereference struct page fields of invalid pages. Fixes: 864b75f9d6b0 ("mm/page_alloc: fix memmap_init_zone pageblock alignment") Tested-by: Jan Glauber <jglauber@cavium.com> Tested-by: Shanker Donthineni <shankerd@codeaurora.org> Cc: Daniel Vacek <neelx@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Paul Burton <paul.burton@imgtec.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 3e04040df6d4613a8af5a80882d5f7f298f49810)
2018-03-15mm/hugetlb.c: clean up VM_WARN usageAndrew Morton1-4/+1
Use VM_WARN return value, remove too-obvious comment. Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nic Losby <blurbdust@gmail.com> Cc: Yisheng Xie <xieyisheng1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-15include/linux/mmdebug.h: fix VM_WARN[_*]() with CONFIG_DEBUG_VM=nAndrew Morton1-4/+4
These macros cause the build to break when someone uses their return value with CONFIG_DEBUG_VM=n: if (VM_WARN(from > to, "%s called with a negative range\n", __func__)) return -EINVAL; mm/hugetlb.c: In function 'hugetlb_reserve_pages': mm/hugetlb.c:4378: error: void value not ignored as it ought to be Fix this up. Reported-by: Michal Hocko <mhocko@kernel.org> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-15zram-drop-max_zpage_size-and-use-zs_huge_class_size-v3Sergey Senozhatsky1-1/+1
add pool param to zs_huge_class_size() (Minchan) Link: http://lkml.kernel.org/r/20180314081833.1096-3-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-15zsmalloc-introduce-zs_huge_class_size-function-v3Sergey Senozhatsky2-2/+3
add pool param to zs_huge_class_size() (Minchan) Link: http://lkml.kernel.org/r/20180314081833.1096-2-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-15x86/mm: implement free pmd/pte page interfacesToshi Kani1-2/+26
Implement pud_free_pmd_page() and pmd_free_pte_page() on x86, which clear a given pud/pmd entry and free up lower level page table(s). Address range associated with the pud/pmd entry must have been purged by INVLPG. Link: http://lkml.kernel.org/r/20180314180155.19492-3-toshi.kani@hpe.com Fixes: e61ce6ade404e ("mm: change ioremap to set up huge I/O mappings") Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Reported-by: Lei Li <lious.lilei@hisilicon.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Borislav Petkov <bp@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-15mm-vmalloc-add-interfaces-to-free-unmapped-page-table-fixAndrew Morton1-1/+1
fix typo in pmd_free_pte_page() stub Cc: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-15mm/vmalloc: add interfaces to free unmapped page tableToshi Kani4-2/+48
On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap() may create pud/pmd mappings. Kernel panic was observed on arm64 systems with Cortex-A75 in the following steps as described by Hanjun Guo. 1. ioremap a 4K size, valid page table will build, 2. iounmap it, pte0 will set to 0; 3. ioremap the same address with 2M size, pgd/pmd is unchanged, then set the a new value for pmd; 4. pte0 is leaked; 5. CPU may meet exception because the old pmd is still in TLB, which will lead to kernel panic. This panic is not reproducible on x86. INVLPG, called from iounmap, purges all levels of entries associated with purged address on x86. x86 still has memory leak. The patch changes the ioremap path to free unmapped page table(s) since doing so in the unmap path has the following issues: - The iounmap() path is shared with vunmap(). Since vmap() only supports pte mappings, making vunmap() to free a pte page is an overhead for regular vmap users as they do not need a pte page freed up. - Checking if all entries in a pte page are cleared in the unmap path is racy, and serializing this check is expensive. - The unmap path calls free_vmap_area_noflush() to do lazy TLB purges. Clearing a pud/pmd entry before the lazy TLB purges needs extra TLB purge. Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(), which clear a given pud/pmd entry and free up a page for the lower level entries. This patch implements their stub functions on x86 and arm64, which work as workaround. Link: http://lkml.kernel.org/r/20180314180155.19492-2-toshi.kani@hpe.com Fixes: e61ce6ade404e ("mm: change ioremap to set up huge I/O mappings") Reported-by: Lei Li <lious.lilei@hisilicon.com> Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Wang Xuefeng <wxf.wang@hisilicon.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Borislav Petkov <bp@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Chintan Pandya <cpandya@codeaurora.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-15hugetlbfs-check-for-pgoff-value-overflow-v3-fix-fixAndrew Morton1-1/+2
fix -ve left shift count on sh Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nic Losby <blurbdust@gmail.com> Cc: Yisheng Xie <xieyisheng1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2018-03-15m68k/mm: Stop printing the virtual memory layoutGeert Uytterhoeven1-27/+0
Since commit ad67b74d2469d9b8 ("printk: hash addresses printed with %p"), the virtual memory layout printed during boot up contains "ptrval" instead of actual addresses: Memory: 268040K/276480K available (2979K kernel code, 310K rwdata, 784K rodata, 144K init, 172K bss, 8440K reserved, 0K cma-reserved) Virtual kernel memory layout: vector : 0x003d2e74 - 0x003d3274 ( 1 KiB) kmap : 0xd0000000 - 0xf0000000 ( 512 MiB) vmalloc : 0x11800000 - 0xd0000000 (3048 MiB) lowmem : 0x00000000 - 0x11000000 ( 272 MiB) .init : 0x(ptrval) - 0x(ptrval) ( 144 KiB) .text : 0x(ptrval) - 0x(ptrval) (2980 KiB) .data : 0x(ptrval) - 0x(ptrval) (1095 KiB) .bss : 0x(ptrval) - 0x(ptrval) ( 173 KiB) Instead of changing the printing to "%px", and leaking virtual memory layout information again, just remove the printing completely, cfr. e.g. commit 071929dbdd865f77 ("arm64: Stop printing the virtual memory layout"). All interesting information (actual section sizes) is already printed by mem_init_print_info() just above anyway. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> (cherry picked from commit 445420c31cbba4a218b432bece0b500b6c4f9d63)
2018-03-14headers-untangle-kmemleakh-from-mmh-fix-fixMichal Hocko1-0/+1
m32r allmodconfig complains with security/integrity/digsig.c:146:2: error: implicit declaration of function 'vfree' [-Werror=implicit-function-declaration] vfree(data); ^ Signed-off-by: Michal Hocko <mhocko@suse.com>
2018-03-14mm/gup.c: teach get_user_pages_unlocked to handle FOLL_NOWAITAndrea Arcangeli1-2/+5
KVM is hanging during postcopy live migration with userfaultfd because get_user_pages_unlocked is not capable to handle FOLL_NOWAIT. Earlier FOLL_NOWAIT was only ever passed to get_user_pages. Specifically faultin_page (the callee of get_user_pages_unlocked caller) doesn't know that if FAULT_FLAG_RETRY_NOWAIT was set in the page fault flags, when VM_FAULT_RETRY is returned, the mmap_sem wasn't actually released (even if nonblocking is not NULL). So it sets *nonblocking to zero and the caller won't release the mmap_sem thinking it was already released, but it wasn't because of FOLL_NOWAIT. Link: http://lkml.kernel.org/r/20180302174343.5421-2-aarcange@redhat.com Fixes: ce53053ce378c ("kvm: switch get_user_page_nowait() to get_user_pages_unlocked()") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Tested-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 96312e61282ae3f6537a562625706498cbc75594)
2018-03-14headers-untangle-kmemleakh-from-mmh-fixAndrew Morton1-0/+1
security/keys/big_key.c needs vmalloc.h, per sfr Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>