summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2016-11-18hmm/dmirror: dummy mirror support for fake device memoryhmm-v13Jérôme Glisse1-0/+6
Add fake device memory. Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
2016-11-18hmm/dmirror: dummy mirror driver for testing and showcasing the HMMJérôme Glisse1-0/+48
Just a dummy driver for test purposes. Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
2016-11-18mm/hmm/devmem: dummy HMM device as an helper for ZONE_DEVICE memoryJérôme Glisse1-1/+21
This introduce a dummy HMM device class so device driver can use it to create hmm_device for the sole purpose of registering device memory. It is usefull to device driver that want to manage multiple physical device memory under same device umbrella. Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Jatin Kumar <jakumar@nvidia.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com> Signed-off-by: Sherry Cheung <SCheung@nvidia.com> Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
2016-11-18mm/hmm/devmem: device driver helper to hotplug ZONE_DEVICE memoryJérôme Glisse1-0/+113
This introduce a simple struct and associated helpers for device driver to use when hotpluging un-addressable device memory as ZONE_DEVICE. It will find a unuse physical address range and trigger memory hotplug for it which allocates and initialize struct page for the device memory. Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Jatin Kumar <jakumar@nvidia.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com> Signed-off-by: Sherry Cheung <SCheung@nvidia.com> Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
2016-11-18mm/hmm/migrate: new memory migration helper for use with device memoryJérôme Glisse1-3/+51
This patch add a new memory migration helpers, which migrate memory backing a range of virtual address of a process to different memory (which can be allocated through special allocator). It differs from numa migration by working on a range of virtual address and thus by doing migration in chunk that can be large enough to use DMA engine or special copy offloading engine. Expected users are any one with heterogeneous memory where different memory have different characteristics (latency, bandwidth, ...). As an example IBM platform with CAPI bus can make use of this feature to migrate between regular memory and CAPI device memory. New CPU architecture with a pool of high performance memory not manage as cache but presented as regular memory (while being faster and with lower latency than DDR) will also be prime user of this patch. Migration to private device memory will be usefull for device that have large pool of such like GPU, NVidia plans to use HMM for that. Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Jatin Kumar <jakumar@nvidia.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com> Signed-off-by: Sherry Cheung <SCheung@nvidia.com> Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
2016-11-18mm/hmm/migrate: add new boolean copy flag to migratepage() callbackJérôme Glisse3-8/+15
Allow migration without copy in case destination page already have source page content. This is usefull for HMM migration to device where we copy page before doing the final migration step. This feature need carefull audit of filesystem code to make sure that no one can write to the source page while it is unmapped and locked. It should be safe for most filesystem but as precaution return error until support for device migration is added to them. Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
2016-11-18mm/hmm/mirror: device page fault handlerJérôme Glisse1-4/+29
This handle page fault on behalf of device driver, unlike handle_mm_fault() it does not trigger migration back to system memory for device memory. Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Jatin Kumar <jakumar@nvidia.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com> Signed-off-by: Sherry Cheung <SCheung@nvidia.com> Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
2016-11-18mm/hmm/mirror: helper to snapshot CPU page tableJérôme Glisse1-2/+28
This does not use existing page table walker because we want to share same code for our page fault handler. Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Jatin Kumar <jakumar@nvidia.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com> Signed-off-by: Sherry Cheung <SCheung@nvidia.com> Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
2016-11-17mm/hmm/mirror: add range monitor helper, to monitor CPU page table updateJérôme Glisse1-0/+18
Complement the hmm_vma_range_lock/unlock() mechanism with a range monitor that do not block CPU page table invalidation and thus do not garanty forward progress. It is still usefull as in many situations concurrent CPU page table update and CPU snapshot are taking place in different region of the virtual address space. Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Jatin Kumar <jakumar@nvidia.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com> Signed-off-by: Sherry Cheung <SCheung@nvidia.com> Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
2016-11-17mm/hmm/mirror: add range lock helper, prevent CPU page table update for the ↵Jérôme Glisse1-0/+30
range There is two possible strategy when it comes to snapshoting the CPU page table inside the device page table. First one snapshot the CPU page table and keep track of active mmu_notifier callback. Once snapshot is done and before updating the device page table (in an atomic fashion) it check the mmu_notifier sequence. If sequence is same as the time the CPU page table was snapshot then it means that no mmu_notifier run in the meantime and hence the snapshot is accurate. If the sequence is different then one mmu_notifier callback did run and snapshot might no longer be valid and the whole procedure must be restarted. Issue with this approach is that it does not garanty forward progress for the device driver trying to mirror a range of the address space. The second solution, implemented by this patch, is to serialize CPU snapshot with mmu_notifier callback and have each waiting on each other according to the order they happen. This garanty forward progress for driver. The drawback is that it can stall process waiting on the mmu_notifier callback to finish. So thing like direct page reclaim (or even indirect one) might stall and this might increase overall kernel latency. For now just accept this potential issue and wait to have real world workload to be affected by it before trying to fix it. Fix is probably to introduce a new mmu_notifier_try_to_invalidate() that could return failure if it has to wait or sleep and use it inside reclaim code to decide to skip to next candidate for reclaimation. Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Jatin Kumar <jakumar@nvidia.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com> Signed-off-by: Sherry Cheung <SCheung@nvidia.com> Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
2016-11-17mm/hmm/mirror: mirror process address space on device with HMM helpersJérôme Glisse1-0/+97
This is a heterogeneous memory management (HMM) process address space mirroring. In a nutshell this provide an API to mirror process address space on a device. This boils down to keeping CPU and device page table synchronize (we assume that both device and CPU are cache coherent like PCIe device can be). This patch provide a simple API for device driver to achieve address space mirroring thus avoiding each device driver to grow its own CPU page table walker and its own CPU page table synchronization mechanism. This is usefull for NVidia GPU >= Pascal, Mellanox IB >= mlx5 and more hardware in the future. Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Jatin Kumar <jakumar@nvidia.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com> Signed-off-by: Sherry Cheung <SCheung@nvidia.com> Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
2016-11-17mm/hmm: heterogeneous memory management (HMM for short)Jérôme Glisse2-0/+144
HMM provides 3 separate functionality : - Mirroring: synchronize CPU page table and device page table - Device memory: allocating struct page for device memory - Migration: migrating regular memory to device memory This patch introduces some common helpers and definitions to all of those 3 functionality. Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Jatin Kumar <jakumar@nvidia.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com> Signed-off-by: Sherry Cheung <SCheung@nvidia.com> Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
2016-11-14mm/ZONE_DEVICE/unaddressable: add special swap for unaddressableJérôme Glisse3-3/+87
To allow use of device un-addressable memory inside a process add a special swap type. Also add a new callback to handle page fault on such entry. Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
2016-11-14mm/ZONE_DEVICE/devmem_pages_remove: allow early removal of device memoryJérôme Glisse1-0/+7
HMM wants to remove device memory early before device tear down so add an helper to do that. Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
2016-11-14mm/ZONE_DEVICE/free-page: callback when page is freedJérôme Glisse1-0/+4
When a ZONE_DEVICE page refcount reach 1 it means it is free and nobody is holding a reference on it (only device to which the memory belong do). Add a callback and call it when that happen so device driver can implement their own free page management. Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
2016-11-14mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memoryJérôme Glisse1-3/+20
This add support for un-addressable device memory. Such memory is hotpluged only so we can have struct page but should never be map. This patch add code to mm page fault code path to catch any such mapping and SIGBUS on such event. Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
2016-11-14mm/memory/hotplug: convert device parameter bool to set of flagsJérôme Glisse1-2/+15
Only usefull for arch where we support ZONE_DEVICE and where we want to also support un-addressable device memory. We need struct page for such un-addressable memory. But we should avoid populating the kernel linear mapping for the physical address range because there is no real memory or anything behind those physical address. Hence we need more flags than just knowing if it is device memory or not. Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com>
2016-10-26mm, thp: avoid unlikely branches for split_huge_pmdDavid Rientjes1-0/+2
While doing MADV_DONTNEED on a large area of thp memory, I noticed we encountered many unlikely() branches in profiles for each backing hugepage. This is because zap_pmd_range() would call split_huge_pmd(), which rechecked the conditions that were already validated, but as part of an unlikely() branch. Avoid the unlikely() branch when in a context where pmd is known to be good for __split_huge_pmd() directly. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1610181600300.84525@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-10-24mm: replace access_remote_vm() write parameter with gup_flagsLorenzo Stoakes1-1/+1
This removes the 'write' argument from access_remote_vm() and replaces it with 'gup_flags' as use of this function previously silently implied FOLL_FORCE, whereas after this patch callers explicitly pass this flag. We make this explicit as use of FOLL_FORCE can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 6347e8d5bcce33fc36e651901efefbe2c93a43ef)
2016-10-24mm: replace get_user_pages_remote() write/force parameters with gup_flagsLorenzo Stoakes1-1/+1
This removes the 'write' and 'force' from get_user_pages_remote() and replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers as use of this flag can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 9beae1ea89305a9667ceaab6d0bf46a045ad71e7)
2016-10-24mm: replace get_user_pages() write/force parameters with gup_flagsLorenzo Stoakes1-1/+1
This removes the 'write' and 'force' from get_user_pages() and replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers as use of this flag can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 768ae309a96103ed02eb1e111e838c87854d8b51)
2016-10-24mm: replace get_vaddr_frames() write/force parameters with gup_flagsLorenzo Stoakes1-1/+1
This removes the 'write' and 'force' from get_vaddr_frames() and replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers as use of this flag can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 7f23b3504a0df63b724180262c5f3f117f21bcae)
2016-10-24mm: replace get_user_pages_locked() write/force parameters with gup_flagsLorenzo Stoakes1-1/+1
This removes the 'write' and 'force' use from get_user_pages_locked() and replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers as use of this flag can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 3b913179c3fa89dd0e304193fa0c746fc0481447)
2016-10-24mm: replace get_user_pages_unlocked() write/force parameters with gup_flagsLorenzo Stoakes1-1/+1
This removes the 'write' and 'force' use from get_user_pages_unlocked() and replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers as use of this flag can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit c164154f66f0c9b02673f07aa4f044f1d9c70274)
2016-10-24mm: remove write/force parameters from __get_user_pages_unlocked()Lorenzo Stoakes1-2/+1
This removes the redundant 'write' and 'force' parameters from __get_user_pages_unlocked() to make the use of FOLL_FORCE explicit in callers as use of this flag can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit d4944b0ecec0af882483fe44b66729316e575208)
2016-10-24mm: remove gup_flags FOLL_WRITE games from __get_user_pages()Linus Torvalds1-0/+1
This is an ancient bug that was actually attempted to be fixed once (badly) by me eleven years ago in commit 4ceb5db9757a ("Fix get_user_pages() race for write access") but that was then undone due to problems on s390 by commit f33ea7f404e5 ("fix get_user_pages bug"). In the meantime, the s390 situation has long been fixed, and we can now fix it by checking the pte_dirty() bit properly (and do it better). The s390 dirty bit was implemented in abf09bed3cce ("s390/mm: implement software dirty bits") which made it into v3.9. Earlier kernels will have to look at the page state itself. Also, the VM has become more scalable, and what used a purely theoretical race back then has become easier to trigger. To fix it, we introduce a new internal FOLL_COW flag to mark the "yes, we already did a COW" rather than play racy games with FOLL_WRITE that is very fundamental, and then use the pte dirty flag to validate that the FOLL_COW flag is still valid. Reported-and-tested-by: Phil "not Paul" Oester <kernel@linuxace.com> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Nick Piggin <npiggin@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 19be0eaffa3ac7d8eb6784ad9bdbc7d67ed8e619)
2016-10-12mm/zsmalloc: add per-class compact trace eventGanesh Mahendran1-10/+30
add per-class compact trace event to get number of migrated objects and number of freed pages. trace log is like below: bash-3863 [001] .... 141.791366: zs_compact_start: pool zram0 bash-3863 [001] .... 141.791372: zs_compact: class 254: 0 objects migrated, 0 pages freed bash-3863 [001] .... 141.791375: zs_compact: class 202: 0 objects migrated, 0 pages freed bash-3863 [001] .... 141.791385: zs_compact: class 190: 1 objects migrated, 3 pages freed bash-3863 [001] .... 141.791393: zs_compact: class 168: 2 objects migrated, 2 pages freed bash-3863 [001] .... 141.791396: zs_compact: class 151: 0 objects migrated, 0 pages freed bash-3863 [001] .... 141.791407: zs_compact: class 144: 5 objects migrated, 4 pages freed bash-3863 [001] .... 141.791427: zs_compact: class 126: 8 objects migrated, 8 pages freed bash-3863 [001] .... 141.791433: zs_compact: class 111: 1 objects migrated, 4 pages freed bash-3863 [001] .... 141.791459: zs_compact: class 107: 18 objects migrated, 12 pages freed bash-3863 [001] .... 141.791487: zs_compact: class 100: 18 objects migrated, 16 pages freed bash-3863 [001] .... 141.791509: zs_compact: class 94: 18 objects migrated, 9 pages freed bash-3863 [001] .... 141.791560: zs_compact: class 91: 44 objects migrated, 24 pages freed bash-3863 [001] .... 141.791605: zs_compact: class 83: 35 objects migrated, 20 pages freed bash-3863 [001] .... 141.791616: zs_compact: class 76: 8 objects migrated, 4 pages freed bash-3863 [001] .... 141.791644: zs_compact: class 74: 21 objects migrated, 9 pages freed bash-3863 [001] .... 141.791665: zs_compact: class 71: 18 objects migrated, 10 pages freed bash-3863 [001] .... 141.791736: zs_compact: class 67: 0 objects migrated, 0 pages freed bash-3863 [001] .... 141.791763: zs_compact: class 66: 22 objects migrated, 8 pages freed bash-3863 [001] .... 141.791820: zs_compact: class 62: 18 objects migrated, 6 pages freed bash-3863 [001] .... 141.791826: zs_compact: class 58: 1 objects migrated, 4 pages freed bash-3863 [001] .... 141.791829: zs_compact: class 57: 0 objects migrated, 0 pages freed bash-3863 [001] .... 141.791834: zs_compact: class 54: 2 objects migrated, 2 pages freed ... bash-3863 [001] .... 141.791952: zs_compact: class 4: 0 objects migrated, 0 pages freed bash-3863 [001] .... 141.791964: zs_compact: class 3: 14 objects migrated, 1 pages freed bash-3863 [001] .... 141.791966: zs_compact: class 2: 0 objects migrated, 0 pages freed bash-3863 [001] .... 141.791968: zs_compact: class 1: 0 objects migrated, 0 pages freed bash-3863 [001] .... 141.791971: zs_compact: class 0: 0 objects migrated, 0 pages freed bash-3863 [001] .... 141.791973: zs_compact_end: pool zram0: 155 pages compacted Also this patch changes trace_zsmalloc_compact_start[end] to trace_zs_compact_start[end] to keep function naming consistent with others in zsmalloc. Link: http://lkml.kernel.org/r/1467882338-4300-8-git-send-email-opensource.ganesh@gmail.com Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-10-12mm/zsmalloc: add trace events for zs_compactGanesh Mahendran1-0/+56
Currently zsmalloc is widely used in android device. Sometimes, we want to see how frequently zs_compact is triggered or how may pages freed by zs_compact(), or which zsmalloc pool is compacted. We have backported the zs_compact() to our product(kernel 3.18). It is usefull for a longtime running device. But there is not a convenient way to get the detailed information of zs_comapct() which is usefull for performance optimization. Information about how much time zs_compact used, which pool is compacted, how many page freed, etc. With these information, we will know what is going on in zs_comapct. And draw the relation between free mem and zs_comapct. Most of the time, user can get the brief information from trace_mm_shrink_slab_[start | end], but in some senario, they do not use zsmalloc shrinker, but trigger compaction manually. So add some trace events in zs_compact is convenient. Also we can add some zsmalloc specific information(pool name, total compact pages, etc) in zsmalloc trace. This patch add two trace events for zs_compact(), below the trace log: ----------------------------- root@land:/ # cat /d/tracing/trace kswapd0-125 [007] ...1 174.176979: zsmalloc_compact_start: pool zram0 kswapd0-125 [007] ...1 174.181967: zsmalloc_compact_end: pool zram0: 608 pages compacted(total 1794) kswapd0-125 [000] ...1 184.134475: zsmalloc_compact_start: pool zram0 kswapd0-125 [000] ...1 184.135010: zsmalloc_compact_end: pool zram0: 62 pages compacted(total 1856) kswapd0-125 [003] ...1 226.927221: zsmalloc_compact_start: pool zram0 kswapd0-125 [003] ...1 226.928575: zsmalloc_compact_end: pool zram0: 250 pages compacted(total 2106) ----------------------------- Link: http://lkml.kernel.org/r/1465289804-4913-1-git-send-email-opensource.ganesh@gmail.com Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-10-12mm: split gfp_mask and mapping flags into separate fieldsMichal Hocko2-12/+11
mapping->flags currently encodes two different things into a single flag. It contains sticky gfp_mask for page cache allocations and AS_ codes used to report errors/enospace and other states which are mapping specific. Condensing the two semantically unrelated things saves few bytes but it also complicates other things. For one thing the gfp flags space is reduced and in fact we are already running out of available bits. It can be assumed that more gfp flags will be necessary later on. To not introduce the address_space grow (at least on x86_64) we can stick it right after private_lock because we have a hole there. struct address_space { struct inode * host; /* 0 8 */ struct radix_tree_root page_tree; /* 8 16 */ spinlock_t tree_lock; /* 24 4 */ atomic_t i_mmap_writable; /* 28 4 */ struct rb_root i_mmap; /* 32 8 */ struct rw_semaphore i_mmap_rwsem; /* 40 40 */ /* --- cacheline 1 boundary (64 bytes) was 16 bytes ago --- */ long unsigned int nrpages; /* 80 8 */ long unsigned int nrexceptional; /* 88 8 */ long unsigned int writeback_index; /* 96 8 */ const struct address_space_operations * a_ops; /* 104 8 */ long unsigned int flags; /* 112 8 */ spinlock_t private_lock; /* 120 4 */ /* XXX 4 bytes hole, try to pack */ /* --- cacheline 2 boundary (128 bytes) --- */ struct list_head private_list; /* 128 16 */ void * private_data; /* 144 8 */ /* size: 152, cachelines: 3, members: 14 */ /* sum members: 148, holes: 1, sum holes: 4 */ /* last cacheline: 24 bytes */ }; Link: http://lkml.kernel.org/r/20160912114852.GI14524@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-10-12treewide: remove redundant #include <linux/kconfig.h>Masahiro Yamada2-2/+0
Kernel source files need not include <linux/kconfig.h> explicitly because the top Makefile forces to include it with: -include $(srctree)/include/linux/kconfig.h This commit removes explicit includes except the following: * arch/s390/include/asm/facilities_src.h * tools/testing/radix-tree/linux/kernel.h These two are used for host programs. Link: http://lkml.kernel.org/r/1473656164-11929-1-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-10-12mm: kmemleak: avoid using __va() on addresses that don't have a lowmem mappingCatalin Marinas1-0/+18
Some of the kmemleak_*() callbacks in memblock, bootmem, CMA convert a physical address to a virtual one using __va(). However, such physical addresses may sometimes be located in highmem and using __va() is incorrect, leading to inconsistent object tracking in kmemleak. The following functions have been added to the kmemleak API and they take a physical address as the object pointer. They only perform the corresponding action if the address has a lowmem mapping: kmemleak_alloc_phys kmemleak_free_part_phys kmemleak_not_leak_phys kmemleak_ignore_phys The affected calling places have been updated to use the new kmemleak API. Link: http://lkml.kernel.org/r/1471531432-16503-1-git-send-email-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Vignesh R <vigneshr@ti.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-10-12kdump, vmcoreinfo: report memory sections virtual addressesThomas Garnier1-0/+6
KASLR memory randomization can randomize the base of the physical memory mapping (PAGE_OFFSET), vmalloc (VMALLOC_START) and vmemmap (VMEMMAP_START). Adding these variables on VMCOREINFO so tools can easily identify the base of each memory section. Link: http://lkml.kernel.org/r/1471531632-23003-1-git-send-email-thgarnie@google.com Signed-off-by: Thomas Garnier <thgarnie@google.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Xunlei Pang <xlpang@redhat.com> Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Cc: Kees Cook <keescook@chromium.org> Cc: Eugene Surovegin <surovegin@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-10-12ipc/sem.c: fix complex_count vs. simple op raceManfred Spraul1-0/+1
Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a race: sem_lock has a fast path that allows parallel simple operations. There are two reasons why a simple operation cannot run in parallel: - a non-simple operations is ongoing (sma->sem_perm.lock held) - a complex operation is sleeping (sma->complex_count != 0) As both facts are stored independently, a thread can bypass the current checks by sleeping in the right positions. See below for more details (or kernel bugzilla 105651). The patch fixes that by creating one variable (complex_mode) that tracks both reasons why parallel operations are not possible. The patch also updates stale documentation regarding the locking. With regards to stable kernels: The patch is required for all kernels that include the commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") (3.10?) The alternative is to revert the patch that introduced the race. The patch is safe for backporting, i.e. it makes no assumptions about memory barriers in spin_unlock_wait(). Background: Here is the race of the current implementation: Thread A: (simple op) - does the first "sma->complex_count == 0" test Thread B: (complex op) - does sem_lock(): This includes an array scan. But the scan can't find Thread A, because Thread A does not own sem->lock yet. - the thread does the operation, increases complex_count, drops sem_lock, sleeps Thread A: - spin_lock(&sem->lock), spin_is_locked(sma->sem_perm.lock) - sleeps before the complex_count test Thread C: (complex op) - does sem_lock (no array scan, complex_count==1) - wakes up Thread B. - decrements complex_count Thread A: - does the complex_count test Bug: Now both thread A and thread C operate on the same array, without any synchronization. Fixes: 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") Link: http://lkml.kernel.org/r/1469123695-5661-1-git-send-email-manfred@colorfullife.com Reported-by: <felixh@informatik.uni-bremen.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: <1vier1@web.de> Cc: <stable@vger.kernel.org> [3.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-10-12relay: Use irq_work instead of plain timer for deferred wakeupPeter Zijlstra1-1/+2
Relay avoids calling wake_up_interruptible() for doing the wakeup of readers/consumers, waiting for the generation of new data, from the context of a process which produced the data. This is apparently done to prevent the possibility of a deadlock in case Scheduler itself is is generating data for the relay, after acquiring rq->lock. The following patch used a timer (to be scheduled at next jiffy), for delegating the wakeup to another context. commit 7c9cb38302e78d24e37f7d8a2ea7eed4ae5f2fa7 Author: Tom Zanussi <zanussi@comcast.net> Date: Wed May 9 02:34:01 2007 -0700 relay: use plain timer instead of delayed work relay doesn't need to use schedule_delayed_work() for waking readers when a simple timer will do. Scheduling a plain timer, at next jiffies boundary, to do the wakeup causes a significant wakeup latency for the Userspace client, which makes relay less suitable for the high-frequency low-payload use cases where the data gets generated at a very high rate, like multiple sub buffers getting filled within a milli second. Moreover the timer is re-scheduled on every newly produced sub buffer so the timer keeps getting pushed out if sub buffers are filled in a very quick succession (less than a jiffy gap between filling of 2 sub buffers). As a result relay runs out of sub buffers to store the new data. By using irq_work it is ensured that wakeup of userspace client, blocked in the poll call, is done at earliest (through self IPI or next timer tick) enabling it to always consume the data in time. Also this makes relay consistent with printk & ring buffers (trace), as they too use irq_work for deferred wake up of readers. [arnd@arndb.de: select CONFIG_IRQ_WORK] Link: http://lkml.kernel.org/r/20160912154035.3222156-1-arnd@arndb.de [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/1472906487-1559-1-git-send-email-akash.goel@intel.com Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Akash Goel <akash.goel@intel.com> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-10-12dma-mapping: introduce the DMA_ATTR_NO_WARN attributeMauricio Faria de Oliveira1-0/+5
Introduce the DMA_ATTR_NO_WARN attribute, and document it. Link: http://lkml.kernel.org/r/1470092390-25451-2-git-send-email-mauricfo@linux.vnet.ibm.com Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Jens Axboe <axboe@fb.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Krzysztof Kozlowski <k.kozlowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-10-12random: remove unused randomize_range()Jason Cooper1-1/+0
All call sites for randomize_range have been updated to use the much simpler and more robust randomize_addr(). Remove the now unnecessary code. Link: http://lkml.kernel.org/r/20160803233913.32511-8-jason@lakedaemon.net Signed-off-by: Jason Cooper <jason@lakedaemon.net> Acked-by: Kees Cook <keescook@chromium.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-10-12random: simplify API for random address requestsJason Cooper1-0/+1
To date, all callers of randomize_range() have set the length to 0, and check for a zero return value. For the current callers, the only way to get zero returned is if end <= start. Since they are all adding a constant to the start address, this is unnecessary. We can remove a bunch of needless checks by simplifying the API to do just what everyone wants, return an address between [start, start + range). While we're here, s/get_random_int/get_random_long/. No current call site is adversely affected by get_random_int(), since all current range requests are < UINT_MAX. However, we should match caller expectations to avoid coming up short (ha!) in the future. All current callers to randomize_range() chose to use the start address if randomize_range() failed. Therefore, we simplify things by just returning the start address on error. randomize_range() will be removed once all callers have been converted over to randomize_addr(). Link: http://lkml.kernel.org/r/20160803233913.32511-2-jason@lakedaemon.net Signed-off-by: Jason Cooper <jason@lakedaemon.net> Acked-by: Kees Cook <keescook@chromium.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Roberts, William C" <william.c.roberts@intel.com> Cc: Yann Droneaud <ydroneaud@opteya.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Nick Kralevich <nnk@google.com> Cc: Jeffrey Vander Stoep <jeffv@google.com> Cc: Daniel Cashman <dcashman@android.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-10-12autofs4: move linux/auto_dev-ioctl.h to uapi/linuxIan Kent2-208/+222
Since linux/auto_dev-ioctl.h wasn't included in include/linux/Kbuild it wasn't moved to uapi/linux as part of the uapi series. Link: http://lkml.kernel.org/r/20160812024901.12352.10984.stgit@pluto.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Cc: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-10-12autofs: move inclusion of linux/limits.h to uapiTomohiro Kusumi2-1/+1
linux/limits.h should be included by uapi instead of linux/auto_fs.h so as not to cause compile error in userspace. # cat << EOF > ./test1.c > #include <stdio.h> > #include <linux/auto_fs.h> > int main(void) { > return 0; > } > EOF # gcc -Wall -g ./test1.c In file included from ./test1.c:2:0: /usr/include/linux/auto_fs.h:54:12: error: 'NAME_MAX' undeclared here (not in a function) char name[NAME_MAX+1]; ^ Link: http://lkml.kernel.org/r/20160812024856.12352.24092.stgit@pluto.themaw.net Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Ian Kent <ikent@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-10-12autofs: remove AUTOFS_DEVID_LENTomohiro Kusumi1-2/+0
This macro was never used by neither kernel nor userspace, and also doesn't represent "devid length" in bytes. (unless it was added to mean something else). Link: http://lkml.kernel.org/r/20160812024820.12352.21210.stgit@pluto.themaw.net Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Ian Kent <ikent@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-10-12llist: introduce llist_entry_safe()Alexander Potapenko1-7/+19
Currently llist_for_each_entry() and llist_for_each_entry_safe() iterate until &pos->member != NULL. But when building the kernel with Clang, the compiler assumes &pos->member cannot be NULL if the member's offset is greater than 0. Therefore the loop condition is always true, and the loops become infinite. To work around this, introduce llist_entry_safe(), which returns NULL for NULL pointers, and additionally check that pos is not NULL in the list iterators before dereferencing it. Link: http://lkml.kernel.org/r/1474636978-41435-3-git-send-email-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-10-12include/linux: provide a safe version of container_of()Alexander Potapenko1-0/+15
Patch series "Clang: avoid undefined behavior in llist iterators": This patchset fixes problems with pointer arithmetics overflow in llist iterators, llist_for_each_entry() and llist_for_each_entry_safe(). Clang turns those macros into infinite loops, because they're operating with "negative" pointers. As a follow-up it may make sense to convert other uses of llist_entry() to llist_entry_safe(), or even replace uses of container_of() with container_of_safe(). This patch (of 2): Some code relies on "negative" (i.e. too big) pointer values being returned by container_of() when its first argument is NULL. But doing so breaks the compiler's assumptions that pointer arithmetic never overflows. container_of_safe() checks its arguments and returns NULL in the case the member offset within the container is greater than the pointer to the member, otherwise it returns the result of container_of(). Link: http://lkml.kernel.org/r/1474636978-41435-2-git-send-email-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-10-12include/linux/ctype.h: make isdigit() table lookuplessAlexey Dobriyan1-1/+4
Make isdigit into a simple range checking inline function: return '0' <= c && c <= '9'; This code is 1 branch, not 2 because any reasonable compiler can optimize this code into SUB+CMP, so the code while (isdigit((c = *s++))) ... remains 1 branch per iteration HOWEVER it suddenly doesn't do table lookup priming cacheline nobody cares about. Link: http://lkml.kernel.org/r/20160826190047.GA12536@p183.telecom.by Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-10-12radix-tree: 'slot' can be NULL in radix_tree_next_slot()Ross Zwisler1-0/+8
There are four cases I can see where we could end up with a NULL 'slot' in radix_tree_next_slot(). Yet radix_tree_next_slot() never actually checks whether 'slot' is NULL. It just happens that for the cases where 'slot' is NULL, some other combination of factors prevents us from dereferencing it. It would be very easy for someone to unwittingly change one of these factors without realizing that we are implicitly depending on it to save us from a NULL pointer dereference. Add a comment documenting the things that allow 'slot' to be safely passed as NULL to radix_tree_next_slot(). Here are details on the four cases: 1) radix_tree_iter_retry() via a non-tagged iteration like radix_tree_for_each_slot(). In this case we currently aren't seeing a bug because radix_tree_iter_retry() sets iter->next_index = iter->index; which means that in in the else case in radix_tree_next_slot(), 'count' is zero, so we skip over the while() loop and effectively just return NULL without ever dereferencing 'slot'. 2) radix_tree_iter_retry() via tagged iteration like radix_tree_for_each_tagged(). This case was giving us NULL pointer dereferences in testing, and was fixed with this commit: commit 3cb9185c6730 ("radix-tree: fix radix_tree_iter_retry() for tagged iterators.") This fix doesn't explicitly check for 'slot' being NULL, though, it works around the NULL pointer dereference by instead zeroing iter->tags in radix_tree_iter_retry(), which makes us bail out of the if() case in radix_tree_next_slot() before we dereference 'slot'. 3) radix_tree_iter_next() via via a non-tagged iteration like radix_tree_for_each_slot(). This currently happens in shmem_tag_pins() and shmem_partial_swap_usage(). As with non-tagged iteration, 'count' in the else case of radix_tree_next_slot() is zero, so we skip over the while() loop and effectively just return NULL without ever dereferencing 'slot'. 4) radix_tree_iter_next() via tagged iteration like radix_tree_for_each_tagged(). This happens in shmem_wait_for_pins(). radix_tree_iter_next() zeros out iter->tags, so we end up exiting radix_tree_next_slot() here: if (flags & RADIX_TREE_ITER_TAGGED) { void *canon = slot; iter->tags >>= 1; if (unlikely(!iter->tags)) return NULL; Link: http://lkml.kernel.org/r/20160815194237.25967-2-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Shuah Khan <shuahkh@osg.samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2016-10-10seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not charJoe Perches1-2/+2
Allow some seq_puts removals by taking a string instead of a single char. [akpm@linux-foundation.org: update vmstat_show(), per Joe] Link: http://lkml.kernel.org/r/667e1cf3d436de91a5698170a1e98d882905e956.1470704995.git.joe@perches.com Signed-off-by: Joe Perches <joe@perches.com> Cc: Joe Perches <joe@perches.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 75ba1d07fd6a494851db5132612944a9d4773f9c)
2016-10-10linux/mm.h: canonicalize macro PAGE_ALIGNED() definitionzijun_hu1-1/+1
The macro PAGE_ALIGNED() is prone to cause error because it doesn't follow convention to parenthesize parameter @addr within macro body, for example unsigned long *ptr = kmalloc(...); PAGE_ALIGNED(ptr + 16); for the left parameter of macro IS_ALIGNED(), (unsigned long)(ptr + 16) is desired but the actual one is (unsigned long)ptr + 16. It is fixed by simply canonicalizing macro PAGE_ALIGNED() definition. Link: http://lkml.kernel.org/r/57EA6AE7.7090807@zoho.com Signed-off-by: zijun_hu <zijun_hu@htc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 1061b0d21e16550e7d7893a5deee2e49ea3990ad)
2016-10-10mm: remove unnecessary condition in remove_inode_hugepageszhong jiang1-1/+1
When the huge page is added to the page cahce (huge_add_to_page_cache), the page private flag will be cleared. since this code (remove_inode_hugepages) will only be called for pages in the page cahce, PagePrivate(page) will always be false. The patch remove the code without any functional change. Link: http://lkml.kernel.org/r/1475113323-29368-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang <zhongjiang@huawei.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 72e2936c04f7d2a4bf87d7f72d3bf11cf91ebb47)
2016-10-10mm: consolidate warn_alloc_failed usersMichal Hocko1-3/+2
warn_alloc_failed is currently used from the page and vmalloc allocators. This is a good reuse of the code except that vmalloc would appreciate a slightly different warning message. This is already handled by the fmt parameter except that "%s: page allocation failure: order:%u, mode:%#x(%pGg)" is printed anyway. This might be quite misleading because it might be a vmalloc failure which leads to the warning while the page allocator is not the culprit here. Fix this by always using the fmt string and only print the context that makes sense for the particular context (e.g. order makes only very little sense for the vmalloc context). Rename the function to not miss any user and also because a later patch will reuse it also for !failure cases. Link: http://lkml.kernel.org/r/20160929084407.7004-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 7877cdcc3893c1bd9a833b2f0398e7320794c6e6)
2016-10-10mm: vma_merge: fix vm_page_prot SMP race condition against rmap_walkAndrea Arcangeli1-2/+8
The rmap_walk can access vm_page_prot (and potentially vm_flags in the pte/pmd manipulations). So it's not safe to wait the caller to update the vm_page_prot/vm_flags after vma_merge returned potentially removing the "next" vma and extending the "current" vma over the next->vm_start,vm_end range, but still with the "current" vma vm_page_prot, after releasing the rmap locks. The vm_page_prot/vm_flags must be transferred from the "next" vma to the current vma while vma_merge still holds the rmap locks. The side effect of this race condition is pte corruption during migrate as remove_migration_ptes when run on a address of the "next" vma that got removed, used the vm_page_prot of the current vma. migrate mprotect ------------ ------------- migrating in "next" vma vma_merge() # removes "next" vma and # extends "current" vma # current vma is not with # vm_page_prot updated remove_migration_ptes read vm_page_prot of current "vma" establish pte with wrong permissions vm_set_page_prot(vma) # too late! change_protection in the old vma range only, next range is not updated This caused segmentation faults and potentially memory corruption in heavy mprotect loads with some light page migration caused by compaction in the background. Hugh Dickins pointed out the comment about the Odd case 8 in vma_merge which confirms the case 8 is only buggy one where the race can trigger, in all other vma_merge cases the above cannot happen. This fix removes the oddness factor from case 8 and it converts it from: AAAA PPPPNNNNXXXX -> PPPPNNNNNNNN to: AAAA PPPPNNNNXXXX -> PPPPXXXXXXXX XXXX has the right vma properties for the whole merged vma returned by vma_adjust, so it solves the problem fully. It has the added benefits that the callers could stop updating vma properties when vma_merge succeeds however the callers are not updated by this patch (there are bits like VM_SOFTDIRTY that still need special care for the whole range, as the vma merging ignores them, but as long as they're not processed by rmap walks and instead they're accessed with the mmap_sem at least for reading, they are fine not to be updated within vma_adjust before releasing the rmap_locks). Link: http://lkml.kernel.org/r/1474309513-20313-1-git-send-email-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Aditya Mandaleeka <adityam@microsoft.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jan Vorlicek <janvorli@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit e86f15ee64d8ee46255d964d55f74f5ba9af8c36)
2016-10-10mm: vm_page_prot: update with WRITE_ONCE/READ_ONCEAndrea Arcangeli1-1/+1
vma->vm_page_prot is read lockless from the rmap_walk, it may be updated concurrently and this prevents the risk of reading intermediate values. Link: http://lkml.kernel.org/r/1474660305-19222-1-git-send-email-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jan Vorlicek <janvorli@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 6d2329f8872f23e46a19d240930571510ce525eb)