summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2013-02-23memory-hotplug: remove sysfs file of nodeTang Chen3-5/+63
Introduce a new function try_offline_node() to remove sysfs file of node when all memory sections of this node are removed. If some memory sections of this node are not removed, this function does nothing. Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23memory_hotplug: clear zone when removing the memoryYasuaki Ishimatsu1-0/+207
When memory is added, we update zone's and pgdat's start_pfn and spanned_pages in __add_zone(). So we should revert them when the memory is removed. The patch adds a new function __remove_zone() to do this. Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23memory-hotplug: integrated __remove_section() of CONFIG_SPARSEMEM_VMEMMAP.Tang Chen1-11/+0
Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But even if we use SPARSEMEM_VMEMMAP, we can unregister the memory_section. Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23memory-hotplug: remove memmap of sparse-vmemmapTang Chen8-1/+31
Introduce a new API vmemmap_free() to free and remove vmemmap pagetables. Since pagetable implements are different, each architecture has to provide its own version of vmemmap_free(), just like vmemmap_populate(). Note: vmemmap_free() is not implemented for ia64, ppc, s390, and sparc. [mhocko@suse.cz: fix implicit declaration of remove_pagetable] Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23memory-hotplug: remove page table of x86_64 architectureTang Chen1-0/+10
Search a page table about the removed memory, and clear page table for x86_64 architecture. [akpm@linux-foundation.org: make kernel_physical_mapping_remove() static] Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23memory-hotplug: common APIs to support page tables hot-removeWen Congyang4-22/+331
When memory is removed, the corresponding pagetables should alse be removed. This patch introduces some common APIs to support vmemmap pagetable and x86_64 architecture direct mapping pagetable removing. All pages of virtual mapping in removed memory cannot be freed if some pages used as PGD/PUD include not only removed memory but also other memory. So this patch uses the following way to check whether a page can be freed or not. 1) When removing memory, the page structs of the removed memory are filled with 0FD. 2) All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared. In this case, the page used as PT/PMD can be freed. For direct mapping pages, update direct_pages_count[level] when we freed their pagetables. And do not free the pages again because they were freed when offlining. For vmemmap pages, free the pages and their pagetables. For larger pages, do not split them into smaller ones because there is no way to know if the larger page has been split. As a result, there is no way to decide when to split. We deal the larger pages in the following way: 1) For direct mapped pages, all the pages were freed when they were offlined. And since menmory offline is done section by section, all the memory ranges being removed are aligned to PAGE_SIZE. So only need to deal with unaligned pages when freeing vmemmap pages. 2) For vmemmap pages being used to store page_struct, if part of the larger page is still in use, just fill the unused part with 0xFD. And when the whole page is fulfilled with 0xFD, then free the larger page. [akpm@linux-foundation.org: fix typo in comment] [tangchen@cn.fujitsu.com: do not calculate direct mapping pages when freeing vmemmap pagetables] [tangchen@cn.fujitsu.com: do not free direct mapping pages twice] [tangchen@cn.fujitsu.com: do not free page split from hugepage one by one] [tangchen@cn.fujitsu.com: do not split pages when freeing pagetable pages] [akpm@linux-foundation.org: use pmd_page_vaddr()] [akpm@linux-foundation.org: fix used-uninitialised bug] Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23memory-hotplug: move pgdat_resize_lock into sparse_remove_one_section()Tang Chen2-5/+4
In __remove_section(), we locked pgdat_resize_lock when calling sparse_remove_one_section(). This lock will disable irq. But we don't need to lock the whole function. If we do some work to free pagetables in free_section_usemap(), we need to call flush_tlb_all(), which need irq enabled. Otherwise the WARN_ON_ONCE() in smp_call_function_many() will be triggered. If we lock the whole sparse_remove_one_section(), then we come to this call trace: ------------[ cut here ]------------ WARNING: at kernel/smp.c:461 smp_call_function_many+0xbd/0x260() Hardware name: PRIMEQUEST 1800E ...... Call Trace: smp_call_function_many+0xbd/0x260 smp_call_function+0x3b/0x50 on_each_cpu+0x3b/0xc0 flush_tlb_all+0x1c/0x20 remove_pagetable+0x14e/0x1d0 vmemmap_free+0x18/0x20 sparse_remove_one_section+0xf7/0x100 __remove_section+0xa2/0xb0 __remove_pages+0xa0/0xd0 arch_remove_memory+0x6b/0xc0 remove_memory+0xb8/0xf0 acpi_memory_device_remove+0x53/0x96 acpi_device_remove+0x90/0xb2 __device_release_driver+0x7c/0xf0 device_release_driver+0x2f/0x50 acpi_bus_remove+0x32/0x6d acpi_bus_trim+0x91/0x102 acpi_bus_hot_remove_device+0x88/0x16b acpi_os_execute_deferred+0x27/0x34 process_one_work+0x20e/0x5c0 worker_thread+0x12e/0x370 kthread+0xee/0x100 ret_from_fork+0x7c/0xb0 ---[ end trace 25e85300f542aa01 ]--- Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmapYasuaki Ishimatsu8-13/+111
For removing memmap region of sparse-vmemmap which is allocated bootmem, memmap region of sparse-vmemmap needs to be registered by get_page_bootmem(). So the patch searches pages of virtual mapping and registers the pages by get_page_bootmem(). NOTE: register_page_bootmem_memmap() is not implemented for ia64, ppc, s390, and sparc. So introduce CONFIG_HAVE_BOOTMEM_INFO_NODE and revert register_page_bootmem_info_node() when platform doesn't support it. It's implemented by adding a new Kconfig option named CONFIG_HAVE_BOOTMEM_INFO_NODE, which will be automatically selected by memory-hotplug feature fully supported archs(currently only on x86_64). Since we have 2 config options called MEMORY_HOTPLUG and MEMORY_HOTREMOVE used for memory hot-add and hot-remove separately, and codes in function register_page_bootmem_info_node() are only used for collecting infomation for hot-remove, so reside it under MEMORY_HOTREMOVE. Besides page_isolation.c selected by MEMORY_ISOLATION under MEMORY_HOTPLUG is also such case, move it too. [mhocko@suse.cz: put register_page_bootmem_memmap inside CONFIG_MEMORY_HOTPLUG_SPARSE] [linfeng@cn.fujitsu.com: introduce CONFIG_HAVE_BOOTMEM_INFO_NODE and revert register_page_bootmem_info_node()] [mhocko@suse.cz: remove the arch specific functions without any implementation] [linfeng@cn.fujitsu.com: mm/Kconfig: move auto selects from MEMORY_HOTPLUG to MEMORY_HOTREMOVE as needed] [rientjes@google.com: fix defined but not used warning] Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Reviewed-by: Wu Jianguo <wujianguo@huawei.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23memory-hotplug: introduce new arch_remove_memory() for removing page tableWen Congyang9-0/+97
For removing memory, we need to remove page tables. But it depends on architecture. So the patch introduce arch_remove_memory() for removing page table. Now it only calls __remove_pages(). Note: __remove_pages() for some archtecuture is not implemented (I don't know how to implement it for s390). Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23memory-hotplug: remove /sys/firmware/memmap/X sysfsYasuaki Ishimatsu3-14/+193
When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start, type} sysfs files are created. But there is no code to remove these files. This patch implements the function to remove them. We cannot free firmware_map_entry which is allocated by bootmem because there is no way to do so when the system is up. But we can at least remember the address of that memory and reuse the storage when the memory is added next time. This patch also introduces a new list map_entries_bootmem to link the map entries allocated by bootmem when they are removed, and a lock to protect it. And these entries will be reused when the memory is hot-added again. The idea is suggestted by Andrew Morton. NOTE: It is unsafe to return an entry pointer and release the map_entries_lock. So we should not hold the map_entries_lock separately in firmware_map_find_entry() and firmware_map_remove_entry(). Hold the map_entries_lock across find and remove /sys/firmware/memmap/X operation. And also, users of these two functions need to be careful to hold the lock when using these two functions. [tangchen@cn.fujitsu.com: Hold spinlock across find|remove /sys operation] [tangchen@cn.fujitsu.com: fix the wrong comments of map_entries] [tangchen@cn.fujitsu.com: reuse the storage of /sys/firmware/memmap/X/ allocated by bootmem] [tangchen@cn.fujitsu.com: fix section mismatch problem] [tangchen@cn.fujitsu.com: fix the doc format in drivers/firmware/memmap.c] Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Julian Calaby <julian.calaby@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23memory-hotplug: remove redundant codesWen Congyang1-47/+82
offlining memory blocks and checking whether memory blocks are offlined are very similar. This patch introduces a new function to remove redundant codes. Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23memory-hotplug: check whether all memory blocks are offlined or not when ↵Yasuaki Ishimatsu3-0/+55
removing memory We remove the memory like this: 1. lock memory hotplug 2. offline a memory block 3. unlock memory hotplug 4. repeat 1-3 to offline all memory blocks 5. lock memory hotplug 6. remove memory(TODO) 7. unlock memory hotplug All memory blocks must be offlined before removing memory. But we don't hold the lock in the whole operation. So we should check whether all memory blocks are offlined before step6. Otherwise, kernel maybe panicked. Offlining a memory block and removing a memory device can be two different operations. Users can just offline some memory blocks without removing the memory device. For this purpose, the kernel has held lock_memory_hotplug() in __offline_pages(). To reuse the code for memory hot-remove, we repeat step 1-3 to offline all the memory blocks, repeatedly lock and unlock memory hotplug, but not hold the memory hotplug lock in the whole operation. Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23memory-hotplug: try to offline the memory twice to avoid dependenceWen Congyang1-2/+14
memory can't be offlined when CONFIG_MEMCG is selected. For example: there is a memory device on node 1. The address range is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10, and memory11 under the directory /sys/devices/system/memory/. If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup when we online pages. When we online memory8, the memory stored page cgroup is not provided by this memory device. But when we online memory9, the memory stored page cgroup may be provided by memory8. So we can't offline memory8 now. We should offline the memory in the reversed order. When the memory device is hotremoved, we will auto offline memory provided by this memory device. But we don't know which memory is onlined first, so offlining memory may fail. In such case, iterate twice to offline the memory. 1st iterate: offline every non primary memory block. 2nd iterate: offline primary (i.e. first added) memory block. This idea is suggested by KOSAKI Motohiro. Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: memory_hotplug: no need to check res twice in add_memorySasha Levin1-2/+1
Remove one redundant check of res. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: make do_mmap_pgoff return populate as a size in bytes, not as a boolMichel Lespinasse6-14/+13
do_mmap_pgoff() rounds up the desired size to the next PAGE_SIZE multiple, however there was no equivalent code in mm_populate(), which caused issues. This could be fixed by introduced the same rounding in mm_populate(), however I think it's preferable to make do_mmap_pgoff() return populate as a size rather than as a boolean, so we don't have to duplicate the size rounding logic in mm_populate(). Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: introduce VM_POPULATE flag to better deal with racy userspace programsMichel Lespinasse5-15/+25
The vm_populate() code populates user mappings without constantly holding the mmap_sem. This makes it susceptible to racy userspace programs: the user mappings may change while vm_populate() is running, and in this case vm_populate() may end up populating the new mapping instead of the old one. In order to reduce the possibility of userspace getting surprised by this behavior, this change introduces the VM_POPULATE vma flag which gets set on vmas we want vm_populate() to work on. This way vm_populate() may still end up populating the new mapping after such a race, but only if the new mapping is also one that the user has requested (using MAP_SHARED, MAP_LOCKED or mlock) to be populated. Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: directly use __mlock_vma_pages_range() in find_extend_vma()Michel Lespinasse5-87/+9
In find_extend_vma(), we don't need mlock_vma_pages_range() to verify the vma type - we know we're working with a stack. So, we can call directly into __mlock_vma_pages_range(), and remove the last make_pages_present() call site. Note that we don't use mm_populate() here, so we can't release the mmap_sem while allocating new stack pages. This is deemed acceptable, because the stack vmas grow by a bounded number of pages at a time, and these are anon pages so we don't have to read from disk to populate them. Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: remove flags argument to mmap_regionMichel Lespinasse4-22/+18
After the MAP_POPULATE handling has been moved to mmap_region() call sites, the only remaining use of the flags argument is to pass the MAP_NORESERVE flag. This can be just as easily handled by do_mmap_pgoff(), so do that and remove the mmap_region() flags parameter. [akpm@linux-foundation.org: remove double parens] Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: use mm_populate() for mremap() of VM_LOCKED vmasMichel Lespinasse1-12/+13
Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: use mm_populate() when adjusting brk with MCL_FUTURE in effectMichel Lespinasse1-4/+14
Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: use mm_populate() for blocking remap_file_pages()Michel Lespinasse1-16/+6
Signed-off-by: Michel Lespinasse <walken@google.com> Reviewed-by: Rik van Riel <riel@redhat.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: introduce mm_populate() for populating new vmasMichel Lespinasse7-22/+62
When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags (or with MCL_FUTURE in effect), we want to populate the pages within the newly created vmas. This may take a while as we may have to read pages from disk, so ideally we want to do this outside of the write-locked mmap_sem region. This change introduces mm_populate(), which is used to defer populating such mappings until after the mmap_sem write lock has been released. This is implemented as a generalization of the former do_mlock_pages(), which accomplished the same task but was using during mlock() / mlockall(). Signed-off-by: Michel Lespinasse <walken@google.com> Reported-by: Andy Lutomirski <luto@amacapital.net> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: remap_file_pages() fixesMichel Lespinasse1-8/+14
We have many vma manipulation functions that are fast in the typical case, but can optionally be instructed to populate an unbounded number of ptes within the region they work on: - mmap with MAP_POPULATE or MAP_LOCKED flags; - remap_file_pages() with MAP_NONBLOCK not set or when working on a VM_LOCKED vma; - mmap_region() and all its wrappers when mlock(MCL_FUTURE) is in effect; - brk() when mlock(MCL_FUTURE) is in effect. Current code handles these pte operations locally, while the sourrounding code has to hold the mmap_sem write side since it's manipulating vmas. This means we're doing an unbounded amount of pte population work with mmap_sem held, and this causes problems as Andy Lutomirski reported (we've hit this at Google as well, though it's not entirely clear why people keep trying to use mlock(MCL_FUTURE) in the first place). I propose introducing a new mm_populate() function to do this pte population work after the mmap_sem has been released. mm_populate() does need to acquire the mmap_sem read side, but critically, it doesn't need to hold it continuously for the entire duration of the operation - it can drop it whenever things take too long (such as when hitting disk for a file read) and re-acquire it later on. The following patches are included - Patches 1 fixes some issues I noticed while working on the existing code. If needed, they could potentially go in before the rest of the patches. - Patch 2 introduces the new mm_populate() function and changes mmap_region() call sites to use it after they drop mmap_sem. This is inspired from Andy Lutomirski's proposal and is built as an extension of the work I had previously done for mlock() and mlockall() around v2.6.38-rc1. I had tried doing something similar at the time but had given up as there were so many do_mmap() call sites; the recent cleanups by Linus and Viro are a tremendous help here. - Patches 3-5 convert some of the less-obvious places doing unbounded pte populates to the new mm_populate() mechanism. - Patches 6-7 are code cleanups that are made possible by the mm_populate() work. In particular, they remove more code than the entire patch series added, which should be a good thing :) - Patch 8 is optional to this entire series. It only helps to deal more nicely with racy userspace programs that might modify their mappings while we're trying to populate them. It adds a new VM_POPULATE flag on the mappings we do want to populate, so that if userspace replaces them with mappings it doesn't want populated, mm_populate() won't populate those replacement mappings. This patch: Assorted small fixes. The first two are quite small: - Move check for vma->vm_private_data && !(vma->vm_flags & VM_NONLINEAR) within existing if (!(vma->vm_flags & VM_NONLINEAR)) block. Purely cosmetic. - In the VM_LOCKED case, when dropping PG_Mlocked for the over-mapped range, make sure we own the mmap_sem write lock around the munlock_vma_pages_range call as this manipulates the vma's vm_flags. Last fix requires a longer explanation. remap_file_pages() can do its work either through VM_NONLINEAR manipulation or by creating extra vmas. These two cases were inconsistent with each other (and ultimately, both wrong) as to exactly when did they fault in the newly mapped file pages: - In the VM_NONLINEAR case, new file pages would be populated if the MAP_NONBLOCK flag wasn't passed. If MAP_NONBLOCK was passed, new file pages wouldn't be populated even if the vma is already marked as VM_LOCKED. - In the linear (emulated) case, the work is passed to the mmap_region() function which would populate the pages if the vma is marked as VM_LOCKED, and would not otherwise - regardless of the value of the MAP_NONBLOCK flag, because MAP_POPULATE wasn't being passed to mmap_region(). The desired behavior is that we want the pages to be populated and locked if the vma is marked as VM_LOCKED, or to be populated if the MAP_NONBLOCK flag is not passed to remap_file_pages(). Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: avoid calling pgdat_balanced() needlesslyZlatko Calusic1-4/+11
Now that balance_pgdat() is slightly tidied up, thanks to more capable pgdat_balanced(), it's become obvious that pgdat_balanced() is called to check the status, then break the loop if pgdat is balanced, just to be immediately called again. The second call is completely unnecessary, of course. The patch introduces pgdat_is_balanced boolean, which helps resolve the above suboptimal behavior, with the added benefit of slightly better documenting one other place in the function where we jump and skip lots of code. Signed-off-by: Zlatko Calusic <zlatko.calusic@iskon.hr> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: compaction: make __compact_pgdat() and compact_pgdat() return voidAndrew Morton2-10/+7
These functions always return 0. Formalise this. Cc: Jason Liu <r64343@freescale.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: make madvise(MADV_WILLNEED) support swap file prefetchShaohua Li1-4/+101
Make madvise(MADV_WILLNEED) support swap file prefetch. If memory is swapout, this syscall can do swapin prefetch. It has no impact if the memory isn't swapout. [akpm@linux-foundation.org: fix CONFIG_SWAP=n build] [sasha.levin@oracle.com: fix BUG on madvise early failure] Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23memcg,vmscan: do not break out targeted reclaim without reclaimed pagesMichal Hocko1-10/+9
Targeted (hard resp soft) reclaim has traditionally tried to scan one group with decreasing priority until nr_to_reclaim (SWAP_CLUSTER_MAX pages) is reclaimed or all priorities are exhausted. The reclaim is then retried until the limit is met. This approach, however, doesn't work well with deeper hierarchies where groups higher in the hierarchy do not have any or only very few pages (this usually happens if those groups do not have any tasks and they have only re-parented pages after some of their children is removed). Those groups are reclaimed with decreasing priority pointlessly as there is nothing to reclaim from them. An easiest fix is to break out of the memcg iteration loop in shrink_zone only if the whole hierarchy has been visited or sufficient pages have been reclaimed. This is also more natural because the reclaimer expects that the hierarchy under the given root is reclaimed. As a result we can simplify the soft limit reclaim which does its own iteration. [yinghan@google.com: break out of the hierarchy loop only if nr_reclaimed exceeded nr_to_reclaim] [akpm@linux-foundation.org: use conventional comparison order] Signed-off-by: Michal Hocko <mhocko@suse.cz> Reported-by: Ying Han <yinghan@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <htejun@gmail.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm/ksm.c: use new hashtable implementationSasha Levin1-18/+12
Switch ksm to use the new hashtable implementation. This reduces the amount of generic unrelated code in the ksm module. Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm/huge_memory.c: use new hashtable implementationSasha Levin1-45/+9
Switch hugemem to use the new hashtable implementation. This reduces the amount of generic unrelated code in the hugemem. This also removes the dymanic allocation of the hash table. The upside is that we save a pointer dereference when accessing the hashtable, but we lose 8KB if CONFIG_TRANSPARENT_HUGEPAGE is enabled but the processor doesn't support hugepages. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: compaction: do not accidentally skip pageblocks in the migrate scannerMel Gorman1-3/+2
Compaction uses the ALIGN macro incorrectly with the migrate scanner by adding pageblock_nr_pages to a PFN. It happened to work when initially implemented as the starting PFN was also aligned but with caching restarts and isolating in smaller chunks this is no longer always true. The impact is that the migrate scanner scans outside its current pageblock. As pfn_valid() is still checked properly it does not cause any failure and the impact of the bug is that in some cases it will scan more than necessary when it crosses a page boundary but by no more than COMPACT_CLUSTER_MAX. It is highly unlikely this is even measurable but it's still wrong so this patch addresses the problem. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm/vmscan.c:__zone_reclaim(): replace max_t() with max()Andrew Morton1-2/+1
"mm: vmscan: save work scanning (almost) empty LRU lists" made SWAP_CLUSTER_MAX an unsigned long. Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Satoru Moriya <satoru.moriya@hds.com> Cc: Simon Jeons <simon.jeons@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm/page_alloc.c:__setup_per_zone_wmarks: make min_pages unsigned longAndrew Morton1-5/+2
`int' is an inappropriate type for a number-of-pages counter. While we're there, use the clamp() macro. Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Satoru Moriya <satoru.moriya@hds.com> Cc: Simon Jeons <simon.jeons@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: reduce rmap overhead for ex-KSM page copies created on swap faultsJohannes Weiner2-7/+4
When ex-KSM pages are faulted from swap cache, the fault handler is not capable of re-establishing anon_vma-spanning KSM pages. In this case, a copy of the page is created instead, just like during a COW break. These freshly made copies are known to be exclusive to the faulting VMA and there is no reason to go look for this page in parent and sibling processes during rmap operations. Use page_add_new_anon_rmap() for these copies. This also puts them on the proper LRU lists and marks them SwapBacked, so we can get rid of doing this ad-hoc in the KSM copy code. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Simon Jeons <simon.jeons@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Satoru Moriya <satoru.moriya@hds.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: vmscan: compaction works against zones, not lruvecsJohannes Weiner1-88/+91
The restart logic for when reclaim operates back to back with compaction is currently applied on the lruvec level. But this does not make sense, because the container of interest for compaction is a zone as a whole, not the zone pages that are part of a certain memory cgroup. Negative impact is bounded. For one, the code checks that the lruvec has enough reclaim candidates, so it does not risk getting stuck on a condition that can not be fulfilled. And the unfairness of hammering on one particular memory cgroup to make progress in a zone will be amortized by the round robin manner in which reclaim goes through the memory cgroups. Still, this can lead to unnecessary allocation latencies when the code elects to restart on a hard to reclaim or small group when there are other, more reclaimable groups in the zone. Move this logic to the zone level and restart reclaim for all memory cgroups in a zone when compaction requires more free pages from it. [akpm@linux-foundation.org: no need for min_t] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Satoru Moriya <satoru.moriya@hds.com> Cc: Simon Jeons <simon.jeons@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: vmscan: clean up get_scan_count()Johannes Weiner1-21/+44
Reclaim pressure balance between anon and file pages is calculated through a tuple of numerators and a shared denominator. Exceptional cases that want to force-scan anon or file pages configure the numerators and denominator such that one list is preferred, which is not necessarily the most obvious way: fraction[0] = 1; fraction[1] = 0; denominator = 1; goto out; Make this easier by making the force-scan cases explicit and use the fractionals only in case they are calculated from reclaim history. [akpm@linux-foundation.org: avoid using unintialized_var()] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Satoru Moriya <satoru.moriya@hds.com> Cc: Simon Jeons <simon.jeons@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: vmscan: improve comment on low-page cache handlingJohannes Weiner1-5/+7
Fix comment style and elaborate on why anonymous memory is force-scanned when file cache runs low. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Satoru Moriya <satoru.moriya@hds.com> Cc: Simon Jeons <simon.jeons@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: vmscan: clarify how swappiness, highest priority, memcg interactJohannes Weiner1-9/+30
A swappiness of 0 has a slightly different meaning for global reclaim (may swap if file cache really low) and memory cgroup reclaim (never swap, ever). In addition, global reclaim at highest priority will scan all LRU lists equal to their size and ignore other balancing heuristics. UNLESS swappiness forbids swapping, then the lists are balanced based on recent reclaim effectiveness. UNLESS file cache is running low, then anonymous pages are force-scanned. This (total mess of a) behaviour is implicit and not obvious from the way the code is organized. At least make it apparent in the code flow and document the conditions. It will be it easier to come up with sane semantics later. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Satoru Moriya <satoru.moriya@hds.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Simon Jeons <simon.jeons@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: vmscan: save work scanning (almost) empty LRU listsJohannes Weiner2-5/+7
In certain cases (kswapd reclaim, memcg target reclaim), a fixed minimum amount of pages is scanned from the LRU lists on each iteration, to make progress. Do not make this minimum bigger than the respective LRU list size, however, and save some busy work trying to isolate and reclaim pages that are not there. Empty LRU lists are quite common with memory cgroups in NUMA environments because there exists a set of LRU lists for each zone for each memory cgroup, while the memory of a single cgroup is expected to stay on just one node. The number of expected empty LRU lists is thus memcgs * (nodes - 1) * lru types Each attempt to reclaim from an empty LRU list does expensive size comparisons between lists, acquires the zone's lru lock etc. Avoid that. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Satoru Moriya <satoru.moriya@hds.com> Cc: Simon Jeons <simon.jeons@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: memcg: only evict file pages when we have plentyJohannes Weiner1-9/+11
Commit e9868505987a ("mm, vmscan: only evict file pages when we have plenty") makes a point of not going for anonymous memory while there is still enough inactive cache around. The check was added only for global reclaim, but it is just as useful to reduce swapping in memory cgroup reclaim: 200M-memcg-defconfig-j2 vanilla patched Real time 454.06 ( +0.00%) 453.71 ( -0.08%) User time 668.57 ( +0.00%) 668.73 ( +0.02%) System time 128.92 ( +0.00%) 129.53 ( +0.46%) Swap in 1246.80 ( +0.00%) 814.40 ( -34.65%) Swap out 1198.90 ( +0.00%) 827.00 ( -30.99%) Pages allocated 16431288.10 ( +0.00%) 16434035.30 ( +0.02%) Major faults 681.50 ( +0.00%) 593.70 ( -12.86%) THP faults 237.20 ( +0.00%) 242.40 ( +2.18%) THP collapse 241.20 ( +0.00%) 248.50 ( +3.01%) THP splits 157.30 ( +0.00%) 161.40 ( +2.59%) Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Satoru Moriya <satoru.moriya@hds.com> Cc: Simon Jeons <simon.jeons@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23CMA: make putback_lru_pages() call conditionalSrinivas Pandruvada1-3/+5
As per documentation and other places calling putback_lru_pages(), putback_lru_pages() is called on error only. Make the CMA code behave consistently. [akpm@linux-foundation.org: remove a test-n-branch in the wrapup code] Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm/hugetlb.c: convert to pr_foo()Andrew Morton1-13/+9
Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm/memcontrol.c: convert printk(KERN_FOO) to pr_foo()Andrew Morton1-8/+7
Acked-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23memcg, oom: provide more precise dump info while memcg oom happeningSha Zhengju2-12/+41
Currently when a memcg oom is happening the oom dump messages is still global state and provides few useful info for users. This patch prints more pointed memcg page statistics for memcg-oom and take hierarchy into consideration: Based on Michal's advice, we take hierarchy into consideration: supppose we trigger an OOM on A's limit root_memcg | A (use_hierachy=1) / \ B C | D then the printed info will be: Memory cgroup stats for /A:... Memory cgroup stats for /A/B:... Memory cgroup stats for /A/C:... Memory cgroup stats for /A/B/D:... Following are samples of oom output: (1) Before change: mal-80 invoked oom-killer:gfp_mask=0xd0, order=0, oom_score_adj=0 mal-80 cpuset=/ mems_allowed=0 Pid: 2976, comm: mal-80 Not tainted 3.7.0+ #10 Call Trace: [<ffffffff8167fbfb>] dump_header+0x83/0x1ca ..... (call trace) [<ffffffff8168a818>] page_fault+0x28/0x30 <<<<<<<<<<<<<<<<<<<<< memcg specific information Task in /A/B/D killed as a result of limit of /A memory: usage 101376kB, limit 101376kB, failcnt 57 memory+swap: usage 101376kB, limit 101376kB, failcnt 0 kmem: usage 0kB, limit 9007199254740991kB, failcnt 0 <<<<<<<<<<<<<<<<<<<<< print per cpu pageset stat Mem-Info: Node 0 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 ...... CPU 3: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 173 ...... CPU 3: hi: 186, btch: 31 usd: 130 <<<<<<<<<<<<<<<<<<<<< print global page state active_anon:92963 inactive_anon:40777 isolated_anon:0 active_file:33027 inactive_file:51718 isolated_file:0 unevictable:0 dirty:3 writeback:0 unstable:0 free:729995 slab_reclaimable:6897 slab_unreclaimable:6263 mapped:20278 shmem:35971 pagetables:5885 bounce:0 free_cma:0 <<<<<<<<<<<<<<<<<<<<< print per zone page state Node 0 DMA free:15836kB ... all_unreclaimable? no lowmem_reserve[]: 0 3175 3899 3899 Node 0 DMA32 free:2888564kB ... all_unrelaimable? no lowmem_reserve[]: 0 0 724 724 lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 1*4kB (U) ... 3*4096kB (M) = 15836kB Node 0 DMA32: 41*4kB (UM) ... 702*4096kB (MR) = 2888316kB 120710 total pagecache pages 0 pages in swap cache <<<<<<<<<<<<<<<<<<<<< print global swap cache stat Swap cache stats: add 0, delete 0, find 0/0 Free swap = 499708kB Total swap = 499708kB 1040368 pages RAM 58678 pages reserved 169065 pages shared 173632 pages non-shared [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 2693] 0 2693 6005 1324 17 0 0 god [ 2754] 0 2754 6003 1320 16 0 0 god [ 2811] 0 2811 5992 1304 18 0 0 god [ 2874] 0 2874 6005 1323 18 0 0 god [ 2935] 0 2935 8720 7742 21 0 0 mal-30 [ 2976] 0 2976 21520 17577 42 0 0 mal-80 Memory cgroup out of memory: Kill process 2976 (mal-80) score 665 or sacrifice child Killed process 2976 (mal-80) total-vm:86080kB, anon-rss:69964kB, file-rss:344kB We can see that messages dumped by show_free_areas() are longsome and can provide so limited info for memcg that just happen oom. (2) After change mal-80 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 mal-80 cpuset=/ mems_allowed=0 Pid: 2704, comm: mal-80 Not tainted 3.7.0+ #10 Call Trace: [<ffffffff8167fd0b>] dump_header+0x83/0x1d1 .......(call trace) [<ffffffff8168a918>] page_fault+0x28/0x30 Task in /A/B/D killed as a result of limit of /A <<<<<<<<<<<<<<<<<<<<< memcg specific information memory: usage 102400kB, limit 102400kB, failcnt 140 memory+swap: usage 102400kB, limit 102400kB, failcnt 0 kmem: usage 0kB, limit 9007199254740991kB, failcnt 0 Memory cgroup stats for /A: cache:32KB rss:30984KB mapped_file:0KB swap:0KB inactive_anon:6912KB active_anon:24072KB inactive_file:32KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/B: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/C: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/B/D: cache:32KB rss:71352KB mapped_file:0KB swap:0KB inactive_anon:6656KB active_anon:64696KB inactive_file:16KB active_file:16KB unevictable:0KB [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 2260] 0 2260 6006 1325 18 0 0 god [ 2383] 0 2383 6003 1319 17 0 0 god [ 2503] 0 2503 6004 1321 18 0 0 god [ 2622] 0 2622 6004 1321 16 0 0 god [ 2695] 0 2695 8720 7741 22 0 0 mal-30 [ 2704] 0 2704 21520 17839 43 0 0 mal-80 Memory cgroup out of memory: Kill process 2704 (mal-80) score 669 or sacrifice child Killed process 2704 (mal-80) total-vm:86080kB, anon-rss:71016kB, file-rss:340kB This version provides more pointed info for memcg in "Memory cgroup stats for XXX" section. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23drivers/md/persistent-data/dm-transaction-manager.c: rename HASH_SIZEAndrew Morton1-7/+7
Fix the warning: drivers/md/persistent-data/dm-transaction-manager.c:28:1: warning: "HASH_SIZE" redefined In file included from include/linux/elevator.h:5, from include/linux/blkdev.h:216, from drivers/md/persistent-data/dm-block-manager.h:11, from drivers/md/persistent-data/dm-transaction-manager.h:10, from drivers/md/persistent-data/dm-transaction-manager.c:6: include/linux/hashtable.h:22:1: warning: this is the location of the previous definition Cc: Alasdair Kergon <agk@redhat.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-21Merge branch 'x86-mm-for-linus' of ↵Linus Torvalds77-1247/+1542
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm changes from Peter Anvin: "This is a huge set of several partly interrelated (and concurrently developed) changes, which is why the branch history is messier than one would like. The *really* big items are two humonguous patchsets mostly developed by Yinghai Lu at my request, which completely revamps the way we create initial page tables. In particular, rather than estimating how much memory we will need for page tables and then build them into that memory -- a calculation that has shown to be incredibly fragile -- we now build them (on 64 bits) with the aid of a "pseudo-linear mode" -- a #PF handler which creates temporary page tables on demand. This has several advantages: 1. It makes it much easier to support things that need access to data very early (a followon patchset uses this to load microcode way early in the kernel startup). 2. It allows the kernel and all the kernel data objects to be invoked from above the 4 GB limit. This allows kdump to work on very large systems. 3. It greatly reduces the difference between Xen and native (Xen's equivalent of the #PF handler are the temporary page tables created by the domain builder), eliminating a bunch of fragile hooks. The patch series also gets us a bit closer to W^X. Additional work in this pull is the 64-bit get_user() work which you were also involved with, and a bunch of cleanups/speedups to __phys_addr()/__pa()." * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (105 commits) x86, mm: Move reserving low memory later in initialization x86, doc: Clarify the use of asm("%edx") in uaccess.h x86, mm: Redesign get_user with a __builtin_choose_expr hack x86: Be consistent with data size in getuser.S x86, mm: Use a bitfield to mask nuisance get_user() warnings x86/kvm: Fix compile warning in kvm_register_steal_time() x86-32: Add support for 64bit get_user() x86-32, mm: Remove reference to alloc_remap() x86-32, mm: Remove reference to resume_map_numa_kva() x86-32, mm: Rip out x86_32 NUMA remapping code x86/numa: Use __pa_nodebug() instead x86: Don't panic if can not alloc buffer for swiotlb mm: Add alloc_bootmem_low_pages_nopanic() x86, 64bit, mm: hibernate use generic mapping_init x86, 64bit, mm: Mark data/bss/brk to nx x86: Merge early kernel reserve for 32bit and 64bit x86: Add Crash kernel low reservation x86, kdump: Remove crashkernel range find limit for 64bit memblock: Add memblock_mem_size() x86, boot: Not need to check setup_header version for setup_data ...
2013-02-21Merge branch 'x86-cpu-for-linus' of ↵Linus Torvalds5-32/+43
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpu updates from Peter Anvin: "This is a corrected attempt at the x86/cpu branch, this time with the fixes in that makes it not break on KVM (current or past), or any other virtualizer which traps on this configuration. Again, the biggest change here is enabling the WC+ memory type on AMD processors, if the BIOS doesn't." * 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, kvm: Add MSR_AMD64_BU_CFG2 to the list of ignored MSRs x86, cpu, amd: Fix WC+ workaround for older virtual hosts x86, AMD: Enable WC+ memory type on family 10 processors x86, AMD: Clean up init_amd() x86/process: Change %8s to %s for pr_warn() in release_thread() x86/cpu/hotplug: Remove CONFIG_EXPERIMENTAL dependency
2013-02-21Merge tag 'please-pull-misc-3.9' of ↵Linus Torvalds4-11/+17
git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux Pull misc ia64 bits from Tony Luck. * tag 'please-pull-misc-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux: MAINTAINERS: update SGI & ia64 Altix stuff sysctl: Enable IA64 "ignore-unaligned-usertrap" to be used cross-arch
2013-02-21Merge branch 'for-linus' of ↵Linus Torvalds82-930/+1216
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 update from Martin Schwidefsky: "The most prominent change in this patch set is the software dirty bit patch for s390. It removes __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY and the page_test_and_clear_dirty primitive which makes the common memory management code a bit less obscure. Heiko fixed most of the PCI related fallout, more often than not missing GENERIC_HARDIRQS dependencies. Notable is one of the 3270 patches which adds an export to tty_io to be able to resize a tty. The rest is the usual bunch of cleanups and bug fixes." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (42 commits) s390/module: Add missing R_390_NONE relocation type drivers/gpio: add missing GENERIC_HARDIRQ dependency drivers/input: add couple of missing GENERIC_HARDIRQS dependencies s390/cleanup: rename SPP to LPP s390/mm: implement software dirty bits s390/mm: Fix crst upgrade of mmap with MAP_FIXED s390/linker skript: discard exit.data at runtime drivers/media: add missing GENERIC_HARDIRQS dependency s390/bpf,jit: add vlan tag support drivers/net,AT91RM9200: add missing GENERIC_HARDIRQS dependency iucv: fix kernel panic at reboot s390/Kconfig: sort list of arch selected config options phylib: remove !S390 dependeny from Kconfig uio: remove !S390 dependency from Kconfig dasd: fix sysfs cleanup in dasd_generic_remove s390/pci: fix hotplug module init s390/pci: cleanup clp page allocation s390/pci: cleanup clp inline assembly s390/perf: cpum_cf: fallback to software sampling events s390/mm: provide PAGE_SHARED define ...
2013-02-21Merge branch 'for-linus' of ↵Linus Torvalds71-824/+1405
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid Pull HID subsystem updates from Jiri Kosina: "HID subsystem and drivers update. Highlights: - new support of a group of Win7/Win8 multitouch devices, from Benjamin Tissoires - fix for compat interface brokenness in uhid, from Dmitry Torokhov - conversion of drivers to use hid_driver helper, by H Hartley Sweeten - HID over I2C transport received ACPI enumeration support, written by Mika Westerberg - there is an ongoing effort to make HID sensor hubs independent of USB transport. The first self-contained part of this work is provided here, done by Mika Westerberg - a few smaller fixes here and there, support for a couple new devices added" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (43 commits) HID: Correct Logitech order in hid-ids.h HID: LG4FF: Remove unnecessary deadzone code HID: LG: Prevent the Logitech Gaming Wheels deadzone HID: LG: Fix detection of Logitech Speed Force Wireless (WiiWheel) HID: LG: Add support for Logitech Momo Force (Red) Wheel HID: hidraw: print message when succesfully initialized HID: logitech: split accel, brake for Driving Force wheel HID: logitech: add report descriptor for Driving Force wheel HID: add ThingM blink(1) USB RGB LED support HID: uhid: make creating devices work on 64/32 systems HID: wiimote: fix nunchuck button parser HID: blacklist Velleman data acquisition boards HID: sensor-hub: don't limit the driver only to USB bus HID: sensor-hub: get rid of unused sensor_hub_grabbed_usages[] table HID: extend autodetect to handle I2C sensors as well HID: ntrig: use input_configured() callback to set the name HID: multitouch: do not use pointers towards hid-core HID: add missing GENERIC_HARDIRQ dependency HID: multitouch: make MT_CLS_ALWAYS_TRUE the new default class HID: multitouch: fix protocol for Elo panels ...
2013-02-21Merge branch 'for-linus' of ↵Linus Torvalds58-110/+215
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree from Jiri Kosina: "Assorted tiny fixes queued in trivial tree" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (22 commits) DocBook: update EXPORT_SYMBOL entry to point at export.h Documentation: update top level 00-INDEX file with new additions ARM: at91/ide: remove unsused at91-ide Kconfig entry percpu_counter.h: comment code for better readability x86, efi: fix comment typo in head_32.S IB: cxgb3: delay freeing mem untill entirely done with it net: mvneta: remove unneeded version.h include time: x86: report_lost_ticks doesn't exist any more pcmcia: avoid static analysis complaint about use-after-free fs/jfs: Fix typo in comment : 'how may' -> 'how many' of: add missing documentation for of_platform_populate() btrfs: remove unnecessary cur_trans set before goto loop in join_transaction sound: soc: Fix typo in sound/codecs treewide: Fix typo in various drivers btrfs: fix comment typos Update ibmvscsi module name in Kconfig. powerpc: fix typo (utilties -> utilities) of: fix spelling mistake in comment h8300: Fix home page URL in h8300/README xtensa: Fix home page URL in Kconfig ...