summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2009-03-06tracing, Text Edit Lock - Architecture Independent CodeMathieu Desnoyers1-0/+10
This is an architecture independant synchronization around kernel text modifications through use of a global mutex. A mutex has been chosen so that kprobes, the main user of this, can sleep during memory allocation between the memory read of the instructions it must replace and the memory write of the breakpoint. Other user of this interface: immediate values. Paravirt and alternatives are always done when SMP is inactive, so there is no need to use locks. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> LKML-Reference: <49B142D8.7020601@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-06Merge branch 'x86/core' into tracing/texteditIngo Molnar6-26/+1128
Conflicts: arch/x86/Kconfig block/blktrace.c kernel/irq/handle.c Semantic conflict: kernel/trace/blktrace.c Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-04Merge branch 'core/locking' into tracing/ftraceIngo Molnar5-0/+14
2009-03-04Merge branch 'x86/core' into core/percpuIngo Molnar3-27/+36
2009-03-04Merge branches 'x86/apic', 'x86/cpu', 'x86/fixmap', 'x86/mm', 'x86/sched', ↵Ingo Molnar4-31/+35
'x86/setup-lzma', 'x86/signal' and 'x86/urgent' into x86/core
2009-03-02Merge branches 'tracing/ftrace', 'tracing/mmiotrace' and 'linus' into ↵Ingo Molnar1-1/+9
tracing/core
2009-03-02x86, mm: dont use non-temporal stores in pagecache accessesIngo Molnar2-8/+5
Impact: standardize IO on cached ops On modern CPUs it is almost always a bad idea to use non-temporal stores, as the regression in this commit has shown it: 30d697f: x86: fix performance regression in write() syscall The kernel simply has no good information about whether using non-temporal stores is a good idea or not - and trying to add heuristics only increases complexity and inserts fragility. The regression on cached write()s took very long to be found - over two years. So dont take any chances and let the hardware decide how it makes use of its caches. The only exception is drivers/gpu/drm/i915/i915_gem.c: there were we are absolutely sure that another entity (the GPU) will pick up the dirty data immediately and that the CPU will not touch that data before the GPU will. Also, keep the _nocache() primitives to make it easier for people to experiment with these details. There may be more clear-cut cases where non-cached copies can be used, outside of filemap.c. Cc: Salman Qazi <sqazi@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-01Merge branch 'x86/urgent' into x86/patIngo Molnar2-23/+30
2009-03-01bootmem, x86: further fixes for arch-specific bootmem wrappingTejun Heo1-15/+30
Impact: fix new breakages introduced by previous fix Commit c132937556f56ee4b831ef4b23f1846e05fde102 tried to clean up bootmem arch wrapper but it wasn't quite correct. Before the commit, the followings were broken. * Low level interface functions prefixed with __ ignored arch preference. * reserve_bootmem(...) can't be mapped into reserve_bootmem_node(NODE_DATA(0)->bdata, ...) because the node is not preference here. The region specified MUST fall into the specified region; otherwise, it will panic. After the commit, * If allocation fails for the arch preferred node, it should fallback to whatever is available. Instead, it simply failed allocation. There are too many internal details to allow generic wrapping and still keep things simple for archs. Plus, all that arch wants is a way to prefer certain node over another. This patch drops the generic wrapping around alloc_bootmem_core() and add alloc_bootmem_core() instead. If necessary, arch can define bootmem_arch_referred_node() macro or function which takes all allocation information and returns the preferred node. bootmem generic code will always try the preferred node first and then fallback to other nodes as usual. Breakages noted and changes reviewed by Johannes Weiner. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
2009-03-01percpu: kill compile warning in pcpu_populate_chunk()Tejun Heo1-1/+1
Impact: remove compile warning Mark local variable map_end in pcpu_populate_chunk() with uninitialized_var(). The variable is always used in tandem with map_start and guaranteed to be initialized before use but gcc doesn't understand that. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Ingo Molnar <mingo@elte.hu>
2009-02-27mm: fix lazy vmap purging (use-after-free error)Vegard Nossum1-1/+2
I just got this new warning from kmemcheck: WARNING: kmemcheck: Caught 32-bit read from freed memory (c7806a60) a06a80c7ecde70c1a04080c700000000a06709c1000000000000000000000000 f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f ^ Pid: 0, comm: swapper Not tainted (2.6.29-rc4 #230) EIP: 0060:[<c1096df7>] EFLAGS: 00000286 CPU: 0 EIP is at __purge_vmap_area_lazy+0x117/0x140 EAX: 00070f43 EBX: c7806a40 ECX: c1677080 EDX: 00027b66 ESI: 00002001 EDI: c170df0c EBP: c170df00 ESP: c178830c DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 CR0: 80050033 CR2: c7806b14 CR3: 01775000 CR4: 00000690 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: 00004000 DR7: 00000000 [<c1096f3e>] free_unmap_vmap_area_noflush+0x6e/0x70 [<c1096f6a>] remove_vm_area+0x2a/0x70 [<c1097025>] __vunmap+0x45/0xe0 [<c10970de>] vunmap+0x1e/0x30 [<c1008ba5>] text_poke+0x95/0x150 [<c1008ca9>] alternatives_smp_unlock+0x49/0x60 [<c171ef47>] alternative_instructions+0x11b/0x124 [<c171f991>] check_bugs+0xbd/0xdc [<c17148c5>] start_kernel+0x2ed/0x360 [<c171409e>] __init_begin+0x9e/0xa9 [<ffffffff>] 0xffffffff It happened here: $ addr2line -e vmlinux -i c1096df7 mm/vmalloc.c:540 Code: list_for_each_entry(va, &valist, purge_list) __free_vmap_area(va); It's this instruction: mov 0x20(%ebx),%edx Which corresponds to a dereference of va->purge_list.next: (gdb) p ((struct vmap_area *) 0)->purge_list.next Cannot access memory at address 0x20 It seems that we should use "safe" list traversal here, as the element is freed inside the loop. Please verify that this is the right fix. Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: <stable@kernel.org> [2.6.28.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-27mm: vmap fix overflowNick Piggin1-0/+7
The new vmap allocator can wrap the address and get confused in the case of large allocations or VMALLOC_END near the end of address space. Problem reported by Christoph Hellwig on a 32-bit XFS workload. Signed-off-by: Nick Piggin <npiggin@suse.de> Reported-by: Christoph Hellwig <hch@lst.de> Cc: <stable@kernel.org> [2.6.28.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-27Merge branches 'tracing/ftrace' and 'linus' into tracing/coreIngo Molnar1-22/+21
2009-02-26Merge branches 'x86/apic', 'x86/defconfig', 'x86/memtest', 'x86/mm' and ↵Ingo Molnar3-5/+10
'linus' into x86/core
2009-02-25shmem: fix shared anonymous accountingHugh Dickins1-22/+21
Each time I exit Firefox, /proc/meminfo's Committed_AS goes down almost 400 kB: OVERCOMMIT_NEVER would be allowing overcommits it should prohibit. Commit fc8744adc870a8d4366908221508bb113d8b72ee "Stop playing silly games with the VM_ACCOUNT flag" changed shmem_file_setup() to set the shmem file's VM_ACCOUNT flag according to VM_NORESERVE not being set in the vma flags; but did so only _after_ the shmem_acct_size(flags, size) call which is expected to pre-account a shared anonymous object. It's all clearer if we switch shmem.c over to use VM_NORESERVE throughout in place of !VM_ACCOUNT. But I very nearly sent in a patch which mistakenly removed the accounting from tmpfs files: shmem_get_inode()'s memset was good for not setting VM_ACCOUNT, but now it needs to set VM_NORESERVE. Rather than setting that by default, then perhaps clearing it again in shmem_file_setup(), let's pass it as a flag to shmem_get_inode(): that allows us to remove the #ifdef CONFIG_SHMEM from shmem_file_setup(). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-25x86: make vmap yell louder when it is used under irqs_disabled()Peter Zijlstra1-0/+3
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25x86, mm: pass in 'total' to __copy_from_user_*nocache()Ingo Molnar2-5/+7
Impact: cleanup, enable future change Add a 'total bytes copied' parameter to __copy_from_user_*nocache(), and update all the callsites. The parameter is not used yet - architecture code can use it to more intelligently decide whether the copy should be cached or non-temporal. Cc: Salman Qazi <sqazi@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-24Merge branch 'tj-percpu' of ↵Ingo Molnar5-19/+1104
git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into core/percpu Conflicts: arch/x86/include/asm/pgtable.h
2009-02-24Merge branch 'tracing/ftrace'; commit 'v2.6.29-rc6' into tracing/coreIngo Molnar4-18/+36
2009-02-24percpu: add __read_mostly to variables which are mostly read onlyTejun Heo1-8/+8
Most global variables in percpu allocator are initialized during boot and read only from that point on. Add __read_mostly as per Rusty's suggestion. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au>
2009-02-24percpu: give more latitude to arch specific first chunk initializationTejun Heo1-33/+116
Impact: more latitude for first percpu chunk allocation The first percpu chunk serves the kernel static percpu area and may or may not contain extra room for further dynamic allocation. Initialization of the first chunk needs to be done before normal memory allocation service is up, so it has its own init path - pcpu_setup_static(). It seems archs need more latitude while initializing the first chunk for example to take advantage of large page mapping. This patch makes the following changes to allow this. * Define PERCPU_DYNAMIC_RESERVE to give arch hint about how much space to reserve in the first chunk for further dynamic allocation. * Rename pcpu_setup_static() to pcpu_setup_first_chunk(). * Make pcpu_setup_first_chunk() much more flexible by fetching page pointer by callback and adding optional @unit_size, @free_size and @base_addr arguments which allow archs to selectively part of chunk initialization to their likings. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-02-24percpu: remove unit_size power-of-2 restrictionTejun Heo1-14/+19
Impact: allow unit_size to be arbitrary multiple of PAGE_SIZE In dynamic percpu allocator, there is no reason the unit size should be power of two. Remove the restriction. As non-power-of-two unit size means that empty chunks fall into the same slot index as lightly occupied chunks which is bad for reclaming. Reserve an extra slot for empty chunks. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-02-24vmalloc: add @align to vm_area_register_early()Tejun Heo2-5/+8
Impact: allow larger alignment for early vmalloc area allocation Some early vmalloc users might want larger alignment, for example, for custom large page mapping. Add @align to vm_area_register_early(). While at it, drop docbook comment on non-existent @size. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
2009-02-24bootmem: clean up arch-specific bootmem wrappingTejun Heo1-3/+11
Impact: cleaner and consistent bootmem wrapping By setting CONFIG_HAVE_ARCH_BOOTMEM_NODE, archs can define arch-specific wrappers for bootmem allocation. However, this is done a bit strangely in that only the high level convenience macros can be changed while lower level, but still exported, interface functions can't be wrapped. This not only is messy but also leads to strange situation where alloc_bootmem() does what the arch wants it to do but the equivalent __alloc_bootmem() call doesn't although they should be able to be used interchangeably. This patch updates bootmem such that archs can override / wrap the backend function - alloc_bootmem_core() instead of the highlevel interface functions to allow simpler and consistent wrapping. Also, HAVE_ARCH_BOOTMEM_NODE is renamed to HAVE_ARCH_BOOTMEM. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@saeurebad.de>
2009-02-24percpu: fix pcpu_chunk_struct_sizeTejun Heo1-1/+1
Impact: fix short allocation leading to memory corruption While dropping rvalue wrapping macros around global parameters, pcpu_chunk_struct_size was set incorrectly resulting in shorter page pointer array. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-02-21Merge branch 'hibernate'Linus Torvalds2-18/+14
* hibernate: PM: Fix suspend_console and resume_console to use only one semaphore PM: Wait for console in resume PM: Fix pm_notifiers during user mode hibernation swsusp: clean up shrink_all_zones() swsusp: dont fiddle with swappiness PM: fix build for CONFIG_PM unset PM/hibernate: fix "swap breaks after hibernation failures" PM/resume: wait for device probing to finish Consolidate driver_probe_done() loops into one place
2009-02-21swsusp: clean up shrink_all_zones()Johannes Weiner1-12/+11
Move local variables to innermost possible scopes and use local variables to cache calculations/reads done more than once. No change in functionality (intended). Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Len Brown <lenb@kernel.org> Cc: Greg KH <gregkh@suse.de> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-21swsusp: dont fiddle with swappinessJohannes Weiner1-4/+1
sc.swappiness is not used in the swsusp memory shrinking path, do not set it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Len Brown <lenb@kernel.org> Cc: Greg KH <gregkh@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-21PM/hibernate: fix "swap breaks after hibernation failures"Alan Jenkins1-2/+2
http://bugzilla.kernel.org/show_bug.cgi?id=12239 The image writing code dropped a reference to the current swap device. This doesn't show up if the hibernation succeeds - because it doesn't affect the image which gets resumed. But it means multiple _failed_ hibernations end up freeing the swap device while it is still use! swsusp_write() finds the block device for the swap file using swap_type_of(). It then uses blkdev_get() / blkdev_put() to open and close the block device. Unfortunately, blkdev_get() assumes ownership of the inode of the block_device passed to it. So blkdev_put() calls iput() on the inode. This is by design and other callers expect this behaviour. The fix is for swap_type_of() to take a reference on the inode using bdget(). Signed-off-by: Alan Jenkins <alan-jenkins@tuffmail.co.uk> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Len Brown <lenb@kernel.org> Cc: Greg KH <gregkh@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-21percpu: clean up size usageTejun Heo1-11/+12
Andrew was concerned about the unit of variables named or have suffix size. Every usage in percpu allocator is in bytes but make it super clear by adding comments. While at it, make pcpu_depopulate_chunk() take int @off and @size like everyone else. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org>
2009-02-20vmalloc: call flush_cache_vunmap() from unmap_kernel_range()Tejun Heo1-0/+2
Impact: proper vcache flush on unmap_kernel_range() flush_cache_vunmap() should be called before pages are unmapped. Add a call to it in unmap_kernel_range(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: David S. Miller <davem@davemloft.net> Cc: <stable@kernel.org> [2.6.28.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-20slab: introduce kzfree()Johannes Weiner1-0/+20
kzfree() is a wrapper for kfree() that additionally zeroes the underlying memory before releasing it to the slab allocator. Currently there is code which memset()s the memory region of an object before releasing it back to the slab allocator to make sure security-sensitive data are really zeroed out after use. These callsites can then just use kzfree() which saves some code, makes users greppable and allows for a stupid destructor that isn't necessarily aware of the actual object size. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Matt Mackall <mpm@selenic.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-20Merge branch 'for-ingo' of ↵Ingo Molnar1-8/+8
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 into tracing/kmemtrace Conflicts: mm/slub.c
2009-02-20SLUB: Introduce and use SLUB_MAX_SIZE and SLUB_PAGE_SHIFT constantsChristoph Lameter1-8/+8
As a preparational patch to bump up page allocator pass-through threshold, introduce two new constants SLUB_MAX_SIZE and SLUB_PAGE_SHIFT and convert mm/slub.c to use them. Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Tested-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-02-20percpu: implement new dynamic percpu allocatorTejun Heo2-0/+894
Impact: new scalable dynamic percpu allocator which allows dynamic percpu areas to be accessed the same way as static ones Implement scalable dynamic percpu allocator which can be used for both static and dynamic percpu areas. This will allow static and dynamic areas to share faster direct access methods. This feature is optional and enabled only when CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is defined by arch. Please read comment on top of mm/percpu.c for details. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org>
2009-02-20vmalloc: add un/map_kernel_range_noflush()Tejun Heo1-3/+64
Impact: two more public map/unmap functions Implement map_kernel_range_noflush() and unmap_kernel_range_noflush(). These functions respectively map and unmap address range in kernel VM area but doesn't do any vcache or tlb flushing. These will be used by new percpu allocator. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au>
2009-02-20vmalloc: implement vm_area_register_early()Tejun Heo1-0/+24
Impact: allow multiple early vm areas There are places where kernel VM area needs to be allocated before vmalloc is initialized. This is done by allocating static vm_struct, initializing several fields and linking it to vmlist and later vmalloc initialization picking up these from vmlist. This is currently done manually and if there's more than one such areas, there's no defined way to arbitrate who gets which address. This patch implements vm_area_register_early(), which takes vm_area struct with flags and size initialized, assigns address to it and puts it on the vmlist. This way, multiple early vm areas can determine which addresses they should use. The only current user - alpha mm init - is converted to use it. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-02-20percpu: kill percpu_alloc() and friendsTejun Heo1-13/+19
Impact: kill unused functions percpu_alloc() and its friends never saw much action. It was supposed to replace the cpu-mask unaware __alloc_percpu() but it never happened and in fact __percpu_alloc_mask() itself never really grew proper up/down handling interface either (no exported interface for populate/depopulate). percpu allocation is about to go through major reimplementation and there's no reason to carry this unused interface around. Replace it with __alloc_percpu() and free_percpu(). Signed-off-by: Tejun Heo <tj@kernel.org>
2009-02-20vmalloc: call flush_cache_vunmap() from unmap_kernel_range()Tejun Heo1-0/+2
Impact: proper vcache flush on unmap_kernel_range() flush_cache_vunmap() should be called before pages are unmapped. Add a call to it in unmap_kernel_range(). Signed-off-by: Tejun Heo <tj@kernel.org>
2009-02-19Merge branch 'linus' into tracing/blktraceIngo Molnar5-14/+43
Conflicts: block/blktrace.c Semantic merge: kernel/trace/blktrace.c Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-18Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds1-1/+1
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: block: fix deadlock in blk_abort_queue() for drivers that readd to timeout list block: fix booting from partitioned md array block: revert part of 18ce3751ccd488c78d3827e9f6bf54e6322676fb cciss: PCI power management reset for kexec paride/pg.c: xs(): &&/|| confusion fs/bio: bio_alloc_bioset: pass right object ptr to mempool_free block: fix bad definition of BIO_RW_SYNC bsg: Fix sense buffer bug in SG_IO
2009-02-18mm: fix memmap init for handling memory holeKAMEZAWA Hiroyuki1-3/+20
Now, early_pfn_in_nid(PFN, NID) may returns false if PFN is a hole. and memmap initialization was not done. This was a trouble for sparc boot. To fix this, the PFN should be initialized and marked as PG_reserved. This patch changes early_pfn_in_nid() return true if PFN is a hole. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reported-by: David Miller <davem@davemlloft.net> Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-18mm: clean up for early_pfn_to_nid()KAMEZAWA Hiroyuki1-1/+7
What's happening is that the assertion in mm/page_alloc.c:move_freepages() is triggering: BUG_ON(page_zone(start_page) != page_zone(end_page)); Once I knew this is what was happening, I added some annotations: if (unlikely(page_zone(start_page) != page_zone(end_page))) { printk(KERN_ERR "move_freepages: Bogus zones: " "start_page[%p] end_page[%p] zone[%p]\n", start_page, end_page, zone); printk(KERN_ERR "move_freepages: " "start_zone[%p] end_zone[%p]\n", page_zone(start_page), page_zone(end_page)); printk(KERN_ERR "move_freepages: " "start_pfn[0x%lx] end_pfn[0x%lx]\n", page_to_pfn(start_page), page_to_pfn(end_page)); printk(KERN_ERR "move_freepages: " "start_nid[%d] end_nid[%d]\n", page_to_nid(start_page), page_to_nid(end_page)); ... And here's what I got: move_freepages: Bogus zones: start_page[2207d0000] end_page[2207dffc0] zone[fffff8103effcb00] move_freepages: start_zone[fffff8103effcb00] end_zone[fffff8003fffeb00] move_freepages: start_pfn[0x81f600] end_pfn[0x81f7ff] move_freepages: start_nid[1] end_nid[0] My memory layout on this box is: [ 0.000000] Zone PFN ranges: [ 0.000000] Normal 0x00000000 -> 0x0081ff5d [ 0.000000] Movable zone start PFN for each node [ 0.000000] early_node_map[8] active PFN ranges [ 0.000000] 0: 0x00000000 -> 0x00020000 [ 0.000000] 1: 0x00800000 -> 0x0081f7ff [ 0.000000] 1: 0x0081f800 -> 0x0081fe50 [ 0.000000] 1: 0x0081fed1 -> 0x0081fed8 [ 0.000000] 1: 0x0081feda -> 0x0081fedb [ 0.000000] 1: 0x0081fedd -> 0x0081fee5 [ 0.000000] 1: 0x0081fee7 -> 0x0081ff51 [ 0.000000] 1: 0x0081ff59 -> 0x0081ff5d So it's a block move in that 0x81f600-->0x81f7ff region which triggers the problem. This patch: Declaration of early_pfn_to_nid() is scattered over per-arch include files, and it seems it's complicated to know when the declaration is used. I think it makes fix-for-memmap-init not easy. This patch moves all declaration to include/linux/mm.h After this, if !CONFIG_NODES_POPULATES_NODE_MAP && !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID -> Use static definition in include/linux/mm.h else if !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID -> Use generic definition in mm/page_alloc.c else -> per-arch back end function will be called. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reported-by: David Miller <davem@davemlloft.net> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-18mm: task dirty accounting fixNick Piggin1-10/+3
YAMAMOTO-san noticed that task_dirty_inc doesn't seem to be called properly for cases where set_page_dirty is not used to dirty a page (eg. mark_buffer_dirty). Additionally, there is some inconsistency about when task_dirty_inc is called. It is used for dirty balancing, however it even gets called for __set_page_dirty_no_writeback. So rather than increment it in a set_page_dirty wrapper, move it down to exactly where the dirty page accounting stats are incremented. Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-18vmalloc: add __get_vm_area_caller()Benjamin Herrenschmidt1-0/+8
We have get_vm_area_caller() and __get_vm_area() but not __get_vm_area_caller() On powerpc, I use __get_vm_area() to separate the ranges of addresses given to vmalloc vs. ioremap (various good reasons for that) so in order to be able to implement the new caller tracking in /proc/vmallocinfo, I need a "_caller" variant of it. (akpm: needed for ongoing powerpc development, so merge it early) [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-18block: fix bad definition of BIO_RW_SYNCJens Axboe1-1/+1
We can't OR shift values, so get rid of BIO_RW_SYNC and use BIO_RW_SYNCIO and BIO_RW_UNPLUG explicitly. This brings back the behaviour from before 213d9417fec62ef4c3675621b9364a667954d4dd. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-02-17Merge branch 'x86-fixes-for-linus' of ↵Linus Torvalds1-1/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, vm86: fix preemption bug x86, olpc: fix model detection without OFW x86, hpet: fix for LS21 + HPET = boot hang x86: CPA avoid repeated lazy mmu flush x86: warn if arch_flush_lazy_mmu_cpu is called in preemptible context x86/paravirt: make arch_flush_lazy_mmu/cpu disable preemption x86, pat: fix warn_on_once() while mapping 0-1MB range with /dev/mem x86/cpa: make sure cpa is safe to call in lazy mmu mode x86, ptrace, mm: fix double-free on race
2009-02-15lockdep: annotate reclaim context (__GFP_NOFS), fixIngo Molnar1-2/+1
Impact: fix build warning Fix: mm/vmscan.c: In function ‘kswapd’: mm/vmscan.c:1969: warning: ISO C90 forbids mixed declarations and code node_to_cpumask_ptr(cpumask, pgdat->node_id), has a side-effect: it defines the 'cpumask' local variable as well, so it has to go into the variable definition section. Sidenote: it might make sense to make this purpose of these macros more apparent, by naming them the standard way, such as: DEFINE_node_to_cpumask_ptr(cpumask, pgdat->node_id); (But that is outside the scope of this patch.) Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Mike Travis <travis@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-14lockdep: annotate reclaim context (__GFP_NOFS)Nick Piggin5-0/+15
Here is another version, with the incremental patch rolled up, and added reclaim context annotation to kswapd, and allocation tracing to slab allocators (which may only ever reach the page allocator in rare cases, so it is good to put annotations here too). Haven't tested this version as such, but it should be getting closer to merge worthy ;) -- After noticing some code in mm/filemap.c accidentally perform a __GFP_FS allocation when it should not have been, I thought it might be a good idea to try to catch this kind of thing with lockdep. I coded up a little idea that seems to work. Unfortunately the system has to actually be in __GFP_FS page reclaim, then take the lock, before it will mark it. But at least that might still be some orders of magnitude more common (and more debuggable) than an actual deadlock condition, so we have some improvement I hope (the concept is no less complete than discovery of a lock's interrupt contexts). I guess we could even do the same thing with __GFP_IO (normal reclaim), and even GFP_NOIO locks too... but filesystems will have the most locks and fiddly code paths, so let's start there and see how it goes. It *seems* to work. I did a quick test. ================================= [ INFO: inconsistent lock state ] 2.6.28-rc6-00007-ged31348-dirty #26 --------------------------------- inconsistent {in-reclaim-W} -> {ov-reclaim-W} usage. modprobe/8526 [HC0[0]:SC0[0]:HE1:SE1] takes: (testlock){--..}, at: [<ffffffffa0020055>] brd_init+0x55/0x216 [brd] {in-reclaim-W} state was registered at: [<ffffffff80267bdb>] __lock_acquire+0x75b/0x1a60 [<ffffffff80268f71>] lock_acquire+0x91/0xc0 [<ffffffff8070f0e1>] mutex_lock_nested+0xb1/0x310 [<ffffffffa002002b>] brd_init+0x2b/0x216 [brd] [<ffffffff8020903b>] _stext+0x3b/0x170 [<ffffffff80272ebf>] sys_init_module+0xaf/0x1e0 [<ffffffff8020c3fb>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff irq event stamp: 3929 hardirqs last enabled at (3929): [<ffffffff8070f2b5>] mutex_lock_nested+0x285/0x310 hardirqs last disabled at (3928): [<ffffffff8070f089>] mutex_lock_nested+0x59/0x310 softirqs last enabled at (3732): [<ffffffff8061f623>] sk_filter+0x83/0xe0 softirqs last disabled at (3730): [<ffffffff8061f5b6>] sk_filter+0x16/0xe0 other info that might help us debug this: 1 lock held by modprobe/8526: #0: (testlock){--..}, at: [<ffffffffa0020055>] brd_init+0x55/0x216 [brd] stack backtrace: Pid: 8526, comm: modprobe Not tainted 2.6.28-rc6-00007-ged31348-dirty #26 Call Trace: [<ffffffff80265483>] print_usage_bug+0x193/0x1d0 [<ffffffff80266530>] mark_lock+0xaf0/0xca0 [<ffffffff80266735>] mark_held_locks+0x55/0xc0 [<ffffffffa0020000>] ? brd_init+0x0/0x216 [brd] [<ffffffff802667ca>] trace_reclaim_fs+0x2a/0x60 [<ffffffff80285005>] __alloc_pages_internal+0x475/0x580 [<ffffffff8070f29e>] ? mutex_lock_nested+0x26e/0x310 [<ffffffffa0020000>] ? brd_init+0x0/0x216 [brd] [<ffffffffa002006a>] brd_init+0x6a/0x216 [brd] [<ffffffffa0020000>] ? brd_init+0x0/0x216 [brd] [<ffffffff8020903b>] _stext+0x3b/0x170 [<ffffffff8070f8b9>] ? mutex_unlock+0x9/0x10 [<ffffffff8070f83d>] ? __mutex_unlock_slowpath+0x10d/0x180 [<ffffffff802669ec>] ? trace_hardirqs_on_caller+0x12c/0x190 [<ffffffff80272ebf>] sys_init_module+0xaf/0x1e0 [<ffffffff8020c3fb>] system_call_fastpath+0x16/0x1b Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-13Merge branches 'tracing/ftrace', 'tracing/ring-buffer', 'tracing/sysprof', ↵Ingo Molnar11-43/+84
'tracing/urgent' and 'linus' into tracing/core