summaryrefslogtreecommitdiff
path: root/mm/slab.c
AgeCommit message (Collapse)AuthorFilesLines
2012-10-07Merge branch 'slab/for-linus' of ↵Linus Torvalds1-205/+143
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux Pull SLAB changes from Pekka Enberg: "New and noteworthy: * More SLAB allocator unification patches from Christoph Lameter and others. This paves the way for slab memcg patches that hopefully will land in v3.8. * SLAB tracing improvements from Ezequiel Garcia. * Kernel tainting upon SLAB corruption from Dave Jones. * Miscellanous SLAB allocator bug fixes and improvements from various people." * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: (43 commits) slab: Fix build failure in __kmem_cache_create() slub: init_kmem_cache_cpus() and put_cpu_partial() can be static mm/slab: Fix kmem_cache_alloc_node_trace() declaration Revert "mm/slab: Fix kmem_cache_alloc_node_trace() declaration" mm, slob: fix build breakage in __kmalloc_node_track_caller mm/slab: Fix kmem_cache_alloc_node_trace() declaration mm/slab: Fix typo _RET_IP -> _RET_IP_ mm, slub: Rename slab_alloc() -> slab_alloc_node() to match SLAB mm, slab: Rename __cache_alloc() -> slab_alloc() mm, slab: Match SLAB and SLUB kmem_cache_alloc_xxx_trace() prototype mm, slab: Replace 'caller' type, void* -> unsigned long mm, slob: Add support for kmalloc_track_caller() mm, slab: Remove silly function slab_buffer_size() mm, slob: Use NUMA_NO_NODE instead of -1 mm, sl[au]b: Taint kernel when we detect a corrupted slab slab: Only define slab_error for DEBUG slab: fix the DEADLOCK issue on l3 alien lock slub: Zero initial memory segment for kmem_cache and kmem_cache_node Revert "mm/sl[aou]b: Move sysfs_slab_add to common" mm/sl[aou]b: Move kmem_cache refcounting to common code ...
2012-10-03Merge branch 'slab/tracing' into slab/for-linusPekka Enberg1-1/+1
2012-10-03Merge branch 'slab/common-for-cgroups' into slab/for-linusPekka Enberg1-150/+103
Fix up a trivial conflict with NUMA_NO_NODE cleanups. Conflicts: mm/slob.c Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-10-03Merge branch 'slab/next' into slab/for-linusPekka Enberg1-55/+40
2012-10-03slab: Fix build failure in __kmem_cache_create()Tetsuo Handa1-2/+3
Fix build failure with CONFIG_DEBUG_SLAB=y && CONFIG_DEBUG_PAGEALLOC=y caused by commit 8a13a4cc "mm/sl[aou]b: Shrink __kmem_cache_create() parameter lists". mm/slab.c: In function '__kmem_cache_create': mm/slab.c:2474: error: 'align' undeclared (first use in this function) mm/slab.c:2474: error: (Each undeclared identifier is reported only once mm/slab.c:2474: error: for each function it appears in.) make[1]: *** [mm/slab.o] Error 1 make: *** [mm] Error 2 Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-10-02Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds1-1/+1
Pull workqueue changes from Tejun Heo: "This is workqueue updates for v3.7-rc1. A lot of activities this round including considerable API and behavior cleanups. * delayed_work combines a timer and a work item. The handling of the timer part has always been a bit clunky leading to confusing cancelation API with weird corner-case behaviors. delayed_work is updated to use new IRQ safe timer and cancelation now works as expected. * Another deficiency of delayed_work was lack of the counterpart of mod_timer() which led to cancel+queue combinations or open-coded timer+work usages. mod_delayed_work[_on]() are added. These two delayed_work changes make delayed_work provide interface and behave like timer which is executed with process context. * A work item could be executed concurrently on multiple CPUs, which is rather unintuitive and made flush_work() behavior confusing and half-broken under certain circumstances. This problem doesn't exist for non-reentrant workqueues. While non-reentrancy check isn't free, the overhead is incurred only when a work item bounces across different CPUs and even in simulated pathological scenario the overhead isn't too high. All workqueues are made non-reentrant. This removes the distinction between flush_[delayed_]work() and flush_[delayed_]_work_sync(). The former is now as strong as the latter and the specified work item is guaranteed to have finished execution of any previous queueing on return. * In addition to the various bug fixes, Lai redid and simplified CPU hotplug handling significantly. * Joonsoo introduced system_highpri_wq and used it during CPU hotplug. There are two merge commits - one to pull in IRQ safe timer from tip/timers/core and the other to pull in CPU hotplug fixes from wq/for-3.6-fixes as Lai's hotplug restructuring depended on them." Fixed a number of trivial conflicts, but the more interesting conflicts were silent ones where the deprecated interfaces had been used by new code in the merge window, and thus didn't cause any real data conflicts. Tejun pointed out a few of them, I fixed a couple more. * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits) workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending() workqueue: use cwq_set_max_active() helper for workqueue_set_max_active() workqueue: introduce cwq_set_max_active() helper for thaw_workqueues() workqueue: remove @delayed from cwq_dec_nr_in_flight() workqueue: fix possible stall on try_to_grab_pending() of a delayed work item workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback() workqueue: use __cpuinit instead of __devinit for cpu callbacks workqueue: rename manager_mutex to assoc_mutex workqueue: WORKER_REBIND is no longer necessary for idle rebinding workqueue: WORKER_REBIND is no longer necessary for busy rebinding workqueue: reimplement idle worker rebinding workqueue: deprecate __cancel_delayed_work() workqueue: reimplement cancel_delayed_work() using try_to_grab_pending() workqueue: use mod_delayed_work() instead of __cancel + queue workqueue: use irqsafe timer for delayed_work workqueue: clean up delayed_work initializers and add missing one workqueue: make deferrable delayed_work initializer names consistent workqueue: cosmetic whitespace updates for macro definitions workqueue: deprecate system_nrt[_freezable]_wq workqueue: deprecate flush[_delayed]_work_sync() ...
2012-09-29Revert "mm/slab: Fix kmem_cache_alloc_node_trace() declaration"Pekka Enberg1-4/+4
This reverts commit 1e5965bf1f018cc30a4659fa3f1a40146e4276f6. Ezequiel Garcia has a better fix.
2012-09-25mm/slab: Fix kmem_cache_alloc_node_trace() declarationEzequiel Garcia1-4/+4
The bug was introduced in commit 4052147c0afa ("mm, slab: Match SLAB and SLUB kmem_cache_alloc_xxx_trace() prototype"). Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-09-25mm/slab: Fix typo _RET_IP -> _RET_IP_Ezequiel Garcia1-1/+1
The bug was introduced by commit 7c0cb9c64f83 ("mm, slab: Replace 'caller' type, void* -> unsigned long"). Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-09-25mm, slab: Rename __cache_alloc() -> slab_alloc()Ezequiel Garcia1-7/+7
This patch does not fix anything and its only goal is to produce common code between SLAB and SLUB. Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-09-25mm, slab: Match SLAB and SLUB kmem_cache_alloc_xxx_trace() prototypeEzequiel Garcia1-5/+5
This long (seemingly unnecessary) patch does not fix anything and its only goal is to produce common code between SLAB and SLUB. Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-09-25mm, slab: Replace 'caller' type, void* -> unsigned longEzequiel Garcia1-26/+24
This allows to use _RET_IP_ instead of builtin_address(0), thus achiveing implementation consistency in all three allocators. Though maybe a nitpick, the real goal behind this patch is to be able to obtain common code between SLAB and SLUB. Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-09-25mm, slab: Remove silly function slab_buffer_size()Ezequiel Garcia1-10/+2
This function is seldom used, and can be simply replaced with cachep->size. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-09-19mm, sl[au]b: Taint kernel when we detect a corrupted slabDave Jones1-0/+1
It doesn't seem worth adding a new taint flag for this, so just re-use the one from 'bad page' Acked-by: Christoph Lameter <cl@linux.com> # SLUB Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-09-19slab: Only define slab_error for DEBUGChristoph Lameter1-0/+2
On Tue, 11 Sep 2012, Stephen Rothwell wrote: > After merging the final tree, today's linux-next build (sparc64 defconfig) > produced this warning: > > mm/slab.c:808:13: warning: '__slab_error' defined but not used [-Wunused-function] > > Introduced by commit 945cf2b6199b ("mm/sl[aou]b: Extract a common > function for kmem_cache_destroy"). All uses of slab_error() are now > guarded by DEBUG. There is no use case left for slab builds without DEBUG. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-09-17slab: fix starting index for finding another objectJoonsoo Kim1-1/+1
In array cache, there is a object at index 0, check it. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17slab: do ClearSlabPfmemalloc() for all pages of slabMel Gorman1-2/+2
Right now, we call ClearSlabPfmemalloc() for first page of slab when we clear SlabPfmemalloc flag. This is fine for most swap-over-network use cases as it is expected that order-0 pages are in use. Unfortunately it is possible that that __ac_put_obj() checks SlabPfmemalloc on a tail page and while this is harmless, it is sloppy. This patch ensures that the head page is always used. This problem was originally identified by Joonsoo Kim. [js1304@gmail.com: Original implementation and problem identification] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-11slab: fix the DEADLOCK issue on l3 alien lockMichael Wang1-3/+3
DEADLOCK will be report while running a kernel with NUMA and LOCKDEP enabled, the process of this fake report is: kmem_cache_free() //free obj in cachep -> cache_free_alien() //acquire cachep's l3 alien lock -> __drain_alien_cache() -> free_block() -> slab_destroy() -> kmem_cache_free() //free slab in cachep->slabp_cache -> cache_free_alien() //acquire cachep->slabp_cache's l3 alien lock Since the cachep and cachep->slabp_cache's l3 alien are in the same lock class, fake report generated. This should not happen since we already have init_lock_keys() which will reassign the lock class for both l3 list and l3 alien. However, init_lock_keys() was invoked at a wrong position which is before we invoke enable_cpucache() on each cache. Since until set slab_state to be FULL, we won't invoke enable_cpucache() on caches to build their l3 alien while creating them, so although we invoked init_lock_keys(), the l3 alien lock class won't change since we don't have them until invoked enable_cpucache() later. This patch will invoke init_lock_keys() after we done enable_cpucache() instead of before to avoid the fake DEADLOCK report. Michael traced the problem back to a commit in release 3.0.0: commit 30765b92ada267c5395fc788623cb15233276f5c Author: Peter Zijlstra <peterz@infradead.org> Date: Thu Jul 28 23:22:56 2011 +0200 slab, lockdep: Annotate the locks before using them Fernando found we hit the regular OFF_SLAB 'recursion' before we annotate the locks, cure this. The relevant portion of the stack-trace: > [ 0.000000] [<c085e24f>] rt_spin_lock+0x50/0x56 > [ 0.000000] [<c04fb406>] __cache_free+0x43/0xc3 > [ 0.000000] [<c04fb23f>] kmem_cache_free+0x6c/0xdc > [ 0.000000] [<c04fb2fe>] slab_destroy+0x4f/0x53 > [ 0.000000] [<c04fb396>] free_block+0x94/0xc1 > [ 0.000000] [<c04fc551>] do_tune_cpucache+0x10b/0x2bb > [ 0.000000] [<c04fc8dc>] enable_cpucache+0x7b/0xa7 > [ 0.000000] [<c0bd9d3c>] kmem_cache_init_late+0x1f/0x61 > [ 0.000000] [<c0bba687>] start_kernel+0x24c/0x363 > [ 0.000000] [<c0bba0ba>] i386_start_kernel+0xa9/0xaf Reported-by: Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> Acked-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1311888176.2617.379.camel@laptop Signed-off-by: Ingo Molnar <mingo@elte.hu> The commit moved init_lock_keys() before we build up the alien, so we failed to reclass it. Cc: <stable@vger.kernel.org> # 3.0+ Acked-by: Christoph Lameter <cl@linux.com> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-09-05mm/sl[aou]b: Move kmem_cache refcounting to common codeChristoph Lameter1-1/+0
Get rid of the refcount stuff in the allocators and do that part of kmem_cache management in the common code. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-09-05mm/sl[aou]b: Shrink __kmem_cache_create() parameter listsChristoph Lameter1-40/+33
Do the initial settings of the fields in common code. This will allow us to push more processing into common code later and improve readability. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-09-05mm/sl[aou]b: Move kmem_cache allocations into common codeChristoph Lameter1-18/+16
Shift the allocations to common code. That way the allocation and freeing of the kmem_cache structures is handled by common code. Reviewed-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-09-05mm/sl[aou]b: Get rid of __kmem_cache_destroyChristoph Lameter1-25/+21
What is done there can be done in __kmem_cache_shutdown. This affects RCU handling somewhat. On rcu free all slab allocators do not refer to other management structures than the kmem_cache structure. Therefore these other structures can be freed before the rcu deferred free to the page allocator occurs. Reviewed-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-09-05mm/sl[aou]b: Move freeing of kmem_cache structure to common codeChristoph Lameter1-1/+0
The freeing action is basically the same in all slab allocators. Move to the common kmem_cache_destroy() function. Reviewed-by: Glauber Costa <glommer@parallels.com> Reviewed-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-09-05mm/sl[aou]b: Use "kmem_cache" name for slab cache with kmem_cache structChristoph Lameter1-35/+37
Make all allocators use the "kmem_cache" slabname for the "kmem_cache" structure. Reviewed-by: Glauber Costa <glommer@parallels.com> Reviewed-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-09-05mm/sl[aou]b: Extract a common function for kmem_cache_destroyChristoph Lameter1-42/+3
kmem_cache_destroy does basically the same in all allocators. Extract common code which is easy since we already have common mutex handling. Reviewed-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-09-05mm/sl[aou]b: Move list_add() to slab_common.cChristoph Lameter1-2/+5
Move the code to append the new kmem_cache to the list of slab caches to the kmem_cache_create code in the shared code. This is possible now since the acquisition of the mutex was moved into kmem_cache_create(). Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Glauber Costa <glommer@parallels.com> Reviewed-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-08-29mm, slab: lock the correct nodelist after reenabling irqsDavid Rientjes1-0/+1
cache_grow() can reenable irqs so the cpu (and node) can change, so ensure that we take list_lock on the correct nodelist. This fixes an issue with commit 072bb0aa5e06 ("mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages") where list_lock for the wrong node was taken after growing the cache. Reported-and-tested-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21workqueue: make deferrable delayed_work initializer names consistentTejun Heo1-1/+1
Initalizers for deferrable delayed_work are confused. * __DEFERRED_WORK_INITIALIZER() * DECLARE_DEFERRED_WORK() * INIT_DELAYED_WORK_DEFERRABLE() Rename them to * __DEFERRABLE_WORK_INITIALIZER() * DECLARE_DEFERRABLE_WORK() * INIT_DEFERRABLE_WORK() This patch doesn't cause any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-17mm, slab: remove page_get_cacheDavid Rientjes1-6/+0
page_get_cache() isn't called from anything, so remove it. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-08-16slab: do not call compound_head() in page_get_cache()Michel Lespinasse1-1/+0
page_get_cache() does not need to call compound_head(), as its unique caller virt_to_slab() already makes sure to return a head page. Additionally, removing the compound_head() call makes page_get_cache() consistent with page_get_slab(). Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-07-31mm: micro-optimise slab to avoid a function callMel Gorman1-2/+26
Getting and putting objects in SLAB currently requires a function call but the bulk of the work is related to PFMEMALLOC reserves which are only consumed when network-backed storage is critical. Use an inline function to determine if the function call is required. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm: introduce __GFP_MEMALLOC to allow access to emergency reservesMel Gorman1-1/+1
__GFP_MEMALLOC will allow the allocation to disregard the watermarks, much like PF_MEMALLOC. It allows one to pass along the memalloc state in object related allocation flags as opposed to task related flags, such as sk->sk_allocation. This removes the need for ALLOC_PFMEMALLOC as callers using __GFP_MEMALLOC can get the ALLOC_NO_WATERMARK flag which is now enough to identify allocations related to page reclaim. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm: sl[au]b: add knowledge of PFMEMALLOC reserve pagesMel Gorman1-18/+174
When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-09mm, sl[aou]b: Move kmem_cache_create mutex handling to common codeChristoph Lameter1-51/+1
Move the mutex handling into the common kmem_cache_create() function. Then we can also move more checks out of SLAB's kmem_cache_create() into the common code. Reviewed-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-07-09mm, sl[aou]b: Use a common mutex definitionChristoph Lameter1-57/+51
Use the mutex definition from SLAB and make it the common way to take a sleeping lock. This has the effect of using a mutex instead of a rw semaphore for SLUB. SLOB gains the use of a mutex for kmem_cache_create serialization. Not needed now but SLOB may acquire some more features later (like slabinfo / sysfs support) through the expansion of the common code that will need this. Reviewed-by: Glauber Costa <glommer@parallels.com> Reviewed-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-07-09mm, sl[aou]b: Common definition for boot state of the slab allocatorsChristoph Lameter1-31/+14
All allocators have some sort of support for the bootstrap status. Setup a common definition for the boot states and make all slab allocators use that definition. Reviewed-by: Glauber Costa <glommer@parallels.com> Reviewed-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-07-09mm, sl[aou]b: Extract common code for kmem_cache_create()Christoph Lameter1-15/+9
Kmem_cache_create() does a variety of sanity checks but those vary depending on the allocator. Use the strictest tests and put them into a slab_common file. Make the tests conditional on CONFIG_DEBUG_VM. This patch has the effect of adding sanity checks for SLUB and SLOB under CONFIG_DEBUG_VM and removes the checks in SLAB for !CONFIG_DEBUG_VM. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-07-02slab: move FULL state transition to an initcallGlauber Costa1-4/+4
During kmem_cache_init_late(), we transition to the LATE state, and after some more work, to the FULL state, its last state This is quite different from slub, that will only transition to its last state (previously SYSFS), in a (late)initcall, after a lot more of the kernel is ready. This means that in slab, we have no way to taking actions dependent on the initialization of other pieces of the kernel that are supposed to start way after kmem_init_late(), such as cgroups initialization. To achieve more consistency in this behavior, that patch only transitions to the UP state in kmem_init_late. In my analysis, setup_cpu_cache() should be happy to test for >= UP, instead of == FULL. It also has passed some tests I've made. We then only mark FULL state after the reap timers are in place, meaning that no further setup is expected. Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-07-02slab: Fix a typo in commit 8c138b "slab: Get rid of obj_size macro"Feng Tang1-1/+1
Commit 8c138b only sits in Pekka's and linux-next tree now, which tries to replace obj_size(cachep) with cachep->object_size, but has a typo in kmem_cache_free() by using "size" instead of "object_size", which casues some regressions. Reported-and-tested-by: Fengguang Wu <wfg@linux.intel.com> Signed-off-by: Feng Tang <feng.tang@intel.com> Cc: Christoph Lameter <cl@linux.com> Acked-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-07-02mm, slab: Build fix for recent kmem_cache changesThierry Reding1-1/+1
Commit 3b0efdf ("mm, sl[aou]b: Extract common fields from struct kmem_cache") renamed the kmem_cache structure's "next" field to "list" but forgot to update one instance in leaks_show(). Signed-off-by: Thierry Reding <thierry.reding@avionic-design.de> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-07-02slab: rename gfpflags to allocflagsGlauber Costa1-5/+5
A consistent name with slub saves us an acessor function. In both caches, this field represents the same thing. We would like to use it from the mem_cgroup code. Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> CC: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-06-20slab/mempolicy: always use local policy from interrupt contextAndi Kleen1-2/+2
slab_node() could access current->mempolicy from interrupt context. However there's a race condition during exit where the mempolicy is first freed and then the pointer zeroed. Using this from interrupts seems bogus anyways. The interrupt will interrupt a random process and therefore get a random mempolicy. Many times, this will be idle's, which noone can change. Just disable this here and always use local for slab from interrupts. I also cleaned up the callers of slab_node a bit which always passed the same argument. I believe the original mempolicy code did that in fact, so it's likely a regression. v2: send version with correct logic v3: simplify. fix typo. Reported-by: Arun Sharma <asharma@fb.com> Cc: penberg@kernel.org Cc: cl@linux.com Signed-off-by: Andi Kleen <ak@linux.intel.com> [tdmackey@twitter.com: Rework control flow based on feedback from cl@linux.com, fix logic, and cleanup current task_struct reference] Acked-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Mackey <tdmackey@twitter.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-06-14slab: Get rid of obj_size macroChristoph Lameter1-26/+21
The size of the slab object is frequently needed. Since we now have a size field directly in the kmem_cache structure there is no need anymore of the obj_size macro/function. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-06-14mm, sl[aou]b: Extract common fields from struct kmem_cacheChristoph Lameter1-59/+58
Define a struct that describes common fields used in all slab allocators. A slab allocator either uses the common definition (like SLOB) or is required to provide members of kmem_cache with the definition given. After that it will be possible to share code that only operates on those fields of kmem_cache. The patch basically takes the slob definition of kmem cache and uses the field namees for the other allocators. It also standardizes the names used for basic object lengths in allocators: object_size Struct size specified at kmem_cache_create. Basically the payload expected to be used by the subsystem. size The size of memory allocator for each object. This size is larger than object_size and includes padding, alignment and extra metadata for each object (f.e. for debugging and rcu). Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-06-14slab: Remove some accessorsChristoph Lameter1-27/+8
Those are rather trivial now and its better to see inline what is really going on. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-06-14slab: Use page struct fields instead of castingChristoph Lameter1-4/+4
Add fields to the page struct so that it is properly documented that slab overlays the lru fields. This cleans up some casts in slab. Reviewed-by: Glauber Costa <glommer@parallels.com> Reviewed-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-03-28Merge branch 'slab/for-linus' of ↵Linus Torvalds1-4/+52
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux Pull SLAB changes from Pekka Enberg: "There's the new kmalloc_array() API, minor fixes and performance improvements, but quite honestly, nothing terribly exciting." * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: mm: SLAB Out-of-memory diagnostics slab: introduce kmalloc_array() slub: per cpu partial statistics change slub: include include for prefetch slub: Do not hold slub_lock when calling sysfs_slab_add() slub: prefetch next freelist pointer in slab_alloc() slab, cleanup: remove unneeded return
2012-03-21cpuset: mm: reduce large amounts of memory barrier related damage v3Mel Gorman1-5/+8
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing cpuset's mems") wins a super prize for the largest number of memory barriers entered into fast paths for one commit. [get|put]_mems_allowed is incredibly heavy with pairs of full memory barriers inserted into a number of hot paths. This was detected while investigating at large page allocator slowdown introduced some time after 2.6.32. The largest portion of this overhead was shown by oprofile to be at an mfence introduced by this commit into the page allocator hot path. For extra style points, the commit introduced the use of yield() in an implementation of what looks like a spinning mutex. This patch replaces the full memory barriers on both read and write sides with a sequence counter with just read barriers on the fast path side. This is much cheaper on some architectures, including x86. The main bulk of the patch is the retry logic if the nodemask changes in a manner that can cause a false failure. While updating the nodemask, a check is made to see if a false failure is a risk. If it is, the sequence number gets bumped and parallel allocators will briefly stall while the nodemask update takes place. In a page fault test microbenchmark, oprofile samples from __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The actual results were 3.3.0-rc3 3.3.0-rc3 rc3-vanilla nobarrier-v2r1 Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%) Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%) Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%) Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%) Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%) Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%) Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%) Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%) Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%) Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%) Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%) Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%) Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%) Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%) Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%) MMTests Statistics: duration Sys Time Running Test (seconds) 135.68 132.17 User+Sys Time Running Test (seconds) 164.2 160.13 Total Elapsed Time (seconds) 123.46 120.87 The overall improvement is small but the System CPU time is much improved and roughly in correlation to what oprofile reported (these performance figures are without profiling so skew is expected). The actual number of page faults is noticeably improved. For benchmarks like kernel builds, the overall benefit is marginal but the system CPU time is slightly reduced. To test the actual bug the commit fixed I opened two terminals. The first ran within a cpuset and continually ran a small program that faulted 100M of anonymous data. In a second window, the nodemask of the cpuset was continually randomised in a loop. Without the commit, the program would fail every so often (usually within 10 seconds) and obviously with the commit everything worked fine. With this patch applied, it also worked fine so the fix should be functionally equivalent. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-10mm: SLAB Out-of-memory diagnosticsRafael Aquini1-1/+50
Following the example at mm/slub.c, add out-of-memory diagnostics to the SLAB allocator to help on debugging certain OOM conditions. An example print out looks like this: <snip page allocator out-of-memory message> SLAB: Unable to allocate memory on node 0 (gfp=0x11200) cache: bio-0, object size: 192, order: 0 node 0: slabs: 3/3, objs: 60/60, free: 0 Signed-off-by: Rafael Aquini <aquini@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-01-23slab, cleanup: remove unneeded returnZhao Jin1-3/+2
The procedure ends right after the if-statement, so remove ``return''. Also move the last common statement outside. Signed-off-by: Zhao Jin <cronozhj@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>